# A to Z on Correlation and Co-variance :-

Correlation is a statistical tool used to measure the relationship between two or more variables such that the movement/changes in one variable is accompanied by the movement/changes of another is called as Correlation.

## Positive and Negative Correlation : –

Whether the correlation between the variables is positive or negative depends on its direction of change. The correlation is positive when both the variables move in the same direction, i.e. when one variable increases the other on an average also increases and if one variable decreases the other also decreases. like a straight line. The correlation is said to be negative when both the variables move in the opposite direction, i.e. when one variable increases the other decreases and vice versa. like a exponentially decreasing curve.

## Simple, Partial and Multiple Correlation :–

Whether the correlation is simple, partial or multiple depends on the number of variables studied. The correlation is said to be simple when only two variables are studied. The correlation is either multiple or partial when three or more variables are studied simultaneously. Such as, if we want to study the relationship between the yield of wheat per acre and the amount of fertilizers and rainfall used, then it is a problem of multiple correlations. Whereas, in the case of a partial correlation we study more than two variables, we consider only two among them that would be influencing each other such that the effect of the other influencing variable is kept constant. Such as, in the above example, if we study the relationship between the yield and fertilizers used during the periods when certain average temperature existed, then it is a problem of partial correlation. ( Three variables yields, temperature, fertilizer etc).

## Linear and Non-Linear (Curvilinear) Correlation :-

Whether the correlation between the variables is linear or non-linear depends on the constancy of ratio of change between the variables. The correlation is said to be linear when the amount of change in one variable to the amount of change in another variable tends to bear a constant ratio. For example, from the values of two variables given below, it is clear that the ratio of change between the variables is the same:

X: 10 20 30 40 50
Y: 20 40 60 80 100

The correlation is called as non-linear or curvilinear when the amount of change in one variable does not have a constant ratio(means variable change) to the amount of change in the other variable. For example, if the amount of fertilizers is doubled the yield of wheat would not be necessarily being doubled. (Other factors are also there).

## a) Auto-Correlations :-

It is also known as serial correlation, is the correlations of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise.

— Used in signal processing for analyzing functions or series of values, such as time domain signals.

— In analysis of Markov chain Monte Carlo data, autocorrelation must be taken into account for correct error determination.

— Autocorrelation is used to analyze dynamic light scattering data.

## b) Cross-Correlations :-

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

In probability and statistics, the term cross-correlations refers to the correlations between the entries of two random vectors   and  , while the correlations of a random vector   are the correlations between the entries of   itself, those forming the correlation matrix of  . If each of   and   is a scalar random variable which is realized repeatedly in a time series, then the correlations of the various temporal instances of   are known as autocorrelations of  , and the cross-correlations of   with   across time are temporal cross-correlations. In probability and statistics, the definition of correlation always includes a standardising factor in such a way that correlations have values between −1 and +1.

## 2) Methods of Determining the Correlation:

The correlation is said to be linear when the change in the amount of one variable tends to be same as a constant ratio to the amount of change in another variable. Where, the non-linear or curvilinear correlation is when the ratio of the amount of change in one variable to the amount of change in another variable is not constant.

To determine the linearity and non-linearity among the variables and the extent to which these are correlated, following are the important methods used to ascertain these:

## 1) Scatter Diagram Method :-

Definition :- The Scatter Diagram Method is the simplest method to study the correlation between two variables wherein the values for each pair of a variable is plotted on a graph in the form of dots thereby obtaining as many points as the number of observations. Then by looking at the scatter of several points, the degree of correlation can be guessed.

The degree to which the variables are related to each other depends on the manner in which the points are scattered over the chart. The more the points plotted are scattered over the chart, the lesser is the degree of correlation between the variables. The more the points plotted are closer to the line, the higher is the degree of correlation. The degree of correlation is denoted by “r”.

The following types of scatter diagrams tell about the degree of correlation between variable X and variable Y.

## a) Perfect Positive Correlation (r=+1) :-

The correlation is said to be perfectly positive when all the points lie on the straight line from left to right.

## b) Perfect Negative Correlation (r=-1) : –

When all the points lie on a straight line falling from the upper left-hand corner to the lower right-hand corner, the variables are said to be negatively correlated.

## c) High Degree of +Ve Correlation (r= + High) :-

The degree of correlation is High, when the points plotted fall under the narrow band and they show rising tendency from left to right.

## d) High Degree of –Ve Correlation (r= – High) :-

The degree of negative correlation is high when the point plotted fall in the narrow band and show the declining tendency from left to right.

## e) Low degree of +Ve Correlation (r= + Low) :-

The correlation between the variables is said to be low but positive when the points are highly scattered over the graph and show a rising tendency from the lower left-hand corner to the upper right-hand corner.

## f) Low Degree of –Ve Correlation (r= – Low) :-

The degree of correlation is low and negative when the points are scattered over the graph and it show the falling tendency from the upper left-hand corner to the lower right-hand corner.

## g) No Correlation (r= 0) :–

The variable is said to be unrelated when the points are haphazardly scattered over the graph and do not show any specific pattern. Here the correlation is absent and hence r = 0.

Thus, the scatter diagram method is used to study the degree of relationship between the variables by plotting the dots for each pair of variable values given. The chart on which the dots are plotted is also called as a Dotogram.

# 2) Karl Pearson’s Coefficient of Correlation.

Karl Pearson’s Coefficient of Correlation is widely used mathematical method wherein the numerical expression is used to calculate the degree and direction of the relationship between linear related variables.

Pearson’s method, popularly known as a Pearsonian Coefficient of Correlation, is the most extensively used quantitative methods in practice. The coefficient of correlation is denoted by “r”.

## Properties of Coefficient of Correlation :-

• The value of the coefficient of correlation (r) always lays between±1. Such as:
r=+1, perfect positive correlation
r=-1, perfect negative correlation
r=0, no correlation
• The coefficient of correlation is independent of the origin and scale. By origin, it means subtracting any non-zero constant from the given value of X and Y the value of “r” remains unchanged. By scale it means, there is no effect on the value of “r” if the value of X and Y is divided or multiplied by any constant. ( Also called Regularization in Machine Learning).

The coefficient of correlation is a geometric mean of two regression coefficient. Symbolically it is represented as:

• Note :- The coefficient of correlation is “zero” when the variables X and Y are independent. But, however, the converse is not true.

### Assumptions of Karl Pearson’s Coefficient of Correlation :-

1. The relationship between the variables is “Linear”, which means when the two variables are plotted, a straight line is formed by the points plotted.
2. We assume it has a Normal Distribution.
3. The variables are independent of each other.

Note: The coefficient of correlation measures not only the magnitude of correlation but also tells the direction. Such as, r = -0.67, which shows correlation is negative because the sign is “-“and the magnitude is 0.67.

# 3) Spearman’s Rank Correlation Coefficient.

The Spearman’s Rank Correlation Coefficient is the non-parametric statistical measure used to study the strength of association between the two ranked variables. This method is applied to the ordinal set of numbers, which can be arranged in order, i.e. one after the other so that ranks can be given to each.

In the rank correlation coefficient method, the ranks are given to each individual on the basis of its quality or quantity.

Where, R = Rank coefficient of correlation
D = Difference of ranks
N = Number of Observations

The value of R lies between ±1 such as:
R =+1, there is a complete agreement in the order of ranks and move in the same direction.
R=-1, there is a complete agreement in the order of ranks, but are in opposite directions.
R =0, there is no association in the ranks.

Where actual ranks are given: An individual must follow the following steps to calculate the correlation coefficient:

1. First, the difference between the ranks (R1-R2) must be calculated, denoted by D.
2. Then, square these differences to remove the negative sign and obtain its sum ∑D2.
3. Apply the formula as shown above.

Where ranks are not given: In case the ranks are not given, then the individual may assign the rank by taking either the highest value or the lowest value as 1. Whatever criteria is being decided the same method should be applied to all the variables.

Note: The Spearman’s rank correlation coefficient method is applied only when the initial data are in the form of ranks, and N (number of observations) is fairly small, i.e. not greater than 25 or 30.

# 4) Method of Least Squares.

The Method of Least Squares is another mathematical method that tells the degree of correlation between the variables by using the square root of the product of two regression coefficient that of x on y and y on x.

The numerical notation of the formula to calculate the correlation by the coefficient method of least squares is given below:

## Lag and Lead in Correlation :-

While studying any time series data it might be observed that there is a time gap before any cause-and-effect relationship is established and this time gap is called as a “Lag.”

For example 1:- We must have noticed a data at time t(i) and at t(i-1) its value was different or the signal was lagging.

Example 2:- The production of a commodity may increase today, but might not have an immediate effect on its price, it may take some time for the price to adjust itself to the increased production. As we know if the production is more then price must reduce, it reduces after sometime this is called as “Lag”

While computing the correlation between the variables by using any of the methods, this time, a gap must be taken into the consideration otherwise wrong or false conclusions would be drawn.

## Covariance :-

Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, co variance tells you how two variables vary together.

The formula is: Cov(X,Y) = Σ E((X-μ)E(Y-ν)) / n-1

Where X is a random variable
E(X) = μ is the expected value (the mean) of the random variable X and
E(Y) = ν is the expected value (the mean) of the random variable Y
n = the number of items in the data set.

Calculate covariance for the following data set:

x: 2.1, 2.5, 3.6, 4.0 (mean = 3.1)
y: 8, 10, 12, 14 (mean = 11)

Substitute the values into the formula and solve:

Cov(X,Y) = ΣE((X-μ)(Y-ν)) / n-1
= (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)(14-11) /(4-1)
= (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3
= 3 + 0.6 + .5 + 2.7 / 3
= 6.8/3
= 2.267

The result is positive, meaning that the variables are positively related.

Note on dividing by “n” or “n-1”:
When dealing with samples, there are n-1 terms that have the freedom to vary (see: Degrees of Freedom). If you are finding the covariance of just two random variables, just divide by n.

The reason dividing by n1 corrects the bias is because we are using the sample mean, instead of the population mean, to calculate the variance. Since the sample mean is based on the data, it will get drawn toward the center of mass for the data.

References :- 1) Introduction to statistics by David M.Lane, Rice University. 2) Khan Academy. 3) Images from Wikipedia and Google.

Thanks

Satyajit Das

“Gaining Knowledge is the first step to wisdom, sharing it is the first step to humanity! “.

All the best ! Keep Learning and Sharing with Grouply 