Correlation is a statistical tool used to measure the relationship between two or more variables such that the movement/changes in one variable is accompanied by the movement/changes of another is called as **Correlation**.

## 1) Types of Correlation :-

**Positive and Negative Correlation : – **

Whether
the correlation between the variables is positive or negative depends on
its **direction of change**. The correlation is positive when both the
variables **move in the same direction**, i.e. when one variable
increases the other on an average also increases and if one variable decreases
the other also decreases. like a straight line. The correlation is said to be
negative when both the variables **move in the opposite direction**,
i.e. when one variable increases the other decreases and vice versa. like a exponentially
decreasing curve.

**Simple, Partial and Multiple Correlation :**–

Whether
the correlation is simple, partial or multiple depends on the **number of
variables studied**. The correlation is said to be simple when **only
two variables **are studied. The correlation is either multiple or partial
when **three or more variables are studied simultaneously**. Such as,
if we want to study the relationship between the yield of wheat per acre and
the amount of fertilizers and rainfall used, then it is a problem of multiple
correlations. Whereas, in the case of a partial correlation we study more than
two variables, **we consider only two among them that would be
influencing each** other such that the effect of the other influencing
variable is kept constant. Such as, in the above example, if we study the
relationship between the yield and fertilizers used during the periods when
certain average temperature existed, then it is a problem of partial
correlation. ( Three variables yields, temperature, fertilizer etc).

**Linear and Non-Linear (Curvilinear) Correlation :- **

Whether
the correlation between the variables is linear or non-linear depends on
the **constancy of ratio of change between the variables**. The
correlation is said to be **linear** when the amount of change in one
variable to the amount of change in another variable tends to **bear a
constant ratio**. For example, from the values of two variables given below,
it is clear that the ratio of change between the variables is the same:

X: 10 20 30 40 50

Y: 20 40 60 80 100

The correlation is called as **non-linear or curvilinear** when the amount of change in one variable **does not have a constant ratio**(means variable change) to the amount of change in the other variable. For example, if the amount of fertilizers is doubled the yield of wheat would not be necessarily being doubled. (Other factors are also there).

## In Signals and Communications we have Two types of Correlations :

## a) **Auto-Correlations** :-

It is also known as **serial correlation**, is the correlations of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise.

— Used in signal processing for analyzing functions or series of values, such as time domain signals.

— In analysis of Markov chain Monte Carlo data, autocorrelation must be taken into account for correct error determination.

— Autocorrelation is used to analyze dynamic light scattering data.

## b) **Cross-Correlations** :-

In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature. It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology. The cross-correlation is similar in nature to the convolution of two functions. In an autocorrelation, which is the cross-correlation of a signal with itself, there will always be a peak at a lag of zero, and its size will be the signal energy.

In probability and statistics, the term cross-correlations refers to the correlations between the entries of two random vectors and , while the correlations of a random vector are the correlations between the entries of itself, those forming the correlation matrix of . If each of and is a scalar random variable which is realized repeatedly in a time series, then the correlations of the various temporal instances of are known as autocorrelations of , and the cross-correlations of with across time are temporal cross-correlations. In probability and statistics, the definition of correlation always includes a standardising factor in such a way that correlations have values between −1 and +1.

## 2) **Methods of Determining the Correlation: **

The correlation is said to be linear when the change in the amount of one variable tends to be same as a constant ratio to the amount of change in another variable. Where, the non-linear or curvilinear correlation is when the ratio of the amount of change in one variable to the amount of change in another variable is not constant.

To determine the linearity and non-linearity among the variables and the extent to which these are correlated, following are the important methods used to ascertain these:

## 1) Scatter Diagram Method :-

**Definition :- **The **Scatter Diagram Method** is the simplest method to study the correlation between two variables wherein the values for each pair of a variable is plotted on a graph in the form of dots thereby obtaining as many points as the number of observations. Then by looking at the scatter of several points, the degree of correlation can be guessed.

The degree to which the variables are related to each other depends on the manner in which the points are scattered over the chart. The more the points plotted are scattered over the chart, the lesser is the degree of correlation between the variables. The more the points plotted are closer to the line, the higher is the degree of correlation. **The degree of correlation is denoted by “r”.**

The following types of scatter diagrams tell about the degree of correlation between variable X and variable Y.

## a) Per**fect Positive Correlation (r=+1)** :-

The correlation is said to be perfectly positive when all the points lie on the straight line from left to right.

## b) **Perfect Negative Correlation (r=-1) : – **

When all the points lie on a straight line falling from the upper left-hand corner to the lower right-hand corner, the variables are said to be negatively correlated.

## c) **High Degree of +Ve Correlation (r= + High) :- **

The degree of correlation is High, when the points plotted fall under the narrow band and they show rising tendency from left to right.

## d) **High Degree of –Ve Correlation (r= – High) :-**

** **The degree of negative correlation is high when the point plotted fall in the narrow band and show the declining tendency from left to right.

## e) **Low degree of +Ve Correlation (r= + Low)** :-

The correlation between the variables is said to be low but positive when the points are highly scattered over the graph and show a rising tendency from the lower left-hand corner to the upper right-hand corner.

## f) **Low Degree of –Ve Correlation (r= – Low)** :-

The degree of correlation is low and negative when the points are scattered over the graph and it show the falling tendency from the upper left-hand corner to the lower right-hand corner.

## g) **No Correlation (r= 0) :**–

The variable is said to be unrelated when the points are haphazardly scattered over the graph and do not show any specific pattern. Here the correlation is absent and hence **r = 0**.

Thus, the scatter diagram method is used to study the degree of relationship between the variables by plotting the dots for each pair of variable values given. The chart on which the dots are plotted is also called as a** Dotogram**.

# 2) Karl Pearson’s Coefficient of Correlation.

**Karl Pearson’s Coefficient of Correlation** is widely used mathematical method wherein the numerical expression is used to calculate the degree and direction of the relationship between linear related variables.

Pearson’s method, popularly known as a **Pearsonian
Coefficient of Correlation, **is the most extensively used
quantitative methods in practice. The coefficient of correlation is denoted
by **“r”.**

## Properties of Coefficient of Correlation :-

- The value of the coefficient of correlation (r) always
**lays between±1**. Such as:

r=+1, perfect positive correlation

r=-1, perfect negative correlation

r=0, no correlation - The coefficient of correlation is independent of the
**origin and scale.**By origin, it means subtracting any non-zero constant from the given value of X and Y the value of “r” remains unchanged. By scale it means, there is no effect on the value of “r” if the value of X and Y is divided or multiplied by any constant. ( Also called Regularization in Machine Learning).

The coefficient of correlation is a **geometric mean of two regression coefficient.** Symbolically it is represented as:

- Note :- The coefficient of correlation is
**“zero”**when the variables X and Y are independent. But, however, the converse is not true.

### Assumptions of Karl Pearson’s Coefficient of Correlation :-

- The relationship between the variables is
**“Linear”,**which means when the two variables are plotted, a straight line is formed by the points plotted. - We assume it has a
**Normal Distribution**. - The variables are independent of each other.

**Note:** The coefficient
of correlation measures not only the magnitude of correlation but also tells
the direction. Such as, r = -0.67, which shows correlation is negative because
the sign is **“-“**and the magnitude is **0.67**.

# 3) Spearman’s Rank Correlation Coefficient.

The **Spearman’s Rank Correlation Coefficient** is
the non-parametric statistical measure used to study the strength of
association between the two ranked variables. This method is applied to the
ordinal set of numbers, which can be arranged in order, i.e. one after the
other so that ranks can be given to each.

In the rank correlation coefficient method, the ranks are given to each individual on the basis of its quality or quantity.

Where, R = Rank coefficient
of correlation

D = Difference of ranks

N = Number of Observations

**The value
of R lies between ±1 such as:****
**R =+1, there is a complete agreement in the order of ranks and
move in the same direction.

R=-1, there is a complete agreement in the order of ranks, but are in opposite directions.

R =0, there is no association in the ranks.

**Where
actual ranks are given: **An individual must follow
the following steps to calculate the correlation coefficient:

- First, the difference between the ranks (R1-R2) must be calculated, denoted by D.
- Then, square these differences to remove the negative sign and obtain its sum ∑D
^{2}. - Apply the formula as shown above.

**Where
ranks are not given: **In case the ranks are not given, then the
individual may assign the rank by taking either the highest value or the lowest
value as 1. Whatever criteria is being decided the same method should be
applied to all the variables.

**Note:** The
Spearman’s rank correlation coefficient method is applied only when the initial
data are in the form of ranks, and N (number of observations) is fairly small,
i.e. not greater than 25 or 30.

# 4) Method of Least Squares.

The **Method of Least Squares** is another
mathematical method that tells the degree of correlation between the variables
by using the square root of the product of two regression coefficient that of x
on y and y on x.

The numerical notation of the formula to calculate the correlation by the coefficient method of least squares is given below:

## **Lag and Lead in Correlation** :-

While studying any time series data it might be observed that there is a time gap before any cause-and-effect relationship is established and this time gap is called as a “Lag.”

For example 1:- We must have noticed a data at time t(i) and at t(i-1) its value was different or the signal was lagging.

Example 2:- The production of a commodity may increase today, but might not have an immediate effect on its price, it may take some time for the price to adjust itself to the increased production. As we know if the production is more then price must reduce, it reduces after sometime this is called as “Lag”

While computing the correlation between the variables by using any of the methods, this time, a gap must be taken into the consideration otherwise wrong or false conclusions would be drawn.

**Covariance** :-

Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a *single *variable
varies, **co** variance tells you how **two **variables
vary together.

The formula is: Cov(X,Y) = Σ E((X-μ)E(Y-ν)) / n-1

Where X is a random variable

E(X) = μ is the expected value (the mean) of the random variable X and

E(Y) = ν is the expected value (the mean) of the random variable Y

n = the number of items in the data set.

Calculate covariance for the following data set:

x:
2.1, 2.5, 3.6, 4.0 (mean = 3.1)

y: 8, 10, 12, 14 (mean = 11)

**Substitute the values into the formula and solve:**

Cov(X,Y) =
ΣE((X-μ)(Y-ν)) / n-1

= (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)(14-11) /(4-1)

= (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3

= 3 + 0.6 + .5 + 2.7 / 3

= 6.8/3

= 2.267

The result is positive, meaning that the variables are positively related.

**Note on dividing by “n” or “n-1”:**

When dealing with samples, there are n-1 terms that have the freedom to vary (see: Degrees of Freedom). If you are finding the covariance of just two random variables, just divide by n.

The reason **dividing by n**–**1** corrects the bias is because **we** are using the sample mean, instead of the population mean, to **calculate** the variance. Since the sample mean is based on the data, it will get drawn toward the center of mass for the data.

*More reading : **https://en.wikipedia.org/wiki/Covariance_and_correlation** *

References :- 1) Introduction to statistics by David M.Lane, Rice University. 2) Khan Academy. 3) Images from Wikipedia and Google.

Thanks

Satyajit Das

“Gaining Knowledge is the first step to wisdom, sharing it is the first step to humanity! “.

All the best ! Keep Learning and Sharing with Grouply