Positive and Negative Correlation

Correlation can be defined as a statistical tool that defines the relationship between two variables. For, eg: correlation may be used to define the relationship between the price of a good and its quantity demanded. It explains how two variables are related but do not explain any cause-effect relation. It only gives an understanding as to the direction and intensity of relation between two variables. Correlation can be of two types:

A) Positive Correlation

A correlation in the same direction is called a positive correlation. If one variable increases the other also increases and when one variable decreases the other also decreases. For example, the length of an iron bar will increase as the temperature increases.

Two variables are positively correlated when they move together in the same direction. In economics, quantity supplied increases as the price increases. This is because sellers find it profitable to sell when the prices are high, so they will sell more. Thus, we can call price and quantity supplied to be positively correlated. This is also called the law of supply.

B) Negative Correlation

Correlation in the opposite direction is called a negative correlation. Here if one variable increases the other decreases and vice versa. For example, the volume of gas will decrease as the pressure increases, or the demand for a particular commodity increases as the price of such commodity decreases.

Two variables are negatively correlated if they move in opposite directions. For instance, as the price of increases, the quantity demanded declines as the good becomes more expensive relative to when the price had not increased. Thus, we can say that price and quantity demanded are negatively correlated. Note that this is the famous law of demand.

C) No Correlation or Zero Correlation

If there is no relationship between the two variables such that the value of one variable changes and the other variable remains constant, it is called no or zero correlation.

Simple, Partial and Multiple Correlation: Whether the correlation is simple, partial or multiple depends on the number of variables studied. The correlation is said to be simple when only two variables are studied. The correlation is either multiple or partial when three or more variables are studied. The correlation is said to be Multiple when three variables are studied simultaneously. Such as, if we want to study the relationship between the yield of wheat per acre and the amount of fertilizers and rainfall used, then it is a problem of multiple correlations.

Whereas, in the case of a partial correlation we study more than two variables, but consider only two among them that would be influencing each other such that the effect of the other influencing variable is kept constant. Such as, in the above example, if we study the relationship between the yield and fertilizers used during the periods when certain average temperature existed, then it is a problem of partial correlation.

Meaning of Correlation, Importance

Correlation, in the finance and investment industries, is a statistic that measures the degree to which two securities move in relation to each other. Correlations are used in advanced portfolio management, computed as the correlation coefficient, which has a value that must fall between -1.0 and +1.0

A perfect positive correlation means that the correlation coefficient is exactly 1. This implies that as one security moves, either up or down, the other security moves in lockstep, in the same direction. A perfect negative correlation means that two assets move in opposite directions, while a zero correlation implies no relationship at all.

For example, large-cap mutual funds generally have a high positive correlation to the Standard and Poor’s (S&P) 500 Index – very close to 1. Small-cap stocks have a positive correlation to that same index, but it is not as high – generally around 0.8.

However, put option prices and their underlying stock prices will tend to have a negative correlation. As the stock price increases, the put option prices go down. This is a direct and high-magnitude negative correlation.

  • Correlation is a statistic that measures the degree to which two variables move in relation to each other.
  • In finance, the correlation can measure the movement of a stock with that of a benchmark index, such as the Beta.
  • Correlation measures association, but does not tell you if x causes y or vice versa, or if the association is caused by some third (perhaps unseen) factor.

Importance of correlation Analysis

Correlation is very important in the field of Psychology and Education as a measure of relationship between test scores and other measures of performance. With the help of correlation, it is possible to have a correct idea of the working capacity of a person. With the help of it, it is also possible to have a knowledge of the various qualities of an individual.

After finding the correlation between the two qualities or different qualities of an individual, it is also possible to provide his vocational guidance. In order to provide educational guidance to a student in selection of his subjects of study, correlation is also helpful and necessary.

Correlation Statistics and Investing

The correlation between two variables is particularly helpful when investing in the financial markets. For example, a correlation can be helpful in determining how well a mutual fund performs relative to its benchmark index, or another fund or asset class. By adding a low or negatively correlated mutual fund to an existing portfolio, the investor gains diversification benefits.

In other words, investors can use negatively-correlated assets or securities to hedge their portfolio and reduce market risk due to volatility or wild price fluctuations. Many investors hedge the price risk of a portfolio, which effectively reduces any capital gains or losses because they want the dividend income or yield from the stock or security.

Correlation statistics also allows investors to determine when the correlation between two variables changes. For example, bank stocks typically have a highly-positive correlation to interest rates since loan rates are often calculated based on market interest rates. If the stock price of a bank is falling while interest rates are rising, investors can glean that something’s askew. If the stock prices of similar banks in the sector are also rising, investors can conclude that the declining bank stock is not due to interest rates. Instead, the poorly-performing bank is likely dealing with an internal, fundamental issue.

Co-efficient of Variation

The coefficient of variation (CV) is a statistical measure of the dispersion of data points in a data series around the mean. The coefficient of variation represents the ratio of the standard deviation to the mean, and it is a useful statistic for comparing the degree of variation from one data series to another, even if the means are drastically different from one another.

The Formula for Coefficient of Variation is

Where: σ is the standard deviation and μ is the mean.

The coefficient of variation shows the extent of variability of data in sample in relation to the mean of the population. In finance, the coefficient of variation allows investors to determine how much volatility, or risk, is assumed in comparison to the amount of return expected from investments. The lower the ratio of standard deviation to mean return, the better risk-return trade-off. Note that if the expected return in the denominator is negative or zero, the coefficient of variation could be misleading.

The coefficient of variation is helpful when using the risk/reward ratio to select investments. For example, an investor who is risk-averse may want to consider assets with a historically low degree of volatility and a high degree of return, in relation to the overall market or its industry. Conversely, risk-seeking investors may look to invest in assets with a historically high degree of volatility.

While most often used to analyze dispersion around the mean, quartile, quintile, or decile CVs can also be used to understand variation around the median or 10th percentile, for example.

  • The coefficient of variation (CV) is a statistical measure of the dispersion of data points in a data series around the mean.
  • In finance, the coefficient of variation allows investors to determine how much volatility, or risk, is assumed in comparison to the amount of return expected from investments.
  • The lower the ratio of standard deviation to mean return, the better risk-return trade-off.

Mean Deviation and Standard Deviation

Mean Deviation

Mean deviation is a measure of dispersion that indicates the average of the absolute differences between each data point and the mean (or median) of the dataset. It provides an overall sense of how much the values deviate from the central value. To calculate mean deviation, the absolute differences between each data point and the central measure are summed and then divided by the number of observations. Unlike variance, mean deviation is expressed in the same units as the data and is less sensitive to extreme outliers.

The basic formula for finding out mean deviation is :

Mean Deviation = Sum of absolute values of deviations from ‘a’ ÷ The number of observations

Standard Deviation

Standard deviation is a widely used measure of dispersion that indicates the average amount by which each data point deviates from the mean. It is calculated by first finding the variance, which is the average of squared deviations, and then taking the square root of the variance. Standard deviation provides a more interpretable measure of spread, as it is in the same units as the original data. A higher standard deviation indicates greater variability, while a lower value indicates data points are closer to the mean, indicating less spread or consistency.

Usually represented by or σ. It uses the arithmetic mean of the distribution as the reference point and normalizes the deviation of all the data values from this mean.

Therefore, we define the formula for the standard deviation of the distribution of a variable X with n data points as:

Quartile Deviation

The Quartile Deviation is a simple way to estimate the spread of a distribution about a measure of its central tendency (usually the mean). So, it gives you an idea about the range within which the central 50% of your sample data lies. Consequently, based on the quartile deviation, the Coefficient of Quartile Deviation can be defined, which makes it easy to compare the spread of two or more different distributions. Since both of these topics are based on the concept of quartiles, we’ll first understand how to calculate the quartiles of a dataset before working with the direct formulae.

Quartiles

A median divides a given dataset (which is already sorted) into two equal halves similarly, the quartiles are used to divide a given dataset into four equal halves. Therefore, logically there should be three quartiles for a given distribution, but if you think about it, the second quartile is equal to the median itself! We’ll deal with the other two quartiles in this section.

  • The first quartileor the lower quartile or the 25th percentile, also denoted by Q1corresponds to the value that lies halfway between the median and the lowest value in the distribution (when it is already sorted in the ascending order). Hence, it marks the region which encloses 25% of the initial data.
  • Similarly, the third quartileor the upper quartile or 75th percentile, also denoted by Q3, corresponds to the value that lies halfway between the median and the highest value in the distribution (when it is already sorted in the ascending order). It, therefore, marks the region which encloses the 75% of the initial data or 25% of the end data.

For a better understanding, look at the representation below for a Gaussian Distribution:

The Quartile Deviation

Formally, the Quartile Deviation is equal to the half of the Inter-Quartile Range and thus we can write it as:

Qd=(Q3–Q1)/2

Therefore, we also call it the Semi Inter-Quartile Range.

  • The Quartile Deviation doesn’t take into account the extreme points of the distribution. Thus, the dispersion or the spread of only the central 50% data is considered.
  • If the scale of the data is changed, the Qd also changes in the same ratio.
  • It is the best measure of dispersion for open-ended systems (which have open-ended extreme ranges).
  • Also, it is less affected by sampling fluctuations in the dataset as compared to the range (another measure of dispersion).
  • Since it is solely dependent on the central values in the distribution, if in any experiment, these values are abnormal or inaccurate, the result would be affected drastically.

The Coefficient of Quartile Deviation

Based on the quartiles, a relative measure of dispersion, known as the Coefficient of Quartile Deviation, can be defined for any distribution. It is formally defined as:

Coefficient of Quartile Deviation = {(Q3–Q1)/(Q3+Q1)}×100

Since it involves a ratio of two quantities of the same dimensions, it is unit-less. Thus, it can act as a suitable parameter for comparing two or more different datasets which may or may not involve quantities with the same dimensions.

So, now let’s go through the solved examples below to get a better idea of how to apply these concepts to various distributions.

Range

The Range of a distribution gives a measure of the width (or the spread) of the data values of the corresponding random variable. For example, if there are two random variables X and Y such that X corresponds to the age of human beings and Y corresponds to the age of turtles, we know from our general knowledge that the variable corresponding to the age of turtles should be larger.

Since the average age of humans is 50-60 years, while that of turtles is about 150-200 years; the values taken by the random variable Y are indeed spread out from 0 to at least 250 and above; while those of X will have a smaller range. Thus, qualitatively you’ve already understood what the Range of a distribution means. The mathematical formula for the same is given as:

Range=L–S

where L – the largets/maximum value attained by the random variable under consideration and S – the smallest/minimum value.

Properties

  • The Range of a given distribution has the same units as the data points.
  • If a random variable is transformed into a new random variable by a change of scale and a shift of origin as –

Y = aX + b

where Y – the new random variable, X – the original random variable and a,b – constants. Then the ranges of X and Y can be related as –

RY = |a|RX

Clearly, the shift in origin doesn’t affect the shape of the distribution, and therefore its spread (or the width) remains unchanged. Only the scaling factor is important.

  • For a grouped class distribution, the Range is defined as the difference between the two extreme class boundaries.
  • A better measure of the spread of a distribution is the Coefficient of Range, given by:

Coefficient of Range (expressed as a percentage)=L–SL+S×100

Clearly, we need to take the ratio between the Range and the total (combined) extent of the distribution. Besides, since it is a ratio, it is dimensionless, and can, therefore, one can use it to compare the spreads of two or more different distributions as well.

  • The range is an absolute measureof Dispersion of a distribution while the Coefficient of Range is a relative measure of dispersion.

Due to the consideration of only the end-points of a distribution, the Range never gives us any information about the shape of the distribution curve between the extreme points. Thus, we must move on to better measures of dispersion. One such quantity is Mean Deviation which is we are going to discuss now.

Median Characteristics, Applications and Limitations

Median is a measure of central tendency that represents the middle value of an ordered dataset, dividing it into two equal halves. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number, it is the average of the two middle values. The median is less affected by outliers, making it useful for skewed data or non-uniform distributions.

Example:

The marks of nine students in a geography test that had a maximum possible mark of 50 are given below:

     47     35     37     32     38     39     36     34     35

Find the median of this set of data values.

Solution:

Arrange the data values in order from the lowest value to the highest value:

    32     34     35     35     36     37     38     39     47

The fifth data value, 36, is the middle value in this arrangement.

Characteristics of Median:

  1. Middle Value of Data

The median divides a dataset into two equal halves, with 50% of the values lying below it and 50% above it. It is determined by arranging data in ascending or descending order.

  1. Resistant to Outliers

The median is not influenced by extreme values or outliers. This makes it a more robust measure for datasets with significant variability or skewness.

  1. Applicable to Ordinal and Quantitative Data

The median can be calculated for ordinal data (where data can be ranked) and quantitative data. It is not suitable for nominal data, as there is no inherent order.

  1. Unique Value

For any given dataset, the median is always unique and provides a single central value, ensuring consistency in its interpretation.

  1. Requires Data Sorting

The calculation of the median necessitates ordering the data values. Without arranging the data, the median cannot be identified.

  1. Effective for Skewed Distributions

In skewed datasets, the median better represents the center compared to the mean, as it remains unaffected by the skewness.

  1. Not Affected by Sample Size

Median’s calculation is straightforward and remains valid regardless of the sample size, as long as the data is properly ordered.

Applications of Median:

  1. Income and Wealth Distribution

In economics and social studies, the median is used to analyze income and wealth distributions. For example, the median income indicates the income level at which half the population earns less and half earns more. It is more accurate than the mean in scenarios with extreme disparities, such as high-income earners skewing the average.

  1. Real Estate Market Analysis

Median is commonly applied in the real estate industry to determine the central value of property prices. Median house prices are preferred over averages because they are less affected by outliers, such as extremely high or low-priced properties.

  1. Educational Assessments

In education, the median is used to evaluate student performance. For example, the median test score helps identify the middle-performing student, providing a fair representation when the scores are unevenly distributed.

  1. Medical and Health Statistics

Median is often employed in health sciences to summarize data such as median survival rates or recovery times. These metrics are crucial when the data includes extreme cases or a non-symmetric distribution.

  1. Demographic Studies

Median age, household size, and other demographic measures are widely used in population studies. These metrics provide insights into the central characteristics of populations while avoiding distortion by extremes.

  1. Transportation Planning

In transportation and traffic analysis, the median is used to determine the typical travel time or commute duration. It offers a realistic measure when the data includes unusually long or short travel times.

Demerits or Limitations of Median:

  1. Even if the value of extreme items is too large, it does not affect too much, but due to this reason, sometimes median does not remain the representative of the series.
  2. It is affected much more by fluctuations of sampling than A.M.
  3. Median cannot be used for further algebraic treatment. Unlike mean we can neither find total of terms as in case of A.M. nor median of some groups when combined.
  4. In a continuous series it has to be interpolated. We can find its true-value only if the frequencies are uniformly spread over the whole class interval in which median lies.
  5. If the number of series is even, we can only make its estimate; as the A.M. of two middle terms is taken as Median.

Mode, Characteristics, Applications and Limitations

Mode is a measure of central tendency that identifies the most frequently occurring value or values in a dataset. Unlike the mean or median, the mode can be used for both numerical and categorical data. A dataset may have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode at all if no value repeats. The mode is particularly useful for understanding trends in categorical data, such as the most popular product, common response, or frequent event, and is less sensitive to outliers compared to other central tendency measures.

Examples:

For example, in the following list of numbers, 16 is the mode since it appears more times than any other number in the set:

  • 3, 3, 6, 9, 16, 16, 16, 27, 27, 37, 48

A set of numbers can have more than one mode (this is known as bimodal if there are 2 modes) if there are multiple numbers that occur with equal frequency, and more times than the others in the set.

  • 3, 3, 3, 9, 16, 16, 16, 27, 37, 48

In the above example, both the number 3 and the number 16 are modes as they each occur three times and no other number occurs more than that.

If no number in a set of numbers occurs more than once, that set has no mode:

  • 3, 6, 9, 16, 27, 37, 48

Characteristics of Mode:

  • Can Be Used for Qualitative and Quantitative Data

Mode can be applied to both qualitative (categorical) and quantitative data. For example, in market research, the mode can identify the most common product color or customer preference.

  • Not Affected by Outliers

The mode is not influenced by extreme values or outliers in a dataset. For instance, in a dataset of salaries where most values are clustered around a certain range but a few extreme salaries exist, the mode will still reflect the most frequent salary, making it a useful measure when dealing with skewed data or anomalies.

  • May Have Multiple Values

A dataset may have more than one mode. If there are two values that occur with the same highest frequency, the dataset is considered bimodal. If there are more than two, it is multimodal. In such cases, the mode provides insight into multiple frequent occurrences within the dataset, unlike the mean or median, which offer a single value.

  • Can Be Uniquely Defined or Undefined

In some datasets, there may be no mode if all values occur with equal frequency. For example, in a dataset where every value appears only once, the mode is undefined. Conversely, in datasets with a clear most frequent value, the mode is uniquely defined.

  • Easy to Calculate

The mode is simple to compute. It only requires identifying the value that appears most frequently in the dataset. No complex formulas or data manipulations are needed, making it a straightforward measure for quick analysis.

  • Useful for Categorical Data

The mode is especially useful for categorical data where numerical calculations do not apply. For instance, in surveys where respondents choose their favorite color, the mode will show the most popular choice, providing valuable insights in marketing or social studies.

Applications of Mode:

  1. Market Research

In market research, the mode is used to identify the most popular product, service, or customer preference. For example, if a survey is conducted to determine consumers’ favorite brands, the mode will highlight the brand chosen most frequently, helping businesses focus on popular trends.

  1. Fashion and Retail Industry

The mode is widely used in the fashion and retail sectors to determine popular product styles, colors, or sizes. For example, if a clothing store wants to know the most commonly bought color of a particular item, the mode will provide the answer, guiding inventory decisions and promotional strategies.

  1. Educational Testing

In educational assessments, the mode can be used to determine the most common score or grade achieved by students in a test or examination. This helps educators identify common performance trends and understand the difficulty level of the assessment.

  1. Health and Medical Statistics

In healthcare, the mode is used to find the most common age group, symptom, or diagnosis within a population. For example, in a study of common diseases, the mode can reveal the most frequently occurring disease or the most prevalent age group affected, providing insights into public health needs.

  1. Consumer Behavior Analysis

In consumer behavior studies, the mode is used to determine the most frequently chosen option in surveys and polls. For instance, it can highlight the most common reasons for customer dissatisfaction or preferences regarding product features, aiding companies in product development and customer service strategies.

  1. Sports Statistics

In sports analytics, the mode is used to identify the most frequent performance metric. For example, the mode can be applied to identify the most common score in a set of matches or the most frequent outcome of a particular game, assisting coaches and analysts in understanding patterns in performance.

Advantages:

  • It is easy to understand and simple to calculate.
  • It is not affected by extremely large or small values.
  • It can be located just by inspection in un-grouped data and discrete frequency distribution.
  • It can be useful for qualitative data.
  • It can be computed in an open-end frequency table.
  • It can be located graphically.

Disadvantages:

  • It is not well defined.
  • It is not based on all the values.
  • It is stable for large values so it will not be well defined if the data consists of a small number of values.
  • It is not capable of further mathematical treatment.
  • Sometimes the data has one or more than one mode, and sometimes the data has no mode at all.

Harmonic Mean Characteristics, Applications and Limitations

A simple way to define a harmonic mean is to call it the reciprocal of the arithmetic mean of the reciprocals of the observations. The most important criteria for it is that none of the observations should be zero.

A harmonic mean is used in averaging of ratios. The most common examples of ratios are that of speed and time, cost and unit of material, work and time etc. The harmonic mean (H.M.) of n observations is

H.M. = 1÷ (1⁄n ∑ i= 1n (1⁄xi) )

In the case of frequency distribution, a harmonic mean is given by

H.M. = 1÷ [1⁄N (∑ i= 1n (f⁄ xi)], where N = ∑ i= 1n fi

Properties of Harmonic Mean

  • If all the observation taken by a variable are constants, say k, then the harmonic mean of the observations is also k
  • The harmonic mean has the least value when compared to the geometric mean and the arithmetic mean

Advantages of Harmonic Mean

  • A harmonic mean is rigidly defined
  • It is based upon all the observations
  • The fluctuations of the observations do not affect the harmonic mean
  • More weight is given to smaller items

Disadvantages of Harmonic Mean

  • Not easily understandable
  • Difficult to compute

Geometric Mean Characteristics, Applications and Limitations

A geometric mean is a mean or average which shows the central tendency of a set of numbers by using the product of their values. For a set of n observations, a geometric mean is the nth root of their product. The geometric mean G.M., for a set of numbers x1, x2, … , xn is given as

G.M. = (x1. x2 … xn)1⁄n

or, G. M. = (π i = 1n xi1⁄n n√( x1, x2, … , xn).

The geometric mean of two numbers, say x, and y is the square root of their product x×y. For three numbers, it will be the cube root of their products i.e., (x y z) 1⁄3.

Properties of Geometric Means

  • The logarithm of geometric mean is the arithmetic mean of the logarithms of given values
  • If all the observations assumed by a variable are constants, say K >0, then the G.M. of the observation is also K
  • The geometric mean of the ratio of two variables is the ratio of the geometric means of the two variables
  • The geometric mean of the product of two variables is the product of their geometric means

Advantages of Geometric Mean

  • A geometric mean is based upon all the observations
  • It is rigidly defined
  • The fluctuations of the observations do not affect the geometric mean
  • It gives more weight to small items

Disadvantages of Geometric Mean

  • A geometric mean is not easily understandable by a non-mathematical person
  • If any of the observations is zero, the geometric mean becomes zero
  • If any of the observation is negative, the geometric mean becomes imaginary
error: Content is protected !!