Scatter Diagram

Scatter Diagram Method is the simplest method to study the correlation between two variables wherein the values for each pair of a variable is plotted on a graph in the form of dots thereby obtaining as many points as the number of observations. Then by looking at the scatter of several points, the degree of correlation is ascertained.

The degree to which the variables are related to each other depends on the manner in which the points are scattered over the chart. The more the points plotted are scattered over the chart, the lesser is the degree of correlation between the variables. The more the points plotted are closer to the line, the higher is the degree of correlation. The degree of correlation is denoted by “r”.

The following types of scatter diagrams tell about the degree of correlation between variable X and variable Y.

  1. Perfect Positive Correlation (r = +1):

The correlation is said to be perfectly positive when all the points lie on the straight line rising from the lower left-hand corner to the upper right-hand corner.

2. Perfect Negative Correlation (r = -1):

When all the points lie on a straight line falling from the upper left-hand corner to the lower right-hand corner, the variables are said to be negatively correlated.

3. High Degree of +Ve Correlation (r = + High):

The degree of correlation is high when the points plotted fall under the narrow band and is said to be positive when these show the rising tendency from the lower left-hand corner to the upper right-hand corner.

4. High Degree of –Ve Correlation (r = – High):

The degree of negative correlation is high when the point plotted fall in the narrow band and show the declining tendency from the upper left-hand corner to the lower right-hand corner.

5. Low degree of +Ve Correlation (r = + Low):

The correlation between the variables is said to be low but positive when the points are highly scattered over the graph and show a rising tendency from the lower left-hand corner to the upper right-hand corner.

6. Low Degree of –Ve Correlation (r = + Low):

The degree of correlation is low and negative when the points are scattered over the graph and the show the falling tendency from the upper left-hand corner to the lower right-hand corner.

7. No Correlation (r = 0):

The variable is said to be unrelated when the points are haphazardly scattered over the graph and do not show any specific pattern. Here the correlation is absent and hence r = 0.

Thus, the scatter diagram method is the simplest device to study the degree of relationship between the variables by plotting the dots for each pair of variable values given. The chart on which the dots are plotted is also called as a Dotogram.

Methods of Studying Correlation

The Correlation is a statistical tool used to measure the relationship between two or more variables, i.e. the degree to which the variables are associated with each other, such that the change in one is accompanied by the change in another.

The correlation is said to be linear when the change in the amount of one variable tends to bear a constant ratio to the amount of change in another variable. Whereas, the non-linear or curvilinear correlation is when the ratio of the amount of change in one variable to the amount of change in another variable is not constant.

These figures clearly show the difference between the linear and non-linear correlation. To determine the linearity and non-linearity among the variables and the extent to which these are correlated, following are the important methods used to ascertain these:

  1. Scatter Diagram Method
  2. Karl Pearson’s Coefficient of Correlation
  3. Spearman’s Rank Correlation Coefficient; and
  4. Methods of Least Squares

Among these, the first method, i.e. scatter diagram method is based on the study of graphs while the rest is mathematical methods that use formulae to calculate the degree of correlation between the variables.  The researcher may apply either of these methods on the basis of the nature of variables being considered in ascertaining the association between them.

Positive and Negative Correlation

Correlation can be defined as a statistical tool that defines the relationship between two variables. For, eg: correlation may be used to define the relationship between the price of a good and its quantity demanded. It explains how two variables are related but do not explain any cause-effect relation. It only gives an understanding as to the direction and intensity of relation between two variables. Correlation can be of two types:

A) Positive Correlation

A correlation in the same direction is called a positive correlation. If one variable increases the other also increases and when one variable decreases the other also decreases. For example, the length of an iron bar will increase as the temperature increases.

Two variables are positively correlated when they move together in the same direction. In economics, quantity supplied increases as the price increases. This is because sellers find it profitable to sell when the prices are high, so they will sell more. Thus, we can call price and quantity supplied to be positively correlated. This is also called the law of supply.

B) Negative Correlation

Correlation in the opposite direction is called a negative correlation. Here if one variable increases the other decreases and vice versa. For example, the volume of gas will decrease as the pressure increases, or the demand for a particular commodity increases as the price of such commodity decreases.

Two variables are negatively correlated if they move in opposite directions. For instance, as the price of increases, the quantity demanded declines as the good becomes more expensive relative to when the price had not increased. Thus, we can say that price and quantity demanded are negatively correlated. Note that this is the famous law of demand.

C) No Correlation or Zero Correlation

If there is no relationship between the two variables such that the value of one variable changes and the other variable remains constant, it is called no or zero correlation.

Simple, Partial and Multiple Correlation: Whether the correlation is simple, partial or multiple depends on the number of variables studied. The correlation is said to be simple when only two variables are studied. The correlation is either multiple or partial when three or more variables are studied. The correlation is said to be Multiple when three variables are studied simultaneously. Such as, if we want to study the relationship between the yield of wheat per acre and the amount of fertilizers and rainfall used, then it is a problem of multiple correlations.

Whereas, in the case of a partial correlation we study more than two variables, but consider only two among them that would be influencing each other such that the effect of the other influencing variable is kept constant. Such as, in the above example, if we study the relationship between the yield and fertilizers used during the periods when certain average temperature existed, then it is a problem of partial correlation.

Meaning of Correlation, Importance

Correlation, in the finance and investment industries, is a statistic that measures the degree to which two securities move in relation to each other. Correlations are used in advanced portfolio management, computed as the correlation coefficient, which has a value that must fall between -1.0 and +1.0

A perfect positive correlation means that the correlation coefficient is exactly 1. This implies that as one security moves, either up or down, the other security moves in lockstep, in the same direction. A perfect negative correlation means that two assets move in opposite directions, while a zero correlation implies no relationship at all.

For example, large-cap mutual funds generally have a high positive correlation to the Standard and Poor’s (S&P) 500 Index – very close to 1. Small-cap stocks have a positive correlation to that same index, but it is not as high – generally around 0.8.

However, put option prices and their underlying stock prices will tend to have a negative correlation. As the stock price increases, the put option prices go down. This is a direct and high-magnitude negative correlation.

  • Correlation is a statistic that measures the degree to which two variables move in relation to each other.
  • In finance, the correlation can measure the movement of a stock with that of a benchmark index, such as the Beta.
  • Correlation measures association, but does not tell you if x causes y or vice versa, or if the association is caused by some third (perhaps unseen) factor.

Importance of correlation Analysis

Correlation is very important in the field of Psychology and Education as a measure of relationship between test scores and other measures of performance. With the help of correlation, it is possible to have a correct idea of the working capacity of a person. With the help of it, it is also possible to have a knowledge of the various qualities of an individual.

After finding the correlation between the two qualities or different qualities of an individual, it is also possible to provide his vocational guidance. In order to provide educational guidance to a student in selection of his subjects of study, correlation is also helpful and necessary.

Correlation Statistics and Investing

The correlation between two variables is particularly helpful when investing in the financial markets. For example, a correlation can be helpful in determining how well a mutual fund performs relative to its benchmark index, or another fund or asset class. By adding a low or negatively correlated mutual fund to an existing portfolio, the investor gains diversification benefits.

In other words, investors can use negatively-correlated assets or securities to hedge their portfolio and reduce market risk due to volatility or wild price fluctuations. Many investors hedge the price risk of a portfolio, which effectively reduces any capital gains or losses because they want the dividend income or yield from the stock or security.

Correlation statistics also allows investors to determine when the correlation between two variables changes. For example, bank stocks typically have a highly-positive correlation to interest rates since loan rates are often calculated based on market interest rates. If the stock price of a bank is falling while interest rates are rising, investors can glean that something’s askew. If the stock prices of similar banks in the sector are also rising, investors can conclude that the declining bank stock is not due to interest rates. Instead, the poorly-performing bank is likely dealing with an internal, fundamental issue.

Co-efficient of Variation

The coefficient of variation (CV) is a statistical measure of the dispersion of data points in a data series around the mean. The coefficient of variation represents the ratio of the standard deviation to the mean, and it is a useful statistic for comparing the degree of variation from one data series to another, even if the means are drastically different from one another.

The Formula for Coefficient of Variation is

Where: σ is the standard deviation and μ is the mean.

The coefficient of variation shows the extent of variability of data in sample in relation to the mean of the population. In finance, the coefficient of variation allows investors to determine how much volatility, or risk, is assumed in comparison to the amount of return expected from investments. The lower the ratio of standard deviation to mean return, the better risk-return trade-off. Note that if the expected return in the denominator is negative or zero, the coefficient of variation could be misleading.

The coefficient of variation is helpful when using the risk/reward ratio to select investments. For example, an investor who is risk-averse may want to consider assets with a historically low degree of volatility and a high degree of return, in relation to the overall market or its industry. Conversely, risk-seeking investors may look to invest in assets with a historically high degree of volatility.

While most often used to analyze dispersion around the mean, quartile, quintile, or decile CVs can also be used to understand variation around the median or 10th percentile, for example.

  • The coefficient of variation (CV) is a statistical measure of the dispersion of data points in a data series around the mean.
  • In finance, the coefficient of variation allows investors to determine how much volatility, or risk, is assumed in comparison to the amount of return expected from investments.
  • The lower the ratio of standard deviation to mean return, the better risk-return trade-off.

Mean Deviation and Standard Deviation

Mean Deviation

Mean deviation is a measure of dispersion that indicates the average of the absolute differences between each data point and the mean (or median) of the dataset. It provides an overall sense of how much the values deviate from the central value. To calculate mean deviation, the absolute differences between each data point and the central measure are summed and then divided by the number of observations. Unlike variance, mean deviation is expressed in the same units as the data and is less sensitive to extreme outliers.

The basic formula for finding out mean deviation is :

Mean Deviation = Sum of absolute values of deviations from ‘a’ ÷ The number of observations

Standard Deviation

Standard deviation is a widely used measure of dispersion that indicates the average amount by which each data point deviates from the mean. It is calculated by first finding the variance, which is the average of squared deviations, and then taking the square root of the variance. Standard deviation provides a more interpretable measure of spread, as it is in the same units as the original data. A higher standard deviation indicates greater variability, while a lower value indicates data points are closer to the mean, indicating less spread or consistency.

Usually represented by or σ. It uses the arithmetic mean of the distribution as the reference point and normalizes the deviation of all the data values from this mean.

Therefore, we define the formula for the standard deviation of the distribution of a variable X with n data points as:

Quartile Deviation

The Quartile Deviation is a simple way to estimate the spread of a distribution about a measure of its central tendency (usually the mean). So, it gives you an idea about the range within which the central 50% of your sample data lies. Consequently, based on the quartile deviation, the Coefficient of Quartile Deviation can be defined, which makes it easy to compare the spread of two or more different distributions. Since both of these topics are based on the concept of quartiles, we’ll first understand how to calculate the quartiles of a dataset before working with the direct formulae.

Quartiles

A median divides a given dataset (which is already sorted) into two equal halves similarly, the quartiles are used to divide a given dataset into four equal halves. Therefore, logically there should be three quartiles for a given distribution, but if you think about it, the second quartile is equal to the median itself! We’ll deal with the other two quartiles in this section.

  • The first quartileor the lower quartile or the 25th percentile, also denoted by Q1corresponds to the value that lies halfway between the median and the lowest value in the distribution (when it is already sorted in the ascending order). Hence, it marks the region which encloses 25% of the initial data.
  • Similarly, the third quartileor the upper quartile or 75th percentile, also denoted by Q3, corresponds to the value that lies halfway between the median and the highest value in the distribution (when it is already sorted in the ascending order). It, therefore, marks the region which encloses the 75% of the initial data or 25% of the end data.

For a better understanding, look at the representation below for a Gaussian Distribution:

The Quartile Deviation

Formally, the Quartile Deviation is equal to the half of the Inter-Quartile Range and thus we can write it as:

Qd=(Q3–Q1)/2

Therefore, we also call it the Semi Inter-Quartile Range.

  • The Quartile Deviation doesn’t take into account the extreme points of the distribution. Thus, the dispersion or the spread of only the central 50% data is considered.
  • If the scale of the data is changed, the Qd also changes in the same ratio.
  • It is the best measure of dispersion for open-ended systems (which have open-ended extreme ranges).
  • Also, it is less affected by sampling fluctuations in the dataset as compared to the range (another measure of dispersion).
  • Since it is solely dependent on the central values in the distribution, if in any experiment, these values are abnormal or inaccurate, the result would be affected drastically.

The Coefficient of Quartile Deviation

Based on the quartiles, a relative measure of dispersion, known as the Coefficient of Quartile Deviation, can be defined for any distribution. It is formally defined as:

Coefficient of Quartile Deviation = {(Q3–Q1)/(Q3+Q1)}×100

Since it involves a ratio of two quantities of the same dimensions, it is unit-less. Thus, it can act as a suitable parameter for comparing two or more different datasets which may or may not involve quantities with the same dimensions.

So, now let’s go through the solved examples below to get a better idea of how to apply these concepts to various distributions.

Range

The Range of a distribution gives a measure of the width (or the spread) of the data values of the corresponding random variable. For example, if there are two random variables X and Y such that X corresponds to the age of human beings and Y corresponds to the age of turtles, we know from our general knowledge that the variable corresponding to the age of turtles should be larger.

Since the average age of humans is 50-60 years, while that of turtles is about 150-200 years; the values taken by the random variable Y are indeed spread out from 0 to at least 250 and above; while those of X will have a smaller range. Thus, qualitatively you’ve already understood what the Range of a distribution means. The mathematical formula for the same is given as:

Range=L–S

where L – the largets/maximum value attained by the random variable under consideration and S – the smallest/minimum value.

Properties

  • The Range of a given distribution has the same units as the data points.
  • If a random variable is transformed into a new random variable by a change of scale and a shift of origin as –

Y = aX + b

where Y – the new random variable, X – the original random variable and a,b – constants. Then the ranges of X and Y can be related as –

RY = |a|RX

Clearly, the shift in origin doesn’t affect the shape of the distribution, and therefore its spread (or the width) remains unchanged. Only the scaling factor is important.

  • For a grouped class distribution, the Range is defined as the difference between the two extreme class boundaries.
  • A better measure of the spread of a distribution is the Coefficient of Range, given by:

Coefficient of Range (expressed as a percentage)=L–SL+S×100

Clearly, we need to take the ratio between the Range and the total (combined) extent of the distribution. Besides, since it is a ratio, it is dimensionless, and can, therefore, one can use it to compare the spreads of two or more different distributions as well.

  • The range is an absolute measureof Dispersion of a distribution while the Coefficient of Range is a relative measure of dispersion.

Due to the consideration of only the end-points of a distribution, the Range never gives us any information about the shape of the distribution curve between the extreme points. Thus, we must move on to better measures of dispersion. One such quantity is Mean Deviation which is we are going to discuss now.

Median Characteristics, Applications and Limitations

Median is a measure of central tendency that represents the middle value of an ordered dataset, dividing it into two equal halves. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number, it is the average of the two middle values. The median is less affected by outliers, making it useful for skewed data or non-uniform distributions.

Example:

The marks of nine students in a geography test that had a maximum possible mark of 50 are given below:

     47     35     37     32     38     39     36     34     35

Find the median of this set of data values.

Solution:

Arrange the data values in order from the lowest value to the highest value:

    32     34     35     35     36     37     38     39     47

The fifth data value, 36, is the middle value in this arrangement.

Characteristics of Median:

  1. Middle Value of Data

The median divides a dataset into two equal halves, with 50% of the values lying below it and 50% above it. It is determined by arranging data in ascending or descending order.

  1. Resistant to Outliers

The median is not influenced by extreme values or outliers. This makes it a more robust measure for datasets with significant variability or skewness.

  1. Applicable to Ordinal and Quantitative Data

The median can be calculated for ordinal data (where data can be ranked) and quantitative data. It is not suitable for nominal data, as there is no inherent order.

  1. Unique Value

For any given dataset, the median is always unique and provides a single central value, ensuring consistency in its interpretation.

  1. Requires Data Sorting

The calculation of the median necessitates ordering the data values. Without arranging the data, the median cannot be identified.

  1. Effective for Skewed Distributions

In skewed datasets, the median better represents the center compared to the mean, as it remains unaffected by the skewness.

  1. Not Affected by Sample Size

Median’s calculation is straightforward and remains valid regardless of the sample size, as long as the data is properly ordered.

Applications of Median:

  1. Income and Wealth Distribution

In economics and social studies, the median is used to analyze income and wealth distributions. For example, the median income indicates the income level at which half the population earns less and half earns more. It is more accurate than the mean in scenarios with extreme disparities, such as high-income earners skewing the average.

  1. Real Estate Market Analysis

Median is commonly applied in the real estate industry to determine the central value of property prices. Median house prices are preferred over averages because they are less affected by outliers, such as extremely high or low-priced properties.

  1. Educational Assessments

In education, the median is used to evaluate student performance. For example, the median test score helps identify the middle-performing student, providing a fair representation when the scores are unevenly distributed.

  1. Medical and Health Statistics

Median is often employed in health sciences to summarize data such as median survival rates or recovery times. These metrics are crucial when the data includes extreme cases or a non-symmetric distribution.

  1. Demographic Studies

Median age, household size, and other demographic measures are widely used in population studies. These metrics provide insights into the central characteristics of populations while avoiding distortion by extremes.

  1. Transportation Planning

In transportation and traffic analysis, the median is used to determine the typical travel time or commute duration. It offers a realistic measure when the data includes unusually long or short travel times.

Demerits or Limitations of Median:

  1. Even if the value of extreme items is too large, it does not affect too much, but due to this reason, sometimes median does not remain the representative of the series.
  2. It is affected much more by fluctuations of sampling than A.M.
  3. Median cannot be used for further algebraic treatment. Unlike mean we can neither find total of terms as in case of A.M. nor median of some groups when combined.
  4. In a continuous series it has to be interpolated. We can find its true-value only if the frequencies are uniformly spread over the whole class interval in which median lies.
  5. If the number of series is even, we can only make its estimate; as the A.M. of two middle terms is taken as Median.

Mode, Characteristics, Applications and Limitations

Mode is a measure of central tendency that identifies the most frequently occurring value or values in a dataset. Unlike the mean or median, the mode can be used for both numerical and categorical data. A dataset may have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode at all if no value repeats. The mode is particularly useful for understanding trends in categorical data, such as the most popular product, common response, or frequent event, and is less sensitive to outliers compared to other central tendency measures.

Examples:

For example, in the following list of numbers, 16 is the mode since it appears more times than any other number in the set:

  • 3, 3, 6, 9, 16, 16, 16, 27, 27, 37, 48

A set of numbers can have more than one mode (this is known as bimodal if there are 2 modes) if there are multiple numbers that occur with equal frequency, and more times than the others in the set.

  • 3, 3, 3, 9, 16, 16, 16, 27, 37, 48

In the above example, both the number 3 and the number 16 are modes as they each occur three times and no other number occurs more than that.

If no number in a set of numbers occurs more than once, that set has no mode:

  • 3, 6, 9, 16, 27, 37, 48

Characteristics of Mode:

  • Can Be Used for Qualitative and Quantitative Data

Mode can be applied to both qualitative (categorical) and quantitative data. For example, in market research, the mode can identify the most common product color or customer preference.

  • Not Affected by Outliers

The mode is not influenced by extreme values or outliers in a dataset. For instance, in a dataset of salaries where most values are clustered around a certain range but a few extreme salaries exist, the mode will still reflect the most frequent salary, making it a useful measure when dealing with skewed data or anomalies.

  • May Have Multiple Values

A dataset may have more than one mode. If there are two values that occur with the same highest frequency, the dataset is considered bimodal. If there are more than two, it is multimodal. In such cases, the mode provides insight into multiple frequent occurrences within the dataset, unlike the mean or median, which offer a single value.

  • Can Be Uniquely Defined or Undefined

In some datasets, there may be no mode if all values occur with equal frequency. For example, in a dataset where every value appears only once, the mode is undefined. Conversely, in datasets with a clear most frequent value, the mode is uniquely defined.

  • Easy to Calculate

The mode is simple to compute. It only requires identifying the value that appears most frequently in the dataset. No complex formulas or data manipulations are needed, making it a straightforward measure for quick analysis.

  • Useful for Categorical Data

The mode is especially useful for categorical data where numerical calculations do not apply. For instance, in surveys where respondents choose their favorite color, the mode will show the most popular choice, providing valuable insights in marketing or social studies.

Applications of Mode:

  1. Market Research

In market research, the mode is used to identify the most popular product, service, or customer preference. For example, if a survey is conducted to determine consumers’ favorite brands, the mode will highlight the brand chosen most frequently, helping businesses focus on popular trends.

  1. Fashion and Retail Industry

The mode is widely used in the fashion and retail sectors to determine popular product styles, colors, or sizes. For example, if a clothing store wants to know the most commonly bought color of a particular item, the mode will provide the answer, guiding inventory decisions and promotional strategies.

  1. Educational Testing

In educational assessments, the mode can be used to determine the most common score or grade achieved by students in a test or examination. This helps educators identify common performance trends and understand the difficulty level of the assessment.

  1. Health and Medical Statistics

In healthcare, the mode is used to find the most common age group, symptom, or diagnosis within a population. For example, in a study of common diseases, the mode can reveal the most frequently occurring disease or the most prevalent age group affected, providing insights into public health needs.

  1. Consumer Behavior Analysis

In consumer behavior studies, the mode is used to determine the most frequently chosen option in surveys and polls. For instance, it can highlight the most common reasons for customer dissatisfaction or preferences regarding product features, aiding companies in product development and customer service strategies.

  1. Sports Statistics

In sports analytics, the mode is used to identify the most frequent performance metric. For example, the mode can be applied to identify the most common score in a set of matches or the most frequent outcome of a particular game, assisting coaches and analysts in understanding patterns in performance.

Advantages:

  • It is easy to understand and simple to calculate.
  • It is not affected by extremely large or small values.
  • It can be located just by inspection in un-grouped data and discrete frequency distribution.
  • It can be useful for qualitative data.
  • It can be computed in an open-end frequency table.
  • It can be located graphically.

Disadvantages:

  • It is not well defined.
  • It is not based on all the values.
  • It is stable for large values so it will not be well defined if the data consists of a small number of values.
  • It is not capable of further mathematical treatment.
  • Sometimes the data has one or more than one mode, and sometimes the data has no mode at all.
error: Content is protected !!