Chain based index Numbers

According to the fixed base methods, the base remains the same and unchangeable throughout the series. But as the time passes some items may be added in the series while some may be deleted. It, therefore, becomes tough to compare the result of the current conditions with that of the past period. Thus, in such a situation changing the base period is more appropriate. Chain Index Numbers method is one such method.

Under this method, firstly we express the figures for each year as a percentage of the preceding year. These are known as Link Relatives. We then need to chain them together by successive multiplication to form a chain index.

Thus, unlike fixed base methods, in this method, the base year changes every year. Hence, for the year 2001, it will be 2000, for 2002 it will be 2001, and so on. Let us now study this method step by step.

Steps in the construction of Chain Index Numbers

  1. Calculate the link relatives by expressing the figures as the percentage of the preceding year. Thus,

Link Relatives of current year = (price of current year/price of previous year) × 100

  1. Calculate the chain index by applying the following formula:

Chain Index = (Current year relative × Previous year link relative) / 100

Advantages of Chain Index Numbers Method

  1. This method allows the addition or introduction of the new items in the series and also the deletion of obsolete items.
  2. In an organization, management usually compares the current period with the period immediately preceding it rather than any other period in the past. In this method, the base year changes every year and thus it becomes more useful to the management.

Disadvantages of Chain Index Numbers Method

  1. Under this method, if the data for any one of the year is not available then we cannot compute the chain index number for the subsequent period. This is so because we need to calculate the link relatives, which are not possible to be calculated in this case.
  2. In case an error occurs in the calculation of any of the link relatives, then that error gets compounded and all the subsequent link relatives will also become incorrect. Thus, the entire series will give a misrepresented picture.

Determination of Season

Time series datasets can contain a seasonal component.

This is a cycle that repeats over time, such as monthly or yearly. This repeating cycle may obscure the signal that we wish to model when forecasting, and in turn may provide a strong signal to our predictive models.

  • The definition of seasonality in time series and the opportunity it provides for forecasting with machine learning methods.
  • How to use the difference method to create a seasonally adjusted time series of daily temperature data.
  • How to model the seasonal component directly and explicitly subtract it from observations.

Seasonality in Time Series

Time series data may contain seasonal variation.

Seasonal variation, or seasonality, are cycles that repeat regularly over time.

A repeating pattern within each year is known as seasonal variation, although the term is applied more generally to repeating patterns within any fixed period.

Introductory Time Series with R

A cycle structure in a time series may or may not be seasonal. If it consistently repeats at the same frequency, it is seasonal, otherwise it is not seasonal and is called a cycle.

Benefits to Machine Learning

Understanding the seasonal component in time series can improve the performance of modeling with machine learning.

This can happen in two main ways:

  • Clearer Signal: Identifying and removing the seasonal component from the time series can result in a clearer relationship between input and output variables.
  • More Information: Additional information about the seasonal component of the time series can provide new information to improve model performance.

Both approaches may be useful on a project. Modeling seasonality and removing it from the time series may occur during data cleaning and preparation.

Extracting seasonal information and providing it as input features, either directly or in summary form, may occur during feature extraction and feature engineering activities.

Types of Seasonality

There are many types of seasonality; for example:

  • Time of Day.
  • Daily.
  • Weekly.
  • Monthly.
  • Yearly.

As such, identifying whether there is a seasonality component in your time series problem is subjective.

The simplest approach to determining if there is an aspect of seasonality is to plot and review your data, perhaps at different scales and with the addition of trend lines.

Removing Seasonality

Once seasonality is identified, it can be modeled.

The model of seasonality can be removed from the time series. This process is called Seasonal Adjustment, or Deseasonalizing.

A time series where the seasonal component has been removed is called seasonal stationary. A time series with a clear seasonal component is referred to as non-stationary.

There are sophisticated methods to study and extract seasonality from time series in the field of Time Series Analysis. As we are primarily interested in predictive modeling and time series forecasting, we are limited to methods that can be developed on historical data and available when making predictions on new data.

In this tutorial, we will look at two methods for making seasonal adjustments on a classical meteorological-type problem of daily temperatures with a strong additive seasonal component. Next, let’s take a look at the dataset we will use in this tutorial.

index Numbers

The value of money does not remain constant over time. It rises or falls and is inversely related to the changes in the price level. A rise in the price level means a fall in the value of money and a fall in the price level means a rise in the value of money. Thus, changes in the value of money are reflected by the changes in the general level of prices over a period of time. Changes in the general level of prices can be measured by a statistical device known as ‘index number.’

Index number is a technique of measuring changes in a variable or group of variables with respect to time, geographical location or other characteristics. There can be various types of index numbers, but, in the present context, we are concerned with price index numbers, which measures changes in the general price level (or in the value of money) over a period of time.

Price index number indicates the average of changes in the prices of representative commodities at one time in comparison with that at some other time taken as the base period. According to L.V. Lester, “An index number of prices is a figure showing the height of average prices at one time relative to their height at some other time which is taken as the base period.”

Features of Index Numbers:

The following are the main features of index numbers:

(i) Index numbers are a special type of average. Whereas mean, median and mode measure the absolute changes and are used to compare only those series which are expressed in the same units, the technique of index numbers is used to measure the relative changes in the level of a phenomenon where the measurement of absolute change is not possible and the series are expressed in different types of items.

(ii) Index numbers are meant to study the changes in the effects of such factors which cannot be measured directly. For example, the general price level is an imaginary concept and is not capable of direct measurement. But, through the technique of index numbers, it is possible to have an idea of relative changes in the general level of prices by measuring relative changes in the price level of different commodities.

(iii) The technique of index numbers measures changes in one variable or group of related variables. For example, one variable can be the price of wheat, and group of variables can be the price of sugar, the price of milk and the price of rice.

(iv) The technique of index numbers is used to compare the levels of a phenomenon on a certain date with its level on some previous date (e.g., the price level in 1980 as compared to that in 1960 taken as the base year) or the levels of a phenomenon at different places on the same date (e.g., the price level in India in 1980 in comparison with that in other countries in 1980).

Steps or Problems in the Construction of Price Index Numbers:

The construction of the price index numbers involves the following steps or problems:

  1. Selection of Base Year:

The first step or the problem in preparing the index numbers is the selection of the base year. The base year is defined as that year with reference to which the price changes in other years are compared and expressed as percentages. The base year should be a normal year.

In other words, it should be free from abnormal conditions like wars, famines, floods, political instability, etc. Base year can be selected in two ways- (a) through fixed base method in which the base year remains fixed; and (b) through chain base method in which the base year goes on changing, e.g., for 1980 the base year will be 1979, for 1979 it will be 1978, and so on.

  1. Selection of Commodities:

The second problem in the construction of index numbers is the selection of the commodities. Since all commodities cannot be included, only representative commodities should be selected keeping in view the purpose and type of the index number.

In selecting items, the following points are to be kept in mind:

(a) The items should be representative of the tastes, habits and customs of the people.

(b) Items should be recognizable,

(c) Items should be stable in quality over two different periods and places.

(d) The economic and social importance of various items should be considered

(e) The items should be fairly large in number.

(f) All those varieties of a commodity which are in common use and are stable in character should be included.

  1. Collection of Prices:

After selecting the commodities, the next problem is regarding the collection of their prices:

(a) From where the prices to be collected;

(b) Whether to choose wholesale prices or retail prices;

(c) Whether to include taxes in the prices or not etc.

While collecting prices, the following points are to be noted:

(a) Prices are to be collected from those places where a particular commodity is traded in large quantities.

(b) Published information regarding the prices should also be utilised,

(c) In selecting individuals and institutions who would supply price quotations, care should be taken that they are not biased.

(d) Selection of wholesale or retail prices depends upon the type of index number to be prepared. Wholesale prices are used in the construction of general price index and retail prices are used in the construction of cost-of-living index number.

(e) Prices collected from various places should be averaged.

  1. Selection of Average:

Since the index numbers are, a specialised average, the fourth problem is to choose a suitable average. Theoretically, geometric mean is the best for this purpose. But, in practice, arithmetic mean is used because it is easier to follow.

  1. Selection of Weights:

Generally, all the commodities included in the construction’ of index numbers are not of equal importance. Therefore, if the index numbers are to be representative, proper weights should be assigned to the commodities according to their relative importance.

For example, the prices of books will be given more weightage while preparing the cost-of-living index for teachers than while preparing the cost-of-living index for the workers. Weights should be unbiased and be rationally and not arbitrarily selected.

  1. Purpose of Index Numbers:

The most important consideration in the construction of the index numbers is the objective of the index numbers. All other problems or steps are to be viewed in the light of the purpose for which a particular index number is to be prepared. Since, different index numbers are prepared with specific purposes and no single index number is ‘all purpose’ index number, it is important to be clear about the purpose of the index number before its construction.

Least Square Method in Time Series

During Time Series analysis we come across with variables, many of them are dependent upon others. It is often required to find a relationship between two or more variables.  Least Square is the method for finding the best fit of a set of data points. It minimizes the sum of the residuals of points from the plotted curve. It gives the trend line of best fit to a time series data. This method is most widely used in time series analysis.

Method of Least Squares 

Each point on the fitted curve represents the relationship between a known independent variable and an unknown dependent variable.

In general, the least squares method uses a straight line in order to fit through the given points which are known as the method of linear or ordinary least squares. This line is termed as the line of best fit from which the sum of squares of the distances from the points is minimized.

Equations with certain parameters usually represent the results in this method. The method of least squares actually defines the solution for the minimization of the sum of squares of deviations or the errors in the result of each equation.

The least squares method is used mostly for data fitting. The best fit result minimizes the sum of squared errors or residuals which are said to be the differences between the observed or experimental value and corresponding fitted value given in the model. There are two basic kinds of the least squares methods – ordinary or linear least squares and nonlinear least squares.

Mathematical Representation

It is a mathematical method and with it gives a fitted trend line for the set of data in such a manner that the following two conditions are satisfied.

  1. The sum of the deviations of the actual values of Y and the computed values of Y is zero.
  2. The sum of the squares of the deviations of the actual values and the computed values is least.

This method gives the line which is the line of best fit. This method is applicable to give results either to fit a straight line trend or a parabolic trend.

The method of least squares as studied in time series analysis is used to find the trend line of best fit to a time series data.

Secular Trend Line

The secular trend line (Y) is defined by the following equation:

Y = a + b X

Where, Y = predicted value of the dependent variable

a = Y-axis intercept i.e. the height of the line above origin (when X = 0, Y = a)

b = slope of the line (the rate of change in Y for a given change in X)

When b is positive the slope is upwards, when b is negative, the slope is downwards

X = independent variable (in this case it is time)

To estimate the constants a and b, the following two equations have to be solved simultaneously:

ΣY = na + b ΣX

ΣXY = aΣX + bΣX2

 To simplify the calculations, if the midpoint of the time series is taken as origin, then the negative values in the first half of the series balance out the positive values in the second half so that ΣX = 0. In this case, the above two normal equations will be as follows:

ΣY = na

ΣXY = bΣX2

In such a case the values of a and b can be calculated as under:

Since ΣY = na

a = ∑Yn

Since, ΣXY = bΣX2

Example

Fit a straight line trend on the following data using the Least Squares Method.

Period (year) 1996 1997 1998 1999 2000 2001 2002 2003 2004
Y 4 7 7 8 9 11 13 14 17

Solution:

Total of 9 observations are there. So, the origin is taken at the Year 2000 for which X is assumed to be 0.

PERIOD (YEAR) Y X XY X2 REMARK
1996 4 -4 -16 16 NEGATIVE REGION
1997 7 -3 -21 9
1998 7 -2 -14 4
1999 8 -1 -8 1
2000 9 0 0 0 ORIGIN
2001 11 1 11 1 POSITIVE REGION
2002 13 2 16 4
2003 14 3 42 9
2004 17 4 68 16
Total (Σ) ΣY = 90 ΣX = 0 ΣXY = 88 SΣX2 =60

From the table we find that value of n is 9, value of   ΣY is 90, value of ΣX is  0, value of ΣXY is  88   and value of  ΣX2  is 60 .

Substituting these values in the two given equations,

a = 909 or a = 10
b =  8860 or b = 1.47
Trend equation is :    Y = 10 + 1.47 X

Moving average Method

While watching the news you might have noticed the reporter saying that the temperature of a particular city or a country has broken a record. The rainfall of some state or country has set a new bar. How can they know about it? What are the measures that they have taken and studied to say so? These are the time-series data. You all are familiar with time-series data and the various components of the time series.

A Trend in a Time Series

A time series is broadly classified into three categories of long-term fluctuations, short-term or periodic fluctuations, and random variations. A long-term variation or a trend shows the general tendency of the data to increase or decrease during a long period of time. The variation may be gradual but it is inevitably present.

Analysis of Time Series

Suppose you have a time series data. What will you do with it? How can you calculate the effect of each component for the resulting variations in it? The main problems in the analysis of time series are

  • To identify the components and the net effect of whose interaction is shown by the movement of a time series, and
  • To isolate, study, analyze and measure each component independently by making others constant.

Measurement of Trend by the Method of Moving Average

This method uses the concept of ironing out the fluctuations of the data by taking the means. It measures the trend by eliminating the changes or the variations by means of a moving average. The simplest of the mean used for the measurement of a trend is the arithmetic means (averages).

Moving Average

The moving average of a period (extent) m is a series of successive averages of m terms at a time. The data set used for calculating the average starts with first, second, third and etc. at a time and m data taken at a time.

In other words, the first average is the mean of the first m terms. The second average is the mean of the m terms starting from the second data up to (m + 1)th term. Similarly, the third average is the mean of the m terms from the third to (m + 2) th term and so on.

If the extent or the period, m is odd i.e., m is of the form (2k + 1), the moving average is placed against the mid-value of the time interval it covers, i.e., t = k + 1. On the other hand, if m is even i.e., m = 2k, it is placed between the two middle values of the time interval it covers, i.e., t = k and t = k + 1.

When the period of the moving average is even, then we need to synchronize the moving average with the original time period. It is done by centering the moving averages i.e., by taking the average of the two successive moving averages.

Drawbacks of Moving Average

  • The main problem is to determine the extent of the moving average which completely eliminates the oscillatory fluctuations.
  • This method assumes that the trend is linear but it is not always the case.
  • It does not provide the trend values for all the terms.
  • This method cannot be used for forecasting future trend which is the main objective of the time series analysis.

Base Shifting, Splicing and Deflating

Base Shifting

For a variety of reasons, it frequently becomes necessary to change the reference base of an index number series from one time to another without returning to the original raw data and recomputing the entire series. This change of reference base period is usually referred to as “shifting the base”. There are two important reasons for shifting the base:

  1. The previous base has become too old and is almost useless for purposes of comparison. By shifting the base, it is possible to state the series in terms of a more recent time period.
  2. It may be desired to compare several index number series which have been compared on different base period; particularly if the several series are to be shown on the same graph, it different base periods; particularly if the several series are to be shown graph, it may be desirable for them to have the same base period. This may necessitate a shift in the base period.

When base period is to be changed, one possibility is to recompute all index numbers using the new base period. A simpler approximate method is to divide all index numbers for the various years corresponding to the old base period by the index number corresponding to the new base period, expressing the results as percentage. These results represent the new index numbers, the index number for the new base period beings 100%.

Mathematically speaking, this method is strictly applicable only if the index numbers satisfy the circular test. However, for many types of index numbers the method, fortunately, yields results which in practice are close enough to those which would be obtained theoretically.

Splicing

Splicing is a technique where we link the two or more index number series which contain the same items and a common overlapping year but with different base year to form a continuous series. It may be forward splicing or backward splicing. We can further understand this with the help of the table given below:

Splicing The index number of old series The index number of new series
Forward (100/overlapping index number of oldseries)× Given index number of old series No change
Backward No change (Index number of old series/100) × Given index number of new series

Deflating

It refers to the correction for price changes in money wages or money income series.

Inflation adjustment or deflation is the process of removing the effect of price inflation from data. It makes sense to adjust only data that is currency denominated in this way. Examples of such data are weekly wages, the interest rate on your deposits, or the price of a 5kg bag of Red Delicious apples in Kashmir. If you are dealing with a currency denominated time series, deflating it will extinguish the fraction of the up-down movement in it that was a consequence of general inflationary pressure.

Real Wage = (Money Wage / Price index) *100

Real Wage index no. = Index of Money Wage / Price index

Unweighted, Weighted Aggregate Method

To measure the growth and progress of an economy, economists and scientists use many statistical tools. One such very important tool are index numbers. They help reveal the trends and tendencies of the economy and also help in the formulation of economic policies and laws.

There are broadly three types of index numbers price index numbers, value index numbers, and quantity index numbers.

Very simply put, index numbers help us observe the change in some quantity that we cannot otherwise easily observe or measure. For example, we cannot directly measure the growth of business activity in an economy. However, we can study the changes in factors that influence this business activity.

So an index number is a tool to measure the change in a variable quantity that has happened over a defined period of time. These index numbers are not directly measurable, they are represented as percentages which express the relative changes in quantity.

Quantity Index Numbers

Now we will specifically understand what are quantity index numbers. Quantity index numbers measure the change in the quantity or volume of goods sold, consumed or produced during a given time period. Hence it is a measure of relative changes over a period of time in the quantities of a particular set of goods.

Just like price index numbers and value index numbers, there are also two types of quantity index numbers, namely

  • Unweighted Quantity Indices
  • Weighted Quantity Indices

Let us take a look at the various methods, formulas, and examples of both these types of quantity index numbers.

Unweighted Index: Simple Aggregate Method

Here we do a simple and direct comparison of the aggregate quantities of the current year, with those of the previous year. We express this index number as a percentage. No weights are assigned, it is the simplest calculation. The formula is as follows,

Q01=(ΣQ1/ΣQ0)×100

where, Q1 is the quantity of the current year, and Q0 is the quantity of the previous year,

Unweighted Index: Simple Average of Quantity Method

In this method, we take the aggregate quantities of the current year as a percentage of the quantity of the base year. Then to obtain the index number, we average this percentage figure. So the formula under this method is as follows,

Q01= (ΣQ1/ΣQ0) × 100÷N

where N is the total number of items

Weighted Index: Simple Aggregative Method

There are a few various methods for calculating this index number. We will take a look at some of the most important ones.

1) Laspeyres Method

In this method, the base price is taken as the weight. We only use the price of the base year (P0), not the current year. The formula is as follows,

Q01= (ΣQ1P0/ΣQ0P0) × 100

2) Paasche’s Method

Here, the current year price (P1) of the commodity is taken as the weight.

Q01= (ΣQ1P0/ΣQ0P0) × 100

3) Dorbish & Bowley’s Method

Q01= (ΣQ1P0/ΣQ0P0) + (ΣQ1P1/ΣQ0P1) ÷ 2

Weighted Index: Weighted Average of Relative Method

In this method, we use the arithmetic mean for averaging the values. The formula is a little more complex as seen below,

Q01= ΣQV/ ΣV

where

Q= Σq1/Σq0

and

V=q0p0

Cost of Living Index Number

Uses of cost of living index number:

(i) It is used in wage negotiations, dearness allowance, bonus etc., to the workers.

(ii) The cost of living index number measures the change in the retail prices of a specified quantity of goods and services.

(iii) It is also useful to the government in framing policies relating to wages.

(iv) It is used as measures of change in the purchasing power of money and real income.

The cost-of-living index, or general index, shows the difference in living costs between cities. The cost of living in the base city is always expressed as 100. The cost of living in the destination is then indexed against this number. So to take a simple example, if London is the base (100) and New York is the destination, and the New York index is 120, then New York is 20% more expensive than London. Similarly, if London is the base and Budapest is the destination, and the Budapest index is 80, than the cost of living in Budapest is 80% of London’s.

What’s the methodology behind the index?

The cost-of-living index expresses the difference in the cost of living between any two cities in the survey. How is this index calculated?

Using exactly the same price data, but different methods of calculation, a number of different people could come up with a number of markedly different indices. The challenge, therefore, when seeking to construct an index is to know which method is best for the problem at hand and to represent equitably (in one figure) the general trend of price differences in separate locations. To illustrate this point, let us take a simple price survey comparing two fictional cities, “Mumbai” and “Delhi.”

  Mumbai  Delhi 
Bread (1kg)  1.00  1.25 
Potatoes (1kg)  3.00  2.00 
Coffee (1kg)  2.50  1.75 
Sugar (1kg)  1.00  1.75 
TOTAL  7.50  6.75 

Assuming we give equal weight to each of the products, which of the two towns deserves the higher cost of living index number? The answer is: it all depends on how the calculation is made.

1) Mumbai is more expensive if we simply add up the prices of the four items in the index and compare the two cities on that basis.

2) Delhi, however, is more expensive when we use Mumbai as a base city and calculate an index based on the average of relative prices in the two cities:

  Mumbai  Delhi 
Bread  100  125 
Potatoes  100  67 
Coffee  100  70 
Sugar  100  175 
Index  100  109 

However, if the same calculation is done with Delhi serving as a base city, Mumbai becomes the more expensive city:

  Delhi  Mumbai 
Bread  100  80 
Potatoes  100  150 
Coffee  100  143 
Sugar  100  57 
Index  100  107.50 

Thus with the standard price-relatives calculation we can end up in the paradoxical situation where each city is more expensive than the other.

3) Using a different method, both Delhi and Mumbai would have the same index number, ie 100, and neither would be considered more expensive than the other. Such a calculation would be made according to a well-established statistical formula that takes prices in both cities, makes an average of them, and uses this average as the basis for the index comparison. This formula, adopted by the Economist Intelligence Unit for its indices, has some distinct advantages over the standard price-relatives calculation described in Step 2 above. With the EIU formula, for example, the paradoxical situation of the two cities being more expensive than each other cannot arise: if city A = 100 and city B = 110, then this relationship is maintained, even if city B is used as a base (when B = 100 then A = 91). In other words, the EIU indices are reversible. This property ensures that the cost of living allowances established with the aid of the indices are consistent in that executives transferred from city A to B can be dealt with on the same footing as those transferred from city B to A. In addition, the indices are nearly circular. This means that the relationship between any three cities is maintained regardless of which of the cities is used as a base with which to compare the other two. This logical inter-relationship is important in assuring equitable cost of living compensation as executives are transferred from location to location.

The index formula. The index is based on the arithmetic mean of price levels in the two selected cities. In order to calculate the index for the two hypothetical cities examined on the previous page, we must first calculate the average price of each item:

  Mumbai  Delhi  Average price 
Bread  1.00  1.25  1.125 
Potatoes  3.00  2.00  2.500 
Coffee  2.50  1.75  2.125 
Sugar  1.00  1.75  1.375 

Next we compare prices in each town to these average prices:

  Average  Mumbai  Delhi 
Bread  100  89  111 
Potatoes  100  120  80 
Coffee  100  118  82 
Sugar  100  73  127 
General Index  100  100  100 

As we can see the relationship between Mumbai and Delhi prices remains intact: bread is still 25% more expensive in Delhi, potatoes are still 50% more expensive in Mumbai. If we want to compare Mumbai as a base city to Delhi, we must divide Delhi’s index by that of Mumbai and multiply by 100. The result is 100. If we reverse the operation and use Delhi as base, the result is also 100. The two cities are equally expensive.

Range and co-efficient of Range

The range is a measure of dispersion that represents the difference between the highest and lowest values in a dataset. It provides a simple way to understand the spread of data. While easy to calculate, the range is sensitive to outliers and does not provide information about the distribution of values between the extremes.

Range of a distribution gives a measure of the width (or the spread) of the data values of the corresponding random variable. For example, if there are two random variables X and Y such that X corresponds to the age of human beings and Y corresponds to the age of turtles, we know from our general knowledge that the variable corresponding to the age of turtles should be larger.

Since the average age of humans is 50-60 years, while that of turtles is about 150-200 years; the values taken by the random variable Y are indeed spread out from 0 to at least 250 and above; while those of X will have a smaller range. Thus, qualitatively you’ve already understood what the Range of a distribution means. The mathematical formula for the same is given as:

Range = L – S

where

L: The Largets/maximum value attained by the random variable under consideration

S: The smallest/minimum value.

Properties

  • The Range of a given distribution has the same units as the data points.
  • If a random variable is transformed into a new random variable by a change of scale and a shift of origin as:

Y = aX + b

where

Y: the new random variable

X: the original random variable

a,b: constants.

Then the ranges of X and Y can be related as:

RY = |a|RX

Clearly, the shift in origin doesn’t affect the shape of the distribution, and therefore its spread (or the width) remains unchanged. Only the scaling factor is important.

  • For a grouped class distribution, the Range is defined as the difference between the two extreme class boundaries.
  • A better measure of the spread of a distribution is the Coefficient of Range, given by:

Coefficient of Range (expressed as a percentage) = L – SL + S × 100

Clearly, we need to take the ratio between the Range and the total (combined) extent of the distribution. Besides, since it is a ratio, it is dimensionless, and can, therefore, one can use it to compare the spreads of two or more different distributions as well.

  • The range is an absolute measure of Dispersion of a distribution while the Coefficient of Range is a relative measure of dispersion.

Due to the consideration of only the end-points of a distribution, the Range never gives us any information about the shape of the distribution curve between the extreme points. Thus, we must move on to better measures of dispersion. One such quantity is Mean Deviation which is we are going to discuss now.

Interquartile range (IQR)

The interquartile range is the middle half of the data. To visualize it, think about the median value that splits the dataset in half. Similarly, you can divide the data into quarters. Statisticians refer to these quarters as quartiles and denote them from low to high as Q1, Q2, Q3, and Q4. The lowest quartile (Q1) contains the quarter of the dataset with the smallest values. The upper quartile (Q4) contains the quarter of the dataset with the highest values. The interquartile range is the middle half of the data that is in between the upper and lower quartiles. In other words, the interquartile range includes the 50% of data points that fall in Q2 and

The IQR is the red area in the graph below.

The interquartile range is a robust measure of variability in a similar manner that the median is a robust measure of central tendency. Neither measure is influenced dramatically by outliers because they don’t depend on every value. Additionally, the interquartile range is excellent for skewed distributions, just like the median. As you’ll learn, when you have a normal distribution, the standard deviation tells you the percentage of observations that fall specific distances from the mean. However, this doesn’t work for skewed distributions, and the IQR is a great alternative.

I’ve divided the dataset below into quartiles. The interquartile range (IQR) extends from the low end of Q2 to the upper limit of Q3. For this dataset, the range is 21 – 39.

Quartiles, Quartile Deviation and Quartile co-efficient

The Quartile Deviation is a simple way to estimate the spread of a distribution about a measure of its central tendency (usually the mean). So, it gives you an idea about the range within which the central 50% of your sample data lies. Consequently, based on the quartile deviation, the Coefficient of Quartile Deviation can be defined, which makes it easy to compare the spread of two or more different distributions. Since both of these topics are based on the concept of quartiles, we’ll first understand how to calculate the quartiles of a dataset before working with the direct formulae.

Quartiles

A median divides a given dataset (which is already sorted) into two equal halves similarly, the quartiles are used to divide a given dataset into four equal halves. Therefore, logically there should be three quartiles for a given distribution, but if you think about it, the second quartile is equal to the median itself! We’ll deal with the other two quartiles in this section.

  • The first quartileor the lower quartile or the 25th percentile, also denoted by Q1corresponds to the value that lies halfway between the median and the lowest value in the distribution (when it is already sorted in the ascending order). Hence, it marks the region which encloses 25% of the initial data.
  • Similarly, the third quartileor the upper quartile or 75th percentile, also denoted by Q3, corresponds to the value that lies halfway between the median and the highest value in the distribution (when it is already sorted in the ascending order). It, therefore, marks the region which encloses the 75% of the initial data or 25% of the end data.

For a better understanding, look at the representation below for a Gaussian Distribution:

The Quartile Deviation

Formally, the Quartile Deviation is equal to the half of the Inter-Quartile Range and thus we can write it as:

Qd=(Q3–Q1)/2

Therefore, we also call it the Semi Inter-Quartile Range.

  • The Quartile Deviation doesn’t take into account the extreme points of the distribution. Thus, the dispersion or the spread of only the central 50% data is considered.
  • If the scale of the data is changed, the Qd also changes in the same ratio.
  • It is the best measure of dispersion for open-ended systems (which have open-ended extreme ranges).
  • Also, it is less affected by sampling fluctuations in the dataset as compared to the range (another measure of dispersion).
  • Since it is solely dependent on the central values in the distribution, if in any experiment, these values are abnormal or inaccurate, the result would be affected drastically.

The Coefficient of Quartile Deviation

Based on the quartiles, a relative measure of dispersion, known as the Coefficient of Quartile Deviation, can be defined for any distribution. It is formally defined as:

Coefficient of Quartile Deviation = {(Q3–Q1)/(Q3+Q1)}×100

Since it involves a ratio of two quantities of the same dimensions, it is unit-less. Thus, it can act as a suitable parameter for comparing two or more different datasets which may or may not involve quantities with the same dimensions.

So, now let’s go through the solved examples below to get a better idea of how to apply these concepts to various distributions.

error: Content is protected !!