Mean deviation with mean, Co-efficient of mean deviation

To understand the dispersion of data from a measure of central tendency, we can use mean deviation. It comes as an improvement over the range. It basically measures the deviations from a value. This value is generally mean or median. Hence although mean deviation about mode can be calculated, mean deviation about mean and median are frequently used.

Note that the deviation of an observation from a value a is d= x-aTo find out mean deviation we need to take the mean of these deviations. However, when this value of a is taken as mean, the deviations are both negative and positive since it is the central value.

This further means that when we sum up these deviations to find out their average, the sum essentially vanishes. Thus to resolve this problem we use absolute values or the magnitude of deviation. The basic formula for finding out mean deviation is :

Mean deviation= Sum of absolute values of deviations from ‘a’ ÷ The number of observations

Coefficient of Mean Deviation:

It is calculated to compare the data of two series. The coefficient of mean deviation is calculated by dividing mean deviation by the average. If deviations are taken from mean, we divide it by mean, if the deviations are taken from median, then it is divided by mode and if the “deviations are taken from median, then we divide mean deviation by median.

  1. For Discrete Series:

M.D. = ∑fdy/N; Where; N=∑f

And dy is the deviation of variable from X, M or Z ignoring signs (Taking +ive signs only).

Steps to Calculate:

  1. Take X, M or Z series as desired.
  2. Take deviations ignoring signs.
  3. Multiply dy by respective f; get ∑fdy
  4. Use the following formula

M.D. = ∑fdy/N

(Note : If value of X or M or Z is in decimal fractions better use Direct Method to get result easily)

When Mean or Median or Mode is in Fractions, in that case, Direct formula is applied

  1. For Continuous Series:

For Continuous Series also ;

M.D. = fdy/N

Standard deviation with co-efficient of Variance

As the name suggests, this quantity is a standard measure of the deviation of the entire data in any distribution. Usually represented by or σ. It uses the arithmetic mean of the distribution as the reference point and normalizes the deviation of all the data values from this mean.

Therefore, we define the formula for the standard deviation of the distribution of a variable X with n data points as:

Variance

Another statistical term that is related to the distribution is the variance, which is the standard deviation squared (variance = SD² ). The SD may be either positive or negative in value because it is calculated as a square root, which can be either positive or negative. By squaring the SD, the problem of signs is eliminated. One common application of the variance is its use in the F-test to compare the variance of two methods and determine whether there is a statistically significant difference in the imprecision between the methods.

In many applications, however, the SD is often preferred because it is expressed in the same concentration units as the data. Using the SD, it is possible to predict the range of control values that should be observed if the method remains stable. As discussed in an earlier lesson, laboratorians often use the SD to impose “gates” on the expected normal distribution of control values.

Coefficient of Variation

Another way to describe the variation of a test is calculate the coefficient of variation, or CV. The CV expresses the variation as a percentage of the mean, and is calculated as follows:

CV% = (SD/Xbar)100

In the laboratory, the CV is preferred when the SD increases in proportion to concentration. For example, the data from a replication experiment may show an SD of 4 units at a concentration of 100 units and an SD of 8 units at a concentration of 200 units. The CVs are 4.0% at both levels and the CV is more useful than the SD for describing method performance at concentrations in between. However, not all tests will demonstrate imprecision that is constant in terms of CV. For some tests, the SD may be constant over the analytical range.

The CV also provides a general “feeling” about the performance of a method. CVs of 5% or less generally give us a feeling of good method performance, whereas CVs of 10% and higher sound bad. However, you should look carefully at the mean value before judging a CV. At very low concentrations, the CV may be high and at high concentrations the CV may be low. For example, a bilirubin test with an SD of 0.1 mg/dL at a mean value of 0.5 mg/dL has a CV of 20%, whereas an SD of 1.0 mg/dL at a concentration of 20 mg/dL corresponds to a CV of 5.0%.

Skewness

Skewness, in statistics, is the degree of distortion from the symmetrical bell curve, or normal distribution, in a set of data. Skewness can be negative, positive, zero or undefined. A normal distribution has a skew of zero, while a lognormal distribution, for example, would exhibit some degree of right-skew.

The three probability distributions depicted below depict increasing levels of right (or positive) skewness. Distributions can also be left (negative) skewed. Skewness is used along with kurtosis to better judge the likelihood of events falling in the tails of a probability distribution.

Right skewness

  • Skewness, in statistics, is the degree of distortion from the symmetrical bell curve in a probability distribution.
  • Distributions can exhibit right (positive) skewness or left (negative) skewness to varying degree.
  • Investors note skewness when judging a return distribution because it, like kurtosis, considers the extremes of the data set rather than focusing solely on the average.

Broadly speaking, there are two types of skewness: They are

(1) Positive skewness

(2) Negative skewnes.

Positive skewness

A series is said to have positive skewness when the following characteristics are noticed:

  • Mean > Median > Mode.
  • The right tail of the curve is longer than its left tail, when the data are plotted through a histogram, or a frequency polygon.
  • The formula of Skewness and its coefficient give positive figures.

Negative skewness

A series is said to have negative skewness when the following characteristics are noticed:

  • Mode> Median > Mode.
  • The left tail of the curve is longer than the right tail, when the data are plotted through a histogram, or a frequency polygon.
  • The formula of skewness and its coefficient give negative figures.

Thus, a statistical distribution may be three types viz.

  • Symmetric
  • Positively skewed
  • Negatively skewed

Skewness Co-efficient

  1. Pearson’s Coefficient of Skewness #1 uses the mode. The formula is:
    pearson skewness
    Where xbar = the mean, Mo = the mode and s = the standard deviation for the sample.
  2. Pearson’s Coefficient of Skewness #2 uses the median. The formula is:
    Pearson's Coefficient of Skewness
    Where xbar = the mean, Mo = the mode and s = the standard deviation for the sample.
    It is generally used when you don’t know the mode.

Kurtosis

Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

Along with skewness, kurtosis is an important descriptive statistic of data distribution. However, the two concepts must not be confused with each other. Skewness essentially measures the symmetry of the distribution while kurtosis determines the heaviness of the distribution tails.

In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high level of risk of an investment because it indicates that there are high probabilities of extremely large and extremely small returns. On the other hand, a small kurtosis signals a moderate level of risk because the probabilities of extreme returns are relatively low.

Excess Kurtosis

An excess kurtosis is a metric that compares the kurtosis of a distribution against the kurtosis of a normal distribution. The kurtosis of a normal distribution equals 3. Therefore, the excess kurtosis is found using the formula below:

Excess Kurtosis = Kurtosis – 3

Types of Kurtosis

The types of kurtosis are determined by the excess kurtosis of a particular distribution. The excess kurtosis can take positive or negative values as well, as values close to zero.

1. Mesokurtic

Data that follows a mesokurtic distribution shows an excess kurtosis of zero or close to zero. It means that if the data follows a normal distribution, it follows a mesokurtic distribution.

2. Leptokurtic

Leptokurtic indicates a positive excess kurtosis distribution. The leptokurtic distribution shows heavy tails on either side, indicating the large outliers. In finance, a leptokurtic distribution shows that the investment returns may be prone to extreme values on either side. Therefore, an investment whose returns follow a leptokurtic distribution is considered to be risky.

3. Platykurtic

A platykurtic distribution shows a negative excess kurtosis. The kurtosis reveals a distribution with flat tails. The flat tails indicate the small outliers in a distribution. In the finance context, the platykurtic distribution of the investment returns is desirable for investors because there is a small probability that the investment would experience extreme returns.

Karl Pearson and Rank co-relation

Karl Pearson Coefficient of Correlation (also called the Pearson correlation coefficient or Pearson’s r) is a measure of the strength and direction of the linear relationship between two variables. It ranges from -1 to +1, where +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. The formula for Pearson’s r is calculated by dividing the covariance of the two variables by the product of their standard deviations. It is widely used in statistics to analyze the degree of correlation between paired data.

The following are the main properties of correlation.

  1. Coefficient of Correlation lies between -1 and +1:

The coefficient of correlation cannot take value less than -1 or more than one +1. Symbolically,

-1<=r<= + 1 or | r | <1.

  1. Coefficients of Correlation are independent of Change of Origin:

This property reveals that if we subtract any constant from all the values of X and Y, it will not affect the coefficient of correlation.

  1. Coefficients of Correlation possess the property of symmetry:

The degree of relationship between two variables is symmetric as shown below:

  1. Coefficient of Correlation is independent of Change of Scale:

This property reveals that if we divide or multiply all the values of X and Y, it will not affect the coefficient of correlation.

  1. Co-efficient of correlation measures only linear correlation between X and Y.
  2. If two variables X and Y are independent, coefficient of correlation between them will be zero.

Karl Pearson’s Coefficient of Correlation is widely used mathematical method wherein the numerical expression is used to calculate the degree and direction of the relationship between linear related variables.

Pearson’s method, popularly known as a Pearsonian Coefficient of Correlation, is the most extensively used quantitative methods in practice. The coefficient of correlation is denoted by “r”.

If the relationship between two variables X and Y is to be ascertained, then the following formula is used:

Properties of Coefficient of Correlation

  • The value of the coefficient of correlation (r) always lies between±1. Such as:

    r=+1, perfect positive correlation

    r=-1, perfect negative correlation

    r=0, no correlation

  • The coefficient of correlation is independent of the origin and scale.By origin, it means subtracting any non-zero constant from the given value of X and Y the vale of “r” remains unchanged. By scale it means, there is no effect on the value of “r” if the value of X and Y is divided or multiplied by any constant.
  • The coefficient of correlation is a geometric mean of two regression coefficient. Symbolically it is represented as:
  • The coefficient of correlation is “ zero” when the variables X and Y are independent. But, however, the converse is not true.

Assumptions of Karl Pearson’s Coefficient of Correlation

  1. The relationship between the variables is “Linear”, which means when the two variables are plotted, a straight line is formed by the points plotted.
  2. There are a large number of independent causes that affect the variables under study so as to form a Normal Distribution. Such as, variables like price, demand, supply, etc. are affected by such factors that the normal distribution is formed.
  3. The variables are independent of each other.                                     

Note: The coefficient of correlation measures not only the magnitude of correlation but also tells the direction. Such as, r = -0.67, which shows correlation is negative because the sign is “-“ and the magnitude is 0.67.

Spearman Rank Correlation

Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables.  The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.

The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson’s correlation assesses linear relationships, Spearman’s correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.

Intuitively, the Spearman correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully opposed for a correlation of −1) rank between the two variables.

The following formula is used to calculate the Spearman rank correlation:

ρ = Spearman rank correlation

di = the difference between the ranks of corresponding variables

n = number of observations

Assumptions

The assumptions of the Spearman correlation are that data must be at least ordinal and the scores on one variable must be monotonically related to the other variable.

Least Square Method

The least square method is the process of finding the best-fitting curve or line of best fit for a set of data points by reducing the sum of the squares of the offsets (residual part) of the points from the curve. During the process of finding the relation between two variables, the trend of outcomes are estimated quantitatively. This process is termed as regression analysis. The method of curve fitting is an approach to regression analysis. This method of fitting equations which approximates the curves to given raw data is the least square.

It is quite obvious that the fitting of curves for a particular data set are not always unique. Thus, it is required to find a curve having a minimal deviation from all the measured data points. This is known as the best-fitting curve and is found by using the least-squares method.

Least Square Method

The least-squares method is a crucial statistical method that is practised to find a regression line or a best-fit line for the given pattern. This method is described by an equation with specific parameters. The method of least squares is generously used in evaluation and regression. In regression analysis, this method is said to be a standard approach for the approximation of sets of equations having more equations than the number of unknowns.

The method of least squares actually defines the solution for the minimization of the sum of squares of deviations or the errors in the result of each equation. Find the formula for sum of squares of errors, which help to find the variation in observed data.

The least-squares method is often applied in data fitting. The best fit result is assumed to reduce the sum of squared errors or residuals which are stated to be the differences between the observed or experimental value and corresponding fitted value given in the model.

There are two basic categories of least-squares problems:

  • Ordinary or linear least squares
  • Nonlinear least squares

These depend upon linearity or nonlinearity of the residuals. The linear problems are often seen in regression analysis in statistics. On the other hand, the non-linear problems generally used in the iterative method of refinement in which the model is approximated to the linear one with each iteration.

Least Square Method Graph

In linear regression, the line of best fit is a straight line as shown in the following diagram:

The given data points are to be minimized by the method of reducing residuals or offsets of each point from the line. The vertical offsets are generally used in surface, polynomial and hyperplane problems, while perpendicular offsets are utilized in common practice.

Least Square Method Formula

The least-square method states that the curve that best fits a given set of observations, is said to be a curve having a minimum sum of the squared residuals (or deviations or errors) from the given data points. Let us assume that the given points of data are (x1,y1), (x2,y2), (x3,y3), …, (xn,yn) in which all x’s are independent variables, while all y’s are dependent ones. Also, suppose that f(x) be the fitting curve and d represents error or deviation from each given point.

Now, we can write:

d1 = y1 − f(x1)

d2 = y2 − f(x2)

d3 = y3 − f(x3)

…..

dn = yn – f(xn)

The least-squares explain that the curve that best fits is represented by the property that the sum of squares of all the deviations from given values must be minimum. i.e:

Sum = Minimum Quantity

Limitations for Least-Square Method

The least-squares method is a very beneficial method of curve fitting. Despite many benefits, it has a few shortcomings too. One of the main limitations is discussed here.

In the process of regression analysis, which utilizes the least-square method for curve fitting, it is inevitably assumed that the errors in the independent variable are negligible or zero. In such cases, when independent variable errors are non-negligible, the models are subjected to measurement errors. Therefore, here, the least square method may even lead to hypothesis testing, where parameter estimates and confidence intervals are taken into consideration due to the presence of errors occurring in the independent variables.

Secondary Data: Merits, Limitations, Sources

Secondary data is the data that have been already collected by and readily available from other sources. Such data are cheaper and more quickly obtainable than the primary data and also may be available when primary data can not be obtained at all.

Advantages of Secondary data

  1. It is economical. It saves efforts and expenses.
  2. It is time saving.
  3. It helps to make primary data collection more specific since with the help of secondary data, we are able to make out what are the gaps and deficiencies and what additional information needs to be collected.
  4. It helps to improve the understanding of the problem.
  5. It provides a basis for comparison for the data that is collected by the researcher.

Disadvantages of Secondary Data

  1. Secondary data is something that seldom fits in the framework of the marketing research factors. Reasons for its non-fitting are:
  • Unit of secondary data collection: Suppose you want information on disposable income, but the data is available on gross income. The information may not be same as we require.
  • Class Boundaries may be different when units are same.
Before 5 Years After 5 Years
2500-5000 5000-6000
5001-7500 6001-7000
7500-10000 7001-10000
  1. Thus the data collected earlier is of no use to you.
  1. Accuracy of secondary data is not known.
  2. Data may be outdated.

Evaluation of Secondary Data

Because of the above mentioned disadvantages of secondary data, we will lead to evaluation of secondary data. Evaluation means the following four requirements must be satisfied:

  1. Availability: It has to be seen that the kind of data you want is available or not. If it is not available then you have to go for primary data.
  2. Relevance: It should be meeting the requirements of the problem. For this we have two criteria:
    1. Units of measurement should be the same.
    2. Concepts used must be same and currency of data should not be outdated.
  3. Accuracy: In order to find how accurate, the data is, the following points must be considered: –
  • Specification and methodology used
  • Margin of error should be examined
  • The dependability of the source must be seen.

4. Sufficiency: Adequate data should be available.

Robert W Joselyn has classified the above discussion into eight steps. These eight steps are sub classified into three categories. He has given a detailed procedure for evaluating secondary data.

  • Applicability of research objective.
  • Cost of acquisition.
  • Accuracy of data.

Data: Relevance of data in Current scenario

This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. What has also changed in the last decade is that we now have the means to sift through these 2.5 quintillion bytes of data in a reasonable amount of time. All these changes have major implications for organizations today.

In organizations, analytics enables professionals to convert extensive data and statistical and quantitative analysis into powerful insights that can drive efficient decisions.

Therefore with analytics, organizations can now base their decisions and strategies on data rather than on gut feelings. Moreover, with the rate at which this data can be analyzed, organizations are able to keep tabs on the customer trends in near real time. As a result effectiveness of a strategy can be determined almost immediately. Thus with powerful insights, analytics promises reduced costs and increased profits.
The analytics Industry is one of the fastest growing in modern times with it poised to become a $50 billion market by 2017. With this sudden surge in the analytics industry, there is a tremendous increase in the demand for analytics expertise across all domains, throughout all major organizations across the globe. It has been predicted that by 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.
IBM’s recent study revealed that “83% of Business Leaders listed Business Analytics as the top priority in their business priority list.”
Deloitte has mentioned in its study that: Decision makers who can leverage everyday data & information into actionable insights for the growth of their organization by taking reliable decisions, will find themselves in a much better position to achieve strategic growth in their career.

There is an information overload in today’s world and data analytics helps to cut out the clutter to help businesses make safe and smart choices.

A recent report by Nucleus Research found that companies realize a return of USD10.66 for every dollar they invest in analytics.

In the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data, not including using big data to reduce fraud and errors and boost the collection of tax revenues. Thus big data courses in India are going to be essential in a few years.

There is a saying “Today Data is the new Oil”. Data in today’s business & technology world is absolutely crucial. The Big Data technologies and initiatives are rising to analyze this data for gaining insights that can help in making strategic decisions. The concept evolved at the beginning of 21st century, and every technology giant is now making use of Big Data technologies. Big Data refers to vast data sets that may be structured or unstructured. There is a massive amount of data which has been produced everyday by businesses & users alike. Big data analytics is the process of examining large data sets to find the underlying insights & patterns. Data analytics field is absolutely vast.

The Big Data Analytics is indeed a revolution in the field of information technology. The use of Data analytics by the companies is increasing day by day. The primary focus of the companies is on the customers. Hence this field is flourishing in the area of B2C applications. There are 3 divisions of Big data analytics: Prescriptive Analytics, Predictive Analytics, Descriptive Analytics. There are four different perspectives to explain why big data analytics is so important. They are

  • Data Science Perspective
  • Business Perspective
  •  Real Time Usability Perspective
  • Job Market Perspective

Big Data Analytics & Data Science

The analytics involves the use of advanced techniques & tools of analytics on a data obtained via different sources & different sizes. Big Data has the properties of high variety, volume & velocity. The data sets are basically retrieved from various online networks, web pages, audio & video devices, social media, logs & many other sources.

It involves the use of techniques like machine learning, data mining, natural language processing & statistics. The data is extracted, prepared & blended to provide analysis for the businesses.

Benefits of Big Data Analytics

Due to enormous growth in the field of Big Data Analytics it is extensively used in multiple industries like

  • Banking
  • Healthcare
  • Energy
  • Technology
  • Consumer
  • Manufacturing

The importance of big data analytics leads to intense competition and increased demand for big data professionals. Data Science and Analytics is an evolving field with huge potential. Data analytics help in analyzing the value chain of business and gain insights. The use of analytics can enhance the industry knowledge of the analysts. Data analytics experts provide the organizations a chance to learn about the opportunities for the business.

Types of Date: Primary & Secondary

Primary data

Primary data are original observations collected by the researcher or his agent for the first time for any investigation and used by them in the statistical analysis.

The primary data is the one type of important data. It is collection of data from first-hand information.

This information published by one organization for some purposes. This type of primary data is mostly pure and original data.

The primary data collection is having three different data collection methods are:

  • Data Collection through Investigation:

In this method, trained investigators are working as employees for collecting the data. The researchers will use the tools like interview and collect the information from the individual persons.

  • Personal Investigation Methods:

The researchers or the data collectors will conduct the survey and hence they collect the data. In this method we have to collect more accurate data and original data. This method is useful for small data collection only not big collection of data projects.

  • Data Collection through Telephones:

The data researcher uses the tools like telephones, mobile phones to collect the information or data. This is accurate and very quick process for data collection. But information collected is not accurate and true.

(2) secondary data

The secondary data is the other type of data, which is collection of data from second hand information. This information is known as, given data is already collected from any one person for some purpose, and it has available for the present issues. And mostly these secondary data are not relevant and pure or original data.

Primary Data Census vs Samples

In Statistics, the basis of all statistical calculations or interpretation lies in the collection of data. There are numerous methods of data collection. In this lesson, we shall focus on two primary methods and understand the difference between them. Both are suitable in different cases and the knowledge of these methods is important to understand when to apply which method. These two methods are the Census method and Sampling method.

Census Method

Census method is the method of statistical enumeration where all members of the population are studied. A population refers to the set of all observations under concern. For example, if you want to carry out a survey to find out student’s feedback about the facilities of your school, all the students of your school would form a part of the ‘population’ for your study.

At a more realistic level, a country wants to maintain information and records about all households. It can collect this information by surveying all households in the country using the census method.

In our country, the Government conducts the Census of India every ten years. The Census appropriates information from households regarding their incomes, the earning members, the total number of children, members of the family, etc. This method must take into account all the units. It cannot leave out anyone in collecting data. Once collected, the Census of India reveals demographic information such as birth rates, death rates, total population, population growth rate of our country, etc. The last census was conducted in the year 2011.

Sampling Method

Like we have studied, the population contains units with some similar characteristics on the basis of which they are grouped together for the study. In the case of the Census of India, for example, the common characteristic was that all units are Indian nationals. But it is not always practical to collect information from all the units of the population.

It is a time-consuming and costly method. Thus, an easy way out would be to collect information from some representative group from the population and then make observations accordingly. This representative group which contains some units from the whole population is called the sample.

The first most important step in selecting a sample is to determine the population. Once the population is identified, a sample must be selected. A good sample is one which is:

  • Small in size.
  • It provides adequate information about the whole population.
  • It takes less time to collect and is less costly.

In the case of our previous example, you could choose students from your class to be the representative sample out of the population (all students in the school). However, there must be some rationale behind choosing the sample. If you think your class comprises a set of students who will give unbiased opinions/feedback or if you think your class contains students from different backgrounds and their responses would be relevant to your student, you must choose them as your sample. Otherwise, it is ideal to choose another sample which might be more relevant.

Again, realistically, the government wants estimates on the average income of the Indian household. It is difficult and time-consuming to study all households. The government can simply choose, say, 50 households from each state of the country and calculate the average of that to arrive at an estimate. This estimate is not necessarily the actual figure that would be arrived at if all units of the population underwent study. But it approximately gives an idea of what the figure might look like.

Difference between Census and Sample Surveys

Parameter

Census

Sample Survey

Definition A statistical method that studies all the units or members of a population. A statistical method that studies only a representative group of the population, and not all its members.
Calculation Total/Complete Partial
Time involved It is a time-consuming process. It is a quicker process.
Cost involved It is a costly method. It is a relatively inexpensive method.
Accuracy The results obtained are accurate as each member is surveyed. So, there is a negligible error. The results are relatively inaccurate due to leaving out of items from the sample. The resulting error is large.
Reliability Highly reliable Low reliability
Error Not present The smaller the sample size, the larger the error.
Relevance This method is suited for heterogeneous data. This method is suited for homogeneous data.
error: Content is protected !!