Data preparation & preliminary analysis

09/05/2020 1 By indiafreenotes

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data and the combining of data sets to enrich data.

Data preparation is often a lengthy undertaking for data professionals or business users, but it is essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias resulting from poor data quality.

For example, the data preparation process usually includes standardizing data formats, enriching source data, and/or removing outliers.

Benefits of Data Preparation

76% of data scientists say that data preparation is the worst part of their job, but the efficient, accurate business decisions can only be made with clean data. Data preparation helps:

  • Fix errors quickly: Data preparation helps catch errors before processing. After data has been removed from its original source, these errors become more difficult to understand and correct.
  • Produce top-quality data: Cleaning and reformatting datasets ensures that all data used in analysis will be high quality.
  • Make better business decisions: Higher quality data that can be processed and analyzed more quickly and efficiently leads to more timely, efficient and high-quality business decisions.

Additionally, as data and data processes move to the cloud, data preparation moves with it for even greater benefits, such as:

  • Superior scalability: Cloud data preparation can grow at the pace of the business. Enterprise don’t have to worry about the underlying infrastructure or try to anticipate their evolutions.
  • Future proof: Cloud data preparation upgrades automatically so that new capabilities or problem fixes can be turned on as soon as they are released. This allows organizations to stay ahead of the innovation curve without delays and added costs.
  • Accelerated data usage and collaboration: Doing data prep in the cloud means it is always on, doesn’t require any technical installation, and lets teams collaborate on the work for faster results.

Additionally, a good, cloud-native data preparation tool will offer other benefits (like an intuitive and simple to use GUI) for easier and more efficient preparation.

Data Preparation Steps

The specifics of the data preparation process vary by industry, organization and need, but the framework remains largely the same.

1. Gather data

The data preparation process begins with finding the right data. This can come from an existing data catalog or can be added ad-hoc.

2. Discover and assess data

After collecting the data, it is important to discover each dataset. This step is about getting to know the data and understanding what has to be done before the data becomes useful in a particular context.

3. Cleanse and validate data

Cleaning up the data is traditionally the most time consuming part of the data preparation process, but it’s crucial for removing faulty data and filling in gaps. Important tasks here include:

  • Removing extraneous data and outliers.
  • Filling in missing values.
  • Conforming data to a standardized pattern.
  • Masking private or sensitive data entries.

Once data has been cleansed, it must be validated by testing for errors in the data preparation process up to this point. Often times, an error in the system will become apparent during this step and will need to be resolved before moving forward.

4. Transform and enrich data

Transforming data is the process of updating the format or value entries in order to reach a well-defined outcome, or to make the data more easily understood by a wider audience. Enriching data refers to adding and connecting data with other related information to provide deeper insights.

5. Store data

Once prepared, the data can be stored or channeled into a third party application such as a business intelligence tool clearing the way for processing and analysis to take place.

Preliminary Steps in Quantitative Data Analysis

After collecting and before analyzing survey data, we recommend closely examining the data set to ensure the accuracy and representativeness of the information and the integrity of subsequent analyses. Data conditioning involves attending to detailed components of both an actual data set and the particular analytic techniques chosen to examine the data. This often requires more time and attention to detail than either the data collection or the subsequent analytic procedures. Though data conditioning can be a time-intensive step, carefully executing these practices allows one to responsibly proceed with accurately analyzing, interpreting, and reporting quantitative data. In addition, it offers a more fine-grained picture of the study abroad student sample, which can be quite informative even before more focused statistical analyses are begun.

Data Accuracy

The initial step in data conditioning attends to the issue of accurate data entry. This step requires an examination of how data have been entered (or uploaded) into a data file and a consideration of issues that could yield inaccurate analyses. Comparing the actual obtained data to the final data file is an essential step; however, the size of the sample under study affects the method by which this is typically executed. Tabachnick and Fidell (2013) outlined several components to consider in ensuring data accuracy; for example with small data sets, careful proofreading of all variable values is recommended, but for larger data sets, analyzing particular descriptive statistics and graphic representations of variables is typically more efficient in ensuring appropriate variable value ranges (e.g., possible minimum and maximum values). Analyzing descriptive statistics of variables differs depending on the types of variables examined (i.e., categorical or continuous variables). Categorical variables consist of data that are grouped into discrete categories: either nominal classifications devoid of any particular order or ordinal classifications that have a meaningful ranked order. For example, the location of a study abroad program (e.g., Asia, Europe, or South America) is a nominal variable, whereas asking participants to rate their responses to questions along a Likert-type rating scale (e.g., 1 = strongly disagree to 5 = strongly agree, or 1 = poor to 7 = excellent) is an example of an ordinal variable. Though Likert-type scale responses are technically categorical variables, these responses are often treated as continuous variables in data conditioning and later analyses. Continuous variables take on numeric values within a defined range and have equal intervals between data points (e.g., a student’s age or number of months immersed in a host country).

To check data accuracy for categorical variables, evaluators and researchers must examine the frequencies of responses in each possible category. For example, utilizing the frequency function in SPSS will display tables that include the number and percentage of responses in each of a variable’s categories, as well as the number of valid and missing values (after opening SPSS and loading your data file, follow these SPSS menu choices: Analyze > Descriptive Statistics > Frequencies). In addition, various types of charts can also be generated through the same SPSS navigation menu to graphically display frequencies, including bar charts, pie charts, and histograms. In looking at the frequency tables, we can find several questions that are helpful to ask. Are any values out of the range of the numbered categories (e.g., there are three categories of study abroad program types arbitrarily numbered 1 through 3 but the frequency chart or table indicates other number categories beyond these three values)? Finding nonexistent categories easily brings to light these types of data-entry errors. What do the frequencies suggest? How many responses are in each category? Which category contains the lowest and highest number of responses? What are the implications of low or high frequencies in particular categories?

To examine data accuracy for continuous variables (including Likert-type scales), we must analyze other descriptive statistics beyond frequencies. For instance, we often analyze the mean values (the averages) and dispersion (i.e., ranges and minimum-maximum values) of the continuous variables in SPSS (follow these SPSS menu choices: Analyze > Descriptive Statistics > Descriptives > Options) to answer important questions about the accuracy of the data. Do all of the values fall within the range of possible scores? If not, this points to data-entry errors. Do the mean values for the variables make sense based on what is already known about the population under study? The dispersion of a variable is also important to examine, particularly if there are any out-of-range values (i.e., below the minimum or beyond the maximum possible values). In addition, the standard deviation (the amount of variation from the mean) is also important to consider, as this indicates how closely values are dispersed around the sample’s mean. A low standard deviation value suggests that overall scores are generally clustered around the mean with little variation, making the likelihood of finding differences across the sample relatively small. Conversely, a high standard deviation value indicates that the sample’s scores are more widely dispersed across a wider range of scores, indicating a greater likelihood of differences in scores within a sample.

Finally, it is important to ensure that missing data are properly entered and coded in the data file. Data are missing from data files for several reasons, and these must be identified for accurate analyses and reporting. Participants, for instance, may choose not to answer particular questions on a survey, whereas others may have inadvertently skipped several questions or run out of time to complete the survey, leaving some answers blank. Finally, the nature of some survey questions may require participants to legitimately skip particular questions or blocks of questions. In SPSS, missing values are indicated by either an asterisk or the absence of any values. A more thorough discussion of missing data is found later.

Participant Response Rates

Once the data are checked for accuracy, response rates must be carefully examined to understand the representativeness of the sample. For several reasons, it is often not possible to survey, interview, or otherwise investigate every individual from a population of interest. Comparing the sample participants to the larger overall population of interest examining how representative the sample is and discussing any significant distinctions between the two is critical before findings can be understood and applied more broadly. Furthermore, external validity which considers the generalizability of one’s findings or the extent to which one’s findings generalize beyond the current sample to the overall population under study is an important aim of quantitative inquiry.

It is essential to know and report a participant response rate by determining the total number of individuals invited to participate in a study and those who actually participated. This is a simple proportion to calculate by dividing those who participated by the total invited, although it is important to take into account those who never received the initial invitation because of invalid e-mail addresses or returned mail. Beyond understanding response rates, it is necessary to consider how representative a sample is relative to the overall population of interest. How many and what types of individuals compose the overall population under study, and how does this compare to your final sample? Is the sample representative of important demographics of the total population, including race, ethnicity, gender, age, and other salient characteristics? Are there over- or underrepresented groups in your sample? What are the implications of these disparities? If these data are not readily accessible, campus institutional research or enrollment management areas can typically provide assistance in obtaining population data. Although beyond the scope of this chapter, weighting techniques can also be applied to correct for nonresponse biases (see NSSE, 2014).

Missing Data

The issue of missing data is one of the most prevalent quandaries in quantitative research and assessment efforts. In an extended discussion on the implications of and strategies for handling missing data, Tabachnick and Fidell (2013) stated that it is essential to first determine the severity of any missing data, particularly the patterns of missing data, the amount of data missing, and the reasons why the data may be missing. In quantitative research, missing data are often categorized as MCAR (missing completely at random), MAR (missing at random, which constitutes ignorable nonresponses), and MNAR (missing not at random, which constitutes nonignorable nonresponses) (Little, Jorgensen, Lang, & Moore, 2014). Randomly scattered missing values are less serious than nonrandom missing values, as the latter can affect the generalizability of results.

We can determine random from nonrandom missing data by testing for patterns in the missing data. Tabachnick and Fidell (2013) recommended two ways to test for this: First, one can construct a new variable that represents cases with missing and nonmissing values for an independent variable (e.g., a new variable could be created and coded as 0 = missing and 1 = not missing) and then test for mean differences on a continuous outcome measure between the groups using an independent-samples t-test (follow these SPSS menu choices: Analyze > Compare Means > Independent Samples t-Test). We can then examine the SPSS output and determine whether the two means differ significantly. The second strategy Tabachnick and Fidell (2013) outlined is SPSS’s missing value analysis (follow these SPSS menu choices: Analyze > Missing Value Analysis), which highlights the numbers and patterns of missing values by providing statistics including frequencies of missing values, t-tests, and missing patterns.

Once the missing data patterns have been identified, there are a few different approaches and resulting implications in handling missing data that emphasize either excluding or substituting missing values. Excluding cases (participants) with missing data from analyses is a reasonable option if there is a random pattern of missing values, very few participants have missing data, and the participants are missing data on different variables and it appears that the missing cases represent a random subsample of the aggregate sample (Tabachnick & Fidell, 2013). By default, cases with missing values are usually excluded from most analyses in SPSS based on a listwise deletion technique.  Although an acceptable approach  provided that the previous points are considered excluding cases with extensive missing values (over 10% in most cases) can compromise the external validity of the results.

Tabachnick and Fidell (2013) recommended a number of different imputation or substitution approaches to use if a variable is missing extensive data yet is important to the analysis: First, one can use prior knowledge to replace missing values with an informed estimate if the sample is large and the number of missing values is small. For instance, if given experience or expertise in a field one is sure that the missing values would equate to the median, mean, or most frequent response, it is acceptable to substitute those values and note the reasons for doing so. Second, one can transform an ordinal or continuous variable into a dichotomous variable (e.g., participated or did not participate in study abroad; low or high engagement) and predict into which category to place the missing case. For longitudinal data, one can use the last observed value to fill in missing data, but this implies that there was no change over time. Third, one can substitute missing values by inserting an overall sample mean or a subsample mean defined by a particular grouping variable. Finally, one can utilize a regression-based technique on those cases with complete data to generate an equation that substitutes estimated missing values for incomplete cases. In the long run, effective methods of reducing missing data may focus on well-constructed surveys in which students are less likely to leave data blank and exhortations for students to leave no answers blank as they work through the questions.

For those interested in a much more in-depth discussion of missing data analysis, see Enders (2010) for quite thorough overviews and methods of different techniques to handle various types of missing data.

Detecting Outliers (Extreme Values)

Occasionally, outliers or extreme, unexpected values surface in the data and must be addressed, especially with small sample sizes. Participants can randomly respond to questions or represent genuinely rare cases, so it is often helpful to examine the other items attached to a particular participant to see a fuller picture and possibly explain any extreme values. Univariate outliers (an extreme value on one variable) and multivariate outliers (an unusual combination of scores on two or more variables) distort sample statistics (i.e., can lead to either stating there is a relationship or effect when there is not one or failing to detect a relationship or effect when there is one) and interfere with generalizability.

Tabachnick and Fidell (2013) discussed several reasons for outliers: First, incorrect data entry can produce incorrect values, some of which may be outliers (e.g., accidentally typing a value of 22 instead of 2). Second, failure to specify missing-value codes for data that should be read as real data can also produce outliers. Third, an outlier could be from outside of the population from which we wish to sample; we should delete these cases once they are detected, as they are not relevant to our analyses. Finally, an outlier could be from the population of interest, but the distribution of the variable has more extreme values than expected in a normal distribution. In this final case, we can retain these outliers but change the value on the variable so that the outlier’s impact on the analyses is attenuated. Given the more advanced nature of identifying and handling multivariate outliers, we recommend referring to Tabachnick and Fidell 2013) for a more extended discussion.

Looking for Correlations Among Variables

Data conditioning also involves examining the degree to which continuous variables (including Likert-type scales) are correlated – or related – to one another. Note that correlations are not viable using categorical data, as the numerical values of those variables are not meaningful (the numerical values solely serve to categorize data into discrete groups). When examining correlations between continuous variables, correlation coefficients in SPSS will indicate the direction and strength of the correlation between the variables. Correlation coefficients are reported as values between -1.0 and +1.0. (Note: A positive relationship indicates that as one variable either increase or decreases, the other variable increases or decreases in the same manner; a negative relationship indicates that as one variable either increases or decreases, the other variable moves in the opposite direction.) To examine the correlations among all of the continuous variables in a data set, we can produce a correlation matrix in SPSS (follow these SPSS menu choices: Analyze > Correlate > Bivariate), which is simply a table that allows one to see the correlation coefficients for the specified variables to determine the direction (positive or negative) and degree to which they are related with each other.  The closer correlation coefficients are to a value of -1.0 or +1.0, the stronger the negative or positive relationships, whereas the closer these values are to zero, the weaker the relationships.

For example, using responses from two survey items found on the GPI, we are interested in understanding the relationship between the number of multicultural courses taken at college and the degree to which students felt informed of current issues that impact international relations. Intuitively, it might seem that there could be a relationship between these two items, but whether this is statistically significant – and if so, the strength of this relationship – will be useful to understand.  Using the SPSS navigation described earlier, we ran a bivariate (two-variable) correlation on these two items and found a correlation coefficient of 0.058 that was statistically significant. This value indicates that there is a statistically significant and positive (the correlation coefficient was greater than zero) relationship between these variables; in other words, as students complete more multicultural courses, their understanding of current global issues also increase. This correlation coefficient also illustrates, though, that although statistically significant, it is a weak relationship, as the value is very close to zero at 0.058. In this case, our intuition was correct in that these GPI items are, indeed, related, but the weak relationship between them is not that meaningful.

Of particular concern in the data conditioning stage for multivariate analyses is when two or more variables are strongly correlated with each other. For instance, problems can occur when independent variables are highly correlated with each other in the same multivariate model, which may lead to unstable findings, larger standard errors, and a reduced likelihood of statistical significance (see Grimm & Yarnold, 1995, for an expanded discussion of multicollinearity issues). As such, it is important to examine a correlation matrix prior to engaging in multivariate analyses.