Concept Testing
Concept testing (to be distinguished from pre-test markets and test markets which may be used at a later stage of product development research) is the process of using surveys (and sometimes qualitative methods) to evaluate consumer acceptance of a new product idea prior to the introduction of a product to the market. It is important not to confuse concept testing with advertising testing, brand testing and packaging testing, as is sometimes done. Concept testing focuses on the basic product idea, without the embellishments and puffery inherent in advertising.
It is important that the instruments (questionnaires) to test the product have a high quality themselves. Otherwise, results from data gathered surveys may be biased by measurement error. That makes the design of the testing procedure more complex. Empirical tests provide insight into the quality of the questionnaire. This can be done by:
- Conducting cognitive interviewing. By asking a faction of potential-respondents about their interpretation of the questions and use of the questionnaire, a researcher can verify the viability of the cognitive interviewing.
- Carrying out a small pretest of the questionnaire, using a small subset of target respondents. Results can inform a researcher of errors such as missing questions, or logical and procedural errors.
- Estimating the measurement quality of the questions. This can be done for instance using test-retest, quasi-simplex, or mutlitrait-multimethod models.
- Predicting the measurement quality of the question. This can be done using the software Survey Quality Predictor (SQP).
Concept testing in the new product development (NPD) process is the concept generation stage. The concept generation stage of concept testing can take on many forms. Sometimes concepts are generated incidentally, as the result of technological advances. At other times concept generation is deliberate: examples include brain-storming sessions, problem detection surveys and qualitative research. While qualitative research can provide insights into the range of reactions consumers may have, it cannot provide an indication of the likely success of the new concept; this is better left to quantitative concept-test surveys.
In the early stages of concept testing, a large field of alternative concepts might exist, requiring concept-screening surveys. Concept-screening surveys provide a quick means to narrow the field of options; however they provide little depth of insight and cannot be compared to a normative database due to interactions between concepts. For greater insight and to reach decisions on whether or not pursue further product development, monadic concept-testing surveys must be conducted.
Frequently concept testing surveys are described as either monadic, sequential monadic or comparative. The terms mainly refer to how the concepts are displayed:
1.) Monadic. The concept is evaluated in isolation.
2.) Sequential monadic. Multiple concepts are evaluated in sequence (often randomized order).
3.) Comparative. Concepts are shown next to each other.
4.) Proto-monadic. Concepts are first shown in sequence, and then next to each other.
“Monadic testing is the recommended method for most concept testing. Interaction effects and biases are avoided. Results from one test can be compared to results from previous monadic tests. A normative database can be constructed.” However, each has its specific uses and it depends on the research objectives. The decision as to which method to use is best left to experience research professionals to decide, as there are numerous implications in terms of how the results are interpreted.
Copy Testing
Copy testing is a specialized field of marketing research that determines an advertisement’s effectiveness based on consumer responses, feedback, and behavior. Also known as pre-testing, it might address all media channels including television, print, radio, outdoor signage, internet, and social media.
Automated Copy Testing is a specialized type of digital marketing specifically related to digital advertising. This involves using software to deploy copy variations of digital advertisements to a live environment and collecting data from real users. These automated copy tests will generally use a Z-test to determine the statistical significance of results. If a specific ad variation out performs the baseline in the copy test, to a desired level of statistical significance, this new copy variation should be used by the marketer.
Features
In 1982, a consortium of 21 leading advertising agencies — including N. W. Ayer, D’Arcy, Grey, McCann Erickson, Needham Harper & Steers, Ogilvy & Mather, J. Walter Thompson, and Young & Rubicam released a public document laying out the PACT (Positioning Advertising Copy Testing) Principles that constitute a good copy testing system. PACT states a good copy testing system must meet the following criteria:
- Provides measurements which are relevant to the objectives of the advertising.
- Requires agreement about how the results will be used in advance of each specific test.
- Provides multiple measurements, because single measurements are generally inadequate to assess the performance of an advertisement.
- Based on a model of human response to communications – the reception of a stimulus, the comprehension of the stimulus, and the response to the stimulus.
- Allows for consideration of whether the advertising stimulus should be exposed more than once.
- Recognizes that the more finished a piece of copy is, the more soundly it can be evaluated and requires, as a minimum, that alternative executions be tested in the same degree of finish.
- Provides controls to avoid the biasing effects of the exposure context.
- Takes into account basic considerations of sample definition.
- Demonstrates reliability and validity.
Types of copy testing measurements
Recall
The predominant copy testing measure of the 1950s and 1960s, Burke’s Day-After Recall (DAR) was interpreted to measure an ad’s ability to “break through” into the mind of the consumer and register a message from the brand in long-term memory. Once this measure was adopted by Procter and Gamble, it became a research staple.
In the 70s, 80s, and 90s, validation efforts found no link between recall scores and actual sales (Adams & Blair; Blair; Blair & Kuse; Blair & Rabuck; Jones; Jones & Blair; MASB; Mondello; Stewart). For example, Procter and Gamble reviewed 10 year’s worth of split-cable tests (100 total) and found no significant relationship between recall scores and sales (Young, pp. 3–30). In addition, Wharton University’s Leonard Lodish conducted an even more extensive review of test market results and also failed to find a relationship between recall and sales.
The 1970s also saw a re-examination of the “breakthrough” measure. As a result, an important distinction was made between the attention-getting power of the creative execution and how well “branded” the ad was. Thus, the separate measures of attention and branding were born.
Persuasion
In the 1970s and 1980s, after DAR was determined to be a poor predictor of sales, the research industry began to depend on a measure of persuasion as an accurate predictor of sales. This shift was led, in part, by researcher Horace Schwerin who pointed out, “the obvious truth is that a claim can be well remembered but completely unimportant to the prospective buyer of the product the solution the marketer offers is addressed to the wrong need”). As with DAR, it was Procter and Gamble’s acceptance of the ARS Persuasion measure (also known as brand preference) that made it an industry standard. Recall scores were still provided in copy testing reports with the understanding that persuasion was the measure that mattered.
Diagnostic
The main purpose of diagnostic measures is optimization. Understanding diagnostic measures can help advertisers identify creative opportunities to improve executions.
Non-Verbal
Non-verbal measures were developed in response to the belief that much of a commercial’s effects e.g. the emotional impact may be difficult for respondents to put into words or scale on verbal rating statements. In fact, many believe the commercial’s effects may be operating below the level of consciousness. According to researcher Chuck Young, “There is something in the lovely sounds of our favorite music that we cannot verbalize and it moves us in ways we cannot express”.
In the 1970s, researchers sought to measure these non-verbal measures biologically by tracking brain wave activities as respondents watched commercials (Krugman). Others experimented with galvanic skin response, voice pitch analysis, and eye-tracking. These efforts were not popularly adopted, in part because of the limitations of the technology as well as the poor cost-effectiveness of what was widely perceived as academic, not actionable research.
In the early 1980s the shift in analytical perspective from thinking of a commercial as the fundamental unit of measurement to be rated in its entirety, to thinking of it as a structured flow of experience, gave rise to experimentation with moment-by-moment systems. The most popular of these was the dial-a-meter response which required respondents to turn a meter, in degrees, toward one end of a scale or another to reflect their opinion of what was on screen at that moment.
More recently, research companies have started to use psychological tests, such as the Stroop effect, to measure the emotional impact of copy. These techniques exploit the notion that viewers do not know why they react to a product, image, or ad in a certain way (or that they reacted at all) because such reactions occur outside of awareness, through changes in networks of thoughts, ideas, and images.