False Positives in the Test for Identifying Random Respondents

In an earlier post, Bryan Orme described how to use the Lighthouse Studio data generator to identify “bad” (i.e. random) respondents in a choice experiment: https://www.linkedin.com/pulse/identifying-consistency-cutoffs-identify-bad-respondents-orme/. Using the method Bryan describes, you can generate random respondents and then measure how well their choice data fits their utility models, using HB estimation and a fit statistic called root likelihood (RLH). Using the RLH higher than that of 95% of the random respondents as a cutoff, you can identify 95% of random respondents and potentially remove them from your data set. The remaining 5% of random responders pass the test, but they are false negatives (i.e. the test has, by design, a 5% false negative rate).

But the false negative rate is only half of what we want to know about a diagnostic test. False positives also have a cost, because we may end up discarding valid respondents with low RLHs. Thus, the other thing we need to know about this test is its false positive rate.

Using a bit of fancy Perl code, we can use the Lighthouse Studio data generator to create non-random responders. Specifically, we can create robotic respondents programmed to choose according to the random utility model (RUM). For example, let’s say we’ve collected MaxDiff data from 400 human respondents. To make our robots realistic, we program them to choose using the utilities from the human respondents. RUM suggests that human respondents choose with some error in their choice process, which is assumed in the logit model to be Gumbel distributed. Therefore, when we simulate the robots’ choices, we add to their utilities a random error component drawn from the Gumbel distribution (to make their choices consistent with the logit model). In this way we create MaxDiff choice data which we know conforms to RUM, so the respondents are not merely random responders. We then estimate HB utilities for the robots, and those that fall below the original cutoff are misclassified as random responders – in other words they are false positives.

My colleague Cameron Halverson and I will run this analysis for 10 or so data sets at next year’s Turbo Choice Modeling event. For now, here are the results of the false positives analysis for two studies.

First up is a MaxDiff study of a few hundred respondents wherein we measured 38 items in 23 sets of quints. Using Bryan’s cutoff method, we find that 95% of random respondents have RLH fit statistics below 0.264. After programming our robots to use the human respondents’ measured utilities to make RUM choices (perturbed by Gumbel-distributed error), we find that 7.88% of our robots have RLHs below the cutoff. So, our false positive rate is under 8%.

Next is a CBC study with several hundred respondents, featuring 8 attributes and a total of 26 levels. We asked each respondent 10 choice sets of triples. Here the 95% RLH cutoff is 0.549 for randomly responding robots and we misclassify 6.3% of our RUM-choosing robots as random responders. Because of their low RLHs, these respondents likely had unusually large Gumbel error draws, making for more random-looking choices than those of the other robots.

What does this mean in an actual empirical study? Imagine a study with n=500 and that 20% of the respondents (100) are truly random and 80% (400) maximize utility as in RUM.  If we generate random-responding robots to find the 95th percentile cutoff rate, then we'll correctly throw out 95 out of 100 random responders.  If the false positive rate for real responders is 6% as with the CBC study reported above, then we’ll incorrectly throw out 24 of the 400 real respondents.

Importantly, the sparser our data, the higher will be the bar that separates random from non-random respondents and the more we will elevate our false positive rate. For example, the MaxDiff study above with 23 choice sets allows each respondent to see each item 3 times (our standard recommendation for HB analysis of MaxDiff data). If we show each item only twice per respondent (16 choice sets instead of 23), the cutoff for identifying 95% of random respondents rises to 0.285 and we mistakenly classify 11% of our valid robotic RUM responders as random (i.e. our false positive rate rises from 9% to 11%). Similarly, if make the design even more sparse, (just 8 choice sets, so that each item appears once per respondent) then our cutoff is 0.351 and we end up with 19% false positives.

It looks like we can use the 95% cutoff rule to remove random responders from our CBC and MaxDiff studies, without discarding too many non-random responders. Be careful with sparse data, however, lest an excessive false positive rate causes you to remove too many valid respondents.