A Discrete Choice Take on Halloween

Last updated: 07 Jan 2019

Checking my Twitter feed the other day (@meganpeitz), I couldn’t help but get excited when I saw @FiveThirtyEight tweeting about the Ultimate Halloween Candy Power Rankings - a list of 80+ treats along with a Win Percentage. Being the discrete choice fan that I am, I dug a bit further in the hopes of finding an underlying experimental design. Unfortunately, I didn’t. And although @WaltHickey, the author, does mention that this is “not a scientific survey”, I saw it as an opportunity to make more folks aware of the power of discrete choice, specifically Best-Worst Scaling or MaxDiff.

What is MaxDiff? MaxDiff is an approach for measuring consumer preference for a list of items. Items could include messages, benefits, images, product names, claims, brands, features, packaging options, and more! In this case – we have 80+ different options a trick-or-treater could receive in their Halloween bag. (Being the Bayesian I am, I used Hickey’s item list as a prior, borrowing ~90% of his items for my experiment. See the full article from Hickey here.)

In Hickey’s experiment, he used a paired comparison design. This means that each person saw two items at a time.


But for 80 items, there are 3,160 (80 choose 2) possible paired comparisons!  Hickey reports that on average, 32 questions were completed per person, with 11 questions as the median. A very small proportion of the total number of pairs. Now, Hickey and FiveThirtyEight have a much larger following than I do, so he was able to get answers for 269,000 paired comparisons, from over 8,000 different IP addresses.  And because of that, his results are probably pretty darn good, but can we create an experiment that captures more information per screen and requires less sample? The answer is yes.   

Instead of asking about items in pairs, we can present multiple items at a time and ask the respondent to tell us which item is 'Best' and which is 'Worst'. This is called Best-Worst Scaling, or MaxDiff. In our survey, if we asked about 5 items at a time from just two clicks we would learn about 7 of the possible 10 paired comparisons!     


From this respondent's answers, we could conclude that:

  • Reese’s Peanut Butter Cup > Tootsie Pop
  • Reese’s Peanut Butter Cup > Jawbusters
  • Reese’s Peanut Butter Cup > Almond Joy
  • Reese’s Peanut Butter Cup > Milky Way Midnight


  • Jawbusters > Tootsie Pop
  • Almond Joy > Tootsie Pop
  • Milky Way Midnight > Tootsie Pop

We just don’t know…

  • Jawbusters ??? Almond Joy
  • Jawbusters ??? Milky Way Midnight
  • Almond Joy ??? Milky Way Midnight

Asking 3 to 5 items at a time in a Best-Worst experiment will capture more information quicker than asking about just 2 items. Therefore,

Recommendation #1 – Use Best-Worst Scaling instead of Paired Comparisons


For analysis, Hickey created a Win Percentage, computed by counting how many times Candy X won divided by how many times Candy X was available to be chosen. Reese’s Peanut Butter Cup takes the #1 spot with an 84% Win Rate. The first non-chocolate item, Starburst, comes in at rank #13, with a 67% win rate, all the way down to the worst performing item, Good & Plenty, only winning 22% of the time. 

Using these results, we can only say things like, “Overall, Reese’s Peanut Butter Cups are most preferred”. But we can’t say much, if anything, about the individual, or potential segments that might exist in the marketplace. Think of, for example, the people that don’t like chocolate, or the people that are allergic to peanut butter. Certainly a Reese’s Peanut Butter Cup is not going to make them very happy. And what if we didn’t have access to a super large sample size? We might not have shown enough non-chocolate items to the folks who don’t like chocolate. Or we might have shown too many peanut butter products to the folks that are allergic to peanut butter!

Recommendation #2 – Build a model instead of using counts

When we have a properly designed experiment and when we get enough observations of every item per individual (1 observation per item = sparse design), we can actually create a model that allows us to predict how each individual would choose in every match up, regardless of if they actually saw that match up in their experiment or not! Let me explain further.


If we show every item to each individual at least once, then we will have enough information to run a hierarchical Bayesian regression (HB)*. This model borrows information from other respondents to ‘fill in the gaps’ for the match ups that the respondent didn’t see. The resulting model offers a score for every item for every individual. Better yet, the results are ratio-scaled so that a score of a 10 means that item was twice as preferred as a score of a 5! 

From there, we can find out how people would choose given a specific scenario. We can do this at the aggregate (across all respondents) like Hickey’s results and, more importantly, even down to the individual. 


Since we have individual data, we can look and see if there are specific segments interested in one candy versus another. We can even run a TURF (Total Unduplicated Reach and Frequency) Analysis on the MaxDiff scores, and find out that we should actually buy Reese’s Peanut Butter Cups, Twix, and Candy Corn to appeal to the majority of Trick-or-Treaters. Using MaxDiff, your house will be remembered as the one that made everyone happy on Halloween Night! (Unless of course one of your neighbors is offering King Size candy bars...)

We collected some data using a MaxDiff survey.  Here are the results.

P.S. Let’s not forget, MaxDiff is also better than ratings data too. You can get greater discrimination among items, greater discrimination between respondents on the item, no scale bias, and it looks great on mobile devices! Read more about MaxDiff and how it compares to ratings data in Bryan Orme’s article, How Good Is Best-Worst Scaling? Or check out this quick video on What Can MaxDiff Do For You? 

* A general rule of thumb is to show each item at least 2 times, preferably 3 times, per respondent to obtain stable HB scores. However, recent papers from Serpetti et al and Chrzan & Peitz would suggest that sparse designs (1 time) can do very well when modeling at the individual-level. However, should your list of items exceed 100, we are unsure if the sparse approach is scale-able and a Bandit MaxDiff approach may be a better option.