# Comparing Latent Gold and R’s mclust Package for Model-Based Clustering for Segmentation

Last updated: 29 Aug 2023

## Introduction

During a recent project I noticed that when I tried to create clusters based on a set of continuous variables, Latent Gold (hereafter “LG”) and the R package mclust gave me different solutions. That part wasn’t too surprising, because prior research comparing the two packages had already identified this possibility (Haughton et al. 2009).  What did surprise me, and what Haughton et al. didn’t report, was that the two programs identified different numbers of segments, (seven in the case of Latent Gold and three in the case of mclust).

To dig more deeply in into this difference, and to learn which program gives more accurate results, I conducted a small test (and it is a small test – doing a larger and more systematic test might be a nice paper for someone to propose presenting at an upcoming conference, hint hint).  As described below, I generated nine data sets according to an orthogonal experimental design.  Data sets differed in terms of their number of segments, the distance between segments and the number of dimensions In the cluster space.  After segmenting the data sets with LG’s cluster routine and mclust’s Mclust model-based clustering routine in turn, we’ll see how well the two perform in terms of (a) identifying the correct number of segments and (b) putting the right respondents in the right segments.

## Data Generation

Qui and Joe (2020) provide the clusterGeneration package for constructing artificial data sets with known segment structures based on a variety of user-specifiable inputsI used a 9-run fractional factorial design, to systematically vary the three aspects of the data sets (and three levels per aspect):

• Number of segments
• Cluster separation
• Close (setting a control parameter called sepVal to 0.01)
• Separated (sepVal =0.21)
• Well-separated (sepVal=0.34)
• Number of dimensions in the space
• 12

The 9 experimental conditions were

 Condition Segments Separation Dimensions 1 3 0.01 4 2 5 0.21 12 3 7 0.34 8 4 3 0.34 12 5 5 0.01 8 6 7 0.21 4 7 3 0.21 8 8 5 0.01 4 9 7 0.34 12

Finally, all nine data sets used segments of random sizes.

After creating these nine data sets, I used both LG and mclust to form segments.

## Analysis

For both mclust and LG clustering I selected the model with the lowest BIC, as LG documentation advises and as mclust does by default.

Results for the nine data sets were as follows, in terms of the Adjusted Rand Index (ARI).  Columns 2 and 3 below show the ARI of the mclust or LG solution, respectively compared to known segment membership (“na” indicates that the method failed to identify the correct number of segments).  The 4th column reports the ARI of the two solutions compared to one another.

 Condition ARI mclust ARI LG ARI (mclust, LG) 1 0.926 na - 2 0.988 0.988 1 3 1 1 1 4 1 1 1 5 0.901 0.893 0.951 6 0.988 0.935 0.996 7 0.994 0.994 1 8 1 1 1 9 0.877 0.882 0.989

LG identifies the correct number of segments in all but one of the conditions and mclust does so in all nine.  Both methods achieve high agreement with the known segment structure, with ARI at or near 100% whenever they identify the correct number of segments.  The two methods also agree with each other about the observations that they assign to each segment.

## Conclusions/Recommendations

Based on this small-scale experiment, the two methods perform similarlyThis suggests that the conflicting solutions that motivated this investigation may be an uncommon situationIt also suggests that analysts can use either method with some comfortThose preferring the free R package and working in the powerful R environment get good results that largely agree with the commercial Latent Gold packageThose wanting one easy-to-use software package to do all their latent class and model-based clustering (whether with categorical or continuous or mixed variable types) can rest easy knowing that they’re not giving up quality to gain ease-of-use.

### Future Research

As noted, this is a very small study, perhaps little more than a proof-of-concept for a more in-depth study someone else might want to pursue.  Some avenues for improvement I might suggest are:

• Though I created one data set per experimental condition, a better practice for a larger scale study would be to replicate the analysis 10+ times in each condition.
• The results table above suggests that both methods struggle more when the number of dimensions is small and when clusters are closer – a future study might focus more on these harder cases to give both methods a more challenging stress test
• I used the clusterGeneraion method of Qui and Joe (2020) which draws data points from multivariate normal distributions.  I’ve just become aware of a more recent entry in the effort to create realistic artificial cluster structures, Fachada and de Andrade (2023), whose Clugen package does not rely on the multivariate normal for building segments.  Clugen also allows for segments with different spatial orientations.  It’s possible that these two aspects might help distinguish the performance of LG and mclust better than I was able to do in this small test.

## References

Fachada N, de Andrade D (2023). “Generating multidimensional clusters with support lines.” Knowledge-Based Systems, 277, 110836. doi:10.1016/j.knosys.2023.110836.

Haughton, D., P. Legrand, and S. Woolford (2009) Review of three latent class cluster analysis packages:  latent Gold, poLCA and mclust, The American Statistician, 63(1), 81-91, doi: 10.1198/tast.2009.0016.

Qiu, W. & H. Joe (2020) “clusterGeneration: random cluster generation (with specified degree of separation).” https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf.

Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, The R Journal, 8/1, pp. 205-233. https://journal.r-project.org/archive/2016/RJ-2016-021/RJ-2016-021.pdf

Statistical Innovations Inc. (2023). LatentGOLD® version 6.0.0. [computer software]. https://www.statisticalinnovations.com/