I have created a CBC and only use pictures and no verbal description. My master thesis deals with using CBC in an unusual field, namely schedules of students. Since there are many dependencies that lead to many prohibitions, I allow violations of attribute levels (these are defined for all days equally) at some days to a small extent (which might cause problems with the goodness-of-fit). I have 15 tasks (10 random, 2 reliability holdouts, 3 validity holdouts) and per task two concepts are compared (no none option available). I have 5 attributes with 2 levels each. When I run HB I get the following average goodness-of-fit values:

Pct-Cert: 0.487

RLH: 0.701

(153 participants)

Question 1: With two concepts, the worst possible RLH value is 0.5, so 0.7 is acceptable. In one study Orme showed that Pct-Cert. should be at least 0.6, so I assume that there are problems with my study design?

As a result, I have removed unreliable participants through the holdout tasks and the values have improved as follows:

Pct-Cert: 0.528

RLH: 0.721

(124 participants)

Since this is still not satisfactory, I have eliminated participants who took less than 2 minutes (is appropriate for my study, as only images are compared). The values have improved as follows:

Pct-Cert: 0.552

RLH: 0.733

(111 participants)

The same I have repeated with 3min

Pct-Cert.: 0.578

RLH: 0.747

(99 participants)

I also tested my orthogonality study with the preliminary counting test and the logit efficiency test and got optimal values. However, I have seen through an aggregate logit test that two attributes have standard error of about 0.6.

Question 2: Can I conclude from this that my images may not be representative for the levels (this is certainly the case because, as mentioned above, violations were allowed for some images) and therefore my approach of deviations shows problems?

Question 3: What are optimal values for Avg. Variance and RMS? Are they as important as Pct-Certainty?

Question 4: In the next step I'll calculate the hit rate. Am I right to assume that based on my findings, the hit rate might be low?

Thank you very much!

I need to clean 29 of 153 respondents which are not reliable, this is already 19%.

Step1: To clean respondents with a low RLH, I first need to conduct a HB with the reliable data right? Then I export the utility file and see the individual RLH of the reliable data.

Step 2: Based on this I clean all respondents which have a RLH <0.579 (https://www.sawtoothsoftware.com/help/lighthouse-studio/manual/hid_web_maxdiff_badrespondents.html), and have a total time < 2min.

Step 3: Then I condct a HB again with reliable data with good RLH and enough time.

Step 4: Based on the new RLH's of step 3, I can calculate the average RLH and average Pct-Certainty as you described in another forum. Or do I therefore need to use the "old" RLH of step 1?

After conducting step 1-2 I have eliminated 29 unreliable respondents and 7 additional respondents having RLH<0.579 and time<2min. So in total I excluded 23.5%.

Is this approach ok?

I have two prohibitions and therefore the number of possible profiles decreased from 32 to 30. But the preliminary counting analysis showed excellent results for frequencies and simulated standard errors (I have compared it to the same design using no prohibitions and the outputs are the same). But still I used pictures instead of verbal descriptions and allowed for some "deviations" of attribute levels, to avoid more prohibitions. Normally I would have needed 10 prohibitions but due to the permitted deviations of attribute levels to a certain degree, I could decrease the number to two prohibitions. The deviations of attribute levels might have caused higher standard error because the pictures are not 100% representative for the attribute levels. Is this right?

Thank you so much!