Editor's Note: This is a re-post with modified content to correct math errors. Thanks to those who were kind enough to point them out.
If you are like most Requesters you spend as little time as possible qualifying your Workers. You probably send your HITs to all Workers with a 90% approval rate. But approval rate and accuracy are two different things. As an experiment we did recently with a Requester demonstrates:
HITs were sent through the system multiple times varying only the Qualifications. Half the HITs went to pre-qualified Photo Moderation Masters. We sent the other half to Workers with a 95% approval rate. Each HIT asked Workers to determine if the Photo met certain guidelines (inappropriate content and the # of people in the photo).
Masters answers were accurate 99% of the time. Workers who completed the other group of HITs (Workers with a 95% approval rate) were 90% accurate. Masters also were much more consistent. They had a much narrower range of accuracy - from 95% - 100% accuracy meaning the “worst” Master was still 95% accurate. Workers in the broader group ranged from 68 – 100% accurate.
Less accurate Workers will have a lot more disagreements. Workers with 68% accuracy will agree on the correct answer less than 50% of the time and they will agree on a wrong answer 10% of the time. They will disagree about 40% of the time*. If you’re like most Requesters, you will ask more Workers to complete the same HITs in order to resolve these disagreements. This will increase the time and money you spend. Avoid this complexity and rework by using Workers that will give you accurate results – either by using Masters or creating your own Qualifications based on Worker accuracy. You’ll get MUCH better accuracy.
P.S. Note that while I talk about using plurality for judging accuracy, I did not recommend using plurality to determine what assignments to approve or reject. More on why in a future post…
* This assumes Worker error is independent and independently distributed, and also that the question has only 2 options; with more options, inaccurate workers are more likely to simply disagree instead of agreeing on an incorrect option.