One of the challenges our Requesters face is how they can evaluate submitted work in an easy and efficient manner. Some Requesters manage the quality of the work they receive by using Master Workers for their HITs, or use qualifications on their HITs that allow only their preferred Workers to do the HITs. This keeps the results they receive within their quality expectations. But what about other cases where the HIT is open to a broader audience? It turns out to be challenging to make a decision about the usefulness of the work. There are different techniques available that can be very helpful to evaluate submitted work – e.g. Known Answers and Plurality. This blog is not a detailed analysis of the effectiveness of those techniques – it is an introduction to a tool that makes those techniques easy to use for a Requester. The tool is provided as a set of macros in a spreadsheet – which will allow Requesters to use it with output from the Requester User Interface (RUI) as well as the Command Line Tools (CLT). The tool was built as a quick solution to help answer our own questions on large batches of HITs we ran. It evolved from ongoing analysis of our own batches, rather than a specific product development exercise - and was useful enough that we thought we should provide it to our Requesters under the Amazon Software License. It is not a supported product, and the intent is to provide a starting point for Requesters to evaluate their HITs.
First let’s start with an overview of some techniques to help evaluate work:
Known Answers – Requesters can put known good answers within their HITs to enable evaluation of work. E.g. Let’s say the Requester is publishing HITs to categorize products. Some of the products are known to belong to a specific category – e.g. the Tuba is in the ‘Musical Instruments’ category. When publishing the HITs, the Requester blends these known answers into the groups of HITs published. So when a worker gets the known answers correct, it increases the Requester’s confidence that the other answers provided by that Worker are also going to be useful. Known answers can be embedded within each HIT, or can be scattered across HITs.
Plurality – Requesters can ask the same question to multiple workers and see if they agree/disagree (see the related blog). This is done by creating multiple assignments for each HIT. If a sufficient percentage of Workers are in agreement, then the answer could be assumed to be usable. Alternatively, a HIT can be considered usable once a certain number of Workers agree upon an answer. Plurality can be a useful tool to determine answer usefulness, but we don’t recommend using it for approvals/rejections due to the reasons outlined in the related blog.
The Excel Tool
The linked excel tool provides a set of macros/actions that work with data from the RUI or CLT. The tool starts with the HIT information you’ve downloaded from the RUI or CLT, then analyzes the data using your configuration for known answers and plurality, and then outputs the results of that analysis by Worker, HIT, and Assignment into another file. Here is a picture of what it does:
There is a 3 step process to evaluate work using these macros:
1) Select & read the output file that contains the HIT Answers you are looking to evaluate:
- This is usually a Batch Results file downloaded from the RUI, or a *.results file generated by the getResults command in the CLT
2) Set the parameters for evaluating the work, including providing known answers and recommended actions
- This allows you set parameters such as the threshold percentage of known answers you want the Worker to get correct before you consider their answers useful
- You can also provide any known answers so that they can be used to help evaluate the answers
3) Run the evaluation
Try it with some result files from RUI or getResults, and let us know what you think. We’ll do a walk-through of the details in a following post.
Caveats: the tool is not an mturk dev project – it is a collection of macros from a few of our non-developers – so that will show in the code. There are many other known techniques & algorithms that can be used – they are left as an exercise for the reader :). It has only been tested on CSVs of ~25 Megabytes – files larger than that will need a different solution.