I'm pulling some data from Amazon Mechanical Turk and saving it in a mongodb collection.
I have multiple workers repeat each task as a little redundancy helps me check the quality of the work.
Every time I pull data from amazon using the boto AWS python interface I obtain a file containing all the completed HITs and want to insert them into the collection.
Here is the document I want to insert into the collection:
    mongo_doc = \
    {'subj_id'    :data['subj_id'],
    'img_id'      :trial['img_id'],
    'data_list'   :trial['data_list'],
    'worker_id'   :worker_id,
    'worker_exp'  :worker_exp,
    'assignment_id':ass_id
    }
- img_idis an identifier of an image from a database of images.
- subj_idis an identifier of a person in that image (there might be multiple per image).
- data_listis the data I obtain from the AMT workers.
- worker_id,- worker_exp,- assignment_idare variables about the AMT worker and assignment.
Successive pulls using boto will contain the same data, but I don't want to have duplicate documents in my collection.
I am aware of two possible solutions but none work exactly for me:
- I could search for the document in the collection and insert it only if not present. But this would have a very high computational cost. 
- I can use upsert as a way to make sure that a document is inserted only if a certain key is not already contained. But all of the contained keys can be duplicated since the task is repeated by multiple workers. 
NOTE on part 2:
 - subj_id, img_id, data_list can be duplicated since different workers annotate the same subject, image and could give the same data.
 - worker_id, worker_exp, assignment_idcan be duplicated since a worker annotates multiple images within the same assignment.
 - The only unique thing is the combination of all these fields.
Is there a way I can insert the mongo_doc only if it was not inserted previously?
 
     
    