2009Jan22
Maureen showed equivalent of last wee's demo but for real data from Specify db
Decided to meet weekly 9:30 am Thursday alternating between HUH and UMB
To discuss:
What should fuzzy matches be when searching for duplicates
James: What should we put fuzzy matches on and how?
Bob: where is mapping done?
Consensus: some such things should be provided by the network, e.g. date matching
Paul: what errors to tolerate? How many "possible" duplicates to offer. What diverges among dups? Catalog numbers, species id, but collector, collector number, date. A stragegy: Ask an initial question - collector name + number, exact matches. If any results come back, group and test them against additional data. If no results come back, expand the question to fuzzy matching of collector name and number plus other information coming in from data capture.
James: try matches on few fields first. Increase fields if no good match. Hierarchy of fields on which to match.
Zhimin: suggest that provider offers confidence of claim to be offering a duplicate.
James: Remember that there needs to be a way to confirm/reject/annotate claims of a duplicate.
Bob: Matching what?
Concensus: we need to know which ABCD(?) fields should be identified as requiring some methodologies for fuzzy matching and some notion of confidence in the application of those methodolgies.
Collector number - exact match http://www.bgbm.org/tdwg/codata/schema/ABCD_2.06/HTML/ABCD_2.06.html#element_CollectorsFieldNumber_Link03185058
Collector name - exact up to data entry standards http://www.bgbm.org/tdwg/codata/schema/ABCD_2.06/HTML/ABCD_2.06.html#element_GatheringAgent_Link03174CE0
Collector name AND Collector number
Date of collection - http://www.bgbm.org/tdwg/codata/schema/ABCD_2.06/HTML/ABCD_2.06.html#element_DateTime_Link03174F90
Zhimin: place a dissimilarity measure on each field and a weight for that field
The test Flickr EOL group quality control tool is up at http://www.aa3sd.net/qc_test/ and the Flickr group has been made aware of it. Over the first week, the percent of images in the group pool with non-compliant licenses dropped from 25% to just under 10%. Potential for automated push of qc back to group members, but only on a per-image basis.
For 2009Jan29 at UMB todo:
- Maureen and Zhimin will demonstrate network with nodes at HUH AND UMB.
- Maureen will demo UI for query generation
- Paul and James will rationalize these notes and make simple cases of what should be the hierarchy of fields (Concepts) upon which to do fuzzy match.
- Zhimin to look for date recognition algorithms and named entity recognition algorthims that could be used for matches