Maureen showed equivalent of last wee's demo but for real data from Specify db

Decided to meet weekly 9:30 am Thursday alternating between HUH and UMB

To discuss:

What should fuzzy matches be when searching for duplicates

James: What should we put fuzzy matches on and how?

Bob: where is mapping done?

Consensus: some such things should be provided by the network, e.g. date matching

Paul: what errors to tolerate? How many "possible" duplicates to offer. What diverges among dups? Catalog numbers, species id, but collector, collector number, date. A stragegy: Ask an initial question - collector name + number, exact matches. If any results come back, group and test them against additional data. If no results come back, expand the question to fuzzy matching of collector name and number plus other information coming in from data capture.

James: try matches on few fields first. Increase fields if no good match. Hierarchy of fields on which to match.

Zhimin: suggest that provider offers confidence of claim to be offering a duplicate.

James: Remember that there needs to be a way to confirm/reject/annotate claims of a duplicate.

Bob: Matching what?

Concensus: we need to know which ABCD(?) fields should be identified as requiring some methodologies for fuzzy matching and some notion of confidence in the application of those methodolgies.

Collector number - exact match http://www.bgbm.org/tdwg/codata/schema/ABCD_2.06/HTML/ABCD_2.06.html#element_CollectorsFieldNumber_Link03185058

Collector name - exact up to data entry standards http://www.bgbm.org/tdwg/codata/schema/ABCD_2.06/HTML/ABCD_2.06.html#element_GatheringAgent_Link03174CE0

Collector name AND Collector number

Date of collection - http://www.bgbm.org/tdwg/codata/schema/ABCD_2.06/HTML/ABCD_2.06.html#element_DateTime_Link03174F90

Zhimin: place a dissimilarity measure on each field and a weight for that field

The test Flickr EOL group quality control tool is up at http://www.aa3sd.net/qc_test/ and the Flickr group has been made aware of it. Over the first week, the percent of images in the group pool with non-compliant licenses dropped from 25% to just under 10%. Potential for automated push of qc back to group members, but only on a per-image basis.

For 2009Jan29 at UMB todo:

  • Maureen and Zhimin will demonstrate network with nodes at HUH AND UMB.
  • Maureen will demo UI for query generation
  • Paul and James will rationalize these notes and make simple cases of what should be the hierarchy of fields (Concepts) upon which to do fuzzy match.
  • Zhimin to look for date recognition algorithms and named entity recognition algorthims that could be used for matches