Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2013Oct09
- Summary of Friday Tech Call.
- Report: Progress on Date Validation.
- On table with Kansas: Using Species Analyst services for occurrence QC. We have an occurrence with a taxon name and a georeference: what would we like in a service to check if this occurrence is an outlier for a species distribution model for this species?
- Discussion: Approaches to repeated QC requests for same records.
- Report: Progress on unpack of current copy of GBIF cache.
- Discussion: Duplicate Finding Find_Duplicates. State of old code, Lucene, etc.
- SCAN TCN Support
- Report: Sanity Checking, further assessment of lower number of new annotations as opposed to new records in omoccurdeterminations.
- Report: MCZbase Driver
- NEVP TCN Support
- Report: Preparations to update Annotation Processor Deployment at UNH.
- Report: Specify-HUH driver
- FP Infrastructure
- Report: FP Node Refactoring
- Report: Annotation Processor Enhancements
- InvertENet TCN Proposal
- Further Discussion: Duplicate Finding Find_Duplicates. State of old code, Lucene, etc.
- Discussion: NEVP New Occurrence record ingest process from Specify instances.
For Future meetings
- dwcFP and DarwinCore RDF guide - feedback.
- Prospective meetings, development targets.
- Jim Beach visited on Tuesday morning. Discussed Specify at HUH, also touched on FP issues. Some work progressing on Symbiota ingest of exported data from Specify. Mentioned OAI/PMH harvesting, AnnotationProcessor/Driver, and FP-Lite. Species Analyst was mentioned, put a QC service for occurrence point data against environmental models for species distribution on the table.
Present: Maureen, Chuck, James, Jim, Bertram, Sven, Tianhong
1. touch base 2. tech discussion
where is finding duplicates launched?
Paul notes: See the use case documentation Find_Duplicates (1) Duplicate detection is a background process done on harvest of the data (directly from participating Specify instances in ApplePie). (2) Finding duplicates by collector name/collector number/date collected in native data entry applications (Specify workbench, Specify CollectionObject forms, Symbiota data transcription forms, HUH rapid data entry application, etc.).
- annotation processor should have a process for duplicate detection like other workflows
Paul notes: No: The annotation processor is not a data entry application. We did a prototype demonstration of duplicate finding there, but duplicate finding must be implemented in native data entry environments.
- the analysis would compare each record in a collection with the gbif cache and return a report on matches
Paul notes: No. The GBIF cache is not our primary source for duplicate data, it is first a test platform, and second a possible supplement to the primary data that we harvest directly from Specify instances that are participating in ApplePie (the NEVP TCN network).
- use Scatter Gather Reconcile?
Paul notes: SGR uses a copy of the GBIF cache. It suffers from arbitrarily fuzzed and redacted data on endangered species, and arbitrary poor mappings to people's actual data sets. We need to implement the OAI/PMH harvesting mechanism with both provider and consumer on the client side, pushing harvests out to the Access point across network firewalls to obtain detailed occurrence records directly from participating Specify instances in ApplePie. We need to implement, test, and deploy this mechanism soon.
- might want to use the UI design
- the extra feature SGR doesn't have is the "consensus" record
Paul notes: An extra feature. SGR also uses a service based on the GBIF cache.
- analysis of existing SGR code in Specify -- Maureen
- have a look at MarcXimiL to see if anything there applies. http://marcximil.sourceforge.net/ -- Chuck
- functional definition of duplicate-- comparing two Simple Darwin Records, how do we know if the two are duplicates of each other?
- how to merge a set of > 2 duplicates into a consensus record --Maureen
- how to compose and format a report that advises the user of duplicates found during the duplicate detection analysis
- how to manage use of the gbif cache
- what input format, platform, etc? - how to keep current - currently: zipped SQL insert statements?? MySQL - download from GBIF; two versions: month old, several months old - how to make it usable, useful - where is the data physically hosted? Florida host?
Paul notes: The GBIF cache, as noted above, is a test bed for us. The real knowledge store that we will use for duplicate detection will need to be raw occurrence data harvested directly from ApplePie participants. Data harvested from the CNH Symbiota instance might supplement this, and the GBIF cache (in it's current snapshot) might supplement this. We don't expect to update the GBIF cache beyond our current copy.
- do we do validation of gbif data before match? in general, how do we do validation of consensus record? how do we integrate validation
Links & References:
Questions: - What was the Python code that maybe Tianhong was looking at?