2013Oct23

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2013Oct23

Agenda

Non-Tech

  • InvertENet TCN Proposal
  • TDWG

Tech

  • Summary of Friday Tech Call.
  • Kepler
    • Report: Progress on Date Validation.
    • Report: Akka integration.
  • Analysis
    • Discussion: Approaches to repeated QC requests for same records. How are we going to do this?
    • Report: Progress on implementation of OAI/PMH harvesting through firewalls.
    • Report: State of investigation of duplicates in GBIF cache.
    • Discussion: Duplicate Finding Find_Duplicates.
  • SCAN TCN Support
    • Report: Sanity Checking, further assessment of lower number of new annotations as opposed to new records in omoccurdeterminations.
    • Report: MCZbase Driver
    • Note: Progress statistics
  • NEVP TCN Support
    • Report: Preparations to update Annotation Processor Deployment at UNH.
    • Report: Specify-HUH driver
  • FP Infrastructure
    • Report: FP Node Refactoring

Reports

  • Paul:
    • Got Chuck set up to work on HUH-rapid application.
    • Pinged Jim Beach about using a Species Analyst service for occurrence QC.
    • Discussed an FP-Scan hackathon visit with Ed, have floated the idea with Nico.
  • Chuck:
    • Emailed with Tianhong about DB schema.
    • With other folks, clarified goals for duplicate detection. Wrote up a new spec, and started working. ActuallyFindingDuplicates2

Notes

Present: Bob, Chuck, Paul, Maureen, Jim, Tianhong, James.

Non-Tech

  • InvertENet TCN Proposal (ADBC deadline: Nov 13)

All in hand on this end.

  • TDWG

Tech

  • Summary of Friday Tech Call.

Maureen: Discussed IDCC (northern CA, in Feb) Bertram is going to draft (1) an abstract, and (2) a proposal for a workshop. Maureen: Got an overview of a new conceptual model for curation workflows from Tianhong, and an overview of David's work on integrating Camel as part of the message driven refactoring of Triage/Job Planner/Job Runner. Chuck: Started out sorting out finding duplicates issues with Tianhong. Main function of duplicates is with data entry tools, need to see if we have requirements for other uses of duplicates. Bob: Starting to look at Lucine clustering. Chuck: How do these clusters get used? Bob when entering data, they provide for return of clusters of duplicates. Chuck: Is there anything else we need to present duplicates for? Can we articulate anything else that we need duplicates for.

  • Kepler
    • Report: Progress on Date Validation.

Mostly done with our part (1,7,8), what about the problem 2,5,6 (http://wiki.filteredpush.org/wiki/Use_Case_Scenarios#Collecting_Event_Date_Validation)? Although some of the parsing part is done manually, special cases may need additional code Bob: One of the things that is interesting about this is, all the date validation is about validation errors that are detectable without correlation with other data, but many of these cases are, like the case of correlated endangered species locality data, involving correlations. For example, a sequence of dates and collector numbers, where an element is out of sequence. Paul: State of flowering time validator? James: We have a number of FNA volumes that are finely parsed (about half of the flora of north america). Should have flowering times available for all species in these volumes? Paul: Can you talk with Hong about a productized service? James: Yes, can talk with Hong (and Joel). Maureen: Can we just get a copy of the data and source code? James: Yes, that's probably what we'll need to do - extract the data and serve it up ourselves, FNA may not be able to do this for another year or so. Lifespan-based Paul: Will provide HUH botanist data to Tianhong and set up a service. Geographically-based Internal correlation (with GBIF cache?). Chuck and Tianhong to discuss this. Collector number-based Internal correlation (with GBIF cache?). Chuck and Tianhong to discuss this. Phenology-based FNA flowering Service/data. James to start conversation with Hong.

    • Report: Akka integration.

Tianhong: Had a meeting with David, working towards message-driven implementation for both akka and Kepler

  • Analysis
    • Discussion: Approaches to repeated QC requests for same records. How are we going to do this?
    • Report: Progress on implementation of OAI/PMH harvesting through firewalls.

Maureen: David is going to describe a message for this function,

    • Report: State of investigation of duplicates in GBIF cache.

Chuck: Have a wiki page up describing results of this: http://wiki.filteredpush.org/wiki/ActuallyFindingDuplicates2

Chuck: Improving data capture rate with duplicate finding can be done without our having to reify duplicate sets. James: Suppose there are 5 specimens in 5 collections, and all have been digitized, and three of them have diverging identification histories. Bob: Offhand, that seems the most frequent scenario of the challenge that Chuck gave. New determinations are cases where we are likely to want to propagate the determinations to related duplicates. Chuck: Are these new determinations always the right thing that all users should go with? Is it important to track the provenance (that the determination was made on a different specimen in the duplicate set)? James: Provenance very important. Paul: New determinations not neccessarily correct or applicable. Chuck: Big discrepancies may be things that point out the need for expert human intervention. Bob: There's on provenance issue that we know about (about aggregators) there are duplicated data records that have different paths into the aggregator's data sets. James: How about zoologica/paleontological duplicated locality data. Chuck: Some people put more effort into the same locality string. Paul: Problem 1: Present data to people faster than they can type it. Problem 2: Pass annotations to members of duplicate set. Problem 3: Quality control internal consistency within a duplicate set. Maureen: given two harvested digital records, there are at least two questions, 1) do the records represent the same specimen (the same plant, the same collecting event), 2) do the records represent the same database record; do the records have the same authoritative owner (in which case we don't really care unless they are different and we don't know which bits to use (if an aggregator added or substracted something)).

  • SCAN TCN Support
    • Report: Sanity Checking, further assessment of lower number of new annotations as opposed to new records in omoccurdeterminations.
    • Report: MCZbase Driver

Maureen: Pending

    • Note: Progress statistics
  • NEVP TCN Support
    • Report: Preparations to update Annotation Processor Deployment at UNH.

Maureen: Date not set yet.

    • Report: Specify-HUH driver

Maureen: Checked Python Specify Web app over HUH Schema, doesn't work out of the box.

  • FP Infrastructure
    • Report: FP Node Refactoring

For Friday:

  • Bob not available
  • Services for Tianhong

For Wednesday:

  • Bob, Paul, and James at TDWG