2014Dec02

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Dec02

Agenda

Non-Tech

  • Publications
    • Progress
      • Paul/James: Collection Objects
      • Bob: Refactoring Dup finding cluster analysis
      • Tianhong - Easy-Cure paper

Tech

  • QC for SCAN
    • QC report to Tim?
    • Feedback from Neil/Paul Heinrich (none yet)
    • Feedback from Nico (none yet)
  • QC work
    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.
    • JSON to XLS into Kurator
  • Bob: Duplicate finding cluster analysis

Reports

  • Paul
    • More work on Agents in Symbiota. Working through Lichen and Bryophyte natural language processing code that uses omcollectors to refactor it for agents. Evaluating two alternatives for handling collectors - either omoccurrences-agents or omoccurrences-omcollectors-agents.
    • Provided description of Symbiota's SpecProcessorNEVP-OmOccurrences to David as possible should work right off the shelf option to use for ingest of data from croudsourcing applications into Symbiota (and Specify-HUH) with easy reuse for other applications.
    • NEVP ingest running again on Specify-HUH, along with image processing code that checks for multiple barcodes in image and fixes collection object - item - preparation relationships.
    • Provided some comments to Tianhong on Easy-Cure drafts.

Notes

Present:Bertram, Bob, James, Tianhong, Paul, Tim

Non-Tech

  • Publications
    • Progress
      • Paul/James: Collection Objects

James: Bit of work last week, will get back to it this week. Paul: Hardest case: Collection with moss, lichen, bark in packet, with slide made from the moss.

      • Bob: Refactoring Dup finding cluster analysis

Bob: Nothing this week

      • Tianhong - Easy-Cure paper

Tianhong: Submitted on Sunday night, thanks to everyone for the comments. Response expected in late January.

      • Larger FP manuscript(s) outline started by Bob?

James: Reminder - Bob started a google doc - please comment on/add to list. Paul: Please remind people and share link again in an email. Link is https://docs.google.com/document/d/1FyTIbaIRIzw3uizxs5HEBOgcoxfF4A07g26KK_xrEYk/edit Ask for invitation if can't open it....

Tech

  • QC for SCAN
    • QC report to Tim?

Tim: Haven't gotten a report yet.

Tianhong: David was waiting for a rerun on DMNS, have run.

Paul: This one is reasonably recent: http://symbiota2.acis.ufl.edu/fp/ASUHIC.xls

    • Feedback from Neil/Paul Heinrich (none yet)
    • Feedback from Nico (none yet)

Bertram: got needed info from David, need to ping Nico. Spreadsheet: http://symbiota2.acis.ufl.edu/fp/ASUHIC.xls Specific questions for Nico: - Do you understand the data quality assertions being made? (color coding, details) Do you "understand" what this is saying? - Can you check the statements about scientific names? Are we saying the right things here?

  • QC work
    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.

Paul: Progressing here, working on implications of changes for rest of Symbiota - particularly Lichen/Bryophyte TCN Natural Language Processing code. Still close, but not quite there in pointing the rest of the Symbiota developers to a change set that they can test.

    • JSON to XLS into Kurator

Tim: Yes, found the code that David pointed us at a few weeks ago. Would be good to have this in the next sprint. Would be good to have some sort of test, at least by eye, that the results look the same.

    • dwcArcReader: requirements and extensiblility

Tianhong: Evaluating the GBIF Code:

  1. HIT: https://code.google.com/p/gbif-indexingtoolkit/, robust, with lots of capabilities but too large and with dependency issue (at lest I saw)
  1. dwc-reader: https://github.com/gbif/dwca-reader, not really a library, but can be refactored for reuse, more light weight and easily buildable. With enough entry points I think

But what do we need exactly: read the core records and metadata? Do we need to extend later, e.g. extensions?

Paul: Let's say for now:

  1. Does this DwCArchive file contain an occurrence core?
  2. If yes, extract for analysis.

Later we can look at analysis of other types of core documents (e.g. taxa) and of star schema data - but probably that will be after DarwinCore Archives will be going obsolete to the W3C structured CSV initiatives.

Tianhong: so reading the core document is good for now?

Paul: Yes.

  • Bob: Duplicate finding cluster analysis

Bob difficult to characterize what sorts of false positives you expect from different quality control analysies. Where is the expectation for false positives coming from.


  • Bertram: dwc-validator code available now (had sent email to kurator-staff)

-> Should we check this out?

Paul: Yes we should check it out - good source for what tests GBIF is desiring to perform - likely data type validation and adherence to controled vocabularies.