2014Nov25

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Nov25

Agenda

Non-Tech

  • LepNet TCN
  • Publications
    • Progress
      • Paul/James: Collection Objects
      • Bob: Refactoring Dup finding cluster analysis
      • Tianhong - Easy-Cure paper
  • Preparations for iDigBio Hackathon

Tech

  • QC for SCAN
    • Feedback from Neil/Paul Heinrich
    • Feedback from Nico
  • QC work
    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.
    • Is JSON to XLS with color coding ready to productize?
  • Firuta
    • State of deployed apps
  • Deployments
    • Access point updates
    • Deploy and re-run harvest of occurrence records
    • Status of fp2 and fp3
  • Annotation Processor

Reports

  • Paul
    • Got comments on DarwinCore RDF guide back to Steve, along with set of crafted examples.
    • More updates to Symbiota branch for agents, reworked to use Agent__ instead of Om___ table names, retaining omcollectors to link omoccurrences to agents.

Notes

FilteredPush Team Meeting 2014 Nov 25

Present: Bertram, Jim, Tianhong, Paul, David, Bob, James

  • LepNet TCN

Jim: Proposal went in. Neil Cobb, Northern Arizona U, is lead PI. (He's also the lead PI of SCAN.) "Lep" = Lepidoptera (butterflies and moths).

  • Publications
    • Progress
      • Paul/James: Collection Objects

James: High on list, should work on this week.

      • Bob: Refactoring Dupe finding cluster analysis

Bob: Some work since TDWG, want to try with more cores. Two problems to solve with tinkering - computational complexity -> solve with parallelization. Identifying metrics for fields - can move towards whether things that should be similar are.

Bertram: Common pattern for finding similarity across data sets: How to schedule workflows that involve "one vs. many" records questions.

Paul: Different types of processing:

  1. one record self-consistent
  2. one record vs. remote data
  3. data across records consistent within a dataset
  4. data within a data set consistent with remote data - external collections

Each with different implications for handling by a workflow engine.

Bob: datamining community around Mahout is abandoning Hadoop/MapReduce.

Bertram: Question: why the move away and where-to instead?

      • Tianhong - Easy-Cure paper

- Deadline end of this week, Sunday (SIGMOD demo track)

  • Preparations for iDigBio Hackathon

David: Should be all set, need to look at Kurator 0.1 release - still need to look at documentation.

Bertram: Documentation seems to be high priority, with added functionality as lower priority.

David: Concur, documentation at highest priority, other thing they are looking at is annotations.

Paul: Probably highest priority functionality is the annotation generation actor - actor within workflow that takes the data quality assertions, wraps them up as annotations, and either writes to file system or sends to FP access point. Kurator-69 Tech

  • QC for SCAN
    • Feedback from Neil/Paul Heinrich
    • Feedback from Nico

David: Haven't heard back from Nico or Paul. Should also generate a result spreadsheet for Paula Cushing - DMNS, Tianhong has run the workflow, per Neil's request.

Bertram: please forward me the mail to Nico.

  • QC work
    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.

Paul: Close to something that can go back into trunk in Symbiota.

Paul: Looking for where in the Kuration code is the Harvard Botanist service being invoked.

Tianhong: in the date validation actor service in FP-KurationServices, a function is as a wrapper that a url call is invoked to check the lifespan of a collector this is being done.

    • Is JSON to XLS with color coding ready to productize?

David: I think so. It runs as a command line utility, probably just needs readme file to describe the command line functionality. There are javadocs comments and a class to demonstrate usage.

James: Add a separate sheet with an explanation of the content.

David: That should be easy to add.

Jim: Like that idea as well.

Paul: Two distinct roles for spreadsheet, each with its own focus (and explanation): (1) Displaying full data set tagged with potential quality issues that may make some records unfit for purpose for research users; and (2) Displaying actionable items for data curators.

Carry remainder over to next week.