From Filtered Push Wiki
Jump to: navigation, search

Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2013Sep04


  • Kepler
    • Demo: run of current Kepler taxon name cleaning workflow on SCAN data, evaluate scaling issues.
  • NEVP TCN Support
    • Annotation Processor hardening
    • UNH site for AnnotationProcessor deployment, visit (Aug 5).
    • Status update (production deployment target date 2013 Aug 15).
    • Specify-HUH driver
  • MCZbase Driver
  • Issue Tracking: Revisit Mantis-BT? See: CodeHosting and 2011Feb01


Next Week

  • Look at duplicate finding (Lucene etc).
  • SCAN TCN Support.
  • NEVP New Occurrence record ingest process

For Future meetings

  • dwcFP and DarwinCore RDF guide - feedback.
  • Prospective meetings, development targets.
  • Burndown
  • Duplicate Finding Find_Duplicates. State of old code.


  • Paul
    • Tiny bit of work with Bob on finalizing figures etc. for Annotation paper.
    • Did some more categorization of uncategorized wiki pages and files, and a bit of wiki cleanup.
  • Chuck
    • Setting up fine-grained (User/Trusted/Manager/Super/Developer) access controls on AP.
    • Reading through the paper and writing up some comments.
  • Tianhong
    • Prepared demo for this meeting, recognized some issue of the actors (not related to demo for now), dealing with them.
    • Went through the logic of new scitificNameValidator actor, expressing it in a more straightforward manner.
  • Maureen
    • Wrote script to automate harvesting occurrence and taxon data into Mongo/Fuseki for NEVP and SCAN
    • It still remains to create a cron job, configure Apache to allow access to OAI provider only by localhost; testing on fp1
    • Kepler-Akka: met with David, Tianhong, and Sven.
      • David will adapt FP-Analysis project to use web service instead of EJB for wrapping Kepler/Akka
      • Sven will create a project for shared Kepler/Akka actor code in FP-Tools, and another top-level project for branching Akka


FilteredPush Team Meeting 2013 Sept 04

Present: David, Maureen, Paul, Bob, Sven, Tianhong, Bertram.

  • Kepler
    • Demo: run of current Kepler taxon name cleaning workflow on SCAN data, evaluate scaling issues.


Tianhong: Running workflow with GeoReference validation actor and ScientificName validator actor. SpecimenCurationNoGUI 15-6

Query Runtime Time per record
5 records (year: 1898) 25.9 sec 5.18 sec/record
34 records (year: 1942) 1 min 34 sec 2.76 sec/record
34 records (year: 1942) 1 min 45 sec 3.09 sec/record
146 records( year: 1953) 6 min 4 sec 2.49 sec/record
Query  -->  Runtime
5 records (year: 1898)  -->  25.9s  -> 5.18 sec/record
34 records (year: 1942)  -->  1m34s/1m45s -> 2.76 sec/record // 3.09 sec/record
146 records( year: 1953)  -->  6m4s -> 2.49 sec/record 

Bertram: Relatively long time for small set of records, do we know where the time is spent?

Tianhong: Bottleneck is service invocations.

Tianhong, substantial variability in run times for the same record. Not much gain on repeat. A cost for startup and a cost for display of spreadheet.

Maureen: What is the user looking for right off?

Paul: Good to add a simple overview of the provenance - how many records, how many valid, etc.

Bertram: Two components; Reading information from the provenance records to display, and Visualizing them. We had some documentation of this.

Tianhong: Documentation going to the spreadsheet view.

Bob: With regards to timing/delays/long processes: What can the user reasonably and on the spot do something about, and what do they need to file a report about and get on with their work.

Paul: Back of the envelope calculation: 55 hours for 100,000 records.

Bertram: Some gain from parallellizing.

Bertram: Need to handle failure conditions, restart failed job in middle.

Maureen: do the analysis when records are harvested. return the QC results when QC analysis is invoked (Paul, that is, when the data are requested).

Paul: Two tracks: Work on improving efficency of workflows, second, only run an analysis once for a record. Agenda items for Friday.

Bob: but more and more data becomes available all the time, we will miss knowing about that, e.g. when GeoLocate has better data.

Maureen: that's why GeoLocate should have some kind of RSS feed (roughly equal to OAI) that we tap in to; pub sub hubbub kind of thing. We hear from them that "Pretoria" is different, we look up records related to "Pretoria," and update them with new locality values and provenance data.

Discussion: Clear need to launch workflows as a consequence of changes - to data in network or to external triggers.

David: How many workflows? A fixed set?

Paul: Yes probably a small fixed set.

  • NEVP TCN Support
    • Annotation Processor hardening
    • UNH site for AnnotationProcessor deployment, visit (Aug 5).

David:Testing FP3 now, Chuck's most recent changes aren't ready yet.

Paul: Will make for good test of update remote site.

    • Status update (production deployment target date 2013 Aug 15).

Maureen: Worked on automating the harvests. Harvesting exactly what is in the Symbiota databases at this point. Mongo data store has whatever is in Symbiota. Raises concerns about access to data which should be redacted - we need to be very clear about policy and enforcement mechanisms.

    • Specify-HUH driver

Maureen: Haven't set up yet.

  • MCZbase Driver

Maureen: No work this week.

Paul: To think about and explore, we'll discuss next week.

Bob: Paul and Bob to make inquiries tomorow.

    • Bob: Anyone want to be coauthors on my SKOS TDWG submissions?

Bob: Need to know soon if anyone does.

    • Abstracts

Paul: Will work on more this afternoon.

  • Annotation Paper

Bob: Worked with Paul on redoing the figures, got answers back from production staff about how to do crossreferencing of supplemental materials. Plan to upload stuff to production within a day or two - estimating about a month from there to paper appearing.