2013Sep11

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2013Sep11

Agenda

  • Kepler
    • Report: Progress on Scaling
  • Discussion: Duplicate finding (Lucene etc).
  • SCAN TCN Support
    • Report: Deployment
    • Report: MCZbase Driver
  • NEVP TCN Support
    • Report from UNH site for AnnotationProcessor deployment, visit (Aug 5).
    • Discussion: NEVP New Occurrence record ingest process
    • Report: Node Status update (production deployment target date 2013 Aug 15).
    • Report: Specify-HUH driver
  • FP Infrastructure
    • Report: FP Node Refactoring
    • Report: Annotation Processor hardening
  • Discussion: Issue Tracking: Revisit Mantis-BT? See: CodeHosting and 2011Feb01

Non-Tech

Next Week

  • Discussion: Approaches to repeated QC requests for same records.
  • Discussion: Date Validation (QC for core of taxon name, georeference, date collected).

Week After Next

  • Duplicate Finding Find_Duplicates. State of old code, Lucene, etc.
  • Report: W3C RDF Validation Workshop, report of FP participation.

For Future meetings

  • dwcFP and DarwinCore RDF guide - feedback.
  • Prospective meetings, development targets.
  • Burndown

Reports

  • Paul
    • Tiny bit of work with Bob on finalizing figures etc. for Annotation paper.
    • Got TDWG abstract "Semantic matching of interests to annotations with SPARQL queries on reasoned triples" for the semantic symposium submitted.

Notes

  • Kepler
    • Report: Progress on Scaling

Bertram: Goal from friday was to produce observations of details of performance of Kepler Comad/Kuration workflow and Akka workflow.

Tianhong: Kepler about 5-6 times slower than Akka,

Sven: Observing Frequency and timing of service invoations.

  • Questions from Friday meeting (benchmarking questions):
  • how much speedup do you get as you add parallelization?

Akka seems to be 4-5 x faster for the overall runtime

Sven: Sweet spot for parallelization of service invocations seems to be about 4 to 6 parelell service invocations.

  • how variable are the service response times?

Tianhong: Not answered yet for Kepler.

Sven's Akka workflow:

Sven: Have some numbers on service response times:

ScientificNameValidatorInvocation 342 329716.5877192982463280307.742851047325

One test (query with about 500 records of SCAN data to analyze):

Actor Mean time, ms SD, ms
akka.fp.MongoDBReader 2053
FloweringTimeValidatorInvocation .2
GEORefValidatorInvocation 198 155
ScientificNameValidatorInvocation 716 307
MongoDBWriter 66
  • what happens when you “loosen the brakes” on Kepler parallelization

Bertram: What happens when the COMAD single object at a time constraints are relaxed?

Tianhong: need to benchmark the non-COMAD version

  • Q: what are the overhead times for Kepler/Akka startup, data fetching, how do they scale
    • milliseconds for Akka
    • much more for Kepler!
    • About 2 sec data load time from mongodb in one test.

Comparison of service invocations in Akka vs Kepler allows us to quality control the curation workflow executions themselves.

Bob: What determines the difference in time from one run to the next for, say, scientificnamevalidation?

A: combination of Rx runtime + network latency

Sven: Response times on remote service is highly variable. Moving it closer/caching should help significantly.

Continue on Friday, examine differences between Kepler and Akka in more detail.

  • Discussion: Duplicate finding (Lucene etc). -> week after next.
  • SCAN TCN Support
    • Report: Deployment

David: Last night switched the SCAN symbiota instance into production FilteredPush mode, working with the FP2 node. Did some testing, looks good.

    • Report: MCZbase Driver

Maureen: No work in last week.

  • NEVP TCN Support
    • Report from UNH site for AnnotationProcessor deployment, visit (Sept 5).

David: Configuration probably the largest issue, lots of configuration needs to be done to get the annotation processor to work. Working on refactoring this to simplify the deployment. Wev'e mostly been deploying to development environments, had lots of prerequistes to install to deploy. Goal to reduce to a deployable war file.

Maureen:

      • Specify Driver needs more work.
      • There's lots and lots of config files, lots of places for things to be misconfigured.
      • Lots of work ongoing on Annotation Processor, too
      • Have compiled a longer list of questions to be asked of IT and curatorial staff at site before a visit for deployment
    • Discussion: NEVP New Occurrence record ingest process
    • Report: Node Status update (production deployment target date 2013 Aug 15).
    • Report: Specify-HUH driver

Maureen: working on it, analyzing the workbench code so that she can use the existing "upload" functionality.

  • FP Infrastructure
    • Report: FP Node Refactoring

David working on message driven refactoring, also on knowledge interfaces.

    • Report: Annotation Processor hardening

Chuck: Mostly now working on local data source configuration.

Non-Tech

  • InvertENet TCN Proposal

Resubmission planned, considering including FP again. Similar FP setup to SCAN, but with Specify instances.

Discussion: We are onboard for resubmitting.

Bertram: Have provenance workflows abstract in.

Paul: Have semantic abstract in.

Bob: Have submitted two abstracts, one report on values/properties with skos, other for semantics.

  • Annotation Paper

Bob: Waiting on production staff to reply to two specific questions about formatting (indentation in the turtle examples). Everything staged ready to upload to their production system, pending these answers. No author proofs after that point.

  • W3C RDF Validation Workshop report of FP participation [next week] [week after. Bob in Davis next week]