2013Sep18

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2013Sep18

Agenda

  • Annotation Paper
  • Kepler
    • Report: Progress on Scaling
  • Discussion: Date Validation (QC for core of taxon name, georeference, date collected).
  • SCAN TCN Support
    • Report: Deployment
    • Report: MCZbase Driver
  • NEVP TCN Support
    • Report from UNH site for AnnotationProcessor deployment, visit (Aug 5).
    • Discussion: NEVP New Occurrence record ingest process
    • Report: Node Status update (production deployment target date 2013 Aug 15).
    • Report: Specify-HUH driver
  • FP Infrastructure
    • Report: FP Node Refactoring
    • Report: Annotation Processor Enhancements
  • Discussion: Approaches to repeated QC requests for same records.
  • Discussion: Issue Tracking: Revisit Mantis-BT? See: CodeHosting and 2011Feb01

Non-Tech

Next Week

  • Duplicate Finding Find_Duplicates. State of old code, Lucene, etc.
  • Report: W3C RDF Validation Workshop, report of FP participation.

For Future meetings

  • dwcFP and DarwinCore RDF guide - feedback.
  • Prospective meetings, development targets.
  • Burndown

Notes

FilteredPush Team meeting 2013 Sept 18 Present: David, Chuck, Maureen, Paul, James, Jim, Tianhong, Sven, Bertram, Bob Agenda

  • Annotation Paper

Bob: In press, gone over to production. Main issue that took time to resolve was formating of indented text (code/n3 snippets).

  • Kepler
    • Report: Progress on Scaling

Bertram: Can accellerate workflow execution by parallel service invocation, Tianhong and Sven both created benchmark tests. Sven produced graph of multiple service invocations. Sven: One service benefited from multiple service invocation up to about 4-5 parallel invocations, another didn't. Tianhong: Monitored service invocation time in Kepler. IPNI service shows much much variability in invocation times. Graph: for 1957 and 1959 queries on SCAN data: invocation over time against invocation time for IPNI and Geolocate. IPNI service varies from aroudn 700ms to around 5000ms, skewed distribution, meak aroudn 700ms, most data around 700ms. Geolocate much more consistent in response tiems, around 200 to 300ms. Bob: What are the graphs showing us? Discussion: difference in response times in services, not clear what basis is. Next steps (Kepler & Akka): - some additional benchmark plots - implement parallel processing for Kepler (multi-instance actor!?) Tianhong - caching (proxy server): work in progress (Sven) using PHP + MySQL will work for both Kepler and Akka (by design) - fixing bug(s) in MongoDB reader (time-out issue) - migrate all actors and workflows to use generic service/curation API [* Aside: added some of the parallel processing issues to TDWG abstract (Anton's session)]

  • Discussion: Date Validation (QC for core of taxon name, georeference, date collected).

Paul: Time to start thinking about validating date collected. Bob: Dates come along in arbitrary formats. What are other questions. Maureen: There are date parsing libraries available. Bob: Is the problem the same as finding duplicates, dirty data that needs to be put into a consistent format? Paul: Perhaps two questions - is format correct, and is date information correct. Bob: Format problem - can date be put into rdf/ISO date format, second, can the semantics of the date be validated: James: One quality problem, standardization and completeness, Second problem, inference - check the date value against other information (in Kazikistan one day, in Canada the next, number 52 in north carolina, 53 in canada, 54 in north carolina). Bob: One validation rule based on correlation with collector/locality/collectornumber. Obvious for collector, were things collected before collector was born. James: Simple services over Harvard botanist data. James: Collect dates for collectors, do statistics, provide service to check for outliers. Bob: Certain things can't happen before a particular name has been assigned, some date collected in 1893, label name was published only in 1983, suggests an issue. James: Phenology - flowering time. Chuck: How often do we know to exact day, how often to year/month, how often to just year. Bob: There are also practices for assigning values when day was unknown (first of month, multiples of 5). Can we assume that there are dates that reference the julian callendar. Chuck: From Russia before the revolution. Marueen: Like lat/long if it doesn't make sense try other. Paul: Stake in ground for two fittness for purpose: (1) change in distribution over time needs validated year (2) change in phenology needs validated year/month/day. James: Check Arthur Chapman's GBIF data quality paper. Paul: Two things to do: Display workflow metadata in annotation processor. Davis to examine Lei's actors for date validation (flowering time actor and collector track actor).

  • SCAN TCN Support
    • Report: Deployment

David: Haven't heard any responses since the deployment went live, so everything seems to be working. Bob: Follow up to check. Paul: Still need to populate specialist records for Nico and other testers.

    • Report: MCZbase Driver

Maureen: have a schedule now for the driver, have about 1 week worth of things to do in Mantis before then.

  • NEVP TCN Support
    • Discussion: NEVP New Occurrence record ingest process
    • Report: Node Status update (production deployment target date 2013 Aug 15).

David: FP3 node is ready for deployment. Maureen: Data harvested, but not yet doing incremental harvesting, still to do, harvest from specify instances beyond firewalls.

    • Report: Specify-HUH driver

Maureen: In process. Ready to set up tests in paralell.

  • FP Infrastructure
    • Report: FP Node Refactoring

David: Finished implementing knowlege APIs for data stores (document, triplestore, transient objectg store (messages, results)), working on tests and documentation. Next step is to start working on jobs as message driven beans (injecting new data store implementations). Looking at whether we can use elements of JMS for access point (ActiveMQ implementation provides enterprise integration patterns or spring xml, storage of jms messages in persistent queues, plentifull clients available). Current pattern is messaging endpoint, there are other other logical ones to use.

    • Report: Annotation Processor Enhancements

Chuck: Working on implemntation of db admistrator/user, admin can create local datasources, (in progress) select a datasource for user, users can edit other properties, but get assigned datasource.

  • Discussion: Approaches to repeated QC requests for same records.

Paul: For Friday.

Non-Tech

  • InvertENet TCN Proposal--Do we know the NSF submission deadline? Says October 18th on the NSF site... Thus Oct 11 for Harvard.

Paui: In my court, need to get back to them.

James submitted the Workflow abstract for Anton's session. James: (1) Bertram's workflow & provenance abstract (Paul or James), (2) FP Workflow for Anton's session James, (3) FP semantics abstract Paul. (not matching numbers in worksheet) Bob: Controled vocabulary for attribute values: Bob (4), and skos Abstracts Bob (5). Paul: James confirm new time for annotation session with Joel? James: Joel can arrange to not conflict with any annotation relevant documentation. Paul: Will confirm change with Gail.