2013Oct02

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2013Oct02

Agenda

Non-Tech

  • InvertENet TCN Proposal
  • TDWG

Tech

  • Summary of Friday Tech Call.
  • Kepler
    • Report: Status of Date Validation.
  • Discussion: Approaches to repeated QC requests for same records.
  • SCAN TCN Support
    • Report: Sanity Checking
    • Report: MCZbase Driver
  • NEVP TCN Support
    • Report: Preparations to update Annotation Processor Deployment at UNH.
    • Report: Specify-HUH driver
    • Discussion: NEVP New Occurrence record ingest process from Specify instances.
  • FP Infrastructure
    • Report: FP Node Refactoring
    • Report: Annotation Processor Enhancements

Next Week

  • Discussion Duplicate Finding Find_Duplicates. State of old code, Lucene, etc.

For Future meetings

  • dwcFP and DarwinCore RDF guide - feedback.
  • Prospective meetings, development targets.
  • Burndown


Reports

  • Chuck
    • Misc code cleanup: getUser(); password vs. passwordHash; anonymous inner classes for converters.
    • Misc UI tweaks: Favicon; page titles
    • Get password reset/creation working, and use same validation logic everywhere. Managers pick user from menu.
    • Show all DC metadata for workflows: just a grid right now, but if some fields will always be longer, and others might be missing entirely, could be tweaked.
    • Record original values of name / affiliation to prevent priv-escalation-by-social-engineering.
    • Next up: Make password reset more secure / idiomatic.
  • Maureen
    • still working on Specify Driver, but she promises it will be worth it
    • finished the piece that does the mapping to workbench object (equivalent of importing csv file), almost done with upload

Notes

FilteredPush Team Meeting 2013 Oct 02 Present: Bertram, Tianhong, Sven, Maureen, Chuck, James, Bob, Paul, David, Jim Non-Tech

  • InvertENet TCN Proposal

Paul: Sent revised scope of work to Petra.

https://www.nsf.gov/outage.html may become relevant.

  • TDWG

Paul: Logistics in progress...

Bob: Gail looking for a paragraph on interest groups.

Bertram: Flight booked, still need to register/lodging.

  • Summary of Friday Tech Call.

Maureen: Brief call, three items:

    • ClientIdentity: this was a placeholder for interacting with DataOne, but since we're not interacting with

DataOne, we'll just keep it as mainly Annotation Processor user metadata. (also for use later by James)

    • Where to put non-JavaSOA code: will go into some other top-level project in the repository;

FP-SpecifyDriver will remove its dependency on FP-Core

    • Update on Kepler performance research
  • Kepler

Tianhong: Have been looking at cache for performance improvement of network calls. Could get about a 30% improvement in some circumstances.

Sven: Webserver on localhost caching remote calls, improves performance, but there is still an overhead for webservice calls to this cache that degrade the performace. Moving away from webservice as a caching mechanism to a local database/persistence store should improve performance further. Maintaining currency of the cache is, of course, also an issue.

    • Report: Status of Date Validation.

Tianhong: Formatting of dates is one part of problem, validation of parsed dates is a second part of the problem.

Bertram: There should be exising libraries that we can use for date validation.

Paul: XML records with XSD.date; is this date in the record "correct"?

Bertram: based on what info can we determine that? The record itself probably cannot validate its own validity..

Paul: need to consult other sources of data

Maureen:

  1. problem 1: can we obtain a date from the record? what if we only have verbatim date? what if the "original" darwin core eventDate field contained a different format than specified in the standard?

Bob: Two problems here? Second is perhaps just testing with a validator.

Bob: Out of which range is this an error.

  1. problem 2: is the date a valid date according to some calendar, or is it Feb. 31, e.g.? what is the actual precision of the date? databases often used timestamp fields to record this data, even when all that is known is a year

Bob: This is validating on a parser.

  1. problem 3: is the date likely to be the date on which the specimen was collected? we need to consider geographic context, collector agent's birth/death dates, collected dates

Paul: Note that QC test on HUH data shows 6988 records where the date collected falls outside the birth/death dates for the collector. (could also be a problem matching to correct collector, or the collector's dates have the wrong type-- there's birth/death, flourished, and collected types)

Paul: Lei's code looks for correlations in space and time - collector tracks with outliers.

Chuck: There is a whole universe of things that we could try, do we have any heuristics of what things that we could try to maximize the number of records that we could clean up.

James: I need to write up a wiki page on the use cases, and point at some resources (of data data)

David: We may be able to approach the in date range sort of issues with sparql queries, appropriately parameterized

Bob: A key kind of annotation is of a data set identifying a systematic error.

David: We have some of Lei's rules concerning inconsistency and systematic error to revisit.

  • Discussion: Approaches to repeated QC requests for same records.
  • SCAN TCN Support
    • Report: Sanity Checking

David: Ran a sanity check query again this morning, but new determiniations from symbiota in last week haven't gone in as annotations.

Confirmed that no value for date identified was reason why some of last week's data hadn't gone in as annotations.

Nico: Also tried with Ed Gilbert today; process seemed to work within Symbiota but not on FP3; and Error Message in Symbiota to that end. Will be in touch soon.

Paul: Two workflow processes, transcription more relaxed than new determination.

David: Are there actions that we should take in symbiota when requirements we have aren't met?

Paul: If new determination by user, and element we need is missing, and user asserts submit to FP, we should warn the user.

    • Report: MCZbase Driver: still awaiting Maureen to finish Specify Driver.
  • NEVP TCN Support
    • Report: Preparations to update Annotation Processor Deployment at UNH.: lots of work has been done on all components, Annotation Processor, network configuration, and driver. We need to test install on Windows again before going back and we need to finish and test the driver. Maureen will email Chris and Janet today and again next week; we'll shoot for another installation session in two weeks.
    • Report: Specify-HUH driver: Mapping to workbench object (equivalent of importing csv feature) done, upload step is almost done

Maureen: Two main pieces, in specify can import xls or csv, creating a workbench object, then make a mapping, then, upload into actuall database objects.

Maureen: The driver was too fragile, this work is reusing existing workbench ingest functionality to make the Specify driver implementations easier to maintain.

Maureen: Creating a set of files that can be added as a jar to the specify folder, difference between Specify and Specify-HUH will be present in different deployment artifacts for these. Moves the code into the specify codebase, so we don't need to import all the specify dependencies.

    • Discussion: NEVP New Occurrence record ingest process from Specify instances.
  • FP Infrastructure
    • Report: FP Node Refactoring

David: Finished java docs and tests for knowledge, checked in to FP-Common, place where whe could move things out of FP-JAVASOA. Creaing message driven beans to replace jobs, calling these services. Got open ejb working with junit tests for these, so they don't need to be deployed in a container for testing. For now, have access point launch the JMS message to trigger services as intermediate step to making access point asynchronus.

    • Report: Annotation Processor Enhancements

Chuck: Fixing and unifying password reset mechanisms. Various UI enhancements.

Friday agenda:

  • next steps with Annotation Processor
  • date validation issues
  • reconciling all the different darwin core usages within FP

Bob: We should look at how many DarwinCore reprentations we have.