2014Jul16

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Jul16

Agenda

Non-Tech

  • Publications
  • Kurator
  • James (probably won't make meeting): TDWG Symposium
  • InvertEBase - request for images, requirements process, startup organization?
  • (next week: iDigBio croudsourcing deployment)

Tech

  • Status of going live with Morphbank integration
  • David: Status of FP2 services
  • QC for SCAN
    • Run on full NAU dataset, send report to Neil
      • GBIF checklist bank and CoL services.
      • Collector names and dates of birth (solr index)
      • Adding agent authority file to Symbiota
    • Run on full MCZ SCAN dataset, send report to Linda Ford
      • Need to update harvest (14k records in MongoDB on FP2, 23k in Symbiota4)
  • Metrics for SCAN
  • Updating Roadmap
  • Upcoming work
    • NEVP
    • InvertEBase

Reports

  • Paul
    • Discussed open NEVP data flow issues with Patrick. Fixed bugs in error and process handling in NEVP ingests into Symbiota that were causing new occurrence annotation xml documents to be processed more than once and not get moved to finished location even on successful ingest of all data.
    • Note from Petra, looking for images for InvertENet for initial announcement of project.

Notes

FilteredPush Team meeting 2014 July 16

Present: David, Paul, Bob, Tianhong, Jim, James.


  • Publications

Bob: Seems like what the broader scientific community is concerned with first is getting previously undigitized specimens digitized, focus on how we put stuff into existing specimen management systems has the largest immediate appeal.

Jim: The greatest impact/significance publication would be not reporting new software but report what it has accomplished. X number of records cleaned up, or x amount of time saved would be something compelling to report.

Paul: Have a story about distribution and data flow in NEVP, with data coming from the conveyor belts as new occurrence annotation documents to both the Symbiota portal and Specify-HUH.

  • Kurator

Jim: We (including Bertram) have sent everything to NSF, waiting on them now. How early in the fall do the gatherings need to get organized?

Paul: Team kickoff meeting perhaps in early November. Initial workshow would need more coordination (including with iDigBio), might not be plausible until early next year. Good to discuss as soon as Bertram has finished move - lots of logistics to work out.

Paul: Bob also raised code management site for Kurator work.

Bob: Where are we managing the Kuration code now?

Paul: Kepler repository for Kepler actors. FP Sourceforge repository for FP QC class library and FP Sourceforge repository for Akka workflows.

Paul: an item for discussion when Kurator starts spinning up.

  • James (probably won't make meeting): TDWG Symposium

James: no news; meeting website and tentative program to be released shortly along with call for abstracts.

  • InvertEBase - request for images, requirements process, startup organization?

Jim: Message from Petra, seems in similar state as Kurator, pending official approval.

Tech

  • Status of going live with Morphbank integration

David: Need to get Michael to deploy our prerequisites (for client helper). Have been testing on the VM that matches their setup (e.g. apache/ubuntu versions). Ready coordinate deployment with them now.

Paul: In discussions with Ed about providing image delivery URIs from Symbiota that conform to Greg's requirements for including an image in Morphbank - in order to allow morhpbank to also serve records for all of the images in SCAN.

Bob: We should start writing this paper now.

  • David: Status of FP2 services

David: In a position to turn back on (after heap space issue with Fedora), need to monitor closely after that (and schedule a weekly update to deployment after that). Check annotations through Fedora admin panel (check daily). Some concern with issues not preventing people from submitting new determinations to Symbiota (gracefully failing there). Needs some close monitoring once turned back on.

David: May be ways of hooking icinga monitoring to the tomcat JVM, also support for monitoring through camel (through a web interface as well). Difficult right now to check the state of the entire system. David: Harvesting is deployed but not scheduled, need to automate.

  • QC for SCAN

David: JSON looks fine. Need to look at to spreadsheet code to handle large data sets (e.g. too many rows for spreadsheet). Also want to refactor code to make addition of columns more easily. Wanted to separate code that creates the table views from the code that creates the output document itself (to support creation of both .xls and web presentation more easily).

Paul: Two potential issues in spreadsheet: (1) Actor details in spreadsheet doesn't include comment. (2) Actor details (in spreadsheet) doesn't contain information to link row back to analysis summary row (e.g. darwincore triplet plus occurrenceid).

Tianhong: Do the record identifiers need to be added to the JSON?

David: No, can extract from the JSON structure, just a display issue.

Tianhong: Don't have record identifier, no information about the original record in the spreadsheet.

David: Not included in spreadsheet, could use OAI ID, query mongo, and insert as separate sheet.

Paul: Example rows :

Both have georeference actor assesrting unable to determine validity, one has a coordinate, the other doesn't.

                                                                                              NAU                 NAUF                 NAUF4 A0003146                 33.6                 -111.1                 Tenebrionidae                 Steriphanus subopacus                 Casey                 United States of America                 Arizona                 Sierra Ancha exp sta.                 1958-09-27                 CORRECT                 UNABLE_DETERMINE_VALIDITY                 UNABLE_DETERMINE_VALIDITY                                   NAU                 NAUF                 NAUF4A0022498                 
                
                Scarabaeidae                 Maladera castanea                  (Arrow, 1913)                 USA                 Maryland                 Beltsville Md                 1958-07-11                 CORRECT                 CURATED                 UNABLE_DETERMINE_VALIDITY         

Bob: Valid and correct are two different assertions.

David: Unable to curate?

Tianhong: Distiction present in the markers, just not showing it.

David: In the validation state?

Tianhong: If prerequsites aren't met then won't show in the actor details.

Tianhong: Coordinate missing: Unable to determine validity

David: Empty georeference and empty textual data both unable to determine validity.

Tianhong: Yes, distingushed by the actor detail comments.

    • Run on full NAU dataset, send report to Neil

Paul: Tianhong can run now and send result to David to format as spreadsheet.

      • GBIF checklist bank and CoL services.

Paul: For Tomorrow, let's look at some specific instances of names in this example sheet and see if the services and taxon name validation actor are doing what we expect.

      • Collector names and dates of birth (solr index)

David: Service is up, last week we decided that we need better handling of the name searches. Reading up on configuration of solr and how to edit schema, need to set up a tokenizer and filters for the name field. Tokenizing, allowing solr to rearange tokens, and removing punctuation are all things that other people are doing to deal with names, all configuration, will have up to test next week.

Bob: Service interface that we'd like to make public?

David: This service is up.

Bob: Plazi is digitizing treatments, could use service for QC.

      • Adding agent authority file to Symbiota

Paul: No progress here yet.

    • Run on full MCZ SCAN dataset, send report to Linda Ford

Paul: Tianhong can also run analysis here and get to David.

      • Need to update harvest (14k records in MongoDB on FP2, 23k in Symbiota4)

David: Maureen provided instructions (running harvest script), will run again, then set up to run daily.

  • Metrics for SCAN

David: Have a page ready for the stats, need to do more testing, have code integrated, but have some bugs to resolve, then can show Ed/Nico/Neil on Symbiota 2 to get feedback on where to integrate.

  • Updating Roadmap
  • Upcoming work
    • NEVP

Paul: Next step for NEVP is to get the new occurrence annotation documents from sites other than Harvard into people's specify databases, either by transform to csv files to load into the workbench, or refactoring the specify-huh ingest code to handle trunk specify (perhaps in a language other than PHP).

David: With RDF Beans and jaxson, we can transform the new occurrence annotation documents to csv and prefixes some fields, should have simple darwin core classes that should work for this.

Paul: An example document from UNH is at http://portal.neherbaria.org/portal/uploads/images/nha/orig_xml/orig_1403812440.rdfSpecimen_2014-06-26-152559.xml need transform to CSV to load into Specify workbench (this is a location on the filesystem that we could monitor with camel).

David: Will need to add audubon core classes to the model.

    • InvertEBase

Paul: Nothing yet. Plan is for SCAN and InvertEBase to share the FP2 instance.

David: Annotations to same place seems straightforward. Question is about harvests (occurrences and taxon trees), and analysis. Named graphs should work to keep taxon trees separated, just need to specify this in queries. Analysis probably needs separate analysies for each. If there are performance issues, increasing the load will increase those.

Bob: Need to keep the iDigBio croudsourcing on the agenda.

David: Greg would like to host his own FP node for the croudsourcing.

Bob: Need to keep on our list.

David: Michael is getting a VM set up for the node, main challenge getting our deployment into tomcat more user friendly, so we can give him a war file to deploy.

For Thursday: Examine instances of scientific names in test spreadsheet results, back trace against service calls.