2014Aug06

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Aug06

Agenda

Non-Tech

  • Publications
  • Kurator
  • James: TDWG Symposium
  • InvertEBase
  • Possible firuta server move.

Tech

  • QC for SCAN
    • David: Status of updates to Occurrence and Taxon harvests, (new data, missing collection code from some data).
    • Run on full NAU dataset (using entomologists list from solr), send report to Neil
      • Tianhong: Status of workflow and this run.
      • David: Alternative report just listing actionable items.
  • QC work to do
    • Tianhong: Preparation of jar for workflow runnable by Bertram
    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.
  • DarwinCore issues 204-226 under discussion: https://code.google.com/p/darwincore/issues/list?sort=-id
  • Status of going live with Morphbank integration
  • iDigBio croudsourcing deployment
  • David: Update on current Status of FP2 and SCAN
  • Metrics for SCAN: http://symbiota2.acis.ufl.edu/symbiota/scan/scan_reports.html
  • Updating Roadmap
  • Upcoming work
    • NEVP
      • Node configuration, symbiota integration
      • Annotation processor
    • InvertEBase
  • Java 8

Reports

  • Paul
    • Per Alex, installed current java version on fp1.acis, fp2.acis, fp3.acis.

Notes

FilteredPush Team Meeting 2014 Aug 06 Present: Jim, Tianhong, David, Paul, Bertram, James. Agenda Non-Tech

  • Publications

Paul: Gen is interested in pursing complex botanical objects. James: I will start to write these use cases up toward a manuscript beginning next week also for use by DINA. Happy to have Gen join in.... David: Have ideas about configuration at deployment time, have redone client helper (access point next) using spring and camel, have been able to move all the routing code into a spring configuration. Enables build to work with tomcat, glassfish, and embedded jetty. Substantial improvement in deployment - properties file containing spring configuration parameters for connection and then spring configuration can define the behaviors. Bertram: Working with Tianhong on IDCC paper, perhaps ready to share by end of week. Tianhong: Yes, should be ready to share by then.

  • Kurator

Bertram: Have three postdoc slots (two on other funding: 2-year DataONE provenance postdoc and 1-year "BCC" pilot project with archeologists from ASU; focus: data integration), have an unofficial advertisement, e.g., via DBWORLD to generate interest! (then fill actual opening quickly). Will emphasize "development-orientation" (not a theory postdoc; hands-on development is expected).

  • James: TDWG Symposium

James: Nothing new. Adding brazilain talk to the symposium. Expect registration to open next week. Bertram: do we need to advertise/invite people to the wf session? James: No, pretty much full.

  • InvertEBase

Jim: No further word from NSF; we're still waiting for the final award letter. Will FP be sending anyone to this fall's iDigBio meeting through our InvertEBase connection? InvertEBase is planning a meeting in tandem, Petra should have a slot for us. Logical for me (Jim) to go? Consensus: Yes, very logical. James: Note that iDigBio summit overlaps with TDWG, unfortunately... Paul: Will be sending David through NEVP (I'll be at TDWG).

  • Possible firuta server move.

Paul: no date yet. Tech

  • QC for SCAN
    • David: Status of updates to Occurrence and Taxon harvests, (new data, missing collection code from some data).

David: Started working thorugh Maureen's Harvester code, don't see perl scripts for the JSON generation checked in - may be on her desktop machine from recent run. Scripts for Mongo and Mulgara loads are there. Need to add some more logging - running too long, perhaps from too old a date. Able to get at the OAI/PMH endpoitns on symbiota4. Harvest to mongo last done by harvest to json file on filesystem then load into mongo. Debugging to do, and coupling of the process makes it difficult at present. Missing collection code should be simple to add to query.

    • Run on full NAU dataset (using entomologists list from solr), send report to Neil
      • Tianhong: Status of workflow and this run.

Tianhong: still debugging workflow, Paul: We'd like to get a test result set to Neil as soon as we can, key item on the akka side that was identified was using the entomologist service on fp2 instead of the harvard botanist list to check dates.

      • David: Alternative report just listing actionable items.

David: Started, refactoring to allow this. Separating the sheets for actor details so that one actor is reported on one sheet. Current code is building sheet row by row, so need to refactor (out of two nested loops) to support the new cases. Will need to stream process the data as data sets will get too large to handle in memory - thus need to have model of structure of the sheets. Paul: two use cases, researcher - wants all rows in data set; data curator - wants limited report of actionable items (data problems that they can do something about). Bertram: Thanks!

  • QC work to do
    • Tianhong: Preparation of jar for workflow runnable by Bertram

Tianhong: Two jars in fp1 (/home/thsong/Akka_Artifacts/): one with mongoReader and mongoWriter, one with CSVReader and JsonWriter, along with readme file. Paul: Bertram - gotten access to those? Bertram: My typical username: ludaesch (8 characters!) Paul: No fp1 account. David: Will create one, add to filteredpush group, create a share. Tianhong: Jars take a very long time to terminate (on the order of 30 minutes), may be issues with the jvm. Bertram: have you profiled the execution? Where does the time go? -> try to find out where the time goes! Reminiscent of discussion while doing the benchmarking, where the services were sometimes had long timeouts that slowed down the execution time. May be simple tasks waiting for a timeout, rather than something doing intense computation. Tianhong: May be configuration of the akka system, and way in which the actor is informing akka that it has completed all of its work. Profiling/debugging suggestions: - "poor man's" approach (print statements) - Use "top" command at the command line Tianhong: system is *waiting* (is it "hung"? Why / how?)

    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.

Paul: Haven't gotten to yet. Tianhong: Working with downstream index: Tianhong: integrated with dateValidator test on MCZ dataset: 17627 records, 1102 distinguish names, 69 have matches most of them are last names only, only two "meaningful" matches: J. Paulus, P. Leonard Similiar on NAU/1966: 1081 records, 51 distinguish names, 5 have matches David: which query are you using? Tianhong: namePre David with trailing ~4 ? David: Did the data supply only the last name (in the record), or was the match just a last name? Tianhong: Matches that are made are where the data in the record are just a last name or just one initial with a last name. David: This sounds like we aren't constructing the query correctly. Lets check to see if the known matching collectors that we put into the index are found.

David: Sent mail to michael asking about his availability to deploy the warfile and what tomcat/java version he needs. Also updating documentation for client helper/spring deployment. Before merging back into morphbank trunk adding a configuration switch to morphbank to enable/disable annotation submission.

  • iDigBio crowdsourcing deployment

David, haven't gotten to that yet. Need to make sure that client helper deployment is smooth, also need to finish refactoring of access point with camel and spring, should then be deployable as a war file.

  • David: Update on current Status of FP2 and SCAN

David: Monitoring, everythign seems up, but not seeing activity using latest access point, mongo, fedora. Harvester still in progress. Redeploying today. Will then redeploy new client helper on symbiota4, coordinating with Ed (along with a symbiota update). Minor issue with query for annotation conversatons- results aren't coming back, probably minor change to namespace not reflected in query.

David: Haven't heard back from Neil or Ed. Ready to link from a page, just need to know which one.

  • Updating Roadmap
  • Upcoming work
    • NEVP

David: What goals for iDigBio summit in october? Paul: At least node on fp3 connected to cnh portal of symbiota4. Would be nice to have annotation processor to show. David should be able to get tagged version of annotation processor without driver to show with just a couple days worth of work. David: Main work to do for deployment is access point work (the refactoring that is in progress).

      • Node configuration, symbiota integration
      • Annotation processor
    • InvertEBase

Paul: symbiota instance, probably on symbiota4, connected to node on fp2 shared with SCAN. David: Multiple client helpers submitting to same node seems to work well with camel, testing this with morphbank - xmldsig might be an issue.

  • Java 8

Paul: Time to start thinking about testing with Java 8. EOL for java 7 targeted for april of next year. David: How about minimum version to build for. Nice features in 7, and we are currently building for 6. Bertram: Saw some discussion here in Kepler/Ptolomey community, concerns about Java 8 breaking elements of the workflow engine. David: Multicatch for exceptions makes code much easier to read. Makes for nice gain in 1.7. David: Morphbank using java 1.6

  • For Tomorrow:
    • Investigate querying of entomolgist names in solr.
    • Bertram to look at akka workflow jars.
    • Minimum java and testing with java8.