2014Sep10

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Sep10

Agenda

Non-Tech

  • Will need to schedule a new meeting date/time.
  • Publications
    • Progress
      • Tianhong: Workflows
      • James: Collection Objects
      • Bob: Refactoring Dup finding cluster analysis
  • TDWG Symposium
    • Abstracts, Registrations.
  • W3C Survey on greater public participation

Tech

  • Firuta server move.
  • Status of FP2
  • Mongo cleanup on FP2
  • QC for SCAN
    • Run on full NAU dataset (using entomologists list from solr), send report to Neil
  • QC work
    • Tianhong/Bertram: Report on running workflow from command line.
    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.
  • DarwinCore issues 204-226 under discussion: https://code.google.com/p/darwincore/issues/list?sort=-id

Reports

  • Paul
    • No firm date on a possible Firuta move, David, Anne Marie, and I should meet next week to discuss details.
  • Bertram
    • Faculty meeting conflict with this meeting time.

Notes

FilteredPush Team Meeting 2014 Sept 10 Present: Tianhong, Bob, David, Paul, James.

Non-Tech

  • Will need to schedule a new meeting date/time.
  • Publications
    • Progress
      • Tianhong: Workflows: IDCC Paper

Tianhong: few issues left, will submit by the end of today

      • James: Collection Objects

James: Have a draft, need more feedback from Paul. Agenda items for Monday Dina meeting (James and Paul remote participation): Core of information model, embedding annotation ingestion.

Paul: Need draft by later today to get any work on it before Monday.

James: Sent link.

      • Bob: - will start putting some stakes in the ground in a google doc on lessons learned.
  • TDWG:

James: Activities: Dina workshop friday Oct 24th, Identifiers meeting Sat Oct 25, (meeting registration includes Sunday bus from stockholm). Meeting Oct 27-31.

    • Abstracts (Due Sept 25th)
  1. FP Kurator - Put Bertram as presenter
  2. NEVP/MCZ digitization processing; - Paul as Presenter.
  3. QC Experiences with cluster analysis FP-DataEntry tool SOLR and regx contrasted with data mining approach - Bob as presenter, docs on FP-DataEntry at http://sourceforge.net/p/filteredpush/svn/HEAD/tree/trunk/FP-DataEntry/
    • Registrations (By Sept 29) :

James is registered.

Bob: Note this survey.

Tech

  • Firuta server move.

Paul and David need to meet with Anne Marie sometime next week to discuss what needs to be done and dates.

  • FP Client

David: Refactoring FP client to use latest camel, should be able to check in later in the day tomorrow.

Bob: Looking to check this out - schedule for looking at it around 3PM to look at this and getting the annotation processor running.

  • Status of FP2

David: Currently access point. but not latest camel client helper deployed and working for SCAN. Utilties for mananging analysis results deployed on FP2. Limited activity, but not seeing any issues. Harvester still needs to be updated and get cron job set up.

  • Mongo cleanup on FP2

David: Datastore was using about 40GB. Cleaned out many of the old data sets. Reloaded the current harvest using the command line tools from Maureen. Tianhong has an updated analysis result in there as well. Total space use is about 4GB.

Paul: Model for the mongo data was (1) annotations (small set), (2) a current copy of harvested data (not multiple snapshots), and (3) a current copy of analysis results.

Tianhong: We haven't run the analysis on the whole harvested data set yet. Runtime issue.

Paul: Still at state of sepratate collections in mongo for each analysis result.

  • Bob: Refactoring Dup finding cluster analysis

Bob: Biggest current issue is that for each data set, a different vectorization is required (they have different attributes with different significance), different pieces of whole darwin core. Four different data sets, thus need four different vectorizations.

  • QC for SCAN
    • Run on full NAU dataset (using entomologists list from solr), send report to Neil

David: Still seeing heap space issues when post processing from Mongo into a spreadsheet - appears to relate to behavior of cursor in postprocessing code while reading data from mongo. Data set is available in Mongo, post processing to spreadsheet isn't completing.

    • long runtime issue

Tianhong: NAU data set (around 300,000 records) takes about six hours - webservice calls are the time consuming element, some taking long times to respond (and issue appears to be response time of service, rather than networking times). This is without caching results - can try with caching re-enabled to see if this improves, sense is that it won't be a significant improvement from the random nature of the long responses.

Tianhong: Am able to graph distribution of response times for each web service.

Bob: If the issue is waiting for responses from services, then caches should improve the throughput (and we could conceivably predictively call for things that we think we will want to cache them).

Tianhong: Akka parallellizing into multiple threads, each actor has about 8 threads, with a router that directs work to the actor threads.

Bob: Is the multithreding accomplishing multiple parallel network requests to services?

Tianhong: Not sure.

Tianhong: Also seeing that the incoming data from Mongo is entering rapidly, but the handoff of this incoming data to routers (within Akka) appears to be introducing random delay times.

  • QC work
    • Tianhong/Bertram: Report on running workflow from command line.

Tianhong: Got Bertram to the point where he could take json from file system, run workflow from the command line, obtain json on the file system, and post process into spreadsheet.

    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.
    • IDCC paper prototype

Tianhong: First version done, keep improving on UI, better integration, optimization Have something to show people.

Bob: That seems to have died down.

James: Yes, activity has died down on this discussion.