2014Sep24

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Sep24

Agenda

Non-Tech

  • Publications
    • Progress
      • Paul: Collection Objects
      • Bob: Refactoring Dup finding cluster analysis
  • TDWG Symposium
    • Abstracts, Registrations.
  • QC work to Kurator

Tech

  • Firuta server move.
    • Cleanup of artifacts in Archiva
    • Update deployed apps in Tomcat and Apache (Symbiota, Morphbank, Annotation Processor)
  • Status of Mongo on FP2
  • QC for SCAN
    • Run on full NAU dataset, report to Neil (and Tim)
  • QC work
    • Tianhong/Bertram: More on running workflows.
    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.
  • Deployments
    • Access point updates
    • Bringing Annotation Processor up-to-date
    • Deploy and re-run harvest of occurrence records
    • Status of fp2 and fp3
  • SVN re-organization/cleanup

Reports

  • Paul
    • TDWG AIG abstract submitted
    • Drafts of TDWG abstracts circulated
    • More work on ComplexObjects paper
  • David
    • Worked with Tianhong to resolve result set size issues
    • Generated spreadsheet for NAU, sent to Neil for review
    • Coordinated with Alex to increase space on fp2
    • Initial cleanup on firuta, removed glassfish installation, fresh symbiota check-out, cleared old morphbank, removed sparqlpush

Notes

FilteredPush Team Meeting 2014 Sept 24 Present: Bertram, Jim, Tianhong, Paul, David, Bob, (Tim joining at end). Agenda Non-Tech

  • Publications
    • Progress
      • Paul: Collection Objects

Paul: Another draft.

      • Bob: Refactoring Dup finding cluster analysis

Bob: Working on cluster analysis, haven't gotten to drafting yet.

  • TDWG Symposium
    • Abstracts, Registrations.

Paul: Drafts of all three abstacts circulating. Due tomorrow. Bertram: Are symposia scheduled yet? Want to make travel arrangements. Jim: Bob and Paul coordinate travel plans with Melissa.

  • QC work to Kurator

Paul: Time to move most of the discussion of data quality workflows into the Kurator calls. Bertram: Concur, we are starting with the FilteredPush workflows in Kurator, makes practical sense.

  • Meeting time

Paul: Possible slots for the room, Tuesday and Friday at 2 to 3 eastern time. Jim: Either possible this semester. Paul: Tentatively, 2 PM eastern. I'll check with James. Tech

  • Firuta server move.

Paul: Met with David and Anne Marie, Starting to clean up Firuta. David: Have compliled a list of the services on Firuta, need to put on a wiki page. Have been cleaning up old deployed applications. Archiva now a dependency for Tim - need to notify him if there is downtime.

  • QC for SCAN
    • Run on full NAU dataset, report to Neil (and Tim)

Paul: Tianhong has run workflow on NAU data, david has post processed and spreadsheet has gone to Neil for comments, he's planning on working through the spreadsheet, may call tomorrow to work on it. David: Disk space on FP2 increased, ready to run QC workflow on full SCAN data set. Tianhong: Expect run will be at least overnight. David shouldn't interfere with anything as long as the NAU5All collection isn't replaced. Expectation about 10 GB needed for results, seeing about 2.5KB per record for resultsets. Paul: Go ahead and start analysis on full SCAN data set.

  • QC work
    • Tianhong/Bertram: More on running workflows.

Bertram: Got workflow to run, ran into issues with Mongo interaction. Would like to see how to get from where we are to something that runs more seamlessly from the command line - sequence of commands that are independent of each other - load data from csv into mongo, load data from mongo into workflow, run this actor, write to mongo.... Small steps than can easily be chained together. Where we are now: Limited level of control, limited level to chain actions together from the command line. Bertram: Previously: sweet spot for wf processing was around 5 parallel service invocations Now: proccessing time / record is increasing

  • Bertram: DarwinCore Attribute names: where? (so many links :)

http://rs.tdwg.org/dwc/terms/guides/text/index.htm Bob: Machine readable or Human readable? Bertram: Human readable Bob, Paul: This one: http://rs.tdwg.org/dwc/terms/index.htm

    • Adding agent authority file to Symbiota - harvest to solr index - use in actor.

Paul, not done yet.

  • Deployments
    • Access point updates

David: Working on updating and deploying to FP2 and FP3, using the camel/spring framework and bringing that into production. Probably deploy next week.

    • Bringing Annotation Processor up-to-date

David: Began looking at annotation processor - looks like place to start is the state of the driver code and if it can be brought into use for NEVP.

    • Deploy and re-run harvest of occurrence records

David: Have maureen's code and have started into it, about two thirds of the way working, need to get the harvest script for json running - ingest into mongo is working. Bit that needs work is the OAI/PMH harvesting to json perl script.

    • Status of fp2 and fp3

David: FP2 is functioning using the current client helper and older access point, update next week, mongo issues seem resolved, more space added. FP3 needs updating, targeted for next week.

  • Status of Mongo on FP2

David: Cleaned up, removed unneded collections, ran repair database to defragment mongo's file system - may be advantageous to do regularly, particularly if we remove collections.

  • SVN re-organization/cleanup

David: Time to start reorganizing the repostiory again - need to tag old artifacts and trim from trunk, shorten list of things we need currently. Will put up a wiki page to summarize. Paul: Can add to existing page.

  • Clustering duplicates

Bob: Vectorization on NEVP done, ran, not seeing better performance. Looking at trying on 10cores on amazon cloud to see if it speeds up there.

  • Akka

Tim joining briefly to work through issues getting akka workflow to run.