2014Feb26

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Feb26

Agenda

Non-Tech

  • Davis: NCE
  • James: TDWG session
  • Bob: Finishing the SemanticMediaWiki as FP client deliverable.

Tech

  • Report from Friday call
  • Analysis
    • Report Tianhong: Progress on updated Kepler Kuration release
    • Report Tianhong: Akka Actors.
    • Report Bob: Progress on Duplicate Finding data mining.
    • Report Chuck: Duplicate detection UI.
    • Discussion: Firuta for duplicate detection rollout?
  • Nodes
    • Report David: Progress on updating annotation processor for Camel infrastructure.
    • Report Maureen: Ingest progress.
  • NEVP
  • SCAN
  • Driver
    • Report Maureen: Status of rollback to last working driver version.

Reports

  • Paul
    • Several more rounds of testing, bugfixing, and cleanup on NEVP ingest tool for Specify-HUH.
    • Deployed, from Chuck's instructions, with a little help, the back end parts duplicate detection tool FP-DataEntry configured with Exsiccatae data ready for use with the HUH rapid data entry tool.
  • Chuck
    • Big refactoring of FP-DataEntry for run-time configurability: Originally, the idea was that different variants might provide their own implementations of Term and Tuple, and consistency would be assured by the test suite. It turns out that I was editing those pretty often, and since they are just data structures and don't specify behavior, I couldn't really justify the hassle of the extend-the-code model. So: Everything else is run-time configured, I'm 1/2-way with Term, and Tuple still needs to be done.
    • "Remote"/"Local" don't make sense any more: Right now, my preference would be to change it to "Backend"/"Frontend". Thoughts?
    • The solr schema also needs to be specified via the command line, since it is linked to the terms, but I want to think about how that could be done: There's a lot of boilerplate in any schema file which I would like to factor out.
    • Several other questions floating in my head: Do we need anything else to get the exsiccati system into production? Priorities: xml configuration? or facetting? or tomcat-ization? or demos?
  • Jim
    • No further word on request submitted to NSF for second no-cost extension.

Notes

Present: Maureen, Chuck, Bob, Jim, James, Paul

Non-Tech

  • Davis: NCE
  • James: TDWG session

James: No action needed yet.

  • Bob: Finishing the SemanticMediaWiki as FP client deliverable.

Bob: Propose working with Joel Sachs. He's been working with SMW on the Berlin Biowiki Farm (which has hosted wiki.filteredpush.org for about a year now along with terms.gbif.org and others which might find interesting applications for OA annotations. Ditto for things like Wikispecies https://species.wikimedia.org)

Discussion: Fits Criteria for expeding effort at this point - produces a deliverable in the project and provides support for others, not just an intellectual exersise. James, Bob, Paul to coordinate planning.

Tech

  • Report from Friday call
  • Analysis

Davis folks are at IDCC meeting today.

    • Report Tianhong: Progress on updated Kepler Kuration release
    • Report Tianhong: Akka Actors.
    • Report Bob: Progress on Duplicate Finding data mining.

Bob: Experimenting with http://pkghosh.wordpress.com/2013/09/09/identifying-duplicate-records-with-fuzzy-matching/ Waiting to see if iDigBio can supply a mult-node Hadoop client; iPlant does not or cannot. Meanwhile, modest size (10K specimens) throws Exception on a single-node, so reading pkgosh's code... Hadoop should not behave differently on single node except for performance, so I have to solve the Exception issue, which is probably an input data format error not properly defended against in the code.

    • Report Chuck: Duplicate detection UI.

Discussion: Threataned/endangered species - try USDA plants flags, then local flags within symbiota data. Exclude all locality data for all taxa that anyone has flagged.

Timeline: Target getting into production for HUH-rapid this friday, collect bugs next week. Work in paralell on integration into dina-specify web application. Look for other possibilies for generalization and deployment in duplicate detection space.

  • Nodes
    • Report David: Progress on updating annotation processor for Camel infrastructure.

David: Need to connect to messinger bean, looking at decoupling from old FP-core and coupling to new FP-core elements. Most of work probably related to how we are doing configuration. Lots of coupling between Annotation procesor and projects other than FP-Core.

    • Report Maureen: Ingest progress.

Maureen: Bullkoad of taxon data from the two taxon authorities in SCAN into two named graphs.

David: How do we phrase the sparql query to use these, do we pick one, use both, need to support n named graphs? Currently working with the merger of the two.

Maureen: We can expect inconsistencies between the two (or n), but we don't care.


Paul: Also have the first batch ingested from there into Specify-HUH http://kiki.huh.harvard.edu/databases/image_search.php?mode=qc&batchid=3217

Maureen: Started looking at how to connect in a more sane way, lots of mapping happening in annotation processor that should happen in driver. Looking at reusing Chuck's FP-DataEntry code to refactoring old driver code.

Discussion: Approches to driver and getting annotation processing working again - tackle in paralell David working on getting existing annotation processor/driver from SPNHC demo or late summer last year working again with new infrastructure. Maureen and Chuck to look at new approach to simpler framework.

    • Report Jim: No further word on request submitted to NSF for second no-cost extension.
    • ScratchPads lead Dimitri at NHM is interested in FP API

James: Talked with him at Phenotype RCN meeting.

James: Interest at phenotype RCN in OA for genomic annotation.

For Friday: