2014Feb26
Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Feb26
Agenda
Non-Tech
- Davis: NCE
- James: TDWG session
- Bob: Finishing the SemanticMediaWiki as FP client deliverable.
Tech
- Report from Friday call
- Analysis
- Report Tianhong: Progress on updated Kepler Kuration release
- Report Tianhong: Akka Actors.
- Report Bob: Progress on Duplicate Finding data mining.
- Report Chuck: Duplicate detection UI.
- Discussion: Firuta for duplicate detection rollout?
- Nodes
- Report David: Progress on updating annotation processor for Camel infrastructure.
- Report Maureen: Ingest progress.
- NEVP
- Report Paul: CNH/NEVP ingests live on Symbiota, e.g. http://portal.neherbaria.org/portal/collections/individual/index.php?occid=589301&clid=0. About 16,000 records ingested.
- SCAN
- Report Paul: Ed has taken hackathon code live: http://symbiota1.acis.ufl.edu/scan/portal/imagelib/imgsearch.php
- Driver
- Report Maureen: Status of rollback to last working driver version.
Reports
- Paul
- Several more rounds of testing, bugfixing, and cleanup on NEVP ingest tool for Specify-HUH.
- Deployed, from Chuck's instructions, with a little help, the back end parts duplicate detection tool FP-DataEntry configured with Exsiccatae data ready for use with the HUH rapid data entry tool.
- Chuck
- Big refactoring of FP-DataEntry for run-time configurability: Originally, the idea was that different variants might provide their own implementations of Term and Tuple, and consistency would be assured by the test suite. It turns out that I was editing those pretty often, and since they are just data structures and don't specify behavior, I couldn't really justify the hassle of the extend-the-code model. So: Everything else is run-time configured, I'm 1/2-way with Term, and Tuple still needs to be done.
- "Remote"/"Local" don't make sense any more: Right now, my preference would be to change it to "Backend"/"Frontend". Thoughts?
- The solr schema also needs to be specified via the command line, since it is linked to the terms, but I want to think about how that could be done: There's a lot of boilerplate in any schema file which I would like to factor out.
- Several other questions floating in my head: Do we need anything else to get the exsiccati system into production? Priorities: xml configuration? or facetting? or tomcat-ization? or demos?
- Jim
- No further word on request submitted to NSF for second no-cost extension.
Notes
Present: Maureen, Chuck, Bob, Jim, James, Paul
Non-Tech
- Davis: NCE
- James: TDWG session
James: No action needed yet.
- Bob: Finishing the SemanticMediaWiki as FP client deliverable.
Bob: Propose working with Joel Sachs. He's been working with SMW on the Berlin Biowiki Farm (which has hosted wiki.filteredpush.org for about a year now along with terms.gbif.org and others which might find interesting applications for OA annotations. Ditto for things like Wikispecies https://species.wikimedia.org)
Discussion: Fits Criteria for expeding effort at this point - produces a deliverable in the project and provides support for others, not just an intellectual exersise. James, Bob, Paul to coordinate planning.
Tech
- Report from Friday call
- Analysis
Davis folks are at IDCC meeting today.
- Report Tianhong: Progress on updated Kepler Kuration release
- Report Tianhong: Akka Actors.
- Report Bob: Progress on Duplicate Finding data mining.
Bob: Experimenting with http://pkghosh.wordpress.com/2013/09/09/identifying-duplicate-records-with-fuzzy-matching/ Waiting to see if iDigBio can supply a mult-node Hadoop client; iPlant does not or cannot. Meanwhile, modest size (10K specimens) throws Exception on a single-node, so reading pkgosh's code... Hadoop should not behave differently on single node except for performance, so I have to solve the Exception issue, which is probably an input data format error not properly defended against in the code.
- Report Chuck: Duplicate detection UI.
Discussion: Threataned/endangered species - try USDA plants flags, then local flags within symbiota data. Exclude all locality data for all taxa that anyone has flagged.
Timeline: Target getting into production for HUH-rapid this friday, collect bugs next week. Work in paralell on integration into dina-specify web application. Look for other possibilies for generalization and deployment in duplicate detection space.
- Nodes
- Report David: Progress on updating annotation processor for Camel infrastructure.
David: Need to connect to messinger bean, looking at decoupling from old FP-core and coupling to new FP-core elements. Most of work probably related to how we are doing configuration. Lots of coupling between Annotation procesor and projects other than FP-Core.
- Report Maureen: Ingest progress.
Maureen: Bullkoad of taxon data from the two taxon authorities in SCAN into two named graphs.
David: How do we phrase the sparql query to use these, do we pick one, use both, need to support n named graphs? Currently working with the merger of the two.
Maureen: We can expect inconsistencies between the two (or n), but we don't care.
- NEVP
- Report Paul: CNH/NEVP ingests live on Symbiota, e.g. http://portal.neherbaria.org/portal/collections/individual/index.php?occid=589301&clid=0. About 16,000 records ingested.
Paul: Also have the first batch ingested from there into Specify-HUH http://kiki.huh.harvard.edu/databases/image_search.php?mode=qc&batchid=3217
- SCAN
- Report Paul: Ed has taken hackathon code live: http://symbiota1.acis.ufl.edu/scan/portal/imagelib/imgsearch.php
- Driver
- Report Maureen: Status of rollback to last working driver version.
Maureen: Started looking at how to connect in a more sane way, lots of mapping happening in annotation processor that should happen in driver. Looking at reusing Chuck's FP-DataEntry code to refactoring old driver code.
Discussion: Approches to driver and getting annotation processing working again - tackle in paralell David working on getting existing annotation processor/driver from SPNHC demo or late summer last year working again with new infrastructure. Maureen and Chuck to look at new approach to simpler framework.
- Report Jim: No further word on request submitted to NSF for second no-cost extension.
- ScratchPads lead Dimitri at NHM is interested in FP API
James: Talked with him at Phenotype RCN meeting.
James: Interest at phenotype RCN in OA for genomic annotation.
For Friday:
- at what point in the harvest-to-delivery pipeline do we exclude endangered species data, or put access controls on it and how to include USDA plants data for flags of threataned/endangered status. James can help broker getting the USDA-Plants data as I do not think they have any services... See: http://plants.usda.gov/threat.html but the output is an HTML page only! Probably available in direct download: http://plants.usda.gov/dl_all.html [I checked and it is only a limited list, that download doesn't include the status], see also query results, see also http://plants.usda.gov/java/threat [that looks similar and useful =scrape?] There's a direct download available: http://plants.usda.gov/java/downloadData?fileName=plnt12963.txt [great! did not see that button, that should be easily parsable]
- code review