2014Apr16

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Apr16

Agenda

Non-Tech

  • James: TDWG session
  • SPNHC (April 25)
    • Abstract Preparation
  • InvertEBase
  • iDigBio - proposed integration visit from Greg R.

Tech

  • Report from Thursday call
  • Nodes
    • Report David: Status of deployments on FP2 and FP3.
    • Report Maureen: Ingest progress.
    • Report David: Morphbank status.
  • Driver
    • Report David: Status of integration of last working driver version.
    • Report Maureen: Status of driver - current annotation processor integration.
  • SCAN
    • Report David/Tianhong: Akka integration in FP2
    • Test of Akka workflow with SCAN data.
    • Query for harvested data and analysis results.
    • Display of annotation on interests.
  • NEVP
    • Report David: Progress on updating deployment.
  • Analysis
    • Tianhong: Progress on cleaning data with data.
    • Report Bob: Progress on Duplicate Finding data mining.
    • Report Chuck: Duplicate detection integration into Specify workbench/Dina-Specify
  • SemanticMediaWiki as FP Client, review of SMW use cases.
  • For Thursday:

Reports

  • Jim: No word from NSF regarding our request for a second NCE for FP, and no word from Petra (Field Museum) regarding when we should upload budget, etc., for InvertEBase proposal.
  • Chuck:
    • Data Entry Plugin can now work with Solr over HTTP. (With some caveats.)
    • Optional "confirm" button.
    • Demo integration with Yale's application.
    • Formalizing configuration schema: With more options (that are incompatible), I can't just point at one example and say that's how it works. DTD helps a little, but I think I need a real schema.
  • Maureen
    • began work on ingest for NEVP into prod Mongo
    • continued work refactoring the AnnotationProcessor to remove OAuth authentication and streamline the "map to local" process

Notes

FilteredPush Team Meeting 2014 Apr 16 Present: David, Chuck, Paul, Maureen, Jim, Tianhong Agenda Non-Tech

  • SPNHC (April 25)
    • Abstract Preparation

Paul: In my court, drafts not done yet.

  • InvertEBase

Jim: Nothing new yet.

  • iDigBio - proposed integration visit from Greg R.

Paul: Dates set with Greg. April 29 and 30. Tech

  • Report from Thursday call

Maureen: We mainly discussed which combination of Solr and Mongo to use now that we have the additional use case of Chuck's data entry tool. Build Solar indexes for data entry tool, leave everything else as is using MongoDB. Jim: Request: Add text to FP Wiki describing use in the wild, and pointing to SCAN/NEVP instances as appropriate.

  • Nodes
    • Report David: Status of deployments on FP2 and FP3.

David: On both, Node infrascructure is deployed. SCAN symbiota (symbiota4 pointed to FP2, currently working to insert, view, respond to annotations). NEVP on symbiota 4 configuring to use FP3, not done yet.

    • Report Maureen: Ingest progress.

Maureen: Snapshot of NEVP data, harvesting occurrence data. Expect to have that and the occurrence data for NEVP done this afternoon. Harvesting to files on workstation, can move to ingest to mongo/mulgara/solar Need script to ingest into solar, needs to include removing endagered species data. Paul: Target deployment is FP3.

    • Report David: Morphbank status.

David: Working on client integration in morphbank, expect to have new determination form running by end of week. Working on client helper route to link NEVP and SCAN, allowing view of annotations in either network from Morphbank, and submission to both. Production morphbank will need to do client helper installation and domain configuration. Check in to trunk or a branch? Paul: Coordinate checkin and client helper/config with Deb Paul.

  • Driver
    • Report David: Status of integration of last working driver version.

David: Pending refactoring.

    • Report Maureen: Status of driver - current annotation processor integration.

Maureen: Annotation processor taking account of Camel integration in progress.

  • Three elements:
    • New client helper/camel integration (done).
    • UI working with new network changes (maureen, in progress).
    • UI working with new driver (to do).
  • SCAN
    • Report David/Tianhong: Akka integration in FP2

David: Not running yet. Working on local machine. Working through some maven issues. Tianhong: Maven issues in working with local repository. Index.m2? Not looking at local repository for the files. Maureen: Maven ofline with mvn -o David: Has yet to build successfully, so transitive dependencies aren't there. FP-Core has dependency on Fedora FC-repo, which has a dependency on sesame, which repository is no longer on line. Thus need to get these either into local repository or into archiva, and overriding the fedora pom calls. Chuck: Best to put into archiva installation on firuta.

    • Test of Akka workflow with SCAN data.

Tianhong: Two sets of data in fp1 ASU1905: {institutionCode:"ASU",year:"1905"} 5 records ASU1950: {institutionCode:"ASU",year:"1950"} 96 records Problems: 1. gbif checklist bank is down Paul: As noted in emails: Check with Markus, we can escalate if he doesn't respond. 2. can't find lifespan of collectors (handful of collectors) in Harvard List of Botanists Paul: We could buld list for SCAN - perhaps add as a symbiota feature. Maureen: There are existing lists of entomologists we could use (see below). 3. can't construct SciName from atomic fields Most of the records don't have atomic fields, just scientific name and scientific name authorship. Paul: Existing parser code: http://tools.gbif.org/nameparser/api.do https://code.google.com/p/taxon-name-processing/wiki/NameParsing https://gbif-ecat.googlecode.com/svn/name-parser/tags/name-parser-2.2/ Tianhong: I used gbif.nameparser-2.2 4. GeoLocate query has issues Tianhong: New error message, problem parsing the response. Perhaps the response from their webservice changed. todo: a set of records that is more representative? Maureen: http://en.wikipedia.org/wiki/List_of_entomologists Directories of Entomologists by Taxonomic Group: http://www.ent.iastate.edu/list/directory/149/vid/4 World Taxonomist Database: http://www.eti.uva.nl/tools/wtd.php List of other Taxonomist Databases: http://www.eti.uva.nl/tools/links/taxonomist.php More lists of taxonomists: https://www.cbd.int/gti/expertise.shtml

    • Query for harvested data and analysis results.

David: Rewriting the query so that Tianhong doesn't need to rewrite the actors. Which ID will we be using. Maureen: "oaiid" is going into mongo. Paul: So long as we have an ID we agree on, present in harvested data MongoDB, loaded into workflow, and passed through into workflow results, then we can write that ID into the query to get the data plus results. "oaiid" works for this. Tianhong: New results to check on FP1 are: ASU1905: and ASU1950: Tianhong: Do we need results other than summary? David: Three collections: Validated records with summary, and provenance trace and exceptions. Summary is what we need to display the desired result. Tianhong: Provenance trace isn't present in Akka workflow results.

    • Display of annotation on interests.

David: Working, but some issues with query - seeing too many results and duplicates. Need to work on query. Also need to test against new harvested taxon records. Also need to deploy these changes for NEVP.

  • NEVP
    • Report David: Progress on updating deployment.

Paul: See above notes. David: Now keeping SCAN and NEVP deployments synced. Have read up to here. Nico

  • Analysis
    • Tianhong: Progress on cleaning data with data.

Tianhong: Working on learning how to use Solr. Chuck: Can you use the same schema that I sent around in the email? Example has solr demonstration, can you work with that? Best to use a single schema and add indexes as needed. Tianhong: Will review. Only need a very few fields.

    • Report Chuck: Duplicate detection integration into Specify workbench/Dina-Specify

Chuck: Defined schema and index types for herbarium duplicate search. With embedded solar, multiples for a particular field are going in as lists in json, into solr, one field to one field working, indexing tuples is more complex. Thus latitude and longitude aren't linked - linking them would take more solr configuration, unclear how much work. Some value for this case (lat/long) and for geographic/taxonomic heirarchies. Paul: Path: Backend, then test integration with Patrick's yale application, then look at integration with Specify workbench/Specify web (working with fields in demo).

  • SemanticMediaWiki as FP Client, review of SMW use cases.
  • For Thursday (10AM Pacific, 1PM Eastern):
    • review Chuck's solr schema
    • touch base with David and Tianhong on Maven build issues
    • look at Tianhong's two datasets, ASU 1905 and ASU 1950