2014Apr09

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Apr09

Agenda

Non-Tech

  • James: TDWG session
  • SPNHC (April 25)
    • Abstract Preparation (Patrick good with NEVP)
  • InvertEBase
  • iDigBio - proposed integration visit from Greg R.

Tech

  • Report from Friday call
  • Nodes
    • Report David: Status of deployments on FP2 and FP3.
    • Report Maureen: Ingest progress.
    • Report David: Morphbank status.
  • Driver
    • Report David: Status of integration of last working driver version.
    • Report Maureen: Status of driver - current annotation processor integration.
  • SCAN
    • Report David/Tianhong: Akka integration in FP2
    • Test of Akka workflow with SCAN data.
    • Query for harvested data and analysis results.
    • Display of annotation on interests.
  • NEVP
    • Report David: Progress on updating deployment.
  • Analysis
    • Tianhong: Progress on cleaning data with data.
    • Report Bob: Progress on Duplicate Finding data mining.
    • Report Chuck: Duplicate detection integration into Specify workbench/Dina-Specify
  • SemanticMediaWiki as FP Client
    • Bob: SMW use cases.
  • For Friday

Reports

  • Paul
    • Backported fixes from NEVP Specify-HUH ingest into NEVP ingest for Symbiota to address issue that Patrick saw, turned out to be an additional (not easily fixed in software) issue of correctly splitting first collector for Symbiota when collector list is comma separated, addressed in data entry protocol.
    • Modified NEVP Specify-HUH ingest to be able to use a list of corrections to botanist names (provided by Walter) prior to ingest. Testing this now.
  • Chuck
    • Experiment with live API data sources for plugin
    • Resizable iframe
    • Fix collectionspace bug by running $.change() on every field, if $.change is defined
    • Fix 404 for config.xml
    • Lots of demos: features / integrations (CollectionSpace, ArchivesSpace, Omeka)
    • Publicity: code4lib, mcn-l, lists for the different applications. No particular response: I probably should have made the posts more effusive.
    • Contacts: a few emails back and forth with Berkley folks, but nothing more in my court at the moment; Feedback from Patrick Sweeney at Yale this morning which I should respond to.
  • Jim
    • Latest word from Petra (Field Museum) is that it may be another several weeks before NSF asks us to upload revised budget and related materials for the InvertEBase proposal.
    • No further word from NSF regarding our request for a second NCE for the current FP grant.
  • Maureen
    • Imported SCAN prod occurrence data from symbiota4 into MongoDB on fp1. Working on getting taxon data into Mulgara on fp1.

Notes

  • James: TDWG session

No updates yet.

  • SPNHC (April 25)
    • Abstract Preparation (Patrick good with NEVP)

Paul: Ran idea of dataflow in NEVP past patrick, he likes the idea.

James: Leaving saturday, gone for three weeks, email - preferably gmail.

Paul: Will start drafting up abstracts.

  • InvertEBase

Jim: Message from Petra, still on hold with NSF, not ready to finalize ADBC funding yet, might still be weeks.

  • iDigBio - proposed integration visit from Greg R.

Paul: we haven't heard back from Greg yet.

Bob: I'll ping him. -- Done.

  • Report from Friday call

Maureen: Reviewed David's UI work in symbiota providing views into the network. Changed date/time of meeting to 1PM Eastern, 10AM pacific on thursday. Reviewed work not yet done for SCAN and prioritized.

  • Nodes
    • Report David: Status of deployments on FP2 and FP3.
    • Report Maureen: Ingest progress.
    • Report David: Morphbank status.

David: Working on installing Morphbank on development machine. Have current dump, working through configuration.

  • Driver
    • Report David: Status of integration of last working driver version.
    • Report Maureen: Status of driver - current annotation processor integration.
  • SCAN
    • Report David/Tianhong: Akka integration in FP2

David: Have been working through dependency issues that crept up last week, working on getting FP Core running on Tianhong's side. One of the transitive dependencies has a repository that is unavailable (Sessame, can solve by importing Jar and POM into local repository).

Maureen: Encountered same issue.

Tianhong: Code is there, just need the package build.

    • Test of Akka workflow with SCAN data.

Maureen: Did a harvest of SCAN data into FP1 - have the JSON files available, and put the data into Mongo on FP1, with note, OAI ID only field with an index in mongo, if querying, indexes on targeted fields (e.g. collection code) will be needed and will substantivly speed queries.

Paul: for test, we can use "ASU" as collection code

    • Query for harvested data and analysis results.
    • Display of annotation on interests.
  • NEVP
    • Report David: Progress on updating deployment.
  • Analysis
    • Tianhong: Progress on cleaning data with data.

Tianhong; For date validation, thinking of three steps - (1) internal inconsistency (implemented), (2) second for each record query mongo for similar data (implemented, performance issues - caching may help), (3) third use data mining to produce clusters, not sure about how the third would work with current actor and workflow.

Maureen: For indexing see: http://docs.mongodb.org/manual/reference/method/db.collection.ensureIndex/

Paul: Maureen to work with Tianhong on working out appropriate indecies.

Bob: Vectorizing dates by taking unix timestamp in miliseconds, normalize by number of seconds in 400 years (2 lines of code in joda time package).

Bob: Approach of adding the new record and reclustering is probably feasable with small (tens of thousands of records). Discussion of approaches to (3), possibly cluster a body of existing data on collector name/collector number/date collected, export min/max of each of these vectors for each cluster into a db, then take new record, vectorize it, query the db for all nearby clusters, return those records, and do (2) on the subset of data returned.

Tianhong: What about the initial, say 1 million records, what can we say about the validity of those records?

    • Report Bob: Progress on Duplicate Finding data mining.

Bob: Can write out what the clusters are now, Mahout makes it possible to tune the small number of parameters that it depends on, haven't done this yet. Eyeballing the data, seeing lots of false positive, so need to do tuning (or the metrics used aren't appropriate, and aren't easily chosen).

    • Report Chuck: Duplicate detection integration into Specify workbench/Dina-Specify

Chuck: Dina-Specify specify web interface integration working, waiting for Jim Beach to hear if we can include the demo link in the documentation.

Chuck: No progress on workbench yet - have to get specify build working.

    • Report Chuck: Other integrations

Chuck: Good progress and feedback on collections space integration and integration with data entry application at Yale (working with Patrick). Looking at refactoring the solar index portion to be able to use other (existing) database sources to query without need to build solar index.

  • SemanticMediaWiki as FP Client
    • Bob: SMW use cases.

Bob: Described some Use case scenarioes on the wiki, and sent out an email. Please examine them and add more.

  • For Thursday
    • QC on SCAN Data.
    • Look at modifications to Mongo ingest for NEVP to build searchable dataset for duplicate detection.
    • look at use cases for Mongo. Should we use Solr instead?

Maureen: Flag for sensitive data in mongo: { accessRights:"sensitive" }