2014Mar05

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Mar05

Agenda

Non-Tech

  • Davis: NCE
  • James: TDWG session
  • James: Progress on Finishing the SemanticMediaWiki as FP client deliverable.
  • Jim: 1) No further word regarding second NCE request submitted to NSF last month. 2) Encouraging signs that InvertEBase proposal may be funded beginning July 1st. NSF is requesting cuts to budget; FP requested ca. $75K.

Tech

  • Report from Friday call
  • Analysis
    • Report Bob: Progress on Duplicate Finding data mining.
    • Report Paul: Duplicate detection UI.
  • Nodes
    • Discussion: Current and target states of nodes.
    • Report Maureen: Ingest progress.
    • Report David: Morphbank status.
  • NEVP
    • Report Paul: CNH/NEVP ingests live on Symbiota, e.g.
    • Discussion: List of annotation types/sources that we need to be able to ingest.
  • Driver
    • Report David: Status of integration of last working driver version.
    • Report Maureen: Status of new driver approach.

Reports

  • Paul
    • More testing and tweaking on NEVP new occurrence -> Specify6 ingest, reducing number of duplicate agents created on ingest.
    • FP-DataEntry duplicate finding tool is now live in production in the HUH rapid data entry application, populated with a set of Exscciati records.
  • Chuck
    • Unglamorous, but really useful changes to the way you set up FP-DataEntry. The work was done in several passes but the end state is that the whole* configuration is now specified in an xml file. The java class which mirrors that xml file implements four interfaces, one for each of the modules... so usually, you'll run all four with the same config file, but if you wanted to have two different instances of the front-end, for example, you could just pass along a subset of the configuration. Not sure if that description makes sense.
    • The solr configuration files are still separate: I want to dump them into the same big xml file, and I want to change the way the structures are represented in xml, so that things that depend on each other will be more obvious, instead of causing errors at run time.
    • I disabled some of the tests that were particularly about testing the configuration, so I need to get them up and running again, but I think it's in a pretty good place.

Notes

FilteredPush Team Meeting 2014 Mar 5

Present: Jim, Bob, James, Tianhong, David, Maureen, Paul

Non-Tech

  • Davis: NCE

Jim: On this end: everything has gone through for the Davis NCE for this year. Paul: We'll confirm with Bertram next week.

  • James: TDWG session

James: No updates.

James: Call for SPNHC abstracts should be soon.

  • James: Progress on Finishing the SemanticMediaWiki as FP client deliverable.

James: Message to Bob, Joel, Paul, James, hasn't gotten in train yet. Have talked with Joel and this should work for him. Need to nail down a real time and get started on the paperwork. Will talk with Joel after this meeting and try to get details nailed down.

Paul: Then when we know costs, I'll need to talk with Kristin.

  • Jim: 1) No further word regarding second NCE request submitted to NSF last month. 2) Encouraging signs that InvertEBase proposal may be funded beginning July 1st. NSF is requesting cuts to budget; FP requested ca. $75K. Start date, if approved) would be July 1. Modest FP role, similar to SCAN TCN.

Tech

  • Report from Friday call

Maureen: We discussed how to keep locality data private for occurrence records with determinations of names of endangered taxa. Our solution will be to simply require that users be authenticated to the network before they can see any data. The only way to see data aside from that is through the Data Entry Tool, which uses an index built on harvested data that may contain sensitive records. The solution to that case is to require that the OAI provider software at each data source include a Dublin Core <accessRights>sensitive</accessRights> element. We will address how to implement privacy in the harvest-index-deploy process when we work out specifications for the Data Entry Tool in general.

  • Analysis
    • Report Bob: Progress on Duplicate Finding data mining.

Bob: Still soliciting Hadoop clusters while running naive approach on single core. Mahout Canopy Clustering runs fast enough to bother writing human-readable output to see if it gives reasonable clusters, and if not can try tuning parameters.

    • Report Paul: Duplicate detection UI.

Paul: In production use in HUH with the Specify-HUH rapid data entry application, using a set of exsiccati records.

Maureen: Haven't had a chance to work further on integration into the Dina-Specify web application.

  • Nodes
    • Discussion: Current and target states of nodes.

Paul: Walked through overview on whiteboard.

    • Report Maureen: Ingest progress.

SCAN data loaded into Mulgara on fp1. Two named graphs, one for ITIS taxonomy and one for the default taxonomy.

Paul: Anything for NEVP yet?

Maureen: Not yet? Where to?

Paul: FP3.

    • Report David: Morphbank status.

David: Still waiting on a current dump from the developers.

David: We need some overlapping scan records.

Paul: We can have Nico put some images from SCAN in Morphank.

  • NEVP
    • Report Paul: CNH/NEVP ingests live in Specify-HUH.

Paul: First two batches live into Specify-HUH, turining on for rest is imminent.

    • Discussion: List of annotation types/sources that we need to be able to ingest.

SCAN

  1. New Determination (both editing and transcribing motivations)
  2. Update Georeference
  3. Solve with More Data
  4. Other annotations coming off Akka workflow: Update Event?

Small set to ingest into MCZbase, could have special purpose tool.

NEVP

  1. New Occurrence
  2. New Determination
  3. Update Georeference
  4. Solve With More Data
  5. Other annotations coming off Akka workflow: Update Event?
  6. Update Occurence (harvest from OmOccurEdits)
  7. Transcribe Habitat
  8. Assert Phenological State (Flowering, Fruiting, etc).

Lots more to ingest into Specify and Specify-HUH in NEVP

  • Driver
    • Report David: Status of integration of last working driver version.

Maureen: David can use what is in trunk right now.

David: Updating the annotation processor to use new FP core, removed mapper bean. Testing now, not into the driver yet.

Bob: How does the conversation yesterday about hints of classes used to map embeded in the RDF affect this?

David: This is the consuming side of those hints. Without them, how does the consumer know how to map?

    • Report Maureen: Status of new driver approach.

Maureen: On hold while rolling back to an old version. Since we had split off and used a branch at one point, and I had been doing my work in a different project, we were essentially mostly rolled back already. I patched up the "current" driver and we are now using that.

David and I have been reworking the annotation processor, removing the Specify-specific dependencies. David removed some duplicated code for representing annotations. I am working on removing code that duplicates coding for the driver's own mapping from flat dwc to Specify.

Bob: Bob and Paul are going to submit a position paper to a pending W3C workshop on OA about data annotation. James: Have been talking with Greg and Deb about Morphbank. Current stable PHP version is what is going forward for deployments. Greg very interested in getting the annotation integration working.

    • For Friday:

Break down whiteboard diagram into an ordered list of tasks.