2013Oct16

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2013Oct16

Agenda

Non-Tech

  • InvertENet TCN Proposal
  • TDWG

Tech

  • Summary of Friday Tech Call.
  • Kepler
    • Report: Progress on Date Validation.
    • On table with Kansas: Using Species Analyst services for occurrence QC. We have an occurrence with a taxon name and a georeference: what would we like in a service to check if this occurrence is an outlier for a species distribution model for this species?
  • Analysis
    • Discussion: Approaches to repeated QC requests for same records. How are we going to do this?
    • Report: Progress on implementation of OAI/PMH harvesting through firewalls.
    • Report: State of investigation of duplicates in GBIF cache.
    • Discussion: Duplicate Finding Find_Duplicates. Identify targets for native data entry applications into which we will embed duplicate finding functionality.
  • SCAN TCN Support
    • Report: Sanity Checking, further assessment of lower number of new annotations as opposed to new records in omoccurdeterminations.
    • Report: MCZbase Driver
  • NEVP TCN Support
    • Report: Preparations to update Annotation Processor Deployment at UNH.
    • Report: Specify-HUH driver
  • FP Infrastructure
    • Report: FP Node Refactoring

Reports

  • Paul:
    • Revised InvertENet FP budget, budget justification, and supporting docs went to Kristin, updated to current numbers, and then have gone to Petra all staged ready to go when submission process reopens.
  • Chuck:
    • GbifExperiments: GBIF cache is installed. Making subsamples and indexes of the data. Right now, trying to see what the precision is like if we were to identify duplicates on first collector name + collector id. (It would also be nice to know recall, but I'm not sure how to do that without actually having found all the duplicates. Chicken and egg.)
  • Maureen:
    • More UI extraction for the driver. Also built an occurrence index and a botanist index for Solr/Lucene based on HUH data.

Notes

FilteredPush Team Meeting 2013 Oct 16 Present: Bertram, James, Tianhong, Jim, David, Chuck, Paul, Maureen. Agenda Non-Tech

  • InvertENet TCN Proposal

Paul: Budget and justification and supporting paperwork ready and circulated to Petra.

  • TDWG

Bertram: Putting together a powerpoint talk. Will have together several days ahead of time. Perhaps combine for DataOne. Provenance and workflows. Will add voiceover. Will share through Dropbox (and a copy of an existing broad talk on provenance/workflows). James: Need a quick walkthrough about Kurator talk: FP overview, analysis, give some examples of problems in data, talk about classes/categories of problems. Two choices - video of screen captures - or - series of screen captures expanding on SPNHC demo. Then talk a little bit about the technology. Bertram: Share also on dropbox. Paul: Will need to put together the filtered push semantic talk. Paul: Annotations interest group will need brief summary of OA. Bob: Probably just need to go over slides we gave in rollout. Will circulate. Paul: Good to have someone from Berlin present briefly in the annotation interest group on what they are doing with OA. [BL: what day is that?]=Tuesday 11-12:30PM BL: thanks (will miss it :-( Bob: Will invite Walter and Lutz. Bob: We should all be conversant with BCO. Bob: http://www.obofoundry.org/cgi-bin/detail.cgi?id=BCO Biodiversity Collections Ontology Bertram: IDCC deadline extended: http://www.dcc.ac.uk/events/idcc14 ine: Oct 28th. Meeting on 24 - 27 February 2014. Three categories might be interesting: 1) full research paper (probably too tight) 2) practice paper (1000 word abstract first; then full paper later) 3) workshop proposal: maybe organize a biodiversity data curation workshop there? Good opportunity/excuse to go to San Francisco/California :-) I suggest we go for (2) .. and maybe (3) if there's interest. Bob: 4) maybe a data annotation submission? Paul: Brief discussion for Friday. Tech

  • Summary of Friday Tech Call.

Maureen: Kansas has a separate jar for SGR, just querying a lucine index on a copy of the gbif Cache. Chuck has copy of GBIF cache ready for query. David: Camel will alow us to replace Triage/JobPlanner/JobRunner with configurable routes for messages within the system. Camel can provide a message bus that is slightly lighterweight than an enterprise service bus.

  • Kepler
    • Report: Progress on Date Validation.

Tianhong: Have actor and service for date internal consistency within record. (Problems 1,7,8 from James' list). Use ISO 8601 as intermediate format? i.e. YYYY-MM-DD Bertram: Also trying to set up a high level modeling language for describing curation workflows (in a research direction). Bertram: Overview of curation workflow in a new conceptual model/language on agenda for Friday. Tianhong: Will send out some further questions on email.

    • On table with Kansas (Jim Beach): Using Species Analyst services for occurrence QC. We have an occurrence with a taxon name and a georeference: what would we like in a service to check if this occurrence is an outlier for a species distribution model for this species?

Paul: We provide a taxon and georeference (from an actor in a workflow), they provide a service that asserts whether this fits inside a predicted species distribution or not. Bob: Does it matter if they have environmental layers pertaining to the date collected? James: Probably, yes. Question worth asking. Paul: Something to pursue with Kansas? James: Yes, if not too much work.

  • Analysis
    • Discussion: Approaches to repeated QC requests for same records. How are we going to do this?

David: Sven suggested that we keep count of how many people express an interest that covers some record, and/or a count of how many times a query has requested some annotation. Maureen: Problem to be solved? Only run a QC analysis a record once in a certain timeline: Chuck: Prioritizing records to be fixed. Older records will be weighted more heavily than newer records, as more queries will have been launched, may need to compensate for that. Bob: Formulate sparql query. James: Prioritization should come from person who asked the question - perhaps allow users to flag records as having priorities. Also have human intervention needed workflows, might have priority there. Paul: We need to start from the UI researchers use to obtain data: David: Query and Analysis jobs similar. Could create triples and associate with fedora PIDs. Paul: Something to start on as soon as David has the message driven architecture in place.

    • Report: Progress on implementation of OAI/PMH harvesting through firewalls.

Maureen: need a FP message to provide harvests for the network. David: Will make job and messge type for this. Maureen: Harvester/processing code lives in network, can call up a provider directly, or can be given a harvest from a provider that got wrapped up in a FP message and handed to an access point.

    • Report: State of investigation of duplicates in GBIF cache.

For friday and next week.

    • Discussion: Duplicate Finding Find_Duplicates. Identify targets for native data entry applications into which we will embed duplicate finding functionality.

For next week.

  • SCAN TCN Support
    • Report: Sanity Checking, further assessment of lower number of new annotations as opposed to new records in omoccurdeterminations.

David: Need to add transcribing annotation when data is coming from image. Added support for transcribing motivation to message generation, not yet plugged into symbiota. Nico -- wondering if David Lowery can visit Ed/myself at ASU before end of year for a 2-3 days SCAN/FP user interface optimization session. Could talk about options while at TDWG. Jim: Paul give Nico a call and figure out details of what he's looking for.

    • Report: MCZbase Driver

Maureen: No work yet.

  • NEVP TCN Support
    • Report: Preparations to update Annotation Processor Deployment at UNH.

Maureen: Haven't picked a date yet.

    • Report: Specify-HUH driver

Maureen: Still working on refactoring using workbench backend.

  • FP Infrastructure
    • Report: FP Node Refactoring

For Friday:

  • IDCC item above.
  • Overview of kuration workflow in the new conceptual model, with a focus on what we plan to do with this "new" thing :)
  • What Chuck would like to do with consensus records