2014Feb12

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Feb12

Agenda

Non-Tech

  • Davis burndown - anything else to confirm?
  • James: TDWG session
  • Burndown

Tech

  • Report from Friday call
  • Note from Chuck: OCR Hackathon results
  • Analysis
    • Report Tianhong: Progress on updated Kepler Kuration release
    • Report Tianhong: Akka Actors.
    • Report Bob: Progress on Duplicate Finding data mining
    • Report Chuck: Duplicate detection UI.
  • Nodes
    • Report Maureen: Ingest progress.
  • NEVP
  • Driver
    • Discussion: Driver

Reports

  • Paul
    • More testing of NEVP ingest for Symbiota and Specify-HUH. Fixed minor bugs found in testing.
    • Wrote code to link records of multiple barcodes from the same sheet from finding similar images then checking them for barcodes, then editing the Specify-HUH preparation-fragment links.
  • Jim
    • Request for second no-cost extension of main FP award submitted to NSF on Feb 11th.
  • Chuck
    • A little bit stalled on Exsicatti title search. (See longer emails.) Decided that I might do some fuzzing at search and query time, perhaps with a custom analyser, and to compensate for that, a faceted interface that improves the precision would be good.
    • Housekeeping: schema validation; cors rather than jsonp; better online docs; etc.

Notes

FilteredPush Team Meeting 2014 Feb 12 Present: Bertram, Tianhong, Bob, Chuck, Paul, Maureen, James, Jim. Agenda Non-Tech

  • Davis burndown - anything else to confirm?

Bertram: Award folks on both sides seem to be getting things lined up. Jim: Yes, Harvard folks sent NCE paperwork to UCDavis. Jim: Also NCE request from Harvard going in to NSF, needs approval from NSF, Not clear how long this will take.

  • James: TDWG session

James: Nothing further at this point.

  • Burndown

Jim: Nothing further at this point. Paul: Folks with copies of the Temp/LHT postings please get comments back to me, so we can have that text ready if we need them. Tech

  • Report from Friday call

Bob: Discussed issues about ingest of taxonomic heirarchies and reasoning. Is is good/bad/doesn't matter to memorialize inferred triples. Paul: Had Ed in on call discussing SCAN hackthon rollout - blocking issues all dealt with, Ed deploying test copy for Nico/Neil to evaluate, then aming to integrate into trunk this week.

  • Note from Chuck: OCR Hackathon results

Chuck: Code developed in hackathon now into production use in CalBug in notes for nature.

  • Analysis
    • Report Tianhong: Progress on updated Kepler Kuration release

Tianhong: have merged into repository, built, ready to release package - new module on current release - new version of the Kuration module.

    • Report Tianhong: Akka Actors.

Tianhong: Stable code for DateValidation actor at this point. Bob: Joda Time? Bertram: Link in notes Tianhong: Looked at some. Bob: Has some lax parsers that might be useful (and may report on what went wrong in the parsing). Chronology object - ISO8609 chronology, but others as well. Bertram: Would like to have Tianhong to - wrap up new release of Kepler Kuration and - a release of Akka for curation, - then look at case studies of workflow efficiency to prepare for qualifying exam (PhD research).

    • Report Bob: Progress on Duplicate Finding data mining

Bob: One person in text mining community around Hadoop pointed out that database duplicates can be found with simple method that is ameanable to parallelisation and map-reduce. Bunch of records, put them into arbitrary buckets, given a metric for distance between two data records, set a threshold for distance to represent candidate duplicate records. Map-reduce application compares within buckets in separate threads. In testing, taking a couple of seconds per comparison, slow on single core machine... Conceptually, much like a recommender system - if you like x, you'll also like y - lots of work on this with Mahout, etc, good model for duplicate finding. Also working on cluster analysis as well - freeer of the semantics, and string edit distances work well. Feeling like this ls leading in the direction of more research on relating semantic and logical resoning systems. Paul: Pragmatic approach to start with, on harvest, Maureen: In step down pipline from harvest... Paul: Examine harvested records for collector name, collector number (not 1-9), date collected, if present query for matches in existng data (fuzzy on collector name), then assert relationships of potenital duplicates on matches, then express queries on anntoations related to my collection code plus to duplicates to follow these relationships. Bob: Good to memorialize oddities we are seeing in duplicate finding right now. Chuck: In test suite, have some cases - good to extend that.

    • Report Chuck: Duplicate detection UI.

Chuck: Did some code cleanup. Working on Exsicatti for rapid data entry, there isn't an exact match of names in HUH data set and list of names (of Exsicatti on Lichen/Bryophyte portal) - need some mechanism for matching close matches of publication names and publication abbreviations. Either match exactly the strings we have or match more generally - perhaps some faceting refining search for possible close matches betwen local authority file and data from duplicates. Discussion of Faceting, search results in duplicate finding tool may be underspecified, could facet (e.g. Lichens of Canada [v1, v2], no.42, one from alberta, one from ontario), use faceting to help users narrow down to tigher specification of duplicate data. Paul: Time to run current UI past some users for feedback, then tackle faceting. Paul: How to handle different data sources? Chuck: One instance of duplicate finding web UI configured to a solr instance for each datasource.

  • Nodes
    • Report Maureen: Ingest progress.

Maureen: Have code for doing harvesting committed, need to add a mechanism to extend that pipline (down stream after harvest adding addiional information, e.g higher taxon links). Need a different way of implementing OAI provider, using a view as the basis for implementing records, need better performance, export data and have PHP code assemble the data objects, rather than a view in the database. Haven't approached harvesting across firewalls yet, have plan - harvest class that writes to files, can have a consumer of this push to network.

    • Report Maureen: Duplicate finding integration into specify web.

Maureen: Figured out how to get the specify web development environment working and how to add a plugin. Working on understanding data object model that specify web is using, then can integrate into form.

  • NEVP

Paul: Looking at similar images to find candiates for tests for barcodes.

  • Driver
    • Discussion: Driver

For Friday:

  • Integrating data mining approaches with the data cleaning workflows
  • harvester pipeline for adding inferred triples: taxon data, collector name/number duplicate matching