2014Feb05

From FilteredPush
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2014Feb05

Agenda

Non-Tech

  • Davis burndown rate - any questions from Bertram
  • James: TDWG session
  • Burndown

Tech

  • Report from Friday call
  • Analysis
    • Report Tianhong: Progress on updated Kepler Kuration release
    • Report Tianhong: Current state of Akka work.
    • Report: Progress towards deploying Akka for QC in SCAN and NEVP nodes.
    • Report Bob: Progress on Duplicate Finding data mining
    • Report Chuck: Duplicate detection UI.
    • Discussion: Duplicate detection
  • NEVP
    • Report Paul: Testing, Symbiota Ingest and Alternate NEVP->Specify-HUH ingest tool.
  • Driver
    • Discussion: Driver

Reports

  • Chuck
    • Current:
      • Paul gave me a dump from the Lichen Portal, so I'm trying to get the tool to work for searching exsiccate titles and numbers: Imports work, but I haven't gotten the solr config right yet.
    • Completed:
      • To support that, got generalized tab/csv import working from the CLI.
      • Hostname configurable on startup: so, conceivably we could have an instance running as a daemon on a server.
      • Testing more robust: try a range of ports rather than just a single hard-coded one.
      • Got my head around testing asynchronous code with QUnit, and all of the supporting packages are now tested.
      • Used Selenium to run the QUnit tests, so if there are JS test failures, they will get flagged in the JUnit run.
    • Actually being used:
      • Submitted patch to QUnit-HTML-diff plugin: Maintainer seemed to like it.
      • Got a thank-you note from the calbug guy: december hackathon code has been made part of their workflow, and works pretty well.

Notes

FilteredPush team meeting 2014 February 5 Present: Tianhong, Bertram, Bob, Jim, James, Paul, and Chuck Agenda Non-Tech

  • Meeting time?

- OK for Davis (and ETC) to move back 30 min to 11:30am Pacific, 2:30 PM Eastern.

  • Davis burndown rate - any questions from Bertram:

- nope. Other than awards @ ucdavis.edu probably needs a NCE letter for 2013-2014

  • James: TDWG session

James: Symposium for workflows submitted, and reciept acknowledged.

  • Burndown

Paul: Have circulated drafts of two temp positions for comment. Jim: Framing a second no-cost extension, which will need to go to NSF. Suggest holding on temp/LHT positions until we get an answer from NSF. Tech

  • Report from Friday call

Maureen: Lots of discussion of harvesting into mulgara - about maintinance - using separate named graph for harvests of taxon tree from that for annotations.

  • Analysis
    • Report Tianhong: Progress on updated Kepler Kuration release

Tianhong: Found how to put new actors into the left menu (adding a .kar file) - for development version. Bertram: And for release? Paul: Ant tasks? Bertram: Relatively small kepler crew here now, need to reach out to some of the UCSD developers to inquire about their release process for bioKepler. Paul: See ant targets (ant help in build-area/)

    [exec]  make-debian-package         Builds a Debian package.
    [exec]  make-installers             Builds installers for Linux, Mac OS X, Windows
    [exec]  make-kars                   Management task to induce the creation of kars from the build system.
    [exec]  make-linux-installer        Builds an installer for Linux
    [exec]  make-mac-installer          Builds an .app for Mac OS X

Unclear where binaries are put for distribution.

    • Report Tianhong: Current state of Akka work.

Tianhong: Akka-camel integration, documentation on website isn't wholy clear, testing. Tianhong: Current date collected actor is working but not finished. Flow chart needs more work. A main question is how to use data mining to resolve dates. Bertram: As we said last week, very open ended, need to focus on a core functionality. Tianhong: Currently Two steps: (1) Is the date valid - working. (2) is date within lifespan of collector - also working. Bertram: Is this a unitary actor, or are these able to use separately? Tianhong: internals are more than two steps, not cleanly separated. Bob: Java package JodaTime package may be relevant. Tianhong: will need data set to carry out the datamining on? Bob: Working from view of BBG (Brooklyn Botanic Garden) data set [that we used in the prototype for duplicate finding, regional NE US duplicates accross many institutions] as a flat csv file, just for convenience into Mahout.

    • Report: Progress towards deploying Akka for QC in SCAN and NEVP nodes.

Bertram: Discuss on friday call? Might help more?

    • Report Bob: Progress on Duplicate Finding data mining

Bob: Very large cost for having higher dimensionality (year, month, day as separate variables has much higher cost than date - quadratic or worse in number of dimensions you are mining on). Bob: Working on trying to agglomerate the date fields into Collector number is very sparse. Then, collector name (concatenated list) and locality are variously less sparse, but still very sparse data. Initial tests finding a small number of clusters with fewer members than 10, and a large cluster with 70k+ members, thus working on adding date collected as an additional dimension. Bob: Under the assumption that differences in duplicates are transcription errors string edit distance seems a reasonable measure. Chuck: Trying at this point to find clusters that actually are clusters, strict criteria make sense at this point to reduce false positives. Bob: Mahout book has a lot of practice in it, and approach of starting with a few parameters and slowly adding until reasonable clusters emerge seems standard. Chuck: More generally, the assumption that duplicates will differ by transcription errors seems false as how data go into different database field structures vary, but as a starting point, string edit distance is a good conservative starting point. James: Reminded of sliding scale for clustering in prototype, where users controlled what showed up in the results.

    • Report Chuck: Duplicate detection UI.

Chuck: Got dump of Exssicatti data from Lichen portal - read in and testing on it. Added unit tests to the javascript, and added selinum testing of the javascript to the junit tests.

    • Discussion: Duplicate detection
  • NEVP
    • Report Paul: Testing, Symbiota Ingest and Alternate NEVP->Specify-HUH ingest tool.
  • Driver
    • Discussion: Driver

for friday:

  • duplicate detection
  • driver