ActuallyFindingDuplicates2

From Filtered Push Wiki
Jump to: navigation, search

FP-DataEntry proceeded well past this initial spec. This document now is only of historical interest.


Things that are out of scope

These have come up earlier, and I want to be clear that they are not in this plan:

  • Special handling for OCR.
  • Any notion of annotating or managing the duplicates.
  • Actually seeing duplicate records completely.

TODO: We also want these things, but right now they seem to be orthogonal to the data-entry problem.

  • "Pass annotations to members of duplicate set."
    • Once a collection has used an existing record as a basis for its own, it would like to be notified is there are updates in the future. (redeterminations, etc.)
    • In this story does the origin of the record matter, or would we want the same kind of notification even if the record did not originate with FP?
    • Will redeterminations, for example, always originate in FP, and then only later, prompt an update in the collection DB... Wouldn't we be just as interested in changes that happened first in the collection DB? Will successive edits always be sent back up to FP?
    • approaches:
      • At the time the record is copied to the collection DB for data entry, also copy to FP. One problem: if latter edits are made in the data-entry application, FP won't necessarily see them.
      • or, Store provenance information in the record in the local DB, and then when it's reharvested it gets into FP. problem: trusting application databases to keep this kind of metadata available in a machine-readable form.
      • or, with each data entry event, add to a list of queries which keep an eye out for changes: ask "Hey, we have Smith's specimen 1234 (and Baker 5678, and ...): any updates?" every day.
  • "Quality control internal consistency within a duplicate set."
    • I'm not clear on what this means: For me, the duplicate sets are tools for a job; What's the job?

Full plan

Two components:

  • remote server
    • Has a subset of GBIF digested, plus FP data. (Privilege FP data, because it is less likely to have the geography fuzzed.)
    • Requires login.
    • Responds to fielded DwC GET queries over HTTPS with DwC JSONP responses.
  • localhost client/server
    • It is a client of the remote server, and server to locally running applications.
    • JS:
      • Offers a faceted search interface: Collector name and ID are the most important, but it will also be possible to restrict by taxonomy or geography.
      • Sends DwC JSONP GET queries to the remote server
      • "Best" (= most complete) response is presented in a series of selectors.
      • User can choose options for each field [and edit by hand there, rather than requiring further edits to be done in the application?]
      • With each edit, the local server is POSTed with the current state.
    • Java:
      • Responds to an initial GET which sets up the AJAXy application.
      • Stores the state changes that come via POSTs from the JS.
      • Responds to GETs from the data entry application with DwC JSON representing the current state.

questions...

  • How is tool deployed to the user's machine? Will a jar that they run from the commandline be ok?
  • How is it configured on the user's machine? Go to localhost:8080, and from there indicate the remote host and port? Hardcode port number for local httpd, which could be overridden on command line when starting jar?
  • Ok if there's no persistence of settings between runs of the client/server? or do we write to a settings file in some conventional location?
  • Who sets up and maintains the logins on the server? Do we need a UI for that?
  • How does login to server work?
  • How do taxonomy or geography restrictions work? Could it just be text that does a free text search? or does it need to be hierarchical pulldowns? (I don't see much to be gained from pulldowns.)

Immediate plan

  • Just use GBIF data. (Subset to be specified by Michaela.)
  • Don't worry about login between the client/server and the remote server, since there is no sensitive data.
  • Install both the client/server and the server on Firuta: Note that the client/server has no sessioning and only has global state, so if more than one person tries to use it, it will get confused.
  • Modify the Rapid Data Entry Tool to hit the local client/server when you hit a button.

Immediate, immediate plan

For development, run the client/server and the server on localhost, and point it at some slice of the GBIF cache.

Alternatives

  • It could just be a remote server?
    • All the application plugins would need to be configurable, instead of just pointing to the client/server on localhost:8080, and since the plugin is now the client, it would need provisions for login.
    • And we don't want to rewrite the whole client for each application environment.
  • The client/server could be less AJAXy?
    • I think it's actually a good thing to split it up: the the client/server java code only worries about GETs from the applications and POSTs from the JS, and the JS POSTs to the Java and GETs from the remote
  • Could use solr/lucene?
    • Right now, I think all our queries will just be left substring, and MySQL suffices for that: free text searches will bring in too much junk.
  • Could use our XML schemas?
    • For a limited subset of DwC, I see no advantage in bringing in XML, and it would make the JS harder.
  • Worry about security on the local client/server?
    • We're just going to assume that you're running this on localhost, and the port is firewalled from the outside world. If we need security on the client/server, then we'll need to find some way to share the session with the data entry application, and that's just going to get hard.
  • Use something other than Java for the client/server?
    • The server of the client/server is going to be really, really simple, and it really could be in any language... I think it just comes down to what will be easiest to deploy, and that might be Java? (Given free reign, I might do it in Python.)