2009Mar19

From FilteredPush
Jump to: navigation, search

Zhimin has fuzzy match implentation on local machines using open source code

Motif confusion matrices as paradigm for fuzzy match interface.

Z: there is literature on matching up record references with two dbs on unsupervised learning.

  • What can we say by 2009 ABI (June?):
    • Paul prototype can find dups
    • Maureen: can adopt IPT for local to global map
    • Caching
      • diagram is only how to construct. How to find: first(?) local, then global, then construct. Now have only last? "Think globally, first act locally"
    • Zhimin and team: specify an Analysis interface to be a parameter to the Dispatcher? (Where should Analysis be invoked? API Dispatcher?)

By Sunday evening: make the "Think globally, act locally(TGAL)" stuff more explicit

By two weeks hence, have diagram fixed to address the above. Then June ABI task is: show a caching TGAL

Component diagram rendering scribbles from blackboard about local/global/construct and questions about where analysis is invoked (adding notifiable as question about pub-sub)


James: Main use case requires fuzy match on Collector/CollectorName

1. Make a demo with not all nodes equal

2. Fast local cache

3. Queries to run faster

    • use a local service that runs always with cached classes from previous hadoop jobs. Hadoop task only sends it arguments and process results of it.

4. Demo with several different use cases (adding demonstration of Make Annotation use case)

  • "Real time" answers for simple case, using query cache locally
  • Fuzzy match (implicit, not defined by GUI)
  • Demo not tied to Speciy or at least to Specify not tied to fixed mapping
    • Excel as an ODBC source? (Perhaps not, flat file cases covered by Specify Workbench) Simple relational Microsoft Access DB? Maybe NY data?
  • Receiving annotation pushed in to local client from network
    • need GUI for creating annotation
    • need GUI for receiving annotation
    • need to think about implementing pub-sub

Investigate IIPT ("Inverse Internet Publishing Toolkit") for ingestion (possible make annotation injection client, possible mapper code).

Performance with Hadoop was a key issue Bob observed in giving the GBIF demo. Using Hadoop currently appears to add an overhead of starting a JVM and loading classes for each message launched into network. One option: investigate altering Hadoop startup behavior to reduce overhead. Another option: put more buisness logic into processing messages before handing them off to Hadoop. Discussion of second option lead to suggestions of elaboration of the communication component from Prototype_design#Revisiting_the_design adding a dispatcher able to choose between analytical engines - a lightweight engine and Hadoop as a heavy weight engine, with "get results fast from the local cache" queries being passed to the lightweight analytical engine which would query a fast local datastore.

Architecture with "always running" dispatcher that can take messages from Hadoop and run classes already known.

1. A "find dups" query is launched from a Specify node by a call to the FP API.

2. The FP API sends the query to the dispatcher

3. The dispatcher sends the query to a lightweight engine that simply queries the local cache for results, and returns them, if found, to the dispatcher and thence to the Specify node.

4. The dispatcher also sends the query to a heavyweight engine.

5. The heavyweight engine does analysis to determine what other nodes might be interested or ought to be queried, performs the queries, adds results to the local cache, and maybe dispatches notifications to interested subscribers (or maybe the dispatcher does after the heavyweight engine returns results).

Whiteboard image with expanded component diagram showing proposed dispatcher and lightweight engine, along with sequence diagram for find duplicates message processing by the dispatcher.
Component diagram with dispatcher and lightweight engine


Sequence diagram:

Dispatcher messages.png

For next week: 2009Mar19

  • Zhimin will create and time a hadoop job that does nothing but report a timestamp so that we can learn how much hadoop job execution time is overhead
  • Paul will upload his picture of the architecture-with-dispatcher diagram.
  • James to consider distance metrics.