From FilteredPush
Jump to: navigation, search


  • are messages sufficient to determine who has primary duplicate?
  • since no guaranteed record claiming to be primary at all, may need some reasoning.
  • message must be able to carry "I assert that X is primary holder" and perhaps "here's why"
  • ontology??? FPOntology

Maureen: Specify is talking about using IPT

Zhimin has simplified regular expression fuzzy match code, not tested yet.

Zhimin suggests interest metrics, rather than syntactic fuzzy match, e.g. phylogenetic distance. Geocoded data lends to natural distance measure by shortest great circle path, non-geocoded data doesn't.


  • What can we say by 2009 ABI (June?):
    • Paul prototype can find dups
    • Maureen: can adopt IPT for local to global map

1. Make a demo with not all nodes equal

2. Fast local cache

3. Queries to run faster

    • use a local service that runs always with cached classes from previous hadoop jobs. Hadoop task only sends it arguments and process results of it.

4. Demo with several different use cases (adding demonstration of Make Annotation use case)

  • "Real time" answers for simple case, using query cache locally
  • Fuzzy match (implicit, not defined by GUI)
  • Demo not tied to Speciy or at least to Specify not tied to fixed mapping
    • Excel as an ODBC source? (Perhaps not, flat file cases covered by Specify Workbench) Simple relational Microsoft Access DB? Maybe NY data?
  • Receiving annotation pushed in to local client from network
    • need GUI for creating annotation
    • need GUI for receiving annotation
    • need to think about implementing pub-sub

Investigate IIPT ("Inverse Internet Publishing Toolkit") for ingestion (possible make annotation injection client, possible mapper code).

Performance with Hadoop was a key issue Bob observed in giving the GBIF demo. Using Hadoop currently appears to add an overhead of starting a JVM and loading classes for each message launched into network. One option: investigate altering Hadoop startup behavior to reduce overhead. Another option: put more buisness logic into processing messages before handing them off to Hadoop. Discussion of second option lead to suggestions of elaboration of the communication component from Prototype_design#Revisiting_the_design adding a dispatcher able to choose between analytical engines - a lightweight engine and Hadoop as a heavy weight engine, with "get results fast from the local cache" queries being passed to the lightweight analytical engine which would query a fast local datastore.

Architecture with "always running" dispatcher that can take messages from Hadoop and run classes already known.

1. A "find dups" query is launched from a Specify node by a call to the FP API.

2. The FP API sends the query to the dispatcher

3. The dispatcher sends the query to a lightweight engine that simply queries the local cache for results, and returns them, if found, to the dispatcher and thence to the Specify node.

4. The dispatcher also sends the query to a heavyweight engine.

5. The heavyweight engine does analysis to determine what other nodes might be interested or ought to be queried, performs the queries, adds results to the local cache, and maybe dispatches notifications to interested subscribers (or maybe the dispatcher does after the heavyweight engine returns results).

Whiteboard image with expanded component diagram showing proposed dispatcher and lightweight engine, along with sequence diagram for find duplicates message processing by the dispatcher.
Component diagram with dispatcher and lightweight engine

Sequence diagram (source BOUML model at File:Component3.zip)

Dispatcher messages.png

For next week: 2009Mar19

  • Zhimin will create and time a hadoop job that does nothing but report a timestamp so that we can learn how much hadoop job execution time is overhead
  • Paul will upload his picture of the architecture-with-dispatcher diagram.
  • James to consider distance metrics.