2008Apr17

From Filtered Push Wiki
Jump to: navigation, search

Zhimin proposed Apache Hadoop MapReduce framework as a mechanism to present local agents with a single, integrated, view of all the duplicates of a certain record(s). Before next week, we will build a toy implementation at UMB with toy data Maureen and Paul will produce (available at: Test_Data_Sets). The first case will be four identical copies of a single set of data. We'll deploy them in a set of four independent MySql databases and see what is needed to answer a query like "Find all the dups of X".

Proposed minimal working prototype for July 12 as queries:

1. Give me all the duplicates of <list> 2. Launch annotation pushes of <list> 3. Accept/reject annotation pushes of <list>



Todo:

refactor Paul's grand Usecase diagram Image:Collection_manager_too_late_at_night.png

Specify interface for for 1, 2, 3 for Kansas

Hadoop model implies(?) need to extend client specimen management systems' schema to include attribute that tells whose record is in view since it will include all the duplicate records delivered by the FPNET.

Might need to restrict what meaning of duplicate or of annotation we support in prototype.

Time to start trying for an abstract model of duplicate? of annotation?

(Zhimin: If hadoop MapReduce is a good platform, models of duplicate or annotation could include requests to execute a computation.)


Links: Hadoop: http://hadoop.apache.org/core/ http://en.wikipedia.org/wiki/Hadoop MapReduce: http://labs.google.com/papers/mapreduce.html http://en.wikipedia.org/wiki/MapReduce HBase: http://wiki.apache.org/hadoop/Hbase