Zhimin gave a demo of Apache Hadoop as possible platform for prototyping Filtered Push
Demonstrated DataNode as able to query a database for data as part of job.
Hadoop seems good platform for distribution of pieces of work.
Job tracker does distribution, Job Client frames question.
Bob's question: can Hadoop handle notification of return of partial results?
Frame question as query [loosely]:
Select Barcode, Acronym Collector_Number, Collector from [db]* where genus = "Rubus" group by collector_number, collector.
Clearly needs JobTracker able to report, in a policy defined for the job specific manner, what the status of progress in the job is - and decision making on when to cancel jobs or parts of jobs (unresponsive network nodes).
The abstraction of a JobTracker being able to track pieces of a job appears important to FilteredPush. Agents in network need to take action on the basis of returned results. JobTracker needs to be able to continue
Interesting use case: Rows become available at data source nodes while long running job is in progress.
JobClient interfaces seem too Hadoop specific - generalize perhaps to "Prepare Job". JobTracker needs method to report partial results.
Towards a Workbench Interface: Possibility? Workbench hands JobClient->Prepare a SQL query Or Workbench hands JobClient a where clause and a Reduce. JobTracker returns an XML formatted result set.
Query agents like workbench need at least to be able to ask whether there are updates to a network query (known by an identifier) not yet known to be complete.
Result set from query could live in local network node, be updated from the running job, and be quereied for individual records by workbench (as in original model), or result set from network query could be passed on to workbench with periodic updates from the network (until the job is done) - with workbench handling internally questions of individual records in the network query.
Next meeting in third week of May, probably May 19th or 20th.