Maureen gave a demonstration of the Specify UI for FP find duplicate queries by collector name and number, in which the Form View of the Workbench was used to launch a query and the results were displayed in a results-chooser dialog. (Screenshots later) There is a race condition involved in obtaining a single file for hadoop results when more than one query is launched nearly simultaneously; Zhimin will fix that for the demo.
We discussed caching general query results on the local node because of overhead associated with network lantency and launching hadoop jobs.
We discussed the generalization of map-reduce algorithms and whether we need to have messages that define algorithms. We concluded that for the prototype the algorithms would be implemented as network internals and that we need messages to discover and choose from existing algorithms. Somewhere in Specify we need to include a means to select the preference for default algorithm.
Zhimin mentioned that the algorithms for map-reduce could be java classes. Eventually it would be nice to have messages that could define algorithms that are not tied to such specific implementation.
We discussed fuzzy matching approaches for mappings, and Bob mentioned using a probability matrix of likely misspellings for matching collector names. We should explore Open Office's auto-correct lists.
Paul suggested that the caching mechanism on the local nodes should have an adaptive learning flavor. For example, a Specify Workbench user might begin a session by declaring an intent to work with taxon "t", but the user might also just enter a collector name and collector number and the caching mechanism would infer the taxa of interest based on the next queries entered and the results accepted.
For 2009Feb12 to do:
Zhimin will fix the file name problem with the hadoop interface and if time allows work on the fuzzy matching algorithm
Maureen will get the infrastructure in place for Bob's demo in Denmark by early next week
Paul will get Zhimin a copy of the collectors
Paul and Maureen will encourage James to get the data set from Brooklyn