2010Nov30

From FilteredPush
Jump to: navigation, search



User: David Lowery | User: Bob Morris | User: Zhimin Wang | User: BertramLudaescher | User: Lei Dou | User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.



Reports

  1. UCDavis (Lei and Bertram)
    1. working on kepler curation pacakge, especially on google account authorization by using OAuth without asking users to hand out their credentials (like username and password )
    2. built a draft of filtered-push website based on google site[1].
  2. Zhimin: with Bob, working on the detail of use case analysis.

Agenda

Discussion

http://firuta.huh.harvard.edu:9000/yr6blbD0Dz Participants: Bob, Zhiming, Lei, Bertram, ... ?

FPush architecture snapshot: http://www.etaxonomy.org/mw/FP2_Design


Is this our current thinking? If not, what is it? ;-)

http://www.etaxonomy.org/mw/File:TDWG2010_5minuteFP.pdf --> page 2

http://www.etaxonomy.org/mw/File:GSA_Morris_287-8_with_notes.pdf --> page 25, 30, 31

Architecture components?

Analysis Server


Slide 25 of GSA_Morris_287-8-with_notes.pdf:

Consider a case where a client asks for data meeting a certain condition expressed, say, in a metadata record level vocabulary like Darwin Core. The Triage component might determine that some or all of that data can be provided from a Global Knowledge store known to the entire network. I would make up and launch a query that retrieves that data. Then, using resources Triage held (or acquired) pending an answer---e.g. the identity of a message queue for answering the client---, Triage would return the data to the client. List of some typical metadata catalogs: - KNB / Metacat? - SRB(iRODS?)/MCAT? - GBIF cache?

FPush-1 prototype: - Brooklyn Botanical Garden (w/ lots of duplicate instances etc) - trivial Triage (because there was just one backend!?) - single Analysis engine (SQL + M-Tree-based fuzzy match)

(GNA.org) Global Names Architecture

See http://community.gbif.org/pg/site_news/10700


The more difficult case is if Triage determines that there is no simple answer to the client's search request, but rather that further analysis has to be done. Example: the client's metadata specifies a certain taxonomy and DwC values for a taxon, but some data holders do not use that taxonomy, and no data would meet the query. Here an Analysis engine---call it the taxonomic integrator---would use some Thau-like process to mediate between the different taxonomies, and offer data that meets this integrated view. The taxonomic intergrator Analysis Engine then is responsible for packaging all those records and arrangeing for their return to the client as in the paragraph above.

Analysis engines might be complex compositions---a perfect use for Kepler to hide the details of such Analysis from the client.

What about the complexity (long duration) of searches that involve reasoning, distributed catalogs etc?

Sometimes offline answers are OK.

FPush-2 system: - Which KOS resources do we use? (was: Brooklyn Botanical in FPush-1) --> check w/ James & Paul --> six institutions expose data as DwC to get started - Which search and analysis functions do we target first?

Fast+Slow model of query-answering: - DataWarehouse + mediation approach!?

Possible Action Items: - Double-check w/ James & Paul re. (initial) target institutions - Look at the GBIF cache view of these institutions: Will this be enough info on those collections? If not, go to the source. -> problem of botanical duplicates (multiple copies/parts of the original plant specimen are catalogued and sent to different [museum] collections) - need a good example for using Kepler wf as an analysis engine e.g. -- periodically executing a Kepler wf to (re-)analyse (new) collections, cluster to identify possible duplicates -- NSF ADBC initiative to digitize biological collections

http://www.nsf.gov/pubs/2010/nsf10603/nsf10603.htm 
(community effort)

--> continuous query / analysis wf to notify of new duplicate candidates etc