2011Jan25

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for compiling meeting notes: http://firuta.huh.harvard.edu:9000/FP-2011Jan25



User: David Lowery | User: Bob Morris | User: Zhimin Wang | User: BertramLudaescher | User: James Hanken | User: Paul J. Morris | User: Greg Riccardi | User: Chris Jordan | User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.



Reports

  1. Lei and Bertram (UC Davis)
    1. Developed e-bird duplicate detection demo workflow by using the clustering actor to find out duplicated bird observation events happening in close time and close places.
    2. Finish documenting of actors and workflows in kepler/curation package which is ready for release. (This package will be enriched continually with our in-depth understanding and touch of data curation problems and use cases.)
    3. Reading paper of Lens for bidirectional data transformation.
  2. Zhimin
    1. add auto complete function to the field based search interface of the demo
    2. add the general search function to the demo
  3. Paul
    1. Extracted 800 random MCZ fish records, provided them to Bertram.
    2. Of possible interest is part of the documentation (dealing with annotations) that James and I developed in Australia for IPNI2: http://www.aa3sd.net/IPNI2/index.php?title=Category:Editorial_Process_Use_Cases
    3. Produced an alternative view of the model that Zhimin, Bob and I have been working on ( Image:3nodes.png and Image:NetworkComponents.png ), with proposals to illustrate points for discussion in: Image:Elaborated component model of a three node network.png, Image:Annnotation Sequence Proposed in network.png and Image:Sequence diagram asking for response.png. These go along with a proposed List of Component Responsibilities
  4. Bob
    1. Examined Paul's new UML diagrams
    2. Re-examined AO and entered email dialog with Paolo Ciccarese
    3. Studying the Open Archival Information System for another project, but began thinking it might lead to a good set of design and deployment documents for persistent annotation stores associated with a FP deployment. "The Open Archival Information System (OAIS) reference model identifies the components of a comprehensive system to manage digital content. OAIS is an international standard (ISO 14721:2003) that all managed repositories are encouraged to adopt. It describes the roles of people involved in the archival endeavor and defines the major aspects of a well-functioning archive." From Inter-University Consortium for Political and Social Science Research. It's not a platform, it is a standard for specifying how you are managing digital resources for persistence. There are a number of articles, tutorials, and analyses in the Digital Library community. I am just at the beginning...

Agenda

  1. Morphbank.
  2. Zhimin, demonstration of general search in the web application.
  3. Nature of queries in on the network: What kind of query is supported by the network. SPARQL, Tapir or Tapir like schema?
  4. Progress on architecture.
  5. Second project programmer position.

Meeting Notes

Attending: Zhimin Wang, Bob Morris, Jim Hanken, Paul Morris, David Lowery, Greg Riccardi, Chris Jordan.

Agenda

1) Morphbank

Greg: Collaboration meeting for Morphbank/Morphster/Specify later this week, discussion of transporting annotations, wanted update on annotation ontology, transport, and distribution capabilities for Filtered Push

Bob: See: http://etaxonomy.org/mw/AnnotationOntology http://etaxonomy.org/svn/FP/FP-Network/trunk/design/ontologies /ao/aod.owl Starting discussion with Paulo, in this case we are annotating a piece of data which comes along with the annotation, not an external data object, modeling here as minor subclasses of his classes, still looking to verify if these are the appropriate extensions.

Greg: Transport as XML? Bob: Standard owl description as rdf xml.

Greg: Where in the process of development for infrastructure are you?

Bob: Producing UML diagrams, particularly sequence diagrams for a specific use case, informing requirements.

Paul: Ending a cycle of design work, about to start production of a three node network.

Discussion: Initial step of point to point transport of annotation documents, later adding in injection of messages containing annotations to FP network, and use of FP network for queries to discover annotations. Scenario: view specimen data in Specify, pull related annotations from FP network.

2) Zhimin, demonstration of general search in the web application.

Desktop share via webex.

Basic search: Search on "Oak", 10,000 matching results, 10 per page, view record details, then able to annotate individual fields for records.

Find Duplicates Search: Collectorname (now has autocomplete), can use exact or fuzzy match.

Two styles of search.

Jim: Find records for a particular species

Basic Search on: "Thorius"

Example of an annotation: Latiude 100.

Jim: Attribute of who made the annotation. There is no immediate way to describe who is right or wrong. Differnt field collectors could go to the same place and report different elevations. Can't immediately identify who is right, just present the conflicting data.

Bob: In annotation ontology, basis for asserting why a value was proposed in an annotation.

Bob: In our heads, but not in the design documents yet: How something is synthesised could involve policy as well as who made the annotation.

Jim: When people are using data in publications (in taxonomy), they go back to the original data, and are responsible for assessing its validity, with cautions about using subsequently added/inferred data.

Bob: Annotations allow tracing the sequence of opinions. Ontology needs to be expressive enough to query these sorts of traces on the data.

Bob: There is no such thing as the use of data just by humans, except when working soley on paper or clay or stone. There is always software involved...

Jim: How would people access this?

Bob: We could enable MCZ programmers to have a user interface that would allow people to create annotations of MCZ specimen data, and then inject these into the FP network.

3) Nature of queries in on the network: What kind of query is supported by the network. SPARQL, Tapir or Tapir like schema?

Paul: Currently supporting several sorts of queries, solar, sql, exact match, fuzzy match...

Bob: Two separate things, what's the query, how is the query agent hoping to influence the way the query is answered. Partly about capability discovery. Perhaps a metaquery level that clients can discover capabilities, or someone has to discover capabilities.

Take example: Fuzzy match, Fuzzy match by xyz algorithm, let query agent specify I'd wish you'd make a fuzzy match using xyz, failing back to another algorithm, but at least tell me how you did it. Similar to who's going to do the data integration.

Problems that triage needs to manage - finding the relevant specialist for dealing with results.

Zhimin: Two issues: semantic of query (my question), requests for tuning the query (Quality Of Service).

Requirement for FP Query messages to specify 1) query, 2) quality of service parameters (performance, algorithms, integration, format, constraint systems).

Bob: Example: Agent specifies constraint language for response as integrated object (result set in a schema using these terms, here's a document, fill in the blanks).

List of specific things: Formats. Client "I Require", Client, "show me what you have and I'll choose" [impuse buying]. Fuzzy match, exact match.

4) Progress on architecture. Bob: Comment on Paul's (that one isn't mine, that's the one Zhimin has been working on (reflecting our discussions) sequence diagram about 3 node query a. *Note asks "How does client know where to get return (in the asynchronous query case). Is it hidden behind the API?" I say: Yes and No. For lightweight clients there should be a simple API for subscription configurable for (but independent of) the messaging system. But the messaging system should also be exposed to clients that wish to hit the messaging system directly. This means that the messaging system will have to support the access control system used by the network. Possibly we make the API be a deployment parameter with the first version being exactly the JMS API. I take your question to be: "Should be have our own API that abstracts lots of messaging APIs and remain future proof by only having to write wrappers?" I suggest that we declare the API to be JMS and after initial release see how much we are really using. That subset is possibly our wrapper...The trouble here being the requirement that client to network traffic be encrypted, also we've discussed message queues being per user, and with authorized access by only that user.

We can look at: http://www.ibm.com/developerworks/xml/tutorials/x-secmes1/

Discussion of requirements for data security.

Key issue, access of message queues by unauthorized person.

Goals: 1) Don't interfere with security mechanisms supported in the JMS model (granularity of defense of rss feeds, email, wire, end to end encryption on wire). 2) Not interfer with authentication mechanisms for asking questions.

b. Why do you specify all data in motion be encrypted? Is privacy a generic requirement here, or are you trying to get at something else? It is a generic requirement. We know that a subset of the data will be locality information for endagered species, for which we have said that only participants in the network will be allowed access, thus there is a requirement that all data in motion be encrypted.

c. Even in the predecessor diagram, I wondered what exactly we mean by returning query-plan objects. Is this something done routinely in the DBMS community? Does it extend to SPARQL? Where can I learn about it?

Zhimin: In demo, chain of commands.

Bob: In diagram suspending disbelief about if you have a plan you can execute it. How general are the tools for generating and executing query plans? Are they tied to relational databases?

Zhimin: Similar issues for sparql queries. Query includes a plan (e.g. joins), phyisical/optimiser plan specifies ordering, ....

Bob: Discuss with Bertram.

d. Note asks: "What if UMB has both local data and GBIF cache copy. " Who has to integrate it?" My thought is that it shouldn't matter, though "I can integrate" should maybe be a communicable ability of a node. If a query node can integrate things on its own, the query could signal that capbility, perhaps with "but I wish you would integrate if you can". If it can't, and data comes from several nodes, then the query node needs an integration service somewhere anyway. I think the physical plan, logical plan, and integration should be delegated to a service (variously Query service or PullService).

5) Second project programmer position.

Jim: Add an REU supplement.