2011May19 Meeting at UCDavis
Meeting at UC Davis
- Mapping Problem
- Vision for workflows
- Review Sequence Diagrams
- Review Use Case Diagrams?
- Difference between the FilteredPush Network Software and an Instance of a FilteredPush network.
- Shall we have a name for the instance we will create?
- Who is going to work on which parts of the instance or the framework?
- Strawman list of terms for network (annotations, specimen data)
- Priorities and timelines
Filtered Push Team Meeting 2011 May 19. UCDavis
Present: Maureen, Bob, Lei, Tim, Bertram, Paul, James.
Name the instance of the FP network we will create. Apple Core Network?
Strawman list of terms for network (terms used in the message bodies: annotations, specimen data)
Need human readable list and definitions.
- set of things to transport in find-duplicates: this will be discussed during the SPNHC meeting, reviewed later, and made available.
- Start with AppleCore defined terms.
- core classes of annotations, e.g. what goes in a new-determination annotation, and what goes in a new-georeference annotation
- taxon name, (is the name atomic?), authorship as a separate concept, do we need date determined...
- new georeference annotation.
- duplicate set member annotation.
- Nomenclatural change annotation.
what is the determination in Darwin Core? what goes in each of the dwc fields is not clearly defined. what goes in internal databases is represented differently than what is displayed on the web. the set of terms needs to be closed set.
who owns digital annotations? who guarantees that the annotation information will persist? the network has the responsibility of remembering what annotations have been made and what actions have been taken on those annotations. the network is the place you can discover the annotation has been made, whether or not any addressee ever took action based on the annotation.
one of the important functions of the FP network is to wrap the annotation knowledge in a message that is put into the network. there can be no limitation on the length of time annotations will persist.
FP is not like email, where messages have specific addressees, the addressee is "to whom it may concern." FP analyzes the message to determine who is to be notified of new annotation messages, e.g.
what about authentication of annotators, or more generally, message originators? in the prototype, the authentication of individual users would be the responsibility of the clients. however, we now have use cases that require the network to authenticate users. perhaps solved by having a separate system do the authentication, where authentication tokens would be exchanged.
how does our model compare to eBird, where it is basically an honor system for recording observations? what about crowd-sourcing, how would those use cases factor in? if we don't have a closed system, but rather still track who is doing what, we can then use that information in quality control applications. on the persistence of annotations: in different implementations of FP, there may be different requirements for the lifetime of annotations.
it is possible that the network could be successful that the amount of messages in the network becomes a problem for presentation. is this only a client issue, or is there some responsibility on the network side? perhaps the filtering could be delegated to a workflow and distributed?
Another Use Case: Delegate message filtering to network. A client generates a message telling the network to take a particular action on a particular class of messages. Potentially a workflow.
Review Sequence Diagrams
Review Use Case Diagrams
- whom to thank: NSF, with grant numbers; GeoLocate, IPNI, GNI, FNA, GBIF, Google, Kepler; grant number should be included for FNA; Google trademark notice
- what is the takeaway message? this is our outreach component.
- what is our goal for next year at the same meeting, specifically, concretely, at each step of the process?
Vision for workflows
- be able to give people the same story from the Kepler and FP viewpoints
Difference between the FilteredPush Network Software and an Instance of a FilteredPush network.
- Who is going to work on which parts of the instance or the framework?
- what is our strategy in research vs. development
Priorities and timelines
Paul finishes up his review of a talk he gave for AGU: how does industry think about management of quality control? determine standards for data, analyze fitness of data by those standards, find data needing attention, fix those data, repeat. how do natural sciences manage the quality control process? an example of transposed lat/long from data providers to GBIF. this could be addressed with an annotation on a whole dataset. parties interested in notification of that annotation might be those who have used the data
a class of more subtle errors are those related to colletion dates and days of the month. for example, a common practice among entomologists is to generalize collection day of month to the nearest multiple of five. an annotation advising of this might describe what purposes the data is fit for, or unfit for. (discussion on message injection policies, would we want every possible data correction annotation; perhaps there are more than one level of policy for messages, maybe this would could be modeled by e.g. XACML. )
where in the architecture is the filtering and triggering of filtering happening? we need to discuss this.
will there be artificial intelligence pieces of FP? there has been no suggestion of that so far. if we allow all annotations, we might have the problem of duplicate annotations, in the sense that there would be lots and lots of messages that say exactly the same thing. how to address this? keep them all? let clients sort them out? maybe let's see the first ten million annotations and then see what is needed. signal to noise ration is important to consider. we could build a great system, but if no one sees value in using it, the system is not so great. we need to think about data warehousing.
Bertram's challenge: top seven fields found in an annotation and a record in Apple Core. For the case of finding duplicates.
- 0. GUID, if present
- 1. Catalog No.
- 2. Institution Name
- 3. Collection Code
- 4. Collector
- 5. Collector No.
- 6. Date Collected
- 7./8. Verbatm Locality/ Original Determination
- 9. Authorship
- 10. Lat/Long
- 11. GPS Datum
- 12. Lat/Long Uncertainty
- 13. Phenology
- 14. Sex
- 15. Habit/Behavior
- 0. GUID for the annotation
- 1. About/Subject references by GUID; there may be multiple subjects
(are annotations about records? yes, but, most often it is expected that annotations will be about _consensus_ records. James proposes the network host the consensus records.)
(Definition of a duplicate set: distinct records about botanical duplicates; as opposed to what we'll call replicate sets, which are electronic records describing the botanical specimens)
- 2. About/Subject Darwin Core triplet; there may be multiple subjects
- 3. Annotator (on the table: a text string DataOne authentication token)
- 4. Timestamp
- 5. Expectation (e.g. "insert" or "update" to a database; the expectation of an error correction message might be an update, and the expectation of a new determination message might be an insert; another expectation is "group" for a duplicate assertion)
- 6. Evidence
- 7. Payload (i.e. patch)
(in the case of identifying a duplicate set, the payload would be the entirety of the consensus record)
(it is false to say "everything is an annotation." all interactions with the access points are through messages. one class of message is "inject annotation;" another is "query," a subclass of which is "find duplicates;" another is "subscribe" to a topic for information; "check for messages," relevant to a previous subscribe; "check for results," which has more to do with a previous asynchronous query message, can be thought of as dereferencing a result handle)
The intent is that the definition of an annotation is domain-general; payloads will be domain-specific. the Apple Core network will have payloads in Darwin Core.
12:00 Paul shows an example of an annotation.
2:00 Goals for one year: Specify client demonstrating find duplicates on a network of three nodes, with scalability to 2000 nodes. This divides into two parts: the Specify client and the node software, message injection rate.
By TDWG, demonstrate an annotation; have the ontology and the system in which to put the annotation; we can ingest an AO annotation. the main work here is on AO, not the system. a data provider is someone who has a mapper. a node is an access point that understands how to communicate with other nodes, potentially has messaging capability. does not need to be a data provider. we need to have a strategy for load testing.
Lei and Tim can work on the mapper. Input to mapper is annotation messages. Output is SQL.