From Filtered Push Wiki
Jump to: navigation, search

Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2011Jun14


  • Annual Report
  • Progress on mapping, understanding mapping problem
  • What is the question to which peer-to-peer is the answer?
  • Schedule for technical group to review architecture.


  • Zhimin: Working on changing old implementation into pipeline structure.
  • James:
    • Working on the AppleCore documentation of Darwin Core. Peter Desmet and I will focus on this work over the summer with help from other enthusiasts. Hope to have something to show by TDWG.
    • Phenotype RCN in Boulder, CO was quite interesting, especially as it was hosted by Google. Had a good chat with Jamie Taylor of FreeBase about possibilities for annotation storage and potentially even being a specialized FP node. Made better connection with the Plant Ontology group. Discussed AO and its extensibility with the Gene Ontology representatives as well as general case of sharing annotations.
    • Consortium of Northeast Herbaria meeting had a one day workshop on Symbiota presented by its creator, Ed Gilbert. Attendees seemed fairly happy with the way the portal software worked. Symbiota has been adopted as the portal for CNH. Symbiota will get some more support from the Lichen and Bryophyte ADBC theme funding which will have a great need for more sophisticated duplicate detection based on use of exsiccatae. We should consider collaborating with them (FP is mentioned in the ADBC theme proposal and is considered infrastructure for the Hub).
  • Paul:
    • Also contributing to the AppleCore documentation of Darwin Core. Have been working through mapping HHU-Specify fields onto the DarwinCore with AppleCore guidance, and noting issues along the way.
    • Wrote up the User_Scenarios_for_Client_and_Mapper_Design document.
  • Lei:Working on curation package: generalize actors by extracting the common functions and reorganize directories and files after changing and adding classes.
  • Tim: Installed SQL Server in preparation for developing a NS (not Specify) database.
  • Maureen: got sparqlPush installed and running; is still investigating the workings


Filtered Push Team meeting 2011 June 14 (1PM Eastern/10AM Pacific) Present: Bob, Zhimin, Paul, Maureen, Jim, James, Lei, Tim Agenda

  • Annual Report
  • Progress on mapping, understanding mapping problem
  • Curation Package
  • What is the question to which peer-to-peer is the answer?
  • Schedule for technical group to review architecture.
  • What is the question to which the global cache is the answer (or local cache)?


(Maureen is taking notes)

Cache: what is it? what are we talking about?

There is the concept of the global knowledge store, which comes from the idea that annotations must be kept somewhere to fulfill the requirement that they can be queried and retrieved later.

Are we storing records for data providers?

Bob: for AppleCore, there will be participants with large specimen databases, but at any given time, and possibly for long stretches of time, will be offline. The justification for caching there is hosting offline records, with a quality of service equal to the host providers being online. Not sure we need that.

Note that consensus records may be dynamic, in that consensus may change if participating records are not available at the time the consensus was requested.

(we will come back to this discussion later)

Annual Report:

Bob added some things, James added some things after that. We've put yearly headers in the reports so people can see what we've done by year. Are we missing any publication/presentation things? We need to get that in. We haven't yet got Paul's items in from his google doc. After the meeting the Harvard folks will walk through the document and be done this afternoon.

Bertram is away but there are things he will need to look at.

Year two allocation won't be given out until the yearly report is submitted and approved.

On Friday, Jim will create a pdf of the report as it exists then and will circulate it; after responses, we will try to get it submitted some time next week.


Lei is working on Kuration package.

Tim is working on the Not-Specify database example . Installed SQL Server, set up a minimal schema. Bertram and Lei talked to local herbarium people, who might be willing to contribute their database.

Paul: Good embeded tools for loading MS Access into MS SQL Server.

Bob: Should be simple to connect to Access if it can be done with MS SQL Server.

The group agrees this is a good direction, especially checking that Access and SQL Server work very similarly

Annotation Ontology examples? http://etaxonomy.org/ontologies/ao/ (examples 1-5 are complete, example 6 still needs work) How far is

Zhimin along in class description? We have architecture for messages.

Zhimin thinks the network should not care about message format. We have to have something on the application level that supports query.

He has marked out a starting point in the documentation about how to implement query in different languages, e.g. Sparql and Tapir.

Nodes have to be able to respond to queries. Do they need to map annotations to something else? There is probably not a need in the network client to map terms in the ontology to anything locally.

We need a discussion, given those annotation examples, on how to extract the pieces that need to be mapped from the contents of the annotation messages. Which API is going to hand them, for example, darwin core.

If someone is viewing incoming annotations to their databases, they will get the annotations from the network. They need the annotation for the user to act on it.

----AnnotatesData (dwc)
----HasTopic (dwc)

The annotation is the part of the message which is of interest to the end-user client.

The annotation is about some data object. A standard case for AppleCore is that AnnotatesData contains an RDF representation of the darwin core triple {institition, collection, catalog number}. Alternatively, AnnotatesData may simply contain an LSID.

The annotation has a topic. That is the payload. Likely to be a set of dwc terms, ScientificName, DateDetermined, and so on. For a determination, the topic may be the tuple (Determiner, ScientificName, Specimen). The topic tuple is completely unspecified. According to the ontology, there MAY be a topic, but not necessarily. The contents of the topic is entirely a function of the particular instance of the message.

Example of an annotation: an identification is being made. The hasTopic is (some representation of the specimen). The annotatesData is a set of dwc triples (date, scientific name). The hasExpectation is an object from the Annotation Ontology (e.g. insert)

Whiteboard diagram: http://etaxonomy.org/mw/File:Whiteboard_diagram_2011Jun14.png

The Mapper may be invoked on the contents of hasTopic to identify the local database object, depending on the contents of hasExpectation.

Example of an annotation (this is example 1): your country named is misspelled. The Expectation is: update.

There are two times a Mapper may be invoked: 1) to understand incoming messages, and 2) to make changes to the local database.

What scaffolding is needed to support testing the Mapper?

For example, some non-Mapper pieces of software have to be able to extract the SQL from a SQL query message, and add some information about namespaces. Those two things are handed to the Mapper.

Since there is nothing written for the software that goes between the Client API and the local database, of which the Mapper will be a part, it is left up to the Mapper tester to determine the boundaries of the Mapper.

Note on the AO examples: one has to do with an annotation on Flickr, the rest of them would be good targets for the Mapper.

Curation package:

What determines when a Kepler curation package is done, from the non-Davis point of view? The curation package will work on non-demo data sets with some subset of services, including testing taxon names using GNI and IPNI. (Are all my names in IPNI, if not, which ones are they supposed to be?). Also georeferencing, and rendering the provenance graph.

Bob suggests that another case might be for people with a non-Specify specimen management system, that they can use the workflow. A lesser goal would be to support the same thing for people with no specimen management system.

We should set up the workflow within AppleCore as an analysis engine. That goal has less productization needed thatn the other user scenarios, because the user is AppleCore.

If someone saw the demo and downloaded the curation package, what is the risk that if someone downloaded the package that they would be disappointed at lack of functionality?

Let's think about other users, such as ecologies, that would not need to do curation work but would want to be able to use other features, such as the consensus record.

We can't provide the curation package now with the most "fun" data cleaning pieces because we don't have the data sets in the services, such as FNA. But we can do the taxon name and georeferencing services and provenance graphs. With enough documentation to string those working pieces together, it is worth releasing. As to the documentation, we will need some language from the community to go along with the functional documentation.

A suggestion: all of the actors from the demo are in the workflow, but only the actors that work are in the library.

Should we distribute the means to run the demo? The data used is probably OK. It might be helpful to non-SPNCH audiences to be able to work with all the various actors, to see how they could be strung together.

When we release whatever we release, it would be nice to have a means to track feedback.

Since the FP Network communication part of the demo, Lei will remove those pieces because she was using a temporary FP API. A suggestion: disable those pieces by configuration rather than disable by removing completely. Schedule for technical architecture discussion:

How about next Wednesday? Done.

The FP network implementation can no longer be called AppleCore. New name? We will work on it.