2010 All Hands Meeting

From Filtered Push Wiki
Jump to: navigation, search


October 25-26th, 2010

Homework (for invitees to complete before the meeting)

  1. A one paragraph use case scenario involving FilteredPush based on their project;
  2. A one paragraph statement of how they see their project interacting with ours.

Agenda

Sunday evening

Most participants arrive, dinner with those who arrive in time.

Monday

Monday Morning

Webex: https://uoa.webex.com/mw0306lb/mywebex/default.do?siteurl=uoa

Develop Morning Meeting Notes on http://firuta.huh.harvard.edu:9000/FP-AllHands-2010Oct25-AM

9 AM Introductions with brief description of related project(s)

10 AM-11 AM What is Filtered Push? Outcomes of prototyping grant including web client demo. What is FPCQC and plans for implementation?

  1. Brief presentation by Paul using an updated version of the 2009 AGU talk.
  2. Brief demonstration of Kepler (20 min)
    1. Bertram: High level overview
    2. Lei: Demonstration
  3. Brief demonstration of Web Client
  4. Brief overview of Annotation Ontology by Bob.


11 AM-12:30 PM Discussion of core FP use cases.

Meeting Notes: Filtered Push All Hands Meeting, 2010 Oct 25, Morning Sessions

Webex https://uoa.webex.com/mw0306lb/mywebex/default.do?siteurl=uoa

Twitter tag: #nsfp ( goo.gl/2RGl )

Agenda 9 AM Introductions with brief description of related project(s)

James: Welcome. Participants:

  1. James Macklin, Director of Collections and Informatics, HUH.
  2. Paul Morris, HUH/MCZ
  3. Maureen Kelly, University of Florida
  4. Dave Vieglais: University of Kansas and DataONE.
  5. Tim Noble: University of Kansas, Specify.
  6. Bertram Ludaescher: UC Davis. Kepler Project
  7. Lei Dou: UC Davis, Kepler.
  8. Brendan Haley: Museum of Comparative Zoology
  9. John Deck: University of California, Berkley: BiSciCol ("bicycle") project.
  10. Greg Riccardi: Florida State University, Morphbank.
  11. Mark Schildhauer: NCEAS, SONet: Scientific Observations Network. SemTools.
  12. Andreas Müller: Berlin Botanical Garden. EDIT
  13. Bob Morris: UMass Boston, Filtered Push.
  14. Aaron Steele ( goo.gl/WsbI ): Information Architect at UC Berkeley
  15. Anne Marie Countie, HUH.
  16. Zhimin Wang, HUH, FilteredPush
  17. Chinua Iloabachie, UMass Boston.
  18. Tom Orrell: National Museum of Natural History, BiSciCol.
  19. Dan Stanzione: University of Texas advanced Computing Center: IPlant
  20. James Cuff: Harvard University

10 AM-11AM What is Filtered Push?

James: Finishing up prototype, begining work on production Filtered Push. How do we connect to all of you? This morning: Summarize current state and goals of FP. This afternoon: Who is bringing what to the table? Key piece of today is to identify use cases: Do we understand the Use Cases for FP, and leading from this for tomorrow the Synergies with partners and other members of the community. Goal for tomorow to understand how we can work together in a sustainable way: Understand the puzzle pieces, how they fit together and how we can continue to communicate, leverage funds, and make progress.

News: James will be leaving Harvard for a position in Agriculture Canada, lead reasearcher in biodiversity informatics. Currently negotiating with James Hanken, director of the MCZ for continued base of FP at Harvard, official announcement perhaps later this week.

Outcomes of prototyping grant including web client demo. What is FPCQC and plans for implementation? (1) Paul: Overview of Filtered Push

Presentation.

(2) Brief demonstration of Kepler (20 min)

  (a) Bertram: High level overview 
  
  Overview of Scientific workflows.  Cyberinfrastructure "Upperware"  Scientist wants to assemble data analysis, visualizations, etc.  Workflows aim to capture how scientists access, transform, analyse, visualize their data.  Value added: Automation of routine tasks (sometimes with human embedded in a workflow); easier to reuse and interpret than scripts;  act a high level documentation of scientific protocol; archive and share with others (e.g. myExperiment);  can offer parallell execution;  Workflows can record the provenance, processing history, and data lineage of data used in analysies.  Example workflows: phylogenetic analysis of sequence data.  Tap meterological data source, tap into R script to perform linear regression, plot results.  pPod demostration for ATOL.  Metagenomics: Biologist working with a computer scientist to develop a workflow, trying to move towards biologists alone being able to assemble complex workflows. WATERS.  Data streaming, quering data streams from sensors - continuously compute a value, e.g. heating degree days from temperature sensors.  
  

Terminology from Ptolemy: Actors, Channels (data flows), Ports. Directors: (MoCs model of computation, a scheduler, how is a graph executed; what is the paradigm for thinking about processing the problem) assembly line metaphor, complete data collection as a whole being passed through a set of workers, each of which performs some process and moves the data collection on through the process.

Kepler: SEEK, SDM and others as grass roots, expanded to many partners, added glue project: Kepler core - easly extensible by third parties. Example third party extension by WEKA. SciencePipes - browser based front end over Kepler.

Provenance: ClimateGate: illustration of the need for better standards to document the data lineage and provenance of data leading to conclusions. Publications contain references to data sets and references to data manipulation tools - good to move this towards electronic pointers to the actual data sets and documentation of how the data was manipulated. Kepler/pPOD provenance browser: Documentation of data sources, tranformations of the data, and other actions in the workflow. Other uses: result validation, debugging, reproducability, ...

COMAD model - data streams passing through actors, actor can pick up particular elements from the stream, invoke a black box on the items, and then reinject them into the stream.

  (b) Lei: Demonstration 
  Technology demonstration:  Duplicates as something we want to fuse.  
  Not a canonical FP use case.  
  
  Overview:  Data set from Brookyn Botanical Garden (contains inherent duplicates as captured from multiple herbaria).   Find duplicates via a chunker, fuse the duplicate records, pass on to a human curator. 
  
  Kepler chains everything togehter, data steream as a converyor belt with actors listening, one actor being a human notifed by email, human curator edits the fuse, confirms, reinjects for aggregation.  Workflow easily modified on the fly (inserting a new actor into pipe, replacing one actor with a more capable (fuzzy matching) one.  Provenance browser re-traces, in detail, the events involved in the aggregation.  
  
  ((Lei: digital and in persion, A duplicate...))
  
  New items for Kepler: Workflow pauses for human interaction (through email notification and GoogleDoc interaction).  
  • Question: Greg R. Is there potential for recording the human workflows.
  • Bertram: Yes, there is a challenge of recording track of breadcrumbs across multiple different systems. A provenance record repository could store "tell us what you did" records.
  • Bob M. Scientific publications contain Methods sections, which contain the sort of data that Greg is pointing at. Mining methods sections within disciplines might
  • James C. Removing directors, moving towards lamda calculus method ala taverna?
  • Bertram: No. Scientific community still needs to frame the nature of directors. Jury is still out on what the right computational model is, and it may vary, and event driven directors may be needed.
  • Mark S. Going back to human work flows - a editor might need to confirm the actions taken by human workfow agents before they can be fused.
  • Bob M. We haven't said too much in FP yet about the provenance of the workflow itself yet. Case: an error exists in an implementation in a workflow step, this might need to generate annotations related to analysies run with prior versions of the implementation.
  • Bertram: Data provenance as one notion. Workflow evolution provenance (visible in Lei's demonstration with the replacement of an actor). Univesity of Utah Vistrail? system, describes/montiors the evolution of a workflow. Impending discussions of the meaning of "Provenance" in workflows.
  • Mark S. Tagging workflows useful.
  • Bob M. Detail: versions of software used. Example: Recent NSF workshop on Dimensions of Biodiversity run by Dave V., Mark S. also at meeting. Mention made of implementation error in Blast where multiple sequences run similtatneusly come up with wrong answeres
  • Dave V. Decisions made driving policy where output from faulty tools is used mindlessly.

(3) Zhimin: Demonstration of Web Client

Demonstration of a subset of the FP prototype functionality.

Zhimin logs in as "James". Begins a find duplicate task, has specimen, enters the collector name and a collector number, increasing fuzzy matching factor returns what looks like a duplicate pair, same place, same id, fuzzily different collector name. Tag pair as a duplicate set. Examine details of one of the records: Several annotations exist for the duplicate set. Facet view of the annotators of this record. Two annotatios of this record by J. Macklin. Current status of record visible as synthetic view of record, with in this case, the most recent annotation winning.

James has preferences: Expressed interests in datasource=A, datasource=GH, taxon starts with "Rubus". James submitts an annotation of a BKL specimen. Logout. Login as Paul. Paul expresses interest in datasource="BKL". Paul can see the incoming annotation in his inbox. Annotation details visible.

A user can run a search for records, add them to a drop box, and export as a spreadsheet. [Then available for import into their local database].

Search includes search terms, fuzzy matching (two different engines), and faceted search.

  • Aaron: Locality of cache? In client or on server?
  • Paul: Could be either. FP is agnostic. FP client library intends to include a cache that can be local.
  • James C. How is currency of local cache maintained?
  • Paul: Not, replaced in full as work interests changes.. Find duplicates doesn't need
  • Bertram: Shadow database sitting near to production database, where a DBA could decide to take in on a record level which annotations to accept. See new annotation in context, and accept from there.
  • Brendan: More complex than DBA, needs tools for the indivdual data curators.
  • James M. Yes that is the filter - perhaps a person can keep up with the incoming data flow.
  • John D. Are all the relationships analytical? This identifier is related to this other identifier? Or is it all fuzzy?
  • James M. There are both.
  • Bob M. Relationships also supported by the annotation ontology we are going to discuss tomorrow. An ontology that expresses the relationships is supported by the annotation ontology on the table.
  • Zhimin W. We aren't enforcing any GUID resolution at any time.
  • James C. Authenticaion/Authorization? Gorilla in the room?
  • Paul M. Explicity not included in prototype, trust of clients, not their users. Use cases for production needs authentication of individual users.
  • Mark S. DataONE?
  • Dave V. DataONE intends to provide identifiers associated with data objects, and a federated identity scheme. Projects like FilteredPush could leverage these two capabilities of DataONE.
  • James M. Trust as a broader issue - notoriously problematic issue for scientists in the domain. Filters (people) able to make their own trust decisions.
  • Greg R. Identification in a different sense: Someone looks at a specimen and asserts a new identification. What is being annotated? The image? The specimen? The taxon? The linkages between the objects need to be clear, this identification applied to this digital object which relates to other aspects of this other digital object. Who gets notified is a mater of inference of following these links.
  • Bertram L. Big inference fan, but: Hopefully able to work with graph based algorithms that don't need large scale inference.
  • Greg R. We are most interested in semantics on a human scale, thus simple algorithms about provenance make sense. We are interested in saying that things can be related by humans, not necessarily that they are identical.

11 AM-12:30 PM Discussion of core FP use cases.

What are we missing?

  • Mark S. Is knowledge the mother of all aggregators?
  • James M. Goal is for the knowledge store to be a store of annotations, not the primary data.
  • James C. With great success comes great computer failure. If successfull, will need significant resources to manage and analyse annotations.
  • James M. Researchers will want to be able to mine the large scale annotations. How we would query and display important.
  • Brendan H: Routing on large collections is a non-trivial problem - some annotations are on objects shared across the institution,
  • Greg: Say no.
  • Paul M. Ignore, reject, reject for cause, accept, accept for cause.
  • Bertram L. Ignored is trackable.
  • Bob M. Might be good to say this is an annotation I'd like to hear more. Another case we are exploring is encryption of part or all of annotations.
  • Andreas M.: All collections have curators?
  • James M. No, curation levels vary. On the encryption story, might be useful to annotate at dataset level.
  • Bob M. Andreas' use case: Hello, Hello, is anyone there?

12:30 PM to 2 PM Lunch

Monday Afternoon

Webex: https://uoa.webex.com/mw0306lb/mywebex/default.do?siteurl=uoa

Develop Afternoon Meeting Notes on http://firuta.huh.harvard.edu:9000/FP-AllHands-2010Oct25-PM

2 PM- 3:30 PM What is everyone else doing that is related to annotation. Short (10 min.) presentations.


Filtered Push All Hands Meeting 2010 Oct 25 Afternoon Session Notes

Agenda 2 PM- 3:30 PM What is everyone else doing that is related to annotation. Short (10 min.) presentations.

James: Introduction. Joining in: Chris Jordan: Texas, TACC, iPlant. CoLead on iPlant data integration geontype to phenotype project. Large scale data integration. Data Managemetn and collections at TACC, providing database and image stores for Natural History Collections. Percieves lots of interesting applications of FP, particularly in sets of natural history collections.

Aaron Steele: SilverLining, exploring cloud computing solutions to biodivesity data storage and aggregation. Desire for data to be synchronized in close to real time. Desire to push changes from sources to aggregators rapidy. Examining PubSubHubbub as technology. Presenation here on how annotations might move through PubSubHubbub.

Publishers don't send messages to particular subscribers. Subscribers express an interest. This decoupling allows scaling. Large queues of notifications have the potential to starve subscribers with as few as a few thousand messages - fan out problem.

FP Messaging capability percieved as publishing data quality annotations, subscribers to those annotations.

PubSubHubub protocol - REST API, open protocol, Atom, RSS standards. Change updates are pushed. Publisher provides a feed. Publisher produces an update. Subscriber expresses an interest. Publisher informs Hub of change, Hub obtains updates, determines subscribers, queues up messages to those subscribers. Hub can also poll the publisher's feed to check for new data. Scaling happens in the hub. Protocol is easy for subscribers and publishers. Three roles: publisher, subscriber, hub, no native hub-hub relationship.

Bob M. Scaling question: Lots of subscription queues with relatively small number of clients? An annotation will probably go to more than one queue, and may expand the number of queues. Aaron: Need to reconcyle the PubSubHubbub design of a subscriber must explicitly subscribe to a queue with the FP concept of the network desciding who is interested in a particular message.

Free open-source hub running at pubsubhubbub.appspot.com

Some alternatives: Jabber. More complex. Base profile is XMPP. Needs xml and base64 encoding. Google FeedAPI v2 with Push. javascript library. Plays well with pubsubhubbub. google.feeds.push.feeds.... Means of rapidly pushing annotations to browser endpoints. All good tools, perhaps or perhaps not the right tools for the job.

James: Several kinds of users, some of whom (e.g. researchers) where real time updates are likely very relevant, others of whom it might not be. Aaron: Valuable that updates are pushed out (both to publishers and subscribers), e.g. sensitive data).

Greg Riccardi: Specify, Morphbank, Morphster Projects

Image from a publication, a description (ventral, rounded, mandible and lots of other text that can be treated with ontologies), a data matrix of characters and taxa (which has supporting information), all this a snapshot in time - was it correct when written, is it correct in the current state of affairs. Does clicking on a cell in the data matrix produce a different answer now than it did at the time of publication. Publications thus as a work process.

Annotate images to express character states. Annotate to extend ontologies. Illustrating ontologies (morphster)/Ontologizing images (morphbank). Share online amongst collaborators.

Specify collections management tool. Add capability of pushing data to a morphbank instance, provide an ontology based annotation tool for images. Make sure that all the data is shared. Lots of data integration issues, part way there, about another year. Goal of creating open APIs. thus moving beyond just the three instances. Atlas of Living Australia interested in advancing the morphbank UI to include work processes and other features.

John D. Flickr is another way to store images on the web, what is the strategy for morphbank?

Greg R. Have built, but not deployed an application to push images from Morphbank to Flickr, with terms added as flickr tags. Haven't explored moving from Flickr to Morphbank.

Ontobrowser with morphbank. Trying two strategies: Ontobrowser asks morphbank for instances of terms. Morphbank exports keyword/image tables to morphbank

Chris J. Following up on flickr question. Seems a broader question. Morphbank is currently referencing images stored at TACC, morphbank seems to have a good generalization for where images are stored. Metadata another issue.

John D. Multiple image storage systems, in general the metadata seems to be the issue. Chris J. Different systems with different goals, good for them to be able to interact at the level of data, allowing aggregation. Greg R. ALA percieving morphbank as good platform for well curated images with lots of good metadata.

Dave V. Identifiers for both images and metadata? Greg R. Yes, a specimen has one identifier, each image another. ALA case extending to image of drawer of insects. Dave V. Sustainablity model for morphbank? Greg R. Freeing it to hope that people will find it useful, with each participant providing storage space for their images, plus extra storage to provide redundancy (with redundant storage elsewhere being used to host redundant copies of their images). Chris J. Don't know anyone who has a long term sustainability plan that doesn't require recurring funding.

John Deck Moorea Biocode, leading up to BiSciCol Biocode project, staring with collecting event, moving to specimens, tissue samples, DNA extraction, then into LIMS for visualizing sequences.

Lots of challenges: Spreasheets and databases being maintained independently, data flowing in many directions. GUIDs needed in many places.

Hub. Tracks relationships. Just the core data. Object GUID. Relation. Related GUID. Any object could have a long list of related objects.

Spokes. Contain published data - resolve the GUIDS. Spokes have GUIDs, date last updated and data.

James M. Lots of commonalities. BiSciCol seems like a general filter.

Mark S. BiSciCol seems very concerned with tracking material through processing.

Andreas M. EDIT: This sounds like a derevation event

John D. The derived objects would like to keep the information that is derived from each in synch.

James M. Sounds like a tracking workflow.

Mark S. Annotation seems to be things that people say about your primary data, outside the scope of the schema you use for your primary data. Annotation perhaps as change management.

Bob M. Is the data curation the place to draw the line? The data curator has the authority to say which data they are going to expose to the public, independent of how they are tracking the change history. This marks a clean separation between who is providing data and who is saying things about the data. This might be the place to draw a line.

Greg R. Annotations are represented by events - person, timestamp, purpose, etc.

John D. Is the collection of a tissue sample an annotation?

Greg R. The annotation schema allows it to be expressed as such.

Bob M. An annotation is a sort of metadata, but metadata that isn't offered by the data curator as such. Software agents might also generate annotations. Bob M: proposed aphorism, or definition, or .. An annotation is a candidate metadatum


Andreas Müller Three projects: Duplicate detection; BioCASE annotation system; EDIT annotations.

Duplicate detection, Synthesis I, 2007. Goal to find duplicates as a user (at the end point), search done on GBIF index. Adapted FEBRL record linkage software. Developed a prototype. Considered both physical and digital duplicates, left it up to user to decide what they wanted. Use case: find physical duplicates, e.g. syntypes. Use Case: find additional information about an existing specimen: digital duplicates. Tool in essence, a fuzzy multiparameter search. Able to be used as a fiilter in data aggregation to avoid the creation of duplicate records. Web interface for multiparameter search.

Problems: Complexity. Binary comparison of pairs of records doesn't scale. Need to use and algorithm for determining which pairs to compare, e.g. blocking algorithm. Adopted a record linking approach. Able to easily produce parameters and conditions for duplicate detection.


BioCASE annotation system. Annotation system for the BioCase portal. Intended as a general service, able to work with different data collection standards (e.g. ABCD). Able to annotate arbitrary data elements (examples: specimen, author, first name of author). Feeback system to data curators.

Implementation: Annotations stored in XML ABCD records. Annotations became collection records themselves. Used Subversion for managing the change tracking of the annotated records.

John D. Annotations are new xml documents? How subversion?

Andreas: Original specimen record document becomes modified as new version as it is annotated. URLs point separately to different versions of the document as it changes (with rendering produced by stylesheets).

Authentication system overlaid on biocase portal.

Annotation currently generated by direct editing of the XML schema containing the data record. No GUI for annotating data yet.

System also allows free text comments on xml documents.

Architecture: Central annotation server. JDBC connections out to data providers. Need provider to be online to obtain current document to annotate. Data from providers often not producing valid darwin core documents. Lots of scaling issues.

Need observed in user community to annotate data sets and collections of records.

Annotations pushed to data curators by email.

Mapping a darwin core record to a highly atomized relational database is hard. Very clear need for a reverse wrapper.

Problem: Changing identifiers.

EDIT Annotations. Similar problem in annotation of taxonomic data. Multiple plant taxonomy databases in Europe, need to track data provenance as data travels amongst taxonomic databases and checklist databases and other user communities.

Similar issues about how to transport annotations and what to annotate, only worse, taxa aren't fixed in the way that specimens are.

GUID stragegy: Simple rules to determinie If a taxon name is changed will this change result in a GUID change or not. Checklists and taxon specific data sets tend to have sporadic updates, but continual addition of annotations.

Bertram L. Question: Are there available software products from these projects?

Andreas: In principle, yes, all available as open source. Almost all EDIT software is open source, BioCASE open source.

Bertram L. Reverse wrapper, interesting use case, Export data for cleanup, import back. In general, insoluble in database theory, view-update problem. But, in some cases, known to be soluble, if the export language is simple, then it is likely to be reimportable, and provenance tracking may help this process.

Andreas: Working for a couple of years on mapping database schemas onto ABCD. Possible to define bidirectional mappings with one use case, bidirectional mapping in general is very hard. \

James M. Specify includes a mapping tool to import the flat Workbench spreadsheet into the Specify schema.

Dave Vieglais

DataONE DataObservationNetworkforEarth

NSF DataNet program, resusing existing data, focus on climate science, biodiversity science.

Scientist's perspective: How do I preserve my data? Tools? Credit? Data management plan?

Infrastructure: Building on existing cyberinfrastructure. Lots of good repositories, standards, protocols, etc. Leverage as much as possible. Create some new cyberinfrastructure, mostly as glue. Support communities of practice.

About 200 people actively involved in the project. Two main thrusts, cyberinfrastrucute, and outreach (also developing science use cases to guide decisions about cyberinfrastructure).

Objectives: Support the full lifecycle of scientific data management. Intended as core national infrastructure. Heavy emphasis on testing, stability and reliabilty (but able to adapt to technology changes).

Example science use case: Niche modeling. eBird data, land cover, meterorology, and remote sensing data as inputs to niche model, output of model results - e.g indigo bunting seasonal occurances. Input to state of the birds.

Components: Member nodes. e.g. NBII metadata clearinghouse; Driad repository for publications; fedora commons. Member nodes maintain operations as they were, exposing a couple of simple new APIs. Two basic approaches to APIs, exend existing implementation to add API's or add wrappers.

Coordinating nodes. Being developed based on sets of existing tools. Metadata catalog for data across all the member nodes. Manage data replication. Can direct the member nodes to replicate data between each other. Provides for long term content preservation and availablity. Can't effectively do with with petabyte scale image repositories. Each coordinating node is a replicate of the others. Focus of implementation is here.

Investigator Toolkit. Mix of low level libraries and higher level tools, e.g. R plugins. Web interface for search at coordinating nodes (based on mercury). DataONE drive as a fuse mount on *nix able to mount network as a read only filesystem.

How does it work? Investigator with data and some metadata. This content is given to a member node using its native mechanisms. From this, system metadata and metadata are created. Coordinating nodes pull this new metadata and science data, replicate them amongst themselves. Coordinating nodes replicate the data beween member nodes untill that data types persistence requirements are met. Investigator publishes on this data, using an identifier. Other researchers find this identifier and can query the system for the data. Other researchers can also use this identifier to provide annotaitons of the data or metadata. The annotations can be passed on to the original researcher and other interested parties.

Every object in the DataONE infrastructure has a unique identifier (scheme agnostic).

Multiple global candidate nodes. Lots of interest.

Very strong need emerging for computational resources close to the data, investgating approaches with Teragrid.

Providing federated identity system. CILogon and InCommon as likely technologies.

Public release targeted for the end of 2011. Replication, basic set of nodes, federated authentication.

Identifiers for all objects. Pointers to reliably available documents. Federated identity, authentication, and authorization. Extensible search and discovery (based on Mercury). Content persistence. LOCKSS principle.

Bob M. How orthogonal are these services? Suppose someone was going to build a FP network, and they thought federated identity important, and deployed nodes that delegated the federated identity management to DataONE?

Dave V. No requirement to implement all the APIs to participate. Other than replication, which has a particular set of required implemented APIs, most are orthogonal. We would like to suggest to people that if you add these APIs to your systems, then people will be able to build lots of interesting things on top of these. Added draw is the ability to improve availablity and persistence.

Andreas M. Similar to LifeWatch project in Europe.

Dave V. There are interactions between LifeWatch and DataOne, DataONE right now focused on .

Cyndy Parr. Lifewatch has to do more with analysis tools for biodiversity

Cyndy Parr. EOL is aiming for an inference layer on top of the raw occurance data, thus any mechanisms for improving the quality of the data, or for providing summaries would be very helpfull. EOL is developing mechanisms for collecting annotations. EOL would look at FP as a pontental source of understanding the complexities of how to collect and manage annotations.

Break.

3:30 PM Plan: Breakout on scenario writing within a domain (groups of 3). Examples of Scenarios: Use_Case_Scenarios#Scenarios_for_taxonomist_who_can_edit_specimen_records Continued instead on discussion around presentation by Mark Schildhauer.

Dave V. SharedCopy.com nice web page annotation tool, nice to examine.

Mark S. Trying to do some very detailed semantics and interpretation - the insect is on the leaf that is inside the plot. The weight is of the insect that was eating the leaf. Many of the semantics are only understood (and easily understood) by the domain experts. In scope?

Bob M. Yes. Paul M. Yes.

Mark Schildhauer Semantics annotation on the SONet and Semtools projects. Effectively report from TDWG.

Nature of scientific data sets: Often in tables. Rows represent records, columns typed attrbutes.

Materialized scientific data set, denormalized view, special meaning for researcher.

Researcher has their own data, often in spreadsheets. Groups of scientists with complementary data, usually assemble into a single table for analysis, more rows, complementary columns (nothing about soil chemistry in this set, but present in this set).

To allow scientists to proceded, provide metadata about the data sets. EML, branded as ecological metadata, primarily a means of describing the structure of datsets, types of columns, etc. Lots and lots of very ideosyncratic data. Need for some standard ways of describing particular types of data. Ontologies appear as aproach here. Work coming out of SEEK. Heirarchical structure of concepts, cardinalities, relatioships, etc. Data described by metadata (perhaps would be nicer to start with ontologies...), link the metadata to ontological descriptions, thus semantic metadata.

Fundamental thing that allows joins in composite materialized data views is observations. No longer a discipline specific term. Observations, standardized forms of measurements with units, protocols for the collection of the measurements. Multiple disciplines converging on this same abstraction level of observation and measurement. OBOE, Exensible Observation Ontology - basic template for linkiing domain terms to table formats. Can interelate values within a tuple.

Annotations give metadata atributes semantic meaning with with respect to an ontology. Enable structured search.

Stack Domain-specific ontology (neither mature nor fully developed) OBOE Semantic annotation Structural Metadata Data

Metadata searches very imprecise and misleading.

Annotation here done in an XML schema. Discussion of whether OBOE, in owl should extend to incorporate this.


OBOE: Observation consists of one or more measurements shich are of charactersitcs, measurements have values, based on a standard. OBservation has a context with context relatiohips. Converging on OGC observation-measurement ontology, SONET activity lead to OCG extending its ontology to incorporate contextural relationships. Ocenography community, biodiversity community, etc, all converging on a similar set of concepts.

Annotations link data sets to domain ontologies, in the context of OBOE.

Data set:

yr spec spp dbh 2007 1 piru 35.8 2008 1 piru 36.2 2008 2 abba 33.2

basic idea: go row by row through the data set, generating triples to external terms,

What is each column, what are the relatiohships between data values.

Bob M.: Principal current goal for current grant: Make a network that will be able to distribute annotations. Secondary goal to do this in a domain independent manner, that is consistent with the goals expressed here. We don't want to be in the buisnes of how domain scienctists can ontologize their data. We thus have an important use case of being able to transport your annotations. We don't succed if we aren't able to transport your annotations. We also don't want to pretend to be the only ones in the buisness of distributing annotations. Goal to be able to build bridges between different people's understanding of annotations, and transport annotations where we are simply brokers for the transport and discovery of annotations. Important here is the ability to annotate annotations. An OBOE annotation doesn't appear to be itself an observation, thus perhaps not able to be annotated itself within OBOE.

Mark S. Potentially OBOE could be extended to managed the concerns of annotations, or perhaps an annotation ontology could be extended to handle OBOE.

Bertram L. High level, my understanding for FP the annotation ontology talks about actionalbe things, "the intent is to correct missing data", but not described to the detail of OBOE, OBOE, deeply describes, and appears orthogonal.

Bob M. We certainly don't want to handle the buisness of OBOE, but we are going to need to understand OBOE descriptors to do analysis in the domain.

Bertram L. Top of the head scenario. Data set. Someone proivides an annotation describing the header of the dataset in OBOE in detail. Enriched metadata submitited to FP network. PErhaps, Person who created the semanticly weakly described data set submits that, response comes back from others as a full fledged set of OBOE metadata.

Mark S. Different scientists could have different interpretations of the same data set. Specific Leaf area, has specific protocol. One researcher might assert that a column contains specific leaf area, another might look at the data and assert that this couldn't possibly true.

Bob M. OBOE Actively used by community, reasoning possible on it, very nice to be able to get annotations back to it. Scenario: What instrument. Calibration requirements. Data Set described by OBOE, assertion that column is measurement by particular instrument. Data set contains no metadata about the calibration of the instrument. Data set can be annotated that it contains no indication of the calibration of this instrument, and thus might be suspect. Seems a very natural collaboration.

Dave V. Please let us know about additional requreiments for APIs to handle annotations.



6:30 PM Dinner at nearby restaurant.

Tuesday

Webex: https://uoa.webex.com/mw0306lb/mywebex/default.do?siteurl=uoa

Tuesday Morning

Meeting notes on: http://firuta.huh.harvard.edu:9000/FP-AllHands-2010Oct26-AM

Agenda

9 AM Annotation Task Group report: thoughts; Bob's ontology and Paul's layered diagram.

10 AM Quality control scenarios.

9 AM Annotation Task Group report: thoughts; Bob's ontology and Paul's layered diagram.

James: Continuing with annotations.

Discussion of end user communities:

Initial use cases coming out of collections management. Now expanding user community

Bertram: Danger of working with just one narrow user community. Likewise a danger of not being focused. A small set of people who are prototypical for a set of user communities can provide very good advice on requirements and design choices.

Introductions James Macklin Paul Morris Dave Vieglais Tim Noble Bertram Ludaescher Lei Dou Maureen Kelly Paolo Ciccarese. Mass General Hospital/Harvard Medical School. W3C group working on annotations. AO Andeas Müller Brendan Haley John Deck Greg Riccardi Mark Schildhauer Alex Dusenbery, UMASS Boston Zhimin Wang Chinua Iloabachie Johnathan Rees, Creative Commons, Science Commons.

Bob Morris: FP Annotation Ontology

Bob M. Stake in the ground annotation ontology from FP. Data centric. Examining w3c AO, we suspect that the FP work is a minor extension of the AO, thus

Paolo C. AO derived from Annotea, Annotea from 2001, updating for current web. Retaining Annotea concept of being able to annotate anything. Current focus on annotating things that have resolvable identifiers. Some interesting things in FP, as current focus is on documents and images, and FP adds some additional areas of use.

Bob M. Best outcome if FP needs are met by minor extensions of AO, and we don't need to build our own in FP and can contribute to the larger project.

View of the OBOE annotation ontology Mark S. described yesterday is of a vocabulary that is of use in the community, and annotations in FP should be able to carry information in this vocabulary.

See: http://www.etaxonomy.org/mw/AnnotationOntology [Latex/PDF version of the instance (Bob's example): https://docs.google.com/fileview?id=0B_5CaPcogJCkZDY4NmJlMTItNmVjOS00ODg5LThjOGItMjNmYzNlNjZlMzVm&hl=en&authkey=CNDvrfYD]

An annotation itself needs an identifier so that it can be annotated. Who or what agent created the annotation What is the subject of the annotation On the table is something more than rdf:about. Might be possible to do everything without the elaborations described here. The subject needs to be able to be described in the chosen vocabulary of the annotator. This is the function of the interpretableObject: the communities' vocabularies are the things that describe the content of annotations. In this example, a small piece of the data is being annotated. Data sets, records, fractions of records are possible subjects of annotation.

We think that the community includes why they did this as a part of the communication of an annotation, thus the motivation. Do the interests of the annotator correspond to the intrests of the data curator?

The original data holder needs to know the hopes of what the annotator will happen to their annotation. Thus the expectation.

Each of subject, content, motivation is an interpretable object, which carries a namespace and the data. Might not be needed, as it might be possible with namespace management. The namespace might describe and opaque object that can't be queried into in rdf. A trivial example is a need for encryption of portions of the annotation.

Interpretable object - term to define how the payload is to be interpreted, plus a payload.

Johnathan: is the namespace mime type? Paolo: This sounds like the wrapper is ...

Paul: Concrete example. Annotation

   Subject
       Interpretable object
            Content: XML fragment of darwin core collection code, institution code, catalog number, without namespace.
           Intepretation URI
               mime type not sufficient
               mime type text/xml plus darwin core name space describes how to interpret.
               

Bob: Another example, annotation carrying encrypted data. More an enabler than a requirer.

Bertram L. At TDWG some impatiets about ontology. Where is this going. In an abstract sense, since we are so broad in annotations, because it is so abstract, it is hard to make clear techology decisions without understanding specific use cases.

Bob: Quality control important use case, and example is of a specific use of an anotation. The example might address this concern.

Example. Annotation by James Macklin. Subject, instance of DWCOwlFragment (defined as a subclass of interpretableObject, has interpretationURI, owl representation of darwin core (there isn't one, but Bob made one here). Interpreting agent can expect to be able to find terms from this ontology in the data. Interpretable object might not be needed, might be just namespace management. Problems of determining what to do and what is being annotated are the consumer's problem.

Motivation, another interpretable object, here an expression that James found a transcription error, the database record not corresponding to the text found on the herbarium sheet.

Annotator is part of a community. Moving the annotations around is supposed to help that community. Thus the anotator should be able to express their intent in creating the annotation. Expectation

Paul: Expectation values on table are update, append, cluster. Mark S. Expression of the intent of an annotation being that some automated process should examine and act on the annotation.

Paolo: AO tracks provenance and who said what and why, trust and actions left to the consumers. This is right, this is wrong, are data curation, annotation can have data curation tokens, and chain of these tokens. Network of trust can examine these chains of tokens. Erratum as a type of annotation. This chunk of text should be replaced by this replacement.

Bob M. Best practice for database manager of course to add changes to change history, but provide a current view of the record.

Paolo: People interested in using AO for wikis, linear story, set of annotations can be viewed easily by an editor to make decisions about changes, provenance and meaning of annotations in context are key here, but need service to layer the annotation on top of the database.

Bob M. Interesting example, wikis tend to track change histories, but users seem to tend to ignore the distictions between edits to the current page, the talk page, and the change history. It is in the psyche of wikis to understand the distinction between the document and its changes.

Bob M. Back to example

Mark S. Is there an element that points to the framework that is expected to handle the interpretation of a particular interpretable object.

Bob M. has interpreation. -- "hasInterpretationURI"

Bob M. FP very much about transport of annoations. In an instance of a FP network, if annotations are expressed in vocabularies known to the domain, then they can be processed and handled by the instance of the network.

A couple of years ago when we looked at Annotea, it looked very document centric, AO now seems much more general.

Paolo: Annotea model was linking an annotation to something, to a resource, hopefully with a URI. Annotea intended to annotate resources. AO is extending that. We don't have a single mechanism for identifying any particular resource on the web. AO introduced the concept of a selector to bridge the world of annotations and the real world of resources, including data. Current work heavily on documents and images, but very interested in FP's examinations of annotation of data and other use cases, should be able to make AO more complete.

Bob: We should probably focus on understanding the selector. Paolo: The selector allows pointing to a fragment of a thing. Can point to a resource, selector allows identifying a portion of a resource.

Bob: Fragment may not have any global identifier. A perfectetly reasonable action by a data curator on recipt of the example is to generalize from the recieved annotation to see that "Magnalia" is an error that occurs in many other records.

Mark S. More concrete about interpretable object. Many of the interpretation terms are going to be drawn from ontologies, To interpret what the annotaiton means you need to go to the ontology and see if the statement is consistent with its broader context. The content of the annotaiton is referring to a term in an ontology, by asserting this you are potentially saying a lot-- e.g. in terms of associated properties.

Paolo: In looking at a pice of a document you can express that this piece is this term in this ontology. Find a particular protein mentioned in a paper, broader or narrower term in ontology can be linked to the term found in the paper. People annotating documents from multiple ontologies can produce views of links between ontologies. Going the other direction from trying to analysie tagging clouds, exploiting users who are very willing to explicity specify the nature of a term or an image found in a document. Point to an image, a section of an image.

Bob: Bit order is usually important in documents, and less so in data. Hopefully AO avoids making this distinction. Documents are tightly ordered.

Greg R. And hard to track changes to documents.

Paulo: AO doesn't simply track position in document, but context in document as well. Annotation may become unlinked if the document changes enough (and annotator can be notified). Selector important here, and probably particularly so for data.

Dave V. Selector, subclassed appropriately seems very generic and able to address both documents and data sets.

Paolo: Might need to create your own selector, and if general enough contribute back to AO.

Bob M. I'm not taking a strong position that there is an important distiction. Documents tend to have bit order important, but that isn't absolute, and born digital documents are close to data (xsd).

Greg R. AO, how annotate a region of interest in a complex image.

Paolo: Raster images, identify rectangle. In svg, can identify actual portions of images (e.g. polygon). In charts, haven't specified yet (if not svg, might be just a portion of the raster).

Greg R. Workflows, images thereof might be spreadsheets embedded in imge.

Paolo: Hard to go from rendering of images to rendering of actual data. Would be nice to communicate to see if the requrements are similar and can inform common ontology.

Mark S. Use cases for AO?

Paolo: Pharmacology industry trying to interpret documents in the literature. Neuroscience project, trying to locate references to specific antibodies and antigens present in the literature. Others working on interpreation of medical images. Others working on text mining and trying to export data in ways that can be understood by interpretations of the content.

Bob. On our side we need to provide more example to AO, and we can see if they can be accomodated, or what extensions they might need. Ecology rich area to look at.

Mark S. Hard to enable a scientist to annotate if drawing terms from a complex or voluminous ontology.

Paolo: Demonstration of alpha version of application using AO. Tool that can load an online document, e.g. an article from pubmed central. Document seen in tool as it would be seen in your browser. User can run text mining services, or directly select a piece of text. Selected piece of text brings up a UI to tie the chunk of text to a term in an ontology.

Mark S. When you highlighed that text, it ran a search on the ontology and pulled up potentally related terms?

Paolo: Can look at synonums in ontologies, or precompute heirarchies, term is APP protein, bring up synonyms and higher/lower terms, e.g. isoforms.

James, as this is a document, how define area:

Paolo: looking at defining high level and medium granularity definition of the retorical content, the sections of the document (e.g. define methods and materials section - would be nice to have publishers have these sections allreadey marked up). Text mining improved by understanding this context.

Now in annotated document, able to view terms annotated in document. Able to create sets of annotations, group annotations by criteria - e.g. topic, e.g. provenance.

Selected set. protiens. created by me, currently set as private.

Identifying scientific discourse in the literature. Larger selection of teext able to be marked as a claim/hypothesis/question. Can collect the context from several places in the paper. References to protiens in text grouped into a context of a claim about those proteins, evidence can be linked to that claim. Evidence can be marked as supporting/contradicting/etc, evidence here can be out in data. Working with MyExperiment claim,data,workflow process involved in building the claim. Can graph the discorse structure - claims, supporting evidence, contradictory evidence, related articles, processes, comments. etc. Easy to visualize contradictions in literature. Curators tag claims, etc. lead to the graphs.

Underlying technologies, rdf and ontologies. Iframe in browser, documents proxied.

Able to add notes, errata. Curator of document can curate these.

Link out to text mining services. Pick document, Link out to NCBO, obtain text mining terms, these results can be annotated, Document sent, 69 terms recognized in document. Results failed to recognize APP as amyloid precursor protein. Able to annotate/improve these results - this term to broad, this term is accurate, this term is inaccurate. Terms came back, neuron death, cell death, neuron death annotated as accurate. Cell death annotated as too broad. Able to statistially examine the corpus of annotations made by multiple people.

Mechaism for editors in a community to lock a set of annotations, allowing just further comments or the creation of a branch curated by others. (providing for credit for an edited view of the annotations).

Mark S. Back end, knowledge base?

Paolo: Storage in database, structures tightly follow AO, export as rdf. Performance and code maintinance issues with using a triple store.

Mark S. Open source?

Paolo: It will be, probably with a non-commercial license, needs funding stream, interest seems mostly from private industry and foundations. Needs lots of process for release.

Dave V. What service infrastructure is required?

Paolo: Connected to science commons, working on an ontological broker, given a piece of text try to understand it as an ontology. Issue with ontologies, often not rich enough terms. Broker able to extend ontologies. Proxy service. Text mining service. Image processing service.

Dave V. Annotation data stored in the ontologies?

Paolo: Stored in the AO structure in a database, and a triple store, queriable in triple store.

Mark S. Swan ontology covers the items like: hypothesis, consistent evidence, et.c?

Paolo, claims, evidence, etc, yes. AO is domain agnostic. W3C task working on ontologies for scientific experiments, involved in that group.

Bertram: Also provenance incubator group.

Paolo: participating in those calls, but using PAV provenance authoring versioning ontology developed for SWAN. Curation very important for provenance. Trying to create an ecosystem of ontologies that can be used together.

Bertram: Notion of provenance, data lineage, broader, narrower?

Paolo: PAV not the same thing as data lineage. What created by who and when, and if by a software agent what that agent was. These ontology terms were associated with this document by this text mining tool at this point in time. Not enamored of the term Creator, creator of document might not be author. Curator valuable term in his area. Author and creator different.

Bob: Reason on difference?

Paolo: Mostly for human consumption and searching.

Bob: Author a notion in copyright. Something that is derived from a copyrighted work, retains this lineage.

Johnathan: Author different from copyright holder, Copyright chain in derivative works.

Paolo: We are interested in tracking change history. Creating timeline of resource, rather than tracking authorship in legal terms.

Mark S. Paolo's work very relevant. Lots of very interesting things. What are the possibilites for sharing between his project and AO.

Paolo: I like simple steps. Suggested next step of examples of annotation of data, examine how they can be used with AO. Mostly a catalyst for the ontologies.

Paul: Definite similarities of Paolo's demonstration of ALA's goals with annotations (provide Paolo with details of ALA annotations).

Bob: Recurring processes?

Paolo: Haven't thought much about that. Very important piece is capturing data from specialists for an organization.

Paul: Image of annotation from talk.

Bob: Keep coming back to the encryption case.


10 AM Quality control scenarios.

Bob: We need to collect more stories from communites outside of specimens.

James: Workshop elaborating this further would be a very good thing.  

Bob: Question to start, probably for Dave V. Does DataOne have specific QC APIs.

Dave V. Have discussed. At this point, relying on existing QC proceses of member nodes. Later on idea of sharing QC processes. At this point will work with any data that has resolvable identifier and xml metadata descriptor (that in turn conforms to an xmlschema).

James: the GBIF approach. Real questions are probably with the ecological community that is likely to use the data, and what are their needs.

Bob: Let me ask Mark, what concerns do people express about EML data sets.

Mark S. People express concerns. Hasn't been a focus. Moving to ontologies, we are hopefully moving to tigher control of the provided metadata. By constraining the choices of scientists, we aren't letting them use terms in ideosyncratic ways. Producing a dialectial process of helping the scientists understand the meaning of the terms they are using in their metadata.

We are also taking assertions from the metadata, e.g.data type assertions. You said this column was alphanumeric, sumarizing the data it is all numeric (or viceversa). As metadata is more carefully described, it is possible to feed back to the researcher whether the summaries of the data fit with the metadata.

Dave V. In DataONE, we are doing a very low level of bit checking, is the content consistent with the original - checksums and means for quality controling the replication processes.

Mark S. We are expecting our ontologies to provide quality control mechanisms. Assertion in data set: Parasite and host related. Ontology specifies parasite host relationships. If ontology and data instance disagree - might be an error or might be interesting scientific observation.

James: Greg Images? What data quality issues?

Greg R. We see metadata correction as an obvious thing to do. Assert that this field should have a different value. In lifecycle of images, would like to be able to assert that images haven't been manipulated (or have). This bird is flying against the sky rather than a blue sheet.

Johnathan: The clock being wrong on the camera.

Mark S. For a taxonomic enity, we could imaging large sets of character properties. Similar kind of quality test. Trait in ontoloogy used isn't found in organism.

Greg: Absolutely, right terms allow testing for consistency. Caterpilear on leaf has definable ecological relation as well as taxonomic - can be tested against ontologies for consistency.

James: When you have an image: What are the kinds of things on images are quality issees:

Mark S. When metadata is inconsitent with known relationships.

James M. When you have two annotations that are in conflict.

Greg: Possibility of course, of image analysis by automated tools, and comparison with conclusions made by people viewing the images. Example of images of drawers of insects, find spaces between bugs, this bit of the image is a bug in tray 2. Human can correct the region of interest identified by the automated tool of this bug.

Bob: One of the things Mark said made me wonder. Is it important/possible to distinguish errors in the data from errors in the knowlegee base. Example monarch butterlies have a specific host plant. Image of larva on plant claimed to be not the same plant.

Greg: Express inconsisency, but not which is wrong.

Bob: The domain provides idea of consistency.

Greg: Idea of contradication.

Bob: Same as Paolo's mechainsims

Paolo, labeling consistent/inconsistent.

Johnathon: Here reasoning on knowlege base and identifying inconsistences with this in data automatically.

James: When you have a taxon with a known range, and an occurance that is an outlier, the outlier might be important or might be an error.

John: Also need to examine error radius. A large error radius may be correct, but it distorts species range expectations. So, the system is working correctly and the range itself is OK, but the issue is a field that we may not be thinking about. Similiarly, In laboratory information systems, have quality scores for process steps (e.g binning). Result doesn't match morphological identification, go back to the process step QC results and examine those as a process control. The issue may have been with the primer selection and these needs to be re-run.

James: In continuous quality control, look for conflicts amongs assertions by data.

Paolo: In SWAN started with accept/reject. right/wrong. was a disaster, this is a very harsh jusgement in science. Moved to consistent/inconsistent, move to humans to evaluate in context. There are things, however, that have a very complex context, e.g. two groups coming up with different results from their research. Lots of context to understaning these differences. Very important bit is identifiying inconsistencies.

Mark S. Are you really having these problems? Assertions about brain cells found in heart.

Ah, different scales.  Annotations of research papers at a much higher level of generalization than individual data elements.  

Johnathon: Two different models: Modeling formally, then software can detect inconsistencies. At level of annotation small blocks of text with terms, very similar to annotating data elements. Annotating at higher blocks of text different.

Paolo: Annotating images, how much time do you expect a user to spend on an image?

James yes. Time might vary substantially depending on the task.

Greg. Someone interested in making a taxonomic determination from the image might make a substantial amount of time examining it. Others classifying parts of the image might spend much less time.

Expectaiton of identifying features within the image.

Mark S. Largely for querying. Find all images of butterfly larvae in tropical dry forests.

Paolo: Our users thought that it would take too much time to assert this level of detail for the use of the data.

James:

Greg: we might want to look at the logs, at 10 am they said it was a milkweed, and before this they spent lots of time zooming in examining the fine details. Is there a long input to the process of saying this is a monarch buterfly.

Paolo: I see an explosion of relatioships needed to define this.

Greg: The intent of the users is to fill in matricies of characters by taxa, with images showing charactgers as supporting data. Would like the users to be more explicit about what they are seeing, big problem. We should be harnessing audio, or audio and eye tracking. Lots of digitizing from images, should be looking at text extraction from audio.

James: Our challenge is how to provide the terms to ask the rich queries, and identify what level of annotation will be supported for the goals.

Greg: which herbarium sheets are flowering, correlate with date.

James: set of fields with phenological values. Data set here with some fitness for use, would like to run through workflow to assess whetehr this data is fit for some sort of phenological analysis.

Greg, if I've annotated the herbarium image with a region of interest around the flowers make for an easier pass.

Mark S. anorther use case: Whales, can do individual identifications, lots of photos in the field, tryinging to identify from image. Photo, coordinates, time, indivudual identification, willing to put in lots of time on this.

Greg, and want to check against the ships logs, etc.

Paul Put phrase on table: cleaning data with data.

Greg: By publication, image in publication is one with maximum impact, morphbank started from inability to publish a richer set of images, zooming browsers for images critical.

James: The data, and what has been done in the processes, and what metadata of the processes feeding into assessment of quality.

John: Crossing service boundaries an issue here. Process chain is important.

Mark S. Lots of what we are talking about is annotating primary data. Paolo's example perhaps annotating subsequent interpretations.

Paolo: Many different dimensions in the examples here - lots of complexity, and many case by case domain specific dimensions. Some aspects are more stable (e.g. geolocation). This flower is red, simple, bee on the flower, adds more complexity, bee doing this on the flower more so. Very growing set of ontologies.

Mark S. Somewhat like the task of the encylopedists in the 1700s.

Paolo, Ontologies working with snapshot of something that is predefined, if space is entierely open becomes very very much more complex to handle. Approach in small steps.


Paul: Three categories of ways of identifying inconsistencies in the discussion today:

1) Conflict in assertions made by people. 2) Conflict between assertions made by people and knowlege bases. 3) Conflict between data sets.

Mark S. TNRS Seems a leading use case for this group. Mechanism for communicating a sanitized list of taxon names, a checklist, back to the name service.

Bertram: GBIF in the buisness of validating records they are harvesting. How does this overlap with iPlant TNRS.

James: Brad at TNRS is talking to GBIF about relations between TNRS and GNA. Taxon name resolution one of the biggest problems our users face.


Tuesday Afternoon

Meeting Notes developed on: http://firuta.huh.harvard.edu:9000/FP-AllHands-2010Oct26-PM

12:30 PM Lunch

2:00 PM Requirements for the network: What can others provide? i.e. DataOne and authentication.

3:00 PM What collaborations and at what level make sense? How do we maintain communication and a coordinated effort?

4:00 PM Summary


2:00 PM Requirements for the network: What can others provide? i.e. DataOne and authentication.

Joining on Phone: Chris Jordan, TACC

James: This afternoon: focusing on the colaborative parts of this.

Bob M: David V. and I have agreed on immediate collaboration with DataONE. How do we use their services, we should make a little test of their APIs. We register our central data store with one of their coordinating nodes, then we turn off our central data store, and we shouldn't notice a difference. Thus, how do we become a contributing node in DataONE (implementing their APIs).

Zhimin: At what level is this, database level?

Dave V. Simplest way to think of DataONE is as a key store, sort of like very simple BigTable, one column for key, one for value. Everything going into DataONE needs to have a unique identifier, could just use a GUID generator.

Bob declares that Zhimin agrees. We'll look at the DataONE APIs.

Dave V. End of november probably reasonable time frame.

discussion....

James: players here with storage in DataONE:

Mark S. METACAT. We have a specific take on annotations, more specific.

Paul M. Bob M. asserted last night that an annotation is concerned with proposed metadata. FP annotations seem orthogonal to OBOE, an FP annotation can be thought of as an assertion that could contain an OBOE annotation describing the structure of a data set, assserting that a data set has a particular data set.

Mark S. Not orthogonal, both concerned with secondary statements independent of the integrity of the data object itself. Perhaps can wrap OBOE annotations in FP annotation asssertions.

Dave V. Does an annotation apply to more than one data object?

Bob: Yes it can.

Mark S. Lots of complementarity. Observational ontology capable of handling lots of scales of observation, but going beyond atomic seems to be stretching it. We haven't looked substantivlly into the data curation process much and the filter portion of filtered push is very interesting as a data curation process.

Mark S. Observational ontology has means of trying to determine if two instances are the same. Replicates, duplicates.

James, herbarium community pretends not to distinguish replicates and duplicates. Calls them duplicates...

John D. Tying in separate observations of environmental parameters might provide for interesting conclusions.

Mark S. How your data should be interpreted by someone who doesn't understand its semantic structure is something we are concerned with.

Greg R. Annotation carries a package. Network understands content of package. Annotations, say concerned with measurements of trees at chest height. Service point should be able to understand these keywords.

Mark S. Ontology and semantics.

Greg R. values present, API to extract keywords

Dave V. selector for annotations.

Greg R. Annotation might need to contain a term with a set of keywords provided by the annotator. Define API for every type of message I can submit a message, and hte API will tell me which characteristics to add to the message.

Bob M. JPG2000, a few years ago has the ability to carry all the metadata in the image itself. But in an ideal world where queries cost nothing, querying can return the context for an annotation and thus who would be interested.

Aaron. Having a hard time to understand the concept of annotations being pushed to me without me explicilty subscribing.

James: There are two sides, one an explicit expression of interest. I'm researching X, I'm the curator for data Y. Other case being duplicates, where links between objects end up dropping them on your interests.

Bob M. Could be semantic as well.

James: EOL's concept of aggregated data curators is analogous, by expressing an interest, in say Rubus, I'm becoming a data curator for global knowledge of Rubus.

James: Returning to collaborations, particularly cyberinfrastucture.

Chris J. We can certainly discuss the resources available from TACC, we don't expect this to be heavyweight with regards to our resources.

Bob: Specify action plan to form answers for this problem. Start with Metacat, find how many data sets are involved, what kind of query rates they see. What happens if 20 times as many people as this would like to annotate it say three times a month.

Mark S. Flattered, but Core constituency is DarwinCore, strategically a lightweight implementation on darwin core data seems a better direction.

Paul, yes very short term plan.

Bob: Our concern is that we need to understand better other communities, and have a desire to examine broader issues.

Mark S, then there is OGC, and its broader concerns, from e.g. the broader earth observation community.

Greg R. Moving forward and having successes important. Because you understand an annotation carrying a darwincore package, you can filter. For morphbank and others, we need point to point communications, rather than using a distribution network to transport. Specify can put objects in morphbank, ask morphbank with identifiers, and get back the images, metadata, and annotations. Important thus for other vocabularies than darwincore to work. Need thus means for exposing means for filtering.

Greg R. Morphbank/Specify/Morphster will move forward by using the vocabulary of annotations, wrapped within a point to point pull transport mechanism. Annotation pull service working point to point. Send annotation or request an annotation.

Bertram consolidating. Series of use cases. One end of spectrum. Specimen collections. Steve Kelling's eBird observation data Then Specify-Morphbank-Morphster interaction, richer content. Then Ecology, very different kinds of data.

Can we stagger the approach working from things we understand very well towards moving out to things that we don't understand. Very like the idea to link to these ecological data. Might lead to a good use case to look at OBOE documents and describing changes to those documents.


Dave V. Very usefull for colaborators to have a monthly/quarterly report on progress. List of milestones, progress towards each.

3:00 PM What collaborations and at what level make sense? How do we maintain communication and a coordinated effort?


Discussion of collaborations with coordinators.


4:00 PM Summary

Bertram: Strawman cartoon of collaboration:

Shared Use Case Library. Joint Software Develoment. Hackathons (joint developer's meetings to jointly develop service). Participating in Meetings. Common Architecture.

James: Strayed from agenda, but benefited from straying. Need to target some more use case development for ecology and taxonomy. Talked a lot about annotations, got some very good examples, likely very good road forward in collaboration with AO. Core team meeting tomorrow, will review, categorize, develop revised time table, particularly in interactions with collaborations.

Perhaps good to do open meetings with collaborators remotely attending, briefly presenting status.



6 PM Dinner at nearby restaurant

Wednesday

Core Team reviews requirements elicited from Monday and Tuesday. Core Team reviews workplan.