2010Oct05

From FilteredPush
Jump to: navigation, search
User: Bill Michener | User: Paul J. Morris | User: James Macklin | User: Bob Morris | User: Zhimin Wang | User: BertramLudaescher | User: Lei Dou | User: Dave Vieglais | User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.


Agenda

  1. Discussion with DataOne, Bill Michener and/or designee.
  2. Discussion of the WC3 Semantic Web Health Care and Life Sciences (HCLS) Interest Group's Annotea extension, the AO - Annotation Ontology http://code.google.com/p/annotation-ontology/wiki/Homepage
  3. Quick 5 Minute review by Bob of the AnnotationOntology example.
  4. Quick review of Kepler/WebClient demonstration.
  5. Discussion, arising from previous two quick reviews, of a short term road map for API development.

Reports


TDWG 2010

See: TDWG2010 Annotations IG Sessions

See: 2010Oct02 Meeting with Apiary team.

UC Davis (Lei, Bertram)

  1. Extended “SearchSpeicmen” workflow to allow users identify duplicates and send the annotation back to network.
  2. Developed a message provider and consumer of ActiveMQ to simulate the use case: provider push a message to require execution of specific workflow with specific parameters; once the consumer receive the message it will start Kepler and execute the workflow as required automatically.

HUH

  1. Paul: Etherpad installed on http://firuta.huh.harvard.edu:9000/FP-2010Oct05, currently visible from inside Herbarium subnet.


Discussion


Presentation on DataOne by Bill Michener and Dave Vieglais

Data challenges at multiple scales. 90% of time spent by scientists managing the data, 10% on the productive analysis/virtualization. Goal to reverse this. NSF Data management requirements. Cyberinfrastructure for DataOne. Build on existing cyberinfrastructure, create new cyberinfrastructure, support communities of practice. Full life cycle of scientific data management. Needs reliable stable infrastructure - archival system. Somewhat in conflict - adapting to changing technologies driven by the scientific community's needs. Leverage existing technologies.

Three major components to DataOne: MemberNodes (existing data repositories, have API added to them which exposes a minimal set of methods for DataOne interaction, very diverse content and technologies). Coordinating nodes, manage all the data, index the data (scientific metadata and system metadata), provide catalog of information, manage replicates of data (insuring enough copies of metadata and data distributed accross the MemberNodes). Investigator Toolkit (suite of tools), java and python libraries for interaction with DataOne, extension/plugins of multiple other tools, e.g. R (exists), Specify, Kepler (early targets).

Current prototype: Three member nodes and three initial coordinating nodes. Multiple additonal candidate member nodes, starting to come on line in about 6 months. Additional world wide candidate member nodes. Target public release near end of next year. Adding more nodes and functionality after that.

Investigator Toolkit: Build on existing functionality. DataOne providing a virutal data store to multiple existing tools. Supporting the scientific data lifecycle. Java and Python libraries.

Search and retrieval based on Mercury.

Investigator Loolkit:System integration: DataOneDrive, mount the DataOne system for direct acess from file system for Linux and OSX. Initally read only FS.

Data and Scientific Metadata are added to a member node. System metadata added by member node. Coordinating nodes detect the new content, System and Scientific Metadata pulled to the Coordinating nodes and replicated [Coordinating nodes are replicates of each other (no single point of failure). Coordinating nodes notice that there is only one copy of the data and direct another MemberNode to replicate the data from the first MemberNode. Researchers publish with reference to an Identifier for the data. Other users can resolve the identifier from the Coordinating Nodes, and can annotate the data/scientifc metadata by providing the annotation and identifier to a CoordinatingNode.

Development timeline. Initial public release near end of 2011. Core functionality (replication, synchronization, search/annotate content), emphasis on reliability and stability. Through 2014, incremental increases in functionality (search, workflows, visualization, data manipulation). Progressive integration with DataNet. Capacity building to 1 Petabyte by 2014.

Approach to Community Engagement: Working groups with small (<15 person) focused teams. Wide set of stakeholders (compiled by sociocultural working group). Usability and Assessment working group, results of survey of scientists being prepared for publication. How would users interact with DataOne? Project Planning, research activity, publication. Similar for Scientists and Resource Managers. User comunities under examination: scientists, data librarians, ecological modelers, resource managers, citizen scientists. Analysis of sustainablity and governenance - multiple funding sources. Training, outreach, education for community engagement. Informatics training at 2010 ESA meeting. Community engagement and education including best practice documents (including how to provide dataset metadata). Example of a large scale data integration and analysis project: What are the patterns and processes involved in continental scale bird migration?

Potential DataOne - FilteredPush project collaboration opportunities.

Questions

Bob: APIs for common data sharing. People might not have thought out how their applications might interact with the DataOne APIs, e.g. Specify might think of a high level DataOne wrapper for access. Is model flat out access to data and metadata of member nodes, or at a less granular level.

Dave: Granularity? Yes. Interesting issue. For most repositories this far, data set itself is the granule. Metadata describes the datasets, access retrieves a dataset. Specify interesting example. Entire database might be a data object, or single specimen record might be a data object. APIs mirror dataset level of granularity, but recognizing the complexity of domain specific tools and datasets. Current model probably not

Bob: From discussion in SensorNetwork ontology earlier today, potential missing stakeholders: Lawyers. Metadata might not be freely available.

Dave: Access control a major topic. Just concluded workshop on technologies for federated identiy systems. E.g. NASA and Specify have very different rules about access, requrements to enable access, and auditing. Potential FP interaction area.

James: AnnotationWorking group likely area for collaboration (had all day session at TDWG2010 TDWG2010_Annotations_IG_Sessions ) Question: Target tight integration, or higher level standards?

Bob: DataOne network architecture - able to support virtual network overlays. FilteredPush involves not just the original data holders receiving annotations, but all interested parties getting them. High level model being a peer to peer network with a publication/subscription capability for identifying interests, receiving notifications, and allowing client side filtering of those annotations. Good view of what we would like the overlay network should accomplish, but suspect that there are better ways to spend our time than implementing them.

Dave: Three core capabilities: (1) Identifiers for everything. Persistent, resolvable, type agnostic (delegation to other services). (2) Federated Identity. Each individual may have multiple accounts for access, but need to be consistent in knowing who is who. (3) Extensible search mechanisms. Anticipate that there will be other projects that emerge that are able to take advantage of this core infrastructure.

Bob: Steve Baskoff, SERNEC data manager, talk at TDWG 2010, wrote GuidOMatic, allows SERNEC clientele to upload spreadsheets, generates LSIDs from those. TDWG TAG group. Slides appear to be uploaded to TDWG: http://www.tdwg.org/fileadmin/2010conference/slides/Baskauf_full-implementation-of-guids.ppt

Zhimin: Need discovery, need broadcast of notification of annotations.

Bob: Requirements might not be peer to peer. We don't intend to be the emperor of distribution of annotations, but that's a separate issue. We'd like people to talk about an instance of a filtered push network out of multiple instances.

Dave: Sustainability approach?

Bob: Trying to separate sustainability of annotations from sustainability of network.

James: Can we work with DataOne to help with sustainability of annotation store:

Dave: We were hoping that you would be the annotation store. Definite parallels between FP and portions of DataOne. Good to keep investigating which portions we can pawn off on each other. Plenty of room for discussion.

James: ByCycLe also funded. Lots of potential for working together. Biggest philosophical difference, annotations are addressed to recipients, networks of identified objects being recipients of annotations, FP distributing annotations to any interested parties.


Animation for DataONE data contribution: http://willmorris.net/dataone/videos/add/03/


Bertram: Dave's comment: Possible split of roles re. Annotations: Thus perhaps: FPush: Standards and Development; DataONE: Storage

Bertram: Given different missions of the projects, FP could rely on DataOne for particular node capabilities. Potential split of roles as above. Do we need to build our own overlay network for FP?

Bob: We are using ActiveMQ as messaging implemetnation for FP.

Bertram: Need to document available tools. Sharing best practice. We need to understand what standards/technologies/projects to build upon.

bob: Hey! DataOne as a sandbox! (Bob's interpretation of Betram's comments). Likely to be good sandbox uses of DataOne. Agile network building tool, good for evaluating network requirements.

Dave: Sounds out of scope. Interesting. Worth bringing up to community enagement group.

Bertram: other collab opportunity: provenance of data products (DataONE WG on that)

Bertram: Provenance of data products, and data lineage. Might be very good grounds for collaboration.

Bob: Point of view of annotation ontology, any community should be able to use any vocabularies they wish for an annotation. Informal contract within a community, even if not using formal semantics, still believe and operate on shared vocabularies for data integration. This flexibility should be supported within annotations. WC3 groups working on annotation ontologies, FP annotation ontology goes further in allowing the vocabulary to be specified at the time an instance document is constructed. WC3 effort evaluated about a year ago, along with Annotea, both seem too strongly focused to annotating documents for scope of FP annotation ontology.

Bertram: Useful exercise, look through existing annotation efforts, evaluate their uses, purposes, and look for features to incorporate or grounds for extension.

Bob: WC3 Annotation ontology converging, except: content of annotation being in any vocabulary the annotator wishes, and annotation should be a kind of object that can also be annotated, and (separating transport from semantics) ontology free of any transportation implications (probably so).

James: Bioinformatics community talk about annotations a lot. We may be building a more general thing. Generalization might be a very good thing for their community. Lots of annotation of gene sequences. How do we avoid confusion.

Bertram: Case for particular fields for particular use cases, e.g. duplicate detection, or you need to fix your name, the nomenclature is wrong, for processing needs particular corresponding fields.

Discussion of Mapping

Bertram: How to transport annotations from network into database?

Paul: Discussion of FP mapping tool concept. Level 1, one to many fields. Level 2, one to many tables.

Bertram: Install a specify database. Build an example of an annotation and look at the issues of bringing this annotation into the database. Move annotation into local cache, provide a hypothetical update, roll back if not desired. Build a specifc example of this.

James: KU has an (out of date) schema map on the web. Specify workbench is very useful aspect of specify. Workbench is potential place to place hypothetical update, when user is happy, import into specify. Workbench has good schema mapping tool to map workbench fields onto specify fields (handles multiplicity of tables, but not field splitting/merging). Easy to use tool.

Lei: Examining specify good way to help getting a handle on the requirements.

Action Items

HUH: Describe requirements and archtetural vision of an overlay network from FilteredPush.

DataOne: Add brief description of DataOne usecases related to annotations to FilteredPush wiki Use_Case_Scenarios or Use_cases

HUH: Provide more examples of annotations, their uses and their mapping onto the proposed annotation ontology.

Bertram: Compile list of other annotation vocabularies and their uses on FP wiki.

James: Invite member of bioinformatics community to FP all hands meeting to describe biomedical uses of ontology.

UCD: Install specify and the workbench, examine some sample data. (list here)

Items for Later

  • HUH: Friday, review Chinua's latest demo. (done briefly two fridays ago), needs presentation again to group.
  • GBIF cache use, authorizations, etc.
  • Maven (HUH only)
  • Functional and non-functional requirements for Chinua for Web Client development of GBIF Cache (HUH only)
  • Use cases from Kepler (all)
  • Start security requirements for Kepler workflows and clients. HUH and Davis should provide scenarios and use cases on wiki beforehand.
  • Discussion of bibliographic Kepler Record Fuse use case.

See also ToDo