2010Oct19

From FilteredPush
Jump to: navigation, search


User: David Lowery | User: Paul J. Morris | User: Bob Morris | User: Zhimin Wang | User: BertramLudaescher | User: Lei Dou | User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.



Reports

  1. Paul: With tweaking (set mysql temp directory outside of small nosuid noexex /tmp partition, set mysql parameters to allow a larger memory allocation for index generation, all to prevent mysql from switching to the inefficient resource limited method for index creation (Repair by Keycache) the my.cnf parameters are key_buffer=2G myisam_max_sort_file_size=800G myisam_max_extra_sort_filesize=800G, misam_sort_buffer_size=5G) got the GBIF cache dump loaded into a MySQL database. Total size is about 168GB, lots of tables, looks like it is optimized for serving up portal. Two core tables of interest raw_occurrence_record and occurrence_record. Indexed raw_occurrence_record, took about 11 hours, 1600 distinct values. About 40 million specimen records, plus 27 million records where raw_occurrence_record.basis_of_record is null. Haven't found collector_number/field_number yet, presumably there somewhere as it is present in GBIF search results.
  2. UCD(Lei, Bertram): Worked on modeling “Find Duplicates” workflow to identify duplicate set automatically and then present result to specialist for curation.

http://www.youtube.com/watch?v=NPFvfJwPPzI

  1. Zhimin: After initial investigation of two data sharing systems, Orchestra and HepTox, got the following inconclusive impression:
    • Orchestra:
  1. update log is data base level (too fine for us)
  2. focus on local materialized view realized by data exchange (very helpful for us on the push story)
  3. query based on schema graph with provenance calculated (seems not in the code base)
  4. distributed storage for update log (not reflect in the demo)
  5. provenance implementation is very interesting
  6. source code is accessible through Apache License 2.0
    • HePTox:
  1. tree (XML) based mapping
  2. support mapping between data and metadata
  3. focus on querying

It looks further research on those two projects is worthy.

Agenda

  1. Preparations for All Hands Meeting
  2. Brief Report by Bob on the meeting he was at last week.
  3. 10 minute discussion of annotation ontology, raising the question of appliesTo links between interpretable objects.
  4. Kepler Workflow Engine in Analysis?

Discussion

Built on http://firuta.huh.harvard.edu:9000/FP-2010Oct19

Filtered Push Team Meeting Notes 2010 Oct 19

Discussion

1. Preparations needed for All Team Meeting?

From UCD: http://www.youtube.com/watch?v=NPFvfJwPPzI perhaps have a DarwinCore example ready. Fuze or annotate duplicate records. Proof of concept demonstration (good showcase of technology possiblities). Need set of say 10 to 20 duplicate records (CSV export of brooklyn data).

Logistics of getting from Airport.

Could use a list of arrival times to circulate to people who want to travel from airport to hotel together.

HUH, Zhimin to have Web client demo on his machine.

2. Bob was at a meeting at Madison last week from new NSF program, Dimensions of Biodiversity (http://www.nsf.gov/pubs/2010/nsf10548/nsf10548.htm) - new directions (not necessaraly new to biology) for biodiversity data. NSF wanted a group of the usual suspects and not so usual suspects to advise on the needs for the cyberinfrastructure 3, 5 and 7 years out. Meeting will produce a report, appeared to be a consensus that many of the requirements are already in play by DataOne and other data network players. Annotations were on lots of people's lips. FilteredPush seems to be a strong brand. Genomics people also have a (well known) concept of annotation of data. General understanding that phylogenetics are important.

3. Bob, brief discussion of annotation ontology. We met with Paolo C[] at lunch today. It looks like what we have been doing with our annotation ontology may fit well with what they are doing. He expressed that while most of his examples are document centric, the intent is more general. We will provide some use cases of annotation of data from within our perspective, which may inform on extensions of their AO. They are also interested in hypotheis management. We expressed that it would be benificial to be able to extend their ontology for TDWG.

Our annotation ontology has the interpretable objects (which we feel are needed for most classes of biodiversity annotations (where the content tends to still be expressed in xml schema)). All of these are defined as related to anotations without limitations on cardinality. If an annotation has more than one content and more than one expectation, then there is an issue of which expectation applies to which content. Bob is investigating models for handling this and developing examples. From this investgation, Bob ran into the European NEON project, which has a discussion of ontology design patterns . http://ontologydesignpatterns.org/wiki/Main_Page

4. Workflows in the back end. Kepler is being run from the command line Bob: BTW One of the typical "Expectations" of an Annotation will be "rerun Workflow <W>"

Discussion: Zhimin, Lei raised question of a dynamic workflow. Lei: what happens when you design a workflow, but the situation changes so that unexpected things happen within that flow. Lei proposes dynamic changes to workflow. Benficial to change workflows over time as facts and conclusions change. Need some good senarioes for this. Bob: If someone reruns a workflow that has changed, they need to know the relationship between the old and new version, that sounds like provenance. Bob and Zhimin are reading a paper from the provenance group at Penn (one now at UCD, Todd Green) - very interesting papers on algebra of provenance. Zhimin: In the dynamic case, the workflow manages itself, rather than being replaced/managed by an external actor? Perhaps workflow configuration that can be triggered by external events to change itself. Clarification: After some correspondences between Zhimin and Lei after the meeting, it is clear that Lei's argument of the dynamic feature is generally applied to any work flow system, which can easily be tuned to adapt to requirement change etc.

[BL: yes, TJ is down the hall at CS; we have a Reading Group together] [further talking points: -- myExperiment annotations -- Kepler semantic types/annotations]


Arrivals: Bertram, Lei: BOS Sunday, October 24, 2010 06:19 PM EDT

Bob: Kepler and Taverna projects have some contact about interoperability? Bertram: Yes. Paolo Missier (was at U Manchester w/ Carole Goble/Taverna team, moving to U Newcastle) Bob: We had lunch with another Paolo today, with regards to w3 AO owl Annotation Onotology, also very interested in provenance. We've inivited him to the FP all hands meeting, will probably be there on Tuesday morning. Add brief discussion of Taverna-Kepler interaction to agenda for All Hands Meeting.

Action Items

  1. HUH to provide small set of duplicate records (as csv file, darwin core fields) to UCD.
  2. HUH to provide feedback on Lei's video of the workflow (soon).
  3. HUH to try to set up GBIF cache, or portion thereof on firuta before friday.

Items for Later

  • HUH: Chinua's latest demo needs presentation again to group.
  • GBIF cache use, authorizations, etc.
  • Maven (HUH only)
  • Functional and non-functional requirements for Chinua for Web Client development of GBIF Cache (HUH only)
  • Use cases from Kepler (all)
  • Start security requirements for Kepler workflows and clients. HUH and Davis should provide scenarios and use cases on wiki beforehand.
  • Discussion of bibliographic Kepler Record Fuse use case.

See also ToDo