From FilteredPush
Jump to: navigation, search

Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2011May10

User: David Lowery | User: Zhimin Wang | User: BertramLudaescher | User: James Macklin | User: Lei Dou | User: Tim McPhillips | User: James Hanken | User:Paul J. Morris | User:Maureen Kelly | User:
"User:" cannot be used as a page name in this wiki.
| User:
"User:" cannot be used as a page name in this wiki.


  • Welcome for Maureen
  • Agenda for Meeting at UC Davis
  • Two week coding target
  • Duplicate-Detection using Map-Reduce project (ECS-265) started.
  • Role of Ontologies and Semantics in the project architecture.
  • Review Web Presence Requirements
  • NSF annual project report


  • Paul
    • Maureen is here. I've given her a high level overview of the current project.
    • I've started filling in project milestones in the dotProject application on sourceforge.
    • Progress is being made with setting up storage for the cache behind the web client in the research computing infrastructure.
  • Zhimin:
    • Prepare data for demo
    • Help Lei use the API for querying gbif cache
    • Interface design
  • Lei:
    • Working on completing the functions of the actor in the SPNHC demo and making them more stable, e.g. using cached file instead of accessing real service to avoid exception caused by unstable internet or service.
    • Working with Zhimin to query from FPush network
    • Add animated gif in the workflow to indicate opportunities that the workflow could interact with FPush network to do better curartion.
  • Bob
  • Tim:
    • Incorporated changes to web presence requirements.
    • Clarified a number of key points about FP with Paul.

Meeting Notes

Filtered Push Team Meeting 2011 May 10

Present: Paul, Bertram, Jim, Lei, Tim, Bob, Zhimin, Maureen, David, Johnathan Rees.


  • Welcome for Maureen
  • Agenda for Meeting at UC Davis

Bob: Topic of opportunities for Kepler as a FP network client.

Paul: Leads naturally to client API

Zhimin: Not just calls, but also the messages. Need to settle down on the nature of mesages. For example: Query, Annotation, Notification.

Bob: In the spirit of what Tim asked, we have allready extracted what messages are needed to support functionality.

Zhimin: Also flow. Need strategy for moving forward with the content of the messages. How to transport is the job of the network. Understanding the message is job of the data provider and end user.

Jim: Dates for UC Davis?

18th-24th, James just 18th-22nd.

Jim: Available some of that time to skype in to work on project report.

  • Agenda Items:
    • Kepler as client
    • Messages
    • API
    • Requirements
    • Project Report.

Need time for Wed 25th meeting? Tentatively: 9AM for two hours.

  • Agenda
    • Assigning tasks for NSF report
    • Sustainability
  • Two week coding target

Tim: Chain of driving requirements: End user--> workflow designer --> engine developer Experience from another, very productive project (Stanford): Start with putting pieces into production; then add features, driven by upstream users: workflow developer drives engine developer; end user drives workflow developer

Paul: start with 3-node functioning network Tim: what are functions? Paul: We've got a diagram, with specific desired functionality... Bob: web-client against previous network Paul: Good grasp of Bob: It would be a good goal for shortly after SPNHC to take the Kepler workflow demonstrated at SPNHC to be a client that launches stuff into a working network, and these results can be seen in the web client (we put in several opportunities for annotation injecton) Need to see what we need to do to provide those interfaces. Likewise allow web client to launch annotations into new architecture.

Zhimin: Two driving forces: One from end user of system. Other from data provider and service provider, need means for them to plug into the network. The latter is driver for network. On Tim's side, Lei provides feedback on how client is interacting.

Bob: The institutionally friendly data provider APIs don't need to be done to test the rest of the network architecture. We can have a naieve data provider for the 3 node network. Simple static queries to clients.

Paul: Concerned about a focus on end users at this point - We've been driven by that for a couple years, and haven't produced network software - and have a good understanding of a set of end-user requirements.

Maureen: Perhaps we can start with a set of administrative tools - basic tools for determining what nodes are on the network.

Bob: Something from the client side launches a workflow which does something and launches some annotations, and indicates that if some new data becomes available then the workflow is supposed to re-run itself with the new data and notify the original launcher of the workflow if the results changed. Our high-level archtecture makes this relatively easy to do. Lots of use of the network is needed for that.

Jim: Good for everyone to be clear about who is doing what. We are meeting in CA in a couple weeks, good to have a target for that.

Bob: My vision of next week (Bob, Paul, James), we'll try to make it clearer what all the red squigles ment on the screenshot of Lei's workflow - opportunities for launching messages. Role of Kepler as a client. Kepler as a network resource for other kinds of clients.

Paul: Two key requirements for network: Scalability, no single point of failure.

Bertram: One story is of a query finding records at other member nodes.

Bob: The collecting event outlier detection story is a good target - this data came from somewhere else because of the network. That's a very high target for next week. How can we specify what is success for clients when a workflow is launched as a consequence of an annotation. Often endusers may wish to have workflow magically run in the network, rather than locally - the clientel has very little local IT support in many cases. Let the internet be the constriction point, rather than the lack of IT support be the constriction point.

Paul: A target for UC Davis meeting may be elaborating requirements. Clear target in shorter term to formalize more of the project managment targets we have on paper and whiteboard.

Tim: Let's do some hypothesis testing. Hard to see where the hypothesies are. Run risk of developing software that we aren't going to need. Trying to reverse engineer from the design documents to whether the users are going to be able to do what they wish to do.

Bob: I remain confident that looking at the red lines will give us answers to Tim's questions.

Bertram: The GBIF cache (or multiple collections databases) and cleaning up data a good testing ground?

Tim: Clone data and test against it. Don't need to keep the results.

Paul: Clear target for next couple of weeks is to have SPNHC demo solid.

Lei: In feature freeze, working on improving stability.

Bob: Are you going to rely in live internet for demo.

Lei: Yes, needs google. Can use cache of query to each network service. Yes, needs interaction with FP node at Harvard.

Paul: Agenda for this friday: Trace communications through Demonstration and with FP node.

  • Duplicate-Detection using Map-Reduce project (ECS-265) started.

- Last week: (some!) students installing/starting to use Hadoop, Disco (python based) http://discoproject.org/ - This week: first experiments with real data --> Any real-data sets we could use for this? Need two sets: (i) a small "test set", (ii) the real "challenge set"

Relatively little literature on using Hadoop for duplicate detection, targeting that with a class. Paralleling duplicate detection with map-reduce. Installing Disco on cluster, Hadoop already there. Looking for concrete data sets.

Bob: FPush MapReduce paper: Filtered-Push: A Map-Reduce Platform for Collaborative Taxonomic Data Management

Jim: Access to MCZbase data fine.

Paul: We can get Bertram a set of collector/agent data from MCZbase and botanical duplicate data in the BBG dataset.

Jim: If at SPNHC a question is asked: when can we have this? What will be our response. What would be a realistic answer for us to give?

Lei: Workflow part - curation package, soon.

Bertram: Kepler 2 release very close - would make sense to have a module with workflows and actors packaged as a module release following the demo.

Lei: Workflow logic complete, but heavily dependent on services from external providers - particularly in the case of the Flora of North America service from Lei - not yet a supported service.

  • Role of Ontologies and Semantics in the project architecture.

Bob: The fundamental problem that we to address with messages is that the descriptions of data have to be in vocabularies that are those of the data itself, not in vocabularies that we specify. The point of using RDF or OWL ontologies you can standardize the bits that are about annotations, and leave the payloads in the vocabularies of the data. The real point is to be able to carry and query other peoples annotations.

Paul: Is that: What is the question to which Ontologies in the Filtered Push is the answer?

Bob: Integration of data to which we don't control the vocabularies.

Johnathon: RDF is the answer to the failure of the approach of XML schema based integration.

  • Review Web Presence Requirements

Tim: Have taken the comments and reflected them on the main page.

Paul: Ready to review and set timelines/targets/goals next week.

  • Anything else:

Jim: Sent out email about project report. Please send me a one or two sentance of your role in the project.

Jim: Able to visit UC Davis in afternoon of Monday the 23rd.

  • NSF annual project report