2011Jun07

From Filtered Push Wiki
Jump to: navigation, search


Etherpad for meeting notes: http://firuta.huh.harvard.edu:9000/FP-2011Jun07


Agenda

  • Mapper (including stub needs for tests)
  • Annual Report Progress
  • Discussions with Specify team
  • UMLSource discussion

Reports

  • Zhimin: working on generating code from uml diagrams in Bouml
  • Paul: Mapping HUH specify onto TDWG DarwinCore per AppleCore guidelines, and in the process commenting on the AppleCore Guidelines and GBIF vocabularies.
  • Bob: minor cleanup of Source Forge SVN repo; scrub FP svn presence from umb machines; In UMLSource discussion, discuss UML as source for class descriptions.
  • Maureen worked on compiling a list of features added to Specify as part of the HUH migration to Specify as an exercise in re-acquainting herself with Specify, and investigated using FreePastry as a substrate for building FP into a peer to peer network. She now believes that FreePastry is a lower-level framework than needed, and is now looking at how PubSubHubbub and SparqlPush might be used.
  • Tim: Discussed with Bertram and Lei the various roles Kepler (or other scientific workflow systems) could play in FPush. Posted proposed initial work plan for Mapper explorations MapperWorkPlan.
  • Lei: Working on Kuration package release: add more exception handling code and make the actor configurable to be possible to use different services for one function, like using GeoLocate and BioGeomancer for GeoreferenceValidation.

Notes

Filtered Push Team Meeting Notes 2011 June 7

Present: Bob, Zhimin, Chinua, David, Paul, Maureen, Jim, Lei, Tim.

Agenda:

Maureen will be taking notes.

Starting with annual reports:

Anyone who wants to provide material for the report, either send it to Jim or post it to the GoogleDocs version. Please have changes in within the next week.

Jim will send a message to Tim and Bertram about accessing the report.

Where are we in discussions with Specify?

Bob reports that Rod continues to seem eager to work with us. There was some concern that there may have been another copy of the repository (other than Kansas' sourceforge version) available over the web but that turned out to not be a concern-- it was a copy of a very early version of the Specify codebase from SourceForge (rev. 1 or 2). Bob reiterated to Rod that if we had changes to be made that we would go through Rod to get changes in the trunk to Specify.

Paul reports that in the last month or so David has been doing other work on the Specify base that may be relevant for Filtered Push and the mapper, but those changes are being stored elsewhere. New Specify incidents should go into new Trac in Specify's SourceForge site.

Continuing a discussion at Davis about how to use Specify code base: we need our repository's code to use APIs to include code from other projects, rather than keeping duplicate copies of other projects' code. In order to accommodate some changes being considered regarding object persistence, we will probably need to have further discussions with Rod about developing that API. Using APIs to separate code concerns will prevent changes in either codebase from breaking the other.

Bob asks: why would they agree to this?

Paul replies that there is a specification to apply changes to the Specify database; one of the routes is to directly do SQL, the other is to go through the application and use their business rules.

Bob says the requirement is that we have the ability to create a table in the database, not do SQL on existing tables. Paul replies that when we have new determination annotations come in , e.g., we would like to be able to modify the records in question. We don't modify their data model and don't modify existing objects, but we need to insert new records.

Bob says that what this boils down to is an API method that allows us to get a new Determination object.

Zhimin says this all depends on what layer you want access to, the database, or the objects.

Bob says we need a clear user scenario and should present it to Rod for his ideas on the subject. He cautions against giving Rod solutions rather than questions.

Tim agrees that we should have Rod's input, we should make sure that our plan is OK with him.

Bob points out that we also want to have very little impact on future required maintenance from the Specify perspective.

Bob asks if the Specify Workbench can already do this, change existing determinations. Maureen and Paul think not. Workbench objects exist outside the main data model, and there is a mechanism for importing the workbench rows into the main data model. Paul points out that asking for a method call that does the edit of a determination a scalable approach. Bob asks if we have requirements to do this on other types of objects, aside from determinations.

Zhimin thinks we probably don't need Specify to support provenance tracking.

The mapper:

There is a link in Tim's report section about the mapper, this page does not reflect conversations about how a scientific workflow would enter into the discussion. Lei is working on setting up a Specify installation for testing with an application with business rules. They propose setting up another database that is not Specify that also has collections management data for comparison. They need a mock FP network to provide messages to these systems and to which the systems would respond. They also need a skeleton Specify FP plugin, and some piece of software that functions as the "cache." This would all serve as a harness for the mapper.

Plausible alternative if oracle is available is Arctos:

Also, there's the simple example database model I set up for Bertram earlier this year.

For example, one test scenario might be a long-running service that requests the mapper to provide records from a database meeting certain requirements. Another case is using duplicates to do data entry.

Bob says that as to non-Specify data sources, there are at least two OWL representations of Darwin Core. (There's also an application out there that does Sparql to SQL. Maureen points out a project called SparqlPush, see the link in her report.) Many people use this for observation data and not specimens, though. Bob thinks it's probably not difficult to load our sample data into a triplestore using one of those two ontologies.

Tim points out the importance of having simple sandboxes to try out technologies, in order to really understand them.

Lei asks if Maureen or Rod have any pointers about where to get started with making modifications to Specify, for accessing the data model. Maureen has one link from Specify's FAQ: http://specifysoftware.org/faq/11 that addresses how to create a plugin.

Zhimin has been investigating automation tools for generating code from buml. There are options for making a round trip UML - Java - UML without losing comments, given proper configuration of the tools. Some conventions must also be followed concerning where to put comments.

Do we want to have the master copy of the FP code be UML objects, or Java?

Buml is a well-written single-author GUI tool. Its only documentation is in English. It does not appear to have a command-line interface that could be leveraged with Ant for managing the intricacies of the Java-UML round trip process.

Bob asks what is the impact on Davis if the Harvard people choose to generate Java from UML?

Tim would like to know what the problem is that this approach solves? Does anyone have experience with the process in development? Does everything that could be represented with UML need to be then be "programmed" with UML?

Zhimin proposes only putting the high-level type system into UML. It would be helpful if we had a unified high-level type system for doing coordinated development.

Bob says that it will be possible to ensure that method signatures will be correct in the sequence diagrams with this approach.

Is the problem to be solved this: 1) keeping the UML design docukmentation synced with the code base, and 2) preventing typos from happening in interactions across APIs?

Paul has tried to do round-tripping between entity relationship diagrams and database schemas with mixed results. A limitation of the tools, rather than the approach, Zhimin asks? Yes.

Bob knows an expert in model-directed architecture, he will get this person's take. Zhimin also will get some advice.

Tim points out that this approach is somewhat in conflict with Agile approaches, which are more test-driven, we should ask which approach meets our development requirements, which would be appropriate in our context. It would be nice to have a coherent set of development practices that we all use.

Paul mentions that his experience with test-driven development have generally resulted in more robust code. He has some concerns about the buml developer having announced he would not be further developing that project, aside from some bug fixes.

Bob has one advantage to the UML approach: it prevents you from making public methods that have not participated in the model.

Maureen: Is is possible to solve this problem with appropriate unit tests?

Zhimin: Unit Tests like a post-verification. The UML diagram gives and overview of expected behavior. Work together.

Tim: Some of us write the tests firsts On one project using tests extensively, can easily go in and make rapid large scale refactorings without worrying if things will be broken, because the tests will tell you if it broke. If instead you are depending on models to determine if something will work you aren't getting the same speed or safety. Question is who is using the documentation. Keeping the documentation consistent with the code is good, but who will be using it?

Zhimin: Tests can help you find problems, but not, in my view, as good at letting you avoid them, the model helps here in understanding the system and logically verifying the system.

Tim: Example

Zhimin: Type system, like a compiler, know from the model that you invoke the right method and get the right return. As a simple example.

Tim: Compilers tell us when the types are wrong...

Zhimin: The type system prevents you from doing something wrong, the tests help you find logical errors, but not covering all the errors.

Paul: what your tests do for you depends on how well you write your tests, whether you have covered all possibilities for input. Unless you have thought out all your boundary cases, your tests might not help with cases encountered in the real world. It takes discipline in the test design, requires defensive programming.

Zhimin: there are cases when you just can't cover all the possible branches of program flow.

Bob: has some misgivings about rigorous agile programming in this particular project, based on experience that some of the methodologies require more frequent access to the real end users than we have in the schedule of the grant. There is real risk in having only two end users, Paul and James; they may not be adequate representation of the user base. Bob's experience is that the approach works best in one of two cases: the project is small, or the language is Lisp. His experience bias, he says, with large systems, is that more classical approaches take longer but are more robust.

Paul: one place we need more documentation is between network system and the clients. we will need decent documentation for sysadmins and programmers who deploy or extend the system; for these groups robustness is a primary requirement, but for the end users, it is flexibility. this is a source of tension in that different parts of the project, needing different development requirements. It makes sense to keep exploring this in the way that Bob and Zhimin have described, in documenting the high-level pieces and interactions within the network, rather than going all the way out into the implementation details. We also need to build good test harnesses within the network.

Bob: as to "does it matter if some parts of the project use code-generation approach and others don't?": there are different extensibility requirements in different parts of the project. In the AppleCore implementation, we are going to depend on other people's implementations of pieces of the project, whereas the mapper has a very well-defined problem to be solved, and the only extensibility has to do with schema agnosticism (kinds of data stores). Nothing can happen without the mapper. There may be a strong argument to getting it done quickly and then making it robust. On the network side, it is more important to make it robustly extensible so others can more easily and confidently contribute. Buml doesn't require much from the developer who is acquainted with UML.

Tim: as to talking about how to solve problems, it is good to agree on the definition and scope of the problem, and then give the developer room to explore solutions. An example of how to get the tool set up and working is very helpful for adopting the tool/approach. It is a good way to see at a pragmatic level how much the tool is going to help vs. how much it gets in the way.

Bob: you won't get sequence diagrams out of code.

Next week, we meet at 1:00 Eastern, 10:00 Pacific.

Paul: Todo: Create user scenario document. Specify: look for duplicates, find duplicate, injest annotation. User Scenarios for Client and Mapper Design

Maureen: todo: what approaches are we adopting in various parts of the project and why, what are the various parts of the project