Report: Evaluation of FP prototype technologies
Evaluation of technologies used in the FilteredPush Prototype
JavaServer Faces (JSF) is a java web application framework standard for development of server-side user interfaces for Java EE application. Some of its key features include server-side event model, state management, JavaBeans with dependency injection. JSF 2.0 is rated very high and compares very well with other frameworks like Struct, Spring, Seam e.t.c.
JSF2.0 does not need to use configuration files, this and the availability of open-source rich component suites like Richfaces and primefaces makes JSF suited for our development needs.
Hadoop is a java based open source software system, which includes a distributed file system and a map-reduce engine. It is inspired by the map-reduce system and the file system used in Google.
Hadoop's file system (HDFS) manages files in big blocks( normally in the unit of megabytes) and each block is replicated into several copies on different machines. By this means, it achieves the high availability of data. Map reduce engine is most useful for computation jobs, which can be decomposed to independent subtasks based on data partitions. In that case the job tracker node of a Hadoop system will generate sub tasks based the data partition and dispatch each of those tasks to a machine, which is close to the data and available for new assignments. After each subtask is finished, job tracker will start the reduce phase by assign some machines (could be another round of map) to combine finished local results. It should be noticed that during the execution, all intermediate results and the final result is managed in the Hadoop's file system.
A computation job for Hadoop normally contains the data and the program part. The data part is a reference to some storage (normally HDFS) for Hadoop and Hadoop node to load the resource to establish local execution contexts. The program part is a compiled java package that will be loaded and executed in the local context of each node assigned for the job. Because a dynamic class loading in each node is needed, it normally takes quite amount of time to setup for the execution, which means Hadoop is better for job involve huge amount of data and long execution time than for small jobs, for which the setup time will take significant portion of their execution time. This explains that most of uses of Hadoop are for index generation and data intensive analysis.
Hadoop does not support direct communication between task nodes, which are nodes other than job tracker. Also because the dynamic nature of task assignment, the nodes that take the job is opaque for end user. There is some low level communication amongst HDFS nodes to manage replication, but this is not exposed in a manner that we could exploit for communication. The configuration of the Hadoop name node and job tracker, also requires multiple open ports for Hadoop process communication, something that has proven difficult to configure in networks where firewalls are present. In general, Hadoop is not a good candidate for a communication platform.
- Indexing for fuzzy match
Because each system may have its own convention of spelling of collector and place names, and input errors in specimen records are inevitable, fuzzy match is an important method to enable our system to identify possible duplicated records for a specimen. The problem of applying common database index techniques such as B tree to speed up our fuzzy query processing is that the relationship about 'similarity' can not be decomposed into linear relations. To pre-index fuzzy entries, we instead use a metric based index structure called 'M-Tree', which divide the search space into sub regions based on triangular properties of distance measure. The consequence is that the similarity measure between two entries has be a real distance. In this prototype, we use the editing distance, which measures the minimal number of editing operations( remove, insert, replace) needed to transform one string to the other.
Fuzzy matching as a capability is tightly tied to the data that is being indexed. This provides a challenge for providing fuzzy matching as a service on distributed data. Fuzzy matching may be needed for both network knowledge and for local caches.
To implement the notification function of our system, a sub-pub platform has to be in place to manage message queues of users. Because we use Java as our primary developing language, a server, which complies the JMS (Java Message Service) standard is necessary. ActiveMQ from Apache foundation is attractive to us in that it is a open source project with full compliance to the JMS 1.1 specification and scales well through its clustering feature and flexible persistent storage set up( supporting any database).
JiBX is an XML to java binding tool which allows you to generate an XML schema from existing code or to start with a schema and generate Java code. JiBX provides very high performance in comparison to other Java data binding tools. JiBX performance comparisons among other binding frameworks can be found at the following link, https://bindmark.dev.java.net/old-index.html, however this evaluation was last performed in 2005. JiBX also contains a binding runtime component that allows you to marshall and unmarshall during runtime. We are currently using JiBX for the FP project however descriptions of other similar frameworks that were considered are presented at the link below.