Installing and Running Specify-fp

From Filtered Push Wiki
Jump to: navigation, search


This document is for project use only, but will evolve into installation instructions when we distribute. Documentation by User:Maureen_Kelly.

Prerequisites

  • MySQL
  • Java 6

Configure MySql

If the Specify installation is to be a Filtered Push node, remote access must be enabled in my.cnf (/etc/mysql/): make sure the following line is commented, and restart mysqld if changes were necessary (sudo /etc/init.d/mysql restart)

#bind-address           = 127.0.0.1

Create a database named "testfish" and a user and assign privileges. You will also have to grant select privileges to the user from each of the remote hosts in the network. Get the values of 'user' and 'password' from Maureen-- you will need a particular mysql user set up in order to make Specify work with the sql dump (which includes a test data set and some Workbench settings for the demo).

localhost> mysql -u root -p mysql
mysql> create database testfish;
mysql> grant all on testfish.* to 'user'@'localhost' identified by 'password';
mysql> grant select on testfish.* to 'user'@'webprojects.huh.harvard.edu' identified by 'password';
mysql> grant select on testfish.* to 'user'@'umbfp.cs.umb.edu' identified by 'password';
mysql> flush privileges;
mysql> quit;

Import the sql dump from the demo directory after unzipping it. For user, substitute the same user as in the database creation in the previous step:

localhost> cd demo; gunzip fp_specify_a.sql.gz
localhost> mysql -u user -p testfish < fp_specify_a.sql

Unpack Specify

Unpack specify.tar. This will create a directory named "specify". I will refer to this directory as $SPECIFY_ROOT from now on.

To test that Specify finds Java correctly we will start it but not log in. This is very important at this point, because we have not yet configured Specify, and logging in will complicate that. Before you start it, on Ubuntu you will want to turn off compiz (right-click the desktop and choose "Change Destop Background" ->Visual Effects"->"None"), otherwise you might get blank dialog windows and other very bad GUI effects.

Navigate to $SPECIFY_ROOT and run specify.sh

localhost> cd $SPECIFY_ROOT
localhost> ./specify.sh

REMEMBER: IF THE LOGIN SCREEN APPEARS, DO NOT LOG IN OR OTHERWISE CLICK ANYTHING ON THE SCREEN AT THIS POINT, EXCEPT CANCEL

There may be Java errors in the shell window that you started Specify in, but if you see the Specify login window, you have tested your Specify sufficiently at this point. 'Do not log in. Instead, press Cancel, and proceed with the next step.

If you do not see a login screen, ummm, well we haven't started to think about that yet.  :-)


Add user.properties to Specify

I have provided a copy of my user.properties file. Put it in $SPECIFY_ROOT/Specify/. There will be some errors when you start Specify because some paths are stored in user.properties that won't exist in your installation, but Specify recovers. What using my file does for you is provide a working user/password configuration, and prevent Specify from trying to create the database when it is run the first time.

Running Specify

If you are running Ubuntu you will want to turn off compiz (right-click the desktop and choose "Visual Effects" and then "None"), otherwise you might get blank dialog windows and other very bad GUI effects.

Navigate to $SPECIFY_ROOT and run specify.sh

localhost> cd $SPECIFY_ROOT
localhost> ./specify.sh

To log in, use the username and password from Maureen. You can also specify which database and host to use here, but my user.properties file has pre-selected "testfish" and "localhost," which is what we want.

If a Specify has previously been started and not shut down properly, you may see an "Already Logged IN" message. You can press "Override" and continue.

Click on "More Information". If you correctly copied user.properties to $SPECIFY_ROOT/Specify you should see that the Database is "testfish" and the Server is "localhost". Be sure that the Username and Password at the top of the login screen are those you have been issued, and click Login.

Now you should have a Specify screen whose WorkBench DataSets menu has several datasets listed, one of which should be named fp_test_dup_set_a_cleaned. (If not, user.properties may fail to mention it). Click on this. Specify should now show a dataset with many duplicate records in it. (I.e. many pairs of records that have the same Collector and Collector Code, albeit with other differences). If you click on the wrong data set, not to worry, because Specify will open to a window whose left pane also shows the datasets available. Once you see this dataset, you are ready to move to the next step, installing and configuring Hadoop and the FP network resting on it.

Install and Configure Hadoop

Installing and Running Hadoop-fp

Unpack Filtered Push: h.jar

Unpack filtered-push.tgz. This will create a directory named "filtered-push". Inside this directory is a file called h.jar that contains the Filtered Push API for making queries to the hadoop network. The API requires a hadoop configuration file named hadoop-site.xml. This is where the scope of hadoop queries is defined (which nodes will be included in the queries).

Unfortunately "hadoop-site.xml" is also the name of the configuration file read by the hadoop node-running code, which we installed previously in Installing and Running Hadoop-fp. The configurations made in that file are mainly for database connections. These two files have different purposes and should remain separate.

You will need to extract hadoop-site.xml from h.jar, make some modifications, and update h.jar.

localhost> tar -xzf filtered-push.tgz
localhost> cd filtered-push
localhost> jar -xf h.jar hadoop-site.xml
localhost> nano hadoop-site.xml

There are four properties that are required to be set in hadoop-site.xml.

  • fs.default.name: This is required for communication with the hadoop network. This must match the property of the same name in one of the hadoop master nodes' hadoop-site.xml configurations files.
 <name>fs.default.name</name>
 <value>hdfs://localhost:8002</value>
 
  • mapred.job.tracker: This is required for communication with the hadoop network. This must match the property of the same name in one of the hadoop master nodes' hadoop-site.xml configuration files.
 <name>mapred.job.tracker</name>
 <value>localhost:8001</value>
 
  • filteredpush.scope: This defines the set of database connections that are queried. Connection names should be spearated by commas (no whitespace) and for each name there must be properties in the hadoop slave nodes' hadoop-site.xml configuration files for filteredpush.connection.$name, filteredpush.connection.$name.dbusr, and filteredpush.connection.$name.passwd.
 <name>filteredpush.scope</name>
 <value>localhost</value>
 
  • filteredpush.scope: This defines the path to the jar containing the hadoop map-reduce code that the master nodes distribute to the slave nodes; this code defines the operations to be performed in the hadoop job generated by the filtered push query.
 <name>filteredpush.jar</name>
 <value>$SPECIFY_ROOT/Specify/libs/h.jar</value>
 

Put your changed hadoop-site.xml back into the jar:

 localhost> jar -uf h.jar hadoop-site.xml

Add the Filtered Push Jars to Specify

Now you need to put h.jar where Specify can find it when it is called:

localhost> cp h.jar $SPECIFY_ROOT/Specify/libs/

You also need to add some jars required by h.jar. You don't want to copy all of them, because Specify already has some of them, possibly by different names/versions, and we don't want the classloader to get confused:

localhost> cp lib/commons-cli-2.0-SNAPSHOT.jar $SPECIFY_ROOT/Specify/libs/
localhost> cp lib/hadoop* $SPECIFY_ROOT/Specify/libs/
localhost> cp lib/hbase* $SPECIFY_ROOT/Specify/libs/
localhost> cp lib/{message,xbean,xmltypes}.jar $SPECIFY_ROOT/Specify/libs/


I need to put instructions here on the Filtered Push functionality for the demonstration. Note that in order to actually use the Filtered Push functionality, the hadoop network has to be running! See how to do that here.

Setting the node name

This is not the best place to put it, but for the time being the node name reported by Specify to its Filtered-Push results chooser (this is shown in the workbench row data at the top left of the dialog) is in resources_en.properties, FilteredPush.NodeName=HUH. I did not think to change that when I created the distribution files, so if you want something else to display you will have to change that file and put it back into specify.jar, in $SPECIFY_ROOT/Specify/libs.

Node names reported by hadoop are the nodes' host names, which display in the Filtered-Push results chooser results list.