Installing and Running the Prototype
This document is of historical interest only, it describes the procedure for running the Hadoop based Prototype FilteredPush system developed under NSF: DBI #0646266, it may work with code from around revision 658
You probably want the: Current Developer Documentation.
- 1 Prerequisites
- 2 Installing Software
- 3 SVN checkout via Subclipse
- 4 Configuring the MySQL Datasets
- 5 Configure Triageconf.xml
- 6 Generating The WAR File
- 7 Configuring Hadoop
- 8 Starting and Stopping Hadoop
Create FP User
1. From command-line use the useradd and passwd commands to create a new user "fp".
sudo adduser fp
2. Once you have created the account you must add it to the sudoers file. Using the sudo visudo command, add fp ALL=(ALL) ALL just bellow the line that reads root ALL=(ALL) ALL. When you have finished push Ctrl+X and then Y to save.
3. Log in as fp once you have finished creating the account and giving it sudo privilliges. The remainder of fp setup will be done within the fp account you've just created.
Download and Install Software
6. Lastly you will need to download the latest versions of the JDK, MySQL, Tomcat, ActiveMQ, Hadoop. A list of everything you need can be found below:
- Eclipse - http://eclipse.org/downloads/
- Subversion - http://subversion.apache.org/packages.html
- JDK - http://java.sun.com/javase/downloads/index.jsp
- MySQL - http://dev.mysql.com/downloads/mysql/
- Tomcat - http://www.gossipcheck.com/mirrors/apache/tomcat/
- ActiveMQ - http://activemq.apache.org/download.html
- Solr- http://lucene.apache.org/solr/
- Hadoop - http://mirror.atlanticmetro.net/apache/hadoop/core/
If you are using Ubuntu Linux or another distribution with a package manager you may use it to install the software prerequisites instead of following the instructions for manual installation. If you do chose to use a package manager, skip to #Configuring the MySQL Datasets.
Installing the JDK
1. Move the .bin file to the location you wish to install the JDK to and change to that directory before proceeding.
sudo mv jdk-6u17-linux-i586.bin /usr/local cd /usr/local
2. Check the permissions, run the installation and lastly create a symbolic link if you wish via the ln command. (you may also remove the .bin file if you no longer need it)
sudo chmod +x jdk-6u17-linux-i586.bin sudo ./jdk-6u17-linux-i586.bin sudo ln -s /usr/local/jdk1.6.0_17 java sudo rm jdk-6u17-linux-i586.bin
3. Edit the file .bashrc under your home directory and add the following lines (replace the value of JAVA_HOME with the location your JDK extracted to) so that Java can be found from the command-line.
vi ~/.bashrc export JAVA_HOME=/usr/local/java/ export PATH=$JAVA_HOME/bin:$PATH
4. To update your terminal session and confirm that the environment variables are configured properly type the following:
source ~/.bashrc java -version
NOTE: If installing in Ubuntu, from synaptic Package Manger be sure to completely remove the mysql-common and libmysqlclient to avoid conflicts
1. Add the mysql user and group via the following commands in terminal
sudo groupadd mysql sudo useradd -g mysql mysql
2. Change the current directory to point to where you would like to install to and extract MySQL via the tar command. After extracting create a symbolic link with the ln command and change directory to that location.
cd /usr/local sudo tar -zxvf ~/mysql-5.1.41-linux-i686-glibc23.tar.gz sudo ln -s /usr/local/mysql-5.1.41-linux-i686-glibc23 mysql cd mysql
3. Run the following commands to configure ownership and execute the script that configures the database.
sudo chown -R mysql . sudo chgrp -R mysql . sudo scripts/mysql_install_db --user=mysql sudo chown -R root . sudo chown -R mysql data
4. Start the server with the following command.
sudo bin/mysqld_safe --user=mysql &
6. In another terminal window, the next command will allow you to configure a root password:
cd /usr/local/mysql bin/mysqladmin -u root password 'password'
7. Login as root to the mysql client with the following command:
bin/mysql -uroot -ppassword
8. As root, create the fp user for accessing the datasets:
create user fp identified by 'password'
1. Create a directory for Tomcat and change the current directory to this location. Extract Tomcat via the tar command.
cd /usr/local sudo tar -zxvf ~/apache-tomcat-6.0.20.tar.gz sudo ln -s /usr/local/apache-tomcat-6.0.20 tomcat
3. Edit the file .bashrc under your home directory and add the following line (replace the value of CATALINA_HOME with the location you extracted to)
vi ~/.bashrc export CATALINA_HOME=/usr/local/tomcat
3. Change the current directory to point to /usr/local/tomcat/bin and execute the catalina.sh script. After the server starts up, test to confirm that it is running by typing http://localhost:8080/ into your browser.
sudo -s cd /usr/local/tomcat/bin ./catalina.sh start
4. To stop tomcat use the following commands:
sudo -s ./catalina.sh stop
1. Change directory to the location you wish to install activemq to, extract the files to this location and create a symbolic link to the directory:
cd /usr/local sudo tar zxvf ~/apache-activemq-5.3.0-bin.tar.gz sudo ln -s /usr/local/apache-activemq-5.3.0 activemq
2. Change directory to point to the newly created link. Make sure the permissions allow you to run the startup script in bin and access the directory, once this is configured start activemq:
cd activemq sudo chown -R fp . sudo chmod 755 bin/activemq bin/activemq
3. Type the following URL in your browser to confirm that activemq is up and running http://localhost:8161/admin/
Solr is used in the prototype as the indexing engine for general search page, auto-completion, and fuzzy match. The following instruction is to deploy it as a web application in tomcat:
- download solr distribution from the solr web site [] as instructed. After decompressing the file into a folder, you will find dist sub-directory with name 'dist', which has a file called apache-solr-x.x.x.war. It is file that you will put into the webapps directory of a tomcat installation.
- Now we need tell the solr webapp where is the data and configuration directory. There are several methods can be found online, here is the one I found effective by appending the following to JAVA_OPTS:
-Dsolr.solr.home=(path to the solr home) -Dsolr.data.dir=(path to the solr data)
The first is to tell solr where to find configuration file and the second is to tell where to find the data directory(for indices) You can use the 'example/solr/' in the decompressed folder as the start point of you configuration. The main configuration files are schema.xml and solrconfig.xml under conf folder.
More to add after check in configuration file somewhere--Zhimin
NOTE: If ssh client and server are not installed by default on your distribution you must install them on the master and all slaves. In Ubuntu you may use the following command:
apt-get install ssh
1. Create public key-based authentication for ssh connection. The master node needs to be able to ssh to each of the slave nodes non-interactively in order to start the slave node processes. Communication occurs over hadoop-specific ports after that. Note that a master can also be a slave, and that authentication is used in this case as well, even though the "remote" host is local. (mmk-- I don't know if symmetric authentication is required, i.e. the slaves ssh'ing to the master, but I created the keys for my setup that way, too). Create public keys on each local host and concatenate them to the end of each remote authorized_keys list:
localhost> cd /home/fp localhost> ssh-keygen localhost> scp ~/.ssh/id_rsa.pub fp@remote-host: localhost> ssh fp@remote-host remotehost> cat ~/id_rsa.pub >> ~/.ssh/authorized_keys remotehost> exit
Verify that non-interactive authentication is working. If not, there might be a directory permissions problem (google for help on that, I can't remember the specifics). Don't forget to turn the sshd service on!
2. Unpack hadoop-0.20.1.tar.gz in the home directory of each node's fp user. This will create a directory named hadoop-0.20.1.
cd ~ tar zxvf ~/Downloads/hadoop-0.20.1.tar.gz
SVN checkout via Subclipse
Configuring the MySQL Datasets
1. Login to mysql as root, create three databases (testfish, testfrog, testfern) and grant all privileges for the localhost. Additionally you must grant select privileges to all the remote hosts that are running on the network. In the example below 'user'@'localhost' is the fp user at the localhost machine. 'user'@'webprojects.huh.harvard.edu' 'user'@'umbfp.cs.umb.edu' are the two remote hosts. Adjust these commands as necessary to suit your needs.
mysql -u root -p password
2. Run the following commands for each new database you wish to create:
mysql> create database testfish; mysql> grant all on testfish.* to 'user'@'localhost' identified by 'password'; mysql> grant select on testfish.* to 'user'@'webprojects.huh.harvard.edu' identified by 'password'; mysql> grant select on testfish.* to 'user'@'umbfp.cs.umb.edu' identified by 'password';
3. After you have created the databases, flush privileges and quit.
mysql> flush privileges; mysql> quit;
Import SQL Dumps
4. Import the SQL dumps from the fp repository. For user, substitute the same user as in the database creation in the previous step:
gunzip fp_specify_a.sql.gz gunzip fp_specify_b.sql.gz gunzip fp_specify_c.sql.gz mysql -uroot -ppassword testfrog < /mnt/fp/fp_specify_a.sql mysql -uroot -ppassword testfish < /mnt/fp/fp_specify_b.sql mysql -uroot -ppassword testfern < /mnt/fp/fp_specify_c.sql
The triageconf.xml file should be located in the src directory of the FP-Web project folder. The database user and password must be included as well as the location of the indexing files. Using the file below as a reference change the following areas to match your specific configuration:
File:Triageconf.xml (Note, this file is also available via svn checkout in the src directory of the FP-Web project)
Specify the database user password and connection URL where you see the following:
<property name="user" value="fpuser"/> <property name="password" value="umb2fp"/> <property name="connURL" value="jdbc:mysql://localhost:3306/bbg"/>
Lastly, change the following lines to reference the location of the index file where you see the following:
<property name="firstNameIndexFile" value="/home/wangzm/workspace/first_names.ind"></property> <property name="lastNameIndexFile" value="/home/wangzm/workspace/last_names.ind"></property>
Generating The WAR File
1. Add the repository http://gentoo.cs.umb.edu/svn/FilteredPush to eclipse and check out the trunk and FilteredPushWeb directories.
2. Right click FilteredPushWeb in the Project Explorer and export as a WAR file.
3. Move this file to the tomcat webapps directory (for example /usr/local/tomcat/webapps) When you start tomcat this file will automatically unpack.
1. Make the following changes to these files located in conf within the hadoop directory:
- hadoop-en.sh: Set JAVA_HOME (e.g. JAVA_HOME=/usr/lib/jvm/java-6-openjdk) on each node.
- masters: On the master node only, add one line containing the full host name of the master node (e.g. umbfp.cs.umb.edu).
- slaves: On the master node only, add a line for each slave node containing the slave's full host name.
2. Every node should have a nearly identical copy of hadoop-site.xml. The only value that should differ is local.host. The following configuration is hadoop specific.
- hadoop.tmp.dir: this defaults to /tmp but their documentation suggests that in all but the most trivial installations (single-node) this should be set to some other value. We use "/home/fp/tmp". This directory needs to exist on each node.
<property> <name>hadoop.tmp.dir</name> <value>/home/fp/tmp</value> </property>
- fs.default.name: this represents a connection string for datanodes (slaves) to contact the master node.
<property> <name>fs.default.name</name> <value>hdfs://[hostname]:8002</value> </property>
- mapred.job.tracker: this represents a connection string for tasktracker nodes (slaves) to contact the master node/
<property> <name>mapred.job.tracker</name> <value>[hostname]:8001</value> </property>
- dfs.replication: I'm a little fuzzy on this but I've had good luck with the value 2.
<property> <name>dfs.replication</name> <value>2</value> </property>
3. Each node must be able to contact the others to process queries. For the moment, this means that each node needs to be able to connect to each Specify database. There should be a stanza in hadoop-site.xml like the following for each node. This is the filtered push configuration.
<property> <name>filteredpush.connection.[hostname].dbusr</name> <value>sqluser</value> </property> <property> <name>filteredpush.connection.[hostname].passwd</name> <value>password</value> </property> <property> <name>filteredpush.connection.[hostname]</name> <value>jdbc:mysql://[dbhostname][:optional port]/[dbname]</value> </property>
- local.host: the full host name of the node
<property> <name>local.host</name> <value>[hostname]</value> </property>
An example hadoop-site.xml configuration file: File:Hadoop-site.xml
Format the Namenode
4. Execute the following commands on the master node to start the dfs daemon and create the hadoop file system:
localhost> cd $HADOOP_HOME localhost> bin/start-dfs.sh localhost> bin/hadoop namenode -format localhost> bin/stop-dfs.sh
5. Specify will probably be running as a different user than fp. To allow hadoop queries to be processed when launched by Specify, the hadoop file system permissions must be altered (see http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html). On the master node, run the following commands:
localhost> cd $HADOOP_HOME localhost> bin/start-all.sh localhost> bin/hadoop dfs -chmod -R 777 / localhost> bin/stop-all.sh
There are other shell-like commands that can be run through hadoop to see what's going on with its file system. They are documented here: http://hadoop.apache.org/core/docs/current/hdfs_shell.html
- for future, we could consider putting any Specify user that wants to run the fpnet client into the fp group and have everything authenticated to that group. This deserves discussion (later).--Bob Morris 22:06, 15 February 2009 (UTC)
Starting and Stopping Hadoop
1. To start the master node and all the slaves, run the following commands on the master:
localhost> cd $HADOOP_HOME localhost> bin/start-all.sh
2. To ask hadoop which processes are running on a local node, go to $HADOOP_HOME and run:
3. On each of the slaves, you should see a datanode process and a tasktracker process. On the master, you should also see namenode and jobtracker processes, and possibly a secondarynamenode process. If you don't, check the logs ($HADOOP_HOME/logs). You will get almost no indication from standard out on the master if anything goes wrong.
4. To stop hadoop, on the master node go to $HADOOP_HOME and run: