Installing and Running Hadoop-fp

From Filtered Push Wiki
Jump to: navigation, search


For Prototype'

This document is for project use only, but will evolve into installation instructions when we distribute. Documentation by User:David Lowery and User:Maureen_Kelly.

Prerequisites


Create FP User

1. From command-line use the useradd and passwd commands to create a new user "fp".

 sudo adduser fp

2. Once you have created the account you must add it to the sudoers file. Using the sudo visudo command, add fp ALL=(ALL) ALL just bellow the line that reads root ALL=(ALL) ALL. When you have finished push Ctrl+X and then Y to save.

 sudo visudo

3. Log in as fp once you have finished creating the account and giving it sudo privilliges. The remainder of fp setup will be done within the fp account you've just created.

SVN Check-out

4. Check out the filtered push code in Eclipse from the repository located at the following URL: http://gentoo.cs.umb.edu/svn/FilteredPush". You will need to check-out the current trunk as well as the FilteredPushWeb directory.

5. You will also need the three test data sets (fp_specify_a.sql, fp_specify_b.sql, and fp_specify_c.sql), which can be obtained from the folder "test/data" in the Filtered Push repository.

Download Software

6. Lastly you will need to download the latest versions of the JDK, MySQL, Tomcat, ActiveMQ, Hadoop. A list of everything you need can be found below:



Installing the JDK

1. Move the .bin file to the location you wish to install the JDK to and change to that directory before proceeding.

 sudo mv jdk-6u17-linux-i586.bin /usr/local
 cd /usr/local

2. Check the permissions, run the installation and lastly create a symbolic link if you wish via the ln command. (you may also remove the .bin file if you no longer need it)

 chmod +x jdk-6u17-linux-i586.bin 
 sudo ./jdk-6u17-linux-i586.bin
 sudo ln -s /usr/local/jdk1.6.0_17 java
 sudo rm jdk-6u17-linux-i586.bin

3. Edit the file .bashrc under your home directory and add the following lines (replace the value of JAVA_HOME with the location your JDK extracted to) so that Java can be found from the command-line.

 vi ~/.bashrc
 export JAVA_HOME=/usr/local/java/
 export PATH=$JAVA_HOME/bin:$PATH

4. Close the current terminal window and open a new one so that the changes are applied. Type java -version to confirm that the environment variables are configured properly.


Installing MySql

Note: If installing in Ubuntu, from synaptic Package Manger be sure to completely remove the mysql-common and libmysqlclient to avoid conflicts

1. Add the mysql user and group via the following commands in terminal

 sudo groupadd mysql
 sudo useradd -g mysql mysql

2. Change the current directory to point to where you would like to install to and extract MySQL via the tar command. After extracting create a symbolic link with the ln command and change directory to that location.

 cd /usr/local
 sudo tar -zxvf ~/mysql-5.1.41-linux-i686-glibc23.tar.gz
 sudo ln -s /usr/local/mysql-5.1.41-linux-i686-glibc23 mysql
 cd mysql

3. Change directory to /etc and remove (or move) the mysql directory if it exists.

 cd /etc
 sudo rm -R mysql

3. Run the following commands to configure ownership and execute the script that configures the database.

 sudo chown -R mysql .
 sudo chgrp -R mysql .
 sudo scripts/mysql_install_db --user=mysql --basedir=.
 sudo chown -R root .
 sudo chown -R mysql data

5. Start the server with the following command.

 sudo bin/mysqld_safe --user=mysql &

6. In another terminal window, the next command will allow you to configure a root password:

 cd /usr/local/mysql
 bin/mysqladmin -u root password 'password'

7. Login as root to the mysql client with the following command:

 bin/mysql -uroot -ppassword


Configuring the MySQL Datasets


Create Databases

1. Login to mysql as root, create three databases (testfish, testfrog, testfern) and grant all privileges for the localhost. Additionally you must grant select privileges to all the remote hosts that are running on the network. In the example below 'user'@'localhost' is the fp user at the localhost machine. 'user'@'webprojects.huh.harvard.edu' 'user'@'umbfp.cs.umb.edu' are the two remote hosts. Adjust these commands as necessary to suit your needs.

 mysql -u root -p password

2. Run the following commands for each new database you wish to create:

 mysql> create database testfish;
 mysql> grant all on testfish.* to 'user'@'localhost' identified by 'password';
 mysql> grant select on testfish.* to 'user'@'webprojects.huh.harvard.edu' identified by 'password';
 mysql> grant select on testfish.* to 'user'@'umbfp.cs.umb.edu' identified by 'password';

3. After you have created the databases, flush privileges and quit.

 mysql> flush privileges;
 mysql> quit;


Import SQL Dumps

4. Import the SQL dumps from the fp repository. For user, substitute the same user as in the database creation in the previous step:

 gunzip fp_specify_a.sql.gz
 gunzip fp_specify_b.sql.gz
 gunzip fp_specify_c.sql.gz
 mysql -u user -p testfish < fp_specify_a.sql
 mysql -u user -p testfrog < fp_specify_b.sql
 mysql -u user -p testfern < fp_specify_c.sql


Installing Tomcat

1. Create a directory for Tomcat and change the current directory to this location. Extract Tomcat via the tar command.

 cd /usr/local
 sudo tar -zxvf ~/apache-tomcat-6.0.20.tar.gz
 sudo ln -s /usr/local/apache-tomcat-6.0.20 tomcat

3. Edit the file .bashrc under your home directory and add the following line (replace the value of CATALINA_HOME with the location you extracted to)

 vi ~/.bashrc
 export CATALINA_HOME=/usr/local/tomcat

3. Change the current directory to point to /usr/local/tomcat/bin and execute the catalina.sh script. After the server starts up, test to confirm that it is running by typing http://localhost:8080/ into your browser.

 sudo cd /usr/local/tomcat/bin
 sudo ./catalina.sh start


Generating The WAR File

1. Add the repository http://gentoo.cs.umb.edu/svn/FilteredPush to eclipse and check out the trunk and FilteredPushWeb directories.

2. Right click FilteredPushWeb in the Project Explorer and export as a WAR file.

3. Move this file to the tomcat webapps directory (for example /usr/local/tomcat/webapps) When you start tomcat this file will automatically unpack.


Installing ActiveMQ

1. Change directory to the location you wish to install activemq to, extract the files to this location and create a symbolic link to the directory:

 cd /usr/local
 tar zxvf ~/apache-activemq-5.3.0-bin.tar.gz 
 sudo ln -s /usr/local/apache-activemq-5.3.0 activemq

2. Change directory to point to the newly created link. Make sure the permissions allow you to run the startup script in bin and access the directory, once this is configured start activemq:

 cd activemq
 sudo chown -R fp .
 sudo chmod 755 bin/activemq
 bin/activemq

3. Type the following URL in your browser to confirm that activemq is up and running http://localhost:8161/admin/


Installing Hadoop

1. Create public key-based authentication for ssh connection. The master node needs to be able to ssh to each of the slave nodes non-interactively in order to start the slave node processes. Communication occurs over hadoop-specific ports after that. Note that a master can also be a slave, and that authentication is used in this case as well, even though the "remote" host is local. (mmk-- I don't know if symmetric authentication is required, i.e. the slaves ssh'ing to the master, but I created the keys for my setup that way, too). Create public keys on each local host and concatenate them to the end of each remote authorized_keys list:

 localhost> cd /home/fp
 localhost> ssh-keygen
 localhost> scp ~/.ssh/id_rsa.pub  fp@remote-host:
 localhost> ssh fp@remote-host
 remotehost> cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
 remotehost> exit

Verify that non-interactive authentication is working. If not, there might be a directory permissions problem (google for help on that, I can't remember the specifics). Don't forget to turn the sshd service on!

2. Unpack hadoop-0.20.1.tar.gz in the home directory of each node's fp user. This will create a directory named hadoop-0.20.1.

 cd ~
 tar zxvf ~/Downloads/hadoop-0.20.1.tar.gz


Configuring Hadoop


Environment Configuration

1. Make the following changes to these files located in conf within the hadoop directory:

  • hadoop-env.sh: Set JAVA_HOME (e.g. JAVA_HOME=/usr/lib/jvm/java-6-openjdk) on each node.
  • masters: On the master node only, add one line containing the full host name of the master node (e.g. umbfp.cs.umb.edu).
  • slaves: On the master node only, add a line for each slave node containing the slave's full host name.

hadoop-site.xml

2. Every node should have a nearly identical copy of hadoop-site.xml. The only value that should differ is local.host. The following configuration is hadoop specific.

  • hadoop.tmp.dir: this defaults to /tmp but their documentation suggests that in all but the most trivial installations (single-node) this should be set to some other value. We use "/home/fp/tmp". This directory needs to exist on each node.
 <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/fp/tmp</value>
  </property>
 
  • fs.default.name: this represents a connection string for datanodes (slaves) to contact the master node.
 <property>
    <name>fs.default.name</name>
    <value>hdfs://[hostname]:8002</value>
  </property>
 
  • mapred.job.tracker: this represents a connection string for tasktracker nodes (slaves) to contact the master node/
 <property>
    <name>mapred.job.tracker</name>
    <value>[hostname]:8001</value>
 </property>
 
  • dfs.replication: I'm a little fuzzy on this but I've had good luck with the value 2.
 <property>
    <name>dfs.replication</name>
    <value>2</value>
 </property>
 

3. Each node must be able to contact the others to process queries. For the moment, this means that each node needs to be able to connect to each Specify database. There should be a stanza in hadoop-site.xml like the following for each node. This is the filtered push configuration.

 <property>
    <name>filteredpush.connection.[hostname].dbusr</name>
    <value>sqluser</value>
 </property>
 <property>
    <name>filteredpush.connection.[hostname].passwd</name>
    <value>password</value>
 </property>  
 <property>
    <name>filteredpush.connection.[hostname]</name>
    <value>jdbc:mysql://[dbhostname][:optional port]/[dbname]</value>
 </property>
 
  • local.host: the full host name of the node
 <property>
    <name>local.host</name>
    <value>[hostname]</value>
 </property>
 


Format the Namenode

4. Execute the following commands on the master node to start the dfs daemon and create the hadoop file system:

localhost> cd $HADOOP_HOME
localhost> bin/start-dfs.sh
localhost> bin/hadoop namenode -format
localhost> bin/stop-dfs.sh

5. Specify will probably be running as a different user than fp. To allow hadoop queries to be processed when launched by Specify, the hadoop file system permissions must be altered (see http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html). On the master node, run the following commands:

localhost> cd $HADOOP_HOME
localhost> bin/start-all.sh
localhost> bin/hadoop dfs -chmod -R 777 /
localhost> bin/stop-all.sh

There are other shell-like commands that can be run through hadoop to see what's going on with its file system. They are documented here: http://hadoop.apache.org/core/docs/current/hdfs_shell.html

for future, we could consider putting any Specify user that wants to run the fpnet client into the fp group and have everything authenticated to that group. This deserves discussion (later).--Bob Morris 22:06, 15 February 2009 (UTC)


Starting and Stopping Hadoop

1. To start the master node and all the slaves, run the following commands on the master:

localhost> cd $HADOOP_HOME
localhost> bin/start-all.sh

2. To ask hadoop which processes are running on a local node, go to $HADOOP_HOME and run:

localhost> jps

3. On each of the slaves, you should see a datanode process and a tasktracker process. On the master, you should also see namenode and jobtracker processes, and possibly a secondarynamenode process. If you don't, check the logs ($HADOOP_HOME/logs). You will get almost no indication from standard out on the master if anything goes wrong.

4. To stop hadoop, on the master node go to $HADOOP_HOME and run:

localhost> bin/stop-all.sh