Installing and Running Hadoop-fp
- 1 Prerequisites
- 2 Installing the JDK
- 3 Installing MySql
- 4 Configuring the MySQL Datasets
- 5 Installing Tomcat
- 6 Generating The WAR File
- 7 Installing ActiveMQ
- 8 Installing Hadoop
- 9 Configuring Hadoop
- 10 Starting and Stopping Hadoop
Create FP User
1. From command-line use the useradd and passwd commands to create a new user "fp".
sudo adduser fp
2. Once you have created the account you must add it to the sudoers file. Using the sudo visudo command, add fp ALL=(ALL) ALL just bellow the line that reads root ALL=(ALL) ALL. When you have finished push Ctrl+X and then Y to save.
3. Log in as fp once you have finished creating the account and giving it sudo privilliges. The remainder of fp setup will be done within the fp account you've just created.
4. Check out the filtered push code in Eclipse from the repository located at the following URL: http://gentoo.cs.umb.edu/svn/FilteredPush". You will need to check-out the current trunk as well as the FilteredPushWeb directory.
5. You will also need the three test data sets (fp_specify_a.sql, fp_specify_b.sql, and fp_specify_c.sql), which can be obtained from the folder "test/data" in the Filtered Push repository.
6. Lastly you will need to download the latest versions of the JDK, MySQL, Tomcat, ActiveMQ, Hadoop. A list of everything you need can be found below:
- JDK - http://java.sun.com/javase/downloads/index.jsp
- MySQL - http://dev.mysql.com/downloads/mysql/
- Tomcat - http://www.gossipcheck.com/mirrors/apache/tomcat/
- ActiveMQ - http://activemq.apache.org/download.html
- Hadoop - http://mirror.atlanticmetro.net/apache/hadoop/core/
Installing the JDK
1. Move the .bin file to the location you wish to install the JDK to and change to that directory before proceeding.
sudo mv jdk-6u17-linux-i586.bin /usr/local cd /usr/local
2. Check the permissions, run the installation and lastly create a symbolic link if you wish via the ln command. (you may also remove the .bin file if you no longer need it)
chmod +x jdk-6u17-linux-i586.bin sudo ./jdk-6u17-linux-i586.bin sudo ln -s /usr/local/jdk1.6.0_17 java sudo rm jdk-6u17-linux-i586.bin
3. Edit the file .bashrc under your home directory and add the following lines (replace the value of JAVA_HOME with the location your JDK extracted to) so that Java can be found from the command-line.
vi ~/.bashrc export JAVA_HOME=/usr/local/java/ export PATH=$JAVA_HOME/bin:$PATH
4. Close the current terminal window and open a new one so that the changes are applied. Type java -version to confirm that the environment variables are configured properly.
Note: If installing in Ubuntu, from synaptic Package Manger be sure to completely remove the mysql-common and libmysqlclient to avoid conflicts
1. Add the mysql user and group via the following commands in terminal
sudo groupadd mysql sudo useradd -g mysql mysql
2. Change the current directory to point to where you would like to install to and extract MySQL via the tar command. After extracting create a symbolic link with the ln command and change directory to that location.
cd /usr/local sudo tar -zxvf ~/mysql-5.1.41-linux-i686-glibc23.tar.gz sudo ln -s /usr/local/mysql-5.1.41-linux-i686-glibc23 mysql cd mysql
3. Change directory to /etc and remove (or move) the mysql directory if it exists.
cd /etc sudo rm -R mysql
3. Run the following commands to configure ownership and execute the script that configures the database.
sudo chown -R mysql . sudo chgrp -R mysql . sudo scripts/mysql_install_db --user=mysql --basedir=. sudo chown -R root . sudo chown -R mysql data
5. Start the server with the following command.
sudo bin/mysqld_safe --user=mysql &
6. In another terminal window, the next command will allow you to configure a root password:
cd /usr/local/mysql bin/mysqladmin -u root password 'password'
7. Login as root to the mysql client with the following command:
bin/mysql -uroot -ppassword
Configuring the MySQL Datasets
1. Login to mysql as root, create three databases (testfish, testfrog, testfern) and grant all privileges for the localhost. Additionally you must grant select privileges to all the remote hosts that are running on the network. In the example below 'user'@'localhost' is the fp user at the localhost machine. 'user'@'webprojects.huh.harvard.edu' 'user'@'umbfp.cs.umb.edu' are the two remote hosts. Adjust these commands as necessary to suit your needs.
mysql -u root -p password
2. Run the following commands for each new database you wish to create:
mysql> create database testfish; mysql> grant all on testfish.* to 'user'@'localhost' identified by 'password'; mysql> grant select on testfish.* to 'user'@'webprojects.huh.harvard.edu' identified by 'password'; mysql> grant select on testfish.* to 'user'@'umbfp.cs.umb.edu' identified by 'password';
3. After you have created the databases, flush privileges and quit.
mysql> flush privileges; mysql> quit;
Import SQL Dumps
4. Import the SQL dumps from the fp repository. For user, substitute the same user as in the database creation in the previous step:
gunzip fp_specify_a.sql.gz gunzip fp_specify_b.sql.gz gunzip fp_specify_c.sql.gz mysql -u user -p testfish < fp_specify_a.sql mysql -u user -p testfrog < fp_specify_b.sql mysql -u user -p testfern < fp_specify_c.sql
1. Create a directory for Tomcat and change the current directory to this location. Extract Tomcat via the tar command.
cd /usr/local sudo tar -zxvf ~/apache-tomcat-6.0.20.tar.gz sudo ln -s /usr/local/apache-tomcat-6.0.20 tomcat
3. Edit the file .bashrc under your home directory and add the following line (replace the value of CATALINA_HOME with the location you extracted to)
vi ~/.bashrc export CATALINA_HOME=/usr/local/tomcat
3. Change the current directory to point to /usr/local/tomcat/bin and execute the catalina.sh script. After the server starts up, test to confirm that it is running by typing http://localhost:8080/ into your browser.
sudo cd /usr/local/tomcat/bin sudo ./catalina.sh start
Generating The WAR File
1. Add the repository http://gentoo.cs.umb.edu/svn/FilteredPush to eclipse and check out the trunk and FilteredPushWeb directories.
2. Right click FilteredPushWeb in the Project Explorer and export as a WAR file.
3. Move this file to the tomcat webapps directory (for example /usr/local/tomcat/webapps) When you start tomcat this file will automatically unpack.
1. Change directory to the location you wish to install activemq to, extract the files to this location and create a symbolic link to the directory:
cd /usr/local tar zxvf ~/apache-activemq-5.3.0-bin.tar.gz sudo ln -s /usr/local/apache-activemq-5.3.0 activemq
2. Change directory to point to the newly created link. Make sure the permissions allow you to run the startup script in bin and access the directory, once this is configured start activemq:
cd activemq sudo chown -R fp . sudo chmod 755 bin/activemq bin/activemq
3. Type the following URL in your browser to confirm that activemq is up and running http://localhost:8161/admin/
1. Create public key-based authentication for ssh connection. The master node needs to be able to ssh to each of the slave nodes non-interactively in order to start the slave node processes. Communication occurs over hadoop-specific ports after that. Note that a master can also be a slave, and that authentication is used in this case as well, even though the "remote" host is local. (mmk-- I don't know if symmetric authentication is required, i.e. the slaves ssh'ing to the master, but I created the keys for my setup that way, too). Create public keys on each local host and concatenate them to the end of each remote authorized_keys list:
localhost> cd /home/fp localhost> ssh-keygen localhost> scp ~/.ssh/id_rsa.pub fp@remote-host: localhost> ssh fp@remote-host remotehost> cat ~/id_rsa.pub >> ~/.ssh/authorized_keys remotehost> exit
Verify that non-interactive authentication is working. If not, there might be a directory permissions problem (google for help on that, I can't remember the specifics). Don't forget to turn the sshd service on!
2. Unpack hadoop-0.20.1.tar.gz in the home directory of each node's fp user. This will create a directory named hadoop-0.20.1.
cd ~ tar zxvf ~/Downloads/hadoop-0.20.1.tar.gz
1. Make the following changes to these files located in conf within the hadoop directory:
- hadoop-env.sh: Set JAVA_HOME (e.g. JAVA_HOME=/usr/lib/jvm/java-6-openjdk) on each node.
- masters: On the master node only, add one line containing the full host name of the master node (e.g. umbfp.cs.umb.edu).
- slaves: On the master node only, add a line for each slave node containing the slave's full host name.
2. Every node should have a nearly identical copy of hadoop-site.xml. The only value that should differ is local.host. The following configuration is hadoop specific.
- hadoop.tmp.dir: this defaults to /tmp but their documentation suggests that in all but the most trivial installations (single-node) this should be set to some other value. We use "/home/fp/tmp". This directory needs to exist on each node.
<property> <name>hadoop.tmp.dir</name> <value>/home/fp/tmp</value> </property>
- fs.default.name: this represents a connection string for datanodes (slaves) to contact the master node.
<property> <name>fs.default.name</name> <value>hdfs://[hostname]:8002</value> </property>
- mapred.job.tracker: this represents a connection string for tasktracker nodes (slaves) to contact the master node/
<property> <name>mapred.job.tracker</name> <value>[hostname]:8001</value> </property>
- dfs.replication: I'm a little fuzzy on this but I've had good luck with the value 2.
<property> <name>dfs.replication</name> <value>2</value> </property>
3. Each node must be able to contact the others to process queries. For the moment, this means that each node needs to be able to connect to each Specify database. There should be a stanza in hadoop-site.xml like the following for each node. This is the filtered push configuration.
<property> <name>filteredpush.connection.[hostname].dbusr</name> <value>sqluser</value> </property> <property> <name>filteredpush.connection.[hostname].passwd</name> <value>password</value> </property> <property> <name>filteredpush.connection.[hostname]</name> <value>jdbc:mysql://[dbhostname][:optional port]/[dbname]</value> </property>
- local.host: the full host name of the node
<property> <name>local.host</name> <value>[hostname]</value> </property>
Format the Namenode
4. Execute the following commands on the master node to start the dfs daemon and create the hadoop file system:
localhost> cd $HADOOP_HOME localhost> bin/start-dfs.sh localhost> bin/hadoop namenode -format localhost> bin/stop-dfs.sh
5. Specify will probably be running as a different user than fp. To allow hadoop queries to be processed when launched by Specify, the hadoop file system permissions must be altered (see http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html). On the master node, run the following commands:
localhost> cd $HADOOP_HOME localhost> bin/start-all.sh localhost> bin/hadoop dfs -chmod -R 777 / localhost> bin/stop-all.sh
There are other shell-like commands that can be run through hadoop to see what's going on with its file system. They are documented here: http://hadoop.apache.org/core/docs/current/hdfs_shell.html
- for future, we could consider putting any Specify user that wants to run the fpnet client into the fp group and have everything authenticated to that group. This deserves discussion (later).--Bob Morris 22:06, 15 February 2009 (UTC)
Starting and Stopping Hadoop
1. To start the master node and all the slaves, run the following commands on the master:
localhost> cd $HADOOP_HOME localhost> bin/start-all.sh
2. To ask hadoop which processes are running on a local node, go to $HADOOP_HOME and run:
3. On each of the slaves, you should see a datanode process and a tasktracker process. On the master, you should also see namenode and jobtracker processes, and possibly a secondarynamenode process. If you don't, check the logs ($HADOOP_HOME/logs). You will get almost no indication from standard out on the master if anything goes wrong.
4. To stop hadoop, on the master node go to $HADOOP_HOME and run: