My instructions for installing Hadoop assume that you have already followed my instructions for installing HDFS. Go follow those instructions and then return here.
With HDFS installed, we now need to configure Hadoop's YARN and MapReduce services.
We need to create additional user accounts for mapred
and yarn
. You will need to create passwords for each account and keep track of them separately.
sudo adduser --ingroup hadoop mapred
sudo adduser --ingroup hadoop yarn
We need to create an RSA key pair for the yarn
account. The private and public key are stored on the master node. The public key for this account must then be copied to the corresponding locations on each of the worker nodes.
On the master node, run the following commands:
- Switch to the yarn user:
su - yarn
ssh-keygen -t rsa -f ~/.ssh/id_rsa
- Enter a passphrase for the private key and record it in a safe location.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
We need to copy the public key from the master to each of the worker nodes. We will use ssh-copy-id
for this and simply enter the password that you created for the yarn
account earlier.
On the master node, run the following commands FOR EACH WORKER NODE:
su - yarn
ssh-copy-id yarn@worker01
- Say 'yes' if you are asked to accept the fingerprint of the worker01 machine
- Enter the password for the yarn account on worker01 if asked
- Test the ability to login to the worker01 machine:
ssh worker01
- You will be asked for your passphrase. Type it and you will be logged in on the remote machine.
Login to the yarn
account and edit the .profile to make use of keychain
-
su - yarn
-
vi .profile
# Add the following lines to end of .profile file/usr/bin/keychain $HOME/.ssh/id_rsa source $HOME/.keychain/$HOSTNAME-sh
-
exit
-
su - yarn
-
Keychain will ask you to enter your passphrase
-
Now if you type
ssh worker01
, you should be logged in without having to type your passphase -
You will no longer need to enter your passphrase until your virtual machine is rebooted
REMINDER: This page assumes you have followed my instructions for installing
HDFS. There is a section on that page called Configuring the Cluster that has critical info about core-site.xml
and other configuration files that Hadoop needs to operate.
Edit the file /usr/local/src/hadoop-2.7.1/etc/hadoop/yarn-site.xml
on all nodes of the cluster such that the configuration
tag now looks like this:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
</configuration>
This configures the ports for various parts of the YARN infrastructure. All of these components will run on the master node.
Edit the file /usr/local/src/hadoop-2.7.1/etc/hadoop/mapred-site.xml
on the master node such that the configuration
tag now looks like the configuration
tag below.
Note: first, you have to create this file...
Execute these commands in the /usr/local/src/hadoop-2.7.1/etc/hadoop
directory:
sudo cp mapred-site.xml.template mapred-site.xml
.sudo chown hadoop:hadoop mapred-site.xml
sudo vi mapred-site.xml
Then you can edit the file to contain this information:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
This tells map reduce jobs to run using the cluster; the default, otherwise, is to run locally on the client machine which defeats the purpose of Hadoop.
On the master node, execute these commands:
su - yarn
start-yarn.sh
This command should launch a resource manager and node manager on the master node and a node manager on each worker node. To test this, run the jps
command as the yarn
user on each node in your cluster to verify.
MapReduce has one server process that needs to be started on the master node.
First, specify a new location for the history daemon's log files
- Edit
/usr/local/src/hadoop-2.7.1/etc/hadoop/mapred-env.sh
- Add this line:
export HADOOP_MAPRED_LOG_DIR="/home/mapred/logs"
- Make sure to create that directory:
su - mapred
;mkdir logs
;chmod 777 logs
Second, make sure that the mapred
account is part of the HDFS supergroup
:
sudo addgroup supergroup
sudo usermod -a -G supergroup mapred
Finally, run these commands to launch the history daemon.
su - mapred
mr-jobhistory-daemon.sh start historyserver
If you need to shutdown your Hadoop cluster, perform the following steps on the master node.
su - mapred
mr-jobhistory-daemon.sh stop historyserver
su - yarn
stop-yarn.sh
- Head over to my HDFS page for instructions on shutting down the HDFS daemons.
- There is no step 6.
At this point, you are done! Your Hadoop cluster is now up and running! Congrats!
While it's true that your cluster is up and running, you still have work to do. I did not configure any of the memory-related settings of Hadoop, how much memory each server takes, how much memory mapreduce jobs can consume, etc. Take a look at Hadoop: The Definitive Guide, 4th Edition or Hadoop-related websites on-line for more information.