Install Apache Hadoop 2.9.2 on Ubuntu 18.04 | Step by Step Guide | Big Data



Prerequisite

  • Ubuntu 18.04 Operating System on Oracle VirtualBox
  • Good internet connection on your system
  • Good to have a laptop/desktop with 8GB RAM, 50 to 100 GB free space in HDD (Hard Disk Drive), any good processor

Introduction to Apache Hadoop

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.


Download Apache Hadoop

Login to Ubuntu VM and open the Terminal window(it will be black screen like Windows command prompt)









Install Java

Get Repo file from this URL: https://archive.cloudera.com/cm6/6.3.0/ubuntu1804/apt/cloudera-manager.list

Download Repo file using wget command in the Terminal



Copy the Repo file into /etc/apt/sources.list.d/ location using below command

sudo cp cloudera-manager.list /etc/apt/sources.list.d/


Download and Import the repository signing GPG key

sudo wget https://archive.cloudera.com/cm6/6.3.0/ubuntu1604/apt/archive.key

sudo apt-key add archive.key


Update your system package index by running below command

sudo apt-get update





Install Java using the below command

sudo apt-get install oracle-j2sdk1.8







Java will be installed at this location: /usr/lib/jvm/java-8-oracle-cloudera

ls /usr/lib/jvm/java-8-oracle-cloudera

Set JAVA_HOME in the file(with location): ~/.bashrc using below command

sudo nano ~/.bashrc


Paste the below command

export JAVA_HOME=/usr/lib/jvm/java-8-oracle-cloudera export PATH=$PATH:$JAVA_HOME/bin


Run the below command to run refresh the .bashrc file

source ~/.bashrc


Run the below command to check the version of the Java installed

java -version




Install OpenSSH server & client using the below command

sudo apt-get install openssh-server openssh-client



Type "Y" continue



Create the SSH Key for passless login (Press enter button when it asks you to enter a filename to save the key)

ssh-keygen -t rsa -P ""

Copy the generated ssh key to authorized keys

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys




Connect localhost using OpenSSH by running below command

ssh localhost (Type yes button any prompt comes up)


Exit from the ssh connection using the below command

exit


Open Apache Hadoop official website and click on the "Download" button

URL: https://hadoop.apache.org


Click on the Binary download of the version 2.9.2 as shown below

URL: https://hadoop.apache.org/releases.html





https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz

Click on link below "We suggest the following mirror site for your download:"

http://apachemirror.wuchna.com/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz





Right click the file downloaded and click on "Open Containing Folder", it will be pointing to /tmp/..

Copy the downloaded Hadoop binary file into /home/dmadmin/datamaking/softwares manually





Install Apache Hadoop

Navigate to the location /home/dmadmin/datamaking/softwares using below command

cd datamaking/softwares/


Change the binary file permission using below command

sudo chmod a+x hadoop-2.9.2.tar.gz







Extract binary file

sudo tar -xzvf hadoop-2.9.2.tar.gz








Hadoop Installation Path(location) will be: /home/dmadmin/datamaking/softwares/hadoop-2.9.2

Run the below command

cd hadoop-2.9.2/


Add the HADOOP_HOME and JAVA_HOME paths in the bash file (.bashrc)

sudo nano ~/.bashrc


Add the below Hadoop path information into .bashrc file


# HADOOP VARIABLES SETTINGS START HERE

export JAVA_HOME=/usr/lib/jvm/java-8-oracle-cloudera
export HADOOP_INSTALL=/home/dmadmin/datamaking/softwares/hadoop-2.9.2
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export HADOOP_OPTS="-Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR"

# HADOOP VARIABLES SETTINGS END HERE




Run the below command to refresh bash file (.bashrc)

source ~/.bashrc





Find the host name by running below command

hostname -f


sudo nano /etc/hosts


From this,

127.0.0.1 localhost
127.0.1.1 datamaking

To this,

127.0.0.1 localhost
127.0.0.1 datamaking





Create or Modifiy Hadoop configuration files

Now edit the configuration files in /home/dmadmin/datamaking/softwares/hadoop-2.9.2/etc/hadoop directory.

Create masters file and edit as follows,

cd /home/dmadmin/datamaking/softwares/hadoop-2.9.2/etc/hadoop



sudo nano masters



Add the hostname in the masters file as shown below

datamaking




Edit slaves file as follows,

sudo nano slaves


Add the hostname in the masters file as shown below

datamaking


Edit core-site.xml as follows,

sudo nano core-site.xml


Add the property in the core-site.xml as shown below

<property>
<name>hadoop.tmp.dir</name>
<value>/home/dmadmin/datamaking/softwares/hadoop_data/tmp</value>
<description>Parent directory for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://datamaking:9000</value>
<description>The name of the default file system. </description>
</property>






Run the below command to create the directory

sudo mkdir -p /home/dmadmin/datamaking/softwares/hadoop_data/tmp


Edit hdfs-site.xml as follows,


sudo nano hdfs-site.xml

Add the property in the hdfs-site.xml as shown below

    <property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/dmadmin/datamaking/softwares/hadoop_data/namenode</value>
    </property>

    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/dmadmin/datamaking/softwares/hadoop_data/datanode</value>
    </property>

    <property>
            <name>dfs.replication</name>
            <value>1</value>
    </property>




Run the below command to create directory for namenode

sudo mkdir -p /home/dmadmin/datamaking/softwares/hadoop_data/namenode

Run the below command to create directory for namenode

sudo mkdir -p /home/dmadmin/datamaking/softwares/hadoop_data/datanode

Run the below command to provide permission for the directories/sub-directories of hadoop_data

sudo chown -R dmadmin:dmadmin /home/dmadmin/datamaking/softwares/hadoop_data




Copy mapred-site file from the template file in configuration folder and the edit mapred-site.xml as follows,

sudo cp mapred-site.xml.template mapred-site.xml


sudo nano mapred-site.xml


Add the property in the mapred-site.xml as shown below

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>




Edit yarn-site.xml as follows,

sudo nano yarn-site.xml




Add the property in the yarn-site.xml as shown below

    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>
    <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
      <name>yarn.resourcemanager.webapp.address</name>
      <value>datamaking:8088</value>
    </property>



Run the below command to know the Hadoop version

hadoop version


Format the namenode using below command [It is the one time activity in the Apache Hadoop cluster setup]

hadoop namenode -format



Navigate to the Hadoop configuration folder/directory using below command

cd /home/dmadmin/datamaking/softwares/hadoop-2.9.2/etc/hadoop

Open the Hadoop Environment File(hadoop-env.sh) to set JAVA_HOME path

sudo nano hadoop-env.sh






Run the below command to provide permission for the directories/sub-directories of softwares

sudo chown -R dmadmin:dmadmin /home/dmadmin/datamaking/softwares


start-all.sh

or

start-dfs.sh

start-yarn.sh





Run the below command to check required Hadoop components/processes are started

jps


Check the NameNode Web UI using below URL

NameNode Web UI: http://datamaking:50070




Check the YARN Web UI using below URL

YARN Web UI: http://datamaking:8088




Summary

We have learned how to install Apache Hadoop 2.9.2 successfully on Ubuntu 18.04 and verify all the Hadoop processes/Web UI are running properly.

Please provide us the feedback and suggestions on this blog post.

Happy Learning !!!


Post a Comment

0 Comments