`

搭建Hadoop伪分布式环境

 
阅读更多

<div class="iteye-blog-content-contain" style="font-size: 14px">

Hadoop developers usually test their scripts and code on a pseudo-distributed environment(also known as a single node setup), which is a virtual machine that runs all of the Hadoop daemons simultaneously on a single machine. This allows you to quickly write scripts and test them on limited data sets without having to connect to a remote cluster or pay the expense of EC2. If you're learning Hadoop, you'll probably also want to set up a pseudo-distributed environment to facilitate your understanding of the various Hadoop daemons.

These instructions will help you install a pseudo-distributed environment with Hadoop 2.5.2 on Ubuntu 14.04.

 

Quick Start

There are a couple of options that will allow you to quickly get up and running if you are not familiar with systems administration on Linux or do not wish to work through the process of installing Hadoop yourself. District Data Labs has provided a Virtual Machine Disk (VMDK) configured exactly as the instructions below, available for you to download directly. You can then use this VMDK in the virtualization software of your choice (e.g. VirtualBox or VMWare Fusion). Alternatively both Hortonworks and Cloudera supply virtual machines for quick download. Be aware that if you do use Cloudera or Hortonworks distributions, then the environment may be subtly different than the one described below.

Click here to download the VMDK we have put together.

If you are using the VMDK supplied by District Data Labs, log in to the machine using the username and password as follows:

username: student
password: password

If you're brave enough to set up the environment yourself, go ahead and move to the next section!

 

Setting up Linux

Before you can get started installing Hadoop, you'll need to have a Linux environment configured and ready to use. These instructions assume that you can get an Ubuntu 14.04 distribution installed on the machine of your choice, either in a dual booted configuration or using a virtual machine. Using Ubuntu Server or Ubuntu Desktop is left to your preference, since you'll also need to be familiar working with the command line. Personally, I prefer to use Ubuntu Server since it's more lightweight, and SSH into it from my host operating system.

Base Environment: Ubuntu x64 Desktop 14.04 LTS

Make sure your system is fully up-to-date with the required by running the following commands:

~$ sudo apt-get update && sudo apt-get upgrade
~$ sudo apt-get install build-essential ssh lzop git rsync curl
~$ sudo apt-get install python-dev python-setuptools
~$ sudo apt-get install libcurl4-openssl-dev
~$ sudo easy_install pip
~$ sudo pip install virtualenv virtualenvwrapper python-dateutil

 

Creating a Hadoop User

In order to secure our Hadoop services, we will make sure that Hadoop is run as a Hadoop-specific user and group. This user would be able to initiate SSH connections to other nodes in a cluster, but not have administrative access to do damage to the operating system upon which the service was running. Implementing Linux permissions also helps secure HDFS and is the start of preparing a secure computing cluster.

This tutorial is not meant for operational implementation. However, as a data scientist, these permissions may save you some headache in the long run, so it is helpful to have the permissions in place on your development environment. This will also ensure that the Hadoop installation is separate from other software applications and will help organize the maintenance of the machine.

Create the hadoop user and group, then add the student user to the Hadoop group:

~$ sudo addgroup hadoop
~$ sudo useradd -m -g hadoop hadoop
~$ sudo usermod -a -G hadoop student

Once you have logged out and logged back in (or restarted the machine) you should be able to see that you've been added to the Hadoop group by issuing the groupscommand. Note that the -r flag creates a system user without a home directory.

 

Configuring SSH

SSH is required and must be installed on your system to use Hadoop (and to better manage the virtual environment, especially if you're using a headless Ubuntu). Generate some ssh keys for the Hadoop user by issuing the following commands:

~$ sudo su hadoop
~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. [... snip ...]

Simply hit enter at all the prompts to accept the defaults and to create a key that does not require a password to authenticate (this is required for Hadoop). In order to allow the key to be used to SSH into the box, copy the public key to the authorized_keys file with the following command:

~$ cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
~$ chmod 600 /home/hadoop/.ssh/authorized_keys

You should be able to download this key and use it to SSH into the Ubuntu environment. To test the SSH key issue the following command:

~$ ssh -l hadoop localhost

If this completes successfully without asking you for a password, then you have successfully configured SSH for Hadoop. Exit the SSH window by typing exit. You should be returned back to the hadoop user. Exit the Hadoop user by typing exitagain, you should now be in a terminal window that says student@ubuntu.

 

Installing Java

Hadoop and most of the Hadoop ecosystem require Java to run. Hadoop requires a minimum of Oracle Java™ 1.6.x or greater and used to recommend particular versions of Java™ to use with Hadoop. Now, Hadoop maintains a reporting of the various JDKs that work well with Hadoop. Ubuntu does not maintain an Oracle JDK in Ubuntu repositories because it is proprietary code, so instead we will install OpenJDK. For more information on supported Java™ versions, see Hadoop Java Versions and for information about installing different versions on Ubuntu, please see Installing Java on Ubuntu.

~$ sudo apt-get install openjdk-7-*

Do a quick check to ensure the right version of Java™ is installed:

~$ java -version
java version "1.7.0_65" OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Hadoop is currently built and tested on both OpenJDK and Oracle's JDK/JRE.

 

Disabling IPv6

It has been reported for a while now that Hadoop running on Ubuntu has a conflict with IPv6, and ever since Hadoop 0.20, Ubuntu users have been disabling IPv6 on their clustered boxes. It is unclear whether or not this is still a bug in the latest versions of Hadoop, however in a single-node or pseudo-distributed environment we will have no need for IPv6, so it is best to simply disable it and not worry about any potential problems.

Edit the /etc/sysctl.conf file by executing the following lines of code:

~$ gksu gedit /etc/sysctl.conf

Then add the following lines to the end of the file:

# disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

For this change to take effect, reboot your computer. Once it has rebooted check the status with the following command:

~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

If the output is 0, then IPv6 is enabled. If it is 1, then we have successfully disabled IPv6.

 

Installing Hadoop

To get Hadoop, you'll need to download the release of your choice from one of theApache Download Mirrors. These instructions will download the current stable version of Hadoop with YARN at the time of this writing, Hadoop 2.5.2.

After you've selected a mirror, type the following commands into a Terminal window, replacing http://apache.mirror.com/hadoop-2.5.0/ with the mirror URL that you selected and that is best for your region:

~/Downloads$ curl -O http://apache.mirror.com/hadoop-2.5.2/hadoop-2.5.2.tar.gz

You can verify the download by ensuring that the md5sum matches the md5sum which should also be available at the mirror:

~/Downloads$ md5sum hadoop-2.5.2.tar.gz
74a7581893a8224540a9417a4c2630da  hadoop-2.5.2.tar.gz

Of course, you can use any mechanism you wish to download Hadoop - wget or a browser will work just fine.

 

Unpacking

After obtaining the compressed tarball, the next step is to unpack it. You can use an Archive Manager or simply follow the instructions that follow next. The most significant decision that you have to make is where to unpack Hadoop to.

The Linux operating system depends upon a hierarchical directory structure to function. At the root, many directories that you've heard of have specific purposes:

  • /etc is used to store configuration files
  • /home is used to store user specific files
  • /bin and /sbin include programs that are vital for the OS
  • /usr/sbin are for programs that are not vital but are system wide
  • /usr/local is for locally installed programs
  • /var is used for program data including caches and logs

You can read more about these directories in this Stack Exchange post.

A good choice to move Hadoop to is the /opt and /srv directories.

  • /opt contains non-packaged programs, usually source. A lot of developers stick their code there for deployments.
  • The /srv directory stands for services. Hadoop, HBase, Hive and others run as services on your machine, so this seems like a great place to put things, and it's a standard location that's easy to get to. So let's stick everything there!

Enter the following commands:

~/Downloads$ tar -xzf hadoop-2.5.2.tar.gz
~/Downloads$ sudo mv hadoop-2.5.2 /srv/
~/Downloads$ sudo chown -R hadoop:hadoop /srv/hadoop-2.5.2
~/Downloads$ sudo chmod g+w -R /srv/hadoop-2.5.2
~/Downloads$ sudo ln -s /srv/hadoop-2.5.2 /srv/hadoop

These commands unpack Hadoop, move it to the service directory where we will keep all of our Hadoop and cluster services, and then set permissions. Finally, we create a symlink to the version of Hadoop that we would like to use, this will make it easy to upgrade our Hadoop distribution in the future.

 

Environment

In order to ensure everything executes correctly, we are going to set some environment variables so that Hadoop executes in its correct context. Enter the following command on the command line to open up a text editor with the profile of the hadoop user to change the environment variables.

/srv$ gksu gedit /home/hadoop/.bashrc

Add the following lines to this file:

# Set the Hadoop Related Environment variables export HADOOP_HOME=/srv/hadoop export PATH=$PATH:$HADOOP_HOME/bin # Set the JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

We'll also add some convenience functionality to the student user environment. Open the student user bash profile file with the following command:

~$ gedit ~/.profile

Add the following contents to that file:

# Set the Hadoop Related Environment variables export HADOOP_HOME=/srv/hadoop export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop streaming-2.5.2.jar export PATH=$PATH:$HADOOP_HOME/bin # Set the JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # Helpful Aliases alias ..="cd .." alias ...="cd ../.." alias hfs="hadoop fs" alias hls="hfs -ls"

These simple aliases may save you a lot of typing in the long run! Feel free to add any other helpers that you think might be useful in your development work.

Check that your environment configuration has worked by running a Hadoop command:

~$ hadoop version
Hadoop 2.5.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r cc72e9b000545b86b75a61f4835eb86d57bfafc0
Compiled by jenkins on 2014-11-14T23:45Z
Compiled with protoc 2.5.0
From source with checksum df7537a4faa4658983d397abf4514320
This command was run using /srv/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2.jar

If that ran with no errors and displayed an output similar to the one above, then everything has been configured correctly up to this point.

 

Hadoop Configuration

The penultimate step to setting up Hadoop as a pseudo-distributed node is to edit configuration files for the Hadoop environment, the MapReduce site, the HDFS site, and the YARN site. This will mostly entail configuration file editing.

Edit the hadoop-env.sh file by entering the following on the command line.

~$ gedit $HADOOP_HOME/etc/hadoop/hadoop-env.sh

The most important part of this configuration is to change the following line:

# The java implementation to use. export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Next, edit the core site configuration file:

~$ gedit $HADOOP_HOME/etc/hadoop/core-site.xml

Replace the <configuration></configuration> with the following:

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/app/hadoop/data</value>
    </property>
</configuration>

Edit the MapReduce site configuration following by copying the template then opening the file for editing:

~$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template \ $HADOOP_HOME/etc/hadoop/mapred-site.xml
~$ gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml

Replace the <configuration></configuration> with the following:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Now edit the HDFS site configuration by editing the following file:

~$ gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Replace the <configuration></configuration> with the following:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Finally, edit the YARN site configuration file:

~$ gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml

And update the configuration as follows:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>localhost:8025</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>localhost:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>localhost:8050</value>
    </property>
</configuration>

With these files edited, Hadoop should be fully configured as a pseudo-distributed environment.

 

Formatting the Namenode

The final step before we can turn Hadoop on is to format the namenode. The namenode is in charge of HDFS, the distributed file system. The namenode on this machine is going to keep its files in the /var/app/hadoop/data directory. We need to initialize this directory and then format the namenode to properly use it.

~$ sudo mkdir -p /var/app/hadoop/data
~$ sudo chown hadoop:hadoop -R /var/app/hadoop
~$ sudo su hadoop
~$ hadoop namenode -format

You should see a bunch of Java messages scrolling down the page if the namenode has executed successfully. There should be directories inside of the /var/app/hadoop/datadirectory, including a dfs directory. If that is what you see, then Hadoop should be all set up and ready to use!

 

Starting Hadoop

At this point we can start and run our Hadoop daemons. When you formatted the namenode, you switched to being the hadoop user with the sudo su hadoopcommand. If you're still that user, go ahead and execute the following commands:

~$ $HADOOP_HOME/sbin/start-dfs.sh
~$ $HADOOP_HOME/sbin/start-yarn.sh

The daemons should start up and issue messages about where they are logging to and other important information. If you get asked about your SSH key, just type y at the prompt. You can see the processes that are running via the jps command:

~$ jps
4801 Jps
4468 ResourceManager
4583 NodeManager
4012 NameNode
4318 SecondaryNameNode
4150 DataNode

If the processes are not running, then something has gone wrong. You can also access the Hadoop cluster administration site by opening a browser and point it tohttp://localhost:8088. This should bring up a page with the Hadoop logo and a table of applications.

To wrap up the configuration, prepare a space on HDFS for our student account to store data and to run analytical jobs on:

~$ hadoop fs -mkdir -p /user/student
~$ hadoop fs -chown student:student /user/student

You can now exit from the hadoop user's shell with the exit command.

 

Restarting Hadoop

If you reboot your machine, the Hadoop daemons will stop running and will not automatically be restarted. If you are attempting to run a Hadoop command and you get a "connection refused" message, it is likely because the daemons are not running. You can check this by issuing the jps command as sudo:

~$ sudo jps

To restart Hadoop in the case that it shuts down, issue the following commands:

~$ sudo -H -u hadoop $HADOOP_HOME/sbin/start-dfs.sh
~$ sudo -H -u hadoop $HADOOP_HOME/sbin/start-yarn.sh

The processes should start up again as the dedicated hadoop user and you'll be back on your way!

 

Installing Hive

For the most part, installing services on Hadoop (e.g. Hive, HBase, or others) will consist of the following in the environment we have set up:

  1. Download the release tarball of the service
  2. Unpack the release to /srv/ and creating a symlink from the release to a simple name
  3. Configure environment variables with the new paths
  4. Configure the service to run in pseudo-distributed mode

Hive also follows this pattern. Find the Hive release you wish to download from theApache Hive downloads page. At the time of this writing, Hive release 0.14.0 is current. Once you have selected a mirror, download the apache-hive-0.14.0-bin.tar.gz file to your downloads directory. Then issue the following commands in the terminal to unpack it:

~$ tar -xzf apache-hive-0.14.0-bin.tgz
~$ sudo mv apache-hive-0.14.0-bin /srv
~$ sudo chown -R hadoop:hadoop /srv/apache-hive-0.14.0-bin
~$ sudo ln -s /srv/apache-hive-0.14.0-bin /srv/hive

Edit your ~/.profile with these environment variables by adding the following to the bottom of the .profile:

# Configure Hive environment export HIVE_HOME=/srv/hive export PATH=$PATH:$HIVE_HOME/bin

No other configuration for Hive is required, although you can find other configuration details in HIVE_HOME/conf including the Hive environment shell file and the Hive site configuration XML.

 

Installing Spark

Installing Spark is also pretty straight forward, and we'll install it similarly to how we installed Hive. Find the Spark release you wish to download from the Apache Spark downloads page. The Spark release at the time of this writing is 1.1.0. You should choose the package type "Pre-built for Hadoop 2.4" and the download type should be "Direct Download". Then unpack it as follows:

~$ tar -xzf spark-1.1.0-bin-hadoop2.4.tgz
~$ sudo mv spark-1.1.0-bin-hadoop2.4.tgz /srv
~$ sudo chown -R hadoop:hadoop /srv/spark-1.1.0-bin-hadoop2.4
~$ sudo ln -s /srv/spark-1.1.0-bin-hadoop2.4 /srv/spark

Edit your ~/.profile with the following environment variables at the bottom of the file:

# Configure Spark environment export SPARK_HOME=/srv/spark export PATH=$SPARK_HOME/bin:$PATH

After you source your .profile or restart your terminal, you should be able to run a pyspark interpreter locally. You can now use pyspark and spark-submit commands to run Spark jobs.

 

Conclusion

 

At this point you should now have a fully configured Hadoop setup ready for development in pseudo-distributed mode on Ubuntu with HDFS, MapReduce on YARN, Hive, and Spark all ready to go as well as a simple methodology for installing other services. 

</div>

 

分享到:
评论

相关推荐

    hadoop伪分布式环境搭建

    详细的hadoop2 伪分布式环境搭建以及eclipse部署。demo示例代码测试运行。文中有插件包。资源包等参考链接参考下载。

    linux虚拟机搭建hadoop伪分布式集群

    搭建hadoop伪分布式集群的步骤,希望能帮到各位,多多

    云服务器上搭建大数据伪分布式环境

    本文档是笔者基于阿里云服务器CentOS 7.2搭建的Hadoop伪分布式环境,其中用到的一些工具,如:远程终端Xshell 6、文件传输软件Xftp 6。此文档亦可在其他Linux操作系统中运行,可能其中命令稍有区别,请加以注意。

    搭建hadoop伪分布式.docx

    伪分布式运行模式 这种模式也是在一台单机上运行,但用不同的Java进程模仿分布式运行中的各类结点(NameNode,DataNode,JobTracker,TaskTracker,SecondaryNameNode),请注意分布式运行中的这几个结点的区别:从分布式...

    Hadoop伪分布式环境搭建

    1、搭建Hadoop伪分布式环境,通过HDFS 进行文件的上传和下载来测试环境是否搭建成功; 2、创建Java Maven项目,编写MapReduce代码实现对文本中字符(包含大小写字母、数字、各种符号)的统计,将项目打成jar包放入...

    Hadoop伪分布式集群环境搭建

    Hadoop伪分布式集群环境搭建 Hadoop伪分布式集群环境搭建

    hadoop伪分布式搭建.docx

    大数据hadoop平台伪分布式搭建详细步骤,基于ubtuntu系统,供初学者学习使用。... 大数据hadoop平台伪分布式搭建详细步骤,基于ubtuntu系统,供初学者学习使用。...

    hadoop伪分布式搭建.doc

    hadoop集群环境搭建之伪分布式集群环境搭建,本文档详细的介绍了伪分布式搭建过程以及搭建过程中遇到的一些问题

    虚拟机搭建Hadoop伪分布式及Hbase.docx

    虚拟机搭建Hadoop伪分布式及Hbase

    hadoop2.6.5伪分布式搭建

    hadoop2.6.5伪分布式搭建hadoop2.6.5伪分布式搭建hadoop2.6.5伪分布式搭建hadoop2.6.5伪分布式搭建

    Hadoop完全分布式环境搭建步骤

    Hadoop完全分布式环境搭建文档,绝对原创,并且本人亲自验证并使用,图文并茂详细介绍了hadoop完全分布式环境搭建所有步骤,条例格式清楚,不能成功的,请给我留言!将给与在线支持!

    hadoop伪分布式搭建(超级详细)

    hadoop伪分布式搭建(从虚拟机设置到搭建完成),包含每一个步骤,照做即可,内有hadoop2.2.0安装包。

    实验1.hadoop伪分布式配置.docx

    实验1.hadoop伪分布式配置.docx

    hadoop伪分布式配置教程.doc

    如果用的是 CentOS/RedHat 系统,请查看相应的CentOS安装Hadoop教程_单机伪分布式配置。 本教程基于原生 Hadoop 2,在 Hadoop 2.6.0 (stable) 版本下验证通过,可适合任何 Hadoop 2.x.y 版本,如 Hadoop 2.7.1、...

    hadoop伪分布式集群搭建

    hadoop伪分布式集群搭建

    hadoop伪分布式搭建_原理_格式化问题解决方案.pdf

    这里有hadoop的简介,比如MR,Yarn,HDfs,还有一些详细的介绍。以及伪分布式的搭建,同样是图文的方式去写的,配置基本上是一层接一层的去做的,所以有阶段性,可以一阶段一阶段的去配。

    shell脚本配置Hadoop伪分布式.zip

    适用于《shell脚本配置伪分布式Hadoop》

Global site tag (gtag.js) - Google Analytics