skip to content
aguo

Install Hadoop using Esxi

Building a Hadoop Distributed Environment with Virtual Machines.

In order to simulate a Hadoop distributed environment with Esxi, follow the instructions below:

  1. Download and install the latest version of Esxi on your machine.
  2. Create multiple virtual machines (VMs) on the Esxi host, each with its own IP address.
  3. Install the desired operating system (OS) on each VM.
  4. Configure the network settings for each VM to be on the same subnet.
  5. Install Hadoop on each VM.
  6. Configure the Hadoop settings on each VM to work together as a distributed environment.

By following these steps, you can simulate a Hadoop distributed environment on your local machine using Esxi. Happy Hadooping!

前期准备

  • 三台虚拟机(本文使用Ubuntu20.04 Server)如下:
  • 防止ip变动,为每台机器设置静态ip
  • 每台虚拟机上安装java,可前往hadoop官网查看支持的java版本
  • java -version 输出正常即可,我并未特意设置JAVA_HOME等环境变量。
  • 设置机器之间互相免密SSH访问。
  • 下载并安装hadoop(解压即可)(可先在一台机器上安装,等配置完成一起分发到其他机器)
  • 根据官网的说明,选择性的进行配置(官网列举了很多配置,不指定的情况下会有默认配置,所以一开始不用全部配置一遍)。
192.168.3.254 hadoop-node1
192.168.3.250 hadoop-node2
192.168.3.253 hadoop-node3

To install Java on Ubuntu, you can use the following command:

sudo apt-get update
sudo apt-get install openjdk-8-jdk

要将文件同步到其他Linux机器,可以使用rsync命令。首先在源机器上运行以下命令:

rsync -avz /path/to/local/file user@remote_machine:/path/to/remote/directory

/path/to/local/file替换为要同步的本地文件的路径,将user替换为远程机器的用户名,将remote_machine替换为远程机器的IP地址或域名,将/path/to/remote/directory替换为要同步到的远程目录的路径。运行此命令后,系统会提示您输入远程机器的密码。

如果要同步整个目录,可以将文件路径替换为目录路径,例如:

rsync -avz /path/to/local/directory/ user@remote_machine:/path/to/remote/directory/

请注意,rsync命令不会删除远程机器上已删除的文件。为了确保目标机器上的文件与源机器上的文件保持同步,最好使用--delete选项。例如:

rsync -avz --delete /path/to/local/directory/ user@remote_machine:/path/to/remote/directory/

To unpack a .tar.gz file, use the following command:

tar -xzvf file.tar.gz

Replace file.tar.gz with the name of the file you want to unpack. This will extract the contents of the file to the current directory.

To set the JAVA_HOME environment variable, follow these steps:

  1. Determine the path to your Java installation. This will typically be /usr/lib/jvm/java-version on Linux or C:\\Program Files\\Java\\jdk-version on Windows.

  2. Open your bash shell configuration file. This will typically be ~/.bashrc on Linux or ~/.bash_profile on macOS.

  3. Add the following line to the end of the file, replacing /path/to/java with the path to your Java installation:

    export JAVA_HOME=/path/to/java
    
  4. Save and close the file.

  5. Run the following command to reload your shell configuration:

    source ~/.bashrc
    

    or

    source ~/.bash_profile
    

    depending on which file you modified.

Your JAVA_HOME environment variable is now set and should be available in any new terminal sessions.

To determine the path to your Java installation on Ubuntu, you can run the following command in a terminal:

readlink -f $(which java)

This will output the absolute path to the java executable on your system. The path will typically be /usr/lib/jvm/java-version. Replace version with the version of Java you have installed.

Configuring Environment of Hadoop Daemons

这里我根据官网的要求配置了etc/hadoop/hadoop—env.sh 文件

1.export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

这个路径可以通过这个的命令获得readlink -f $(which java) ,这个命令显示的是绝对路径。这里配置了就不用再在yarn-env.sh 等文件里配置了,默认会从这里继承。

2.配置了两个目录

HADOOP_PID_DIR - The directory where the daemons’ process id files are stored. HADOOP_LOG_DIR - The directory where the daemons’ log files are stored. Log files are automatically created if they don’t exist.

export HADOOP_LOG_DIR=/var/log/hadoop

export HADOOP_PID_DIR=/var/run/hadoop

这里记得要先创建目录,并设置权限

Set the ownership and permissions on the directories so that only the Hadoop user (or users) can write to them. For example, you might use the following commands:

chown hadoop:hadoop /var/run/hadoop
chmod 700 /var/run/hadoop

chown hadoop:hadoop /var/log/hadoop
chmod 700 /var/log/hadoop

Replace hadoop with the name of the user (or users) who will be running the Hadoop daemons.

3.配置Hadoop环境变量,这里网上很多教程都选择在/etc/profile ~/.bashrc之类的文件中添加,官网是让在/etc/profile.d/ 文件夹下面创建一个脚本(有其优势):

It’s generally recommended to use ~/.bashrc for user-specific settings, and to add system-wide settings to /etc/profile.d instead of directly editing /etc/profile. This is because /etc/profile is a system file that may be overwritten during an update, while /etc/profile.d is intended for user-defined scripts that are sourced by /etc/profile.

这里选择跟官网一致:

To configure HADOOP_HOME in the system-wide shell environment configuration, we can create a script inside /etc/profile.d/. Create a file called hadoop.sh with the following contents:

export HADOOP_HOME=/path/to/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Replace /path/to/hadoop with the path to your Hadoop installation. Save the file and then run the following command to make it executable:

chmod +x /etc/profile.d/hadoop.sh

The HADOOP_HOME environment variable will now be set for all users on the system.

Configuring the Hadoop Daemons

etc/hadoop/core-site.xml

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop-node1:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/home/aguo/hadoop-3.3.4/tmp</value>
        </property>
</configuration>

官网没有写hadoop.tmp.dir临时目录的配置,不指定的话会有一个默认目录:

Prior to Hadoop v3, the default value was /tmp/hadoop-${user.name}, which is the system temporary directory with a Hadoop-specific subdirectory. In Hadoop v3 and later, the default value is ${hadoop.tmp.dir}/hadoop-${user.name}, where ${hadoop.tmp.dir} is a new property introduced in Hadoop v3 that defaults to /tmp.

So, even if you don’t explicitly set hadoop.tmp.dir in core-site.xml in Hadoop v3, the value will still be set to /tmp/hadoop-${user.name} unless you override it.

However, it’s still recommended to set hadoop.tmp.dir explicitly to a dedicated directory that is separate from the system temporary directory, as this can improve performance and reduce the risk of conflicts.

etc/hadoop/hdfs-site.xml

官网上有分别指出namenode和datanode的配置,我只指定了dfs.replication 的值:

The dfs.replication property is used to configure the Hadoop Distributed File System (HDFS). This property determines the default replication factor for HDFS, which is the number of copies of each block of data that are stored across the cluster.

By default, this property is set to 3, which means that each block is replicated to three different nodes in the cluster. However, you can change this value to a different integer if you want to use a different replication factor. This property can also be overridden on a per-file basis if you want to use a different replication factor for specific files.

It’s generally a good practice to set this property explicitly in the hdfs-site.xml file, as it allows you to specify the replication factor that is appropriate for your specific use case.

For example, if you have a small cluster with limited storage space, you may want to set the replication factor to 2 or even 1 to conserve storage. On the other hand, if you have a large cluster with plenty of storage space, you may want to set the replication factor to a higher value to improve data availability and fault tolerance.

etc/hadoop/yarn-site.xml

我只设置了yarn.resourcemanager.hostname

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>hostname</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>hostname:port</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>hostname:port</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>hostname:port</value>
</property>
<property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>hostname:port</value>
</property>
<property>
  <name>yarn.resourcemanager.address</name>
  <value>hostname:port</value>
</property>

The default values for these properties are as follows:

yarn.resourcemanager.hostname: localhost
yarn.resourcemanager.webapp.address: localhost:8088
yarn.resourcemanager.scheduler.address: localhost:8030
yarn.resourcemanager.resource-tracker.address: localhost:8031
yarn.resourcemanager.admin.address: localhost:8033

etc/hadoop/mapred-site.xml

设置了mapreduce.framwork.name

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>mapreduce.framwork.name</name>
                <value>yarn</value>
        </property>
</configuration>

Slaves File

To modify the /etc/hadoop/workers file to add or remove nodes in a Hadoop cluster, follow

  1. Add or remove the hostnames or IP addresses of the machines you want to add or remove, one per line. For example:

    hadoop-node1
    hadoop-node2
    hadoop-node3
    

    If you want to remove a machine from the cluster, simply delete its hostname or IP address from the file.

  2. Save and close the file.

  3. If you’re using Hadoop v2 or later, run the following command on the master node to refresh the list of worker nodes:

    $ hdfs dfsadmin -refreshNodes
    

    This will update the list of worker nodes in the Hadoop Distributed File System (HDFS) metadata. Note that this command must be run as the Hadoop superuser (hdfs by default).

  4. If you’re using Hadoop v1, you don’t need to run any additional commands. The worker nodes will automatically register themselves with the master node.

问题

  1. 做完上诉配置后,执行sbin/start-all.sh 启动集群,输出正常,但是打开web界面http://<NameNode hostname>:9870http://<ResourceManager hostname>:8088 发现节点只有一个,到各节点执行命令jps 发现各个进程都正常。

    最后发现原因是,做ssh免密登陆的时候将添加host条目时使用的hostname重复了

    原本的hosts文件内容如下:

    127.0.0.1 localhost
    127.0.1.1 hadoop-node1
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters

    为了方便,添加了各台机器的条目,内容变成这样

    127.0.0.1 localhost
    127.0.1.1 hadoop-node1
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    
    192.168.1.1 hadoop-node1
    192.168.1.2 hadoop-node2
    192.168.1.3 hadoop-node3

    懒得挨个修改,所以我直接将这个文件分发到了其他两台机器,这里的问题是

    127.0.1.1 hadoop-node1
    192.168.1.1 hadoop-node1

    /etc/hosts文件中,一个主机名只能与一个IP对应。如果在文件中为同一主机名指定了多个IP,则只有最后一个条目会生效。这里出现了冲突

    In the /etc/hosts file in Linux, the 127.0.1.1 IP address is often used to map a hostname to the loopback interface, which allows services running on the local machine to be accessed using a hostname instead of an IP address.

    This is commonly used in Ubuntu and other Debian-based distributions, where the /etc/hosts file contains a line like the following:

    127.0.1.1 hostname
    

    Here, hostname is the name of the local machine. This line maps the hostname to the loopback interface, allowing services and applications running on the local machine to be accessed using the hostname hostname instead of the IP address 127.0.0.1.

    我将127.0.1.1 hadoop-node1 这个条目注释后,重启集群,一切正常了。

  2. 在集群以外的机器上,通过web interface上传文件失败,这个问题还有一个表现,就是不能通过hostname:port方式访问hadoop的任何web interface。检查浏览器network发现有跨域错误

    • 将host条目添加到机器hosts文件中,windows需要执行ipconfig /flushdns
    • core-site.xml 配置了跨域访问,此时windows机器测试已经正常,能够正常上传文件,ip:porthostname:port 都能正常访问。
    • macOS机器上仍然提示跨域错误,因为跨域提示的请求是hostname:port的地址,所以问题应该还是出现在hostname的解析上。但是Chrome浏览器chrome://net-internals/#dns 页面查询DNS,发现hostname已经正常解析到了ip上,终端中用ping hostname 显示也是正常解析。
    • 检查其他网络相关的进程,发现clash,将其结束后发现macOS机器也正常了,查看clash的配置,发现打开了clash内部的dns服务,关闭之,一切正常了。