### 环境准备
系统: `CentOS7`
软件:
- `hadoop`:`2.7.7`
 
服务器:
`Hadoop Master`: `172.16.0.3(master)` `NameNode` `SecondaryNameNode` `ResourceManager` `DataNode` `NodeManager`
`Hadoop Slave` : `172.16.0.4(slave1)` `DataNode` `NodeManager`
`Hadoop Slave` : `172.16.0.5(slave2)` `DataNode` `NodeManager`
`Hadoop Slave` : `172.16.0.6(slave3)` `DataNode` `NodeManager`
`Hadoop Slave` : `172.16.0.7(slave4)` `DataNode` `NodeManager`
 
### 初始化工作
#### 配置主机名解析
`所有主机`
```bash
cat >> /etc/hosts << EOF
172.16.0.3 master
172.16.0.4 slave1
172.16.0.5 slave2
172.16.0.6 slave3
172.16.0.7 slave4
EOF
```
#### 创建私钥以及免密登陆slaves
`master`
```bash
su - hadoop
ssh-keygen -t rsa
ssh-copy-id slave1
ssh-copy-id slave2
ssh-copy-id slave3
ssh-copy-id slave4
```
#### 下载安装java
下载地址: https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
`所有主机`
```bash
rpm -ivh jdk-8u221-linux-x64.rpm
```
 
### 安装hadoop集群
#### 创建用户
`所有主机`
```bash
useradd -d /opt/hadoop hadoop
echo "password"|passwd --stdin hadoop #免交互设置用户密码
```
#### 下载hadoop
`master`
```bash
curl -O http://apache.javapipe.com/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar xfz hadoop-2.7.7.tar.gz
cp -rf hadoop-2.7.7/* /opt/hadoop/
chown -R hadoop:hadoop /opt/hadoop/
```
#### 配置环境变量
`master`
```bash
su - hadoop
cat >> .bash_profile << EOF
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=\$PATH:\$JAVA_HOME/bin
export CLASSPATH=.:\$JAVA_HOME/jre/lib:\$JAVA_HOME/lib:\$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=\$HADOOP_HOME
export HADOOP_HDFS_HOME=\$HADOOP_HOME
export HADOOP_MAPRED_HOME=\$HADOOP_HOME
export HADOOP_YARN_HOME=\$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=\$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=\$HADOOP_HOME/lib/native
export PATH=\$PATH:\$HADOOP_HOME/sbin:\$HADOOP_HOME/bin
EOF
source .bash_profile
```
 
### 配置hadoop集群
#### 编辑core-site.xml
`master`
```bash
su - hadoop
vi etc/hadoop/core-site.xml
```
```xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000/</value>
</property>
</configuration>
```
#### 编辑hdfs-site.xml
`master`
```bash
vi etc/hadoop/hdfs-site.xml
```
```xml
<configuration>
<property>
<name>dfs.data.dir</name>
<value>file:///opt/volume/datanode</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///opt/volume/namenode</value>
</property>
</configuration>
```
#### 编辑mapred-site.xml
`master`
```bash
vi etc/hadoop/mapred-site.xml
```
```xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
</property>
</configuration>
```
#### 编辑yarn-site.xml
`master`
```bash
vi etc/hadoop/yarn-site.xml
```
```xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
<property>
<name>yarn.resourcemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
</configuration>
```
#### 编辑hadoop-env.sh
`master`
```bash
vi etc/hadoop/hadoop-env.sh
```
```bash
export JAVA_HOME=/usr/java/default/
```
#### 编辑masters
`master`
```bash
cat > etc/hadoop/masters<EOF
master
EOF
```
#### 编辑slaves
`master`
```bash
cat > etc/hadoop/slaves <EOF
master
slave1
slave2
slave3
slave4
EOF
```
 
#### 拷贝hadoop到slaves节点
```bash
su - hadoop
scp -r * slave1:/opt/hadoop/*
scp -r * slave2:/opt/hadoop/*
scp -r * slave3:/opt/hadoop/*
scp -r * slave4:/opt/hadoop/*
```
 
### 格式化Namenode
`master`
```bash
su - hadoop
hdfs namenode -format
```
 
### 启动停止集群
`master`
```bash
start-all.sh #启动hadoop集群
stop-all.sh #停止hadoop集群
```
 
### 监控进程
`master`
```bash
jps
```
```
21078 Jps
3922 ResourceManager
4050 NodeManager
3431 NameNode
3577 DataNode
3755 SecondaryNameNode
```
`slaves节点`
```bash
jps
```
```
7517 Jps
21298 DataNode
21422 NodeManager
```
 
### 测试HDFS集群
```bash
hdfs dfs -mkdir /my_storage #创建目录
hdfs dfs -put LICENSE.txt /my_storage #上传文件
hdfs dfs -cat /my_storage/LICENSE.txt #查看文件
hdfs dfs -ls /my_storage/
hdfs dfs -get /my_storage/ ./ #获取文件
```
 
### 监控集群服务
`master`
```bash
http://master:50070
```
#### 查看hdfs文件系统
```bash
http://master:50070/explorer.html
```
#### 集群和应用信息
```bash
http://master:8088
```
#### NodeManager信息
```bash
http://master:8042
```
 
### 开机启动
`master`
```bash
vi /etc/rc.local
```
```bash
su - hadoop -c "/opt/hadoop/sbin/start-all.sh"
```
```bash
chmod +x /etc/rc.d/rc.local
systemctl enable rc-local
systemctl start rc-local
```
 
### Python执行MapReduce
`说明:统计noaa数据1901-1909各个年份的最大温度,文件格式15-18位代表年份,87-91代表温度,92位为检验码。mapper对文件每一行内容进行处理,生成"年份 温度"的格式(例如:1901 +0056),reducer对mapper输出统计出每个年份的最大值.`
Mapper程序
```bash
cat mapper_noaa.py
```
```bash
#!/usr/bin/env python
import sys
import re
pattern = re.compile(r'[01459]')
for line in sys.stdin:
year,temperature,q = line[15:19],int(line[87:92]),line[92:93]
if pattern.match(q) and temperature != 9999:
print("{0}\t{1}".format(year,temperature))
```
Reducer进程
```bash
cat reducer_noaa.py
```
```bash
#!/usr/bin/env python
import sys
import re
current_year=None
current_temp_max=None
for line in sys.stdin:
year,templature= line.strip().split('\t')
try:
templature=int(templature)
except:
continue
if current_year == year:
if current_temp_max < templature:
current_temp_max=templature
else:
if current_year:
print("{0} {1}".format(current_year,current_temp_max))
current_year=year
current_temp_max=templature
if current_year:
print("{0} {1}".format(current_year,current_temp_max))
```
#### 下载数据
```bash
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ #把下载的对应每年数据放到noaa文件夹
```
#### 上传数据到hdfs
```bash
su - hadoop
hdfs dfs -mkdir /test/ #创建test目录
hdfs dfs -copyFromLocal noaa /test/noaa #noaa为下载的天气数据
```
#### 运行MapReduce
```bash
su - hadoop
hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.7.jar -file ./mapper_noaa.py -file ./reducer_noaa.py -mapper ./mapper_noaa.py -reducer ./reducer_noaa.py -input /test/noaa/190[0-9]/ -output /test/noaa_1901_1909_results
```
#### 查看运行结果
```bash
hdfs dfs -cat /test/noaa_1901_1909_results/part-00000
```
```
1901 317
1902 244
1903 289
1904 256
1905 283
1906 294
1907 283
1908 289
1909 278
```
`注:由于气温被放大10倍,所以1901年的最高气温为31.7°`