HDFS 副本存放策略、文件读写流程

Posted by Jackson on 2017-08-20

HDFS 副本存放策略、文件读写流程、PID、常规命令、磁盘检查、数据均衡


HDFS 副本存放策略

HDFS默认三副本,在hdfs-site.xml 文件中进行配置,参数 dfs.replication

第一个副本:
假如上传节点为DN节点,优先放置本节点;
否则就随机挑选一台磁盘不太慢、CPU不太繁忙的节点存储副本

第二个副本:
放置在于第一个副本的不同的机架的一个节点上

第三个副本:
放置于第二个副本相同机架的不同节点上

CDH 会有一个虚拟机架,在生产上面我们一般不调整CDH的机架配置


HDFS 文件读写流程

两个核心对象:FSDataOutputStream、FSDataInputStream

文件写流程

avatar
1 Client调用FileSystem.create(filePath)方法,与NN进行【rpc】通信,check是否存在及是否有权限创建;假如不ok,就返回错误信息;假如ok,就创建一个新文件,不关联任何的block块,返回一个FSDataOutputStream对象;

2 Client调用FSDataOutputStream对象的write()方法,先将第一块的第一个副本写到第一个DN,第一个副本写完;就传输给第二个DN,第二个副本写完;就传输给第三个DN,第三个副本写完,就返回一个ack packet确认包给第二个DN,第二个DN接收到第三个的ack packet确认包加上自身ok,就返回一个ack packet确认包给第一个DN,第一个DN接收到第二个DN的ack packet确认包加上自身ok,就返回ack packet确认包给FSDataOutputStream对象,标志第一个块 3个副本写完。然后余下的块按照上述方式依次写入。

3 当向文件写入数据完成后,
Client调用FSDataOutputStream.close()方法,关闭输出流。

4 再调用FileSystem.complete()方法,告诉NN该文件写入成功。


文件读流程

avatar

1 Client调用FileSystem.open(filePath)方法,
与NN进行【rpc】通信,返回该文件的部分或者全部的block列表,
也就是返回FSDataInputStream对象。

2 Client调用FSDataInputStream对象read()方法;

a.与第一个块最近的DN进行read,读取完成后,会check;
假如ok,就关闭与当前DN的通信;假如失败,会记录
失败块+DN信息,下次不会再读取;那么会去该块的
第二个DN地址读取。

b.然后去第二个块的最近的DN上通信读取,check后,关闭通信。

c.假如block列表读取完成后,文件还未结束,
就再次FileSystem会从NN获取该文件的下一批次的block列表。
(感觉就是连续的数据流,对于客户端操作是透明无感知的)

Client调用FSDataInputStream.close()方法,关闭输入流。


Hadoop PID

查看pid的配置文件

1
2
3
4
5
6
7
8
9
10
11
12
13
[hadoop@bigdata01 hadoop]$ cat hadoop-env.sh |grep -C 10 PID
# export HADOOP_MOVER_OPTS=""
###
# Advanced Users Only!
###
# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER

配置PID的目录,使用下面两个参数

1
2
export HADOOP_PID_DIR=/home/hadoop/tmp
export HADOOP_SECURE_DN_PID_DIR=/home/hadoop/tmp

yarn-env.sh

export YARN_PID_DIR=/home/hadoop/tmp

1
2
3
4
5
6
7
8
9
10
[hadoop@bigdata01 ~]$ cd /home/hadoop/tmp/
[hadoop@bigdata01 tmp]$ ll
drwxrwxr-x 5 hadoop hadoop 51 Dec 1 23:38 dfs
-rw-rw-r-- 1 hadoop hadoop 6 Dec 5 22:03 hadoop-hadoop-datanode.pid
-rw-rw-r-- 1 hadoop hadoop 6 Dec 5 22:03 hadoop-hadoop-namenode.pid
-rw-rw-r-- 1 hadoop hadoop 6 Dec 5 22:03 hadoop-hadoop-secondarynamenode.pid
drwxr-xr-x 5 hadoop hadoop 57 Dec 5 22:06 nm-local-dir
-rw-rw-r-- 1 hadoop hadoop 42 Dec 2 12:36 test.txt
-rw-rw-r-- 1 hadoop hadoop 6 Dec 5 22:06 yarn-hadoop-nodemanager.pid
-rw-rw-r-- 1 hadoop hadoop 6 Dec 5 22:06 yarn-hadoop-resourcemanager.pid

Hadoop 常规命令

1
2
3
4
5
6
7
8
9
10
hadoop fs==> hdfs dfs 
[-cat [-ignoreCrc] <src> ...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]

生产环境需检查是否开启回收站机制,CDH默认开启回收站机制,默认保存七天,七天后自动删除
配置参数如下
fs.trash.interval 10080 分钟 即七天

安全模式:

查看命令帮助

1
2
3
4
5
[hadoop@bigdata01 hadoop]$ hdfs dfsadmin
Usage: hdfs dfsadmin
Note: Administrative commands can only be run as the HDFS superuser.
[-report [-live] [-dead] [-decommissioning]]
[-safemode <enter | leave | get | wait>]

使用hdfs dfsadmin -safemode enter 进入安全模式
使用hdfs dfsadmin -safemode leave 离开安全模式

在安全模式下,可以对HDFS进行读操作,但是不能对HDFS进行写操作


HDFS 磁盘检查

hdfs fsck /

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[hadoop@bigdata01 hadoop]$ hdfs fsck
Usage: DFSck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]] [-maintenance]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-includeSnapshots include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
-list-corruptfileblocks print out list of missing blocks and files they belong to
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations
-maintenance print out maintenance state node details
-blockId print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

执行命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[hadoop@bigdata01 hadoop]$ hdfs fsck /
19/12/05 22:25:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://bigdata01:50070/fsck?ugi=hadoop&path=%2F
FSCK started by hadoop (auth:SIMPLE) from /192.168.52.50 for path / at Thu Dec 05 22:25:45 CST 2019
.........Status: HEALTHY
Total size: 515494717 B
Total dirs: 13
Total files: 9
Total symlinks: 0
Total blocks (validated): 11 (avg. block size 46863156 B)
Minimally replicated blocks: 11 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Thu Dec 05 22:25:45 CST 2019 in 16 milliseconds
The filesystem under path '/' is HEALTHY

以上参数主要看两个部分:

Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)


HDFS 数据均衡

各DN节点的数据均衡

查看start-balancer.sh 脚本发现使用的是"$bin"/hdfs start balancer 这个命令来执行的

1
2
3
4
5
[hadoop@bigdata01 sbin]$ pwd
/home/hadoop/app/hadoop/sbin
[hadoop@bigdata01 sbin]$ cat start-balancer.sh
#!/usr/bin/env bash
"$HADOOP_PREFIX"/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script "$bin"/hdfs start balancer $@

执行命令:

1
2
3
[hadoop@bigdata01 logs]$ sh /home/hadoop/app/hadoop/sbin/start-balancer.sh 
starting balancer, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-balancer-bigdata01.out
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved

到对应的log文件中查看日志文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[hadoop@bigdata01 logs]$ more hadoop-hadoop-balancer-bigdata01.log
2019-12-05 22:34:50,542 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: namenodes = [hdfs://bigdata01:9000]
2019-12-05 22:34:50,544 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: parameters = Balancer.Parameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included no
des = 0, #source nodes = 0, run during upgrade = false]
2019-12-05 22:34:50,544 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: included nodes = []
2019-12-05 22:34:50,544 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: excluded nodes = []
2019-12-05 22:34:50,544 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: source nodes = []
2019-12-05 22:34:50,671 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-12-05 22:34:51,708 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000)
2019-12-05 22:34:51,709 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000)
2019-12-05 22:34:51,709 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200)
2019-12-05 22:34:51,709 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 50 (default=50)
2019-12-05 22:34:51,714 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240)
2019-12-05 22:34:51,726 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.52.50:50010
2019-12-05 22:34:51,727 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 0 over-utilized: []
2019-12-05 22:34:51,728 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 0 underutilized: []

可以看到参数:threshold = 10.0
参数的意思是:比如我们有三台的机器,磁盘的使用分别是90 70 60
90+60+80=230/3=76
90-76=14
60-76=16
80-76=4
所有节点的磁盘used与集群的平均used之差要小于这个阈值

dfs.datanode.balance.bandwidthPerSec 30m 参数用于设置带宽

生产上面建议写定时脚本每天执行balance

单个DN节点的多个磁盘的数据均衡

官网参考地址:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html

df -h
/data01 90%
/data02 60%
/data03 80%
/data04 0%

需要开启这个参数:dfs.disk.balancer.enabled must be set to true in hdfs-site.xml.

执行过程:
1.hdfs diskbalance -plan bigdata01 生成bigdata01.plan.json
2.hdfs diskbalance -execute bigdata01.plan.json
3.hdfs diskbalance -query bigdata01

执行场景:
1.新的磁盘加入的时候
2.监控服务器的磁盘剩余空间小于10%的时候,进行执行

DataNode 配置多磁盘目录:
dfs.datanode.data.dir /data01,/data02/,data03 采用逗号进行分割

为什么DataNode生产上挂多个物理磁盘目录
1.为了高效的写和高效的读
2.可以提前规划好2-3年的存储量,避免后期增加磁盘维护的工作量