Spark RDD概念、特性、源码体现

Posted by Jackson on 2017-10-13

Spark RDD概念、特性、源码体现

RDD 的GitHub地址

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/**
* A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
* partitioned collection of elements that can be operated on in parallel. This class contains the
* basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
* [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
* pairs, such as `groupByKey` and `join`;
* [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
* Doubles; and
* [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
* can be saved as SequenceFiles.
* All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
* through implicit.
*
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
*
* All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
* to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
* reading data from a new storage system) by overriding these functions. Please refer to the
* <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
* for more details on RDD internals.
*/

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. 弹性分布式数据集

弹性:体现在计算上面,分布式计算中可以容错,某个节点挂了,或者某个数据计算丢失了。
分布式:分布式存储,数据跨节点存储在不同的机器上面,代码可以运行在多个节点上面

Represents an immutable 不可变的,RDD一旦产生就不可变,通过操作生成其他的RDD为新的RDD
partitioned collection of elements 以分区的形式存储元素,类似于HDFS上面的Block
that can be operated on in parallel.

1
2
3
4
RDDA:(1,2,3,4,5,6,7,8,9)		operate + 1
PartitionA:(1,2,3) + 1
PartitionB:(4,5,6) + 1
PartitionC:(7,8,9) + 1

RDD的定义:

1
2
3
4
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging
  • 抽象类 RDD必然是有实现类的,使用的时候直接使用其子类
  • 可序列化 序列化性能
  • T 泛型 RDD存储的数据类型是多种的
  • SparkContext
  • @transient

RDD的五大特点

Internally, each RDD is characterized by five main properties:

  • A list of partitions 一个RDD由很多的分区构成
  • A function for computing each split 对RDD的操作实际上是对RDD里面所有的分区做的操作
  • A list of dependencies on other RDDs RDD与RDD之间是由依赖关系的 RDDA==> RDDB ==>RDDC ==>RDDE
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for
    an HDFS file) 数据本地性

A Resilient Distributed Dataset (RDD) 体现的五大特性

RDD五大特点在源码中的对应关系

1.def compute(split: Partition, context: TaskContext): Iterator[T]
对RDD做计算其实是对RDD里面的分区做计算,计算方法出入的参数是Partition和Context上下文
对应于RDD特点的第二点

2.protected def getPartitions: Array[Partition] 一个RDD是由一系列的分区构成,返回的类型是数组,里面的泛型是Partition
这个对应于RDD特点里面的第一条

3.protected def getDependencies: Seq[Dependency[_]] = deps 获取依赖关系
RDD里面是有转换操作的,这个对应于RDD里面的第三个特点

4.protected def getPreferredLocations(split: Partition): Seq[String] = Nil
获取最佳的位置,即数据本地性,对应于RDD特点里面的第四条

5.@transient val partitioner: Option[Partitioner] = None

补充:在Spark中有多少Partition 就有多少Task来执行

avatar