Spark RDD概念、特性、源码体现

RDD 的GitHub地址：

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

/**
 * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
 * partitioned collection of elements that can be operated on in parallel. This class contains the
 * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
 * [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
 * pairs, such as `groupByKey` and `join`;
 * [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
 * Doubles; and
 * [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
 * can be saved as SequenceFiles.
 * All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
 * through implicit.
 *
 * Internally, each RDD is characterized by five main properties:
 *
 *  - A list of partitions
 *  - A function for computing each split
 *  - A list of dependencies on other RDDs
 *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
 *    an HDFS file)
 *
 * All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
 * to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
 * reading data from a new storage system) by overriding these functions. Please refer to the
 * <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
 * for more details on RDD internals.
 */

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. 弹性分布式数据集

弹性：体现在计算上面，分布式计算中可以容错，某个节点挂了，或者某个数据计算丢失了。
分布式：分布式存储，数据跨节点存储在不同的机器上面，代码可以运行在多个节点上面

Represents an immutable 不可变的，RDD一旦产生就不可变，通过操作生成其他的RDD为新的RDD
partitioned collection of elements 以分区的形式存储元素，类似于HDFS上面的Block
that can be operated on in parallel.

RDDA：(1,2,3,4,5,6,7,8,9)		operate + 1
	PartitionA:(1,2,3)			+ 1
	PartitionB:(4,5,6)			+ 1
	PartitionC:(7,8,9)			+ 1

RDD的定义：

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging

抽象类 RDD必然是有实现类的，使用的时候直接使用其子类
可序列化序列化性能
T 泛型 RDD存储的数据类型是多种的
SparkContext
@transient

RDD的五大特点：

Internally, each RDD is characterized by five main properties:

A list of partitions 一个RDD由很多的分区构成
A function for computing each split 对RDD的操作实际上是对RDD里面所有的分区做的操作
A list of dependencies on other RDDs RDD与RDD之间是由依赖关系的 RDDA==> RDDB ==>RDDC ==>RDDE
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file) 数据本地性

A Resilient Distributed Dataset (RDD) 体现的五大特性

RDD五大特点在源码中的对应关系

1.def compute(split: Partition, context: TaskContext): Iterator[T]
对RDD做计算其实是对RDD里面的分区做计算，计算方法出入的参数是Partition和Context上下文
对应于RDD特点的第二点

2.protected def getPartitions: Array[Partition] 一个RDD是由一系列的分区构成，返回的类型是数组，里面的泛型是Partition
这个对应于RDD特点里面的第一条

3.protected def getDependencies: Seq[Dependency[_]] = deps 获取依赖关系
RDD里面是有转换操作的，这个对应于RDD里面的第三个特点

4.protected def getPreferredLocations(split: Partition): Seq[String] = Nil
获取最佳的位置，即数据本地性，对应于RDD特点里面的第四条

5.@transient val partitioner: Option[Partitioner] = None

补充：在Spark中有多少Partition 就有多少Task来执行

avatar

Spark RDD概念、特性、源码体现

FEATURED TAGS

FRIENDS