/** * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, * partitioned collection of elements that can be operated on in parallel. This class contains the * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition, * [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value * pairs, such as `groupByKey` and `join`; * [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of * Doubles; and * [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that * can be saved as SequenceFiles. * All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]) * through implicit. * * Internally, each RDD is characterized by five main properties: * * - A list of partitions * - A functionfor computing each split * - A list of dependencies on other RDDs * - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) * - Optionally, a list of preferred locations to compute each split on (e.g. block locations for * an HDFS file) * * All of the scheduling and execution in Spark is done based on these methods, allowing each RDD * to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for * reading data from a new storage system) by overriding these functions. Please refer to the * <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a> * for more details on RDD internals. */
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. 弹性分布式数据集
Represents an immutable 不可变的,RDD一旦产生就不可变,通过操作生成其他的RDD为新的RDD partitioned collection of elements 以分区的形式存储元素,类似于HDFS上面的Block that can be operated on in parallel.
abstract class RDD[T: ClassTag]( @transientprivate var _sc: SparkContext, @transientprivate var deps: Seq[Dependency[_]] ) extends Serializable with Logging
抽象类 RDD必然是有实现类的,使用的时候直接使用其子类
可序列化 序列化性能
T 泛型 RDD存储的数据类型是多种的
SparkContext
@transient
RDD的五大特点:
Internally, each RDD is characterized by five main properties:
A list of partitions 一个RDD由很多的分区构成
A function for computing each split 对RDD的操作实际上是对RDD里面所有的分区做的操作
A list of dependencies on other RDDs RDD与RDD之间是由依赖关系的 RDDA==> RDDB ==>RDDC ==>RDDE
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file) 数据本地性