Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. Spark是一个开源的分布式计算系统,它的目的是使得数据分析更快——写起来和运行起来都很快。
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or combineByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
9. union(otherDataset)
Return a new dataset that contains the union of the elements in the source dataset and the argument.
10. join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin,rightOuterJoin, and fullOuterJoin.
val rddtest1 = sc.parallelize(List(("James", 1), ("Wade", 2), ("Paul", 3)))
val rddtest2 = sc.parallelize(List(("James", 4), ("Wade", 5)))
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
13. count()
Return the number of elements in the dataset.
14. collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
将分布式的RDD 返回为一个单机的足够小的scala Array 数组。
15. countByKey()
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
16. lookup(key: K)
17. reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
val reduceRdd = sc.parallelize(List(1,2,3,4,5))
reduceRdd.reduce(_ + _)
18. saveAsTextFile(path)
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.