认识Spark
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. Spark是一个开源的分布式计算系统,它的目的是使得数据分析更快——写起来和运行起来都很快。
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or combineByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
将相同的key依据函数func合并。
9. union(otherDataset)
Return a new dataset that contains the union of the elements in the source dataset and the argument.
将两个RDD合并,要求两个RDD中的数据项类型一致。
10. join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin,rightOuterJoin, and fullOuterJoin.
val rddtest1 = sc.parallelize(List(("James", 1), ("Wade", 2), ("Paul", 3)))
val rddtest2 = sc.parallelize(List(("James", 4), ("Wade", 5)))
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
按照key来排序,默认从小到大。如果加上参数false,则从大到小排序。
13. count()
Return the number of elements in the dataset.
返回数据项的个数。
14. collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
将分布式的RDD 返回为一个单机的足够小的scala Array 数组。
15. countByKey()
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
用于key-value类型的RDD,返回每个key对应的个数。
16. lookup(key: K)
用于key-value类型的RDD,返回key对应的所有value值。
17. reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
迭代遍历每一个元素,并执行函数func。
val reduceRdd = sc.parallelize(List(1,2,3,4,5))
reduceRdd.reduce(_ + _)
复制代码
计算结果为所有元素之和15。
18. saveAsTextFile(path)
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.