langke93 发表于 2018-4-16 11:30:05

Spark性能优化之reduceByKey和groupByKey比较

举个例子
val counts=pairs.reduceByKey(_+_)
val counts=pairs.groupByKey().map(wordcounts=>(wordcounts._1,wordcounts_2.sum))

如果能用reduceByKey那就用reduceByKey,因为它会在map端,先进行本地combine,可以大大的减少要传输到reduce端的数据量,减少网路传输的开销
只有在reduceByKey 处理不了的时候才会用groupbByKey.map()来替代

下面给出一个图解介绍一下val counts=pairs.groupByKey().map(wordcounts=>(wordcounts._1,wordcounts_2.sum))



下面给出一个图解介绍一下val counts=pairs.reduceByKey(_+_)





页: [1]
查看完整版本: Spark性能优化之reduceByKey和groupByKey比较