linux_oracle 发表于 2020-12-4 14:39:32

Spark算子总结

本帖最后由 linux_oracle 于 2020-12-4 15:52 编辑

一.RDD创建1.从集合创建
1.1 parallelize
1.scala> var rdd = sc.parallelize(1 to 10)
2.rdd: org.apache.spark.rdd.RDD = ParallelCollectionRDD at parallelize at :21
1.2 makerdd
1.scala> var collect = Seq((1 to 10,Seq("slave007.lxw1234.com","slave002.lxw1234.com")),
2.(11 to 15,Seq("slave013.lxw1234.com","slave015.lxw1234.com")))

6.scala> var rdd = sc.makeRDD(collect)
7.rdd: org.apache.spark.rdd.RDD = ParallelCollectionRDD at makeRDD at :23

2.从外部创建RDD
2.1 textFile
scala> var rdd = sc.textFile("hdfs:///tmp/lxw1234/1.txt")2.2 从其他HDFS文件格式创建
hadoopFilesequenceFileobjectFilenewAPIHadoopFile


页: [1]
查看完整版本: Spark算子总结