本帖最后由 linux_oracle 于 2020-12-4 15:52 编辑
一.RDD创建1.从集合创建
1.1 parallelize
- 1.scala> var rdd = sc.parallelize(1 to 10)
- 2.rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21
复制代码
1.2 makerdd
- 1.scala> var collect = Seq((1 to 10,Seq("slave007.lxw1234.com","slave002.lxw1234.com")),
- 2.(11 to 15,Seq("slave013.lxw1234.com","slave015.lxw1234.com")))
-
- 6.scala> var rdd = sc.makeRDD(collect)
- 7.rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[6] at makeRDD at :23
复制代码
2.从外部创建RDD
2.1 textFile
- scala> var rdd = sc.textFile("hdfs:///tmp/lxw1234/1.txt")
复制代码
2.2 从其他HDFS文件格式创建
hadoopFile sequenceFile objectFile newAPIHadoopFile
|