根据提供的函数进行聚合。When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like ingroupByKey, the number of reduce tasks is configurable through an optional second argument.
笛卡尔积,在类型为 T 和 U 类型的数据集上调用时,返回一个 (T, U)对数据集(两两的元素对)
pipe(command,[envVars])
Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions)
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
Store RDD in serialized format inTachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory.
15/01/08 15:19:58 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@n5.china.com:20545]
15/01/08 15:19:58 INFO Utils: Successfully started service 'sparkDriver' on port 20545.
15/01/08 15:19:58 INFO SparkEnv: Registering MapOutputTracker
15/01/08 15:19:58 INFO SparkEnv: Registering BlockManagerMaster
15/01/08 15:19:58 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150108151958-cadb
15/01/08 15:19:58 INFO Utils: Successfully started service 'Connection manager for block manager' on port 63094.
15/01/08 15:19:58 INFO ConnectionManager: Bound socket to port 63094 with id = ConnectionManagerId(n5.china.com,63094)
15/01/08 15:19:58 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/01/08 15:19:58 INFO BlockManagerMaster: Trying to register BlockManager
15/01/08 15:19:58 INFO BlockManagerMasterActor: Registering block manager n5.china.com:63094 with 265.4 MB RAM
15/01/08 15:19:58 INFO BlockManagerMaster: Registered BlockManager
15/01/08 15:19:58 INFO HttpFileServer: HTTP File server directory is /tmp/spark-3b796d21-db73-4aad-84bc-23cd0ce3fab7
15/01/08 15:19:58 INFO HttpServer: Starting HTTP Server
15/01/08 15:19:58 INFO Utils: Successfully started service 'HTTP file server' on port 38860.
15/01/08 15:19:58 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/01/08 15:19:58 INFO SparkUI: Started SparkUI at http://n5.china.com:4040
15/01/08 15:19:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/01/08 15:19:59 INFO SparkContext: Added JAR file:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/spark-examples-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar at http://10.1.0.182:38860/jars/spa ... p2.5.0-cdh5.2.0.jar with timestamp 1420701599601
15/01/08 15:19:59 INFO AppClient$ClientActor: Connecting to master spark://10.1.0.141:7077...
15/01/08 15:19:59 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/01/08 15:19:59 INFO SparkContext: Starting job: reduce at SparkPi.scala:35
15/01/08 15:19:59 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 1000 output partitions (allowLocal=false)
15/01/08 15:19:59 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkPi.scala:35)
15/01/08 15:19:59 INFO DAGScheduler: Parents of final stage: List()
15/01/08 15:19:59 INFO DAGScheduler: Missing parents: List()
15/01/08 15:19:59 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkPi.scala:31), which has no missing parents
15/01/08 15:20:00 INFO MemoryStore: ensureFreeSpace(1728) called with curMem=0, maxMem=278302556
15/01/08 15:20:00 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1728.0 B, free 265.4 MB)
15/01/08 15:20:00 INFO MemoryStore: ensureFreeSpace(1125) called with curMem=1728, maxMem=278302556
15/01/08 15:20:00 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1125.0 B, free 265.4 MB)
15/01/08 15:20:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on n5.china.com:63094 (size: 1125.0 B, free: 265.4 MB)
15/01/08 15:20:00 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/01/08 15:20:00 INFO DAGScheduler: Submitting 1000 missing tasks from Stage 0 (MappedRDD[1] at map at SparkPi.scala:31)
15/01/08 15:20:00 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000 tasks
15/01/08 15:20:15 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
15/01/08 15:20:19 INFO AppClient$ClientActor: Connecting to master spark://10.1.0.141:7077...
15/01/08 15:20:30 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
15/01/08 15:20:39 INFO AppClient$ClientActor: Connecting to master spark://10.1.0.141:7077...
15/01/08 15:20:45 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
15/01/08 15:20:59 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/01/08 15:20:59 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/01/08 15:20:59 INFO TaskSchedulerImpl: Cancelling stage 0
15/01/08 15:20:59 INFO DAGScheduler: Failed to run reduce at SparkPi.scala:35
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up.
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[root@n5 bin]#