今天在cloudera上部署了spark on YARN,查看spark的状态,发现Gateway全部是灰色,而且都是“不适用”的状态,在网上搜了一下,也没人有具体说明这是装成功了还是装失败了:
尝试着启动spark-shell,是可以启动的。
于是我安装了IDEA,安装了插件,在window上安装了scala2.10.6,写了一个wordcount的程序,准备在IDEA中提交:
[mw_shl_code=scala,true]import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object test {
def main(args : Array[String]) : Unit = {
System.setProperty("HADOOP_USER_NAME", "hdfs")
val conf = new SparkConf().setMaster("yarn-client").setAppName("spark_test")
conf.setJars(List("F:\\workspace\\hadoop-mr\\TestMr\\out\\artifacts\\sparktest_jar\\sparktest.jar"))
conf.set("spark.yarn.jar", "hdfs://hadoopnamenode1:8020/user/spark/sparkjar/spark-assembly-1.6.0-cdh5.8.2-hadoop2.6.0-cdh5.8.2.jar")
val sc = new SparkContext(conf)
val lines = sc.textFile("hdfs://hadoopnamenode1:8020//DCBD/data/ic/201609", 1)
val words = lines.flatMap { line => line.split(" ") }
val pairs = words.map { word => (word, 1) }
val wordCounts = pairs.reduceByKey { _ + _ }
wordCounts.foreach(wordCount => println(wordCount._1 + " appeared " + wordCount._2 + " times."))
}
}[/mw_shl_code]
从日志上来看,spark把我的windows本机当成了driver:
INFO util.Utils (Logging.scala:logInfo(58)) - Successfully started service 'sparkDriver' on port 58400.
但是运行了几秒钟,就开始一直打印出错信息:
2016-11-17 15:19:16,088 INFO cluster.YarnClientSchedulerBackend (Logging.scala:logInfo(58)) - Registered executor NettyRpcEndpointRef(null) (HadoopDatanode3:59789) with ID 2
2016-11-17 15:19:16,138 WARN net.ScriptBasedMapping (ScriptBasedMapping.java:runResolveCommand(254)) - Exception running /etc/hadoop/conf.cloudera.yarn/topology.py 172.16.1.145
java.io.IOException: Cannot run program "/etc/hadoop/conf.cloudera.yarn/topology.py" (in directory "F:\workspace\spark\sparktest"): CreateProcess error=2, 系统找不到指定的文件。
很明显,这是在注册executor的时候报的错,在172.16.1.145 上无法执行"/etc/hadoop/conf.cloudera.yarn/topology.py",但是172.16.1.145 上明明是有这个路径存在的,用hdfs用户也能手动执行,所以应该不是权限的问题(任何用户都有可读可执行权限)。
日志中显示in directory "F:\workspace\spark\sparktest",这是我IDEA工程的路径,我怀疑,是不是程序在我的工程路径中去找需要执行的文件,那当然是找不到的,查看了源码ProcessBuilder.java:
[mw_shl_code=scala,true] throw new IOException(
"Cannot run program \"" + prog + "\""
+ (dir == null ? "" : " (in directory \"" + dir + "\")")
+ exceptionInfo,
cause);[/mw_shl_code]
[mw_shl_code=scala,true] /**
* Returns this process builder's working directory.
*
* Subprocesses subsequently started by this object's {@link
* #start()} method will use this as their working directory.
* The returned value may be {@code null} -- this means to use
* the working directory of the current Java process, usually the
* directory named by the system property {@code user.dir},
* as the working directory of the child process.
*
* @return this process builder's working directory
*/
public File directory() {
return directory;
}[/mw_shl_code]
我尝试着用去set system property 去修改user.dir,但是这样程序刚启动就会报错,说路径非法。
综上所述,想请问一下同样在IDEA下,用scala开发spark,同时提交任务到spark上的同学,你们是如何提交的呢?
|
|