我是的不是读取问题,是从hive处理完数据后,保存为csv格式文件,保存后用IDEA查看是不乱码的,但用Excel打开后才是乱码的 |
spark.read.option("header","true").csv(path)读取CSV文件或spark.read.textFile(path)读取文本文件,由于这2个方法默认是UTF-8编码,如果源数据带有GBK或GB2312的编码,就会出现Spark读取文本的中文乱码。 如果读取的是文本文件,可以用下面的方法来解决 import org.apache.hadoop.io.{LongWritable, Text} import org.apache.hadoop.mapred.TextInputFormat val path = "E:\\newcode\\MyFirstProject\\data\\stockearn" val inputRdd = spark.sparkContext.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map( pair => new String(pair._2.getBytes, 0, pair._2.getLength, "GBK")) 更多可参考 def readCSV(spark:SparkSession,headerSchema:String,mySchema: ArrayBuffer[String],code:String,file:String) ={ val rddArr:RDD[Array[String]] = spark.sparkContext.hadoopFile(file, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map( pair => new String(pair._2.getBytes, 0, pair._2.getLength, code)) //处理同一个单元格 同时出现 引号 逗号串列问题 切割 .map(_.trim.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)",-1)) val fieldArr = rddArr.first() //Row.fromSeq(_) 如果只是 map(Row(_)),会导致 spark.createDataFrame(rddRow,schema)错误 val rddRow = rddArr.filter(!_.reduce(_+_).equals(fieldArr.reduce(_+_))).map(Row.fromSeq(_)) val schemaList = ArrayBuffer[StructField]() if("TRUE".equals(headerSchema)){ for(i <- 0 until fieldArr.length){ println("fieldArr(i)=" + fieldArr(i)) schemaList.append(StructField(mySchema(i),DataTypes.StringType)) } }else{ for(i <- 0 until fieldArr.length){ schemaList.append(StructField(s"_c$i",DataTypes.StringType)) println("fieldArr(i)=" + fieldArr(i)) } } val schema = StructType(schemaList) spark.createDataFrame(rddRow,schema) } |