liangxiaoxia 2017-06-19 06:20 采纳率: 0%
浏览 4237

spark sql中处理json嵌套数组的方法

各位大神刚开始学spark sql想处理json数据,一般的json数据没问题,但是当json串中有json嵌套数组时,就不太清楚怎样获取这个数据里每一项的数据,请各位指点。格式如
{"name":"Yin","address":[{"city":"Columbus","state":"防守打法"},{"city":"Columbus2","state":"防守打法"}]}我想获取address中的每一个数据项,应该怎么弄比较好?谢谢!

  • 写回答

2条回答 默认 最新

  • liangxiaoxia 2017-06-19 06:59
    关注

    别沉了,补个代码
    val conf =new SparkConf().setAppName("test2").setMaster("local")
    val sc =new SparkContext(conf)
    var sqlContext= new SQLContext(sc)

        var anotherPeopleRDD=sc.parallelize("""{"name":"Yin","address":[{"city":"Columbus","state":"http://www.tom.com"},{"city":"Columbus2","state":"http://www.tom.com.cn"}]}"""::Nil)
    

    sqlContext.udf.register("testaddress",(arr:Array[Row])=>{
    })

    val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
    anotherPeople.printSchema
    anotherPeople.registerTempTable("people")
    val pairs = sqlContext.sql("select testaddress(address) from people ")
    pairs.collect()

        我想用udf 但是总报错:
        java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Lorg.apache.spark.sql.Row;
    at com.dwnews.DMP.ETL.service.FileFilterService$$anonfun$test2$1.apply(FileFilterService.scala:100)
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
    at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
    at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:909)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:909)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    
    评论

报告相同问题?

悬赏问题

  • ¥15 python:excel数据写入多个对应word文档
  • ¥60 全一数分解素因子和素数循环节位数
  • ¥15 ffmpeg如何安装到虚拟环境
  • ¥188 寻找能做王者评分提取的
  • ¥15 matlab用simulink求解一个二阶微分方程,要求截图
  • ¥30 乘子法解约束最优化问题的matlab代码文件,最好有matlab代码文件
  • ¥15 写论文,需要数据支撑
  • ¥15 identifier of an instance of 类 was altered from xx to xx错误
  • ¥100 反编译微信小游戏求指导
  • ¥15 docker模式webrtc-streamer 无法播放公网rtsp