您好 我需要使用spark完成时间序列的聚合,具体是需要将已经按照车辆id和时间排序后的小汽车gps数据的DataFrame按照 【col_sev】 列为1的划分为单独trip,处理前的数据是这样的
val taxiRaw = spark.sparkContext.textFile("E:/demo_20180821.dat")
import spark.implicits._
val safeParse = Parse.safe(Parse.parseRecords)
val taxiParsed = taxiRaw.map(safeParse)
val taxiGood = taxiParsed.map(_.left.get).toDS
..... 各种数据清洗
val taxiClean = ....toDF()
demo.show(100)
+-------+-------------------+---------+----------+-------+
|col_car| col_dttim| col_lat| col_lon|col_sev|
+-------+-------------------+---------+----------+-------+
| ID1|2018-08-21 09:29:57| 39.88922| 116.347| 0|
| ID1|2018-08-21 09:30:59| 39.88968|116.346998| 0|
| ID1|2018-08-21 09:31:30| 39.89037|116.346978| 1|
| ID1|2018-08-21 09:31:39|39.890758|116.346947| 1|
| ID1|2018-08-21 09:33:05|39.895908| 116.34676| 1|
| ID1|2018-08-21 09:33:45|39.896063|116.346745| 1|
| ID1|2018-08-21 09:34:05| 39.89609|116.346735| 1|
| ID1|2018-08-21 09:34:31|39.896453| 116.34672| 1|
| ID1|2018-08-21 09:35:41|39.897587|116.346692| 1|
| ID1|2018-08-21 09:37:15|39.898068|116.346638| 1|
| ID1|2018-08-21 09:37:35|39.898123|116.346603| 1|
| ID1|2018-08-21 09:38:35|39.898462|116.346615| 1|
| ID1|2018-08-21 09:38:56|39.898408|116.346615| 1|
| ID1|2018-08-21 09:39:05|39.898382|116.346622| 1|
| ID1|2018-08-21 09:39:48|39.898593| 116.34664| 1|
| ID1|2018-08-21 09:40:18|39.899062|116.346658| 1|
| ID1|2018-08-21 09:40:28|39.899055|116.346662| 1|
| ID1|2018-08-21 09:40:48|39.899097|116.346635| 1|
| ID1|2018-08-21 09:41:20|39.899847|116.346443| 1|
| ID1|2018-08-21 09:44:40| 39.89988|116.345462| 1|
| ID1|2018-08-21 09:49:02|39.901228|116.343818| 1|
| ID1|2018-08-21 09:52:07| 39.90414|116.337148| 1|
| ID1|2018-08-21 09:52:59|39.905652|116.337548| 1|
| ID1|2018-08-21 09:56:58| 39.91248|116.339273| 1|
| ID1|2018-08-21 09:58:11|39.912655|116.342495| 0|
| ID1|2018-08-21 09:58:38|39.912698|116.343038| 0|
| ID1|2018-08-21 09:59:12|39.914267|116.343198| 0|
| ID1|2018-08-21 10:00:40|39.917063|116.342582| 0|
| ID1|2018-08-21 10:01:12| 39.91744|116.341958| 0|
后面还有很多车和不同的服务状态
需要整理成
|col_car|tripID| Starttime|start_lat| start_lon| Endtime| end_lat| end_lon|
+-------+------|-------------------+---------+----------+-------------------+---------+----------+
| ID1| 2|2018-08-21 09:31:30| 39.89037|116.346978|2018-08-21 09:56:58| 39.91248|116.339273|
spark好像没有这种类似的算子提供不定长合并,groupby因为没有key也无法完成