假如我有两个hive表,结构分别为
| id| title|average_sentiment|
| id| title|price| average_sentiment|
,我该如何使用pyspark,执行以下逻辑?
若表一中有一行id1,表二中没有相应行id2=id1,取表一id1哪行。
若表二中有一行id2,表一中没有相应行id1=id2,取表二id2哪行。
若表一中有一行id1,表二中存在id2=id1,那一行的average_sentiment字段和再除2
在spark中编写代码以及结果如下:
相关代码:
sqlstr = ("SELECT COALESCE(t1.id, t2.id) AS id,"
" COALESCE(t1.title, t2.title) AS title, "
"CASE WHEN t1.id IS NOT NULL AND t2.id IS NOT NULL "
"THEN (t1.average_sentiment + t2.average_sentiment) / 2 "
"WHEN t1.id IS NOT NULL"
" THEN t1.average_sentiment "
"ELSE t2.average_sentiment END AS average_sentiment"
" FROM cjw_data.qvnasentiment t1 "
"FULL OUTER JOIN cjw_data.xiechengsentiment t2 "
"ON t1.id = t2.id;")
# sqlstr = ("SELECT * FROM cjw_data.qvnasentiment LIMIT 5;")
df = spark.sql(sqlstr).limit(5)
df.show(5)
print("共有",df.count(), "行数据")
df1 = spark.table("cjw_data.qvnasentiment")
df2 = spark.table("cjw_data.xiechengsentiment")
df1 = df1.drop("price")
df1.show(2)
df2.show(2)
result = df1.join(df2, on="id", how="outer")
print(result.count())
result.show(5)
结果截图:
为什么df1.join后和之前sql查询的结果id都是null?