掌控自己就能掌控世界 2019-07-15 16:13 采纳率: 0%
浏览 2091

大表 left join 小表时产生数据倾斜怎么处理?

drop table if exists tableS;

create table tableS as rcfile as
select t.order_app_tp_cd as site_id,
count(distinct t.order_id) cnt_order_all,

count(distinct case when t.order_date = '${hivevar:statis_date}' then t.order_id else null end) as cnt_order,
sum(case when t.order_date = '${hivevar:statis_date}' then t.pay_amt else 0.0 end) as amt_order,

count(distinct case when t.pay_date = '${hivevar:statis_date}'
and t.status = '30'
then t.order_id else null end) as cnt_eff,

sum(case when t.pay_date = '${hivevar:statis_date}'
and t.status in ('30','50','60','70')
then t.pay_amt else 0.0 end) as amt_eff,

sum(case when t.pay_date = '${hivevar:statis_date}'
and t.status in ('50','60','70')
then t.pay_amt else 0.0 end) as amt_back,

count(distinct case when t.pay_date = '${hivevar:statis_date}'
and t.status = '30' then t.member_id
else null end) as cnt_buyer,

count(distinct case when c.member_id is not null
and t.pay_date = '${hivevar:statis_date}'
and t.status = '30' then t.member_id
else null end) as cnt_buyer_new,

count(distinct case when t.pay_date between date_sub(from_unixtime(to_unix_timestamp('${hivevar:statis_date}', 'yyyyMMdd'), 'yyyy-MM-dd'),14) and '${hivevar:statis_date}'
and t.status='30'
then t.order_id else null end) as cnt_eff_15,
sum(case when t.pay_date between date_sub(from_unixtime(to_unix_timestamp('${hivevar:statis_date}', 'yyyyMMdd'), 'yyyy-MM-dd'),14) and '${hivevar:statis_date}'
and t.status in('30','50','70')
then t.pay_amt else 0.0 end) as amt_eff_15,
count(distinct case when t.pay_date between date_sub(from_unixtime(to_unix_timestamp('${hivevar:statis_date}', 'yyyyMMdd'), 'yyyy-MM-dd'),14) and '${hivevar:statis_date}'
and t.status='30'
then t.member_id else null end) as cnt_buyer_15,
count(distinct case when t.pay_date between date_sub(from_unixtime(to_unix_timestamp('${hivevar:statis_date}', 'yyyyMMdd'), 'yyyy-MM-dd'),29) and '${hivevar:statis_date}'
and t.status='30'
then t.order_id else null end) as cnt_eff_30,
sum(case when t.pay_date between date_sub(from_unixtime(to_unix_timestamp('${hivevar:statis_date}', 'yyyyMMdd'), 'yyyy-MM-dd'),29) and '${hivevar:statis_date}'
and t.status in('30','50','70')
then t.pay_amt else 0.0 end) as amt_eff_30,
count(distinct case when t.pay_date between date_sub(from_unixtime(to_unix_timestamp('${hivevar:statis_date}', 'yyyyMMdd'), 'yyyy-MM-dd'),29) and '${hivevar:statis_date}'
and t.status='30'
then t.member_id else null end) as cnt_buyer_30
from (select *
from tableA t
where statis_date between date_sub(from_unixtime(to_unix_timestamp('${hivevar:statis_date}', 'yyyyMMdd'), 'yyyy-MM-dd'),29) and '${hivevar:statis_date}'
and order_apply_tp = 4
and order_dtl_apply_cd = '-1'
and order_source = 'B2C'
and gmv_Flg = 1
and chnl_cd = '50') t
left join (select c1.member_id
from tableB c1
where site_id not in ('PC','WAP','APP')
and first_pay_time = '${hivevar:statis_date}'
group by c1.member_id) c
on t.member_id = c.member_id
join (select name,code
from tableC
where statis_date between date_sub(from_unixtime(to_unix_timestamp('${hivevar:statis_date}', 'yyyyMMdd'), 'yyyy-MM-dd'),29) and '${hivevar:statis_date}'
and status = 0) td
on td.code = t.order_app_tp_cd
group by t.order_app_tp_cd;
其中tableA是大表,tableB是小表,tableC是小表
运行后发现GC运行很频繁,应该是产生了数据倾斜?请问各位大佬怎么解决?

  • 写回答

1条回答 默认 最新

  • 憧憬blog 2023-03-15 05:43
    关注

    针对大表 left join 小表时产生数据倾斜的情况,可以尝试以下解决方案:

    1. 增加并行度:对于大表,可以尝试增加它的并行度,比如增加 mapreduce 的 map 数量或 Hive 的 reducer 数量。同时,可以增加小表的并行度,比如使用 distribute by 分发数据到不同的 reduce 任务处理。

    2. 使用随机数:在大表的 join key 中增加随机数列,然后再使用 distribute by 将数据分发到不同的 reduce 任务。这种方法可以避免数据倾斜,但可能会影响查询结果的准确性。

    3. 使用 pre-aggregation:对于小表,可以尝试在查询之前使用 pre-aggregation 对数据进行聚合,然后将结果和大表进行 join。这样可以减小小表和大表 join 时的数据量,减少数据倾斜的概率。

    4. 调整 join 顺序:如果可能的话,可以调整大表和小表的 join 顺序,将小表作为驱动表进行 join。这种方式可以减少大表中需要 join 的数据量。

    以上是一些常见的解决方案,具体方法需要根据实际情况进行调整。另外,出现频繁的 GC 可能也是由于 JVM 内存空间不足造成的,可以考虑增加内存大小或者调整 JVM 参数。

    评论

报告相同问题?

悬赏问题

  • ¥20 5037端口被adb自己占了
  • ¥15 python:excel数据写入多个对应word文档
  • ¥60 全一数分解素因子和素数循环节位数
  • ¥15 ffmpeg如何安装到虚拟环境
  • ¥188 寻找能做王者评分提取的
  • ¥15 matlab用simulink求解一个二阶微分方程,要求截图
  • ¥30 乘子法解约束最优化问题的matlab代码文件,最好有matlab代码文件
  • ¥15 写论文,需要数据支撑
  • ¥15 identifier of an instance of 类 was altered from xx to xx错误
  • ¥100 反编译微信小游戏求指导