hadoop EMR集群spark离线任务优化
生产两张表dwd_dsp_bid_basic_log_d 表和dwd_dsp_cps_bid_log_d 表,各取7天分区数据,input_size在500gb左右,关联键是event_id+adv_id,两张表为ds,hh分区表,目前sql上已经做了优先分区剪枝获取7日数据,并使用distribute by event_id,adv_id 优化了关联键分布
spark任务配置参数如下,希望能控制任务使用资源上限,降低集群成本,但是该配置下执行效率很低,执行时间过长,有什么sql上或者spark上的优化点,能使得在这个配置上的spark任务执行能快一些?
spark-sql --master yarn \
--name "dwd_dsp_basic_and_cps_d" \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.adaptive.enabled=true \
--conf spark.yarn.executor.lostCheckTimeout=60s \
--conf spark.yarn.shuffle.stopOnFailure=true \
--conf spark.excludeOnFailure.enabled=true \
--conf spark.excludeOnFailure.task.maxTaskAttemptsPerExecutor=2 \
--conf spark.excludeOnFailure.timeout=10m \
--conf spark.excludeOnFailure.defaultExecutorExcludeDuration=10m \
--conf spark.sql.shuffle.partitions=1000 \
--conf spark.executor.memory=3g \
--conf spark.executor.cores=1 \
--conf spark.shuffle.io.connectionTimeout=60s \
--conf spark.shuffle.io.maxRetries=10 \
--conf spark.shuffle.io.retryWait=30s \
--conf spark.network.timeout=300s \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.executorIdleTimeout=120s \
--conf spark.dynamicAllocation.maxExecutors=50 \
--conf spark.dynamicAllocation.minExecutors=20 \
--conf spark.executor.launchTimeout=120s \
--conf spark.task.maxFailures=4 \
--conf spark.executor.heartbeatInterval=60s \
--conf spark.hadoop.yarn.timeline-service.enabled=true \
--conf spark.executor.extraJavaOptions="
-Djava.net.preferIPv6Addresses=false
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:OnOutOfMemoryError='kill -9 %p'
-XX:+IgnoreUnrecognizedVMOptions
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.io=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED
--add-opens=java.base/sun.security.action=ALL-UNNAMED
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED
-Djdk.reflect.useDirectMethodHandle=false
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapRegionSize=8m
-XX:InitiatingHeapOccupancyPercent=45" \
--hivevar seven_days_ago=${seven_days_ago} \
--hivevar yesterday=${yesterday} \
-f dwd_dsp_basic_and_cps_d.sql
```