train 存储方式为textfile, 条数 37912917
train_orc 表结构与 train一样,数据也完全一样,但存储方式为orc
现在相同语句对两张表进行操作,看执行计划
explain select avg(userid) from train_orc;
对 ORC表进行操作,看执行计划
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: train_orc
Statistics: Num rows: **37912917 **Data size: **26425303149 **Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: userid (type: string)
outputColumnNames: userid
Statistics: Num rows: 37912917 Data size: 26425303149 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: avg(userid)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: structcount:bigint,sum:double,input:string)
ORC表的操作执行计划 Statistics: Num rows: **37912917 ** 这里瞬间计算出全表条数
对 TEXTFILE表进行操作,看执行计划
explain select avg(userid) from train
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: train
Statistics: Num rows: **13381927 **Data size: 1338192768 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: userid (type: string)
outputColumnNames: userid
Statistics: Num rows: 13381927 Data size: 1338192768 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: avg(userid)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: structcount:bigint,sum:double,input:string)
这里计算出来的条数却是13381927 ,与原表不符,而且结果出来的还慢, 请问这统计的 13381927 是从哪里出来的,这条数明显错误,但结果虽然慢,出来也是正确的,请大家帮忙解释下,感谢