创建testlzo表
CREATE external TABLE `testlzo`(
xxxx)
row format delimited
fields terminated by '|'
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/nginx/testlzo';
执行查询
select count(*) from testlzo;
没有建索引时
$ hadoop fs -du -s -h /nginx/testlzo/*
427.8 M 1.3 G /nginx/testlzo/123.lzo
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
1679356
建立索引
hadoop jar hadoop/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /nginx/testlzo/123.lzo
$ hadoop fs -du -s -h /nginx/testlzo/*
427.8 M 1.3 G /nginx/testlzo/123.lzo
32.0 K 96.0 K /nginx/testlzo/123.lzo.index
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
1679759
可以看到 并没有hive 并没有split lzo文件,并且还把index文件算进去了,这是怎么回事?
并且,使用stream方式,是可以split的
hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.4.5.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /nginx/testlzo/123.lzo /tmp/2019081401
执行这句的时候,可以看到split 4 ,确实是可以分割的,但是hive为什么没有分割了? hive的版本是2.3.5