求助
我的环境是spark2.1+hdp2.6 采用spark on yarn模式,在用pyspark时,使用了python3.5
结果在执行类似distinct 语句
user_data = sc.textFile("/testdata/u.user")
user_fields = user_data.map(lambda line: line.split("|"))
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
报异常,如下
File "/data/opt/hadoop-2.6.0/tmp/nm-local-dir/usercache/jsdxadm/appcache/application_1494985561557_0001/container_1494985561557_0001_01_000002/pyspark.zip/pyspark/rdd.py", line 72, in portable_hash
raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")
Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
根据源码好像由于安全漏洞,增加了对python3控制
if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")
我按网上方式,采用两种方法,都不行,哪位碰到,能告诉一下如何解决的吗
1、echo "export PYTHONHASHSEED=0" >> /root/.bashrc
2、spark.yarn.appMasterEnv.PYTHONHASHSEED="XXXX"