现在要把一个结果集中表情符号全部替换为空字符串,但是我发现把一些外国文字也替换了。
scala> val df = Seq(
| (8, "bat★☆😂 😆 ⛱ ✨🚣♂⛷🏂❤🤍🪵֎۩ᴥ★ Lôa Créole♥"),
| (64, "bb")
| ).toDF("number", "word")
df: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> df.show(false)
+------+------------------------------------------------+
|number|word |
+------+------------------------------------------------+
|8 |bat★☆😂 😆 ⛱ ✨🚣♂️⛷🏂❤️🤍🪵֎۩ᴥ★ Lôa Créole♥|
|64 |bb |
+------+------------------------------------------------+
我在网上搜了一个这样的
regexp_replace(df("word"), """[^ 'a-zA-Z0-9,.?!]""","")
scala> df.select($"number", $"word", regexp_replace(df("word"), """[^ 'a-zA-Z0-9,.?!]""","").alias("word_revised")).show(false)
+------+------------------------------------------------+---------------+
|number|word |word_revised |
+------+------------------------------------------------+---------------+
|8 |bat★☆😂 😆 ⛱ ✨🚣♂️⛷🏂❤️🤍🪵֎۩ᴥ★ Lôa Créole♥|bat La Crole|
|64 |bb |bb |
+------+------------------------------------------------+---------------+
scala>
ô 和 é 这两个是我想要保留的,请问如何改善一下呢?
谢谢