从大文本中删除所有非字母数字字符的有效方法

I need to process volumes of text and one of the steps is to remove all non-alphanumeric characters. I'm trying to find an efficient way to do it.

So far I have two functions:

func stripMap(str, chr string) string {
    return strings.Map(func(r rune) rune {
        if strings.IndexRune(chr, r) < 0 {
            return r
        }
        return -1
    }, str)
}

Here I actually have to feed a string of all non-alpha characters.

And plain old regex

func stripRegex(in string) string {
    reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")
    return reg.ReplaceAllString(in, "")
}

The regex one seems to be much slower

BenchmarkStripMap-8        30000         37907 ns/op        8192 B/op          2 allocs/op

BenchmarkStripRegex-8          10000        131449 ns/op       57552 B/op         35 allocs/op

Looking for suggestions. Any other better way to do it? Improve the above?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douguan3470 2019-01-31 13:51
关注
Because the surviving runes are less than utf8.RuneSelf, this problem can be solved by operating on bytes. If any byte is not in [^a-zA-Z0-9 ], then the byte is part of a rune to be removed.

func strip(s string) string { var result strings.Builder for i := 0; i < len(s); i++ { b := s[i] if ('a' <= b && b <= 'z') || ('A' <= b && b <= 'Z') || ('0' <= b && b <= '9') || b == ' ' { result.WriteByte(b) } } return result.String() }

A variation on this function is to preallocate the result by calling result.Grow:

func strip(s string) string { var result strings.Builder result.Grow(len(s)) ...

This ensures that the function makes one memory allocation, but that memory allocation may be significantly larger than needed if the ratio of surviving runes to source runes is low.

The strip function in this answer is written to work with string argument and result types because those are the types used in the question.

If the application is working a []byte source text and that source text can be modified, then it will be more efficient to update the []byte in place. To do this, copy the surviving bytes to the beginning of the slice and reslice when done. This avoids memory allocations and overhead in strings.Builder. This variation is similar to one in peterSO's answer to this question.

func strip(s []byte) []byte { n := 0 for _, b := range s { if ('a' <= b && b <= 'z') || ('A' <= b && b <= 'Z') || ('0' <= b && b <= '9') || b == ' ' { s[n] = b n++ } } return s[:n] }

Depending on actual data used, one of the approaches in this answer may be faster than the approaches in the question.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

从大文本中删除所有非字母数字字符的有效方法
2019-01-31 13:13

回答 2 已采纳 Because the surviving runes are less than utf8.RuneSelf, this problem can be solved by operating o
python统计字符串中数字，大小写字母和其他字符数目 python
2022-04-13 21:14

回答 2 已采纳我估计你是要自定义一个函数吧，我按这个思路给了个例子，s是字符串，f代表统计内容，0表示数字，1大写，2小写，3其它字符 az = 'abcdefghigklmnopqrstuvwxyz' AZ =
用指针实现：输入一个字符串，将组成字符串的所有非英文字母的字符删除后输出 c++ 有问必答
2022-04-24 10:15

回答 2 已采纳 p1指向了数组str1p2指向了数组str2在for循环中，p1不断后移，就相当于遍历整个str1数组，如果p1指向的字符是字母，就插入到p2的位置，并把p2的位置后移一位（相当于把字母保存到str2
c语言滤去所有非数字字符,Python: 去掉字符串中的非数字(或非字母)字符
2021-05-19 08:59

颢卿的博客 >>> crazystring = ‘dade142.;!0142f[.,]ad’只保留数字>>> filter(str.isdigit, crazystring)... filter(str.isalpha, crazystring)‘dadefad’只保留字母和数字>>> filter(str....
怎么编写delnum函数去删除字符串中所有非数字字符？ c语言
2022-06-17 15:42

回答 2 已采纳 delnum是自己写的函数吧，需要你自己写。 #include <stdio.h> void delnum(char *p) { int i=0,j=0; while(p[
从字符串str中找出所有被非数字字符分隔的连续数字matlab matlab
2021-11-17 20:59

回答 1 已采纳你好，先要找到字符串里的数字，然后才是判断连续 str = 'abc1.23efg2ssrtu5k8thu6.32mmm2ccc3ddde4fffff'; p = str>'9' | str&l
键盘录入一个字符串统计该字符串中大写字母字符，小写字母字符，数字字符出现的次数 eclipse java
2021-09-09 09:37

回答 4 已采纳如下： public static void main(String[] args) { // TODO Auto-generated method stub Scan
php非数字,PHP如何删除字符串中的非字母数字字符？（代码示例）
2021-03-23 22:30

丛越的博客下面本篇文章就给大家介绍preg_replace()函数删除字符串中非字母数字字符的方法，希望对大家有所帮助。preg_replace()函数首先我们来了解一下preg_replace()函数。preg_replace()函数能够执行一个正则表达式，通过这...
输入一个字符串，过滤掉所有的非数字字符，得到由数字字符组成的字符串，并输出？ c语言
2020-12-23 15:49

回答 3 已采纳你代码问题很多参考以下我改的 #include<stdio.h> #include<string.h> int main() { long a; int
C#正则表达式查找非纯数字的字符 c# 正则表达式
2022-04-27 01:53

回答 6 已采纳 (([a-zA-Z_])([a-zA-Z0-9_])+)|(([0-9])([a-zA-Z_])+)
我输入ABCDEFG之后出现了非字母数字字符 c语言
2022-06-04 20:44

回答 1 已采纳你应该先给c赋值再判断你这样写是先判断把判断不是回车符给赋值c了。。。表达式先判断了getchar()得到的字符是否为换行如果不是换行符返回1 ，是返回0 而你这块返回的肯定都是1Ansci
shell脚本：删除文本中的字母、找单词、算数字
2019-11-27 20:04

Asnfy的博客文章目录删除文本中指定行的字母删除文本中指定行的字母需求： 1.将文本test.txt(共10行)中前5行中包含字母的行删除 2.将6-10行中的字母删除脚本思路：过滤出前5行，将包含字母的行删除，输出满足需求1的结果，...
删除包含非字母数字字符的字符串中的“单词”？ nlp php
2016-06-25 03:34

回答 1 已采纳 Try using the preg_split, preg_grep, and implode functions, like so: $string = "Test let's test 1
C语言如何去掉非数字字符串,Objective-C中利用正则去除非数字字母汉字方法实例...
2021-05-20 09:26

荒川与野的博客前言今天碰到个需求,PM要求输入框中取出非字母数字汉字的输入.带着这个疑问开始今天的文章准备工作创建个demo 代码如下@interface ViewController ()@property (weak, nonatomic) IBOutlet UITextField *input;@...
Python正则表达式：删除字符串中的非数字和非字母字符
2023-09-06 00:53

代码编织匠人的博客这是使用正则表达式在Python中删除字符串中的非数字和非字母字符的方法。输出结果将是：“Hello123Howareyou”，其中所有的标点符号和空格都被删除了，只保留了数字和字母字符。函数，并传入一个字符串作为参数，来...
没有解决我的问题, 去提问

悬赏问题

¥15 stm32代码移植没反应
¥15 matlab基于pde算法图像修复，为什么只能对示例图像有效
¥100 连续两帧图像高速减法
¥15 组策略中的计算机配置策略无法下发
¥15 如何绘制动力学系统的相图
¥15 对接wps接口实现获取元数据
¥20 给自己本科IT专业毕业的妹m找个实习工作
¥15 用友U8：向一个无法连接的网络尝试了一个套接字操作，如何解决？
¥30 我的代码按理说完成了模型的搭建、训练、验证测试等工作(标签-网络|关键词-变化检测)
¥50 mac mini外接显示器画质字体模糊

从大文本中删除所有非字母数字字符的有效方法

2条回答 默认 最新

悬赏问题

2条回答默认最新