从大文本中删除所有非字母数字字符的有效方法

I need to process volumes of text and one of the steps is to remove all non-alphanumeric characters. I'm trying to find an efficient way to do it.

So far I have two functions:

func stripMap(str, chr string) string {
    return strings.Map(func(r rune) rune {
        if strings.IndexRune(chr, r) < 0 {
            return r
        }
        return -1
    }, str)
}

Here I actually have to feed a string of all non-alpha characters.

And plain old regex

func stripRegex(in string) string {
    reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")
    return reg.ReplaceAllString(in, "")
}

The regex one seems to be much slower

BenchmarkStripMap-8        30000         37907 ns/op        8192 B/op          2 allocs/op

BenchmarkStripRegex-8          10000        131449 ns/op       57552 B/op         35 allocs/op

Looking for suggestions. Any other better way to do it? Improve the above?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douguan3470 2019-01-31 13:51
关注
Because the surviving runes are less than utf8.RuneSelf, this problem can be solved by operating on bytes. If any byte is not in [^a-zA-Z0-9 ], then the byte is part of a rune to be removed.

func strip(s string) string { var result strings.Builder for i := 0; i < len(s); i++ { b := s[i] if ('a' <= b && b <= 'z') || ('A' <= b && b <= 'Z') || ('0' <= b && b <= '9') || b == ' ' { result.WriteByte(b) } } return result.String() }

A variation on this function is to preallocate the result by calling result.Grow:

func strip(s string) string { var result strings.Builder result.Grow(len(s)) ...

This ensures that the function makes one memory allocation, but that memory allocation may be significantly larger than needed if the ratio of surviving runes to source runes is low.

The strip function in this answer is written to work with string argument and result types because those are the types used in the question.

If the application is working a []byte source text and that source text can be modified, then it will be more efficient to update the []byte in place. To do this, copy the surviving bytes to the beginning of the slice and reslice when done. This avoids memory allocations and overhead in strings.Builder. This variation is similar to one in peterSO's answer to this question.

func strip(s []byte) []byte { n := 0 for _, b := range s { if ('a' <= b && b <= 'z') || ('A' <= b && b <= 'Z') || ('0' <= b && b <= '9') || b == ' ' { s[n] = b n++ } } return s[:n] }

Depending on actual data used, one of the approaches in this answer may be faster than the approaches in the question.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

从大文本中删除所有非字母数字字符的有效方法
2019-01-31 13:13

回答 2 已采纳 Because the surviving runes are less than utf8.RuneSelf, this problem can be solved by operating o
python统计字符串中数字，大小写字母和其他字符数目 python
2022-04-13 21:14

回答 2 已采纳我估计你是要自定义一个函数吧，我按这个思路给了个例子，s是字符串，f代表统计内容，0表示数字，1大写，2小写，3其它字符 az = 'abcdefghigklmnopqrstuvwxyz' AZ =
用指针实现：输入一个字符串，将组成字符串的所有非英文字母的字符删除后输出 c++ 有问必答
2022-04-24 10:15

回答 2 已采纳 p1指向了数组str1p2指向了数组str2在for循环中，p1不断后移，就相当于遍历整个str1数组，如果p1指向的字符是字母，就插入到p2的位置，并把p2的位置后移一位（相当于把字母保存到str2
c语言滤去所有非数字字符,Python: 去掉字符串中的非数字(或非字母)字符
2021-05-19 08:59

颢卿的博客 >>> crazystring = ‘dade142.;!0142f[.,]ad’只保留数字>>> filter(str.isdigit, crazystring)... filter(str.isalpha, crazystring)‘dadefad’只保留字母和数字>>> filter(str....
怎么编写delnum函数去删除字符串中所有非数字字符？ c语言
2022-06-17 15:42

回答 2 已采纳 delnum是自己写的函数吧，需要你自己写。 #include <stdio.h> void delnum(char *p) { int i=0,j=0; while(p[
从字符串str中找出所有被非数字字符分隔的连续数字matlab matlab
2021-11-17 20:59

回答 1 已采纳你好，先要找到字符串里的数字，然后才是判断连续 str = 'abc1.23efg2ssrtu5k8thu6.32mmm2ccc3ddde4fffff'; p = str>'9' | str&l
键盘录入一个字符串统计该字符串中大写字母字符，小写字母字符，数字字符出现的次数 eclipse java
2021-09-09 09:37

回答 4 已采纳如下： public static void main(String[] args) { // TODO Auto-generated method stub Scan
php非数字,PHP如何删除字符串中的非字母数字字符？（代码示例）
2021-03-23 22:30

丛越的博客下面本篇文章就给大家介绍preg_replace()函数删除字符串中非字母数字字符的方法，希望对大家有所帮助。preg_replace()函数首先我们来了解一下preg_replace()函数。preg_replace()函数能够执行一个正则表达式，通过这...
输入一个字符串，过滤掉所有的非数字字符，得到由数字字符组成的字符串，并输出？ c语言
2020-12-23 15:49

回答 3 已采纳你代码问题很多参考以下我改的 #include<stdio.h> #include<string.h> int main() { long a; int
C#正则表达式查找非纯数字的字符 c# 正则表达式
2022-04-27 01:53

回答 6 已采纳 (([a-zA-Z_])([a-zA-Z0-9_])+)|(([0-9])([a-zA-Z_])+)
我输入ABCDEFG之后出现了非字母数字字符 c语言
2022-06-04 20:44

回答 1 已采纳你应该先给c赋值再判断你这样写是先判断把判断不是回车符给赋值c了。。。表达式先判断了getchar()得到的字符是否为换行如果不是换行符返回1 ，是返回0 而你这块返回的肯定都是1Ansci
shell脚本：删除文本中的字母、找单词、算数字
2019-11-27 20:04

Asnfy的博客文章目录删除文本中指定行的字母删除文本中指定行的字母需求： 1.将文本test.txt(共10行)中前5行中包含字母的行删除 2.将6-10行中的字母删除脚本思路：过滤出前5行，将包含字母的行删除，输出满足需求1的结果，...
删除包含非字母数字字符的字符串中的“单词”？ nlp php
2016-06-25 03:34

回答 1 已采纳 Try using the preg_split, preg_grep, and implode functions, like so: $string = "Test let's test 1
C语言如何去掉非数字字符串,Objective-C中利用正则去除非数字字母汉字方法实例...
2021-05-20 09:26

荒川与野的博客前言今天碰到个需求,PM要求输入框中取出非字母数字汉字的输入.带着这个疑问开始今天的文章准备工作创建个demo 代码如下@interface ViewController ()@property (weak, nonatomic) IBOutlet UITextField *input;@...
excel怎么样将某列所有单元格中文本的非中文字符或数字或字母全部删除？
2023-07-11 18:01

盘古开天1666的博客当我们要处理大量行数的excel数据时候，有时某列中有些行中的单元格内容中含有我们不想要的部分（比如包含一些特殊字符、数字、字母等等），这个时候我们想把这些不想要的部分删除，目前的excel没有相关的函数可以为...
没有解决我的问题, 去提问

悬赏问题

¥15 nslt的可用模型，或者其他可以进行推理的现有模型
¥15 arduino上连sim900a实现连接mqtt服务器
¥15 vncviewer7.0安装后如何正确注册License许可证，激活使用
¥15 phython如何实现以下功能？查找同一用户名的消费金额合并2
¥66 关于人体营养与饮食规划的线性规划模型
¥15 基于深度学习的快递面单识别系统
¥15 Multisim仿真设计地铁到站提醒电路
¥15 怎么用一个500W电源给5台60W的电脑供电
¥15 请推荐一个轻量级规则引擎，配合流程引擎使用，规则引擎负责判断出符合规则的流程引擎模板id
¥15 Excel表只有年月怎么计算年龄

从大文本中删除所有非字母数字字符的有效方法

2条回答 默认 最新

悬赏问题

2条回答默认最新