donglaoping9702 2019-01-31 13:13
浏览 38
已采纳

从大文本中删除所有非字母数字字符的有效方法

I need to process volumes of text and one of the steps is to remove all non-alphanumeric characters. I'm trying to find an efficient way to do it.

So far I have two functions:

func stripMap(str, chr string) string {
    return strings.Map(func(r rune) rune {
        if strings.IndexRune(chr, r) < 0 {
            return r
        }
        return -1
    }, str)
}

Here I actually have to feed a string of all non-alpha characters.

And plain old regex

func stripRegex(in string) string {
    reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")
    return reg.ReplaceAllString(in, "")
}

The regex one seems to be much slower

BenchmarkStripMap-8        30000         37907 ns/op        8192 B/op          2 allocs/op

BenchmarkStripRegex-8          10000        131449 ns/op       57552 B/op         35 allocs/op

Looking for suggestions. Any other better way to do it? Improve the above?

  • 写回答

2条回答 默认 最新

  • douguan3470 2019-01-31 13:51
    关注

    Because the surviving runes are less than utf8.RuneSelf, this problem can be solved by operating on bytes. If any byte is not in [^a-zA-Z0-9 ], then the byte is part of a rune to be removed.

    func strip(s string) string {
        var result strings.Builder
        for i := 0; i < len(s); i++ {
            b := s[i]
            if ('a' <= b && b <= 'z') ||
                ('A' <= b && b <= 'Z') ||
                ('0' <= b && b <= '9') ||
                b == ' ' {
                result.WriteByte(b)
            }
        }
        return result.String()
    }
    

    A variation on this function is to preallocate the result by calling result.Grow:

    func strip(s string) string {
        var result strings.Builder
        result.Grow(len(s))
        ...
    

    This ensures that the function makes one memory allocation, but that memory allocation may be significantly larger than needed if the ratio of surviving runes to source runes is low.

    The strip function in this answer is written to work with string argument and result types because those are the types used in the question.

    If the application is working a []byte source text and that source text can be modified, then it will be more efficient to update the []byte in place. To do this, copy the surviving bytes to the beginning of the slice and reslice when done. This avoids memory allocations and overhead in strings.Builder. This variation is similar to one in peterSO's answer to this question.

    func strip(s []byte) []byte {
        n := 0
        for _, b := range s {
            if ('a' <= b && b <= 'z') ||
                ('A' <= b && b <= 'Z') ||
                ('0' <= b && b <= '9') ||
                b == ' ' {
                s[n] = b
                n++
            }
        }
        return s[:n]
    }
    

    Depending on actual data used, one of the approaches in this answer may be faster than the approaches in the question.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 nslt的可用模型,或者其他可以进行推理的现有模型
  • ¥15 arduino上连sim900a实现连接mqtt服务器
  • ¥15 vncviewer7.0安装后如何正确注册License许可证,激活使用
  • ¥15 phython如何实现以下功能?查找同一用户名的消费金额合并2
  • ¥66 关于人体营养与饮食规划的线性规划模型
  • ¥15 基于深度学习的快递面单识别系统
  • ¥15 Multisim仿真设计地铁到站提醒电路
  • ¥15 怎么用一个500W电源给5台60W的电脑供电
  • ¥15 请推荐一个轻量级规则引擎,配合流程引擎使用,规则引擎负责判断出符合规则的流程引擎模板id
  • ¥15 Excel表只有年月怎么计算年龄