donglaoping9702 2019-01-31 13:13
浏览 38
已采纳

从大文本中删除所有非字母数字字符的有效方法

I need to process volumes of text and one of the steps is to remove all non-alphanumeric characters. I'm trying to find an efficient way to do it.

So far I have two functions:

func stripMap(str, chr string) string {
    return strings.Map(func(r rune) rune {
        if strings.IndexRune(chr, r) < 0 {
            return r
        }
        return -1
    }, str)
}

Here I actually have to feed a string of all non-alpha characters.

And plain old regex

func stripRegex(in string) string {
    reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")
    return reg.ReplaceAllString(in, "")
}

The regex one seems to be much slower

BenchmarkStripMap-8        30000         37907 ns/op        8192 B/op          2 allocs/op

BenchmarkStripRegex-8          10000        131449 ns/op       57552 B/op         35 allocs/op

Looking for suggestions. Any other better way to do it? Improve the above?

  • 写回答

2条回答 默认 最新

  • douguan3470 2019-01-31 13:51
    关注

    Because the surviving runes are less than utf8.RuneSelf, this problem can be solved by operating on bytes. If any byte is not in [^a-zA-Z0-9 ], then the byte is part of a rune to be removed.

    func strip(s string) string {
        var result strings.Builder
        for i := 0; i < len(s); i++ {
            b := s[i]
            if ('a' <= b && b <= 'z') ||
                ('A' <= b && b <= 'Z') ||
                ('0' <= b && b <= '9') ||
                b == ' ' {
                result.WriteByte(b)
            }
        }
        return result.String()
    }
    

    A variation on this function is to preallocate the result by calling result.Grow:

    func strip(s string) string {
        var result strings.Builder
        result.Grow(len(s))
        ...
    

    This ensures that the function makes one memory allocation, but that memory allocation may be significantly larger than needed if the ratio of surviving runes to source runes is low.

    The strip function in this answer is written to work with string argument and result types because those are the types used in the question.

    If the application is working a []byte source text and that source text can be modified, then it will be more efficient to update the []byte in place. To do this, copy the surviving bytes to the beginning of the slice and reslice when done. This avoids memory allocations and overhead in strings.Builder. This variation is similar to one in peterSO's answer to this question.

    func strip(s []byte) []byte {
        n := 0
        for _, b := range s {
            if ('a' <= b && b <= 'z') ||
                ('A' <= b && b <= 'Z') ||
                ('0' <= b && b <= '9') ||
                b == ' ' {
                s[n] = b
                n++
            }
        }
        return s[:n]
    }
    

    Depending on actual data used, one of the approaches in this answer may be faster than the approaches in the question.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 stm32代码移植没反应
  • ¥15 matlab基于pde算法图像修复,为什么只能对示例图像有效
  • ¥100 连续两帧图像高速减法
  • ¥15 组策略中的计算机配置策略无法下发
  • ¥15 如何绘制动力学系统的相图
  • ¥15 对接wps接口实现获取元数据
  • ¥20 给自己本科IT专业毕业的妹m找个实习工作
  • ¥15 用友U8:向一个无法连接的网络尝试了一个套接字操作,如何解决?
  • ¥30 我的代码按理说完成了模型的搭建、训练、验证测试等工作(标签-网络|关键词-变化检测)
  • ¥50 mac mini外接显示器 画质字体模糊