去字符串。包含（）比Python3慢2倍？

Am converting a text pattern scanner from Python3 to Go1.10, but am surprised it is actually 2 times slower. Upon profiling, the culprit is in strings.Contains(). See the simple benchmarks below. Did I miss anything? Could you recommend a faster pattern search algorithm for Go that would perform better in this case? I'm not bothered about startup time, the same pattern will be used to scan millions of files.

Py3 benchmark:

import time
import re

RUNS = 10000

if __name__ == '__main__':
    with open('data.php') as fh:
        testString = fh.read()

    def do():
        return "576ad4f370014dfb1d0f17b0e6855f22" in testString

    start = time.time()
    for i in range(RUNS):
        _ = do()
    duration = time.time() - start
    print("Python: %.2fs" % duration)

Go1.10 benchmark:

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "strings"
    "time"
)

const (
    runs = 10000
)

func main() {
    fname := "data.php"
    testdata := readFile(fname)
    needle := "576ad4f370014dfb1d0f17b0e6855f22"
    start := time.Now()

    for i := 0; i < runs; i++ {
        _ = strings.Contains(testdata, needle)

    }
    duration := time.Now().Sub(start)
    fmt.Printf("Go: %.2fs
", duration.Seconds())
}

func readFile(fname string) string {
    data, err := ioutil.ReadFile(fname)
    if err != nil {
        log.Fatal(err)
    }
    return string(data)
}

data.php is a 528KB file that can be found here.

Output:

Go:     1.98s
Python: 0.84s

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douren8379 2018-08-04 19:47
关注
I've done more benchmarking with various string search implementations that I found on Wikipedia, such as:

https://github.com/cloudflare/ahocorasick

https://github.com/cubicdaiya/bms

https://github.com/kkdai/kmp

https://github.com/paddie/gokmp

https://github.com/hillu/go-yara (Yara seems to implement Aho & Corasick under the hood).

Benchmark results (code here):

BenchmarkStringsContains-4 10000 208055 ns/op BenchmarkBMSSearch-4 1000 1856732 ns/op BenchmarkPaddieKMP-4 2000 1069495 ns/op BenchmarkKkdaiKMP-4 1000 1440147 ns/op BenchmarkAhocorasick-4 2000 935885 ns/op BenchmarkYara-4 1000 1237887 ns/op

Then, I benchmarked my practical use case of testing about 1100 signatures (100 regex, 1000 literals) against a 500KB no-match file, for both the native (strings.Contains and regexp) and the C-based Yara implementations:

BenchmarkScanNative-4 2 824328504 ns/op BenchmarkScanYara-4 300 5338861 ns/op

Even though C calls in Go are supposedly expensive, in these "heavy" operations the profit is remarkable. Side observation: it takes Yara just 5 times as much CPU time to match 1100 signatures instead of 1.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

python3 \x开头的utf8字符串怎么转成中文字符串？？ python 有问必答
2022-01-15 18:24

回答 2 已采纳 from urllib import parse utf8_str = '\xe4\xb8\xad\xe6\x96\x87' # 转成 ’中文‘ print(utf8_str)
Python3怎么让连着的字符串，拆开竖排？ list python
2022-07-13 22:25

回答 4 已采纳看你拆的规则是怎样了，如果固定每两个字符拆分，可以用切片： str='''小明小红小张小王''' for i in range(0,len(str),2): print(str[i:i+2])
关于python中以字符串切片的方法反转字符串原理？ python
2020-01-28 15:44

回答 1 已采纳 step = 循环前一次的下标 - 后一次的下标比如 step = 1的时候 0 1 2 3 4... (1-0=1 2-1=1 3-2=1...) step=-1的时候 9 8 7 6 5.
Python123之字符串压缩#134865
2023-10-28 08:18

3. `bz2`库：提供了Bzip2压缩算法，这是一种高压缩比但相对较慢的算法。 4. `lzma`库：实现了XZ Utils中的LZMA算法，具有较高的压缩率和较快的解压速度。三、使用`zlib`库进行字符串压缩在Python中，我们可以...
怎么在python中用正则提取指定多个字符串？ python 正则表达式
2022-02-12 18:04

回答 1 已采纳题主朋友，我把两种方法都列举了，请参考 import re # 不用re的方法 def func(goal_list, set_names): string = str(goal_list
python中的原始字符串是什么意思？ python
2022-04-02 14:59

回答 2 已采纳 python原始字符串是指在引号前添加 r 或 R 的字符串，如 r'hello'。原始字符串是为了解决ascii字符和正则表达式特殊字符间的冲突而产生的。望采纳
字符串验证检查python程序错误。 python
2022-03-19 14:47

回答 1 已采纳你想实现什么？
Python 是否有字符串“包含”子字符串方法？
2022-12-17 09:09

HuntsBot的博客这个问题的答案是社区的努力。编辑现有答案以改进这篇文章。它目前不接受新的答案或交互。我正在寻找 Python 中的 string.contains 或 string.indexof 方法。我想要做：
截取字符串python python
2022-03-19 21:53

回答 1 已采纳 s1 = input() s2 = input() if len(s1)<len(s2): print(s2[len(s1):]) else: print(s1[len(s2):
Python怎样完成字符串交换？ list python
2022-10-16 02:01

回答 3 已采纳 # Python完成字符串交换？ s = '''好的，坏的短的，长的爱的，恨的''' x = ['好的', '短的', '爱的'] for x_t, temp in zip(x, s.spli
Python 字符串操作请问如何找到相同元素最长的个数? python
2022-06-09 20:54

回答 2 已采纳先用字符切割，找出最长的即可 a ="0010101001000100000001" maxlen=a[0] for i in set(list(a)): temp=a.split(i)
Python3字符串常用方法
2023-04-14 18:52

识途老码的博客 Python3字符串常用方法
python字符串p型编码 python
2022-05-11 20:38

回答 1 已采纳 import itertools as it s = '122344111' res = [str(len(list(v))) + i for i , v in it.groupby(s)] r
字符串合并python_Python合并字符串的3种方法
2020-12-13 13:42

weixin_39602280的博客 Python合并字符串的3种方法目的将一些小的字符串合并成一个大字符串，更多考虑的是性能方法常见的方法有以下几种：1.使用+=操作符代码如下:BigString=small1+small2+small3+...+smalln例如有一个片段pieces=['Today'...
使用Python实现字符串反转？
2020-12-25 14:28

品易HTTP的博客在Python中如何做到字符串反转，有几种方法呢？样例：如 a=‘123456789’ 反转成 a=‘987654321’ 第一种方法：使用字符串切片 ```python >>> a='123456789' >>> a = a[::-1] '987654321' ...
没有解决我的问题, 去提问

悬赏问题

¥15 用verilog实现tanh函数和softplus函数
¥15 求京东批量付款能替代天诚
¥15 slaris 系统断电后，重新开机后一直自动重启
¥15 51寻迹小车定点寻迹
¥15 谁能帮我看看这拒稿理由啥意思啊阿啊
¥15 关于vue2中methods使用call修改this指向的问题
¥15 idea自动补全键位冲突
¥15 请教一下写代码，代码好难
¥15 iis10中如何阻止别人网站重定向到我的网站
¥15 滑块验证码移动速度不一致问题

去字符串。包含（）比Python3慢2倍？

2条回答 默认 最新

悬赏问题

2条回答默认最新