去字符串。包含（）比Python3慢2倍？

Am converting a text pattern scanner from Python3 to Go1.10, but am surprised it is actually 2 times slower. Upon profiling, the culprit is in strings.Contains(). See the simple benchmarks below. Did I miss anything? Could you recommend a faster pattern search algorithm for Go that would perform better in this case? I'm not bothered about startup time, the same pattern will be used to scan millions of files.

Py3 benchmark:

import time
import re

RUNS = 10000

if __name__ == '__main__':
    with open('data.php') as fh:
        testString = fh.read()

    def do():
        return "576ad4f370014dfb1d0f17b0e6855f22" in testString

    start = time.time()
    for i in range(RUNS):
        _ = do()
    duration = time.time() - start
    print("Python: %.2fs" % duration)

Go1.10 benchmark:

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "strings"
    "time"
)

const (
    runs = 10000
)

func main() {
    fname := "data.php"
    testdata := readFile(fname)
    needle := "576ad4f370014dfb1d0f17b0e6855f22"
    start := time.Now()

    for i := 0; i < runs; i++ {
        _ = strings.Contains(testdata, needle)

    }
    duration := time.Now().Sub(start)
    fmt.Printf("Go: %.2fs
", duration.Seconds())
}

func readFile(fname string) string {
    data, err := ioutil.ReadFile(fname)
    if err != nil {
        log.Fatal(err)
    }
    return string(data)
}

data.php is a 528KB file that can be found here.

Output:

Go:     1.98s
Python: 0.84s

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douren8379 2018-08-04 19:47
关注
I've done more benchmarking with various string search implementations that I found on Wikipedia, such as:

https://github.com/cloudflare/ahocorasick

https://github.com/cubicdaiya/bms

https://github.com/kkdai/kmp

https://github.com/paddie/gokmp

https://github.com/hillu/go-yara (Yara seems to implement Aho & Corasick under the hood).

Benchmark results (code here):

BenchmarkStringsContains-4 10000 208055 ns/op BenchmarkBMSSearch-4 1000 1856732 ns/op BenchmarkPaddieKMP-4 2000 1069495 ns/op BenchmarkKkdaiKMP-4 1000 1440147 ns/op BenchmarkAhocorasick-4 2000 935885 ns/op BenchmarkYara-4 1000 1237887 ns/op

Then, I benchmarked my practical use case of testing about 1100 signatures (100 regex, 1000 literals) against a 500KB no-match file, for both the native (strings.Contains and regexp) and the C-based Yara implementations:

BenchmarkScanNative-4 2 824328504 ns/op BenchmarkScanYara-4 300 5338861 ns/op

Even though C calls in Go are supposedly expensive, in these "heavy" operations the profit is remarkable. Side observation: it takes Yara just 5 times as much CPU time to match 1100 signatures instead of 1.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

python3 \x开头的utf8字符串怎么转成中文字符串？？ python 有问必答
2022-01-15 18:24

回答 2 已采纳 from urllib import parse utf8_str = '\xe4\xb8\xad\xe6\x96\x87' # 转成 ’中文‘ print(utf8_str)
Python3怎么让连着的字符串，拆开竖排？ list python
2022-07-13 22:25

回答 4 已采纳看你拆的规则是怎样了，如果固定每两个字符拆分，可以用切片： str='''小明小红小张小王''' for i in range(0,len(str),2): print(str[i:i+2])
关于python中以字符串切片的方法反转字符串原理？ python
2020-01-28 15:44

回答 1 已采纳 step = 循环前一次的下标 - 后一次的下标比如 step = 1的时候 0 1 2 3 4... (1-0=1 2-1=1 3-2=1...) step=-1的时候 9 8 7 6 5.
Python 是否有字符串“包含”子字符串方法？
2022-12-17 09:09

HuntsBot的博客这个问题的答案是社区的努力。编辑现有答案以改进这篇文章。它目前不接受新的答案或交互。我正在寻找 Python 中的 string.contains 或 string.indexof 方法。我想要做：
怎么在python中用正则提取指定多个字符串？ python 正则表达式
2022-02-12 18:04

回答 1 已采纳题主朋友，我把两种方法都列举了，请参考 import re # 不用re的方法 def func(goal_list, set_names): string = str(goal_list
python中的原始字符串是什么意思？ python
2022-04-02 14:59

回答 2 已采纳 python原始字符串是指在引号前添加 r 或 R 的字符串，如 r'hello'。原始字符串是为了解决ascii字符和正则表达式特殊字符间的冲突而产生的。望采纳
字符串验证检查python程序错误。 python
2022-03-19 14:47

回答 1 已采纳你想实现什么？
Python3字符串常用方法
2023-04-14 18:52

识途老码的博客 Python3字符串常用方法
截取字符串python python
2022-03-19 21:53

回答 1 已采纳 s1 = input() s2 = input() if len(s1)<len(s2): print(s2[len(s1):]) else: print(s1[len(s2):
Python怎样完成字符串交换？ list python
2022-10-16 02:01

回答 3 已采纳 # Python完成字符串交换？ s = '''好的，坏的短的，长的爱的，恨的''' x = ['好的', '短的', '爱的'] for x_t, temp in zip(x, s.spli
用Python判断同构字符串 python
2021-12-22 12:03

回答 2 已采纳 class Solution: def isIsomorphic(self, s: str, t: str) -> bool: for i in range(len(s
字符串合并python_Python合并字符串的3种方法
2020-12-13 13:42

weixin_39602280的博客 Python合并字符串的3种方法目的将一些小的字符串合并成一个大字符串，更多考虑的是性能方法常见的方法有以下几种：1.使用+=操作符代码如下:BigString=small1+small2+small3+...+smalln例如有一个片段pieces=['Today'...
python字符串p型编码 python
2022-05-11 20:38

回答 1 已采纳 import itertools as it s = '122344111' res = [str(len(list(v))) + i for i , v in it.groupby(s)] r
使用Python实现字符串反转？
2020-12-25 14:28

品易HTTP的博客在Python中如何做到字符串反转，有几种方法呢？样例：如 a=‘123456789’ 反转成 a=‘987654321’ 第一种方法：使用字符串切片 ```python >>> a='123456789' >>> a = a[::-1] '987654321' ...
python循环拼接字符串_Python字符串拼接
2020-12-05 11:23

weixin_39586825的博客问题描述在解析文件的时候，需要将解析出来的数据字符串拼接成新的字符串。正常来说这个，过程是一个循环，不断拼接字符串。如果这个过程循环的次数不多的话，不同的方式拼接方式差别不大。如果循环次数超过10000次...
没有解决我的问题, 去提问

悬赏问题

¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！
¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像，如何解决？
¥15 求daily translation（DT）偏差订正方法的代码
¥15 js调用html页面需要隐藏某个按钮

去字符串。包含（）比Python3慢2倍？

2条回答 默认 最新

悬赏问题

2条回答默认最新