dongzhi1822 2018-07-31 11:01
浏览 39
已采纳

去字符串。包含()比Python3慢2倍?

Am converting a text pattern scanner from Python3 to Go1.10, but am surprised it is actually 2 times slower. Upon profiling, the culprit is in strings.Contains(). See the simple benchmarks below. Did I miss anything? Could you recommend a faster pattern search algorithm for Go that would perform better in this case? I'm not bothered about startup time, the same pattern will be used to scan millions of files.

Py3 benchmark:

import time
import re

RUNS = 10000

if __name__ == '__main__':
    with open('data.php') as fh:
        testString = fh.read()

    def do():
        return "576ad4f370014dfb1d0f17b0e6855f22" in testString

    start = time.time()
    for i in range(RUNS):
        _ = do()
    duration = time.time() - start
    print("Python: %.2fs" % duration)

Go1.10 benchmark:

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "strings"
    "time"
)

const (
    runs = 10000
)

func main() {
    fname := "data.php"
    testdata := readFile(fname)
    needle := "576ad4f370014dfb1d0f17b0e6855f22"
    start := time.Now()

    for i := 0; i < runs; i++ {
        _ = strings.Contains(testdata, needle)

    }
    duration := time.Now().Sub(start)
    fmt.Printf("Go: %.2fs
", duration.Seconds())
}

func readFile(fname string) string {
    data, err := ioutil.ReadFile(fname)
    if err != nil {
        log.Fatal(err)
    }
    return string(data)
}

data.php is a 528KB file that can be found here.

Output:

Go:     1.98s
Python: 0.84s
  • 写回答

2条回答 默认 最新

  • douren8379 2018-08-04 19:47
    关注

    I've done more benchmarking with various string search implementations that I found on Wikipedia, such as:

    Benchmark results (code here):

    BenchmarkStringsContains-4         10000        208055 ns/op
    BenchmarkBMSSearch-4                1000       1856732 ns/op
    BenchmarkPaddieKMP-4                2000       1069495 ns/op
    BenchmarkKkdaiKMP-4                 1000       1440147 ns/op
    BenchmarkAhocorasick-4              2000        935885 ns/op
    BenchmarkYara-4                     1000       1237887 ns/op
    

    Then, I benchmarked my practical use case of testing about 1100 signatures (100 regex, 1000 literals) against a 500KB no-match file, for both the native (strings.Contains and regexp) and the C-based Yara implementations:

    BenchmarkScanNative-4              2     824328504 ns/op
    BenchmarkScanYara-4              300       5338861 ns/op
    

    Even though C calls in Go are supposedly expensive, in these "heavy" operations the profit is remarkable. Side observation: it takes Yara just 5 times as much CPU time to match 1100 signatures instead of 1.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?