du512053619 2017-12-28 02:32
浏览 604

更快的sqlite3查询吗? 我需要尽快处理100万以上的行

What's the fastest way to read a sqlite3 table in golang?

package main

import (
    "fmt"
    "database/sql"
    _ "github.com/mattn/go-sqlite3"
    "log"
    "time"
)

func main() {
    start := time.Now()

    db, err := sql.Open("sqlite3", "/Users/robertking/go/src/bitbucket.org/thematicanalysis/optimization_test/robs.db")
    if err != nil {
        log.Fatal(err)
    }
    defer db.Close()
    rows, err := db.Query("select * from data")
    if err != nil {
        log.Fatal(err)
    }
    defer rows.Close()

    for rows.Next() {
    }
    err = rows.Err()
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(time.Since(start))
}

This takes 8 seconds in Go because .Next is slow. In python a fetchall takes only 4 seconds! I'm rewriting in GO to gain performance not lose performance.

Here is the python code, I couldn't find an equivalent of fetchall in go:

import time

start = time.time()
import sqlite3
conn = sqlite3.connect('/Users/robertking/go/src/bitbucket.org/thematicanalysis/optimization_test/robs.db')
c = conn.cursor()
c.execute("SELECT * FROM data")
x = c.fetchall()
print time.time() - start

Edit: adding bounty. I'm reading the data in go, python and C, here are results. Dont want to use C, but will stick with python if GO isnt faster.:

py: 2.45s
go: 2.13s (using github.com/mxk/go-sqlite/sqlite3 instead of github.com/mattn/go-sqlite3)
c:  0.32s

I feel like go should be closer to the c side of thing? anyone know how to make it faster? is it possible to avoid mutex with readonly mode?

edit:

It seems like all the sqlite3 implementations are slow (too much reflection and too many cgo calls for conversions). So i'll have to just write my own interface.

Here's the schema:

CREATE TABLE mytable
(
  c0   REAL,
  c1   INTEGER,
  c15  TEXT,
  c16  TEXT,
  c17  TEXT,
  c18  TEXT,
  c19  TEXT,
  c47  TEXT,
  c74  REAL DEFAULT 0,
  c77  TEXT,
  c101 TEXT,
  c103 TEXT,
  c108 TEXT,
  c110 TEXT,
  c125 TEXT,
  c126 TEXT,
  c127 REAL DEFAULT 0,
  x    INTEGER
    PRIMARY KEY
);

and the query is dynamic but usually something like this:

SELECT c77,c77,c125,c126,c127,c74 from mytable

edit:

looks like i'll fork the sqlite3 implementation and make some methods that focus on performance,

this is an example of some code which is much faster:.

package main


/*
 #cgo LDFLAGS: -l sqlite3

#include "sqlite3.h"
*/
import "C"

import (
    //"database/sql"
    "log"
    "reflect"
    "unsafe"
)

type Row struct {
    v77 string
    v125 string
    v126 string
    v127 float64
    v74 float64
}

// cStr returns a pointer to the first byte in s.
func cStr(s string) *C.char {
    h := (*reflect.StringHeader)(unsafe.Pointer(&s))
    return (*C.char)(unsafe.Pointer(h.Data))
}

func main() {
    getDataFromSqlite()
}

func getDataFromSqlite() {
    var db *C.sqlite3
    name := "../data_dbs/all_columns.db"
    rc := C.sqlite3_open_v2(cStr(name+"\x00"), &db, C.SQLITE_OPEN_READONLY, nil)

  var stmt *C.sqlite3_stmt;
  rc = C.sqlite3_prepare_v2(db, cStr("SELECT c77,c125,c126,c127,c74 from data\x00"), C.int(-1), &stmt, nil);
  rc = C.sqlite3_reset(stmt);

    var result C.double
    result = 0.0
    rc = C.sqlite3_step(stmt)
    for rc == C.SQLITE_ROW {
    C.GoString((*C.char)(unsafe.Pointer(C.sqlite3_column_text(stmt, 0))))
    C.GoString((*C.char)(unsafe.Pointer(C.sqlite3_column_text(stmt, 1))))
    C.GoString((*C.char)(unsafe.Pointer(C.sqlite3_column_text(stmt, 2))))
    C.sqlite3_column_double(stmt, 3)
    result += C.sqlite3_column_double(stmt, 4)
        rc = C.sqlite3_step(stmt)
  }
    log.Println(result)
}
  • 写回答

2条回答 默认 最新

  • douyu4535 2017-12-31 14:42
    关注

    Introduction

    My assumption was that we have a problem with how the performance is measured here, so I wrote a little Go program to generate records and save them into a SQLite database as well as a Python and Go implementation of a little task to do on those records.

    You can find the according repository at https://github.com/mwmahlberg/sqlite3perf

    The data model

    The records generated consist of

    The table's schema is relatively simple:

    sqlite> .schema
    CREATE TABLE bench (ID int PRIMARY KEY ASC, rand TEXT, hash TEXT);
    

    First I generated 1.5M records and vacuumed the sqlite database afterwards with

    $ ./sqlite3perf generate -r 1500000 -v
    

    Next I called the Go implementation against those 1.5M records. Both the Go as well as the Python implementation basically do the same simple task:

    1. Read all entries from the database.
    2. For each row, decode the random value from hex, then create a SHA256 hex from the result.
    3. Compare the generated SHA256 hex string against the one stored in the database
    4. If they match, continue, otherwise break.

    Assumptions

    My assumption explicitly was that Python did some type of lazy loading and/or possibly even execution of the SQL query.

    The results

    Go implementation

    $ ./sqlite3perf bench
    2017/12/31 15:21:48 bench called
    2017/12/31 15:21:48 Time after query: 4.824009ms
    2017/12/31 15:21:48 Beginning loop
    2017/12/31 15:21:48 Acessing the first result set 
        ID 0,
        rand: 6a8a4ad02e5e872a,
        hash: 571f1053a7c2aaa56e5c076e69389deb4db46cc08f5518c66a4bc593e62b9aa4
    took 548.32µs
    2017/12/31 15:21:50 641,664 rows processed
    2017/12/31 15:21:52 1,325,186 rows processed
    2017/12/31 15:21:53 1,500,000 rows processed
    2017/12/31 15:21:53 Finished loop after 4.519083493s
    2017/12/31 15:21:53 Average 3.015µs per record, 4.523936078s overall
    

    Note the values for "time after query" ( the time the query command took to return) and the time it took to access the first result set after the iteration over the result set was started.

    Python implementation

    $ python bench.py 
    12/31/2017 15:25:41 Starting up
    12/31/2017 15:25:41 Time after query: 1874µs
    12/31/2017 15:25:41 Beginning loop
    12/31/2017 15:25:44 Accessing first result set
        ID: 0
        rand: 6a8a4ad02e5e872a
        hash: 571f1053a7c2aaa56e5c076e69389deb4db46cc08f5518c66a4bc593e62b9aa4
    took 2.719312 s
    12/31/2017 15:25:50 Finished loop after 9.147431s
    12/31/2017 15:25:50 Average: 6.098µs per record, 0:00:09.149522 overall
    

    Again, note the value for "time after query" and the time it took to access the first result set.

    Summary

    It took the Go implementation quite a while to return after the SELECT query was send, while Python seemed to be blazing fast in comparison. However, from the time it took to actually access the first result set, we can see that the Go implementation is more than 500 times faster to actually access the first result set (5.372329ms vs 2719.312ms) and about double as fast for the task at hand as the Python implementation.

    Notes

    • In order to prove the assumption that Python actually does lazy loading on the result set, each and every row and column had to be accessed in order to make sure that Python is forced to actually read the value from the database.
    • I chose a hashing task because presumably the implementation of SHA256 is highly optimised in both languages.

    Conclusion

    Python does seem to do lazy loading of result sets and possibly does not even execute a query unless the according result set is actually accessed. In this simulated scenario, mattn's SQLite driver for Go outperforms Python's by between roughly 100% and orders of magnitude, depending on what you want to do.

    Edit: So in order to have a fast processing, implement your task in Go. While it takes longer to send the actual query, accessing the individual rows of the result set is by far faster. I'd suggest starting out with a small subset of your data, say 50k records. Then, to further improve your code, use profiling to identify your bottlenecks. Depending on what you want to do during processing, pipelines for example might help, but how to improve the processing speed of the task at hand is difficult to say without actual code or a thorough description.

    评论

报告相同问题?

悬赏问题

  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘
  • ¥15 perl MISA分析p3_in脚本出错
  • ¥15 k8s部署jupyterlab,jupyterlab保存不了文件
  • ¥15 ubuntu虚拟机打包apk错误
  • ¥199 rust编程架构设计的方案 有偿
  • ¥15 回答4f系统的像差计算
  • ¥15 java如何提取出pdf里的文字?