为什么以下golang程序会抛出运行时内存不足错误？

This program is supposed to read a file consisting of pairs of ints (one pair per line) and remove duplicate pairs. While it works on small files, it throws a runtime error on huge files (say a file of 1.5 GB). Initially, I thought that it is the map data structure which is causing this, but even after commenting it out, it still runs out of memory. Any ideas why this is happening? How to rectify it? Here's a data file on which it runs out of memory: http://snap.stanford.edu/data/com-Orkut.html

package main
import (
    "fmt"
    "bufio"
    "os"
    "strings"
    "strconv"
)

func main() {
    file, err := os.Open(os.Args[1])
    if err != nil {
        panic(err.Error())
    }
    defer file.Close()
    type Edge struct {
        u, v int
    }
    //seen := make(map[Edge]bool)
    edges := []Edge{}
    scanner := bufio.NewScanner(file)

    for i, _ := strconv.Atoi(os.Args[2]); i > 0; i-- {
        scanner.Scan()
    }

    for scanner.Scan() {
        str := scanner.Text()
        edge := strings.Split(str, ",")
        u, _ := strconv.Atoi(edge[0])
        v, _ := strconv.Atoi(edge[1])
        var key Edge
        if u < v {
            key = Edge{u,v}
        } else {
            key = Edge{v,u}
        }
        //if seen[key] {
        //  continue
        //}
        //seen[key] = true
        edges = append(edges, key)
    }
    for _, e := range edges {
        s := strconv.Itoa(e.u) + "," + strconv.Itoa(e.v)
        fmt.Println(s)
    }
}

A sample input is given below. The program can be run as follows (where the last input says how many lines to skip). go run undup.go a.txt 1

# 3072441,117185083
1,2
1,3
1,4
1,5
1,6
1,7
1,8

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

douyue2313 2014-10-12 15:09

关注

I looked at this file: com-orkut.ungraph.txt and it contains 117,185,082 lines. The way your data is structured, that's at least 16 bytes per line. (Edge is two 64bit ints) That alone is 1.7GB. I have had this problem in the past, and it can be a tricky one. Are you trying to solve this for a specific use case (the file in question) or the general case?

In the specific case there are a few things about the data you could leverage: (1) the keys are sorted and (2) it looks it stores every connection twice, (3) the numbers don't seem huge. Here are a couple ideas:

If you use a smaller type for the key you will use less memory. Try a uint32.
You could stream (without using a map) the keys to another file by simply seeing if the 2nd column is greater than the first:
```
if u < v {
    // write the key to another file
} else {
    // skip it because v will eventually show v -> u
}
```

For the general case there are a couple strategies you could use:

If the order of the resulting list doesn't matter: Use an on-disk hash table to store the map. There are a bunch of these: leveldb, sqlite, tokyo tyrant, ... A really nice one for go is bolt.

In your for loop you would just check to see if a bucket contains the given key. (You can convert the ints into byte slices using encoding/binary) If it does, just skip it and continue. You will need to move the second for loop processing step into the first for loop so that you don't have to store all the keys.
If the order of the resulting list does matter (and you can't guarantee the input is in order): You can also use an on-disk hash table, but it needs to be sorted. Bolt is sorted so that will work. Add all the keys to it, then traverse it in the second loop.

Here is an example: (this program will take a while to run with 100 million records)

package main

import (
    "bufio"
    "encoding/binary"
    "fmt"
    "github.com/boltdb/bolt"
    "os"
    "strconv"
    "strings"
)

type Edge struct {
    u, v int
}

func FromKey(bs []byte) Edge {
    return Edge{int(binary.BigEndian.Uint64(bs[:8])), int(binary.BigEndian.Uint64(bs[8:]))}
}

func (e Edge) Key() [16]byte {
    var k [16]byte
    binary.BigEndian.PutUint64(k[:8], uint64(e.u))
    binary.BigEndian.PutUint64(k[8:], uint64(e.v))
    return k
}

func main() {
    file, err := os.Open(os.Args[1])
    if err != nil {
        panic(err.Error())
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)

    for i, _ := strconv.Atoi(os.Args[2]); i > 0; i-- {
        scanner.Scan()
    }

    db, _ := bolt.Open("ex.db", 0777, nil)
    defer db.Close()

    bucketName := []byte("edges")
    db.Update(func(tx *bolt.Tx) error {
        tx.CreateBucketIfNotExists(bucketName)
        return nil
    })

    batchSize := 10000
    total := 0
    batch := make([]Edge, 0, batchSize)
    writeBatch := func() {
        total += len(batch)
        fmt.Println("write batch. total:", total)
        db.Update(func(tx *bolt.Tx) error {
            bucket := tx.Bucket(bucketName)
            for _, edge := range batch {
                key := edge.Key()
                bucket.Put(key[:], nil)
            }
            return nil
        })
    }

    for scanner.Scan() {
        str := scanner.Text()
        edge := strings.Split(str, "\t")
        u, _ := strconv.Atoi(edge[0])
        v, _ := strconv.Atoi(edge[1])
        var key Edge
        if u < v {
            key = Edge{u, v}
        } else {
            key = Edge{v, u}
        }
        batch = append(batch, key)
        if len(batch) == batchSize {
            writeBatch()
            // reset the batch length to 0
            batch = batch[:0]
        }
    }
    // write anything leftover
    writeBatch()

    db.View(func(tx *bolt.Tx) error {
        tx.Bucket(bucketName).ForEach(func(k, v []byte) error {
            edge := FromKey(k)
            fmt.Println(edge)
            return nil
        })
        return nil
    })
}

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

为什么以下golang程序会抛出运行时内存不足错误？
2014-10-12 09:48

回答 2 已采纳 I looked at this file: com-orkut.ungraph.txt and it contains 117,185,082 lines. The way your data
为什么此Golang应用程序运行的时间越长会占用更多的内存？
2017-12-14 16:37

回答 1 已采纳 Yes, this is a memory leak. One obvious source I can spot is that you're not closing the response
为什么GoLang方法会产生编译错误？
2017-07-20 12:32

回答 1 已采纳 You are declaring sq as a function, not a method. If you want to attach sq to MyFloat, you should
golang-如何用全局参数打印程序运行信息
2021-12-30 20:43

a...Z的博客意为直到自上次垃圾回收后heap size已经增长了100%时GC才触发运行。即是GOGC=100意味着live heap size 每增长一倍，GC触发运行一次。如设定GOGC=200, 则live heap size 自上次垃圾回收后，增长2倍时，GC触发运行， ...
为什么打印出最大整数会导致golang编译错误？
2016-03-26 02:42

回答 1 已采纳 The Go Programming Language Specification Constants An untyped constant has a default
报告测试失败时，为什么golang测试包会用下划线替换空格？
2018-09-19 16:56

回答 2 已采纳 It's explained in this blog post: https://blog.golang.org/subtests The full name of a subtest
如何在Docker中正确运行Golang应用程序？ docker
2019-03-11 13:43

回答 1 已采纳 The imports in your main.go file specify: "questionnaire/database" "questionnaire/routes" "qu
golang 内存管理
2022-05-19 00:28

Mars'Ares的博客文章目录内存管理内存分配器线性分配（Bump Allocator）空闲链表分配（Free-List Allocator）线程缓存分配（Thread-Caching Malloc，TCMalloc）对象大小多级缓存地址空间状态转移内存对齐结构体内存逃逸逃逸机制...
为什么将for循环与select一起使用时，为什么该golang程序会挂在操场上？
2017-06-11 08:04

回答 2 已采纳 You should remove the default clauses from the select statements. Since your select statement is
在Golang中，为什么这样的类型转换会导致运行时错误：索引超出范围？
2016-04-01 05:25

回答 1 已采纳 uint8 has a max value of 255 (only 8 bits, max 2^8) but dx, dy passed into Pic can have values gre
golang:在学习函数时调用包运行时不能正常运行，想请教是不是搭建环境时出了错误？环境配置，代码片段以及运行结果在下面。 golang 有问必答
2021-08-19 13:42

回答 1 已采纳其中包名写错了，不是chapter包下，是chapter06
golang 内存泄漏以及异常退出的问题解决
2020-10-16 18:56

wangbiao007的博客运行一段时间就停掉了，而且还没有错误提示，最后在阿里云这个主机的监控图形中看到，这个主机的内存是慢慢向上的，等内存使用率接近100%的时候，内存突然降了下来，当时就怀疑当内存使用率接近100%的时候，agent就...
Golang：恐慌：运行时错误：无效的内存地址或nil指针取消引用
2014-12-28 06:31

回答 1 已采纳 You are calling FormFile() at the beginning of your function. This calls ParseMultipartForm() (se
golang 程序休眠_Golang 并发编程与同步原语
2020-12-19 14:24

weixin_39668282的博客我们往往都离不开『锁』这一概念，Go 语言作为一个原生支持用户态进程 Goroutine 的语言，也一定会为开发者提供这一功能，锁的主要作用就是保证多个线程或者 Goroutine 在访问同一片内存时不会出现混乱的问题，锁...
初始Golang---为啥选用Go语言？
2020-10-05 09:59

这Leslie_Lau的博客 Go语言为什么一会叫Go，一会又叫Golang？这是因为Go的全名为Go language，简称可以为Golang或者Go。而Go表示的意思有太多了，比如在英文里表示很多意思，很难让人们想到编程语言，所以一般在搜索时可以以Golang作为...
没有解决我的问题, 去提问

悬赏问题

¥15 js调用html页面需要隐藏某个按钮
¥15 ads仿真结果在圆图上是怎么读数的
¥20 Cotex M3的调试和程序执行方式是什么样的？
¥20 java项目连接sqlserver时报ssl相关错误
¥15 一道python难题3
¥15 牛顿斯科特系数表表示
¥15 arduino 步进电机
¥20 程序进入HardFault_Handler
¥15 关于#python#的问题：自动化测试
¥20 问题请教！vue项目关于Nginx配置nonce安全策略的问题

码龄粉丝数原力等级 --

为什么以下golang程序会抛出运行时内存不足错误？

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

为什么以下golang程序会抛出运行时内存不足错误？

2条回答 默认 最新

悬赏问题

2条回答默认最新