dongyun51582 2014-08-03 18:00
浏览 472
已采纳

缓冲的golang通道丢失数据

I am trying to parse a huge Wiktionary dump using a goroutine, and am encountering a strange bug where the channel that the goroutine is reading from seems to be losing and corrupting data every time the channel blocks.

func main() {
    inFile, err := os.Open(*srcFile)
    if err != nil {
        log.LogErrorf("Error opening dump: %v", err)
        return
    }
    defer inFile.Close()

    var wg sync.WaitGroup
    input := make(chan []byte, 51)


    go func() {
        wg.Add(1)
        for line := range input {
            log.Printf("Bytes: %s", line)
            // process the line
        }
        wg.Done()
    }()

    scanner := bufio.NewScanner(inFile)
    count := 0
    for scanner.Scan() {
        count++
        log.Printf("Scanned: %d", count)
        if err := scanner.Err(); err != nil {
            log.LogErrorf("Error scanning: %v", err)
        }
        newestBytes := scanner.Bytes()
        log.Printf("Bytes: %s", newestBytes)
        input <- newestBytes
    }
    close(input)
    wg.Wait()
}

When I run this, I get the correct output. In particular, note lines 51 and 52.

2014/08/03 17:49:25 Scanned: 42
2014/08/03 17:49:25 Bytes:       <namespace key="115" case="case-sensitive">Citations talk</namespace>
2014/08/03 17:49:25 Scanned: 43
2014/08/03 17:49:25 Bytes:       <namespace key="116" case="case-sensitive">Sign gloss</namespace>
2014/08/03 17:49:25 Scanned: 44
2014/08/03 17:49:25 Bytes:       <namespace key="117" case="case-sensitive">Sign gloss talk</namespace>
2014/08/03 17:49:25 Scanned: 45
2014/08/03 17:49:25 Bytes:       <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:49:25 Scanned: 46
2014/08/03 17:49:25 Bytes:       <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:49:25 Scanned: 47
2014/08/03 17:49:25 Bytes:     </namespaces>
2014/08/03 17:49:25 Scanned: 48
2014/08/03 17:49:25 Bytes:   </siteinfo>
2014/08/03 17:49:25 Scanned: 49
2014/08/03 17:49:25 Bytes:   <page>
2014/08/03 17:49:25 Scanned: 50
2014/08/03 17:49:25 Bytes:     <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:49:25 Scanned: 51
2014/08/03 17:49:25 Bytes:     <ns>4</ns>
2014/08/03 17:49:25 Scanned: 52
2014/08/03 17:49:25 Bytes:     <id>6</id>
2014/08/03 17:49:25 Scanned: 53
2014/08/03 17:49:25 Bytes:     <restrictions>edit=autoconfirmed:move=sysop</restrictions>
2014/08/03 17:49:25 Scanned: 54
2014/08/03 17:49:25 Bytes:     <revision>
2014/08/03 17:49:25 Scanned: 55
2014/08/03 17:49:25 Bytes:       <id>24557508</id>
2014/08/03 17:49:25 Scanned: 56
2014/08/03 17:49:25 Bytes:       <parentid>19020708</parentid>
2014/08/03 17:49:25 Scanned: 57
2014/08/03 17:49:25 Bytes:       <timestamp>2013-12-30T13:50:49Z</timestamp>
2014/08/03 17:49:25 Scanned: 58
2014/08/03 17:49:25 Bytes:       <contributor>
2014/08/03 17:49:25 Scanned: 59

Yet when I print line instead (what the goroutine is receiving), I get the output below. After line 51, the channel blocks and main scans and passes 51 more values to the channel. However, the next line that the goroutine reads is incorrect, and more than that, it is clearly malformed.

Bytes:       <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:40:52 Bytes:       <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:40:52 Bytes:     </namespaces>
2014/08/03 17:40:52 Bytes:   </siteinfo>
2014/08/03 17:40:52 Bytes:   <page>
2014/08/03 17:40:52 Bytes:     <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:40:52 Scanned: 52
2014/08/03 17:40:52 Scanned: 53
2014/08/03 17:40:52 Scanned: 54
2014/08/03 17:40:52 Scanned: 55
2014/08/03 17:40:52 Scanned: 56
2014/08/03 17:40:52 Scanned: 57
2014/08/03 17:40:52 Scanned: 58
2014/08/03 17:40:52 Scanned: 59
2014/08/03 17:40:52 Scanned: 60
2014/08/03 17:40:52 Scanned: 61
2014/08/03 17:40:52 Scanned: 62
2014/08/03 17:40:52 Scanned: 63
2014/08/03 17:40:52 Scanned: 64
2014/08/03 17:40:52 Scanned: 65
2014/08/03 17:40:52 Scanned: 66
2014/08/03 17:40:52 Scanned: 67
2014/08/03 17:40:52 Scanned: 68
2014/08/03 17:40:52 Scanned: 69
2014/08/03 17:40:52 Scanned: 70
2014/08/03 17:40:52 Scanned: 71
2014/08/03 17:40:52 Scanned: 72
2014/08/03 17:40:52 Scanned: 73
2014/08/03 17:40:52 Scanned: 74
2014/08/03 17:40:52 Scanned: 75
2014/08/03 17:40:52 Scanned: 76
2014/08/03 17:40:52 Scanned: 77
2014/08/03 17:40:52 Scanned: 78
2014/08/03 17:40:52 Scanned: 79
2014/08/03 17:40:52 Scanned: 80
2014/08/03 17:40:52 Scanned: 81
2014/08/03 17:40:52 Scanned: 82
2014/08/03 17:40:52 Scanned: 83
2014/08/03 17:40:52 Scanned: 84
2014/08/03 17:40:52 Scanned: 85
2014/08/03 17:40:52 Scanned: 86
2014/08/03 17:40:52 Scanned: 87
2014/08/03 17:40:52 Scanned: 88
2014/08/03 17:40:52 Scanned: 89
2014/08/03 17:40:52 Scanned: 90
2014/08/03 17:40:52 Scanned: 91
2014/08/03 17:40:52 Scanned: 92
2014/08/03 17:40:52 Scanned: 93
2014/08/03 17:40:52 Scanned: 94
2014/08/03 17:40:52 Scanned: 95
2014/08/03 17:40:52 Scanned: 96
2014/08/03 17:40:52 Scanned: 97
2014/08/03 17:40:52 Scanned: 98
2014/08/03 17:40:52 Scanned: 99
2014/08/03 17:40:52 Scanned: 100
2014/08/03 17:40:52 Scanned: 101
2014/08/03 17:40:52 Scanned: 102
2014/08/03 17:40:52 Bytes: nd other refer
2014/08/03 17:40:52 Bytes: nce and instru
2014/08/03 17:40:52 Bytes: tional materials. It stipulates that any copy of the material,
2014/08/03 17:40:52 Bytes: even if modifi
2014/08/03 17:40:52 Bytes: d, carry the same licen
2014/08/03 17:40:52 Bytes: e. Those copies may be sold but, if
2014/08/03 17:40:52 Bytes: produced in quantity, have to be made available i
2014/08/03 17:40:52 Bytes:  a format which fac
2014/08/03 17:40:52 Bytes: litates further editing. 

I have tried to reproduce this in the Go playground but I have been unsuccessful - it seems like this is something to do with the way slices are passed in channels.

  • 写回答

1条回答 默认 最新

  • dongsong1911 2014-08-03 18:34
    关注

    The function Scanner.Bytes may return the same slice used internally by the scanner.

    func (s *Scanner) Bytes() []byte
    

    Bytes returns the most recent token generated by a call to Scan. The underlying array may point to data that will be overwritten by a subsequent call to Scan. It does no allocation.

    As per documentation, this slice may be overwritten by subsequent calls to Scanner.Scan. Since your code does not ensure that this slice is not used after the next call to Scanner.Scan (and in fact your code produces lines and consumes them asynchonously), it may contain garbage at the point where you're trying to use it.

    Explicitly copy the slice to make sure that the data is not being overwritten by subsequent calls to Scanner.Scan.

    input <- append(nil, newestBytes...)
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 matlab计算中误差
  • ¥15 对于相关问题的求解与代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
  • ¥15 保护模式-系统加载-段寄存器
  • ¥15 电脑桌面设定一个区域禁止鼠标操作
  • ¥15 求NPF226060磁芯的详细资料
  • ¥15 使用R语言marginaleffects包进行边际效应图绘制
  • ¥20 usb设备兼容性问题
  • ¥15 错误(10048): “调用exui内部功能”库命令的参数“参数4”不能接受空数据。怎么解决啊