douyan2680 2014-07-23 20:06
浏览 80
已采纳

buffo.Scanner逐行读取文件的奇怪行为

i use bufio.Scanner for reading a file line-by-line into the variable wordlist ([][]byte)

This is the code (tested with go 1.1 / 1.3).

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
)

func main() {
    fle, err := os.Open("words.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer fle.Close()

    scanner := bufio.NewScanner(fle)

    n := 1000
    dCnt := 5
    var wordlist [][]byte

    for scanner.Scan() {
        if len(wordlist) == n {
            break
        }
        word := scanner.Bytes()
        for ii := 0; ii < len(wordlist); ii++ {
            if string(word) == string(wordlist[ii]) {
                log.Println(ii, string(word), string(wordlist[ii]))
                log.Println(len(wordlist), "double")

                dCnt--
                if dCnt == 0 {
                    for i, v := range wordlist {
                        fmt.Println(i, string(v))
                    }
                    log.Fatal("double")
                }
            }
        }
        wordlist = append(wordlist, word)
    }
    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

words.txt is a file of 5040 lines of permutations of the sequenz "abcdefg":

line 1 .. 
abcdefg
abcdegf
abcdfeg
abcdfge
..
line 510 ..
afcdbge
afcdebg
afcdegb
afcdgbe
afcdgeb
.. line 5040

generated by this small python script:

from itertools import permutations as perm
c = "abcdefg"
p = perm(c, len(c))
with file('words.txt','wb') as outFle:
    for i in xrange(5040):
        n = ''.join(p.next())
        print >> outFle, n

The problem is, that after running the above go program the wordlist contains the following:

index string(wordlist[])

0 afcdebg      <-- this is line 513 of words.txt
1 afcdegb
2 afcdgbe
3 afcdgeb
...
510 bdefcag
511 bdefcga
512 afcdebg    <-- this is the begin of a repition of line 513 .. 1024 in words.ttx
513 afcdegb
514 afcdgbe 

Instead wordlist should contain the first 1000 lines of words.txt

Any Ideas ?

The answer was given by Daniel Darabos (see below)

changing

word := scanner.Bytes()

to

word := scanner.Text() ' did the job.

(Thanks for your help!)

  • 写回答

1条回答 默认 最新

  • douke7274 2014-07-23 20:52
    关注

    The documentation of Scanner.Bytes says:

    The underlying array may point to data that will be overwritten by a subsequent call to Scan.

    So if you save the returned slice, you can expect to see its contents change. This wreaks havoc in your application. Better to not save the returned slice!

    A nice solution is to build a string from the bytes:

    word := string(scanner.Bytes())
    

    Then you can work with strings everywhere and the code becomes more pleasant.

    What is going on?

    Why does Scanner.Bytes hate me? The answer is also in the documentation:

    It does no allocation.

    This makes the Scanner nicely efficient. From what you see, I guess it allocates buffers for 512 lines in the constructor and then rotates over them.

    This is not a problem in applications where you do not need to keep references to the lines. (For example a grep-like program only looks at each line once.) Often you parse the line and store a reference to that. But if you want to store the raw byte data, you are responsible for copying it out from the Scanner.

    This may be a hassle, but while you can implement the convenient behavior on top of the inconvenient one, it would be impossible to implement the efficient behavior on top of the inefficient one.


    Also a simpler script for generating the input:

    import itertools
    for p in itertools.permutations('abcdefg'):
      print ''.join(p)
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度