doufeng9567 2017-02-20 14:37
浏览 232
已采纳

如何解压缩/缩小PDF流

Working with the 2016-W4 pdf, which has 2 large streams (page 1 & 2), along with a bunch of other objects and smaller streams. I'm trying to deflate the stream(s), to work with the source data, but am struggling. I'm only able to get corrupt inputs and invalid checksums errors.

I've written a test script to help debug, and have pulled out smaller streams from the file to test with.

Here are 2 streams from the original pdf, along with their length objects:

stream 1:

149 0 obj
<< /Length 150 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType
1 /BBox [0 0 8 8] /Resources 151 0 R >>
stream
x+TT(T0B ,JUWÈS0Ð37±402V(NFJSþ¶
«
endstream
endobj
150 0 obj
42
endobj

stream 2

142 0 obj
<< /Length 143 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType
1 /BBox [0 0 0 0] /Resources 144 0 R >>
stream
x+Tçã
endstream
endobj
143 0 obj
11
endobj

I copied just the stream contents into new files within Vim (excluding the carriage returns after stream and before endstream).

I've tried both:

  • compress/flate (rfc-1951) – (removing the first 2 bytes (CMF, FLG))
  • compress/zlib (rfc-1950)

I've converted the streams to []byte for the below:

package main

import (
    "bytes"
    "compress/flate"
    "compress/gzip"
    "compress/zlib"
    "fmt"
    "io"
    "os"
)

var (
    flateReaderFn = func(r io.Reader) (io.ReadCloser, error) { return flate.NewReader(r), nil }
    zlibReaderFn  = func(r io.Reader) (io.ReadCloser, error) { return zlib.NewReader(r) }
)

func deflate(b []byte, skip, length int, newReader func(io.Reader) (io.ReadCloser, error)) {
    // rfc-1950
    // --------
    //   First 2 bytes
    //   [120, 1] - CMF, FLG
    //
    //   CMF: 120
    //     0111 1000
    //     ↑    ↑
    //     |    CM(8) = deflate compression method
    //     CINFO(7)   = 32k LZ77 window size
    //
    //   FLG: 1
    //     0001 ← FCHECK
    //            (CMF*256 + FLG) % 31 == 0
    //             120 * 256 + 1 = 30721
    //                             30721 % 31 == 0

    stream := bytes.NewReader(b[skip:length])
    r, err := newReader(stream)
    if err != nil {
        fmt.Println("
failed to create reader,", err)
        return
    }

    n, err := io.Copy(os.Stdout, r)
    if err != nil {
        if n > 0 {
            fmt.Print("
")
        }
        fmt.Println("
failed to write contents from reader,", err)
        return
    }
    fmt.Printf("%d bytes written
", n)
    r.Close()
}

func main() {
    //readerFn, skip := flateReaderFn, 2 // compress/flate RFC-1951, ignore first 2 bytes
    readerFn, skip := zlibReaderFn, 0 // compress/zlib RFC-1950, ignore nothing

    //                                                                                                ⤹ This is where the error occurs: `flate: corrupt input before offset 19`.
    stream1 := []byte{120, 1, 43, 84, 8, 84, 40, 84, 48, 0, 66, 11, 32, 44, 74, 85, 8, 87, 195, 136, 83, 48, 195, 144, 51, 55, 194, 177, 52, 48, 50, 86, 40, 78, 70, 194, 150, 74, 83, 8, 4, 0, 195, 190, 194, 182, 10, 194, 171, 10}
    stream2 := []byte{120, 1, 43, 84, 8, 4, 0, 1, 195, 167, 0, 195, 163, 10}

    fmt.Println("----------------------------------------
Stream 1:")
    deflate(stream1, skip, 42, readerFn) // flate: corrupt input before offset 19

    fmt.Println("----------------------------------------
Stream 2:")
    deflate(stream2, skip, 11, readerFn) // invalid checksum
}

I'm sure I'm doing something wrong somewhere, I just can't quite see it.

(The pdf does open in a viewer)

展开全部

  • 写回答

2条回答 默认 最新

  • dongwen5351 2017-02-21 12:23
    关注

    Binary data should never be copied out of / saved from text editors. There might be cases when this succeeds, and it just adds oil to the flame.

    Your data that you eventually "mined out" from the PDF is most likely not identical to the actual data that is in the PDF. You should take the data from a hex editor (e.g. try hecate for something new), or write a simple app that saves it (which strictly handles the file as binary).

    Hint #1:

    The binary data displayed spread across multiple lines. Binary data does not contain carriage returns, that's a textual control. If it does, that means the editor did interpret it as text, and so some codes / characters where "consumed" to start a new line. Multiple sequences may be interpreted as the same newline (e.g. , ). By excluding them, you're already at data loss, by including them, you might already have a different sequence. And if the data was interpreted and displayed as text, more problems may arise as there are more control characters, and some characters may not appear when displayed.

    Hint #2:

    When flateReaderFn is used, decoding the 2nd example succeeds (completes without an error). This means "you were barking up the right tree", but the success depends on what the actual data is and to what extent was it "distorted" by the text editor.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)
编辑
预览

报告相同问题?

悬赏问题

  • ¥15 PADS Logic 原理图
  • ¥15 PADS Logic 图标
  • ¥15 电脑和power bi环境都是英文如何将日期层次结构转换成英文
  • ¥20 气象站点数据求取中~
  • ¥15 如何获取APP内弹出的网址链接
  • ¥15 wifi 图标不见了 不知道怎么办 上不了网 变成小地球了
手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部