2020-12-09 13:10

Decrypting HUGE files

Here is my current use-case: Huge files (GB or more) must be PGP-decrypted and immediately re-encrypted. Currently, I'm using subprocess, call GnuPG and pipe the output to my python process for re-encryption.

GnuPG decrypts block by block, so what I receive in the pipe is sent to a generator for re-encryption.

I stay therefore within memory bounds, even for huge files, and decryption+rencryption run in parallel.

As far as I understand, I can not do that with PGPy, since it uses bytearray to store the message content internally. Moreover, in the case of huge files to decrypt, it blows up the memory.

It's a show-stopper for me at this point. Would this be possible to update the bytearray(os.path.getsize(filepath)) into a generator?

We could then have key.decrypt(msg, sink=None or file-like-obj-or-pipe) if sink is None, use the old bytearray. If not, use the file-like obj-or-pipe with read and write api.

I'm not that new to Python, but the library is fairly complex, so you would have better chances than me at fixing that issue.


  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答


  • weixin_39904809 weixin_39904809 4月前
    def chunker(stream, chunk_size=None):
        """Lazy function (generator) to read a stream one chunk at a time."""
        if not chunk_size:
            chunk_size = 1 << 26 # 67 MB or 2**26
        assert(chunk_size >= 16)
        yield chunk_size
        while True:
            data = stream.read(chunk_size)
            if not data:
                return None # No more data
            yield data

    and then

    def decrypt_engine(key, passphrase):
        '''Generator that takes a block of data as input and decrypts it as output (using PGP).'''
        assert( isinstance(key,PGPKey) )
        print('Starting the (de)cipher engine') # or log
        with key.unlocked(passphrase): # raise Exception on wrong passphrase
            cipherchunk = yield
            while True:
                cipherchunk = yield key.decrypt(cipherchunk)

    Now the chunker can feed the decrypt_engine...

    Is that what you have in mind? Does it help?

    点赞 评论 复制链接分享
  • weixin_39904809 weixin_39904809 4月前

    I have not read the PGP file format to know where to cut the stream, and what chunks to send to the decrypt_engine. But you probably have that in the top of your head, right?

    点赞 评论 复制链接分享
  • weixin_39825105 weixin_39825105 4月前

    The problem in PGPy is that entire message content is stored into the memory using bytearray. This simply doesn’t work if the file size is bigger than you have memory. One possible solution is to use memory map and let the kernel handle memory allocation:


    The needed changes are very minor:

    diff --git a/pgpy/types.py b/pgpy/types.py
    index b4a3d71..c13c65b 100644
    --- a/pgpy/types.py
    +++ b/pgpy/types.py
    @@ -13,6 +13,7 @@ import os
     import re
     import warnings
     import weakref
    +import mmap
     from enum import EnumMeta
     from enum import IntEnum
    @@ -185,8 +186,8 @@ class Armorable(six.with_metaclass(abc.ABCMeta)):
         def from_file(cls, filename):
             with open(filename, 'rb') as file:
                 obj = cls()
    -            data = bytearray(os.path.getsize(filename))
    -            file.readinto(data)
    +            m = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
    +            data = bytearray(m)
             po = obj.parse(data)

    With this it’s possible to decrypt as big files as you want:

    [tornroos:~/elixir/pgp]$ ls -lh big_test_file.gpg 
    -rw-r--r--  1 tornroos  staff    32G Sep 18 14:59 big_test_file.gpg
    [tornroos:~/elixir/pgp]$ cat decrypt.py 
    import pgpy
    TEST_FILE = 'big_test_file.gpg'
    PRIVATE_KEY = 'private.key'
    PASSPHRASE = 'foobar'
    key, _ = pgpy.PGPKey.from_file(PRIVATE_KEY)
    with key.unlock(PASSPHRASE):
        message = pgpy.PGPMessage.from_file(TEST_FILE)
        decrypted_message = key.decrypt(message).message.decode("utf-8")
    [tornroos:~/elixir/pgp]$ /usr/bin/time -l /usr/local/bin/python3 decrypt.py 
          414.64 real        24.66 user       110.94 sys
    7013019648  maximum resident set size
             0  average shared memory size
             0  average unshared data size
             0  average unshared stack size
      19225541  page reclaims
       8393470  page faults
             0  swaps
             1  block input operations
            20  block output operations
             0  messages sent
             0  messages received
             0  signals received
        103799  voluntary context switches
       1795588  involuntary context switches

    Without the fix this sample program just dies when you run out of memory. Please merge this into the next release of PGPy.

    点赞 评论 复制链接分享
  • weixin_39526651 weixin_39526651 4月前

    I'm not merging that without testing it thoroughly, and I'm not convinced that change alone is actually useful for anyone, because all it does is move the point of running out of memory to sometime during packet parsing, rather than up front when the buffer is allocated.

    I'd much rather fail fast in that case, because it's absolutely going to fail anyway if the message is actually that large.

    点赞 评论 复制链接分享
  • weixin_39526651 weixin_39526651 4月前

    The real tricky part of this is not so much "how do we read a file that is too big to fit into memory" but "how do we provide access to an encrypted blob that is too big to fit into memory, and still allow meaningful access to its contents in a way that does not simply result in blowing up memory at a later point rather than an earlier one"

    点赞 评论 复制链接分享
  • weixin_39904809 weixin_39904809 4月前

    Any progress on that issue? Is the streaming part, as advised above, implemented?

    I would like to avoid writing my own library just for that particular case when there is already a lot of work done here!

    点赞 评论 复制链接分享
  • weixin_39526651 weixin_39526651 4月前

    I promise I'm working on this, my available time for working on PGPy is just limited right now and I'm trying to wrap up the 0.4.4 bugfix release first.

    点赞 评论 复制链接分享
  • weixin_39888412 weixin_39888412 4月前

    Any news on this?

    点赞 评论 复制链接分享
  • weixin_39942995 weixin_39942995 4月前

    Did you find out a solution ?

    点赞 评论 复制链接分享
  • weixin_39526651 weixin_39526651 4月前

    So, this is essentially a +1 for #139, which also depends on completing #95.

    Would this be possible to update the bytearray(os.path.getsize(filepath)) into a generator?

    That is actually similar to what I have in mind for supporting streaming crypto, and I have a rough idea of how to implement it. I just need to make some time to sit down and figure out the details. I want to get this functionality in place for 0.5.0

    点赞 评论 复制链接分享