weixin_39723102
weixin_39723102
2020-12-02 06:25

Memory Leak / Caching without limits

When doing tasks, each action requiring access to the pickled ensembl DB data (e.g looking up a new gene) adds values to the stack causing memory usage to skyrocket. Does not go away after deleting references to pyensembl (e.g deleting the EnsemblRelease variable).

该提问来源于开源项目:openvax/pyensembl

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

9条回答

  • weixin_39627390 weixin_39627390 5月前

    Hi ,

    There's a lot of potentially over-eager caching going on in PyEnsembl:

    1) Did you create the EnsemblRelease instance yourself or are you using something like pyensembl.ensembl_grch38 or cached_release. In the latter case the instance will persist in a global dictionary.

    2) Any call to a method that returns all of some feature (either across the reference or on a contig) such as gene_names, transcript_ids will cache its results (see: https://github.com/hammerlab/pyensembl/blob/master/pyensembl/genome.py#L365). This might be unnecessary and I'm open to changing it.

    Can you tell me more about what you're doing and how much memory consumption you encounter?

    点赞 评论 复制链接分享
  • weixin_39723102 weixin_39723102 5月前

    I created the ensembl release variable myself. i'm using it in combination with varcode and basically this causes memory usage to go up as each new gene is accessed:

    
    es=EnsemblRelease(75)
    for chrom, pos, ref, alt in line:
         variant = varcode.Variant(chrom, pos, ref, alt, ensembl=es)
         effect_on_gene = variant.effects()
         # do something
    

    after about 100 genes or so I'm seeing memory usage of 8gb, I usually run out of available memory after about 300 genes (24gb on my virtual machine)

    点赞 评论 复制链接分享
  • weixin_39723102 weixin_39723102 5月前

    I think it may have to do with the memoized properties of transcript.py and genes.py (the caching you mentioned was going on) but this is a complete guess just from looking at the cProfile times and the memory usage spike when accessing variant.effects(), in the sense that memoize is caching objects in memory with each new transcript retrieval but is not forgetting them because the memory is stored in the module variable for memoized_property.py, not in pyensembl. A fix for it would be to offer a monkeypatch for memoized_property.py where you can set a max size for variables in the module or a dealloc function.

    点赞 评论 复制链接分享
  • weixin_39627390 weixin_39627390 5月前

    Property memoization is definitely a memory leak but I'm surprised by the magnitude. How many variants are you getting effects for?

    I see >1gb memory usage after about 9k variants. I got rid of some of the memoization in Varcode and now get to >1gb after about 14k variants.

    I'll file another issue under Varcode and try to trim the memory usage of both libraries.

    点赞 评论 复制链接分享
  • weixin_39723102 weixin_39723102 5月前

    I am trying to calculate all of the possible mutations across the entire transcriptome, so not your typical case I guess (I am calculating the effects of 143 million variants on 20.5k genes). I was able to get all of them to complete (only one of the jobs using more than 8gb memory) by splitting the jobs up 100 genes at a time on our cluster hah

    点赞 评论 复制链接分享
  • weixin_39627390 weixin_39627390 5月前

    Ah! That's definitely a use-case I didn't consider!

    I'm working on a moderate decrease in memory consumption in Varcode: https://github.com/hammerlab/varcode/pull/157, I'll look through PyEnsembl next.

    点赞 评论 复制链接分享
  • weixin_39723102 weixin_39723102 5月前

    Thanks!

    点赞 评论 复制链接分享
  • weixin_39627390 weixin_39627390 5月前

    Hey , I trimmed some of the memoization in Varcode and PyEnsembl. If you update those two libraries I'm curious to see how long it takes before you run out of memory. We're still doing an unbounded amount of memoization but hopefully leaking at a slower rate.

    点赞 评论 复制链接分享
  • weixin_39627390 weixin_39627390 5月前

    Since memory hasn't seemed like a problem for a while, closing this issue. Let me know if it still grows out of control.

    点赞 评论 复制链接分享

相关推荐