2020-12-30 09:59

Memory issues with vg mpmap

So I've been attempting to run the latest version of vg (v1.6.0-654-g1c819fff-t142-run) against my dataset and found that it takes up an unreasonable amount of memory during a run of vg mpmap. This didn't use to happen with previous versions of vg.

Using the vg container version v1.6.0-621-g766ecbcb-t141-run (vg commit hash 766ecbcbbe9d685c7cee908b9ef498f6cacdacdb) the run of vg mpmap only took up a maximum of about 74 GB of memory per subset read dataset.

When attempting to use vg container version v1.6.0-654-g1c819fff-t142-run (vg commit hash 1c819fffaefdc3ae7573f998cb6e27f0bac52600) the exact same runs take up more than 200 GB of memory. Memory during this run grows in the following intervals during a vg mpmap run:

1 min: 37 GB
2 min: 59.3 GB
3 min: 71.8 GB
7 min: 80.4 GB
23 min: 151.3 GB
33 min: 185.4 GB
36 min FAILURE: 198.4 GB


  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答


  • weixin_39687881 weixin_39687881 4月前

    This looks a lot like a memory leak to me; memory usage increases quickly to ~70-80 GB, and then slowly and linearly from there until the process is killed.

    Relevant changes that were made over that commit range include:

    • Jordan's new parallel driver code to let the input thread work
    • Changes to MEM finding
    • Changes to surject
    点赞 评论 复制链接分享
  • weixin_39625586 weixin_39625586 4月前

    I can confirm that memory utilization appears to increase linearly after around the 70-80 GB mark. I've rerun the same command on a large 1 TB memory node and it appears to still be processing since I've last checked (3:31:43 elapsed wall time with memory utilization at 682 GB now). Normally it would take only about 1 hour and 30 minutes to mpmap the same chunked fastq.

    Do you think these memory leaks might have been resolved since the 1c819ff commit? Should I create a new docker container using the latest commit of vg and test that?

    点赞 评论 复制链接分享
  • weixin_39687881 weixin_39687881 4月前

    I doubt we've fixed it, but you could try.

    Can you share the exact vg mpmap command you are using? Even without the reads themselves, I want to be able to replicate exactly what you are doing (including the types of indexes and input and output formats in use).

    点赞 评论 复制链接分享
  • weixin_39625586 weixin_39625586 4月前

    Sure, it's pretty basic. No particular parameter tuning. It uses vg container quay.io/vgteam/vg:v1.6.0-621-g766ecbcb-t141-run

    vg mpmap -S -f /home/markellocj/split_fastqs_[SAMPLE#]/fq_chunk_1.part.aa.fq.gz -f /home/markellocj/split_fastqs_[SAMPLE#]/fq_chunk_2.part.aa.fq.gz -x /mnt/1kg_ref_hs37d5.xg -g /mnt/1kg_ref_hs37d5.gcsa -t 32
    点赞 评论 复制链接分享
  • weixin_39625586 weixin_39625586 4月前

    I feel like this memory leak should have been captured by the unit tests since it appears that the size of the read input data doesn't matter. It just looks like mpmap hangs after a certain point.

    点赞 评论 复制链接分享
  • weixin_39625586 weixin_39625586 4月前

    I should also mention the input gzipped fastq file was 747MB and 797MB respectively.

    点赞 评论 复制链接分享
  • weixin_39687881 weixin_39687881 4月前

    I'm not able to replicate the memory usage growth using the vg mpmap command you gave and substituting my own indexes and pair of FASTQs (for chr21), using commit 1c819fff.

    I did poke around with valgrind and find that the FASTQ input code mismatches new[] and delete, which I'm pretty sure is undefined behavior, but as far as I can tell it has always done that. I just made a PR to fix it; I suppose you could give that a try.

    If there's a way I could get ahold of the offending reads and index files, that might help. This might be a problem that really has to do with extremely large MEM sets, somehow, and only shows up on the whole-genome index.

    点赞 评论 复制链接分享
  • weixin_39687881 weixin_39687881 4月前

    If the -ltcmalloc_minimal we are linking in supports it (?) and if you can make Singularity do it, you can try turning on tcmalloc's leak checker as described in https://gperftools.github.io/gperftools/heap_checker.html

    Basically you set HEAPCHECK=normal as an environment variable for the program, and it will complain if anything leaks when the program ends. You may have to cut down your read set first to avoid an abnormal termination due to running out of memory.

    点赞 评论 复制链接分享
  • weixin_39540315 weixin_39540315 4月前

    Hey guys, sorry to be taking so long to get to this. What's the current status, did your idea fix it ?

    点赞 评论 复制链接分享
  • weixin_39524183 weixin_39524183 4月前

    I'm curious if this is also affecting vg map, so I've done some digging with valgrind. For some reason I'm getting a hang under valgrind with this:

    vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 32 -p >z.vg
    vg index -g z.gcsa -k 16 -p z.vg
    vg sim -x z.xg -l 100 -n 1000 -e 0.01 -i 0.005 -a >z.sim
    valgrind --tool=memcheck vg mpmap -x z.xg -g z.gcsa -S -G /dev/null
    点赞 评论 复制链接分享
  • weixin_39687881 weixin_39687881 4月前

    OK, I've concluded that this isn't a memory leak after all. The mapper (or at least one thread of it) hits a particular read in the input data and hangs there, consuming all the memory trying to handle that read, until it crashes. I'm not observing this with simulated reads because (I presume) the simulator doesn't generate any true negative reads.

    When running on a single thread and watching the output with pv I can see it stop producing alignments and spin on a single read, consuming more and more memory.

    The important part of the stack trace seems to be:

    #4  0x0000000000ed9695 in gcsa::GCSA::locate (this=0x7fff17b9b240, range=..., 
        results=std::vector of length 555538341, capacity 1073741824 = {...}, append=append=false, sort=sort=true)
        at gcsa.cpp:813
    #5  0x0000000000b99d7e in vg::BaseMapper::find_mems_deep (this=this=0x7fff17b9b030, seq_begin=67 'C', seq_end=0 '\000', 
        longest_lcp=: 9.8813129168249309e-324, fraction_filtered=: 3.783407534486875e-313, 
        max_mem_length=0, min_mem_length=1, reseed_length=28, use_lcp_reseed_heuristic=false, use_diff_based_fast_reseed=true, 
        include_parent_in_sub_mem_count=true, record_max_lcp=false, reseed_below=0) at src/mapper.cpp:604

    Note how GCSA is locating into a vector with ~556 million results in it. We're locating a MEM with way too many hits.

    I think the causative change was the MEM finder un-limiting. See this commit where the locate where we hang is being made no longer conditional on the MEM having a tractable number of hits: https://github.com/vgteam/vg/commit/ed9e3a41226817fb969a618dd26fe8855d9e4f9d#diff-b0f0db38cdc1912c42330b797d80a428L596

    点赞 评论 复制链接分享
  • weixin_39687881 weixin_39687881 4月前

    OK I have a patch that should at least work around this, by protecting all the locate calls I could find with a max hit count of 16384, distinct from the existing hit_max. I've tested it and I am no longer seeing unbounded memory growth on real reads against the whole genome index.

    Is your upcoming work on the MEM limits going to address this same underlying issue?

    点赞 评论 复制链接分享