weixin_39639568
weixin_39639568
2020-12-03 03:11

Reduce volume memory usage

Reduce memory usage when computing variance normalization on volume files.

该提问来源于开源项目:Washington-University/HCPpipelines

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

92条回答

  • weixin_39784195 weixin_39784195 5月前

    /usr/bin/time should measure the true memory peak, while other resource tracking may be polling the process, so if the peak was extremely short...

    I would have expected compiled matlab to be no worse than interpreted, there doesn't seem like any good reason their compiler should be dumber (though maybe the compiler exists for bad reasons in the first place...).

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    I agree we should add an echo statement and retry.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    I ran the 2nd pass VN stage two more times using the compiled matlab, just to see if my previously reported result (with the inconsistent reported peak memory: https://github.com/Washington-University/HCPpipelines/pull/168#issuecomment-598931130) was repeatable, and in each case the reported mem value from TORQUE this time matched the maxresident value reported by /usr/bin/time on the functionhighpass... call:

    
    Resources: cput=02:19:43,vmem=43890700kb,walltime=02:24:14,mem=37181008kb,energy_used=0
    
    1153.64user 6028.97system 2:24:11elapsed 83%CPU (0avgtext+0avgdata 38652224maxresident)k
    1865120inputs+26811104outputs (4323major+31476466minor)pagefaults 0swaps
    
    Resources: cput=06:36:33,vmem=43824136kb,walltime=06:40:50,mem=37187800kb,energy_used=0
    
    1183.44user 16265.38system 6:40:45elapsed 72%CPU (0avgtext+0avgdata 38761520maxresident)k
    1900048inputs+23573592outputs (9039major+19870045minor)pagefaults 0swaps
    

    Of note, in each of these other two instances, the duration of the job on the node was much longer than the earlier test (which took only 33 min), so the longer duration for the latest two test may have given the TORQUE polling process the opportunity to catch the memory peak (consistent with 's hypothesis).

    I'll ask to do a recompile after I add an fprintf to the matlab code, just to be 100% sure that the recompile is hitting the intended code. But at this point, it sure looks like compiled matlab is using memory that interpreted matlab is optimized to not need.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    I have confirmed that the expected out of the fprintf statements that I added are showing up in the log files when using compiled matlab mode for the latest compile. But, the memory profile is as before. I'm therefore concluding that the improved (lower) memory profile that achieved using interpreted matlab is unfortunately not propagated into the compiled matlab.

    We do still have a small reduction (~10% in compiled matlab mode in the 2nd pass VN stage), and of course the more efficient memory is obviously beneficial for interpreted matlab. So, I suggest we merge this PR and tag a new release (v4.2) that we can use in the QuNex container for ICAFIX.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    I tossed an extra half gigabyte allocation at the very start of icaDim.m, and the memory didn't seem to increase, but adding it at the start of functionhighpass... did. Time to look elsewhere...

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    Never the less, it might be worth building this version and running it on a full dataset. I don't feel like the tests Tim has run have been concordant with the results on the full dataset.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Culprit found: std() (at least in its (.., [], 2) form that is used everywhere, I didn't test anything else) is very memory-inefficient for some reason (it is a .m file, and its internal memory usage seems to hit double or more the size of the input), in addition to being a much more involved computation than we really need for "nonzero change". max(...) - min(...) uses almost no memory beyond the variable involved.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Please take a look at d122f4c

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Don't know if range is any more (or less?) memory efficient, but that is a convenient alternative to max(...) - min(...)

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    It is a .m file that calls min - max, looks promising.

    Can I send you my version when I'm done with it, so you can figure out which of your committed changes stepped on places I edited?

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    With min - max, single run:

    
    151.83user 42.56system 4:06.02elapsed 79%CPU (0avgtext+0avgdata 1572384maxresident)k
    97816inputs+2750048outputs (485major+25932034minor)pagefaults 0swaps
    

    I'll switch to range and report the full results for that.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Sure. Somewhat relatedly, do we even need to compute a mask within icaDim.m now that we are masking the data prior to the call to icaDim.m itself? Seems like that bit of unnecessary duplication could be removed, unless icaDim.m is being used as a "stand-alone" function outside of functionhighpass...

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    icaDim.m is a standalone function used for many things.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    I have edits to icaDim that checks if the mask will remove anything, and if not, avoids using it, which should reduce the memory usage for this case (avoiding making a "masked" copy, since the input doesn't actually count against memory usage until it is modified) without increasing it for other cases.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    Mike's version appears to have 2 extra copies in certain spots relative to mine (early in the loop data and data_vn both exist when they are unneeded and will be overwritten or can be regenerated from other variables).

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    My version, with range and masking-bypass:

    First pass:

    
    156.81user 38.57system 3:20.25elapsed 97%CPU (0avgtext+0avgdata 1577072maxresident)k
    171952inputs+2750040outputs (420major+25930748minor)pagefaults 0swaps
    173.58user 46.03system 4:54.48elapsed 74%CPU (0avgtext+0avgdata 1492044maxresident)k
    503368inputs+2750040outputs (1831major+28643473minor)pagefaults 0swaps
    

    Second run had more major pagefaults, so ignore the memory difference between the two.

    Concatenated:

    
    108.39user 25.36system 4:28.90elapsed 49%CPU (0avgtext+0avgdata 2687976maxresident)k
    234824inputs+2487416outputs (1138major+15806211minor)pagefaults 0swaps
    

    Some major pagefaults, so expect another 100MB or so...I still have /usr/bin/time on melodic, etc, so I can compare to those numbers. We used std in the fix_3_clean changes too, so that needs editing.

    On another note, does functionhighpass... really need to spit out mountains of numbers? It makes it slightly more difficult to find the timing results in the output.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    The numbers tell you if the golden search is behaving sensibly.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Do you mean if the golden mean search has a bug in it, or do you mean if there isn't anything particularly good for the search to find? If it is only for debugging, then it can be commented out when not debugging.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    If there is something wrong with the data it can floor or ceil, which is bad.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    That sounds like you can just print the end result and the initial endpoints, and that will tell you whether what it found was an initial endpoint.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    So, for my 2-run test, the concatenated highpass is now only barely above the melodic memory usage, and fix_3_clean (fix -f) is now in line with fix -c:

    melodic:

    
    659.51user 6.62system 21:21.15elapsed 51%CPU (0avgtext+0avgdata 2523728maxresident)k
    38480inputs+923472outputs (204major+5845224minor)pagefaults 0swaps
    

    fix -f (LOTS of major pagefaults, so this number could be wrong, but it mostly matches a much-earlier run of fix -f):

    
    872.93user 70.86system 56:32.20elapsed 27%CPU (0avgtext+0avgdata 2381880maxresident)k
    3677648inputs+35472outputs (25280major+18657473minor)pagefaults 0swaps
    

    fix -c:

    
    29.95user 0.90system 0:39.67elapsed 77%CPU (0avgtext+0avgdata 2952384maxresident)k
    2261832inputs+40outputs (73major+766193minor)pagefaults 0swaps
    

    fix -a:

    
    11.04user 4.06system 1:40.66elapsed 15%CPU (0avgtext+0avgdata 2884456maxresident)k
    154048inputs+2748856outputs (324major+1970588minor)pagefaults 0swaps
    

    So, my versions appear to peak at only slightly more memory than melodic (which we can't edit easily). Emailing them to Mike for change resolution.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Ok, see latest commits, that role in Tim's changes. I restored Origdata as the input variable name, but use (and reuse) data for the rest of the code. That seems clean to me.

    : Given that icaDim isn't actually the memory limiting step, and given that we need to return to using the non-VN'ed data over and over, it doesn't seem worth it to me to bother with deleting data and recomputing it on the fly from the SVD within the while loop. Besides, in this version, the full time series of data_trend is only formed at the front end for the DEMDT==1 condition (which is the only condition for which we actually need to preserve that data throughout the while looping), so for the DEMDT=={0,-1} conditions, this version is actually a wash memory-wise with your earlier 6edc725 version.

    Update: I realized that data can/should nonetheless be deleted once we are actually done with it (outside the while loop), so I just added that.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    : If you could run this version, that would be appreciated!

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    For f1b2c5f: First pass:

    
    152.06user 48.16system 5:57.56elapsed 55%CPU (0avgtext+0avgdata 1574704maxresident)k
    301160inputs+2750040outputs (1034major+27044101minor)pagefaults 0swaps
    211.18user 58.82system 2:33.85elapsed 175%CPU (0avgtext+0avgdata 1575388maxresident)k
    26400inputs+2750040outputs (145major+38181696minor)pagefaults 0swaps
    

    Concatenated VN:

    
    90.43user 34.90system 1:54.81elapsed 109%CPU (0avgtext+0avgdata 2751528maxresident)k
    327000inputs+2487416outputs (1050major+14559735minor)pagefaults 0swaps
    
    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    Let's get this merged, compiled, and running on the CHPC.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    : Do you have scripts in place to quickly test all three DEMDT conditions? Not for memory, but just that they run successfully?

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    It runs in all three conditions.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Fabulous! : Anything else you want to change (or are still reviewing), before I request this gets recompiled?

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    The excessive amount of text output isn't that important, and is easy enough to comment out when it gets annoying. I'd say go ahead.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    ARGH!!! There is no doubt that we've improved the memory utilization of interpreted matlab. We've been assuming that any such gains there would readily translate to the compiled matlab as well. But, what if that isn't actually the case?!?

    I just ran a test of the compiled matlab, and the results of /usr/bin/time on the 1st pass VN from the latest compile (25f3e1f) are no different than what we had from the earlier compile (624c8d1) from the code that added the masking to functionhighpass... but accomplished it via std rather than range. Which is to say, the maxresident values were ~ 4.4 x the size of the input (uncompressed) volume time series, rather than ~ 2.5 x as achieved in his testing with interpreted matlab following the switch to 'range'.

    I've checked with and he is confident that recompiled from the tip of this PR (although to be absolutely 100% sure, it might be worth adding an fprintf to the code, recompiling, and confirming that we see the expected output in the log).

    The results above would be consistent with the hypothesis that in compiled matlab, the range operation is basically the same as std in terms of memory.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Ok, another wrinkle. The 2nd pass VN stage just completed from my compiled matlab run, and the results are even more confusing (to me at least)

    The output of the /usr/bin/time that I appended to the front end of the functionhighpasss... call was:

    
    1315.55user 117.50system 32:46.33elapsed 72%CPU (0avgtext+0avgdata 38761832maxresident)k
    1338648inputs+22320072outputs (832major+6527444minor)pagefaults 0swaps
    

    i.e., 39 GB, which represents no improvement relative to the version that created the spatial mask using std. However, the reported Resource usage on CHPC for the overall job was:

    
    Resources: cput=00:31:43,vmem=29324464kb,walltime=00:32:49,mem=23844428kb,energy_used=0
    

    i.e., mem of only 24 GB. This is the biggest discrepancy between the Resources mem value and the /usr/bin/time maxresident value that I've seen. Any insight as to what might be going on?

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    That's a bit disappointing, but thanks for taking another look. Would it perhaps be worthwhile, as the next level of granularity, to check the memory separately for the fix -f, fix -c, and fix -a stages of fix, just to confirm that the peak is indeed in the cleaning (fix -a) stage?

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    That is what I checked.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    What were the numbers on each stage? I'm just curious now.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    Per the Slack "everything is sorted out," and it does not appear that the fix_3_clean modifications are saving us much, so I think we should merge this, tag, send to the Qu | Nex team and move on at this point.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Jure has it worked out in his dev version, which is encouraging, but Z is responsible for building the official containers, and we haven't gotten a status report from her.

    There is no point in merging this until we get the compiled matlab included as part of it (and it might at well be part of this same PR to keep all the components together). I'll work with to get that bit accomplished.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    The VN passes over the single runs:

    
    130.60user 54.53system 1:20.49elapsed 230%CPU (0avgtext+0avgdata 2812932maxresident)k
    64inputs+2750040outputs (0major+26841977minor)pagefaults 0swaps
    
    189.44user 76.19system 1:27.03elapsed 305%CPU (0avgtext+0avgdata 2813364maxresident)k
    64inputs+2750040outputs (0major+37487392minor)pagefaults 0swaps
    

    Concatenated VN pass:

    
    89.94user 32.65system 1:08.38elapsed 179%CPU (0avgtext+0avgdata 5219612maxresident)k
    72inputs+2487416outputs (0major+16975865minor)pagefaults 0swaps
    

    Melodic:

    
    584.06user 6.13system 9:57.63elapsed 98%CPU (0avgtext+0avgdata 2523808maxresident)k
    606624inputs+922688outputs (0major+5697683minor)pagefaults 0swaps
    

    fix -f:

    
    711.63user 41.25system 13:38.07elapsed 92%CPU (0avgtext+0avgdata 2387808maxresident)k
    2136inputs+35472outputs (0major+18424559minor)pagefaults 0swaps
    

    fix -c:

    
    30.28user 10.64system 0:28.17elapsed 145%CPU (0avgtext+0avgdata 2952392maxresident)k
    16inputs+56outputs (0major+766178minor)pagefaults 0swaps
    

    fix -a:

    
    9.74user 6.13system 0:54.68elapsed 29%CPU (0avgtext+0avgdata 5351828maxresident)k
    72inputs+2748856outputs (0major+2901407minor)pagefaults 0swaps
    
    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    So, I ran a test of the newly compiled version of functionhighpassandvariancenormalize at the scale of a fully concatenated HCA subject, and the result at full scale was frankly disappointing.

    First, on a positive and confirmatory note, the memory usage of the "1st pass" functionhigh... calls to the individual runs was reduced 25%, exactly consistent with 's testing in interpreted matlab mode (above: https://github.com/Washington-University/HCPpipelines/pull/168#issuecomment-594908252).

    But, for whatever reason, that memory reduction didn't scale up to the "2nd pass" functionhigh... call (for the VN on the concatenated run). (Incidentally, I don't see a "pre-change" memory value for the concatenated VN stage of 's simple 2-run MR-FIX test, so its possible that my findings would be borne out even for just a 2-run test. Might be worth running that test to compare).

    2nd pass VN, Pre-changes (~ 42 GB):

    
    Resources:      cput=01:00:49,vmem=57187948kb,walltime=01:06:13,mem=41378956kb,energy_used=0
    Nodes:              node11
    
    1101.01user 2110.48system 1:06:10elapsed 80%CPU (0avgtext+0avgdata 42412500maxresident)k
    

    2nd pass VN, Post-changes (~ 38 GB):

    
    Resources:      cput=01:42:54,vmem=43890684kb,walltime=01:46:06,mem=37642540kb,energy_used=0
    Nodes:              node17
    
    1712.31user 4460.70system 1:46:04elapsed 96%CPU (0avgtext+0avgdata 38753080maxresident)k
    

    On the full concatenated HCA data, the fix stage is taking 30 GB, so unfortunately, we don't appear to have gotten the 2nd pass VN stage down to the same memory usage as fix (which we would have if we achieved a 25% reduction on the 2nd pass VN stage).

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    Well the VMEM is down by about that much, which is what will matter for the CHPC RAM limits.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    No, it is the mem that matters for RAM limits on CHPC2, per Malcolm himself.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    So you believe that the vmem is never allocated? Those nodes don't have swap.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    I don't know what vmem exactly represents or how it is computed under the queuing system on CHPC2, esp. since the nodes indeed don't have swap space (or only 3-4 GB of it). Malcolm told me that vmem is "essentially ignored" as regards the queuing system memory monitoring. He described it to me as "the memory the application could use if everything were allocated all at once."

    It is perhaps interesting that the difference between vmem and mem is much smaller after the changes than before. And that the vmem value is reduced about 25%. But what we actually need to request memory against is the mem value.

    To be clear, I'm not proposing that we not go ahead and roll these changes into the master. But we unfortunately haven't gained ourselves much in terms of actual ability to lower the requested mem value.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    Well, given the differences I expect we don't fully understand the distinction between mem and vmem, as that change suggests we did reduce usage the expected amount.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    All I can say is that as a practical matter, we didn't gain ourselves much.

    If we wanted to continue with this optimization, we'd need to hit icaDim.m, which based on a quick look has a number of memory inefficiencies. [i.e., it appears to unnecessarily duplicate the "original" data in data and Origdata; keeps data_detrend after it is no longer needed (i.e., after data_detrend_vn is computed]. I didn't initially understand why the memory usage was still so high -- but I see now it's due to how icaDim.m is written. (The multiple copies of the time series add up, even after the spatial masking has been applied).

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    I have modified icaDim to be more economical about memory usage and tested the function itself in Matlab with the different demean/detrend options.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    I took a stab at a more memory efficient icaDim.m. Will email for now, to keep it out of the PR for the time being.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    I pulled Matt's changes, and on the same test setup (100307, EMOTION LR and RL):

    First filtering:

    
    145.42user 39.66system 1:50.42elapsed 167%CPU (0avgtext+0avgdata 2808072maxresident)k
    14104inputs+2750040outputs (101major+26617504minor)pagefaults 0swaps
    203.51user 56.17system 1:36.42elapsed 269%CPU (0avgtext+0avgdata 2808804maxresident)k
    72inputs+2750040outputs (0major+37162454minor)pagefaults 0swaps
    

    Concatenated filtering:

    
    100.46user 23.56system 1:08.71elapsed 180%CPU (0avgtext+0avgdata 5214928maxresident)k
    336inputs+2487416outputs (1major+16649390minor)pagefaults 0swaps
    

    So, only a small change (~5MB).

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    Tim's results don't make a lot of sense to me. There originally were 5 copies of the data inside icaDim and now should be a maximum of 3. Is Matlab simply referencing variables that contain the same information instead of allocating them?

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    In the version that I just emailed, I actually got it down to just two copies of the data in the DEMDT~=-1 conditions (all you need is the detrended data, which I store in data, and then there is the data_vn copy within the while loop; one extra copy needs to be preserved for the DEMDT=-1 condition).

    Although, I share your confusion about why this isn't showing up in the values. The full, unmasked concatenated volume file is only 1.27 GB. So, how/where is it that we are still needing 5.2 GB...?

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    DEMDT=-1 is only for MIGP data, so it isn't relevant here. I don't see a huge value to differential memory usage for the options so that is why I didn't do that.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Regarding your question, if you do a simple B = A, matlab doesn't actually instantiate B in memory yet. But if you start assigning values new values to B then it needs to instantiate a copy in memory.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    I pulled Matt's changes again:

    First filtering:

    
    149.10user 48.02system 8:06.34elapsed 40%CPU (0avgtext+0avgdata 2808832maxresident)k
    24336inputs+2750040outputs (153major+27994182minor)pagefaults 0swaps
    173.16user 89.46system 5:48.51elapsed 75%CPU (0avgtext+0avgdata 2804916maxresident)k
    707048inputs+2750096outputs (3709major+39118064minor)pagefaults 0swaps
    

    Concatenated filtering:

    
    92.07user 42.04system 1:53.23elapsed 118%CPU (0avgtext+0avgdata 5218592maxresident)k
    60504inputs+2487416outputs (290major+16650239minor)pagefaults 0swaps
    

    Edit: replaced with low-pagefaults test, virtually no change from previous edit.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    And again:

    First filtering:

    
    159.39user 40.57system 1:55.11elapsed 173%CPU (0avgtext+0avgdata 2812448maxresident)k
    162128inputs+2750040outputs (383major+26839321minor)pagefaults 0swaps
    157.41user 58.14system 1:38.73elapsed 218%CPU (0avgtext+0avgdata 2813420maxresident)k
    64inputs+2750040outputs (0major+29550200minor)pagefaults 0swaps
    

    Concatenated filtering:

    
    101.21user 35.80system 1:19.57elapsed 172%CPU (0avgtext+0avgdata 5220272maxresident)k
    336inputs+2487416outputs (1major+17288230minor)pagefaults 0swaps
    

    So, somehow, that made it worse, back to the original numbers. As Mike said, an exact copy that you don't do any assignment to doesn't count as memory usage:

    
    test = ones(100000000, 1); % 800MB
    test2 = test; % still 800MB
    test2(2) = 1; % 1,600MB
    
    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    That doesn't make much sense.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    So, another thought occurred to me: Since we verified IO does cause a change in measured memory, since I am measuring this on a loaded system that may occasionally use nearly all the RAM, it may be reclaiming the IO pages (or even swapping out part of the test process), which would reduce the measured memory. I will attempt to confirm that the previous commit reproducibly reduced memory usage, though to be completely certain we would need to run it on an otherwise idle system.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    How many copies do you think it uses for a heavily repmatted line like this: Out.data(OrigdataMask,:)=data_trend + (((u*diag(Out.EigSAdj)*v')'+repmat(mean_data_detrend_vn,size_data_detrend_vnOne,1)) .* repmat(Out.noise_unst_std,1,size_data_detrend_vnTwo));

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Looking back at the results, major page faults are larger with anomalous memory reductions, and it would make sense for swapping to be counted as a major page fault. I will keep an eye on those, and edit the worst of the previous results with better runs.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Each repmat almost certainly counts as a full matrix, as if you'd assigned each repmat to a separate variable (two identical repmats would become two separate variables), used the variables in the line instead, and then cleared those variables.

    Edit: to answer your question literally, I count 5 separate matrices the size of a full timeseries in that line before generating any intermediate computation results (though the left side which is presumably a zeros() which hopefully doesn't count for memory use). This would actually reduce it to 3 copies at a time, plus the Out.data:

    
    temp_reps = repmat(mean_data_detrend_vn,size_data_detrend_vnOne,1)) .* repmat(Out.noise_unst_std,1,size_data_detrend_vnTwo));
    temp_reps = (((u*diag(Out.EigSAdj)*v')' + temp_reps;
    temp_reps = data_trend + temp_reps;
    Out.data(OrigdataMask,:) = temp_reps;
    clear temp_reps;
    

    Edit 2: I suppose matlab might be smart enough to clear the two repmat temporaries after multiplying them, and evaluate and drop the addition temporaries serially, but that would still likely peak at 4 copies (because of the order of additions - putting the multiply first would give it the opportunity to make it only 3), while my version should peak at 3.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    I broke up such statements and fixed a bug where I was randomly removing the global signal and adding it back. It didn't make much difference to the Wishart Filtered data, but shouldn't be done.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Once more, from 6edc725:

    First filtering:

    
    154.44user 43.28system 5:40.76elapsed 58%CPU (0avgtext+0avgdata 2810056maxresident)k
    247272inputs+2750040outputs (1339major+28200216minor)pagefaults 0swaps
    186.05user 85.89system 9:38.19elapsed 47%CPU (0avgtext+0avgdata 2811980maxresident)k
    330984inputs+2750040outputs (2190major+31024365minor)pagefaults 0swaps
    

    Concatenated filtering:

    
    114.97user 23.59system 1:32.39elapsed 149%CPU (0avgtext+0avgdata 5218340maxresident)k
    6336inputs+2487416outputs (18major+17286986minor)pagefaults 0swaps
    

    So, still not improved.

    From some quick testing, a .* b + c is more memory-efficient than c + a .* b, and a + b + c uses the same memory as a + b, so matlab is using some reasonably smart tricks.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    I can run the basic run to completion tests; however, I don't think it is necessary to test the memory reduction (and I don't have a way to do that)--it will be reduced considerably.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    The code runs through the VN function for hp=0 and hp=2000. hp=-1 apparently is not allowed.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    That's good. hp=-1 should be able to run successfully through functionhighpassandvariancenormalize.m itself. But it isn't a supported option in hcp_fix_multi_run.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Sticking a /usr/bin/time in front of the matlab call would give you some memory numbers to work with, though I seem to recall they may be scaled strangely (so, should not be compared against things like htop).

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    ?? /usr/bin/time reports the time used by a command, not the memory...

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前
    
    tim:~$ /usr/bin/time ls
    ...
    0.00user 0.00system 0:00.00elapsed 71%CPU (0avgtext+0avgdata 1088maxresident)k
    0inputs+0outputs (0major+342minor)pagefaults 0swaps
    

    "maxresident" refers to memory. In posix mode, it will only report time, so you might need to give it some options in your environment.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    This is interesting (from CHPC):

    
    [mharms fixTesting]$ time ls
    ...
    real    0m0.016s
    user    0m0.000s
    sys 0m0.010s
    [mharms fixTesting]$ /usr/bin/time ls
    ...
    0.00user 0.01system 0:00.02elapsed 52%CPU (0avgtext+0avgdata 920maxresident)k
    0inputs+0outputs (0major+278minor)pagefaults 0swaps
    

    The man time page reports that

    
    Note: some shells (e.g., bash(1)) have a built-in time command that provides less  function-
           ality  than the command described here.  To access the real command, you may need to specify
           its pathname (something like /usr/bin/time).
    

    But, then, I'd expect that which time would report that time is a "built-in", but that is not the case:

    
    [mharms fixTesting]$ which time
    /usr/bin/time
    

    I guess that isn't how which functions under bash?

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前
    
    tim:~$ type time
    time is a shell keyword
    tim:~$ type which
    which is /usr/bin/which
    tim:~$ type type
    type is a shell builtin
    
    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Any particular reason bash doesn't report "built-ins" via which, since most people are (I suspect) much more familiar with which?

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Esp. considering that which does report on aliases...

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    which doesn't tell me about aliases:

    
    tim:~$ which ls
    /bin/ls
    tim:~$ type ls
    ls is aliased to `ls --color=auto'
    

    As I edited my reply to show, which is not a builtin (at least not on my system), so it doesn't know about the internals of bash.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    And here is your answer on the chpc:

    
    [... ~]$ type which
    which is aliased to `alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde'
    [... ~]$ type type
    type is a shell builtin
    

    From here via /etc/profile:

    
    [... ~]$ cat /etc/profile.d/which2.sh
    # Initialization script for bash and sh
    
    # export AFS if you are in AFS environment
    alias which='alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde'
    
    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    I see that my comment about which was tripped up by the fact that the system itself aliases which as part of a default set of aliases (to include the --read-alias flag):

    
    [mharms ~]$ which which
    alias which='alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde'
        /usr/bin/which
    [mharms ~]$ which rm
    alias rm='rm -i'
        /bin/rm
    [mharms ~]$ /usr/bin/which rm
    /bin/rm
    

    I see that you just made the same discovery...

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    With my change to delete(), it appears to complete successfully (with hp=2000, so it should hit that code).

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    From the master commit this PR started from, here are the outputs of /usr/bin/time on just the functionhighpass... lines:

    
    148.10user 61.93system 1:18.69elapsed 266%CPU (0avgtext+0avgdata 3794780maxresident)k
    72inputs+2750040outputs (0major+30161541minor)pagefaults 0swaps
    
    110.79user 41.44system 0:49.44elapsed 307%CPU (0avgtext+0avgdata 3721664maxresident)k
    40inputs+1246304outputs (0major+20672548minor)pagefaults 0swaps
    

    Before changing single(zeros(:

    
    190.61user 77.33system 1:26.93elapsed 308%CPU (0avgtext+0avgdata 2812540maxresident)k
    72inputs+2750040outputs (0major+38105443minor)pagefaults 0swaps
    
    104.78user 43.89system 0:44.69elapsed 332%CPU (0avgtext+0avgdata 2737868maxresident)k
    40inputs+1246304outputs (0major+19979866minor)pagefaults 0swaps
    

    So, those edits saved a fair amount of memory used by matlab in that function. Strange that it took more time, maybe from the stdev computation? After zeros(..., 'single'):

    
    190.40user 77.16system 1:25.92elapsed 311%CPU (0avgtext+0avgdata 2812632maxresident)k
    72inputs+2750040outputs (0major+37486119minor)pagefaults 0swaps
    
    107.54user 39.42system 0:44.10elapsed 333%CPU (0avgtext+0avgdata 2737484maxresident)k
    32inputs+1246304outputs (0major+19668406minor)pagefaults 0swaps
    

    So, not really any change to the maximum there.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Is the fsl official read_avw updated to return single when it makes sense? If so, then at some point, we can probably stop wrapping single around it for specifically files written by the pipelines. Considering that eliminating the extra double temporary of zeros didn't touch the memory usage according to /usr/bin/time (assuming matlab isn't smart enough to auto-convert single(zeros(), I don't think we need to worry much.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Cool. Thanks for running /usr/bin/time in your testing. For reference, I'm guessing those are the outputs from the "1st pass VN" round on the individual runs, for a test instance in which you had two runs, and used hp=0? (If you are curious, might be interesting to compare to hp=2000 as well).

    Approx. how many time points were in each of the two runs?

    Here is read_avw_img.m as of FSL 6.0.1:

    
    function img = read_avw_img(filename);
    %  [img] = READ_AVW_IMG(filename)
    %
    %  Read in an Analyze (or nifti .img) file into either a 3D or 4D
    %   array (depending on the header information)
    %  Ouput coordinates for all dimensions start at 1 rather than 0
    %  Note: automatically detects char, short, long or double formats
    %  
    %  See also: READ_AVW, READ_AVW_HDR, SAVE_AVW, SAVE_AVW_HDR, SAVE_AVW_IMG
    
    fnimg=strcat(filename,'.img');
    
    [dims,scales,bpp,endian,datatype] = read_avw_hdr(filename);
    fp=fopen(fnimg,'r',endian);
    if (datatype==4),
      dat=fread(fp,'short=>float32');
    elseif (datatype==2),
      dat=fread(fp,'uint8=>float32');
    elseif (datatype==8),
      dat=fread(fp,'int=>float32');
    elseif (datatype==64),
      dat=fread(fp,'double=>double');
    elseif (datatype==16),
       dat=fread(fp,'float32=>float32');
    end
    fclose(fp);
    ...
    
    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    No, those were the first and second calls to functionhighpass... in a single subject (100307), single call of hcp_fix_multi_run, hp=0. It was run on tfMRI_EMOTION_LR only (so, as if running single-run fix, basically). This is just what Matt was using for testing that it didn't crash.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Interesting. That run is 176 time points, so 91 * 109 * 91 *176 * 4 ~ 635e6 bytes for a single copy of an uncompressed volume file. While we have clearly reduced the maxresident memory (by ~ 25%), it is still ~ 4.4 x a single uncompressed volume. Any thoughts on why it is still so high...?

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    From previous experience with /usr/bin/time, I don't trust it to report correct absolute usage numbers, but rather be off by some factor. I'd need to play around with it to determine what the factor is again.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Testing with a toy example that just uses calloc to allocate and zero memory, /usr/bin/time tracked well with the amount allocated, so that seems accurate, at least for trivial cases. Loading gui, interactive matlab and closing it without doing anything used 600MB resident, or 250MB when running with -nojvm -nodisplay -nodesktop -nosplash, as in FIX's call_matlab.sh with the settings file I used.

    However, matlab does something funny with an array of all zeros, allocating a vector of 1 billion with zeros() didn't change the memory usage, while the same allocation with ones() resulted in 8GB (and took longer). Modifying one element in the zeros() result had very little effect. Setting a random 40,000 of the indexes to a nonzero constant increased the memory usage by ~150MB, which to me suggests the memory is allocated (or initialized) in chunks as needed, rather than using a classical sparse implementation (which would be slow) - amusingly, storing values of zero to the zero vector also increased the memory usage. zeros() may allocate (reserve) the memory, but it doesn't store to it, instead there is likely a second, much smaller array where each element says "are any indices in this range nonzero", thus avoiding the need to actually zero allocated memory (and uninitialized memory allocations don't use physical resources until you store values to them). From some quick testing, I suspect the granularity is 4KB pages (which I think matches the usual linux/x86 page size), which would be 500 consecutive doubles or 1000 singles.

    On to IO testing: loading a single 1MB cifti file has to load and interpret the ciftiopen function, and do a few other things, which brings the maxresident to 271.4MB. Loading the same cifti file a second time (overwriting the variable used by the first) increases this to 274.6MB, and a third time is 277.9MB. Adding a clear to explicitly delete the gifti object in between brought this down to 274.8MB for three loads, but still continued to increase for a fourth clear+load, to 275.6MB.

    So, it sounds like maxresident should be considered to be the peak of the sum of at least memory usage, matlab loading itself, and the cumulative amount of data read in, at least when reading a few small files with ciftiopen. Not sure if writing files is also counted for resident memory. I did find out earlier that /usr/bin/time's reporting includes child processes (so, including the wb_command launched by ciftiopen, but since wb_command terminates, the wb_command IO and memory usage shouldn't add up across multiple uses, the way matlab's reading of the gifti appears to).

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Interesting. I noticed the same thing about zeros() vs ones() while just monitoring some operations in an interactive matlab using just the "RES" output from top (with the reporting delay set to just 0.1 sec to capture the rapid memory changes).

    Using that same monitoring approach, the "RES" memory reported for matlab isn't itself impacted by the running of save_avw, which just calls save_avw_img, which then just uses a simple fwrite to the opened file pointer. But save_avw operates by creating a {.hdr, .img} pair first, which at the end gets converted the user default format using call_fsl to invoke fslmaths, which of course consumes memory (outside of matlab) equal to the size of the uncompressed volume as part of converting to a compressed volume.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Which is to say, that if we were really trying to reduce peak memory to its bare minimum, we would need to set it up such that save_avw left the data as a {.hdr,.img} pair initially; delete/clear the variable from matlab after it has been saved to disk; and then convert/compress the {.hdr,.img} pair to .nii.gz. (And for that to reduce peak memory, we would be relying on the OS reclaiming the memory right away after it is cleared in matlab).

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Unless the inner functions use save_avw, we probably don't need to worry about improving that, as it shouldn't be responsible for the peak memory usage. I don't think the IO impact on resident size should be considered a real problem, compression utilities operate on a data stream, and will read small chunks of a file at a time, never actually allocating much memory. If the unallocated memory on the system is small, I think how it works is that it will generally reclaim the sections of memory that are internally mapped during file IO before trying to swap out much application memory (but unfortunately linux generally treats these as very similar when reporting things).

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Maybe that's how compression utilities generally operate, but that doesn't appear to be what happens if the conversion is accomplished via an I/O operation using fslmaths (as save_avw) does, because as I was monitoring top, fslmaths peaked at basically a full volume of memory.

    What do you mean about not worrying unless the "inner functions" using save_avw? (i.e., What qualifies as an "inner function"?) I'm pretty sure (per my testing above) that the current implementation is not memory efficient. Whether we want to mess with that is a different issue...

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    Do you guys know what step is currently the peak user of memory after these changes? As far as I know this is currently holding up processing, so there is a tradeoff between finalizing sICA+FIX and saving processing time. I think we should rapidly bring these changes to a conclusion and get sICA+FIX processing running to ensure we are able to release everything we want to, when we want to.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    Ah, it hadn't clicked that they were using fsl to convert - I should have known from the separate hdr/img pair.

    By inner function, I mean the actual computation stuff that allocates more than just the memory for the input/output files. If all the computation intermediates have been cleared from matlab's memory before the outputs are written, it doesn't matter if save_avw uses some extra memory.

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    Well, we can't clear ctsfull within matlab until after it is written to disk, and due to the manner in which save_avw is written that means you basically need to have ctsfull in memory in matlab, while fslmaths proceeds to ramp up to the same (additional) amount of memory just to convert the {.hdr,.img} pair to .nii.gz. A very inefficient implementation if you are actually trying to minimize the memory footprint.

    If this allows us to run more jobs simultaneously, then some (relatively minor) delay in launching what will be 1700+ jobs (that each will take over 8 hours) may be well worth it. Besides, Z hasn't reported yet on her progress on getting all the pieces/dependencies in place in the container, so it's not like this is ready to go...

    I think the next immediate step is for to hopefully show empirically that my changes to fix_3_clean reduce memory there as well, and then we can take that proposed change to Oxford.

    点赞 评论 复制链接分享
  • weixin_39784195 weixin_39784195 5月前

    When I tested the fix_3_clean changes, they didn't make much difference. I am currently measuring the usage for many steps, fix as a whole, etc to make sure we aren't missing something obvious.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    I will not have time to mess with this any more, so if you wish to make further adjustments, you can do this after it is merged or we can stick with what we have that uses more memory. This approach minimizes the memory footprint as much as is easily possible regarding the zeros in the NIFTI volume. One clears the full array immediately after using it so as to never have that in memory when doing other steps that may temporarily increase memory usage over the amount needed when the full array is being used (e.g. icaDim). This was tested with both detrending and hp2000, but is anticipated to only be used with detrending given current recommendations.

    点赞 评论 复制链接分享
  • weixin_39639568 weixin_39639568 5月前

    The above changes are fine with me. Were they tested with hp=0 and hp=2000 (and now I guess hp=-1)?

    点赞 评论 复制链接分享
  • weixin_39637233 weixin_39637233 5月前

    FYI: The push I just forced just corrects the variable indentation. I was hoping that you could quickly test this with what you already have set up, since I'm not really set up to test under interpreted matlab.

    Also, still unclear though if you have empirically tested the memory reduction, which I would like to be established before I work with to get these changes into the compiled matlab. I wonder if the matlab profiler would make it easy to assess the memory usage...

    点赞 评论 复制链接分享

相关推荐