weixin_39614834
weixin_39614834
2020-12-02 02:01

3d MaxPooling forward is very slower than backward when input format is ncdhw

Hi, I want to use 3d MaxPooling, but I find the forward time is very slower than backward time when I set input format to NCDHW(slower than about 7 times), Can you give me some advice? Thanks!

Environment

Intel MKL-DNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue: * CPU Device:


Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                44
On-line CPU(s) list:   0-43
Thread(s) per core:    1
Core(s) per socket:    22
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               1200.117
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4389.86
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              56320K
NUMA node0 CPU(s):     0-43
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
  • OS version (Linux)
  • Compiler version (
    gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
    )
  • CMake version (
    cmake version 2.8.12.2
    )

Example code


#include <iostream>
#include <cstdlib>
#include <sstream>
#include <time.h>
#include <mkldnn.hpp>

using namespace mkldnn;

void run(){
    int batch = 12;
    int channel = 192;
    int depth = 34;
    int height = 30;
    int weight = 30;
    clock_t t1,t2,t3;
    auto cpu_engine = engine(engine::cpu, 0);
    auto data_t = memory::data_type::f32;
    //auto format = memory::format::nCdhw16c;

    std::vector<float> net_src(batch * depth * channel * height * height);
    std::vector<float> net_dst(batch *channel * 32 * 28 * 28);

    memory::dims pool_src_tz = { batch, channel, depth, height, height };
    memory::dims pool_dst_tz = { batch, channel, 32, 28, 28  };
    memory::dims pool_kernel = { 3, 3, 3 };
    memory::dims pool_strides = { 1, 1, 1 };
    auto pool_padding = { 0, 0, 0};

    /* create memory for pool dst data in user format */
    auto pool_user_src_memory = memory(
            { { { pool_src_tz }, memory::data_type::f32, memory::format::ncdhw },
              cpu_engine },
            net_src.data());

    auto pool_user_dst_memory = memory(
            { { { pool_dst_tz }, memory::data_type::f32, memory::format::ncdhw},
              cpu_engine },
            net_dst.data());

     /* create pool dst memory descriptor in format any */
    auto pool_src_md = memory::desc({ pool_src_tz }, memory::data_type::f32,
                                    memory::format::ncdhw);
    auto pool_dst_md = memory::desc({ pool_dst_tz }, memory::data_type::f32,
                                    memory::format::any);

    auto pool_desc = pooling_forward::desc(
            prop_kind::forward, pooling_max,
            pool_src_md, pool_dst_md,
            pool_strides, pool_kernel, pool_padding, pool_padding,
            padding_kind::zero);
    auto pool_pd = pooling_forward::primitive_desc(pool_desc, cpu_engine);

    auto pool_dst_memory = pool_user_dst_memory;
    bool reorder_pool_dst = false;
    primitive pool_reorder_dst;
    if (memory::primitive_desc(pool_pd.dst_primitive_desc())
        != pool_user_dst_memory.get_primitive_desc()) {
        pool_dst_memory = memory(pool_pd.dst_primitive_desc());
        pool_reorder_dst = reorder(pool_dst_memory, pool_user_dst_memory);
        reorder_pool_dst = true;
    }


    /* create pooling workspace memory if training */
    auto pool_workspace_memory = memory(pool_pd.workspace_primitive_desc());

    /* finally create a pooling primitive */
    auto pool = pooling_forward(pool_pd, pool_user_src_memory , pool_dst_memory,
                                pool_workspace_memory);

    std::vector<primitive> net_fwd;
    net_fwd.push_back(pool);
    if (reorder_pool_dst)
        net_fwd.push_back(pool_reorder_dst);


    std::vector<float> net_diff_dst(batch *32* channel * 28 * 28);

    for (size_t i = 0; i < net_diff_dst.size(); ++i)
        net_diff_dst[i] = sinf((float)i);


    /* create memory for user diff dst data */
    auto pool_user_diff_dst_memory = memory(
            { { { pool_dst_tz }, memory::data_type::f32, memory::format::ncdhw },
              cpu_engine },
            net_diff_dst.data());

    /* Backward pooling */
    /* create memory descriptorsfor pooling */
    auto pool_diff_src_md = pool_user_src_memory.get_primitive_desc().desc();
    auto pool_diff_dst_md = pool_dst_memory.get_primitive_desc().desc();

    /* create backward pooling descriptor*/
    auto pool_bwd_desc = pooling_backward::desc(
            pooling_max, pool_diff_src_md, pool_diff_dst_md, pool_strides,
            pool_kernel, pool_padding, pool_padding, padding_kind::zero);
    /* backward primitive descriptor needs to hint forward descriptor */
    auto pool_bwd_pd = pooling_backward::primitive_desc(pool_bwd_desc,
                                                        cpu_engine, pool_pd);

    /* create reorder primitive between user diff dst and pool diff dst
     * if required */
    auto pool_diff_dst_memory = pool_user_diff_dst_memory;
    primitive pool_reorder_diff_dst;
    bool reorder_pool_diff_dst = false;
    if (memory::primitive_desc(pool_dst_memory.get_primitive_desc())
        != pool_user_diff_dst_memory.get_primitive_desc()) {
        pool_diff_dst_memory = memory(pool_dst_memory.get_primitive_desc());
        pool_reorder_diff_dst
                = reorder(pool_user_diff_dst_memory, pool_diff_dst_memory);
        reorder_pool_diff_dst = true;
    }

    /* create memory primitive for pool diff src */
    auto pool_diff_src_memory = memory(pool_bwd_pd.diff_src_primitive_desc());

    /* finally create backward pooling primitive */
    auto pool_bwd
            = pooling_backward(pool_bwd_pd, pool_diff_dst_memory,
                               pool_workspace_memory, pool_diff_src_memory);

    std::vector<primitive> net_bwd;
    net_bwd.push_back(pool_bwd);
    if (reorder_pool_diff_dst)
        net_bwd.push_back(pool_reorder_diff_dst);

    float fwd_time = 0.0;
    float bwd_time = 0.0;
    //warm up
    for(int i=0; i<100;i++) { 
        stream(stream::kind::eager).submit(net_fwd).wait();
        stream(stream::kind::eager).submit(net_bwd).wait();
    }

    for(int i=0; i<100;i++) { 
        t1=clock();
        stream(stream::kind::eager).submit(net_fwd).wait();
        t2=clock();
        fwd_time +=(t2-t1);
        stream(stream::kind::eager).submit(net_bwd).wait();
        t3=clock();
        bwd_time +=(t3-t2);
    }
    printf("The fwd_time=%f ms\n",fwd_time/100*1000/CLOCKS_PER_SEC);
    printf("The bwd_time=%f ms\n",bwd_time/100*1000/CLOCKS_PER_SEC);
}

int main(int argc, char **argv)
{
    try
    {
        run();
        std::cout << "passed" << std::endl;
    }
    catch (error &e)
    {
        std::cerr << "status: " << e.status << std::endl;
        std::cerr << "message: " << e.message << std::endl;
    }
    return 0;
}
</primitive></float></primitive></float></float></mkldnn.hpp></time.h></sstream></cstdlib></iostream>

Actual behavior

You can get the following result when you run the example code, thanks!

Expected output


### using OMP_NUM_THREADS=44
### using KMP_AFFINITY=granularity=fine,compact,1,0

### using KMP_BLOCKTIME=1
The fwd_time=8721.600586 ms
The bwd_time=1206.200073 ms
passed

该提问来源于开源项目:oneapi-src/oneDNN

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

8条回答

  • weixin_39633276 weixin_39633276 5月前

    Hi, This is expected behavior as forward max pooling algorithm has more operations than backward (proportionally to kernel size). So I don't think it is fair to compare forward and backward path in terms of computational time.

    Best regards, Anton

    点赞 评论 复制链接分享
  • weixin_39614834 weixin_39614834 5月前

    Yes, I know, , I have integrated mkldnn into pytorch, and made a profiler for a 3D deep learning model running on CPU and also make a profiler for GPU device, I found 3D MaxPooling running on CPU is far slower than running on GPU: MaxPooling forward for CPU(MaxPooling calls MKLDNN in Pytorch) is slower bout 11 times than GPU, but MaxPooling backward for CPU is only slower 3 times than GPU (MaxPooling calls CUDNN in Pytorch) . For GPU device, MaxPooing forward is only slower than 2 times than MaxPooling backward, but for CPU device, MaxPooing forward is slower than 8 times than MaxPooling backward. So I think there has too big gap between MaxPooling forward and backward on MKLDNN. This 3D deep learning modle have a large number of 3D MaxPooling operation which made a creat gap between CPU device and GPU device. So I doubt that can we improve the 3d MaxPooling forward performance to decrease the gap between forward and backward? This can also improve the performance of CPU. Thanks!

    点赞 评论 复制链接分享
  • weixin_39633276 weixin_39633276 5月前

    XiaobingSuper, There are to things: 1. MKLDNN is high optimized for blocked formats (nCdhw16c or nCdhw8c). So is it possible to run the whole model (or at least the most of your model) in nCdhw16c format? The best case when you do 1 reorder at the very beginning and 1 at the very end. 2. If the first case does not work for you then could you give us more details of the model. So we can understand what we should optimized and what is your performance goal. For example, if all polling don't have padding and strides then it is possible to optimized current algorithm better.

    点赞 评论 复制链接分享
  • weixin_39978749 weixin_39978749 5月前

    , could you please share the log with MKLDNN_VERBOSE=1?

    点赞 评论 复制链接分享
  • weixin_39966225 weixin_39966225 5月前

    In case you need some code help, an example of layout propagation can be found here: https://github.com/intel/mkl-dnn/blob/master/examples/simple_net.cpp#L125

    Here, the relu descriptor queries the dst_mem of the conv previous operation.

    's recommendation would then translate to: Reorder to blocked format at beginning, Propagate format through the entire topology, Reorder back to user_format at the end.

    点赞 评论 复制链接分享
  • weixin_39614834 weixin_39614834 5月前

    Thanks, -greeneltch-intel, and , I will try it.

    点赞 评论 复制链接分享
  • weixin_39614834 weixin_39614834 5月前

    The first method give by works for me , thanks!

    点赞 评论 复制链接分享
  • weixin_39978749 weixin_39978749 5月前

    Closing as the issue is resolved.

    点赞 评论 复制链接分享

相关推荐