weixin_39562752
weixin_39562752
2021-01-03 09:16

Inference Time Issue

Hi, when I was trying to generate .engine the error occured below: init plugin proto: ./yolov3_lite_mod.prototxt caffemodel: ./yolov3_lite_mod.caffemodel

Begin parsing model... could not parse layer type Upsample ERROR: ssd_error_log: Fail to parse Segmentation fault (core dumped)

I have checked the layer Upsample and the example you provided in the readme work fine. So could you please check the problem? Thanks!!

该提问来源于开源项目:lewes6369/TensorRT-Yolov3

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

34条回答

  • weixin_39562752 weixin_39562752 4月前

    And with the resize mode "INTER_CUBIC". I replace it with "INTER_LINEAR". There is no significant change in speed.

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前

    Okay ... , it not a good news , is it the same problems with original darknet model ?

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前

    I think I got a mistake of preprocessing , the vector mean_values = {0.5,0.5,0.5} may need change to {1.0,1.0,1.0}

    点赞 评论 复制链接分享
  • weixin_39562752 weixin_39562752 4月前

    In your training prototxt, the mean_value is set to 127.5 and normalized valued is 0.007843. So in demo.sh I set the mean value to 0.5 and normalized value to 0.007843. What do you mean the change of {1.0, 1.0, 1.0}? UPDATE# Or you mean change the mean value I set in main.cpp in this project?

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前

    This mistake was from me , but it is out of topic , I will update my project soon

    点赞 评论 复制链接分享
  • weixin_39562752 weixin_39562752 4月前

    I found something strange: I used FIT_LARGE_SIZE_AND_PAD for training and ssd_detect.cpp for detect, but the ssd_detect used warp in image resize. The performance is good. In this project, the image resize mode is letter_box( i.e. FIT_LARGE_SIZE_AND_PAD), the performance is not as good as WARP in caffe. I have changed the resize mode in main.cpp into WARP and the detection results are much better than letter_box.

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前
    1. I put letter box resize on example\yolo\yolo_detect.cpp , just set resize mode =1 , you can compare result both project
    2. Or you can retrain models with resize mode WARP , I recommend this
    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    Hi, I checked the prototxt which you put on , and it looks right. I did not have the device at present. I will check them on device later this week. According to the recent talking records. It seems that the model trained in proj in preprocess is different from which in the darknet . Can you copy the preprocess code from the training step to the main.cpp to have a try? I will also update some caffe input config later. For another choice, you can try the model converted by official darknet in my proj , and check the result in both proj.

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    Hi, have you solved it? I checked the prototxts one more time.In caffe,there are some DepthwiseConvolution layers which are Convolution layers in your tensorRT prototxt. They are different behaviors. It should affect the acc result.

    点赞 评论 复制链接分享
  • weixin_39771260 weixin_39771260 4月前

    Am trying to use trt.prototxt, yoloconfig.h and caffemodel was shared above to create trt engine for mobilenet_yolov3_lite. But am getting the below error, 1.While using TensorRT 4.0.1.6:

    ERROR: conv1/dw: group count must divide input channel count runYolov3: helpers.cpp:56: nvinfer1::DimsCHW nvinfer1::getCHW(const nvinfer1::Dims&): Assertion `d.nbDims >= 3' failed. Aborted (core dumped)

    2.While using TensorRT 5.0.2.6:

    ERROR: conv1/dw: group count must divide input channel count runYolov3: ../common/enginehelper.h:89: nvinfer1::DimsCHW enginehelper::getCHW(const nvinfer1::Dims&): Assertion `d.nbDims >= 3' failed. Aborted (core dumped)

    Kindly share your comments

    点赞 评论 复制链接分享
  • weixin_39715538 weixin_39715538 4月前

    Hi, ,make sure your prototxt is the same name written in PluginFactory.h , name as "layer86-upsample ,layer98-upsample"

    where is PluginFactory.h?

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    Hi, ,make sure your prototxt is the same name written in PluginFactory.h , name as "layer86-upsample ,layer98-upsample"

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    Hi do you test in the same structure? Maybe you should check whether the yolo config is the same. The running will output the time cost , can you put the output log here ? And shall you share your gpu device type? In testing, the precision is the same in fp32 as caffe.

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    And could you give the the prototxt and YoloConfigs info here?

    点赞 评论 复制链接分享
  • weixin_39562752 weixin_39562752 4月前

    In both test of caffe and TRT I used 416*416 input and the same anchor boxes, so is the confi_thre, weights and etc. . And in TRT inference, the FP32 mode is faster than FP16 mode. :(

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    The proto layer names seem different from mine . Are the leaky relu layers also meeting names in PluginFactory.h ? Did you convert your darknet by the repo https://github.com/ChenYingpeng/caffe-yolov3 ? Does your device support faster fp16?

    点赞 评论 复制链接分享
  • weixin_39562752 weixin_39562752 4月前

    I test my model on TX2( which support fp16). The model was design and modified from yolov3-lite(detection for one class https://github.com/eric612/MobileNet-YOLO.git). I only use relu instead of L-relu. So I thought the L-relu would not be used in PluginFactory.h.

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    I test the 416 model in GTX 1060. No layer will more than 1ms. Do you modify your yolo model ? in the official the conv layer are 76. But the caffemodel you used seems only 24. Can you share some details ?

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    OK. I will take the git a try.

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    I tried the prototxt but failed . It is the 320 size not the 416. And in my code the "DepthwiseConvolution" did not implement by the plugin yet. How did you process the model?

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    The prototxt seems the training prototxt. The engine is related to gpu. So I can't run it directly.

    I test the the model on gpu GTX 1060 . In tensorRT , fp32 costs 10.069ms , int8 costs 8.936ms. In caffe, it costs 49.968ms. Caffe is much slower than tensorRT.

    Please check the time cost in TX2.

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    Make sure that no other thing effects running inference.

    点赞 评论 复制链接分享
  • weixin_39562752 weixin_39562752 4月前

    Thanks for your acvice. The result of TX2 is still strange and I will test my model on the desktop GPU first and then report you the log.

    点赞 评论 复制链接分享
  • weixin_39562752 weixin_39562752 4月前

    Hi, the problem of time inference is reasonable now. There is another question that the accuracy of my model in FP32 and in FP16 mode is much worse than that in caffe. I don't think the accuracy of FP32 should be reduced so much. What do you think is the cause? Thanks!

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前

    Or you can try non-merge batchnorm model first ?

    点赞 评论 复制链接分享
  • weixin_39562752 weixin_39562752 4月前

    Thanks for your advice! I will have a try and report you the result. The merged model test on caffe works fine and the performance decrease dramatically in TRT(FP32 and 16). I hope the non-merged model could show a good result.

    点赞 评论 复制链接分享
  • weixin_39856269 weixin_39856269 4月前

    Hi,So what is the reason of the time problem?

    For the accuracy problem, I tested some different TR models, and the fp32 is the same within caffe . Did you run yolo layer by add caffe layer? or only run the yolo and detectionOut in customer cpu code? Can you share me the trt used prototxt and the caffe used prototxt? Also with the "YoloConfigs.h" and repo commit id. I will have a check in the repo. Thanks.

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前

    Can I ask a question about the fps on tx2 by using the model , I'm very interesting it.

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前

    Thanks for your reply , compare with ssd on tx2 , like below https://github.com/chenzhi1992/TensorRT-SSD https://github.com/Robert-JunWang/Pelee/issues/43

    • The pelee+ssd can run 70 fps + on tx2 , it's very cool .
    • So currently , I try to find the time consuming components, and thanks your reply again
    点赞 评论 复制链接分享
  • weixin_39562752 weixin_39562752 4月前

    Hi, I have tested the MobileNet-SSD on TX2, too. With input 300*300. It could run about 30 FPS on webcam. They may calculate FPS in the following way: 1/(inference time). It seems higher. A lot of time cost by image transmitting and processing. Nvidia has propose the deepstream for video accelerating. I think this is very helpful for real-time detection.

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前

    Maybe it is cause from different resize mode , like "FIT_LARGE_SIZE_AND_PAD" or "WARP" , I'm not sure

    点赞 评论 复制链接分享
  • weixin_39562752 weixin_39562752 4月前

    I used the "FIT_LARGE_SIZE_AND_PAD" mode in training.

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前

    I search theresize code in main.cpp. I guess it use warp resize , can you write down the image , and check the image have a border ? And I don't understand why there have cropped code

    点赞 评论 复制链接分享
  • weixin_39828102 weixin_39828102 4月前

    I also see the interpolation method of resize use "INTER_CUBIC". It may cost computing time

    点赞 评论 复制链接分享

相关推荐