weixin_39613291
weixin_39613291
2020-12-09 13:18

Aetros using exclusively TensorFlow as backend ?

Are you planing to use TensorFlow as a backend exclusively ?

Because when running a training with aetros on my server, using Keras with TensorFlow backend, Theano AND TensorFlow seems to run in parallel which leads me to this crash :


Found 21 classes, 3963 images (3168 in training [augmented], 795 in validation). Read all images into memory from /home/ubuntu/aetros-cli-data/datasets/arnauddelaunay/dataset/fashion/datasets_downloads
trainer.input_shape = []
trainer.classes = ["Class 15", "Class 1", "Class 14", "Class 18", "Class 10", "Class 19", "Class 4", "Class 17", "Class 5", "Class 6", "Class 11", "Class 0", "Class 3", "Class 9", "Class 13", "Class 20", "Class 16", "Class 12", "Class 2", "Class 8", "Class 7"]
Possible data keys 'arnauddelaunay/dataset/fashion'
Training status changed to CONSTRUCT   
F tensorflow/stream_executor/cuda/cuda_driver.cc:316] current context was not created by the StreamExecutor cuda_driver API: 0x3e13e20; a CUDA runtime call was likely performed without using a StreamExecutor context
Aborted (core dumped)

On this issue, they say it may come from running both Theano and Tensorflow on GPU.

By the way, it works fine when I run with Keras using Theano as backend.

该提问来源于开源项目:aetros/aetros-cli

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

11条回答

  • weixin_39613291 weixin_39613291 5月前

    Ay yes, you're right. I was on 90% zoom on Chrome

    90% : capture-3 100%: capture-2

    点赞 评论 复制链接分享
  • weixin_39604189 weixin_39604189 5月前

    good :)

    点赞 评论 复制链接分享
  • weixin_39604189 weixin_39604189 5月前

    Looks a bit weird. :D Which browser are you using? My bar looks like:

    screen shot 2016-08-23 at 16 24 20

    点赞 评论 复制链接分享
  • weixin_39613291 weixin_39613291 5月前

    Looks fine to me :) capture-1

    点赞 评论 复制链接分享
  • weixin_39613291 weixin_39613291 5月前

    Yes, it's working, thanks !

    Still having some issues with logging though : - Not taking into account GPU usage in the online trainer : capture while I know it's running on GPU on my machine with if I check nvidia-smi :

    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |    0     18311    C   /usr/bin/python                               3845MiB |
    +-----------------------------------------------------------------------------+
    
    • If I want to activate insights (with the flag --insights), it crashes :
    
    Training status changed to TRAINING 
    Epoch 1: loss=0.338704, acc=0.897200, val_loss=0.099598, val_acc=0.969100
    Crashed ...
    ERROR:root:Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/aetros-0.3.4-py2.7.egg/aetros/starter.py", line 93, in start
        network.job_start(job_model, trainer, keras_logger, general_logger)
      File "/usr/local/lib/python2.7/dist-packages/aetros-0.3.4-py2.7.egg/aetros/network.py", line 121, in job_start
        model_provider.train(trainer, model, data_train, data_validation)
      File "/home/ubuntu/aetros-cli/aetros-cli-data/networks/arnauddelaunay/digit-convolution/emP4X0Pz2/model_provider.py", line 166, in train
        callbacks=trainer.callbacks
      File "/usr/local/lib/python2.7/dist-packages/Keras-1.0.7-py2.7.egg/keras/engine/training.py", line 1107, in fit
        callback_metrics=callback_metrics)
      File "/usr/local/lib/python2.7/dist-packages/Keras-1.0.7-py2.7.egg/keras/engine/training.py", line 845, in _fit_loop
        callbacks.on_epoch_end(epoch, epoch_logs)
      File "/usr/local/lib/python2.7/dist-packages/Keras-1.0.7-py2.7.egg/keras/callbacks.py", line 40, in on_epoch_end
        callback.on_epoch_end(epoch, logs)
      File "/usr/local/lib/python2.7/dist-packages/aetros-0.3.4-py2.7.egg/aetros/KerasLogger.py", line 215, in on_epoch_end
        images = self.build_insight_images()
      File "/usr/local/lib/python2.7/dist-packages/aetros-0.3.4-py2.7.egg/aetros/KerasLogger.py", line 308, in build_insight_images
        data = layer.W.get_value()
    AttributeError: 'Variable' object has no attribute 'get_value'
    
    Sending last (6) monitoring information to server ... 
    out.
    Traceback (most recent call last):
      File "/usr/local/bin/aetros", line 9, in <module>
        load_entry_point('aetros==0.3.4', 'console_scripts', 'aetros')()
      File "/usr/local/lib/python2.7/dist-packages/aetros-0.3.4-py2.7.egg/aetros/__init__.py", line 79, in main
        return command.main(cmd_args)
      File "/usr/local/lib/python2.7/dist-packages/aetros-0.3.4-py2.7.egg/aetros/commands/StartCommand.py", line 45, in main
        start(parsed_args.network_name, dataset_id=parsed_args.dataset, insights=parsed_args.insights, insights_sample_path=parsed_args.insights_sample)
      File "/usr/local/lib/python2.7/dist-packages/aetros-0.3.4-py2.7.egg/aetros/starter.py", line 131, in start
        raise e
    AttributeError: 'Variable' object has no attribute 'get_value'
    Exception AttributeError: AttributeError("'NoneType' object has no attribute 'raise_exception_on_not_ok_status'",) in <bound method session.__del__ of object at>> ignored
    </bound></module>
    点赞 评论 复制链接分享
  • weixin_39604189 weixin_39604189 5月前

    I fixed the second issue. Tensorflow is not yet completely supported, for example I have not yet found a way to get GPU information using Tensorflow.

    点赞 评论 复制链接分享
  • weixin_39604189 weixin_39604189 5月前

    Btw, have you zoomed in your browser? That accuracy bar looks weird :P

    点赞 评论 复制链接分享
  • weixin_39604189 weixin_39604189 5月前

    Actually, when you change Keras to TensorFlow things should work, we do not force the usage of Theano somewhere. Maybe you hit a bug, but I couldn't imagine where we use Theano directly. During the open beta, we test only on Theano tho. Which network did you want to train?

    点赞 评论 复制链接分享
  • weixin_39613291 weixin_39613291 5月前

    I tried with several networks, but even with the default digit-convolution, I hit the same bug :

    
    ubuntu-172-31-5-17:~/aetros$ API_KEY='xxxxxxxxx' aetros start arnauddelaunay/digit-convolution --insights --gpu --tf
    ...
    Training '1xv2YRNRD' created and started. Open http://aetros.com/trainer/app?training=1xv2YRNRD to monitor the training.
    start network ...
    Using TensorFlow backend.
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
    Using gpu device 0: GRID K520 (CNMeM is disabled, cuDNN 5005)
    0.08GB GPU memory used of 4.00GB
    Setup training
    Start training
    Training status changed to STARTING 
    Imported model_provider in /home/ubuntu/aetros/aetros-cli-data/networks/arnauddelaunay/digit-convolution/1xv2YRNRD/model_provider.py 
    Training status changed to LOAD DATA 
    Imported dataset provider in /home/ubuntu/aetros/aetros-cli-data/networks/arnauddelaunay/digit-convolution/1xv2YRNRD/datasets/arnauddelaunay__dataset__mnist-digits.py 
    ('X_train shape:', (60000, 1, 28, 28))
    (60000, 'train samples')
    (10000, 'test samples')
    trainer.input_shape = []
    trainer.classes = []
    Possible data keys 'arnauddelaunay/dataset/mnist-digits'
    Training status changed to CONSTRUCT 
    F tensorflow/stream_executor/cuda/cuda_driver.cc:316] current context was not created by the StreamExecutor cuda_driver API: 0x455f1f0; a CUDA runtime call was likely performed without using a StreamExecutor context
    Aborted (core dumped)
    
    点赞 评论 复制链接分享
  • weixin_39604189 weixin_39604189 5月前

    Alright, thanks for reporting! I'll look into this very soon.

    点赞 评论 复制链接分享
  • weixin_39604189 weixin_39604189 5月前

    Fixed in master. Can you try to test it please?

    点赞 评论 复制链接分享

相关推荐