Hello. I'm sorry to disturb you again after so long.
When I try to use the parameters defined in your paper, I get the final result.
My results are different from your results in the paper. As shown below, after 1000 epochs, my **Total reward** stops at -100, and **slowndown** stops at 2.7.

### my result

### your result（Figure6 in paper）

I haven't changed any other code. Why is my result worse? I guess the reason may have the following： - I did not use supervised learning and I started to reinforce learning directly. At the same time I would like to ask whether the use of supervised learning (imitation learning) will only speed up the convergence without increasing the effect. - Theano version is different (I use 1.0.1) - Different parameter settings. My parameter.py reference 4.1 Deeprm in the paper, my main settings are as follows：

```
self.num_epochs = 1000 # number of training epochs
self.simu_len = 50 # length of the busy cycle that repeats itself
self.num_ex = 100 # number of sequences
self.output_freq = 100 # interval for output and store parameters
self.num_seq_per_batch = 20 # number of sequences to compute baseline
self.episode_max_length = 1000 # enforcing an artificial terminal
self.num_res = 2 # number of resources in the system
self.num_nw = 10 # maximum allowed number of work in the queue
self.time_horizon = 20 # number of time steps in the graph
self.max_job_len = 15 # maximum duration of new jobs
self.res_slot = 10 # maximum number of available resource slots
self.max_job_size = 10 # maximum resource request of new work
self.backlog_size = 60 # backlog queue size
self.max_track_since_new = 10 # track how many time steps since last new jobs
self.job_num_cap = 40 # maximum number of distinct colors in current work graph
self.new_job_rate = 0.7 # lambda in new job arrival Poisson Process
self.discount = 1
```

Hope to get your help. Thank you.