Hello. I'm sorry to disturb you again after so long. When I try to use the parameters defined in your paper, I get the final result. My results are different from your results in the paper. As shown below, after 1000 epochs, my Total reward stops at -100, and slowndown stops at 2.7.
your result（Figure6 in paper）
I haven't changed any other code. Why is my result worse? I guess the reason may have the following： - I did not use supervised learning and I started to reinforce learning directly. At the same time I would like to ask whether the use of supervised learning (imitation learning) will only speed up the convergence without increasing the effect. - Theano version is different (I use 1.0.1) - Different parameter settings. My parameter.py reference 4.1 Deeprm in the paper, my main settings are as follows：
self.num_epochs = 1000 # number of training epochs self.simu_len = 50 # length of the busy cycle that repeats itself self.num_ex = 100 # number of sequences self.output_freq = 100 # interval for output and store parameters self.num_seq_per_batch = 20 # number of sequences to compute baseline self.episode_max_length = 1000 # enforcing an artificial terminal self.num_res = 2 # number of resources in the system self.num_nw = 10 # maximum allowed number of work in the queue self.time_horizon = 20 # number of time steps in the graph self.max_job_len = 15 # maximum duration of new jobs self.res_slot = 10 # maximum number of available resource slots self.max_job_size = 10 # maximum resource request of new work self.backlog_size = 60 # backlog queue size self.max_track_since_new = 10 # track how many time steps since last new jobs self.job_num_cap = 40 # maximum number of distinct colors in current work graph self.new_job_rate = 0.7 # lambda in new job arrival Poisson Process self.discount = 1
Hope to get your help. Thank you.