Reinforcement Learning Agents

DQN

DQN Implementation with target net and epsilon greedy. Follows the Stable Baselines 3 implementation. To reuse trained models, you can make use of the save and load function. To adapt policy and value network structure, specify the layer and activation parameter in your train config or change the constants in this file

class agents.reinforcement_learning.dqn.MemoryBuffer(buffer_size: int, batch_size: int, obs_dim: int, obs_type: type, action_type: type)

Bases: object

Handles episode data collection and sample generation

Parameters
  • buffer_size – Buffer size

  • batch_size – Size for batches to be generated

  • obs_dim – Size of the observation to be stored in the buffer

  • obs_type – Type of the observation to be stored in the buffer

  • action_type – Type of the action to be stored in the buffer

__init__(buffer_size: int, batch_size: int, obs_dim: int, obs_type: type, action_type: type)
store_memory(obs, action, reward, done, new_obs) None

Appends all data from the recent step

Parameters
  • obs – Observation at the beginning of the step

  • action – Index of the selected action

  • reward – Reward the env returned in this step

  • done – True if the episode ended in this step

  • new_obs – Observation after the step

Returns

get_samples() Tuple

Generates random samples from the stored data

Returns

batch_size samples from the buffer. e.g. obs, actions, …, new_obs from step 21

class agents.reinforcement_learning.dqn.Policy(obs_dim: int, action_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Network structure used for both the Q network and the target network

Parameters
  • obs_dim – Observation size to determine input dimension

  • action_dim – Number of action to determine output size

  • learning_rate – Learning rate for the network

  • hidden_layers – List of hidden layer sizes (int)

  • activation – String naming activation function for hidden layers

__init__(obs_dim: int, action_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(obs)

forward pass through the Q-network

class agents.reinforcement_learning.dqn.DQN(env, config: dict, logger: Optional[Logger] = None)

Bases: object

DQN Implementation with target net and epsilon greedy. Follows the Stable Baselines 3 implementation.

__init__(env, config: dict, logger: Optional[Logger] = None)
batch_size: Number of samples that are chosen and passed through the net per update
gradient_steps: Number of updates per training
train_freq: Environment steps between two trainings
buffer_size: Size of the memory buffer = max number of rollouts that can be stored before the oldest are deleted
target_net_update: Number of steps between target_net_updates
training_starts = Learning_starts: steps after which training can start for the first time
initial_eps: Initial epsilon value
final_eps: Final epsilon value
fraction_eps: If the percentage progress of learn exceeds fraction eps, epsilon takes the final_eps value
e.g. 5/100 total_timesteps done -> progress = 0.5 > fraction eps -> eps=final_eps
max_grad_norm: Value to clip the policy update of the q_net
Parameters
  • env – Pregenerated, gymbased environment. If no env is passed, env = None -> PPO can only be used for evaluation (action prediction)

  • config – Dictionary with parameters to specify DQN attributes

  • logger – Logger

save(file: str) None

Save model as pickle file

Parameters

file – Path under which the file will be saved

Returns

None

classmethod load(file: str, config: dict, logger: Optional[Logger] = None)

Creates a DQN object according to the parameters saved in file.pkl

Parameters
  • file – Path and filname (without .pkl) of your saved model pickle file

  • config – Dictionary with parameters to specify PPO attributes

  • logger – Logger

Returns

DQN object

get_action(obs: ndarray) int

Random action or action according to the policy and epsilon

Returns

action index

predict(observation: ndarray, action_mask: ndarray = array([1.]), deterministic: bool = True, state=None) Tuple

Action prediction for testing

Parameters
  • observation – Current observation of teh environment

  • action_mask – Mask of actions, which can logically be taken. NOTE: currently not implemented!

  • deterministic – Set True, to force a deterministic prediction

  • state – The last states (used in rnn policies)

Returns

Predicted action and next state (used in rnn policies)

train() None

Trains Q-network and Target-Network

Returns

None

on_step(total_timesteps)

Method track and check plenty conditions to e.g. check if q_target_net or epsilon update are necessary

learn(total_instances: int, total_timesteps: int, intermediate_test=None) None

Learn over n problem instances or n timesteps (environment steps). Breaks depending on which condition is met first. One learning iteration consists of collecting rollouts and training the networks on the rollout data

Parameters
  • total_instances – Instance limit

  • total_timesteps – Timestep limit

  • intermediate_test – (IntermediateTest) intermediate test object. Must be created before.

PPO

PPO implementation inspired by the StableBaselines3 implementation. To reuse trained models, you can make use of the save and load function To adapt policy and value network structure, specify the policy and value layer and activation parameter in your train config or change the constants in this file

class agents.reinforcement_learning.ppo.RolloutBuffer(buffer_size: int, batch_size: int)

Bases: object

Handles episode data collection and batch generation

Parameters
  • buffer_size – Buffer size

  • batch_size – Size for batches to be generated

__init__(buffer_size: int, batch_size: int)
generate_batches() Tuple

Generates batches from the stored data

Returns

batches: Lists with all indices from the rollout_data, shuffled and sampled in lists with batch_size e.g. [[0,34,1,768,…(len: batch size)], [], …(len: len(rollout_data) / batch size)]

compute_advantages_and_returns(last_value, gamma, gae_lambda) None

Computes advantage values and returns for all stored episodes.

Parameters
  • last_value – Value from the next step to calculate the advantage for the last episode in the buffer

  • gamma – Discount factor for the advantage calculation

  • gae_lambda – Smoothing parameter for the advantage calculation

Returns

None

store_memory(observation: ndarray, action: int, prob: float, value: float, reward: Any, done: bool) None

Appends all data from the recent step

Parameters
  • observation – Observation at the beginning of the step

  • action – Index of the selected action

  • prob – Probability of the selected action (output from the policy_net)

  • value – Baseline value that the value_net estimated from this step onwards according to the

  • observation – Output from the value_net

  • reward – Reward the env returned in this step

  • done – True if the episode ended in this step

Returns

None

reset() None

Resets all buffer lists

Returns

None

class agents.reinforcement_learning.ppo.PolicyNetwork(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Policy Network for the agent

Parameters
  • input_dim – Observation size to determine input dimension

  • n_actions – Number of action to determine output size

  • learning_rate – Learning rate for the network

  • hidden_layers – List of hidden layer sizes (int)

  • activation – String naming activation function for hidden layers

__init__(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(observation)

forward function

class agents.reinforcement_learning.ppo.ValueNetwork(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Value Network for the agent

Parameters
  • input_dim – Observation size to determine input dimension

  • learning_rate – Learning rate for the network

  • hidden_layers – List of hidden layer sizes (int)

  • activation – String naming activation function for hidden layers

__init__(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(observation)

forward function

class agents.reinforcement_learning.ppo.PPO(env, config: dict, logger: Optional[Logger] = None)

Bases: object

PPO Agent class

__init__(env, config: dict, logger: Optional[Logger] = None)
gamma: Discount factor for the advantage calculation
learning_rate: Learning rate for both, policy_net and value_net
gae_lambda: Smoothing parameter for the advantage calculation
clip_range: Limitation for the ratio between old and new policy
batch_size: Size of batches which were sampled from the buffer and fed into the nets during training
n_epochs: Number of repetitions for each training iteration
rollout_steps: Step interval within the update is performed. Has to be a multiple of batch_size
classmethod load(file: str, config: dict, logger: Optional[Logger] = None)

Creates a PPO object according to the parameters saved in file.pkl

Parameters
  • file – Path and filname (without .pkl) of your saved model pickle file

  • config – Dictionary with parameters to specify PPO attributes

  • logger – Logger

Returns

MaskedPPO object

save(file: str) None

Save model as pickle file

Parameters

file – Path under which the file will be saved

Returns

None

forward(observation: ndarray, **kwargs) Tuple

Predicts an action according to the current policy based on the observation and the value for the next state

Parameters
  • observation – Current observation of teh environment

  • kwargs – Used to accept but ignore passing actions masks from the environment.

Returns

Predicted action, probability for this action, and predicted value for the next state

predict(observation: ndarray, deterministic: bool = True, state=None, **kwargs) Tuple

Action prediction for testing

Parameters
  • observation – Current observation of teh environment

  • deterministic – Set True, to force a deterministic prediction

  • state – The last states (used in rnn policies)

  • kwargs – Used to accept but ignore passing actions masks from the environment.

Returns

Predicted action and next state (used in rnn policies)

train() None

Trains policy and value

Returns

None

learn(total_instances: int, total_timesteps: int, intermediate_test=None) None

Learn over n environment instances or n timesteps. Break depending on which condition is met first One learning iteration consists of collecting rollouts and training the networks

Parameters
  • total_instances – Instance limit

  • total_timesteps – Timestep limit

  • intermediate_test – (IntermediateTest) intermediate test object. Must be created before.

agents.reinforcement_learning.ppo.explained_variance(y_pred: ndarray, y_true: ndarray) ndarray

From Stable-Baseline Computes fraction of variance that ypred explains about y. Returns 1 - Var[y-ypred] / Var[y]

interpretation:

ev=0 => might as well have predicted zero ev=1 => perfect prediction ev<0 => worse than just predicting zero

Parameters
  • y_pred – the prediction

  • y_true – the expected value

Returns

explained variance of ypred and y

PPO_masked

PPO implementation with action mask according to the StableBaselines3 implementation. To reuse trained models, you can make use of the save and load function

class agents.reinforcement_learning.ppo_masked.RolloutBuffer(buffer_size: int, batch_size: int)

Bases: object

Handles episode data collection and batch generation

Parameters
  • buffer_size – Buffer size

  • batch_size – Size for batches to be generated

__init__(buffer_size: int, batch_size: int)
generate_batches() Tuple

Generates batches from the stored data

Returns

batches: Lists with all indices from the rollout_data, shuffled and sampled in lists with batch_size e.g. [[0,34,1,768,…(len: batch size)], [], …(len: len(rollout_data) / batch size)]

compute_advantages_and_returns(last_value, gamma, gae_lambda) None

Computes advantage values and returns for all stored episodes. Required to

Parameters
  • last_value – Value from the next step to calculate the advantage for the last episode in the buffer

  • gamma – Discount factor for the advantage calculation

  • gae_lambda – Smoothing parameter for the advantage calculation

Returns

None

store_memory(observation: ndarray, action: int, prob: float, value: float, reward: Any, done: bool, action_mask: ndarray) None

Appends all data from the recent step

Parameters
  • observation – Observation at the beginning of the step

  • action – Index of the selected action

  • prob – Probability of the selected action (output from the policy_net)

  • value – Baseline value that the value_net estimated from this step onwards according to the

  • observation – Output from the value_net

  • reward – Reward the env returned in this step

  • done – True if the episode ended in this step

  • action_mask – One hot vector with ones for all possible actions

Returns

None

reset() None

Resets all buffer lists :return: None

class agents.reinforcement_learning.ppo_masked.PolicyNetwork(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Policy Network for the agent

Parameters
  • input_dims – Observation size to determine input dimension

  • n_actions – Number of action to determine output size

  • learning_rate – Learning rate for the network

  • fc1_dims – Size hidden layer 1

  • fc2_dims – Size hidden layer 2

__init__(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(observation, action_mask)

forward through the actor network

class agents.reinforcement_learning.ppo_masked.ValueNetwork(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Value Network for the agent

Parameters
  • input_dims – Observation size to determine input dimension

  • learning_rate – Learning rate for the network

  • fc1_dims – Size hidden layer 1

  • fc2_dims – Size hidden layer 2

__init__(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(observation)

forward through the value network

class agents.reinforcement_learning.ppo_masked.MaskedPPO(env, config: dict, logger: Optional[Logger] = None)

Bases: object

__init__(env, config: dict, logger: Optional[Logger] = None)
gamma: Discount factor for the advantage calculation
learning_rate: Learning rate for both, policy_net and value_net
gae_lambda: Smoothing parameter for the advantage calculation
clip_range: Limitation for the ratio between old and new policy
batch_size: Size of batches which were sampled from the buffer and fed into the nets during training
n_epochs: Number of repetitions for each training iteration
rollout_steps: Step interval within the update is performed. Has to be a multiple of batch_size
classmethod load(file: str, config: dict, logger: Optional[Logger] = None)

Creates a PPO object according to the parameters saved in file.pkl

Parameters
  • file – Path and filname (without .pkl) of your saved model pickle file

  • config – Dictionary with parameters to specify PPO attributes

  • logger – Logger

Returns

MaskedPPO object

save(file: str) None

Save model as pickle file

Parameters

file – Path under which the file will be saved

Returns

None

forward(observation: ndarray, action_mask: ndarray) Tuple

Predicts an action according to the current policy and based on the action_mask and observation and the value for the next state

Parameters
  • observation – Current observation of teh environment

  • action_mask – One hot vector with ones for all possible actions

Returns

Predicted action, probability for this action, and predicted value for the next state

predict(observation: ndarray, action_mask: ndarray, deterministic: bool = True, state=None) Tuple

Action prediction for testing

Parameters
  • observation – Current observation of teh environment

  • action_mask – One hot vector with ones for all possible actions

  • deterministic – Set True, to force a deterministic prediction

  • state – The last states (used in rnn policies)

Returns

Predicted action and next state (used in rnn policies)

train() None

Trains policy and value

Returns

None

learn(total_instances: int, total_timesteps: int, intermediate_test=None) None

Learn over n environment instances or n timesteps. Break depending on which condition is met first One learning iteration consists of collecting rollouts and training the networks

Parameters
  • total_instances – Instance limit

  • total_timesteps – Timestep limit

  • intermediate_test – (IntermediateTest) intermediate test object. Must be created before.

agents.reinforcement_learning.ppo_masked.explained_variance(y_pred: ndarray, y_true: ndarray) ndarray

From Stable-Baseline Computes fraction of variance that ypred explains about y. Returns 1 - Var[y-ypred] / Var[y]

interpretation:

ev=0 => might as well have predicted zero ev=1 => perfect prediction ev<0 => worse than just predicting zero

Parameters
  • y_pred – the prediction

  • y_true – the expected value

Returns

explained variance of ypred and y

Reinforcement Learning Functions

Training Functions

This file provides functions to train an agent on a scheduling-problem environment. By default, the trained model will be evaluated on the test data after training, by running the test_model_and_heuristic function from test.py.

Using this file requires a training config. For example, you have to specify the algorithm used for the training.

There are several constants, which you can change to adapt the training process:

agents.train.final_evaluation(config: dict, data_test: List[List[Task]], logger: Logger)

Evaluates the trained model and logs the results

Parameters
  • config – Training config

  • data_test – Dataset with instances to be used for the test

  • logger – Logger object

Returns

None

agents.train.training(config: dict, data_train: List[List[Task]], data_val: List[List[Task]], logger: Logger) None

Handles the actual training process. Including creating the environment, agent and intermediate_test object. Then the agent learning process is started

Parameters
  • config – Training config

  • data_train – Dataset with instances to be used for the training

  • data_val – Dataset with instances to be used for the evaluation

  • logger – Logger object used for the whole training process, including evaluation and testing

Returns

None

agents.train.main(config_file_name: Optional[dict] = None, external_config: Optional[dict] = None) None

Main function to train an agent in a scheduling-problem environment.

Parameters
  • config_file_name – path to the training config you want to use for training (relative path from config/ folder)

  • external_config – dictionary that can be passed to overwrite the config file elements

Returns

None

agents.train.get_perser_args()

Get arguments from command line

Testing Functions

This file provides the test_model function to evaluate an agent or a heuristic on a set of instances. Furthermore, test_model_and_heuristics can be used to evaluate an agent and all heuristics specified in the TEST_HEURISTICS constant on the same set of instances.

Using this file requires a testing config. For example, it is necessary to specify the name of the model you want to test.

Running this file will automatically call test_model_and_heuristics. You can adapt the heuristics used for testing in the TEST_HEURISTICS constant. An empty list is admissible.

When running the file from a console you can use –plot-ganttchart to show the generated gantt_chart figures.

agents.test.get_action(env, model, heuristic_id: str, heuristic_agent: Optional[HeuristicSelectionAgent]) Tuple[int, str]

This function determines the next action according to the input model or heuristic

Parameters
  • env – Environment object

  • model – Model object. E.g. PPO object

  • heuristic_id – Heuristic identifier. Can be None

  • heuristic_agent – HeuristicSelectionAgent object. Can be None

Returns

ID of the selected action

agents.test.run_episode(env, model, heuristic_id: Optional[str], handler: EvaluationHandler) None

This function executes one testing episode

Parameters
  • env – Environment object

  • model – Model object. E.g. PPO object

  • heuristic_id – Heuristic identifier. Can be None

  • handler – EvaluationHandler object

Returns

None

agents.test.test_solver(config: Dict, data_test: List[List[Task]], logger: Logger) Dict

This function uses the OR solver to schedule the instances given in data_test.

Parameters
  • config – Testing config

  • data_test – Data containing problem instances used for testing

Returns

Evaluation metrics

agents.test.log_results(plot_logger: Logger, inter_test_idx: Optional[int], heuristic: str, env, handler: EvaluationHandler) None

Calls the logger object to save the test results from this episode as table (e.g. makespan mean, gantt chart)

Parameters
  • plot_logger – Logger object

  • inter_test_idx – Index of current test. Can be None

  • heuristic – Heuristic identifier. Can be None

  • env – Environment object

  • handler – EvaluationHandler object

Returns

None

agents.test.test_model(env_config: Dict, data: List[List[Task]], logger: Logger, plot: Optional[bool] = None, log_episode: Optional[bool] = None, model=None, heuristic_id: Optional[str] = None, intermediate_test_idx=None) dict

This function tests a model in the passed environment for all problem instances passed as data_test and returns an evaluation summary

Parameters
  • env_config – Environment config

  • data – Data containing problem instances used for testing

  • logger – Logger object

  • plot – Plot a gantt chart of all tests

  • log_episode – If true, calls the log function to log episode results as table

  • model – {None, StableBaselines Model}

  • heuristic_id – ID that identifies the used heuristic

  • intermediate_test_idx – Step number after which the test is performed. Is used to annotate the log

Returns

evaluation metrics

agents.test.test_model_and_heuristic(config: dict, model, data_test: List[List[Task]], logger: Logger, plot_ganttchart: bool = False, log_episode: bool = False) dict

Test model and agent_heuristics len(data) times and returns results

Parameters
  • config – Testing config

  • model – Model to be tested. E.g. PPO object

  • data_test – Dataset with instances to be used for the test

  • logger – Logger object

  • plot_ganttchart – Plot a gantt chart of all tests

  • log_episode – If true, calls the log function to log episode results as table

Returns

Dict with evaluation_result dicts for the agent and all heuristics which were tested

agents.test.get_perser_args()
agents.test.main(external_config=None)

Util functions

This file provides the IntermediateTest class which is used to run an intermediate test on the current model policy. If the recent model has the best result it is saved as the new current optimum

class agents.intermediate_test.IntermediateTest(env_config: dict, n_test_steps: int, data: List[List[Task]], logger: Logger)

Bases: object

This object is used to run an intermediate test on the current model policy. If the recent model has the best result it is saved as the new current optimum

Parameters
  • env_config – Config used to initialize the environment for training

  • n_test_steps – Number of environment steps between intermediate tests

  • data – Dataset with instances to be used for the intermediate test

  • logger – Logger object

__init__(env_config: dict, n_test_steps: int, data: List[List[Task]], logger: Logger)
on_step(num_timesteps: int, instances: int, model) None

This function is called by the environment during each step. According to n_test_steps the function runs an intermediate test

Parameters
  • num_timesteps – Number of steps that have been already run by the environment

  • instances – Number of instances that have been already run by the environment

  • model – Model with the current policy. E.g. PPO object

Returns

None

This file provides utility functions to load configs, data and agents according to the config. It is used in training and testing.

TIMESTAMP: str: timestamp of the training run, used for the creation of a unique model name AGENT_DICT: dict[str, str]: This dictionary is used to map algorithm identifiers (keys) to their actual class names (values).

E.g. to use the MaskedPPO class, you can use ppo as algorithm in the config.

If you add new algorithms, you can extend this dictionary to assign your algorithm class a short identifier.

agents.train_test_utility_functions.load_config(config_path, external_config) dict

Uses the ConfigHandler routines to load the config according to the path

Parameters
  • config_path – Path to the config to be loaded

  • external_config – Config dict

Returns

Config

agents.train_test_utility_functions.load_data(config: dict) List[List[Task]]

Uses the DataHandler routines to load the training config

Parameters

config – Config dict which specifies a dataset

Returns

Dataset (List of instances)

agents.train_test_utility_functions.complete_config(config: dict) dict

If optional parameters have not been defined in the configuration, this function adds default values. Also creates missing directories, if necessary.

Parameters

config – config file

Returns

completed config file

agents.train_test_utility_functions.get_agent_param_from_config(config: dict) str

Check if config has TRAIN or TEST algorithm param and get corresponding class string for algorithm from config

Parameters

config – Config for training or testing

Returns

Agent type string (e.g. ‘ppo’)

agents.train_test_utility_functions.get_agent_class_from_config(config: dict) Any

Determines and loads the correct agent class type according the config

Parameters

config – Training config

Returns

Agent class type which can be called