Reinforcement Learning Agents

DQN

DQN Implementation with target net and epsilon greedy. Follows the Stable Baselines 3 implementation. To reuse trained models, you can make use of the save and load function. To adapt policy and value network structure, specify the layer and activation parameter in your train config or change the constants in this file

class agents.reinforcement_learning.dqn.MemoryBuffer(buffer_size: int, batch_size: int, obs_dim: int, obs_type: type, action_type: type)

Bases: object

Handles episode data collection and sample generation

Parameters

buffer_size – Buffer size
batch_size – Size for batches to be generated
obs_dim – Size of the observation to be stored in the buffer
obs_type – Type of the observation to be stored in the buffer
action_type – Type of the action to be stored in the buffer

__init__(buffer_size: int, batch_size: int, obs_dim: int, obs_type: type, action_type: type)

store_memory(obs, action, reward, done, new_obs) → None

Appends all data from the recent step

Parameters

obs – Observation at the beginning of the step
action – Index of the selected action
reward – Reward the env returned in this step
done – True if the episode ended in this step
new_obs – Observation after the step

Returns

get_samples() → Tuple

Generates random samples from the stored data

Returns: batch_size samples from the buffer. e.g. obs, actions, …, new_obs from step 21

class agents.reinforcement_learning.dqn.Policy(obs_dim: int, action_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Network structure used for both the Q network and the target network

Parameters

obs_dim – Observation size to determine input dimension
action_dim – Number of action to determine output size
learning_rate – Learning rate for the network
hidden_layers – List of hidden layer sizes (int)
activation – String naming activation function for hidden layers

__init__(obs_dim: int, action_dim: int, learning_rate: float, hidden_layers: List[int], activation: str): Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(obs): forward pass through the Q-network

class agents.reinforcement_learning.dqn.DQN(env, config: dict, logger: Optional[Logger] = None)

Bases: object

DQN Implementation with target net and epsilon greedy. Follows the Stable Baselines 3 implementation.

__init__(env, config: dict, logger: Optional[Logger] = None)

batch_size: Number of samples that are chosen and passed through the net per update
gradient_steps: Number of updates per training
train_freq: Environment steps between two trainings
buffer_size: Size of the memory buffer = max number of rollouts that can be stored before the oldest are deleted
target_net_update: Number of steps between target_net_updates
training_starts = Learning_starts: steps after which training can start for the first time
initial_eps: Initial epsilon value
final_eps: Final epsilon value
fraction_eps: If the percentage progress of learn exceeds fraction eps, epsilon takes the final_eps value
e.g. 5/100 total_timesteps done -> progress = 0.5 > fraction eps -> eps=final_eps
max_grad_norm: Value to clip the policy update of the q_net

Parameters

env – Pregenerated, gymbased environment. If no env is passed, env = None -> PPO can only be used for evaluation (action prediction)
config – Dictionary with parameters to specify DQN attributes
logger – Logger

save(file: str) → None

Save model as pickle file

Parameters: file – Path under which the file will be saved
Returns: None

classmethod load(file: str, config: dict, logger: Optional[Logger] = None)

Creates a DQN object according to the parameters saved in file.pkl

Parameters

file – Path and filname (without .pkl) of your saved model pickle file
config – Dictionary with parameters to specify PPO attributes
logger – Logger

Returns

DQN object

get_action(obs: ndarray) → int

Random action or action according to the policy and epsilon

Returns: action index

predict(observation: ndarray, action_mask: ndarray = array([1.]), deterministic: bool = True, state=None) → Tuple

Action prediction for testing

Parameters

observation – Current observation of teh environment
action_mask – Mask of actions, which can logically be taken. NOTE: currently not implemented!
deterministic – Set True, to force a deterministic prediction
state – The last states (used in rnn policies)

Returns

Predicted action and next state (used in rnn policies)

train() → None

Trains Q-network and Target-Network

Returns: None

on_step(total_timesteps): Method track and check plenty conditions to e.g. check if q_target_net or epsilon update are necessary

learn(total_instances: int, total_timesteps: int, intermediate_test=None) → None

Learn over n problem instances or n timesteps (environment steps). Breaks depending on which condition is met first. One learning iteration consists of collecting rollouts and training the networks on the rollout data

Parameters

total_instances – Instance limit
total_timesteps – Timestep limit
intermediate_test – (IntermediateTest) intermediate test object. Must be created before.

PPO

PPO implementation inspired by the StableBaselines3 implementation. To reuse trained models, you can make use of the save and load function To adapt policy and value network structure, specify the policy and value layer and activation parameter in your train config or change the constants in this file

class agents.reinforcement_learning.ppo.RolloutBuffer(buffer_size: int, batch_size: int)

Bases: object

Handles episode data collection and batch generation

Parameters

buffer_size – Buffer size
batch_size – Size for batches to be generated

__init__(buffer_size: int, batch_size: int)

generate_batches() → Tuple

Generates batches from the stored data

Returns: batches: Lists with all indices from the rollout_data, shuffled and sampled in lists with batch_size e.g. [[0,34,1,768,…(len: batch size)], [], …(len: len(rollout_data) / batch size)]

compute_advantages_and_returns(last_value, gamma, gae_lambda) → None

Computes advantage values and returns for all stored episodes.

Parameters

last_value – Value from the next step to calculate the advantage for the last episode in the buffer
gamma – Discount factor for the advantage calculation
gae_lambda – Smoothing parameter for the advantage calculation

Returns

None

store_memory(observation: ndarray, action: int, prob: float, value: float, reward: Any, done: bool) → None

Appends all data from the recent step

Parameters

observation – Observation at the beginning of the step
action – Index of the selected action
prob – Probability of the selected action (output from the policy_net)
value – Baseline value that the value_net estimated from this step onwards according to the
observation – Output from the value_net
reward – Reward the env returned in this step
done – True if the episode ended in this step

Returns

None

reset() → None

Resets all buffer lists

Returns: None

class agents.reinforcement_learning.ppo.PolicyNetwork(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Policy Network for the agent

Parameters

input_dim – Observation size to determine input dimension
n_actions – Number of action to determine output size
learning_rate – Learning rate for the network
hidden_layers – List of hidden layer sizes (int)
activation – String naming activation function for hidden layers

__init__(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str): Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(observation): forward function

class agents.reinforcement_learning.ppo.ValueNetwork(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Value Network for the agent

Parameters

input_dim – Observation size to determine input dimension
learning_rate – Learning rate for the network
hidden_layers – List of hidden layer sizes (int)
activation – String naming activation function for hidden layers

__init__(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str): Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(observation): forward function

class agents.reinforcement_learning.ppo.PPO(env, config: dict, logger: Optional[Logger] = None)

Bases: object

PPO Agent class

__init__(env, config: dict, logger: Optional[Logger] = None): gamma: Discount factor for the advantage calculation

learning_rate: Learning rate for both, policy_net and value_net

gae_lambda: Smoothing parameter for the advantage calculation

clip_range: Limitation for the ratio between old and new policy

batch_size: Size of batches which were sampled from the buffer and fed into the nets during training

n_epochs: Number of repetitions for each training iteration

rollout_steps: Step interval within the update is performed. Has to be a multiple of batch_size

classmethod load(file: str, config: dict, logger: Optional[Logger] = None)

Creates a PPO object according to the parameters saved in file.pkl

Parameters

file – Path and filname (without .pkl) of your saved model pickle file
config – Dictionary with parameters to specify PPO attributes
logger – Logger

Returns

MaskedPPO object

save(file: str) → None

Save model as pickle file

Parameters: file – Path under which the file will be saved
Returns: None

forward(observation: ndarray, **kwargs) → Tuple

Predicts an action according to the current policy based on the observation and the value for the next state

Parameters

observation – Current observation of teh environment
kwargs – Used to accept but ignore passing actions masks from the environment.

Returns

Predicted action, probability for this action, and predicted value for the next state

predict(observation: ndarray, deterministic: bool = True, state=None, **kwargs) → Tuple

Action prediction for testing

Parameters

observation – Current observation of teh environment
deterministic – Set True, to force a deterministic prediction
state – The last states (used in rnn policies)
kwargs – Used to accept but ignore passing actions masks from the environment.

Returns

Predicted action and next state (used in rnn policies)

train() → None

Trains policy and value

Returns: None

learn(total_instances: int, total_timesteps: int, intermediate_test=None) → None

Learn over n environment instances or n timesteps. Break depending on which condition is met first One learning iteration consists of collecting rollouts and training the networks

Parameters

total_instances – Instance limit
total_timesteps – Timestep limit
intermediate_test – (IntermediateTest) intermediate test object. Must be created before.

agents.reinforcement_learning.ppo.explained_variance(y_pred: ndarray, y_true: ndarray) → ndarray

From Stable-Baseline Computes fraction of variance that ypred explains about y. Returns 1 - Var[y-ypred] / Var[y]

interpretation:: ev=0 => might as well have predicted zero ev=1 => perfect prediction ev<0 => worse than just predicting zero

Parameters

y_pred – the prediction
y_true – the expected value

Returns

explained variance of ypred and y

PPO_masked

PPO implementation with action mask according to the StableBaselines3 implementation. To reuse trained models, you can make use of the save and load function

class agents.reinforcement_learning.ppo_masked.RolloutBuffer(buffer_size: int, batch_size: int)

Bases: object

Handles episode data collection and batch generation

Parameters

buffer_size – Buffer size
batch_size – Size for batches to be generated

__init__(buffer_size: int, batch_size: int)

generate_batches() → Tuple

Generates batches from the stored data

Returns: batches: Lists with all indices from the rollout_data, shuffled and sampled in lists with batch_size e.g. [[0,34,1,768,…(len: batch size)], [], …(len: len(rollout_data) / batch size)]

compute_advantages_and_returns(last_value, gamma, gae_lambda) → None

Computes advantage values and returns for all stored episodes. Required to

Parameters

last_value – Value from the next step to calculate the advantage for the last episode in the buffer
gamma – Discount factor for the advantage calculation
gae_lambda – Smoothing parameter for the advantage calculation

Returns

None

store_memory(observation: ndarray, action: int, prob: float, value: float, reward: Any, done: bool, action_mask: ndarray) → None

Appends all data from the recent step

Parameters

observation – Observation at the beginning of the step
action – Index of the selected action
prob – Probability of the selected action (output from the policy_net)
value – Baseline value that the value_net estimated from this step onwards according to the
observation – Output from the value_net
reward – Reward the env returned in this step
done – True if the episode ended in this step
action_mask – One hot vector with ones for all possible actions

Returns

None

reset() → None: Resets all buffer lists :return: None

class agents.reinforcement_learning.ppo_masked.PolicyNetwork(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Policy Network for the agent

Parameters

input_dims – Observation size to determine input dimension
n_actions – Number of action to determine output size
learning_rate – Learning rate for the network
fc1_dims – Size hidden layer 1
fc2_dims – Size hidden layer 2

__init__(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str): Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(observation, action_mask): forward through the actor network

class agents.reinforcement_learning.ppo_masked.ValueNetwork(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)

Bases: Module

Value Network for the agent

Parameters

input_dims – Observation size to determine input dimension
learning_rate – Learning rate for the network
fc1_dims – Size hidden layer 1
fc2_dims – Size hidden layer 2

__init__(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str): Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(observation): forward through the value network

class agents.reinforcement_learning.ppo_masked.MaskedPPO(env, config: dict, logger: Optional[Logger] = None)

Bases: object

__init__(env, config: dict, logger: Optional[Logger] = None): gamma: Discount factor for the advantage calculation

learning_rate: Learning rate for both, policy_net and value_net

gae_lambda: Smoothing parameter for the advantage calculation

clip_range: Limitation for the ratio between old and new policy

batch_size: Size of batches which were sampled from the buffer and fed into the nets during training

n_epochs: Number of repetitions for each training iteration

rollout_steps: Step interval within the update is performed. Has to be a multiple of batch_size

classmethod load(file: str, config: dict, logger: Optional[Logger] = None)

Creates a PPO object according to the parameters saved in file.pkl

Parameters

file – Path and filname (without .pkl) of your saved model pickle file
config – Dictionary with parameters to specify PPO attributes
logger – Logger

Returns

MaskedPPO object

save(file: str) → None

Save model as pickle file

Parameters: file – Path under which the file will be saved
Returns: None

forward(observation: ndarray, action_mask: ndarray) → Tuple

Predicts an action according to the current policy and based on the action_mask and observation and the value for the next state

Parameters

observation – Current observation of teh environment
action_mask – One hot vector with ones for all possible actions

Returns

Predicted action, probability for this action, and predicted value for the next state

predict(observation: ndarray, action_mask: ndarray, deterministic: bool = True, state=None) → Tuple

Action prediction for testing

Parameters

observation – Current observation of teh environment
action_mask – One hot vector with ones for all possible actions
deterministic – Set True, to force a deterministic prediction
state – The last states (used in rnn policies)

Returns

Predicted action and next state (used in rnn policies)

train() → None

Trains policy and value

Returns: None

learn(total_instances: int, total_timesteps: int, intermediate_test=None) → None

Learn over n environment instances or n timesteps. Break depending on which condition is met first One learning iteration consists of collecting rollouts and training the networks

Parameters

total_instances – Instance limit
total_timesteps – Timestep limit
intermediate_test – (IntermediateTest) intermediate test object. Must be created before.

agents.reinforcement_learning.ppo_masked.explained_variance(y_pred: ndarray, y_true: ndarray) → ndarray

From Stable-Baseline Computes fraction of variance that ypred explains about y. Returns 1 - Var[y-ypred] / Var[y]

interpretation:: ev=0 => might as well have predicted zero ev=1 => perfect prediction ev<0 => worse than just predicting zero

Parameters

y_pred – the prediction
y_true – the expected value

Returns

explained variance of ypred and y

Reinforcement Learning Functions

Training Functions

This file provides functions to train an agent on a scheduling-problem environment. By default, the trained model will be evaluated on the test data after training, by running the test_model_and_heuristic function from test.py.

Using this file requires a training config. For example, you have to specify the algorithm used for the training.

There are several constants, which you can change to adapt the training process:

agents.train.final_evaluation(config: dict, data_test: List[List[Task]], logger: Logger)

Evaluates the trained model and logs the results

Parameters

config – Training config
data_test – Dataset with instances to be used for the test
logger – Logger object

Returns

None

agents.train.training(config: dict, data_train: List[List[Task]], data_val: List[List[Task]], logger: Logger) → None

Handles the actual training process. Including creating the environment, agent and intermediate_test object. Then the agent learning process is started

Parameters

config – Training config
data_train – Dataset with instances to be used for the training
data_val – Dataset with instances to be used for the evaluation
logger – Logger object used for the whole training process, including evaluation and testing

Returns

None

agents.train.main(config_file_name: Optional[dict] = None, external_config: Optional[dict] = None) → None

Main function to train an agent in a scheduling-problem environment.

Parameters

config_file_name – path to the training config you want to use for training (relative path from config/ folder)
external_config – dictionary that can be passed to overwrite the config file elements

Returns

None

agents.train.get_perser_args(): Get arguments from command line

Testing Functions

This file provides the test_model function to evaluate an agent or a heuristic on a set of instances. Furthermore, test_model_and_heuristics can be used to evaluate an agent and all heuristics specified in the TEST_HEURISTICS constant on the same set of instances.

Using this file requires a testing config. For example, it is necessary to specify the name of the model you want to test.

Running this file will automatically call test_model_and_heuristics. You can adapt the heuristics used for testing in the TEST_HEURISTICS constant. An empty list is admissible.

When running the file from a console you can use –plot-ganttchart to show the generated gantt_chart figures.

agents.test.get_action(env, model, heuristic_id: str, heuristic_agent: Optional[HeuristicSelectionAgent]) → Tuple[int, str]

This function determines the next action according to the input model or heuristic

Parameters

env – Environment object
model – Model object. E.g. PPO object
heuristic_id – Heuristic identifier. Can be None
heuristic_agent – HeuristicSelectionAgent object. Can be None

Returns

ID of the selected action

agents.test.run_episode(env, model, heuristic_id: Optional[str], handler: EvaluationHandler) → None

This function executes one testing episode

Parameters

env – Environment object
model – Model object. E.g. PPO object
heuristic_id – Heuristic identifier. Can be None
handler – EvaluationHandler object

Returns

None

agents.test.test_solver(config: Dict, data_test: List[List[Task]], logger: Logger) → Dict

This function uses the OR solver to schedule the instances given in data_test.

Parameters

config – Testing config
data_test – Data containing problem instances used for testing

Returns

Evaluation metrics

agents.test.log_results(plot_logger: Logger, inter_test_idx: Optional[int], heuristic: str, env, handler: EvaluationHandler) → None

Calls the logger object to save the test results from this episode as table (e.g. makespan mean, gantt chart)

Parameters

plot_logger – Logger object
inter_test_idx – Index of current test. Can be None
heuristic – Heuristic identifier. Can be None
env – Environment object
handler – EvaluationHandler object

Returns

None

agents.test.test_model(env_config: Dict, data: List[List[Task]], logger: Logger, plot: Optional[bool] = None, log_episode: Optional[bool] = None, model=None, heuristic_id: Optional[str] = None, intermediate_test_idx=None) → dict

This function tests a model in the passed environment for all problem instances passed as data_test and returns an evaluation summary

Parameters

env_config – Environment config
data – Data containing problem instances used for testing
logger – Logger object
plot – Plot a gantt chart of all tests
log_episode – If true, calls the log function to log episode results as table
model – {None, StableBaselines Model}
heuristic_id – ID that identifies the used heuristic
intermediate_test_idx – Step number after which the test is performed. Is used to annotate the log

Returns

evaluation metrics

agents.test.test_model_and_heuristic(config: dict, model, data_test: List[List[Task]], logger: Logger, plot_ganttchart: bool = False, log_episode: bool = False) → dict

Test model and agent_heuristics len(data) times and returns results

Parameters

config – Testing config
model – Model to be tested. E.g. PPO object
data_test – Dataset with instances to be used for the test
logger – Logger object
plot_ganttchart – Plot a gantt chart of all tests
log_episode – If true, calls the log function to log episode results as table

Returns

Dict with evaluation_result dicts for the agent and all heuristics which were tested

agents.test.get_perser_args()

agents.test.main(external_config=None)

Util functions

This file provides the IntermediateTest class which is used to run an intermediate test on the current model policy. If the recent model has the best result it is saved as the new current optimum

class agents.intermediate_test.IntermediateTest(env_config: dict, n_test_steps: int, data: List[List[Task]], logger: Logger)

Bases: object

This object is used to run an intermediate test on the current model policy. If the recent model has the best result it is saved as the new current optimum

Parameters

env_config – Config used to initialize the environment for training
n_test_steps – Number of environment steps between intermediate tests
data – Dataset with instances to be used for the intermediate test
logger – Logger object

__init__(env_config: dict, n_test_steps: int, data: List[List[Task]], logger: Logger)

on_step(num_timesteps: int, instances: int, model) → None

This function is called by the environment during each step. According to n_test_steps the function runs an intermediate test

Parameters

num_timesteps – Number of steps that have been already run by the environment
instances – Number of instances that have been already run by the environment
model – Model with the current policy. E.g. PPO object

Returns

None

This file provides utility functions to load configs, data and agents according to the config. It is used in training and testing.

TIMESTAMP: str: timestamp of the training run, used for the creation of a unique model name AGENT_DICT: dict[str, str]: This dictionary is used to map algorithm identifiers (keys) to their actual class names (values).

E.g. to use the MaskedPPO class, you can use ppo as algorithm in the config.

If you add new algorithms, you can extend this dictionary to assign your algorithm class a short identifier.

agents.train_test_utility_functions.load_config(config_path, external_config) → dict

Uses the ConfigHandler routines to load the config according to the path

Parameters

config_path – Path to the config to be loaded
external_config – Config dict

Returns

Config

agents.train_test_utility_functions.load_data(config: dict) → List[List[Task]]

Uses the DataHandler routines to load the training config

Parameters: config – Config dict which specifies a dataset
Returns: Dataset (List of instances)

agents.train_test_utility_functions.complete_config(config: dict) → dict

If optional parameters have not been defined in the configuration, this function adds default values. Also creates missing directories, if necessary.

Parameters: config – config file
Returns: completed config file

agents.train_test_utility_functions.get_agent_param_from_config(config: dict) → str

Check if config has TRAIN or TEST algorithm param and get corresponding class string for algorithm from config

Parameters: config – Config for training or testing
Returns: Agent type string (e.g. ‘ppo’)

agents.train_test_utility_functions.get_agent_class_from_config(config: dict) → Any

Determines and loads the correct agent class type according the config

Parameters: config – Training config
Returns: Agent class type which can be called