Reinforcement Learning Agents
DQN
DQN Implementation with target net and epsilon greedy. Follows the Stable Baselines 3 implementation. To reuse trained models, you can make use of the save and load function. To adapt policy and value network structure, specify the layer and activation parameter in your train config or change the constants in this file
- class agents.reinforcement_learning.dqn.MemoryBuffer(buffer_size: int, batch_size: int, obs_dim: int, obs_type: type, action_type: type)
Bases:
object
Handles episode data collection and sample generation
- Parameters
buffer_size – Buffer size
batch_size – Size for batches to be generated
obs_dim – Size of the observation to be stored in the buffer
obs_type – Type of the observation to be stored in the buffer
action_type – Type of the action to be stored in the buffer
- __init__(buffer_size: int, batch_size: int, obs_dim: int, obs_type: type, action_type: type)
- store_memory(obs, action, reward, done, new_obs) None
Appends all data from the recent step
- Parameters
obs – Observation at the beginning of the step
action – Index of the selected action
reward – Reward the env returned in this step
done – True if the episode ended in this step
new_obs – Observation after the step
- Returns
- get_samples() Tuple
Generates random samples from the stored data
- Returns
batch_size samples from the buffer. e.g. obs, actions, …, new_obs from step 21
- class agents.reinforcement_learning.dqn.Policy(obs_dim: int, action_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)
Bases:
Module
Network structure used for both the Q network and the target network
- Parameters
obs_dim – Observation size to determine input dimension
action_dim – Number of action to determine output size
learning_rate – Learning rate for the network
hidden_layers – List of hidden layer sizes (int)
activation – String naming activation function for hidden layers
- __init__(obs_dim: int, action_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(obs)
forward pass through the Q-network
- class agents.reinforcement_learning.dqn.DQN(env, config: dict, logger: Optional[Logger] = None)
Bases:
object
DQN Implementation with target net and epsilon greedy. Follows the Stable Baselines 3 implementation.
- __init__(env, config: dict, logger: Optional[Logger] = None)
- batch_size: Number of samples that are chosen and passed through the net per updategradient_steps: Number of updates per trainingtrain_freq: Environment steps between two trainingsbuffer_size: Size of the memory buffer = max number of rollouts that can be stored before the oldest are deletedtarget_net_update: Number of steps between target_net_updatestraining_starts = Learning_starts: steps after which training can start for the first timeinitial_eps: Initial epsilon valuefinal_eps: Final epsilon valuefraction_eps: If the percentage progress of learn exceeds fraction eps, epsilon takes the final_eps valuee.g. 5/100 total_timesteps done -> progress = 0.5 > fraction eps -> eps=final_epsmax_grad_norm: Value to clip the policy update of the q_net
- Parameters
env – Pregenerated, gymbased environment. If no env is passed, env = None -> PPO can only be used for evaluation (action prediction)
config – Dictionary with parameters to specify DQN attributes
logger – Logger
- save(file: str) None
Save model as pickle file
- Parameters
file – Path under which the file will be saved
- Returns
None
- classmethod load(file: str, config: dict, logger: Optional[Logger] = None)
Creates a DQN object according to the parameters saved in file.pkl
- Parameters
file – Path and filname (without .pkl) of your saved model pickle file
config – Dictionary with parameters to specify PPO attributes
logger – Logger
- Returns
DQN object
- get_action(obs: ndarray) int
Random action or action according to the policy and epsilon
- Returns
action index
- predict(observation: ndarray, action_mask: ndarray = array([1.]), deterministic: bool = True, state=None) Tuple
Action prediction for testing
- Parameters
observation – Current observation of teh environment
action_mask – Mask of actions, which can logically be taken. NOTE: currently not implemented!
deterministic – Set True, to force a deterministic prediction
state – The last states (used in rnn policies)
- Returns
Predicted action and next state (used in rnn policies)
- train() None
Trains Q-network and Target-Network
- Returns
None
- on_step(total_timesteps)
Method track and check plenty conditions to e.g. check if q_target_net or epsilon update are necessary
- learn(total_instances: int, total_timesteps: int, intermediate_test=None) None
Learn over n problem instances or n timesteps (environment steps). Breaks depending on which condition is met first. One learning iteration consists of collecting rollouts and training the networks on the rollout data
- Parameters
total_instances – Instance limit
total_timesteps – Timestep limit
intermediate_test – (IntermediateTest) intermediate test object. Must be created before.
PPO
PPO implementation inspired by the StableBaselines3 implementation. To reuse trained models, you can make use of the save and load function To adapt policy and value network structure, specify the policy and value layer and activation parameter in your train config or change the constants in this file
- class agents.reinforcement_learning.ppo.RolloutBuffer(buffer_size: int, batch_size: int)
Bases:
object
Handles episode data collection and batch generation
- Parameters
buffer_size – Buffer size
batch_size – Size for batches to be generated
- __init__(buffer_size: int, batch_size: int)
- generate_batches() Tuple
Generates batches from the stored data
- Returns
batches: Lists with all indices from the rollout_data, shuffled and sampled in lists with batch_size e.g. [[0,34,1,768,…(len: batch size)], [], …(len: len(rollout_data) / batch size)]
- compute_advantages_and_returns(last_value, gamma, gae_lambda) None
Computes advantage values and returns for all stored episodes.
- Parameters
last_value – Value from the next step to calculate the advantage for the last episode in the buffer
gamma – Discount factor for the advantage calculation
gae_lambda – Smoothing parameter for the advantage calculation
- Returns
None
- store_memory(observation: ndarray, action: int, prob: float, value: float, reward: Any, done: bool) None
Appends all data from the recent step
- Parameters
observation – Observation at the beginning of the step
action – Index of the selected action
prob – Probability of the selected action (output from the policy_net)
value – Baseline value that the value_net estimated from this step onwards according to the
observation – Output from the value_net
reward – Reward the env returned in this step
done – True if the episode ended in this step
- Returns
None
- reset() None
Resets all buffer lists
- Returns
None
- class agents.reinforcement_learning.ppo.PolicyNetwork(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)
Bases:
Module
Policy Network for the agent
- Parameters
input_dim – Observation size to determine input dimension
n_actions – Number of action to determine output size
learning_rate – Learning rate for the network
hidden_layers – List of hidden layer sizes (int)
activation – String naming activation function for hidden layers
- __init__(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(observation)
forward function
- class agents.reinforcement_learning.ppo.ValueNetwork(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)
Bases:
Module
Value Network for the agent
- Parameters
input_dim – Observation size to determine input dimension
learning_rate – Learning rate for the network
hidden_layers – List of hidden layer sizes (int)
activation – String naming activation function for hidden layers
- __init__(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(observation)
forward function
- class agents.reinforcement_learning.ppo.PPO(env, config: dict, logger: Optional[Logger] = None)
Bases:
object
PPO Agent class
- __init__(env, config: dict, logger: Optional[Logger] = None)
- gamma: Discount factor for the advantage calculationlearning_rate: Learning rate for both, policy_net and value_netgae_lambda: Smoothing parameter for the advantage calculationclip_range: Limitation for the ratio between old and new policybatch_size: Size of batches which were sampled from the buffer and fed into the nets during trainingn_epochs: Number of repetitions for each training iterationrollout_steps: Step interval within the update is performed. Has to be a multiple of batch_size
- classmethod load(file: str, config: dict, logger: Optional[Logger] = None)
Creates a PPO object according to the parameters saved in file.pkl
- Parameters
file – Path and filname (without .pkl) of your saved model pickle file
config – Dictionary with parameters to specify PPO attributes
logger – Logger
- Returns
MaskedPPO object
- save(file: str) None
Save model as pickle file
- Parameters
file – Path under which the file will be saved
- Returns
None
- forward(observation: ndarray, **kwargs) Tuple
Predicts an action according to the current policy based on the observation and the value for the next state
- Parameters
observation – Current observation of teh environment
kwargs – Used to accept but ignore passing actions masks from the environment.
- Returns
Predicted action, probability for this action, and predicted value for the next state
- predict(observation: ndarray, deterministic: bool = True, state=None, **kwargs) Tuple
Action prediction for testing
- Parameters
observation – Current observation of teh environment
deterministic – Set True, to force a deterministic prediction
state – The last states (used in rnn policies)
kwargs – Used to accept but ignore passing actions masks from the environment.
- Returns
Predicted action and next state (used in rnn policies)
- train() None
Trains policy and value
- Returns
None
- learn(total_instances: int, total_timesteps: int, intermediate_test=None) None
Learn over n environment instances or n timesteps. Break depending on which condition is met first One learning iteration consists of collecting rollouts and training the networks
- Parameters
total_instances – Instance limit
total_timesteps – Timestep limit
intermediate_test – (IntermediateTest) intermediate test object. Must be created before.
- agents.reinforcement_learning.ppo.explained_variance(y_pred: ndarray, y_true: ndarray) ndarray
From Stable-Baseline Computes fraction of variance that ypred explains about y. Returns 1 - Var[y-ypred] / Var[y]
- interpretation:
ev=0 => might as well have predicted zero ev=1 => perfect prediction ev<0 => worse than just predicting zero
- Parameters
y_pred – the prediction
y_true – the expected value
- Returns
explained variance of ypred and y
PPO_masked
PPO implementation with action mask according to the StableBaselines3 implementation. To reuse trained models, you can make use of the save and load function
- class agents.reinforcement_learning.ppo_masked.RolloutBuffer(buffer_size: int, batch_size: int)
Bases:
object
Handles episode data collection and batch generation
- Parameters
buffer_size – Buffer size
batch_size – Size for batches to be generated
- __init__(buffer_size: int, batch_size: int)
- generate_batches() Tuple
Generates batches from the stored data
- Returns
batches: Lists with all indices from the rollout_data, shuffled and sampled in lists with batch_size e.g. [[0,34,1,768,…(len: batch size)], [], …(len: len(rollout_data) / batch size)]
- compute_advantages_and_returns(last_value, gamma, gae_lambda) None
Computes advantage values and returns for all stored episodes. Required to
- Parameters
last_value – Value from the next step to calculate the advantage for the last episode in the buffer
gamma – Discount factor for the advantage calculation
gae_lambda – Smoothing parameter for the advantage calculation
- Returns
None
- store_memory(observation: ndarray, action: int, prob: float, value: float, reward: Any, done: bool, action_mask: ndarray) None
Appends all data from the recent step
- Parameters
observation – Observation at the beginning of the step
action – Index of the selected action
prob – Probability of the selected action (output from the policy_net)
value – Baseline value that the value_net estimated from this step onwards according to the
observation – Output from the value_net
reward – Reward the env returned in this step
done – True if the episode ended in this step
action_mask – One hot vector with ones for all possible actions
- Returns
None
- reset() None
Resets all buffer lists :return: None
- class agents.reinforcement_learning.ppo_masked.PolicyNetwork(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)
Bases:
Module
Policy Network for the agent
- Parameters
input_dims – Observation size to determine input dimension
n_actions – Number of action to determine output size
learning_rate – Learning rate for the network
fc1_dims – Size hidden layer 1
fc2_dims – Size hidden layer 2
- __init__(input_dim: int, n_actions: int, learning_rate: float, hidden_layers: List[int], activation: str)
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(observation, action_mask)
forward through the actor network
- class agents.reinforcement_learning.ppo_masked.ValueNetwork(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)
Bases:
Module
Value Network for the agent
- Parameters
input_dims – Observation size to determine input dimension
learning_rate – Learning rate for the network
fc1_dims – Size hidden layer 1
fc2_dims – Size hidden layer 2
- __init__(input_dim: int, learning_rate: float, hidden_layers: List[int], activation: str)
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(observation)
forward through the value network
- class agents.reinforcement_learning.ppo_masked.MaskedPPO(env, config: dict, logger: Optional[Logger] = None)
Bases:
object
- __init__(env, config: dict, logger: Optional[Logger] = None)
- gamma: Discount factor for the advantage calculationlearning_rate: Learning rate for both, policy_net and value_netgae_lambda: Smoothing parameter for the advantage calculationclip_range: Limitation for the ratio between old and new policybatch_size: Size of batches which were sampled from the buffer and fed into the nets during trainingn_epochs: Number of repetitions for each training iterationrollout_steps: Step interval within the update is performed. Has to be a multiple of batch_size
- classmethod load(file: str, config: dict, logger: Optional[Logger] = None)
Creates a PPO object according to the parameters saved in file.pkl
- Parameters
file – Path and filname (without .pkl) of your saved model pickle file
config – Dictionary with parameters to specify PPO attributes
logger – Logger
- Returns
MaskedPPO object
- save(file: str) None
Save model as pickle file
- Parameters
file – Path under which the file will be saved
- Returns
None
- forward(observation: ndarray, action_mask: ndarray) Tuple
Predicts an action according to the current policy and based on the action_mask and observation and the value for the next state
- Parameters
observation – Current observation of teh environment
action_mask – One hot vector with ones for all possible actions
- Returns
Predicted action, probability for this action, and predicted value for the next state
- predict(observation: ndarray, action_mask: ndarray, deterministic: bool = True, state=None) Tuple
Action prediction for testing
- Parameters
observation – Current observation of teh environment
action_mask – One hot vector with ones for all possible actions
deterministic – Set True, to force a deterministic prediction
state – The last states (used in rnn policies)
- Returns
Predicted action and next state (used in rnn policies)
- train() None
Trains policy and value
- Returns
None
- learn(total_instances: int, total_timesteps: int, intermediate_test=None) None
Learn over n environment instances or n timesteps. Break depending on which condition is met first One learning iteration consists of collecting rollouts and training the networks
- Parameters
total_instances – Instance limit
total_timesteps – Timestep limit
intermediate_test – (IntermediateTest) intermediate test object. Must be created before.
- agents.reinforcement_learning.ppo_masked.explained_variance(y_pred: ndarray, y_true: ndarray) ndarray
From Stable-Baseline Computes fraction of variance that ypred explains about y. Returns 1 - Var[y-ypred] / Var[y]
- interpretation:
ev=0 => might as well have predicted zero ev=1 => perfect prediction ev<0 => worse than just predicting zero
- Parameters
y_pred – the prediction
y_true – the expected value
- Returns
explained variance of ypred and y
Reinforcement Learning Functions
Training Functions
This file provides functions to train an agent on a scheduling-problem environment. By default, the trained model will be evaluated on the test data after training, by running the test_model_and_heuristic function from test.py.
Using this file requires a training config. For example, you have to specify the algorithm used for the training.
There are several constants, which you can change to adapt the training process:
- agents.train.final_evaluation(config: dict, data_test: List[List[Task]], logger: Logger)
Evaluates the trained model and logs the results
- Parameters
config – Training config
data_test – Dataset with instances to be used for the test
logger – Logger object
- Returns
None
- agents.train.training(config: dict, data_train: List[List[Task]], data_val: List[List[Task]], logger: Logger) None
Handles the actual training process. Including creating the environment, agent and intermediate_test object. Then the agent learning process is started
- Parameters
config – Training config
data_train – Dataset with instances to be used for the training
data_val – Dataset with instances to be used for the evaluation
logger – Logger object used for the whole training process, including evaluation and testing
- Returns
None
- agents.train.main(config_file_name: Optional[dict] = None, external_config: Optional[dict] = None) None
Main function to train an agent in a scheduling-problem environment.
- Parameters
config_file_name – path to the training config you want to use for training (relative path from config/ folder)
external_config – dictionary that can be passed to overwrite the config file elements
- Returns
None
- agents.train.get_perser_args()
Get arguments from command line
Testing Functions
This file provides the test_model function to evaluate an agent or a heuristic on a set of instances. Furthermore, test_model_and_heuristics can be used to evaluate an agent and all heuristics specified in the TEST_HEURISTICS constant on the same set of instances.
Using this file requires a testing config. For example, it is necessary to specify the name of the model you want to test.
Running this file will automatically call test_model_and_heuristics. You can adapt the heuristics used for testing in the TEST_HEURISTICS constant. An empty list is admissible.
When running the file from a console you can use –plot-ganttchart to show the generated gantt_chart figures.
- agents.test.get_action(env, model, heuristic_id: str, heuristic_agent: Optional[HeuristicSelectionAgent]) Tuple[int, str]
This function determines the next action according to the input model or heuristic
- Parameters
env – Environment object
model – Model object. E.g. PPO object
heuristic_id – Heuristic identifier. Can be None
heuristic_agent – HeuristicSelectionAgent object. Can be None
- Returns
ID of the selected action
- agents.test.run_episode(env, model, heuristic_id: Optional[str], handler: EvaluationHandler) None
This function executes one testing episode
- Parameters
env – Environment object
model – Model object. E.g. PPO object
heuristic_id – Heuristic identifier. Can be None
handler – EvaluationHandler object
- Returns
None
- agents.test.test_solver(config: Dict, data_test: List[List[Task]], logger: Logger) Dict
This function uses the OR solver to schedule the instances given in data_test.
- Parameters
config – Testing config
data_test – Data containing problem instances used for testing
- Returns
Evaluation metrics
- agents.test.log_results(plot_logger: Logger, inter_test_idx: Optional[int], heuristic: str, env, handler: EvaluationHandler) None
Calls the logger object to save the test results from this episode as table (e.g. makespan mean, gantt chart)
- Parameters
plot_logger – Logger object
inter_test_idx – Index of current test. Can be None
heuristic – Heuristic identifier. Can be None
env – Environment object
handler – EvaluationHandler object
- Returns
None
- agents.test.test_model(env_config: Dict, data: List[List[Task]], logger: Logger, plot: Optional[bool] = None, log_episode: Optional[bool] = None, model=None, heuristic_id: Optional[str] = None, intermediate_test_idx=None) dict
This function tests a model in the passed environment for all problem instances passed as data_test and returns an evaluation summary
- Parameters
env_config – Environment config
data – Data containing problem instances used for testing
logger – Logger object
plot – Plot a gantt chart of all tests
log_episode – If true, calls the log function to log episode results as table
model – {None, StableBaselines Model}
heuristic_id – ID that identifies the used heuristic
intermediate_test_idx – Step number after which the test is performed. Is used to annotate the log
- Returns
evaluation metrics
- agents.test.test_model_and_heuristic(config: dict, model, data_test: List[List[Task]], logger: Logger, plot_ganttchart: bool = False, log_episode: bool = False) dict
Test model and agent_heuristics len(data) times and returns results
- Parameters
config – Testing config
model – Model to be tested. E.g. PPO object
data_test – Dataset with instances to be used for the test
logger – Logger object
plot_ganttchart – Plot a gantt chart of all tests
log_episode – If true, calls the log function to log episode results as table
- Returns
Dict with evaluation_result dicts for the agent and all heuristics which were tested
- agents.test.get_perser_args()
- agents.test.main(external_config=None)
Util functions
This file provides the IntermediateTest class which is used to run an intermediate test on the current model policy. If the recent model has the best result it is saved as the new current optimum
- class agents.intermediate_test.IntermediateTest(env_config: dict, n_test_steps: int, data: List[List[Task]], logger: Logger)
Bases:
object
This object is used to run an intermediate test on the current model policy. If the recent model has the best result it is saved as the new current optimum
- Parameters
env_config – Config used to initialize the environment for training
n_test_steps – Number of environment steps between intermediate tests
data – Dataset with instances to be used for the intermediate test
logger – Logger object
- __init__(env_config: dict, n_test_steps: int, data: List[List[Task]], logger: Logger)
- on_step(num_timesteps: int, instances: int, model) None
This function is called by the environment during each step. According to n_test_steps the function runs an intermediate test
- Parameters
num_timesteps – Number of steps that have been already run by the environment
instances – Number of instances that have been already run by the environment
model – Model with the current policy. E.g. PPO object
- Returns
None
This file provides utility functions to load configs, data and agents according to the config. It is used in training and testing.
TIMESTAMP: str: timestamp of the training run, used for the creation of a unique model name AGENT_DICT: dict[str, str]: This dictionary is used to map algorithm identifiers (keys) to their actual class names (values).
E.g. to use the MaskedPPO class, you can use ppo as algorithm in the config.
If you add new algorithms, you can extend this dictionary to assign your algorithm class a short identifier.
- agents.train_test_utility_functions.load_config(config_path, external_config) dict
Uses the ConfigHandler routines to load the config according to the path
- Parameters
config_path – Path to the config to be loaded
external_config – Config dict
- Returns
Config
- agents.train_test_utility_functions.load_data(config: dict) List[List[Task]]
Uses the DataHandler routines to load the training config
- Parameters
config – Config dict which specifies a dataset
- Returns
Dataset (List of instances)
- agents.train_test_utility_functions.complete_config(config: dict) dict
If optional parameters have not been defined in the configuration, this function adds default values. Also creates missing directories, if necessary.
- Parameters
config – config file
- Returns
completed config file
- agents.train_test_utility_functions.get_agent_param_from_config(config: dict) str
Check if config has TRAIN or TEST algorithm param and get corresponding class string for algorithm from config
- Parameters
config – Config for training or testing
- Returns
Agent type string (e.g. ‘ppo’)
- agents.train_test_utility_functions.get_agent_class_from_config(config: dict) Any
Determines and loads the correct agent class type according the config
- Parameters
config – Training config
- Returns
Agent class type which can be called