14. Deep Reinforcement Learning Integration
Sinergym provides features to utilize Deep Reinforcement Learning algorithms from Stable Baselines 3. However, Sinergym is also compatible with any algorithm that operates with the Gymnasium interface.
To this end, we will refine and develop Callbacks, a set of functions that will be called at specific stages of the training procedure. Callbacks allow you to access the internal state of the RL model during training. They enable monitoring, auto-saving, model manipulation, progress bars, and more. Our callbacks inherit from Stable Baselines 3 and can be found in sinergym/sinergym/utils/callbacks.py.
14.1. Evaluation Callback
An enhanced callback, named LoggerEvalCallback
, is used for evaluating the model versions obtained during
the training process with Sinergym. It stores the best model obtained, not necessarily the final one from
the training process. This callback inherits from Stable Baselines 3’s EventCallback
.
This callback functions similarly to EvalCallback from Stable Baselines 3 but includes numerous enhancements and specific adaptations for Sinergym and for logging relevant information during the process.
It requires that the evaluation environment be previously wrapped by a child class of BaseLoggerWrapper
. This is
essential for the callback to access the logger’s methods and attributes and to log information correctly.
Additionally, this callback saves the best model and evaluation summaries (CSV file) in a folder named evaluation
within the training environment’s output.
To log all this data to the Weights and Biases platform, the training environment must be previously wrapped
with the WandbLoggerWrapper
class (see Logger Wrappers). This allows evaluation data to be recorded on
the Weights and Biases platform. Encapsulating the evaluation environment is not necessary unless detailed
monitoring of these episodes is desired.
The data logged on the platform, in the Evaluations section, will depend on the specific logger wrapper used and its episode summary. Thus, to obtain new metrics, the logger wrapper must be modified, not the callback. Aditionally, whis callback will overwrite specific metrics for the best model obtained during the training process, in order to remember the metrics of the best model.
The number of episodes run in each evaluation and their frequency can be configured, and metrics from the underlying logger can be excluded if desired. Moreover, if normalization in observation space is applied, the callback will automatically copy the calibration parameters from the training environment to the evaluation environment.
More episodes lead to more accurate averages of the reward-based indicators, providing a more realistic assessment of the current model’s performance. However, this increases the time required. To see an example, visit Training a model.
14.2. Usage
14.2.1. Model Training
Leverage this functionality for your experiments using the script sinergym/scripts/train/train_agent.py.
Key considerations for your experiments:
Models are built using an algorithm constructor, each with its own specific parameters. Defaults are used if none are defined.
If you normalize the environment wrapper, models will train in these normalized spaces.
Concatenate callbacks in a
CallbackList
instance from Stable Baselines 3.The neural network begins training upon executing the
model.learn()
method, where you specifytimesteps
,callbacks
, andlog_interval
.Curriculum Learning can be implemented by adding a model field with a valid model path. The script will load and train the model.
The train_agent.py
script requires a single parameter, -conf
, a string indicating the JSON
file containing all experiment details. Refer to the JSON structure in sinergym/scripts/train/train_agent_PPO.json:
Mandatory parameters: environment, episodes, algorithm (and any non-default algorithm parameters).
Optional parameters: environment parameters (overwrites default if specified), seed, pre-training model to load, experiment ID, wrappers (in order), training evaluation, and cloud options.
Field names must match the example, or the experiment will fail.
The script performs the following:
Names the experiment in the format:
<algorithm>-<environment_name>-episodes<episodes_int>-seed<seed_value>(<experiment_date>)
Sets environment parameters if specified.
Applies specified wrappers from JSON.
Save all experiment’s hyperparameters in WandB if a session is detected.
Defines model algorithm with specified hyperparameters.
Calculates training timesteps from number of episodes.
Sets up evaluation callback if specified.
Trains with environment.
If remote store is specified, saves all outputs in Google Cloud Bucket. If wandb is specified, saves all outputs in wandb run artifact.
Auto-deletes remote container in Google Cloud Platform if auto-delete parameter is specified.
14.2.2. Model Loading
Use the script sinergym/scripts/eval/load_agent.py to load and evaluate or execute a previously trained model.
The load_agent.py
script requires a single parameter, -conf
, a string indicating the JSON file
containing all evaluation details. Refer to the JSON structure in
sinergym/scripts/eval/load_agent_example.json:
Mandatory parameters: environment, episodes, algorithm (name only), and model to load: local (
model
) or remote (wandb_model
).Optional parameters: environment parameters (overwrites default if specified), experiment ID, wrappers (in order), and cloud options.
The script loads the model and predicts actions from states over the specified episodes. The information is collected and sent to remote storage (like WandB) if specified, otherwise it remains in local memory.
The model field in JSON can be a local path to the model, a bucket url in the form
gs://
, or a wandb artifact path for stored models.
Note
This project is a work in progress. Direct support for additional algorithms is planned for the future!