#######################################
Deep Reinforcement Learning Integration
#######################################
*Sinergym* integrates some facilities in order to use **Deep Reinforcement Learning algorithms**
provided by `Stable Baselines 3 `__.
Current algorithms checked by *Sinergym* are:
+--------------------------------------------------------+
| Stable Baselines 3: |
+-----------+----------+------------+--------------------+
| Algorithm | Discrete | Continuous | Type |
+-----------+----------+------------+--------------------+
| PPO | YES | YES | OnPolicyAlgorithm |
+-----------+----------+------------+--------------------+
| A2C | YES | YES | OnPolicyAlgorithm |
+-----------+----------+------------+--------------------+
| DQN | YES | NO | OffPolicyAlgorithm |
+-----------+----------+------------+--------------------+
| DDPG | NO | YES | OffPolicyAlgorithm |
+-----------+----------+------------+--------------------+
| SAC | NO | YES | OffPolicyAlgorithm |
+-----------+----------+------------+--------------------+
| TD3 | NO | YES | OffPolicyAlgorithm |
+-----------+----------+------------+--------------------+
``Type`` column has been specified due to its importance about
*Stable Baselines callback* functionality.
****************
DRL Logger
****************
`Callbacks `__
are a set of functions that will be called at given **stages of the training procedure**.
You can use callbacks to access internal state of the RL model **during training**.
It allows one to do monitoring, auto saving, model manipulation, progress bars, ...
This structure allows to custom our own logger for DRL executions. Our objective
is to **log all information about our custom environment** specifically.
Therefore, `sinergym/sinergym/utils/callbacks.py `__
has been created with this proposal. Each algorithm has its own differences
about how information is extracted which is why its implementation. ``LoggerCallback``
can deal with those subtleties.
.. note:: You can specify if you want Sinergym logger (see :ref:`Logger`) to record
simulation interactions during training at the same time using
``sinergym_logger`` attribute in constructor.
This callback derives ``BaseCallback`` from Stable Baselines 3 while ``BaseCallBack``
uses `Tensorboard `__ on the
background at the same time. With *Tensorboard*, it's possible to visualize all DRL
training in real time and compare between different executions. This is an example:
.. image:: /_static/tensorboard_example.png
:width: 800
:alt: Tensorboard example
:align: center
There are tables which are in some algorithms and not in others and vice versa.
It is important the difference between ``OnPolicyAlgorithms`` and ``OffPolicyAlgorithms``:
* **OnPolicyAlgorithms** can be recorded **each timestep**, we can set a ``log_interval`` in
learn process in order to specify the **step frequency log**.
* **OffPolicyAlgorithms** can be recorded **each episode**. Consequently, ``log_interval`` in
learn process is used to specify the **episode frequency log** and not step frequency.
Some features like actions and observations are set up in each timestep.
Thus, Off Policy Algorithms record a **mean value** of whole episode values instead
of values steps by steps (see ``LoggerCallback`` class implementation).
~~~~~~~~~~~~~~~~~~~~~~
Tensorboard structure
~~~~~~~~~~~~~~~~~~~~~~
The main structure for *Sinergym* with *Tensorboard* is:
* **action**: This section has action values during training. When algorithm
is On Policy, it will appear **action_simulation** too. This is because
algorithms in continuous environments has their own output and clipped
with gym action space. Then, this output is parse to simulation action
space (See :ref:`Observation/action spaces` note box).
* **episode**: Here is stored all information about entire episodes.
It is equivalent to ``progress.csv`` in *Sinergym logger*
(see *Sinergym* :ref:`Output format` section):
- *comfort_violation_time(%)*: Percentage of time in episode simulation
in which temperature has been out of bound comfort temperature ranges.
- *cumulative_comfort_penalty*: Sum of comfort penalties (reward component)
during whole episode.
- *cumulative_power*: Sum of power consumption during whole episode.
- *cumulative_power_penalty*: Sum of power penalties (reward component)
during whole episode.
- *cumulative_reward*: Sum of reward during whole episode.
- *ep_length*: Timesteps executed in each episode.
- *mean_comfort_penalty*: Mean comfort penalty per step in episode.
- *mean_power*: Mean power consumption per step in episode.
- *mean_power_penalty*: Mean power penalty per step in episode.
- *mean_reward*: Mean reward obtained per step in episode.
* **observation**: Here is recorded all observation values during simulation.
This values depends on the environment which is being simulated
(see :ref:`Observation/action spaces` section).
* **normalized_observation** (optional): This section appear only when environment
has been **wrapped with normalization** (see :ref:`Wrappers` section). The model
will train with this normalized values and they will be recorded both;
original observation and normalized observation.
* **rollout**: Algorithm metrics in **Stable Baselines by default**. For example,
DQN has ``exploration_rate`` and this value doesn't appear in other algorithms.
* **time**: Monitoring time of execution.
* **train**: Record specific neural network information for each algorithm,
provided by **Stable baselines** as well as rollout.
.. note:: Evaluation of models can be recorded too, adding ``EvalLoggerCallback``
to model learn method.
**********
How use
**********
You can try your own experiments and benefit from this functionality.
`sinergym/scripts/DRL_battery.py `__
is a example code to use it. You can use ``DRL_battery.py`` directly from
your local computer specifying ``--tensorboard`` flag in execution.
The most **important information** you must keep in mind when you try
your own experiments are:
* Model is constructed with a algorithm constructor.
Each algorithm can use its **particular parameters**.
* If you wrapper environment with normalization, models
will **train** with those **normalized** values.
* Callbacks can be **concatenated** in a ``CallbackList``
instance from Stable Baselines 3.
* Neural network will not train until you execute
``model.learn()`` method. Here is where you
specify train ``timesteps``, ``callbacks`` and ``log_interval``
as we commented in type algorithms (On and Off Policy).
* ``DRL_battery.py`` requires some **extra arguments** to being
executed like ``-env`` and ``-ep``.
* You can execute **Curriculum Learning**, you only have to
add ``--model`` field with a valid model path, this script
will load the model and execute to train.
****************
Mlflow
****************
Our scripts to run DRL with *Sinergym* environments are using
`Mlflow `__ in order to **tracking experiments**
and recorded them methodically. It is recommended to use it.
You can start a local server with information stored during the
battery of experiments such as initial and ending date of execution,
hyperparameters, duration, etc.
Here is an example:
.. image:: /_static/mlflow_example.png
:width: 800
:alt: Tensorboard example
:align: center
.. note:: For information about how use *Tensorboard* and *Mlflow* with a Cloud
Computing paradigm, see :ref:`Remote Tensorboard log` and
:ref:`Mlflow tracking server set up`.
.. note:: *This is a work in progress project. Direct support with others
algorithms is being planned for the future!*