####### Rewards ####### .. |br| raw:: html
Defining a reward function is essential in reinforcement learning. As such, *Sinergym* provides the option to use pre-implemented reward functions or define custom ones (refer to the section below). *Sinergym*'s predefined reward functions are designed as **multi-objective**, incorporating both *energy consumption* and *thermal discomfort*. These are normalized and combined with varying weights. These rewards are **always negative**, signifying that optimal behavior results in a cumulative reward of 0. Separate temperature comfort ranges are defined for summer and winter periods. The weights assigned to each term in the reward function allow for adjusting the importance of each aspect during environment evaluation. The core concept of the reward system in *Sinergym* is encapsulated by the following equation: .. math:: r_t = - \omega \ \lambda_P \ P_t - (1 - \omega) \ \lambda_T \ (|T_t - T_{up}| + |T_t - T_{low}|) Where: |br| :math:`P_t` represents power consumption, |br| :math:`T_t` is the current indoor temperature, |br| :math:`T_{up}` and :math:`T_{low}` are the upper and lower comfort range limits, respectively, |br| :math:`\omega` is the weight assigned to power consumption, and consequently, :math:`1 - \omega` represents the comfort weight, |br| :math:`\lambda_P` and :math:`\lambda_T` are scaling constants for consumption and comfort penalties, respectively. .. warning:: The constants :math:`\lambda_P` and :math:`\lambda_T` are configured to create a proportional relationship between energy and comfort penalties, calibrating their magnitudes. When working with different buildings, it's crucial to adjust these constants to maintain a similar magnitude of the reward components. Different types of reward functions are designed based on specific details: - ``LinearReward`` implements a **linear reward** function, where discomfort is calculated as the absolute difference between the current temperature and the comfort range. - ``ExpReward`` is similar to the linear reward, but calculates discomfort using the **exponential difference** between the current temperature and comfort ranges, resulting in a higher penalty for larger deviations from target temperatures. - ``HourlyLinearReward`` adjusts the weight assigned to discomfort based on the **hour of the day**, emphasizing energy consumption outside working hours more. - ``NormalizedLinearReward`` normalizes the reward components based on the maximum energy penalty and comfort penalty, providing adaptability during the simulation. In this reward, the :math:`\lambda_P` and :math:`\lambda_T` constants are not required to calibrate both magnitudes. .. warning:: This reward function is not very precise at the beginning of the simulation, be careful with that. These reward functions have parameters in their constructors, the values of which may vary based on the building used or other factors. By default, all environments use the ``LinearReward`` with default parameters for each building. To change this, refer to the example in :ref:`Adding a new reward`. .. warning:: When specifying a reward different from the default environment ID with `gym.make`, it's crucial to set the `reward_kwargs` that are required and thus don't have a default value. *************** Reward terms *************** By default, reward functions return the **reward scalar value** and the **terms** used in their calculation. The values of these terms depend on the specific reward function used and are automatically added to the environment's info dictionary. The structure typically matches the diagram below: .. image:: /_static/reward_terms.png :scale: 70 % :alt: Reward terms :align: center *************** Custom Rewards *************** Defining custom reward functions is also straightforward. For instance, a reward signal that always returns -1 can be implemented as shown: .. code:: python from sinergym.utils.rewards import BaseReward class CustomReward(BaseReward): """Naive reward function.""" def __init__(self, env): super(CustomReward, self).__init__(env) def __call__(self, obs_dict): return -1.0, {} env = gym.make('Eplus-discrete-stochastic-mixed-v1', reward=CustomReward) For advanced reward functions, we recommend inheriting from our main class, ``LinearReward``, and overriding relevant methods. Our reward functions simplify observation processing to extract consumption and comfort violation data, from which absolute penalty values are calculated. Weighted reward terms are then calculated from these penalties and summed. .. image:: /_static/reward_structure.png :scale: 70 % :alt: Reward steps structure :align: center By modularizing each of these steps, you can quickly and easily modify specific aspects of the reward to create a new one, as demonstrated with our *exponential function reward version*, for example. *More reward functions will be included in the future, so stay tuned!*