You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During Training I wrap my custom Gymnasium Environment in the EvalCallback wrapper to record the performance of my agent when actions are decided deterministically:
During training, the eval/mean_reward converges to approximately -10.0, so I had a look at the _on_step method of EvalCallback to reproduce these score and visualise what exactly the agent has learned:
Where I have triple-checked that the model that is being loaded is the same as the one saved in EvalCallback, the same deterministic- and return_episode_rewards-flag is set and even that the seed for both environments is the same. But still:
print(mean_reward) ->-500.0
Which is so far off the evaluated mean_reward during training that something must be off and cannot simply be attributed to stochasticity in the environment and normal deviation from the mean.
I have tried everything I could have thought of and I can't seem to figure out where this difference comes from. Would that indicate that something in my custom environment could cause the discrepancy or am I missing a crucial detail?
Checklist
I have checked that there is no similar issue in the repo
Given that the discrepancy is so large, it does suggest there might be an issue with your custom environment or the way it's being handled during evaluation.
Ensure that the environment is being reset correctly before each evaluation episode. Any residual state from previous episodes could affect the evaluation.
Verify that the action and observation spaces are identical between the training and evaluation environments. Any differences could lead to unexpected behavior.
Double-check the reward calculation logic in your custom environment. Ensure that it's consistent and correctly implemented in both training and evaluation modes.
Make sure that any randomness in your environment (e.g., initial states, stochastic transitions) is controlled or eliminated during evaluation to ensure deterministic behavior.
If you're using any wrappers in your evaluation environment, ensure they are identical to those used during training. Even subtle differences can lead to significant discrepancies.
❓ Question
During Training I wrap my custom Gymnasium Environment in the EvalCallback wrapper to record the performance of my agent when actions are decided deterministically:
During training, the
eval/mean_reward
converges to approximately -10.0, so I had a look at the_on_step
method ofEvalCallback
to reproduce these score and visualise what exactly the agent has learned:Where I have triple-checked that the model that is being loaded is the same as the one saved in
EvalCallback
, the same deterministic- and return_episode_rewards-flag is set and even that the seed for both environments is the same. But still:Which is so far off the evaluated
mean_reward
during training that something must be off and cannot simply be attributed to stochasticity in the environment and normal deviation from the mean.I have tried everything I could have thought of and I can't seem to figure out where this difference comes from. Would that indicate that something in my custom environment could cause the discrepancy or am I missing a crucial detail?
Checklist
The text was updated successfully, but these errors were encountered: