Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Cannot reproduce results of "EvalCallback" gathered during training. #2036

Open
4 tasks done
felix-basiliskroko opened this issue Nov 7, 2024 · 2 comments
Open
4 tasks done
Labels
duplicate This issue or pull request already exists question Further information is requested RTFM Answer is the documentation

Comments

@felix-basiliskroko
Copy link

❓ Question

During Training I wrap my custom Gymnasium Environment in the EvalCallback wrapper to record the performance of my agent when actions are decided deterministically:

eval_env = make_vec_env(env_id=env_id, seed=42)
eval_callback = EvalCallback(eval_env, best_model_save_path=f"./{check_root_dir}/{run}/{mod}",
                 log_path=f"./{check_root_dir}/{run}/{mod}", eval_freq=20_000,
                 deterministic=True, render=False, n_eval_episodes=10)

...

model.learn(total_timesteps=2_000_000, callback=eval_callback)

During training, the eval/mean_reward converges to approximately -10.0, so I had a look at the _on_step method of EvalCallback to reproduce these score and visualise what exactly the agent has learned:

vec_env = make_vec_env(env_id=env_id, n_envs=1, seed=42)
model = PPO("MultiInputPolicy", env=vec_env)
model.load(model_path, deterministic=deterministic)
episode_rewards, _ = evaluate_policy(model, vec_env, n_eval_episodes=10, render=False, deterministic=True, return_episode_rewards=True)
mean_reward = np.mean(episode_rewards)

Where I have triple-checked that the model that is being loaded is the same as the one saved in EvalCallback, the same deterministic- and return_episode_rewards-flag is set and even that the seed for both environments is the same. But still:

print(mean_reward) -> -500.0

Which is so far off the evaluated mean_reward during training that something must be off and cannot simply be attributed to stochasticity in the environment and normal deviation from the mean.

I have tried everything I could have thought of and I can't seem to figure out where this difference comes from. Would that indicate that something in my custom environment could cause the discrepancy or am I missing a crucial detail?

Checklist

@felix-basiliskroko felix-basiliskroko added the question Further information is requested label Nov 7, 2024
@amabilee
Copy link

amabilee commented Nov 7, 2024

Hey there!

Given that the discrepancy is so large, it does suggest there might be an issue with your custom environment or the way it's being handled during evaluation.

  1. Ensure that the environment is being reset correctly before each evaluation episode. Any residual state from previous episodes could affect the evaluation.
  2. Verify that the action and observation spaces are identical between the training and evaluation environments. Any differences could lead to unexpected behavior.
  3. Double-check the reward calculation logic in your custom environment. Ensure that it's consistent and correctly implemented in both training and evaluation modes.
  4. Make sure that any randomness in your environment (e.g., initial states, stochastic transitions) is controlled or eliminated during evaluation to ensure deterministic behavior.
  5. If you're using any wrappers in your evaluation environment, ensure they are identical to those used during training. Even subtle differences can lead to significant discrepancies.

@araffin araffin added duplicate This issue or pull request already exists RTFM Answer is the documentation labels Nov 7, 2024
@araffin
Copy link
Member

araffin commented Nov 7, 2024

Duplicate of #928 (comment) and others

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Further information is requested RTFM Answer is the documentation
Projects
None yet
Development

No branches or pull requests

3 participants