-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Ray Train should support AWS trainium instances #33504
Labels
amz
enhancement
Request for new feature and/or capability
P2
Important issue, but not time-critical
train
Ray Train Related Issue
Comments
gilvikra
added
enhancement
Request for new feature and/or capability
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Mar 21, 2023
gjoliver
changed the title
[Train]
[Train] Ray Train should support AWS trainium instances
Mar 21, 2023
gjoliver
added
P2
Important issue, but not time-critical
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Mar 21, 2023
+1. Given the shortage of GPUs in the industry, it would be beneficial for us to have Ray tested and supported on AWS Trainium, to unblock LLM use cases. |
This was referenced Aug 1, 2023
This was referenced Aug 23, 2023
anyscalesam
added
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
train
Ray Train Related Issue
labels
Apr 2, 2024
@woshiyyya can you take a look; I'm adding triage as well in case we want to punt this to the next on-call rotation. |
@anyscalesam OK. Will take a look at the CI issue of #39130. |
anyscalesam
removed
the
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
label
May 15, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
amz
enhancement
Request for new feature and/or capability
P2
Important issue, but not time-critical
train
Ray Train Related Issue
Description
I would like AWS trainium instances requiring "xla" torch backend be supported with ray.
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/distributed_data_parallel.html#neur[…]rial
There is a great push towards Trainium and right now ray does not seem to support it natively like CPU and GPUs
Use case
Use of AWS Trainium chips for efficient, performant, cost effective distributed training on top of ray.
The text was updated successfully, but these errors were encountered: