Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Ray Train should support AWS trainium instances #33504

Open
gilvikra opened this issue Mar 21, 2023 · 4 comments
Open

[Train] Ray Train should support AWS trainium instances #33504

gilvikra opened this issue Mar 21, 2023 · 4 comments
Assignees
Labels
amz enhancement Request for new feature and/or capability P2 Important issue, but not time-critical train Ray Train Related Issue

Comments

@gilvikra
Copy link

Description

I would like AWS trainium instances requiring "xla" torch backend be supported with ray.

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/distributed_data_parallel.html#neur[…]rial

There is a great push towards Trainium and right now ray does not seem to support it natively like CPU and GPUs

Use case

Use of AWS Trainium chips for efficient, performant, cost effective distributed training on top of ray.

@gilvikra gilvikra added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 21, 2023
@gjoliver gjoliver changed the title [Train] [Train] Ray Train should support AWS trainium instances Mar 21, 2023
@gjoliver gjoliver added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 21, 2023
@swaroopch
Copy link

+1. Given the shortage of GPUs in the industry, it would be beneficial for us to have Ray tested and supported on AWS Trainium, to unblock LLM use cases.

@pdames
Copy link
Member

pdames commented Aug 15, 2023

Follow-up issue: #38473. This improves the maintainability of #37998 by removing the need to continuously update a hard-coded dictionary of EC2 instance types to neuron core counts.

@anyscalesam
Copy link
Contributor

@woshiyyya can you take a look; I'm adding triage as well in case we want to punt this to the next on-call rotation.

@woshiyyya
Copy link
Member

woshiyyya commented Apr 2, 2024

@anyscalesam OK. Will take a look at the CI issue of #39130.

@anyscalesam anyscalesam removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
amz enhancement Request for new feature and/or capability P2 Important issue, but not time-critical train Ray Train Related Issue
Projects
None yet
Development

No branches or pull requests

7 participants