You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Talking with the @mapbox/ml-club today, it sounds like running training on multiple hosts is still unexplored, and could provide some benefits to the difficulties involved in running single hosts for days on end.
Thoughts
I'm not sure if this belongs in ecs-watchbot or in ecs-api, but it looks like https://github.com/uber/horovod is a potential way to try out distributing a machine learning system across multiple hosts.
The connection takes place through TCP, so maybe ENI's and named DNS records/service discoverability would help here.
cc/ @mapbox/ml-club @mapbox/platform
The text was updated successfully, but these errors were encountered:
This would be really cool to explore -- but might be worth waiting until ECS rolls out their upcoming service discovery system. From the sound of it, that system will make it far easier to manage the IP addresses, DNS entries, and healthchecking that's usually needed for this kind of cross-node communication.
Our last communication with the team put the launch of this feature in late Feb / early March.
From the API standpoint, I think it'd make sense to have service discovery be an option during template creation. Then within the watchbot listen code, we could poll the Route53 record for the service and internally keep the list of IP's or IP:Port combos of all of the containers in the service. Then, we could inject this list as a comma-separated environment variable to the worker.
Context
Talking with the @mapbox/ml-club today, it sounds like running training on multiple hosts is still unexplored, and could provide some benefits to the difficulties involved in running single hosts for days on end.
Thoughts
I'm not sure if this belongs in ecs-watchbot or in ecs-api, but it looks like https://github.com/uber/horovod is a potential way to try out distributing a machine learning system across multiple hosts.
The connection takes place through TCP, so maybe ENI's and named DNS records/service discoverability would help here.
cc/ @mapbox/ml-club @mapbox/platform
The text was updated successfully, but these errors were encountered: