Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

citc-watchdog service repeatedly restarting #41

Open
willfurnass opened this issue Jul 8, 2021 · 2 comments
Open

citc-watchdog service repeatedly restarting #41

willfurnass opened this issue Jul 8, 2021 · 2 comments
Labels
AWS bug Something isn't working

Comments

@willfurnass
Copy link

willfurnass commented Jul 8, 2021

Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/bin/watchdog", line 8, in <module>
Jul  8 19:10:38 mgmt watchdog[328075]:    sys.exit(main())
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/watchdog.py", line 89, in main
Jul  8 19:10:38 mgmt watchdog[328075]:    cloud_nodes = utils.get_cloud_nodes()
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/utils.py", line 27, in get_cloud_nodes
Jul  8 19:10:38 mgmt watchdog[328075]:    cloud_nodes = aws.AwsNode.all(ec2, nodespace)
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py", line 93, in all
Jul  8 19:10:38 mgmt watchdog[328075]:    return [cls.from_response(instance) for instance in instances]
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py", line 93, in <listcomp>
Jul  8 19:10:38 mgmt watchdog[328075]:    return [cls.from_response(instance) for instance in instances]
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py", line 76, in from_response
Jul  8 19:10:38 mgmt watchdog[328075]:    ip = response["PrivateIpAddress"]
Jul  8 19:10:38 mgmt watchdog[328075]: KeyError: 'PrivateIpAddress'

That response dictionary doesn't contain a PrivateIpAddress key; the dictionary is as follows:

{'AmiLaunchIndex': 0, 'ImageId': 'ami-035ed6bae06963c37', 'InstanceId': 'i-0dbee8bd641226ae8', 'InstanceType': 't3.large', 'KeyName': 'ec2-user-ephemeron', 'LaunchTime': datetime.datetime(2021, 7, 8, 18, 12, 56, tzinfo=tzlocal()), 'Monitoring': {'State': 'disabled'}, 'Placement': {'AvailabilityZone': 'eu-west-1a', 'GroupName': '', 'Tenancy': 'default'}, 'PrivateDnsName': '', 'ProductCodes': [], 'PublicDnsName': '', 'State': {'Code': 48, 'Name': 'terminated'}, 'StateTransitionReason': 'User initiated (2021-07-08 18:35:05 GMT)', 'Architecture': 'x86_64', 'BlockDeviceMappings': [], 'ClientToken': '314e0316-4fad-4d66-9b0b-918590eab1de', 'EbsOptimized': False, 'EnaSupport': True, 'Hypervisor': 'xen', 'NetworkInterfaces': [], 'RootDeviceName': '/dev/sda1', 'RootDeviceType': 'ebs', 'SecurityGroups': [], 'StateReason': {'Code': 'Client.UserInitiatedShutdown', 'Message': 'Client.UserInitiatedShutdown: User initiated shutdown'}, 'Tags': [{'Key': 'Name', 'Value': 'ephemeron-t3-large-0003'}, {'Key': 'type', 'Value': 'compute'}, {'Key': 'cluster', 'Value': 'ephemeron'}], 'VirtualizationType': 'hvm', 'CpuOptions': {'CoreCount': 1, 'ThreadsPerCore': 2}, 'CapacityReservationSpecification': {'CapacityReservationPreference': 'open'}, 'HibernationOptions': {'Configured': False}, 'MetadataOptions': {'State': 'pending', 'HttpTokens': 'optional', 'HttpPutResponseHopLimit': 1, 'HttpEndpoint': 'enabled'}, 'EnclaveOptions': {'Enabled': False}}

@milliams Any thoughts on this? Could this cause problems? Wanting to use CITC for teaching next week :)

(EDIT: line numbers for aws.py in the backtrace are slightly out due to some print calls I've added)

@milliams milliams added bug Something isn't working AWS labels Jul 9, 2021
@milliams
Copy link
Member

milliams commented Jul 9, 2021

This should not cause any issues. The watchdog's job is to reconcile the state between Slurm and AWS. Currently it only keeps track of things and has not yet learned to correct any issues. You are safe to disable and stop the service.

Are you able to submit jobs and have them start as you expect? If so then this problem can be ignored. If not, then this points towards an issue. It may be that you have some VMs that it's finding, trying to track and failing.

As the the cause of the problem. It seems that when talking to the API, it's not getting back one of the fields that it expects. I will look into this later.

@willfurnass
Copy link
Author

Thanks Matt. I'll disable the svc for now to keep the system log cleaner. I have been having some issues starting nodes, which I thought could be related to this but have just realised I've hit my AWS instance limit for my chosen instance type. Doh!

Looks like the API response doesn't contain any NetworkInterface info, which seems odd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AWS bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants