citc-watchdog service repeatedly restarting #41

willfurnass · 2021-07-08T19:13:44Z

Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/bin/watchdog", line 8, in <module>
Jul  8 19:10:38 mgmt watchdog[328075]:    sys.exit(main())
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/watchdog.py", line 89, in main
Jul  8 19:10:38 mgmt watchdog[328075]:    cloud_nodes = utils.get_cloud_nodes()
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/utils.py", line 27, in get_cloud_nodes
Jul  8 19:10:38 mgmt watchdog[328075]:    cloud_nodes = aws.AwsNode.all(ec2, nodespace)
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py", line 93, in all
Jul  8 19:10:38 mgmt watchdog[328075]:    return [cls.from_response(instance) for instance in instances]
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py", line 93, in <listcomp>
Jul  8 19:10:38 mgmt watchdog[328075]:    return [cls.from_response(instance) for instance in instances]
Jul  8 19:10:38 mgmt watchdog[328075]:  File "/opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py", line 76, in from_response
Jul  8 19:10:38 mgmt watchdog[328075]:    ip = response["PrivateIpAddress"]
Jul  8 19:10:38 mgmt watchdog[328075]: KeyError: 'PrivateIpAddress'

That response dictionary doesn't contain a PrivateIpAddress key; the dictionary is as follows:

{'AmiLaunchIndex': 0, 'ImageId': 'ami-035ed6bae06963c37', 'InstanceId': 'i-0dbee8bd641226ae8', 'InstanceType': 't3.large', 'KeyName': 'ec2-user-ephemeron', 'LaunchTime': datetime.datetime(2021, 7, 8, 18, 12, 56, tzinfo=tzlocal()), 'Monitoring': {'State': 'disabled'}, 'Placement': {'AvailabilityZone': 'eu-west-1a', 'GroupName': '', 'Tenancy': 'default'}, 'PrivateDnsName': '', 'ProductCodes': [], 'PublicDnsName': '', 'State': {'Code': 48, 'Name': 'terminated'}, 'StateTransitionReason': 'User initiated (2021-07-08 18:35:05 GMT)', 'Architecture': 'x86_64', 'BlockDeviceMappings': [], 'ClientToken': '314e0316-4fad-4d66-9b0b-918590eab1de', 'EbsOptimized': False, 'EnaSupport': True, 'Hypervisor': 'xen', 'NetworkInterfaces': [], 'RootDeviceName': '/dev/sda1', 'RootDeviceType': 'ebs', 'SecurityGroups': [], 'StateReason': {'Code': 'Client.UserInitiatedShutdown', 'Message': 'Client.UserInitiatedShutdown: User initiated shutdown'}, 'Tags': [{'Key': 'Name', 'Value': 'ephemeron-t3-large-0003'}, {'Key': 'type', 'Value': 'compute'}, {'Key': 'cluster', 'Value': 'ephemeron'}], 'VirtualizationType': 'hvm', 'CpuOptions': {'CoreCount': 1, 'ThreadsPerCore': 2}, 'CapacityReservationSpecification': {'CapacityReservationPreference': 'open'}, 'HibernationOptions': {'Configured': False}, 'MetadataOptions': {'State': 'pending', 'HttpTokens': 'optional', 'HttpPutResponseHopLimit': 1, 'HttpEndpoint': 'enabled'}, 'EnclaveOptions': {'Enabled': False}}

@milliams Any thoughts on this? Could this cause problems? Wanting to use CITC for teaching next week :)

(EDIT: line numbers for aws.py in the backtrace are slightly out due to some print calls I've added)

The text was updated successfully, but these errors were encountered:

milliams · 2021-07-09T12:45:23Z

This should not cause any issues. The watchdog's job is to reconcile the state between Slurm and AWS. Currently it only keeps track of things and has not yet learned to correct any issues. You are safe to disable and stop the service.

Are you able to submit jobs and have them start as you expect? If so then this problem can be ignored. If not, then this points towards an issue. It may be that you have some VMs that it's finding, trying to track and failing.

As the the cause of the problem. It seems that when talking to the API, it's not getting back one of the fields that it expects. I will look into this later.

willfurnass · 2021-07-09T13:13:46Z

Thanks Matt. I'll disable the svc for now to keep the system log cleaner. I have been having some issues starting nodes, which I thought could be related to this but have just realised I've hit my AWS instance limit for my chosen instance type. Doh!

Looks like the API response doesn't contain any NetworkInterface info, which seems odd.

milliams added bug Something isn't working AWS labels Jul 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

citc-watchdog service repeatedly restarting #41

citc-watchdog service repeatedly restarting #41

willfurnass commented Jul 8, 2021 •

edited

Loading

milliams commented Jul 9, 2021

willfurnass commented Jul 9, 2021

citc-watchdog service repeatedly restarting #41

citc-watchdog service repeatedly restarting #41

Comments

willfurnass commented Jul 8, 2021 • edited Loading

milliams commented Jul 9, 2021

willfurnass commented Jul 9, 2021

willfurnass commented Jul 8, 2021 •

edited

Loading