Castellum's API looks like a conventional OpenStack REST API.
- Castellum's service URL can be found in the Keystone service catalog under the service type
castellum
. - All endpoints require a Keystone token to be present in the
X-Auth-Token
header. Only Keystone v3 is supported. - All timestamps are formatted as UNIX timestamps, i.e. seconds since the UNIX epoch.
- All percent values are floating-point numbers, although we only show integers in the examples below.
This document uses the terminology defined in the README.md.
- GET /v1/projects/:id
- GET /v1/projects/:id/resources/:type
- PUT /v1/projects/:id/resources/:type
- DELETE /v1/projects/:id/resources/:type
- GET /v1/projects/:id/assets/:type
- GET /v1/projects/:id/assets/:type/:id
- POST /v1/projects/:id/assets/:type/:id/error-resolved
- GET /v1/projects/:id/resources/:type/operations/pending
- GET /v1/projects/:id/resources/:type/operations/recently-failed
- GET /v1/projects/:id/resources/:type/operations/recently-succeeded
- GET /v1/operations/pending
- GET /v1/operations/recently-failed
- GET /v1/operations/recently-succeeded
- GET /v1/admin/resource-scrape-errors
- GET /v1/admin/asset-scrape-errors
- GET /v1/admin/asset-resize-errors
Shows information about which resources are configured for this project.
Returns 200
and a JSON response body like this:
{
"resources": {
"nfs-shares": {
"checked": {
"error": "cannot connect to OpenStack"
},
"asset_count": 42,
"config": { ... },
"low_threshold": {
"usage_percent": 20,
"delay_seconds": 3600
},
"high_threshold": {
"usage_percent": 80,
"delay_seconds": 1800
},
"critical_threshold": {
"usage_percent": 95
},
"size_constraints": {
"minimum": 10,
"maximum": 2000,
},
"size_steps": {
"percent": 20
}
},
...
}
}
The following fields may be returned:
Field | Type | Explanation |
---|---|---|
resources.$type |
object | Configuration for a project resource. Resources will only be shown when a) autoscaling is enabled for them and b) the requester has sufficient permissions to read them. |
resources.$type.checked.error |
string | Readonly. If the last attempt to scan this resource for new or deleted assets failed, this field contains the error message that was returned from the backend. |
resources.$type.asset_count |
integer | Readonly. The number of assets in this resource. |
resources.$type.config |
object or null | Type-specific configuration for this resource. Most resources don't take configuration here, in which case this field will be missing. Refer to the asset manager's documentation for whether a resource accepts or requires configuration. |
resources.$type.low_threshold resources.$type.high_threshold resources.$type.critical_threshold |
object | Configuration for thresholds that trigger an automated resize operation. Any of these may be missing if the threshold in question has not been enabled. |
resources.$type.low_threshold.usage_percent resources.$type.high_threshold.usage_percent resources.$type.critical_threshold.usage_percent |
float or object | Automated operations will be triggered when usage crosses these thresholds, i.e. usage <= threshold for the low threshold and usage >= threshold for the high and critical thresholds. |
resources.$type.low_threshold.delay_seconds resources.$type.high_threshold.delay_seconds |
integer | How long usage must cross the threshold before the operation is confirmed. Critical operations don't have a delay; they are always confirmed immediately. |
resources.$type.size_constraints.minimum resources.$type.size_constraints.maximum |
integer | If set, resize operations will only be scheduled when the target size fits into these constraints. |
resources.$type.size_constraints.minimum_free |
integer | If set, downsize operations will be inhibited and upsize operations will be scheduled to ensure that size - absoluteUsage is always >= this value. |
resources.$type.size_constraints.minimum_free_is_critical |
boolean | When true, upsize operations forced by violating the minimum free space constraint will be confirmed without delay. |
resources.$type.size_steps.percent |
float | Step size for percentage-step resizing. See below for details. |
resources.$type.size_steps.single |
boolean | When true, use single-step resizing. See below for details. |
There are two mutually-exclusive ways in which resize steps (the size change in a resize operation) can be calculated.
- Percentage-step resizing: Enabled by setting the
size_steps.percent
attribute on the resource to a number. For each resize operation, the size change is that many percent of the previous size, except when constraints permit only a partial step. Exceptions:- When the critical threshold has been crossed, percentage-step resizing will take multiple steps at once if this is necessary to leave the critical threshold.
- Single-step resizing: Enabled by setting
size_steps.single
to true. For each resize operation, the size change is the smallest step that moves usage back into normal areas, except when constraints permit only a partial step.- When the critical threshold has been crossed, and a high threshold is also configured, single-step resizing will calculate a new size that also leaves the high threshold.
Single-step resizing is a good idea when usage changes infrequently, but possibly in large steps at once (e.g. for project quota). But when usage changes constantly (e.g. for an NFS share that gets written to constantly), single-step resizing could lead to a fast succession of tiny size changes instead of a single large step. In these cases, percentage-step resizing is recommended.
Most resources only have a single usage metric, so all fields called usage_percent
will show a single floating-point number value. However, some resources can have multiple usage metrics. For example, a group of servers has both CPU usage and RAM usage. In such a case, fields called usage_percent
will show an object mapping usage metrics to their respective usage values:
{
"in_a_regular_resource": {
"usage_percent": 26.5
},
"in_a_multi_usage_resource": {
"usage_percent": {
"cpu": 26.5,
"ram": 89.4
}
}
}
This is why usage_percent
fields are listed throughout this spec with a data type of "float or object".
Shows information about an individual project resource.
Returns 404
if autoscaling is not enabled for this resource.
Otherwise returns 200
and a JSON response body like this:
{
"checked": {
"error": "cannot connect to OpenStack"
},
"low_threshold": {
"usage_percent": 20,
"delay_seconds": 3600
},
"high_threshold": {
"usage_percent": 80,
"delay_seconds": 1800
},
"critical_threshold": {
"usage_percent": 95
},
"size_constraints": {
"minimum": 10,
"maximum": 2000,
},
"size_steps": {
"percent": 20
}
}
The response body is always identical to the resources.$type
subobject
in the response of GET /v1/projects/:id
. See above for explanations of all
fields.
Enables autoscaling on the specified project resource. The request body must be a JSON document following the same schema as the response from the corresponding GET endpoint, except that the following fields may not be present:
checked
asset_count
Returns 202 and an empty response body on success.
Disables autoscaling on the specified project resource. This also deletes the operation logs for all assets in this project resource. Returns 204 and an empty response body on success.
Shows a list of all known assets in a project resource.
Returns 404
if autoscaling is not enabled for this resource.
Otherwise returns 200
and a JSON response body like this:
{
"assets": [
{
"id": "2535fd62-c30a-4241-8c67-a12c4fba98ad",
"size": 100,
"usage_percent": 42,
"stale": true
},
{
"id": "acc137e0-ac0f-43c4-a3de-ba728c0091fd",
"size": 1000,
"usage_percent": 91,
"checked": {
"error": "cannot connect to OpenStack"
},
"stale": false
},
{
"id": "b73894be-bcbc-44f0-b7e2-29b758c06ce9",
"size": 20,
"min_size": 10,
"max_size": 20480,
"usage_percent": 60,
"stale": false
}
]
}
For each asset, the following fields may be returned:
Field | Type | Explanation |
---|---|---|
id |
string | UUID of asset. |
size |
integer | Size of asset. The unit depends on the asset type. Refer to the asset managers' documentation for more information. |
usage_percent |
float or object | Usage of asset as percentage of size. When the asset has multiple usage types (e.g. instances have both CPU usage and RAM usage), usually the higher value is reported here. |
min_size max_size |
integer | Size value that the asset may not undercut or exceed, respectively. (Assets that cross these boundaries will be automatically upsized or downsized, respectively.) These fields are only shown when there are hidden size constraints on the infrastructure level that the usage_percent number cannot adequately communicate on its own. |
checked.error |
string | If the last attempt by Castellum to retrieve the size and usage of the asset failed, this field contains the error message that was returned from the backend. |
stale |
bool | This flag is set by Castellum after a resize operation to indicate that the reported size and usage are probably not accurate anymore. Will be cleared by the next scrape. |
When no scrape ever succeeded (e.g. because the asset is in an error state since creation), the fields size
and
usage_percent
will be missing. The checked.error
field will always be present in this case.
Shows information about a certain asset.
Returns 404
if the asset does not exist in the selected project, or if autoscaling is not
enabled for the selected project resource.
Otherwise returns 200
and a JSON response body like this:
{
"id": "acc137e0-ac0f-43c4-a3de-ba728c0091fd",
"size": 1000,
"usage_percent": 91,
"checked": {
"error": "cannot connect to OpenStack"
},
"stale": false,
"pending_operation": {
"state": "greenlit",
"reason": "high",
"old_size": 1000,
"new_size": 1200,
"created": {
"at": 1557122400,
"usage_percent": 81
},
"confirmed": {
"at": 1557126000
},
"greenlit": {
"at": 1557129600,
"by_user": "980a0405-c6d2-410a-bca1-a3b109aaabc0"
}
},
"finished_operations": [
{
"state": "cancelled",
"reason": "high",
"old_size": 1000,
"new_size": 1200,
"created": {
"at": 1557036000,
"usage_percent": 83
},
"finished": {
"at": 1557037800
}
},
...
]
}
Most fields on the top level have the same meaning as for GET /v1/projects/:id/assets/:type
.
The following additional fields may be returned:
Field | Type | Explanation |
---|---|---|
pending_operation |
object | Information about an automated resize operation that is currently in progress. If there is no resize operation ongoing, this field will be omitted. |
finished_operations |
array of objects | Information about earlier automated resize operations. This field is only shown on request because it may be quite large. Add the query parameter ?history to see it. |
finished_operations[].finished.at |
timestamp | When the operation entered its final state. |
finished_operations[].finished.error |
string | The backend error that caused this operation to fail. Only present when state is failed or errored . |
The following fields may be returned for each operation, both below pending_operation
and below finished_operations[]
:
Field | Type | Explanation |
---|---|---|
.state |
string | The current state of this operation. For pending operations, this is one of "created", "confirmed" or "greenlit". For finished operations, this is one of "cancelled", "succeeded", "failed" or "errored". See README.md for details. |
.reason |
string | One of "low", "high" or "critical". Identifies which threshold being crossed triggered this operation. |
.old_size |
integer | The asset's size before this resize operation. |
.new_size |
integer | The (projected) asset's size after the successful completion of the resize operation. |
.created.at |
timestamp | When Castellum first observed the asset's usage crossing a threshold. |
.created.usage_percent |
float or object | The asset's usage at that time. |
.confirmed.at |
timestamp | When Castellum confirmed that usage had crossed the threshold for at least the required delay. When reason is critical , this timestamp will be identical to .created.at . For operations in state created , this field is not shown. For operations in state cancelled , this field may or may not be shown. |
.greenlit.at |
timestamp | When a user permitted this operation to go ahead. For operations not subject to operator approval, this is equal to .greenlit.at . For operations in states created , confirmed or cancelled , this field is not shown. As an exception, for operations in state confirmed , this timestamp may be in the future, which means that the operation will automatically move into state greenlit at that point in time. |
.greenlit.by_user |
string | The UUID of the user that greenlit this operation. For operations in states created or confirmed , this field is not shown. For operations not subject to operator approval, this field is not shown. |
The previous table contains a lot of rules like "this field is not shown for operations in state X". When this is confusing to you, have a look at the state machine diagram in README.md. The reason why many fields are optional is that they only have values when the respective state was entered in the operation's lifecycle.
This endpoint allows manually marking the latest errored operation of a specified asset as resolved. This will remove the "resize errored" alert associated with the asset which otherwise only disappears if a later operation on the same asset succeeds.
Returns 404
if the project or asset is not found.
Returns 409
if the last operation of the asset is not errored
.
Otherwise returns 200
.
For backwards-compatibility, these three endpoints are respectively synonymous to:
GET /v1/operations/pending?project=$id&asset-type=$type
GET /v1/operations/recently-failed?project=$id&asset-type=$type
GET /v1/operations/recently-succeeded?project=$id&asset-type=$type
Shows information about all pending operations for assets in resources accessible to the authenticated user.
The following query parameters can be given to filter the result:
- When
project
is given, it is interpreted as a project ID. Only resources in that project will be considered. - When
domain
is given, it is interpreted as a domain ID. Only resources in project in that domain will be considered. - When
asset-type
is given, only resources with this asset type are considered.
Returns 404
if autoscaling is not enabled for any resource matching the query.
Otherwise returns 200
and a JSON response body like this:
{
"pending_operations": [
{
"project_id": "0181e612-fcad-438d-a1a4-2a21fc0a2442",
"asset_type": "nfs-shares",
"asset_id": "acc137e0-ac0f-43c4-a3de-ba728c0091fd",
"state": "greenlit",
"reason": "high",
"old_size": 1000,
"new_size": 1200,
"created": {
"at": 1557122400,
"usage_percent": 81
},
"confirmed": {
"at": 1557126000
},
"greenlit": {
"at": 1557129600,
"by_user": "980a0405-c6d2-410a-bca1-a3b109aaabc0"
}
},
...
]
}
Each pending operation has the same format as the .pending_operation
field in the JSON response returned by
GET /v1/projects/:id/assets/:type/:id
(see above), except for the following additional fields:
asset_id
indicates which asset is being worked on.project_id
andasset_type
identify the resource to which this asset belongs.
For each asset, at most one pending operation will be listed.
Shows information about all operations on assets in resources accessible to the authenticated user that recently failed. This is intended to give operators a view of all assets in a resource where manual intervention may be required because autoscaling is not working properly. Recent failure is defined as follows:
- There is an operation in state "failed" or "errored".
- There is no newer operation for the same asset which finished after the failed operation.
- The asset is still eligible for resizing for the same reason as stated in the failed operation.
Point 2 ensures that we don't see failures where the next attempt succeeded. Point 3 ensures that we don't see failures for assets where the usage returned to normal levels in the meantime.
The query parameters domain
, project
and asset-type
are recognized with the same semantics as for
GET /v1/operations/pending
.
Returns 404
if autoscaling is not enabled for this resource.
Otherwise returns 200
and a JSON response body like this:
{
"recently_failed_operations": [
{
"project_id": "0181e612-fcad-438d-a1a4-2a21fc0a2442",
"asset_type": "nfs-shares",
"asset_id": "acc137e0-ac0f-43c4-a3de-ba728c0091fd",
"state": "failed",
"reason": "high",
"old_size": 1000,
"new_size": 1200,
"created": {
"at": 1557122400,
"usage_percent": 81
},
"confirmed": {
"at": 1557126000
},
"greenlit": {
"at": 1557129600,
"by_user": "980a0405-c6d2-410a-bca1-a3b109aaabc0"
},
"finished": {
"at": 1557137800,
"error": "datacenter is on fire"
}
},
...
]
}
Each failed operation has the same format as the entries in the .finished_operations
field in the JSON response
returned by GET /v1/projects/:id/assets/:type/:id
(see above), except for the following additional fields:
asset_id
indicates which asset is being worked on.project_id
andasset_type
identify the resource to which this asset belongs.
For each asset, at most one failed or errored operation will be listed (the most recent one).
Shows information about all operations on assets in resources accessible to the authenticated user that recently succeeded, that is, all operations in state "succeeded" where there is no newer operation in state "succeeded", "failed" or "errored" for the same asset.
Returns 404
if autoscaling is not enabled for this resource.
Otherwise returns 200
and a JSON response body looking like that from the recently-failed
endpoint above, except that
recently_failed_operations
is called recently_succeeded_operations
.
The query parameters domain
, project
and asset-type
are recognized with the same semantics as for
GET /v1/operations/pending
. The following additional query parameters can be given to filter the result further:
- When
max-age
is given, only those operations will be shown that finished afternow - max_age
. The value must be an integer followed by one of the unitsm
(minute),h
(hour) ord
(day), e.g.12h
or7d
. The default value is1d
.
Shows information about resource scrape errors. This is intended to give operators a view of all scrape errors across all resources.
Returns 200
on success and a JSON response body like this:
{
"resource_scrape_errors": [
{
"asset_type": "nfs-shares",
"checked": {
"error": "cannot connect to OpenStack"
},
"domain_id": "481b2af2-d816-4453-8743-a05382e7d1ce",
"project_id": "0181e612-fcad-438d-a1a4-2a21fc0a2442"
},
{
"asset_type": "foo",
"checked": {
"error": "datacenter is on fire"
},
"domain_id": "481b2af2-d816-4453-8743-a05382e7d1ce",
"project_id": "89b76fc7-78fa-454c-b23b-674bd7589390"
}
]
}
Most fields on the top level have the same meaning as for GET /v1/projects/:id/resources/:type
(see above), except for the following
additional fields:
asset_type
indicates which type of assets belong to this resource.project_id
anddomain_id
identify the resource.project_id
is only shown for non-domain resources.
For each resource, at most one error will be listed (the most recent one).
Shows information about asset scrape errors. This is intended to give operators a view of scrape errors for all assets across all resources.
Returns 200
on success and a JSON response body like this:
{
"asset_scrape_errors": [
{
"asset_id": "c991eb08-e14e-4559-94d6-c9c390c18776",
"asset_type": "nfs-shares",
"checked": {
"error": "cannot connect to OpenStack"
},
"domain_id": "481b2af2-d816-4453-8743-a05382e7d1ce",
"project_id": "89b76fc7-78fa-454c-b23b-674bd7589390"
}
]
}
Most fields on the top level have the same meaning as for GET /v1/projects/:id/assets/:type
(see above), except for the following additional
fields:
asset_id
identifies the concerning asset.asset_type
,project_id
anddomain_id
identify the resource to which this asset belongs.project_id
is only shown for non-domain resources.
For each asset, at most one error will be listed (the most recent one).
Shows information about asset resize errors. This is intended to give operators a view of resize errors for all assets across all resources.
Returns 200
on success and a JSON response body like this:
{
"asset_resize_errors": [
{
"asset_id": "c991eb08-e14e-4559-94d6-c9c390c18776",
"asset_type": "nfs-shares",
"domain_id": "481b2af2-d816-4453-8743-a05382e7d1ce",
"finished": {
"at": 1557144789,
"error": "datacenter is on fire"
},
"new_size": 1025,
"old_size": 1024,
"project_id": "89b76fc7-78fa-454c-b23b-674bd7589390"
}
]
}
Most fields on the top level have the same meaning as for GET /v1/projects/:id/assets/:type/:id
(see above), except for the following
additional fields:
asset_id
identifies the concerning asset.asset_type
,project_id
anddomain_id
identify the resource to which this asset belongs.project_id
is only shown for non-domain resources.
For each asset, at most one error will be listed (the most recent one).