- Attach stack name to CPU, Memory and Worker errors alarms
- Port watchbot scaling logic from 9.x to scale down based on total message (visible and not visible)
- Expose ephemeralStorageGiB param to customize disk size up to 200 GB (default 20 GB)
- Bump aws-cdk-lib and cdk-monitoring-constructs to update type definitions
- Add a cloudformation output for the watchbot SNS topic
- Fix monitoring resource naming conflicts when used in multiple stacks
- Add support for providing a custom cluster, enabling usage in default account
- This library is now distributed as a CDK construct with almost identical functionality. Refer to UPGRADING_TO_V10 for migration details.
- Bug fix: TotalMessagesLambda now working as expected; watchbot stacks now scaling down when no tasks in queue
- v9.0.0 binaries removed from S3 given the cost associated with the bug in the v9.0.0 release
- Move from aws-sdk v2 to v3
- Breaking change: use Node 18 for lambdas and binaries going forward
- Fix tests that were broken in the 8.0.0 release for some reason
- Force artifacts to use Node 14 now that it defaults to 16
- Update @mapbox/watchbot-progress to address vulnerabilities
- Add support for JSON structured logging via the 'structuredLogging' boolean option (defaults to off)
- Breaking change: use Node 14 for lambdas and binaries going forward
- Breaking change: use Node 12 for lambdas and binaries going forward
- Fix bug with
watchbot-dead-letter
command's fetching of recent logs (#350) - Fix dashboard widget for showing running/desired/pending tasks (#351)
- Fix issue causing task scaledown on deployment (#349)
- Adds
fargatePublicIp
option. Its default value is'DISABLED'
, so default behavior is no different than in 6.0.0, but now the property is adjustable.
- Adds the
capacity
,fargateSecurityGroups
, andfargateSubnets
options, which can be used to run tasks on Fargate or Fargate Spot capacity, instead of EC2 — which was the only option before and remains the default.- Switching
capacity
values can be disruptive. Switching betweenEC2
andFARGATE
or betweenFARGATE
andFARGATE_SPOT
will cause the ECS service to be replaced during the CloudFormation update: a new service will be created, then the old service will be deleted. You cannot switch betweenEC2
andFARGATE_SPOT
without deleting and re-creating the CloudFormation stack; but you can make the transition with updates through a multi-step deployment: first changeEC2
toFARGATE
, then changeFARGATE
toFARGATE_SPOT
, or vice versa.
- Switching
- If you set the option
reservations.cpu
lower than128
, that will no longer be raised up to128
in the output template. This was never done before if yourreservations.cpu
value was a CloudFormation intrinsic function.
- Sets
PropagateTags
toTASK_DEFINITION
on the ECS Service. If you are on the old ARN and the AWS account you are in has opted into the new format this version move you to the new ARN format by replacing your current service with a new one.
ECS Service
- Old: arn:aws:ecs:region:account-id:service/service-name
- New: arn:aws:ecs:region:account-id:service/cluster-name/service-name
ECS Task
- Old: arn:aws:ecs:region:account-id:task/task-id
- New: arn:aws:ecs:region:account-id:task/cluster-name/task-id
- Using InChina instead of NotInChina Cloudformation condition: #329
- Fixes for alpine binary: https://github.com/mapbox/ecs-watchbot/pull/328/files
- Use native code pipeline with alpine-specific target instead of all targets
- Upgrades from Node 8 to Node 10: #325
- Applies lint fixes and updates dependencies by several major versions: #325
- Adds us-west-1 support: #323
- Adds cn-northwest-1 support: #322
- Metrics: Adds (approximate) response duration custom metric.
- FIX: Missing properties/type key in CW dashboard change.
- Dashboard: Add queue oldest-message wait time, worker duration to CW dashboard
- Bump js-yaml from 3.12.0 to 3.13.1 for security reasons
- Small Readme improvements (thanks @nickcordella and @ScottBrenner)
- Changes the Lambda functions a watchbot stack creates to use the node 8 runtime
- Modifies CloudWatch alarm names to include the AWS region.
- Make
Family
property optional in docs - Upgrade @mapbox/watchbot-progress dependencies
- Fixes a regression from v4.13.0 that resulted in an invalid IAM role. #297
- Fixes behavior when a worker exits with code
3
: now an notification will be triggered, as the documentation states.
- Adds support for first-in-first-out (FIFO) SQS queues. #279
- Add
options.deadletterAlarm
(default=true) to disable the alarm resource for dead letter queue messages #288
- Re-Introduce
WorkerDuration
andMessageReceives
metrics (removed since v4)
- Remove CPU Alarm: #282
- Minimum CPU value for watchbot container is now 128
- Create a new metric of
TotalMessages
to prevent accidental scaledown: #267
- Hardcode messageTimeout: #264
- Now builds binaries for alpine linux: #266
- Only add the /tmp mount if it isn't already there: #262
- Create binaries when tags are added manually too: #259
- Add reduce mode functionality to version 4: #221
- Fix scaling compatibility with cn-north-1 #257
- Compatiblity with cn-north-1 #251
- Custom Cloudformation resource for watchbot service scaling. Allows maxSize to be parameterized within a template: #249
- Fix undefined this.message within setInterval: #250
- Add a code-pipeline stack for auto-generating watchbot binaries: #235
- Add dead letter queue: #220
- Prefix the dashboard names: #245
- Prefix the alarm names: #244
- Modify logging to prefix all worker logs with
[worker]
: #225
- Add maxJobDuration and a heartbeat for message timeout: #230
- See 4.5
- Allow writable file system: #239
- Remove node 8 engine requirement: #237
- Only expose
./lib/template
throughindex.ts
so people can run node 6 locally: #236
- Change
fresh
mode towritableFilesystem
mode: #234
- Add CPUUtilization and MemoryUtilization alarms: #231
- Remove watchbot-log binary: #227
- Use stackName in the
Name
property of the ContainerDefinition: #226
- Add cloudwatch dashboard: #222
- Major revamp of watchbot internals. (refs #184). The system now:
- Relies on an ECS service for scaling
- Provides users metrics on cpu and memory utilization of all containers
- Re-uses the same containers to process multiple jobs, reducing overhead
- Clearer error messages from the CLI tool for bad user input.
- Adds a log message if the watcher receives an SQS message that it has already launched a task for, and is still waiting to learn whether that task succeeded or failed.
- Upon receiving a duplicate message, the watcher checks if the in-flight task is in
PENDING
state. If so, it stops the task and returns the message to SQS for a retry.
- Fixes
DeadLetterAlarm
thresholding: changesComparisonOperator
fromGreaterThanThreshold
toGreaterThanOrEqualToThreshold
so that alarm is triggered when a single message is sent to the DeadLetterQueue.
- Makes
EvaluationPeriods
forFailedWorkerPlacementAlarm
customizable
- Adds
.ref.notificationTopic
to the output fromwatchbot.template()
- Adjusts watcher permissions on RunTask so that it can only launch its own Worker tasks.
- Adds a configuration option to specify
placementConstraints
of watchbot's task definitions
- Adds a configuration option to specify a
Family
property of watchbot's task definitions
- Adjust CloudWatch Event Rule names to allow stacks to include multiple sets of watchbot resources
- Adjusts log group names to allow stacks to include multiple sets of watchbot resources
- LogGroup names are now
${stack-name}-${region}-${prefix}
, whereprefix
defaults towatchbot
if not otherwise specified.
- LogGroup names are now
- BREAKING changes to the format with which CloudWatch LogGroups and streams are named. These should be considered breaking changes because upgrading a stack from v2.x to v3.x in-place will result in CloudFormation conflicts. Circumvent the conflicts by manually deleting the existing log group before running the CloudFormation update.
- LogGroup names are now
${stack-name}-${region}
- Streams are now prefixed with
${service-version}
(a GitSha in most cases)
- LogGroup names are now
- More permissive engines.node
- Fixes a regression in 2.5.0, allowing watcher containers to launch workers with new family names.
- Task definitions created by Watchbot's
.template(options)
function will now useoptions.service
as the task definition's family.
- Upgrade node.js runtime to 4.3 for webhook function
- Add quotes around
$@
operator in the watchbot-progress.sh script to preserve spaces in metadata arguments #142
- Add metric for the amount of time the task spent in
PENDING
state.
- find watchbot-progress's path using
require.resolve
to work with Yarn's flat dependency tree #131
- set ulimit to 10240 in the container definition
- always uses exponential backoff when returning work messages to SQS
- fixes error handling for
Cannot*ContainerError
no-op - stale messages in the TaskEventQueue will be dropped after 20 minutes
- watcher runs on ubuntu 16.04 LTS
CannotStartContainerError
,CannotPullContainerError
andDockerTimeoutError
errors do not cause notifications when AlarmOnEveryError is set
- Removes
-event-target
from the ID of the cloudwatch events filter to make it shorter. refs #119
- fixes a bug in the changelog
- consolidates CLI commands into a single
watchbot
command - adds a CLI command for interacting with the dead letter queue. Note that you cannot use the CLI unless you're working with a 2.1.0+ stack.
- fixes a bug that wouldn't have allowed you to disable exponential backoff
- returns
task.container[n].reason
asreason
when task finishes, if available - adds a second SQS queue used for the watcher's internal tracking of CloudWatch task state-change events
- adds ephemeral, or non-persistent, volume compatibility (see AWS's task data volume documentation)
- adds mount point object compatibility for cloudfriend operators, and any other operators that use semicolons and commas
- adds a
worker-capacity
script to estimate how many additional worker tasks can be placed in your service's cluster at its current capacity - adds CloudWatch metrics for worker errors (non-zero exit codes), failed worker container placement, worker duration, watcher concurrency, and message receive counts
- adds an alarm for number of worker errors in 60s, configurable through
watchbot.template(options)
.errorThreshold
. Defaults to alarms after 10 failures per minute. - drops polling of DescribeTasks API to learn when workers are completed
- BREAKING removes cluster resource polling - workers will try to be placed and fail instead of avoiding placement attempts
- BREAKING by default, watchbot no longer sends notification emails each time a worker errors. You can opt-in to this behavior by setting
watchbot.template(options)
.alarmOnEachFailure: true
. - BREAKING no longer sends notifications on error interacting with SQS. Instead watchbot silently proceeds.
- BREAKING watcher log format has changed. Now watcher logs print JSON objects
- BREAKING removes
.notifyAfterRetries
option - BREAKING removes
.backoff
option. Workers are always retried with exponential backoff - BREAKING adds a dead letter queue. Messages received more than 14 times by a watcher container will be sent to this queue. Any visible messages in this queue will trip an alarm.
- adds
options.reservation.softMemory
which allows the caller to set up a soft memory reservation on worker tasks
- bump watchbot-progress to v1.1.1, handles a bug in checking part status on a completed job
- move to @mapbox/watchbot, use MemoryReservation soft limit for the Watcher task
- update and switch to namespaced package for
@mapbox/watchbot-progress
- reimplement and fix
NotifyAfterRetries
as a watcher environment variable
- fix a bug where
NotifyAfterRetries
was still expected in watcher container environment
- adds duration (in seconds) to watcher log output when tasks complete
- fix bug with
NotifyAfterRetries
where the environment variable was set in the watcher container, not the worker.
- adds
options.privileged
parameter to watchbot's template
- Adds
.ref.queueName
to the output fromwatchbot.template()
- Clarifies watcher log messages conveying outcome when tasks finish
- Fixes a bug where task launching could fail due to a
startedBy
name longer than 36 characters
- Adds support for us-east-2 (Ohio)
- Allows
options.logAggregationFunction
to reference a potentially empty stack parameter
- Adds event emitter to signal when cluster instances have been identified
- Adds error emitter to signal when there are no cluster instances
- Adds readCapacityUnits & writeCapacityUnits configurable watchbot.template option params
- Adds error handling for log line >50kb edge case
- Exposes notifyAfterRetry concept to retry jobs before sending alarms
- Adds pagination for describeContainerInstances
- Adds watchbot-progress dependency
- Adds support for ap-* regions by adding regional mapping for worker/watcher images assuming ecs-conex is doing your image packaging.
- Fix bug where watchbot would not retry running a task if it encountered a RESOURCE:CPU contrainst error.
- Breaking requires KMS key under the CF export
cloudformation-kms-production
to grant worker tasks permission to decrypt secure environment variables. See README and https://github.com/mapbox/cloudformation-kms, https://github.com/mapbox/decrypt-kms-env.
- Fix potential race condition when creating
LogForwarding
- Adds EcsWatchbotVersion to template Metadata
- allow
workers
andbackoff
to be a ref - adds
options.debugLogs
to enable verbose logging - adds log stream prefix to organize worker/watcher logs better
- fix for worker role in reduce mode
- fixes a bug that could produce an invalid template if no memory reservation is specified. New default memory is 64MB
- fixes a bug that limited a watcher to maintaining at most 100 concurrent workers
- adds
reduce
option towatchbot.template()
for tracking map-reduce operations - adds example recipes for workers using
reduce
mode - Breaking changes the
startedBy
attribute of worker tasks to the stack's name
- fixes a bug where
options.command
would break the watcher - adds
.ref.queueUrl
and.ref.queueArn
references to object returned bywatchbot.template()
- automatically provide workers with permission to publish to watchbot's SNS topic
- adds
watchbot.logStream
, a node.js writable stream for prefixing logs - Breaking changes the name of the SQS queue, making it a bit easier to find in the console
- Breaking switch to TaskRole instead of grafting permissions onto a predefined role
- fixes a template generation bug for callers that do not use mount points
- adds
logAggregationFunction
argument to watchbot.template - allow caller to set container CMD
- template validation, cleanups, default watchbot version
- overhauls template building process, providing scripts that expose Watchbot's resources as JavaScript objects
- container logs are sent from Docker to CloudWatch Logs instead of syslog
- a watchbot stack creates its own CloudWatch LogGroup and sends all container logs to it
- on task failure, reads recent container logs from CloudWatch and includes them in notifications
- adds helper functions to run as part of the worker which help generate homogeneous, searchable log output
- silences
[status]
log messages unless logLevel is set todebug
- improved message body in notifications sent when task fail
- logs are sent to syslog instead of to a file assumed to be mounted from the host machine
- new template builder arguments to only include certain resources (e.g. webhooks) if you ask for them
- watcher pays attention to cluster resource reservation, avoids polling the queue when the cluster is fully utilized, and retries runTask requests if a request fails due to lack of memory.
- template sets up watcher permissions such that updates to the worker's task definition will not lead to permissions failures in the midst of a deploy
- watcher logs include message subject and body
- gracefully return messages to the queue if the ECS API fails to run a task
- handle situations where a single watcher receives the same message twice
- adjust alarm description in CloudFormation template
- First sketch of Watchbot on ECS