From 3608115b9dcc52da3f2c5d0e0f9e145d0ef21dcf Mon Sep 17 00:00:00 2001 From: Viren Nadkarni Date: Wed, 20 Nov 2024 14:50:06 +0530 Subject: [PATCH] Update Chaos Engineering docs --- .../en/tutorials/simulating-outages/index.md | 1 - .../fault-injection-service/index.md | 614 +----------------- 2 files changed, 2 insertions(+), 613 deletions(-) diff --git a/content/en/tutorials/simulating-outages/index.md b/content/en/tutorials/simulating-outages/index.md index c0c280cf3f..46740b4c80 100644 --- a/content/en/tutorials/simulating-outages/index.md +++ b/content/en/tutorials/simulating-outages/index.md @@ -141,7 +141,6 @@ Message sent to queue. If we review the logs, it will show that the `DynamoDbException` has been managed effectively. ```text -2023-11-06T22:21:40.789 DEBUG --- [ asgi_gw_2] l.services.fis.handler : FIS handler called with configs: {'dynamodb': {None: [(100, 'DynamoDbException', '500')]}} 2023-11-06T22:21:40.789 INFO --- [ asgi_gw_2] localstack.request.aws : AWS dynamodb.PutItem => 500 (DynamoDbException) 2023-11-06T22:21:40.834 DEBUG --- [ asgi_gw_4] l.services.sns.publisher : Topic 'arn:aws:sns:us-east-1:000000000000:ProductEventsTopic' publishing '5520d37a-fc21-4a73-b1bf-f9b9afce5908' to subscribed 'arn:aws:sqs:us-east-1:000000000000:ProductEventsQueue' with protocol 'sqs' (subscription 'arn:aws:sns:us-east-1:000000000000:ProductEventsTopic:0a4abf8c-744a-404a-9ff9-f132e25d1b30') diff --git a/content/en/user-guide/chaos-engineering/fault-injection-service/index.md b/content/en/user-guide/chaos-engineering/fault-injection-service/index.md index 47881246bc..3796a79e9f 100644 --- a/content/en/user-guide/chaos-engineering/fault-injection-service/index.md +++ b/content/en/user-guide/chaos-engineering/fault-injection-service/index.md @@ -7,142 +7,21 @@ aliases: - /tutorials/fault-injection-service-experiments/ --- -## Introduction - -The [Fault Injection Service](https://aws.amazon.com/fis/) is a fully managed service by AWS designed to help you improve the resilience of your applications by simulating real-world outages and operational issues. +The [Fault Injection Service (FIS)](https://aws.amazon.com/fis/) is a fully managed service by AWS designed to help you improve the resilience of your applications by simulating real-world outages and operational issues. This service allows you to conduct controlled experiments on your AWS infrastructure, injecting faults and observing how your system responds under various conditions. By using the Fault Injection Service, you can identify weaknesses, test recovery procedures, and ensure that your applications can withstand unexpected disruptions. This proactive approach to reliability engineering enables you to enhance system robustness, minimize downtime, and maintain a high level of service availability for your users. -To see the FIS in action within a more complex application stack, please refer to the [Chaos Engineering Tutorials]({{< ref "tutorials" >}}). - {{< alert title="Note">}} Fault Injection Service emulation is available as part of the LocalStack Enterprise plan. If you'd like to try it out, please [contact us](https://www.localstack.cloud/demo) to request access. {{< /alert >}} -## Prerequisites - -The prerequisites for this guide are: - -- LocalStack Pro with [LocalStack CLI](https://docs.localstack.cloud/getting-started/installation/#localstack-cli) & [LocalStack Auth Token](https://docs.localstack.cloud/getting-started/auth-token/) -- [AWS CLI](https://docs.localstack.cloud/user-guide/integrations/aws-cli/) with the [`awslocal` wrapper](https://docs.localstack.cloud/user-guide/integrations/aws-cli/#localstack-aws-cli-awslocal) -- [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/install/) - -Ensure that you set the Auth Token as an environment variable before beginning: - -{{< command >}} -$ LOCALSTACK_AUTH_TOKEN= -$ localstack start -{{< /command >}} - -## Getting Started - {{< callout "tip" >}} -For more information on LocalStack FIS, please refer to the [FIS service docs]({{< ref "user-guide/aws/fis" >}}). +For more information, please refer to the [FIS service docs]({{< ref "user-guide/aws/fis" >}}). {{< /callout >}} -This guide is created with users who are new to FIS in mind, and assumes basic knowledge of the AWS CLI and our `awslocal` wrapper script. - -The following demo will depict constructing various FIS experiments designed to trigger different types of failures in a DynamoDB service. - -Let's create a simple DynamoDB table called `Students` in the `us-east-1` region. - -{{< command >}} -$ awslocal dynamodb create-table \ - --table-name Students \ - --attribute-definitions AttributeName=id,AttributeType=S \ - --key-schema AttributeName=id,KeyType=HASH \ - --billing-mode PAY_PER_REQUEST \ - --region us-east-1 - -{ - "TableDescription": { - "AttributeDefinitions": [ - { - "AttributeName": "id", - "AttributeType": "S" - } - ], - "TableName": "Students", - "KeySchema": [ - { - "AttributeName": "id", - "KeyType": "HASH" - } - ], - "TableStatus": "ACTIVE", - "CreationDateTime": 1710945576.193, - "ProvisionedThroughput": { - "LastIncreaseDateTime": 0.0, - "LastDecreaseDateTime": 0.0, - "NumberOfDecreasesToday": 0, - "ReadCapacityUnits": 0, - "WriteCapacityUnits": 0 - }, - "TableSizeBytes": 0, - "ItemCount": 0, - "TableArn": "arn:aws:dynamodb:us-east-1:000000000000:table/Students", - "TableId": "c9ae13b6-ecf1-42f2-8d69-0e14d65a4dc3", - "BillingModeSummary": { - "BillingMode": "PAY_PER_REQUEST", - "LastUpdateToPayPerRequestDateTime": 1710945576.193 - } - } -} - -{{< /command >}} - -The newly created table has two items added: - -{{< command >}} -$ awslocal dynamodb put-item --table-name Students --region us-east-1 --item '{ - "id": {"S": "1216"}, - "first name": {"S": "Liam"}, - "last name": {"S": "Davis"}, - "year": {"S": "Junior"}, - "enrolment date": {"S": "2023-03-19"} - }' - -$ awslocal dynamodb put-item --table-name Students --region us-east-1 --item '{ - "id": {"S": "1748"}, - "first name": {"S": "John"}, - "last name": {"S": "Doe"}, - "year": {"S": "Senior"}, - "enrolment date": {"S": "2022-03-19"} - }' -{{< /command >}} - -And then we can look up one of the students by ID, also using the awslocal CLI: - -{{< command >}} -$ awslocal dynamodb get-item --table-name Students --key '{"id": {"S": "1216"}}' - -{ - "Item": { - "id": { - "S": "1216" - }, - "last name": { - "S": "Davis" - }, - "enrolment date": { - "S": "2023-03-19" - }, - "first name": { - "S": "Liam" - }, - "year": { - "S": "Junior" - } - } -} - -{{< /command >}} - -## Key concepts of FIS - Some of the most important concepts associated with a FIS experiment are: **1. @@ -168,492 +47,3 @@ These are necessary for AWS FIS to perform actions on your behalf, like injectin **6. Experiment Execution**: When you start an experiment, AWS FIS executes the actions defined in the experiment template against the specified targets, adhering to any defined stop conditions. The execution process is logged, and detailed information about the experiment's progress and outcome is provided. - -## Examples - -### Service Unavailability - -{{< callout "warning" >}} -The `localstack:generic:api-error` action is deprecated and marked for removal. -Please use the [Chaos API]({{< ref "chaos-api" >}}) to achieve the same effect. -{{< /callout >}} - -In a file called `dynamodb-experiment.json` let's define a FIS experiment that causes all calls to the `GetItem` API of the DynamoDB service to return a 503 `Service Unavailable` response. -This failure will happen 100% of the times the method is called. - -```json -{ - "actions": { - "Test disruption": { - "actionId": "localstack:generic:api-error", - "parameters": { - "service": "dynamodb", - "operation": "GetItem", - "percentage": "100", - "exception": "Service Unavailable", - "errorCode": "503" - } - } - }, - "description": "Template for interfering with the DynamoDB service", - "stopConditions": [{ - "source": "none" - }], - "roleArn": "arn:aws:iam:000000000000:role/ExperimentRole" -} -``` - -And create the experiment: - -{{< command >}} -$ awslocal fis create-experiment-template --cli-input-json file://dynamodb-experiment.json - -{ - "experimentTemplate": { - "id": "547ec5c3-5ca1-4227-9b9d-a737223d1d42", - "description": "Template for interfering with the DynamoDB service", - "actions": { - "Test disruption": { - "actionId": "localstack:generic:api-error", - "parameters": { - "service": "dynamodb", - "operation": "GetItem", - "percentage": "100", - "exception": "DynamoDbException", - "errorCode": "500" - } - } - }, - "stopConditions": [ - { - "source": "none" - } - ], - "creationTime": 1710948862.04738, - "lastUpdateTime": 1710948862.04738, - "roleArn": "arn:aws:iam:000000000000:role/ExperimentRole" - } -} - -{{}} - -The experiment needs to be started in order to be running: - -{{< command >}} -$ awslocal fis start-experiment --experiment-template-id 547ec5c3-5ca1-4227-9b9d-a737223d1d42 - -{ - "experiment": { - "id": "1a01327a-79d5-4202-8132-e56e55c9391b", - "experimentTemplateId": "547ec5c3-5ca1-4227-9b9d-a737223d1d42", - "roleArn": "arn:aws:iam:000000000000:role/ExperimentRole", - "state": { - "status": "running" - }, - "actions": { - "Test disruption": { - "actionId": "localstack:generic:api-error", - "parameters": { - "service": "dynamodb", - "operation": "GetItem", - "percentage": "100", - "exception": "DynamoDbException", - "errorCode": "500" - } - } - }, - "stopConditions": [ - { - "source": "none" - } - ], - "creationTime": 1710949720.491161, - "startTime": 1710949720.491161 - } -} - -{{}} - -The LocalStack logs are confirming the experiment related activity: - -```bash -2024-03-20T15:34:22.048 INFO --- [ asgi_gw_0] localstack.request.aws : AWS fis.CreateExperimentTemplate => 200 -2024-03-20T15:48:40.492 INFO --- [ asgi_gw_0] localstack.request.aws : AWS fis.StartExperiment => 200 -``` - -Let's see it in action: - -{{< command >}} -$ awslocal dynamodb get-item --table-name Students --key '{"id": {"S": "1216"}}' - -An error occurred (DynamoDbException) when calling the GetItem operation (reached max retries: 9): Failing as per Fault Injection Simulator configuration - -{{}} - -The logs now show several attempts of performing the `GetItem` operation, before returning the error message: - -```text -2024-03-20T15:54:16.630 INFO --- [ asgi_gw_0] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -2024-03-20T15:54:16.707 INFO --- [ asgi_gw_1] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -2024-03-20T15:54:16.825 INFO --- [ asgi_gw_0] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -2024-03-20T15:54:17.040 INFO --- [ asgi_gw_1] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -2024-03-20T15:54:17.476 INFO --- [ asgi_gw_0] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -2024-03-20T15:54:18.301 INFO --- [ asgi_gw_1] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -2024-03-20T15:54:19.925 INFO --- [ asgi_gw_0] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -2024-03-20T15:54:23.141 INFO --- [ asgi_gw_0] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -2024-03-20T15:54:29.559 INFO --- [ asgi_gw_1] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -2024-03-20T15:54:42.381 INFO --- [ asgi_gw_1] localstack.request.aws : AWS dynamodb.GetItem => 500 (DynamoDbException) -``` - -However, the `PutItem` and other operations are still working as expected: - -{{< command >}} -$ awslocal dynamodb put-item --table-name Students --region us-east-1 --item '{ - "id": {"S": "9865"}, - "first name": {"S": "Jenny"}, - "last name": {"S": "Jones"}, - "year": {"S": "Sophomore"}, - "enrolment date": {"S": "2021-03-19"} - }' - -2024-03-20T16:00:27.958 INFO --- [ asgi_gw_0] localstack.request.aws : AWS dynamodb.PutItem => 200 - -{{< /command >}} - -Finally, the experiment can be stopped using the experiment's ID with the following command: - -```bash -awslocal fis stop-experiment --id 1a01327a-79d5-4202-8132-e56e55c9391b -``` - -### Region Unavailability - -{{< callout "warning" >}} -The `localstack:generic:api-error` action is deprecated and marked for removal. -Please use the [Chaos API]({{< ref "chaos-api" >}}) to achieve the same effect. -{{< /callout >}} - -This sort of experiment involves disabling entire regions to simulate regional outages and failovers. -Let's see what that would look like, in a separate file, `regional-experiment.json`: - -```json -{ - "actions": { - "regionUnavailable-us-east-1": { - "actionId": "localstack:generic:api-error", - "parameters": { - "region": "us-east-1", - "errorCode": "503" - } - }, - "regionUnavailable-us-west-2": { - "actionId": "localstack:generic:api-error", - "parameters": { - "region": "us-west-2", - "errorCode": "503" - } - } - }, - "description": "Template for internal server error for regions us-west-2, us-east-1", - "stopConditions": [{ - "source": "none" - }], - "roleArn": "arn:aws:iam:000000000000:role/ExperimentRole" -} -``` - -This template defines actions to simulate internal server errors (HTTP 503) in both `us-east-1` and `us-west-2` regions, without specific stop conditions. -These outages will affect all the resources within the regions. - -The experiment is created and started: - -{{< command >}} -$ awslocal fis create-experiment-template --cli-input-json file://regional-experiment.json - -{ - "experimentTemplate": { - "id": "19bec43e-9cb4-4bb8-9509-bf71c6e313c4", - "description": "Template for internal server error for regions us-west-2, us-east-1", - "actions": { - "regionUnavailable-us-east-1": { - "actionId": "localstack:generic:api-error", - "parameters": { - "region": "us-east-1", - "errorCode": "503" - } - }, - "regionUnavailable-us-west-2": { - "actionId": "localstack:generic:api-error", - "parameters": { - "region": "us-west-2", - "errorCode": "503" - } - } - }, - "stopConditions": [ - { - "source": "none" - } - ], - "creationTime": 1710951319.333033, - "lastUpdateTime": 1710951319.333033, - "roleArn": "arn:aws:iam:000000000000:role/ExperimentRole" - } -} - -{{< /command >}} - -{{< command >}} -$ awslocal fis start-experiment --experiment-template-id 19bec43e-9cb4-4bb8-9509-bf71c6e313c4 - -{ - "experiment": { - "id": "1a650841-bc81-4b4b-9317-6ec52df51c1d", - "experimentTemplateId": "19bec43e-9cb4-4bb8-9509-bf71c6e313c4", - "roleArn": "arn:aws:iam:000000000000:role/ExperimentRole", - "state": { - "status": "running" - }, - "actions": { - "regionUnavailable-us-east-1": { - "actionId": "localstack:generic:api-error", - "parameters": { - "region": "us-east-1", - "errorCode": "503" - } - }, - "regionUnavailable-us-west-2": { - "actionId": "localstack:generic:api-error", - "parameters": { - "region": "us-west-2", - "errorCode": "503" - } - } - }, - "stopConditions": [ - { - "source": "none" - } - ], - "creationTime": 1710951725.475228, - "startTime": 1710951725.475228 - } -} - -{{< /command >}} - -The previously created table in the `us-east-1` region, is not reachable, but this time a different error is thrown, the one that we defined in the latest experiment template: - -{{< command >}} -$ awslocal dynamodb get-item --table-name Students --region us-east-1 --key '{"id": {"S": "1216"}}' - -An error occurred (InternalError) when calling the GetItem operation (reached max retries: 9): Failing as per Fault Injection Simulator configuration - -{{< /command >}} - -However, the `eu-central-1` region is unaffected, and a new table can be created and used in that area. - -{{< command >}} -$ awslocal dynamodb create-table \ - --table-name Students \ - --attribute-definitions AttributeName=id,AttributeType=S \ - --key-schema AttributeName=id,KeyType=HASH \ - --billing-mode PAY_PER_REQUEST \ - --region eu-central-1 - -{ - "TableDescription": { - "AttributeDefinitions": [ - { - "AttributeName": "id", - "AttributeType": "S" - } - ], - "TableName": "Students", - "KeySchema": [ - { - "AttributeName": "id", - "KeyType": "HASH" - } - ], - "TableStatus": "ACTIVE", - "CreationDateTime": 1710952212.617, - "ProvisionedThroughput": { - "LastIncreaseDateTime": 0.0, - "LastDecreaseDateTime": 0.0, - "NumberOfDecreasesToday": 0, - "ReadCapacityUnits": 0, - "WriteCapacityUnits": 0 - }, - "TableSizeBytes": 0, - "ItemCount": 0, - "TableArn": "arn:aws:dynamodb:eu-central-1:000000000000:table/Students", - "TableId": "917f7df1-0050-433a-8647-427f072e7409", - "BillingModeSummary": { - "BillingMode": "PAY_PER_REQUEST", - "LastUpdateToPayPerRequestDateTime": 1710952212.617 - } - } -} - -{{< /command >}} - -```bash -awslocal dynamodb put-item --table-name Students --region eu-central-1 --item '{ - "id": {"S": "1111"}, - "first name": {"S": "Alice"}, - "last name": {"S": "Simpson"}, - "year": {"S": "Freshman"}, - "enrolment date": {"S": "2020-03-19"} - }' - -2024-03-20T16:34:57.164 INFO --- [ asgi_gw_0] localstack.request.aws : AWS dynamodb.PutItem => 200 -``` - -Just as with the earlier experiment, this one should be stopped by running the following command: - -```bash -awslocal fis stop-experiment --id e49283c1-c2e0-492b-b69f-9fbd710bc1e3 -``` - -### Service Latency - -{{< callout "warning" >}} -The `localstack:generic:latency` action is deprecated and marked for removal. -Please use the [Chaos API]({{< ref "chaos-api" >}}) to achieve the same effect. -{{< /callout >}} - -Let's now add some latency to our DynamoDB API calls. -First the definition of a new experiment template in another file, `latency-experiment.json`: - -```json -{ - "actions": { - "latency": { - "actionId": "localstack:generic:latency", - "parameters": { - "region": "us-east-1", - "latencyMilliseconds": 5000 - } - } - }, - "description": "template for testing delays in DynamoDB API calls", - "stopConditions": [], - "roleArn": "arn:aws:iam:000000000000:role/ExperimentRole" -} -``` - -After saving the file, we can create and start the experiment: - -{{< command >}} -$ awslocal fis create-experiment-template --cli-input-json file://latency-experiment.json - -{ - "experimentTemplate": { - "id": "1f6e0ce8-57ed-4987-a7e5-b85ba3f5b933", - "description": "template for testing delays in DynamoDB API calls", - "actions": { - "latency": { - "actionId": "localstack:generic:latency", - "parameters": { - "service": "dynamodb", - "region": "us-east-1", - "latencyMilliseconds": "5000" - } - } - }, - "stopConditions": [], - "creationTime": 1711024346.972359, - "lastUpdateTime": 1711024346.972359, - "roleArn": "arn:aws:iam:000000000000:role/ExperimentRole" - } -} - -{{< /command >}} - -{{< command >}} -$ awslocal fis start-experiment --experiment-template-id 1f6e0ce8-57ed-4987-a7e5-b85ba3f5b933 - -{ - "experiment": { - "id": "dd598567-56e6-4d00-9ef5-15c7e90e7851", - "experimentTemplateId": "1f6e0ce8-57ed-4987-a7e5-b85ba3f5b933", - "roleArn": "arn:aws:iam:000000000000:role/ExperimentRole", - "state": { - "status": "running" - }, - "actions": { - "latency": { - "actionId": "localstack:generic:latency", - "parameters": { - "service": "dynamodb", - "region": "us-east-1", - "latencyMilliseconds": "5000" - } - } - }, - "stopConditions": [], - "creationTime": 1711024425.844646, - "startTime": 1711024425.844646 - } -} - -{{< /command >}} - -This FIS experiment introduces a delay of 5 seconds to all DynamoDB API calls within the `us-east-1` region. -Tables located in the `eu-central-1` region, or any other service, remain unaffected. -To extend the latency effect to a regional level, the specific service constraint can be omitted, thereby applying the latency to all resources within the selected region. - -As always, remember to stop your experiment, so it does not cause unexpected issues down the line: - -```bash -awslocal fis stop-experiment --id dd598567-56e6-4d00-9ef5-15c7e90e7851 -``` - -Remember to replace the IDs with your own corresponding values. - -### Experiment overview - -If you want to keep track of all your experiments and make sure nothing is running in the background to hinder any other work, you can get an overview by using the command: - -{{< command >}} -$ awslocal fis list-experiments - -{ - "experiments": [ - { - "id": "1a01327a-79d5-4202-8132-e56e55c9391b", - "experimentTemplateId": "547ec5c3-5ca1-4227-9b9d-a737223d1d42", - "state": { - "status": "stopped" - }, - "creationTime": 1710949720.491161 - }, - { - "id": "1a650841-bc81-4b4b-9317-6ec52df51c1d", - "experimentTemplateId": "19bec43e-9cb4-4bb8-9509-bf71c6e313c4", - "state": { - "status": "stopped" - }, - "creationTime": 1710951725.475228 - }, - { - "id": "e49283c1-c2e0-492b-b69f-9fbd710bc1e3", - "experimentTemplateId": "19bec43e-9cb4-4bb8-9509-bf71c6e313c4", - "state": { - "status": "stopped" - }, - "creationTime": 1710951872.250418 - }, - { - "id": "dd598567-56e6-4d00-9ef5-15c7e90e7851", - "experimentTemplateId": "1f6e0ce8-57ed-4987-a7e5-b85ba3f5b933", - "state": { - "status": "running" - }, - "creationTime": 1711024425.844646 - } - ] -} - -{{< /command >}}