Skip to content

Commit 4c9276a

Browse files
Implement scenario 07 combined (#210)
Co-authored-by: Claude <noreply@anthropic.com>
1 parent b0087c7 commit 4c9276a

File tree

4 files changed

+831
-2
lines changed

4 files changed

+831
-2
lines changed

.github/workflows/nightly_tests.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -491,6 +491,14 @@ jobs:
491491
continue-on-error: false
492492
timeout-minutes: 2
493493

494+
- name: Run Chaos Test Scenario 07 (Combined Failures)
495+
run: |
496+
set -o pipefail
497+
mkdir -p /tmp/teranode-test-results
498+
go test -v ./test/chaos -run TestScenario07 2>&1 | tee /tmp/teranode-test-results/chaostest-scenario-07-results.txt
499+
continue-on-error: false
500+
timeout-minutes: 2
501+
494502
- name: Upload chaostest results
495503
if: always()
496504
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4

test/chaos/README.md

Lines changed: 121 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,9 @@ go test -v ./test/chaos/...
5353

5454
# Scenario 6: Slow Close Connections (Slicer)
5555
./test/chaos/run_scenario_06.sh
56+
57+
# Scenario 7: Combined Failures (DB + Kafka)
58+
./test/chaos/run_scenario_07.sh
5659
```
5760

5861
The helper scripts will:
@@ -82,6 +85,9 @@ go test -v ./test/chaos -run TestScenario05
8285

8386
# Scenario 6: Slow Close Connections (Slicer)
8487
go test -v ./test/chaos -run TestScenario06
88+
89+
# Scenario 7: Combined Failures (DB + Kafka)
90+
go test -v ./test/chaos -run TestScenario07
8591
```
8692

8793
### Run in Verbose Mode
@@ -409,6 +415,116 @@ go test -v ./test/chaos -run TestScenario06_KafkaSlowClose
409415

410416
**Combined scenario duration:** ~55 seconds
411417

418+
### Scenario 7: Combined Failures (3 variants)
419+
**File:** `scenario_07_combined_failures_test.go`
420+
421+
Tests system behavior when multiple dependencies fail simultaneously or in sequence. This simulates realistic infrastructure-wide issues like datacenter problems, network partitions, or cascading failures.
422+
423+
#### Variant A: Simultaneous Complete Failure
424+
**Test:** `TestScenario07_SimultaneousFailure`
425+
426+
**What it tests:**
427+
- System behavior when both PostgreSQL AND Kafka fail at the same time
428+
- Failure detection when multiple dependencies down
429+
- Graceful degradation (errors, not crashes)
430+
- Simultaneous recovery of both services
431+
- Data consistency after dual failure
432+
433+
**How to run:**
434+
```bash
435+
# Using helper script
436+
./test/chaos/run_scenario_07.sh
437+
438+
# Using go test directly
439+
go test -v ./test/chaos -run TestScenario07_SimultaneousFailure
440+
```
441+
442+
**Test phases:**
443+
1. Establish baseline with both services healthy
444+
2. Disable both PostgreSQL and Kafka simultaneously (complete failure)
445+
3. Test behavior during simultaneous outage
446+
4. Restore both services simultaneously
447+
5. Verify recovery and data consistency
448+
449+
**Expected results:**
450+
- ✅ Baseline: Both services healthy and functional
451+
- ✅ Simultaneous failure: Both fail quickly and cleanly (no hangs)
452+
- ✅ During outage: Errors returned promptly (not timeouts or crashes)
453+
- ✅ Recovery: Both services restored successfully
454+
- ✅ Consistency: No data corruption from dual failure
455+
456+
**Test duration:** ~10 seconds
457+
458+
#### Variant B: Simultaneous Latency
459+
**Test:** `TestScenario07_SimultaneousLatency`
460+
461+
**What it tests:**
462+
- System behavior when both PostgreSQL AND Kafka become slow simultaneously
463+
- Performance degradation when multiple dependencies affected
464+
- System remains functional despite infrastructure-wide slowdown
465+
- Recovery when latency removed from both
466+
467+
**How to run:**
468+
```bash
469+
# Using helper script
470+
./test/chaos/run_scenario_07.sh
471+
472+
# Using go test directly
473+
go test -v ./test/chaos -run TestScenario07_SimultaneousLatency
474+
```
475+
476+
**Test phases:**
477+
1. Measure baseline performance (both services fast)
478+
2. Inject 500ms latency to both services simultaneously
479+
3. Test performance under simultaneous latency
480+
4. Remove latency and verify recovery
481+
482+
**Expected results:**
483+
- ✅ Baseline: Fast operations on both services
484+
- ✅ With latency: Both services slower but still functional
485+
- ✅ Operations complete successfully despite 500ms delay
486+
- ✅ No cascading timeouts or failures
487+
- ✅ Recovery: Performance returns to baseline levels
488+
489+
**Test duration:** ~15 seconds
490+
491+
#### Variant C: Staggered Recovery
492+
**Test:** `TestScenario07_StaggeredRecovery`
493+
494+
**What it tests:**
495+
- System behavior when services recover at different times
496+
- Partial functionality when one service up, one down
497+
- No cascading failures during staggered recovery
498+
- Data consistency with asynchronous recovery
499+
500+
**How to run:**
501+
```bash
502+
# Using helper script
503+
./test/chaos/run_scenario_07.sh
504+
505+
# Using go test directly
506+
go test -v ./test/chaos -run TestScenario07_StaggeredRecovery
507+
```
508+
509+
**Test phases:**
510+
1. Disable both PostgreSQL and Kafka simultaneously
511+
2. Restore PostgreSQL first (Kafka still down)
512+
3. Verify PostgreSQL works while Kafka remains down
513+
4. Wait 3 seconds, then restore Kafka
514+
5. Verify both services healthy
515+
6. Confirm data consistency after staggered recovery
516+
517+
**Expected results:**
518+
- ✅ Both services fail cleanly when disabled
519+
- ✅ PostgreSQL recovers independently while Kafka down
520+
- ✅ System handles partial recovery gracefully
521+
- ✅ Kafka recovers after delay with no issues
522+
- ✅ No data corruption from staggered recovery pattern
523+
524+
**Test duration:** ~10 seconds
525+
526+
**Combined scenario duration:** ~35 seconds
527+
412528
## Test Structure
413529

414530
Each chaos test follows this pattern:
@@ -581,7 +697,8 @@ Chaos tests take longer than unit tests:
581697
- Scenario 4C (Load Under Failures): ~28 seconds (load testing under failures)
582698
- Scenario 5 (Bandwidth Constraints): ~4.4 seconds (database + Kafka bandwidth tests)
583699
- Scenario 6 (Slow Close Connections): ~55 seconds (slicer toxic tests)
584-
- Full suite: ~12-13 minutes (with all scenarios)
700+
- Scenario 7 (Combined Failures): ~35 seconds (simultaneous and staggered failures)
701+
- Full suite: ~13-14 minutes (with all scenarios)
585702
586703
## Troubleshooting
587704
@@ -655,4 +772,6 @@ curl -X POST http://localhost:8474/reset
655772
- [x] Scenario 4C: Load Under Failures ✅ **Implemented**
656773
- [x] Scenario 5: Bandwidth Constraints ✅ **Implemented**
657774
- [x] Scenario 6: Slow Close Connections (Slicer toxic) ✅ **Implemented**
658-
- [ ] Scenario 7: Combined Failures (DB + Kafka simultaneously)
775+
- [x] Scenario 7: Combined Failures (DB + Kafka simultaneously) ✅ **Implemented**
776+
777+
All planned chaos test scenarios have been implemented!

test/chaos/run_scenario_07.sh

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
#!/bin/bash
2+
3+
# Chaos Test Scenario 07: Combined Failures (DB + Kafka simultaneously)
4+
# This script runs the combined failures chaos test with proper setup validation
5+
6+
set -e
7+
8+
# Colors for output
9+
RED='\033[0;31m'
10+
GREEN='\033[0;32m'
11+
YELLOW='\033[1;33m'
12+
NC='\033[0m' # No Color
13+
14+
# Configuration
15+
TOXIPROXY_POSTGRES_URL="http://localhost:8474"
16+
TOXIPROXY_KAFKA_URL="http://localhost:8475"
17+
POSTGRES_DIRECT_URL="localhost:5432"
18+
POSTGRES_TOXI_URL="localhost:15432"
19+
KAFKA_DIRECT_URL="localhost:9092"
20+
KAFKA_TOXI_URL="localhost:19092"
21+
COMPOSE_FILE="../compose/docker-compose-ss.yml"
22+
23+
echo -e "${GREEN}[INFO]${NC} Checking if toxiproxy services are running..."
24+
25+
# Check if toxiproxy-postgres is available
26+
if curl -s -f "${TOXIPROXY_POSTGRES_URL}/version" > /dev/null 2>&1; then
27+
echo -e "${GREEN}[INFO]${NC} Toxiproxy for PostgreSQL is available at ${TOXIPROXY_POSTGRES_URL}"
28+
else
29+
echo -e "${RED}[ERROR]${NC} Toxiproxy for PostgreSQL is not accessible at ${TOXIPROXY_POSTGRES_URL}"
30+
echo -e "${GREEN}[INFO]${NC} Check with: docker ps | grep toxiproxy-postgres"
31+
exit 1
32+
fi
33+
34+
# Check if toxiproxy-kafka is available
35+
if curl -s -f "${TOXIPROXY_KAFKA_URL}/version" > /dev/null 2>&1; then
36+
echo -e "${GREEN}[INFO]${NC} Toxiproxy for Kafka is available at ${TOXIPROXY_KAFKA_URL}"
37+
else
38+
echo -e "${RED}[ERROR]${NC} Toxiproxy for Kafka is not accessible at ${TOXIPROXY_KAFKA_URL}"
39+
echo -e "${GREEN}[INFO]${NC} Check with: docker ps | grep toxiproxy-kafka"
40+
exit 1
41+
fi
42+
43+
# Check if PostgreSQL is accessible
44+
echo -e "${GREEN}[INFO]${NC} Checking PostgreSQL..."
45+
if nc -z localhost 5432 2>/dev/null; then
46+
echo -e "${GREEN}[INFO]${NC} PostgreSQL is accessible"
47+
else
48+
echo -e "${RED}[ERROR]${NC} PostgreSQL is not accessible on ${POSTGRES_DIRECT_URL}"
49+
echo -e "${GREEN}[INFO]${NC} Check with: docker ps | grep postgres"
50+
exit 1
51+
fi
52+
53+
# Check if PostgreSQL through toxiproxy is accessible
54+
echo -e "${GREEN}[INFO]${NC} Checking PostgreSQL through toxiproxy..."
55+
if nc -z localhost 15432 2>/dev/null; then
56+
echo -e "${GREEN}[INFO]${NC} PostgreSQL through toxiproxy is accessible"
57+
else
58+
echo -e "${RED}[ERROR]${NC} PostgreSQL through toxiproxy is not accessible on ${POSTGRES_TOXI_URL}"
59+
exit 1
60+
fi
61+
62+
# Check if Kafka is accessible
63+
echo -e "${GREEN}[INFO]${NC} Checking Kafka..."
64+
if nc -z localhost 9092 2>/dev/null; then
65+
echo -e "${GREEN}[INFO]${NC} Kafka is accessible"
66+
else
67+
echo -e "${RED}[ERROR]${NC} Kafka is not accessible on ${KAFKA_DIRECT_URL}"
68+
echo -e "${GREEN}[INFO]${NC} Check with: docker ps | grep kafka"
69+
exit 1
70+
fi
71+
72+
# Check if Kafka through toxiproxy is accessible
73+
echo -e "${GREEN}[INFO]${NC} Checking Kafka through toxiproxy..."
74+
if nc -z localhost 19092 2>/dev/null; then
75+
echo -e "${GREEN}[INFO]${NC} Kafka through toxiproxy is accessible"
76+
else
77+
echo -e "${RED}[ERROR]${NC} Kafka through toxiproxy is not accessible on ${KAFKA_TOXI_URL}"
78+
exit 1
79+
fi
80+
81+
# Reset toxiproxy services to clean state
82+
echo -e "${GREEN}[INFO]${NC} Resetting toxiproxy services to clean state..."
83+
curl -s -X DELETE "${TOXIPROXY_POSTGRES_URL}/proxies/postgres/toxics" > /dev/null || true
84+
curl -s -X DELETE "${TOXIPROXY_KAFKA_URL}/proxies/kafka/toxics" > /dev/null || true
85+
curl -s -X POST "${TOXIPROXY_POSTGRES_URL}/proxies/postgres" -H "Content-Type: application/json" -d '{"enabled": true}' > /dev/null || true
86+
curl -s -X POST "${TOXIPROXY_KAFKA_URL}/proxies/kafka" -H "Content-Type: application/json" -d '{"enabled": true}' > /dev/null || true
87+
88+
echo ""
89+
echo -e "${GREEN}[INFO]${NC} =================================================="
90+
echo -e "${GREEN}[INFO]${NC} Running Scenario 7: Combined Failures"
91+
echo -e "${GREEN}[INFO]${NC} =================================================="
92+
echo ""
93+
94+
# Run the test
95+
go test -v ./test/chaos -run TestScenario07
96+
97+
TEST_EXIT_CODE=$?
98+
99+
echo ""
100+
if [ $TEST_EXIT_CODE -eq 0 ]; then
101+
echo -e "${GREEN}[INFO]${NC} =================================================="
102+
echo -e "${GREEN}[INFO]${NC} ✅ Test PASSED"
103+
echo -e "${GREEN}[INFO]${NC} =================================================="
104+
else
105+
echo -e "${RED}[ERROR]${NC} =================================================="
106+
echo -e "${RED}[ERROR]${NC} ❌ Test FAILED"
107+
echo -e "${RED}[ERROR]${NC} =================================================="
108+
fi
109+
echo ""
110+
111+
# Cleanup: Reset toxiproxy to clean state
112+
echo -e "${GREEN}[INFO]${NC} Cleaning up toxiproxy state..."
113+
curl -s -X DELETE "${TOXIPROXY_POSTGRES_URL}/proxies/postgres/toxics" > /dev/null || true
114+
curl -s -X DELETE "${TOXIPROXY_KAFKA_URL}/proxies/kafka/toxics" > /dev/null || true
115+
curl -s -X POST "${TOXIPROXY_POSTGRES_URL}/proxies/postgres" -H "Content-Type: application/json" -d '{"enabled": true}' > /dev/null || true
116+
curl -s -X POST "${TOXIPROXY_KAFKA_URL}/proxies/kafka" -H "Content-Type: application/json" -d '{"enabled": true}' > /dev/null || true
117+
echo -e "${GREEN}[INFO]${NC} Cleanup complete"
118+
119+
exit $TEST_EXIT_CODE

0 commit comments

Comments
 (0)