Skip to content

Commit 5cf04ed

Browse files
fix: Creating a proper RBAC to AWS role mapping to be able to access K8s APIs (#16)
1 parent 75c94cb commit 5cf04ed

File tree

6 files changed

+162
-102
lines changed

6 files changed

+162
-102
lines changed

README.md

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,19 @@
11
# terraform-aws-recycle-eks
22

33
This module creates a terraform module to recycle EKS worker nodes. The high level functionalities are explained below,
4-
- Use a lamdba to take an instance id as an input, to put it in standby state. Using autoscaling api to automatically add a new instance to the group while putting the old instance to standby state. The old instance will get into "Standby" state only when the new instance is in fully "Inservice" state
4+
- Creates a step-function that will consist of 4 lambdas. This step function will handle the transfer of inputs across the lambda functions.
5+
- The first lambda takes an instance id as an input, to put it in standby state. Using autoscaling api to automatically add a new instance to the group while putting the old instance to standby state. The old instance will get into "Standby" state only when the new instance is in fully "Inservice" state
56
- Taint this "Standby" node in EKS using K8S API in Lambda to prevent new pods from getting scheduled into this node
67
- Periodically use K8S API check for status of “stateful” pods on that node based on the label selector provided. Another Lambda will do that
7-
- Once all stateful pods have completed on the node, use K8S API in another Lambda to drain the standby node
8-
- Once the number of running pod reached 0, shut down that standby instance using AWS SDK.
9-
- We are not termnating the node, only shutting it down, hust in case. In future releases, we will be start terminating the nodes
8+
- Once all stateful pods have completed on the node, i.e number of running pod reached 0, shut down that standby instance using AWS SDK via lambda. We are not terminating the node, only shutting it down, just in case. In future releases, we will be start terminating the nodes
9+
1010

1111
## TODO:
1212
- Check for new node in service before proceeding to put the existing node in standby state. Right now we are putting a sleep of 300 sec.
13-
- Stop using anonymous role and find a way to map the role with a proper user
14-
- get_bearer_token() function used in all lambda. Refactor the code to use as a Python module.
13+
- Refactor the code to use as a common module for getting the access token.
1514
- Better logging and exception handling
15+
- Make use of namespace input while selecting the pods. Currently it checks for pods in all namespaces.
16+
- Find a terraform way to edit configmap/aws-auth, this step is still manual to make this module work.
1617

1718
There are two main components:
1819

@@ -22,10 +23,9 @@ There are two main components:
2223

2324
## Usage
2425

25-
**Set up all supported AWS / Datadog integrations**
2626

2727
```
28-
module "recycl-eks-worker-npde" {
28+
module "recycl-eks-worker-node" {
2929
source = "git::git@github.com:scribd/terraform-aws-recycle-eks.git"
3030
name = "string"
3131
tags = {
@@ -35,9 +35,20 @@ module "recycl-eks-worker-npde" {
3535
vpc_subnet_ids = ["subnet-12345678", "subnet-87654321"]
3636
vpc_security_group_ids = ["sg-12345678"]
3737
aws_region = "us-east-2"
38+
namespace = "your pod namespace" # As of now it is just a place holder we check for all namespaces now
3839
3940
}
41+
42+
```
43+
After running the module, Run `kubectl edit -n kube-system configmap/aws-auth` and add the following:
4044
```
45+
mapRoles: |
46+
# ...
47+
- rolearn: <IAM role for the lamda execution>
48+
username: lambda
49+
50+
```
51+
You can get IAM role for the lamda execution from the output variable of "lambda_exec_arn" in this module
4152

4253
## Running of step function
4354

@@ -52,6 +63,9 @@ Step function takes an json input
5263
This label selector will be the identifier on which the step function will wait and rest all pods will be ignored.
5364
5465
```
66+
## Sample Output of a step function
67+
68+
![](images/Step-Function-sample-output.png)
5569

5670
## Development
5771

183 KB
Loading

lambdas/checkNodesForRunningPods.py

Lines changed: 64 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,11 @@
55
import os.path
66
import base64
77
import logging
8+
import re
89
import yaml
910
import boto3
11+
import kubernetes as k8s
1012
from botocore.signers import RequestSigner
11-
from kubernetes import client, config
1213

1314
logger = logging.getLogger(__name__)
1415
logger.setLevel(logging.DEBUG)
@@ -17,53 +18,59 @@
1718
MIRROR_POD_ANNOTATION_KEY = "kubernetes.io/config.mirror"
1819
CONTROLLER_KIND_DAEMON_SET = "DaemonSet"
1920

20-
def get_bearer_token(cluster_id, region):
21-
''' create the bearer token
22-
'''
23-
if not os.path.exists(KUBE_FILEPATH):
24-
kube_content = dict()
25-
# Get data from EKS API
26-
eks_api = boto3.client('eks',region_name=region)
27-
cluster_info = eks_api.describe_cluster(name=cluster_id)
28-
certificate = cluster_info['cluster']['certificateAuthority']['data']
29-
endpoint = cluster_info['cluster']['endpoint']
30-
# Generating kubeconfig
31-
kube_content = dict()
32-
kube_content['apiVersion'] = 'v1'
33-
kube_content['clusters'] = [
21+
def create_kube_config(eks, cluster_name):
22+
"""Creates the Kubernetes config file required when instantiating the API client."""
23+
cluster_info = eks.describe_cluster(name=cluster_name)['cluster']
24+
certificate = cluster_info['certificateAuthority']['data']
25+
endpoint = cluster_info['endpoint']
26+
27+
kube_config = {
28+
'apiVersion': 'v1',
29+
'clusters': [
3430
{
35-
'cluster':
36-
{
37-
'server': endpoint,
38-
'certificate-authority-data': certificate
39-
},
40-
'name':cluster_id
41-
}]
31+
'cluster':
32+
{
33+
'server': endpoint,
34+
'certificate-authority-data': certificate
35+
},
36+
'name': 'k8s'
4237

43-
kube_content['contexts'] = [
38+
}],
39+
'contexts': [
4440
{
45-
'context':
46-
{
47-
'cluster':cluster_id,
48-
'user':'aws'
49-
},
50-
'name':'aws'
41+
'context':
42+
{
43+
'cluster': 'k8s',
44+
'user': 'aws'
45+
},
46+
'name': 'aws'
47+
}],
48+
'current-context': 'aws',
49+
'Kind': 'config',
50+
'users': [
51+
{
52+
'name': 'aws',
53+
'user': 'lambda'
5154
}]
52-
kube_content['current-context'] = 'aws'
53-
kube_content['Kind'] = 'config'
54-
kube_content['users'] = [
55-
{
56-
'name':'aws',
57-
'user':'lambda'
58-
}]
59-
# Write kubeconfig
60-
with open(KUBE_FILEPATH, 'w') as outfile:
61-
yaml.dump(kube_content, outfile, default_flow_style=False)
55+
}
56+
57+
with open(KUBE_FILEPATH, 'w') as kube_file_content:
58+
yaml.dump(kube_config, kube_file_content, default_flow_style=False)
59+
60+
61+
def get_bearer_token(cluster, region):
62+
"""Creates the authentication to token required by AWS IAM Authenticator. This is
63+
done by creating a base64 encoded string which represents a HTTP call to the STS
64+
GetCallerIdentity Query Request
65+
(https://docs.aws.amazon.com/STS/latest/APIReference/API_GetCallerIdentity.html).
66+
The AWS IAM Authenticator decodes the base64 string and makes the request on behalf of the user.
67+
"""
6268
STS_TOKEN_EXPIRES_IN = 60
6369
session = boto3.session.Session()
6470

6571
client = session.client('sts', region_name=region)
6672
service_id = client.meta.service_model.service_id
73+
6774
signer = RequestSigner(
6875
service_id,
6976
region,
@@ -72,23 +79,29 @@ def get_bearer_token(cluster_id, region):
7279
session.get_credentials(),
7380
session.events
7481
)
82+
7583
params = {
7684
'method': 'GET',
7785
'url': 'https://sts.{}.amazonaws.com/?Action=GetCallerIdentity&Version=2011-06-15'.format(region),
7886
'body': {},
7987
'headers': {
80-
'x-k8s-aws-id': cluster_id
88+
'x-k8s-aws-id': cluster
8189
},
8290
'context': {}
8391
}
92+
8493
signed_url = signer.generate_presigned_url(
8594
params,
8695
region_name=region,
8796
expires_in=STS_TOKEN_EXPIRES_IN,
8897
operation_name=''
8998
)
99+
90100
base64_url = base64.urlsafe_b64encode(signed_url.encode('utf-8')).decode('utf-8')
91101

102+
# need to remove base64 encoding padding:
103+
# https://github.com/kubernetes-sigs/aws-iam-authenticator/issues/202
104+
return 'k8s-aws-v1.' + re.sub(r'=*', '', base64_url)
92105

93106
def get_evictable_pods(api, node_name,label_selector):
94107
'''
@@ -112,20 +125,24 @@ def handler(event, context):
112125
Lambda handler, this function will call the
113126
private functions to get the running pod count based on the label selector provided
114127
'''
128+
eks = boto3.client('eks', region_name=event['region'])
129+
#loading Kube Config
130+
if not os.path.exists(KUBE_FILEPATH):
131+
create_kube_config(eks, event['cluster_name'])
132+
k8s.config.load_kube_config(KUBE_FILEPATH)
133+
configuration = k8s.client.Configuration()
134+
#getting the auth token
115135
token = get_bearer_token(event['cluster_name'],event['region'])
116-
# Configure
117-
config.load_kube_config(KUBE_FILEPATH)
118-
configuration = client.Configuration()
119136
configuration.api_key['authorization'] = token
120137
configuration.api_key_prefix['authorization'] = 'Bearer'
121138
# API
122-
api = client.ApiClient(configuration)
123-
v1 = client.CoreV1Api(api)
139+
api = k8s.client.ApiClient(configuration)
140+
core_v1_api = k8s.client.CoreV1Api(api)
124141

125142
# Get all the pods
126-
runningPodCount=count_running_pods(v1,node_name=event['node_name'],
143+
running_pod_count=count_running_pods(core_v1_api,node_name=event['node_name'],
127144
label_selector=event['label_selector'])
128145
output_json = {"region": event['region'], "node_name" : event['node_name'] ,
129146
"instance_id" : event['instance_id'], "cluster_name": event['cluster_name'],
130-
"activePodCount": runningPodCount}
147+
"activePodCount": running_pod_count}
131148
return output_json

0 commit comments

Comments
 (0)