Skip to content

Commit f9154cd

Browse files
committed
revamp readme
1 parent 659ef34 commit f9154cd

File tree

1 file changed

+191
-24
lines changed

1 file changed

+191
-24
lines changed

readme.md

Lines changed: 191 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,215 @@
11
# Query data in S3 Bucket with Amazon Athena, Glue Catalog & CloudFormation
22

3-
| Key | Value |
4-
| ------------ | ------------------------------------------------------------------------------------- |
5-
| Environment | <img src="https://img.shields.io/badge/LocalStack-deploys-4D29B4.svg?logo="> <img src="https://img.shields.io/badge/AWS-deploys-F29100.svg?logo=amazon"> |
6-
| Services | Glue, Athena, S3, CloudFormation |
7-
| Integrations | CloudFormation |
8-
| Categories | Big Data |
9-
| Level | Intermediate |
10-
| GitHub | [Repository link](https://github.com/aws-samples/query-data-in-s3-with-amazon-athena-and-aws-sdk-for-dotnet) |
3+
| Key | Value |
4+
| ------------ | --------------------------------------------------------------------------------- |
5+
| Environment | LocalStack, AWS |
6+
| Services | S3, Athena, Glue, CloudFormation |
7+
| Integrations | AWS CLI, CloudFormation |
8+
| Categories | Big Data, Analytics, Data Lake |
9+
| Level | Intermediate |
10+
| Use Case | Resource Browsers, Big Data Testing |
11+
| GitHub | [Repository link](https://github.com/localstack/sample-query-data-s3-athena-glue) |
1112

1213
## Introduction
1314

14-
The Query data in S3 Bucket application sample demonstrates how you can leverage Amazon Athena to run standard SQL to analyze a large amount of data in Amazon S3 buckets. The application sample fetches COVID-19 data from the [Registry of Open Data on AWS](https://registry.opendata.aws/) and allows you to run Athena SQL queries using the [LocalStack Web Application](https://app.localstack.cloud) to list the results from Athena Database/Tables. Users can deploy this application setup via Glue Catalog on AWS & LocalStack using CloudFormation with no changes. To test this application sample, we will demonstrate how you use LocalStack to deploy the infrastructure on your developer machine and your CI environment and run queries against the deployed resources on LocalStack's [Athena Resource Browser](https://app.localstack.cloud/resources/athena/sql).
15+
This sample demonstrates how to build a comprehensive data analytics pipeline using Amazon Athena, S3, and Glue Catalog to query large datasets stored in a data lake. Starting with raw COVID-19 datasets from the [Registry of Open Data on AWS](https://registry.opendata.aws/), you'll deploy a complete analytics infrastructure that enables running standard SQL queries against structured data in S3 buckets. To test this application sample, we will demonstrate how you use LocalStack to deploy the infrastructure on your developer machine and validate big data workflows locally. The demo showcases LocalStack's [Resource Browser capabilities](https://docs.localstack.cloud/aws/capabilities/web-app/resource-browser/) for exploring Athena databases and running interactive SQL queries without the cost and complexity of AWS infrastructure.
1516

16-
## Architecture diagram
17+
:::note
18+
- Initial service startup may take several minutes for dependency installation
19+
- Query performance is optimized for development testing, not production-scale analytics
20+
- Dataset size is limited to sample COVID-19 data for demonstration purposes
21+
:::
22+
23+
## Architecture
1724

1825
The following diagram shows the architecture that this sample application builds and deploys:
1926

2027
![Architecture diagram to showcase how we can query data in S3 Bucket with Amazon Athena, Glue Catalog deployed using CloudFormation over LocalStack](./images/architecture.png)
2128

22-
We are using the following AWS services and their features to build our infrastructure:
23-
24-
- [S3](https://docs.localstack.cloud/user-guide/aws/s3/) to store the datasets and the results of the Athena SQL queries.
25-
- [Glue](https://docs.localstack.cloud/user-guide/aws/glue/) Data Catalog to set up the definitions for that data and create the database & tables.
26-
- [Athena](https://docs.localstack.cloud/user-guide/aws/athena/) as a serverless interactive query service to query the data in the AWS COVID-19 data lake.
27-
- [CloudFormation](https://docs.localstack.cloud/user-guide/aws/cloudformation/) as an Infrastructure-as-Code (IaC) framework to create our stack, which includes the `covid-19` database in our Data Catalog.
29+
- [S3 Buckets](https://docs.localstack.cloud/aws/services/s3/) for storing COVID-19 datasets and Athena query results
30+
- [Glue Data Catalog](https://docs.localstack.cloud/aws/services/glue/) for metadata management and schema definitions
31+
- [Athena](https://docs.localstack.cloud/aws/services/athena/) serverless query service for interactive SQL analytics
32+
- [CloudFormation](https://docs.localstack.cloud/aws/services/cloudformation/) for Infrastructure as Code deployment
33+
- Multiple data sources: hospital beds, vaccine distribution, and aggregated case data
2834

2935
## Prerequisites
3036

31-
- LocalStack Pro with the [`localstack` CLI](https://docs.localstack.cloud/getting-started/installation/#localstack-cli).
32-
- [AWS CLI](https://docs.localstack.cloud/user-guide/integrations/aws-cli/) with the [`awslocal` wrapper](https://docs.localstack.cloud/user-guide/integrations/aws-cli/#localstack-aws-cli-awslocal).
37+
- [`localstack` CLI](https://docs.localstack.cloud/getting-started/installation/#localstack-cli) with a [`LOCALSTACK_AUTH_TOKEN`](https://docs.localstack.cloud/getting-started/auth-token/)
38+
- [AWS CLI](https://docs.localstack.cloud/user-guide/integrations/aws-cli/) with the [`awslocal` wrapper](https://docs.localstack.cloud/user-guide/integrations/aws-cli/#localstack-aws-cli-awslocal)
39+
- [`make`](https://www.gnu.org/software/make/) (**optional**, but recommended for running the sample application)
40+
41+
:::note
42+
This sample uses Athena & Glue Data Catalog which requires various dependencies to be lazily downloaded and installed at runtime, which increases the processing time on the first load. To mitigate this, you can pull the Big Data Mono container image with the default dependencies pre-installed.
43+
```shell
44+
docker pull localstack/localstack-pro:latest-bigdata
45+
```
46+
Start the container with `IMAGE_NAME=localstack/localstack-pro:latest-bigdata` configuration variable to use the pre-installed dependencies.
47+
:::
48+
49+
## Installation
50+
51+
To run the sample application, you need to install the required dependencies.
52+
53+
First, clone the repository:
54+
55+
```shell
56+
git clone https://github.com/localstack/sample-query-data-s3-athena-glue.git
57+
```
58+
59+
Then, navigate to the project directory:
3360

34-
We are using Athena & Glue Data Catalog in our sample application. These services are available in a BigData Mono container which installs dependencies directly into the LocalStack (`localstack-main`) container. While launching these services for the first time, the BigData Mono container will download the required dependencies (Hadoop, Hive, Presto, etc.) and install them into the LocalStack container. This process may take a few minutes.
61+
```shell
62+
cd sample-query-data-s3-athena-glue
63+
```
64+
65+
No additional installation steps are required as the sample uses CloudFormation templates and AWS CLI commands.
3566

36-
To circumvent this, you can pull the `localstack/localstack-pro:2.0.0-bigdata` Mono container image with pre-installed default dependencies. You can launch the container with the LocalStack CLI or via [Docker](https://docs.localstack.cloud/getting-started/installation/#docker)/[Docker Compose](https://docs.localstack.cloud/getting-started/installation/#docker-compose).
67+
## Deployment
3768

3869
Start LocalStack Pro with the `LOCALSTACK_AUTH_TOKEN` pre-configured:
3970

40-
```sh
41-
export LOCALSTACK_AUTH_TOKEN=<your-auth-token>
42-
localstack start
71+
```shell
72+
localstack auth set-token <LOCALSTACK_AUTH_TOKEN>
73+
IMAGE_NAME=localstack/localstack-pro:latest-bigdata localstack start
74+
```
75+
76+
To deploy the sample application infrastructure, run the following command:
77+
78+
```shell
79+
make deploy
80+
```
81+
82+
Alternatively, you can deploy manually step-by-step.
83+
84+
### Create S3 bucket and upload data
85+
86+
```shell
87+
awslocal s3 mb s3://covid19-lake
88+
awslocal s3 cp cloudformation-templates/CovidLakeStack.template.json s3://covid19-lake/cfn/CovidLakeStack.template.json
89+
awslocal s3 sync ./covid19-lake-data/ s3://covid19-lake/
90+
```
91+
92+
### Deploy CloudFormation stack
93+
94+
```shell
95+
awslocal cloudformation create-stack --stack-name covid-lake-stack --template-url https://covid19-lake.s3.us-east-2.amazonaws.com/cfn/CovidLakeStack.template.json
96+
```
97+
98+
### Verify deployment
99+
100+
```shell
101+
awslocal cloudformation describe-stacks --stack-name covid-lake-stack | grep StackStatus
102+
```
103+
104+
Wait for `CREATE_COMPLETE` status before proceeding.
105+
106+
## Testing
107+
108+
After deployment, you can test the analytics pipeline using the LocalStack Web Application's Athena SQL viewer at [https://app.localstack.cloud/inst/default/resources/athena/sql](https://app.localstack.cloud/inst/default/resources/athena/sql).
109+
110+
### Query Examples
111+
112+
Run queries against the `covid-19` database in the Glue Data Catalog:
113+
114+
#### Hospital beds data
115+
116+
```sql
117+
SELECT * FROM covid_19.hospital_beds LIMIT 10
118+
```
119+
120+
// Space for screenshot: hospital-beds-per-us-state-athena-query.png
121+
122+
#### Aggregated COVID data by states
123+
124+
```sql
125+
SELECT * FROM covid_19.enigma_aggregation_us_states
126+
```
127+
128+
// Space for screenshot: agreggated-covid-test-data-cases-athena-query.png
129+
130+
#### Moderna vaccine distribution
131+
132+
```sql
133+
SELECT * FROM covid_19.cdc_moderna_vaccine_distribution
134+
```
135+
136+
// Space for screenshot: moderna-vaccine-allocations-athena-query.png
137+
138+
#### Integration tests
139+
140+
You can also run automated integration tests:
141+
142+
```shell
143+
make test
43144
```
44145

45-
> If you prefer running LocalStack in detached mode, you can add the `-d` flag to the `localstack start` command, and use Docker Desktop to view the logs.
146+
## Use Cases
147+
148+
### Resource Browsers
149+
150+
In this sample, LocalStack's Resource Browser provides a web-based interface for interacting with Athena and Glue services without requiring additional tooling or AWS console access.
151+
152+
The [Resource Browser](https://app.localstack.cloud/inst/default/resources/athena/sql) allows you to:
153+
154+
- Browse Glue Data Catalog databases and tables through the left navigation panel
155+
- Execute SQL queries directly in the browser with syntax highlighting and result formatting
156+
- View query execution history and rerun previous queries for iterative development
157+
158+
This approach eliminates the need to install and configure local SQL clients or connect to remote AWS services during development.
159+
160+
### Big Data Testing
161+
162+
This sample includes patterns for testing big data workflows locally before deploying to production environments. LocalStack enables comprehensive validation of big data components without cloud infrastructure costs.
163+
164+
Key testing scenarios include:
165+
166+
- Schema and Metadata Testing:
167+
- Validate Glue Data Catalog table definitions and column mappings
168+
- Test partitioning strategies and data formats (JSON, CSV, Parquet)
169+
- Verify CloudFormation template creates correct database and table structures
170+
- Query Testing:
171+
- Execute representative SQL queries against sample datasets
172+
- Validate query execution plans and optimization strategies
173+
- Test different table join patterns and aggregation logic
174+
- Integration Testing:
175+
- End-to-end validation from S3 data ingestion through Athena query execution
176+
- Verify S3 bucket policies and access patterns work correctly
177+
178+
LocalStack's isolated environment ensures tests don't interfere with production data while providing realistic AWS service behavior for comprehensive validation.
179+
180+
## Troubleshooting
181+
182+
| Issue | Resolution |
183+
|-------|-----------|
184+
| Big Data services taking long to start | Use the pre-built `localstack-pro:latest-bigdata` Docker image to avoid dependency installation |
185+
| CloudFormation stack creation fails | Verify S3 bucket exists and template is uploaded before creating stack |
186+
| Athena queries return no results | Check Glue Data Catalog tables are created and S3 data is properly uploaded |
187+
| Resource Browser not loading | Ensure LocalStack is running and the stack has been created successfully |
188+
| Query execution timeouts | Reduce query complexity for development testing and review the LocalStack logs for any errors |
189+
190+
## Summary
191+
192+
This sample application demonstrates how to build, deploy, and test a complete big data analytics pipeline using AWS services and LocalStack. It showcases the following patterns:
193+
194+
- Deploying scalable data lake architectures using S3, Athena, and Glue Data Catalog with CloudFormation
195+
- Running interactive SQL analytics against large datasets stored in S3 buckets
196+
- Using LocalStack's Resource Browser for intuitive data exploration and query development
197+
- Implementing comprehensive testing strategies for big data workflows in local environments
198+
- Leveraging AWS parity to ensure consistent behavior between development and production environments
199+
- Managing metadata and schema evolution through Glue Data Catalog integration
200+
201+
The application provides a foundation for understanding enterprise data analytics patterns and building cost-effective development workflows for AWS big data services.
202+
203+
## Learn More
204+
205+
- [LocalStack Athena Documentation](https://docs.localstack.cloud/aws/services/athena/)
206+
- [LocalStack Glue Data Catalog](https://docs.localstack.cloud/aws/services/glue/)
207+
- [LocalStack Resource Browser](https://docs.localstack.cloud/user-guide/web-application/resource-browser/)
208+
- [AWS Data Lake Architecture](https://aws.amazon.com/big-data/datalakes-and-analytics/)
209+
- [COVID-19 Data Lake Blog Post](https://aws.amazon.com/blogs/big-data/a-public-data-lake-for-analysis-of-covid-19-data/)
210+
211+
212+
46213

47214
## Instructions
48215

0 commit comments

Comments
 (0)