|
1 | 1 | # Query data in S3 Bucket with Amazon Athena, Glue Catalog & CloudFormation |
2 | 2 |
|
3 | | -| Key | Value | |
4 | | -| ------------ | ------------------------------------------------------------------------------------- | |
5 | | -| Environment | <img src="https://img.shields.io/badge/LocalStack-deploys-4D29B4.svg?logo="> <img src="https://img.shields.io/badge/AWS-deploys-F29100.svg?logo=amazon"> | |
6 | | -| Services | Glue, Athena, S3, CloudFormation | |
7 | | -| Integrations | CloudFormation | |
8 | | -| Categories | Big Data | |
9 | | -| Level | Intermediate | |
10 | | -| GitHub | [Repository link](https://github.com/aws-samples/query-data-in-s3-with-amazon-athena-and-aws-sdk-for-dotnet) | |
| 3 | +| Key | Value | |
| 4 | +| ------------ | --------------------------------------------------------------------------------- | |
| 5 | +| Environment | LocalStack, AWS | |
| 6 | +| Services | S3, Athena, Glue, CloudFormation | |
| 7 | +| Integrations | AWS CLI, CloudFormation | |
| 8 | +| Categories | Big Data, Analytics, Data Lake | |
| 9 | +| Level | Intermediate | |
| 10 | +| Use Case | Resource Browsers, Big Data Testing | |
| 11 | +| GitHub | [Repository link](https://github.com/localstack/sample-query-data-s3-athena-glue) | |
11 | 12 |
|
12 | 13 | ## Introduction |
13 | 14 |
|
14 | | -The Query data in S3 Bucket application sample demonstrates how you can leverage Amazon Athena to run standard SQL to analyze a large amount of data in Amazon S3 buckets. The application sample fetches COVID-19 data from the [Registry of Open Data on AWS](https://registry.opendata.aws/) and allows you to run Athena SQL queries using the [LocalStack Web Application](https://app.localstack.cloud) to list the results from Athena Database/Tables. Users can deploy this application setup via Glue Catalog on AWS & LocalStack using CloudFormation with no changes. To test this application sample, we will demonstrate how you use LocalStack to deploy the infrastructure on your developer machine and your CI environment and run queries against the deployed resources on LocalStack's [Athena Resource Browser](https://app.localstack.cloud/resources/athena/sql). |
| 15 | +This sample demonstrates how to build a comprehensive data analytics pipeline using Amazon Athena, S3, and Glue Catalog to query large datasets stored in a data lake. Starting with raw COVID-19 datasets from the [Registry of Open Data on AWS](https://registry.opendata.aws/), you'll deploy a complete analytics infrastructure that enables running standard SQL queries against structured data in S3 buckets. To test this application sample, we will demonstrate how you use LocalStack to deploy the infrastructure on your developer machine and validate big data workflows locally. The demo showcases LocalStack's [Resource Browser capabilities](https://docs.localstack.cloud/aws/capabilities/web-app/resource-browser/) for exploring Athena databases and running interactive SQL queries without the cost and complexity of AWS infrastructure. |
15 | 16 |
|
16 | | -## Architecture diagram |
| 17 | +:::note |
| 18 | +- Initial service startup may take several minutes for dependency installation |
| 19 | +- Query performance is optimized for development testing, not production-scale analytics |
| 20 | +- Dataset size is limited to sample COVID-19 data for demonstration purposes |
| 21 | +::: |
| 22 | + |
| 23 | +## Architecture |
17 | 24 |
|
18 | 25 | The following diagram shows the architecture that this sample application builds and deploys: |
19 | 26 |
|
20 | 27 |  |
21 | 28 |
|
22 | | -We are using the following AWS services and their features to build our infrastructure: |
23 | | - |
24 | | -- [S3](https://docs.localstack.cloud/user-guide/aws/s3/) to store the datasets and the results of the Athena SQL queries. |
25 | | -- [Glue](https://docs.localstack.cloud/user-guide/aws/glue/) Data Catalog to set up the definitions for that data and create the database & tables. |
26 | | -- [Athena](https://docs.localstack.cloud/user-guide/aws/athena/) as a serverless interactive query service to query the data in the AWS COVID-19 data lake. |
27 | | -- [CloudFormation](https://docs.localstack.cloud/user-guide/aws/cloudformation/) as an Infrastructure-as-Code (IaC) framework to create our stack, which includes the `covid-19` database in our Data Catalog. |
| 29 | +- [S3 Buckets](https://docs.localstack.cloud/aws/services/s3/) for storing COVID-19 datasets and Athena query results |
| 30 | +- [Glue Data Catalog](https://docs.localstack.cloud/aws/services/glue/) for metadata management and schema definitions |
| 31 | +- [Athena](https://docs.localstack.cloud/aws/services/athena/) serverless query service for interactive SQL analytics |
| 32 | +- [CloudFormation](https://docs.localstack.cloud/aws/services/cloudformation/) for Infrastructure as Code deployment |
| 33 | +- Multiple data sources: hospital beds, vaccine distribution, and aggregated case data |
28 | 34 |
|
29 | 35 | ## Prerequisites |
30 | 36 |
|
31 | | -- LocalStack Pro with the [`localstack` CLI](https://docs.localstack.cloud/getting-started/installation/#localstack-cli). |
32 | | -- [AWS CLI](https://docs.localstack.cloud/user-guide/integrations/aws-cli/) with the [`awslocal` wrapper](https://docs.localstack.cloud/user-guide/integrations/aws-cli/#localstack-aws-cli-awslocal). |
| 37 | +- [`localstack` CLI](https://docs.localstack.cloud/getting-started/installation/#localstack-cli) with a [`LOCALSTACK_AUTH_TOKEN`](https://docs.localstack.cloud/getting-started/auth-token/) |
| 38 | +- [AWS CLI](https://docs.localstack.cloud/user-guide/integrations/aws-cli/) with the [`awslocal` wrapper](https://docs.localstack.cloud/user-guide/integrations/aws-cli/#localstack-aws-cli-awslocal) |
| 39 | +- [`make`](https://www.gnu.org/software/make/) (**optional**, but recommended for running the sample application) |
| 40 | + |
| 41 | +:::note |
| 42 | +This sample uses Athena & Glue Data Catalog which requires various dependencies to be lazily downloaded and installed at runtime, which increases the processing time on the first load. To mitigate this, you can pull the Big Data Mono container image with the default dependencies pre-installed. |
| 43 | +```shell |
| 44 | +docker pull localstack/localstack-pro:latest-bigdata |
| 45 | +``` |
| 46 | +Start the container with `IMAGE_NAME=localstack/localstack-pro:latest-bigdata` configuration variable to use the pre-installed dependencies. |
| 47 | +::: |
| 48 | + |
| 49 | +## Installation |
| 50 | + |
| 51 | +To run the sample application, you need to install the required dependencies. |
| 52 | + |
| 53 | +First, clone the repository: |
| 54 | + |
| 55 | +```shell |
| 56 | +git clone https://github.com/localstack/sample-query-data-s3-athena-glue.git |
| 57 | +``` |
| 58 | + |
| 59 | +Then, navigate to the project directory: |
33 | 60 |
|
34 | | -We are using Athena & Glue Data Catalog in our sample application. These services are available in a BigData Mono container which installs dependencies directly into the LocalStack (`localstack-main`) container. While launching these services for the first time, the BigData Mono container will download the required dependencies (Hadoop, Hive, Presto, etc.) and install them into the LocalStack container. This process may take a few minutes. |
| 61 | +```shell |
| 62 | +cd sample-query-data-s3-athena-glue |
| 63 | +``` |
| 64 | + |
| 65 | +No additional installation steps are required as the sample uses CloudFormation templates and AWS CLI commands. |
35 | 66 |
|
36 | | -To circumvent this, you can pull the `localstack/localstack-pro:2.0.0-bigdata` Mono container image with pre-installed default dependencies. You can launch the container with the LocalStack CLI or via [Docker](https://docs.localstack.cloud/getting-started/installation/#docker)/[Docker Compose](https://docs.localstack.cloud/getting-started/installation/#docker-compose). |
| 67 | +## Deployment |
37 | 68 |
|
38 | 69 | Start LocalStack Pro with the `LOCALSTACK_AUTH_TOKEN` pre-configured: |
39 | 70 |
|
40 | | -```sh |
41 | | -export LOCALSTACK_AUTH_TOKEN=<your-auth-token> |
42 | | -localstack start |
| 71 | +```shell |
| 72 | +localstack auth set-token <LOCALSTACK_AUTH_TOKEN> |
| 73 | +IMAGE_NAME=localstack/localstack-pro:latest-bigdata localstack start |
| 74 | +``` |
| 75 | + |
| 76 | +To deploy the sample application infrastructure, run the following command: |
| 77 | + |
| 78 | +```shell |
| 79 | +make deploy |
| 80 | +``` |
| 81 | + |
| 82 | +Alternatively, you can deploy manually step-by-step. |
| 83 | + |
| 84 | +### Create S3 bucket and upload data |
| 85 | + |
| 86 | +```shell |
| 87 | +awslocal s3 mb s3://covid19-lake |
| 88 | +awslocal s3 cp cloudformation-templates/CovidLakeStack.template.json s3://covid19-lake/cfn/CovidLakeStack.template.json |
| 89 | +awslocal s3 sync ./covid19-lake-data/ s3://covid19-lake/ |
| 90 | +``` |
| 91 | + |
| 92 | +### Deploy CloudFormation stack |
| 93 | + |
| 94 | +```shell |
| 95 | +awslocal cloudformation create-stack --stack-name covid-lake-stack --template-url https://covid19-lake.s3.us-east-2.amazonaws.com/cfn/CovidLakeStack.template.json |
| 96 | +``` |
| 97 | + |
| 98 | +### Verify deployment |
| 99 | + |
| 100 | +```shell |
| 101 | +awslocal cloudformation describe-stacks --stack-name covid-lake-stack | grep StackStatus |
| 102 | +``` |
| 103 | + |
| 104 | +Wait for `CREATE_COMPLETE` status before proceeding. |
| 105 | + |
| 106 | +## Testing |
| 107 | + |
| 108 | +After deployment, you can test the analytics pipeline using the LocalStack Web Application's Athena SQL viewer at [https://app.localstack.cloud/inst/default/resources/athena/sql](https://app.localstack.cloud/inst/default/resources/athena/sql). |
| 109 | + |
| 110 | +### Query Examples |
| 111 | + |
| 112 | +Run queries against the `covid-19` database in the Glue Data Catalog: |
| 113 | + |
| 114 | +#### Hospital beds data |
| 115 | + |
| 116 | +```sql |
| 117 | +SELECT * FROM covid_19.hospital_beds LIMIT 10 |
| 118 | +``` |
| 119 | + |
| 120 | +// Space for screenshot: hospital-beds-per-us-state-athena-query.png |
| 121 | + |
| 122 | +#### Aggregated COVID data by states |
| 123 | + |
| 124 | +```sql |
| 125 | +SELECT * FROM covid_19.enigma_aggregation_us_states |
| 126 | +``` |
| 127 | + |
| 128 | +// Space for screenshot: agreggated-covid-test-data-cases-athena-query.png |
| 129 | + |
| 130 | +#### Moderna vaccine distribution |
| 131 | + |
| 132 | +```sql |
| 133 | +SELECT * FROM covid_19.cdc_moderna_vaccine_distribution |
| 134 | +``` |
| 135 | + |
| 136 | +// Space for screenshot: moderna-vaccine-allocations-athena-query.png |
| 137 | + |
| 138 | +#### Integration tests |
| 139 | + |
| 140 | +You can also run automated integration tests: |
| 141 | + |
| 142 | +```shell |
| 143 | +make test |
43 | 144 | ``` |
44 | 145 |
|
45 | | -> If you prefer running LocalStack in detached mode, you can add the `-d` flag to the `localstack start` command, and use Docker Desktop to view the logs. |
| 146 | +## Use Cases |
| 147 | + |
| 148 | +### Resource Browsers |
| 149 | + |
| 150 | +In this sample, LocalStack's Resource Browser provides a web-based interface for interacting with Athena and Glue services without requiring additional tooling or AWS console access. |
| 151 | + |
| 152 | +The [Resource Browser](https://app.localstack.cloud/inst/default/resources/athena/sql) allows you to: |
| 153 | + |
| 154 | +- Browse Glue Data Catalog databases and tables through the left navigation panel |
| 155 | +- Execute SQL queries directly in the browser with syntax highlighting and result formatting |
| 156 | +- View query execution history and rerun previous queries for iterative development |
| 157 | + |
| 158 | +This approach eliminates the need to install and configure local SQL clients or connect to remote AWS services during development. |
| 159 | + |
| 160 | +### Big Data Testing |
| 161 | + |
| 162 | +This sample includes patterns for testing big data workflows locally before deploying to production environments. LocalStack enables comprehensive validation of big data components without cloud infrastructure costs. |
| 163 | + |
| 164 | +Key testing scenarios include: |
| 165 | + |
| 166 | +- Schema and Metadata Testing: |
| 167 | + - Validate Glue Data Catalog table definitions and column mappings |
| 168 | + - Test partitioning strategies and data formats (JSON, CSV, Parquet) |
| 169 | + - Verify CloudFormation template creates correct database and table structures |
| 170 | +- Query Testing: |
| 171 | + - Execute representative SQL queries against sample datasets |
| 172 | + - Validate query execution plans and optimization strategies |
| 173 | + - Test different table join patterns and aggregation logic |
| 174 | +- Integration Testing: |
| 175 | + - End-to-end validation from S3 data ingestion through Athena query execution |
| 176 | + - Verify S3 bucket policies and access patterns work correctly |
| 177 | + |
| 178 | +LocalStack's isolated environment ensures tests don't interfere with production data while providing realistic AWS service behavior for comprehensive validation. |
| 179 | + |
| 180 | +## Troubleshooting |
| 181 | + |
| 182 | +| Issue | Resolution | |
| 183 | +|-------|-----------| |
| 184 | +| Big Data services taking long to start | Use the pre-built `localstack-pro:latest-bigdata` Docker image to avoid dependency installation | |
| 185 | +| CloudFormation stack creation fails | Verify S3 bucket exists and template is uploaded before creating stack | |
| 186 | +| Athena queries return no results | Check Glue Data Catalog tables are created and S3 data is properly uploaded | |
| 187 | +| Resource Browser not loading | Ensure LocalStack is running and the stack has been created successfully | |
| 188 | +| Query execution timeouts | Reduce query complexity for development testing and review the LocalStack logs for any errors | |
| 189 | + |
| 190 | +## Summary |
| 191 | + |
| 192 | +This sample application demonstrates how to build, deploy, and test a complete big data analytics pipeline using AWS services and LocalStack. It showcases the following patterns: |
| 193 | + |
| 194 | +- Deploying scalable data lake architectures using S3, Athena, and Glue Data Catalog with CloudFormation |
| 195 | +- Running interactive SQL analytics against large datasets stored in S3 buckets |
| 196 | +- Using LocalStack's Resource Browser for intuitive data exploration and query development |
| 197 | +- Implementing comprehensive testing strategies for big data workflows in local environments |
| 198 | +- Leveraging AWS parity to ensure consistent behavior between development and production environments |
| 199 | +- Managing metadata and schema evolution through Glue Data Catalog integration |
| 200 | + |
| 201 | +The application provides a foundation for understanding enterprise data analytics patterns and building cost-effective development workflows for AWS big data services. |
| 202 | + |
| 203 | +## Learn More |
| 204 | + |
| 205 | +- [LocalStack Athena Documentation](https://docs.localstack.cloud/aws/services/athena/) |
| 206 | +- [LocalStack Glue Data Catalog](https://docs.localstack.cloud/aws/services/glue/) |
| 207 | +- [LocalStack Resource Browser](https://docs.localstack.cloud/user-guide/web-application/resource-browser/) |
| 208 | +- [AWS Data Lake Architecture](https://aws.amazon.com/big-data/datalakes-and-analytics/) |
| 209 | +- [COVID-19 Data Lake Blog Post](https://aws.amazon.com/blogs/big-data/a-public-data-lake-for-analysis-of-covid-19-data/) |
| 210 | + |
| 211 | + |
| 212 | + |
46 | 213 |
|
47 | 214 | ## Instructions |
48 | 215 |
|
|
0 commit comments