Skip to content

Commit 959a549

Browse files
committed
Folder Stracture
1 parent bc6cd71 commit 959a549

File tree

13 files changed

+175
-4
lines changed

13 files changed

+175
-4
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,6 @@ terraform/.terraform/providers/registry.terraform.io/hashicorp/random/3.6.3/linu
2929
terraform/.terraform/providers/registry.terraform.io/hashicorp/random/3.6.3/linux_amd64/LICENSE.txt
3030
jobs/__pycache__/udf_utils.cpython-312.pyc
3131
jobs/__pycache__/config.cpython-312.pyc
32+
33+
notes.bash
34+
/py.py

README.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Real-Time Streaming Unstructured Data Platform
2+
3+
## Overview
4+
This platform is designed for **real-time streaming and processing of unstructured data**. It incorporates cutting-edge tools and services to manage data ingestion, processing, storage, orchestration, security, CI/CD, monitoring, and logging. The platform is built with scalability, reliability, and security in mind.
5+
6+
## Architecture
7+
8+
### Core Components
9+
10+
1. **Data Ingestion:**
11+
- **Apache Kafka**: Used for real-time data streaming and ingestion from various unstructured data sources.
12+
- **Apache Flume** or **Logstash**: Can be used for collecting, aggregating, and transporting data into Kafka for processing.
13+
- **Filebeat**: Lightweight shipper for forwarding and centralizing log data.
14+
- **Kinesis (Optional)**: For real-time data ingestion from AWS services.
15+
16+
2. **Data Processing:**
17+
- **Apache Spark**: Performs real-time data analytics and transformation with Spark Streaming.
18+
- **Apache Flink**: Stream processing for low-latency, high-throughput event-driven applications.
19+
- **Apache Beam**: Unified stream and batch processing that can run on Spark or Flink runners.
20+
- **TensorFlow / PyTorch**: For processing and analyzing unstructured data with machine learning models.
21+
- **PrestoDB**: Distributed SQL query engine for querying large datasets.
22+
- **Apache NiFi**: For automating the flow of data between systems.
23+
24+
3. **Data Storage:**
25+
- **NoSQL Databases**: (e.g., **MongoDB**, **Cassandra**, **Couchbase**) for scalable storage of unstructured data.
26+
- **Hadoop HDFS**: Distributed file system for large-scale data storage.
27+
- **Amazon S3**: Cloud object storage for scalable data lake solutions.
28+
- **Data Lakes**: Can integrate with various tools like **Delta Lake**, **Iceberg**, or **Apache Hudi** for handling structured and unstructured data.
29+
30+
4. **CI/CD Pipeline:**
31+
- **Jenkins**: Automated build and deployment pipelines for continuous integration and delivery.
32+
- **GitLab CI / GitHub Actions**: Alternatives for managing CI/CD pipelines.
33+
- **Docker**: Containerization of microservices and platform components to ensure portability and scalability.
34+
- **Kubernetes**: Orchestration platform for managing containerized workloads and scaling them.
35+
36+
5. **Orchestration & Workflow Automation:**
37+
- **Apache Airflow**: For scheduling, managing, and automating workflows (ETL pipelines, batch processing, etc.).
38+
- **Dagster**: Data orchestrator that can manage the lifecycle of data processing tasks.
39+
- **Argo Workflows**: Kubernetes-native workflow management for automating tasks across a Kubernetes cluster.
40+
41+
6. **Monitoring & Logging:**
42+
- **Prometheus**: Real-time monitoring and alerting for the platform components.
43+
- **Grafana**: Visualization of system metrics, logs, and real-time performance data.
44+
- **ELK Stack** (Elasticsearch, Logstash, Kibana): Centralized logging and log analysis.
45+
- **Datadog**: Cloud monitoring platform for tracking the performance and health of infrastructure.
46+
- **New Relic**: Application performance management and monitoring.
47+
- **Sentry**: For error tracking and monitoring in real time.
48+
49+
7. **Security:**
50+
- **OAuth 2.0** or **JWT**: Secure authentication and authorization of users and services.
51+
- **Kerberos**: Authentication system for securing communication between services.
52+
- **Vault by HashiCorp**: For managing secrets and sensitive information securely.
53+
- **Encryption**: TLS encryption for data in transit and AES encryption for data at rest.
54+
- **Role-Based Access Control (RBAC)**: Managing user permissions based on their roles.
55+
56+
8. **Data Quality & Validation:**
57+
- **Great Expectations**: For validating, documenting, and profiling data.
58+
- **Deequ**: A library for defining "data quality" constraints in Spark.
59+
- **DataHub**: For data governance, metadata management, and lineage tracking.
60+
61+
9. **Machine Learning & AI:**
62+
- **MLflow**: For managing the machine learning lifecycle (experiment tracking, model management).
63+
- **TensorFlow Serving**: For serving machine learning models in production environments.
64+
- **Kubeflow**: Kubernetes-native solution for deploying, monitoring, and managing ML models at scale.
65+
- **PyTorch**: For building deep learning models on unstructured data like images and text.
66+
67+
10. **Data Integration:**
68+
- **Talend**: Data integration and ETL platform.
69+
- **Fivetran**: Managed ETL service to integrate data sources into data lakes or warehouses.
70+
- **Apache Camel**: Framework for integrating data from various systems and formats.
71+
72+
11. **Web & API Layer:**
73+
- **FastAPI**: Web framework for building APIs for the platform with fast performance and easy integration.
74+
- **Flask**: Lightweight web framework for handling HTTP requests and serving data.
75+
- **GraphQL**: API query language for flexible, real-time data fetching.
76+
- **RESTful APIs**: Standard API endpoints for interacting with the platform services.
77+
78+
12. **Containerization & Orchestration:**
79+
- **Docker Compose**: To define and run multi-container Docker applications.
80+
- **Kubernetes**: For deploying, scaling, and managing containerized applications in a clustered environment.
81+
- **Helm**: For Kubernetes package management to deploy applications with ease.
82+
83+
## Features
84+
85+
- **Real-Time Data Processing**: Process and analyze unstructured data in real time using stream processing technologies like Kafka, Spark, and Flink.
86+
- **Scalable Architecture**: Horizontal scalability of Kafka, Spark, and other components for handling massive data volumes.
87+
- **Automated CI/CD Pipeline**: Jenkins, Docker, and Kubernetes ensure a seamless build, test, and deployment process.
88+
- **Data Security**: Implements encryption, authentication, and access control using OAuth, Kerberos, and Vault.
89+
- **Advanced Orchestration**: Use Apache Airflow, Dagster, and Argo for managing complex workflows and data pipelines.
90+
- **Comprehensive Monitoring**: Use Prometheus, Grafana, and ELK Stack for real-time monitoring and log analysis.
91+
- **Data Quality**: Ensures data validation and quality using tools like Great Expectations and Deequ.
92+
- **Machine Learning Integration**: Easy integration with machine learning platforms like TensorFlow and PyTorch for processing unstructured data.
93+
94+
## Development Status
95+
- The platform is in active development with key features like data ingestion, real-time processing, CI/CD pipeline, and security already implemented.
96+
- Ongoing work involves scaling components, improving security measures, integrating more data sources, and enhancing orchestration capabilities.

docker/docker-compose.yaml

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ services:
2828
- SPARK_MODE=worker
2929
- SPARK_MASTER_URL=spark://spark-master:7077
3030
- SPARK_WORKER_CORES=2
31-
- SPARK_WORKER_MEMORY=1G
31+
- SPARK_WORKER_MEMORY=2G
3232
depends_on:
3333
- spark-master
3434
networks:
@@ -41,8 +41,8 @@ services:
4141
environment:
4242
- SPARK_MODE=worker
4343
- SPARK_MASTER_URL=spark://spark-master:7077
44-
- SPARK_WORKER_CORES=2
45-
- SPARK_WORKER_MEMORY=1G
44+
- SPARK_WORKER_CORES=2 # upgrade needed , check logs and KPI
45+
- SPARK_WORKER_MEMORY=2G # upgrade needed , check logs and KPI
4646
depends_on:
4747
- spark-master
4848
networks:
@@ -77,6 +77,14 @@ services:
7777
networks:
7878
- spark-network
7979
command: spark-submit --master spark://spark-master:7077 /opt/bitnami/spark/jobs/main.py
80+
postgres:
81+
image: postgres:13
82+
environment:
83+
POSTGRES_USER: spark
84+
POSTGRES_PASSWORD: spark
85+
POSTGRES_DB: sparkdb
86+
networks:
87+
- spark-network
8088

8189
networks:
8290
spark-network:

jobs/main.py

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,4 +118,30 @@ def define_udfs():
118118

119119

120120
simple_query.awaitTermination()
121-
spark.stop()
121+
spark.stop()
122+
123+
124+
125+
126+
127+
128+
129+
130+
131+
132+
133+
134+
135+
136+
137+
138+
139+
140+
141+
142+
143+
144+
145+
146+
147+

kubernetes/namespaces/cd.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: cd

kubernetes/namespaces/ci.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: ci
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: data-ingestion
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: ml-model
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: monitoring

0 commit comments

Comments
 (0)