Skip to content

Commit a2aa498

Browse files
authored
Merge pull request #53 from Data-Bishop/dev
Added necessary documentation.
2 parents 9fe34db + 5b26029 commit a2aa498

File tree

5 files changed

+485
-0
lines changed

5 files changed

+485
-0
lines changed

docs/airflow_documentation.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# **Airflow Documentation**
2+
3+
## **1. Purpose**
4+
Apache Airflow is used to orchestrate workflows for data generation and processing. It schedules and monitors tasks, ensuring that workflows are executed in the correct order and at the right time.
5+
6+
---
7+
8+
## **2. Components**
9+
### **2.1 Airflow Deployment**
10+
- **Deployment**: Airflow is deployed on an EC2 instance using Docker Compose.
11+
- **Services**:
12+
- **Webserver**: Provides the Airflow UI, accessible on port `8080`.
13+
- **Scheduler**: Manages the execution of tasks in the DAGs.
14+
- **Worker**: Executes tasks in parallel using Celery.
15+
- **Postgres**: Stores metadata for Airflow.
16+
- **Redis**: Acts as the Celery broker for task distribution.
17+
18+
### **2.2 DAGs**
19+
- **Location**: DAGs are stored in the `orchestration/dags/` directory.
20+
- **Key DAGs**:
21+
- `data_generation_dag.py`: Generates synthetic data using Spark on EMR.
22+
- `data_processing_dag.py`: Processes raw data into structured formats using Spark on EMR.
23+
24+
### **2.3 Configuration**
25+
- **Docker Compose**:
26+
- The `docker-compose.yml` file defines the services and their dependencies.
27+
- **Dependencies**:
28+
- Python dependencies are listed in `requirements.txt`.
29+
- Custom configurations are stored in `config/config.py`.
30+
31+
---
32+
33+
## **3. Workflow**
34+
### **3.1 Data Generation DAG**
35+
- **Trigger**: Runs daily at `9:00 AM`.
36+
- **Steps**:
37+
1. Creates an EMR cluster.
38+
2. Submits the `data_generator.py` Spark job to generate synthetic data.
39+
3. Saves the generated data to the `raw/` folder in the S3 bucket.
40+
4. Terminates the EMR cluster.
41+
42+
### **3.2 Data Processing DAG**
43+
- **Trigger**: Runs daily at `6:00 PM`.
44+
- **Steps**:
45+
1. Creates an EMR cluster.
46+
2. Submits the `data_processor.py` Spark job to process raw data into structured formats.
47+
3. Saves the processed data to the `processed/` folder in the S3 bucket.
48+
4. Terminates the EMR cluster.
49+
50+
---
51+
52+
## **4. Access**
53+
### **4.1 Airflow UI**
54+
- **Access Method**:
55+
- Use SSH port forwarding or AWS SSM to access the Airflow UI.
56+
- Navigate to `http://localhost:8080` in your browser.
57+
58+
### **4.2 Logs**
59+
- **Location**:
60+
- Airflow logs are stored in the `builditall-logs/airflow/` folder in the S3 bucket.
61+
62+
---

docs/architecture.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# **Architecture Documentation**
2+
3+
## **1. Purpose**
4+
The purpose of this architecture is to build a **data platform** that supports:
5+
- **Synthetic Data Generation**: Using a Spark job to generate synthetic parquet datasets.
6+
- **Data Processing**: Processing raw generated parquet datasets using Spark on Amazon EMR.
7+
- **Orchestration**: Managing workflows and scheduling tasks using Apache Airflow.
8+
- **Storage**: Storing raw, processed, orchestration-related, logs and other data in Amazon S3.
9+
- **Networking**: Ensuring secure communication between components using a VPC with public and private subnets.
10+
- **Access Control**: Using IAM roles and policies to enforce least-privilege access to AWS resources.
11+
12+
---
13+
14+
## **2. Content**
15+
16+
### **Compute: EMR**
17+
- **Purpose**: Amazon EMR is used to run Spark jobs for data generation and processing.
18+
- **Configuration**:
19+
- **Cluster**: Configured with one master node (`m5.xlarge`) and one core node (`m5.xlarge`).
20+
- **Applications**: Spark is installed on the cluster.
21+
- **Bootstrap Actions**: A bootstrap script (`bootstrap.sh`) installs spark job dependencies like Python libraries (e.g `faker`).
22+
- **Security Groups**:
23+
- Master and slave nodes have security groups allowing internal communication and SSH access.
24+
- **IAM Roles**:
25+
- `EMR_DefaultRole`: Allows the cluster to interact with AWS services.
26+
- `EMR_EC2_DefaultRole`: Grants EC2 instances in the cluster access to S3 and other resources.
27+
28+
---
29+
30+
### **Orchestration: Airflow**
31+
- **Purpose**: Apache Airflow is used to orchestrate workflows for data generation and processing.
32+
- **Configuration**:
33+
- **Deployment**: Airflow is deployed on an EC2 instance using Docker Compose.
34+
- **Components**:
35+
- **Webserver**: Accessible on port `8080`.
36+
- **Scheduler**: Manages task execution.
37+
- **Worker**: Executes tasks in parallel.
38+
- **Postgres**: Used as the metadata database.
39+
- **Redis**: Used as the Celery broker.
40+
- **DAGs**:
41+
- `data_generation_dag.py`: Generates synthetic parquet datasets using Spark on EMR.
42+
- `data_processing_dag.py`: Processes raw parquet datasets using Spark on EMR.
43+
- **Security**:
44+
- Airflow's EC2 instance is in a private subnet.
45+
- Port forwarding or SSM is used to access the Airflow UI securely.
46+
47+
---
48+
49+
### **Storage: S3**
50+
- **Purpose**: Amazon S3 is used to store raw, processed, and orchestration-related data.
51+
- **Buckets**:
52+
- `builditall-client-data`:
53+
- **Folders**:
54+
- `raw/`: Stores raw data generated by Spark jobs.
55+
- `processed/`: Stores processed data.
56+
- `scripts/`: Stores Spark job scripts and other scripts.
57+
- `builditall-airflow`:
58+
- Dockerfile, docker-compose file and airflow setup scripts are stored in the root of this bucket.
59+
- **Folders**:
60+
- `dags/`: Stores Airflow DAGs, email notification and utility scripts.
61+
- `requirements/`: Stores Python dependencies for Airflow.
62+
- `builditall-logs`:
63+
- **Folders**:
64+
- `airflow/`: Stores Airflow logs.
65+
- `emr/`: Stores EMR logs.
66+
- `builditall-tfstate`:
67+
- Stores Terraform backend and state configuration.
68+
- **Bucket Policies**:
69+
- Grant access to specific IAM roles (e.g., Airflow and EMR roles) for reading and writing data.
70+
71+
---
72+
73+
### **Networking**
74+
- **VPC**:
75+
- A custom VPC is created with the following:
76+
- **Public Subnets (2)**: For the bastion host and NAT gateway.
77+
- **Private Subnets (2)**: For the Airflow and EMR instances.
78+
- **Routing**:
79+
- Public subnets have an internet gateway for outbound traffic.
80+
- Private subnets use a NAT gateway for internet access.
81+
- **Security Groups**:
82+
- **Bastion Host**:
83+
- Allows SSH access from a specific IP range (`allowed_ip`).
84+
- **Airflow**:
85+
- Allows access to port `8080` for the Airflow UI (restricted to the VPC CIDR).
86+
- Allows SSH access from the bastion host.
87+
- **EMR**:
88+
- Master and slave nodes allow internal communication and SSH access.
89+
90+
---
91+
92+
### **IAM Roles**
93+
- **Airflow Role**:
94+
- Grants access to:
95+
- S3 buckets (`builditall-airflow`, `builditall-client-data`, `builditall-logs`).
96+
- EMR clusters for job submission and monitoring.
97+
- **EMR Roles**:
98+
- `EMR_DefaultRole`: Grants the EMR cluster access to AWS services.
99+
- `EMR_EC2_DefaultRole`: Grants EC2 instances in the cluster access to S3 and other resources.
100+
- **Bastion Role**:
101+
- Grants access to SSM for secure session management.
102+
103+
---
104+
105+
### **Secrets Management**
106+
- **Purpose**: AWS Secrets Manager is used to securely store sensitive variable values required by Terraform.
107+
- **Configuration**:
108+
- A secret named `builditall-secrets` is created in AWS Secrets Manager.
109+
- The secret contains key-value pairs for sensitive variables such as:
110+
- `aws_region`, `project_name`, `data_bucket_name`, `airflow_bucket_name`, `logs_bucket_name`, `ami_id`, `key_pair_name`, `allowed_ip`, and `vpc_cidr`.
111+
- Terraform dynamically fetches these values using the `aws_secretsmanager_secret` and `aws_secretsmanager_secret_version` data sources.
112+
113+
---
114+
115+
## **3. Workflow**
116+
117+
### **Step 1: Data Generation**
118+
1. **Trigger**:
119+
- The `data_generation_dag.py` DAG runs daily at `9:00 AM`.
120+
2. **Workflow**:
121+
- Creates an EMR cluster.
122+
- Submits a Spark job (`data_generator.py`) to generate synthetic data.
123+
- Saves the generated parquet dataset to the run date subfolder (e.g `2o25-04-26`), in the `raw/` folder in the `builditall-client-data` S3 bucket.
124+
- Terminates the EMR cluster after the job completes.
125+
126+
---
127+
128+
### **Step 2: Data Processing**
129+
1. **Trigger**:
130+
- The `data_processing_dag.py` DAG runs daily at `6:00 PM`.
131+
2. **Workflow**:
132+
- Creates an EMR cluster.
133+
- Submits a Spark job (`data_processor.py`) to process raw parquet datasets/files generated that day from the data generation workflow.
134+
- Saves the processed data to the run date subfolder (e.g `2o25-04-26`), in the `processed/` folder in the `builditall-client-data` S3 bucket.
135+
- Terminates the EMR cluster after the job completes.
136+
137+
---
138+
139+
### **Step 3: Orchestration**
140+
1. **Airflow**:
141+
- Manages the scheduling and execution of the `data_generation` and `data_processing` DAGs.
142+
- Sends email notifications on task success or failure using the `email_alert.py` script.
143+
144+
---
145+
146+
### **Step 4: Access**
147+
1. **Airflow UI**:
148+
- Accessed via port forwarding through the bastion host or using SSM.
149+
2. **Bastion Host**:
150+
- Used to SSH into private instances (e.g., Airflow).
151+
152+
---

docs/codebase_and_ci_cd.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# **Codebase Documentation**
2+
3+
## **1. Repository Structure**
4+
The repository is organized into directories and files that represent the different components of the **BuildItAll Data Platform**. Below is an overview of the structure:
5+
6+
BuildItAll_Data_Platform/
7+
├── .github/
8+
│ └── workflows/ # GitHub Actions workflows for CI/CD
9+
│ ├── ci.yml # Continuous Integration workflow
10+
│ ├── cd.yml # Continuous Deployment workflow
11+
├── docs/ # Documentation file
12+
├── infrastructure/ # Terraform configuration for AWS resources
13+
│ ├── bootstrap/ # Bootstrap resources for Terraform backend
14+
│ │ ├── main.tf # S3 bucket and DynamoDB table for state storage
15+
│ │ ├── provider.tf # AWS provider configuration
16+
│ │ ├── variables.tf # Input variables for bootstrap
17+
│ │ ├── outputs.tf # Outputs for bootstrap resources
18+
│ │ ├── terraform.tfvars # Default variable values for bootstrap
19+
│ ├── modules/ # Terraform modules for reusable components
20+
│ │ ├── vpc/ # VPC module
21+
│ │ ├── s3/ # S3 buckets module
22+
│ │ ├── emr/ # EMR cluster IAM roles and security groups
23+
│ │ ├── airflow_ec2/ # Airflow EC2 instance module
24+
│ │ ├── bastion/ # Bastion host module
25+
│ ├── main.tf # Root module for Terraform
26+
│ ├── provider.tf # AWS provider configuration
27+
│ ├── variables.tf # Input variables for Terraform
28+
│ ├── outputs.tf # Outputs for Terraform resources
29+
├── orchestration/ # Airflow orchestration setup
30+
│ ├── dags/ # Airflow DAGs
31+
│ │ ├── data_generation_dag.py # DAG for data generation
32+
│ │ ├── data_processing_dag.py # DAG for data processing
33+
│ │ ├── notification/ # Notification scripts
34+
│ │ │ └── email_alert.py # Email notifications for task success/failure
35+
│ │ ├── config/ # Configuration files
36+
│ │ │ └── config.py # Airflow configuration variables
37+
│ ├── setup.sh # Script to set up Airflow dependencies
38+
│ ├── start-airflow.sh # Script to start Airflow services
39+
│ ├── requirements.txt # Python dependencies for Airflow
40+
│ ├── docker-compose.yml # Docker Compose configuration for Airflow
41+
│ ├── Dockerfile # Custom Dockerfile for Airflow
42+
├── spark_jobs/ # Spark job scripts
43+
│ ├── data_generator.py # Spark job for synthetic data generation
44+
│ ├── data_processor.py # Spark job for data processing
45+
│ ├── requirements.txt # Python dependencies for Spark jobs
46+
│ ├── bootstrap.sh # Bootstrap script for EMR cluster
47+
├── .gitignore # Git ignore rules
48+
49+
---
50+
51+
## **2. CI/CD**
52+
53+
### **2.1 Continuous Integration (CI)**
54+
The CI pipeline ensures code quality and validates Terraform configurations. It is defined in `.github/workflows/ci.yml`.
55+
56+
#### **CI Workflow Steps**
57+
1. **Terraform Validation**:
58+
- Validates the Terraform configuration files.
59+
- Ensures proper formatting using `terraform fmt`.
60+
- Runs `terraform validate` to check for syntax errors.
61+
2. **Python Linting**:
62+
- Uses `isort` to check import sorting in Python files.
63+
- Uses `flake8` to enforce Python code style and linting rules.
64+
65+
#### **Trigger**:
66+
- Runs on every pull request to the `dev` or `main` branches.
67+
68+
---
69+
70+
### **2.2 Continuous Deployment (CD)**
71+
The CD pipeline automates the deployment of Terraform infrastructure and uploads necessary files to S3. It is defined in `.github/workflows/cd.yml`.
72+
73+
#### **CD Workflow Steps**
74+
1. **Terraform Deployment**:
75+
- Initializes the Terraform backend.
76+
- Runs `terraform plan` to generate an execution plan.
77+
- Applies the Terraform configuration to provision or update AWS resources.
78+
2. **File Upload to S3**:
79+
- Uploads Airflow setup scripts, DAGs, and Spark job scripts to the appropriate S3 buckets.
80+
81+
#### **Trigger**:
82+
- Runs on every push to the `main` branch.
83+
84+
---

docs/spark_jobs.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# **Spark Jobs Documentation**
2+
3+
## **1. Purpose**
4+
Spark jobs are used for data generation and processing. They run on Amazon EMR clusters and interact with S3 for input and output data.
5+
6+
---
7+
8+
## **2. Spark Jobs**
9+
### **2.1 Data Generator**
10+
- **Script**: `data_generator.py`
11+
- **Purpose**: Generates synthetic data and saves it to the `raw/` folder in the S3 bucket.
12+
- **Inputs**:
13+
- Configuration parameters (e.g., number of records, schema).
14+
- **Outputs**:
15+
- Synthetic data in Parquet format.
16+
- **Execution**:
17+
- Submitted to the EMR cluster by the `data_generation_dag.py` DAG.
18+
19+
### **2.2 Data Processor**
20+
- **Script**: `data_processor.py`
21+
- **Purpose**: Processes raw data and saves it to the `processed/` folder in the S3 bucket.
22+
- **Inputs**:
23+
- Raw data from the `raw/` folder in the S3 bucket.
24+
- **Outputs**:
25+
- Processed data in Parquet format.
26+
- **Execution**:
27+
- Submitted to the EMR cluster by the `data_processing_dag.py` DAG.
28+
29+
---
30+
31+
## **3. Configuration**
32+
- **Dependencies**:
33+
- Python dependencies for Spark jobs are listed in `spark_jobs/requirements.txt`.
34+
- **Bootstrap Script**:
35+
- `bootstrap.sh` installs required dependencies on EMR nodes.
36+
37+
---
38+
39+
## **4. Workflow**
40+
### **4.1 Data Generation**
41+
1. The `data_generator.py` script is submitted to the EMR cluster.
42+
2. The script generates synthetic data and saves it to the `raw/` folder in the S3 bucket.
43+
44+
### **4.2 Data Processing**
45+
1. The `data_processor.py` script is submitted to the EMR cluster.
46+
2. The script processes raw data and saves it to the `processed/` folder in the S3 bucket.
47+
48+
---
49+
50+
## **5. Logs**
51+
- **Location**:
52+
- Spark job logs are stored in the `builditall-logs/emr/` folder in the S3 bucket.
53+
- **Access**:
54+
- Logs can be accessed via the EMR console or directly from the S3 bucket.
55+
56+
---
57+
58+
## **6. Notifications**
59+
- Email notifications have also been set up for task run success and failures.
60+
- This is handled by the `email_alert.py` script in the `notification` subfolder in the `dags` folder.

0 commit comments

Comments
 (0)