Merge pull request #53 from Data-Bishop/dev

Data-Bishop · web-flow · commit a2aa4980f9ff · 2025-04-27T14:58:52.000+01:00
Added necessary documentation.
diff --git a/docs/airflow_documentation.md b/docs/airflow_documentation.md
@@ -0,0 +1,62 @@
+# **Airflow Documentation**
+
+## **1. Purpose**
+Apache Airflow is used to orchestrate workflows for data generation and processing. It schedules and monitors tasks, ensuring that workflows are executed in the correct order and at the right time.
+
+---
+
+## **2. Components**
+### **2.1 Airflow Deployment**
+- **Deployment**: Airflow is deployed on an EC2 instance using Docker Compose.
+- **Services**:
+  - **Webserver**: Provides the Airflow UI, accessible on port `8080`.
+  - **Scheduler**: Manages the execution of tasks in the DAGs.
+  - **Worker**: Executes tasks in parallel using Celery.
+  - **Postgres**: Stores metadata for Airflow.
+  - **Redis**: Acts as the Celery broker for task distribution.
+
+### **2.2 DAGs**
+- **Location**: DAGs are stored in the `orchestration/dags/` directory.
+- **Key DAGs**:
+  - `data_generation_dag.py`: Generates synthetic data using Spark on EMR.
+  - `data_processing_dag.py`: Processes raw data into structured formats using Spark on EMR.
+
+### **2.3 Configuration**
+- **Docker Compose**:
+  - The `docker-compose.yml` file defines the services and their dependencies.
+- **Dependencies**:
+  - Python dependencies are listed in `requirements.txt`.
+  - Custom configurations are stored in `config/config.py`.
+
+---
+
+## **3. Workflow**
+### **3.1 Data Generation DAG**
+- **Trigger**: Runs daily at `9:00 AM`.
+- **Steps**:
+  1. Creates an EMR cluster.
+  2. Submits the `data_generator.py` Spark job to generate synthetic data.
+  3. Saves the generated data to the `raw/` folder in the S3 bucket.
+  4. Terminates the EMR cluster.
+
+### **3.2 Data Processing DAG**
+- **Trigger**: Runs daily at `6:00 PM`.
+- **Steps**:
+  1. Creates an EMR cluster.
+  2. Submits the `data_processor.py` Spark job to process raw data into structured formats.
+  3. Saves the processed data to the `processed/` folder in the S3 bucket.
+  4. Terminates the EMR cluster.
+
+---
+
+## **4. Access**
+### **4.1 Airflow UI**
+- **Access Method**:
+  - Use SSH port forwarding or AWS SSM to access the Airflow UI.
+  - Navigate to `http://localhost:8080` in your browser.
+
+### **4.2 Logs**
+- **Location**:
+  - Airflow logs are stored in the `builditall-logs/airflow/` folder in the S3 bucket.
+
+---
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -0,0 +1,152 @@
+# **Architecture Documentation**
+
+## **1. Purpose**
+The purpose of this architecture is to build a **data platform** that supports:
+- **Synthetic Data Generation**: Using a Spark job to generate synthetic parquet datasets.
+- **Data Processing**: Processing raw generated parquet datasets using Spark on Amazon EMR.
+- **Orchestration**: Managing workflows and scheduling tasks using Apache Airflow.
+- **Storage**: Storing raw, processed, orchestration-related, logs and other data in Amazon S3.
+- **Networking**: Ensuring secure communication between components using a VPC with public and private subnets.
+- **Access Control**: Using IAM roles and policies to enforce least-privilege access to AWS resources.
+
+---
+
+## **2. Content**
+
+### **Compute: EMR**
+- **Purpose**: Amazon EMR is used to run Spark jobs for data generation and processing.
+- **Configuration**:
+  - **Cluster**: Configured with one master node (`m5.xlarge`) and one core node (`m5.xlarge`).
+  - **Applications**: Spark is installed on the cluster.
+  - **Bootstrap Actions**: A bootstrap script (`bootstrap.sh`) installs spark job dependencies like Python libraries (e.g `faker`).
+  - **Security Groups**:
+    - Master and slave nodes have security groups allowing internal communication and SSH access.
+  - **IAM Roles**:
+    - `EMR_DefaultRole`: Allows the cluster to interact with AWS services.
+    - `EMR_EC2_DefaultRole`: Grants EC2 instances in the cluster access to S3 and other resources.
+
+---
+
+### **Orchestration: Airflow**
+- **Purpose**: Apache Airflow is used to orchestrate workflows for data generation and processing.
+- **Configuration**:
+  - **Deployment**: Airflow is deployed on an EC2 instance using Docker Compose.
+  - **Components**:
+    - **Webserver**: Accessible on port `8080`.
+    - **Scheduler**: Manages task execution.
+    - **Worker**: Executes tasks in parallel.
+    - **Postgres**: Used as the metadata database.
+    - **Redis**: Used as the Celery broker.
+  - **DAGs**:
+    - `data_generation_dag.py`: Generates synthetic parquet datasets using Spark on EMR.
+    - `data_processing_dag.py`: Processes raw parquet datasets using Spark on EMR.
+  - **Security**:
+    - Airflow's EC2 instance is in a private subnet.
+    - Port forwarding or SSM is used to access the Airflow UI securely.
+
+---
+
+### **Storage: S3**
+- **Purpose**: Amazon S3 is used to store raw, processed, and orchestration-related data.
+- **Buckets**:
+  - `builditall-client-data`:
+    - **Folders**:
+      - `raw/`: Stores raw data generated by Spark jobs.
+      - `processed/`: Stores processed data.
+      - `scripts/`: Stores Spark job scripts and other scripts.
+  - `builditall-airflow`:
+    - Dockerfile, docker-compose file and airflow setup scripts are stored in the root of this bucket.
+    - **Folders**:
+      - `dags/`: Stores Airflow DAGs, email notification and utility scripts.
+      - `requirements/`: Stores Python dependencies for Airflow.
+  - `builditall-logs`:
+    - **Folders**:
+      - `airflow/`: Stores Airflow logs.
+      - `emr/`: Stores EMR logs.
+  - `builditall-tfstate`:
+    - Stores Terraform backend and state configuration.
+- **Bucket Policies**:
+  - Grant access to specific IAM roles (e.g., Airflow and EMR roles) for reading and writing data.
+
+---
+
+### **Networking**
+- **VPC**:
+  - A custom VPC is created with the following:
+    - **Public Subnets (2)**: For the bastion host and NAT gateway.
+    - **Private Subnets (2)**: For the Airflow and EMR instances.
+  - **Routing**:
+    - Public subnets have an internet gateway for outbound traffic.
+    - Private subnets use a NAT gateway for internet access.
+- **Security Groups**:
+  - **Bastion Host**:
+    - Allows SSH access from a specific IP range (`allowed_ip`).
+  - **Airflow**:
+    - Allows access to port `8080` for the Airflow UI (restricted to the VPC CIDR).
+    - Allows SSH access from the bastion host.
+  - **EMR**:
+    - Master and slave nodes allow internal communication and SSH access.
+
+---
+
+### **IAM Roles**
+- **Airflow Role**:
+  - Grants access to:
+    - S3 buckets (`builditall-airflow`, `builditall-client-data`, `builditall-logs`).
+    - EMR clusters for job submission and monitoring.
+- **EMR Roles**:
+  - `EMR_DefaultRole`: Grants the EMR cluster access to AWS services.
+  - `EMR_EC2_DefaultRole`: Grants EC2 instances in the cluster access to S3 and other resources.
+- **Bastion Role**:
+  - Grants access to SSM for secure session management.
+
+---
+
+### **Secrets Management**
+- **Purpose**: AWS Secrets Manager is used to securely store sensitive variable values required by Terraform.
+- **Configuration**:
+  - A secret named `builditall-secrets` is created in AWS Secrets Manager.
+  - The secret contains key-value pairs for sensitive variables such as:
+    - `aws_region`, `project_name`, `data_bucket_name`, `airflow_bucket_name`, `logs_bucket_name`, `ami_id`, `key_pair_name`, `allowed_ip`, and `vpc_cidr`.
+  - Terraform dynamically fetches these values using the `aws_secretsmanager_secret` and `aws_secretsmanager_secret_version` data sources.
+
+---
+
+## **3. Workflow**
+
+### **Step 1: Data Generation**
+1. **Trigger**:
+   - The `data_generation_dag.py` DAG runs daily at `9:00 AM`.
+2. **Workflow**:
+   - Creates an EMR cluster.
+   - Submits a Spark job (`data_generator.py`) to generate synthetic data.
+   - Saves the generated parquet dataset to the run date subfolder (e.g `2o25-04-26`), in the `raw/` folder in the `builditall-client-data` S3 bucket.
+   - Terminates the EMR cluster after the job completes.
+
+---
+
+### **Step 2: Data Processing**
+1. **Trigger**:
+   - The `data_processing_dag.py` DAG runs daily at `6:00 PM`.
+2. **Workflow**:
+   - Creates an EMR cluster.
+   - Submits a Spark job (`data_processor.py`) to process raw parquet datasets/files generated that day from the data generation workflow.
+   - Saves the processed data to the run date subfolder (e.g `2o25-04-26`), in the `processed/` folder in the `builditall-client-data` S3 bucket.
+   - Terminates the EMR cluster after the job completes.
+
+---
+
+### **Step 3: Orchestration**
+1. **Airflow**:
+   - Manages the scheduling and execution of the `data_generation` and `data_processing` DAGs.
+   - Sends email notifications on task success or failure using the `email_alert.py` script.
+
+---
+
+### **Step 4: Access**
+1. **Airflow UI**:
+   - Accessed via port forwarding through the bastion host or using SSM.
+2. **Bastion Host**:
+   - Used to SSH into private instances (e.g., Airflow).
+
+---
diff --git a/docs/codebase_and_ci_cd.md b/docs/codebase_and_ci_cd.md
@@ -0,0 +1,84 @@
+# **Codebase Documentation**
+
+## **1. Repository Structure**
+The repository is organized into directories and files that represent the different components of the **BuildItAll Data Platform**. Below is an overview of the structure:
+
+BuildItAll_Data_Platform/
+├── .github/
+│   └── workflows/                # GitHub Actions workflows for CI/CD
+│       ├── ci.yml                # Continuous Integration workflow
+│       ├── cd.yml                # Continuous Deployment workflow
+├── docs/                         # Documentation file
+├── infrastructure/               # Terraform configuration for AWS resources
+│   ├── bootstrap/                # Bootstrap resources for Terraform backend
+│   │   ├── main.tf               # S3 bucket and DynamoDB table for state storage
+│   │   ├── provider.tf           # AWS provider configuration
+│   │   ├── variables.tf          # Input variables for bootstrap
+│   │   ├── outputs.tf            # Outputs for bootstrap resources
+│   │   ├── terraform.tfvars      # Default variable values for bootstrap
+│   ├── modules/                  # Terraform modules for reusable components
+│   │   ├── vpc/                  # VPC module
+│   │   ├── s3/                   # S3 buckets module
+│   │   ├── emr/                  # EMR cluster IAM roles and security groups
+│   │   ├── airflow_ec2/          # Airflow EC2 instance module
+│   │   ├── bastion/              # Bastion host module
+│   ├── main.tf                   # Root module for Terraform
+│   ├── provider.tf               # AWS provider configuration
+│   ├── variables.tf              # Input variables for Terraform
+│   ├── outputs.tf                # Outputs for Terraform resources
+├── orchestration/                # Airflow orchestration setup
+│   ├── dags/                     # Airflow DAGs
+│   │   ├── data_generation_dag.py # DAG for data generation
+│   │   ├── data_processing_dag.py # DAG for data processing
+│   │   ├── notification/         # Notification scripts
+│   │   │   └── email_alert.py    # Email notifications for task success/failure
+│   │   ├── config/               # Configuration files
+│   │   │   └── config.py         # Airflow configuration variables
+│   ├── setup.sh                  # Script to set up Airflow dependencies
+│   ├── start-airflow.sh          # Script to start Airflow services
+│   ├── requirements.txt          # Python dependencies for Airflow
+│   ├── docker-compose.yml        # Docker Compose configuration for Airflow
+│   ├── Dockerfile                # Custom Dockerfile for Airflow
+├── spark_jobs/                   # Spark job scripts
+│   ├── data_generator.py         # Spark job for synthetic data generation
+│   ├── data_processor.py         # Spark job for data processing
+│   ├── requirements.txt          # Python dependencies for Spark jobs
+│   ├── bootstrap.sh              # Bootstrap script for EMR cluster
+├── .gitignore                    # Git ignore rules
+
+---
+
+## **2. CI/CD**
+
+### **2.1 Continuous Integration (CI)**
+The CI pipeline ensures code quality and validates Terraform configurations. It is defined in `.github/workflows/ci.yml`.
+
+#### **CI Workflow Steps**
+1. **Terraform Validation**:
+   - Validates the Terraform configuration files.
+   - Ensures proper formatting using `terraform fmt`.
+   - Runs `terraform validate` to check for syntax errors.
+2. **Python Linting**:
+   - Uses `isort` to check import sorting in Python files.
+   - Uses `flake8` to enforce Python code style and linting rules.
+
+#### **Trigger**:
+- Runs on every pull request to the `dev` or `main` branches.
+
+---
+
+### **2.2 Continuous Deployment (CD)**
+The CD pipeline automates the deployment of Terraform infrastructure and uploads necessary files to S3. It is defined in `.github/workflows/cd.yml`.
+
+#### **CD Workflow Steps**
+1. **Terraform Deployment**:
+   - Initializes the Terraform backend.
+   - Runs `terraform plan` to generate an execution plan.
+   - Applies the Terraform configuration to provision or update AWS resources.
+2. **File Upload to S3**:
+   - Uploads Airflow setup scripts, DAGs, and Spark job scripts to the appropriate S3 buckets.
+
+#### **Trigger**:
+- Runs on every push to the `main` branch.
+
+---
diff --git a/docs/spark_jobs.md b/docs/spark_jobs.md
@@ -0,0 +1,60 @@
+# **Spark Jobs Documentation**
+
+## **1. Purpose**
+Spark jobs are used for data generation and processing. They run on Amazon EMR clusters and interact with S3 for input and output data.
+
+---
+
+## **2. Spark Jobs**
+### **2.1 Data Generator**
+- **Script**: `data_generator.py`
+- **Purpose**: Generates synthetic data and saves it to the `raw/` folder in the S3 bucket.
+- **Inputs**:
+  - Configuration parameters (e.g., number of records, schema).
+- **Outputs**:
+  - Synthetic data in Parquet format.
+- **Execution**:
+  - Submitted to the EMR cluster by the `data_generation_dag.py` DAG.
+
+### **2.2 Data Processor**
+- **Script**: `data_processor.py`
+- **Purpose**: Processes raw data and saves it to the `processed/` folder in the S3 bucket.
+- **Inputs**:
+  - Raw data from the `raw/` folder in the S3 bucket.
+- **Outputs**:
+  - Processed data in Parquet format.
+- **Execution**:
+  - Submitted to the EMR cluster by the `data_processing_dag.py` DAG.
+
+---
+
+## **3. Configuration**
+- **Dependencies**:
+  - Python dependencies for Spark jobs are listed in `spark_jobs/requirements.txt`.
+- **Bootstrap Script**:
+  - `bootstrap.sh` installs required dependencies on EMR nodes.
+
+---
+
+## **4. Workflow**
+### **4.1 Data Generation**
+1. The `data_generator.py` script is submitted to the EMR cluster.
+2. The script generates synthetic data and saves it to the `raw/` folder in the S3 bucket.
+
+### **4.2 Data Processing**
+1. The `data_processor.py` script is submitted to the EMR cluster.
+2. The script processes raw data and saves it to the `processed/` folder in the S3 bucket.
+
+---
+
+## **5. Logs**
+- **Location**:
+  - Spark job logs are stored in the `builditall-logs/emr/` folder in the S3 bucket.
+- **Access**:
+  - Logs can be accessed via the EMR console or directly from the S3 bucket.
+
+---
+
+## **6. Notifications**
+- Email notifications have also been set up for task run success and failures.
+- This is handled by the `email_alert.py` script in the `notification` subfolder in the `dags` folder.
diff --git a/docs/terraform_infrastructure.md b/docs/terraform_infrastructure.md