|
| 1 | +# **Architecture Documentation** |
| 2 | + |
| 3 | +## **1. Purpose** |
| 4 | +The purpose of this architecture is to build a **data platform** that supports: |
| 5 | +- **Synthetic Data Generation**: Using a Spark job to generate synthetic parquet datasets. |
| 6 | +- **Data Processing**: Processing raw generated parquet datasets using Spark on Amazon EMR. |
| 7 | +- **Orchestration**: Managing workflows and scheduling tasks using Apache Airflow. |
| 8 | +- **Storage**: Storing raw, processed, orchestration-related, logs and other data in Amazon S3. |
| 9 | +- **Networking**: Ensuring secure communication between components using a VPC with public and private subnets. |
| 10 | +- **Access Control**: Using IAM roles and policies to enforce least-privilege access to AWS resources. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## **2. Content** |
| 15 | + |
| 16 | +### **Compute: EMR** |
| 17 | +- **Purpose**: Amazon EMR is used to run Spark jobs for data generation and processing. |
| 18 | +- **Configuration**: |
| 19 | + - **Cluster**: Configured with one master node (`m5.xlarge`) and one core node (`m5.xlarge`). |
| 20 | + - **Applications**: Spark is installed on the cluster. |
| 21 | + - **Bootstrap Actions**: A bootstrap script (`bootstrap.sh`) installs spark job dependencies like Python libraries (e.g `faker`). |
| 22 | + - **Security Groups**: |
| 23 | + - Master and slave nodes have security groups allowing internal communication and SSH access. |
| 24 | + - **IAM Roles**: |
| 25 | + - `EMR_DefaultRole`: Allows the cluster to interact with AWS services. |
| 26 | + - `EMR_EC2_DefaultRole`: Grants EC2 instances in the cluster access to S3 and other resources. |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +### **Orchestration: Airflow** |
| 31 | +- **Purpose**: Apache Airflow is used to orchestrate workflows for data generation and processing. |
| 32 | +- **Configuration**: |
| 33 | + - **Deployment**: Airflow is deployed on an EC2 instance using Docker Compose. |
| 34 | + - **Components**: |
| 35 | + - **Webserver**: Accessible on port `8080`. |
| 36 | + - **Scheduler**: Manages task execution. |
| 37 | + - **Worker**: Executes tasks in parallel. |
| 38 | + - **Postgres**: Used as the metadata database. |
| 39 | + - **Redis**: Used as the Celery broker. |
| 40 | + - **DAGs**: |
| 41 | + - `data_generation_dag.py`: Generates synthetic parquet datasets using Spark on EMR. |
| 42 | + - `data_processing_dag.py`: Processes raw parquet datasets using Spark on EMR. |
| 43 | + - **Security**: |
| 44 | + - Airflow's EC2 instance is in a private subnet. |
| 45 | + - Port forwarding or SSM is used to access the Airflow UI securely. |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +### **Storage: S3** |
| 50 | +- **Purpose**: Amazon S3 is used to store raw, processed, and orchestration-related data. |
| 51 | +- **Buckets**: |
| 52 | + - `builditall-client-data`: |
| 53 | + - **Folders**: |
| 54 | + - `raw/`: Stores raw data generated by Spark jobs. |
| 55 | + - `processed/`: Stores processed data. |
| 56 | + - `scripts/`: Stores Spark job scripts and other scripts. |
| 57 | + - `builditall-airflow`: |
| 58 | + - Dockerfile, docker-compose file and airflow setup scripts are stored in the root of this bucket. |
| 59 | + - **Folders**: |
| 60 | + - `dags/`: Stores Airflow DAGs, email notification and utility scripts. |
| 61 | + - `requirements/`: Stores Python dependencies for Airflow. |
| 62 | + - `builditall-logs`: |
| 63 | + - **Folders**: |
| 64 | + - `airflow/`: Stores Airflow logs. |
| 65 | + - `emr/`: Stores EMR logs. |
| 66 | + - `builditall-tfstate`: |
| 67 | + - Stores Terraform backend and state configuration. |
| 68 | +- **Bucket Policies**: |
| 69 | + - Grant access to specific IAM roles (e.g., Airflow and EMR roles) for reading and writing data. |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +### **Networking** |
| 74 | +- **VPC**: |
| 75 | + - A custom VPC is created with the following: |
| 76 | + - **Public Subnets (2)**: For the bastion host and NAT gateway. |
| 77 | + - **Private Subnets (2)**: For the Airflow and EMR instances. |
| 78 | + - **Routing**: |
| 79 | + - Public subnets have an internet gateway for outbound traffic. |
| 80 | + - Private subnets use a NAT gateway for internet access. |
| 81 | +- **Security Groups**: |
| 82 | + - **Bastion Host**: |
| 83 | + - Allows SSH access from a specific IP range (`allowed_ip`). |
| 84 | + - **Airflow**: |
| 85 | + - Allows access to port `8080` for the Airflow UI (restricted to the VPC CIDR). |
| 86 | + - Allows SSH access from the bastion host. |
| 87 | + - **EMR**: |
| 88 | + - Master and slave nodes allow internal communication and SSH access. |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +### **IAM Roles** |
| 93 | +- **Airflow Role**: |
| 94 | + - Grants access to: |
| 95 | + - S3 buckets (`builditall-airflow`, `builditall-client-data`, `builditall-logs`). |
| 96 | + - EMR clusters for job submission and monitoring. |
| 97 | +- **EMR Roles**: |
| 98 | + - `EMR_DefaultRole`: Grants the EMR cluster access to AWS services. |
| 99 | + - `EMR_EC2_DefaultRole`: Grants EC2 instances in the cluster access to S3 and other resources. |
| 100 | +- **Bastion Role**: |
| 101 | + - Grants access to SSM for secure session management. |
| 102 | + |
| 103 | +--- |
| 104 | + |
| 105 | +### **Secrets Management** |
| 106 | +- **Purpose**: AWS Secrets Manager is used to securely store sensitive variable values required by Terraform. |
| 107 | +- **Configuration**: |
| 108 | + - A secret named `builditall-secrets` is created in AWS Secrets Manager. |
| 109 | + - The secret contains key-value pairs for sensitive variables such as: |
| 110 | + - `aws_region`, `project_name`, `data_bucket_name`, `airflow_bucket_name`, `logs_bucket_name`, `ami_id`, `key_pair_name`, `allowed_ip`, and `vpc_cidr`. |
| 111 | + - Terraform dynamically fetches these values using the `aws_secretsmanager_secret` and `aws_secretsmanager_secret_version` data sources. |
| 112 | + |
| 113 | +--- |
| 114 | + |
| 115 | +## **3. Workflow** |
| 116 | + |
| 117 | +### **Step 1: Data Generation** |
| 118 | +1. **Trigger**: |
| 119 | + - The `data_generation_dag.py` DAG runs daily at `9:00 AM`. |
| 120 | +2. **Workflow**: |
| 121 | + - Creates an EMR cluster. |
| 122 | + - Submits a Spark job (`data_generator.py`) to generate synthetic data. |
| 123 | + - Saves the generated parquet dataset to the run date subfolder (e.g `2o25-04-26`), in the `raw/` folder in the `builditall-client-data` S3 bucket. |
| 124 | + - Terminates the EMR cluster after the job completes. |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +### **Step 2: Data Processing** |
| 129 | +1. **Trigger**: |
| 130 | + - The `data_processing_dag.py` DAG runs daily at `6:00 PM`. |
| 131 | +2. **Workflow**: |
| 132 | + - Creates an EMR cluster. |
| 133 | + - Submits a Spark job (`data_processor.py`) to process raw parquet datasets/files generated that day from the data generation workflow. |
| 134 | + - Saves the processed data to the run date subfolder (e.g `2o25-04-26`), in the `processed/` folder in the `builditall-client-data` S3 bucket. |
| 135 | + - Terminates the EMR cluster after the job completes. |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +### **Step 3: Orchestration** |
| 140 | +1. **Airflow**: |
| 141 | + - Manages the scheduling and execution of the `data_generation` and `data_processing` DAGs. |
| 142 | + - Sends email notifications on task success or failure using the `email_alert.py` script. |
| 143 | + |
| 144 | +--- |
| 145 | + |
| 146 | +### **Step 4: Access** |
| 147 | +1. **Airflow UI**: |
| 148 | + - Accessed via port forwarding through the bastion host or using SSM. |
| 149 | +2. **Bastion Host**: |
| 150 | + - Used to SSH into private instances (e.g., Airflow). |
| 151 | + |
| 152 | +--- |
0 commit comments