Skip to content

hinzy97/Q64_SAM_ExecutionTimeModel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python License

Q64_SAM: Deterministic Execution Time Prediction for Spark Applications

This repository provides a reproducible implementation of the SAM model for Query-64 under static allocation settings.

This Jupyter Notebook implements the Static Allocation Model (SAM), a graph-based deterministic model for predicting execution time of Spark applications.

The model is based on prior peer-reviewed work:

Hina Tariq, O. Das (2022)
A Deterministic Model to Predict Execution Time of Spark Applications
Computer Performance Engineering (EPEW 2022), LNCS, vol 13659.
📄 https://doi.org/10.1007/978-3-031-25049-1_11

The current implementation corresponds to the Query-64 validation case used in this repository and reflects the extended formulation presented in the journal version.


📘 Overview

The notebook simulates the execution of TPC-DS Query 64 on a Spark cluster using a deterministic stage-wise simulation approach. It uses a flattened DAG of Spark stages and assigns tasks based on Spark’s round-robin task scheduling, simulating execution on a fixed number of cores.

  • Models execution using static resource allocation (executor cores).
  • Supports parallel execution of independent stages.
  • Estimates stage-level execution using input-task duration relationships.
  • Produces accurate prediction of total execution time for large input sizes.

🔍 Key Features

  • Graph-based simulation: Represents the DAG of stages with explicit dependencies.
  • Task-level scheduling: Simulates task execution using round-robin scheduling across cores.
  • Stage parallelism: Enables parallel simulation of independent stages that do not share shuffle/output dependencies.
  • Real-world validation: Predicted results closely match Spark execution on Google Cloud.
  • Input: All stage/task parameters are defined within the notebook.

🔬 Methodology

  • Flattens the Spark DAG to remove job-level hierarchy.
  • Accepts stage metadata (number of tasks, task duration, dependencies) hardcoded for 20GB and 100GB inputs.
  • Interpolates values for unseen input sizes (e.g., 200GB).
  • Simulates task execution with 8 cores using a round-robin allocation policy.
  • Aggregates per-stage completion time, considering parallel stages when dependencies allow.
  • Accounts for warm-up overhead while assuming contention-free execution.

🧪 Validation

The model has been validated using Spark history logs for Query 64 on a 200GB input size (Google Cloud Dataproc):

Measured Time (s) Predicted Time (s) Error (%)
491.71 477.71 2.85

🔁 Reproducibility

This repository provides a reproducible implementation of the SAM model for the Query-64 workload under static allocation settings.

All parameters used in the model correspond to the experimental setup described in the associated journal paper.

📂 Files

  • Q64_SAM.ipynb – Main Jupyter Notebook implementing the simulation and prediction model.

▶️ How to Run

  1. Clone the repository: git clone https://github.com/hinzy97/Q64_SAM_ExecutionTimeModel

  2. Open the notebook: jupyter notebook Q64_SAM.ipynb

  3. Run all cells in order. No external Spark cluster is required — this is a deterministic simulation.


📝 Citation

If you use this repository or model in your work, please cite:

@InProceedings{10.1007/978-3-031-25049-1_11,
author="Tariq, Hina
and Das, Olivia",
editor="Gilly, Katja
and Thomas, Nigel",
title="A Deterministic Model to Predict Execution Time of Spark Applications",
booktitle="Computer  Performance Engineering",
year="2023",
publisher="Springer International Publishing",
address="Cham",
pages="167--181",
abstract="This work proposes a graph-based, deterministic analytical model that predicts the execution time of spark applications. It conceptualizes the structure of the spark application as a monolithic Directed Acyclic Graph (DAG) of stages capturing the precedence relationship among all the stages of the application. The model processes every stage of the DAG using a graph traversal algorithm, combined with a fixed scheduling policy of the spark platform in context (spark platform refers to the cloud that hosts the spark cluster). We validate our model against the measured execution time obtained by running a big data query (Query-64 of TPC-DS benchmark) that involves parallel execution of a large number of stages. The query is executed on the spark cluster of Google Cloud. Our model resulted in an execution time that is at 2.85{\%} error in comparison to the measured execution time.",
isbn="978-3-031-25049-1"
}

🔧 Future Work

  • Dynamic Executor Allocation: Extend the model to support runtime changes in resource availability, reflecting Spark’s dynamic allocation mode.
  • Integration with Spark Logs: Automate parameter extraction by parsing Spark UI logs (JSON/History Server).
  • Multiple Job Support: Extend to handle multiple Spark jobs or job groups with independent DAGs.
  • Visualization Enhancements: Add DAG and Gantt chart-style visualizations for better understanding of parallelism and stage behavior.
  • ML-based Comparison: ML-based Comparison: Benchmark SAM’s predictions against data-driven models using regression, decision trees, random forests, and neural networks.

Releases

No releases published

Packages

 
 
 

Contributors