Course
data-engineering-zoomcamp
Question
Why does writing a Spark DataFrame after repartitioning create multiple parquet files instead of a single file?
Answer
Spark processes data in partitions. When a DataFrame is written to disk, each partition is written as a separate output file.
For example:
trips.repartition(4).write.parquet("output/")
This creates four parquet files because the DataFrame now has four partitions.
This behavior allows Spark to write data in parallel and improves performance when working with large datasets.
Checklist
Course
data-engineering-zoomcamp
Question
Why does writing a Spark DataFrame after repartitioning create multiple parquet files instead of a single file?
Answer
Spark processes data in partitions. When a DataFrame is written to disk, each partition is written as a separate output file.
For example:
This creates four parquet files because the DataFrame now has four partitions.
This behavior allows Spark to write data in parallel and improves performance when working with large datasets.
Checklist