-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Course
data-engineering-zoomcamp
Question
When counting records in the yellow_tripdata_2023-11 dataset using Spark, the total number of rows is around 3.3 million, but the quiz options contain much smaller values such as:
62,610
102,340
162,604
225,768
Why don't the counts match the dataset results?
Answer
The quiz question refers to partitioned output files, not the full dataset.
In the homework workflow, the dataset is typically written using something like:
df.repartition(4).write.parquet("output/")
Spark splits the dataset into multiple partitions and writes separate parquet files.
The question asks for the number of records inside one of those partition files, not the total number of rows in the dataset.
To check this, you can read the output directory and count records per file:
spark.read.parquet("output/").groupBy(input_file_name()).count().show()
The counts from those partitions correspond to the quiz options.
Checklist
- I have searched existing FAQs and this question is not already answered
- The answer provides accurate, helpful information
- I have included any relevant code examples or links