Skip to content

[FAQ] Record counts in Homework 6 do not match the quiz options #232

@AsherJD-io

Description

@AsherJD-io

Course

data-engineering-zoomcamp

Question

When counting records in the yellow_tripdata_2023-11 dataset using Spark, the total number of rows is around 3.3 million, but the quiz options contain much smaller values such as:

62,610

102,340

162,604

225,768

Why don't the counts match the dataset results?

Answer

The quiz question refers to partitioned output files, not the full dataset.

In the homework workflow, the dataset is typically written using something like:

df.repartition(4).write.parquet("output/")

Spark splits the dataset into multiple partitions and writes separate parquet files.

The question asks for the number of records inside one of those partition files, not the total number of rows in the dataset.

To check this, you can read the output directory and count records per file:

spark.read.parquet("output/").groupBy(input_file_name()).count().show()

The counts from those partitions correspond to the quiz options.

Checklist

  • I have searched existing FAQs and this question is not already answered
  • The answer provides accurate, helpful information
  • I have included any relevant code examples or links

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions