[FAQ] Record counts in Homework 6 do not match the quiz options

### Course

data-engineering-zoomcamp

### Question

When counting records in the yellow_tripdata_2023-11 dataset using Spark, the total number of rows is around 3.3 million, but the quiz options contain much smaller values such as:

62,610

102,340

162,604

225,768

Why don't the counts match the dataset results?

### Answer

The quiz question refers to partitioned output files, not the full dataset.

In the homework workflow, the dataset is typically written using something like:

```
df.repartition(4).write.parquet("output/")
```
Spark splits the dataset into multiple partitions and writes separate parquet files.

The question asks for the number of records inside one of those partition files, not the total number of rows in the dataset.

To check this, you can read the output directory and count records per file:

```
spark.read.parquet("output/").groupBy(input_file_name()).count().show()
```
The counts from those partitions correspond to the quiz options.

### Checklist

- [x] I have searched existing FAQs and this question is not already answered
- [x] The answer provides accurate, helpful information
- [x] I have included any relevant code examples or links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FAQ] Record counts in Homework 6 do not match the quiz options #232

Course

Question

Answer

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FAQ] Record counts in Homework 6 do not match the quiz options #232

Description

Course

Question

Answer

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions