Course
data-engineering-zoomcamp
Question
Why does the PyFlink streaming job fail with a JSON deserialization error when consuming records from the Kafka/Redpanda topic?
Answer
The Green Taxi dataset contains NaN values in some numeric fields (for example passenger_count). When the producer serializes rows to JSON and sends them to the Kafka topic, those NaN values appear in the payload. Flink’s JSON parser does not accept NaN because it is not valid JSON, so the Kafka source fails during deserialization and the job keeps restarting.
Fix: clean the dataset before producing events by replacing NaN values with null or a valid number before serialization.
Checklist
Course
data-engineering-zoomcamp
Question
Why does the PyFlink streaming job fail with a JSON deserialization error when consuming records from the Kafka/Redpanda topic?
Answer
The Green Taxi dataset contains NaN values in some numeric fields (for example passenger_count). When the producer serializes rows to JSON and sends them to the Kafka topic, those NaN values appear in the payload. Flink’s JSON parser does not accept NaN because it is not valid JSON, so the Kafka source fails during deserialization and the job keeps restarting.
Fix: clean the dataset before producing events by replacing NaN values with null or a valid number before serialization.
Checklist