-
Notifications
You must be signed in to change notification settings - Fork 14
Closed
Labels
Description
Course
data-engineering-zoomcamp
Question
How do you use Spark SQL to query data in Spark?
Answer
Answer
Spark SQL allows you to run SQL queries on Spark DataFrames by registering them as temporary views.
This makes it easy to analyze and transform data using familiar SQL syntax.
Step 1: Create a Temporary View
# Register a DataFrame as a temporary table
df.createOrReplaceTempView('trips')This registers the DataFrame as a table named trips, which can be queried using SQL during the Spark session.
Step 2: Execute a SQL Query
result = spark.sql("""
SELECT
DATE(pickup_datetime) AS date,
COUNT(*) AS trip_count,
AVG(trip_distance) AS avg_distance,
SUM(total_amount) AS total_revenue
FROM trips
WHERE passenger_count > 0
GROUP BY DATE(pickup_datetime)
ORDER BY date
""")This query performs filtering, aggregation, and ordering using standard SQL syntax.
Step 3: View the Results
result.show()The query result is returned as a DataFrame and can be further processed using Spark APIs.
Checklist
- I have searched existing FAQs and this question is not already answered
- The answer provides accurate, helpful information
- I have included any relevant code examples or links