Skip to content

[FAQ] Using Spark SQL with DataFrames #95

@kkh1902

Description

@kkh1902

Course

data-engineering-zoomcamp

Question

How do you use Spark SQL to query data in Spark?

Answer

Answer

Spark SQL allows you to run SQL queries on Spark DataFrames by registering them as temporary views.
This makes it easy to analyze and transform data using familiar SQL syntax.

Step 1: Create a Temporary View

# Register a DataFrame as a temporary table
df.createOrReplaceTempView('trips')

This registers the DataFrame as a table named trips, which can be queried using SQL during the Spark session.

Step 2: Execute a SQL Query

result = spark.sql("""
    SELECT
        DATE(pickup_datetime) AS date,
        COUNT(*) AS trip_count,
        AVG(trip_distance) AS avg_distance,
        SUM(total_amount) AS total_revenue
    FROM trips
    WHERE passenger_count > 0
    GROUP BY DATE(pickup_datetime)
    ORDER BY date
""")

This query performs filtering, aggregation, and ordering using standard SQL syntax.

Step 3: View the Results

result.show()

The query result is returned as a DataFrame and can be further processed using Spark APIs.

Checklist

  • I have searched existing FAQs and this question is not already answered
  • The answer provides accurate, helpful information
  • I have included any relevant code examples or links

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions