Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
221 changes: 139 additions & 82 deletions docs/integrations/data-ingestion/aws-glue/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
import TabItem from '@theme/TabItem';
import notebook_connections_config from '@site/static/images/integrations/data-ingestion/aws-glue/notebook-connections-config.png';
import dependent_jars_path_option from '@site/static/images/integrations/data-ingestion/aws-glue/dependent_jars_path_option.png';
import marketplace_usage_instructions from '@site/static/images/integrations/data-ingestion/aws-glue/marketplace-usage-instructions.png';
import glue_studio_visual_editor from '@site/static/images/integrations/data-ingestion/aws-glue/glue-studio-visual-editor.png';
import ClickHouseSupportedBadge from '@theme/badges/ClickHouseSupported';

# Integrating Amazon Glue with ClickHouse and Spark
Expand All @@ -40,21 +42,25 @@
Ensure your Glue job’s IAM role has the necessary permissions, as described in the minimum privileges [guide](https://docs.aws.amazon.com/glue/latest/dg/getting-started-min-privs-job.html#getting-started-min-privs-connectors).

3. <h3 id="activate-the-connector">Activate the Connector & Create a Connection</h3>
You can activate the connector and create a connection directly by clicking [this link](https://console.aws.amazon.com/gluestudio/home#/connector/add-connection?connectorName="ClickHouse%20AWS%20Glue%20Connector"&connectorType="Spark"&connectorUrl=https://709825985650.dkr.ecr.us-east-1.amazonaws.com/clickhouse/clickhouse-glue:1.0.0&connectorClassName="com.clickhouse.spark.ClickHouseCatalog"), which opens the Glue connection creation page with key fields pre-filled. Give the connection a name, and press create (no need to provide the ClickHouse connection details at this stage).
After subscribing, select the Glue version that matches your job requirements. In the **Additional details** section, under **Usage instructions**, click the link to **Open Glue Studio - Add ClickHouse connector**. This opens the Glue connection creation page with key fields pre-filled. Give the connection a name and press create (no need to provide the ClickHouse connection details at this stage).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link we had here was meant to help users to create a connection easily. If you remove that, please elaborate on how to create this connection.

I took the "new user path", and after subscribing, I don't see any option for selecting a Glue version. Have you validated these steps?

  • I do see an option to select a glue version and a connector version - how should a user proceed from that point?
  • I don't see ClickHouse as an option when I try to create a connection:
Image Image

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason i removed the link was that now we have multiple Glue options and it is not one link.

To choose the Glue version is as you mentioned ( In the drop down)

After subscribing and selecting the glue version you click the link that is in the In the Additional details section, under Usage instructions.

I don't see ClickHouse as an option when I try to create a connection
You are supposed to press the link in the Glue page, that will open a create connection page

Image


<Image img={marketplace_usage_instructions} size='md' alt='AWS Marketplace usage instructions for ClickHouse Glue connector' />

4. <h3 id="use-in-glue-job">Use in Glue Job</h3>
In your Glue job, select the `Job details` tab, and expend the `Advanced properties` window. Under the `Connections` section, select the connection you just created. The connector automatically injects the required JARs into the job runtime.

<Image img={notebook_connections_config} size='md' alt='Glue Notebook connections config' force='true' />

:::note
The JARs used in the Glue connector are built for `Spark 3.3`, `Scala 2`, and `Python 3`. Make sure to select these versions when configuring your Glue job.
Make sure to select the connector version that matches your Glue job configuration:
- **Glue 4**: Spark 3.3, Scala 2, Python 3
- **Glue 5**: Spark 3.5, Scala 2, Python 3
:::

</TabItem>
<TabItem value="Manual Installation" label="Manual Installation">
To add the required jars manually, please follow the following:
1. Upload the following jars to an S3 bucket - `clickhouse-jdbc-0.6.X-all.jar` and `clickhouse-spark-runtime-3.X_2.X-0.8.X.jar`.
1. Upload the latest Spark connector JAR (`clickhouse-spark-runtime-3.X_2.X-0.10.X.jar`) to an S3 bucket.

Check notice on line 63 in docs/integrations/data-ingestion/aws-glue/index.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.Uppercase

Suggestion: Instead of uppercase for 'JAR', use lowercase or backticks (`) if possible. Otherwise, ask a Technical Writer to add this word or acronym to the rule's exception list.
2. Make sure the Glue job has access to this bucket.
3. Under the `Job details` tab, scroll down and expend the `Advanced properties` drop down, and fill the jars path in `Dependent JARs path`:

Expand All @@ -63,72 +69,122 @@
</TabItem>
</Tabs>

## Using AWS Secrets Manager for credentials {#secrets-manager}

Rather than hardcoding your ClickHouse user and password in the job, store them in [AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/) and reference the secret from your Glue connection or job script. At runtime, Glue fetches the secret and merges its key-value pairs into the connector's connection options.

### Create the secret {#create-secret}

In AWS Secrets Manager, create a secret of type **Other type of secret** with key-value pairs whose keys match the connector's option names:

| Key | Value |
|---|---|
| `user` | your ClickHouse username |
| `password` | your ClickHouse password |

Any key you put in the secret is forwarded to the connector, so you can also store `host`, `database`, or any other option there if you'd like to keep them out of code.

### Reference the secret {#reference-secret}

There are two ways to wire the secret into a job.

**Option 1: attach it to the Glue connection.** When creating or editing the ClickHouse connection in Glue Studio, set the **AWS secret** field to the secret's name. Any job that uses this connection resolves the secret automatically — no code changes needed.

**Option 2: pass `secretId` in connection options.** Use this when the secret isn't attached to the connection. Add `secretId` alongside `connectionName`:

<Tabs>
<TabItem value="Python" label="Python" default>

```python
source = glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options={
"connectionName": "<your-connection-name>",
"secretId": "clickhouse/glue/credentials",
"database": "default",
"table": "example_table"
},
transformation_ctx="clickhouse_source"
)
```

</TabItem>
<TabItem value="Scala" label="Scala">

```scala
val source = glueContext.getSource(
connectionType = "marketplace.spark",
connectionOptions = JsonOptions(Map(
"connectionName" -> "<your-connection-name>",
"secretId" -> "clickhouse/glue/credentials",
"database" -> "default",
"table" -> "example_table"
)),
transformationContext = "clickhouseSource"
)
```

</TabItem>
</Tabs>

The secret's `user` and `password` keys are merged into the connector options at runtime, so you never need to read them in your script.

## Examples {#example}

The examples below use `marketplace.spark` and reference the connector by `connectionName`. If you installed the connector manually (Manual Installation tab), use `connection_type="custom.spark"` and pass `className`, `host`, `http_port`, `user`, and `password` directly in the options instead.

If you attached an AWS secret to the connection itself (Option 1 in [Using AWS Secrets Manager for credentials](#secrets-manager)), drop `secretId` from the options — Glue resolves credentials from the connection automatically.

<Tabs>
<TabItem value="Scala" label="Scala" default>
<TabItem value="Visual Editor" label="Visual Editor" default>

You can use the ClickHouse connector as either a source or a target in the Glue Studio visual editor. Simply drag the ClickHouse Spark Connector component onto the canvas and connect it to your data pipeline.

<Image img={glue_studio_visual_editor} size='md' alt='Glue Studio visual editor with ClickHouse connector' />

</TabItem>
<TabItem value="Scala" label="Scala">

```java
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.clickhouseScala.Native.NativeSparkRead.spark
import org.apache.spark.sql.SparkSession

import com.amazonaws.services.glue.util.{GlueArgParser, Job, JsonOptions}
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

object ClickHouseGlueExample {
def main(sysArgs: Array[String]) {
def main(sysArgs: Array[String]): Unit = {
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)

val sparkSession: SparkSession = SparkSession.builder
.config("spark.sql.catalog.clickhouse", "com.clickhouse.spark.ClickHouseCatalog")
.config("spark.sql.catalog.clickhouse.host", "<your-clickhouse-host>")
.config("spark.sql.catalog.clickhouse.protocol", "https")
.config("spark.sql.catalog.clickhouse.http_port", "<your-clickhouse-port>")
.config("spark.sql.catalog.clickhouse.user", "default")
.config("spark.sql.catalog.clickhouse.password", "<your-password>")
.config("spark.sql.catalog.clickhouse.database", "default")
// for ClickHouse cloud
.config("spark.sql.catalog.clickhouse.option.ssl", "true")
.config("spark.sql.catalog.clickhouse.option.ssl_mode", "NONE")
.getOrCreate

val glueContext = new GlueContext(sparkSession.sparkContext)
val sc = new SparkContext()
val glueContext = new GlueContext(sc)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
import sparkSession.implicits._

val url = "s3://{path_to_cell_tower_data}/cell_towers.csv.gz"

val schema = StructType(Seq(
StructField("radio", StringType, nullable = false),
StructField("mcc", IntegerType, nullable = false),
StructField("net", IntegerType, nullable = false),
StructField("area", IntegerType, nullable = false),
StructField("cell", LongType, nullable = false),
StructField("unit", IntegerType, nullable = false),
StructField("lon", DoubleType, nullable = false),
StructField("lat", DoubleType, nullable = false),
StructField("range", IntegerType, nullable = false),
StructField("samples", IntegerType, nullable = false),
StructField("changeable", IntegerType, nullable = false),
StructField("created", TimestampType, nullable = false),
StructField("updated", TimestampType, nullable = false),
StructField("averageSignal", IntegerType, nullable = false)
))

val df = sparkSession.read
.option("header", "true")
.schema(schema)
.csv(url)
val readOptions = JsonOptions(Map(
"connectionName" -> "<your-connection-name>",
"secretId" -> "clickhouse/glue/credentials",
"database" -> "default",
"table" -> "example_table"
))

// Write to ClickHouse
df.writeTo("clickhouse.default.cell_towers").append()
val source = glueContext.getSource(
connectionType = "marketplace.spark",
connectionOptions = readOptions,
transformationContext = "clickhouseSource"
)
val dyf = source.getDynamicFrame()

val writeOptions = JsonOptions(Map(
"connectionName" -> "<your-connection-name>",
"secretId" -> "clickhouse/glue/credentials",
"database" -> "default",
"table" -> "target_table"
))

glueContext.getSink(
connectionType = "marketplace.spark",
connectionOptions = writeOptions
).writeDynamicFrame(dyf)

// Read from ClickHouse
val dfRead = spark.sql("select * from clickhouse.default.cell_towers")
Job.commit()
}
}
Expand All @@ -139,47 +195,48 @@

```python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import Row


## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

spark.conf.set("spark.sql.catalog.clickhouse", "com.clickhouse.spark.ClickHouseCatalog")
spark.conf.set("spark.sql.catalog.clickhouse.host", "<your-clickhouse-host>")
spark.conf.set("spark.sql.catalog.clickhouse.protocol", "https")
spark.conf.set("spark.sql.catalog.clickhouse.http_port", "<your-clickhouse-port>")
spark.conf.set("spark.sql.catalog.clickhouse.user", "default")
spark.conf.set("spark.sql.catalog.clickhouse.password", "<your-password>")
spark.conf.set("spark.sql.catalog.clickhouse.database", "default")
spark.conf.set("spark.clickhouse.write.format", "json")
spark.conf.set("spark.clickhouse.read.format", "arrow")
# for ClickHouse cloud
spark.conf.set("spark.sql.catalog.clickhouse.option.ssl", "true")
spark.conf.set("spark.sql.catalog.clickhouse.option.ssl_mode", "NONE")

# Create DataFrame
data = [Row(id=11, name="John"), Row(id=12, name="Doe")]
df = spark.createDataFrame(data)

# Write DataFrame to ClickHouse
df.writeTo("clickhouse.default.example_table").append()

# Read DataFrame from ClickHouse
df_read = spark.sql("select * from clickhouse.default.example_table")
logger.info(str(df.take(10)))
read_options = {
"connectionName": "<your-connection-name>",
"secretId": "clickhouse/glue/credentials",
"database": "default",
"table": "example_table"
}

source = glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options=read_options,
transformation_ctx="clickhouse_source"
)
dyf = source

logger.info(f"Read {dyf.count()} rows from ClickHouse")

write_options = {
"connectionName": "<your-connection-name>",
"secretId": "clickhouse/glue/credentials",
"database": "default",
"table": "target_table"
}

glueContext.write_dynamic_frame.from_options(
frame=dyf,
connection_type="marketplace.spark",
connection_options=write_options,
transformation_ctx="clickhouse_sink"
)

job.commit()
```
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update these instructions, I don't see an option to Continue to Configure after subscribing.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions styles/ClickHouse/Headings.yml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ exceptions:
- Linux
- MERGETREE
- "AWS Marketplace"
- "AWS Secrets Manager"
- "Azure Marketplace"
- "GCP Marketplace"
- Microsoft
Expand Down
Loading