Skip to content

Rest Catalog and Writing data to Minio Raises OSError: When initiating multiple part upload #974

@Al-Moatasem

Description

@Al-Moatasem

Apache Iceberg version

0.6.1 (latest release)

Please describe the bug 🐞

Hi,

I am trying to use the rest catalog and writing the data into Minio, the script I am using can communicate with Minio (it creates the metadata.json file under metadata directory, however, it raises OSError: When initiating multiple part upload for key 'poc_new/coordinates/data/00000-0-f27b7921-a6d7-4c7e-b034-2d12221e5054.parquet' in bucket 'warehouse': AWS Error NETWORK_CONNECTION during CreateMultipartUpload operation: Encountered network error when sending http request when it tries to write the data table.append(df)

this is the docker compose file that I use

version: '3'
services:
  rest:
    image: tabulario/iceberg-rest:1.5.0
    container_name: iceberg-rest
    ports:
      - 8181:8181
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
      - CATALOG_WAREHOUSE=s3://warehouse/
      - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
      - CATALOG_S3_ENDPOINT=http://minio:9000
    networks:
      iceberg-rest:


  minio:
    image: minio/minio:RELEASE.2024-05-10T01-41-38Z
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=minio

    ports:
      - 9001:9001
      - 9000:9000
    command: [ "server", "/data", "--console-address", ":9001" ]
    networks:
      iceberg-rest:
        aliases:
          - warehouse.minio 

  mc:
    depends_on:
      - minio
    image: minio/mc:RELEASE.2024-05-09T17-04-24Z
    container_name: mc
    entrypoint: |
      /bin/sh -c "
        until (/usr/bin/mc config host add minio http://minio:9000 admin password)
        do
          echo '...waiting...' && sleep 1;
        done;
        /usr/bin/mc rm -r --force minio/warehouse;
        /usr/bin/mc mb minio/warehouse;
        /usr/bin/mc policy set public minio/warehouse;
        tail -f /dev/null
      "
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    networks:
      iceberg-rest:


networks:
  iceberg-rest:

And this the script file

import pyarrow as pa
from pyiceberg.catalog import load_rest
from pyiceberg.exceptions import NamespaceAlreadyExistsError, TableAlreadyExistsError

catalog = load_rest(
    name="rest",
    conf={
        "uri": "http://localhost:8181/",
    },
)


namespace = "poc_new"
try:
    catalog.create_namespace(namespace)
except NamespaceAlreadyExistsError as e:
    pass


df = pa.Table.from_pylist(
    [
        {"lat": 52.371807, "long": 4.896029},
        {"lat": 52.387386, "long": 4.646219},
        {"lat": 52.078663, "long": 4.288788},
    ],
)
schema = df.schema

table_name = "coordinates"
table_identifier = f"{namespace}.{table_name}"
try:
    table = catalog.create_table(
        identifier=table_identifier,
        schema=schema,
    )
except TableAlreadyExistsError as e:
    pass

table = catalog.load_table(table_identifier)
table.append(df)

The Traceback

Traceback (most recent call last):
  File "d:\flink_iceberg\poc_01_iceberg_rest.py", line 40, in <module>
    table.append(df)
  File "D:\flink_iceberg\.venv2\Lib\site-packages\pyiceberg\table\__init__.py", line 1068, in append
    for data_file in data_files:
  File "D:\flink_iceberg\.venv2\Lib\site-packages\pyiceberg\table\__init__.py", line 2423, in _dataframe_to_data_files
    yield from write_file(table, iter([WriteTask(write_uuid, next(counter), df)]))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\flink_iceberg\.venv2\Lib\site-packages\pyiceberg\io\pyarrow.py", line 1726, in write_file
    with fo.create(overwrite=True) as fos:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\flink_iceberg\.venv2\Lib\site-packages\pyiceberg\io\pyarrow.py", line 299, in create
    output_file = self._filesystem.open_output_stream(self._path, buffer_size=self._buffer_size)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow\_fs.pyx", line 868, in pyarrow._fs.FileSystem.open_output_stream
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 115, in pyarrow.lib.check_status
OSError: When initiating multiple part upload for key 'poc_new/coordinates/data/00000-0-efc0be57-453d-442d-af13-2e0b2382a53d.parquet' in bucket 'warehouse': AWS Error NETWORK_CONNECTION during CreateMultipartUpload operation: Encountered network error when sending http request

In Minio, the metadata directory is created and it stores the metadata.json file, but, no data directory.
image

Also, this is the requirements.txt file

annotated-types==0.7.0
apache-beam==2.48.0
apache-flink==1.19.1
apache-flink-libraries==1.19.1
avro-python3==1.10.2
certifi==2024.7.4
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
colorama==0.4.6
confluent-kafka==2.5.0
crcmod==1.7
dill==0.3.1.1
dnspython==2.6.1
docopt==0.6.2
duckdb==0.9.2
duckdb_engine==0.13.0
Faker==26.0.0
fastavro==1.9.5
fasteners==0.19
fsspec==2023.12.2
greenlet==3.0.3
grpcio==1.65.1
hdfs==2.7.3
httplib2==0.22.0
idna==3.7
kafka-python==2.0.2
markdown-it-py==3.0.0
mdurl==0.1.2
mmhash3==3.0.1
numpy==1.24.4
objsize==0.6.1
orjson==3.10.6
packaging==24.1
pandas==2.2.2
polars==1.2.1
proto-plus==1.24.0
protobuf==4.23.4
py4j==0.10.9.7
pyarrow==11.0.0
pydantic==2.8.2
pydantic-settings==2.3.4
pydantic_core==2.20.1
pydot==1.4.2
Pygments==2.18.0
pyiceberg==0.6.1
pymongo==4.8.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
regex==2024.7.24
requests==2.32.3
rich==13.7.1
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
six==1.16.0
sortedcontainers==2.4.0
SQLAlchemy==2.0.31
strictyaml==1.7.3
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
zstandard==0.23.0

I checked this Slack thread for the same issue, but, it doesn't contain any fix for my case.
OS: Windows 10

environment variables contain aws in the three containers

iceberg-rest container

iceberg@ce79d3f11b5f:/usr/lib/iceberg-rest$ env | grep -i aws
AWS_REGION=us-east-1
CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
AWS_SECRET_ACCESS_KEY=password
AWS_ACCESS_KEY_ID=admin

minio container, doesn't have any ENV with aws

mc container

AWS_REGION=us-east-1
AWS_SECRET_ACCESS_KEY=password
AWS_ACCESS_KEY_ID=admin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions