feat(files): Support upload of files larger than 5GB in all CDF cloud environments by msusag · Pull Request #2566 · cognitedata/cognite-sdk-python

msusag · 2026-04-10T13:33:00Z

Description

The SDK functions upload() for single or directory upload, and upload_content() for CogniteFile based files, is currently limited to only upload files smaller than 5GB in AWS and Azure CDF clusters.

With the changes in this PR the upload() and upload_content functions will raise this to support upload of files up to (about) 1000 GiB in size, as documented in the Files API docs (https://api-docs.cognite.com/20230101/tag/Files/operation/initMultiPartUpload).

Internally the functions will now use the multipart upload functions. Multipart upload enables upload of files larger than 5 GB in all CDF Cloud environments (Azure, AWS, and Google). In addition upload performance can be increased, especially for single large files.

With this change, client scripts or notebooks using upload() and upload_content() will benefit implicitly and not have to port their scripts to use the (more low level) multipart upload functions in the SDK.
Note that the other non-file related upload functions are not updated, as they are meant for streaming data, where data is not parallell readable from different offsets, and size is not generally known up front.

A semaphore was added to ensure that the async parallel uploads adhere to the concurrency setting in the configuration. The current implementation can fail with "too many open files" when uploading directories with many files.

To test the functionality, a custom Python test script was created to test upload performance for few large, and also many small files. The main test case uploads a folder with 4 files, where 2 of the files were large, above 5 GB.
In total 30703.83 MB of data is uploaded, per test run.
The file upload script mainly calls client.files.upload(DIRECTORY_NAME, recursive=True).

With the current latest released 8.0.7 version, the upload() of large files works in CDF Google clusters, but fails in AWS and Azure clusters.
With the fixes in this PR, the script succeeds in all 3 CDF cloud environments.
The upload script achieves high upload rate on all the cloud environments, with the 1000 Mbps Ethernet connection I am using.

Comparison of old and new behaviour and performance for upload of folder with a few large files.

CDF cloud env, cluster	Before (SDK 8.0.7)	After (with this PR)
Google cloud, greenfield	706 Mbps	838 Mbps
Azure	Fails	839 Mbps
AWS, orangefield	Fails	835 Mbps

Checklist:

Tests added/updated.
Documentation updated. Documentation is generated from docstrings - these must be updated according to your change.
If a new method has been added it should be referenced in cognite.rst in order to generate docs based on its docstring.
The PR title follows the Conventional Commit spec.

…GiB files in all CDF cloud environments

gemini-code-assist

Code Review

This pull request implements multipart upload functionality for the Files API, allowing large files to be uploaded in parallel parts. The changes include logic for calculating part sizes, managing upload sessions, and updated unit tests. Feedback identifies several necessary improvements: using a semaphore to limit concurrent uploads and prevent memory exhaustion, moving blocking file I/O operations to a thread pool, and enforcing a 1TB limit on file sizes. Additionally, corrections are needed for a broken docstring link and the use of Any type hints in test fixtures to comply with the project's typing standards.

codecov · 2026-04-10T13:40:15Z

Codecov Report

❌ Patch coverage is 90.17857% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.47%. Comparing base (373b6d7) to head (dfac491).

Files with missing lines	Patch %	Lines
cognite/client/_api/files.py	75.55%	11 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2566   +/-   ##
=======================================
  Coverage   93.47%   93.47%           
=======================================
  Files         480      480           
  Lines       48318    48383   +65     
=======================================
+ Hits        45164    45227   +63     
- Misses       3154     3156    +2

Files with missing lines	Coverage Δ
cognite/client/_constants.py	`88.46% <100.00%> (+2.09%)`	⬆️
cognite/client/_sync_api/files.py	`100.00% <100.00%> (ø)`
cognite/client/data_classes/files.py	`93.27% <100.00%> (+2.94%)`	⬆️
cognite/client/utils/_uploading.py	`74.60% <100.00%> (+6.74%)`	⬆️
tests/tests_unit/test_api/test_files.py	`99.78% <100.00%> (+<0.01%)`	⬆️
cognite/client/_api/files.py	`90.66% <75.55%> (-1.07%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… improving file size detection for AsyncIterable

…g of concurrent uploads

msusag · 2026-04-10T16:25:16Z

/gemini review

gemini-code-assist

Code Review

This pull request implements multipart upload support for the Files API, introducing concurrency control via semaphores and logic for calculating optimal part sizes. It also updates AsyncFileChunker to support offset-based reading and includes comprehensive unit tests for the new multipart flow. Feedback was provided regarding a type ignore marker that should be removed once related type hints are updated to maintain the repository's strong typing standards.

…ad_part_async method

… upload

andersfylling · 2026-04-14T10:19:09Z

            source_modified_time=source_modified_time,
            security_categories=security_categories,
        )
+        from cognite.client import global_config


I'm seeing a few lazy evaluated imports, could we keep these at the top of the file - or is there something preventing that?

I also noticed that during development, as I was following a pattern from other code that used global_config.

If the same line is moved to top of the file along with other imports, it leads to a circular import error:

ImportError: cannot import name 'global_config' from partially initialized module 'cognite.client' (most likely due to a circular import)

I saw the from cognite.client import global_config is used in a lot of places within functions in the SDK. So I thought that was the practice to avoid the circular imports.

But I see that using the more specific from cognite.client.config import global_config (adding .config) solves the circular import problem.

Moved the import to top of the file in later commits.

haakonvt

Excellent changes - great QOL improvement! Consider if we can reuse execute_async_tasks_with_fail_fast or execute_async_tasks from cognite/client/utils/_concurrency.py. A few other suggestion/remarks:

haakonvt · 2026-04-14T07:43:52Z

+MIN_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MiB
+MAX_MULTIPART_SIZE = 4000 * 1024 * 1024  # 4000 MiB
+DEFAULT_MULTIPART_SIZE = 50 * 1024 * 1024  # 50 MiB
+MAX_MULTIPART_PARTS = 250


Let's add "FILE" into the constant names to make them self-explanatory

Fixed in newer commits

haakonvt · 2026-04-14T07:47:17Z

        instance_id: NodeId | None = None,
    ) -> FileMetadata:
-        """`Upload a file content <https://api-docs.cognite.com/20230101/tag/Files/operation/getUploadLink>`_
+        """`Upload file content <https://api-docs.cognite.com/20230101/tag/Files/operation/getMultiPartUploadLink>`_


Are we sure we want to point this away from the "old" get upload link? The multi-part is a bit more complex.

The upload() function will with the changes in this PR use the files/multiuploadlink (and files/completemultipartupload) endpoint.
So it seems more correct to reference that endpoint in the REST API documentation.
The API docs there contains further details about the file size restrictions, that it provides a uniform API across all cloud environments for CDF, and the steps needed to use multipart upload API endpoints.

However, all those details are really not that relevant for users of upload(), as the function here takes care of the part size calculation, upload parts in parallel, etc, and completing the operation with the completemultipartupload call.

haakonvt · 2026-04-14T07:48:14Z

+
+        from cognite.client import global_config
+
+        upload_semaphore = asyncio.Semaphore(global_config.concurrency_settings.general.write)


The docs for multi-part say:

The parts can optionally be uploaded in parallel, preferably on a subset of parts at a time, for example maximum 3 concurrent PUT operations.

Should we introduce a separate concurrency_settings for file upload/download concurrency?

That could be relevant yes, but then probably in a separate PR?
Or should the httpx resource limits be used for that?
https://www.python-httpx.org/advanced/resource-limits/

I think it could be potentially also relevant to support network bandwidth throttling,
for throttling the upload of files (also for download, and for other data?).
Throttling can perhaps be implemented with existing async or httpx extensions.

haakonvt · 2026-04-14T07:59:12Z

        self._file_handle = file_handle
        self._chunk_size = global_config.file_upload_chunk_size or self.CHUNK_SIZE
+        self._remaining = size
+        self.size = size  # exposed so prepare_content_for_upload can set Content-Length


Exposing size is a great addition

haakonvt · 2026-04-14T08:00:53Z

            file_size = len(content)
        case AsyncIterable():
-            file_size = None
+            file_size = getattr(content, "size", None)


Suggestion:

case AsyncFileChunker(): file_size = content.size case AsyncIterable(): file_size = None

Fixed in later commits.

haakonvt · 2026-04-14T08:30:53Z

+
+        Creates files in files API with metadata and uploads file content.
+
+        If path is a directory, this method will upload all files in that directory. Use `recursive=True` for subdirectories as well.


Suggested change

If path is a directory, this method will upload all files in that directory. Use `recursive=True` for subdirectories as well.

Tip:

If path is a directory, this method will upload all files in that directory. Use `recursive=True` for subdirectories as well.

or Note:

Added Note:

haakonvt · 2026-04-14T10:16:27Z

+        async with session:
+
+            async def upload_part(part_no: int) -> None:
+                async with upload_semaphore:


Don't need this semaphore, already being auto-used.

Without the semaphore the _upload_file_part calls gets out of control, running more calls than concurrency_limit.

I added more test cases checking for this, see the new test_upload_content_semaphore_limits_concurrent_parts
and
test_upload_semaphore_limits_concurrent_parts
in test_files.py.

haakonvt · 2026-04-14T10:27:20Z

+        read_size = min(part_size, file_size - offset)
        with path.open("rb") as fh:
-            return await self.upload_bytes(fh, overwrite=overwrite, **file_metadata.dump(camel_case=False))
+            await session.upload_part_async(part_no, AsyncFileChunker(fh, offset=offset, size=read_size))


I guess the actual file read is blocking, but I think we should just accept that.

Yes, the synchronous path.open is still used.
mmap or asyncio could be used, but I doubt it really would improve things.

Changed to import global_config from the cognite.client.config, to break the circular import problem using the public export in cognite.client.

… matching

Proves that upload_semaphore is needed

Use multipartupload for upload() of file or directory, supporting >5 …

e29b789

…GiB files in all CDF cloud environments

msusag requested review from a team as code owners April 10, 2026 13:33

msusag requested a review from haakonvt April 10, 2026 13:33

gemini-code-assist bot reviewed Apr 10, 2026

View reviewed changes

msusag changed the title ~~feat(files): Make FilesApi.upload() support files larger than 5GB all CDF cloud environments.~~ feat(files): Make FilesApi.upload() support files larger than 5GB on all CDF cloud environments. Apr 10, 2026

msusag added 5 commits April 10, 2026 16:07

Fix docstring

ad5261c

Add check for max file size

5cb6375

Enhance multipart upload handling by introducing AsyncFileChunker and…

ac23041

… improving file size detection for AsyncIterable

Move semaphore to above directory iteration to ensure correct handlin…

02f0385

…g of concurrent uploads

Fix return type in test_files.py functions

620b29e

gemini-code-assist bot reviewed Apr 10, 2026

View reviewed changes

Comment thread cognite/client/_api/files.py Outdated

msusag added 2 commits April 10, 2026 18:35

Test for file exceeding max size

3aee59b

Enhance multipart upload to support AsyncIterator for content in uplo…

01b1aef

…ad_part_async method

msusag changed the title ~~feat(files): Make FilesApi.upload() support files larger than 5GB on all CDF cloud environments.~~ feat(files): Make FilesApi.upload() support upload of files larger than 5GB in all CDF cloud environments. Apr 10, 2026

Use multipart upload in upload_content() to support large CogniteFile…

66f3ec6

… upload

msusag changed the title ~~feat(files): Make FilesApi.upload() support upload of files larger than 5GB in all CDF cloud environments.~~ feat(files): Support upload of files larger than 5GB in all CDF cloud environments Apr 11, 2026

Update docstring

dfac491

andersfylling reviewed Apr 14, 2026

View reviewed changes

haakonvt reviewed Apr 14, 2026

View reviewed changes

msusag added 5 commits April 15, 2026 20:34

PR feedback: Move global_config import to top of file.

a9b17e4

Changed to import global_config from the cognite.client.config, to break the circular import problem using the public export in cognite.client.

PR feedback: Set file_size only for AsyncFileChunker case, for strict…

b74cc63

… matching

PR feedback: add Note: for upload directory paragraph.

d2f83fa

Add unit test for concurrency control.

27c7025

Proves that upload_semaphore is needed

PR feedback: use FILE_ prefix for new constants

5e16d73


		from cognite.client import global_config

		upload_semaphore = asyncio.Semaphore(global_config.concurrency_settings.general.write)


		Creates files in files API with metadata and uploads file content.

		If path is a directory, this method will upload all files in that directory. Use `recursive=True` for subdirectories as well.

	If path is a directory, this method will upload all files in that directory. Use `recursive=True` for subdirectories as well.
	Tip:
	If path is a directory, this method will upload all files in that directory. Use `recursive=True` for subdirectories as well.

Conversation

msusag commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

msusag commented Apr 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msusag Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msusag Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haakonvt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

msusag commented Apr 10, 2026 •

edited

Loading

codecov bot commented Apr 10, 2026 •

edited

Loading

msusag Apr 15, 2026 •

edited

Loading

msusag Apr 15, 2026 •

edited

Loading