fix: drop latin-1 decode in source title and userIds by kyteinsky · Pull Request #306 · nextcloud/context_chat_backend

kyteinsky · 2026-05-28T09:34:27Z

fixing this issue:

2026-05-28T07:53:19+0000: [ERROR|utils]: original traceback of embed_sources (PID 22239, exitcode: 0): Traceback (most recent call last):
  File "/app/context_chat_backend/utils.py", line 138, in exception_wrap
    value = None if fun is None else fun(*args, **kwargs)
                                     ^^^^^^^^^^^^^^^^^^^^
  File "/app/context_chat_backend/chain/ingest/injest.py", line 407, in embed_sources
    'source_ids': [
                  ^
  File "/app/context_chat_backend/chain/ingest/injest.py", line 408, in <listcomp>
    f'{source.reference} ({_decode_latin_1(source.title)})'
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/context_chat_backend/chain/ingest/injest.py", line 395, in _decode_latin_1
    return s.encode('latin-1').decode('utf-8')
           ^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 27: ordinal not in range(256)

marcelklehr · 2026-05-28T10:43:13Z

@@ -393,8 +393,8 @@ def _process_sources(
 def _decode_latin_1(s: str) -> str:
 	try:
 		return s.encode('latin-1').decode('utf-8')


So we encode a string in latin-1 and decode using utf-8 again. Why do we do that?

I had forgotten too 🙈
blaming the lines points to this issue for which the fix was made: #71
and it was probably related to the issue that the title was transported from context_chat PHP to backend in the headers which only support latin-1 mostly, finding back the sources for it, these two links pop up:
python/cpython#105505 (comment)
https://stackoverflow.com/questions/4400678/what-character-encoding-should-i-use-for-a-http-header/4410331#4410331

now that we don't have that limitation after reversal of the indexing direction, it can be dropped.

thanks for the enquiry :)

after dropping the conversion, it seems to work fine:

ccb=# select cmetadata from langchain_pg_embedding ; cmetadata ---------------------------------------------------------------------------------------------------------------- {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 0} {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 1365} {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 2907} {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 4138} {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 5963} {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 7259} (6 rows)

blaming the lines points to this issue for which the fix was made: #71 and it was probably related to the issue that the title was transported from context_chat PHP to backend in the headers which only support latin-1 mostly, finding back the sources for it, these two links pop up: python/cpython#105505 (comment) https://stackoverflow.com/questions/4400678/what-character-encoding-should-i-use-for-a-http-header/4410331#4410331 now that we don't have that limitation after reversal of the indexing direction, it can be dropped. Signed-off-by: kyteinsky <kyteinsky@gmail.com>

kyteinsky requested a review from marcelklehr as a code owner May 28, 2026 09:34

kyteinsky force-pushed the fix/catch-unicode-encode-error branch from 40e1d41 to 2e2584c Compare May 28, 2026 09:34

marcelklehr reviewed May 28, 2026

View reviewed changes

kyteinsky force-pushed the fix/catch-unicode-encode-error branch from 2e2584c to 48090b2 Compare May 28, 2026 11:10

kyteinsky changed the title ~~fix: catch UnicodeEncodeError in source title and userIds~~ fix: drop latin-1 decode in source title and userIds May 28, 2026

marcelklehr approved these changes May 28, 2026

View reviewed changes

kyteinsky merged commit fb25e86 into master May 28, 2026
14 checks passed

kyteinsky deleted the fix/catch-unicode-encode-error branch May 28, 2026 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: drop latin-1 decode in source title and userIds#306

fix: drop latin-1 decode in source title and userIds#306
kyteinsky merged 1 commit into
masterfrom
fix/catch-unicode-encode-error

kyteinsky commented May 28, 2026

Uh oh!

marcelklehr May 28, 2026

Uh oh!

kyteinsky May 28, 2026

Uh oh!

kyteinsky May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kyteinsky commented May 28, 2026

Uh oh!

marcelklehr May 28, 2026

Choose a reason for hiding this comment

Uh oh!

kyteinsky May 28, 2026

Choose a reason for hiding this comment

Uh oh!

kyteinsky May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants