Skip to content

fix: drop latin-1 decode in source title and userIds#306

Merged
kyteinsky merged 1 commit into
masterfrom
fix/catch-unicode-encode-error
May 28, 2026
Merged

fix: drop latin-1 decode in source title and userIds#306
kyteinsky merged 1 commit into
masterfrom
fix/catch-unicode-encode-error

Conversation

@kyteinsky
Copy link
Copy Markdown
Contributor

fixing this issue:

2026-05-28T07:53:19+0000: [ERROR|utils]: original traceback of embed_sources (PID 22239, exitcode: 0): Traceback (most recent call last):
  File "/app/context_chat_backend/utils.py", line 138, in exception_wrap
    value = None if fun is None else fun(*args, **kwargs)
                                     ^^^^^^^^^^^^^^^^^^^^
  File "/app/context_chat_backend/chain/ingest/injest.py", line 407, in embed_sources
    'source_ids': [
                  ^
  File "/app/context_chat_backend/chain/ingest/injest.py", line 408, in <listcomp>
    f'{source.reference} ({_decode_latin_1(source.title)})'
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/context_chat_backend/chain/ingest/injest.py", line 395, in _decode_latin_1
    return s.encode('latin-1').decode('utf-8')
           ^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 27: ordinal not in range(256)

@kyteinsky kyteinsky requested a review from marcelklehr as a code owner May 28, 2026 09:34
@kyteinsky kyteinsky force-pushed the fix/catch-unicode-encode-error branch from 40e1d41 to 2e2584c Compare May 28, 2026 09:34
@@ -393,8 +393,8 @@ def _process_sources(
def _decode_latin_1(s: str) -> str:
try:
return s.encode('latin-1').decode('utf-8')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we encode a string in latin-1 and decode using utf-8 again. Why do we do that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had forgotten too 🙈
blaming the lines points to this issue for which the fix was made: #71
and it was probably related to the issue that the title was transported from context_chat PHP to backend in the headers which only support latin-1 mostly, finding back the sources for it, these two links pop up:
python/cpython#105505 (comment)
https://stackoverflow.com/questions/4400678/what-character-encoding-should-i-use-for-a-http-header/4410331#4410331

now that we don't have that limitation after reversal of the indexing direction, it can be dropped.

thanks for the enquiry :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after dropping the conversion, it seems to work fine:

ccb=# select cmetadata from langchain_pg_embedding ;
                                                   cmetadata                                                    
----------------------------------------------------------------------------------------------------------------
 {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 0}
 {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 1365}
 {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 2907}
 {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 4138}
 {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 5963}
 {"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 7259}
(6 rows)

blaming the lines points to this issue for which the fix was made: #71
and it was probably related to the issue that the title was transported from context_chat PHP to backend in the headers which only support latin-1 mostly, finding back the sources for it, these two links pop up:
python/cpython#105505 (comment)
https://stackoverflow.com/questions/4400678/what-character-encoding-should-i-use-for-a-http-header/4410331#4410331

now that we don't have that limitation after reversal of the indexing direction, it can be dropped.

Signed-off-by: kyteinsky <kyteinsky@gmail.com>
@kyteinsky kyteinsky force-pushed the fix/catch-unicode-encode-error branch from 2e2584c to 48090b2 Compare May 28, 2026 11:10
@kyteinsky kyteinsky changed the title fix: catch UnicodeEncodeError in source title and userIds fix: drop latin-1 decode in source title and userIds May 28, 2026
@kyteinsky kyteinsky merged commit fb25e86 into master May 28, 2026
14 checks passed
@kyteinsky kyteinsky deleted the fix/catch-unicode-encode-error branch May 28, 2026 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants