fix: drop latin-1 decode in source title and userIds#306
Conversation
40e1d41 to
2e2584c
Compare
| @@ -393,8 +393,8 @@ def _process_sources( | |||
| def _decode_latin_1(s: str) -> str: | |||
| try: | |||
| return s.encode('latin-1').decode('utf-8') | |||
There was a problem hiding this comment.
So we encode a string in latin-1 and decode using utf-8 again. Why do we do that?
There was a problem hiding this comment.
I had forgotten too 🙈
blaming the lines points to this issue for which the fix was made: #71
and it was probably related to the issue that the title was transported from context_chat PHP to backend in the headers which only support latin-1 mostly, finding back the sources for it, these two links pop up:
python/cpython#105505 (comment)
https://stackoverflow.com/questions/4400678/what-character-encoding-should-i-use-for-a-http-header/4410331#4410331
now that we don't have that limitation after reversal of the indexing direction, it can be dropped.
thanks for the enquiry :)
There was a problem hiding this comment.
after dropping the conversion, it seems to work fine:
ccb=# select cmetadata from langchain_pg_embedding ;
cmetadata
----------------------------------------------------------------------------------------------------------------
{"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 0}
{"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 1365}
{"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 2907}
{"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 4138}
{"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 5963}
{"type": "application/pdf", "title": "files/河马.pdf", "source": "files__default: 19455", "start_index": 7259}
(6 rows)
blaming the lines points to this issue for which the fix was made: #71 and it was probably related to the issue that the title was transported from context_chat PHP to backend in the headers which only support latin-1 mostly, finding back the sources for it, these two links pop up: python/cpython#105505 (comment) https://stackoverflow.com/questions/4400678/what-character-encoding-should-i-use-for-a-http-header/4410331#4410331 now that we don't have that limitation after reversal of the indexing direction, it can be dropped. Signed-off-by: kyteinsky <kyteinsky@gmail.com>
2e2584c to
48090b2
Compare
fixing this issue: