Skip to content

Latest commit

 

History

History
350 lines (208 loc) · 5.32 KB

File metadata and controls

350 lines (208 loc) · 5.32 KB

character encoding q&a

q: can you help with autodetect the encoding?

Yes — but with an important caveat:

Encoding autodetection is heuristic, not guaranteed.

For many byte sequences, multiple encodings are valid. The detector guesses based on:

  • byte patterns,
  • language statistics,
  • invalid byte combinations,
  • punctuation usage,
  • frequency analysis.

Still, you can often get very good results.

The easy cases

Some encodings are easy to detect because they have distinctive byte rules.

UTF-8

UTF-8

UTF-8 has strict multibyte patterns.

Invalid sequences are easy to spot.

Example:

  • bytes like C3 A9 strongly suggest UTF-8 (é)
  • random legacy encodings often fail UTF-8 validation

So:

  1. Try UTF-8 first.
  2. If valid and text looks reasonable, it's probably UTF-8.

This works surprisingly well.


UTF-16

UTF-16

Often detectable from:

  • BOM markers:

    • FF FE (little-endian)
    • FE FF (big-endian)
  • lots of zero bytes


The hard cases

Single-byte encodings are much harder.

Example:

Byte ISO-8859-1 ISO-8859-2
E9 é é
F6 ö ö

Many bytes overlap.

So detectors rely heavily on language analysis.


The main strategy used in practice

A robust detector usually does:

Step 1 — BOM detection

Look for:

  • UTF-8 BOM
  • UTF-16 BOM
  • UTF-32 BOM

If found:

  • done.

Step 2 — UTF-8 validation

Try decoding as UTF-8.

If:

  • valid,
  • and contains plausible text,

then choose UTF-8.

Modern data is overwhelmingly UTF-8.


Step 3 — Statistical detection for legacy encodings

If UTF-8 fails:

  • examine byte frequencies,
  • language-specific letters,
  • punctuation patterns,
  • invalid/control byte usage.

Examples:

  • smart quotes (0x91–0x94) strongly suggest Windows-1252
  • Polish letters suggest Windows-1250 or ISO-8859-2

Very important distinction:

“valid” vs “correct”

Example:

Bytes:

43 61 66 E9

Could mean:

  • Windows-1252 → Café
  • ISO-8859-1 → Café

Both decode perfectly.

Impossible to distinguish from bytes alone.


Common practical heuristics

Heuristic: Windows-1252 vs ISO-8859-1

This is classic.

If bytes 80–9F appear:

  • ISO-8859-1 treats them as control chars
  • Windows-1252 treats them as punctuation/symbols

So if you see:

  • curly quotes,
  • euro sign,
  • em dash,

it's probably Windows-1252.

Example:

93 Hello 94

likely means:

  • “Hello”
  • therefore Windows-1252

not ISO-8859-1.


Heuristic: Central European text

If text contains:

  • ł
  • č
  • ř
  • ő

then likely:

  • Windows-1250
  • ISO-8859-2

rather than Western European encodings.


Good libraries

Python

Modern replacement for chardet.

Classic detector.

Example:

from charset_normalizer import from_bytes

data = open("file.txt", "rb").read()

result = from_bytes(data).best()

print(result.encoding)

ICU

ICU Project

Industrial-strength internationalization library with charset detection.

Used in many serious systems.


Mozilla universal charset detector

Historical basis for many detectors.

Used by:

  • Firefox,
  • chardet,
  • others.

A practical modern approach

If you're building your own system:

1. BOM?
2. Valid UTF-8?
3. Try UTF-16 heuristics
4. Run statistical detector
5. Fall back to Windows-1252

Why Windows-1252 fallback?

Because tons of old “ANSI” text on Windows is actually 1252.


The browser reality

Modern browsers effectively do:

  • BOM
  • HTTP header
  • HTML meta charset
  • UTF-8 checks
  • heuristic detection
  • Windows compatibility hacks

because the web accumulated decades of malformed text.


One especially nasty problem:

mojibake chains

Example:

UTF-8 bytes for:

é

are:

C3 A9

If interpreted as Windows-1252:

  • C3 → Ã
  • A9 → ©

Result:

é

Classic mojibake.

A good detector may recognize this pattern and infer:

“This was probably UTF-8 decoded incorrectly as Windows-1252.”

That becomes a repair problem, not just detection.


Realistic expectation

Autodetection confidence:

Situation Reliability
UTF-8 vs invalid bytes excellent
UTF-16 BOM excellent
Windows-1252 with smart quotes good
Polish text 1250 vs 8859-2 decent
Short ASCII-only text impossible
Small samples poor

If text is only:

Hello world

then:

  • UTF-8
  • 8859-1
  • 8859-2
  • 1250
  • 1252

all produce identical output.

No detector can know.