Skip to content

parse_from_bytes silently mis-decodes 8bit UTF-8 bodies as Latin-1 via raw-unicode-escape #152

@ymyke

Description

@ymyke

Summary

For messages parsed via parse_from_bytes (or MailParser.from_bytes), text body parts whose Content-Transfer-Encoding is 8bit, 7bit, or absent are mis-decoded: UTF-8 byte sequences are passed through raw-unicode-escape, which interprets them as Latin-1. The result is silent mojibake — no exception is raised, and the declared charset=utf-8 from Content-Type is ignored.

Example: (em-dash, U+2014, UTF-8 \xe2\x80\x94) becomes the three-codepoint sequence â\x80\x94.

This is related to but not fixed by #97 / #125. PR #125 added a try/except UnicodeDecodeError fallback to ported_string(payload, encoding=charset), but the except branch only fires on malformed \u escapes (the original #97 crash). Valid UTF-8 byte sequences decode "successfully" under raw-unicode-escape and never trigger the fallback — they are silently corrupted instead.

Steps to Reproduce

Minimal .eml (utf8.eml):

From: a@example.com
To: b@example.com
Subject: probe
Content-Type: text/plain; charset=utf-8
MIME-Version: 1.0

Hello — world
import mailparser

m_bytes  = mailparser.parse_from_bytes(open("utf8.eml", "rb").read())
m_string = mailparser.parse_from_string(open("utf8.eml").read())

print("from_bytes :", repr(m_bytes.text_plain[0]))
print("from_string:", repr(m_string.text_plain[0]))

Output:

from_bytes : 'Hello â\x80\x94 world\n'
from_string: 'Hello — world\n'

Expected Behavior

Both parse_from_bytes and parse_from_string should produce 'Hello — world\n', honoring the part's declared charset=utf-8.

Root Cause

src/mailparser/core.py:471-475 (4.2.1 / HEAD):

if not cte or cte in ["7bit", "8bit"]:
    try:
        payload = payload.decode("raw-unicode-escape")
    except UnicodeDecodeError:
        payload = ported_string(payload, encoding=charset)
else:
    payload = ported_string(payload, encoding=charset)

The two code paths give payload very different bytes before this block:

  • email.message_from_bytes(...).get_payload(decode=True) returns the raw body bytes (e.g. b'\xe2\x80\x94' for ). raw-unicode-escape decodes these as Latin-1 (no exception raised) → mojibake.
  • email.message_from_string(...).get_payload(decode=True) returns ASCII-escaped bytes (e.g. b'\\u2014'). raw-unicode-escape decodes the escape correctly → correct character.

So from_string works by accident: it relies on the stdlib's internal escape-encoding to produce input that raw-unicode-escape happens to handle. from_bytes exposes the underlying logic flaw.

Why the Current Logic Is Inverted

The comment block above this code (lines 459-465) explains the intent: when get_payload(decode=True) returns bytes that Python "broke" by mishandling the encoding, the author reinterprets them via raw-unicode-escape. But the correct approach is the opposite: the part's Content-Type already declares the body's charset, so decode with that charset directly. raw-unicode-escape should never be the primary decoder for body bytes.

Suggested Fix

Decode with the declared charset directly:

if not cte or cte in ["7bit", "8bit"]:
    try:
        payload = payload.decode(charset)
    except (UnicodeDecodeError, LookupError):
        # Fallback for legacy "\uXXXX-in-body" cases (#97); won't silently
        # mis-interpret UTF-8 because that bytes path is handled above.
        payload = payload.decode("raw-unicode-escape", errors="replace")
else:
    payload = ported_string(payload, encoding=charset)

This:

  • Honors Content-Type: charset=... for the case it was declared for.
  • Keeps the raw-unicode-escape path available for messages where UnicodeDecodeError when parsing email with "\u" in its body #97-style \u literals appear in 8bit bodies labeled as ASCII/UTF-8.
  • Preserves behavior for non-7bit/8bit CTEs (base64, quoted-printable) which already go through ported_string(..., encoding=charset).

A regression test loading the minimal .eml above via both parse_from_bytes and parse_from_string and asserting equality would catch any future re-inversion.

Environment

  • mail-parser: 4.1.2 (also verified in 4.2.1 / HEAD)
  • Python: 3.11
  • OS: Linux

Workaround

In our subclass we override from_bytes to decode bytes via UTF-8 and delegate to from_string, which sidesteps the broken branch. Happy to open a PR with the upstream fix if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions