parse_from_bytes silently mis-decodes 8bit UTF-8 bodies as Latin-1 via raw-unicode-escape

## Summary

For messages parsed via `parse_from_bytes` (or `MailParser.from_bytes`), text body parts whose Content-Transfer-Encoding is `8bit`, `7bit`, or absent are mis-decoded: UTF-8 byte sequences are passed through `raw-unicode-escape`, which interprets them as Latin-1. The result is silent mojibake — no exception is raised, and the declared `charset=utf-8` from Content-Type is ignored.

Example: `—` (em-dash, U+2014, UTF-8 `\xe2\x80\x94`) becomes the three-codepoint sequence `â\x80\x94`.

This is related to but **not** fixed by #97 / #125. PR #125 added a `try/except UnicodeDecodeError` fallback to `ported_string(payload, encoding=charset)`, but the except branch only fires on malformed `\u` escapes (the original #97 crash). Valid UTF-8 byte sequences decode "successfully" under `raw-unicode-escape` and never trigger the fallback — they are silently corrupted instead.

## Steps to Reproduce

Minimal `.eml` (`utf8.eml`):

```
From: a@example.com
To: b@example.com
Subject: probe
Content-Type: text/plain; charset=utf-8
MIME-Version: 1.0

Hello — world
```

```python
import mailparser

m_bytes  = mailparser.parse_from_bytes(open("utf8.eml", "rb").read())
m_string = mailparser.parse_from_string(open("utf8.eml").read())

print("from_bytes :", repr(m_bytes.text_plain[0]))
print("from_string:", repr(m_string.text_plain[0]))
```

Output:

```
from_bytes : 'Hello â\x80\x94 world\n'
from_string: 'Hello — world\n'
```

## Expected Behavior

Both `parse_from_bytes` and `parse_from_string` should produce `'Hello — world\n'`, honoring the part's declared `charset=utf-8`.

## Root Cause

`src/mailparser/core.py:471-475` (4.2.1 / HEAD):

```python
if not cte or cte in ["7bit", "8bit"]:
    try:
        payload = payload.decode("raw-unicode-escape")
    except UnicodeDecodeError:
        payload = ported_string(payload, encoding=charset)
else:
    payload = ported_string(payload, encoding=charset)
```

The two code paths give `payload` very different bytes before this block:

- `email.message_from_bytes(...).get_payload(decode=True)` returns the **raw body bytes** (e.g. `b'\xe2\x80\x94'` for `—`). `raw-unicode-escape` decodes these as Latin-1 (no exception raised) → mojibake.
- `email.message_from_string(...).get_payload(decode=True)` returns **ASCII-escaped bytes** (e.g. `b'\\u2014'`). `raw-unicode-escape` decodes the `—` escape correctly → correct character.

So `from_string` works by accident: it relies on the stdlib's internal escape-encoding to produce input that `raw-unicode-escape` happens to handle. `from_bytes` exposes the underlying logic flaw.

## Why the Current Logic Is Inverted

The comment block above this code (lines 459-465) explains the intent: when `get_payload(decode=True)` returns bytes that Python "broke" by mishandling the encoding, the author reinterprets them via `raw-unicode-escape`. But the correct approach is the opposite: the part's `Content-Type` already declares the body's charset, so decode with that charset directly. `raw-unicode-escape` should never be the primary decoder for body bytes.

## Suggested Fix

Decode with the declared charset directly:

```python
if not cte or cte in ["7bit", "8bit"]:
    try:
        payload = payload.decode(charset)
    except (UnicodeDecodeError, LookupError):
        # Fallback for legacy "\uXXXX-in-body" cases (#97); won't silently
        # mis-interpret UTF-8 because that bytes path is handled above.
        payload = payload.decode("raw-unicode-escape", errors="replace")
else:
    payload = ported_string(payload, encoding=charset)
```

This:
- Honors `Content-Type: charset=...` for the case it was declared for.
- Keeps the `raw-unicode-escape` path available for messages where #97-style `\u` literals appear in 8bit bodies labeled as ASCII/UTF-8.
- Preserves behavior for non-`7bit`/`8bit` CTEs (`base64`, `quoted-printable`) which already go through `ported_string(..., encoding=charset)`.

A regression test loading the minimal `.eml` above via both `parse_from_bytes` and `parse_from_string` and asserting equality would catch any future re-inversion.

## Environment

- mail-parser: 4.1.2 (also verified in 4.2.1 / HEAD)
- Python: 3.11
- OS: Linux

## Workaround

In our subclass we override `from_bytes` to decode bytes via UTF-8 and delegate to `from_string`, which sidesteps the broken branch. Happy to open a PR with the upstream fix if there's interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parse_from_bytes silently mis-decodes 8bit UTF-8 bodies as Latin-1 via raw-unicode-escape #152

Summary

Steps to Reproduce

Expected Behavior

Root Cause

Why the Current Logic Is Inverted

Suggested Fix

Environment

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

parse_from_bytes silently mis-decodes 8bit UTF-8 bodies as Latin-1 via raw-unicode-escape #152

Description

Summary

Steps to Reproduce

Expected Behavior

Root Cause

Why the Current Logic Is Inverted

Suggested Fix

Environment

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions