Skip to content

fix: add explicit encoding="utf-8" to .txt read in _validators.py#3288

Open
Ghraven wants to merge 1 commit into
openai:mainfrom
Ghraven:fix/validators-txt-encoding
Open

fix: add explicit encoding="utf-8" to .txt read in _validators.py#3288
Ghraven wants to merge 1 commit into
openai:mainfrom
Ghraven:fix/validators-txt-encoding

Conversation

@Ghraven
Copy link
Copy Markdown

@Ghraven Ghraven commented May 20, 2026

What

src/openai/lib/_validators.py reads a user-supplied .txt fine-tuning file in text mode without an explicit encoding:

with open(fname, "r") as f:
    content = f.read()

In text mode, open() uses the platform default encoding (locale.getpreferredencoding()). On Windows that is typically cp1252, not UTF-8.

Why it matters

.txt datasets frequently contain non-ASCII characters (smart quotes, accented text, CJK, emoji). On a default-cp1252 platform, reading a valid UTF-8 file raises UnicodeDecodeError or silently corrupts characters.

Before / After

# Before - platform-dependent
with open(fname, "r") as f:
# After - consistent across platforms
with open(fname, "r", encoding="utf-8") as f:

How I verified

Reproduced the failure on a simulated cp1252 default: the explicit utf-8 read preserves content, while a cp1252 read of the same valid UTF-8 file raises UnicodeDecodeError. With encoding="utf-8" the read succeeds regardless of platform. No behavior change on systems that already default to UTF-8.

Scope

Single-line change in src/openai/lib/, a hand-maintained (non-generated) path per CONTRIBUTING. Happy to adjust if you would prefer a different approach.

The text-mode open() for user-supplied .txt fine-tuning files relied on
the platform default encoding (cp1252 on Windows), which raises
UnicodeDecodeError or corrupts non-ASCII content on valid UTF-8 files.
Pinning encoding="utf-8" makes the read consistent across platforms.
@Ghraven Ghraven requested a review from a team as a code owner May 20, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant