GitHub - pragyamv/DataDojo: An OpenEnv based RL environment that allows agents to learn to clean datasets across 3 levels of difficulties.

DataDojo: The Autonomous Data Cleaning Benchmark

DataDojo is a containerized reinforcement learning environment designed to evaluate the reasoning and data-wrangling capabilities of AI agents. It provides a standardized "gym" where LLMs interact with corrupted datasets to reach a clean "reference" state through autonomous decision-making. :contentReference[oaicite:0]{index=0}

The Architecture

The system operates on a dual-component architecture, or simply "The Twins," ensuring complete separation between data generation and corruption logic. :contentReference[oaicite:1]{index=1}

Genesis Engine

The source of truth. It procedurally generates a randomized "master dataset" from a library of domain skeletons. The datasets have varying numbers of columns, row counts, and data types on every episode — producing a perfectly clean reference state that the agent must restore. :contentReference[oaicite:2]{index=2}

Ruiner Engine

Systematically injects "dirt" into the master dataset by introducing:

Missing values
Duplicate rows
Regex-defying string corruptions
Inconsistent categorical casing

:contentReference[oaicite:3]{index=3}

The agent's success is measured by its ability to reverse the Ruiner's chaos and restore the dataset to the Genesis standard. :contentReference[oaicite:4]{index=4}

Task Levels

DataDojo evaluates agents across three increasing levels of corruption intensity. :contentReference[oaicite:5]{index=5}

1. Easy

Duplicate removal
Dropping a fully empty column
Filling missing values (NaNs)

:contentReference[oaicite:6]{index=6}

2. Medium

Includes all Easy challenges, plus:

Cleaning numerical columns corrupted with currency-style string formatting
Converting dtype: object → float/int
Using the sequence:
- STRIP_CHAR
- TYPE_CAST

:contentReference[oaicite:7]{index=7}

3. Hard

Includes all Medium challenges, plus:

Detecting inconsistent string casing
Standardizing categorical column values

:contentReference[oaicite:8]{index=8}

Environment Logic

Actions & Observations

Observations

At each step, the agent receives:

Current dataset state
Column schema
NaN counts per column
Sample rows
EDA tool-call results
Detailed reward breakdown from the previous action

This provides sufficient context for reasoning about the next step. :contentReference[oaicite:9]{index=9}

Actions

Agents interact through DataFrame-manipulation tool calls. Most actions require specifying the exact column to operate on.

Available tools include:

DROP_DUPLICATES
DROP_COLUMN
FILL_NA
STRIP_CHAR
TYPE_CAST
LOWERCASE
GET_VALUE_COUNTS
MAP_VALUES

:contentReference[oaicite:10]{index=10}

Grading & Subtle Logic

The reward per step is a tanh-normalized sum of rewards and penalties, producing a score in [-1, 1], which is then remapped to [0, 1]. :contentReference[oaicite:11]{index=11}

Reward Shaping

Error Reduction Reward

The primary learning signal. Reward is proportional to how many errors the action eliminates relative to the episode's starting error count.

Tracked errors include:

NaNs
Duplicates
Value mismatches against the master dataset

:contentReference[oaicite:12]{index=12}

The "One Free Drop"

The first DROP_COLUMN call has no penalty.
Every additional column drop incurs an action penalty.
Dropping a non-empty column incurs a severe penalty.

This prevents reward hacking through indiscriminate column deletion. :contentReference[oaicite:13]{index=13}

DROP_DUPLICATES Spam Penalty

Using DROP_DUPLICATES more than once per episode incurs a penalty to discourage repetitive fallback behavior. :contentReference[oaicite:14]{index=14}

Invalid Column Penalty

Referencing a non-existent column is penalized, forcing agents to ground actions in the observed schema rather than hallucinating column names. :contentReference[oaicite:15]{index=15}

Datatype Mismatch Penalty

If a column remains in the wrong datatype relative to the master dataset after an operation, a penalty is applied. This encourages the correct sequence:

STRIP_CHAR
TYPE_CAST

:contentReference[oaicite:16]{index=16}

Action Dependencies

Agents must learn valid operation orderings.

Example:

TYPE_CAST to float fails if symbols like $ or .. have not first been removed using STRIP_CHAR.

The environment reports failed actions, and the agent must recover autonomously. :contentReference[oaicite:17]{index=17}

A Note on Difficulty

DataDojo is intentionally difficult. Every episode generates a randomized dataset with:

Different schemas
Different column names
Different row counts
Different datatypes

This prevents memorization and forces genuine reasoning over observed data. :contentReference[oaicite:18]{index=18}

Baseline Results

During development, the baseline agent used was:

Qwen2.5-72B-Instruct

A 72B-parameter open-source instruction-tuned model. :contentReference[oaicite:19]{index=19}

Observed behavior:

Easy Level

The model can usually:

Identify and drop the empty column

But often struggles to complete:

NaN filling
Duplicate removal

within the 10-step action budget. :contentReference[oaicite:20]{index=20}

Medium & Hard Levels

The model struggles with multi-step dependency chains such as:

Identifying the corrupted column
Applying STRIP_CHAR
Performing TYPE_CAST
Tracking already-completed operations

This limitation is intentional. A benchmark that frontier models solve trivially offers little training value.

DataDojo is designed to sit at the frontier of current LLM tool-use and multi-step reasoning capabilities, leaving meaningful headroom for RL-trained or fine-tuned agents to demonstrate measurable improvement over zero-shot baselines.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
server		server
Dockerfile.txt		Dockerfile.txt
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
client.py		client.py
dockerignore.txt		dockerignore.txt
gitignore.txt		gitignore.txt
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataDojo: The Autonomous Data Cleaning Benchmark

The Architecture

Genesis Engine

Ruiner Engine

Task Levels

1. Easy

2. Medium

3. Hard

Environment Logic

Actions & Observations

Observations

Actions

Grading & Subtle Logic

Reward Shaping

Error Reduction Reward

The "One Free Drop"

DROP_DUPLICATES Spam Penalty

Invalid Column Penalty

Datatype Mismatch Penalty

Action Dependencies

A Note on Difficulty

Baseline Results

Easy Level

Medium & Hard Levels

About

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataDojo: The Autonomous Data Cleaning Benchmark

The Architecture

Genesis Engine

Ruiner Engine

Task Levels

1. Easy

2. Medium

3. Hard

Environment Logic

Actions & Observations

Observations

Actions

Grading & Subtle Logic

Reward Shaping

Error Reduction Reward

The "One Free Drop"

DROP_DUPLICATES Spam Penalty

Invalid Column Penalty

Datatype Mismatch Penalty

Action Dependencies

A Note on Difficulty

Baseline Results

Easy Level

Medium & Hard Levels

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages