From 3d1769b4378d6bcd05ae29ee63a32c2cba89f9e5 Mon Sep 17 00:00:00 2001
From: maskedsyntax <aftaab@aftaab.xyz>
Date: Mon, 16 Mar 2026 18:05:40 +0530
Subject: [PATCH] docs: sync README and website docs with current feature set
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Add missing checks: datetime_future_dates, datetime_gaps, datetime_monotonicity,
  normality, variance_homogeneity, low_mutual_information
- Fix incorrect check names in website docs (many_zeros → high_zero_counts,
  target_leakage → target_leakage_patterns, missing_values → dataset_missingness)
- Document --config flag across all CLI commands (scan, details, report)
- Add config file loading section to Python API docs
- Add dedicated Configuration section to website with YAML/TOML/JSON examples
- Add --comparison flag to scan/details CLI reference on website
- Expand Available Checks from ~12 to all 30 checks across both docs
---
 README.md                        | 36 ++++++++++++-
 web/src/routes/docs/+page.svelte | 89 ++++++++++++++++++++++++++++----
 2 files changed, 115 insertions(+), 10 deletions(-)
diff --git a/README.md b/README.md
index c3b8577..608d81c 100644
--- a/README.md
+++ b/README.md
@@ -96,8 +96,10 @@ hashprep scan dataset.csv
 - `--json`: Output in JSON format
 - `--target COLUMN`: Specify target column for ML-specific checks
 - `--checks CHECKS`: Run specific checks (comma-separated)
+- `--comparison FILE`: Compare with another dataset for drift detection
 - `--sample-size N`: Limit analysis to N rows
 - `--no-sample`: Disable automatic sampling
+- `--config FILE`: Load thresholds from a YAML/TOML/JSON config file
 
 **Example:**
 ```bash
@@ -114,7 +116,7 @@ Get comprehensive details about all detected issues.
 hashprep details dataset.csv
 ```
 
-**Options:** Same as `scan` command
+**Options:** Same as `scan` command (including `--config`)
 
 **Example:**
 ```bash
@@ -138,6 +140,7 @@ hashprep report dataset.csv --format html --theme minimal
 - `--comparison FILE`: Compare with another dataset for drift detection
 - `--sample-size N`: Limit analysis to N rows
 - `--no-sample`: Disable automatic sampling
+- `--config FILE`: Load thresholds from a YAML/TOML/JSON config file
 
 **Examples:**
 ```bash
@@ -190,6 +193,12 @@ hashprep version
 - `constant_length` - String columns with constant character length
 - `extreme_text_lengths` - Text columns with extreme value lengths
 - `datetime_skew` - Datetime columns concentrated in one period
+- `datetime_future_dates` - Datetime columns with values in the future
+- `datetime_gaps` - Anomalous gaps in datetime sequences
+- `datetime_monotonicity` - Non-monotonic datetime columns
+- `normality` - Non-normal numeric distributions (Shapiro-Wilk / D'Agostino-Pearson)
+- `variance_homogeneity` - Unequal variances across target groups (Levene's test, requires --target)
+- `low_mutual_information` - Features with near-zero mutual information with the target (requires --target)
 - `empty_dataset` - Empty or all-missing datasets
 
 ---
@@ -340,6 +349,31 @@ with open('pipeline.py', 'w') as f:
     f.write(pipeline_code)
 ```
 
+#### Load Config from File
+```python
+from hashprep.utils.config_loader import load_config
+from hashprep import DatasetAnalyzer
+
+# Load thresholds from YAML, TOML, or JSON
+config = load_config("hashprep.yaml")  # or .toml / .json
+
+analyzer = DatasetAnalyzer(df, config=config)
+summary = analyzer.analyze()
+```
+
+Example `hashprep.yaml`:
+```yaml
+missing_values:
+  warning: 0.3
+  critical: 0.6
+outliers:
+  z_score: 3.5
+statistical_tests:
+  normality_p_value: 0.01
+```
+
+Only the keys you specify are overridden; all others fall back to defaults.
+
 #### Custom Sampling
 ```python
 from hashprep.utils.sampling import SamplingConfig
diff --git a/web/src/routes/docs/+page.svelte b/web/src/routes/docs/+page.svelte
index 5566976..13a93a2 100644
--- a/web/src/routes/docs/+page.svelte
+++ b/web/src/routes/docs/+page.svelte
@@ -33,6 +33,7 @@
           <a href="#cli-reference">CLI reference</a>
           <a href="#python-api">Python API</a>
           <a href="#checks">Available checks</a>
+          <a href="#configuration">Configuration</a>
           <a href="#contributing">Contributing</a>
         </div>
       </aside>
@@ -192,9 +193,12 @@ generate_report(
             <ul>
               <li><code>--target COLUMN</code> &mdash; target column for ML checks (class imbalance, leakage, etc.)</li>
               <li><code>--checks LIST</code> &mdash; comma-separated list of checks to run</li>
+              <li><code>--comparison FILE</code> &mdash; compare with another dataset for drift detection</li>
               <li><code>--critical-only</code> &mdash; hide warnings and show only critical issues</li>
               <li><code>--json</code> &mdash; emit JSON instead of human-readable text</li>
               <li><code>--quiet</code> &mdash; minimal output, useful in CI</li>
+              <li><code>--sample-size N</code> / <code>--no-sample</code> &mdash; control automatic sampling</li>
+              <li><code>--config FILE</code> &mdash; load thresholds from a YAML, TOML, or JSON file</li>
             </ul>
           </div>
 
@@ -206,7 +210,7 @@ generate_report(
             </p>
             <pre data-lang="bash"><code><span class="hl-command">hashprep</span> <span class="hl-arg">details</span> <span class="hl-value">dataset.csv</span> <span class="hl-flag">--target</span> <span class="hl-value">Survived</span></code></pre>
             <p>
-              Accepts the same options as <code>scan</code> and is best used when you are actively debugging a dataset
+              Accepts the same options as <code>scan</code> (including <code>--comparison</code> and <code>--config</code>) and is best used when you are actively debugging a dataset
               or deciding which columns to drop or transform.
             </p>
           </div>
@@ -224,6 +228,7 @@ generate_report(
               <li><code>--with-code</code> &mdash; write companion <code>_fixes.py</code> and <code>_pipeline.py</code> files</li>
               <li><code>--comparison FILE</code> &mdash; compare two datasets for drift (train vs test, etc.)</li>
               <li><code>--sample-size N</code> / <code>--no-sample</code> &mdash; control automatic sampling</li>
+              <li><code>--config FILE</code> &mdash; load thresholds from a YAML, TOML, or JSON file</li>
             </ul>
           </div>
 
@@ -329,6 +334,7 @@ pipeline_code = builder.generate_pipeline_code()</code></pre>
 )
 summary = analyzer.analyze()</code></pre>
           </div>
+
         </section>
 
         <section id="checks">
@@ -338,33 +344,98 @@ summary = analyzer.analyze()</code></pre>
             <div>
               <h3>Data quality</h3>
               <ul>
-                <li><code>missing_values</code> &mdash; overall missingness patterns</li>
+                <li><code>dataset_missingness</code> &mdash; overall missing data patterns</li>
                 <li><code>high_missing_values</code> &mdash; columns with heavy missingness</li>
+                <li><code>missing_patterns</code> &mdash; correlated missing value patterns</li>
                 <li><code>duplicates</code> &mdash; duplicate rows</li>
+                <li><code>empty_columns</code> &mdash; completely empty columns</li>
+                <li><code>empty_dataset</code> &mdash; empty or all-missing datasets</li>
                 <li><code>single_value_columns</code> &mdash; near-constant features</li>
+                <li><code>mixed_data_types</code> &mdash; columns with mixed data types</li>
               </ul>
             </div>
             <div>
-              <h3>Distribution</h3>
+              <h3>Distribution &amp; statistics</h3>
               <ul>
-                <li><code>outliers</code> &mdash; IQR-based outlier detection</li>
-                <li><code>high_cardinality</code> &mdash; categorical columns with too many uniques</li>
+                <li><code>outliers</code> &mdash; z-score outlier detection</li>
+                <li><code>skewness</code> &mdash; highly skewed numeric distributions</li>
+                <li><code>high_cardinality</code> &mdash; too many unique categorical values</li>
                 <li><code>uniform_distribution</code> &mdash; uniformly distributed numeric columns</li>
-                <li><code>many_zeros</code> &mdash; features dominated by zeros</li>
+                <li><code>unique_values</code> &mdash; columns where &gt;95% values are unique</li>
+                <li><code>high_zero_counts</code> &mdash; features dominated by zeros</li>
+                <li><code>infinite_values</code> &mdash; columns containing infinite values</li>
+                <li><code>constant_length</code> &mdash; string columns with constant character length</li>
+                <li><code>extreme_text_lengths</code> &mdash; text columns with extreme value lengths</li>
+                <li><code>normality</code> &mdash; non-normal distributions (Shapiro-Wilk / D&rsquo;Agostino-Pearson)</li>
+                <li><code>variance_homogeneity</code> &mdash; unequal variances across target groups (Levene&rsquo;s test)</li>
               </ul>
             </div>
             <div>
               <h3>ML-specific</h3>
               <ul>
                 <li><code>class_imbalance</code> &mdash; target imbalance (requires <code>--target</code>)</li>
-                <li><code>feature_correlation</code> &mdash; highly correlated features</li>
-                <li><code>target_leakage</code> &mdash; features leaking target information</li>
-                <li><code>dataset_drift</code> &mdash; drift between train / test datasets</li>
+                <li><code>feature_correlation</code> &mdash; highly correlated numeric features</li>
+                <li><code>categorical_correlation</code> &mdash; highly associated categorical features</li>
+                <li><code>mixed_correlation</code> &mdash; numeric-categorical associations</li>
+                <li><code>data_leakage</code> &mdash; columns identical to target</li>
+                <li><code>target_leakage_patterns</code> &mdash; features leaking target information</li>
+                <li><code>low_mutual_information</code> &mdash; near-zero MI with target (requires <code>--target</code>)</li>
+                <li><code>dataset_drift</code> &mdash; drift between train / test datasets (requires <code>--comparison</code>)</li>
+                <li><code>datetime_skew</code> &mdash; datetime columns concentrated in one period</li>
+                <li><code>datetime_future_dates</code> &mdash; datetime values in the future</li>
+                <li><code>datetime_gaps</code> &mdash; anomalous gaps in datetime sequences</li>
+                <li><code>datetime_monotonicity</code> &mdash; non-monotonic datetime columns</li>
               </ul>
             </div>
           </div>
         </section>
 
+        <section id="configuration">
+          <h2>Configuration</h2>
+          <p>
+            Every detection threshold in HashPrep has a sensible default. You can override any of them at runtime by
+            providing a config file &mdash; no code changes required.
+          </p>
+
+          <div class="docs-card">
+            <h3>Supported formats</h3>
+            <p>Config files can be written in YAML (<code>.yaml</code> / <code>.yml</code>), TOML (<code>.toml</code>),
+            or JSON (<code>.json</code>). Only the keys you specify are changed; everything else falls back to
+            defaults.</p>
+            <pre data-lang="bash"><code><span class="hl-comment"># hashprep.yaml</span>
+missing_values:
+  warning: 0.3
+  critical: 0.6
+outliers:
+  z_score: 3.5
+statistical_tests:
+  normality_p_value: 0.01
+mutual_info:
+  low_mi_warning: 0.05</code></pre>
+          </div>
+
+          <div class="docs-grid-2">
+            <div class="docs-card">
+              <h3>Via CLI</h3>
+              <pre data-lang="bash"><code><span class="hl-command">hashprep</span> <span class="hl-arg">scan</span> <span class="hl-value">dataset.csv</span> <span class="hl-operator">\</span>
+  <span class="hl-flag">--config</span> <span class="hl-value">hashprep.yaml</span>
+
+<span class="hl-command">hashprep</span> <span class="hl-arg">report</span> <span class="hl-value">dataset.csv</span> <span class="hl-operator">\</span>
+  <span class="hl-flag">--format</span> <span class="hl-value">html</span> <span class="hl-operator">\</span>
+  <span class="hl-flag">--config</span> <span class="hl-value">hashprep.toml</span></code></pre>
+            </div>
+            <div class="docs-card">
+              <h3>Via Python</h3>
+              <pre data-lang="python"><code><span class="hl-keyword">from</span> hashprep.utils.config_loader <span class="hl-keyword">import</span> load_config
+<span class="hl-keyword">from</span> hashprep <span class="hl-keyword">import</span> DatasetAnalyzer
+
+config = load_config(<span class="hl-string">"hashprep.yaml"</span>)
+analyzer = DatasetAnalyzer(df, config=config)
+summary = analyzer.analyze()</code></pre>
+            </div>
+          </div>
+        </section>
+
         <section id="contributing">
           <h2>Contributing</h2>
           <p>