From 3d1769b4378d6bcd05ae29ee63a32c2cba89f9e5 Mon Sep 17 00:00:00 2001 From: maskedsyntax Date: Mon, 16 Mar 2026 18:05:40 +0530 Subject: [PATCH] docs: sync README and website docs with current feature set MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add missing checks: datetime_future_dates, datetime_gaps, datetime_monotonicity, normality, variance_homogeneity, low_mutual_information - Fix incorrect check names in website docs (many_zeros → high_zero_counts, target_leakage → target_leakage_patterns, missing_values → dataset_missingness) - Document --config flag across all CLI commands (scan, details, report) - Add config file loading section to Python API docs - Add dedicated Configuration section to website with YAML/TOML/JSON examples - Add --comparison flag to scan/details CLI reference on website - Expand Available Checks from ~12 to all 30 checks across both docs --- README.md | 36 ++++++++++++- web/src/routes/docs/+page.svelte | 89 ++++++++++++++++++++++++++++---- 2 files changed, 115 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index c3b8577..608d81c 100644 --- a/README.md +++ b/README.md @@ -96,8 +96,10 @@ hashprep scan dataset.csv - `--json`: Output in JSON format - `--target COLUMN`: Specify target column for ML-specific checks - `--checks CHECKS`: Run specific checks (comma-separated) +- `--comparison FILE`: Compare with another dataset for drift detection - `--sample-size N`: Limit analysis to N rows - `--no-sample`: Disable automatic sampling +- `--config FILE`: Load thresholds from a YAML/TOML/JSON config file **Example:** ```bash @@ -114,7 +116,7 @@ Get comprehensive details about all detected issues. hashprep details dataset.csv ``` -**Options:** Same as `scan` command +**Options:** Same as `scan` command (including `--config`) **Example:** ```bash @@ -138,6 +140,7 @@ hashprep report dataset.csv --format html --theme minimal - `--comparison FILE`: Compare with another dataset for drift detection - `--sample-size N`: Limit analysis to N rows - `--no-sample`: Disable automatic sampling +- `--config FILE`: Load thresholds from a YAML/TOML/JSON config file **Examples:** ```bash @@ -190,6 +193,12 @@ hashprep version - `constant_length` - String columns with constant character length - `extreme_text_lengths` - Text columns with extreme value lengths - `datetime_skew` - Datetime columns concentrated in one period +- `datetime_future_dates` - Datetime columns with values in the future +- `datetime_gaps` - Anomalous gaps in datetime sequences +- `datetime_monotonicity` - Non-monotonic datetime columns +- `normality` - Non-normal numeric distributions (Shapiro-Wilk / D'Agostino-Pearson) +- `variance_homogeneity` - Unequal variances across target groups (Levene's test, requires --target) +- `low_mutual_information` - Features with near-zero mutual information with the target (requires --target) - `empty_dataset` - Empty or all-missing datasets --- @@ -340,6 +349,31 @@ with open('pipeline.py', 'w') as f: f.write(pipeline_code) ``` +#### Load Config from File +```python +from hashprep.utils.config_loader import load_config +from hashprep import DatasetAnalyzer + +# Load thresholds from YAML, TOML, or JSON +config = load_config("hashprep.yaml") # or .toml / .json + +analyzer = DatasetAnalyzer(df, config=config) +summary = analyzer.analyze() +``` + +Example `hashprep.yaml`: +```yaml +missing_values: + warning: 0.3 + critical: 0.6 +outliers: + z_score: 3.5 +statistical_tests: + normality_p_value: 0.01 +``` + +Only the keys you specify are overridden; all others fall back to defaults. + #### Custom Sampling ```python from hashprep.utils.sampling import SamplingConfig diff --git a/web/src/routes/docs/+page.svelte b/web/src/routes/docs/+page.svelte index 5566976..13a93a2 100644 --- a/web/src/routes/docs/+page.svelte +++ b/web/src/routes/docs/+page.svelte @@ -33,6 +33,7 @@ CLI reference Python API Available checks + Configuration Contributing @@ -192,9 +193,12 @@ generate_report( @@ -206,7 +210,7 @@ generate_report(

hashprep details dataset.csv --target Survived

- Accepts the same options as scan and is best used when you are actively debugging a dataset + Accepts the same options as scan (including --comparison and --config) and is best used when you are actively debugging a dataset or deciding which columns to drop or transform.

@@ -224,6 +228,7 @@ generate_report(
  • --with-code — write companion _fixes.py and _pipeline.py files
  • --comparison FILE — compare two datasets for drift (train vs test, etc.)
  • --sample-size N / --no-sample — control automatic sampling
  • +
  • --config FILE — load thresholds from a YAML, TOML, or JSON file
  • @@ -329,6 +334,7 @@ pipeline_code = builder.generate_pipeline_code() ) summary = analyzer.analyze() +
    @@ -338,33 +344,98 @@ summary = analyzer.analyze()

    Data quality

      -
    • missing_values — overall missingness patterns
    • +
    • dataset_missingness — overall missing data patterns
    • high_missing_values — columns with heavy missingness
    • +
    • missing_patterns — correlated missing value patterns
    • duplicates — duplicate rows
    • +
    • empty_columns — completely empty columns
    • +
    • empty_dataset — empty or all-missing datasets
    • single_value_columns — near-constant features
    • +
    • mixed_data_types — columns with mixed data types
    -

    Distribution

    +

    Distribution & statistics

      -
    • outliers — IQR-based outlier detection
    • -
    • high_cardinality — categorical columns with too many uniques
    • +
    • outliers — z-score outlier detection
    • +
    • skewness — highly skewed numeric distributions
    • +
    • high_cardinality — too many unique categorical values
    • uniform_distribution — uniformly distributed numeric columns
    • -
    • many_zeros — features dominated by zeros
    • +
    • unique_values — columns where >95% values are unique
    • +
    • high_zero_counts — features dominated by zeros
    • +
    • infinite_values — columns containing infinite values
    • +
    • constant_length — string columns with constant character length
    • +
    • extreme_text_lengths — text columns with extreme value lengths
    • +
    • normality — non-normal distributions (Shapiro-Wilk / D’Agostino-Pearson)
    • +
    • variance_homogeneity — unequal variances across target groups (Levene’s test)

    ML-specific

    • class_imbalance — target imbalance (requires --target)
    • -
    • feature_correlation — highly correlated features
    • -
    • target_leakage — features leaking target information
    • -
    • dataset_drift — drift between train / test datasets
    • +
    • feature_correlation — highly correlated numeric features
    • +
    • categorical_correlation — highly associated categorical features
    • +
    • mixed_correlation — numeric-categorical associations
    • +
    • data_leakage — columns identical to target
    • +
    • target_leakage_patterns — features leaking target information
    • +
    • low_mutual_information — near-zero MI with target (requires --target)
    • +
    • dataset_drift — drift between train / test datasets (requires --comparison)
    • +
    • datetime_skew — datetime columns concentrated in one period
    • +
    • datetime_future_dates — datetime values in the future
    • +
    • datetime_gaps — anomalous gaps in datetime sequences
    • +
    • datetime_monotonicity — non-monotonic datetime columns
    +
    +

    Configuration

    +

    + Every detection threshold in HashPrep has a sensible default. You can override any of them at runtime by + providing a config file — no code changes required. +

    + +
    +

    Supported formats

    +

    Config files can be written in YAML (.yaml / .yml), TOML (.toml), + or JSON (.json). Only the keys you specify are changed; everything else falls back to + defaults.

    +
    # hashprep.yaml
    +missing_values:
    +  warning: 0.3
    +  critical: 0.6
    +outliers:
    +  z_score: 3.5
    +statistical_tests:
    +  normality_p_value: 0.01
    +mutual_info:
    +  low_mi_warning: 0.05
    +
    + +
    +
    +

    Via CLI

    +
    hashprep scan dataset.csv \
    +  --config hashprep.yaml
    +
    +hashprep report dataset.csv \
    +  --format html \
    +  --config hashprep.toml
    +
    +
    +

    Via Python

    +
    from hashprep.utils.config_loader import load_config
    +from hashprep import DatasetAnalyzer
    +
    +config = load_config("hashprep.yaml")
    +analyzer = DatasetAnalyzer(df, config=config)
    +summary = analyzer.analyze()
    +
    +
    +
    +

    Contributing