diff --git a/README.md b/README.md index c3b8577..608d81c 100644 --- a/README.md +++ b/README.md @@ -96,8 +96,10 @@ hashprep scan dataset.csv - `--json`: Output in JSON format - `--target COLUMN`: Specify target column for ML-specific checks - `--checks CHECKS`: Run specific checks (comma-separated) +- `--comparison FILE`: Compare with another dataset for drift detection - `--sample-size N`: Limit analysis to N rows - `--no-sample`: Disable automatic sampling +- `--config FILE`: Load thresholds from a YAML/TOML/JSON config file **Example:** ```bash @@ -114,7 +116,7 @@ Get comprehensive details about all detected issues. hashprep details dataset.csv ``` -**Options:** Same as `scan` command +**Options:** Same as `scan` command (including `--config`) **Example:** ```bash @@ -138,6 +140,7 @@ hashprep report dataset.csv --format html --theme minimal - `--comparison FILE`: Compare with another dataset for drift detection - `--sample-size N`: Limit analysis to N rows - `--no-sample`: Disable automatic sampling +- `--config FILE`: Load thresholds from a YAML/TOML/JSON config file **Examples:** ```bash @@ -190,6 +193,12 @@ hashprep version - `constant_length` - String columns with constant character length - `extreme_text_lengths` - Text columns with extreme value lengths - `datetime_skew` - Datetime columns concentrated in one period +- `datetime_future_dates` - Datetime columns with values in the future +- `datetime_gaps` - Anomalous gaps in datetime sequences +- `datetime_monotonicity` - Non-monotonic datetime columns +- `normality` - Non-normal numeric distributions (Shapiro-Wilk / D'Agostino-Pearson) +- `variance_homogeneity` - Unequal variances across target groups (Levene's test, requires --target) +- `low_mutual_information` - Features with near-zero mutual information with the target (requires --target) - `empty_dataset` - Empty or all-missing datasets --- @@ -340,6 +349,31 @@ with open('pipeline.py', 'w') as f: f.write(pipeline_code) ``` +#### Load Config from File +```python +from hashprep.utils.config_loader import load_config +from hashprep import DatasetAnalyzer + +# Load thresholds from YAML, TOML, or JSON +config = load_config("hashprep.yaml") # or .toml / .json + +analyzer = DatasetAnalyzer(df, config=config) +summary = analyzer.analyze() +``` + +Example `hashprep.yaml`: +```yaml +missing_values: + warning: 0.3 + critical: 0.6 +outliers: + z_score: 3.5 +statistical_tests: + normality_p_value: 0.01 +``` + +Only the keys you specify are overridden; all others fall back to defaults. + #### Custom Sampling ```python from hashprep.utils.sampling import SamplingConfig diff --git a/web/src/routes/docs/+page.svelte b/web/src/routes/docs/+page.svelte index 5566976..13a93a2 100644 --- a/web/src/routes/docs/+page.svelte +++ b/web/src/routes/docs/+page.svelte @@ -33,6 +33,7 @@ CLI reference Python API Available checks + Configuration Contributing @@ -192,9 +193,12 @@ generate_report( @@ -206,7 +210,7 @@ generate_report(

hashprep details dataset.csv --target Survived

- Accepts the same options as scan and is best used when you are actively debugging a dataset + Accepts the same options as scan (including --comparison and --config) and is best used when you are actively debugging a dataset or deciding which columns to drop or transform.

@@ -224,6 +228,7 @@ generate_report(
  • --with-code — write companion _fixes.py and _pipeline.py files
  • --comparison FILE — compare two datasets for drift (train vs test, etc.)
  • --sample-size N / --no-sample — control automatic sampling
  • +
  • --config FILE — load thresholds from a YAML, TOML, or JSON file
  • @@ -329,6 +334,7 @@ pipeline_code = builder.generate_pipeline_code() ) summary = analyzer.analyze() +
    @@ -338,33 +344,98 @@ summary = analyzer.analyze()

    Data quality

    -

    Distribution

    +

    Distribution & statistics

    ML-specific

    +
    +

    Configuration

    +

    + Every detection threshold in HashPrep has a sensible default. You can override any of them at runtime by + providing a config file — no code changes required. +

    + +
    +

    Supported formats

    +

    Config files can be written in YAML (.yaml / .yml), TOML (.toml), + or JSON (.json). Only the keys you specify are changed; everything else falls back to + defaults.

    +
    # hashprep.yaml
    +missing_values:
    +  warning: 0.3
    +  critical: 0.6
    +outliers:
    +  z_score: 3.5
    +statistical_tests:
    +  normality_p_value: 0.01
    +mutual_info:
    +  low_mi_warning: 0.05
    +
    + +
    +
    +

    Via CLI

    +
    hashprep scan dataset.csv \
    +  --config hashprep.yaml
    +
    +hashprep report dataset.csv \
    +  --format html \
    +  --config hashprep.toml
    +
    +
    +

    Via Python

    +
    from hashprep.utils.config_loader import load_config
    +from hashprep import DatasetAnalyzer
    +
    +config = load_config("hashprep.yaml")
    +analyzer = DatasetAnalyzer(df, config=config)
    +summary = analyzer.analyze()
    +
    +
    +
    +

    Contributing