diff --git a/README.md b/README.md index c3b8577..608d81c 100644 --- a/README.md +++ b/README.md @@ -96,8 +96,10 @@ hashprep scan dataset.csv - `--json`: Output in JSON format - `--target COLUMN`: Specify target column for ML-specific checks - `--checks CHECKS`: Run specific checks (comma-separated) +- `--comparison FILE`: Compare with another dataset for drift detection - `--sample-size N`: Limit analysis to N rows - `--no-sample`: Disable automatic sampling +- `--config FILE`: Load thresholds from a YAML/TOML/JSON config file **Example:** ```bash @@ -114,7 +116,7 @@ Get comprehensive details about all detected issues. hashprep details dataset.csv ``` -**Options:** Same as `scan` command +**Options:** Same as `scan` command (including `--config`) **Example:** ```bash @@ -138,6 +140,7 @@ hashprep report dataset.csv --format html --theme minimal - `--comparison FILE`: Compare with another dataset for drift detection - `--sample-size N`: Limit analysis to N rows - `--no-sample`: Disable automatic sampling +- `--config FILE`: Load thresholds from a YAML/TOML/JSON config file **Examples:** ```bash @@ -190,6 +193,12 @@ hashprep version - `constant_length` - String columns with constant character length - `extreme_text_lengths` - Text columns with extreme value lengths - `datetime_skew` - Datetime columns concentrated in one period +- `datetime_future_dates` - Datetime columns with values in the future +- `datetime_gaps` - Anomalous gaps in datetime sequences +- `datetime_monotonicity` - Non-monotonic datetime columns +- `normality` - Non-normal numeric distributions (Shapiro-Wilk / D'Agostino-Pearson) +- `variance_homogeneity` - Unequal variances across target groups (Levene's test, requires --target) +- `low_mutual_information` - Features with near-zero mutual information with the target (requires --target) - `empty_dataset` - Empty or all-missing datasets --- @@ -340,6 +349,31 @@ with open('pipeline.py', 'w') as f: f.write(pipeline_code) ``` +#### Load Config from File +```python +from hashprep.utils.config_loader import load_config +from hashprep import DatasetAnalyzer + +# Load thresholds from YAML, TOML, or JSON +config = load_config("hashprep.yaml") # or .toml / .json + +analyzer = DatasetAnalyzer(df, config=config) +summary = analyzer.analyze() +``` + +Example `hashprep.yaml`: +```yaml +missing_values: + warning: 0.3 + critical: 0.6 +outliers: + z_score: 3.5 +statistical_tests: + normality_p_value: 0.01 +``` + +Only the keys you specify are overridden; all others fall back to defaults. + #### Custom Sampling ```python from hashprep.utils.sampling import SamplingConfig diff --git a/web/src/routes/docs/+page.svelte b/web/src/routes/docs/+page.svelte index 5566976..13a93a2 100644 --- a/web/src/routes/docs/+page.svelte +++ b/web/src/routes/docs/+page.svelte @@ -33,6 +33,7 @@ CLI reference Python API Available checks + Configuration Contributing @@ -192,9 +193,12 @@ generate_report(
--target COLUMN — target column for ML checks (class imbalance, leakage, etc.)--checks LIST — comma-separated list of checks to run--comparison FILE — compare with another dataset for drift detection--critical-only — hide warnings and show only critical issues--json — emit JSON instead of human-readable text--quiet — minimal output, useful in CI--sample-size N / --no-sample — control automatic sampling--config FILE — load thresholds from a YAML, TOML, or JSON filehashprep details dataset.csv --target Survived
- Accepts the same options as scan and is best used when you are actively debugging a dataset
+ Accepts the same options as scan (including --comparison and --config) and is best used when you are actively debugging a dataset
or deciding which columns to drop or transform.
--with-code — write companion _fixes.py and _pipeline.py files--comparison FILE — compare two datasets for drift (train vs test, etc.)--sample-size N / --no-sample — control automatic sampling--config FILE — load thresholds from a YAML, TOML, or JSON filemissing_values — overall missingness patternsdataset_missingness — overall missing data patternshigh_missing_values — columns with heavy missingnessmissing_patterns — correlated missing value patternsduplicates — duplicate rowsempty_columns — completely empty columnsempty_dataset — empty or all-missing datasetssingle_value_columns — near-constant featuresmixed_data_types — columns with mixed data typesoutliers — IQR-based outlier detectionhigh_cardinality — categorical columns with too many uniquesoutliers — z-score outlier detectionskewness — highly skewed numeric distributionshigh_cardinality — too many unique categorical valuesuniform_distribution — uniformly distributed numeric columnsmany_zeros — features dominated by zerosunique_values — columns where >95% values are uniquehigh_zero_counts — features dominated by zerosinfinite_values — columns containing infinite valuesconstant_length — string columns with constant character lengthextreme_text_lengths — text columns with extreme value lengthsnormality — non-normal distributions (Shapiro-Wilk / D’Agostino-Pearson)variance_homogeneity — unequal variances across target groups (Levene’s test)class_imbalance — target imbalance (requires --target)feature_correlation — highly correlated featurestarget_leakage — features leaking target informationdataset_drift — drift between train / test datasetsfeature_correlation — highly correlated numeric featurescategorical_correlation — highly associated categorical featuresmixed_correlation — numeric-categorical associationsdata_leakage — columns identical to targettarget_leakage_patterns — features leaking target informationlow_mutual_information — near-zero MI with target (requires --target)dataset_drift — drift between train / test datasets (requires --comparison)datetime_skew — datetime columns concentrated in one perioddatetime_future_dates — datetime values in the futuredatetime_gaps — anomalous gaps in datetime sequencesdatetime_monotonicity — non-monotonic datetime columns+ Every detection threshold in HashPrep has a sensible default. You can override any of them at runtime by + providing a config file — no code changes required. +
+ +Config files can be written in YAML (.yaml / .yml), TOML (.toml),
+ or JSON (.json). Only the keys you specify are changed; everything else falls back to
+ defaults.
# hashprep.yaml
+missing_values:
+ warning: 0.3
+ critical: 0.6
+outliers:
+ z_score: 3.5
+statistical_tests:
+ normality_p_value: 0.01
+mutual_info:
+ low_mi_warning: 0.05
+ hashprep scan dataset.csv \
+ --config hashprep.yaml
+
+hashprep report dataset.csv \
+ --format html \
+ --config hashprep.toml
+ from hashprep.utils.config_loader import load_config
+from hashprep import DatasetAnalyzer
+
+config = load_config("hashprep.yaml")
+analyzer = DatasetAnalyzer(df, config=config)
+summary = analyzer.analyze()
+