Added new checkfunctions from 1.4 to model#141
Conversation
Bumped version Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Matt-Cowsert
left a comment
There was a problem hiding this comment.
Thanks for putting this together. A few notes below.
The main item is CheckColumnComparison: as written it flags rows where either column is null, which disagrees with the convention CheckSameValue already establishes (skip the row when either side is null). Worth aligning those two before merge. The rest are smaller refinements.
I've also drafted a fuller PR description below so the title doesn't carry the whole story by itself. Feel free to tweak.
Proposed PR description
## Summary
Adds the check functions needed for FOCUS 1.4 requirements modeling, simplifies the FormatJSON check, and tightens datetime parsing in the parquet loader.
## New check functions
- `TypeJSON`: column must be of type JSON
- `CheckJSONSchema`: JSON values must conform to a JSON Schema referenced by `SchemaId`
- `CheckRegexMatch`: column must match a regex pattern
- `CheckStringEndsWith`: column must end with a specific value
- `CheckGreaterThanValue`: column must be greater than a threshold
- `CheckLessOrEqualThanValue`: column must be less than or equal to a threshold
- `CheckColumnComparison`: two columns compared with a configurable `Comparator` (=, !=, <>, >, >=, <, <=)
- `CheckNoDuplicates`: column must contain no duplicate values
## Other changes
- Rewrote `FormatJSON` to use `json_valid(CAST(col AS VARCHAR))` instead of the previous `TRY_CAST + typeof` logic. The predicate now also applies the row-filter condition.
- `parquet_data_loader.py` strategy 6 (auto-format inference) now requires every value to convert successfully, matching strategies 1 to 5. Previously it accepted partial success. Users may see more columns dropped if their data has mixed formats.
- Added `jsonschema ^4.25.1` as a runtime dependency for `CheckJSONSchema`.
- Bumped version to 2.2.0.
Closes #140.
Signed-off-by: Mike Fuller <mike@finops.org>Additional notes
Consolidating threshold checks
CheckGreaterThanGenerator (line 1856), CheckLessOrEqualGenerator (line 1907), and the existing CheckGreaterOrEqualGenerator (line 1790) are nearly identical, differing only in the operator. CheckColumnComparisonGenerator (line 1960) already accepts the operator as a parameter, so the same approach could collapse these three into one. Not in scope for this PR, just flagging for a future cleanup.
Test coverage
The new tests cover happy paths only. Worth adding:
CheckColumnComparisonwith one or both columns null (catches the null-handling item below)CheckStringEndsWithwith multi-byte charactersCheckJSONSchemawith malformed paths and withjsonschemaunavailableCheckNoDuplicatescombined with a row-filter condition
| msg_sql = message.replace("'", "''") | ||
|
|
||
| pass_predicate = f"{col_a} IS NOT NULL AND {col_b} IS NOT NULL AND {col_a} {comparator} {col_b}" | ||
| condition = f"NOT ({pass_predicate})" |
There was a problem hiding this comment.
| condition = f"NOT ({pass_predicate})" | |
| condition = f"{col_a} IS NOT NULL AND {col_b} IS NOT NULL AND NOT ({col_a} {comparator} {col_b})" |
The current condition expands to A IS NULL OR B IS NULL OR NOT(A op B), which flags rows where either column is null. CheckSameValueGenerator (line 1590) uses the opposite convention: skip those rows, only flag rows where both columns have values and the comparison fails. Aligning here keeps the two functions consistent.
| col_a = self.params.ColumnAName | ||
| col_b = self.params.ColumnBName | ||
| comparator = self.params.Comparator | ||
| condition = f"NOT ({col_a} IS NOT NULL AND {col_b} IS NOT NULL AND {col_a} {comparator} {col_b})" |
There was a problem hiding this comment.
| condition = f"NOT ({col_a} IS NOT NULL AND {col_b} IS NOT NULL AND {col_a} {comparator} {col_b})" | |
| condition = f"{col_a} IS NOT NULL AND {col_b} IS NOT NULL AND NOT ({col_a} {comparator} {col_b})" |
Same null-handling fix in get_sample_sql so the sample violation output doesn't include rows that weren't actually flagged.
| message = self.errorMessage or f"{col} {keyword} end with '{value}'." | ||
| msg_sql = message.replace("'", "''") | ||
|
|
||
| condition = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) != '{value_sql}'" |
There was a problem hiding this comment.
| condition = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) != '{value_sql}'" | |
| condition = f"{col} IS NOT NULL AND NOT ends_with(CAST({col} AS VARCHAR), '{value_sql}')" |
DuckDB has a built-in ends_with(string, suffix). Using it removes the manual length calculation. After this and the matching changes on lines 1561 and 1573, value_len = len(str(value)) on line 1542 (and the one in get_sample_sql) can be deleted.
| FROM invalid | ||
| """ | ||
|
|
||
| predicate_sql = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) = '{value_sql}'" |
There was a problem hiding this comment.
| predicate_sql = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) = '{value_sql}'" | |
| predicate_sql = f"{col} IS NOT NULL AND ends_with(CAST({col} AS VARCHAR), '{value_sql}')" |
| value_sql = str(value).replace("'", "''") | ||
| value_len = len(str(value)) | ||
|
|
||
| condition = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) != '{value_sql}'" |
There was a problem hiding this comment.
| condition = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) != '{value_sql}'" | |
| condition = f"{col} IS NOT NULL AND NOT ends_with(CAST({col} AS VARCHAR), '{value_sql}')" |
| Draft202012Validator.check_schema(schema) | ||
| validator = Draft202012Validator(schema) | ||
| table_name = getattr(self.params, "table_name", "focus_data") | ||
| sql = query.replace("{{table_name}}", table_name) |
There was a problem hiding this comment.
| sql = query.replace("{{table_name}}", table_name) | |
| sql = query.replace("{table_name}", table_name) |
The f-string that builds query on line 1224 uses {{table_name}} to escape a single brace, so query only contains {table_name}. This line looks for {{table_name}} (double-braced) and never matches anything. Line 1240 below does the actual replacement.
| f"{col} IS NOT NULL " | ||
| f"AND (TRY_CAST({col} AS JSON) IS NOT NULL " | ||
| f"OR (typeof({col}) = 'VARCHAR' AND json_valid({col}::TEXT)))" | ||
| predicate_sql = self._apply_condition( |
There was a problem hiding this comment.
The rewritten FormatJSONGenerator wraps its predicate in _apply_condition. None of the other check functions in the file (CheckSameValueGenerator line 1590, CheckGreaterOrEqualGenerator line 1790, ColumnByColumnEqualsColumnValueGenerator line 1746, others) do this. test_format_json_generator_applies_row_condition_to_predicate shows it's deliberate, but the codebase now has two patterns. Worth either applying the new pattern to the other check functions or leaving a short note here explaining why FormatJSON is the exception.
Co-authored-by: Matt Cowsert <matthew@finops.org>
Summary
Adds the check functions needed for FOCUS 1.4 requirements modeling, simplifies the FormatJSON check, and tightens datetime parsing in the parquet loader.
New check functions
TypeJSON: column must be of type JSONCheckJSONSchema: JSON values must conform to a JSON Schema referenced bySchemaIdCheckRegexMatch: column must match a regex patternCheckStringEndsWith: column must end with a specific valueCheckGreaterThanValue: column must be greater than a thresholdCheckLessOrEqualThanValue: column must be less than or equal to a thresholdCheckColumnComparison: two columns compared with a configurableComparator(=, !=, <>, >, >=, <, <=)CheckNoDuplicates: column must contain no duplicate valuesOther changes
FormatJSONto usejson_valid(CAST(col AS VARCHAR))instead of the previousTRY_CAST + typeoflogic. The predicate now also applies the row-filter condition.parquet_data_loader.pystrategy 6 (auto-format inference) now requires every value to convert successfully, matching strategies 1 to 5. Previously it accepted partial success. Users may see more columns dropped if their data has mixed formats.jsonschema ^4.25.1as a runtime dependency forCheckJSONSchema.Closes #140.
Signed-off-by: Mike Fuller mike@finops.org