Skip to content

Added new checkfunctions from 1.4 to model#141

Open
mike-finopsorg wants to merge 4 commits into
devfrom
140-update-focus-validator-to-support-focus-14-requirements-modelling
Open

Added new checkfunctions from 1.4 to model#141
mike-finopsorg wants to merge 4 commits into
devfrom
140-update-focus-validator-to-support-focus-14-requirements-modelling

Conversation

@mike-finopsorg
Copy link
Copy Markdown
Contributor

@mike-finopsorg mike-finopsorg commented May 14, 2026

Summary

Adds the check functions needed for FOCUS 1.4 requirements modeling, simplifies the FormatJSON check, and tightens datetime parsing in the parquet loader.

New check functions

  • TypeJSON: column must be of type JSON
  • CheckJSONSchema: JSON values must conform to a JSON Schema referenced by SchemaId
  • CheckRegexMatch: column must match a regex pattern
  • CheckStringEndsWith: column must end with a specific value
  • CheckGreaterThanValue: column must be greater than a threshold
  • CheckLessOrEqualThanValue: column must be less than or equal to a threshold
  • CheckColumnComparison: two columns compared with a configurable Comparator (=, !=, <>, >, >=, <, <=)
  • CheckNoDuplicates: column must contain no duplicate values

Other changes

  • Rewrote FormatJSON to use json_valid(CAST(col AS VARCHAR)) instead of the previous TRY_CAST + typeof logic. The predicate now also applies the row-filter condition.
  • parquet_data_loader.py strategy 6 (auto-format inference) now requires every value to convert successfully, matching strategies 1 to 5. Previously it accepted partial success. Users may see more columns dropped if their data has mixed formats.
  • Added jsonschema ^4.25.1 as a runtime dependency for CheckJSONSchema.
  • Bumped version to 2.2.0.

Closes #140.

Signed-off-by: Mike Fuller mike@finops.org

Bumped version
Signed-off-by: Mike Fuller <mike@finops.org>
@mike-finopsorg mike-finopsorg self-assigned this May 14, 2026
Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Copy link
Copy Markdown
Collaborator

@Matt-Cowsert Matt-Cowsert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. A few notes below.

The main item is CheckColumnComparison: as written it flags rows where either column is null, which disagrees with the convention CheckSameValue already establishes (skip the row when either side is null). Worth aligning those two before merge. The rest are smaller refinements.

I've also drafted a fuller PR description below so the title doesn't carry the whole story by itself. Feel free to tweak.


Proposed PR description

## Summary
Adds the check functions needed for FOCUS 1.4 requirements modeling, simplifies the FormatJSON check, and tightens datetime parsing in the parquet loader.

## New check functions
- `TypeJSON`: column must be of type JSON
- `CheckJSONSchema`: JSON values must conform to a JSON Schema referenced by `SchemaId`
- `CheckRegexMatch`: column must match a regex pattern
- `CheckStringEndsWith`: column must end with a specific value
- `CheckGreaterThanValue`: column must be greater than a threshold
- `CheckLessOrEqualThanValue`: column must be less than or equal to a threshold
- `CheckColumnComparison`: two columns compared with a configurable `Comparator` (=, !=, <>, >, >=, <, <=)
- `CheckNoDuplicates`: column must contain no duplicate values

## Other changes
- Rewrote `FormatJSON` to use `json_valid(CAST(col AS VARCHAR))` instead of the previous `TRY_CAST + typeof` logic. The predicate now also applies the row-filter condition.
- `parquet_data_loader.py` strategy 6 (auto-format inference) now requires every value to convert successfully, matching strategies 1 to 5. Previously it accepted partial success. Users may see more columns dropped if their data has mixed formats.
- Added `jsonschema ^4.25.1` as a runtime dependency for `CheckJSONSchema`.
- Bumped version to 2.2.0.

Closes #140.

Signed-off-by: Mike Fuller <mike@finops.org>

Additional notes

Consolidating threshold checks

CheckGreaterThanGenerator (line 1856), CheckLessOrEqualGenerator (line 1907), and the existing CheckGreaterOrEqualGenerator (line 1790) are nearly identical, differing only in the operator. CheckColumnComparisonGenerator (line 1960) already accepts the operator as a parameter, so the same approach could collapse these three into one. Not in scope for this PR, just flagging for a future cleanup.

Test coverage

The new tests cover happy paths only. Worth adding:

  • CheckColumnComparison with one or both columns null (catches the null-handling item below)
  • CheckStringEndsWith with multi-byte characters
  • CheckJSONSchema with malformed paths and with jsonschema unavailable
  • CheckNoDuplicates combined with a row-filter condition

msg_sql = message.replace("'", "''")

pass_predicate = f"{col_a} IS NOT NULL AND {col_b} IS NOT NULL AND {col_a} {comparator} {col_b}"
condition = f"NOT ({pass_predicate})"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
condition = f"NOT ({pass_predicate})"
condition = f"{col_a} IS NOT NULL AND {col_b} IS NOT NULL AND NOT ({col_a} {comparator} {col_b})"

The current condition expands to A IS NULL OR B IS NULL OR NOT(A op B), which flags rows where either column is null. CheckSameValueGenerator (line 1590) uses the opposite convention: skip those rows, only flag rows where both columns have values and the comparison fails. Aligning here keeps the two functions consistent.

col_a = self.params.ColumnAName
col_b = self.params.ColumnBName
comparator = self.params.Comparator
condition = f"NOT ({col_a} IS NOT NULL AND {col_b} IS NOT NULL AND {col_a} {comparator} {col_b})"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
condition = f"NOT ({col_a} IS NOT NULL AND {col_b} IS NOT NULL AND {col_a} {comparator} {col_b})"
condition = f"{col_a} IS NOT NULL AND {col_b} IS NOT NULL AND NOT ({col_a} {comparator} {col_b})"

Same null-handling fix in get_sample_sql so the sample violation output doesn't include rows that weren't actually flagged.

message = self.errorMessage or f"{col} {keyword} end with '{value}'."
msg_sql = message.replace("'", "''")

condition = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) != '{value_sql}'"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
condition = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) != '{value_sql}'"
condition = f"{col} IS NOT NULL AND NOT ends_with(CAST({col} AS VARCHAR), '{value_sql}')"

DuckDB has a built-in ends_with(string, suffix). Using it removes the manual length calculation. After this and the matching changes on lines 1561 and 1573, value_len = len(str(value)) on line 1542 (and the one in get_sample_sql) can be deleted.

FROM invalid
"""

predicate_sql = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) = '{value_sql}'"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
predicate_sql = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) = '{value_sql}'"
predicate_sql = f"{col} IS NOT NULL AND ends_with(CAST({col} AS VARCHAR), '{value_sql}')"

value_sql = str(value).replace("'", "''")
value_len = len(str(value))

condition = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) != '{value_sql}'"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
condition = f"{col} IS NOT NULL AND RIGHT(CAST({col} AS VARCHAR), {value_len}) != '{value_sql}'"
condition = f"{col} IS NOT NULL AND NOT ends_with(CAST({col} AS VARCHAR), '{value_sql}')"

Draft202012Validator.check_schema(schema)
validator = Draft202012Validator(schema)
table_name = getattr(self.params, "table_name", "focus_data")
sql = query.replace("{{table_name}}", table_name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sql = query.replace("{{table_name}}", table_name)
sql = query.replace("{table_name}", table_name)

The f-string that builds query on line 1224 uses {{table_name}} to escape a single brace, so query only contains {table_name}. This line looks for {{table_name}} (double-braced) and never matches anything. Line 1240 below does the actual replacement.

f"{col} IS NOT NULL "
f"AND (TRY_CAST({col} AS JSON) IS NOT NULL "
f"OR (typeof({col}) = 'VARCHAR' AND json_valid({col}::TEXT)))"
predicate_sql = self._apply_condition(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewritten FormatJSONGenerator wraps its predicate in _apply_condition. None of the other check functions in the file (CheckSameValueGenerator line 1590, CheckGreaterOrEqualGenerator line 1790, ColumnByColumnEqualsColumnValueGenerator line 1746, others) do this. test_format_json_generator_applies_row_condition_to_predicate shows it's deliberate, but the codebase now has two patterns. Worth either applying the new pattern to the other check functions or leaving a short note here explaining why FormatJSON is the exception.

Comment thread focus_validator/config_objects/focus_to_duckdb_converter.py
Co-authored-by: Matt Cowsert <matthew@finops.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants