Skip to content

Cache label identifier data to eliminate redundant parsing#1607

Open
jordanpadams wants to merge 2 commits into
mainfrom
issues/1568-cache-label-identifiers
Open

Cache label identifier data to eliminate redundant parsing#1607
jordanpadams wants to merge 2 commits into
mainfrom
issues/1568-cache-label-identifiers

Conversation

@jordanpadams
Copy link
Copy Markdown
Member

@jordanpadams jordanpadams commented May 19, 2026

Summary

Resolves #1568

  • Add LabelCacheEntry POJO to hold pre-extracted identifier data (logical IDs, lid/lidvid refs, context area refs) from a parsed label
  • After each label is parsed in LabelValidationRule, cache identifiers (with \n detection enabled) into ReferentialIntegrityUtil's labelIdentifierCache
  • additionalReferentialIntegrityChecks() now uses cached logicalIdentifiers and lidOrLidVidReferences when available — no disk re-parse for the common case; fallback parse retained for labels not in the initial validation pass
  • CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available — no disk re-parse for file area reference checks
  • collectAllContextReferences() uses cached context area refs to skip three Saxon XPath evaluations per label — fallback to fresh parse when no cache entry exists
  • \n detection (INVALID_FIELD_VALUE) happens once during cacheIdentifiers() (with reportCarriageReturns=true), so the referential integrity phase can safely use cached values without risking double-reporting or missed errors
  • Fix CrossLabelFileAreaReferenceChecker.reset() to clear the isObservational map alongside knownRefs, preventing static state from leaking across validation runs
  • Call CrossLabelFileAreaReferenceChecker.reset() from ValidateLauncher alongside ReferentialIntegrityUtil.reset()

Test plan

  • NASA-PDS/validate#15-2 passes (\n in logical_identifier — 1 INVALID_FIELD_VALUE reported, no double-reporting)
  • NASA-PDS/validate#401-1 passes (\n in lid_reference — 3 INVALID_FIELD_VALUE detected from cache, no re-parse needed)
  • Full Cucumber suite: 297/297 scenarios pass

🤖 Generated with Claude Code

After each label is parsed by pds4-jparser, extract and cache the logical
identifiers, lid/lidvid references, and context area references into a
LabelCacheEntry. In additionalReferentialIntegrityChecks(), use cached
context area refs to skip three expensive Saxon XPath evaluations per label
instead of re-running them against a freshly-reparsed DOM.

Main identifiers (logicalIdentifiers, lidOrLidVidReferences) still re-parse
from disk in additionalReferentialIntegrityChecks() to correctly detect and
report INVALID_FIELD_VALUE for identifier values containing newlines —
pds4-jparser normalizes newlines away, so the cached values cannot be used
for that check.

Also fixes CrossLabelFileAreaReferenceChecker.reset() to clear the
isObservational map alongside knownRefs, preventing static state from leaking
across validation runs.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…ntegrity phase

- LabelValidationRule.cacheIdentifiers() now reports \n errors (reportCarriageReturns=true)
  so the referential integrity phase can safely use cached identifiers without re-parsing
- CrossLabelFileAreaReferenceChecker.add() uses cached logicalIdentifiers when available,
  falling back to disk parse only for labels not in the initial validation pass
- ReferentialIntegrityUtil.additionalReferentialIntegrityChecks() uses cached
  logicalIdentifiers and lidOrLidVidReferences when available, eliminating all
  disk re-parsing for the common case; fallback parse retained for uncached labels

All 297 tests pass. Resolves the full acceptance criteria for #1568.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@jordanpadams
Copy link
Copy Markdown
Member Author

Acceptance Criteria Verification

From issue #1568:

Given a bundle with many product labels
When I perform validation including referential integrity checks
Then I expect each label file is read and parsed from disk only once


How each label is now parsed exactly once

Phase 1 — Initial label validation (LabelValidationRule.validateLabel(), line 319):

validator.parseAndValidate(processor, target)   ← one disk read + one DOM parse per label
  └─ cacheIdentifiers(document, targetUrl)       ← extracts LIDs, LIDVIDs, context refs
       └─ ReferentialIntegrityUtil.cacheLabelIdentifiers(url, entry)

The DOM is built once from disk. cacheIdentifiers() (line 345) runs against the already-in-memory Document, extracting:

  • logicalIdentifiers (LIDs/LIDVIDs registered by the product)
  • lidOrLidVidReferences (all lid_reference/lidvid_reference values)
  • contextAreaRefs (Investigation Area, Observation System Component, Target Identification refs)

\n detection (INVALID_FIELD_VALUE) is also performed here (once) with reportCarriageReturns=true.


Phase 2 — Referential integrity checks: three former re-parse sites, all now cache-first:

Former re-parse site Now
ReferentialIntegrityUtil.additionalReferentialIntegrityChecks() (line 823) — db.parse(url.openStream()) for every label getCachedLabelIdentifiers(url) hit → uses cached.getLogicalIdentifiers() / getLidOrLidVidReferences() directly; no disk read, no parse
CrossLabelFileAreaReferenceChecker.add() (line 41) — DocumentBuilderFactory…newDocumentBuilder().parse(target.getUrl().openStream()) getCachedLabelIdentifiers(target.getUrl()) hit → uses cached.getLogicalIdentifiers() directly; no disk read, no parse
collectAllContextReferences() — three Saxon XPath evaluations over a re-parsed DOM getCachedLabelIdentifiers(url) hit → uses cached.getContextAreaRefs() directly; no Saxon XPath, no re-parse

A fallback disk-parse is retained in all three sites for labels that were not in the initial validation pass (e.g. labels that failed parsing and were never cached). This keeps correctness for edge cases while achieving the optimization for the common case.


Test evidence

  • NASA-PDS/validate#15-2 — label with \n in logical_identifier: INVALID_FIELD_VALUE reported exactly once (from cacheIdentifiers()), not duplicated by the referential integrity phase ✅
  • NASA-PDS/validate#401-1 — bundle with \n in lid_reference (3 occurrences): all 3 INVALID_FIELD_VALUE errors detected from cache; referential integrity phase uses cached values with no re-parse ✅
  • Full Cucumber suite: 297/297 scenarios pass

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants