Skip to content

Fix hash_file return type and wire file_hasher through converters#74

Merged
eywalker merged 6 commits intodevfrom
claude/consolidate-file-hashing-JBRIT
Mar 5, 2026
Merged

Fix hash_file return type and wire file_hasher through converters#74
eywalker merged 6 commits intodevfrom
claude/consolidate-file-hashing-JBRIT

Conversation

@eywalker
Copy link
Contributor

@eywalker eywalker commented Mar 5, 2026

Summary

This PR fixes a pre-existing bug where hash_utils.hash_file() returned raw bytes instead of a ContentHash object, and ensures the file hasher is properly wired through the semantic type system so that file content hashing is consistent across all code paths.

Key Changes

  • Fixed hash_file() return type: Changed hash_utils.hash_file() to return ContentHash instead of raw bytes. All algorithm branches (sha256, md5, xxh64, crc32, hash_path) now wrap their digest in a ContentHash object with the algorithm name as the method.

  • Wired file_hasher through PathStructConverter: Updated PathStructConverter.__init__() to accept a FileContentHasherProtocol dependency and implemented hash_struct_dict() method to hash file content using the injected file hasher.

  • Fixed CachedFileHasher cache key and round-trip:

    • Cache key now includes file mtime_ns and size to properly invalidate on content changes
    • Cache stores the full ContentHash.to_string() format ("method:hex_digest")
    • Cache retrieval uses ContentHash.from_string() to properly restore the object
  • Updated context configuration: Modified v0.1.json to instantiate BasicFileHasher and inject it into PathStructConverter via dependency reference.

  • Added comprehensive test coverage:

    • test_file_hashers.py: Tests verifying hash_file() returns ContentHash, BasicFileHasher behavior, and CachedFileHasher caching with proper round-tripping
    • test_file_hashing_consistency.py: Integration tests ensuring file hashing is consistent across both the Arrow hasher path (via PathStructConverter) and semantic hasher path (via PathContentHandler)
    • Updated test_path_struct_converter.py: Enabled previously commented-out tests and added new tests for hash_struct_dict() behavior

Implementation Details

  • ContentHash now serves as the canonical return type for all file hashing operations, ensuring type safety and consistency
  • The cache key includes file metadata (mtime_ns, size) to automatically invalidate when file content changes
  • Both SemanticArrowHasher and BaseSemanticHasher now use the same FileContentHasherProtocol instance, guaranteeing identical hashes for identical file content regardless of entry point

https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r

claude added 6 commits March 5, 2026 11:17
Inject file_hasher into PathStructConverter so both the Arrow hasher
path (via SemanticHashingVisitor → hash_struct_dict) and the semantic
hasher path (via PathContentHandler) delegate to the same
FileContentHasherProtocol instance.

Changes:
- PathStructConverter now requires file_hasher in __init__
- Implement hash_struct_dict() delegating to injected file_hasher
- Reorder v0.1.json so file_hasher is instantiated before
  semantic_registry (ref resolution is sequential)
- Add file_hasher ref to PathStructConverter config in v0.1.json
- Update versioned_hashers.py factory to pass BasicFileHasher
- Update and uncomment PathStructConverter tests
- Add integration tests for cross-path hash consistency

https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r
…ct_dict

BasicFileHasher.hash_file() returns raw bytes despite the protocol
declaring ContentHash. Handle both cases defensively in
PathStructConverter.hash_struct_dict() and in the integration tests.

https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r
hash_utils.hash_file() declared -> bytes but the protocol and all
callers expected ContentHash. Now it properly returns
ContentHash(method=algorithm, digest=...).

- Update hash_utils.hash_file to return ContentHash
- Fix CachedFileHasher cache hit/miss to use .digest
- Update generate_file_hashes.py to use .to_hex()
- Remove defensive hasattr workarounds added in previous commit

https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r
Use ContentHash.to_string() / from_string() so the cached value
preserves both the method name and digest bytes, instead of losing the
method on cache hit.

https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r
Add tests that would have caught the pre-existing bugs:
- hash_file() returning raw bytes instead of ContentHash
- CachedFileHasher losing the method name on cache hit
- hash_path algorithm failing on Path objects (str() missing)

Test coverage includes:
- hash_utils.hash_file returns ContentHash for all algorithms
- BasicFileHasher return type, determinism, content sensitivity
- CachedFileHasher: cache miss delegates, cache hit preserves method
  and digest, cache stores full to_string() format, clear_cache
  forces rehash, stale cache behavior documented

Also fixes hash_path algorithm to str()-ify Path before encoding.

https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r
Include st_mtime_ns and st_size in the cache key so that modified files
automatically produce a cache miss instead of returning stale hashes.

Update tests to verify the new auto-invalidation behavior and fix
flaky test by using different-length content to guarantee size changes.

https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r
Copilot AI review requested due to automatic review settings March 5, 2026 11:56
@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 87.09677% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/orcapod/hashing/versioned_hashers.py 0.00% 3 Missing ⚠️
...capod/semantic_types/semantic_struct_converters.py 92.85% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@eywalker eywalker merged commit 2b5e526 into dev Mar 5, 2026
6 checks passed
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes file hashing around ContentHash, fixes the hash_utils.hash_file() return type, and wires a shared file-hasher dependency through the semantic type system so file content hashing is consistent across Arrow-table and semantic-object hashing paths.

Changes:

  • Update hash_utils.hash_file() and file hasher implementations to return ContentHash (with method + digest), including cache round-tripping via ContentHash.to_string()/from_string().
  • Inject a FileContentHasherProtocol into PathStructConverter and implement hash_struct_dict() to hash file content for path semantic structs.
  • Add/enable tests covering return types, caching behavior, and cross-path hashing consistency; update context JSON wiring for the file hasher.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/orcapod/hashing/hash_utils.py Changes hash_file() to return ContentHash and wraps all algorithm branches accordingly.
src/orcapod/hashing/file_hashers.py Updates CachedFileHasher to include mtime/size in cache key and to store/restore full ContentHash strings.
src/orcapod/semantic_types/semantic_struct_converters.py Injects file hasher into PathStructConverter and adds content-based hash_struct_dict().
src/orcapod/hashing/versioned_hashers.py Wires a BasicFileHasher into PathStructConverter for the versioned Arrow hasher factory.
src/orcapod/contexts/data/v0.1.json Updates context configuration to instantiate and inject the file_hasher into PathStructConverter.
tests/test_hashing/test_file_hashers.py Adds unit tests for hash_file(), BasicFileHasher, and CachedFileHasher behavior/round-tripping.
tests/test_hashing/test_file_hashing_consistency.py Adds integration tests ensuring Arrow-path and semantic-path hashing match for identical file content.
tests/test_semantic_types/test_path_struct_converter.py Enables/expands tests for PathStructConverter.hash_struct_dict() behavior.
tests/test_hashing/generate_file_hashes.py Updates sample-hash generation script to serialize hashes via ContentHash.to_hex().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 59 to +66
if not Path(file_path).is_file():
raise FileNotFoundError(f"The file {file_path} does not exist")

# Hash the path string itself rather than file content
if algorithm == "hash_path":
hasher = hashlib.sha256()
hasher.update(file_path.encode("utf-8"))
return hasher.digest()
hasher.update(str(file_path).encode("utf-8"))
return ContentHash(method=algorithm, digest=hasher.digest())
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash_file() checks Path(file_path).is_file() before handling algorithm=='hash_path'. That makes the 'hash_path' mode unusable for non-existent paths (even though it’s supposed to hash the path string itself) and it also raises a misleading FileNotFoundError for directories. Consider moving the 'hash_path' branch before the is_file() check, and splitting validation so directories raise IsADirectoryError (or a clearer error) while missing paths raise FileNotFoundError.

Copilot uses AI. Check for mistakes.
if path.is_dir():
raise IsADirectoryError(f"Path is a directory: {path}")

content_hash = self._file_hasher.hash_file(path)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PathStructConverter.hash_struct_dict() delegates to an injected file_hasher, but the prefix formatting is hard-coded to 'sha256' via _format_hash_string() and the docstring says 'path:sha256:...'. If a non-sha256 file_hasher is injected, add_prefix will lie about the algorithm. Either enforce/validate that file_hasher is sha256 here, or build the prefix from the returned ContentHash.method (and adjust _format_hash_string accordingly).

Suggested change
content_hash = self._file_hasher.hash_file(path)
content_hash = self._file_hasher.hash_file(path)
if add_prefix and content_hash.method != "sha256":
raise ValueError(
f"add_prefix=True requires a SHA-256 file hasher, but got '{content_hash.method}'"
)

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +48
stat = os.stat(file_path)
cache_key = f"file:{file_path}:{stat.st_mtime_ns}:{stat.st_size}"
cached_value = self.string_cacher.get_cached(cache_key)
if cached_value is not None:
return bytes.fromhex(cached_value)
return ContentHash.from_string(cached_value)

value = self.file_hasher.hash_file(file_path)
self.string_cacher.set_cached(cache_key, value.hex())
return value
result = self.file_hasher.hash_file(file_path)
self.string_cacher.set_cached(cache_key, result.to_string())
return result
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CachedFileHasher cache key does not include the hashing algorithm / hasher identity, so two CachedFileHasher instances with different algorithms but the same backing StringCacher can return an incorrect cached result (e.g., md5 hasher receiving a sha256 ContentHash). Include something like the underlying hasher's algorithm/method (or a stable hasher_id) in the cache_key to prevent cross-algorithm collisions.

Copilot uses AI. Check for mistakes.
Comment on lines +221 to +243
def test_different_algorithms_share_cache_key(self, tmp_path):
"""Two CachedFileHashers with different algorithms but same path+mtime+size
share the same cache key. The second hasher gets the first's cached result.

This documents a known limitation: the cache key doesn't include the algorithm.
"""
f = tmp_path / "file.txt"
f.write_text("test content")

cacher = InMemoryCacher()

sha_inner = BasicFileHasher(algorithm="sha256")
sha_cached = CachedFileHasher(file_hasher=sha_inner, string_cacher=cacher)
sha_result = sha_cached.hash_file(f)

md5_inner = BasicFileHasher(algorithm="md5")
md5_cached = CachedFileHasher(file_hasher=md5_inner, string_cacher=cacher)
md5_result = md5_cached.hash_file(f)

# Same cache key (same file, same mtime+size), so md5 gets sha256 result
assert md5_result.method == "sha256"
assert md5_result.digest == sha_result.digest

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache-key collision across different algorithms is being asserted as expected behavior here, but it’s a correctness bug: a CachedFileHasher configured for md5 should never return a sha256 ContentHash due to a shared cache. Once the cache key includes the algorithm/hasher identity, update this test to assert that the two hashers produce different methods/digests and do not share entries.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants