Fix hash_file return type and wire file_hasher through converters by eywalker · Pull Request #74 · walkerlab/orcapod-python

eywalker · 2026-03-05T11:56:55Z

Summary

This PR fixes a pre-existing bug where hash_utils.hash_file() returned raw bytes instead of a ContentHash object, and ensures the file hasher is properly wired through the semantic type system so that file content hashing is consistent across all code paths.

Key Changes

Fixed hash_file() return type: Changed hash_utils.hash_file() to return ContentHash instead of raw bytes. All algorithm branches (sha256, md5, xxh64, crc32, hash_path) now wrap their digest in a ContentHash object with the algorithm name as the method.
Wired file_hasher through PathStructConverter: Updated PathStructConverter.__init__() to accept a FileContentHasherProtocol dependency and implemented hash_struct_dict() method to hash file content using the injected file hasher.
Fixed CachedFileHasher cache key and round-trip:
- Cache key now includes file mtime_ns and size to properly invalidate on content changes
- Cache stores the full ContentHash.to_string() format ("method:hex_digest")
- Cache retrieval uses ContentHash.from_string() to properly restore the object
Updated context configuration: Modified v0.1.json to instantiate BasicFileHasher and inject it into PathStructConverter via dependency reference.
Added comprehensive test coverage:
- test_file_hashers.py: Tests verifying hash_file() returns ContentHash, BasicFileHasher behavior, and CachedFileHasher caching with proper round-tripping
- test_file_hashing_consistency.py: Integration tests ensuring file hashing is consistent across both the Arrow hasher path (via PathStructConverter) and semantic hasher path (via PathContentHandler)
- Updated test_path_struct_converter.py: Enabled previously commented-out tests and added new tests for hash_struct_dict() behavior

Implementation Details

ContentHash now serves as the canonical return type for all file hashing operations, ensuring type safety and consistency
The cache key includes file metadata (mtime_ns, size) to automatically invalidate when file content changes
Both SemanticArrowHasher and BaseSemanticHasher now use the same FileContentHasherProtocol instance, guaranteeing identical hashes for identical file content regardless of entry point

https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r

Inject file_hasher into PathStructConverter so both the Arrow hasher path (via SemanticHashingVisitor → hash_struct_dict) and the semantic hasher path (via PathContentHandler) delegate to the same FileContentHasherProtocol instance. Changes: - PathStructConverter now requires file_hasher in __init__ - Implement hash_struct_dict() delegating to injected file_hasher - Reorder v0.1.json so file_hasher is instantiated before semantic_registry (ref resolution is sequential) - Add file_hasher ref to PathStructConverter config in v0.1.json - Update versioned_hashers.py factory to pass BasicFileHasher - Update and uncomment PathStructConverter tests - Add integration tests for cross-path hash consistency https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r

…ct_dict BasicFileHasher.hash_file() returns raw bytes despite the protocol declaring ContentHash. Handle both cases defensively in PathStructConverter.hash_struct_dict() and in the integration tests. https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r

hash_utils.hash_file() declared -> bytes but the protocol and all callers expected ContentHash. Now it properly returns ContentHash(method=algorithm, digest=...). - Update hash_utils.hash_file to return ContentHash - Fix CachedFileHasher cache hit/miss to use .digest - Update generate_file_hashes.py to use .to_hex() - Remove defensive hasattr workarounds added in previous commit https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r

Use ContentHash.to_string() / from_string() so the cached value preserves both the method name and digest bytes, instead of losing the method on cache hit. https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r

Add tests that would have caught the pre-existing bugs: - hash_file() returning raw bytes instead of ContentHash - CachedFileHasher losing the method name on cache hit - hash_path algorithm failing on Path objects (str() missing) Test coverage includes: - hash_utils.hash_file returns ContentHash for all algorithms - BasicFileHasher return type, determinism, content sensitivity - CachedFileHasher: cache miss delegates, cache hit preserves method and digest, cache stores full to_string() format, clear_cache forces rehash, stale cache behavior documented Also fixes hash_path algorithm to str()-ify Path before encoding. https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r

Include st_mtime_ns and st_size in the cache key so that modified files automatically produce a cache miss instead of returning stale hashes. Update tests to verify the new auto-invalidation behavior and fix flaky test by using different-length content to guarantee size changes. https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r

codecov-commenter · 2026-03-05T11:57:58Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 87.09677% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/orcapod/hashing/versioned_hashers.py	0.00%	3 Missing ⚠️
...capod/semantic_types/semantic_struct_converters.py	92.85%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR standardizes file hashing around ContentHash, fixes the hash_utils.hash_file() return type, and wires a shared file-hasher dependency through the semantic type system so file content hashing is consistent across Arrow-table and semantic-object hashing paths.

Changes:

Update hash_utils.hash_file() and file hasher implementations to return ContentHash (with method + digest), including cache round-tripping via ContentHash.to_string()/from_string().
Inject a FileContentHasherProtocol into PathStructConverter and implement hash_struct_dict() to hash file content for path semantic structs.
Add/enable tests covering return types, caching behavior, and cross-path hashing consistency; update context JSON wiring for the file hasher.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/orcapod/hashing/hash_utils.py`	Changes `hash_file()` to return `ContentHash` and wraps all algorithm branches accordingly.
`src/orcapod/hashing/file_hashers.py`	Updates `CachedFileHasher` to include mtime/size in cache key and to store/restore full `ContentHash` strings.
`src/orcapod/semantic_types/semantic_struct_converters.py`	Injects file hasher into `PathStructConverter` and adds content-based `hash_struct_dict()`.
`src/orcapod/hashing/versioned_hashers.py`	Wires a `BasicFileHasher` into `PathStructConverter` for the versioned Arrow hasher factory.
`src/orcapod/contexts/data/v0.1.json`	Updates context configuration to instantiate and inject the `file_hasher` into `PathStructConverter`.
`tests/test_hashing/test_file_hashers.py`	Adds unit tests for `hash_file()`, `BasicFileHasher`, and `CachedFileHasher` behavior/round-tripping.
`tests/test_hashing/test_file_hashing_consistency.py`	Adds integration tests ensuring Arrow-path and semantic-path hashing match for identical file content.
`tests/test_semantic_types/test_path_struct_converter.py`	Enables/expands tests for `PathStructConverter.hash_struct_dict()` behavior.
`tests/test_hashing/generate_file_hashes.py`	Updates sample-hash generation script to serialize hashes via `ContentHash.to_hex()`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T12:00:16Z

src/orcapod/hashing/hash_utils.py

    if not Path(file_path).is_file():
        raise FileNotFoundError(f"The file {file_path} does not exist")

    # Hash the path string itself rather than file content
    if algorithm == "hash_path":
        hasher = hashlib.sha256()
-        hasher.update(file_path.encode("utf-8"))
-        return hasher.digest()
+        hasher.update(str(file_path).encode("utf-8"))
+        return ContentHash(method=algorithm, digest=hasher.digest())


hash_file() checks Path(file_path).is_file() before handling algorithm=='hash_path'. That makes the 'hash_path' mode unusable for non-existent paths (even though it’s supposed to hash the path string itself) and it also raises a misleading FileNotFoundError for directories. Consider moving the 'hash_path' branch before the is_file() check, and splitting validation so directories raise IsADirectoryError (or a clearer error) while missing paths raise FileNotFoundError.

Copilot · 2026-03-05T12:00:17Z

src/orcapod/semantic_types/semantic_struct_converters.py

+        if path.is_dir():
+            raise IsADirectoryError(f"Path is a directory: {path}")
+
+        content_hash = self._file_hasher.hash_file(path)


PathStructConverter.hash_struct_dict() delegates to an injected file_hasher, but the prefix formatting is hard-coded to 'sha256' via _format_hash_string() and the docstring says 'path:sha256:...'. If a non-sha256 file_hasher is injected, add_prefix will lie about the algorithm. Either enforce/validate that file_hasher is sha256 here, or build the prefix from the returned ContentHash.method (and adjust _format_hash_string accordingly).

Suggested change

content_hash = self._file_hasher.hash_file(path)

content_hash = self._file_hasher.hash_file(path)

if add_prefix and content_hash.method != "sha256":

raise ValueError(

f"add_prefix=True requires a SHA-256 file hasher, but got '{content_hash.method}'"

)

Copilot · 2026-03-05T12:00:17Z

src/orcapod/hashing/file_hashers.py

+        stat = os.stat(file_path)
+        cache_key = f"file:{file_path}:{stat.st_mtime_ns}:{stat.st_size}"
        cached_value = self.string_cacher.get_cached(cache_key)
        if cached_value is not None:
-            return bytes.fromhex(cached_value)
+            return ContentHash.from_string(cached_value)

-        value = self.file_hasher.hash_file(file_path)
-        self.string_cacher.set_cached(cache_key, value.hex())
-        return value
+        result = self.file_hasher.hash_file(file_path)
+        self.string_cacher.set_cached(cache_key, result.to_string())
+        return result


CachedFileHasher cache key does not include the hashing algorithm / hasher identity, so two CachedFileHasher instances with different algorithms but the same backing StringCacher can return an incorrect cached result (e.g., md5 hasher receiving a sha256 ContentHash). Include something like the underlying hasher's algorithm/method (or a stable hasher_id) in the cache_key to prevent cross-algorithm collisions.

Copilot · 2026-03-05T12:00:17Z

tests/test_hashing/test_file_hashers.py

+    def test_different_algorithms_share_cache_key(self, tmp_path):
+        """Two CachedFileHashers with different algorithms but same path+mtime+size
+        share the same cache key. The second hasher gets the first's cached result.
+
+        This documents a known limitation: the cache key doesn't include the algorithm.
+        """
+        f = tmp_path / "file.txt"
+        f.write_text("test content")
+
+        cacher = InMemoryCacher()
+
+        sha_inner = BasicFileHasher(algorithm="sha256")
+        sha_cached = CachedFileHasher(file_hasher=sha_inner, string_cacher=cacher)
+        sha_result = sha_cached.hash_file(f)
+
+        md5_inner = BasicFileHasher(algorithm="md5")
+        md5_cached = CachedFileHasher(file_hasher=md5_inner, string_cacher=cacher)
+        md5_result = md5_cached.hash_file(f)
+
+        # Same cache key (same file, same mtime+size), so md5 gets sha256 result
+        assert md5_result.method == "sha256"
+        assert md5_result.digest == sha_result.digest
+


The cache-key collision across different algorithms is being asserted as expected behavior here, but it’s a correctness bug: a CachedFileHasher configured for md5 should never return a sha256 ContentHash due to a shared cache. Once the cache key includes the algorithm/hasher identity, update this test to assert that the two hashers produce different methods/digests and do not share entries.

claude added 6 commits March 5, 2026 11:17

fix(hashing): store full ContentHash string in CachedFileHasher cache

01c90f4

Use ContentHash.to_string() / from_string() so the cached value preserves both the method name and digest bytes, instead of losing the method on cache hit. https://claude.ai/code/session_01PBeavr3pTFu5sPhWUxNx2r

Copilot AI review requested due to automatic review settings March 5, 2026 11:56

Copilot started reviewing on behalf of eywalker March 5, 2026 11:57 View session

eywalker merged commit 2b5e526 into dev Mar 5, 2026
6 checks passed

Copilot AI reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hash_file return type and wire file_hasher through converters#74

Fix hash_file return type and wire file_hasher through converters#74
eywalker merged 6 commits intodevfrom
claude/consolidate-file-hashing-JBRIT

eywalker commented Mar 5, 2026

Uh oh!

codecov-commenter commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-        content_hash = self._file_hasher.hash_file(path)
+        content_hash = self._file_hasher.hash_file(path)
+        if add_prefix and content_hash.method != "sha256":
+            raise ValueError(
+                f"add_prefix=True requires a SHA-256 file hasher, but got '{content_hash.method}'"
+            )

Conversation

eywalker commented Mar 5, 2026

Summary

Key Changes

Implementation Details

Uh oh!

codecov-commenter commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Mar 5, 2026 •

edited

Loading