Conversation
|
@staru09 is attempting to deploy a commit to the Moss Team Team on Vercel. A member of the Team first needs to authorize it. |
|
@HarshaNalluru , would you please be able review ? |
There was a problem hiding this comment.
Pull request overview
Adds a Haystack cookbook integration example so Moss can be used as a lightweight document store + retriever in Haystack RAG-style pipelines.
Changes:
- Introduces a
MossDocumentStore+MossRetrieverhelper implementation for Haystack usage. - Adds runnable example and a live-platform test script for end-to-end verification.
- Documents installation, setup, and usage in a new cookbook README.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| examples/cookbook/haystack/moss_haystack.py | Implements Haystack-facing document store/retriever wrappers plus metadata + filter conversion helpers |
| examples/cookbook/haystack/example_usage.py | Runnable demo that creates an index, writes sample docs from faqs.json, queries, and deletes the index |
| examples/cookbook/haystack/test_live.py | Live test script that exercises write/count/upsert/filter/query/delete/cleanup against the Moss platform |
| examples/cookbook/haystack/README.md | Setup and usage documentation for the new cookbook example |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Moss document store for Haystack. | ||
|
|
||
| Provides write_documents, count_documents, delete_documents, | ||
| and filter_documents for use in Haystack pipelines. |
There was a problem hiding this comment.
MossDocumentStore is presented as a Haystack document store, but it does not implement/extend Haystack’s DocumentStore type (it’s a plain class). This makes it incompatible with any Haystack components/utilities that expect a DocumentStore instance and undermines the “document store integration” deliverable. Consider subclassing/implementing haystack.document_stores.types.DocumentStore (and required methods) or clearly renaming/documenting this as a minimal helper not meant to satisfy the DocumentStore contract.
| """Moss document store for Haystack. | |
| Provides write_documents, count_documents, delete_documents, | |
| and filter_documents for use in Haystack pipelines. | |
| """Minimal Moss-backed helper for Haystack-style document operations. | |
| This is **not** a full implementation of ``haystack.document_stores.types.DocumentStore``; | |
| it only provides a small subset of common methods (such as ``write_documents``, | |
| ``count_documents``, ``delete_documents``, and ``filter_documents``) for use in | |
| custom Haystack components or pipelines that interact with Moss directly. |
| def _serialize_metadata(meta: Optional[dict]) -> Optional[dict]: | ||
| """Convert arbitrary-typed metadata to Moss string-only metadata.""" | ||
| if not meta: | ||
| return None | ||
| result = {} | ||
| needs_json = False | ||
| for k, v in meta.items(): | ||
| if isinstance(v, str): | ||
| result[k] = v | ||
| else: | ||
| result[k] = json.dumps(v) | ||
| needs_json = True | ||
| if needs_json: | ||
| result["__moss_json_meta__"] = "true" | ||
| return result |
There was a problem hiding this comment.
_serialize_metadata injects a reserved key __moss_json_meta__ into user metadata when any value is JSON-encoded. If the input Document.meta already contains this key, the value will be silently overwritten, losing user data and potentially breaking deserialization. Use a collision-safe approach (e.g., namespace under a single reserved sub-dict, or detect/escape pre-existing keys) to avoid overwriting user-provided metadata.
There was a problem hiding this comment.
It would be an edge case but covered this one
| if policy == DuplicatePolicy.SKIP: | ||
| existing = _run_async(self.client.get_docs(self.index_name)) | ||
| existing_ids = {doc.id for doc in existing} | ||
| docs = [d for d in docs if d.id not in existing_ids] |
There was a problem hiding this comment.
write_documents(..., policy=DuplicatePolicy.SKIP) calls get_docs() with no options and builds a full existing_ids set. This forces a full index scan/download of all documents just to dedupe IDs, which can be very expensive for non-trivial indexes. If Moss supports fetching by IDs or listing only IDs, use that; otherwise consider documenting that SKIP is O(N) or dropping/limiting SKIP support in this example.
| def filter_documents(self, filters: Optional[dict] = None) -> list[Document]: | ||
| """Retrieve documents, optionally filtered. | ||
|
|
||
| Without filters: fetches all documents. | ||
| With filters: loads index and runs a broad query with filters. | ||
| """ | ||
| if filters is None: | ||
| docs = _run_async(self.client.get_docs(self.index_name)) | ||
| return [_moss_doc_to_haystack(doc) for doc in docs] | ||
|
|
||
| self._ensure_loaded() | ||
| moss_filters = _convert_haystack_filters(filters) | ||
| results = _run_async( | ||
| self.client.query( | ||
| self.index_name, | ||
| "", | ||
| QueryOptions(top_k=1000, alpha=self.alpha, filter=moss_filters), | ||
| ) | ||
| ) | ||
| return [_moss_doc_to_haystack(doc, score=doc.score) for doc in results.docs] |
There was a problem hiding this comment.
filter_documents(filters=...) uses query() with an empty query string and a hard-coded top_k=1000. This cannot guarantee returning all documents that match the filter (anything beyond the top_k cutoff is dropped, and empty-query semantics may be undefined), so callers may get incomplete results. Consider implementing filtered retrieval by fetching all docs and filtering client-side, adding an explicit top_k/limit parameter, or raising NotImplementedError for filtered listing until a scan-by-filter API exists.
|
|
||
| | File | Description | | ||
| |------|-------------| | ||
| | `moss_haystack.py` | MossDocumentStore + MossRetriever — copy this into your project | | ||
| | `example_usage.py` | Runnable demo with FAQ data | |
There was a problem hiding this comment.
The issue/PR description indicates a Haystack integration under integrations/haystack/ (and a MossDocumentStore/MossRetriever intended for reuse), but this PR only adds a cookbook example that instructs users to copy moss_haystack.py into their project. Either move the integration to the expected integrations/haystack/ location (and package it), or update the PR/issue linkage/docs so it’s clear this is an example-only snippet rather than a shipped integration.
There was a problem hiding this comment.
updated with verbose description
| def report(name, success, detail=""): | ||
| global passed, failed | ||
| status = "PASS" if success else "FAIL" | ||
| if success: | ||
| passed += 1 | ||
| else: | ||
| failed += 1 | ||
| msg = f" [{status}] {name}" | ||
| if detail: | ||
| msg += f" -- {detail}" | ||
| print(msg) | ||
|
|
||
|
|
||
| def run_tests(): | ||
| if not PROJECT_ID or not PROJECT_KEY: | ||
| print("ERROR: MOSS_PROJECT_ID and MOSS_PROJECT_KEY must be set.") | ||
| sys.exit(1) |
There was a problem hiding this comment.
test_live.py is a standalone script with manual PASS/FAIL reporting and hard sys.exit usage, so it won’t be discovered/executed by the repo’s existing Python test runners (CI currently runs unittest for the langchain cookbook). If this PR is meant to satisfy the checklist item “added tests”, consider converting these checks into unittest/pytest tests that can be run in CI and auto-skip when MOSS_PROJECT_ID/KEY are not set.
| PROJECT_ID = os.getenv("MOSS_PROJECT_ID") | ||
| PROJECT_KEY = os.getenv("MOSS_PROJECT_KEY") | ||
| TEST_INDEX = "haystack-live-test" | ||
|
|
There was a problem hiding this comment.
TEST_INDEX is a fixed name and the script deletes the index during cleanup. If a user already has an index with this name (or multiple developers run the script concurrently), this can overwrite/delete real data. Prefer generating a unique index name per run (timestamp/UUID) and printing it for debugging, similar to examples/python/comprehensive_sample.py’s timestamped index naming.
| INDEX_NAME = "haystack-faq-demo" | ||
|
|
There was a problem hiding this comment.
INDEX_NAME is hard-coded and the script unconditionally deletes that index at the end. If the user already has an index with the same name, running this example will delete it. Consider generating a unique index name per run (timestamp/UUID) or adding a clear guard/confirmation around deletion.
There was a problem hiding this comment.
Can we ignore this since it's just an example index name?
|
Is this ready for review ? were the copilot comments resolved ? |
628291f to
cbcef90
Compare
yes please review now |
| # Load FAQ data | ||
| faqs_path = os.path.join( | ||
| os.path.dirname(__file__), "..", "..", "python", "faqs.json" | ||
| ) |
There was a problem hiding this comment.
It seems to work on my local, is it throwing an error in your machine?
| store.load_index() | ||
|
|
||
| # Create retriever and search | ||
| retriever = MossRetriever(document_store=store, top_k=3) |
There was a problem hiding this comment.
This looks complicated, can we do MossRetriever.load_index and MossRetriever.run
Keep the implementation as simple as posssible
| print("\n--- Retrieval: 'payment methods' ---") | ||
| result = retriever.run(query="What payment methods do you accept?") | ||
| for doc in result["documents"]: | ||
| print(f" [{doc.score:.2f}] {doc.content[:100]}...") |
There was a problem hiding this comment.
what is the role of haystack here ? how is it adding value ?
| # --- Filter Conversion (for MossRetriever server-side queries) --- | ||
|
|
||
|
|
||
| def _convert_haystack_filters(filters: dict) -> dict: |
There was a problem hiding this comment.
lets remove the filters for now, it maybe too complicated for now
cbcef90 to
64de14e
Compare
64de14e to
5b8ce80
Compare
| options = ( | ||
| MutationOptions(upsert=True) | ||
| if policy == DuplicatePolicy.OVERWRITE | ||
| else None | ||
| ) | ||
| _run_async(self.client.add_docs(self.index_name, docs, options)) |
There was a problem hiding this comment.
🚩 DuplicatePolicy.NONE behavior depends on Moss backend
When policy is DuplicatePolicy.NONE (which per Haystack's contract should raise on duplicate document IDs), the code at lines 184-189 sets options = None (not OVERWRITE, not SKIP) and calls add_docs without upsert. Whether this correctly raises an error on duplicates depends entirely on the Moss backend's default behavior for add_docs without mutation options. If the backend silently ignores or overwrites duplicates, this would violate the Haystack DocumentStore protocol. Not flagged as a bug because the Moss SDK's default behavior is not determinable from this codebase alone.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
🚩 Live tests all skip without credentials — no mock-based unit tests
Unlike the langchain cookbook which has mock-based unit tests in examples/cookbook/langchain/test_integration.py that run without credentials, the haystack cookbook only has test_live.py with live integration tests that are skipped when MOSS_PROJECT_ID/MOSS_PROJECT_KEY are not set. This means CI without credentials would run zero tests for this code. The metadata serialization round-trip logic in particular would benefit from unit tests with mocks, especially given the edge cases in sentinel handling. CONTRIBUTING.md requires 'add tests' for new code — live tests satisfy the letter of that rule, but mock tests would provide better CI coverage.
Was this helpful? React with 👍 or 👎 to provide feedback.
362920b to
2abff20
Compare
| if isinstance(v, str): | ||
| result[k] = v |
There was a problem hiding this comment.
🟡 Metadata round-trip corrupts string values that start with the typed prefix
_serialize_metadata stores plain strings as-is (line 47), but _deserialize_metadata (line 64) treats any string starting with __moss_typed__: as a JSON-encoded non-string value. If a user stores a string metadata value like "__moss_typed__:42", on serialization it's stored unchanged, but on deserialization the prefix is stripped and json.loads("42") returns the integer 42 — silently corrupting the value from str to int. The fix is to also encode string values that happen to start with the prefix, e.g., wrapping them with json.dumps so they survive the round-trip.
| if isinstance(v, str): | |
| result[k] = v | |
| if isinstance(v, str) and not v.startswith(_MOSS_TYPED_PREFIX): | |
| result[k] = v |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
it's an edge case, can ignore this IMO
a69858b to
036fb11
Compare
| ersion = 1 | ||
| revision = 1 |
There was a problem hiding this comment.
Why was uv lock changed ?
There was a problem hiding this comment.
I think it was pushed by mistake
removed it
| @@ -0,0 +1,140 @@ | |||
| # Haystack + Moss Cookbook Example | |||
|
|
|||
| Use [Moss](https://moss.dev) as a document store and retriever in [Haystack](https://haystack.deepset.ai/) RAG pipelines. Moss provides sub-10ms semantic search, Haystack orchestrates the retrieval-to-generation pipeline. | |||
There was a problem hiding this comment.
can you please say "Use Moss as realtime semantic search in ...."
| ## Installation | ||
|
|
||
| ```bash | ||
| pip install haystack-ai moss python-dotenv |
There was a problem hiding this comment.
can you please add pyproject.toml
a5b176b to
bad8119
Compare
| self._index_loaded = False | ||
|
|
||
| def filter_documents(self, filters: Optional[dict] = None) -> list[Document]: | ||
| raise NotImplementedError("filter_documents is not supported.") |
There was a problem hiding this comment.
MossDocumentStore subclasses haystack.document_stores.types.DocumentStore, which requires filter_documents as part of the protocol. Raising NotImplementedError here breaks that contract — any Haystack component or utility that probes the store (e.g. DocumentWriter with duplicate detection, evaluation helpers, writers that call filter_documents(filters=None) to enumerate existing docs) will crash.
Pick one of:
- Implement it — at minimum, pass through to
client.get_docs(...)whenfilters is None, and raise a clearer error only when non-trivial filters are passed. Supporting Moss metadata filters ($eq,$and,$in) viaQueryOptions.filteris a natural follow-up. - Stop subclassing
DocumentStore— downgrade to a plain helper class (as the earlier Copilot review comment originally suggested) and update the README/docstring to say this is not a fullDocumentStoreimplementation.
Having it both ways — declaring the interface but refusing to honor it — is the worst outcome because it fails only at pipeline-run time with a confusing stack trace.
Pull Request Checklist
Please ensure that your PR meets the following requirements:
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.
Fixes #79
Type of Change