Skip to content

Haystack integration#89

Open
staru09 wants to merge 4 commits intousemoss:mainfrom
staru09:haystack_integration
Open

Haystack integration#89
staru09 wants to merge 4 commits intousemoss:mainfrom
staru09:haystack_integration

Conversation

@staru09
Copy link
Copy Markdown
Contributor

@staru09 staru09 commented Mar 29, 2026

Pull Request Checklist

Please ensure that your PR meets the following requirements:

  • I have read the CONTRIBUTING guide.
  • I have updated the documentation (if applicable).
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.

Fixes #79

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Open with Devin

@vercel
Copy link
Copy Markdown
Contributor

vercel bot commented Mar 29, 2026

@staru09 is attempting to deploy a commit to the Moss Team Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 29, 2026

CLA assistant check
All committers have signed the CLA.

@yatharthk2
Copy link
Copy Markdown
Collaborator

@HarshaNalluru , would you please be able review ?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Haystack cookbook integration example so Moss can be used as a lightweight document store + retriever in Haystack RAG-style pipelines.

Changes:

  • Introduces a MossDocumentStore + MossRetriever helper implementation for Haystack usage.
  • Adds runnable example and a live-platform test script for end-to-end verification.
  • Documents installation, setup, and usage in a new cookbook README.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
examples/cookbook/haystack/moss_haystack.py Implements Haystack-facing document store/retriever wrappers plus metadata + filter conversion helpers
examples/cookbook/haystack/example_usage.py Runnable demo that creates an index, writes sample docs from faqs.json, queries, and deletes the index
examples/cookbook/haystack/test_live.py Live test script that exercises write/count/upsert/filter/query/delete/cleanup against the Moss platform
examples/cookbook/haystack/README.md Setup and usage documentation for the new cookbook example

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +88 to +91
"""Moss document store for Haystack.

Provides write_documents, count_documents, delete_documents,
and filter_documents for use in Haystack pipelines.
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MossDocumentStore is presented as a Haystack document store, but it does not implement/extend Haystack’s DocumentStore type (it’s a plain class). This makes it incompatible with any Haystack components/utilities that expect a DocumentStore instance and undermines the “document store integration” deliverable. Consider subclassing/implementing haystack.document_stores.types.DocumentStore (and required methods) or clearly renaming/documenting this as a minimal helper not meant to satisfy the DocumentStore contract.

Suggested change
"""Moss document store for Haystack.
Provides write_documents, count_documents, delete_documents,
and filter_documents for use in Haystack pipelines.
"""Minimal Moss-backed helper for Haystack-style document operations.
This is **not** a full implementation of ``haystack.document_stores.types.DocumentStore``;
it only provides a small subset of common methods (such as ``write_documents``,
``count_documents``, ``delete_documents``, and ``filter_documents``) for use in
custom Haystack components or pipelines that interact with Moss directly.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this one

Comment on lines +31 to +45
def _serialize_metadata(meta: Optional[dict]) -> Optional[dict]:
"""Convert arbitrary-typed metadata to Moss string-only metadata."""
if not meta:
return None
result = {}
needs_json = False
for k, v in meta.items():
if isinstance(v, str):
result[k] = v
else:
result[k] = json.dumps(v)
needs_json = True
if needs_json:
result["__moss_json_meta__"] = "true"
return result
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_serialize_metadata injects a reserved key __moss_json_meta__ into user metadata when any value is JSON-encoded. If the input Document.meta already contains this key, the value will be silently overwritten, losing user data and potentially breaking deserialization. Use a collision-safe approach (e.g., namespace under a single reserved sub-dict, or detect/escape pre-existing keys) to avoid overwriting user-provided metadata.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be an edge case but covered this one

Comment on lines +146 to +149
if policy == DuplicatePolicy.SKIP:
existing = _run_async(self.client.get_docs(self.index_name))
existing_ids = {doc.id for doc in existing}
docs = [d for d in docs if d.id not in existing_ids]
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_documents(..., policy=DuplicatePolicy.SKIP) calls get_docs() with no options and builds a full existing_ids set. This forces a full index scan/download of all documents just to dedupe IDs, which can be very expensive for non-trivial indexes. If Moss supports fetching by IDs or listing only IDs, use that; otherwise consider documenting that SKIP is O(N) or dropping/limiting SKIP support in this example.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this one

Comment on lines +165 to +184
def filter_documents(self, filters: Optional[dict] = None) -> list[Document]:
"""Retrieve documents, optionally filtered.

Without filters: fetches all documents.
With filters: loads index and runs a broad query with filters.
"""
if filters is None:
docs = _run_async(self.client.get_docs(self.index_name))
return [_moss_doc_to_haystack(doc) for doc in docs]

self._ensure_loaded()
moss_filters = _convert_haystack_filters(filters)
results = _run_async(
self.client.query(
self.index_name,
"",
QueryOptions(top_k=1000, alpha=self.alpha, filter=moss_filters),
)
)
return [_moss_doc_to_haystack(doc, score=doc.score) for doc in results.docs]
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter_documents(filters=...) uses query() with an empty query string and a hard-coded top_k=1000. This cannot guarantee returning all documents that match the filter (anything beyond the top_k cutoff is dropped, and empty-query semantics may be undefined), so callers may get incomplete results. Consider implementing filtered retrieval by fetching all docs and filtering client-side, adding an explicit top_k/limit parameter, or raising NotImplementedError for filtered listing until a scan-by-filter API exists.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to top_k

Comment thread examples/cookbook/haystack/README.md Outdated
Comment on lines +112 to +116

| File | Description |
|------|-------------|
| `moss_haystack.py` | MossDocumentStore + MossRetriever — copy this into your project |
| `example_usage.py` | Runnable demo with FAQ data |
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue/PR description indicates a Haystack integration under integrations/haystack/ (and a MossDocumentStore/MossRetriever intended for reuse), but this PR only adds a cookbook example that instructs users to copy moss_haystack.py into their project. Either move the integration to the expected integrations/haystack/ location (and package it), or update the PR/issue linkage/docs so it’s clear this is an example-only snippet rather than a shipped integration.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with verbose description

Comment thread examples/cookbook/haystack/test_live.py Outdated
Comment on lines +22 to +38
def report(name, success, detail=""):
global passed, failed
status = "PASS" if success else "FAIL"
if success:
passed += 1
else:
failed += 1
msg = f" [{status}] {name}"
if detail:
msg += f" -- {detail}"
print(msg)


def run_tests():
if not PROJECT_ID or not PROJECT_KEY:
print("ERROR: MOSS_PROJECT_ID and MOSS_PROJECT_KEY must be set.")
sys.exit(1)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_live.py is a standalone script with manual PASS/FAIL reporting and hard sys.exit usage, so it won’t be discovered/executed by the repo’s existing Python test runners (CI currently runs unittest for the langchain cookbook). If this PR is meant to satisfy the checklist item “added tests”, consider converting these checks into unittest/pytest tests that can be run in CI and auto-skip when MOSS_PROJECT_ID/KEY are not set.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +14 to +17
PROJECT_ID = os.getenv("MOSS_PROJECT_ID")
PROJECT_KEY = os.getenv("MOSS_PROJECT_KEY")
TEST_INDEX = "haystack-live-test"

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEST_INDEX is a fixed name and the script deletes the index during cleanup. If a user already has an index with this name (or multiple developers run the script concurrently), this can overwrite/delete real data. Prefer generating a unique index name per run (timestamp/UUID) and printing it for debugging, similar to examples/python/comprehensive_sample.py’s timestamped index naming.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +15
INDEX_NAME = "haystack-faq-demo"

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INDEX_NAME is hard-coded and the script unconditionally deletes that index at the end. If the user already has an index with the same name, running this example will delete it. Consider generating a unique index name per run (timestamp/UUID) or adding a clear guard/confirmation around deletion.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ignore this since it's just an example index name?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please ignore

@yatharthk2
Copy link
Copy Markdown
Collaborator

Is this ready for review ? were the copilot comments resolved ?

@staru09 staru09 force-pushed the haystack_integration branch from 628291f to cbcef90 Compare April 2, 2026 07:50
@staru09
Copy link
Copy Markdown
Contributor Author

staru09 commented Apr 2, 2026

Is this ready for review ? were the copilot comments resolved ?

yes please review now

# Load FAQ data
faqs_path = os.path.join(
os.path.dirname(__file__), "..", "..", "python", "faqs.json"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this path wont work

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work on my local, is it throwing an error in your machine?

store.load_index()

# Create retriever and search
retriever = MossRetriever(document_store=store, top_k=3)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks complicated, can we do MossRetriever.load_index and MossRetriever.run

Keep the implementation as simple as posssible

print("\n--- Retrieval: 'payment methods' ---")
result = retriever.run(query="What payment methods do you accept?")
for doc in result["documents"]:
print(f" [{doc.score:.2f}] {doc.content[:100]}...")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the role of haystack here ? how is it adding value ?

# --- Filter Conversion (for MossRetriever server-side queries) ---


def _convert_haystack_filters(filters: dict) -> dict:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove the filters for now, it maybe too complicated for now

@staru09 staru09 force-pushed the haystack_integration branch from cbcef90 to 64de14e Compare April 7, 2026 10:15
devin-ai-integration[bot]

This comment was marked as resolved.

@staru09 staru09 force-pushed the haystack_integration branch from 64de14e to 5b8ce80 Compare April 8, 2026 08:40
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment thread examples/cookbook/haystack/moss_haystack.py
Comment on lines +184 to +189
options = (
MutationOptions(upsert=True)
if policy == DuplicatePolicy.OVERWRITE
else None
)
_run_async(self.client.add_docs(self.index_name, docs, options))
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 DuplicatePolicy.NONE behavior depends on Moss backend

When policy is DuplicatePolicy.NONE (which per Haystack's contract should raise on duplicate document IDs), the code at lines 184-189 sets options = None (not OVERWRITE, not SKIP) and calls add_docs without upsert. Whether this correctly raises an error on duplicates depends entirely on the Moss backend's default behavior for add_docs without mutation options. If the backend silently ignores or overwrites duplicates, this would violate the Haystack DocumentStore protocol. Not flagged as a bug because the Moss SDK's default behavior is not determinable from this codebase alone.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Live tests all skip without credentials — no mock-based unit tests

Unlike the langchain cookbook which has mock-based unit tests in examples/cookbook/langchain/test_integration.py that run without credentials, the haystack cookbook only has test_live.py with live integration tests that are skipped when MOSS_PROJECT_ID/MOSS_PROJECT_KEY are not set. This means CI without credentials would run zero tests for this code. The metadata serialization round-trip logic in particular would benefit from unit tests with mocks, especially given the edge cases in sentinel handling. CONTRIBUTING.md requires 'add tests' for new code — live tests satisfy the letter of that rule, but mock tests would provide better CI coverage.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread examples/cookbook/haystack/moss_haystack.py
Comment thread examples/cookbook/haystack/moss_haystack.py
@staru09 staru09 force-pushed the haystack_integration branch from 362920b to 2abff20 Compare April 8, 2026 14:59
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +46 to +47
if isinstance(v, str):
result[k] = v
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Metadata round-trip corrupts string values that start with the typed prefix

_serialize_metadata stores plain strings as-is (line 47), but _deserialize_metadata (line 64) treats any string starting with __moss_typed__: as a JSON-encoded non-string value. If a user stores a string metadata value like "__moss_typed__:42", on serialization it's stored unchanged, but on deserialization the prefix is stripped and json.loads("42") returns the integer 42 — silently corrupting the value from str to int. The fix is to also encode string values that happen to start with the prefix, e.g., wrapping them with json.dumps so they survive the round-trip.

Suggested change
if isinstance(v, str):
result[k] = v
if isinstance(v, str) and not v.startswith(_MOSS_TYPED_PREFIX):
result[k] = v
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an edge case, can ignore this IMO

devin-ai-integration[bot]

This comment was marked as resolved.

@staru09 staru09 force-pushed the haystack_integration branch from a69858b to 036fb11 Compare April 8, 2026 16:03
devin-ai-integration[bot]

This comment was marked as resolved.

Comment thread sdks/python/sdk/uv.lock Outdated
Comment on lines +1 to +2
ersion = 1
revision = 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was uv lock changed ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was pushed by mistake
removed it

Comment thread examples/cookbook/haystack/README.md Outdated
@@ -0,0 +1,140 @@
# Haystack + Moss Cookbook Example

Use [Moss](https://moss.dev) as a document store and retriever in [Haystack](https://haystack.deepset.ai/) RAG pipelines. Moss provides sub-10ms semantic search, Haystack orchestrates the retrieval-to-generation pipeline.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please say "Use Moss as realtime semantic search in ...."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## Installation

```bash
pip install haystack-ai moss python-dotenv
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add pyproject.toml

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@staru09 staru09 force-pushed the haystack_integration branch from a5b176b to bad8119 Compare April 13, 2026 10:42
self._index_loaded = False

def filter_documents(self, filters: Optional[dict] = None) -> list[Document]:
raise NotImplementedError("filter_documents is not supported.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MossDocumentStore subclasses haystack.document_stores.types.DocumentStore, which requires filter_documents as part of the protocol. Raising NotImplementedError here breaks that contract — any Haystack component or utility that probes the store (e.g. DocumentWriter with duplicate detection, evaluation helpers, writers that call filter_documents(filters=None) to enumerate existing docs) will crash.

Pick one of:

  1. Implement it — at minimum, pass through to client.get_docs(...) when filters is None, and raise a clearer error only when non-trivial filters are passed. Supporting Moss metadata filters ($eq, $and, $in) via QueryOptions.filter is a natural follow-up.
  2. Stop subclassing DocumentStore — downgrade to a plain helper class (as the earlier Copilot review comment originally suggested) and update the README/docstring to say this is not a full DocumentStore implementation.

Having it both ways — declaring the interface but refusing to honor it — is the worst outcome because it fails only at pipeline-run time with a confusing stack trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Framework Integration: Haystack — Document store and retriever integration

4 participants