Document-IQ is a cloud-native document intelligence platform designed to process, analyze, and query enterprise documents using AI, distributed systems, and event-driven microservices .
The platform provides a complete pipeline for:
Document Upload → OCR → Classification → Layout Understanding → Knowledge Retrieval → AI Chat
It demonstrates real-world AI system design , including:
- Event-driven microservices
- ML pipelines and model registry
- Retrieval Augmented Generation (RAG)
- Multi-tenant SaaS architecture
- Observability and distributed tracing
Document-IQ is designed as a distributed microservices architecture connected through Kafka event streams .
- Loose coupling via Kafka
- Independent microservices
- Async document processing
- Horizontal scalability
The system processes documents through multiple stages.
- Organization-based workspaces
- Role-based access control (Admin / Member)
- Secure document isolation
The system uses Kafka for asynchronous processing.
Benefits:
- High throughput
- Fault isolation
- Service decoupling
- Scalability
Document-IQ integrates machine learning and deep learning models .
Classifies document types using engineered text features.
Example features:
- token count
- line count
- average line length
- table detection
These features form a feature contract used across training and inference .
The layout engine detects document structure such as:
- Header
- Text
- Table
- Footer
This enables layout-aware document understanding .
Users can query documents using natural language.
Example questions:
- "Summarize this document"
- "What dates are mentioned?"
- "List key action items"
Pipeline:
Document-IQ includes production-grade observability .
This enables:
- distributed tracing
- centralized logging
- system debugging
- performance monitoring
.
├── document-iq/
│
│ ├── document-iq-core
│ │ Shared schemas, configs, ML utilities
│
│ ├── document-iq-ml-pipeline
│ │ Classical ML training pipelines
│
│ ├── document-iq-dl-pipeline
│ │ Deep learning layout training
│
│ └── document-iq-platform
│ ├── components
│ │ ├── account-component
│ │ ├── application-component
│ │ ├── ingestion-worker
│ │ ├── ocr-adapter
│ │ ├── classification-engine
│ │ ├── layout-engine
│ │ ├── rag-engine
│ │ └── aggregator
│ │
│ ├── gateways
│ │ └── ui-bff
│ │
│ ├── shared
│ │ └── platform shared code
│ │
│ ├── ui-portal
│ │ React frontend
│ │
│ └── docker-compose.yml
│
└── infra
└── terraform
This separation allows:
- ML pipelines to evolve independently
- platform services to scale independently
- infrastructure to be managed separately. Document-iq folder structure
- Python
- FastAPI
- Kafka
- PostgreSQL
- Redis
- Scikit-learn
- PyTorch
- MLflow
- Feature contracts
- Docker
- Docker Compose
- Terraform
- OpenTelemetry
- Loki
- Promtail
- Tempo
- Grafana
- React
- React Router
- Context API
The platform includes ML lifecycle management .
Capabilities:
- experiment tracking
- model versioning
- production model registry
Example model loading:
defload_production_model(model_name: str):
model_uri=f"models:/{model_name}/Production"
returnmlflow.pyfunc.load_model(model_uri)
This ensures consistent model deployment across services . document-iq-core
git clone https://github.com/PranavTupe2000/document-iq
cd document-iq
cd document-iq/document-iq-platform
docker compose up --build
This starts:
- Kafka
- Microservices
- Database
- Observability stack
- UI Portal
1️⃣ Create organization
2️⃣ Login to portal
3️⃣ Upload document
Processing pipeline:
Upload
↓
OCR
↓
Classification
↓
Layout Detection
↓
Aggregation
↓
Stored in platform
↓
Query via AI Chat
Document-IQ demonstrates real-world production engineering skills including:
- Distributed systems design
- Event-driven architecture
- Microservices orchestration
- ML lifecycle management
- AI system deployment
- Observability and monitoring
- Multi-tenant SaaS systems
This type of system is similar to platforms built by:
- AWS Textract pipelines
- Google Document AI
- enterprise knowledge platforms
Planned enhancements:
- Transformer-based document classification
- LayoutLM for document understanding
- Vector database integration
- Kubernetes deployment
- autoscaling pipelines
- streaming ingestion
Pranav Tupe
Software Engineer | AI Systems | Distributed Systems
GitHub
https://github.com/PranavTupe2000