This project is a comprehensive implementation of an embeddings-based search engine designed for a code search task. It uses the CoSQA dataset from Hugging Face to retrieve relevant code snippets based on natural language queries.
The project covers the complete end-to-end model lifecycle:
- Implementation (Part 1): Building a baseline search engine using a pre-trained sentence-transformers model, FAISS for vector indexing, and a FastAPI server for the API.
- Evaluation (Part 2): Measuring the baseline engine's performance on the CoSQA validation set using standard ranking metrics: Recall@10, MRR@10, and NDCG@10.
- Fine-Tuning (Part 3): Improving the search quality by fine-tuning the base model on the CoSQA training set and demonstrating a measurable improvement in evaluation metrics.
├── data/ # Directory for storing FAISS indexes and metadata
│ ├── base_index.bin
│ ├── base_meta.pkl
│ ├── finetuned_index.bin
│ └── finetuned_meta.pkl
│
├── model_checkpoints/ # Directory for saving the fine-tuned model
│ └── finetuned/
│ ├── ...
│ └── loss_history.pkl # Contains mean batch lossess for intermediate training steps.
│
├── src/ # Source code for core modules
│ ├── metrics.py # Implementation of Recall@k, MRR@k, NDCG@k
│ └── utils.py # Data loading utilities for the CoSQA dataset
│
├── api.py # FastAPI server to host the /search endpoint
│
├── build_index.py # Script to encode documents and build a FAISS index
│
├── evaluate.py # Script to evaluate the search engine on CoSQA
│
├── train.py # Script to fine-tune the embedding model
│
├── .gitignore
│
├── README.md # This file
│
├── report.ipynb # The main Jupyter Notebook with all analysis and findings
│
└── requirements.txt # Python dependenciesFollow these steps to set up your environment and install the required dependencies.
- Clone the repository
Clone this project to your local machine.
git clone https://github.com/allesgrau/ML-for-context-task.git
cd ML-for-context-task- Create a virtual environment
It is highly recommended to use a separate virtual environment to manage dependencies. To do this, create the environment, activate it and install dependencies.
Windows:
python -m venv venv
.\venv\Scripts\activate
pip install -r requirements.txtmacOS/Linux:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Explore the project
There are two main ways to run and explore this project:
- (Recommended) Use the interactive Jupyter Notebook for a full walkthrough.
- Run the individual Python scripts (train, build_index, evaluate, api) from your terminal.
This README covers what should be done in order to explore the project through the first way.
The report.ipynb file is a comprehensive, interactive report that walks through every part of the task. It contains all the code, explanations, and results.
- Ensure your virtual environment is activated and dependencies are installed.
- Launch Jupyter Notebook or Jupyter Lab.
- Open report.ipynb.
- You can read the full report and execute the cells sequentially to see the search engine in action, run evaluations, and understand the fine-tuning process.
Note: The report.ipynb notebook contains code for all parts of the project, including a cell for training the model (in the Part 3 – Fine-tuning section). Please skip this training cell. This process is very time-consuming, and the model has already been pre-trained by me. The final, fine-tuned model is saved in the model_checkpoints/finetuned directory. Subsequent cells in the notebook are set up to load this ready-to-use model for evaluation.