ML for Context in AI Assistant

JetBrains internships

Application Task #1

Barbara Gawlik, Warsaw, November 2025

Abstract

This project is a comprehensive implementation of an embeddings-based search engine designed for a code search task. It uses the CoSQA dataset from Hugging Face to retrieve relevant code snippets based on natural language queries.

The project covers the complete end-to-end model lifecycle:

Implementation (Part 1): Building a baseline search engine using a pre-trained sentence-transformers model, FAISS for vector indexing, and a FastAPI server for the API.
Evaluation (Part 2): Measuring the baseline engine's performance on the CoSQA validation set using standard ranking metrics: Recall@10, MRR@10, and NDCG@10.
Fine-Tuning (Part 3): Improving the search quality by fine-tuning the base model on the CoSQA training set and demonstrating a measurable improvement in evaluation metrics.

Project structure

├── data/                       # Directory for storing FAISS indexes and metadata
│   ├── base_index.bin
│   ├── base_meta.pkl
│   ├── finetuned_index.bin
│   └── finetuned_meta.pkl
│
├── model_checkpoints/          # Directory for saving the fine-tuned model
│   └── finetuned/
│       ├── ...
│       └── loss_history.pkl    # Contains mean batch lossess for intermediate training steps.
│
├── src/                        # Source code for core modules
│   ├── metrics.py              # Implementation of Recall@k, MRR@k, NDCG@k
│   └── utils.py                # Data loading utilities for the CoSQA dataset
│   
├── api.py                      # FastAPI server to host the /search endpoint
│
├── build_index.py              # Script to encode documents and build a FAISS index
│
├── evaluate.py                 # Script to evaluate the search engine on CoSQA
│
├── train.py                    # Script to fine-tune the embedding model
│
├── .gitignore
│
├── README.md                   # This file
│
├── report.ipynb                # The main Jupyter Notebook with all analysis and findings
│
└── requirements.txt            # Python dependencies

Setup and installation

Follow these steps to set up your environment and install the required dependencies.

Clone the repository

Clone this project to your local machine.

git clone https://github.com/allesgrau/ML-for-context-task.git 
cd ML-for-context-task

Create a virtual environment

It is highly recommended to use a separate virtual environment to manage dependencies. To do this, create the environment, activate it and install dependencies.

Windows:

python -m venv venv
.\venv\Scripts\activate
pip install -r requirements.txt

macOS/Linux:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Explore the project

There are two main ways to run and explore this project:

(Recommended) Use the interactive Jupyter Notebook for a full walkthrough.
Run the individual Python scripts (train, build_index, evaluate, api) from your terminal.

This README covers what should be done in order to explore the project through the first way.

The report.ipynb file is a comprehensive, interactive report that walks through every part of the task. It contains all the code, explanations, and results.

Ensure your virtual environment is activated and dependencies are installed.
Launch Jupyter Notebook or Jupyter Lab.
Open report.ipynb.
You can read the full report and execute the cells sequentially to see the search engine in action, run evaluations, and understand the fine-tuning process.

Note: The report.ipynb notebook contains code for all parts of the project, including a cell for training the model (in the Part 3 – Fine-tuning section). Please skip this training cell. This process is very time-consuming, and the model has already been pre-trained by me. The final, fine-tuned model is saved in the model_checkpoints/finetuned directory. Subsequent cells in the notebook are set up to load this ready-to-use model for evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML for Context in AI Assistant

JetBrains internships

Application Task #1

Barbara Gawlik, Warsaw, November 2025

Abstract

Project structure

Setup and installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
model_checkpoints/finetuned		model_checkpoints/finetuned
src		src
.gitignore		.gitignore
README.md		README.md
api.py		api.py
build_index.py		build_index.py
evaluate.py		evaluate.py
report.ipynb		report.ipynb
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

ML for Context in AI Assistant

JetBrains internships

Application Task #1

Barbara Gawlik, Warsaw, November 2025

Abstract

Project structure

Setup and installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages