- Open PRs against the
developmentbranch. Create your feature/fix branch fromdevelopment, push it, and targetdevelopmentin the PR so release branches (master/main) stay clean. - If an urgent hotfix lands on
master/main, mirror it back intodevelopmentright away to keep the branches aligned.
- Tie any pipeline change to the artifacts it produces. Common touchpoints:
Corpus.extract()writes source PDFs underdownloads/and a manifest atdownload_results/download_results.parquet(fields likeneeds_ocr).Corpus.clean()emitsmarkdown/andclean_markdown/, keeping.processing_state.pklplusproblematic_files/andtimeout_files/subfolders.Corpus.ocr()andCorpus.section()populatejson/(Docling JSON, formula index, metrics) andsections/sections_for_annotation.parquet.
- When relocating outputs or adding new ones, update assertions in
tests/test_pipeline_smoke.pyand the folder references indocs/pipeline.mdso the layout stays discoverable.
- Avoid large refactors or sweeping interface changes; aim for narrowly scoped PRs and discuss big shifts before starting.