Skip to content

Fix slow /api/v1/jobs endpoint by optimizing RunDao queries#3091

Open
jni-bot wants to merge 1 commit intoMarquezProject:mainfrom
jni-bot:bug/fix-slow-findByLatestJob-query
Open

Fix slow /api/v1/jobs endpoint by optimizing RunDao queries#3091
jni-bot wants to merge 1 commit intoMarquezProject:mainfrom
jni-bot:bug/fix-slow-findByLatestJob-query

Conversation

@jni-bot
Copy link

@jni-bot jni-bot commented Feb 12, 2026

Summary

Fixes #2987 — the /api/v1/jobs endpoint takes 7+ minutes per job at scale due to an unoptimized query in RunDao.findByLatestJob.

The root cause: the findByLatestJob WHERE clause joins runs_view with jobs_view to find matching run UUIDs. This forces PostgreSQL to build the full runs_view (~2.3M rows) and scan the entire run_facets table (~21M rows) and dataset_facets table (~22GB) in the LEFT JOIN subqueries before filtering — even though only 1-10 runs are needed.

Changes

RunDao.java:

  • New findCurrentRunByJob method — uses jobs.current_run_uuid (added in Add col current_run_uuid to jobs #2929 but never leveraged here) for a direct indexed UUID lookup. Returns only the single latest run.
  • Optimized findByLatestJob method — replaces the runs_view JOIN jobs_view WHERE clause with a filter on runs.job_uuid which hits the existing composite index runs_job_uuid ON runs(job_uuid, transitioned_at DESC). Includes symlink resolution for aliased jobs.

JobDao.java:

  • findAllWithRun (jobs list page) now calls findCurrentRunByJob with limit=1 instead of findByLatestJob with limit=10. This is the hot path — called once per job on every list page load.

Why this works

Path Before After
Jobs list (per job) Full runs_viewjobs_view scan → 7 min current_run_uuid indexed lookup → ms
Job detail Full runs_viewjobs_view scan → 7 min job_uuid composite index scan → seconds

Test plan

  • RunDaoTest.testFindByLatestJob — passes (validates symlink resolution)
  • RunDaoTest full suite — all 6 tests pass
  • JobDaoTest full suite — all 8 tests pass
  • spotlessApply — clean

…roject#2987)

The findByLatestJob query was scanning millions of rows in run_facets
and dataset_facets by joining runs_view with jobs_view in the WHERE
clause. This caused 7+ minute query times per job, making the jobs
list page unusable at scale.

Two changes:
- Add findCurrentRunByJob using jobs.current_run_uuid (from MarquezProject#2929)
  for the jobs list page, reducing it to a single indexed UUID lookup
- Optimize findByLatestJob to filter on runs.job_uuid (indexed) with
  symlink resolution, instead of the expensive runs_view/jobs_view join

Closes MarquezProject#2987

Signed-off-by: jni-bot <jni-bot@users.noreply.github.com>
@boring-cyborg boring-cyborg bot added the api API layer changes label Feb 12, 2026
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 12, 2026

Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md).

@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.18%. Comparing base (a89b89c) to head (c9062b2).

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #3091   +/-   ##
=========================================
  Coverage     81.18%   81.18%           
  Complexity     1506     1506           
=========================================
  Files           268      268           
  Lines          7356     7356           
  Branches        325      325           
=========================================
  Hits           5972     5972           
  Misses         1226     1226           
  Partials        158      158           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api API layer changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Very slow /api/v1/jobs endpoint after upgrading to 0.50.0

1 participant