Skip to content

fix(experiments): DB connection leak — ExperimentsAPIImpl.listActive() has no connection lifecycle management on background thread call paths #34831

@spbolton

Description

@spbolton

Problem Statement

Database connections are leaking from ExperimentsAPIImpl.listActive() and ExperimentsFactoryImpl.listActive() when called from background threads. Production pg_stat_activity (coop-prod, 2026-03-02) shows 7–9 orphaned connections per pod with last queries:

  • SELECT experiment.* FROM experiment WHERE status NOT IN ($1, $2) and page_id = $3
  • select * from multi_tree where child = $1 and variant_id = 'DEFAULT'
  • COMMIT (transaction completed but connection never returned to pool)

These connections are idle for 2–16+ hours — far beyond HikariCP's maxLifetime — confirming they are never returned to the pool.

Impact: ~24 permanent leaked connections across 3 coop-prod pods, accumulating throughout business hours, resets only on pod restart.

Steps to Reproduce

  1. Have a dotCMS instance with active A/B experiments configured
  2. Trigger content operations (save, publish, delete) that invoke background reindexing (ReindexThread) or CDI async events
  3. Query pg_stat_activity on the database:
SELECT pid, state,
       EXTRACT(EPOCH FROM (now() - state_change))::int AS idle_secs,
       left(query, 80) AS last_query
FROM pg_stat_activity
WHERE datname = current_database()
  AND state = 'idle'
  AND EXTRACT(EPOCH FROM (now() - state_change)) > 120
ORDER BY idle_secs DESC;
  1. Observe connections with SELECT experiment.* or select * from multi_tree where child = ? as last query, idle for hours

Root Cause

ExperimentsFactoryImpl.listActive() uses new DotConnect() directly with no @CloseDBIfOpened annotation and no LocalTransaction.wrapReturn() or DbConnectionFactory.wrapConnection() wrapper:

// ExperimentsFactoryImpl.java:187
@Override
public final Collection<Experiment> listActive(final String pageIdentifier) throws DotDataException {
    final List<Map<String, Object>> results = new DotConnect()
            .setSQL(ACTIVE_EXPERIMENTS_BY_PAGE)
            .addParam(AbstractExperiment.Status.ENDED.toString())
            .addParam(AbstractExperiment.Status.ARCHIVED.toString())
            .addParam(pageIdentifier)
            .loadObjectResults();
    return TransformerLocator.createExperimentTransformer(results).list;
}

CMSFilter.doFilter() has finally { DbConnectionFactory.closeSilently(); } which cleans up at the end of every HTTP request. But listActive() is also called from background threads that bypass CMSFilter:

  • MultiTreeAPIImpl.refreshPageInCache()getPageVariants()listActive() (line 996) — triggered from content save/delete/publish operations via ReindexThread, CDI async events, scheduler tasks
  • MultiTreeCache.getVariants()listActive() (line 107) — cache population path

On background threads there is no outer scope to close the connection.

Note on @CloseDBIfOpened: This annotation only fires via ByteBuddy class instrumentation. Direct this. calls within the same class, or calls through CDI Weld proxies, bypass the interceptor. wrapConnection() in the factory method itself is more reliable.

Proposed Fix

Wrap the DotConnect call in ExperimentsFactoryImpl.listActive() with DbConnectionFactory.wrapConnection():

@Override
public final Collection<Experiment> listActive(final String pageIdentifier) throws DotDataException {
    return DbConnectionFactory.wrapConnection(() -> {
        final List<Map<String, Object>> results = new DotConnect()
                .setSQL(ACTIVE_EXPERIMENTS_BY_PAGE)
                .addParam(AbstractExperiment.Status.ENDED.toString())
                .addParam(AbstractExperiment.Status.ARCHIVED.toString())
                .addParam(pageIdentifier)
                .loadObjectResults();
        return TransformerLocator.createExperimentTransformer(results).list;
    });
}

This follows the same pattern introduced in DBMetricType.getValue() (PR #34490).

Acceptance Criteria

  • ExperimentsFactoryImpl.listActive() wraps its DotConnect query in DbConnectionFactory.wrapConnection()
  • No orphaned connections from SELECT experiment.* or multi_tree appear in pg_stat_activity after background content operations
  • Integration test verifies connection is returned to pool after listActive() is called from a non-HTTP thread context

dotCMS Version

Production k8s-frankfurt-prod-1 (coop-prod). Confirmed 2026-03-02.

Severity

High - Major functionality broken

Links

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions