Adjust database connection pool and timeout configurations#34941
Adjust database connection pool and timeout configurations#34941
Conversation
…aults (#34921) The CONNECTION_DB_MAX_WAIT constant in DataSourceStrategyProvider.java used "DB_MAXWAIT" while setenv.sh exports "DB_MAX_WAIT", causing the maxLifetime setting to never be applied. This resulted in HikariCP falling back to a 60s default, causing excessive connection churn under load. Changes: - Fix constant from "DB_MAXWAIT" to "DB_MAX_WAIT" to match setenv.sh - Increase DB_MAX_WAIT default from 900000ms (15m) to 1800000ms (30m) - Reduce DB_MIN_IDLE default from 3 to 1 connection - Increase DB_CONNECTION_TIMEOUT default from 5000ms to 30000ms (30s) https://claude.ai/code/session_01QWBxEhHLYeQKZNQAqawGuM
The env var name fix (DB_MAXWAIT → DB_MAX_WAIT) changes the effective maxLifetime from 60s (code default when var wasn't found) to 1800s (30min from setenv.sh). Combined with the CI test pool limit of only 15 connections, this causes pool exhaustion in AI embeddings tests. Set DB_MAX_WAIT=120000 (2min) in both pom.xml and docker-compose.yml to keep test pool behavior stable with the constrained connection limit. https://claude.ai/code/session_013tzbmrxwwC4FozZ5P4wP7B
Apply the same DB_MAX_WAIT=120000 (2min) override to e2e-java, e2e-node, dotcms-ui-e2e, and karate test modules. All use DB_MAX_TOTAL=15 and would be affected by the effective maxLifetime change from 60s to 30min. https://claude.ai/code/session_013tzbmrxwwC4FozZ5P4wP7B
runSQL() used getPGVectorConnection() which calls PGvector.addVectorType() for DDL operations like CREATE EXTENSION and CREATE TABLE. When called before the pgvector extension exists, addVectorType queries pg_type and caches OID=0 for the "vector" type. This stale cached value persists for the connection's lifetime in HikariCP, causing "Unknown type vector" errors when that connection is later used for queries with PGvector parameters (e.g. countEmbeddings). The bug became more likely to manifest after the DB_MAX_WAIT fix because: - DB_MIN_IDLE=1 (fewer pool connections = higher reuse of poisoned conn) - Longer maxLifetime = poisoned connection stays in pool longer Fix: use a plain connection (without addVectorType) for DDL operations, which don't need the vector type registered. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous value of 120000ms (2 min) was less than the default idleTimeout of 300000ms (5 min), causing HikariCP to log: "idleTimeout is close to or more than maxLifetime, disabling it." Bumping to 600000ms (10 min) ensures maxLifetime > idleTimeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| # Max Connection Lifetime 15m | ||
| export DB_MAX_WAIT=${DB_MAX_WAIT:-"900000"} | ||
| # Max Connection Lifetime 30m | ||
| export DB_MAX_WAIT=${DB_MAX_WAIT:-"1800000"} |
There was a problem hiding this comment.
I do not understand this - if dotCMS is waiting 30s for a db connection, dotCMS is literally dead in the water. It might almost be better to fail fast, e.g. lower it to 5 seconds, instead of queueing up a bunch of waiting connections, no?
There was a problem hiding this comment.
There is a broader discussion to be had on this and 30s was not chosen at random and ties into the changes of max wait and max lifetime. A new connection in our infrastructure can take "a few seconds" even when running well and Hikari chose 30s as their default based upon real world usage in cloud environments. There is a broader question though of how this relates to request timeouts and how we handle resilience in general. I have a full discussion here https://docs.google.com/document/d/1zvUhexNfryQ8GTMX-PrAGqH8_9JWqFFt3hgD7NowvTY/edit?usp=sharing
I think that 5s may be too small from what I have seen, and to balance the existing architecture and in combination with the properly sized pool, I would be fine if we want to start with 10 or 15s. The key really is to properly monitor so we know, we should be moving away from the idea of a request from the app triggering a new physical connection almost every time as it does currently this in itself should reduce delays and load on the backend db
Proposed Changes
DB_MAX_WAITfrom 900000ms (15m) to 1800000ms (30m) to allow longer connection lifetimeDB_MIN_IDLEfrom 3 to 1 to reduce minimum idle connections in the poolDB_CONNECTION_TIMEOUTfrom 5000ms (5s) to 30000ms (30s) for new connection attemptsDB_MAXWAIT→DB_MAX_WAITin DataSourceStrategyProvider to match the actual environment variable used in setenv.shChecklist
Additional Info
The configuration changes optimize database connection pool behavior by allowing longer connection lifetimes and more lenient timeout windows, while reducing the minimum idle connection overhead. The variable name fix ensures the code correctly references the environment variable defined in the shell configuration.
https://claude.ai/code/session_01QWBxEhHLYeQKZNQAqawGuM
This PR fixes: #34921