feat(storage): migrate all data mounts to versioned 16/main subdirectories#1649
Draft
marceloneppel wants to merge 19 commits into
Draft
feat(storage): migrate all data mounts to versioned 16/main subdirectories#1649marceloneppel wants to merge 19 commits into
marceloneppel wants to merge 19 commits into
Conversation
…ories
Introduce versioned path constants (POSTGRESQL_DATA_DIR, ARCHIVE_DATA_DIR,
LOGS_DATA_DIR, TEMP_DATA_DIR = <storage-root>/16/main) alongside the existing
storage-root constants. All internal path references are updated to use the
versioned paths.
On snap refresh, _ensure_storage_layout() performs a one-time, idempotent
migration of existing files from each storage root into the versioned
subdirectory, recording completion via a `storage_layout_migrated` flag in
the peer-relation databag. _repair_pg_wal_symlink() updates the pg_wal
symlink to point at the new WAL path after migration.
_reconcile_storage_permissions() runs unconditionally on every refresh to
heal storage root permissions (0755) and pg_wal symlink ownership
(_daemon_:_daemon_), fixing units migrated by earlier builds.
On the primary, _ensure_temp_tablespace_location() migrates the temp
tablespace location in the PostgreSQL catalog. The connection requires
autocommit=True since DROP/CREATE TABLESPACE cannot run inside a transaction
block. _clear_pg_version_dirs() purges PG_* version directories from the
target path before CREATE TABLESPACE to prevent ObjectInUse errors.
patroni.yml.j2 now uses a {{ wal_dir }} template variable instead of the
hardcoded storage-root WAL path.
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
After the PR introduced versioned storage paths (16/main), three bugs prevented temp tablespaces from recovering after a reboot: 1. TEMP_DATA_DIR was not recreated on tmpfs after the migration flag was already set. Add an else-branch in _ensure_storage_layout that calls mkdir(parents=True, exist_ok=True) without setting owner/mode. The resulting root-owned directory triggers set_up_database's permission check, which calls _handle_temp_tablespace_on_reboot to reinitialise the tablespace directory structure for the tmpfs case. 2. As a belt-and-suspenders fallback, _ensure_temp_tablespace_location now detects an empty TEMP_DATA_DIR (no PG_<ver>_<catver>/ subdir) and drops then recreates the tablespace so PostgreSQL rebuilds the internal directory structure. 3. _ensure_temp_tablespace_location_if_primary raised PostgreSQLUndefinedHostError when primary_endpoint was None during early startup. Return True early when the endpoint is not yet available. The persistent storage integration test assertion for the library's "Fixed permissions" log message is softened to a warning: that message only appears when permissions needed fixing, but units initialised by _migrate_storage_mount already have correct owner/mode so the fix path is not triggered. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…_correct After the versioned storage layout migration (PR #1649), PostgreSQL's data_directory is POSTGRESQL_DATA_DIR (.../postgresql/16/main) rather than the Juju storage mount point STORAGE_PATH (.../postgresql). Update the assertion accordingly. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…is active The DROP/CREATE TABLESPACE during the temp tablespace location migration generates WAL that is streamed to the standby cluster. If the standby has not yet been upgraded to the versioned storage layout, it lacks the TEMP_DATA_DIR directory, causing PostgreSQL to crash with "FATAL: directory does not exist" during WAL replay — leaving all standby members stuck in "start failed" state. Defer the migration by returning early from _ensure_temp_tablespace_location_if_primary() when cross-cluster async replication is active. The _on_update_status and _on_peer_relation_changed hooks will retry it after the standby has been upgraded too. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…al snap daemon for safe rollback Context: The initial versioned storage layout (c2f8d9b) was one-way only — migration ran in the snap post-refresh hook and had no rollback path. These changes re-architect it: the charm (unconfined) handles forward migration before the snap refresh, while a new migrate-data daemon in the snap handles both forward (belt-and-suspenders) and reverse migration by reading the Patroni YAML for direction detection. Enables clean rollback from versioned paths to root layout when reverting to 16/stable. Includes: simplified _ensure_storage_layout() in charm, _migrate_storage_roots_to_versioned(), removed obsoleted migration code/constants, bumped snap revisions to 310/311, fixed 16/ parent dir chown, updated unit tests, added IPv6 URL handling and increased timeouts in integration tests. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Move forward and reverse storage layout migration from the charm into the snap's post-refresh and pre-refresh hooks. The charm now only ensures the ephemeral temp tablespace directory exists, which may live on a tmpfs mount wiped on reboot. Data migration direction is determined by the snap hooks reading the already-rendered Patroni YAML as ground truth — versioned paths (16/main) trigger forward migration, root paths trigger reverse. This makes rollbacks from a versioned-storage charm to an older root-storage charm transparent. Pin snap revisions: amd64 313, aarch64/arm64 314. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Measurements from 16/stable snap refreshes showed that 5-minute timeouts are too tight for the charm_refresh flow. Snap download and install takes ~2.5 min per unit, and database initialization can stall briefly during rolling restarts. Bump the blocked-wait, force-refresh- start, and agents-idle timeouts to 10 minutes, and resume-refresh to 15 minutes, matching the empirically observed timing. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…dler Replace _ensure_temp_tablespace_location and _ensure_temp_tablespace_location_if_primary with _migrate_temp_tablespace_location, a one-shot handler that only handles the old-to-versioned-path migration (Scenario C). Other scenarios (missing tablespace, tmpfs wipe) are already covered by set_up_database. Drop calls from _on_peer_relation_changed and _on_update_status since migration is a one-time upgrade event. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Update docstrings in _migrate_temp_tablespace_location and _ensure_storage_layout to note that the reverse catalog migration for the temp tablespace is handled by the snap's pre-refresh hook during rollback, not by the charm itself. The charm's existing forward-only migration remains unchanged. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Snap rev 323 had an unbound SNAP_CURRENT variable in migrate-data.sh that silently crashed the data migration daemon. Rev 329 is built from the corrected source that uses SNAP_DATA. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
The previous arm64 snap revision (322) had the SNAP_CURRENT unbound variable bug in migrate-data.sh, causing upgrade failures on arm64. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…y shutdown test The 180s timeout was insufficient on CI where a new replica joining after a force-destroyed primary can take longer to bootstrap via basebackup and register in the Raft cluster. Increased to 600s. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…cher network recovery Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
Solution
Introduce versioned path constants (POSTGRESQL_DATA_DIR, ARCHIVE_DATA_DIR, LOGS_DATA_DIR, TEMP_DATA_DIR = /16/main) alongside the existing storage-root constants. All internal path references are updated to use the versioned paths.
On snap refresh, _ensure_storage_layout() performs a one-time, idempotent migration of existing files from each storage root into the versioned subdirectory, recording completion via a
storage_layout_migratedflag in the peer-relation databag. _repair_pg_wal_symlink() updates the pg_wal symlink to point at the new WAL path after migration._reconcile_storage_permissions() runs unconditionally on every refresh to heal storage root permissions (0755) and pg_wal symlink ownership (daemon:daemon), fixing units migrated by earlier builds.
On the primary, _ensure_temp_tablespace_location() migrates the temp tablespace location in the PostgreSQL catalog. The connection requires autocommit=True since DROP/CREATE TABLESPACE cannot run inside a transaction block. clear_pg_version_dirs() purges PG* version directories from the target path before CREATE TABLESPACE to prevent ObjectInUse errors.
patroni.yml.j2 now uses a {{ wal_dir }} template variable instead of the hardcoded storage-root WAL path.
Checklist