Skip to content

feat(storage): migrate all data mounts to versioned 16/main subdirectories#1649

Draft
marceloneppel wants to merge 19 commits into
16/edgefrom
feat/versioned-storage-layout
Draft

feat(storage): migrate all data mounts to versioned 16/main subdirectories#1649
marceloneppel wants to merge 19 commits into
16/edgefrom
feat/versioned-storage-layout

Conversation

@marceloneppel
Copy link
Copy Markdown
Member

Issue

Solution

Introduce versioned path constants (POSTGRESQL_DATA_DIR, ARCHIVE_DATA_DIR, LOGS_DATA_DIR, TEMP_DATA_DIR = /16/main) alongside the existing storage-root constants. All internal path references are updated to use the versioned paths.

On snap refresh, _ensure_storage_layout() performs a one-time, idempotent migration of existing files from each storage root into the versioned subdirectory, recording completion via a storage_layout_migrated flag in the peer-relation databag. _repair_pg_wal_symlink() updates the pg_wal symlink to point at the new WAL path after migration.

_reconcile_storage_permissions() runs unconditionally on every refresh to heal storage root permissions (0755) and pg_wal symlink ownership (daemon:daemon), fixing units migrated by earlier builds.

On the primary, _ensure_temp_tablespace_location() migrates the temp tablespace location in the PostgreSQL catalog. The connection requires autocommit=True since DROP/CREATE TABLESPACE cannot run inside a transaction block. clear_pg_version_dirs() purges PG* version directories from the target path before CREATE TABLESPACE to prevent ObjectInUse errors.

patroni.yml.j2 now uses a {{ wal_dir }} template variable instead of the hardcoded storage-root WAL path.

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

…ories

Introduce versioned path constants (POSTGRESQL_DATA_DIR, ARCHIVE_DATA_DIR,
LOGS_DATA_DIR, TEMP_DATA_DIR = <storage-root>/16/main) alongside the existing
storage-root constants. All internal path references are updated to use the
versioned paths.

On snap refresh, _ensure_storage_layout() performs a one-time, idempotent
migration of existing files from each storage root into the versioned
subdirectory, recording completion via a `storage_layout_migrated` flag in
the peer-relation databag. _repair_pg_wal_symlink() updates the pg_wal
symlink to point at the new WAL path after migration.

_reconcile_storage_permissions() runs unconditionally on every refresh to
heal storage root permissions (0755) and pg_wal symlink ownership
(_daemon_:_daemon_), fixing units migrated by earlier builds.

On the primary, _ensure_temp_tablespace_location() migrates the temp
tablespace location in the PostgreSQL catalog. The connection requires
autocommit=True since DROP/CREATE TABLESPACE cannot run inside a transaction
block. _clear_pg_version_dirs() purges PG_* version directories from the
target path before CREATE TABLESPACE to prevent ObjectInUse errors.

patroni.yml.j2 now uses a {{ wal_dir }} template variable instead of the
hardcoded storage-root WAL path.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@github-actions github-actions Bot added the Libraries: OK The charm libs used are OK and in-sync label Apr 24, 2026
After the PR introduced versioned storage paths (16/main), three bugs
prevented temp tablespaces from recovering after a reboot:

1. TEMP_DATA_DIR was not recreated on tmpfs after the migration flag was
   already set. Add an else-branch in _ensure_storage_layout that calls
   mkdir(parents=True, exist_ok=True) without setting owner/mode. The
   resulting root-owned directory triggers set_up_database's permission
   check, which calls _handle_temp_tablespace_on_reboot to reinitialise
   the tablespace directory structure for the tmpfs case.

2. As a belt-and-suspenders fallback, _ensure_temp_tablespace_location
   now detects an empty TEMP_DATA_DIR (no PG_<ver>_<catver>/ subdir) and
   drops then recreates the tablespace so PostgreSQL rebuilds the internal
   directory structure.

3. _ensure_temp_tablespace_location_if_primary raised
   PostgreSQLUndefinedHostError when primary_endpoint was None during
   early startup. Return True early when the endpoint is not yet available.

The persistent storage integration test assertion for the library's
"Fixed permissions" log message is softened to a warning: that message
only appears when permissions needed fixing, but units initialised by
_migrate_storage_mount already have correct owner/mode so the fix path
is not triggered.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…_correct

After the versioned storage layout migration (PR #1649), PostgreSQL's
data_directory is POSTGRESQL_DATA_DIR (.../postgresql/16/main) rather
than the Juju storage mount point STORAGE_PATH (.../postgresql).
Update the assertion accordingly.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…is active

The DROP/CREATE TABLESPACE during the temp tablespace location migration
generates WAL that is streamed to the standby cluster.  If the standby
has not yet been upgraded to the versioned storage layout, it lacks the
TEMP_DATA_DIR directory, causing PostgreSQL to crash with "FATAL: directory
does not exist" during WAL replay — leaving all standby members stuck in
"start failed" state.

Defer the migration by returning early from
_ensure_temp_tablespace_location_if_primary() when cross-cluster async
replication is active.  The _on_update_status and _on_peer_relation_changed
hooks will retry it after the standby has been upgraded too.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…al snap daemon for safe rollback

Context: The initial versioned storage layout (c2f8d9b) was one-way only — migration ran in the snap post-refresh hook and had no
rollback path. These changes re-architect it: the charm (unconfined) handles forward migration before the snap refresh, while a new
migrate-data daemon in the snap handles both forward (belt-and-suspenders) and reverse migration by reading the Patroni YAML for
direction detection. Enables clean rollback from versioned paths to root layout when reverting to 16/stable.

Includes: simplified _ensure_storage_layout() in charm, _migrate_storage_roots_to_versioned(), removed obsoleted migration
code/constants, bumped snap revisions to 310/311, fixed 16/ parent dir chown, updated unit tests, added IPv6 URL handling and
increased timeouts in integration tests.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Move forward and reverse storage layout migration from the charm into
the snap's post-refresh and pre-refresh hooks.  The charm now only
ensures the ephemeral temp tablespace directory exists, which may
live on a tmpfs mount wiped on reboot.

Data migration direction is determined by the snap hooks reading the
already-rendered Patroni YAML as ground truth — versioned paths
(16/main) trigger forward migration, root paths trigger reverse.
This makes rollbacks from a versioned-storage charm to an older
root-storage charm transparent.

Pin snap revisions: amd64 313, aarch64/arm64 314.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Measurements from 16/stable snap refreshes showed that 5-minute
timeouts are too tight for the charm_refresh flow. Snap download and
install takes ~2.5 min per unit, and database initialization can stall
briefly during rolling restarts. Bump the blocked-wait, force-refresh-
start, and agents-idle timeouts to 10 minutes, and resume-refresh to
15 minutes, matching the empirically observed timing.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…dler

Replace _ensure_temp_tablespace_location and _ensure_temp_tablespace_location_if_primary with _migrate_temp_tablespace_location, a
one-shot handler that only handles the old-to-versioned-path migration (Scenario C). Other scenarios (missing tablespace, tmpfs
wipe) are already covered by set_up_database. Drop calls from _on_peer_relation_changed and _on_update_status since migration is a
one-time upgrade event.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Update docstrings in _migrate_temp_tablespace_location and
_ensure_storage_layout to note that the reverse catalog migration
for the temp tablespace is handled by the snap's pre-refresh hook
during rollback, not by the charm itself.  The charm's existing
forward-only migration remains unchanged.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@github-actions github-actions Bot added Libraries: Out of sync The charm libs used are out-of-sync and removed Libraries: OK The charm libs used are OK and in-sync labels May 19, 2026
Snap rev 323 had an unbound SNAP_CURRENT variable in migrate-data.sh
that silently crashed the data migration daemon. Rev 329 is built from
the corrected source that uses SNAP_DATA.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
The previous arm64 snap revision (322) had the SNAP_CURRENT unbound
variable bug in migrate-data.sh, causing upgrade failures on arm64.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…y shutdown test

The 180s timeout was insufficient on CI where a new replica joining
after a force-destroyed primary can take longer to bootstrap via
basebackup and register in the Raft cluster. Increased to 600s.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…rage-layout

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…cher network recovery

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Libraries: Out of sync The charm libs used are out-of-sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant