Claude/fix utc tzdata timestamp [ENG-225] by brian-arnold · Pull Request #75 · walkerlab/orcapod-python

brian-arnold · 2026-03-05T22:06:29Z

Bug Log

Missing tzdata dependency causes UTC timestamp conversion failure

Description: On systems without a system-level timezone database (e.g. minimal Linux environments, stripped Docker images), PyArrow failed to convert timestamp("us", tz="UTC") columns to Python objects via to_pylist() or as_py(). The error was ZoneInfoNotFoundError: 'No time zone found with key UTC'. This affected any pipeline that read back cached results from the result database, since the POD_TIMESTAMP column is stored as a UTC-aware timestamp.
Fix: Added tzdata>=2024.1 as a runtime dependency in orcapod-python/pyproject.toml. The tzdata package ships the full IANA timezone database as a Python package, making timezone lookups self-contained and independent of the host system's timezone data.

Internal meta columns leak into downstream streams, breaking multi-input Join

Description: When a pipeline containing a Join operator with 3+ inputs was run, a polars.exceptions.DuplicateError: column with name '__computed_right' already exists error was raised. The root cause was that FunctionNode.as_table() did not respect the ColumnConfig.meta flag — it built its output table using packet.as_dict(all_info=True) (which includes internal meta columns such as __computed, __pod_ts, __pod_id_*) but never filtered them out when meta=False (the default). As a result, these internal columns leaked into the tables passed to Join.static_process(). Polars' join renamed duplicates with a _right suffix, but on the third join iteration the renamed column already existed, causing a collision. A secondary instance of the same issue existed in FunctionNode.iter_packets() and async_execute(), where only the input hash column (__input_packet_hash) was stripped from the cached-results table before constructing a replay ArrowTableStream, leaving other meta columns in the stream.
Fix:
- src/orcapod/core/nodes/function_node.py (as_table): Added ColumnConfig.meta filtering to the drop-columns logic, consistent with how system_tags, source, and context are already filtered. Meta columns (those starting with constants.META_PREFIX = "__") are now excluded when meta=False and selectively filtered when meta is a Collection[str].
- src/orcapod/core/nodes/function_node.py (iter_packets, async_execute): Changed existing.drop([hash_col]) to drop all meta-prefixed columns, completing what the existing comment ("strip the meta column") intended.

Copilot

Pull request overview

This PR fixes two bugs: (1) a ZoneInfoNotFoundError when converting UTC-aware PyArrow timestamps on minimal systems lacking a timezone database, and (2) internal meta columns (e.g. __computed, __pod_ts) leaking into downstream streams, which caused DuplicateError in multi-input Join operations.

Changes:

Added tzdata>=2024.1 as a runtime dependency to ensure timezone data is always available for PyArrow UTC timestamp conversion.
Added ColumnConfig.meta filtering to FunctionNode.as_table(), so meta columns are properly excluded (or selectively included) when building output tables.
Changed iter_packets() and async_execute() to drop all meta-prefixed columns (not just __input_packet_hash) from cached-results tables before constructing replay streams.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`pyproject.toml`	Added `tzdata>=2024.1` runtime dependency for portable timezone support
`src/orcapod/core/nodes/function_node.py`	Added meta column filtering in `as_table()`, broadened meta column stripping in `iter_packets()` and `async_execute()`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-05T22:10:27Z

src/orcapod/core/nodes/function_node.py

+            # Collection[str]: keep only meta columns matching the specified prefixes
+            drop_columns.extend(
+                c
+                for c in self._cached_output_table.column_names
+                if c.startswith(constants.META_PREFIX)
+                and not any(c.startswith(p) for p in column_config.meta)


The Collection[str] branch does not normalize user-supplied prefixes by prepending constants.META_PREFIX when needed. In contrast, datagram.py (line 483–488) normalizes prefixes so that e.g. meta=["pipeline"] correctly matches "__pipeline".

As written, if a user passes meta=["pipeline"], the c.startswith(p) check compares column names (which start with "__") against the bare prefix "pipeline", so no columns will match and all meta columns will be dropped.

The prefixes in column_config.meta should be normalized the same way datagram.py does it: prepend constants.META_PREFIX to any prefix that doesn't already start with it.

Suggested change

# Collection[str]: keep only meta columns matching the specified prefixes

drop_columns.extend(

c

for c in self._cached_output_table.column_names

if c.startswith(constants.META_PREFIX)

and not any(c.startswith(p) for p in column_config.meta)

# Collection[str]: keep only meta columns matching the specified prefixes.

# Normalize user-supplied prefixes to ensure they include META_PREFIX,

# so that e.g. "pipeline" matches "__pipeline".

meta_prefixes = [

p

if p.startswith(constants.META_PREFIX)

else f"{constants.META_PREFIX}{p}"

for p in column_config.meta

]

drop_columns.extend(

c

for c in self._cached_output_table.column_names

if c.startswith(constants.META_PREFIX)

and not any(c.startswith(p) for p in meta_prefixes)

Copilot · 2026-03-05T22:10:27Z

src/orcapod/core/nodes/function_node.py

+        if not column_config.meta:
+            drop_columns.extend(
+                c
+                for c in self._cached_output_table.column_names
+                if c.startswith(constants.META_PREFIX)
+            )


There are no tests covering the new meta filtering logic added to as_table. The existing test file (tests/test_core/function_pod/test_function_pod_node.py) has thorough tests for get_all_records meta column behavior (e.g. TestGetAllRecordsMetaColumns) but none for as_table with meta=False, meta=True, or meta=Collection[str]. Similarly, the change in iter_packets and async_execute to drop all meta-prefixed columns (instead of just the hash column) is not covered by any test verifying that cached replay streams exclude meta columns. Adding tests for these paths would help prevent regressions.

eywalker · 2026-03-05T23:30:38Z

src/orcapod/core/nodes/function_node.py

            drop_columns.extend(f"{constants.SOURCE_PREFIX}{c}" for c in self.keys()[1])
        if not column_config.context:
            drop_columns.append(constants.CONTEXT_KEY)
+        if not column_config.meta:


We shouldn't be blindly dropping metadata -- this needs to be maintained. The comment suggests that this was done to handle the column name collision when joining 3 or more streams. What's happening is that when join encounters that two streams has colliding metadata column, it renames the second column by appending _right to it. However, when the next join with the 3rd stream occurs and the same name collides (e.g. imagine all 3 streams had __meta_x), then first join gave __meta_x and __meta_x_right, but when the third __meta_x collides with the __meta_x from the first stream, it tries to rename to __meta_x_right but this again collides! To get over this issue, the proper solution would be to use stream index_based suffix (instead of _right currently used). So that it would be __meta_x, __meta_x_1, __meta_x_2, for example

…ding columns

codecov-commenter · 2026-03-06T00:58:40Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/orcapod/core/nodes/function_node.py	50.00%	2 Missing ⚠️
src/orcapod/core/operators/join.py	86.66%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

eywalker

Turns out the very fact meta was trying to be dropped is already a bug...

eywalker · 2026-03-06T01:01:49Z

src/orcapod/core/operators/join.py

Suggested change

table = stream.as_table(all_info=True)

…ess.

Brian Arnold added 2 commits March 5, 2026 22:03

Add tzdata to pyproject.toml

f8bd937

Added filtering to the drop-columns logic

60b5a72

brian-arnold requested review from Copilot and eywalker March 5, 2026 22:06

Copilot started reviewing on behalf of brian-arnold March 5, 2026 22:07 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

eywalker requested changes Mar 5, 2026

View reviewed changes

Fix 3-way Join DuplicateError by using index-based suffixes for colli…

b1a9a17

…ding columns

brian-arnold changed the title ~~Claude/fix utc tzdata timestamp~~ Claude/fix utc tzdata timestamp [ENG-225] Mar 6, 2026

eywalker requested changes Mar 6, 2026

View reviewed changes

src/orcapod/core/operators/join.py Outdated

Copy link

Contributor

eywalker Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

table = stream.as_table(all_info=True)

Brian Arnold added 2 commits March 6, 2026 02:18

Add back in changes from commit 60b5a

048ee1e

Replace all_info=True with explicit column config in Join.static_proc…

998f906

…ess.

eywalker approved these changes Mar 6, 2026

View reviewed changes

eywalker merged commit 109b733 into walkerlab:dev Mar 6, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/fix utc tzdata timestamp [ENG-225]#75

Claude/fix utc tzdata timestamp [ENG-225]#75
eywalker merged 5 commits intowalkerlab:devfrom
brian-arnold:claude/fix-utc-tzdata-timestamp

brian-arnold commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

eywalker Mar 5, 2026

Uh oh!

codecov-commenter commented Mar 6, 2026 •

edited

Loading

Uh oh!

eywalker left a comment

Uh oh!

eywalker Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

brian-arnold commented Mar 5, 2026

Bug Log

Missing tzdata dependency causes UTC timestamp conversion failure

Internal meta columns leak into downstream streams, breaking multi-input Join

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

eywalker Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eywalker left a comment

Choose a reason for hiding this comment

Uh oh!

eywalker Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Mar 6, 2026 •

edited

Loading