Skip to content

Claude/fix utc tzdata timestamp [ENG-225]#75

Merged
eywalker merged 5 commits intowalkerlab:devfrom
brian-arnold:claude/fix-utc-tzdata-timestamp
Mar 6, 2026
Merged

Claude/fix utc tzdata timestamp [ENG-225]#75
eywalker merged 5 commits intowalkerlab:devfrom
brian-arnold:claude/fix-utc-tzdata-timestamp

Conversation

@brian-arnold
Copy link
Collaborator

Bug Log

Missing tzdata dependency causes UTC timestamp conversion failure

  • Description: On systems without a system-level timezone database (e.g. minimal Linux environments, stripped Docker images), PyArrow failed to convert timestamp("us", tz="UTC") columns to Python objects via to_pylist() or as_py(). The error was ZoneInfoNotFoundError: 'No time zone found with key UTC'. This affected any pipeline that read back cached results from the result database, since the POD_TIMESTAMP column is stored as a UTC-aware timestamp.
  • Fix: Added tzdata>=2024.1 as a runtime dependency in orcapod-python/pyproject.toml. The tzdata package ships the full IANA timezone database as a Python package, making timezone lookups self-contained and independent of the host system's timezone data.

Internal meta columns leak into downstream streams, breaking multi-input Join

  • Description: When a pipeline containing a Join operator with 3+ inputs was run, a polars.exceptions.DuplicateError: column with name '__computed_right' already exists error was raised. The root cause was that FunctionNode.as_table() did not respect the ColumnConfig.meta flag — it built its output table using packet.as_dict(all_info=True) (which includes internal meta columns such as __computed, __pod_ts, __pod_id_*) but never filtered them out when meta=False (the default). As a result, these internal columns leaked into the tables passed to Join.static_process(). Polars' join renamed duplicates with a _right suffix, but on the third join iteration the renamed column already existed, causing a collision. A secondary instance of the same issue existed in FunctionNode.iter_packets() and async_execute(), where only the input hash column (__input_packet_hash) was stripped from the cached-results table before constructing a replay ArrowTableStream, leaving other meta columns in the stream.
  • Fix:
    • src/orcapod/core/nodes/function_node.py (as_table): Added ColumnConfig.meta filtering to the drop-columns logic, consistent with how system_tags, source, and context are already filtered. Meta columns (those starting with constants.META_PREFIX = "__") are now excluded when meta=False and selectively filtered when meta is a Collection[str].
    • src/orcapod/core/nodes/function_node.py (iter_packets, async_execute): Changed existing.drop([hash_col]) to drop all meta-prefixed columns, completing what the existing comment ("strip the meta column") intended.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two bugs: (1) a ZoneInfoNotFoundError when converting UTC-aware PyArrow timestamps on minimal systems lacking a timezone database, and (2) internal meta columns (e.g. __computed, __pod_ts) leaking into downstream streams, which caused DuplicateError in multi-input Join operations.

Changes:

  • Added tzdata>=2024.1 as a runtime dependency to ensure timezone data is always available for PyArrow UTC timestamp conversion.
  • Added ColumnConfig.meta filtering to FunctionNode.as_table(), so meta columns are properly excluded (or selectively included) when building output tables.
  • Changed iter_packets() and async_execute() to drop all meta-prefixed columns (not just __input_packet_hash) from cached-results tables before constructing replay streams.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pyproject.toml Added tzdata>=2024.1 runtime dependency for portable timezone support
src/orcapod/core/nodes/function_node.py Added meta column filtering in as_table(), broadened meta column stripping in iter_packets() and async_execute()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +319 to +324
# Collection[str]: keep only meta columns matching the specified prefixes
drop_columns.extend(
c
for c in self._cached_output_table.column_names
if c.startswith(constants.META_PREFIX)
and not any(c.startswith(p) for p in column_config.meta)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Collection[str] branch does not normalize user-supplied prefixes by prepending constants.META_PREFIX when needed. In contrast, datagram.py (line 483–488) normalizes prefixes so that e.g. meta=["pipeline"] correctly matches "__pipeline".

As written, if a user passes meta=["pipeline"], the c.startswith(p) check compares column names (which start with "__") against the bare prefix "pipeline", so no columns will match and all meta columns will be dropped.

The prefixes in column_config.meta should be normalized the same way datagram.py does it: prepend constants.META_PREFIX to any prefix that doesn't already start with it.

Suggested change
# Collection[str]: keep only meta columns matching the specified prefixes
drop_columns.extend(
c
for c in self._cached_output_table.column_names
if c.startswith(constants.META_PREFIX)
and not any(c.startswith(p) for p in column_config.meta)
# Collection[str]: keep only meta columns matching the specified prefixes.
# Normalize user-supplied prefixes to ensure they include META_PREFIX,
# so that e.g. "pipeline" matches "__pipeline".
meta_prefixes = [
p
if p.startswith(constants.META_PREFIX)
else f"{constants.META_PREFIX}{p}"
for p in column_config.meta
]
drop_columns.extend(
c
for c in self._cached_output_table.column_names
if c.startswith(constants.META_PREFIX)
and not any(c.startswith(p) for p in meta_prefixes)

Copilot uses AI. Check for mistakes.
Comment on lines +312 to +317
if not column_config.meta:
drop_columns.extend(
c
for c in self._cached_output_table.column_names
if c.startswith(constants.META_PREFIX)
)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests covering the new meta filtering logic added to as_table. The existing test file (tests/test_core/function_pod/test_function_pod_node.py) has thorough tests for get_all_records meta column behavior (e.g. TestGetAllRecordsMetaColumns) but none for as_table with meta=False, meta=True, or meta=Collection[str]. Similarly, the change in iter_packets and async_execute to drop all meta-prefixed columns (instead of just the hash column) is not covered by any test verifying that cached replay streams exclude meta columns. Adding tests for these paths would help prevent regressions.

Copilot uses AI. Check for mistakes.
drop_columns.extend(f"{constants.SOURCE_PREFIX}{c}" for c in self.keys()[1])
if not column_config.context:
drop_columns.append(constants.CONTEXT_KEY)
if not column_config.meta:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be blindly dropping metadata -- this needs to be maintained. The comment suggests that this was done to handle the column name collision when joining 3 or more streams. What's happening is that when join encounters that two streams has colliding metadata column, it renames the second column by appending _right to it. However, when the next join with the 3rd stream occurs and the same name collides (e.g. imagine all 3 streams had __meta_x), then first join gave __meta_x and __meta_x_right, but when the third __meta_x collides with the __meta_x from the first stream, it tries to rename to __meta_x_right but this again collides! To get over this issue, the proper solution would be to use stream index_based suffix (instead of _right currently used). So that it would be __meta_x, __meta_x_1, __meta_x_2, for example

@brian-arnold brian-arnold changed the title Claude/fix utc tzdata timestamp Claude/fix utc tzdata timestamp [ENG-225] Mar 6, 2026
@codecov-commenter
Copy link

codecov-commenter commented Mar 6, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/orcapod/core/nodes/function_node.py 50.00% 2 Missing ⚠️
src/orcapod/core/operators/join.py 86.66% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@eywalker eywalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the very fact meta was trying to be dropped is already a bug...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
table = stream.as_table(all_info=True)

@eywalker eywalker merged commit 109b733 into walkerlab:dev Mar 6, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants