FEAT: Add arrow fetch support by ffelixg · Pull Request #354 · microsoft/mssql-python

ffelixg · 2025-11-30T21:00:35Z

Work Item / Issue Reference

GitHub Issue: #130

Summary

Hey, you mentioned in issue #130 that you were willing to consider community contributions for adding Apache Arrow support, so here you go. I have focused only on fetching data into Arrow structures from the Database.

The Function signatures I chose are:

arrow_batch(chunk_size=10000): Fetch a single pyarrow.RecordBatch, base for the other two methods.
arrow(chunk_size=10000): Fetches the entire result set as a single pyarrow.Table.
arrow_reader(chunk_size=10000): Returns a pyarrow.RecordBatchReader for streaming results without loading the entire dataset into RAM.

Using fetch_arrow... instead of just arrow... could also be a good option, but I think the terse version is not too ambiguous.

Technical details

I am not very familiar with C++, but I did have some prior practice for this task from implementing my own ODBC driver in Zig (a very good language for projects like this!). The implementation is written almost entirely in C++ in the FetchArrowBatch_wrap function, which produces PyCapsules that are then consumed by arrow_batch and turned into actual arrow objects.

The function itself is very large. I'm sure it could be factored in a better way, even sharing some code with the other methods of fetching, but my goal was to keep the whole thing as straight forward as possible.

I have also implemented my own loop for SQLGetData for Lob-Columns. Unlike with the python fetch methods, I don't use the result directly, but instead copy it into the same buffer I would use for the case with bound columns. Maybe that's an abstraction that would make sense for that case as well.

Notes on data types

I noticed that you use SQL_C_TYPE_TIME for time(x) columns. The arrow fetch does the same, but I think it would be better to use SQL_C_SS_TIME2, since that supports fractional seconds.

Datetimeoffset is a bit tricky, since SQL Server stores timezone information alongside each cell, while arrow tables expect a fixed timezone for the entire column. I don't really see any solution other than converting everything to UTC and returning a UTC column, so that's what I did.

SQL_C_CHAR columns get copied directly into arrow utf8 arrays. Maybe some encoding options would be useful.

Performance

I think the main performance win to be gained is not interacting with any Python data structures in the hot path. That is satisfied. Further optimizations, which I did not make are:

Releasing the GIL for the entire fetch loop
Sharing the bound fetch buffer across repeated fetch calls
Improve the hot loop switching

Instead of looping over rows and columns and then switching on the data type for each cell, you could

Put the row loop inside each switch case (fastest I think, but would bloat the code a lot more)
Use function pointers like you recently did for python fetching (has overhead because of the indirect function call I think, also code is more scattered)
Replace both loops and the switch with computed gotos. That's what I opted for in my ODBC driver (the Zig equivalent is a labeled switch) and I am quite happy with how it came out. Performance seems very good and it allows you to abstract the fetching process on a row by row basis. I don't know how well that would translate to C++.

Overall the arrow performance seems not too far off from what I achieved with zodbc.

ffelixg · 2025-11-30T21:03:39Z

@microsoft-github-policy-service agree

Copilot

Pull request overview

This PR adds Apache Arrow fetch support to the mssql-python driver, enabling efficient columnar data retrieval from SQL Server. The implementation provides three new cursor methods (arrow_batch(), arrow(), and arrow_reader()) that convert result sets into Apache Arrow data structures using the Arrow C Data Interface, bypassing Python object creation in the hot path for improved performance.

Key changes:

Implemented Arrow fetch functionality in C++ that directly converts ODBC result sets to Arrow format
Added three Python API methods for different Arrow data consumption patterns (single batch, full table, streaming reader)
Added comprehensive test coverage for various data types, LOB columns, and edge cases

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 9 comments.

File	Description
mssql_python/pybind/ddbc_bindings.cpp	Core C++ implementation: Added `FetchArrowBatch_wrap()` function with Arrow C Data Interface structures, column buffer management, data type conversion logic, and memory management for Arrow structures
mssql_python/cursor.py	Python API layer: Added `arrow_batch()`, `arrow()`, and `arrow_reader()` methods that wrap the C++ bindings and handle pyarrow imports
tests/test_004_cursor.py	Comprehensive test suite covering wide tables, LOB columns, individual data types, empty result sets, datetime handling, and batch operations
requirements.txt	Added pyarrow as a dependency for development and testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/cursor.py

mssql_python/pybind/ddbc_bindings.cpp

sumitmsft · 2025-12-01T11:14:32Z

Work Item / Issue Reference

GitHub Issue: #130

Summary

Hey, you mentioned in issue #130 that you were willing to consider community contributions for adding Apache Arrow support, so here you go. I have focused only on fetching data into Arrow structures from the Database.

The Function signatures I chose are:

arrow_batch(chunk_size=10000): Fetch a single pyarrow.RecordBatch, base for the other two methods.

arrow(chunk_size=10000): Fetches the entire result set as a single pyarrow.Table.

arrow_reader(chunk_size=10000): Returns a pyarrow.RecordBatchReader for streaming results without loading the entire dataset into RAM.

Using fetch_arrow... instead of just arrow... could also be a good option, but I think the terse version is not too ambiguous.

Technical details

I am not very familiar with C++, but I did have some prior practice for this task from implementing my own ODBC driver in Zig (a very good language for projects like this!). The implementation is written almost entirely in C++ in the FetchArrowBatch_wrap function, which produces PyCapsules that are then consumed by arrow_batch and turned into actual arrow objects.

The function itself is very large. I'm sure it could be factored in a better way, even sharing some code with the other methods of fetching, but my goal was to keep the whole thing as straight forward as possible.

I have also implemented my own loop for SQLGetData for Lob-Columns. Unlike with the python fetch methods, I don't use the result directly, but instead copy it into the same buffer I would use for the case with bound columns. Maybe that's an abstraction that would make sense for that case as well.

Notes on data types

I noticed that you use SQL_C_TYPE_TIME for time(x) columns. The arrow fetch does the same, but I think it would be better to use SQL_C_SS_TIME2, since that supports fractional seconds.

Datetimeoffset is a bit tricky, since SQL Server stores timezone information alongside each cell, while arrow tables expect a fixed timezone for the entire column. I don't really see any solution other than converting everything to UTC and returning a UTC column, so that's what I did.

SQL_C_CHAR columns get copied directly into arrow utf8 arrays. Maybe some encoding options would be useful.

Performance

I think the main performance win to be gained is not interacting with any Python data structures in the hot path. That is satisfied. Further optimizations, which I did not make are:

Releasing the GIL for the entire fetch loop

Sharing the bound fetch buffer across repeated fetch calls

Improve the hot loop switching

Instead of looping over rows and columns and then switching on the data type for each cell, you could

Put the row loop inside each switch case (fastest I think, but would bloat the code a lot more)

Use function pointers like you recently did for python fetching (has overhead because of the indirect function call I think, also code is more scattered)

Replace both loops and the switch with computed gotos. That's what I opted for in my ODBC driver (the Zig equivalent is a labeled switch) and I am quite happy with how it came out. Performance seems very good and it allows you to abstract the fetching process on a row by row basis. I don't know how well that would translate to C++.

Overall the arrow performance seems not too far off from what I achieved with zodbc.

Hi @ffelixg

Thanks for raising this PR. Please allow us time to review and share our comments.

Appreciate your diligence in strengthening this project.

Sumit

mssql_python/pybind/ddbc_bindings.cpp

sumitmsft · 2025-12-04T11:32:31Z

Hello @ffelixg

Me and my team are in the process of reviewing your PR. While we are getting started, it would be great to have some preliminary information from you on the following items:

Have you created any design document for this feature (high\low level)? Could you please attach it here or share it with us at the below mentioned email id?
What is your motivation to bring the support for Arrow in mssql-python? Could you help us understand the use case(s) you're trying to address?
Is there a way to connect with you over Microsoft Teams call, so that we can closely work on this feature together? You can reach out to us at mssql-python@microsoft.com with your contact details and consent to connect with you.

Regards,
Sumit

ffelixg · 2025-12-04T20:31:24Z

Hello @sumitmsft,

I'm happy to hear that.

I don't have any design document beyond what I wrote in the PR description. Are there any areas in particular you would like me to provide more information on?
I assume the motivation is mostly in line with what most arrow users like about arrow. Mainly I believe that arrow is the correct format for anything that is using batches of data and has C-Extensions for both producer and consumer. For example arrow gives you great interop with things like duckdb, polars, pandas on the analytics/ML side. Also I want python to be the obvious one stop shop for ETL workloads and for that, plain python types don't work well both for performance and reliability. You still have plenty of situations though where you want to fetch one result set with python types and the next with arrow types, so it has to be in the same driver as well.
Yes, for sure. I have sent you an Email.

Regards,
Felix

bewithgaurav · 2025-12-08T11:53:57Z

/azp run

mssql_python/pybind/ddbc_bindings.cpp

azure-pipelines · 2025-12-08T11:54:09Z

Azure Pipelines successfully started running 1 pipeline(s).

bewithgaurav

@ffelixg - Thanks for the contribution! :)
Before we get started on this PR - there are a few dev build workflows we need to fix.
Could you please take a look at the Azure DevOps Workflows which are failing? (goes by the check MSSQL-Python-PR-Validation):

Build issues on windows (ref)
A few tests failing on MacOS (ref)

ffelixg · 2025-12-09T21:52:05Z

Hey,

the Windows issue was due to me using a 128 bit integer type which isn't supported by MSVC. To address that, I've added a custom 128 bit int type implemented as two 64 bit ints with some overloaded operators. I'm not super happy about that, it seems to me that using an existing library for this type of thing would be better. If you prefer to use a library, I'd leave the choice of which one to use up to you though.

The 128 bit type is only needed for decimals, so an alternative solution would be to use the numeric struct instead of strings for fetching. That one has near identical bit-representation compared to arrow and wouldn't require a lot of modification. But that's a change that would affect fetching to Python objects as well, since the two paths should probably stay in sync.

The MacOS issue seems to be due to std::mktime failing. I've added an implementation of days_from_civil to eliminate that call. I think a newer version of c++ would include that function. CPython also has an implementation for that in _datetimemodule.c, but it sadly isn't directly accessible, only when going through python objects, which would of course be slow.

I noticed some CI errors related to zip not taking strict=True, which only applies for Python 3.9 and below. I know you don't officially support 3.9, but if you're testing it and it works otherwise, I could write that differently I guess.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

bewithgaurav · 2026-01-07T05:10:11Z

/azp run

azure-pipelines · 2026-01-07T05:10:23Z

Azure Pipelines successfully started running 1 pipeline(s).

gargsaumya

This is a really solid contribution. The Arrow C Data Interface implementation is well structured, the PyCapsule ownership chain is handled correctly with RAII and proper try/catch guards, and the test coverage is comprehensive across a wide range of SQL types.
I’ve added a few non-blocking suggestions for improvement.

Thanks, @ffelixg , for the great work on this! 🎉

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/cursor.py

gargsaumya · 2026-02-09T04:28:17Z

requirements.txt

 coverage
 unittest-xml-reporting
 psutil
+pyarrow


The Python code already handles ImportError gracefully with a helpful message. Should this be under an optional extras group in setup.py (e.g., pip install mssql-python[arrow]) instead of being a hard requirement for all users?
For the test file it's fine, but this requirements.txt drives the project's general dependencies, not just test deps.

@sumitmsft @bewithgaurav thoughts?

Yes, the idea is 100% that pyarrow is not a required dependency for most of mssql_python. From my understanding, the requirements.txt is actually only used by CI and the actual dependencies are given by the install_requires parameter for setup, correct?. PyTest etc are also listed in requirements.txt. Supporting the mssql-python[arrow] syntax is a good thing though, I've added it under the extras_require parameter.

mssql_python/pybind/ddbc_bindings.cpp

ffelixg · 2026-02-09T21:49:30Z

Thank you for the kind words and for the careful review @gargsaumya! I think they are all great points and I have made changes to address each one.

mssql_python/cursor.py

sumitmsft · 2026-03-18T18:35:22Z

mssql_python/pybind/ddbc_bindings.cpp

+                            target_vec->resize(target_vec->size() * 2);
+                        }
+
+                        std::memcpy(&(*target_vec)[start], &buffers.charBuffers[idxCol][idxRowSql * fetchBufferSize], dataLen);


There are multiple memcpy calls that were flagged by the code scanning tools. While you noted these are unavoidable for this type of data manipulation, several of the memcpy calls copy data from ODBC driver buffers without explicit pre-validation of the dataLen indicator against the destination buffer capacity.

Can we add assertions or checks before memcpy calls to validate that dataLen does not exceed the allocated buffer size?

In the example you quoted (and a few other places), the assertion I would write would be:

assert(target_vec->size() >= start + dataLen);

The while loop in the line above already has

while (target_vec->size() < start + dataLen)

as the loop condition, so we already ensured that there is space for memcpy. If you think the assertion helps readability, I can add it.

For name and format string, we know the exact string length before allocating and copying, so the buffer fits exactly and is allocated in the preceding line. Maybe there is a more elegant way to do it, but the bounds checks would be a bit redundant there as well.

sumitmsft · 2026-03-18T19:40:25Z

mssql_python/pybind/ddbc_bindings.cpp

+    assert(fetchSize == 0 || arrowBatchSize % fetchSize == 0);
+    assert(fetchSize <= arrowBatchSize);
+
+    while (idxRowArrow < arrowBatchSize) {


numRowsFetched is a stack-local variable whose address is handed to the ODBC driver. The cleanup that resets SQL_ATTR_ROWS_FETCHED_PTR to NULL and SQL_ATTR_ROW_ARRAY_SIZE back to 1 only runs on the normal exit path. All the early return ret statements inside the fetch loop skip this cleanup entirely.

After an early exit, numRowsFetched is destroyed, but the driver still holds its address. The next SQLFetch on the same hStmt writes to invalid stack memory, undefined behavior (silent corruption, random crashes). SQL_ATTR_ROW_ARRAY_SIZE also stays elevated, breaking subsequent non-Arrow fetches.

I suggest using RAII guard: to ensure stmt attributes are always reset, even on early return or exception.

Then the manual cleanup block near the end (// Reset attributes before returning...) can be removed, the destructor handles all exit paths automatically.

I agree with what you're saying and I've implemented the RAII guard. The same reasoning applies to SQLFreeStmt_ptr(hStmt, SQL_UNBIND), so I've included that as well. As of now, all the fetch functions freshly configure these variables when they're called, so I don't think this could have lead to an actual issue.

Note that I've simply copied that part from the other fetch functions like fetchmany, so maybe we should have the same update there?

mssql_python/pybind/ddbc_bindings.cpp

sumitmsft · 2026-03-18T20:24:25Z

mssql_python/pybind/ddbc_bindings.cpp

+        std::memset(arrowColumnProducer->valid.get(), 0xFF, (arrowBatchSize + 7) / 8);
+    }
+
+    if (fetchSize > 1) {


The fetch size selection loop iterates from min(64, arrowBatchSize) down to 1, looking for a divisor of arrowBatchSize. While correct, if arrowBatchSize is prime and > 64, fetchSize will always be 1, leading to extremely slow row-by-row fetching. For example, arrowBatchSize=8191 (prime) would result in fetchSize=1.

Can we take different approach, say, find the largest divisor ≤ 64 of arrowBatchSize, but if none exists > 1, use a reasonable non-divisor size and handle the remainder differently, or document this behavior so users pick good batch sizes.

I've made the change to where the fetch size is potentially updated for the final call. I originally wanted to avoid that for simplicity, but it actually turned out simpler than the gcd calculation I think. The call to update this size should be trivial performance wise.

In an optimal world, I think we would have a shared buffer for the entire cursor, which is used by all fetch functions. That way we could always fetch the same sized batch and - if one fetch call doesn't fully consume it - the next fetch call will start where the previous one left off. For fetchone this would also open the door to using bound columns, resulting in big performance gains. For bigger batch sizes it would still simplify logic.

The difference in performance between using SQLGetData and SQLBindCol with size 1 is bigger than between SQLBindCol with size 1 and SQLBindCol with larger sizes from what I can tell.

mssql_python/pybind/ddbc_bindings.cpp

sumitmsft · 2026-03-19T08:52:55Z

The new arrow_batch(), arrow(), and arrow_reader() methods should be added to the type stub file mssql_python/mssql_python.pyi because without stub entries, IDEs won't offer autocompletion and type checkers will report errors when users call these methods. Something like:

    # Arrow Extension Methods (requires pyarrow)
    def arrow_batch(self, batch_size: int = 8192) -> "pyarrow.RecordBatch": ...
    def arrow(self, batch_size: int = 8192) -> "pyarrow.Table": ...
    def arrow_reader(self, batch_size: int = 8192) -> "pyarrow.RecordBatchReader": ...

sumitmsft · 2026-03-19T08:54:30Z

@ffelixg I have put in some of my review comments. Request you to look at them. Most of them are good to have - so they are no blocking issues.

…n RAII guard

ffelixg · 2026-03-22T20:20:10Z

Thanks for the review! I've addressed your comments and added the stubs. I can confirm that it fixed complaints from ty. I totally missed the pyi file, because somehow mypy also seems to be looking at the definition inside cursor.py, so it didn't throw any errors.

Copilot AI review requested due to automatic review settings November 30, 2025 21:00

Copilot started reviewing on behalf of ffelixg November 30, 2025 21:01 View session

Copilot finished reviewing on behalf of ffelixg November 30, 2025 21:04

Copilot AI reviewed Nov 30, 2025

View reviewed changes

github-advanced-security bot found potential problems Dec 1, 2025

View reviewed changes

sumitmsft requested review from bewithgaurav, gargsaumya and sumitmsft December 1, 2025 11:12

sumitmsft self-assigned this Dec 1, 2025

sumitmsft added the enhancement New feature or request label Dec 1, 2025

sumitmsft added inADO under development community PR or Issue raised by community members labels Dec 1, 2025

github-advanced-security bot found potential problems Dec 2, 2025

View reviewed changes

mssql_python/pybind/ddbc_bindings.cpp Fixed Show fixed Hide fixed

mssql_python/pybind/ddbc_bindings.cpp Fixed Show fixed Hide fixed

mssql_python/pybind/ddbc_bindings.cpp Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Dec 8, 2025

View reviewed changes

bewithgaurav requested changes Dec 9, 2025

View reviewed changes

ffelixg and others added 7 commits December 28, 2025 16:40

Add arrow fetch support

a7e220f

Copilot suggestion: Fix typo

68482fb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot suggestion: Fix missing buffer resize

3645578

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot suggestion: Initialize bool value buffer

9c8c3e8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add test for long data

5267a33

Copilot suggestion: Uppercase uuids

b81f245

Copilot suggestion: use new for batch schema format/name

532672c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into arrow_fetch

f0bca7b

ffelixg requested a review from bewithgaurav January 12, 2026 17:30

gargsaumya reviewed Feb 9, 2026

View reviewed changes

ffelixg added 7 commits February 9, 2026 18:59

Merge remote-tracking branch 'origin/main' into arrow_fetch

ef91fd9

Fix SQL_REAL handling

9edb495

Fix rownumber -> behaves like fetchmany

3b0cfc7

Check for closed cursor before pyarrow import

ba172f0

Add pyarrow as an optional dependency

554b93b

Switch to large_string/large_binary for variable length data

c0cee4a

Move tests to separate file as requested

4b8d7b5

ffelixg added 2 commits March 4, 2026 19:40

Merge remote-tracking branch 'origin/main' into arrow_fetch

042605e

Unbind & test in accordance with microsoft#441

e997c9b

sumitmsft reviewed Mar 18, 2026

View reviewed changes

mssql_python/cursor.py Outdated Show resolved Hide resolved

sumitmsft reviewed Mar 18, 2026

View reviewed changes

mssql_python/pybind/ddbc_bindings.cpp Outdated Show resolved Hide resolved

sumitmsft reviewed Mar 18, 2026

View reviewed changes

ffelixg added 3 commits March 18, 2026 22:58

Add UDT support like microsoft#423

97959ce

Update bound varchar buffer size according to microsoft#444

8b8d77c

Factor pyarrow import out into _check_closed

6a3190e

sumitmsft reviewed Mar 19, 2026

View reviewed changes

mssql_python/pybind/ddbc_bindings.cpp Outdated Show resolved Hide resolved

ffelixg added 3 commits March 22, 2026 17:04

Add arrow fetch methods to stubs

e2d9a9f

Update GetDataVar error messages

966ad97

Update ROW_ARRAY_SIZE for final batch instead of finding gcd & wrap i…

ec11800

…n RAII guard

Conversation

ffelixg commented Nov 30, 2025

Work Item / Issue Reference

Summary

Technical details

Notes on data types

Performance

Uh oh!

ffelixg commented Nov 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sumitmsft commented Dec 1, 2025

Work Item / Issue Reference

Summary

Technical details

Notes on data types

Performance

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sumitmsft commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffelixg commented Dec 4, 2025

Uh oh!

bewithgaurav commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azure-pipelines bot commented Dec 8, 2025

Uh oh!

bewithgaurav left a comment

Choose a reason for hiding this comment

Uh oh!

ffelixg commented Dec 9, 2025

Uh oh!

bewithgaurav commented Jan 7, 2026

Uh oh!

azure-pipelines bot commented Jan 7, 2026

Uh oh!

gargsaumya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ffelixg commented Feb 9, 2026

Uh oh!

sumitmsft commented Dec 4, 2025 •

edited

Loading

sumitmsft commented Mar 19, 2026 •

edited

Loading