PERF: Handle utf16 strings using simdutf and std::u16string#526
Conversation
|
I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit. This brings arrow performance on linux for nvarchars in line with what it should be. If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace |
There was a problem hiding this comment.
Pull request overview
This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).
Changes:
- Add
simdutf(viafind_packageorFetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths. - Replace a number of
std::wstringusages withstd::u16stringfor connection strings, queries, and SQL_C_WCHAR parameter buffers. - Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
mssql_python/pybind/CMakeLists.txt |
Adds simdutf dependency resolution and links it into the ddbc_bindings module. |
mssql_python/pybind/ddbc_bindings.h |
Introduces UTF-16 helpers (dupeSqlWCharAsUtf16Le, utf16LeToUtf8Alloc) and changes ErrorInfo to store UTF-8 std::string. |
mssql_python/pybind/ddbc_bindings.cpp |
Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + simdutf. |
mssql_python/pybind/connection/connection.h |
Updates connection APIs/state to store connection strings as std::u16string. |
mssql_python/pybind/connection/connection.cpp |
Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly. |
mssql_python/pybind/connection/connection_pool.h |
Updates pooling key types and APIs to use std::u16string. |
mssql_python/pybind/connection/connection_pool.cpp |
Implements pooling with std::u16string connection string keys. |
mssql_python/pybind/unix_utils.h |
Removes the SQLWCHARToWString declaration. |
mssql_python/pybind/unix_utils.cpp |
Removes the SQLWCHARToWString implementation. |
Comments suppressed due to low confidence (1)
mssql_python/pybind/ddbc_bindings.cpp:583
- In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses
data()+SQL_NTSand setsbufferLengthtosize() * sizeof(SQLWCHAR). ForSQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (preferc_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.
Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.
std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
paramBuffers, param.cast<std::u16string>());
LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
"length=%zu characters, buffer=%zu bytes",
paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
dataPtr = sqlwcharBuffer->data();
bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
*strLenOrIndPtr = SQL_NTS;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 1572-1580 1572 LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
1573 ErrorInfo errorInfo;
1574 if (retcode == SQL_INVALID_HANDLE) {
1575 LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1576 errorInfo.ddbcErrorMsg = "Invalid handle!";
1577 return errorInfo;
1578 }
1579 assert(handle != 0);
1580 SQLHANDLE rawHandle = handle->get();Lines 1651-1659 1651 return records;
1652 }
1653
1654 // Wrap SQLExecDirect
! 1655 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
1656 LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
1657 "query_length=%zu chars",
1658 (void*)StatementHandle->get(), Query.length());
1659 if (!SQLExecDirect_ptr) {Lines 1668-1676 1668 SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
1669 (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
1670 }
1671
! 1672 SQLWCHAR* queryPtr = reinterpretU16stringAsSqlWChar(Query);
1673 SQLRETURN ret;
1674 {
1675 // Release the GIL during the blocking ODBC call so that other Python
1676 // threads (e.g. asyncio event loop, heartbeat threads) can run whileLines 3892-3900 3892 sizeof(DateTimeOffset) * fetchSize,
3893 buffers.indicators[col - 1].data());
3894 break;
3895 default:
! 3896 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
3897 std::ostringstream errorString;
3898 errorString << "Unsupported data type for column - " << columnName.c_str()
3899 << ", Type - " << dataType << ", column ID - " << col;
3900 LOG("SQLBindColums: %s", errorString.str().c_str());Lines 3901-3909 3901 ThrowStdException(errorString.str());
3902 break;
3903 }
3904 if (!SQL_SUCCEEDED(ret)) {
! 3905 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
3906 std::ostringstream errorString;
3907 errorString << "Failed to bind column - " << columnName.c_str() << ", Type - "
3908 << dataType << ", column ID - " << col;
3909 LOG("SQLBindColums: %s", errorString.str().c_str());Lines 4248-4256 4248 break;
4249 }
4250 default: {
4251 const auto& columnMeta = columnNames[col - 1].cast<py::dict>();
! 4252 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
4253 std::ostringstream errorString;
4254 errorString << "Unsupported data type for column - " << columnName.c_str()
4255 << ", Type - " << dataType << ", column ID - " << col;
4256 LOG("FetchBatchData: %s", errorString.str().c_str());Lines 4350-4358 4350 case SQL_SS_TIMESTAMPOFFSET:
4351 rowSize += sizeof(DateTimeOffset);
4352 break;
4353 default:
! 4354 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
4355 std::ostringstream errorString;
4356 errorString << "Unsupported data type for column - " << columnName.c_str()
4357 << ", Type - " << dataType << ", column ID - " << col;
4358 LOG("calculateRowSize: %s", errorString.str().c_str());mssql_python/pybind/utf_utils.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.4%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 25.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.2%🔗 Quick Links
|
|
Only issue seems to be that some Linux CI containers don't have git. I'm trying to fetch simdutf via url instead. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@gargsaumya The second CI error was due to the ubuntu image using an older version of cmake. Should be fine now I hope. I have also replaced the remaining std::wstring occurences with std::u16string and eliminated helper functions as well as platform specific code accordingly. Let me know if you are happy with this holistic update to utf16 string handling or if you would prefer to keep it contained to the arrow fetch path. |
|
Thanks for the update @ffelixg. I actually prefer this and it LGTM overall. ButI’ll still take a closer look at the changes and get back to you. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
I think that this latest CI failure was due to a flaky test. The same test / platform passed in a past CI run and I don't think my changes affected that test. Could you rerun that one @gargsaumya? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
gargsaumya
left a comment
There was a problem hiding this comment.
Really appreciate the effort here, this is a very meaningful contribution @ffelixg. I've reviewed the PR and left 2 inline comments, please take a look!
bewithgaurav
left a comment
There was a problem hiding this comment.
one real bug (diag message overflow, stack memory leaking into exception text on long error messages, repro inline), one accidental README typo, one perf nit on the new helper. only the diag bug actually blocks, the other two can ride along or land separately.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Login failures and other connection-time errors raised by the C++ pybind layer were surfacing as plain RuntimeError instead of a mssql_python exception, making it impossible for callers to catch them via the DB-API 2.0 exception hierarchy. Connection::checkError() now embeds the SQLSTATE code in the thrown message (SQLSTATE:XXXXX:<msg>) so the Python layer can map it to the correct exception class via sqlstate_to_exception() -- consistent with how cursor-level errors are already handled in helpers.py. The new _raise_connection_error() helper is applied to all four connection operations that go through checkError(): connect, commit, rollback, and set_autocommit. Note: when PR #526 (simdutf) merges, the two WideToUTF8() calls in connection.cpp::checkError() will need updating to utf16LeToUtf8Alloc(). Fixes #532
|
@ffelixg - thanks for your responses - requesting you to please resolve conflicts and we'll do a last set of review |
|
@bewithgaurav Done, I also noticed that a couple |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Login failures and other connection-time errors raised by the C++ pybind layer were surfacing as plain RuntimeError instead of a mssql_python exception, making it impossible for callers to catch them via the DB-API 2.0 exception hierarchy. Connection::checkError() now embeds the SQLSTATE code in the thrown message (SQLSTATE:XXXXX:<msg>) so the Python layer can map it to the correct exception class via sqlstate_to_exception() -- consistent with how cursor-level errors are already handled in helpers.py. The new _raise_connection_error() helper is applied to all four connection operations that go through checkError(): connect, commit, rollback, and set_autocommit. Note: when PR #526 (simdutf) merges, the two WideToUTF8() calls in connection.cpp::checkError() will need updating to utf16LeToUtf8Alloc(). Fixes #532
PR #526 changed ErrorInfo to return std::string directly instead of std::wstring, and removed WideToUTF8() function. Updated checkError() to use direct assignment since err.sqlState and err.ddbcErrorMsg are now already std::string. This fixes compilation errors after rebase on main.
Replace manual LCOV_EXCL_LINE markers with cleaner built-in lcov filtering.
This approach uses lcov's native exclusion mechanism and is more maintainable.
Changes:
- Add eng/scripts/join_logs_for_coverage.py to join multi-line LOG calls during coverage builds
- Modify build.sh to temporarily join LOG statements in codecov mode with automatic restore
- Use lcov --rc lcov_excl_line='\bLOG[A-Z_]*\s*\(' to exclude LOG macros from coverage
- Add llvm-cov ignore pattern for build/_deps/ (vendored simdutf sources from PR #526)
- Add lcov --remove for build/_deps/ as defense-in-depth (from PR #579)
- Update .gitignore to exclude local development scripts
Benefits:
- No source code clutter (600+ markers not needed)
- Catches all LOG variants (LOG_ERROR, LOG_WARNING, etc.)
- Excludes vendored third-party dependencies from coverage metrics
- Cleaner, more maintainable approach using lcov native features
- Source files remain unchanged in repository
Addresses review feedback from @bewithgaurav on PR #556
Includes changes from PR #579 to fix simdutf coverage pollution
### Work Item / Issue Reference > [AB#45159](https://sqlclientdrivers.visualstudio.com/mssql-python/_sprints/taskboard/mssql-python%20Team/mssql-python/Rubidium/May%202026?workitem=45159) ------------------------------------------------------------------- ### Summary **Enhancements** - #548 — manylinux_2_28 build targets for RHEL 8 / glibc 2.28 - #542 — macOS universal2 wheel for Python 3.10 - #526 — UTF-16 string handling via simdutf - #528 — Optimized execute() hot path - #567 — Azure Linux installation docs **Bug Fixes** - #562 — Login failures now raise mssql_python exception instead of RuntimeError - #568 — GIL released during blocking SQLSetConnectAttr calls - #541 — GIL released during blocking ODBC statement/fetch/transaction calls - #560 — executemany RuntimeError when decimals change signs - #495 — Inconsistent CP1252 VARCHAR retrieval Windows vs Linux - #559 — BulkCopy empty string in NVARCHAR(MAX)/VARCHAR(MAX) (via mssql_py_core 0.1.4)
Work Item / Issue Reference
Summary