arrow_odbc package
Module contents
- class arrow_odbc.BatchReader(reader: BatchReaderRaii)[source]
Bases:
objectIterates over Arrow batches from an ODBC data source
- into_pyarrow_record_batch_reader()[source]
Converts the
arrow-odbcBatchReaderinto apyarrowRecordBatchReader. This method fully passes ownership to the new reader and leavesselfempty.arrow-odbc``s BatchReader interface offers some functionality specific to ODBC datasources. E.g. the ability to move to the next result set of a stored procedure. You may not need this extra functionality and would rather like to integrate the ``BatchReaderwith other libraries like e.g. DuckDB. In order to do this you can use this method to convert thearrow-odbcBatchReader into apyarrowRecordBatchReader.
- more_results(batch_size: int = 65535, max_bytes_per_batch: int = 536870912, max_text_size: int | None = None, max_binary_size: int | None = None, falliable_allocations: bool = False, schema: Schema | None = None, map_schema: Callable[[Schema], Schema] | None = None, fetch_concurrently: bool = True, payload_text_encoding: TextEncoding = TextEncoding.AUTO) bool[source]
Move the reader to the next result set returned by the data source.
A datasource may return multiple results if multiple SQL statements are executed in a single query or a stored procedure is called. This method closes the current cursor and moves it to the next result set. You may move to the next result set without extracting the current one.
Example:
from arrow_odbc import read_arrow_batches_from_odbc connection_string= "Driver={ODBC Driver 18 for SQL Server};" "Server=localhost;" "TrustServerCertificate=yes;" reader = read_arrow_batches_from_odbc( query=f"SELECT * FROM MyTable; SELECT * FROM OtherTable;", connection_string=connection_string, batch_size=1000, user="SA", password="My@Test@Password", ) # Process first result for batch in reader: # Process arrow batches df = batch.to_pandas() # ... reader.more_results() # Process second result for batch in reader: # Process arrow batches df = batch.to_pandas() # ...
- Parameters:
batch_size – The maximum number rows within each batch. Batch size can be individually choosen for each result set. The maximum number of rows can be less if the upper bound defined by
max_bytes_per_batchis lower.max_bytes_per_batch – An upper limit for the total size (all columns) of the buffer used to transit data from the ODBC driver to the application. Please note that memory consumption of this buffer is determined not by the actual values, but by the maximum possible length of an indiviual row times the number of rows it can hold. Both
batch_sizeand this parameter define upper bounds for the same buffer. Which ever bound is lower is used to determine the buffer size.max_text_size – In order for fast bulk fetching to work, arrow-odbc needs to know the size of the largest possible field in each column. It will do so itself automatically by considering the schema information. However, trouble arises if the schema contains unbounded variadic fields like VARCHAR(MAX) which can hold really large values. These have a very high upper element size, if any. In order to work with such schemas we need a limit, of what the an upper bound of the actual values in the column is, as opposed to the what the largest value is the column could theoretically store. There is no need for this to be precise, but just knowing that a value would never exceed 4KiB rather than 2GiB is enough to allow for tremendous efficiency gains. The size of the text is specified in UTF-8 encoded bytes if using a narrow encoding (typically all non-windows systems) and in UTF-16 encoded pairs of bytes on systems using a wide encoding (typically windows). This means about the size in letters, yet if you are using a lot of emojis or other special characters this number might need to be larger.
max_binary_size – An upper limit for the size of buffers bound to variadic binary columns of the data source. This limit does not (directly) apply to the size of the created arrow buffers, but rather applies to the buffers used for the data in transit. Use this option if you have e.g. VARBINARY(MAX) fields in your next batch.
falliable_allocations – If
Truean recoverable error is raised in case there is not enough memory to allocate the buffers. This option may incurr a performance penalty which scales with the batch size parameter (but not with the amount of actual data in the source). In case you can test your query against the schema you can safely set this toFalse. The required memory will not depend on the amount of data in the data source. Default isTruethough, safety first.fetch_concurrently – Trade memory for speed. Allocates another transit buffer and use it to fetch row set groups (aka. batches) from the ODBC data source in a dedicated system thread, while the main thread converts the previous batch to arrow arrays and executes the application logic. The transit buffer may be the biggest part of the required memory so if
Truearrow-odbcconsumes almost two times the memory as compared to false. On the flipsite the next batch can be fetched from the database immediatly without waiting for the application logic to return control.payload_text_encoding – Controls the encoding used for transferring text data from the ODBC data source to the application. The resulting Arrow arrays will still be UTF-8 encoded. If you see garbage characters or invalid UTF-8 errors in non-windows systems, you may want to set the encoding to
TextEncoding.Utf16. On windows systems you may want to set this toTextEncoding::Utf8to gain performance benefits, after you have verified that your system locale is set to UTF-8.
- Returns:
Truein case there is another result set.Falsein case that the last result set has been processed.
- class arrow_odbc.BatchReaderProtocol(*args, **kwargs)[source]
Bases:
ProtocolAnything that exposes an Arrow schema and iterates over record batches.
Both
pyarrow.RecordBatchReaderandarrow_odbc.BatchReadersatisfy this structurally — no inheritance required.- property schema: Schema
- class arrow_odbc.BatchWriter(handle: _CDataBase)[source]
Bases:
objectWrites arrow batches to a database table.
- classmethod from_connection(connection_handle: _CDataBase, reader: BatchReaderProtocol, chunk_size: int, table: str)[source]
Create a
BatchWriterfrom a connection handle and aRecordBatchReaderorBatchReader.This is a low-level constructor. Use :meth:
Connection.insert_into_tableinstead.
- class arrow_odbc.Connection(handle: _CDataBase)[source]
Bases:
objectA strong reference to an ODBC connection.
- commit() None[source]
Commit the current transaction. Behavior is only defined in manual commit mode, which can be set by setting
autocommittoFalsewhen creating the connection.
- classmethod enable_connection_pooling() None[source]
Activates the connection pooling of the ODBC driver manager for the entire process. Best called before creating the ODBC environment, i.e. before you the first connection is opend with arrow-odbc. This is useful in scenarios there you frequently read or write rows and the overhead of creating a connection for each query is significant.
Example:
from arrow_odbc import Connection # Let the ODBC driver manager take care of connection pooling for us Connection.enable_connection_pooling() # Create the first connection after Connection pooling is enabled connection_string= "Driver={ODBC Driver 18 for SQL Server};" "Server=localhost;" "TrustServerCertificate=yes;" connection = connect( connection_string=connection_string, user="SA", password="My@Test@Password" )
- execute(query: str, parameters: Sequence[str | None] | None = None, query_timeout_sec: int | None = None, payload_text_encoding: TextEncoding = TextEncoding.AUTO) None[source]
Execute a SQL statement which does not return a result set, e.g.
INSERT,UPDATE,DELETEor DDL likeCREATE TABLE. Any result set the statement might produce is discarded.Example:
from arrow_odbc import connect connection = connect( connection_string=connection_string, user="SA", password="My@Test@Password", ) connection.execute("CREATE TABLE MyTable (a INTEGER);") connection.execute("INSERT INTO MyTable (a) VALUES (?);", parameters=["42"])
- Parameters:
query – The SQL statement to execute.
parameters – ODBC allows you to use a question mark as placeholder marker (
?) for positional parameters. This argument takes a list of parameters those number must match the number of placeholders in the SQL statement. Currently all parameters are passed as VARCHAR strings. You can useNoneto passNULL.query_timeout_sec – Use this to limit the time the query is allowed to take, before responding to the application. The driver may replace the number of seconds you provide with a minimum or maximum value. You can specify
0to deactivate the timeout, this is the default. For this to work the driver must support this feature. E.g. PostgreSQL and Microsoft SQL Server do, but SQLite or MariaDB do not.payload_text_encoding – Controls the encoding used for the string parameters bound to the query. If you see garbage characters or invalid UTF-8 errors in non-windows systems, you may want to set the encoding to
TextEncoding.Utf16. On windows systems you may want to set this toTextEncoding::Utf8to gain performance benefits, after you have verified that your system locale is set to UTF-8.
- from_table_to_db(source: Table, target: str, chunk_size: int = 1000)[source]
Reads an arrow table and inserts its contents into a relational table on the database.
This is a convinience wrapper around
insert_into_tablewhich converts an arrow table into a record batch reader for you.Example:
from arrow_odbc import connect import pyarrow as pa import pandas def dataframe_to_table(df): table = pa.Table.from_pandas(df) connection = connect( connection_string=connection_string, user="SA", password="My@Test@Password" ) connection.from_table_to_db( source=table, target="MyTable", chunk_size=1000 )
- Parameters:
source – PyArrow table with content to be inserted into the target table on the database. Each column of the table must correspond to a column in the target table with identical name.
target – Name of the database table to insert into.
chunk_size – Number of records to insert in each roundtrip to the database. The number will be automatically reduced to the number of rows, if the table is small, in order to save memory.
- insert_into_table(reader: BatchReaderProtocol, table: str, chunk_size: int)[source]
Consume the batches in the reader and insert them into a table on the database.
Example:
from arrow_odbc import connect import pyarrow as pa import pandas def dataframe_to_table(df): table = pa.Table.from_pandas(df) reader = pa.RecordBatchReader.from_batches(table.schema, table.to_batches()) connectiion = connect( connection_string=connection_string, user="SA", password="My@Test@Password" ) connection.insert_into_table( table="MyTable", reader=reader, chunk_size=1000, )
- Parameters:
reader – Reader is used to iterate over record batches. It must expose a schema attribute, referencing an Arrow schema. Each field in the schema must correspond to a column in the table with identical name. The iterator must yield individual arrow tables
table – Name of a database table to insert into. Used to generate the insert statement for the bulk writer.
chunk_size – Number of records to insert in each roundtrip to the database. Independent of batch size (i.e. number of rows in an individual record batch).
- read_arrow_batches(query: str, batch_size: int = 65535, parameters: Sequence[str | None] | None = None, max_bytes_per_batch: int | None = 536870912, max_text_size: int | None = None, max_binary_size: int | None = None, falliable_allocations: bool = False, schema: Schema | None = None, map_schema: Callable[[Schema], Schema] | None = None, fetch_concurrently: bool = True, query_timeout_sec: int | None = None, payload_text_encoding: TextEncoding = TextEncoding.AUTO) BatchReader[source]
Execute the query and read the result as an iterator over Arrow batches.
Example:
from arrow_odbc import connect # Connect to the data source connection_string= "Driver={ODBC Driver 18 for SQL Server};" "Server=localhost;" "TrustServerCertificate=yes;" connection = connect( connection_string=connection_string, user="SA", password="My@Test@Password" ) # Execute query and create reader reader = connection.read_arrow_batches( query=f"SELECT * FROM MyTable WHERE a=?", batch_size=1000, parameters=["I'm a positional query parameter"], ) # Process results for batch in reader: # Process arrow batches df = batch.to_pandas() # ...
- Parameters:
connection – An ODBC connection created with
connect.query – The SQL statement yielding the result set which is converted into arrow record batches.
batch_size – The maximum number rows within each batch. The maximum number of rows can be less if the upper bound defined by
max_bytes_per_batchis lower.parameters – ODBC allows you to use a question mark as placeholder marker (
?) for positional parameters. This argument takes a list of parameters those number must match the number of placholders in the SQL statement. Using this instead of literals helps you avoid SQL injections or may otherwise simplify your code. Currently all parameters are passed as VARCHAR strings. You can use None to pass NULL.max_bytes_per_batch – An upper limit for the total size (all columns) of the buffer used to transit data from the ODBC driver to the application. Please note that memory consumption of this buffer is determined not by the actual values, but by the maximum possible length of an individual row times the number of rows it can hold. Both
batch_sizeand this parameter define upper bounds for the same buffer. Which ever bound is lower is used to determine the buffer size.max_text_size – In order for fast bulk fetching to work, arrow-odbc needs to know the size of the largest possible field in each column. It will do so itself automatically by considering the schema information. However, trouble arises if the schema contains unbounded variadic fields like VARCHAR(MAX) which can hold really large values. These have a very high upper element size, if any. In order to work with such schemas we need a limit, of what the an upper bound of the actual values in the column is, as opposed to the what the largest value is the column could theoretically store. There is no need for this to be precise, but just knowing that a value would never exceed 4KiB rather than 2GiB is enough to allow for tremendous efficiency gains. The size of the text is specified in UTF-8 encoded bytes if using a narrow encoding (typically all non-windows systems) and in UTF-16 encoded pairs of bytes on systems using a wide encoding (typically windows). This means about the size in letters, yet if you are using a lot of emojis or other special characters this number might need to be larger.
max_binary_size – An upper limit for the size of buffers bound to variadic binary columns of the data source. This limit does not (directly) apply to the size of the created arrow buffers, but rather applies to the buffers used for the data in transit. Use this option if you have e.g. VARBINARY(MAX) fields in your database schema. In such a case without an upper limit, the ODBC driver of your data source is asked for the maximum size of an element, and is likely to answer with either 0 or a value which is way larger than any actual entry in the column. If you can not adapt your database schema, this limit might be what you are looking for. This is the maximum size in bytes of the binary column.
falliable_allocations – If
Truean recoverable error is raised in case there is not enough memory to allocate the buffers. This option may incurr a performance penalty which scales with the batch size parameter (but not with the amount of actual data in the source). In case you can test your query against the schema you can safely set this toFalse. The required memory will not depend on the amount of data in the data source. Default isTruethough, safety first.schema – Allows you to overwrite the automatically detected schema with one supplied by the application. Reasons for doing so include domain knowledge you have about the data which is not reflected in the schema information. E.g. you happen to know a field of timestamps contains strictly dates. Another reason could be that for certain usecases another it can make sense to decide the type based on what you want to do with it, rather than its source. E.g. if you simply want to put everything into a CSV file it can make perfect sense to fetch everything as string independent of its source type.
map_schema – Allows you to provide a custom schema based on the schema inferred from the metainformation of the query. This would allow you to e.g. map every column type to string or replace any float32 with a float64, or anything else you might want to customize, for various reasons while still staying generic over the input schema. If both
map_schemaandschemaare specifiedmap_schematakes priority.fetch_concurrently – Trade memory for speed. Allocates another transit buffer and use it to fetch row set groups (aka. batches) from the ODBC data source in a dedicated system thread, while the main thread converts the previous batch to arrow arrays and executes the application logic. The transit buffer may be the biggest part of the required memory so if
Truearrow-odbcconsumes almost two times the memory as compared to false. On the flipsite the next batch can be fetched from the database immediatly without waiting for the application logic to return control.query_timeout_sec – Use this to limit the time the query is allowed to take, before responding with data to the application. The driver may replace the number of seconds you provide with a minimum or maximum value. You can specify
0, to deactivate the timeout, this is the default. For this to work the driver must support this feature. E.g. PostgreSQL, and Microsoft SQL Server do, but SQLite or MariaDB do not.payload_text_encoding – Controls the encoding used for transferring text data from the ODBC data source to the application. The resulting Arrow arrays will still be UTF-8 encoded. If you see garbage characters or invalid UTF-8 errors in non-windows systems, you may want to set the encoding to
TextEncoding.Utf16. On windows systems you may want to set this toTextEncoding::Utf8to gain performance benefits, after you have verified that your system locale is set to UTF-8.
- Returns:
A
BatchReaderis returned, which implements the iterator protocol and iterates over individual arrow batches.
- exception arrow_odbc.Error(handle: _CDataBase)[source]
Bases:
ExceptionAn error emmited by the arrow-odbc-py bindings.
- class arrow_odbc.TextEncoding(value)[source]
Bases:
EnumText encoding used for the payload of text columns, to transfer data from the data source to the application.
Autoevaluates to Utf16 on windows and Self::Utf8 on other systems. We do this, because most systems e.g. MacOs and Linux use UTF-8 as their default encoding, while windows may still use a Latin1 or some other extended ASCII as their narrow encoding. On the other hand many Posix drivers are lacking in their support for wide function calls and UTF-16. So usingUtf16on windows andUtf8everythere else is a good starting point.Utf8use narrow characters (one byte) to encode text in payloads. ODBC lets the client choose the encoding which should be based on the system local. This is often not what is actually happening though. If we use narrow encoding, we assume the text to be UTF-8 and error if we find that not to be the case.Utf16use wide characters (two bytes) to encode text in payloads. ODBC defines the encoding to be always UTF-16.- AUTO = 0
- UTF16 = 2
- UTF8 = 1
- arrow_odbc.connect(connection_string: str, user: str | None = None, password: str | None = None, login_timeout_sec: int | None = None, packet_size: int | None = None, autocommit: bool = True) Connection[source]
Opens a connection to an ODBC data source.
In case you want to use connection pooling, call
enable_odbc_connection_pooling()before calling this function.Example:
from arrow_odbc import connect, enable_odbc_connection_pooling # Let the ODBC driver manager take care of connection pooling for us enable_odbc_connection_pooling() # Connect to the data source connection_string= "Driver={ODBC Driver 18 for SQL Server};" "Server=localhost;" "TrustServerCertificate=yes;" connection = connect( connection_string=connection_string, user="SA", password="My@Test@Password" )
- Parameters:
connection_string – ODBC Connection string used to connect to the data source. To find a connection string for your data source try https://www.connectionstrings.com/.
user – Allows for specifying the user seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as UID.
password – Allows for specifying the password seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as PWD.
login_timeout_sec – Number of seconds to wait for a login request to complete before returning to the application. The default is driver-dependent. If
0, the timeout is disabled and a connection attempt will wait indefinitely. If the specified timeout exceeds the maximum login timeout in the data source, the driver substitutes that value and uses that instead.packet_size – Specifying the network packet size in bytes. Many ODBC drivers do not support this option. If the specified size exceeds the maximum packet size or is smaller than the minimum packet size, the driver substitutes that value and returns SQLSTATE 01S02 (Option value changed).You may want to enable logging to standard error using
log_to_stderr.autocommit – If
Truethe connection is set to autocommit mode, which means that each individual statement is committed immediately after it is executed. This is the default for ODBC connections, but some drivers may choose to use manual commit mode by default. IfFalsethe connection is set to manual commit mode. In manual commit mode you need to explicitly call commit() on the connection after executing a statement to make the changes visible to other connections. If you do not do so, your changes will not be visible to other connections and will be rolled back when the connection is closed. Setting this parameter toTrueensures that you do not have to worry about this and that your changes are always visible immediately. Setting it toFalseallows you to execute multiple inserts and queries in the same transaction. Insert performance might also differ based on commit mode.
- Returns:
A
Connectionis returned.
- arrow_odbc.enable_odbc_connection_pooling()[source]
Activates the connection pooling of the ODBC driver manager for the entire process. Best called before creating the ODBC environment, i.e. before you first insert or read rows with arrow-odbc. This is useful in scenarios there you frequently read or write rows and the overhead of creating a connection for each query is significant.
Example:
from arrow_odbc import connect, enable_odbc_connection_pooling # Let the ODBC driver manager take care of connection pooling for us enable_odbc_connection_pooling() # Connect to the data source connection_string= "Driver={ODBC Driver 18 for SQL Server};" "Server=localhost;" "TrustServerCertificate=yes;" connection = connect( connection_string=connection_string, user="SA", password="My@Test@Password" )
- arrow_odbc.from_table_to_db(source: Table, target: str, connection_string: str, chunk_size: int = 1000, user: str | None = None, password: str | None = None, login_timeout_sec: int | None = None)[source]
Reads an arrow table and inserts its contents into a relational table on the database.
This is a convinience wrapper around
insert_into_tablewhich converts an arrow table into a record batch reader for you.Example:
from arrow_odbc import from_table_to_db import pyarrow as pa import pandas def dataframe_to_table(df): table = pa.Table.from_pandas(df) from_table_to_db( source=table connection_string=connection_string, user="SA", password="My@Test@Password", chunk_size=1000, table="MyTable", )
- Parameters:
source – PyArrow table with content to be inserted into the target table on the database. Each column of the table must correspond to a column in the target table with identical name. Each field in the schema must correspond to a column in the table with identical name. The iterator must yield individual arrow tables. In case a column name contains a non-alphanumeric character different from
@,$,#, or_, the name will be escaped using double quotes. However if the name already is already escaped, i.e. it is enclosed in either rectangular brackets ([, ]), double quotes (”) or backticks (`), it will be used as is. This is to allow for users to apply their own quoting if needed, e.g. in case reserved keywords are used as column names.target – Name of the database table to insert into.
connection_string – ODBC Connection string used to connect to the data source. To find a connection string for your data source try https://www.connectionstrings.com/.
chunk_size – Number of records to insert in each roundtrip to the database. The number will be automatically reduced to the number of rows, if the table is small, in order to save memory.
user – Allows for specifying the user seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as UID.
password – Allows for specifying the password seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as PWD.
login_timeout_sec – Number of seconds to wait for a login request to complete before returning to the application. The default is driver-dependent. If
0, the timeout is disabled and a connection attempt will wait indefinitely. If the specified timeout exceeds the maximum login timeout in the data source, the driver substitutes that value and uses that instead.
- arrow_odbc.insert_into_table(reader: BatchReaderProtocol, chunk_size: int, table: str, connection_string: str, user: str | None = None, password: str | None = None, login_timeout_sec: int | None = None, packet_size: int | None = None)[source]
Consume the batches in the reader and insert them into a table on the database.
Example:
from arrow_odbc import insert_into_table import pyarrow as pa import pandas def dataframe_to_table(df): table = pa.Table.from_pandas(df) reader = pa.RecordBatchReader.from_batches(table.schema, table.to_batches()) insert_into_table( connection_string=connection_string, user="SA", password="My@Test@Password", chunk_size=1000, table="MyTable", reader=reader, )
- Parameters:
reader – Reader is used to iterate over record batches. It must expose a schema attribute, referencing an Arrow schema. Each field in the schema must correspond to a column in the table with identical name. The iterator must yield individual arrow tables. In case a column name contains a non-alphanumeric character different from
@,$,#, or_, the name will be escaped using double quotes. However if the name already is already escaped, i.e. it is enclosed in either rectangular brackets ([, ]), double quotes (”) or backticks (`), it will be used as is. This is to allow for users to apply their own quoting if needed, e.g. in case reserved keywords are used as column names.chunk_size – Number of records to insert in each roundtrip to the database. Independent of batch size (i.e. number of rows in an individual record batch).
table – Name of a database table to insert into. Used to generate the insert statement for the bulk writer.
connection_string – ODBC Connection string used to connect to the data source. To find a connection string for your data source try https://www.connectionstrings.com/.
user – Allows for specifying the user seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as UID.
password – Allows for specifying the password seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as PWD.
login_timeout_sec – Number of seconds to wait for a login request to complete before returning to the application. The default is driver-dependent. If
0, the timeout is disabled and a connection attempt will wait indefinitely. If the specified timeout exceeds the maximum login timeout in the data source, the driver substitutes that value and uses that instead.packet_size – Specifying the network packet size in bytes. Many ODBC drivers do not support this option. If the specified size exceeds the maximum packet size or is smaller than the minimum packet size, the driver substitutes that value and returns SQLSTATE 01S02 (Option value changed).You may want to enable logging to standard error using
log_to_stderr.
- arrow_odbc.log_to_stderr(level: int = 1)[source]
Activate logging from native code directly to standard error. In particular these logs contain diagnostic information emitted by ODBC. Call this method only once in your application
- Parameters:
level –
Specifies the log level with which the standard error logger in rust is initialized.
0 - Error
1 - Warning,
2 - Info
3 - Debug
4 - Trace
All diagnostics emitted by ODBC are usually warning. In case of an exeception multiple records with severity error could also be emitted.
- arrow_odbc.read_arrow_batches_from_odbc(query: str, connection_string: str, batch_size: int = 65535, user: str | None = None, password: str | None = None, parameters: Sequence[str | None] | None = None, max_bytes_per_batch: int | None = 536870912, max_text_size: int | None = None, max_binary_size: int | None = None, falliable_allocations: bool = False, login_timeout_sec: int | None = None, packet_size: int | None = None, schema: Schema | None = None, map_schema: Callable[[Schema], Schema] | None = None, fetch_concurrently: bool = True, query_timeout_sec: int | None = None, payload_text_encoding: TextEncoding = TextEncoding.AUTO) BatchReader[source]
Execute the query and read the result as an iterator over Arrow batches.
Example:
from arrow_odbc import read_arrow_batches_from_odbc connection_string= "Driver={ODBC Driver 18 for SQL Server};" "Server=localhost;" "TrustServerCertificate=yes;" reader = read_arrow_batches_from_odbc( query=f"SELECT * FROM MyTable WHERE a=?", connection_string=connection_string, batch_size=1000, parameters=["I'm a positional query parameter"], user="SA", password="My@Test@Password", ) for batch in reader: # Process arrow batches df = batch.to_pandas() # ...
- Parameters:
query – The SQL statement yielding the result set which is converted into arrow record batches.
batch_size – The maximum number rows within each batch. The maximum number of rows can be less if the upper bound defined by
max_bytes_per_batchis lower.connection_string – ODBC Connection string used to connect to the data source. To find a connection string for your data source try https://www.connectionstrings.com/.
user – Allows for specifying the user seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as UID.
password – Allows for specifying the password seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as PWD.
parameters – ODBC allows you to use a question mark as placeholder marker (
?) for positional parameters. This argument takes a list of parameters those number must match the number of placholders in the SQL statement. Using this instead of literals helps you avoid SQL injections or may otherwise simplify your code. Currently all parameters are passed as VARCHAR strings. You can use None to pass NULL.max_bytes_per_batch – An upper limit for the total size (all columns) of the buffer used to transit data from the ODBC driver to the application. Please note that memory consumption of this buffer is determined not by the actual values, but by the maximum possible length of an indiviual row times the number of rows it can hold. Both
batch_sizeand this parameter define upper bounds for the same buffer. Which ever bound is lower is used to determine the buffer size.max_text_size – In order for fast bulk fetching to work, arrow-odbc needs to know the size of the largest possible field in each column. It will do so itself automatically by considering the schema information. However, trouble arises if the schema contains unbounded variadic fields like VARCHAR(MAX) which can hold really large values. These have a very high upper element size, if any. In order to work with such schemas we need a limit, of what the an upper bound of the actual values in the column is, as opposed to the what the largest value is the column could theoretically store. There is no need for this to be precise, but just knowing that a value would never exceed 4KiB rather than 2GiB is enough to allow for tremendous efficiency gains. The size of the text is specified in UTF-8 encoded bytes if using a narrow encoding (typically all non-windows systems) and in UTF-16 encoded pairs of bytes on systems using a wide encoding (typically windows). This means about the size in letters, yet if you are using a lot of emojis or other special characters this number might need to be larger.
max_binary_size – An upper limit for the size of buffers bound to variadic binary columns of the data source. This limit does not (directly) apply to the size of the created arrow buffers, but rather applies to the buffers used for the data in transit. Use this option if you have e.g. VARBINARY(MAX) fields in your database schema. In such a case without an upper limit, the ODBC driver of your data source is asked for the maximum size of an element, and is likely to answer with either 0 or a value which is way larger than any actual entry in the column. If you can not adapt your database schema, this limit might be what you are looking for. This is the maximum size in bytes of the binary column.
falliable_allocations – If
Truean recoverable error is raised in case there is not enough memory to allocate the buffers. This option may incurr a performance penalty which scales with the batch size parameter (but not with the amount of actual data in the source). In case you can test your query against the schema you can safely set this toFalse. The required memory will not depend on the amount of data in the data source. Default isTruethough, safety first.login_timeout_sec – Number of seconds to wait for a login request to complete before returning to the application. The default is driver-dependent. If
0, the timeout is disabled and a connection attempt will wait indefinitely. If the specified timeout exceeds the maximum login timeout in the data source, the driver substitutes that value and uses that instead.packet_size – Specifying the network packet size in bytes. Many ODBC drivers do not support this option. If the specified size exceeds the maximum packet size or is smaller than the minimum packet size, the driver substitutes that value and returns SQLSTATE 01S02 (Option value changed).You may want to enable logging to standard error using
log_to_stderr.schema – Allows you to overwrite the automatically detected schema with one supplied by the application. Reasons for doing so include domain knowledge you have about the data which is not reflected in the schema information. E.g. you happen to know a field of timestamps contains strictly dates. Another reason could be that for certain usecases another it can make sense to decide the type based on what you want to do with it, rather than its source. E.g. if you simply want to put everything into a CSV file it can make perfect sense to fetch everything as string independent of its source type.
map_schema – Allows you to provide a custom schema based on the schema inferred from the metainformation of the query. This would allow you to e.g. map every column type to string or replace any float32 with a float64, or anything else you might want to customize, for various reasons while still staying generic over the input schema. If both
map_schemaandschemaare specifiedmap_schematakes priority.fetch_concurrently – Trade memory for speed. Allocates another transit buffer and use it to fetch row set groups (aka. batches) from the ODBC data source in a dedicated system thread, while the main thread converts the previous batch to arrow arrays and executes the application logic. The transit buffer may be the biggest part of the required memory so if
Truearrow-odbcconsumes almost two times the memory as compared to false. On the flipsite the next batch can be fetched from the database immediatly without waiting for the application logic to return control.query_timeout_sec – Use this to limit the time the query is allowed to take, before responding with data to the application. The driver may replace the number of seconds you provide with a minimum or maximum value. You can specify
0, to deactivate the timeout, this is the default. For this to work the driver must support this feature. E.g. PostgreSQL, and Microsoft SQL Server do, but SQLite or MariaDB do not.payload_text_encoding – Controls the encoding used for transferring text data from the ODBC data source to the application. The resulting Arrow arrays will still be UTF-8 encoded. If you see garbage characters or invalid UTF-8 errors in non-windows systems, you may want to set the encoding to
TextEncoding.Utf16. On windows systems you may want to set this toTextEncoding::Utf8to gain performance benefits, after you have verified that your system locale is set to UTF-8.
- Returns:
A
BatchReaderis returned, which implements the iterator protocol and iterates over individual arrow batches.