arrow_odbc package

Module contents

class arrow_odbc.BatchReader(reader: _BatchReaderRaii)[source]

Bases: object

Iterates over Arrow batches from an ODBC data source

fetch_concurrently()[source]

Allocate another transit buffer and use it to fetch row set groups (aka. batches) from the ODBC data source in a dedicated system thread, while the main thread converts the previous batch to arrow arrays and executes the application logic.

If you extract more than one result set from the cursor, you need to call these method for each result set you want to extract concurrently. This has been done so it is possible to skip result sets without worrying about the fetching thread to start fetching row groups from a result set you intended to skip.

Calling this method on an already concurrent reader has no effect.

Example:

from arrow_odbc import read_arrow_batches_from_odbc

connection_string="Driver={ODBC Driver 17 for SQL Server};Server=localhost;"

reader = read_arrow_batches_from_odbc(
    query=f"SELECT * FROM MyTable,
    connection_string=connection_string,
    batch_size=1000,
    user="SA",
    password="My@Test@Password",
)
# Trade memory for speed. For the price of an additional transit buffer and a native
# system thread we fetch batches now concurrent to our application logic.
reader.fetch_concurrently()

for batch in reader:
    # Process arrow batches
    df = batch.to_pandas()
    # ...
into_pyarrow_record_batch_reader()[source]

Converts the arrow-odbc BatchReader into a pyarrow RecordBatchReader. This method fully passes ownership to the new reader and leaves self empty.

arrow-odbc``s BatchReader interface offers some functionality specific to ODBC datasources. E.g. the ability to move to the next result set of a stored procedure. You may not need this extra functionality and would rather like to integrate the ``BatchReader with other libraries like e.g. DuckDB. In order to do this you can use this method to convert the arrow-odbc BatchReader into a pyarrow RecordBatchReader.

more_results(batch_size: int = 65535, max_bytes_per_batch: int = 536870912, max_text_size: int | None = None, max_binary_size: int | None = None, falliable_allocations: bool = False, schema: Schema | None = None, map_schema: Callable[[Schema], Schema] | None = None) bool[source]

Move the reader to the next result set returned by the data source.

A datasource may return multiple results if multiple SQL statements are executed in a single query or a stored procedure is called. This method closes the current cursor and moves it to the next result set. You may move to the next result set without extracting the current one.

Example:

from arrow_odbc import read_arrow_batches_from_odbc

connection_string="Driver={ODBC Driver 17 for SQL Server};Server=localhost;"
reader = read_arrow_batches_from_odbc(
    query=f"SELECT * FROM MyTable; SELECT * FROM OtherTable;",
    connection_string=connection_string,
    batch_size=1000,
    user="SA",
    password="My@Test@Password",
)

# Process first result
for batch in reader:
    # Process arrow batches
    df = batch.to_pandas()
    # ...

reader.more_results()

# Process second result
for batch in reader:
    # Process arrow batches
    df = batch.to_pandas()
    # ...
Parameters:
  • batch_size – The maximum number rows within each batch. Batch size can be individually choosen for each result set. The maximum number of rows can be less if the upper bound defined by max_bytes_per_batch is lower.

  • max_bytes_per_batch – An upper limit for the total size (all columns) of the buffer used to transit data from the ODBC driver to the application. Please note that memory consumption of this buffer is determined not by the actual values, but by the maximum possible length of an indiviual row times the number of rows it can hold. Both batch_size and this parameter define upper bounds for the same buffer. Which ever bound is lower is used to determine the buffer size.

  • max_text_size – An upper limit for the size of buffers bound to variadic text columns of the data source. This limit does not (directly) apply to the size of the created arrow buffers, but rather applies to the buffers used for the data in transit. Use this option if you have e.g. VARCHAR(MAX) fields in your database schema. In such columns without an upper limit, the ODBC driver of your data source is asked for the maximum size of an element, and is likely to answer with either 0 or a value which is way larger than any actual entry in the column If you can not adapt your database schema, this limit might be what you are looking for. On windows systems the size is double words (16Bit), as windows utilizes an UTF-16 encoding. So this translates to roughly the size in letters. On non windows systems this is the size in bytes and the datasource is assumed to utilize an UTF-8 encoding. None means no upper limit is set and the maximum element size, reported by ODBC is used to determine buffer sizes. Lower values result in better memory utilization and can significantly lower the number of bytes needed per row. Higher values allow for larger values to go through without truncation.

  • max_binary_size – An upper limit for the size of buffers bound to variadic binary columns of the data source. This limit does not (directly) apply to the size of the created arrow buffers, but rather applies to the buffers used for the data in transit. Use this option if you have e.g. VARBINARY(MAX) fields in your next batch.

  • falliable_allocations – If True an recoverable error is raised in case there is not enough memory to allocate the buffers. This option may incurr a performance penalty which scales with the batch size parameter (but not with the amount of actual data in the source). In case you can test your query against the schema you can safely set this to False. The required memory will not depend on the amount of data in the data source. Default is True though, safety first.

Returns:

True in case there is another result set. False in case that the last result set has been processed.

class arrow_odbc.BatchWriter(handle)[source]

Bases: object

Writes arrow batches to a database table.

flush()[source]

Inserts the remaining rows of the last chunk to the database.

write_batch(batch)[source]

Fills the internal buffers of the writer with data from the batch. Every time they are full, the data is send to the database. To make sure all the data is is send flush must be called.

exception arrow_odbc.Error(handle)[source]

Bases: Exception

An error emmited by the arrow-odbc-py bindings.

message() str[source]

A string describing the error.

arrow_odbc.enable_odbc_connection_pooling()[source]

Activates the connection pooling of the ODBC driver manager for the entire process. Best called before creating the ODBC environment, i.e. before you first insert or read rows with arrow-odbc. This is useful in scenarios there you frequently read or write rows and the overhead of creating a connection for each query is significant.

arrow_odbc.from_table_to_db(source: Any, target: str, connection_string: str, chunk_size: int = 1000, user: str | None = None, password: str | None = None, login_timeout_sec: int | None = None)[source]

Reads an arrow table and inserts its contents into a relational table on the database.

This is a convinience wrapper around insert_into_table which converts an arrow table into a record batch reader for you.

Example:

from arrow_odbc import from_table_to_db
import pyarrow as pa
import pandas


def dataframe_to_table(df):
    table = pa.Table.from_pandas(df)
    from_table_to_db(
        source=table
        connection_string=connection_string,
        user="SA",
        password="My@Test@Password",
        chunk_size=1000,
        table="MyTable",
    )
Parameters:
  • source – PyArrow table with content to be inserted into the target table on the database. Each column of the table must correspond to a column in the target table with identical name.

  • target – Name of the database table to insert into.

  • connection_string – ODBC Connection string used to connect to the data source. To find a connection string for your data source try https://www.connectionstrings.com/.

  • chunk_size – Number of records to insert in each roundtrip to the database. The number will be automatically reduced to the number of rows, if the table is small, in order to save memory.

  • user – Allows for specifying the user seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as UID.

  • password – Allows for specifying the password seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as PWD.

  • login_timeout_sec – Number of seconds to wait for a login request to complete before returning to the application. The default is driver-dependent. If 0, the timeout is disabled and a connection attempt will wait indefinitely. If the specified timeout exceeds the maximum login timeout in the data source, the driver substitutes that value and uses that instead.

arrow_odbc.insert_into_table(reader: Any, chunk_size: int, table: str, connection_string: str, user: str | None = None, password: str | None = None, login_timeout_sec: int | None = None, packet_size: int | None = None)[source]

Consume the batches in the reader and insert them into a table on the database.

Example:

from arrow_odbc import insert_into_table
import pyarrow as pa
import pandas


def dataframe_to_table(df):
    table = pa.Table.from_pandas(df)
    reader = pa.RecordBatchReader.from_batches(table.schema, table.to_batches())
    insert_into_table(
        connection_string=connection_string,
        user="SA",
        password="My@Test@Password",
        chunk_size=1000,
        table="MyTable",
        reader=reader,
    )
Parameters:
  • reader – Reader is used to iterate over record batches. It must expose a schema attribute, referencing an Arrow schema. Each field in the schema must correspond to a column in the table with identical name. The iterator must yield individual arrow tables

  • chunk_size – Number of records to insert in each roundtrip to the database. Independent of batch size (i.e. number of rows in an individual record batch).

  • table – Name of a database table to insert into. Used to generate the insert statement for the bulk writer.

  • connection_string – ODBC Connection string used to connect to the data source. To find a connection string for your data source try https://www.connectionstrings.com/.

  • user – Allows for specifying the user seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as UID.

  • password – Allows for specifying the password seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as PWD.

  • login_timeout_sec – Number of seconds to wait for a login request to complete before returning to the application. The default is driver-dependent. If 0, the timeout is disabled and a connection attempt will wait indefinitely. If the specified timeout exceeds the maximum login timeout in the data source, the driver substitutes that value and uses that instead.

  • packet_size – Specifying the network packet size in bytes. Many ODBC drivers do not support this option. If the specified size exceeds the maximum packet size or is smaller than the minimum packet size, the driver substitutes that value and returns SQLSTATE 01S02 (Option value changed).You may want to enable logging to standard error using log_to_stderr.

arrow_odbc.log_to_stderr(level: int = 1)[source]

Activate logging from native code directly to standard error. In particular these logs contain diagnostic information emitted by ODBC. Call this method only once in your application

Parameters:

level

Specifies the log level with which the standard error logger in rust is initialized.

  • 0 - Error

  • 1 - Warning,

  • 2 - Info

  • 3 - Debug

  • 4 - Trace

All diagnostics emitted by ODBC are usually warning. In case of an exeception multiple records with severity error could also be emitted.

arrow_odbc.read_arrow_batches_from_odbc(query: str, connection_string: str, batch_size: int = 65535, user: str | None = None, password: str | None = None, parameters: List[str | None] | None = None, max_bytes_per_batch: int | None = 536870912, max_text_size: int | None = None, max_binary_size: int | None = None, falliable_allocations: bool = False, login_timeout_sec: int | None = None, packet_size: int | None = None, schema: Schema | None = None, map_schema: Callable[[Schema], Schema] | None = None) BatchReader[source]

Execute the query and read the result as an iterator over Arrow batches.

Example:

from arrow_odbc import read_arrow_batches_from_odbc

connection_string="Driver={ODBC Driver 17 for SQL Server};Server=localhost;"

reader = read_arrow_batches_from_odbc(
    query=f"SELECT * FROM MyTable WHERE a=?",
    connection_string=connection_string,
    batch_size=1000,
    parameters=["I'm a positional query parameter"],
    user="SA",
    password="My@Test@Password",
)

for batch in reader:
    # Process arrow batches
    df = batch.to_pandas()
    # ...
Parameters:
  • query – The SQL statement yielding the result set which is converted into arrow record batches.

  • batch_size – The maximum number rows within each batch. The maximum number of rows can be less if the upper bound defined by max_bytes_per_batch is lower.

  • connection_string – ODBC Connection string used to connect to the data source. To find a connection string for your data source try https://www.connectionstrings.com/.

  • user – Allows for specifying the user seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as UID.

  • password – Allows for specifying the password seperatly from the connection string if it is not already part of it. The value will eventually be escaped and attached to the connection string as PWD.

  • parameters – ODBC allows you to use a question mark as placeholder marker (?) for positional parameters. This argument takes a list of parameters those number must match the number of placholders in the SQL statement. Using this instead of literals helps you avoid SQL injections or may otherwise simplify your code. Currently all parameters are passed as VARCHAR strings. You can use None to pass NULL.

  • max_bytes_per_batch – An upper limit for the total size (all columns) of the buffer used to transit data from the ODBC driver to the application. Please note that memory consumption of this buffer is determined not by the actual values, but by the maximum possible length of an indiviual row times the number of rows it can hold. Both batch_size and this parameter define upper bounds for the same buffer. Which ever bound is lower is used to determine the buffer size.

  • max_text_size – An upper limit for the size of buffers bound to variadic text columns of the data source. This limit does not (directly) apply to the size of the created arrow buffers, but rather applies to the buffers used for the data in transit. Use this option if you have e.g. VARCHAR(MAX) fields in your database schema. In such columns without an upper limit, the ODBC driver of your data source is asked for the maximum size of an element, and is likely to answer with either 0 or a value which is way larger than any actual entry in the column If you can not adapt your database schema, this limit might be what you are looking for. On windows systems the size is double words (16Bit), as windows utilizes an UTF-16 encoding. So this translates to roughly the size in letters. On non windows systems this is the size in bytes and the datasource is assumed to utilize an UTF-8 encoding. None means no upper limit is set and the maximum element size, reported by ODBC is used to determine buffer sizes. Lower values result in better memory utilization and can significantly lower the number of bytes needed per row. Higher values allow for larger values to go through without truncation.

  • max_binary_size – An upper limit for the size of buffers bound to variadic binary columns of the data source. This limit does not (directly) apply to the size of the created arrow buffers, but rather applies to the buffers used for the data in transit. Use this option if you have e.g. VARBINARY(MAX) fields in your database schema. In such a case without an upper limit, the ODBC driver of your data source is asked for the maximum size of an element, and is likely to answer with either 0 or a value which is way larger than any actual entry in the column. If you can not adapt your database schema, this limit might be what you are looking for. This is the maximum size in bytes of the binary column.

  • falliable_allocations – If True an recoverable error is raised in case there is not enough memory to allocate the buffers. This option may incurr a performance penalty which scales with the batch size parameter (but not with the amount of actual data in the source). In case you can test your query against the schema you can safely set this to False. The required memory will not depend on the amount of data in the data source. Default is True though, safety first.

  • login_timeout_sec – Number of seconds to wait for a login request to complete before returning to the application. The default is driver-dependent. If 0, the timeout is disabled and a connection attempt will wait indefinitely. If the specified timeout exceeds the maximum login timeout in the data source, the driver substitutes that value and uses that instead.

  • packet_size – Specifying the network packet size in bytes. Many ODBC drivers do not support this option. If the specified size exceeds the maximum packet size or is smaller than the minimum packet size, the driver substitutes that value and returns SQLSTATE 01S02 (Option value changed).You may want to enable logging to standard error using log_to_stderr.

  • schema – Allows you to overwrite the automatically detected schema with one supplied by the application. Reasons for doing so include domain knowledge you have about the data which is not reflected in the schema information. E.g. you happen to know a field of timestamps contains strictly dates. Another reason could be that for certain usecases another it can make sense to decide the type based on what you want to do with it, rather than its source. E.g. if you simply want to put everything into a CSV file it can make perfect sense to fetch everything as string independent of its source type.

  • map_schema – Allows you to provide a custom schema based on the schema inferred from the metainformation of the query. This would allow you to e.g. map every column type to string or replace any float32 with a float64, or anything else you might want to customize, for various reasons while still staying generic over the input schema. If both map_schema and schema are specified map_schema takes priority.

Returns:

A BatchReader is returned, which implements the iterator protocol and iterates over individual arrow batches.