Multimodal Data (Blobs)

LanceDB handles multimodal data—images, audio, video, and PDF files—natively by storing the raw bytes in a binary column alongside your vectors and metadata. This approach simplifies your data infrastructure by keeping the raw assets and their embeddings in the same database, eliminating the need for separate object storage for many use cases. This guide demonstrates how to ingest, store, and retrieve image data using standard binary columns, and also introduces the Lance Blob API for optimized handling of larger multimodal files.

Store binary data

To store binary data, define a binary Arrow field in your schema (pa.binary() in Python, Binary in TypeScript, and DataType::Binary in Rust).

1. Setup and imports

First, import the necessary libraries for LanceDB and Arrow in your SDK.

2. Prepare data

For this example, we’ll create some dummy in-memory images. In a real application, you would read these from files or an API. The key is to convert your data (image, audio, etc.) into a raw bytes object.

3. Define the schema

When creating the table, it is highly recommended to define the schema explicitly. This ensures that your binary data is correctly interpreted as a binary type by Arrow/LanceDB and not as a generic string or list.

4. Ingest data

Now, create the table using the data and the defined schema.

Retrieve and use blobs

When you search your LanceDB table, you can retrieve the binary column just like any other metadata.

Convert bytes back to objects

Once you have the bytes back from the search result, you can decode them into the original format (for example, an image object or audio buffer).

Large Blobs (Blob API)

For larger files like high-resolution images or videos, Lance provides a specialized Blob API. By using a large-binary Arrow type (pa.large_binary() in Python, LargeBinary in TypeScript, and DataType::LargeBinary in Rust) and specific metadata, you enable lazy loading and optimized encoding. This allows you to work with massive datasets without loading all binary data into memory upfront.

1. Define a blob schema

To use the Blob API, you must mark the column with {"lance-encoding:blob": "true"} metadata.

2. Ingest large blobs

You can then ingest data normally, and Lance will handle the optimized storage.

For more advanced usage, including random access and file-like reading of blobs, see the Lance format’s blob API documentation.

3. Convert blob tables to pandas

When you call to_pandas() on a local LanceDB table that contains Blob API columns, the blob_mode argument controls how those columns materialize. This is available in the Python SDK on local tables; remote tables raise NotImplementedError. blob_mode accepts:

"lazy" (default): returns blob columns as lazy BlobFile objects without eagerly materializing their payloads. Use this when you want to stream blob bytes on demand or only inspect a subset of rows. Namespace-backed local tables also use the Lance native blob-aware pandas conversion for lazy blobs; in-memory datasets fall back to the standard PyArrow to_pandas() path.
"bytes": eagerly materializes each blob as bytes. Use this when you need the raw payload in the DataFrame, for example to decode an image or audio clip in-process.
"descriptions": returns blob descriptors (offsets, sizes, and positions) instead of the data itself. Use this when you want to plan I/O without paying the cost of loading every blob.

"bytes" and "descriptions" require a filesystem-backed Lance dataset and are not supported on in-memory tables. Extra keyword arguments are forwarded to the underlying PyArrow / Lance pandas conversion, so you can also pass options like split_blocks or self_destruct: Query builders also accept blob_mode on their to_pandas() method for plain scan queries (queries without a vector search, full-text search, or order_by clause). Filters, projections, aliases, limit, and offset are all supported and routed through Lance’s native pandas conversion so that lazy and bytes modes work end to end. Vector, FTS, hybrid, and other non-scan query shapes keep the existing Arrow conversion path and only accept blob_mode="descriptions"; using "lazy" or "bytes" on those queries raises a RuntimeError directing you to use "descriptions" or drop the blob column from the projection. The same blob_mode argument is available on both sync (table.search(...).to_pandas(blob_mode=...)) and async (await table_async.query()...to_pandas(blob_mode=...)) query builders. Extra PyArrow kwargs like split_blocks and self_destruct are still forwarded:

Other modalities

The pa.binary() and pa.large_binary() types are universal. You can use this same pattern for other types of multimodal data:

Audio: Read .wav or .mp3 files as bytes.
Video: Store video transitions or full clips using the Blob API.
PDFs/Documents: Store the raw file content for document search.

​Store binary data

​1. Setup and imports

​2. Prepare data

​3. Define the schema

​4. Ingest data

​Retrieve and use blobs

​Convert bytes back to objects

​Large Blobs (Blob API)

​1. Define a blob schema

​2. Ingest large blobs

​3. Convert blob tables to pandas

​Other modalities