The GPU Case in Private Information Retrieval (PIR)
Exploring how GPUs can accelerate private information retrieval (PIR) by leveraging embarrassingly parallel matrix-vector multiplication, achieving 13-174x speedups over CPU implementations.
What if I asked you to name the top 3 applications of GPUs? Machine learning? Gaming? Crypto mining? Those are all good candidates. Today, I would like to introduce you to one you may not expect - private information retrieval (PIR).
Private information retrieval (PIR) is a mechanism that allows a server to process a client's query without learning what the client asked for. This is achieved by homomorphic encryption - the capability to compute on encrypted data. The client encrypts their query and sends it to the server which matrix-vector multiplies the full database (the matrix) against the query (the vector).
PIR enables privacy-preserving services like checking if your password was leaked in a data breach, performing DNS lookups, or verifying credentials. The server never learns what you're actually looking up.
Most state-of-the-art mechanisms require the server to touch every database record. Intuitively, this is how the client never reveals what it wants to the server. The query specifies an encrypted request which the server processes via homomorphic operations over the full database. As a result, the server per-query work linearly scales with the size of the database.
The full database must be streamed through the CPU to compute the response to the client's query. In this post, we will make the case for processing the query against the database using a GPU and achieve a 13-174x speedup: 13x over multi-threaded CPU (Rayon) and 174x over single-threaded implementations. See benchmark details.
Breaking Down The Query
But, first, let's discuss the PIR query abstraction.
To frame the mechanism, we will be using Simple PIR construction. Many details will be omitted for brevity but you can learn more about it here.
Assume the database is , which is a matrix where is the number of elements. For simplicity, assume 1-byte records so that each cell maps to a record. Let be the query vector and be the result vector.
Imagine the setup to look like below:
Remember that the query vector is encrypted, and the result is as well. The client decrypts the returned result, which contains an entire database column, from which they select the desired record.
Key Insight
In the matrix-vector multiplication , each row of the database matrix is multiplied against the query vector . Row 0's computation doesn't depend on row 1's result, row 1's doesn't depend on row 2's, and so on.
This is the definition of embarrassingly parallel - the work decomposes into thousands of identical, independent operations which is the perfect application for a GPU.
CPU Approach
A CPU processes rows one-by-one (or a handful at a time with SIMD). For a database with 1 million rows, that's roughly 1 million sequential operations. Even with modern multi-core CPUs (8-16 cores), you're still looking at ~62K-125K sequential batches (1M ÷ 16 ≈ 62K).
Constraint
With this approach, state-of-the-art PIR constructions are memory-bandwidth constrained. This means the server's throughput is bottlenecked by how fast the memory bus can feed data to the CPU. The pieces of data get queued up for sequential CPU processing.
GPU Approach
A GPU has thousands of cores (e.g., an RTX 4090 has 16,384 CUDA cores). Each core can independently compute one row's dot product with the query vector. Those 1 million rows? The GPU processes them in ~61 parallel batches (1M ÷ 16,384 ≈ 61) instead of ~62K.
Additionally, GPUs benefit from storing database values as single bytes (u8) rather than 32-bit integers, reducing memory traffic by 4x. The byte-to-integer widening happens during computation at negligible cost.
Comparison
Our GPU benchmark has achieved a massive 174x speedup against a single-core CPU implementation and 13x over parallelized Rayon. See the details behind the setup, implementation, and results.
Putting It All Together
Simple PIR design achieves throughput that is memory bandwidth constrained at the cost of offline server pre-computation and increased communication size (details omitted for brevity). That being said, this compromise represents the general PIR trilemma (throughput, communication size, pre-processing).
While SimplePIR demonstrates the GPU opportunity, more recent work addresses other limitations of the construction. YPIR and InsPIRe improve upon SimplePIR by eliminating pre-processing and lowering query and response size. However, they do not improve—and in fact compromise—throughput.
With GPUs, we are empowered to leverage the breakthroughs made in YPIR and InsPIRe while alleviating the critical memory bandwidth throughput bottleneck by 13-174x depending on implementation details.
GPUs transform PIR from a theoretical curiosity into a practical privacy tool.