The Rise of XPUs: Exploring The Hardware Behind Modern AI

At the infamous Google I/O 2024 keynote, "AI" was mentioned over 120 times; with a focus on Trillium, their sixth-generation TPU. At the NVIDIA GTC 2026 keynote, GPU was mentioned 23 different times! What I mean to imply is, if you've spent time following tech news or keynotes lately, you've probably heard certain terms being thrown around like GPU, TPU, NPU, and if you are lucky enough, maybe DPU too.

Suddenly, the humble CPU we grew up hearing and reading about doesn't seem to be enough. So why does modern AI need an alphabet soup of processors, and what even is the differentiation? Let's take a deeper dive into the world of XPUs.

What do you actually mean by an "X"PU?

An "XPU" isn't one specific chip. It's a catch-all umbrella term for different kinds of processing units, where the "X" changes based on purpose, and while XPU is becoming a common term, "accelerator" or "heterogeneous processing" are also interchangeably used to describe this category of hardware.

Some examples of processing units that we very commonly use everyday are Central Processing Unit aka CPU and Graphics Processing Unit aka GPU, and then we have more specialized ones like Tensor Processing Unit from the house of Google and Neural Processing Unit, built to handle complex computation with dramatically lower power consumption, perfect for mobile and wearables.

The reason they exist is simple;

Different workloads warrant different hardware, because running an operating system and training a billion-parameter model are not the same problem. In today's AI-focused world, modern systems use XPUs together (example: a laptop with a CPU for tasks, a GPU for graphics, and an NPU for AI acceleration).

Scalar vs Vector vs Matrix Thinking

One of the easiest ways to understand modern AI hardware is to start thinking in shapes of math. Most computing workloads can be simplified into three categories:

Paradigm	Description
Scalar	Process one value at a time
Vector	Process many values together
Matrix / tensor	Process large grids of values multiplied repeatedly

As workloads became more data-heavy, the PU evolved alongside them.

CPU: Scalar-First Thinking

The CPU was built for flexibility. It excels at sequential logic, branching decisions, operating systems, and tasks where each instruction depends on the last. Think:


if user_clicked_button:

   open_window()

CPUs make excellent managers, because managers always delegate repetitive tasks, hence making CPUs inefficient for repetitive AI math.

GPU: Vector Function

The GPU became dominant because graphics and AI both involve doing the same operation repetitively at once, outperforming a CPU by following the paradigm of vectorization.

The easiest way to comprehend this would be using a simple vector addition wherein:

\begin{bmatrix} 1 & 2 & 3 \end{bmatrix} + \begin{bmatrix} 4 & 5 & 6 \end{bmatrix} = \begin{bmatrix} 5 & 7 & 9 \end{bmatrix}

Consider three lanes of calculations:


Lane 1: 1 + 4 = 5
Lane 2: 2 + 5 = 7
Lane 3: 3 + 6 = 9

Unlike a CPU which would pick one lane at a time (sequentially) and perform the function, the GPU will process all three lanes at once. In action, thousands of lightweight cores are doing similar work in parallel, which is exactly what would benefit a neural network processing millions of weights, activations, and repeated calculations simultaneously.

NPU: Matrix Native

Unlike a GPU which would be considered general-purpose, the NPU is built specifically for neural networks, where the core workload is repeated matrix multiplication. If a GPU is great at adding many rows quickly, an NPU is great at combining entire tables of numbers repeatedly.

A simple example is a matrix multiplied by a vector:

\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \times \begin{bmatrix} 5 \\ 6 \end{bmatrix} = \begin{bmatrix} (1 \times 5) + (2 \times 6) \\ (3 \times 5) + (4 \times 6) \end{bmatrix} = \begin{bmatrix} 17 \\ 39 \end{bmatrix}

Each layer of an AI model repeats this process with parameters across thousands or millions of rows and columns, turning an input into a prediction.

NPUs are optimized to do this style of math with low power usage, fast on-chip memory reuse, and efficient AI inference on-device; making them a perfect addition to phones, laptops and even wearables for AI on edge.

A concept worth mentioning is TOPS, which translates to Tera Operations Per Second. It is a throughput metric used to describe how many AI math operations an NPU can perform each second. Breaking it down further might give a better idea of how powerful an NPU actually is:

Term	Explanation
Tera	Trillion
Operations	Individual compute ops (usually multiply, add, multiply-accumulate)
Per Second	Unit of time

So, 1 TOPS is basically 1,000,000,000,000 operations per second.

Note: Not all operations are equal as we usually count INT8 or low-precision AI workloads; vendors may count different data types leading to varying results across different chips.

Why CPUs aren't enough for AI

A CPU solves one problem at a time, a GPU solves many similar problems at once and then the NPU solves huge grids of connected (or related) problems efficiently.

So while a CPU can run AI applications, and in fact it can power up quantized models, it suffers in efficiency, throughput, and scalability compared with the other ones literally built for parallel math.

Enter Hyper-specialized Chips

Even though GPUs and/or NPUs are built to accelerate specific forms of math, the next stage of computing goes even further; hardware designed for highly targeted jobs. As AI systems grew larger and more complex, raw compute stopped being the only bottleneck. Training models at scale also required faster data movement, lower latency, efficient networking, and specialized handling for vision-heavy workloads.

This situation gave rise to a new class of processors, hardware built not to do everything, but to do one thing exceptionally well.

Tensor Processing Unit

TPUs were developed by Google specifically for machine learning, and before we look at what they do let's talk about tensors.

A tensor is a mathematical container that goes beyond matrices, numbers arranged across dimensions. One intuitive way to visualize a simple tensor is as a stack of matrices:

T= \begin{bmatrix} A_1 \\ A_2 \\ A_3 \end{bmatrix}

Take this analogy into consideration:


Scalar = one cell [no dimension]
Vector = one row [one dimension: length]
Matrix = full sheet [two dimensions: length x width]
Tensor = stack of sheets [three dimensions (for now): length x width x height]

The need for tensors arises from the fact that larger AI models work with data that naturally has multiple dimensions. Take an example of an image, with the parameters of height, width and color channels (RGB). That would be a three-dimensional tensor. Now if your model had to deal with a batch of images to perform a function, there's another dimension of batch (number of images), making it a four-dimensional or 4D tensor. Tensors really are the mathematical backbone of LLMs that we use today.

Now that tensors are out of the way, TPUs specialize in large-scale and low-latency inference, transformer training, and cloud AI workloads. Google created them because its own AI demand outgrew general-purpose chips while running AI across products like search ranking, translation, image classification and of course recommendations.

Data Processing Unit

DPUs truly have a niche, and they were invented because in modern datacenters, moving data became almost as important as computing on data. As cloud systems scaled, servers had to do more than run apps; they also had to deal with problems consistently, like:

Encryption / TLS
Firewalling
Virtualization
Telemetry
And a lot of other networking focused issues

This already put immense pressure on expensive CPU cores which were burning cycles, and only got worse when AI apps started entering the market, which lead to an increase in data movement. Now servers additionally had to deal with dataset streaming, training traffic and inference request routing amidst a lot more. Even with GPUs in place, AI workloads often caused bottlenecks.

DPUs offload infrastructure tasks from the CPU and commonly deal with network acceleration, packet processing, encryption and traffic management; repetitive tasks that would stress a CPU.

In the easiest words, DPUs specialize in optimizing the flow, protection, and delivery of data, and are named Data Processing Units because they process data movement and data services, not just raw math.

Vision/Visual Processing Unit

VPUs are specialized chips designed to accelerate computer vision workloads efficiently, and often at low power. Instead of handling broad computing workloads, they focus on repetitive visual operations such as object detection, face recognition, motion tracking, edge detection, and image enhancement. They were created because running always-on camera and perception workloads on general processors wastes power and can introduce latency.

You'll find VPU-style hardware in drones, robotics platforms, autonomous vehicles, surveillance cameras, and smartphones, anywhere a machine needs to "see" and react quickly. For example, a security camera identifying a person at the door or a drone avoiding obstacles mid-flight benefits from dedicated visual compute.

Standalone VPUs are less commonly marketed because many of their capabilities have been integrated into modern system-on-chips alongside GPUs and NPUs.

XPUs that aren't even called XPUs

There are certain chips that fit under the "XPU" umbrella because XPU doesn't describe what the chip is, it describes that it's a specialized processing unit, so here's two more chips that are performing targeted, accelerated computation.

ASIC: Application-Specific Integrated Circuit

An ASIC is a chip built for one purpose and permanently optimized for that job. Instead of flexibility, it prioritizes maximum performance, power efficiency, and cost at scale.

Many crypto miners, networking chips, and storage controllers are actually ASICs. They are expensive and time-consuming to design, but unbeatable when you need massive volume and highly optimized performance.

FPGA: Field-Programmable Gate Array

An FPGA is a reconfigurable chip whose internal logic can be rewired after manufacturing, allowing engineers to build custom hardware behavior for specific workloads. It sits between software and fixed silicon: more specialized than a CPU, but more flexible than an ASIC.

FPGAs are popular in telecom, robotics, aerospace, finance, and low-latency AI inference where custom pipelines matter. They're especially useful when requirements may change over time or production volumes don't justify designing a custom chip.

What Comes Next

AI workloads are evolving faster than any single processor can keep up with. Instead of replacing CPUs or GPUs, the future is increasingly heterogeneous computing, systems where multiple chips work together. The next breakthrough in AI may not come from one faster chip, but from many smarter ones working in sync.

\text{Performance} \propto \frac{\text{Parallelism} \times \text{Bandwidth}}{\text{Latency}}

Ending Notes

The next time someone says "AI runs on GPUs," you'll know that's only part of the story. At the end of the day, AI rewards specialization.

Modern intelligence doesn't run on one processor, it runs on an orchestra of silicon, where every chip has a role to play.