Distributed matrix multiplication matters when the matrices are too large for one machine to hold comfortably in memory. To explain the distributed case clearly, it helps to refresh the basic matrix product first and then look at two different ways of splitting the work.
If matrix multiplication is already fresh in your mind, skip directly to "Why distribute the matrix product?" below.
Refresher: Matrix Product
What A Matrix Is
A matrix is a data structure that's made of columns and rows. It is basically a Tic tac toe grid filled with numbers and a potentially infinite number of rows and columns.
When a matrix has the same number of rows and columns we say that it is squared ... because it looks like a square.
What The Product Computes
A dot product is an operation that takes two matrices and returns another matrix as the result. Let's take an example to go through the process.
Let be a squared matrix and another squared matrix.
This matrix product will result in another matrix .
In that configuration each element of : , , , and is the result of sums and products on the corresponding rows of and columns of :
The overall operation look like this, in red is circled the elements of the matrix involved in getting the part of the resulting matrix :
The important idea is that each value in the result matrix comes from one row of interacting with one column of .
Why Distribute The Matrix Product?
This operation appears in many machine learning and numerical workloads. As the amount of data grows, the matrices get large enough that a single processor may no longer be able to store or process them efficiently on its own.
For the sake of simplicity we will consider square matrices, but the same ideas can be generalized to rectangular ones.
Strategy 1: Slice Rows Of A And Columns Of B
The first strategy is to split matrix into row slices and matrix into column slices. Each processor receives one slice of each.
Let be the number of processors we can use.
What One Processor Can Compute
What does processor 1 get in that configuration?
It gets the first row slice of and the first column slice of . With only those two pieces, it can compute only the top-left corner of .
Following the same logic, every processor computes one block on the diagonal of . At this point we only have a partial result.
Rotate B Through The Ring
To fill in the missing blocks, the processors keep their slice of fixed and rotate the slices of . On the next iteration, processor 1 still owns row slice 1 of , but now receives column slice 2 of , so it can compute a different block of .
If we imagine the processors arranged in a ring, each processor simply passes its current slice of to the processor on its right until every processor has seen every slice.
The next iteration fills the red blocks:
Repeat that rotation enough times and the whole result matrix is assembled from those partial computations.
Strategy 2: Split Everything Into Square Submatrices
The previous method suggests a second idea: instead of long row and column slices, split both matrices into smaller square blocks.
What Each Processor Owns
In that configuration, each block of can be expressed as a sum of products of corresponding submatrices from and .
That equation captures the same dependency as before: to compute one destination block, you need a matching run of blocks from one row of and one column of .
The difference is ownership. Now each processor is responsible for one output block of and works with smaller chunks instead of whole rows or columns.
Moving A Right And B Down
Take processor 1 in a 2x2 processor grid. To compute its block of , it eventually needs the right set of blocks from both and :
To make that happen, the processors play the same general game as before: move data around in a structured pattern so that every processor eventually sees what it needs without ever storing the entire input.
In a 2x2 grid, each iteration sends blocks of one step to the right and blocks of one step downward. The green arrows show the movement of and the red arrows show the movement of .
The big pattern is the same in both strategies: each processor keeps one local view of the data, exchanges just enough information with its neighbors, and accumulates one part of the final matrix.