Nicolas Joseph

Nicolas Joseph

CTO / Entrepreneur

// ROOT/LOGS/2016-04-03// LOG ENTRY

> Distributed Matrix product _

A visual walkthrough of distributed matrix multiplication: first the refresher on matrix product, then two ways to split the work across processors.

// Archived NoteAuthor // Nicolas Joseph

Distributed matrix multiplication matters when the matrices are too large for one machine to hold comfortably in memory. To explain the distributed case clearly, it helps to refresh the basic matrix product first and then look at two different ways of splitting the work.

If matrix multiplication is already fresh in your mind, skip directly to "Why distribute the matrix product?" below.

Refresher: Matrix Product

What A Matrix Is

A matrix is a data structure that's made of columns and rows. It is basically a Tic tac toe grid filled with numbers and a potentially infinite number of rows and columns.

When a matrix has the same number of rows and columns we say that it is squared ... because it looks like a square.

What The Product Computes

A dot product is an operation that takes two matrices and returns another matrix as the result. Let's take an example to go through the process.

Let AA be a squared 2×22 \times 2 matrix and BB another squared 2×22 \times 2 matrix.

This matrix product will result in another 2×22 \times 2 matrix CC.

In that configuration each element of CC: C1,1C_{1,1}, C1,2C_{1,2}, C2,1C_{2,1}, and C2,2C_{2,2} is the result of sums and products on the corresponding rows of AA and columns of BB:

  • C1,1=A1,1B1,1+A1,2B2,1=12+33=11C_{1,1} = A_{1,1} \cdot B_{1,1} + A_{1,2} \cdot B_{2,1} = 1 \cdot 2 + 3 \cdot 3 = 11
  • C1,2=A1,1B1,2+A1,2B2,2=11+31=4C_{1,2} = A_{1,1} \cdot B_{1,2} + A_{1,2} \cdot B_{2,2} = 1 \cdot 1 + 3 \cdot 1 = 4
  • C2,1=A2,1B1,1+A2,2B2,1=52+23=16C_{2,1} = A_{2,1} \cdot B_{1,1} + A_{2,2} \cdot B_{2,1} = 5 \cdot 2 + 2 \cdot 3 = 16
  • C2,2=A2,1B1,2+A2,2B2,2=51+21=7C_{2,2} = A_{2,1} \cdot B_{1,2} + A_{2,2} \cdot B_{2,2} = 5 \cdot 1 + 2 \cdot 1 = 7

The overall operation look like this, in red is circled the elements of the matrix involved in getting the C1,1=11C_{1,1} = 11 part of the resulting matrix CC:

The important idea is that each value in the result matrix comes from one row of AA interacting with one column of BB.

Why Distribute The Matrix Product?

This operation appears in many machine learning and numerical workloads. As the amount of data grows, the matrices get large enough that a single processor may no longer be able to store or process them efficiently on its own.

For the sake of simplicity we will consider square matrices, but the same ideas can be generalized to rectangular ones.

Strategy 1: Slice Rows Of A And Columns Of B

The first strategy is to split matrix AA into row slices and matrix BB into column slices. Each processor receives one slice of each.

Let pp be the number of processors we can use.

What One Processor Can Compute

What does processor 1 get in that configuration?

It gets the first row slice of AA and the first column slice of BB. With only those two pieces, it can compute only the top-left corner of CC.

Following the same logic, every processor computes one block on the diagonal of CC. At this point we only have a partial result.

Rotate B Through The Ring

To fill in the missing blocks, the processors keep their slice of AA fixed and rotate the slices of BB. On the next iteration, processor 1 still owns row slice 1 of AA, but now receives column slice 2 of BB, so it can compute a different block of CC.

If we imagine the processors arranged in a ring, each processor simply passes its current slice of BB to the processor on its right until every processor has seen every slice.

The next iteration fills the red blocks:

Repeat that rotation enough times and the whole result matrix is assembled from those partial computations.

Strategy 2: Split Everything Into Square Submatrices

The previous method suggests a second idea: instead of long row and column slices, split both matrices into smaller square blocks.

What Each Processor Owns

In that configuration, each block of CC can be expressed as a sum of products of corresponding submatrices from AA and BB.

C1,1=i=1p(A1,i)(Bi,1)C_{1,1} = \sum_{i=1}^{\sqrt{p}} \left(A_{1,i}\right) \cdot \left(B_{i,1}\right)

That equation captures the same dependency as before: to compute one destination block, you need a matching run of blocks from one row of AA and one column of BB.

The difference is ownership. Now each processor is responsible for one output block of CC and works with smaller chunks instead of whole rows or columns.

Moving A Right And B Down

Take processor 1 in a 2x2 processor grid. To compute its block of CC, it eventually needs the right set of blocks from both AA and BB:

  • C1,1=A1,1B1,1+A1,2B2,1C_{1,1} = A_{1,1} \cdot B_{1,1} + A_{1,2} \cdot B_{2,1}
  • C1,2=A1,1B1,2+A1,2B2,2C_{1,2} = A_{1,1} \cdot B_{1,2} + A_{1,2} \cdot B_{2,2}
  • C2,1=A2,1B1,1+A2,2B2,1C_{2,1} = A_{2,1} \cdot B_{1,1} + A_{2,2} \cdot B_{2,1}
  • C2,2=A2,1B1,2+A2,2B2,2C_{2,2} = A_{2,1} \cdot B_{1,2} + A_{2,2} \cdot B_{2,2}

To make that happen, the processors play the same general game as before: move data around in a structured pattern so that every processor eventually sees what it needs without ever storing the entire input.

In a 2x2 grid, each iteration sends blocks of AA one step to the right and blocks of BB one step downward. The green arrows show the movement of AA and the red arrows show the movement of BB.

The big pattern is the same in both strategies: each processor keeps one local view of the data, exchanges just enough information with its neighbors, and accumulates one part of the final matrix.