How to: Implement Parallel Matrix Multiplication in F#

Visual Studio 2010

Applies to: Functional Programming

Published: January 2010

Authors: Yin Zhu

Referenced Image

Get this book in Print, PDF, ePub and Kindle at manning.com. Use code “MSDN37b” to save 37%.

Summary: This article shows how to implement a simple parallel matrix multiplication. It uses a task to demonstrate three libraries for parallel programming that can be used from F#.

This topic contains the following sections.

This article is associated with Real World Functional Programming: With Examples in F# and C# by Tomas Petricek with Jon Skeet from Manning Publications (ISBN 9781933988924, copyright Manning Publications 2009, all rights reserved). No part of these chapters may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, electrostatic, mechanical, photocopying, recording, or otherwise—without the prior written permission of the publisher, except in the case of brief quotations embodied in critical articles or reviews.

The speed of a single conventional CPU core has reached its upper bounds and will likely stay the same for several years. Unless some new technology allows increasing the speed of individual cores again, the speed of single-threaded programs will not increase with newer and better CPUs. Currently, CPUs become more powerful not by increasing the frequency but by adding more cores. This means that the mainstream computing paradigm is gradually shifting to multicore and parallel programming in general.

This article demonstrates several ways to write multithreaded programs in F#. It uses matrix multiplication, which is a very computationally intensive task. Some of the approaches demonstrated in this article are more appropriate for this problem than others. The main purpose of this article is to briefly introduce the following three libraries:

  • F# Asynchronous Workflows—Mainly designed for writing code using nonblocking asynchronous operations (such as working with a network or I/O), but they can be used for basic parallelization as well. Asynchronous workflows are mainly useful for server-side programming or for writing user interfaces.

  • The Task Parallel Library (TPL)—A low-level, high-performance library available in .NET 4.0 that can be used for writing parallel computations. The basic concept of the TPL is a task, which represents a unit of work. The TPL also provides higher-level primitives such as parallel loops.

  • Parallel Sequences (Parallel LINQ)—A higher-level parallel library for querying and processing collections of data. It provides parallel implementations of many complex operations, including filtering, grouping, and joins.

The best choice for implementing a parallel matrix multiplication is the Task Parallel Library (TPL). Matrix multiplication works directly with arrays, which is a relatively low-level task. The next three examples demonstrate some functions from each of the libraries. The article doesn't provide a rigorous performance analysis, but it shows a performance estimate using the #time directive. In practice, performance is very application-specific. The following snippet initializes two matrices that are used by the samples below:

> let rnd = new System.Random()
  let a = Array2D.init 200 200 (fun _ _ -> rnd.NextDouble())
  let b = Array2D.init 200 200 (fun _ _ -> rnd.NextDouble())
  ;;
val rnd : System.Random
val a : float [,] = [| ... |]
val b : float [,] = [| ... |]

> #time;;
(...)

The summary at the end of this article compares the three options and discusses which is more appropriate for specific tasks. It also gives references toarticles that include additional information about individual libraries and resources about agent-based programming, which is an alternative application architecture that is more suitable for writing concurrent applications.

As already mentioned, F# asynchronous workflows are mainly useful for writing computations that involve some I/O operations (such as working with a disk, a network, or a user interface). However, they also provide a function that runs multiple workflows in parallel:

> let matrixMultiplyAsync (a:float[,]) (b:float[,]) =
    let rowsA, colsA = Array2D.length1 a, Array2D.length2 a
    let rowsB, colsB = Array2D.length1 b, Array2D.length2 b
    let result = Array2D.create rowsA colsB 0.0
    [ for i in 0 .. rowsA - 1 ->
        async {
           for j in 0 .. colsB - 1 do
             for k in 0 .. colsA - 1 do
               result.[i,j] <- result.[i,j] + a.[i,k] * b.[k,j]
        } ]
    |> Async.Parallel
    |> Async.RunSynchronously
    |> ignore
    result;;
val matrixMultiplyAsync : float [,] -> float [,] -> float [,]

> matrixMultiplyAsync a b;;
Real: 00:00:03.046, CPU: 00:00:04.446, GC gen0: 399, gen1: 1, gen2: 0
val it : float [,] = [| ... |]

The snippet declares a function that takes two-dimensional arrays a and b as parameters. It assumes that the arrays have the right shape. The snippet first gets the dimensions of the matrices and then implements the multiplication. This is written as three nested loops. However, the first for loop generates asynchronous workflows. A workflow is wrapped in the async { … } block and it contains the remaining two for loops. The symbol -> means that that a workflow is returned as an element of a list. Next, the snippet uses Async.Parallel to combine all of the generated workflows into a single workflow (by executing them in parallel) and then runs the composed workflow using Async.RunSynchronously to wait for the completion. F# asynchronous workflows are mainly useful for writing IO bound computations or for creating efficient agent-based applications. For fine-grained parallelism (like the above example), the following two options are more efficient.

The solution that provides the best performance from the three options shown in this article is based on the .NET TPL. Most of the library's functionality is available in the System.Threading.Tasks namespace. The library is based on types Task and Task<'T>. The former represents an operation that does something and the later represents an operation that eventually calculates a value of type 'T. In addition to these primitives, it is possible to use functions from the Parallel class, such as Parallel.For:

open System.Threading.Tasks

> let matrixMultiplyTasks (a:float[,]) (b:float[,]) =
    let rowsA, colsA = Array2D.length1 a, Array2D.length2 a
    let rowsB, colsB = Array2D.length1 b, Array2D.length2 b
    let result = Array2D.create rowsA colsB 0.0
    Parallel.For(0, rowsA, (fun i->
        for j = 0 to colsB - 1 do
           for k = 0 to colsA - 1 do
              result.[i,j] <- result.[i,j] + a.[i,k] * b.[k,j]))  
    |> ignore
    result;;
val matrixMultiplyTasks : float [,] -> float [,] -> float [,]

> matrixMultiplyTasks a b;;
Real: 00:00:00.061, CPU: 00:00:00.109, GC gen0: 0, gen1: 0, gen2: 0
val it : float [,] = [| ... |]

The structure of the snippet is similar to the previous example, but the parallelization looks slightly different. It uses function Parallel.For, which repesents a parallel for loop that takes three arguments. The first two arguments specify the range for the iteration and the last argument is a function with the body. The function takes the current index as a parameter. The Parallel.For function automatically creates an appropriate number of tasks to efficiently evaluate the body in parallel. The parallel for loop is the most efficient solution for the example demonstrated in this article, but it lacks many features that are available in F# asynchronous workflows (like automatic cancellation support or non-blocking waiting).

The next version of the example uses Parallel LINQ, which is based on the TPL and is also available in .NET 4.0. As discussed in the introduction, the library is mainly useful for more complex data processing, but it also includes a primitive for a simple iteration. F# programmers can access Parallel LINQ using a wrapper that is available as part of the F# PowerPack. The wrapper provides an interface similar to other F# collection types, such as arrays or lists. Most of the funcitonality is available in the PSeq module. In particular, iteration is captured by PSeq.iter:

#r "FSharp.PowerPack.Parallel.Seq.dll"
open Microsoft.FSharp.Collections

let matrixMultiplyPSeq (a:float[,]) (b:float[,]) =
    let rowsA, colsA = Array2D.length1 a, Array2D.length2 a
    let rowsB, colsB = Array2D.length1 b, Array2D.length2 b
    let result = Array2D.create rowsA colsB 0.0
    [ 0 .. rowsA - 1 ] |> PSeq.iter (fun i ->
      for j = 0 to colsB - 1 do
        for k = 0 to colsA - 1 do
          result.[i,j] <- result.[i,j] + a.[i,k] * b.[k,j] )
    result;;
val matrixMultiplyPSeq : float [,] -> float [,] -> float [,]

> matrixMultiplyPSeq a b;;
Real: 00:00:00.087, CPU: 00:00:00.109, GC gen0: 0, gen1: 0, gen2: 0
val it : float [,] = [| ... |]

The function PSeq.iter behaves very similarly to Parallel.For. It takes the input as a list (instead of a range) so the snippet uses a simple comprehension to generate a list of indices. The solution is slightly less efficient than using parallel for loop, but it still gives a very good performance in this example.

The PSeq module is interesting mainly because of other functions. For example, PSeq.map can be used to transform a collection and PSeq.distinct removes duplicates from a collection. These functions are not needed when implementing matrix multiplication, but they are extremely useful for processing large amounts of data.

This article demonstrated several libraries that can be used for writing parallel algorithms. The sample problem that was used in the article can be best solved using the Task Parallel Library (TPL), which provides low-level parallelization primitives based on tasks. Parallel LINQ (or Parallel Sequences in F#) are more suitable for complex data processing, but the matrix multiplication example still gives a good performance. On the other hand, F# asynchronous workflows are not designed for fine-grained concurrency. They are mainly useful for composing computations that run for a longer time, such as processing a server request or performing an I/O operation.

More information about writing server-side programs that use F# asynchronous workflows, about the agent-based architecture, and about the Task Parallel Library as well as Parallel LINQ can be found in the following articles:

This article used matrix multiplication as an example of parallel algorithms. In practice, matrix multiplication (and many other standard tasks) is already implemented by numerous libraries. The following articles introduce several numerical libraries for F# and .NET:

To download the code snippets shown in this article, go to http://code.msdn.microsoft.com/Chapter-4-Numerical-3df3edee

This article is based on Real World Functional Programming: With Examples in F# and C#. Book chapters related to the content of this article are:

  • Book Chapter 12: “Sequence expressions and alternative workflows” contains detailed information on processing in-memory data (such as lists and seq<'T> values) using higher-order functions and F# sequence expressions.

  • Book Chapter 13: “Asynchronous and data-driven programming” explains how asynchronous workflows work and uses them to write an interactive script that downloads a large dataset from the Internet.

  • Book Chapter 14: “Writing parallel functional programs” explains how to use Task Parallel Library to write data-parallel and task-based parallel programs. This approach complements the agent-based parallelism in F#.

Show: