**Note:**

This blog is intended to record what I learned in *Parallel Programming*, rather than share the course materials. So there are two things that I won’t put up: my solution to assignments and topics which I’ve mastered already.

PLEASE NOTIFY ME if I broke Coursera Honor Code accidentally.

## Week 1

### Introduction to Parallel Computing

- at the beginning of 21 century processor frequency scaling hit the
*power wall*. - parallism and concurrency are closely related concepts:
- parallel program uses parallel hardware to execute computation more quickly. Efficiency is its main concern (or speedup).
- concurrent program may or may not execute multiple executions at the same time. It concerns about improving modularity, responsiveness and maintainability (or convenience).

- parallelism granularity:
- bit-level parallelism: processing multiple bits of data in parallel.
- instruction-level parallelism: executing different instructions from the same instruction stream in parallel.
- task-level parallelism: executing separate instruction streams in parrel.

- different forms of parallel hardware.
- multi-core processor.
- symmetric multiprocessors.
- graphics processing unit.
- field-programmable gate arrays.
- computer clusters.

### Parallelism on the JVM I

### Parallelism on the JVM II

- memory model is a set of rules that describe how threads interact when accessing shared memory.
- Java Memory Model - the memory model for JVM:
- two threads writing to separate locations in memory do not need synchroization.
- a thread
`X`

that calls`join`

on another thread`Y`

is guaranteed to observe all the writes by thread`Y`

after`join`

returns.

### Running Computations in Parallel

### How Fast are Parallel Programs?

- how to estimate the performance of parallelism:
- empirical measurement.
- asymptotic analysis, which is important to understand how algorithms scale when input get larger or we have more hardware parallelism available.

### Benchmarking Parallel Programs

- when a JVM program starts, it undergoes a period of
*warmup*, after which it achieves its maximum performance.- first, the program is interpreted.
- then, parts of the program are compiled into machine code.
- later, the JVM may choose to apply additional dynamic optimizations.
- eventually, the program reaches steady state.

## Week 2

### Parallel Sorting

- implement
`merge sort`

in parallel.

### Data Operations and Parallel Mapping

- operations on
`List`

are not good for parallel implementation, since spliting a list in half and combining them takes linear time. - this section mainly concentrate on the parallel implementation of
`Map`

operation on`Array`

and`Tree`

.

### Parallel Fold (Reduce) Operation

- operation
`f: (A, A) => A`

is associative iff for every`x`

,`y`

,`z`

:`f(x, f(y, z)) = f(f(x, y), z)`

### Associativity I

- operation
`f: (A, A) => A`

is commutative iff for every`x`

,`y`

:`f(x, y) = f(y, x)`

- floating point addition is commutative but not associative:
1

2

3

4

5

6

7

8

9

10

11

12

13scala> val e = 1e-200

e: Double = 1.0E-200

scala> val x = 1e200

x: Double = 1.0E200

scala> val mx = -x

mx: Double = -1.0E200

scala> (x + mx) + e

res2: Double = 1.0E-200

scala> x + (mx + e)

res3: Double = 0.0

scala> (x + mx) + e == x + (mx + e)

res4: Boolean = false

### Associativity II

- for
`E(x,y,z) = f(f(x,y), z)`

, we say arguments of E can rotate if E(x,y,z) = E(y,z,x), that is:`f(f(x,y), z) = f(f(y,z), x)`

. - if
`f`

is commutative and arguments of`E`

can rotate then`f`

is also associative.

### Parallel Scan (Prefix Sum) Operation

## Week 3

### Data-Parallel Programming

- task-parallel programming: a form of parallelization what distributes execution processes across computing nodes.
- data-parallel programming: a form of parallelization what distributes data across computing nodes.
parallel for loop is the simplest form of data-parallel programming:

1

2

3for (i <- (0 until xs.length).par) {

xs(i) = i

}goal of data-parallel scheduler: effiently balance the workload across processors without any knowledge about the

`w(i) = #iterations`

.

### Data-Parallel Operations I

- most collections in scala can become data-parallel by appending
`.par`

. - operations
`reduceLeft`

,`reduceRight`

,`scanLeft`

,`scanRight`

,`foldLeft`

,`foldRight`

must process the elements sequentially. - the
`fold`

operation can process the elements in a reduction tree, so it can execute in parallel.

### Data-Parallel Operations II

in order for the

`fold`

operation to work correctly, the following relations must hold:1

2f(a, f(b, c)) == f(f(a, b), c) // associativity

f(z, a) == f(a, z) == a // neutral elementwe say that the neutral element z and the binary operator f must form a

`monoid`

.for

`f = math.max`

, we have`f(a, (b, c)) == f(f(a, b), c)`

and`f(Int.MinValue, a) == f(a, IntMinValue) = a`

:1

2def max(xs: Array[Int]): Int =

xs.par.fold(Int.MinValue)(math.max)the

`fold`

operation can only produce values of the same type as the collection that it is called on.

### Scala Parallel Collections

- Scala collections hierarchy:
`Traversable[T]`

– collection of elements with type`T`

, with operations implemented using foreach.`Iterable[T]`

– collection of elements with type`T`

, with operations implemented using iterator.`Seq[T]`

– an ordered sequence of elements with type`T`

.`Set[T]`

– a set of elements with type`T`

(no duplicates).`Map[K, V]`

– a map of keys with type`K`

associated with values of type`V`

(no duplicate keys).

- rule: never modify a parallel collection on which a data-parallel operation is in progress.
- never write to a collection that is concurrently traversed.
- never read from a collection that is concurrently modified

### Splitters and Combiners

- 4 abstractions for data-parallel collections:
`iterators`

,`splitters`

,`builders`

,`combiners`

. `Iterator`

:1

2

3

4

5trait Iterator[A] {

def next(): A

def hasNext: Boolean

}

def iterator: Iterator[A] // on every collectionthe

`Iterator`

contract:`next`

can be called only if`hasNext`

returns true.- after
`hasNext`

returns false, it will always return false.

`Splitter`

:1

2

3

4

5trait Splitter[A] extends Iterator[A] {

def split: Seq[Splitter[A]]

def remaining: Int

}

def splitter: Splitter[A] // on every parallel collectionthe

`Splitter`

contract:- after calling
`split`

, the original splitter is left in an undefined state. - the resulting splitters traverse disjoint subsets of the original splitter.
`remaining`

is an estimate on the number of remaining elements.`split`

is an efficient method – $O(log n)$ or better.

- after calling
`Builder`

:1

2

3

4

5trait Builder[A, Repr] {

def +=(elem: A): Builder[A, Repr]

def result: Repr

}

def newBuilder: Builder[A, Repr] // on every collectionthe

`Builder`

contract:- calling
`result`

returns a collection of type`Repr`

, containing the elements that were previously added with`+=`

. - calling
`result`

leaves the Builder in an undefined state.

- calling
`Combiner`

:1

2

3

4trait Combiner[A, Repr] extends Builder[A, Repr] {

def combine(that: Combiner[A, Repr]): Combiner[A, Repr]

}

def newCombiner: Combiner[T, Repr] // on every parallel collectionthe

`Combiner`

contract:- calling
`combine`

returns a new combiner that contains elements of input combiners. - calling
`combine`

leaves both original Combiners in an undefined state. `combine`

is an efficient method – $O(log n)$ or better.

- calling

## Week 4

### Implementing Combiners

`Builder`

is used in sequential collection methods:1

2

3

4trait Builder[T, Repr] {

def +=(elem: T): this.type

def result: Repr

}to implement parallel transformer operations, we need an abstraction called a combiner:

1

2

3trait Combiner[T, Repr] extends Builder[T, Repr] {

def combine(that: Combiner[T, Repr]): Combiner[T, Repr]

}the

`combine`

operation must be efficient, i.e. execute in $O(log n + log m)$ time, where n and m are the sizes of two input combiners.- array cannot be efficiently concatenated.
- most of these data structures do not have an efficient concatenation or union, providing a
`combiner`

for the corresponding collections is not straight forward.- but it is indeed possible.

### Parallel Two-Phase Construction

- most data structures can be constructed in parallel using two-phase construction.
- the intermediate data structure is a data structure that:
- has an efficient
`combine`

method – $O(log n + log m)$ or better. - has an efficient
`+=`

method. - can be converted to the resulting data structure in $O(n/P)$ time (P is the number of processors).

- has an efficient
- two-phase construction works for in a similar way for other data structures. First, partition the elements, then construct parts of the final data structure in parallel:
- partition the indices into subintervals.
- initialize the array in parallel.

- two-phase construction for hash tables:
- partition the hash codes into buckets.
- allocate the table, and map hash codes from different buckets into different regions.

- two-phase construction for search trees:
- partition the elements into non-overlapping intervals according to their ordering.
- construct search trees in parallel, and link non-overlapping trees.

- two-phase construction for spatial data structures:
- spatially partition the elements.
- construct non-overlapping subsets and link them.

- how can we implement combiners?
- two-phase construction – the combiner uses an intermediate data structure with an efficient combine method to partition the elements. When result is called, the final data structure is constructed in parallel from the intermediate data structure.
- an efficient concatenation or union operation – a preferred way when the resulting data structure allows this.
- concurrent data structure – different combiners share the same underlying data structure, and rely on synchronization to correctly update the data structure when += is called.

### Conc-tree Data Structure

- Lists are built for sequential computations – they are traversed from left to right.
- Trees allow parallel computations – their subtrees can be traversed in parallel.
- Trees are not good for parallelism unless they are balanced.