Chapter 1 Solutions

Ray Ontko's problem solutions for Chapter 1 of Wilkinson, Barry, and Michael Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, (Prentice Hall, 1999).

If you're wondering why I do this, see Why Publish Homework Solutions?.

Chapter 1

Well, if there are 10^11 stars in the galaxy, each iteration would involve approximately 10^22 basic calculations. We multiply this by an estimated 10 floating point operations per calculation, and by 100 iterations, we get 10^25 floating point operations total. If we divide this by 500,000,000 (5*10^8) floating point operations per second, we get 2*10^16 seconds, or about 630 million years.
Let's look at a few examples. If n = 3, the torus has a diameter (maxium least distance between nodes) of 2. If n = 4, the torus has a diameter of 4. If n = 5, the torus has a diameter of 4. If n = 6, the torus has a diameter of 6. It appears that for odd n, the diameter is d=(n+1)/2, and for even n, the diameter is d=n/2.
If the tree is balanced, i.e., the depth of the tree is equal on all branches from the root node, and the depth of the tree is n, then the diameter of the network is is the distance from the deepest node on one branch off the root to the deepest node on another branch off the root. That is, d=2*n.
The maximum distance in any dimension is n-1, if n is the number of nodes in that dimension of the mesh. The maximum distance for the d-dimensional mesh, then, is the the sum of nodes in each dimension minus the number of dimensions. If n is the number of elements in each of d dimensions, then the diameter = (n-1)*d.
First, we note that a five-dimensional hypercube which uses the e-cube routing algorithm is a binary hypercube, that is, the cube is 2x2x2x2x2. Therefore each node is represented by an address comprised of 5 base-2 digits. The address of node 7, then, is 00111 (7=4+2+1), and node 22 is 10110 (22=16+4+2). By taking the exclusive OR between the two addresses, we get 10001 for the routing bits. In other words, the two addresses differ only in the first and last dimension of the hypercube. To obtain the route, we XOR the left-most non-zero routing bit with the corresponding bit of the current node to obtain the next node, set that bit in the routing bits to zero, and repeat the process until all of the routing bits are zero. The route, then would be from node 7 (00111) to node 23 (10111) to to node 22 (10110).
figure tbd. Since there are 4 processors in each of 4 dimensions, we think of the hypercube as a 4x4x4x4 array, so each node may be thought of as having an address comprised of 4 base-4 digits. For example, if the "near" corner has an address of 0000, the "far" corner would have an address of 3333. Two adjacent nodes near the "middle" might have addresses of 1212 and 1112. The total number of nodes would be 4*4*4*4=256.
First, note that a 9 x 9 mesh requires an 8-dimensional binary hypercube in order to implement the approach used on page 17 in the text. To represent the row/column addresses of the mesh, we require the number 0 through 8, which in binary representation requires 4 bits. So, four bits for the row address, and 4 bits for the column address add up to an 8-bit address.
Second, note that the row addresses and column addresses are not your standard binary numbering; the numbers use something called a Gray encoding which has the property that adjacent numbers differ by only one bit.
```
8 1100

7 0100

6 0101

5 0111

4 0110

3 0010

2 0011

1 0001

0 0000

        0000  0001  0011  0010  0110  0111  0101  0100 1100
           0     1     2     3     4     5     6     7    8
```
You may find it interesting to note that this embedded mesh is not a torus. This technique of embedding a mesh in a hypercube only works for toruses of size 2x2, 4x4, 4x2, 8x8, 8x4, 8x2, 16x16, 16x8, 16x4, 16x2, etc. Of course, it will work fine for meshes of any size, provided the sum of the number of bits required for the row and column addresses is less than or equal to the dimension of the hypercube.
Actually, this is a lot of questions. Here are a few things to note.
The average distance in a mesh is r/2 + c/2, where r is the number of rows, and c is the number of columns. Acutally, we expect a somewhat smaller average since the nodes near the middle of the mesh are closer on average to more other nodes than are nodes near the edge of the mesh. (?)
In a torus, the average distance is r/4 + c/4, where r is the number of rows, and c is the number of columns.
The average distance in a hypercube is d/2, where d is the number of dimensions. On average, we expect two nodes to differ in only 1/2 of the bits in their respective addresses.
Let us assume that for this problem, the algorithm in question is a sorting algorithm in which each node holds a single data point and that upon completion of the algorithm, the data points have migrated amongst the nodes such that they apear in a left-reading order in the tree. In the following, let h be the height of the tree.
The communication lower bound for a complete binary tree, based on distance, depend on the height of the tree. In the worst case, the tree is exactly out of order. In other words, the bottom level on the right needs to trade places with the bottom level on the left, and the second level on the right needs to trade places with the second level on the left, etc. In this case, the number of hops required to complete the transfer is 2*2^h*2*h + 2*2^(h-1)*2*(h-1) + ... + 2.
The communication lower bound for a complete binary tree, based on the bisection width, depends on the height of the tree. In the worst case, all of the data in nodes in the left half of the tree need to be moved to the right half, and all the data in the nodes on the right half need to move to the left. Since the bisection width of a tree is 1, and the number of nodes on each side of the tree is 2^h - 1 hops, the communication lower bound is 2*(2^h - 1) hops. If on average only 1/2 of the data need to migrate from one half of the tree to the other, the communication lower bound is 2^h - 1 hops.
If the numbers start out un-sorted in a mesh that is sqrt(n) x sqrt(n) in size, then a typical number will have to move half-way across the mesh in each of two dimensions during the sorting process, that is, 2 * sqrt(n) / 2 = sqrt(n).
In terms of bisection width, the mesh has a bisection width of sqrt(n). If half the data needs to cross the midline of the mesh during the sorting process, the communication lower bound may be calculated by n/2/sqrt(n) = sqrt(n)/2.
Suppose, for the sake of argument, that the program takes 100 machine seconds to run. Then clearly, one machine will execute the 10% of the sequential code in a total of 10 seconds, and all ten machines will execute the 90% of parallelizable code for 9 seconds. During the execution of the program then, 200 MFLOPS for ten seconds, plus 2000 MFLOPS for nine seconds = 20,000 MFLOP seconds. If we then divide this by the total execution time of 19 seconds we get an average performance of 1053 MFLOPS, or slightly better than 50% of capacity.
Hmm. First thing to notice, is that 2*n*log(n^0.5) = 2*n*0.5*log(n) = n*log(n). Second, the n in the number of processors is different from the n in the formula for the number of steps required for the sequential algorithm. For this reason, we will use p instead of n to denote the number of processors.
According to the definition, a cost-optimal parallel algorithm is one in which the cost on a multiprocessor is proportional to the cost on a sequential processor. In other words, Cp = k*Cs, where k is some constant, Cp is the parallel algoritm cost, and Cs is the sequential algoritm cost.
According to the definition of speedup, we define S(p^2) to be the speedup using p^2 processors:
S(p^2) = n*log(n) / x, where x is the number of steps with p^2 processors.
Note that Cp = Ts * p^2 / S(p^2). Then Cp = Ts * p^2 / (n*log(n)/x).
Note that Cs = Ts * 1 / 1 = Ts.
Using the proportionality equation expressed above, we get: Ts * p^2 * x / (n*log(n)) = k * Ts. Solving for x, we get x = k * n * log(n) / p^2.
The parallel algorithm, then, will be cost-optimal if it requires k*n*log(n)/p^2 or fewer steps.
(???)