Login: cs61c-\_\_\_\_ #### Spring 2012 CS61C Final | Your Name:_ | SC | )LU | 101 | | | | | |---------------|-------|-------|------|------|------|-----|--| | Your TA: | Rimas | Scott | Alan | Eric | Paul | lan | | | Login: cs61c- | - | ¥ | | | | | | This exam is worth 110 points, or about 20% of your total course grade. The exam contains 8 questions. This booklet contains 12 numbered pages including the cover page. Put all answers on these pages, please; don't hand in stray pieces of paper. You will receive 5 points for properly filling out your name, TA, and login (login must be filled in properly on every page of the exam). This is the exam's 0<sup>th</sup> question. | Question | MIN | MAR | oints(Minutes) | AVG | STacore | |----------|-----|-----|----------------|------|---------| | 0 | 4 | 5 | 5(0) | 5 | 0, 1 | | 1 | 2 | 15 | 15(10) | 7.7 | 3.0 | | 2 | 0 | 10 | 10(10) | 6.0 | 2,3 | | 3 | 0 | 9 | 10(9) | 5.3 | 1.8 | | 4 | 0 | 14 | 14(11) | 8.2 | 4,0 | | 5 | 0 | 16 | 16(15) | 9.5 | 4.1 | | 6 | 0 | 114 | 14(11) | 7.6 | 3.7 | | 7 | 0 | 10 | 10(9) | 6.6 | 3,2 | | 8 | 0 | 16 | 16(15) | 7.7 | 3.5 | | Total | 11 | 100 | 110(90) | 63.5 | 16.6 | ## Potpourri: Mostly True or False. (ALAN C) The different parts of this question are independent. a. Simplify the following boolean expression. $$\frac{(\overline{a}+b)(\overline{b}+c)(\overline{c}+d)+(\overline{a}+d)}{((a\Rightarrow b)(b\Rightarrow c)(c\Rightarrow d))} \Rightarrow (a\Rightarrow d)$$ 11 1 Full evedit was also given to and, as many students misread the question as (a+b)(b+c)(e+d)+a+d b. Suppose we feed in digits of a number to a finite state machine most significant bit first (e.g. it would see 1 first if we input 1000). Complete the below diagram so that it outputs 1 exactly when the number is divisible by 2, but not by 4. Assume that the machine's starting state is such that it has seen more than four 0's. 1 pt. per start $$\rightarrow$$ 10 0/0 1/0 0/0 01 01 I point was taken for using an extra c. How many unique gates with n input bits and m output bits are there? lpt; mix up m, n d. Circle all modifications to the IEEE754 float standard that would change the proportion of floats with values between -1 and 1. lpt, for Circling each (- changing the exponent bias value) - increasing the number of exponent bits (- removing the implicit leading one) - getting rid of denorm values | Login: cs61c | |--------------| |--------------| e. A program tries to load a word at address X that causes a TLB miss but not a page fault or protection violations. <u>True or False</u> A TLB miss means that the page table does not contain a valid mapping for virtual page corresponding to the address X <u>True or False.</u> There is no need to look up the page table because there is no page fault True or False: The word that the program is trying to load is present in physical memory. 1 ines ## Don't MIPS the Point! ( PAUL ) ``` typedef struct { int val; struct node* next; } node: /* Removes the first node after cur with a value of x and return a pointer to it. You may assume cur is not null.*/ node* removeNext(node *cur, int x) { node* next = cur->next; if (next) { //have a next node? if (\text{next->val} == x) { //remove the next if its val is x cur->next = next->next; } else { //otherwise, keep searching for node to remove return removeNext(next, x); } return next; a) Use the above definition of a linked list node. Fill in the blanks below so that the assembly code does the equivalent of the C removeNext. Assume no delayed branches. removeNext: addiu $sp $sp -4 1 sw $ra 0($sp) IW 200 4(8a beq $v0 $0 done lw $t1 0($v0) bre Dal Stireuse 7 8 done: 10 lw $ra 0($sp) all 3) addiu $sp $sp 4 11 12 #return 13 recurse: 14 move $a0 $v0 #set arguments for recursive call 15 jal removeNext #recurse that removeNext all 5, cept the first node in 16 b) Say we change line 15's jal to j. Which lines, if any, can we remove so that removeNext still works properly without editing any other line? List the line numbers. c)Say we want a function remove which does the same as removeNext, except the first node is also a target for removal, and the return type is void. Fill in the prototype below so that it could work properly. ``` #### On Multiplying Polynomials (OMP) (NAVE) | void multiply1(double *A, double *B, | void multiply2(double *A, double *B, | | |--------------------------------------|--------------------------------------|--| | double *C, int n) { | double *C, int n) { | | | // Outer Loop | // Outer Loop | | | for (int $i=0$ ; $i < n$ ; $i++$ ) | for (int $i=0$ ; $i < 2*n$ ; $i++$ ) | | | // Inner Loop | // Inner Loop | | | for (int $j=0$ ; $j < n$ ; $j++$ ) | for (int $j = MAX(i - n + 1, 0)$ ; | | | C[i+j] += A[i] * B[j]; | j < MIN(i + 1, n); j++) | | | } | C[i] += A[j] * B[i-j]; | | a) Suppose OpenMP pragmas are placed on the outer and inner loops according to the table below. Evaluate if multiply1 and multiply2 are guaranteed to return the correct answer with the given pragma placements. Circle your answer in each of the rightmost six cells. A, B, and C point to non-overlapping memory regions. PERCORRECT STU2 UA | Outer Loop | Inner Loop | multiply1 | multiply2 | |--------------------------|--------------------------|---------------------|---------------------| | <leave empty=""></leave> | parallel for | correct incorrect | (incorrect) | | parallel for | <leave empty=""></leave> | correct (incorrect) | correct Incorrect | | parallel for | parallel for | correct (incorrect) | correct (incorrect) | - The incorrect i - c) Circle all statements that are true for both multiply1 and multiply2 unless otherwise noted. - Threads can see different "i" values even if the parallel for is only on the inner loop. - A #pragma omp barrier on the innermost statement will fix all data races from part (a). - Ignoring correctness, parallelizing both inner & outer loops will give a speedup for large - The synchronization overhead of parallelizing an inner loop with a given number of threads is amortized for large n. 2 POINTS FOR CORRECT ANSWER -1 POINT PER INCORRECT AUSWER MINIMUM SCORE IS O 14 points Total # Additional Logic (IAN) For this problem we will look at a simple carry bit adder. Each adder block is simply a one bit adder. This circuit will add a series of pairs of 4 bit inputs and output 5 bit outputs (instead of having an overflow bit). The inputs will change once each clock cycle. 2 points a) We want to pick the maximum clock frequency such that we can guarantee this circuit works correctly. The adder block has a delay of time a. The registers have a clock-to-q delay of time q, as well as a setup time s, and a hold time s where s and a hold time s where s and Max frequency = $$1/[4\alpha + q + 5]$$ Hz The setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup time s, and a noid time n where $\frac{1}{16}$ All times given are in setup times and $\frac{1}{16}$ All times given are in setup times and $\frac{1}{16}$ All times given are in setup times and $\frac{1}{16}$ All times given are in setup times and $\frac{1}{16}$ All times given are in setup times given are in setup times and $\frac{1}{16}$ All times given are in setup times and $\frac{1}{16}$ All times given are in setup times and $\frac{1}{1$ 6 points b) Alyssa P Hacker realizes that she can improve this circuit by pipelining it. She starts by inserting a register between the carry out of the second adder and the carry in of the third adder (see next page). Your job is to add additional registers to make this circuit correctly perform addition again. You are to either place a register in each empty space (any box will be taken to be a register) or place a wire across it. Login: cs61c-\_\_\_\_ 2 points c) Now what is the max frequency: . I points for correct answer . I point for dividing the by something other than 2 ] Hz. I point for dividing the whole expression not Max frequency = 1/[ 2 α+ 2+5 just 4x by 2 d) In terms of time, the latency of a single addition has increased / decreased / stayed the same. Circle one. All or nothing 2 points 2 points In terms of additions per time, the throughput of this circuit has increased / decreased / stayed the same. Circle one. All or nothing # Hazardous Question (DAUE) Consider each of the following sets of instructions. Decide the minimum number of stalls needed in each case with and without forwarding. Assume the five stage MIPS pipeline discussed in class and that equality for branches are checked at the Decode stage. Assume delayed branching. Number of stalls required? | _ | Without forwarding | With forwarding | |------------------------------------------------------------------------------------|--------------------|-----------------| | addiu \$t1 \$t0 5<br>addu \$t4 \$t3 \$t1<br>addu \$t0 \$t3 \$t0 | 2 | | | lw \$t0 4(\$s0)<br>addu \$t1 \$t2 \$t3<br>beq \$t0 \$t1 DONE<br>addu \$t5 \$t5 \$5 | 2 | | | lw \$t1 0(\$s0)<br>sw \$t1 0(\$s1) | 2 | 0 | | addu \$t0 \$s0 \$t1<br>lw \$t1 8(\$t0) | 2 | 0 | | 0 | 2 POINTS POR CORRECT ANSWER | | |---|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----| | 0 | 4 POINTS IF ALL 4"WITHOUT 4 POINTS IF ALL 4"WITHOUT 5 ORWARDING" CHOICES WERE 3 STALLS FORWARDING CHOICES WERE 3 STALLS CONFUSION ABOUT WRITEBACK STAGE CONFUSION ABOUT WRITEBACK CYCE) 3 DE CODE IN SAME CLOCK CYCE) | (1 | | | C CONFUSION TO SAME CLOCK CTCO S DE CODE IN SAME CLOCK CTCO | | | 9 | EDRUMEDIAGE HOTEES WHEE | | Counting Cache (SCOTT) (6) A) A word-addressed cache has 8 words of capacity, 2 word blocks, and uses a LRU replacement policy. Assume it started empty and then the warming sequence was executed. For each cache organization given, determine whether each access will cause a hit or a miss. Indicate the result with a "H" or "M" in each box. Warming sequence: 1, 4, 7, 3 | | Query<br>Sequence | 5 | 4 | 10 | 2 | Ò | |-----|--------------------------|---|---|-----|----|---| | (2) | Direct<br>Mapped | Н | Н | · M | М | Н | | (2) | 2-way Set<br>Associative | H | H | Μ | H | Н | | (2) | Fully<br>Associative | Н | Н | M | 14 | M | - B) How many tag (T), index (I), and offset (O) bits does a byte addressed cache have that is 8MB, 4-way set associative, 32B block, and has a 48b address? 2/2 all correct T:1:0= 27 : 16 : 5 1/2 two wrong but adds to 48 0/2 ZZ wrong - (6) C)For each of the following modifications to a set associative cache, indicate how it will affect the different miss rates. Notice that some parameters are held constant. Use (+) for increase, (=) for stays the same, and (-) for decrease. | | Miss Type | Compulsory | Conflict | Capacity | |-----|-------------------------------------------------------------|------------|--------------------|----------| | (2) | Double associativity (capacity and block size constant) | = | - | = | | (2) | Halve block size<br>(associativity & # of<br>sets constant) | + | =<br>- (alternate) | + | | (2) | Double # of sets<br>(block size & capacity<br>constant) | = | + | | # Additional Instruction (RIMAS) We would like to add an instruction to the above unpipelined CPU. The new instruction is addm rd, rs, rt. This instruction is like a normal add but instead of writing the result of adding the contents of the rs and rt registers to a register we write it to a location in memory specified by rd (Note, there is no offset, it is simply written to the address held in rd). $M[rd] \leftarrow R[rs] + R[rt]$ a) Change as *little as possible* in the datapath above (draw your changes right in the figure) to enable addm and list all changes below. Your modification may use muxes, wires, constants, additional read or write ports on the register file, and additional ALU's (again, use as little as you need!). You may not need all boxes, and do not need to mention the addition of the control signal addm, which is 1 iff the executing instruction is addm. | (i) MVX | adr, bus C | - | |---------|---------------------|---| | | data in ALV out | | | | reg out port for Rd | | | (iv) | | | b) We now want to set all the control lines appropriately. List what each signal should be: an intuitive name or $\{0, 1, x - \text{don't care}\}$ . Include a new control signal **addm.** | RegDst | RegWr | nPC_sel | Ext0p | ALUSrc | ALUctr | MemWr | MemtoReg | addm | 1 5 | 0+5 | |-----------|-------|---------|-------|--------|--------|-------|----------|------|-----|-----| | 11/1/11/1 | 0 | PC+4 | × | 0 | add | | <b>×</b> | 1 | ] | | 1 pt deducted for every ~ 2 wrogs ### Bit by Bit (ERIC) In a hamming code, with log(n) extra bits, you can detect and correct any single bit error in a n-bit string. The code requires computing log(n) parity groups p1, p2, p4, p8, etc. where the parity of each group is zero if the entire codeword is correct. If there is an error you can use the non-zero parities to find out which bit is corrupted. | Bit position | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | | |--------------|-----|------|----|----|----|-------|----|----|----|----|-----|----|----|----|-----|------|-----|-----|-----|-----|-----|--| | Code bits | | р1 | p2 | d1 | p4 | .d2 | d3 | d4 | р8 | d5 | d6 | d7 | d8 | d9 | d10 | d11 | p16 | d12 | d13 | d14 | d15 | | | Parity Group | р1 | х | | Х | | Х | | Х | | Х | | Х | | Х | | Х | | Х | | Х | | | | | | p2 | | p2 | | x x x | Х | | | х | x x | | | Х | Х | K | | Х | Х | | | | | | | p p4 | | Х | Х | Х | Х | | | | | Х | Х | Х | Х | | | | | Х | | | | | p8 | | | | | | | | Х | Х | Х | Х | Х | Х | Х | Х | | | | | | | | | p16 | | | | | | | | | | | | | | | Jiil | Х | Х | Х | Х | Х | | a) Finish the implementation of fix\_single\_error() below. Implement by putting a simple bitwise expression or value in each blank. You are given a single helper function hamming\_parity() that will compute and return the parity of the hamming groups p1, p2, p4, p8 etc. in a 32-bit word, where $p1 \in \{0,1\}$ is stored in the least significant bit, p2 in the second bit, p4 in the third, etc. The input is a pointer to the 'code' byte array of length $\lceil n/8 \rceil$ , which contains a hamming code n bits long. The least significant bit of our hamming data is the least significant bit of code $\lceil 0 \rceil$ , is also p1. | Login: | cs61c- | |--------|--------| |--------|--------| b) Answer the questions about the following (terribly slow) implementation of hamming\_parity(). Assume that the #defined macros are implemented correctly for you using standard C expressions which should look similar to your solution to part (a). ``` // ^'s the kth bit in vector with the bit b (0 or 1) #define XOR_THIS_BIT(vector, k, b) 3 // returns the nth bit counting from ptr (0 or 1) #define GET BIT(ptr, n) // returns 1 if the ith bit is counted in the jth hamming group, else 0 5 6 #define IS HAMMING PARITY BIT(i, j) 7 8 uint32 t hamming parity(uint8 t *code, size t n) { uint32t parity_vector = 0x00000000; size_t logn = 1 + log2(n); 10 11 for (size t i=0; i < n; i++) { 12 #pragma omp parallel for 13 for (size t = 0; j < logn; j++) 14 if (IS HAMMING PARITY BIT(i, j)) 15 XOR THIS BIT(parity vector, j, GET BIT(code, i)); 16 17 return parity vector; 18 ``` i) If you unrolled the loop on line 13, this would help with \_\_\_\_\_\_ hazards. ii) You can eliminate the if statement on line 14-15 in favor of a bitwise \_\_\_&\_ operator. iii) Propose a single line deletion and single line insertion that will greatly speed up hamming\_parity(). Your solution must not cause any data races or other incorrect behavior. Insert after line 10 this: #pragma cmp parallel for reduction (1: party-vector) Tpts 3 pts for parallel for in right place 1 pt for recognizing you need reduction other answers that produced some speedup received 1-3 pts