# **CS61C Summer 2013 Final Exam** | Your Name: | | | | | | | |---------------------|------------|-------|--------|-------|---------|-------| | Your TA (Circle): | Albert | Kevin | Justin | Shaun | Jeffrey | Sagar | | Name of person to y | our LEFT: | _ | | | | | | Name of person to y | our RIGHT: | _ | | | | | This exam is worth 90 points and will count for 26% of your course grade. The exam contains 7 questions on 14 numbered pages. Put all answers in the spaces provided. Some pages are intentionally left blank for scratch space. **Question 0:** You will receive 1 point for properly filling out this page as well your login on every page of the exam. | Question | Points (Minutes) | Score | |----------|------------------|-------| | 0 | 1 (0) | | | 1 | 23 (48) | | | 2 | 9 (20) | | | 3 | 15 (30) | | | 4 | 12 (20) | | | 5 | 17 (36) | | | 6 | 13 (26) | | | Total | 90 (180) | | All the work is my own. I had no prior knowledge of the exam contents nor will I share the contents with others in CS61C who have not taken it yet. | Signature: | | | |------------|--|--| |------------|--|--| ## **Question 1:** Potpourri – Hard to Spell, Nice to Smell (23 points, 48 minutes) ### a) MOESI on Through This Problem Our computer has two cores, each with a 32B direct-mapped cache with 16B blocks using write-back and write-allocate policies. The MOESI protocol is implemented with invalidation of other caches on write, and the caches are empty at the beginning of the program. arr is a block-aligned array of ints. Fill out the status of the blocks in each cache. Indicate any memory locations that are not up-to-date as well. The first two rows have been done for you, using the abbreviations "C" for cache and "B" for block. | After Operation | Core 1 \$ State | Core 2 \$ State | Out-of-date Mem Locations | |----------------------|-----------------|-----------------|---------------------------| | C1: read from arr[0] | B0: Exclusive | B?: Invalid | nono | | CI. read from arr[0] | B?: Invalid | B?: Invalid | none | | C2: write to a [ [ ] | B0: Exclusive | B1: Modified | arr[5] | | C2: write to arr[5] | B?: Invalid | B?: Invalid | arr[3] | | C2: write to arr[2] | | | | | C1: read from arr[3] | | | | ### b) Answer the following questions based on the FSM circuit shown below: i. Draw the FSM state diagram (assume the initial state shown) in the space below: | ii. | Let $t_{ m setup}=t_{ m hold}=50~{ m ps}$ and $t_{ m XOR}=20~{ m ps}$ . If we run this FSM on a 4-GHz processor and the | |-----|-------------------------------------------------------------------------------------------------------------------------| | | input arrives $t_{ m hold}$ after the clock triggers, what are the maximum and minimum $t_{ m clk-to-q}$ for | | | the register to ensure proper functionality? | | N 4: | N / a | |------|-------| | Min: | Max: | | 2) | Implement the following truth | table functions | using only NOR | gates (fewer is better) | |----|-------------------------------|-----------------|----------------|-------------------------| |----|-------------------------------|-----------------|----------------|-------------------------| | F1: | = | F3 | F2 | F1 | В | Α | | |-----|---|----|----|----|---|---|--| | F2: | _ | 1 | 0 | 1 | 0 | 0 | | | | | 0 | 1 | 1 | 1 | 0 | | | F3: | | 1 | 1 | 0 | 0 | 1 | | | | | 1 | 1 | 0 | 1 | 1 | | i. What's the correct data word given the following SEC Hamming code: 0101000? \_\_\_\_ You are analyzing a new error detection/correction scheme for 4 bits of data $(d_1d_2d_3d_4)$ , where you store *two* parity bits $(d_1d_2d_3d_4p_1p_2)$ with $p_1 = xor(d_1,d_2,d_3,d_4)$ and $p_2 = xor(d_1,d_2,d_3,d_4,p_1)$ . For example, if the data bits were 1011, the code word would be 101110. | ii | What fraction | of all | code | words is | Valid? | |-----|---------------|--------|------|----------|--------| | 11. | Wilat Hattion | OI all | coue | worus is | vallu: | | iii. | What is the minimum Hamming distance between valid code words? | | |------|----------------------------------------------------------------|--| | | | | | • | | | |-----|-------------|---------|--------|-----------|--------|---------| | IV. | What is the | maxımum | number | of errors | we can | detect? | | | | | <br>_ | |--|--|--|-------| #### e) There's Loot to be Had... Let's RAID It! You are using RAID with 8 equally-sized disks and block-striping in 32-bit chunks. | • | \ A / l= : = l= | | 11/-1 | | not use? | | |---|-----------------|-------|----------|----------|-----------|--| | | wnich | RAIII | IDVALICE | can voll | אחד ווכבי | | | vinen in the level(3) can you not use: | | |----------------------------------------|--| | | | | ii. | With which viable RAID level(s) can you to store the <i>most</i> data? | |-----|------------------------------------------------------------------------| | | | | III. | With which viable R | (AID level(s) can vou | u store the <i>least</i> data? | |------|---------------------|-----------------------|--------------------------------| | iv. | If using RAID 5, how many disk reads and writes are there for writing 8 bytes of data within the | |-----|--------------------------------------------------------------------------------------------------| | | same stripe? | | Rd: | Wr: | | |-----|-----|--| | | | | ### f) Solve for the *maximum controller overhead* to meet the following specifications: We need disk latency under 18 ms while reading 800 B of data. The hard drive spins at 6000 rev/min with a seek time of 2.5 ms and transfer rate of 80 KB/s (SI prefix). Don't forget units! | <br> | <br> | |------|------| ## PAGE INTENTIONALLY LEFT BLANK ## Question 2: MIPStifying (9 points, 20 minutes) Answer the questions below about the following MIPS function. Answer each part separately, assuming each time that mystery() has not been called yet. ``` mystery: 1 andi $a0, $a0, 3 $t0, $0, 1 2 ori $t0, $t0, 6 3 sll 4 Lbl1: beq $a0, $0, Lb12 $t0, $t0, 5 5 sll addi $a0, $a0, -1 6 7 j Lbl1 $s0, Lb13 8 Lbl2: la 8 $s1, 0($s0) lw 9 add $s1, $s1, $t0 $s1, 0($s0) 10 sw 11 Lbl3: add $v0, $0, $0 12 $ra jr ``` - a) Which instruction (number) gets modified in the above function? - b) Write an equivalent *arithmetic* (not logical) C expression to instruction 1. a0 = \_\_\_\_\_ - c) Which instruction field gets modified when mystery is called with \$a0 = 3? - d) How many times can ${\tt mystery(2)}$ be called before the behavior of ${\tt mystery()}$ changes? \_\_\_\_ e) How many times can mystery(0) be called before the behavior of mystery() changes? \_\_\_\_\_ f) A program calls mystery with the following sequence of arguments: 0, 1, 2, 3, 4, 5. What MIPS instruction gets stored in memory? \_\_\_\_\_ ## **Question 3:** To Be Without Parallel... Means You're Slow (15 points, 30 minutes) #### a) SIMD and OpenMP } Property; Four CEOs are playing the board game Monopoly, where the object of the game is to own properties and gain profits from them. You are in charge of keeping track of the CEOs' finances and wish to parallelize this task. All memory accesses are valid. Assume sizeof(int) = 4. Every round we must give each CEO the amount of money that he has earned from each of his properties. To do this, we add the property's profit into the CEO's balance and then set the profit to zero for the next round using the following function: - i. Perform a 2-fold unrolling of the loop by filling in the blank spaces above. You may edit the looping conditions if you need to (cross out and write in changes). Assume NUMPROPS is a multiple of 2. - ii. \_mm\_loadu\_si128 loads 128-bits of data into a vector. If we wish to use **three** of these instructions to load from props[] every iteration of our loop, how many fold must we unroll the original loop? Answer n if performing an n-fold unrolling. - iii. You slap a #pragma omp parallel for statement on the original for loop. Circle the effect on execution below and provide a one phrase/sentence explanation. No credit without explanation. | Correct; | Correct; | Almost always | Coafoult | |----------|----------|---------------|----------| | Faster | Slower | Incorrect | Segfault | #### **Explanation:** iv. Now you additionally add a #pragma omp critical statement around all writes to balance[]. What is the new effect on execution? No credit without explanation. | Correct; | Correct; | Almost always | Segfault | |----------|----------|---------------|----------| | Faster | Slower | Incorrect | Segrauit | #### **Explanation:** #### b) MapReduce Suppose that given a large social network dataset, you wish to generate recommendations for yourself by looking at the "Likes" of your friends. You wish to exclude your own likes so that the recommendations are useful. Unfortunately, your input dataset consists of everyone on the social network, not just your friends. Assume that you have access to a global person\_id for yourself, YOUR ID. Conceptually, you can think of the map as filtering the overall dataset to just you and your friends, while the reduce will filter out common "Likes" between you and your friends. You have access to the following special methods: Feel free to access members of lists and tuples using array syntax. - i. The input to the map function is (key = person\_id, value = (friend\_list, likes)). How many times will a friendship between two people show up in the input data? - ii. Fill in the MapReduce functions below using Java-like pseudocode: Your output will be of the form: key = (friendpair<sub>i</sub>), value = recommendations ## PAGE INTENTIONALLY LEFT BLANK ## **Question 4:** Off the Beaten Datapath (12 points, 20 minutes) Add the instruction stones (store one smaller) to the single cycle datapath, which stores a 1 at the address of the smaller value between two specified registers. **Ignore pipelining.** a) Write out the assembly syntax and RTL for this instruction. Don't forget about the PC! Syntax: RTL: b) Change as *little as possible* in the datapath above (draw your changes on the figure) to enable stones. List all your changes below (be concise!). Your modification may use MUXes (define what select bits refer to what inputs), wires, constants, and up to one new control signal, but <u>nothing else</u>. You may not need all of the provided boxes. You cannot modify the ALU (there is no min operation). | (i) | | |-------|--| | (ii) | | | (iii) | | | (iv) | | c) We now want to set all the control lines appropriately. List what each signal should be, using 0, 1, X, or an intuitive name. Include any new control signals you added. | RegDst | RegWr | nPC_sel | ExtOp | ALUSrc | ALUctr | MemWr | MemtoReg | | |--------|-------|---------|-------|--------|--------|-------|----------|--| | | | | | | | | | | ## **Question 5:** Tread Carefully, Thread Carefully (17 points, 36 minutes) Examine the function prototype and MIPS implementation below. 6 7 j exit: jr loop \$ra ``` // sets *value = (*value) * 2^pow using shifting instructions int multMemPow2(int *value, unsigned int pow); multMemPow2: $v0, 0($a0) # load value 1 lw 2 loop: beq $a1, $0, exit # exit condition 3 sll $v0, $v0, 1 # multiply by 2 4 addi $a1, $a1, -1 # decrement counter $v0, 0($a0) 5 # store result ``` We are using a 5-stage MIPS pipelined datapath with separate I\$ and D\$ that can read and write to registers in a single cycle. Assume no other optimizations (no forwarding, no branch prediction, etc.). The default behavior is to stall when necessary. Branch checking is done during the Execute stage. For parts (a)-(c), let pow=1. When we ask for clock cycles to execute multMemPow2, we mean from the instruction fetch of lw up to and including the write back of jr. | a) | How many instructions are executed in multMemPow2? | |----|---------------------------------------------------------------------------------------------------------------------------------| | | How many clock cycles does it take to execute multMemPow2? e will assign partial credit based on table on opposite page) | | c) | Consider the following optimizations <i>separately</i> . How many FEWER cycles are taken for the addition of each optimization? | | | i. Forwarding | | | ii. Branch Prediction of always taken | d) Suppose we introduce only jump delay slots and want to move a loop instruction into the new jump delay slot after instruction 6 (j loop). For the following candidate instructions, answer **C** for "changes behavior," **S** for "causes additional stall(s)," or **G** for "good choice": | Instr 3: | Instr 4: | Instr 5: | |----------|----------|-----------| | msu s. | 11130 4. | וווטנו ט. | | Login: | cs61c- | |--------|--------| | | | For the following questions, assume we are executing the multMemPow2 simultaneously on TWO processors in the same shared-memory machine with \*value=1 and pow=2. e) List ALL possible values of \*value after execution: Stanley Stanfurd thinks he can fix the data race problem by replacing the 1w with 11 and sw with sc. Assume sc compares against the value from the last 11 call. f) List ALL possible values of \*value after execution of this new version: You may find the following space useful for scratch work; we will check for assigning partial credit: | | <br> | <br> | | <br> | | <br> | <br> | <br> | | |--|------|------|---|------|--|------|------|------|--| | | | | | | | | | | | | | | | _ | | | _ | | | | ## PAGE INTENTIONALLY LEFT BLANK ## **Question 6:** It's Virtual Insanity! (13 points, 26 minutes) Our 32-bit uniprocessor machine has 1 GiB of RAM with 1 KiB pages, a fully-associative TLB that holds 8 entries and uses LRU, and a direct-mapped, write-back *data* cache with 32 B blocks and 32 slots. The *instruction* cache is 256 B and fully-associative with 32 B blocks. | a) | What is the maximum number of valid entries in the page table for a single process? | Answer in IEC. | |----|-------------------------------------------------------------------------------------|----------------| | | | | b) What is the TLB Reach of our system? Examine the following function. Assume the entire program's code takes the entirety of one page and sizeof(int)=sizeof(int \*)=4. ``` void addConst(int *ptr, char c) { for(int i = 0; 1; i+=4) ptr[i] += c; } ``` c) If ptr[] lives in disk and ptr[0] is page-aligned, what is the TLB hit rate for data accesses only? ``` d) If ptr[] lives in disk and ptr[0] is page-aligned, what fraction of D$ misses are also TLB misses? ``` ``` <del>------</del> ``` e) If ptr[0] is in physical memory, what is the *minimum* value of i that could cause a **page fault**? ``` <del>-----</del> ``` f) If ptr[0] is in physical memory, what is the *minimum* value of i that could cause a **protection** fault? ``` g) If ptr[0] is in physical memory, what is the maximum value of i that causes the first cache miss in the loop? ``` h) If ptr[0] is in physical memory, what is the maximum value of i that causes the first **TLB miss** in the loop? You may leave your answer as a product. \_\_\_\_\_ ## **BACK OF EXAM**