## 台灣聯合大學系統101學年度碩士班招生考試命題紙 科目:計算機系統(計算機組織)(300A) 校系所組:中央大學通訊工程學系(乙組) 交通大學生醫工程研究所(乙組) 清華大學電機工程學系 (丙組) 中央大學電機工程學系(電子組) 交通大學電子研究所(乙A組、乙B組) 清華大學通訊工程研究所 交通大學電控工程研究所(乙組) | Class | CPI on M1 | CPI on M2 | C1 usage | C2 usage | Third-party usage | |-------|-----------|-----------|----------|----------|-------------------| | A | 4 | 2 | 30% | 30% | 50% | | В | 6 | 4 | 50% | 20% | 30% | | С | 8 | 3 | 20% | 50% | 20% | The table also contains a summary of how three different compilers use the instruction set. C1 is a compiler produced by the makers of M1, C2 is the compiler produced by the makers of M2, and the other compiler is a third-party product. Assume that each compiler use the same number of instructions for a given program but that the instruction mix is as described in the table. - (1) [4%] Assume that peak performance is defined as the fastest rate that a machine can execute an instruction sequence chosen to maximize that rate. What are the peak performances of M1 and M2 expressed as instructions per seconds? - (2) [6%] Using the third-party compiler on both M1 and M2, what is the CPI for each machine? What is the execution time for each machine if the number of instructions for the given program is 1,000,000 in this case? What clock rate should M2 have to have the same performance as M1 in this case? - (3) [3%] Using C1 on both M1 and M2, how much faster can the makers of M1 claim that M1 is compared with M2? - (4) [3%] Using C2 on both M2 and M1, how much faster can the makers of M2 claim that M2 is compared with M1? - (5) [2%] If you purchase M1, which compiler would you use? (You must give your reason to support your answer. Simply write down the compiler without justification gets no score.) - (6) [2%] If you purchase M2, which compiler would you use? (You must give your reason to support your answer. Simply write down the compiler without justification gets no score.) - 二. [10%] An array of integers is stored in memory beginning in the address given in register \$50. The array length is provided in register \$s1. The MIPS code for finding the maximum value in the array is shown right. - (1) [6%] Write down the missing parts in (a), (b), (c) to complete the code. - (2) [2%] Suppose we place the code starting at location 0x00000A20 in memory. What is the MIPS machine code for the instruction: j L1 \$t0,0(\$s0) # \$t0=array[0] add \$t1,\$zero,\$zero L1: addi \$t1,\$t1,1 (a) beq sll \$t2,\$t1,2 # \$t2=\$t1\*4 add (b) \$t3,0(\$t2) \$t4,\$t0,\$t3 beq \$t4,\$zero,L1 add \$t0,\$t3,\$zero Exit: ... in hexadecimal representation? Note that the value of the 6-bit jump opcode is 2<sub>ten</sub>. (3) [2%] What is the value of the 16-bit address for the instruction: beq \$t4,\$zero,L1 in decimal number? (beq $\rightarrow$ opcode=4, \$t4 $\rightarrow$ 12) ## 台灣聯合大學系統101學年度碩士班招生考試命題紙 共上頁第一2頁 科目: 計算機系統(計算機組織)(300A) 校系所組:中央大學通訊工程學系(乙組) 交通大學生醫工程研究所(乙組) 中央大學電機工程學系(電子組) 清華大學電機工程學系 (丙組) 交通大學電子研究所(乙A組、乙B組) 清華大學通訊工程研究所 交通大學電控工程研究所 (乙組) [10%] Consider two MIPS instructions: xor (exclusive OR) and nor (not OR). The table shown right defines these operations on a bit-by-bit basis. For each of the following new instructions, use only xor and nor to provide the minimal instruction sequence to accomplish the same thing. | • | Α | В | A xor B | A nor B | |---|---|---|---------|---------| | | 0 | 0 | 0 | 1 | | | 0 | 1 | 1 | 0 | | | 1 | 0 | 1 | 0 | | | 1 | 1 | 0 | 0 | | | | | | | (1) [5%] The instruction not takes the one's complement of the source register and places it in the destination register: not \$t0,\$t1 # \$t0 = not(\$t1) (2) [5%] The instruction swap exchanges two registers. After the instruction is executed, the destination register has the original value of the source register, and the source register has the original value of the destination register: swap \$t0,\$t1 # \$t0 ↔ \$t1 四. [12%] Consider the following four code sequences: i. ADDI R1, R1, #4 LW R2, 7(R1) ii. ADD R3, R1, R2 SW R2, 7(R1) iii. SW R2, 7(R1) SW R3, 200(R7) iii. BE7 iv. BEZ R1, L1 SW R1, 7(R1) For each code sequence, identify each type of dependence that exists and describes what data flow or control structure causes that dependence. 五. [8%] There is a deeply-pipelined processor having the following pipeline stages. | | IF1 | IF2 | ID1 | ID2 | EXE1 | EXE2 | EXE3 | MEM1 | MEM2 | WB | | |---|-----|-----|---------------------------------------|-----|------|----------|------|------|----------|----------|--| | • | | 4 | · · · · · · · · · · · · · · · · · · · | | | <u> </u> | | | <u> </u> | <u> </u> | | The branch target address will be calculated out at ID1-stage (3rd stage) and the branch will be resolved at EXE2-stage (6th stage). Please describe the branch penalties (in cycles) for each strategy to handle control hazard in the following table. Assume that the compiler can always find instructions to fill branch delayed slot. | | Unconditional branch | Conditional branch | | | |-------------------|----------------------|--------------------|---------|--| | | | taken | untaken | | | Predict taken | | | | | | Predict not-taken | | | | | | Delayed branch | | <u> </u> | | | 注:背面有試題 科目: 計算機系統(計算機組織)(300A) 校系所組:中央大學通訊工程學系(乙組) 交通大學生醫工程研究所(乙組) 清華大學電機工程學系(丙組) 中央大學電機工程學系(電子組) 交通大學電子研究所(乙A組、乙B組) 清華大學通訊工程研究所 交通大學電控工程研究所 (乙組) 六. [20%] Consider the following pipelined MIPS datapath with forwarding and stall control. The signals Src1 and Src2 are 2-bit control signals, while all the other control lines are 1-bit signals. Note that \$1 in the given code shown right denotes the register with index 1. Assume the pipeline is initially empty, and all register writes occur at the end of the clock cycle. sll \$2,\$1,2 add \$2,\$2,\$3 lw \$4,0(\$2) add \$5,\$4,\$1 sub \$6,\$4,\$1 - (1) [4%] Assume I-Mem, Mux, ALU, Register files, D-Mem have latencies of 50ps, 10ps, 30ps, 20ps, and 50ps, respectively. Except register files, the latency of all other registers (including PC) is 15ps. Each control unit (Control, Hazard detection, Forwarding) has a latency of 20ps. What is the minimum clock cycle time of this processor? Please explain your answer. - (2) [4%] Assume the given code is executed on this processor. Please draw the simple (traditional) multiple-clock-cycle pipeline execution diagram of the first 8 clock cycles, from beginning including all necessary stalls or nops. - (3) [5%] In (2), what are the values of those MUX selection signals (Stall, Src1, Src2, RegDst, MtoR) at the 5th cycles from beginning? - (4) [3%] In (2), what are the values of those MUX selection signals (Stall, Src1, Src2) at the 6th cycles from beginning? - (5) [4%] If we want to implement the forwarding capability for the store operation immediately after a load instruction, what circuit should be added? Please redraw the datapath including the added parts on your answer sheet. 连背面有試題 ## 台灣聯合大學系統101學年度碩士班招生考試命題紙 頁第 4 頁 科目:計算機系統(計算機組織)(300A) 校系所組:中央大學通訊工程學系(乙組) 交通大學生醫工程研究所(乙組) 中央大學電機工程學系(電子組) 清華大學電機工程學系(丙組) 交通大學電子研究所(乙A組、乙B組) 清華大學通訊工程研究所 交通大學電控工程研究所(乙組) +. [20%] Consider two chip multiprocessors (CMP) P1 and P2 and each processor has dual cores. The CMPs contains L1 and L2 caches on a single chip. The CMP P1 has private L2 cache and the P2 has shared L2 cache. If the two CMPs are evaluated with benchmark A, the resulting L1 and L2 miss rates and hit latency as follows. | CMP | L1 miss rate | L2 miss rate | |-----|--------------|--------------| | P1 | 5% | 2% | | P2 | 5% | 0.2% | | I 1 hit latamay | L2 hit l | Main memory | | |-----------------|---------------|--------------|-------------| | L1 hit latency | Private cache | Shared cache | access time | | 0.6ns | 6ns | 8ns | 160ns | If the two CMPs are evaluated with benchmark B, then the L1 and L2 miss rate and hit latency are as follows. | CMP | L1 miss rate | L2 miss rate | |-----|--------------|--------------| | P1 | 5% | 0.1% | | P2 | 5% | 0.04% | | I 1 hit lataner | L2 hit l | Main memory | | |-----------------|---------------|--------------|-------------| | L1 hit latency | Private cache | Shared cache | access time | | 0.6ns | 6ns | 20ns | 160ns | - (1) [3%] What is the average memory access time (AMAT) for P1 for benchmark A? - (2) [2%] Which processor has better AMAT for benchmark A and why? - (3) [3%] Which L2 cache design has better performance for benchmark A? Please use data to support the answer. - (4) [2%] Which benchmark has better performance for the shared L2 cache design? Please use data to support the answer. - (5) [2%] Which processor is better for multithreaded workloads? - (6) [3%] Assume that the shared cache latency increases linearly with the CMP size and the shared cache miss rate changes inverse proportionally with the CMP size. Which processor has better performance in benchmark A if the number of cores doubles for both CMPs, P1 and P2.? - (7) [2%] With doubled number of cores, can we determine how much more off-chip memory bandwidth is needed to maintain the same level of per-core performance for P1 and P2? - (8) [3%] For benchmark A, assume the base CPI =1 and nonblocking cache is used to improve the concurrent misses (i.e. more than one outstanding miss in parallel) from 1 to 2, how much performance improvement can be achieved for P1 and P2?