



PH.D. DISSERTATION

# AN IMPEDANCE-MATCHED BIDIRECTIONAL MULTI-DROP MEMORY INTERFACE

임피던스 매칭이 된 양방향 다분기 메모리 인터페이스

BY

WOO-YEOL SHIN

February 2013

DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE COLLEGE OF ENGINEERING SEOUL NATIONAL UNIVERSITY

### AN IMPEDANCE-MATCHED BIDIRECTIONAL MULTI-DROP MEMORY INTERFACE

임피던스 매칭이 된 양방향 다분기 메모리 인터페이스

지도교수 김 수 환

이 논문을 공학박사 학위논문으로 제출함

2013 년 2 월

서울대학교 대학원

전기컴퓨터 공학부

신 우 열

신우열의 공학박사 학위논문을 인준함 2013 년 2 월

> 위 원 장 : \_\_\_\_\_(印) 부위원장 : \_\_\_\_\_(印) 위 원 : \_\_\_\_\_(印) 위 원 : \_\_\_\_\_(印)

## ABSTRACT

## AN IMPEDANCE-MATCHED BIDIRECTIONAL MULTI-DROP MEMORY INTERFACE

Woo-Yeol Shin Department of Electrical Engineering and Computer Science College of Engineering Seoul National University

In this thesis, an impedance-matched bidirectional multi-drop (IMBM) DQ bus is proposed, together with a 4.8Gb/s transceiver for a memory controller which supports this bus. Reflective ISI is eliminated at each stub of the IMBM DQ bus by resistive unidirectional impedance matching. The IMBM DQ bus generates no reflections during write operations, and the reflections that are generated during read operations do not reach the memory controller. Therefore, the IMBM DQ bus transmits and receives both read and write signals without reflective ISI. In addition, the IMBM DQ bus is more tolerant to stub length mismatches than a conventional stub-series terminated logic (SSTL) DQ bus. The proposed DQ bus is applicable to memory system applications which require both high speed operation and high capacity, which the conventional multidrop and point-to-point bus cannot handle. Because the IMBM DQ bus attenuates the voltage of signals in a manner inversely proportionate to the number of modules, a new clocking architecture is necessary to support the IMBM DQ bus. In this thesis, a 4.8Gb/s transceiver which uses shifted phase-locked loop (PLL) clock is proposed for data sampling instead of the received strobe signal. A prototype memory controller transceiver was designed and fabricated in a 0.13µm CMOS process, and it operates with a 1.2-V supply voltage. Its effectiveness was demonstrated on various measurement configurations. At 4.8Gb/s, this transceiver, with a 4-slot, 8-drop IMBM DQ bus, has an eye opening of 0.39UI in TX mode and 0.58UI in RX mode at a threshold of 10<sup>-9</sup> BER, whereas a comparable transceiver with a conventional 4-slot, 8-drop stub-series terminated logic (SSTL) has no timing margin under the same test conditions. Our transceiver consumes 14.25mW/Gb/s per DQ in TX mode and 13.69mW/Gb/s per DQ in RX mode.

**Keywords**: Impedance matching, memory controller, memory interface, multi-drop DQ bus, Stub-Series Terminated Logic, transceiver

**Student Number**: 2005-21422

# **CONTENTS**

| ABSTRA  | СТІ                                   |
|---------|---------------------------------------|
| CONTEN  | VTS III                               |
| LIST OF | FIGURESV                              |
| LIST OF | TABLES IX                             |
| Снарти  | TR 1 INTRODUCTION1                    |
| 1.1     | MOTIVATION                            |
| 1.2     | THESIS ORGANIZATION                   |
| Снартн  | CR 2 INTRODUCTION TO MEMORY INTERFACE |
| 2.1     | MEMORY INTERFACE INTRODUCTION         |
| 2.2     | BUS TOPOLOGY                          |
| 2.3     | CLOCKING ARCHITECTURE AND CIRCUITS    |
| 2.4     | COMMAND AND ADDRESS ARCHITECTURE      |
| 2.5     | SIGNALING AND TERMINATION SCHEME      |
| 2.6     | EQUALIZATION IN MEMORY INTERFACE      |
| 2.7     | EMERGING TECHNOLOGY                   |

| CHAPTER | <b>3</b> IMPEDANCE-MATCHED BIDIRECTIONAL MULTI-DROP DQ BUS4 | 10 |
|---------|-------------------------------------------------------------|----|
| 3.1 IN  | MPEDANCE-MATCHED BIDIRECTIONAL MULTI-DROP DQ BUS            | 10 |

| 3.2    | OPERATION OF IMBM DQ BUS                   |    |
|--------|--------------------------------------------|----|
| 3.3    | GENERALIZED IMBM DQ BUS                    |    |
| 3.4    | STEADY-STATE RESISTOR MODEL OF IMBM DQ BUS | 55 |
|        |                                            |    |
| СНАРТЕ | ER4 MEMORY CONTROLLER TRANSCEIVER          | 67 |
| 4.1    | MEMORY CONTROLLER TRANSCEIVER ARCHITECTURE | 67 |
| 4.2    | TX CIRCUITS OF THE TRANSCEIVER             | 69 |
| 4.3    | RX CIRCUITS OF THE TRANSCEIVER             | 74 |
| 4.4    | LIMITATION OF THE TRANSCEIVER              | 79 |
|        |                                            |    |
| СНАРТЕ | ER5 EXPERIMENTAL RESULTS                   |    |
| 5.1    | EXPERIMENTAL SETUP                         |    |
| 5.2    | SINGLE-BIT RESPONSE AND EYE DIAGRAM        |    |
| 5.3    | BER OF TRANSMITTED SIGNALS (WRITE SIGNALS) | 97 |
| 5.4    | BER OF RECOVERED SIGNALS (READ SIGNALS)    |    |
|        |                                            |    |
| СНАРТЕ | ER6 CONCLUSIONS                            |    |
|        |                                            |    |
| BIBLIO | GRAPHY                                     |    |
|        |                                            |    |
| ABSTRA | CT IN KOREAN                               |    |

# LIST OF FIGURES

| Fig. 1.1.1. Reflective Inter-Symbol Interference (ISI) caused by conventional multi-d | rop |
|---------------------------------------------------------------------------------------|-----|
| bus topology                                                                          | 2   |
| Fig. 1.1.2. Data rate and bus topology trend of the DDRx DRAM interface               | 3   |
| Fig. 2.1.1. Typical structure of a memory interface with multi-core CPUs              | 7   |
| Fig. 2.1.2. Memory interface and its bandwidth trend                                  | 8   |
| Fig. 2.2.1. Multi-drop bus topology                                                   | 9   |
| Fig. 2.2.2. Point-to-point bus topology                                               | .10 |
| Fig. 2.2.3. FBDIMM architecture and AMB                                               | .11 |
| Fig. 2.2.4. Cascading memory architecture                                             | .12 |
| Fig. 2.3.1. Clocking architecture of the DDR3 interface                               | .13 |
| Fig. 2.3.2. Center-aligned WDQS and edge-aligned RDQS                                 | .14 |
| Fig. 2.3.3. (a) Block diagram and (b) timing diagram of DLL in DDR DRAM               | .15 |
| Fig. 2.3.4. Clocking architecture of the GDDR5 interface                              | .16 |
| Fig. 2.3.5. Clocking architecture of the Terabyte Bandwidth Initiative from Rambus    | .18 |
| Fig. 2.3.6. Clocking architecture of mobile XDR                                       | .19 |
| Fig. 2.3.7. Clocking architecture of AMB1 and FBDIMM1                                 | .20 |
| Fig. 2.3.8. Clocking architecture of AMB and FBDIMM2                                  | .22 |
| Fig. 2.3.9. CDR architecture of AMB and FBDIMM2                                       | .23 |
| Fig. 2.3.10. Clocking architecture of SPMT                                            | .24 |
| Fig. 2.4.1. Clock domain of GDDR3 DRAM                                                | .27 |
| Fig. 2.4.2. T-branch network for C/A of the DDR2 interface                            | .28 |
| Fig. 2.4.3. Fly-by network for C/A of DDR3 interface                                  | .29 |
| Fig. 2.4.4. Write leveling operation in the DDR3 interface                            | .30 |
|                                                                                       |     |

| Fig. 2.4.5. Read leveling operation in the DDR3 interface                                  |
|--------------------------------------------------------------------------------------------|
| Fig. 2.4.6. C/A link of the Terabyte Bandwidth Initiative interface                        |
| Fig. 2.5.1. Voltage mode driver of the DDR memory interface                                |
| Fig. 2.5.2. Voltage mode driver of the mobile XDR memory interface                         |
| Fig. 2.5.3. Current mode driver of XDR, FBDIMM, SPMT and M-PHY                             |
| Fig. 2.5.4. POD signaling for GDDR 3/4/5                                                   |
| Fig. 2.5.5. (a) SSTL bus of the DDR memory interface. (b) ODT table for the DDR2           |
| interface                                                                                  |
| Fig. 2.6.1. Asymmetric equalization scheme in the Terabyte Bandwidth Initiative            |
| Fig. 2.7.1. DDR DRAM with 3D-TSV technology                                                |
| Fig. 3.1.1. Equivalent stub model of (a) a conventional SSTL DQ bus and (b) the            |
| proposed IMBM DQ bus41                                                                     |
| Fig. 3.1.2. Impedance-matched bidirectional multi-drop (IMBM) DQ bus43                     |
| Fig. 3.1.3. Stub-Series Terminated Logic (SSTL) DQ bus43                                   |
| Fig. 3.2.1. Write operation of the IMBM DQ bus45                                           |
| Fig. 3.2.2. Read operation of the IMBM DQ bus – from module #0                             |
| Fig. 3.2.3. Read operation of the IMBM DQ bus – from module #1                             |
| Fig. 3.2.4. Read operation of the IMBM DQ bus – from module #3                             |
| Fig. 3.3.1. Generalized IMBM DQ bus $(Z_1 \ge Z_2)$                                        |
| Fig. 3.3.2. Generalized IMBM DQ bus $(Z_1 \ge Z_2)$ with resistor values                   |
| Fig. 3.3.3. Generalized IMBM DQ bus (Z <sub>1</sub> <z<sub>2)</z<sub>                      |
| Fig. 3.3.4. Generalized IMBM DQ bus $(Z_1 < Z_2)$ with resistor values                     |
| Fig. 3.3.5. Generalized IMBM DQ bus $(Z_1 \ge Z_2)$ with a double-sided module             |
| Fig. 3.4.1. Equivalent stub model and equivalent impedance from the bottom transmission    |
| line of (a) a conventional SSTL DQ bus and (b) the proposed IMBM DQ bus57                  |
| Fig. 3.4.2. Steady-state resistor model of the single-sided SSTL DQ bus without the        |
| selective ODT scheme                                                                       |
| Fig. 3.4.3. Steady-state resistor model of the single-sided SSTL DQ bus with the selective |
| ODT scheme                                                                                 |

| Fig. 3.4.4. Steady-state resistor model of the single-sided IMBM DQ bus61                       |
|-------------------------------------------------------------------------------------------------|
| Fig. 3.4.5. Steady-state resistor model of the double-sided SSTL DQ bus without the             |
| selective ODT scheme                                                                            |
| Fig. 3.4.6. Steady-state resistor model of the double-sided SSTL DQ bus with the                |
| selective ODT scheme62                                                                          |
| Fig. 3.4.7. Steady-state resistor model of the double-sided IMBM DQ bus                         |
| Fig. 3.4.8. Reciprocity                                                                         |
| Fig. 4.1.1. Memory controller transceiver block diagram and its clocking architecture 68        |
| Fig. 4.2.1. 8:1 serializer with a 4-tap output70                                                |
| Fig. 4.2.2. (a) Five-latch 2:1 serializer and (b) its timing diagram71                          |
| Fig. 4.2.3. Four-tap half-rate serializer with a differential output71                          |
| Fig. 4.2.4. Four-tap current-mode driver                                                        |
| Fig. 4.2.5. (a) Overall duty-cycle corrector structure, (b) simple equivalent model of the      |
| jth stage of the DCC buffer, and (c) a schematic diagram of the DCC buffer73                    |
| Fig. 4.2.6. Measured DCC linearity of a 2.4GHz DQS signal74                                     |
| Fig. 4.3.1. Block diagram of the strobe recovery unit (SRU)75                                   |
| Fig. 4.3.2. Schematic diagram of the sampler76                                                  |
| Fig. 4.3.3. Schematic diagram of the continuous-time linear equalizer77                         |
| Fig. 4.3.4. Schematic diagram of the phase interpolator78                                       |
| Fig. 4.3.5. Measured linearity of the phase interpolator79                                      |
| Fig. 4.4.1. Next version plan of the memory controller transceiver – not implemented $\dots 81$ |
| Fig. 5.1.1. Die photo of the memory controller transceiver implemented in $0.13 \mu m$          |
| CMOS                                                                                            |
| Fig. 5.1.2. Scope of this work                                                                  |
| Fig. 5.1.3. Implemented 4-slot 8-drop IMBM DQ bus                                               |
| Fig. 5.1.4. Implemented 4-slot 8-drop SSTL DQ bus                                               |
| Fig. 5.2.1. Setup I for measuring the eye diagram and single-bit response                       |
| Fig. 5.2.2. Measured 4.8Gb/s single-bit responses of (a) a SSTL DQ bus with a 50 $\Omega$ load, |
| (b) a SSTL DQ bus with a 50 $\Omega$ and a 1pF load, (c) an IMBM DQ bus with a 50 $\Omega$      |

| load and (d) an IMBM DQ bus with a 50 $\Omega$ and a 1pF load                            | <del>)</del> 0 |
|------------------------------------------------------------------------------------------|----------------|
| Fig. 5.2.3. Measured 4.8Gb/s eye diagrams of an unequalized signal (a) #1, (b) #3, (c) # | 5,             |
| and (d) #7; a de-emphasized signal (e) #1, (f) #3, (g) #5, and (h) #7 at the SST         | Ľ              |
| module with a 50 $\Omega$ load                                                           | <del>)</del> 3 |
| Fig. 5.2.4. Measured 4.8Gb/s eye diagrams of an unequalized signal (a) #1, (b) #3, (c) # | 5,             |
| and (d) #7; a de-emphasized signal (e) #1, (f) #3, (g) #5, and (h) #7 at the SST         | Ľ              |
| module with a 50 $\Omega$ and a 1pF load                                                 | <b>)</b> 4     |
| Fig. 5.2.5. Measured 4.8Gb/s eye diagrams of an unequalized signal (a) #1, (b) #3, (c) # | 5,             |
| and (d) #7; a de-emphasized signal (e) #1, (f) #3, (g) #5, and (h) #7 at the IMBI        | М              |
| module with a 50 $\Omega$ load                                                           | <b>)</b> 5     |
| Fig. 5.2.6. Measured 4.8Gb/s eye diagrams of an unequalized signal (a) #1, (b) #3, (c) # | 5,             |
| and (d) #7; a de-emphasized signal (e) #1, (f) #3, (g) #5, and (h) #7 at IMBI            | М              |
| module with a 50 $\Omega$ and a 1pF load9                                                | <del>)</del> 6 |
| Fig. 5.2.7. Measured 4.8Gb/s eye diagram and histogram of a de-emphasized (a) D          | Q              |
| signal and (b) a DQS signal, both on the IMBM DQ bus with a $50\Omega$ and a 1pF loa     | d              |
| ç                                                                                        | <b>)</b> 7     |
| Fig. 5.3.1. Setup II for measuring the TX BER                                            | <b>)</b> 8     |
| Fig. 5.3.2. Bathtub graph based on TX BER measurements of both equalized ar              | ıd             |
| unequalized SSTL signals with (a) a 50 $\Omega$ and (b) a 50 $\Omega$ and a 1pF load9    | <del>)</del> 9 |
| Fig. 5.3.3. Bathtub graph based on TX BER measurements of both equalized ar              | ıd             |
| unequalized IMBM signals with (a) a 50 $\Omega$ and (b) a 50 $\Omega$ and a 1pF load10   | )0             |
| Fig. 5.4.1. Setup III for measuring the RX BER                                           | )2             |
| Fig. 5.4.2. Bathtub graph based on RX BER measurements of unequalized and equalized      | ed             |
| SSTL with (a) a 50 $\Omega$ and (b) a 50 $\Omega$ and a 1pF load10                       | )3             |
| Fig. 5.4.3. Bathtub graph based on RX BER measurements of unequalized and equalized      | ed             |
| IMBM with (a) a 50 $\Omega$ and (b) a 50 $\Omega$ and a 1pF load10                       | )4             |
| Fig. 5.4.4. Measured 4.8Gb/s eye diagram and histogram of a recovered DQ signal with     | th             |
| the pattern 10101010, on the IMBM DQ bus10                                               | )5             |

# LIST OF TABLES

| Table 2.3.1. Clocking Architecture and Features (Asymmetric Architecture) |    |
|---------------------------------------------------------------------------|----|
| Table 2.3.2. Clocking Architecture and Features (Symmetric Architecture)  |    |
| Table 3.4.1. Voltage Attenuation Comparison of the Single-sided DQ Bus    | 66 |
| Table 3.4.2. Voltage Attenuation Comparison of the Double-sided DQ Bus    | 66 |
| Table 5.4.1. Memory Controller Transceiver Summary                        |    |
| Table 5.4.2. Timing Margin Summary of the SSTL DQ Bus                     |    |
| Table 5.4.3. Timing Margin Summary of the IMBM DQ Bus                     |    |
|                                                                           |    |

### **CHAPTER 1**

## INTRODUCTION

#### **1.1 MOTIVATION**

The combination of rapid increases in both the speed of processors and the capacity of memory modules means that memory interfaces are now required to handle enormous amounts of data during read and write operations. The scaling of CMOS transistors and the development of new IO circuit technology has allowed the data rate of memory interfaces to reach 16Gb/s per channel [1.1.1] [1.1.2]. However, higher speeds [1.1.1]–[1.1.4], such as those achieved by the XDR and GDDR memory interfaces, require the use of a point-to-point bus topology rather than a multi-drop bus topology. A point-to-point bus topology can achieve much higher data-rates than a multi-drop bus topology because each stub of a multi-drop channel bus topology generates undesired reflections, causing inter-symbol interference (ISI), as shown in Fig. 1.1.1.



Fig. 1.1.1. Reflective Inter-Symbol Interference (ISI) caused by conventional multi-drop bus topology.

Unfortunately, point-to-point channels require too much PCB area to allow their use in high-capacity memory systems such as DRAM modules for personal computers and servers. This is why recent DDR2/3 memory interfaces [1.1.5] uses Stub-Series Terminated Logic (SSTL) in spite of the resulting reflection at each stub. Nevertheless, unlike previous DDR1/2/3 interface, shown in Fig. 1.1.2, the next-generation DDR memory interface may adopt a point-to-point interface bus topology because conventional multi-drop channels cannot handled data rates that exceed 2Gb/s [1.1.6]. This makes it necessary to find some way to increase the data transfer rate while maintaining the multidrop bus topology, which turns the focus to ISI as caused by reflection. I expect that three- or four-connector module memory, with at least six or eight ranks, will be needed in next-generation multi-core PC, server, and workstation architectures. Therefore, a new approach is necessary to support both high-speed data rates and high memory capacities.



Fig. 1.1.2. Data rate and bus topology trend of the DDRx DRAM interface.

Various approaches are being taken to address this problem. Fully buffered DIMM (FBDIMM) with an advanced memory buffer (AMB) [1.1.7] [1.1.8] and a cascading memory architecture [1.1.9] [1.1.10] has a daisy-chained point-to-point bus topology which is not affected by the reflection problem. However, FBDIMM has more latency than the multi-drop bus topology. Latency or fast access time has priority over data throughput because the CPU waits until the first data arrives. Alternatively, impedance matching by means of a  $2Z_0\Omega$  transmission line in the last part of a memory module can

significantly reduce the reflection in a channel [1.1.11]. However, this sort of matching can only be applied to a two-slot configuration. The configuration method of the PCB trace with a separate location on the board can reduce the reflection dramatically [1.1.12], but this scheme is also limited to two or three slots. Moreover, an impedance-matching scheme [1.1.13] in which the characteristic impedance of the PCB trace is changed for proper impedance has a physical problem because a smaller characteristic impedance means that a PCB trace with a wider width is necessary, which necessitates a larger PCB routing area for a heavy parallel memory interface.

In this thesis, an impedance-matched bidirectional multi-drop (IMBM) DQ bus and memory controller transceiver supporting the IMBM DQ bus are proposed. The IMBM DQ bus achieves data-transfer rates on the order of Gb/s without reflective ISI at each stub while maintaining the advantages of a multi-drop channel, such as minimum latency during read and write operations. The proposed memory controller transceiver offsets the weaknesses of the IMBM DQ bus caused by voltage attenuation from the DQ bus.

#### **1.2 THESIS ORGANIZATION**

This thesis consists of six chapters. Chapter 1 is an introductory chapter which describes the necessity of the IMBM DQ bus. In chapter 2, various types of memory interface architecture and circuits will be introduced. They will be classified into different types. In chapter 3, the proposed IMBM DQ bus and its principles are introduced. The

advantage of the IMBM DQ bus will be discussed and a generalized IMBM DQ bus will be introduced. A steady-state model of an IMBM DQ bus will also be discussed. In chapter 4, the memory controller transceiver architecture and the TX and RX circuit details are presented. The measurement setup and experimental results are given in chapter 5. Finally, in chapter 6, the proposed IMBM DQ bus and the transceiver are summarized.

### **CHAPTER 2**

## **INTRODUCTION TO MEMORY INTERFACE**

### 2.1 MEMORY INTERFACE INTRODUCTION

A memory interface is a comprehensive concept that can include everything related to data transmission between a memory controller (host) and a memory module (slave). A memory interface is composed of the bus topology, the clocking architecture, the signal driving method, the termination scheme, the deskew method, the sampling method, the data coding scheme, and other components. Fig. 2.1.1 shows the typical structure of a memory interface with dual-socket, multi-core CPUs, memory controllers, and memory modules. The structure can differ according to the application and system specifications. Commonly used memory interfaces are as follows: Double-Data-Rate (DDR) Dynamic Random Access Memory (DRAM) [2.1.1], Graphic DDR (GDDR) [2.1.2], Low Power DDR (LPDDR) [2.1.3], eXtreme Data Rate (XDR) [2.1.4], mobile XDR [2.1.5], Fully-Buffered Dual-Inline Memory Module (FBDIMM) [2.1.6], Serial-Port Memory Technology (SPMT) [2.1.7], M-PHY [2.1.8] and Wide IO [2.1.9]. These memory interfaces have different characteristics depending on their purpose.



Fig. 2.1.1. Typical structure of a memory interface with multi-core CPUs.

The DDR memory interface is used for personal computer applications. PC applications require moderate bandwidths and capacities. The bandwidth of the DDR2 memory interface can be as high as 800Mb/s per pin, and the total aggregated bandwidth of a DDR2 module can reach 1.6GB/s. Recent DDR3 memory interfaces have a maximum bandwidth of 2133Mb/s per pin. FBDIMM is used for high-capacity server applications. Because the Advanced Memory Bus (AMB) in the FBDIMM interface gathers and serializes DDR signals, the bandwidth of FBDIMM is much higher than that of the DDR interface. FBDIMM1 has a bandwidth of 4.8Gb/s per pin while for FBDMM2 the bandwidth can reach 9.6Gb/s [1.1.7] [1.1.8]. Graphic memory applications

require higher throughput than that of DRAM for PCs. The pin bandwidth of GDDR5 is 5Gb/s, while that of XDR is 6.4Gb/s. Recent XDR2 interfaces reach 12.8Gb/s. Mobile memory focuses on low power consumption. The bandwidth of LPDDR2 is 1.066Mb/s. Serial memory interfaces such as SPMT and M-PHY have bandwidths of 7.5Gb/s and 5.8Gb/s, respectively. Fig. 2.1.2 shows various memory interfaces and the recent bandwidth trend.



Fig. 2.1.2. Memory interface and its bandwidth trend.

#### 2.2 **BUS TOPOLOGY**

Generally, there are two types of bus topologies for a memory interface – multi-drop bus and point-to-point bus [2.2.1], [2.2.2]. The multi-drop bus connects a memory controller to multiple memory modules through a shared bus, as shown in Fig. 2.2.1. The multi-drop bus is used to enhance the memory capacity while maintaining the number of PCB traces. However, this sharing scheme creates stubs in the bus. The impedance of the transmission line leads to discontinuities at each stub. This causes a reflection wave, as shown in Fig. 1.1.1. The maximum achievable data bandwidth is limited by the reflection wave in the multi-drop bus. Thus, a multi-drop bus is adopted for a memory interface which requires high capacity with a moderate per pin bandwidth, such as DDR1, 2 and 3. The bus topology of the DDR4 memory interface is not yet determined to the best of my knowledge.



Fig. 2.2.1. Multi-drop bus topology.

On the other hand, the point-to-point bus, as shown in Fig. 2.2.2, connects a memory controller and a memory module one by one. Because there is no stub in the point-to-point bus, reflection does not occur if the source and destination have a proper termination. Thus, the point-to-point channel can transmit data at a higher speed and with

better signal integrity than the multi-drop channel. The point-to-point bus topology, however, cannot handle multiple memory modules simultaneously. This means that the point-to-point bus topology is limited when used with large-capacity memory systems. Thus, memory interfaces which focus on high-speed data transmission and high data throughput, such as the GDDR, XDR, and SPMT interfaces, adopt the point-to-point bus topology. In the case of a mobile memory interface such as the LPDDR interface, destination port is not terminated to reduce the amount of power used. This causes a reflection wave on the destination side and degrades the signal integrity. Because the multi-drop bus can be more severely damaged than the point-to-point bus, LPDDR adopts the point-to-point bus.



Fig. 2.2.2. Point-to-point bus topology.

To mitigate the limitations of the multi-drop bus, the daisy-chained bus [1.1.7], [1.1.8] and the cascade bus schemes [1.1.9] [1.1.10] have been proposed. Fig. 2.2.3 shows the architecture of the FBDIMM interface, which adopts a daisy-chained bus [2.1.6]. An AMB chip serializes and transmits high-speed data after collecting the DDR signal from the DRAM module [2.2.3]. First, the AMB receives southbound data from the host

controller. If the destination of the data is the first AMB, the first AMB receives the data and sends the data to its own module. If the data should go to another module, the first AMB passes the data to the next AMB. The host and AMBs are connected to the point-topoint bus for high-speed operation. Northbound data is transmitted from the memory module to the host controller in the same way. Although the FBDIMM can easily increase the capacity and the number of modules via daisy-chained connections, the daisy chain increases data latency and impairs the system performance. Data concentration on the central AMB chip causes heat dissipation and requires a careful cooling scheme and a radiation panel.



Fig. 2.2.3. FBDIMM architecture and AMB [2.1.6].

The cascading memory architecture in Fig. 2.2.4 uses a bus scheme similar to that of

FBDIMM, but the cascading architecture uses only a one-directional rotational bus and does not use a central buffer chip such as an AMB chip. Although the cascading memory bus can solve the heat concentration problem, its latency does not decrease.



Fig. 2.2.4. Cascading memory architecture [1.1.9].

### 2.3 CLOCKING ARCHITECTURE AND CIRCUITS

Each memory interface organizes its own clocking architecture for optimized memory interface performance. In this section, several recent clocking architectures and circuits are introduced. Most memory interfaces utilize an asymmetric type of architecture between memory controllers. However, DRAM uses a different process, as the memory controller is implemented with the standard logic CMOS process, whereas DRAM is implemented with a slow memory process. Most complex functions, such as system clock synthesis, deskewing, phase alignment, and training and optional equalizing are implemented on the memory controller side [1.1.1]. DDRx, GDDRx, LPDDRx, XDRx and mobile XDR adopt the asymmetric architecture.



Fig. 2.3.1. Clocking architecture of the DDR3 interface [2.3.1].

Fig. 2.3.1 shows the clocking architecture of the DDR3 DRAM interface [2.3.1]. Both read and write data operations require phase alignment circuits such as a Delay-Locke Loop (DLL) [2.3.2] or a phase interpolator (PI) [2.3.3], [2.3.4], as shown in Fig. 2.3.1. These phase alignment circuits in both directions are located in memory controller unit (MCU). DDR3 DRAM does not have a phase-aligned circuit for data sampling. Only DLL, which adjusts the launch time of read data, is in DDR3 DRAM.



Fig. 2.3.2. Center-aligned WDQS and edge-aligned RDQS.

The DDR3 DRAM interface utilizes a source-synchronous clocking scheme using read-strobe (RDOS) and write-strobe (WDOS) signals. The source-synchronous clocking scheme has an advantage in terms of its ability to track high-frequency jitter from the transmitter because the strobe and data signal have the same jitter variation [2.3.5]. In the case of write data transmissions, write DQ data is directly sampled by WDQS, which is delivered by an on-chip inverter based buffer. A phase-shifting circuit is not used in the DRAM circuit for write data sampling. The skew between write DQ (WDQ) and WDQS is compensated by the phase interpolator (PI) and the skip circuit in the MCU. The MCU transmits WDQS after shifting some amount of phase. At the sampler of DRAM, the WDQS is center-aligned with WDQ for an optimal timing margin, as shown in Fig. 2.3.2. The skew and the optimum transmit phase are measured during the initial calibration period by sweeping the WDQS phase using a phase interpolator. Read data (RDQ) is also sampled by the RDQS signal in the MCU. Because no phase-shifting circuit-adjusting timing between RDOS and RDO exists in DDR3 DRAM, RDO and RDOS are transmitted with edge-aligned timing, as shown in Fig. 2.3.2. The RDQS signal is shifted by 90° through replica DLL (RDLL) and PI in the MCU. Then, a 90°-shifted RDQS

signal is used for RDQ sampling. Essentially, the optimal degree of RDQS shift is measured during the initial calibration. It may not be precisely 90° depending on the trace mismatch and circuit mismatch. Due to source-synchronous clocking, sampled read and write data is in the strobe clock domain. FIFO in both MCU and DRAM passes data into the system clock (CLK) domain.





(a)

(b)

Fig. 2.3.3. (a) Block diagram and (b) timing diagram of DLL in DDR DRAM [2.3.1].

Fig. 2.3.3 shows a block and timing diagram of DLL, which is the only phaseshifting circuit in DRAM. As noted above, DLL is used to adjust the launch time of read data. To calculate the read latency (CAS latency) of DRAM in the integer multiple of clock cycle, DRAM launches read data with precisely the same phase of the received system clock (Ext. Clock). To achieve this, DLL with replica delay which imitates the delay sum of the input and output buffer aligns the edges of the output read data and the external clock.



Fig. 2.3.4. Clocking architecture of the GDDR5 interface [2.1.2].

The GDDR5 interface also adopts an asymmetric architecture, as shown in Fig. 2.3.4.

Because the data bandwidth of the GDDR5 interface is much higher than that of the DDR interface, GDDR5 uses only a point-to-point bus and various training schemes. During a write operation, GDDR5 uses modified source-synchronous clocking. WCK (Write Clock) replaces the WDQS of the earlier GDDR interface. WDQS is a pulsed strobe with pre-amble, burst width and post-amble operations, but WCK is a free-running continuous clock signal. In addition, WCK can be selectively filtered by PLL in Synchronous Graphic RAM (SGRAM) depending on the operation condition. The controller determines whether or not to turn on PLL. WCK is also used for timing alignment between CK (including CMD/ADD) and WCK. The bang-bang phase detector [2.3.6] in GDDDR5 SGRAM detects the phase difference and generates an EDC signal which contains early/late phase information. During a read operation, GDDR5 uses mesochronous clocking, meaning that a forwarded clock or strobe is not used for read data recovery. As shown in Fig. 2.3.4, a Clock and Data Recovery (CDR) circuit is used at the receiver of the read path in the controller. The CDR continuously tracks data timing variations in the read data.



Fig. 2.3.5. Clocking architecture of the Terabyte Bandwidth Initiative from Rambus [1.1.1].

Fig. 2.3.5 shows the clocking architecture of the tentatively named Terabyte Bandwidth Initiative from Rambus, whose bandwidth is 16Gb/s per DQ [1.1.1] [1.1.2]. This interface also adopts asymmetric architecture between the controller and DRAM. Both write and read data transactions use mesochronous clocking and do not use pulsed strobe or forwarded clock operations. The controller and DRAM only share a reference clock which provides frequency information. As shown in Fig. 2.3.5, no phase adjusting circuits exist on the DRAM side. The phase mixer in the controller adjusts phase both the write and read data. For a proper write operation, the controller transmits optimally phase-shifted write data, after which DRAM receives the read data with its own fixed phase PLL clock. DRAM transmits read data as well with its own fixed clock. For a proper read operation, the controller shifts the sampling clock of the read data with an optimum phase-shifting value. The optimum amount of phase shifting is determined during the initial calibration and via periodic calibration in the idle state of DRAM. Because this architecture does not use a forwarded clock and source-synchronous clocking scheme, proper selection of the interval time for periodic phase calibration is very important. During the initial calibration period, the controller finds the edge position of the write and read data timing using a sweeping phase mixer. It then determines the optimal operation point of the phase mixer at the center of the previously determined edges. During the periodic calibration after normal operation, the controller finds the locations of drifted edges based on the past edge position. Doing this reduces the calibration time by eliminating the entire phase sweeping range.



Fig. 2.3.6. Clocking architecture of mobile XDR [2.1.5].

Mobile XDR [2.1.5], whose bandwidth is 4.3Gb/s per DQ, also uses an asymmetric type of clocking architecture which is similar to the previously mentioned architecture of GDDR5, as shown in Fig. 2.3.6. During a write operation, it uses the source-synchronous clocking scheme. The forwarded clock, Clk, is used for data sampling in DRAM. On the other hand, mesochronous clocking is used for read data transactions. There is no strobe or forwarded clock for read data. DRAM in the mobile XDR interface does not have a timing-adjustment circuit, like the Terabyte Bandwidth Initiative interface. In addition, clock-synthesizing PLL does not exist in the mobile DRAM for low-power operation in DRAM. The Clk signal is used for the system clock in the DRAM in this case. Creating a proper transmit phase for write data and a proper sampling phase for read data is the role of the controller. Optimum phases are also found during the calibration period.



Fig. 2.3.7. Clocking architecture of AMB1 and FBDIMM1 [1.1.7].

Unlike the previously mentioned forms of clocking architecture, a memory interface which adopts the concept of a serial link and interface uses the symmetric clocking architecture. FBMM1, 2 and SPMT and M-PHY all adopt the symmetric clocking architecture. First, a 4.8Gb/s FBDIMM1 interface uses symmetric mesochronous clocking for both northbound and southbound data transactions [1.1.7]. In Fig. 2.3.7, CDR circuits recover the sampling clock for received data. For a daisy-chained bus, MUX selects the transmit data between its own DRAM data and received data from the preceding AMB. To reduce latency during clock recovery and clock domain crossing operations in FIFO, the FIFO bypass mode can be enabled. In the FIFO bypass mode, the MUX selects the data which bypasses FIFO, and the recovered clock of CDR is used for the MUX and TX driver instead of the system clock from its own PLL. Unlike the XDR interface, CDR in AMB1 continuously tracks the phase information of the received data using a phase detector. In other words, FBDIMM does not use a training period for optimum data sampling, instead using a CDR circuit for continuous jitter tracking.



Fig. 2.3.8. Clocking architecture of AMB and FBDIMM2 [1.1.8].

Fig. 2.3.8 shows the clocking architecture of the 9.6Gb/s FBDIMM2 interface [1.1.8]. Unlike FBDIMM1, FBDIMM adopts a symmetric-source synchronous clocking scheme using a forwarded clock. Because the forwarded clock has the same jitter variation with the transmitted data, the receiver of the AMB can recover data with high jitter tolerance. To achieve this, correlated jitter of the forwarded clock should not be filtered by the CDR circuit in the FBDIMM2 interface. Thus, the CDR should have a high-frequency jitter tracking bandwidth (phase locked loop), as shown in Fig. 2.3.9. On the other hand, the loop bandwidth of the static skew correction loop (phase recovery loop) can be very low. Toifl's dual-loop adjustable PLL [2.3.7] with a forwarded clock reference is a candidate for this function.



Fig. 2.3.9. CDR architecture of AMB and FBDIMM2 [1.1.8].

The SPMT interface shown in Fig. 2.3.10 and the M-PHY interface adopt a serial interface and symmetric mesochronous clocking [2.1.7]. Both interfaces use the symmetric clocking architecture and a common CDR scheme for clocking and data recovery in the receiver. These serial memory interfaces can reduce the pin count and system cost through their use of a serial interface, whereas the overall system latency increases due to the serializing and deserializing operation.



Fig. 2.3.10. Clocking architecture of SPMT [2.1.7].

Clocking architectures and features of various memory interfaces are summarized in Tables 2.3.1 and 2.3.2.
|                       | Asymmetric clocking architecture   |                                                    |                                       |                                               |  |
|-----------------------|------------------------------------|----------------------------------------------------|---------------------------------------|-----------------------------------------------|--|
| Memory<br>interface   | DDR3                               | GDDR5 Terabyte<br>Initiative                       |                                       | Mobile<br>XDR                                 |  |
| Bandwidth per DQ      | ~2.133Gb/s                         | ~7.5Gb/s                                           | 16Gb/s                                | 4.3Gb/s                                       |  |
| Bus topology          | Multi-drop<br>(~2 slots)           | Point-to-point                                     | Point-to-point                        | Point-to-point                                |  |
| Signaling             | Single-ended                       | Single-ended                                       | Differential                          | Differential                                  |  |
| Duplex<br>scheme (DQ) | Half duplex                        | Half duplex                                        | Half duplex                           | Half duplex                                   |  |
| Write<br>clocking     | Source<br>synchronous<br>(strobe)  | Source<br>synchronous<br>(forwarded<br>clock)      | Mesochronous                          | Source<br>synchronous<br>(forwarded<br>clock) |  |
| Read<br>clocking      | Source<br>synchronous<br>(strobe)  | Mesochronous                                       | Mesochronous                          | Mesochronous                                  |  |
| Write<br>phase shift  | DLL+PI<br>in controller<br>(fixed) | DLL+PI<br>in controller<br>(training)              | PLL+PI<br>in controller<br>(training) | Unknown<br>in controller<br>(training)        |  |
| Read<br>phase shift   | DLL+PI<br>in controller<br>(fixed) | (Optionally)<br>Continuous<br>CDR<br>in controller | PLL+PI<br>in controller<br>(training) | Unknown<br>in controller<br>(training)        |  |

Table 2.3.1. Clocking Architecture and Features (Asymmetric Architecture).

| Memory<br>interface   | Symmetric clocking architecture |                                               |                            |                            |  |  |
|-----------------------|---------------------------------|-----------------------------------------------|----------------------------|----------------------------|--|--|
|                       | FBDIMM1 FBDIMM2 SPMT            |                                               | M-PHY                      |                            |  |  |
| Bandwidth per DQ      | 4.8Gb/s                         | 9.6Gb/s                                       | 7.5Gb/s                    | 6Gb/s                      |  |  |
| Bus topology          | Point-to-point                  | Point-to-point                                | Point-to-point             | Point-to-point             |  |  |
| Signaling             | Differential                    | Differential                                  | Differential               | Differential               |  |  |
| Duplex<br>scheme (DQ) | Dual simplex                    | Dual simplex                                  | Dual simplex               | Dual simplex               |  |  |
| Clocking              | Mesochronous                    | Source<br>synchronous<br>(forwarded<br>clock) | Mesochronous               | Mesochronous               |  |  |
| Phase shift           | Continuous<br>CDR<br>in slave   | Dual loop<br>CDR<br>in slave                  | Continuous<br>CDR<br>in RX | Continuous<br>CDR<br>in RX |  |  |

Table 2.3.2. Clocking Architecture and Features (Symmetric Architecture).

# 2.4 COMMAND AND ADDRESS ARCHITECTURE

In the conventional DDR memory interface, data is transmitted and received via a source-synchronous scheme. On the DRAM side, write data is sampled by a write strobe signal. Thus, sampled write data is in the strobe clock domain. For high-speed operation,

a DQ data line is shared only by DRAM with the same DQ channel (ranks 0, 1, 2 and 3). On the other hand, command (CMD) and address (ADDR) (C/A) functions are shared by all DRAM and every module. Thus, the C/A bandwidth is lower than that of DQ. For synchronized operations among all DRAM chips, C/A is sampled with the system clock (CK) in each instance of DRAM. In other words, C/A is in the system clock domain, whereas DQ is in the strobe clock domain. In the controller and DRAM, FIFO plays a domain crossing role from the strobe clock domain to the system clock domain. Fig. 2.4.1 shows an example of the clock domain structure of GDDR3 DRAM [2.4.1]. Each instance of DDRx and GDDRx (up to GDDR4) has a similar clock domain structure containing the strobe clock domain and the system clock domain.



Fig. 2.4.1. Clock domain of GDDR3 DRAM [2.4.1].

Up to the DDR2 memory interface, C/A is transmitted through a T-branch, as shown in Fig. 2.4.2 [2.4.2]. The length of the traces of C/A from the memory controller to each DRAM chip is tightly matched to avoid byte-alignment failure. The T-branch, however, creates impedance discontinuity points and stubs. Due to the stubs and impedance discontinuity, undesired reflection occurs and degrades the maximum data bandwidth of the C/A channel.



Fig. 2.4.2. T-branch network for C/A of the DDR2 interface [2.4.2].

To reduce the number of impedance discontinuities and to loosen tight lengthmatching requirements, the DDR3 interface adopts the fly-by C/A bus topology [1.1.5]. As shown in Fig. 2.4.3, C/A signals come into the DIMM module then connect to each memory device sequentially. Thus, the length of the stub is minimized and reflections are suppressed. Although the bandwidth of the C/A channel can increase due to the fly-by C/A network, the time difference from the controller to each memory chip can rise. The first chip of the C/A network receives a command first. Likewise, the last chip lastly receives a command from the controller, as shown in Fig. 2.4.3. This skew means that each memory chip transmits or receives data at a different time, leading to C/A-DQ mismatch, byte alignment failure, and synchronization failure.



Fig. 2.4.3. Fly-by network for C/A of DDR3 interface [1.1.5].

To avoid this C/A-DQ mismatch, DDR3 introduces write leveling and read leveling operations. In a write leveling operation, DDR3 DRAM compares the CK phase and DQS phase and then returns the phase comparison information to the controller. Because the C/A is synchronized with the CK signal and DQ signal is synchronized with DQS signal, the mismatch between C/A and D/Q can be eliminated if CK and DQS are in phase. To implement this operation, the controller appropriately delays the transmit timing of DQS and DQ until CK and DQS have no phase difference, as shown in Fig. 2.4.4. If the DRAM receives C/A and CK earlier, the delay of the DQS signal for this DRAM has a

small value. In contrast, the delay of the DQS signal for DRAM, which receives C/A later, has a larger value.



Fig. 2.4.4. Write leveling operation in the DDR3 interface [1.1.5].

For the read leveling operation, the controller loads pre-defined data from the Multi-Purpose Register (MPR) of the DRAM chip during the calibration period, as shown in Fig. 2.4.5. If read data from the MPR arrives early, the controller sets a larger delay value of DQS and DQ for this DRAM chip. In the opposite case, the controller sets a smaller delay value during the calibration period.



Fig. 2.4.5. Read leveling operation in the DDR3 interface [1.1.5].

In the DDR memory interface, the C/A and DQ cannot transmit using the same bus topology because the C/A signal should be shared among all chips, whereas the DQ signal is not shared in the same module. If the C/A signal is duplicated and is not shared among memory chips, which means that every chip has a dedicated C/A channel with the same content, the C/A signal can be transmitted in the same manner as the DQ signal [1.1.1]. The Terabyte Bandwidth Initiative and serial memory interfaces use this concept. As shown in Fig. 2.4.6, the Terabyte Bandwidth Initiative configures the C/A link and transmits C/A signals with a high bandwidth.



Fig. 2.4.6. C/A link of the Terabyte Bandwidth Initiative interface [1.1.1].

The FBDIMM interface, which adopts the serial link concept and a daisy-chained bus, transmits the C/A by packetizing C/A data with write data, as shown in Fig. 2.2.3. By packetizing, the FBDIMM interface does not require a dedicated C/A channel, while additive circuit overhead is necessary for de-packetizing [2.1.6].

## 2.5 SIGNALING AND TERMINATION SCHEME

As shown in Tables 2.3.1 and 2.3.2, each memory interface adopts a different signaling scheme. The memory interface affiliated with DDR (DDRx, GDDRx, and LPDDRx) uses a single-ended half duplex signaling scheme. Half duplex means that TX and RX share a data bus but that each side does not transmit data when the other side is transmitting. The memory interface, which is based on serial interface technology such as FBDIMM, SPMT and M-PHY, uses differential dual-simplex signaling. Dual-simplex means that the interface has a dedicated TX and RX channel.

Each memory interface uses a different type of driver circuit. DDRx and LPDDRx use a push-pull type of voltage mode driver in the transmitter, as shown in Fig. 2.5.1 [2.5.1].



Fig. 2.5.1. Voltage mode driver of the DDR memory interface [2.5.1].

The mobile XDR interface uses a N-over-N type of voltage mode driver, as shown in

Fig. 2.5.2. [2.5.2].



Fig. 2.5.2. Voltage mode driver of the mobile XDR memory interface [2.5.2].

The voltage mode driver uses less power than the current mode driver by one quarter or half depending on the termination method. Although the voltage mode driver consumes less power, it requires circuit overhead for impedance control, swing level control and its de-emphasis configuration [2.5.3].



Fig. 2.5.3. Current mode driver of XDR, FBDIMM, SPMT and M-PHY.

XDR, the Terabyte Bandwidth Initiative, FBDIMM, SPMT and the M-PHY interface adopt the differential current mode signaling shown in Fig. 2.5.3. Although the current mode driver consumes more power than the voltage mode driver, it can easily adjust its output voltage swing, termination impedance and de-emphasis configuration [2.5.4].

Each type of memory adopts a different destination-side (receiver-side) termination scheme. First, the LPDDR interface does not terminate at the receiver to reduce the termination power. Thus, the LPDDR interface requires careful source termination and rise time control to suppress the reflection signal. A memory interface which uses current mode signaling adopts a fixed form of source-destination double termination, as shown in Fig. 2.5.3. The GDDR interface adopts the Pseudo Open Drain (POD) signaling scheme shown in Fig. 2.5.4 [2.5.5]. When the transmitter transmits a "High" signal, POD signaling does not consume any current. Thus, the Data Bus Inversion (DBI) and Address Bus Inversion (ABI) schemes, which convert the data polarity when data has more "High" than "Low" signals, are used in the GDDR5 interface.



Fig. 2.5.4. POD signaling for GDDR 3/4/5 [2.5.5].

The signaling scheme of the DDR interface should be considered with the bus topology. The DDR interface uses the Stub-Series Terminated Logic (SSTL) bus topology [2.2.2]. In the SSTL DQ bus, bus suppresses reflection is accomplished by inserting a series resistor (Rstub) in each stub, as shown in 2.5.5 (a). The DDR interface adopts a push-pull type of termination scheme. Unlike other memory interfaces, the DDR memory interface selectively turns on the source and destination termination resistor, as shown in Fig. 2.5.5 (b) [2.5.1]. This scheme is termed the On Die Termination (ODT) control scheme. To enhance the voltage margin of the write and read signal, the termination resistor of the active module is disconnected. When the destination is in an open state, the entire signal is reflected, and thus the destination can receive 2X times the voltage. Using

this phenomenon, the DDR memory interface controls the termination resistor depending in the operation mode. Although this ODT scheme enhances the voltage margin, the reflected signal on the destination side does not settle during high-speed operation faster than 2Gb/s. Thus, a careful channel design is required when the bus uses the selective ODT scheme.



|    |              | `  |
|----|--------------|----|
|    | $\mathbf{a}$ | ۰. |
|    | - 24         |    |
| ۰. | "            |    |
| •  | ~            |    |
| •  |              |    |

| Configuration |         | For Write |        |        | For Read |           |        |        |         |
|---------------|---------|-----------|--------|--------|----------|-----------|--------|--------|---------|
| Slot1         | Slot2   | Write to  | Slot 1 | Slot 2 | Chipset  | Read from | Slot 1 | Slot 2 | Chipset |
| Present Prese | Duccont | Slot 1    | OFF    | ON     | OFF      | Slot 1    | OFF    | ON     | ON      |
|               | Present | Slot 2    | ON     | OFF    | OFF      | Slot 2    | ON     | OFF    | ON      |
| Present       | Empty   | Slot 1    | ON     | -      | OFF      | Slot 1    | OFF    | -      | ON      |
| Empty         | Present | Slot 2    | -      | ON     | OFF      | Slot 2    | -      | OFF    | ON      |

#### (b)

Fig. 2.5.5. (a) SSTL bus of the DDR memory interface. (b) ODT table for the DDR2 interface [2.5.1].

# 2.6 EQUALIZATION IN MEMORY INTERFACE

In the conventional DDR memory interface, a channel equalization circuit which compensate for the frequency-dependent loss in the channel is not used because the Inter-Symbol Interference (ISI) caused by reflection is dominant in the moderate Gb/s bandwidth range. However, when the bandwidth of the memory interface increases, the memory interface typically adopts various equalizing circuits [2.6.1]. FBDIMM2 uses 3-tap TX de-emphasis for equalization to achieve a 9.6Gb/s bandwidth [1.1.8]. An asymmetric equalization technique is proposed in the 16Gb/s Terabyte Bandwidth Initiative interface [1.1.1]. As shown in Fig. 2.6.1, a DRAM chip does not have an equalizing circuit due to the slow memory process. Both the TX de-emphasis equalizing circuit for write data and the RX continuous-time linear equalizer (CTLE) for read data are present on the controller side.



Fig. 2.6.1. Asymmetric equalization scheme in the Terabyte Bandwidth Initiative [1.1.1].

As the data rate of the GDDR memory interface continues to increase, some GDDR5 vendors have begun to adopt a RX decision feedback equalizer (DFE) circuit in GDDR5 DRAM [1.1.3]. In future memory, more complex and powerful equalization schemes will be necessary due to the increased data rate of the memory interface.

# 2.7 EMERGING TECHNOLOGY

To mitigate the limitations of the previously introduced memory interface, various circuit and architecture schemes have emerged. In the clocking circuit technique, an

injection-locked oscillator (ILO) is introduced for jitter tracking of the forwarded clock [2.7.1]-[2.7.3]. Because the ILO achieves a high jitter tracking bandwidth without a feedback loop, CDR using an ILO consumes less power and occupies a smaller area. However, an ILO requires complex and precise frequency tuning.

In the bus topology area, the 3D-IC scheme is popularly used in DRAM interfaces [2.7.4]-[2.7.6]. By eliminating the off-chip interconnect with a heavy capacitive load and stub due to the use of a multi-drop connection, 3D-IC can achieve data transmission with a dense arrangement and low power consumption. Through-Silicon Via (TSV) technology is used for the wire-lined 3D-IC scheme. Fig. 2.7.1 shows a DRAM chip that implements TSV technology.



Fig. 2.7.1. DDR DRAM with 3D-TSV technology [2.7.4].

A wireless 3D-IC scheme using inductive coupling [2.7.7]-[2.7.8] and an optical signaling scheme using fiber interconnect [2.7.9]-[2.7.10] have also been actively researched to replace conventional memory interface technology.

# **CHAPTER 3**

# **IMPEDANCE-MATCHED BIDIRECTIONAL MULTI-DROP DQ BUS**

# 3.1 IMPEDANCE-MATCHED BIDIRECTIONAL MULTI-DROP DQ BUS

When a transmission line has an impedance discontinuity, a reflection wave occurs [2.2.1] [2.2.2]. In the case of a conventional SSTL DQ bus, a series resistor of  $Z0/2\Omega$  reduces the ringing and reflection, as shown in Fig. 3.1.1 (a), but this bus cannot suppress reflections entirely when there are more than two slots, as the reflection coefficient of an SSTL DQ bus has a non-zero value. This can be expressed as follows:

$$\Gamma_{SSTL} = \frac{Z_T - Z_0}{Z_T + Z_0} = \frac{\left(Z_0 \parallel \left(Z_0 + \frac{Z_0}{2}\right)\right) - Z_0}{\left(Z_0 \parallel \left(Z_0 + \frac{Z_0}{2}\right)\right) + Z_0} = -\frac{1}{4}$$
(3.1.1)





Fig. 3.1.1. Equivalent stub model of (a) a conventional SSTL DQ bus and (b) the proposed IMBM DQ bus.

Thus, a reflected signal,  $I_{refl,SSTL}$ , is generated at every stub, and these signals propagate across connectors repeatedly and cause overshooting or overdamped responses depending on the position of the module. As the data rate increases, these reflection

waves play a key part in generating ISI in an SSTL DQ bus.

Fig. 3.1.1 (b) shows part of the IMBM DQ bus [3.1.1]. The insertion of two resistors of appropriate values at each stub allows us to match the impedances at the stubs without attempting to alter the characteristic impedances of the PCB traces. Resistors can be introduced within the width of a PCB trace with the use of emerging technologies such as embedded PCB resistors or film resistors. The reflection coefficient at each stub of an impedance-matched DQ bus can be expressed as follows:

$$\Gamma_{IMBM} = \frac{Z_T - Z_0}{Z_T + Z_0} = \frac{\left(\left(Z_0 + \frac{Z_0}{k}\right) || \left(Z_0 + kZ_0\right)\right) - Z_0}{\left(\left(Z_0 + \frac{Z_0}{k}\right) || \left(Z_0 + kZ_0\right)\right) + Z_0} = 0$$
(3.1.2)

This shows that, at least in theory, an IMBM DQ bus does not generate a reflected signal at each stub. The ratio between  $I_{k+1,IMBM}$  and  $I_{k,IMBM}$  can be expressed as follows:

$$I_{k+1,IMBM} : I_{k,IMBM} = \left(Z_0 + kZ_0\right) : \left(Z_0 + \frac{Z_0}{k}\right) = k : 1$$
(3.1.3)

This means that an IMBM DQ bus transmits an incident signal to every module with the same current and allows an identical transfer response regardless of the position of a module.



Fig. 3.1.2. Impedance-matched bidirectional multi-drop (IMBM) DQ bus.



Fig. 3.1.3. Stub-Series Terminated Logic (SSTL) DQ bus.

#### Fig. 3.1.2 shows an IMBM DQ bus as the counterpart of the conventional SSTL DQ

bus shown in Fig. 3.1.3. Resistors of  $Z_0$ ,  $Z_0/2$ ,  $Z_0/3$ ,  $2Z_0$ , and  $3Z_0\Omega$  match the impedances in the left-to-right direction along the upper four transmission lines (TLs). Thus, each memory module receives the same voltage from the memory controller, and the memory controller receives the same voltage from each memory module. To prevent reflections at the ends of the channel, the memory controller and the memory modules both have on-die-termination (ODT) resistors.

### **3.2 OPERATION OF IMBM DQ BUS**

Although the impedances are only matched in one direction, the IMBM DQ bus can cancel reflective ISI during the transmission of both write and read data. Because there is no reflection at the right-hand ends of the TLs (TL1, TL2, TL3, and TL4) on the motherboard during a write operation, as shown in Fig. 3.2.1, the data stream being written from the memory controller to memory modules #0, #1, #2, and #3 is transmitted without reflections [3.2.1] [3.2.2].



Fig. 3.2.1. Write operation of the IMBM DQ bus.

In a write operation, the maximum turn-around time ( $T_{turn\_around}$ ) occurs when writing from the controller to memory module #3. This time can be expressed in the equation below (3.2.1).  $T_d$  is the flight time of the signal passing through each TL.

$$T_{turn\_around,WR,\#3} = T_{d,TL1} + T_{d,TL2} + T_{d,TL3} + T_{d,TL4} + T_{d,TL5}$$
(3.2.1)

Because reflection does not occur during a write operation, additive waiting time for the settling of reflections is unnecessary. Reading is performed in a different manner. A read operation from memory module #0 to the memory controller causes a reflection at the top end of TL5, as shown in Fig. 3.2.2, but this signal is absorbed by the ODT resistor of module #0. The read data is split at the stub, and the desired data signal flows from the right-hand end of TL1 to its left-hand end, eventually reaching the memory controller. Meanwhile, unwanted signals flow from left to right through TL2 and proceed towards memory modules #1, #2, and #3, causing no reflections, and are eventually absorbed by the ODT resistors. Thus, although reflections do occur in an IMBM DQ bus during read operations, no reflective ISI arrives at the controller.



Fig. 3.2.2. Read operation of the IMBM DQ bus - from module #0.

If data flows from memory module #1 to the memory controller during a read operation, then a reflection occurs at the top of TL6, as shown in Fig. 3.2.3; however, this signal is absorbed by the ODT resistor, as in the earlier case. Again, the read data is split at the stub and the desired data flows from right to left through TL2, where there is another reflection. This reflected signal goes rightward through TL2, where it too is absorbed by the ODT resistors, as in the write operation of the bus. Thus, the desired read signal arrives at the memory controller, while all the reflections are absorbed by the ODT

resistors. Read operations from memory modules #2 and #3 behave similarly, and again there is no reflective ISI. During a read operation, the maximum turn-around time occurs when reading from memory module #3 to the controller. In this case, the controller waits until the reflection signal at the left side of TL2 disappears in the bus, as shown in Fig. 3.2.4. This turn-around time can be expressed by equation (3.2.2).  $T_{d,CMD}$  denotes the flight time of the command signals from the controller to the memory modules.

$$T_{turm\_around,RD,\#3} = T_{d,CMD} + \left(T_{d,TL8} + T_{d,TL4} + T_{d,TL3} + T_{d,TL2}\right) + \max\left\{T_{d,TL1}, \left(T_{d,TL2} + T_{d,TL3} + T_{d,TL4} + T_{d,TL8}\right)\right\}$$
(3.2.2)

In summary, the IMBM DQ bus generates no reflections during write operations, and the reflections that are generated during read operations do not reach the memory controller. Therefore, the IMBM DQ bus transmits and receives both read and write signals without reflective ISI. In addition, IMBM DQ bus is tolerant to stub length variation. Because the impedance is matched in right-ward direction, additional settling time for the settling of reflections is not required, except for residual reflection due to parasitics. In the case of the SSTL DQ bus, stub resistors suppress reflections instead of fully eliminating reflection. Thus, the signal behavior of the SSTL DQ bus depends on the stub length. Both the SSTL DQ bus and IMBM DQ bus are affected by characteristic impedance and series resistor value mismatches.



Fig. 3.2.3. Read operation of the IMBM DQ bus - from module #1.



Fig. 3.2.4. Read operation of the IMBM DQ bus - from module #3

# 3.3 GENERALIZED IMBM DQ BUS

In section 3.1, the IMBM DQ bus is introduced, where the characteristic impedance of the memory module's PCB trace and motherboard's PCB trace is identical. Generally, their characteristic impedances can differ. In this section, a generalized IMBM DQ bus will be established and its resistor value will be derived. We call the characteristic impedance of the motherboard's PCB  $Z_1$  and refer to that of the memory module's PCB as  $Z_2$ , as shown in Fig. 3.3.1.



Fig. 3.3.1. Generalized IMBM DQ bus  $(Z_1 \ge Z_2)$ .

#### $3.3.1 Z_1 \ge Z_2 CASE$

If  $Z_1$  is larger than  $Z_2$ ,  $Z_{B0}$  can be a series resistor. Because the equivalent

impedance from the right end of the  $TL_{T1}$  is  $(Z_{B0} + Z_2)$ , the resistance of  $Z_{B0}$  should be  $Z_1$ - $Z_2$ . To satisfy the condition of the IMBM DQ bus, the two equations below should be satisfied for the (N-k-1)th stub. Here, k is the number of remaining slots on the right side, as shown in Fig. 3.1.1.

$$(Z_{Tk} + Z_1) || (Z_{Bk}) = Z_1$$
(3.3.1)

$$(Z_{Tk} + Z_1): (Z_{Bk} + Z_2) = 1:k$$
 (3.3.2)

Equation (3.3.1) is the condition for no reflection at the right end of the top TLs  $(TL_{T1\sim TN})$ , and Equation (3.3.2) is the condition for an equivalent current distribution of all modules. Equations (3.1.4) and (3.1.5) can be expressed with the variables  $Z_{TK}$  and  $Z_{BK}$ :

$$Z_{Tk}(Z_1 - Z_2 - Z_{Bk}) + Z_1^2 = 0$$
(3.3.3)

$$Z_{Bk} = kZ_{Tk} + kZ_1 - Z_2 \tag{3.3.4}$$

After inserting (3.3.3) into (3.3.4) and solving the quadratic equation, the root of  $Z_{Tk}$  is equal to ( $Z_1/k$ ) or (- $Z_1$ ). Because we cannot implement passive negative resistance in the PCB, the resistance values of  $Z_{Tk}$  and  $Z_{BK}$  are determined as follows:

$$Z_{Tk} = \frac{Z_1}{k} \tag{3.3.5}$$

$$Z_{Bk} = (k+1)Z_1 - Z_2 \tag{3.3.6}$$

Thus, the generalized IMBM DQ bus in the case of  $(Z_1 \ge Z_2)$  can be described as shown in Fig. 3.3.2.



Fig. 3.3.2. Generalized IMBM DQ bus  $(Z_1 \ge Z_2)$  with resistor values.

### $3.3.2 Z_1 < Z_2 CASE$

If  $Z_1$  is smaller than  $Z_2$ ,  $Z_{B0}$  cannot be connected serially as shown in Fig. 3.1.11. To match the equivalent impedance of the combination of  $Z_{B0}$  with  $Z_2$ , ZB0 should be connected in parallel, as shown in Fig. 3.3.3.



Fig. 3.3.3. Generalized IMBM DQ bus  $(Z_1 < Z_2)$ .

To match the impedance, the resistance of  $Z_{B0}$  should take the value below.

$$Z_{B0} = \frac{Z_1 Z_2}{Z_2 - Z_1} \tag{3.3.7}$$

Next, the necessary current for the last module should be increased by  $I_T$ , as shown in the right-hand part of Fig. 3.3.3. The relationship between  $I_B$  and  $I_T$  can be expressed as follows:

$$\frac{Z_1 Z_2}{Z_2 - Z_1} : I_B = Z_2 : I_T$$
(3.3.8)

$$I_T = \left(\frac{Z_2}{Z_1} - 1\right) I_B \tag{3.3.7}$$

Thus, the condition for (3.3.2) in the previous case takes on the following form:

$$(Z_{Tk} + Z_1): (Z_{Bk} + Z_2) = 1: \left(k + \left(\frac{Z_2}{Z_1} - 1\right)\right)$$
 (3.3.8)

This equation can be expressed as follows:

$$Z_{Bk} = \left(\frac{Z_2}{Z_1} + k - 1\right) Z_{Tk} + (k - 1) Z_1$$
(3.3.9)

After inserting (3.3.9) into (3.3.4) and solving the quadratic equation, the root of  $Z_{Tk}$  becomes equal to

$$Z_{Tk} = \frac{Z_1^2}{(k-1)Z_1 + Z_2} \quad or \quad -Z_1 \tag{3.3.9}$$

Because we cannot implement passive negative resistance in a PCB, as in the previous case, the resistance values of  $Z_{Tk}$  and  $Z_{BK}$  are solved via the following equations:

$$Z_{Tk} = \frac{Z_1^2}{(k-1)Z_1 + Z_2}$$
(3.3.5)

$$Z_{Bk} = kZ_1 \tag{3.3.6}$$

Thus, the generalized IMBM DQ bus in the case of  $(Z_1 < Z_2)$  can be described as shown in Fig. 3.3.4.



Fig. 3.3.4. Generalized IMBM DQ bus  $(Z_1 < Z_2)$  with resistor values.

#### **3.3.3 DOUBLE-SIDED MODULE CASE**

To increase the memory capacity, today's memory modules are formed of doublesided DIMM. Each memory module has both top-side memory chips and bottom side memory chips. Through-hole vias connect the top-side chips and bottom-side chips. Because two transmission lines run along each via, as shown in Fig. 3.3.5, the equivalent impedance is reduced by half,  $Z_2/2$ . To cancel the reflection in this via, therefore,  $Z_2/2$ ohm resistors should be inserted in the double-sided DIMM, as shown in Fig. 3.3.5.



Fig. 3.3.5. Generalized IMBM DQ bus  $(Z_1 \ge Z_2)$  with a double-sided module.

## 3.4 STEADY-STATE RESISTOR MODEL OF IMBM DQ BUS

To compare the voltage swing level of the conventional SSTL DQ bus and IMBM DQ bus, a steady-state resistor model of both types of DQ buses is analyzed in this section. At high-speed data transmissions, the transmission line causes various reflection waves and transient responses until the reflection waves settle. Therefore, the transient voltage of the transmitted signal can vary with the reflection environment. After the signal settles, however, the steady-state voltage of the transmitted signal is determined to certain value, and the value is not affected by the reflection situation. In addition, the steady-state voltage value is calculated only with the resistor network and not with the

characteristics of the transmission lines. In other words, the steady-state voltage value of a resistor network containing transmission lines is identical to the value of the resistor network in which the transmission lines are substituted with simple wires. Thus, we can determine the steady-state voltage value and voltage loss from the memory controller to the memory module if we model both the SSTL DQ bus and the IMBM DQ bus with a resistor network.

First, single-sided, N-slot, N-drop DQ buses will be described, after which doublesided, N-slot, and 2N-drop DQ buses will be introduced. In the case of the SSTL DQ bus, a selective ODT scheme is used to the enhance voltage margin. The SSTL DQ bus with this scheme is also analyzed. The selective ODT scheme is only available in the SSTL DQ bus because the reflection coefficients of the bottom transmission line (TL<sub>bottom</sub>) in the IMBM DQ bus differ from case to case. Figs. 3.4.1 (a) and (b) show the equivalent stub model and equivalent impedance from the bottom transmission line of both types of DQ buses.



Fig. 3.4.1. Equivalent stub model and equivalent impedance from the bottom transmission line of (a) a conventional SSTL DQ bus and (b) the proposed IMBM DQ bus.

The reflection coefficient of the bottom transmission line from the bottom to the stub is as follows:

$$\Gamma_{SSTL,bot} = \frac{Z_{bot} - Z_0}{Z_{bot} + Z_0} = \frac{\left(\frac{Z_0}{2} + (Z_0 \parallel Z_0)\right) - Z_0}{\left(\frac{Z_0}{2} + (Z_0 \parallel Z_0)\right) + Z_0} = 0$$
(3.4.1)

The reflection coefficient of the IMBM DQ bus in the same case can be expressed as follows:

$$\Gamma_{IMBM,bot} = \frac{Z_{bot} - Z_0}{Z_{bot} + Z_0} = \frac{\left(kZ_0 + \left(Z_0 \| \left(\frac{Z_0}{k} + Z_0\right)\right)\right) - Z_0}{\left(kZ_0 + \left(Z_0 \| \left(\frac{Z_0}{k} + Z_0\right)\right)\right) + Z_0} = \left(\frac{k}{k+1}\right)^2 \quad (3.4.2)$$

As shown in equation (3.4.1), the SSTL DQ bus does not reflect the reflected signal by the opened ODT resistor. On the other hand, the IMBM DQ bus immediately reflects the signal which is reflected by the opened ODT resistor. As a result, the selective ODT control scheme which disconnects the ODT resistor of the active module cannot be used with the IMBM DQ bus. In fact, although the reflected signal by the ODT resistor of the SSTL DQ bus does not produce a reflection wave on the bottom side of the transmission line, this reflected signal is reflected by the upper transmission and stubs. Therefore, during high-speed data transmission, this selective ODT scheme which enhances the voltage margin reduces the signal integrity.

#### **3.4.1 SINGLE-SIDED MODULE CASE**

Fig. 3.4.2 shows the steady-state resistor model of the single-sided SSTL DQ bus with all ODT resistors on, indicating no control over the ODT resistor among active

modules. Fig. 3.4.3 shows the same SSTL DQ bus with the selective ODT scheme. In the single-sided module, the number of slots and the drop are identical.



#### N:# of slots / N drop

Fig. 3.4.2. Steady-state resistor model of the single-sided SSTL DQ bus without the selective ODT scheme.



#### N:# of slots / N drop

Fig. 3.4.3. Steady-state resistor model of the single-sided SSTL DQ bus with the selective ODT scheme.

For a write operation, the memory controller transmits a signal with voltage of  $V_{MC}$ 

and the memory module receives the signal with  $V_{MM}$ . N is the number of slots and the drop. At node  $n_{N-1}$  in Fig. 3.4.2, the equivalent impedance on the right is  $(Z_0/2+Z_0)/N$ . Thus, the voltage gain of Fig. 3.4.2 is equal to

$$\frac{V_{MM}}{V_{MC}} = \frac{\frac{(3/2)Z_0}{N}}{Z_0 + \frac{(3/2)Z_0}{N}} \frac{Z_0}{(1/2)Z_0 + Z_0} = \frac{2}{2N+3}$$
(3.4.3)

On the other hand, the equivalent impedance of node  $n_{N-1}$  of Fig. 3.4.3 is divided by N-1 and does not share voltage with the  $Z_0/2$  ohm resistor, as the ODT resistor of the active module is opened and does not drive any impedance. Thus, the voltage gain in Fig. 3.4.3 can be expressed as follows:

$$\frac{V_{MM}}{V_{MC}} = \frac{\frac{(3/2)Z_0}{N-1}}{Z_0 + \frac{(3/2)Z_0}{N-1}} = \frac{3}{2N+1}$$
(3.4.4)

A steady-state resistor model of the singled-sided IMBM DQ bus is described in Fig. 3.4.4. In the IMBM DQ bus, the equivalent impedance of the right side of node  $n_{1-N-1}$  is always  $Z_0$ . Using this characteristic, the  $V_{MM}$  voltage can be calculated by successive voltage divisions. The voltage gain of the single-sided IMBM DQ bus is derived by equation (3.4.5).


#### N:# of slots / N drop

Fig. 3.4.4. Steady-state resistor model of the single-sided IMBM DQ bus.

$$\frac{V_{MM}}{V_{MC}} = \left(\frac{Z_0}{Z_0 + Z_0}\right) \left(\prod_{i=N-1}^{k+1} \frac{Z_0}{\frac{Z_0}{i} + Z_0}\right) \left(\frac{Z_0}{kZ_0 + Z_0}\right) \\
= \frac{1}{2} \left(\prod_{i=N-1}^{k+1} \frac{i}{1+i}\right) \frac{1}{k+1} = \frac{1}{2} \frac{N-1}{N} \frac{N-2}{N-1} \dots \frac{k+1}{k+2} \frac{1}{k+1} = \frac{1}{2N}$$
(3.4.5)

#### 3.4.2 DOUBLE-SIDED MODULE CASE

The steady state resistor model of double-sided module can be modeled in the same way as the single-sided module. In addition, the voltage gain can be calculated with the same method. Fig. 3.4.5 and Fig. 3.4.6 show the model of the SSTL DQ bus. The voltage gain of each model is derived in equations (3.4.6) and (3.4.7).



#### N:# of slots / 2N drop

Fig. 3.4.5. Steady-state resistor model of the double-sided SSTL DQ bus without the selective ODT scheme.

$$\frac{V_{MM}}{V_{MC}} = \frac{\frac{Z_0}{N}}{Z_0 + \frac{Z_0}{N}} \frac{(1/2)Z_0}{(1/2)Z_0 + (1/2)Z_0} = \frac{1}{2(N+1)}$$
(3.4.6)



#### N:# of slots / 2N drop

Fig. 3.4.6. Steady-state resistor model of the double-sided SSTL DQ bus with the selective ODT scheme.

$$\frac{V_{MM}}{V_{MC}} = \frac{\frac{Z_0}{N-1}}{Z_0 + \frac{Z_0}{N-1}} = \frac{1}{N+1}$$
(3.4.7)

The double-sided IMBM DQ bus can also be modeled as shown in Fig. 3.4.7.

$$VTT \mapsto VTT \mapsto Z_{0} \cap Z_{0} \cap$$

### N : # of slots / 2N drop

Fig. 3.4.7. Steady-state resistor model of the double-sided IMBM DQ bus.

The voltage gain of the double-sided IMBM DQ bus is equal to

$$\frac{V_{MM}}{V_{MC}} = \left(\frac{Z_0}{Z_0 + Z_0}\right) \left(\prod_{i=N-1}^{k+1} \frac{Z_0}{\frac{Z_0}{i} + Z_0}\right) \left(\frac{(1/2)Z_0}{kZ_0 + Z_0}\right) \\
= \frac{1}{2} \left(\prod_{i=N-1}^{k+1} \frac{i}{1+i}\right) \frac{1}{k} = \frac{1}{2} \frac{N-1}{N} \frac{N-2}{N-1} \dots \frac{k+1}{k+2} \frac{1}{2(k+1)} = \frac{1}{4N}$$
(3.4.8)

#### **3.4.3 VOLTAGE GAIN COMPARISON**

In the previous section, the voltage gain of both the SSTL DQ bus and the IMBM DQ bus were derived from the memory controller to the memory module  $(V_{MM}/V_{MC})$ . The reverse voltage gain  $(V_{MC}/V_{MM})$  can be easily derived using the reciprocity theorem [3.4.1].



Fig. 3.4.8. Reciprocity.

The reciprocity theorem shown in Fig. 3.4.8 reads as follows: "If a voltage source  $E_1$  acting in one branch of a linear passive network causes a current  $I_2$  to flow in another branch of the linear passive network, then the same voltage source  $E_2$  acting in the second branch would cause an identical current  $I_1$  to flow in the first branch (If  $E_1=E_2$ ,  $I_2=I_1$ )." Because the steady-state resistor model of both the SSTL and the IMBM DQ bus consists of only passive resistors, the forward voltage gain ( $V_{MM}/V_{MC}$ ) and reverse voltage gain ( $V_{MC}/V_{MM}$ ) are identical according to this reciprocity theorem. Therefore, we do not have to derive the reverse voltage gain.

With the result of the voltage gain formula, the voltage gain decreases as the number

of slots increases. This trend is applicable to both DQ buses. Comparing the SSTL and the IMBM DQ bus, the voltage gain of the SSTL DQ bus is larger than that of the IMBM DQ bus by 2~3 times in accordance with number of slots and the ODT scheme. Although the SSTL DQ bus has a higher voltage gain and wastes less power, the SSTL DQ bus cannot be used in a high-speed and high-capacity channel, as mentioned earlier. When using the selective ODT, signal integrity is increasingly degraded. In other words, the IMBM DQ bus cancels the reflection wave at every stub with the sacrifice of the voltage gain and the power consumption. In addition, the maximum number of slots of the IMBM DQ bus is limited by the transmitted voltage and the sensitivity of the receiver circuits. Tables 3.4.1 and 3.4.2 summarize the voltage gain of both DQ buses.

| # of<br>slots    | Single-sided (N-drop) |                       |                       |  |
|------------------|-----------------------|-----------------------|-----------------------|--|
|                  | SSTL<br>(active open) | SSTL<br>(all ODTs on) | IMBM<br>(all ODTs on) |  |
| Ν                | 3/(2N+1)              | 2/(2N+3)              | 1/2N                  |  |
| 2                | 0.6                   | 0.286                 | 0.25                  |  |
| 3                | 0.429                 | 0.222                 | 0.167                 |  |
| 4                | 0.333                 | 0.182                 | 0.125                 |  |
| Signal integrity |                       | _                     | +                     |  |

Table 3.4.1. Voltage Attenuation Comparison of the Single-sided DQ Bus.

Table 3.4.2. Voltage Attenuation Comparison of the Double-sided DQ Bus.

| # of<br>slots    | Double-sided (2N-drop) |                       |                       |  |
|------------------|------------------------|-----------------------|-----------------------|--|
|                  | SSTL<br>(active open)  | SSTL<br>(all ODTs on) | IMBM<br>(all ODTs on) |  |
| Ν                | 1/(N+1)                | 2/(2(N+1))            | 1/4N                  |  |
| 2                | 0.333                  | 0.167                 | 0.125                 |  |
| 3                | 0.25                   | 0.125                 | 0.083                 |  |
| 4                | 0.2                    | 0.1                   | 0.063                 |  |
| Signal integrity |                        | -                     | +                     |  |

## **CHAPTER 4**

## **MEMORY CONTROLLER TRANSCEIVER**

#### 4.1 MEMORY CONTROLLER TRANSCEIVER ARCHITECTURE

Fig. 4.1.1 shows the architecture of a memory controller transceiver designed to support the IMBM DQ bus. This transceiver includes four data (DQ) channels, a data strobe (DQS) channel, a PLL and a clock tree. Each bidirectional DQ channel consists of a transmitter (TX) and a receiver (RX). The TX is made up of a PRBS generator for BER testing, a four-tap 8:1 serializer, and a current-mode driver to allow de-emphasis with four taps. The RX consists of a linear equalizing buffer [4.1.1], a sampler, a 2:8 deserializer, and a PRBS verifier. The PLL and clock trees provide a TX clock for the serializer and a multi-phase clock for the strobe recovery unit (SRU) of the DQS block. The DQS channel generates a strobe signal with the same phase as the DQ write data. This strobe signal is used for timing recovery in the RX of a memory module and is not used for direct sampling in the sampling circuit. Instead, the recovered strobe signal which is generated from the PLL and the phase interpolator is used for data sampling. Because the IMBM DQ bus attenuates the voltage of the signals in inverse proportion to the number of

modules, direct sampling of data using the received strobe is not advisable due to the large amount of power required to limit the input strobe signal.



Fig. 4.1.1. Memory controller transceiver block diagram and its clocking architecture.

Fig. 4.1.1 also shows the SRU and RX clocking architecture in detail. The SRU has a dual-loop architecture for timing recovery [2.3.3]. In order to generate a sampling clock with the proper phase for every DQ, a phase interpolator and a half-rate bang-bang phase detector are used in the SRU. The central PLL generates multi-phase clock signals which are delivered by the clock tree to drive the phase interpolators. The SRU loop control block, which is composed of a deserializer, a first-order  $\Delta\Sigma$  modulator, and a finite-state machine, receives the early and late information from the PD and then supplies the appropriate up or down control bit to the phase interpolator. The  $\Delta\Sigma$  modulator dithers this control bit to prevent the entire strobe recovery loop from enacting a limit-cycle. The proposed memory controller transceiver does not have to track the frequency offset; this avoids the need for integrating control of the strobe recovery loop and thus prevents the  $\Delta\Sigma$  modulator from causing a stability problem. A duty-cycle corrector (DCC) is also used to adjust the sampling clock, which eliminates the distortion of the duty-cycle that would otherwise be caused by the clock tree and phase interpolator. All DQ channels must receive the recovered sampling clock with the same phase; therefore, the shorting clock method [4.1.2] is used to reduce the on-chip clock skew. To eliminate skew between each DQ channel and the SRU, the sampling clock signal used in the PD traverses the clock tree in parallel with the recovered clock. This increases the loop latency of the entire SRU and the limit cycle; but the first-order  $\Delta\Sigma$  modulator in this loop can cope with this extended latency, which thus has a negligible effect on the jitter of the recovered clock signal.

#### 4.2 **TX CIRCUITS OF THE TRANSCEIVER**

To compensate for the non-ideal nature of the channels, the memory controller

transceiver uses a linear equalizer in RX mode and performs de-emphasis in the TX mode. To achieve the latter, a 8:1 serializer with a 4-tap output is implemented, as shown Fig. 4.2.1. The front 8:4 and 4:2 serializer uses the parallel 5-latch 2:1 serializer shown in Fig. 4.2.2 (a), whose timing diagram is shown in Fig. 4.2.2 (b). The final 2:1 serializer, as shown in Fig. 4.2.3, has a four-tap output. Its driver has a separate component for each tap. Instead of using an XOR gate [4.2.1], two intermediate multiplexers that select their polarity while generating a negative output signal are employed. The differential output signal is driven by the final multiplexers, which are in turn driven by the delayed clock signal CLKD, as the cascaded multiplexers need a timing margin to operate at a high speed. The four-tap serialized signal is then delivered to the four-tap current-mode driver shown in Fig. 4.2.4. Equalization coefficients are controlled by means of the control signal tap weight (TW).



Fig. 4.2.1. 8:1 serializer with a 4-tap output.



Fig. 4.2.2. (a) Five-latch 2:1 serializer and (b) its timing diagram.



Fig. 4.2.3. Four-tap half-rate serializer with a differential output.



Fig. 4.2.4. Four-tap current-mode driver.

Fig. 4.2.5 shows the proposed duty-cycle corrector (DCC). To reduce its area and complexity, the controlled and uncontrolled transistors in the DCC buffer are separated, as shown in Fig. 4.2.5 (c). An equivalent circuit model of the jth stage of an (m–1)-stage DCC buffer is given in Fig. 4.2.5 (b). The resolution of the DCC in the jth stage can then be expressed as follows:

$$\left(\frac{R_1}{m-j} + R_2\right)C\left(\left(\frac{R_1}{j} \parallel \frac{R_1}{m-j}\right) + R_2\right)C = \frac{j}{m(m-j)}R_1C$$
(4.2.1)

In this design, m is 5 and thus the resolutions of consecutive stages are 1/20, 2/15, 3/10 and 4/5. Because the ratios between successive resolutions are close to 2, the DCC buffer can achieve linearity despite its simple design and small area. Fig. 4.2.6 shows the measured DCC linearity curve for 2.4GHz DQS and DQSB output signals. The DCC can cover a duty cycle ranging from -42% to 38%.



(a)



Fig. 4.2.5. (a) Overall duty-cycle corrector structure, (b) simple equivalent model of the jth stage of the DCC buffer, and (c) a schematic diagram of the DCC buffer.



Fig. 4.2.6. Measured DCC linearity of a 2.4GHz DQS signal.

## 4.3 RX CIRCUITS OF THE TRANSCEIVER

Fig. 4.3.1 shows detailed block diagram of the SRU. Two phase interpolators generate the recovered strobe signal from the PLL signals. An in-phase (I-phase) phase interpolator is used for data sampling and a quadrature-phase (Q-phase) phase interpolator is used for edge sampling and phase detection. Each phase interpolator consists of a CMOS-to-CML (current mode logic) stage, a phase interpolator core stage with a CML level and a CML-to-CMOS stage. Each phase interpolator core has left and

right mux. For seamless phase switching, the left mux selects the phases from  $0^{\circ}$ -,  $90^{\circ}$ -,  $180^{\circ}$ -, and  $270^{\circ}$ - multi-phase, the right mux selects phase from  $45^{\circ}$ -,  $135^{\circ}$ -,  $225^{\circ}$ -, and  $315^{\circ}$ - multi-phase, and the Q-phase phase interpolator selects the  $90^{\circ}$  shifted phase from that of I-phase phase interpolator.



Fig. 4.3.1. Block diagram of the strobe recovery unit (SRU).

The recovered strobe signal is used for phase detection in half-rate bang-bang phase detector (BB PD) block after traveling the clock tree and the DCC block. To match the delay of the recovered strobe signal between the DQ and DQS block, the recovered strobe signal intentionally travels the clock tree. The BB PD block generates up and down (DN)

information which becomes deserialized. The deserialized up and down signal is filtered in the CDR loop control block, as shown in Fig. 4.3.1. The CDR control block, which is composed of a bit-shifting gain block, an accumulator and an over-/and under-flow detection block, is equivalent to a first-order  $\Delta\Sigma$  modulator. The filtered up/down signals are delivered to the phase shifting controller, which generates the mux selection signal and the weight value for the phase interpolator. Fig. 4.3.2 shows a schematic diagram of the sampler which is used in the half-rate BB PD and RX of each DQ block. Switch transistors are inserted for offset canceling by steering the output current. To enhance the linearity during the offset canceling step, NMOS with the gate voltage of VDD are stacked together with switches. Fig. 4.3.3 shows the continuous-time linear equalizer (CTLE) whose peak gain is about 6dB at 2.4GHz. This linear equalizer uses a differential capacitance scheme to reduce the occupied area [4.1.1].



Fig. 4.3.2. Schematic diagram of the sampler.



Fig. 4.3.3. Schematic diagram of the continuous-time linear equalizer.

There are two possible types of phase interpolators. In one type, the phase interpolator core and multiplexer are separated, while the other type has an embedded multiplexer [4.3.1]. In the latter type, all of the phases are simultaneously connected to the input pair of the phase interpolator. This reduces the effect of clock feedthrough on the linearity, but it also increases the load on the output capacitance, which in turn limits the speed of this type of phase interpolator when there are more than eight phases coming from the PLL. In the SRU, therefore, the phase interpolator shown in Fig. 4.3.4 is used. It has a separate multiplexer for each of the eight phases, thus improving the linearity.

Clock feedthrough is normally considered to be the main source of nonlinearity, but it actually enhances the linearity of this design of phase interpolator, as it creates a mid-phase during multi-phase switching. Fig. 4.3.5 shows the linearity curve of the phase interpolator as well as its DNL. This figure shows that mid-phase occurs during phase switching and that the absolute value of the DNL does not exceed 1.



Fig. 4.3.4. Schematic diagram of the phase interpolator.



Fig 4.3.5. Measured linearity of the phase interpolator.

## 4.4 LIMITATION OF THE TRANSCEIVER

Unfortunately, the proposed memory controller transceiver has some drawbacks. First, the designed transceiver has a low jitter tracking bandwidth (JTB). Because most memory interfaces adopt source-synchronous clocking, a high jitter tacking bandwidth is advantageous for a memory controller transceiver. Generally, a strobe signal is transmitted simultaneously with a data signal. Thus, the strobe signal contains the same jitter variation as the data signal. Ideally, the receiver will receive data with high jitter tolerance if an unprocessed strobe signal is used for data sampling. However, an unprocessed strobe signal cannot be used for data sampling in a multi-channel serial link due to the necessity of clock buffering capability for driving multiple receivers and filtering uncorrelated high-frequency jitter. To track correlated jitter between the data and strobe signal and to filter high-frequency uncorrelated jitter, the jitter tracking bandwidth of the receiver should range within a few hundred MHz because recent packages and regulators generate power supply noise generally from 100 MHz to 500MHz [2.7.3]. The proposed transceiver, however, adopts the type II DLL [4.4.1], and its jitter tracking bandwidth is under 10MHz. Second, the memory controller needs deskew capability in both the write and read directions [4.4.2]. Because the memory interface has many channels and modules, the PCB trace length of each channel can vary over the 1UI timing. Moreover, the on-chip circuit mismatch and routing mismatch increases the skew. Thus, the skew should be offset in the memory controller transceiver. In this thesis, a fourchannel DQ and a DQS are carefully laid out and the PCB trace is implemented with high accuracy. However, the next version of this transceiver would be improved with a deskewing circuit.



Fig 4.4.1. Next version plan of the memory controller transceiver – not implemented.

The reference DLL and the phase interpolator play the role of write signal skew offsetting. Additionally, the replica DLL and phase interpolator compensates for the skew of the read signal. Although the TX clock signal runs continuously, the RX strobe signal toggles only when a read command is delivered. Thus, the reference DLL provides the control signal of the replica DLL. Moreover, a high jitter tracking bandwidth injection-locked oscillator (ILO) is used in the DQS receiver [2.7.2]. Because the ILO has a high jitter tracking bandwidth, it can track correlated time-varying jitter and can filter out uncorrelated high-frequency jitter. An on-chip transmission line would be necessary if the transceiver requires low power consumption during clock delivery [2.5.3]. Finally, the resistors, necessary for configuring the IMBM DQ bus, occupy the PCB area with a complex routing when the differential signaling scheme is used. Thus, optimal system efficiency can be achieved when the transceiver adopts a single-ended signaling scheme.

## CHAPTER 5

# **EXPERIMENTAL RESULTS**

### 5.1 EXPERIMENTAL SETUP

A prototype memory controller transceiver was designed and fabricated using a 0.13 $\mu$ m CMOS process, and Fig. 5.1.1 is a microphotograph of the die. The transceiver, composed of four DQs and a DQS, occupies an area of 1400 $\mu$ m × 1200 $\mu$ m. The transceiver chip was implemented as a part of the I/O circuitry of the memory controller, and a test board with a motherboard and memory module PCBs was implemented, as shown in Fig. 5.1.2. There are no actual memory chips in this setup; the test equipment, including a BERT and an oscilloscope, plays this role. To compare the simple load and heavy load imposed by memory chips, inactive modules are modeled in two ways. First, only passive 50 $\Omega$  resistors are modeled for a light load condition [3.1.1]. Next, passive 50 $\Omega$  resistors in parallel with 1pF capacitors are used for a heavy load condition.



Fig. 5.1.1. Die photo of the memory controller transceiver implemented in 0.13µm CMOS.



Fig. 5.1.2. Scope of this work.

To facilitate a range of measurements, the chip- and channel-boards are implemented separately. Nelco material instead of FR4 is used to reduce insertion losses. MicroTCA [5.1.1] connectors are used on the channel board, as DDR2/3 connectors are not suitable for multi-gigabit per second transmissions. The chip board has a 2.5 inch trace and the channel board has a trace with a length between 1.43 and 2.29 inches. The spacing of the connectors is 0.43 inches. A block diagram and photograph of the actual 4-slot 8-drop double-sided IMBM DQ bus is shown in Fig. 5.1.3.



Fig. 5.1.3. Implemented 4-slot 8-drop IMBM DQ bus.

To eliminate the reflections which occur in front of the vias (the through-hole paths to the other surface) in a memory module,  $Z0/2\Omega$  resistors are inserted. To compare the channel characteristics of the IMBM DQ bus with those of a conventional SSTL DQ bus, the latter was implemented with similar specifications, as shown in Fig. 5.1.4. For fairness, I assumed that every memory module in the conventional SSTL DQ bus has an ODT resistor.



Fig. 5.1.4. Implemented 4-slot 8-drop SSTL DQ bus.

### 5.2 SINGLE-BIT RESPONSE AND EYE DIAGRAM

The single-bit response and an eye diagram were both measured using an oscilloscope. Because the memory controller transceiver uses a VDD-referenced CML-level signal during transmission and an oscilloscope only provides ground termination, DC block and ground-terminated channel boards were used, as shown in Fig. 5.2.1.



Fig. 5.2.1. Setup I for measuring the eye diagram and single-bit response.

Figs. 5.2.2 (a) and (b) show the 4.8Gb/s single-bit response of a conventional SSTL DQ bus. Due to the reflections between connectors, this single-bit response varies greatly between module positions. The signal SSTL#1 through the first module has an overshooting response, and its second and third post-cursors have large negative values. Conversely, signals SSTL #5 and #7 have over-damped responses, as though they were

being transmitted through a very lossy channel. The first post-cursor level is almost half the magnitude of the main cursor. The heavy load of 1pF generates capacitive reflection and harms the signal integrity increasingly, as shown in Fig. 5.2.2 (b). Conversely, the single-bit response of the IMBM DQ bus is identical for all module positions, as shown in Figs. 5.2.2 (c) and (d). The 1/8-scaled single-bit response of a signal passing through the chip board alone is very similar to that of the signals that go through the IMBM DQ bus, suggesting that there is little reflection in the IMBM channel. In case of the SSTL DQ bus, timing and voltage margin can be improved when non-linear type equalizer such as floating-tap DFE is used [5.2.1]. However, non-linear equalizer increases circuit overhead and complexity. Moreover, different coefficient setting is required in the memory controller RX during read operation from each memory module.







(b)







Fig. 5.2.2. Measured 4.8Gb/s single-bit responses of (a) a SSTL DQ bus with a 50 $\Omega$  load, (b) a SSTL DQ bus with a 50 $\Omega$  and a 1pF load, (c) an IMBM DQ bus with a 50 $\Omega$  load and (d) an IMBM DQ bus with a 50 $\Omega$  and a 1pF load.

To check the timing and voltage margins, eye diagrams for both SSTL and IMBM DQ buses are obtained. The memory controller transceiver and chip board alone have considerable post-cursor voltage levels, mainly caused by the capacitance of the package and pad. Therefore, TX de-emphasis is used to measure both the unequalized and equalized signals. Equalization coefficients were chosen to cancel the ISI, which is not related to the reflections from each stub but to the insertion loss of the PCB trace on the chip board and the capacitive loading imposed by both the memory controller transceiver chip and the 1pF loading capacitors. The coefficient of the pre-cursor tap was set to 0, the main cursor tap was set to 1, the first post-cursor tap was set to -0.33, and the second post-cursor tap was set to -0.08. Figs. 5.2.3 (a)-(d) and Figs. 5.2.4 (a)-(d) show the measured eye diagrams of the SSTL DQ bus with a 50 $\Omega$  load and at 50 $\Omega$  with a 1pF load for a 4.8Gb/s 2<sup>7</sup>-1 PRBS data pattern. As can be predicted from the single-bit response of both the SSTL and IMBM DQ buses, modules #5 and #7 of the SSTL DQ bus have severely closed eye diagrams, whereas modules #1 and #3 have clean diagrams. With TX de-emphasis, modules #1 and #3 show over-boosted behavior because module #1 has negative post-cursors and #3 has small post-cursors. The eyes for modules #5 and #7 are enlarged, but the timing and voltage margins of the equalized signal are, respectively, only half the 1UI timing and signal levels, as shown in Figs. 5.2.3 and 5.2.4 (e)-(h). Figs. 5.2.5 and 5.2.6 are eye diagrams of signals transmitted through the IMBM DQ bus. These diagrams are identical for all modules, allowing the memory controller transceiver to have the same equalization coefficients for all modules. With TX equalization, the IMBM DQ bus has a wide-open eve diagram and the voltage and timing margins are

much greater than those of the SSTL DQ bus. Fig. 5.2.7 shows the 4.8Gb/s eye diagram and a histogram of the de-emphasized DQ and DQS signals in the IMBM DQ bus. The DQS signal toggles every cycle and has less timing jitter than the DQ signal, which is random data. The measured jitter in the DQ signal is 9.21  $ps_{rms}$ , and the jitter in the DQS signal is 5.41  $ps_{rms}$ .









(e)

(a)

(c)

(f)

(d)



Fig. 5.2.3. Measured 4.8Gb/s eye diagrams of an unequalized signal (a) #1, (b) #3, (c) #5, and (d) #7; a de-emphasized signal (e) #1, (f) #3, (g) #5, and (h) #7 at the SSTL module with a  $50\Omega$  load.



(a)

(c)





(e)

(f)

(b)

(d)



Fig. 5.2.4. Measured 4.8Gb/s eye diagrams of an unequalized signal (a) #1, (b) #3, (c) #5, and (d) #7; a de-emphasized signal (e) #1, (f) #3, (g) #5, and (h) #7 at the SSTL module with a 50 $\Omega$  and a 1pF load.





(c)



(d)



Fig. 5.2.5. Measured 4.8Gb/s eye diagrams of an unequalized signal (a) #1, (b) #3, (c) #5, and (d) #7; a de-emphasized signal (e) #1, (f) #3, (g) #5, and (h) #7 at the IMBM module with a 50 $\Omega$  load.





(c)

(b)

(d)







(f)



Fig. 5.2.6. Measured 4.8Gb/s eye diagrams of an unequalized signal (a) #1, (b) #3, (c) #5, and (d) #7; a de-emphasized signal (e) #1, (f) #3, (g) #5, and (h) #7 at IMBM module with a  $50\Omega$  and a 1pF load.


Fig. 5.2.7. Measured 4.8Gb/s eye diagram and histogram of a de-emphasized (a) DQ signal and (b) a DQS signal, both on the IMBM DQ bus with a  $50\Omega$  and a 1pF load.

#### 5.3 BER OF TRANSMITTED SIGNALS (WRITE SIGNALS)

A further comparison between the SSTL and IMBM DQ bus was made by measuring the BER of transmitted signals using the measurement setup shown in Fig. 5.3.1. Unlike an oscilloscope, a BERT (Agilent J-BERT N4903A) can provide VDD-level termination voltage, allowing us to dispense with the DC block and ground-terminated channel board.

The unequalized bathtub graph shown in Figs. 5.3.2 (a) and (b) shows that the signal transmitted through SSTL modules #5 and #7 have no timing margin. This may be related to the single-bit response and the eye diagrams. The equivalent bathtub curve for the unequalized IMBM DQ bus has a BER which approaches  $10^{-7}$  at the optimum sampling point. When the transceiver enables TX de-emphasis, the difference between the SSTL and IMBM results becomes more noticeable. The IMBM DQ bus achieves a BER of  $10^{-9}$ 

with a timing margin of 0.39UI, as shown in Fig. 5.3.3. However, some modules of the SSTL DQ bus do not reach a BER of  $10^{-9}$  at any sampling positions, as shown in Fig. 5.3.2. In the case of a heavy load condition, as shown in Fig. 5.3.2 (b), BER of module #7 is worse than  $10^{-3}$ .



Fig. 5.3.1. Setup II for measuring the TX BER.



(a)



Fig. 5.3.2. Bathtub graph based on TX BER measurements of both equalized and unequalized SSTL signals with (a) a  $50\Omega$  and (b) a  $50\Omega$  and a 1pF load.





Fig. 5.3.3. Bathtub graph based on TX BER measurements of both equalized and unequalized IMBM signals with (a) a  $50\Omega$  and (b) a  $50\Omega$  and a 1pF load.

#### 5.4 BER OF RECOVERED SIGNALS (READ SIGNALS)

A memory interface channel is bidirectional. In write operations (the TX mode of the memory controller transceiver), data is transmitted from the chip board to the channel board. During read operations (the RX mode of the memory controller transceiver), data from the channel board is received by the chip board. To verify correct bidirectional operation of the IMBM DQ bus, the RX measurement setup is configured as shown in Fig. 5.4.1. In the RX mode, the transceiver extracts the phase information of every DQ signal with respect to the common DQS signal. Thus, both the DQ and DQS signals must be generated and transmitted simultaneously by the BERT. The Agilent PBERT 81250 is a parallel BERT which can handle both DQ and DQS signals. Therefore, this instrument is used to measure the RX BER. If intentional skew between the generated DQ and DQS signals is asserted, a bathtub graph can be drawn for the memory controller transceiver during RX mode which is similar to that for the TX BER results.



Fig. 5.4.1. Setup III for measuring the RX BER.

Figs. 5.4.2 and 5.4.3 show bathtub graphs based on the measurement of RX BER for unequalized and equalized received signals. In RX mode, a linear equalizer is used, and the BER results generally indicate a larger timing margin than the TX BER results, as the PBERT has an ideal wide-bandwidth driver. Fig. 5.4.2 shows that data recovered from the unequalized SSTL DQ bus has many errors, originating in modules #5 and #7, which would not be acceptable during manual operation; this is similar to the TX results. However, the IMBM DQ bus has a timing margin of 0.52UI when the BER is 10<sup>-9</sup>, as shown in Fig. 5.4.3 (b). When the memory controller transceiver turns on the linear equalizer and boosts the high-frequency gain, the SSTL DQ bus still has no timing margin; but the IMBM DQ bus has a timing margin of 0.58UI under the heavy load condition shown in Fig. 5.4.3 (b).



Fig. 5.4.2. Bathtub graph based on RX BER measurements of unequalized and equalized SSTL with (a) a  $50\Omega$  and (b) a  $50\Omega$  and a 1pF load.



Fig. 5.4.3. Bathtub graph based on RX BER measurements of unequalized and equalized IMBM with (a) a  $50\Omega$  and (b) a  $50\Omega$  and a 1pF load.



Fig. 5.4.4. Measured 4.8Gb/s eye diagram and histogram of a recovered DQ signal with the pattern 10101010..., on the IMBM DQ bus.

To verify the correct operation of the strobe recovery unit, the recovered clock jitter is measured. The recovered clock pin was not assigned to the transceiver, and the jitter of the recovered clock signal is measured by attempting to recover a 1010101010... data pattern. Fig. 5.4.4 shows the 4.8Gb/s eye diagram and a histogram of the recovered DQ signal on the IMBM DQ bus. The measured jitter of the recovered clock is  $2.47 \text{ps}_{\text{rms}}$ . The hunting jitter caused by the limit cycle is low, and the histogram shows a single Gaussian peak rather than two. The use of a first-order  $\Delta\Sigma$  modulator in the bang-bang loop of the strobe recovery unit eliminates the limit cycle.

The memory controller transceiver is split into three different power domains: the I/O circuits (driver and linear EQ buffer); the analog circuits (PRBS generator, serializer, sampler, deserializer, PRBS verifier and clock trees); and the remaining circuits. In TX mode at 4.8Gb/s, the I/O domain consumes 39.5mA/DQ and the analog domain consumes 17.5mA/DQ. Thus the energy efficiency of the transceiver in TX mode is

14.25mW/Gb/s/DQ. In RX mode, the I/O domain consumes 36mA/DQ and the analog domain consumes 18.75mA/DQ. The energy efficiency of each DQ of the memory controller transceiver is 13.69mW/Gb/s/DQ. The performance of the transceiver chip and the channel board are summarized in Tables 5.4.1, 5.4.2 and 5.4.3.

| Process                                 | 0.13µm 1P8M CMOS                                 |  |
|-----------------------------------------|--------------------------------------------------|--|
| Connector                               | MicroTCA                                         |  |
| Package                                 | TQFP 100p                                        |  |
| Data rate                               | 4.8Gb/s                                          |  |
| Energy efficiency<br>(@4.8Gb/s, per DQ) | 14.24mW/Gb/s (TX mode)<br>13.69mW/Gb/s (RX mode) |  |

Table 5.4.1. Memory Controller Transceiver Summary.

Table 5.4.2. Timing Margin Summary of the SSTL DQ Bus.

| DQ bus                                                | SSTL, 4 slots (8 drops) |                           |
|-------------------------------------------------------|-------------------------|---------------------------|
| Inactive modules                                      | $50\Omega$ only         | $50\Omega \parallel 1 pF$ |
| TX timing margin @ 10 <sup>-9</sup> BER<br>(w/o TxEQ) | Fail                    | Fail                      |
| TX timing margin @ 10 <sup>-9</sup> BER<br>(w/ TxEQ)  | Fail                    | Fail                      |
| RX timing margin @ 10 <sup>-9</sup> BER<br>(w/o RxEQ) | Fail                    | Fail                      |
| RX timing margin @ 10 <sup>-9</sup> BER<br>(w/ RxEQ)  | 0.52 UI                 | Fail                      |

| DQ bus                                                | IMBM, 4 slots (8 drops) |                           |
|-------------------------------------------------------|-------------------------|---------------------------|
| Inactive modules                                      | $50\Omega$ only         | $50\Omega \parallel 1 pF$ |
| TX timing margin @ 10 <sup>-9</sup> BER<br>(w/o TxEQ) | Fail                    | Fail                      |
| TX timing margin @ 10 <sup>-9</sup> BER<br>(w/ TxEQ)  | 0.39 UI                 | 0.39 UI                   |
| RX timing margin @ 10 <sup>-9</sup> BER<br>(w/o RxEQ) | 0.61 UI                 | 0.52 UI                   |
| RX timing margin @ 10 <sup>-9</sup> BER<br>(w/ RxEQ)  | 0.73 UI                 | 0.58 UI                   |

Table 5.4.3. Timing Margin Summary of the IMBM DQ Bus.

### **CHAPTER 6**

## CONCLUSIONS

In this thesis, a new impedance-matched bidirectional multi-drop DQ bus is proposed in which reflections at the stub of each channel are canceled by series resistors between the connectors. Among various recent memory interface schemes which are introduced and compared, no memory interface can handle a multi-drop bus with both high-speed data transmission and a high capacity. Unlike the conventional DQ bus, the proposed DQ bus cancels reflections unidirectionally, directly eliminating reflections during write operations; during read operation reflections occur but never reach the destination ports. Thus, the IMBM DQ bus is an unique solution which achieves bidirectional multiple gigabit per second data transmission rates with a multi-drop configuration. The generalized bus and steady-state resistor model of the IMBM DQ bus are also analyzed in this thesis.

A 4.8Gb/s memory controller transceiver optimized to the IMBM DQ bus and IMBM channel board are also implemented. Their effective operation is verified by means of TX and RX BER measurements. This thesis showed that reflective ISI is indeed eliminated in the IMBM DQ bus and that the signal integrity is much better than that observed in a conventional SSTL DQ bus. By applying TX de-emphasis and RX linear equalization to an 8-drop IMBM DQ bus, a timing margin of more than 0.39UI with a BER of  $10^{-9}$  can be achieved in both TX and RX modes. We fabricated the memory controller transceiver in the 0.13µm standard CMOS process. In this form, it consumes 14.25mW/Gb/s per DQ at 4.8Gb/s in TX mode and 13.69mW/Gb/s per DQ at 4.8Gb/s in RX mode.

# **BIBLIOGRAPHY**

- [1.1.1] H. Lee et al., "A 16 Gb/s/link, 64 GB/s bidirectional asymmetric memory interface," *IEEE J. Solid-State Circuits*, vol. 44, no. 4, pp. 1235-1247, April 2009.
- [1.1.2] J.-H. Chun et al., "A 16Gb/s 65nm CMOS transceiver for a memory interface," *IEEE Asian Solid-State Circuits Conf.*, 2008, pp. 25-28.
- [1.1.3] S.-J. Bae et al., "A 60nm 6Gb/s/pin GDDR5 graphics DRAM with multifaceted clocking and ISI/SSN-reduction techniques," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2008, pp. 278–279.
- K. Kim, S. Yoon, K. Kwean, D. Kwon, S. Yang, M. Park, Y. Kim and B. Chung,
  "A 5.2Gb/p/s GDDR5 SDRAM with CML clock distribution network," Proceedings of the 34th European Solid-State Circuits Conf., pp. 194-197, 2008.
- [1.1.5] JEDEC, DDR3 SDRAM standard (JESD 79-3B).
- [1.1.6] J.-H. Kim et al., "Challenges and solutions for next generation main memory systems," *IEEE Conference on Electrical Performance of Electronic Packaging* and Systems, 2009, pp. 93–96.
- [1.1.7] H. Partovi et al., "Data recovery and retiming for the fully buffered DIMM 4.8Gb/s serial links," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2006, pp. 1314–1323.

- [1.1.8] E. Prete, D. Scheideler and A. Sanders, "A 100mW 9.6Gb/s transceiver in 90nm CMOS for next-generation memory interfaces," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2006, pp. 253–262.
- [1.1.9] Z. Gu et al., "Cascading techniques for a high-speed memory interface," IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, 2007, pp. 234–235.
- [1.1.10] H. Chung et al., "Channel BER measurement for a 5.8Gb/s/pin unidirectional differential I/O for DRAM application," *IEEE Asian Solid-State Circuits Conf.*, 2008, pp. 29-32
- [1.1.11] S.-J. Bae, H.-J. Chi, H.-R. Kim and H.-J. Park, "A 3Gb/s 8b single-ended transceiver for 4-drop DRAM interface with digital calibration of equalization skew and offset coefficients," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2005, pp. 520–521.
- [1.1.12] J. Lee, S. Lee and S. Nam, "Multi-slot main memory system for post DDR3," *IEEE Trans. Circuits and Systems II: Express Briefs*, vol. 57, no. 5, pp. 334–338, May 2010.
- [1.1.13] R. Esper-Chaín, F. Tobajas, O. Tubío, R. Arteaga, V. de Armas, and R. Sarmiento, "A gigabit multidrop serial backplane for high-speed digital systems based on asymmetrical power splitter," *IEEE Trans. Circuits and Systems II: Express Briefs*, vol. 52, no. 1, pp. 5–9, Jan. 2005.
- [2.1.1] JEDEC, DDR3 SDRAM standard (JESD 79-3B).
- [2.1.2] JEDEC, GDDR5 SGRAM standard (JESD212).

- [2.1.3] JEDEC, LPDDR3 standard (JESD209-3).
- [2.1.4] Rambus,  $XDR^{TM}$  architecture.
- [2.1.5] B. Leibowitz et al., "A 4.3 GB/s mobile memory interface with power-efficient bandwidth scaling," *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 889-898, April 2010.
- [2.1.6] JEDEC, FBDIMM: architecture and protocol (JESD206).
- [2.1.7] SPMT, SPMT serial interface specification, Serial Port DRAM (SPDRAM) specification.
- [2.1.8] MIPI Alliance, M-PHY specification v2.0.
- [2.1.9] JEDEC, Wide I/O single data rate (Wide I/O SDR) (JESD229)
- [2.2.1] W. J. Dally and J. Poulton, Digital Systems Engineering, Cambridge Univ. Press, 1998.
- [2.2.2] J. Poulton, "Signaling in high-performance memory systems," *ISSCC Tutorial*, 1999.
- [2.2.3] JEDEC, FBDIMM: advanced memory buffer (AMB) (JESD82-20).
- [2.3.1] Y. Kim, "DRAM interface," IEEK High-Speed Interface Workshop, 2007.
- [2.3.2] N. Kurd, P. Mosalikanti, M. Neidengard, J. Douglas and R. Kumar, "Nextgeneration Intel Core<sup>™</sup> micro-architecture (Nehalem) clocking," *IEEE J. Solid*-

State Circuits, vol. 44, no. 4, pp. 1121-1129, April 2009.

- [2.3.3] S. Sidiropoulos and M. Horowitz, "A semidigital dual delay-locked loop," *IEEE J. Solid-State Circuits*, vol. 32, pp. 1683–1692, Nov. 1997.
- [2.3.4] B. W. Garlepp, K. S. Donnelly, J. Kim, P. S. Chau, J. L. Zerbe, C. Huang, C. V. Tran, C. L. Portmann, D. Stark, Y.-F. Chan, T. H. Lee, and M. A. Horowitz, "A portable digital DLL for high-speed CMOS interface circuits," *IEEE J. Solid-State Circuits*, vol. 34, no. 5, pp. 632–644, May 1999.
- [2.3.5] B. Casper and F. O'Mahony, "Clocking analysis, implementation and measurement techniques for high-speed data links – a tutorial," *IEEE Trans. Circuits and Systems I: Regular Papers*, vol. 56, no. 1, pp. 17–37, Jan. 1999.
- [2.3.6] J. Alexander, "Clock recovery from random binary signals," *Electron.Lett.*, vol. 11, no. 22, pp. 541–542, 1975.
- [2.3.7] T. Toifl, C. Menolfi, P. Buckmann, M. Kossel, T. Morf, R. Reutemann, M. Reugg, M. Schmatz, and J. Weiss, "0.94 ps-rms-jitter 0.016 mm 2.5 GHz multiphase generator PLL with 360 digitally programmable phase shift for 10 Gb/s serial links," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2005, pp. 410–411.
- [2.4.1] JEDEC, GDDR3 SGRAM standard (JESD21-C).
- [2.4.2] JEDEC, DDR2 SDRAM standard (JESD79-2F).
- [2.5.1] C. Yoo et al., "A 1.8 V 700-Mb/s/pin 512-Mb DDR-II SDRAM with on-die termination and off-chip driver calibration," *IEEE J. Solid-State Circuits*, vol.

39, no. 6, pp. 941–951, Jun. 2004.

- [2.5.2] J. Poulton, R. Palmer, A. M. Fuller, T. Greer, J. Eyles, W. J. Dally, and M. Horowitz, "A 14-mW 6.25-Gb/s transceiver in 90-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 42, no. 12, pp. 2745–2757, Dec. 2007.
- [2.5.3] G. Balamurugan et al., "A scalable 5–15 Gbps, 14–75 mW low-power I/O transceiver in 65 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 43, no. 4, pp. 1010–1019, Apr. 2008.
- [2.5.4] F. O'Mahoney et al., "A 47 10 Gb/s 1.4 mW/Gb/s parallel interface in 45 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 45, no. 12, pp. 2828–2837, Dec. 2010.
- [2.5.5] S.-J. Bae et al., "An 80 nm 4 Gb/s/pin 32 bit 512 Mb GDDR4 graphics DRAM with low power and low noise data bus inversion," *IEEE J. Solid-State Circuits*, vol. 43, no. 1, pp. 121–131, Jan. 2008.
- [2.6.1] J. G. Proakis, Digital Communications, McGraw-Hill, 2001.
- [2.7.1] F. O'Mahoney et al., "A 27 Gb/s forwarded-clock I/O receiver using an injection-locked LC-DCO in 45 nm CMOS," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2010, pp. 452–453.
- [2.7.2] K. Hu et al., "A 0.6 mW/Gb/s, 6.4–7.2 Gb/s serial link receiver using local injection-locked ring oscillators in 90 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 899–908, Apr. 2010.
- [2.7.3] M. Hossein and A. C. Carusone, "7.4 Gb/s 6.8 mW source synchronous

receiver in 65 nm CMOS," IEEE J. Solid-State Circuits, vol. 46, no. 6, pp. 1337–1348, Jun. 2011.

- [2.7.4] U. Kang et al., "8 Gb 3-D DDR3 DRAM using through-silicon-via technology," *IEEE J. Solid-State Circuits*, vol. 45, no. 1, pp. 111–119, Jan. 2010.
- [2.7.5] J.-S. Kim et al., "A 1.2 V 12.8 GB/s 2 Gb mobile wide-I/O DRAM with 4 x 128
  I/Os using TSV based stacking," *IEEE J. Solid-State Circuits*, vol. 47, no. 1, pp. 107–116, Jan. 2012.
- [2.7.6] G. V. Plas et al, "Design issues and considerations for low-cost 3-D TSV IC technology," *IEEE J. Solid-State Circuits*, vol. 46, no. 1, pp. 293–307, Jan. 2011.
- [2.7.7] Y. Take, N. Miura and T. Kuroda, "A 30Gb/s/link 2.2 Tb/s/mm<sup>2</sup> inductivelycoupled injection-locking CDR for high-speed DRAM interface," *IEEE J. Solid-State Circuits*, vol. 46, no. 11, pp. 2552–2559, Nov. 2011.
- [2.7.8] N. Miura, Y. Kohama, Y. Sugimori, H. Ishikuro, T. Sakurai, and T. Kuroda, "An 11 Gb/s inductive-coupling link with burst transmission," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2008, pp. 298–299.
- [2.7.9] C. L. Schow et al, "Low-power 16x10 Gb/s bi-directional single chip CMOS optical transceivers operating at < 5 mW/Gb/s/link," *IEEE J. Solid-State Circuits*, vol. 44, no. 1, pp. 301-313, Jan. 2009.
- [2.7.10] S. Palermo et al, "A 90 nm CMOS 16 Gb/s transceiver for optical interconnects," *IEEE J. Solid-State Circuits*, vol. 43, no. 5, pp. 1235-1246, May 2008.

- [3.1.1] W.-Y. Shin, G.-M. Hong, H. Lee, J.-D. Han, S. Kim, K.-S. Park, D.-H. Lim, J-H. Chun, D.-K. Jeong and S. Kim, "A 4.8Gb/s impedance-matched bidirectional multi-drop transceiver for high-capacity memory interface," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2011, pp. 494–495.
- [3.2.1] I.-S. Oh, D.-K. Jeong, S. Kim, W.-Y. Shin and D.-H. Lim, "Impedance matched bi-directional multi drop bus system, memory system using the same and memory module," KOR. Patent 09-43861, Feb. 17, 2010.
- [3.2.2] D.-K. Jeong, S. Kim, W.-Y. Shin, D.-H. Lim and I.-S. Oh, "Bi-directional multidrop bus memory system," U.S. Patent 8,195,855, Jun. 5, 2012.
- [3.4.1] H. Fredriksson and C. Svensson, "Improvement potential and equalization example for multidrop DRAM memory buses," *IEEE Trans. Adv. Packag.*, vol. 32, no. 3, pp. 675-682, Aug. 2009.
- [4.1.1] Y. Moon, G. Ahn, N. Kim and D. Shim, "A quad 6Gb/s multi-rate CMOS transceiver with Tx rise/fall-time control," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2006, pp. 233–242.
- [4.1.2] H. Higashi et al., "A 5-6.4-Gb/s 12-channel transceiver with pre-emphasis and equalization," *IEEE J. Solid-State Circuits*, vol. 40, no. 4, pp. 978–985, April 2005.
- [4.2.1] C. Menofli et al., "A 16Gb/s source-series terminated transmitter in 65nm CMOS SOI," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2007, pp. 446–447.
- [4.3.1] T. Wu et al., "Clocking circuits for a 16Gb/s memory interface," Proc. IEEE

Custom Integrated Circuit Conf., 2008, pp. 435–438.

- [4.4.1] M.-J. E. Lee, W. J. Dally, T. Greer, H.-T. Ng, R. Farjad-Rad, J. Poulton, and R. Senthinathan, "Jitter transfer characteristics of delay-locked loops—Theories and design techniques," *IEEE J. Solid-State Circuits*, vol. 38, no. 4, pp. 614–621, Apr. 2003.
- [4.4.2] E. Yeung and M. A. Horowitz, "A 2.4 Gb/s/pin simultaneous bidirectional parallel link with per-pin skew compensation," *IEEE J. Solid-State Circuits*, vol. 35, no. 11, pp. 1619–1628, Nov. 2000.
- [5.1.1] MicroTCA Standard, <u>http://www.picmg.org/v2internal/microtca.htm</u>.
- [5.2.1] S. Quan et al., "A 1.0625-to-14.025Gb/s multimedia transceiver with full-rate source-series-terminated transmit driver and floating-tap decision-feedback equalizer in 40nm CMOS," *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, 2011, pp. 348–349.

### 한글초록

본 연구에서는 임피던스 매칭이 된 양향향 다분기 (IMBM) 데이터 버스와 이 버스를 구동하는 4.8Gb/s 메모리 컨트롤러 송수신기를 제안하였다. IMBM 데이터 버스를 사용할 경우 직렬 저항 추가를 통해 단방향 임피던스 매칭을 할 수 있으며, 이를 통해 각 분기점에서 발생하게 되는 반사파에 의한 상호 데이터 간섭을 제거 할 수 있다. IMBM 데이터 버스는 메모리 쓰기 동작 시 반사파를 발생시키지 않으며, 읽기 동작 시에는 반사파를 발생시키나 이 반사 파가 컨트롤러 쪽을 향하지 않게 한다. 그러므로, IMBM 데이터 버스는 메모리 쓰기, 읽기 두 동작 시에 모두 반사파에 의한 상호 데이터 간섭 없이 신호를 송수신 할 수 있다. 제안하는 데이터 버스는 기존의 다분기 데이터 버스와 일 대일 데이터 버스가 쓰일 수 없는 고속 대용량 메모리 인터페이스에 사용될 수 있다.

제안하는 IMBM 데이터 버스는 모듈 수에 비례하여 송신 신호 크기를 감 쇄시키므로, 이를 이용하기 위해서는 새로운 클러킹 방식이 필요하다. 본 연구 에서는 스트로브 신호를 직접 사용하여 데이터를 수집하는 방식 대신, 위상 동기화 루프의 클럭을 이동시켜 사용하는 클럭킹 방식의 4.8Gb/s 송수기를 제 안하였다. 메모리 컨트롤러 송수신기 시제품은 0.13µm CMOS 공정을 이용하여 제작하였으며, 1.2V 전원 전압을 사용한다. 다양한 측정을 통하여 효용성을 검

119

증하였으며, 4 슬롯, 8 분기 IMBM 데이터 버스 환경에서 본 송수신기는 4.8Gb/s, 10<sup>9</sup> 에러 비율 기준에 대하여 송신 동작 시 0.39UI의 시간 마진을 가 지며, 수신 동작 시 0.58UI 시간 마진을 가진다. 같은 측정 환경 조건에서 기 존의 SSTL 데이터 버스는 정상 적인 동작을 하지 못하였다. 설계한 송수신기 는 한 개의 데이터 채널 당 송신 동작 시 14.25mW/Gb/s, 수신 동작 시 13.69mW/Gb/s 의 에너지 효율을 가진다.

**주요어**: 임피던스 매칭, 메모리 컨트롤러, 메모리 인터페이스, 다분기 데이터 버스, 트랜시버

학번: 2005-21422