



**Master's Thesis** 

# Design of High Speed PAM-4 Transmitter with Level Mismatch Adjustment for Next-generation Memory Testing

# 차세대 메모리 테스트를 위한 고속 PAM-4 송신기 설계

by

**Minsu Park** 

February, 2023

Department of Electrical and Computer Engineering College of Engineering Seoul National University

# Design of High Speed PAM-4 Transmitter with Level Mismatch Adjustment for Next-generation Memory Testing

지도 교수 정 덕 균

이 논문을 공학석사 학위논문으로 제출함 2023 년 2 월

> 서울대학교 대학원 전기·정보공학부 박 민 수

박민수의 석사 학위논문을 인준함 2023 년 2 월

위 원 장 \_\_\_\_\_ (인) 부위원장 \_\_\_\_\_ (인)

위 원 \_\_\_\_\_ (인)

# Design of High Speed PAM-4 Transmitter with Level Mismatch Adjustment for Next-generation Memory Testing

by

Minsu Park

A Thesis Submitted to the Department of Electrical and Computer Engineering in Partial Fulfillment of the Requirements for the Degree of Master of Science

at

#### SEOUL NATIONAL UNIVERSITY

February, 2023

Committee in Charge:

Professor Woo-Seok Choi, Chairman Professor Deog-Kyoon Jeong, Vice-Chairman Professor Woogeun Rhee

# Abstract

High memory bandwidth is required due to the emergence of new applications requiring highperformance computing systems such as machine learning and A/I. In order to satisfy the high bandwidth required for DRAM, the application of multi-level signaling is being discussed, but numerous changes in infrastructure are required to apply this technology. In particular, in the case of products for the purpose of mass production, such as DRAM, large-scale facilities are configured to evaluate them in large quantities. Since DRAM manufactors already have largescale facilities for NRZ (Non-Return-to-Zero) signal evaluation, multi-level signaling support causes large-scale test facility changes that incur time and cost. To solve this problem, a bridge chip that connects the DRAM tester and DRAM has been proposed, which receives input/output data in parallel from low-performance test equipment, converts them into high speed PAM4 signals, and outputs/inputs them to DRAM.

In this paper, we propose a transmitter on a bridge chip. The transmitter of the bridge chip composed of voltage mode CMOS driver receives the "8 pins, 6Gb/s" NRZ signal output from the test equipment in parallel and transmits the received data to the DRAM as "1pin, 48Gb/s" PAM4 signal. The transmitter was implemented using a serializer, a 4:1 MUX, and a predriver using an overdrive scheme. The clock driving method uses the 4-phase clock generated by the internal ADPLL. In particular, the transmitter proposed in this paper can adjust the output level of the PAM4 signal to evaluate the DRAM receiver, and proposes calibration that optimizes the RLM using the function of adjusting this output level. Each output level is controlled by changing the gate voltage level of the final driver of the bridge chip transmitter.

The bridge transmitter made of 40nm CMOS occupies an area of 0.32mm2, consumes 85.25mW power, operates at a speed of 48Gb/s, and achieves an RLM of 0.99.

Keywords : PAM-4, PAM4-Binary Bridge, Memory Tester

Student Number : 2021-21988

# Contents

| ABSTRACT                                 | I           |
|------------------------------------------|-------------|
| CONTENTS                                 | III         |
| LIST OF FIGURES                          | V           |
| LIST OF TABLES                           | VII         |
| CHAPTER 1 INTRODUCTION                   | 1           |
| 1.1 MOTIVATION                           | 1           |
| 1.2 THESIS ORGANIZATION                  | 3           |
| CHAPTER 2 BACKGROUNDS                    | 4           |
| 2.1 Overview                             | 4           |
| 2.2 BASIC OF DRAM INTERFACE              | 7           |
| 2.3 BASIC OF DRAM DRIVER                 | 9           |
| 2.4 BASIC OF MULTI LEVEL SIGNALING       |             |
| CHAPTER 3 DESIGN OF PAM4 TRANSMITTER FOR | PAM4-BINARY |
| BRIDGE                                   | 16          |
| 3.1 Design Consideration                 |             |
| 3.2 OVERALL ARCHITECTURE                 |             |
| 3.3 CIRCUIT IMPLEMENTATION               | 20          |
| 3.3.1 CLOCK TREE                         |             |

| 3.3.2 DATA ALIGN              | 25  |
|-------------------------------|-----|
| 3.3.3 4:1 MUX                 | 27  |
| 3.3.3 Drier                   | 31  |
| 3.3.4 CALIBRATION             | 37  |
| CHAPTER 4 MEASUREMENT RESULTS | 43  |
| 4.1 CHIP PHOTOMICROGRAPH      | 43  |
| 4.2 Measurement Setup         | 45  |
| 4.3 MEASUREMENT RESULTS       | 47  |
| 4.4 Performance Summary       | 49  |
| CHAPTER 5 CONCLUSION          | 51  |
| BIBLIOGRAPHY                  | 522 |
| 초 록                           | 54  |

# **List of Figures**

| Fig. 1.1 The trend of increasing computing core and memory bandwidth2        |
|------------------------------------------------------------------------------|
| FIG. 2.1 2.5D/3D SYSTEM ARCHITECTURE WITH HBM MEMORY                         |
| FIG. 2.2 LONG CHANNEL FREQUENCY RESPONSE FOR A MODERN SERVER CONFIGURATION 6 |
| FIG. 2.3 STANDARD SINGLE-ENDED IO INTERFACE APPLIED BY DRAM                  |
| FIG. 2.4 CURRENT MODE DRIVER AND VOLTAGE MODE DRIVER                         |
| FIG. 2.5 OUTPUT DRIVER CALIBRATION SCHEME ON DRAM                            |
| FIG. 2.6 THE STRUCTURE OF THE OUTPUT DRIVER OF THE CONVENTIONAL DRAM9        |
| FIG. 2.7 BASIC EYE DIAGRAMS OF (A) NRZ, (B) PAM-3 AND (C) PAM-4 SIGNAL       |
| FIG. 2.8 LANE MARGINING IN PCIE 6.0                                          |
| FIG. 3.1 OVERALL ARCHITECTURE OF PAM4-BINARY BRIDGE TRANSMITTER              |
| FIG. 3.2 BLOCK DIAGRAM OF INTERNAL CLOCK GENERATOR                           |
| FIG. 3.3 BLOCK DIAGRAM OF DUTY CYCLE CORRECTOR                               |
| FIG. 3.4 TIMING DIAGRAM OF DUTY CYCLE CORRECTOR                              |
| FIG. 3.5 SCHEMATIC OF CLOCK DELAY CIRCUIT                                    |
| FIG. 3.6 TIMING DIAGRAM OF DATA ALIGNMENT                                    |
| FIG. 3.7 DATA ALIGNMENT BLOCK DIAGRAM                                        |
| FIG. 3.8 TIMING DIAGRAM OF 4:1 MUX                                           |
| FIG. 3.9 SCHEMATIC OF 4:1 MUX                                                |
| FIG. 3.10 BLOCK DIAGRAM OF PRE-DRIVER FOR IMPROVING BANDWIDTH OF 4:1 MUX29   |
| FIG. 3.11 EYE DIAGRAM OF THE 4:1 MUX OUTPUT NODE BEFORE AND AFTER APPLYING   |
| FEEDBACK RESISTOR                                                            |
| FIG. 3.12 IMPEDANCE AND CURRENT PREDICTION ACCORDING TO OUTPUT LEVEL         |

| FIG. 3.13 SCHEMATIC OF POSSIBLE PRE-DRIVER                                         |
|------------------------------------------------------------------------------------|
| FIG. 3.14 (A) PROPOSED PRE-DRIVER FOR CONTROLLING THE OUTPUT LEVEL (B) SIMULATION  |
| RESULT OF EXPECTED PROBLEM OF THE PRE-DRIVER                                       |
| FIG. 3.15 (A) SCHEMATIC OF OVERDRIVE SCHEME (B) SIMULATION RESULT OF NODE_X        |
| BEFORE AND AFTER APPLYING THE SCHEME                                               |
| FIG. 3.16 THE SIMULATED OUTPUT EYE DIAGRAM BEFORE AND AFTER APPLYING THE           |
| OVERDRIVE SCHEME                                                                   |
| FIG. 3.17 OVERALL ARCHITECTURE OF THE DRIVER                                       |
| FIG. 3.18 OVERALL ARCHITECTURE OF CALIBRATION LOGIC                                |
| FIG. 3.19 BLOCK DIAGRAM OF THE REFERENCE GENERATOR                                 |
| FIG. 3.20 THE FLOW CHART OF CALIBRATION                                            |
| FIG. 3.21 REPLICA DRIVER OF CALIBRATION CIRCUIT (A) REPLICA DRIVERS THAT TUNE THE  |
| WIDTH (B) REPLICA DRIVER THAT TUNE THE $V_{\mbox{\tiny GS}}41$                     |
| FIG. 3.22 (A) TRANSIENT SIMULATION RESULT OF CALIBRATION CIRCUIT AND (B) SIMULATED |
| OUTPUT EYE DIAGRAM OF TRANSMITTER AFTER THE CALIBRATION                            |
| FIG. 4.1 CHIP PHOTOMICROGRAPH                                                      |
| FIG. 4.2 MEASUREMENT SETUP                                                         |
| FIG. 4.3 MEASURED EYE-DIAGRAMS OF PAM4 TRANSMITTER (A) COMPARISON BEFORE AND       |
| AFTER CALIBRATION AT 16 GB/S (B) EYE DIAGRAM OF PAM4 32 GB/S47                     |
| FIG. 4.4 RESULT OF EYE LEVEL ADJUSTMENT AFTER CALIBRATION                          |
| FIG. 4.5 POWER BREAKDOWN                                                           |

# **List of Tables**

| TABLE 1.1 DRAM BANDWIDTH ACCORDING TO THE TYPE OF DRAM                     | .4 |
|----------------------------------------------------------------------------|----|
| TABLE 2.1 SIMULATION RESULT OF OUTPUT DUTY ACCORDING TO CTRL_CODE<7:0>.    | 23 |
| TABLE 2.2 SIMULATION RESULT OF CLOCK MAXIMUM DELAY  2                      | 24 |
| TABLE 2.3 THE EQUATION OF PAM4 TRANSISTOR ACCORDING TO OUTPUT LEVEL.     3 | 32 |
| TABLE 4.1 PERFORMANCE SUMMARY                                              | 18 |
| TABLE 4.2 COMPARISON WITH OTHER TX CHIPS 5                                 | 50 |

# Chapter 1

# Introduction

#### **1.1 Motivation**

Recently, with the development of technologies such as AI and cloud service, the demand for memory bandwidth is continuously increasing. The "memory wall" problem[1] was originally mentioned at 1990s. The memory wall pointed out that the rate of improvement in microprossor performance for exceeds the rate of improvement in DRAM memory speed. And with the recent development of multi-core CPU architecture, the core counts tends to increase at an unprecedented rate, and this is dramatically increasing the bandwidth requirements of the memory subsystem as shown in Fig. 1.1 [2]

To increase the memory bandwidth, the pam4 signaling, which is widely used in other applications, applying to DRAM interface has been discussed recently. The Pulse Amplitude

Moudulation (PAM4) signal could transfer twice as much information at the same nyquist frequency as the non return to zero (NRZ) signal.

In order to apply the PAM4 interface to DRAM, test equipment that can evaluae the DRAM which support PAM4 interface is required. DRAM has only used NRZ interface so far, it is necessary to pay the cost to change the test equipment used in the past. And this makes the transition to PAM4 interface slow. In this paper, we will propose the PAM4 binary bridge chip that converts the NRZ signal of the DRAM tester to the PAM4 signal of the DRAM, and in particular, and especially the transmitter.



Fig. 1.1 The trend of increasing computing core and memory bandwidth.

#### **1.2 Thesis Organization**

This thesis is organized as following. Chapter 2 describes the basis of DRAM interface and the background on transmitter. Then, it presents the four-level amplitude modulation (PAM-4) and explains the requirement of PAM4 transmitter in the bridge chip.

In chapter 3, the transmitter for the PAM4-binary Bridge is implemented. What should be considered when designing and circuit implemented are explained. It presents an overall architecture of transmitter. In particular, it presents the design and operation of a clock tree, data alignment, 4:1mux, calibration circuit and driver.

Chapter 4 describes the performance measurement results of transmitter manufactured with 40 nm CMOS technology. The NRZ signal at 6 Gb/s speed input from the NRZ tester is converted to a PAM4 signal at 48 Gb/s speed through a bridge transmitter. In addition, it describes the performance of the chip such as power consumption, area and measured eye. It also describes the calibration result of this driver.

Chapter 5 summarizes the proposed work and concludes this thesis.

# Chapter 2

## Backgrounds

#### **2.1 Overview**

As the system using DRAM is advanced, the DRAM is required to achieve higher bandwidth. To achieve high bandwidth, DRAM technology is being developed in two ways. The first method is to increase the number of pins that input and output data like HBM, and the second method is to increase the transmission speed per pin like DDR and GDDR. Table 1.1 shows the bandwidth, pin speed and pin count of various types of DRAM.

| Туре      | HBM2  | GDDR5X | GDDR6 | GDDR6X | DDR4   | DDR5   |
|-----------|-------|--------|-------|--------|--------|--------|
| # of Pin  | 1024  | 16/32  | 16/32 | 16/32  | 4/8/16 | 4/8/16 |
| Data rate | 1.7   | 11.4   | 14    | 21     | 3.2    | 6.4    |
| Bandwidth | 217.6 | 45.6   | 56    | 84     | 6.4    | 12.8   |

Table 1.1 DRAM bandwidth according to the type of DRAM

A method of improving bandwidth by increasing the number of IO pins, such as HBM, can achieve high bandwidth. This structure is implemented as shown in Figure 2.1 by using the TSV process and silicon interposer. DRAM core die stacked with TSV transmits data to base (bottom) die, and base die transmits data to memory controller through silicon interposer connected via u-bump. Because silicon interposer is used, more I/O can be implemented than the existing wire method.



Fig. 2.1 2.5D/3D system architecture with HBM memory.

Since HBM uses silicon interposer and TSV, a relatively high price is required to manufacture it, and if even one failure occurs, a large replacement cost is incurred. Therefore, a method of expanding bandwidth by increasing the pin speed using the existing PCB wire is also continuously being researched. As the generation of DDR or GDDR increases, the pin speed improves to achieve higher bandwidth. However, as shown in Figure 2.2, as the data rate per pin increases, the loss increases and the data rate cannot be increased.[3] In order to solve this problem, multi-level signaling is required instead of NRZ signaling in DRAM. Multi level signaling is widely studied as a method to improve bandwidth at the same Nyquist frequency. In particular, research on PAM4 signaling is being conducted in DRAM as well. [4] When a new scheme is applied to DRAM, preparations for DRAM test must be made in advance. For example, HBM changed metal PAD and PCB wire used by conventional DRAM to u-bump and silicon interposer, respectively. This change also caused a big change in testing DRAM. Therefore, DRAM companies have proposed various methods to test the new architecture [5]. Just like when HBM was first proposed, PAM4 signaling also causes major changes to the DRAM interface. The equipment that tests DRAM is manufactured based on NRZ signaling, so PAM4 signaling cannot be tested. Also, even if upgrading to equipment that supports PAM4 signals, a lot of time and cost are required to change NRZ equipment to PAM4 equipment. Therefore, a bridge chip capable of testing PAM4 DRAM with existing NRZ equipment was proposed [6][7]. [6] converts NRZ data of 8Gbps 4 lanes sent from the tester into 32Gbps PAM4 signal of 1 lane and transmits it to DRAM. Also, it receives the 32Gbps PAM4 signal sent from DRAM and converts it into 4 lane 8Gbps NRZ data and transmits it to the tester. If this bridge chip is used, PAM4 interface DRAM can be tested using existing NRZ equipment.



Fig. 2.2 Long channel frequency response for a modern server configuration.

### 2.2 Basic of DRAM interface

Modern DRAM interface applies asymmetric termination to reduce Cio, thereby improving bandwidth and reducing operating power. Fig.2.3(a) is the SSTL method used before DDR3, and termination was performed using the center tap method. When terminating with the STTL, N/PMOS is turned on and terminated with 0.5VDDQ. The SSTL method has the disadvantage of large Cio and DC current flowing during termination.[8] In order to improve the two problems of SSTL, the POD structure was selected in DDR4 and GDDR6. POD depicts in figure Fig.2.3(b). The POD structure removes the driver's PMOS transistor, so Cio is reduced and current for termination does not flow. The termination method also changed for Mobile DRAM. LPDDR3 used HSUL, which had advantages in power consumption and larger swing at low speed (no termination) as shown in Fig.2.3(c). As the required bandwidth increases, LPDDR4 requires termination and adopts LVSTL(Fig.2.3(d)). This is because LVSTL has an advantage in reducing power consumption. Unlike POD, LVSTL terminates with NMOS only. LVSTL has the advantage of easy VDD scaling due to low termination. In LPDDR4x, the advantages of LVSTL are further maximized. LPDDR4X applied N/N driver instead of CMOS driver to remove a relatively large PMOS, thereby reducing the Cio of the driver. Also, by using NMOS for pull-up, the interface voltage could be greatly reduced. However, despite these advantages, LVSTL has a disadvantage that the receiver must be configured as a PMOS.

DDR5, in which several DRAMs are connected in a multi-drop method to a relatively long channel, uses a POD structure. And LPDDR5, where power consumption reduction is important, uses an LVSTL structure capable of VDDQ scaling. In the future, it is expected that either POD or LVSTL will be selected according to the channel situation for newly developed DRAM.



Fig. 2.3 Standard single-ended IO interface applied by DRAM.

#### 2.3 Basic of DRAM Driver

At the last stage of the transmitter, a driver that drives the transmission line is placed. The driver can be configured in two ways, the first is a current mode driver and the second is a voltage mode driver. The current ,ode driver is shown in Fig.2.4(a), and produces output signal with high impedance and source termination done with shunt-connected resistor. However, a current mode driver consumes more than twice the current than voltage mode driver. Moreover, because a current mode driver provide differential signaling, it is not suitable for memory application which use single-ended signaling for pin efficiency. Voltage mode driver produces output signal with low impedance and source termination done with series connected resistor. Voltage mode drivers are generally composed of series resistors and transistors. Because the transistor operates in linear mode and has a constant  $R_{on}$ , the output impedance becomes  $R_{on}+R_s$ . Therefore, impedance matching must be performed with  $Z_0=R_{on}+R_s$ . However, since Ron is sensitive to PVT changes, a method to adjust Ron to a desired value is needed.



Fig. 2.4 Current mode driver and voltage mode driver.

Since DDR3, a function named ZQ calibration is used to calibrate the Ron value. Fig.2.5 is the ZQ Calibration block diagram of DRAM. When using high termination, an external reference resistor ( $Z_0$ ) is connected between the VSS and ZQ pins to match the impedance of the pullup driver. When ZQ Calibration starts, the strength of the pull up driver is adjusted so that the ZQ Pin becomes half VDDQ, and the result is stored in the register (PCODE). After calibrating the pull up driver, calibrate the impedance of the pull down driver. To calibrate the pull down driver, a reference resistor ( $Z_0$ ) is connected between the VDD and ZQ pins. The reference resistor is replaced with an already calibrated pull up driver. Adjust the impedance of the pull-down driver so that the node value between pull-up and pull-down becomes VDDQ/2, and store the value in the register (NCODE). These stored values are applied to all DQs to match the impedance sensitive to the PVT. To prevent impedance change according to voltage/temperature change during operation, the DRAM controller executes calibration

periodically by issuing the ZQCS command. (ZQCL: A command to precisely calibrate for a long time when initializing DRAM, ZQCS : A command to calibrate for a short period of time during operation)

In a PCB board environment where DRAM is used, the impedance varies depending on various factors such as the permittivity of the PCB, the structure/width/thickness of the transmission line, and the stack. Therefore, DRAM is configured to set impedance with various values. Fig. 2.6 is the structure of the output driver of the conventional DRAM. It consists of a pull up and pull down driver with 7 legs calibrated to 240 ohms. Using such a configured driver, the DRAM provides termination impedances of 240, 120, 80, 60, 48, 40, and 34 ohms, and output driver impedances of 48, 40, and 34 ohms. Generally, DRAM uses 40 ohm matching.



Fig. 2.5 Output driver calibration scheme on DRAM.



Fig. 2.6 The structure of the output driver of the conventional DRAM.

### 2.4 Basic of Multi-level signaling

As mentioned above, PAM4 signaling is widely used to improve pin speed, but it has not yet been applied to DRAM. Since DRAM uses single-end signaling, both PAM3 and PAM4 are being studied. Fig.2.7 shows NRZ signal, PAM4 signal and PAM3 signal. At the same Nyquist frequency, the bandwidth increases in the order of PAM4, PAM3, and NRZ, but the voltage margin decreases. Therefore, it is necessary to select a method suitable for the channel environment of the application in which DRAM is used.



Fig. 2.7 basic eye diagrams of (a) NRZ, (b) PAM-3 and (c) PAM-4 signal.

In addition, single-ended and voltage mode drivers used in DRAM need to pay attention to linearity. In the case of a voltage mode driver, since the transistor operates in the triode region, the relationship between  $I_{ds}$  and  $V_{ds}$  is as shown in Equation 1.

$$I_{ds} = \frac{1}{2} k_n \frac{W}{L} \left[ 2 (V_{gs} - V_{th}) V_{ds} - V_{ds}^2 \right]$$
 Eq. 1

In Equation 1, since V<sub>ds</sub> is equal to V<sub>out</sub>, in the case of multi-level signaling, the R<sub>on</sub> value

varies according to the output level. So there is a problem that each output level is not equal. On the other hand, the current mode driver shows relatively good linearity characteristics. Since the current source of the current mode driver operates in the saturation region, it has a high output resistance value. Therefore, only the passive resistance is visible at all output levels, which has the advantage of maintaining a constant impedance. In Section 3, we propose a way to obtain linear characteristics while using a voltage mode driver.

The PAM4 method considers the linearity of the data eye as an important indicator of signal integrity because the smallest eye represents performance. There is RLM as a criterion for evaluating the linearity characteristics of these PAM4 signals. The method of obtaining the RLM from the PAM4 signal is as follows. First, measure 4 output levels ( $V_A \sim V_D$ ). Find the eye height through the measured output level and find the minimum signal level ( $S_{min}$ ) among them (Equation 2). Finally, RLM is defined as the following equation 3. In general, it is known that PAM4 transmitters must satisfy the condition RLM > 0.92 [9]

$$S_{min} = \frac{1}{2}\min(V_B - V_A, V_C - V_B, V_D - V_C)$$
 Eq. 2

$$R_{LM} = \frac{6S_{min}}{V_D - V_A} \qquad \qquad Eq.3$$

The basic operation of the bridge chip transmitter converts multiple NRZ signals sent from the memory tester into PAM4 signals and outputs them to DRAM. When testing a DRAM, the transmitter and receiver of the DRAM must also be tested. Since PAM4 receiver of DRAM cannot be evaluated by the NRZ tester, the bridge chip must be able to test it. The characteristics of the DRAM receiver can be measured while changing the timing margin and voltage margin of the PAM4 signal. For example, Figure 2.8 is the specification of lane

margining in PCIe 6.0.[10] Lane margining is a function that checks the margin of input data by measuring several times within the voltage offset and timing offset at the receiver. In order to determine whether this function works, the tester must add a function that can move the voltage and timing. The timing margin of DRAM reciever should be tested while changing the valid data window by adjusting the clock applied to the transmitter of bridge chip. The voltage margin should be tested by adjusting the output voltage level of the transmitter of bridge chip. In this paper, we propose a method of adjusting the output level to test DRAM reciever.

As the transmitter of the bridge chip uses a single-ended voltage mode driver, it is necessary to optimize the RLM by solving the linearity problem and adjust the output level to evaluate the voltage margin of the DRAM receiver.



Fig. 2.8 Lane margining in PCIe 6.0

## Chapter 3

# Design of PAM4 Transmitter for PAM4-Binary Bridge

### **3.1 Design Consideration**

The transmitter to be used in bridge chip should be able to generate accurate PAM4 signal with high RLM value and adjust each output level to test Rx of DRAM. In addition, it should be able to provide the various termination and output impedance values required by the DRAM interface specification. Therefore, the transmitter of the bridge chip was implemented using a CMOS voltage mode driver to be possible impedance control. DRAM generally uses a CMOS driver and is configured using 6 240 ohm legs. The combination of 6 legs provides various termination levels and output impedances. Since voltage mode driver is used instead of current mode driver, there is an advantage in power consumption. However, it is necessary to solve

the problem of impedance mis-matching and degradation of RLM in multi-level signaling caused by the nonlinearity of the transistor. [11] implemented an N over N voltage mode driver for LPDDR5 using LVSTL interface, and solved this problem by adjusting the number of transistors (same as adjusting the width) that are turned on according to each output level. The NMOS transistor of the N over N driver operates in the triod region, and the relationship between  $I_{ds}$  vs  $V_{ds}$  is as Equation 4. The non-linear change according to the output level,  $V_{ds}$ level, is controlled through W.

$$I_{ds} = \frac{1}{2}k_n \frac{W}{L} [2(V_{gs} - V_{th})V_{ds} - V_{ds}^2] \qquad Eq.4$$

In the bridge chip, the method of improving non-linearity by adjusting Vgs was used instead of correcting linearity using width. Adjusting the Vgs provides fine resolution than adjusting the width. It is used to improve RLM and to adjust the output level to evaluate the voltage margin of the DRAM reciever.

The maximum clock frequency of the commonly used DRAM tester (T5511) is 4GHz. With 4GHz clock, NRZ 8Gbps DDR data can be generated from one DQ Pin. If a PAM4 signal is generated with two DQ pins, one 16Gbps PAM4 signal can be generated. However, the maximum expected speed of the next generation DRAM is 32Gbps, and the tester which test the next generation DRAM should be faster than the DRAM. The speeds above 16Gbps exceed the limit of the signal speed that the equipment can generate. Therefore, a 48Gbps transmitter, which is higher speed than the previously proposed bridge chip, was designed. In order to achieve a speed of 48 Gbps, various methods have been changed in [6]. For example, due to speed limit, the clock scheme was changed from 2phase clock to 4phase clock, which required a scheme for 4phase clock distribution and high speed 4:1 mux.

### **3.2 Overall Architecture**

Figure 3.3 shows the four level adjustable PAM4 transmitter architecture of the bridge that allows testers that support evaluating PAM4 DRAM by using the NRZ tester. 6Gbps data generated by the internal parallel PRBS or external 8 DQ pins are used as 4bit MSB and 4bit LSB, and internal data passing through the 4:1 mux is fed to the PAM4 main driver with



Fig. 3.1 Overall architecture of PAM4-Binary bridge transmitter.

increased bandwidth with resistive feedback. The data is processed at the PAM4 main driver with 24Gsymbol through the control of the 4-phase generator. Independently controlling each level of the PAM4 signal is required to evaluate the voltage margin of the memory receiver and helps improve the RLM of the PAM4 signal. For this purpose, the PAM4 driver has additional transistor to control the input level of main driver at driver's input node. Because the transistor of PAM4 driver operates in the linear region, the four output voltage levels (drain voltage) are changed by the gate-source voltage of the MOS transistor. Therefore, a transistor (M2) is added to adjust the driver's input node. The amount of Vgs that controlled by the added transistor is determined by VREF\_P\_LSB level obtained through calibration. The four drivers, divided into PMOS/NMOS and LSB/MSB, require each VREF signal according to its driving level. Meanwhile, the additional transistor (M1, M2) decreases the slope of the DRV\_LEV\_P node and it causes the bandwidth to be lowered. The additional transistor(M1) is applied to reduce the impact of M2 when the transition occurs, so that slope of the DRVLEVP node can be further sharpen. Figure 2 (down) describes the LSB pull-up driver of this circuit. The gate level of LSB pull-up driver is VSS when driving 1/6VDD and above the VSS when driving 1/2VDD. Especially when drive 1/2 VDD, to guarantee the slope of the output, M1 turned on after the transitions of DRV\_LEV\_P node.

#### **3.3 Circuit Implementation**

#### 3.3.1 Clock Tree

In general, the maximum clock frequency of the equipment (T5511) used for DRAM test is 4GHz. Since the 4GHz clock cannot generate more than 16Gbps PAM4 signal, the bridge chip needs to internally generate the 12GHz clock in 4-phase to generate 48Gbps data. In this paper, 3 GHz clock is increased to 6 GHz 4-phase clock through ADPLL. The measured rms jitter is 0.688psrms from a 0.9V supply at 6GHz output frequency with 3GHz reference clock.





Fig. 3.2 Block diagram of internal clock generator.

bridge chip supports two types of clock tree through SEL<1> signal. The first way is to generate the clock through ADPLL. The second way bypasses the external clock by blocking to the ADPLL for debugging. In debugging mode, the clock must be input from the outside with a frequency of 6GHz. Externally input clock supports both single end clock and

differential clock and can be selected through SEL<0>. ADPLL output clock or 4phase clock made from externally applied clock is input to PI and regenerated as a clock of a new phase. PI creates a clock of a new phase to synchronize the clock with the data of 8 DQ pads.

Because the final output signal is generated using the 4phase clock, the duty of the 4phase clock and the skew management between the 4 clocks are very important. After the 4phase generator, the DCC block and delay control block for each clock were added. Figure 3.3 and figure 3.4 are block diagram and timing diagram of DCC, respectively. The default value of CTRL\_CODE<7:0> is 0F (4 highs and 4 lows), and the number of turned-on PMOS and NMOS is the same. If the duty cycle of the input clock is changed, the duty cycle can be corrected by adjusting the number of turned-on PMOS and NMOS. For example, in the timing diagram in Figure 3.5, the number of highs in CTRL\_CODE<7:0> is greater than the default value of 4. At ② node, the weakened NMOS makes the falling slope smooth, and at the ③ node, the weakened PMOS makes the rising slope smooth. As a result, when this signal passes through the buffer, the low pulse increases and the high pulse decreases. In this way, the duty cycle of the clock can be adjusted. Table 2.1 is the simulation result of clock duty according to CTRL\_CODE<7:0>.

Since the output is made by using the 4 phase clock, not only the self duty of the clock but also the relationship between the four clocks is important. Although the 4phase generator will make the 90 degree interval accurate, the 4phase at the output may be shifted due to several factors such as clock tree difference, process variation, local variation, etc. Therefore, there should be a way to compensate when the 4 phase relationship is out of alignment for any reason. Figure 3.6 is a block diagram of the circuit for adjusting 4phase skew. It is composed of a pass transistor and MOS capacitance, and the delay is adjusted by selecting the capacitance value through the pass transistor. The size of each capacitor is doubled so that various options can be placed. Since this adjusts the relationship of 4 phases, it is designed only in the direction of adding delay by increasing it. Table 2.2 is the postsim result of delay, and it has a correction ability of about 10% for 1 UI, which is 41.6 ps.



Fig. 3.3 Block diagram of duty cycle corrector.



Fig. 3.4 Timing diagram of duty cycle corrector.

Table 2.1 Simulation result of output duty according to CTRL\_CODE<7:0>

| # of high in CTRL_CODE<7:0> | Output Duty (%) |
|-----------------------------|-----------------|
| 6 (3F)                      | 40              |
| 5 (1F)                      | 45              |
| 4 (0F)                      | 50              |
| 2 (03)                      | 55              |
| 1 (01)                      | 60              |



Fig. 3.5 Schematic of clock delay circuit.

| Table 2.2 | Simulation | result of | clock | maximum | delay. |
|-----------|------------|-----------|-------|---------|--------|
|           |            |           |       |         |        |

| Process | Maximum delay (ps) |
|---------|--------------------|
| TT      | 4.26               |
| FF      | 3.63               |
| SS      | 5.18               |

#### 3.3.2 Data Align

The bridge transmitter generates output data using parallel input data at 6Gbps from 8 DQ pins of DRAM tester or 8 internal PRBS. In order to output 48Gbps with 8 6Gbps signals, a serializer that converts parallel data into serial data is required. Each of the 8 data is designated according to the order of LSB/MSB and sequence of output, and is aligned through the synchronized clock from the PI. Because 4phase clock is used, <0:1> data is synchronized with 0 degree clock, and BL<2:3> data is synchronized with 180 degree clock. Also, in this paper, 2tap FFE to compensate the post-cursor is adopted. Since the data for the FFE should be 90 degrees faster than the main data, BL<1:2> data is synchronized with the 0 degree clock, and Figure 3.7 are a timing diagram and a block diagram of the serializer, respectively.



Fig. 3.6 Timing diagram of data alignment.



Fig. 3.7 Data alignment block diagram.

#### 3.3.3 4:1 MUX

As for the final mux, architectures employing quarter-rate clocks (4phase) and 4:1 muxes are widely used to relieve timing constraints and reduce power consumption in the clock path[12]. [12] proposes a 4:1 MUX consisting of three 2:1 muxes that can be used below 5GHz. Although this method is simple and has the advantage of not managing 4 phases clocks, the timing margin of our target speed becomes too tight. In this paper, four 1/2 frequency clocks and a 4:1 MUX were employed to secure a timing margin of 1 UI forward and backward. The margin between the clock and data secured over 1UI was used so that it could operate even when PVT variation occurred in the tCO of the serializer or the variation of the 1UI pulse generator block. The related timing diagram is shown in Fig 3.8.



Fig. 3.8 Timing diagram of 4:1 MUX.

Figure 3.9 shows a 4:1 selector design controlled by four inverters and NAND and NOR gates. 1UI pulse generated by two 4-phase clocks and data are input to the NAND and NOR gates. When CK is high, data is output, and when CK is low, it is in high-z state. A 4:1 mux is composed of 4 NAND/NOR/INV blocks, which sequentially output data.. This approach improves output bandwidth by eliminating the use of cascading or stacking devices. However, this 4:1 MUX causes bandwidth limitation due to parastic capacitance caused by connecting 4 units, and is still insufficient to operate at the target speed. To solve this problem, a resistive feedback inverter is adopted as the first stage of the predriver connected after the 4:1 mux. Fig. 3.10 shows the block diagram of the 4:1 MUX and the resistive-feedback inverter. The feedback resistance RF can be split into RF<sub>in</sub> and RF<sub>out</sub> by Miller effect. Therefore, the output time constant of the 4:1 MUX can be reduced by shunting the output resistance with RF<sub>in</sub>.[13] However, the finite input impedance degrades the MUX output swing to some extent. The following three-stage inverter is designed carefully with  $1-1.5 \times$  fan-outs to return the output back to CMOS level and to drive the output driver cell. Fig. 3.11 shows the simulated eyediagram with and without 2.7KOhm feedback resistor. After applying the resistive-feedback predriver, the data-dependent jitter (DDJ) can be reduced to 2.0ps from 5.1ps.



Fig. 3.9 Schematic of 4:1 MUX.



Fig. 3.10 Block diagram of pre-driver for improving bandwidth of 4:1 MUX.



Fig. 3.11 Eye diagram of the 4:1MUX output node before and after applying feedback resistor.

#### 3.3.4 Driver

The transmitter of the PAM4 binary bridge must be able to control each level of the PAM4 signal and must be capable of impedance matching. Fig. 3.12 is a diagram of the output impedance of the ideal driver for each output level. The PAM4 driver is composed of PMOS (pull-up) and NMOS (pull-down) transistors connected in parallel with three  $3Z_0$  size transistors. In addition, VSS termination is used as default in the bridge chip. For example if



Fig. 3.12 Impedance and current prediction according to output level.

the output level is 1/2VDDQ, all three pull up transistors are turned on and all pull down transistors are turned off. The pull up impedance and the termination impedance of the receiver become the same, and a 1/2VDDQ output level is formed. Moreover, the current I<sub>0</sub> flows through the driver and termination. To generate the output level of 1/3VDDQ, two pull up transistors are turned on to make 1.5Z<sub>0</sub>, and one pull down transistor is turned on to make 3Z<sub>0</sub>. Depending on the pull-up transistor, pull-down transistor and termination resistance, the output level becomes 1/3VDDQ, and the output impedance becomes  $3Z_0||1.5Z_0=Z_0$ . In this case,  $8/9I_0$  flows into the pull-up transistor, and  $2/9I_0$  flows through the pull-down transistor and  $6/9I_0$  through the termination resistance. Even when the output level is 1/3VDDQ, it consists of a pull-up driver of 3 Z<sub>0</sub> and a pull-down driver of 1.5 Z<sub>0</sub> as shown in third picture of Fig. 3.12.

However, if Fig. 3.12 is implemented with a real transistor, the driver operates non-linearly according to  $V_{ds}$  (output level) and does not operate like the resistor in Fig. 3.12. A voltage mode PAM4 driver should improve the RLM degradation caused by the transistor's non-linearity. The CMOS driver operates as shown in equation (4) in the triode region. The current changes in proportion to the square of the output voltage,  $V_{ds}$ , and this causes non-linearity. For linear operation, correction is possible by changing width or  $V_{gs}$  according to  $V_{ds}$  in equation (4). [11] improved the linearity by changing the width. In this paper, since it was necessary to support output level change for DRAM Rx test as well as RLM improvement, the output level was adjusted by changing the gate voltage that can provide fine resolution than the width. In Fig. 3.14, the number of turned-on transistors determines the width, and the output level determines  $V_d$ . The current flowing through the transistor is defined for each output level as shown in Fig. 3.12. Table 2.3 shows the Fig.3.12 as a equation. we can find a, b, c, and d which are correction value through these equation. This is an ideal calculation result and may not match the actual one, so there are calibration circuit to fine  $V_{gs}$  value for each output level.

| Output level        | PMOS                                                                                                                   | NMOS                                                                                                                |
|---------------------|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| $\frac{1}{2}VDDQ/0$ | $I_0 = K_p \frac{3W}{L} [2(\gamma - V_{ddq} - V_{th}) \cdot -\frac{1}{2} V_{ddq} - (-\frac{1}{2} V_{ddq})^2]$          | $I_0 = K_n \frac{3W}{L} [2(V_{ddq} - V_{th}) \cdot \frac{1}{2} V_{ddq} - (\frac{1}{2} V_{ddq})^2]$                  |
| $\frac{1}{3}VDDQ$   | $\frac{8}{9}l_0 = K_p \frac{2W}{L} [2(\delta - V_{ddq} - V_{th}) \cdot -\frac{1}{3}V_{ddq} - (-\frac{1}{3}V_{ddq})^2]$ | $\frac{2}{9}I_0 = K_n \frac{W}{L} [2(V_{ddq} - \alpha - V_{th}) \cdot \frac{1}{3}V_{ddq} - (\frac{1}{3}V_{ddq})^2]$ |
| $\frac{1}{6}VDDQ$   | $\frac{5}{9}I_0 = K_p \frac{W}{L} [2(-V_{ddq} - V_{th}) \cdot -\frac{1}{6}V_{ddq} - (-\frac{1}{6}V_{ddq})^2]$          | $\frac{2}{9}I_0 = K_n \frac{2W}{L} [2(V_{ddq} - \beta - V_{th}) \cdot \frac{1}{6}V_{ddq} - (\frac{1}{6}V_{ddq})^2]$ |
| Result              | $\gamma = \frac{1}{6} V_{ddq}, \delta = \frac{1}{12} V_{ddq}$                                                          | $\alpha = \frac{1}{12} V_{ddq}, \beta = \frac{1}{6} V_{ddq}$                                                        |

Table 2.3 The equation of PAM4 transistor according to output level



Fig. 3.13 Schematic of possible pre-driver.

There are two methods to provide the gate level of the final driver. The first method is to set the supply voltage level of the predriver according to the desired gate level as shown in Fig.3.13 (a). This method can be easily implemented by making the desired voltage level through the LDO. However, in order to generate the 4 levels ( $VDDQ - \alpha$ ,  $VDDQ - \beta$ ,  $\gamma$ ,  $\delta$ ), four power lines and capacitors should be arranged respectively. This causes area overhead and noise control problems. In addition, since VDD0 and VDD2 are continuously switched, a problem of charge flowing from a high voltage to a low voltage is expected. Due to the characteristic of LDOs to generate low voltages from high voltages, it is necessary to have a solution for when the low voltages are high. This increases the complexity of the circuit. The second method is to create a DC path at the output of the predriver to adjust the output level as shown in Fig.3.13 (b). By adjusting the size of the DC path through reference voltage control, the desired output level of the predriver is generated.4 references are required, but since the reference signals are connected to the gate node of final driver, capacitance and metel line overhead are less than the first method. Although there is a disadvantage of current consumption due to the occurrence of a DC path, the second method was applied because the overhead for 4 power lines was considered to be greater in actual implementation.

The predriver applied in this paper has one problem. The slope when reaching  $VDDQ - \alpha$ 

is lower than the slope when reaching VDDQ, and this causes a decrease in the bandwidth of the main driver. The NODE\_X in Fig. 3.14(a) outputs  $VDDQ - \alpha$  and VDDQ according to MSB\_B signal. The eye diagram result is shown in Fig. 3.14(b), and there is a difference in slope when transitioning to a  $VDDQ - \alpha$  and VDDQ. To solve this problem, an overdriver scheme was applied to improve the slope of the predriver. As shown in Fig. 3.15.(a), when outputting  $VDDQ - \alpha$ , the predriver's DC path was turned on after NODE\_X sufficiently swings. As a result, the output waveform is shown in Fig.3.15(b). Even if an overshot occurs at the output of the predriver, the slope becomes the same as when VDDQ is reached. Fig.3.16 is the simulation result of main driver before and after the overdrive scheme. The output slope of the main driver has been improved to secure eye margin. The fig.3.17 is the overall



Fig. 3.14 (a) Proposed pre-driver for controlling the output level (b) simulation result of expected problem of the pre-driver.

architecture of the transmitter combining the perdriver and main driver.



Fig. 3.15 (a) Schematic of overdrive scheme (b) simulation result of NODE\_X before and after applying the scheme.



Fig. 3.16 The simulated output eye diagram before and after applying the overdrive scheme.



Fig. 3.17 Overall architecture of the driver.

#### 3.3.5 Calibration

In order to maximize the RLM of the PAM4 output, an appropriate reference level must be applied to the predriver. In this paper, we propose a calibration circuit to find the reference value. Fig. 3.18 is the overall architecture of the calibration circuit. The calibration circuit consists of 6 replica drivers made by imitating the state of the driver for each of the 4 output levels, a comparator that checks whether the replica driver outputs an appropriate level, and a digital block implemented with verilog code to find an a proper reference. As shown in Fig.



Fig. 3.18 Overall architecture of calibration logic.

3.19, the reference signal outputs the voltage generated by the resistor divider through the mux. VDAC generates 4 signals VREF0~VREF3 that determine the output level and 3 reference signals ( $^{1}/_{2}VDDQ$ ,  $^{1}/_{3}VDDQ$ ,  $^{1}/_{6}VDDQ$ ) that will be the reference for calibration. Fig3 3.20 is a flowchart of calibration, and Fig. 3.21 is the concept of a replica driver. Calibration is



Fig. 3.19 Block diagram of the reference generator.

calibrated in 6 steps. The first and second steps are to determine the widths of NMOS and PMOS. Since PMOS and NMOS performances differ by PVT variation, the widths of PMOS and NMOS are determined according to operating conditions. This step adjusts the output impedance of NMOS and PMOS to be  $Z_0$  and  $3Z_0$ , respectively, under the condition that the gate input voltages of NMOS and PMOS are VDD and VSS, respectively. The replica driver at this stage is composed as in Fig3.21 (a). The calibration logic adjusts the width to find a condition where the NMOS replica driver outputs  $1/_2$ VDDQ level and the PMOS replica driver outputs  $1/_6$  VDDQ level. The width values obtained through calibration become PCODE and NCODE signals and are transmitted to all drivers and replica drivers. Steps 3 to 6 are for finding an appropriate input gate level. Since the width suitable for PVT was selected through PCODE and NCODE, now we need to find a reference voltage that can make the output level

linear. As shown in Fig. 3.21(b), there are 4 replica drivers according to each output level. For example, if the output level is 1/3VDDQ, the PMOS and NMOS which is right side of Fig. 3.21(b) should be selected and calibrated. In the case of NMOS, in order to avoid a replica driver that creates a DC path, a pull-up resistor that can perform the same operation was obtained and simplified. In the NMOS replica driver at the bottom right of Fig. 3.21(b), a dc path is generated due to the pull-up resistance and the replica termination resistance. The replica termination resistor was removed to block the dc path, and 6Z<sub>0</sub> was used as a pull up resistor to generate 1/3VDDQ at the NMOS impedance 3Z<sub>0</sub>. The NMOS replica driver operating at  $1/_6$  VDDQ also used the same method to remove the replica termination resistor and change the pull up to  $15/_2$ Z<sub>0</sub>. When the reference level is found sequentially for each output level, the calibration operation is terminated. All calibration result values can be read-out through I2C. If you adjust the output level for the DRAM Rx test, you can read the calibration result code first, change the result code, and write back to I2C.

Fig. 3.22 (a) is the result of transient simulation performed to check the calibration operation. It can be seen that the replica driver output (black line) for each mode is calibrated along the reference signal (blue line). Fig 3.22(b) is the simulation result of the calibration. In general, prior to calibration, high-level eyes are relatively large due to the non-linearity of the Ron value according to the vds level. After calibration, it is possible to obtain the result that the size of all eyes becomes constant.



Fig. 3.20 The flow chart of calibration.



Fig. 3.21 Replica driver of calibration circuit (a) replica drivers that tune the width (b)



Fig. 3.22 (a) transient simulation result of calibration circuit and (b) simulated output eye diagram of transmitter after the calibration.

# **Chapter 4**

## **Measurement Results**

# 4.1 Chip Photomicrograph

The prototype chip is fabricated on the TSMC 40nm CMOS process. I2C and calibration blocks were written in verilog, synthesized and implemented. Other blocks were designed manually. Decoupling capacitors were placed on all reference signals, and dummy metal and power capacitance were placed in the space left for the power mesh. The area of the PAM4 binary bridge transmitter is 0.32mm<sup>2</sup>, Figure 4.1 shows fabricated pam4 binary bridge chip.



Fig. 4.1 Chip photomicrograph.

### 4.2 Measurement Setup

In order to measure the operation of PAM4 transmitter, the test environment is configured similar with [6]. Fig. 4.2 shows the test environment of this chip. To setup the bridge chip and perform the calibration circuit, I2C is changed and readout through PC. PC and I2C communicat through Aardvark and the code to control I2C is written in phython. Differential clock is input from Anritsu MP1800A equipment and Tektronix MSO73304DX osilloscope is used to measure output waveform. The supply power for I2C and transmitter are input from the Agilent E3649A DC power supply. Because the Anritsu MP1800A equipment can provide only 2 NRZ data, the Anritsu MP1800A which can provide only 2 NRZ data does not provide enough input data for this chip that requires 8 inputs. Therefore, internal PRBS generator is used for measurement.



Fig. 4.2 Measurement setup.

### **4.3 Measurement Results**

Figure 4.3(a) shows the PAM4 data eye with and without the calibration measured through the oscilloscope. Before the calibration, the upper eye is larger than lower eye due to the non-linearity. After the calibration the all three eyes are the same size. It was tested at 16Gb/s to check the output level. Figure 4.3 (b) shows the measurment result of this chip at 48Gb/s. To DRAM Rx, output level could be controllerable. Fig. 4.4 shows the measurement result when the second level is raised or lowered. Fig 4.4(b) is the result of manual adjustment based on calibration result to improve the RLM. Fig 4.4(a) and (c) show that second level riases and falls, respectively. Table 4.1 shows the performance summary.



Fig. 4.3 Measured eye-diagrams of PAM4 transmitter (a) comparison before and after calibration at 16 Gb/s (b) eye diagram of PAM4 32 Gb/s.



Fig. 4.4 Result of eye level adjustment after calibration.

Table 4.1 Performance summary.

| Clock<br>Frequency[GHz] | VDD[V] | Power[mW] | Data Rate[Gb/s] | FoM[pJ/bit] |  |
|-------------------------|--------|-----------|-----------------|-------------|--|
| 6                       | 0.9    | 85.27     | 48              | 1.78        |  |

### **4.4 Performance Summary**

The total area of the proposed PAM4-Binary bridge transmitter is 0.32 mm<sup>2</sup>. The operating speed of this transmitter is 48 Gb/s with PAM4 signaling, and the operating power of this transmitter is 85.25 mW and the serializer and driver consume 60.48 mW of power.

Figure 4.5 show the power consumption of each block. Since this chip was fabriacated with a receiver and other analog blocks, the power breakdown was calculated based on the post-simulation.

The 6Ghz clock was input, and the data generated by the internal PRBS generator was output at 48Gbps, and the RLM of the PAM4 data eye is 0.99.

Table 4.2 shows comparison with other PAM4 Tx chips using single-ended signaling.



Fig. 4.5 Power breakdown.

Table 4.2 Comparison with other Tx chips.

|                              | ASSCC'21<br>[7] | TCAS-II'22<br>[6] | JSSC'21<br>[11]         | ISCAS'19<br>[16] | This work    |
|------------------------------|-----------------|-------------------|-------------------------|------------------|--------------|
| Technology                   | 28nm            | 40nm              | 65nm                    | 65nm             | 40nm         |
| Supply(V)                    | 1.2/1           | 1.25/0.9          | 1.0/0.6                 | 1.0              | 0.9          |
| Data rate(Gb/s/pin)          | 24              | 32                | 28                      | 20               | 48           |
| Tx Driver topology           | Voltage-mode    | Current-mode      | Voltage-mode            | Voltage-mode     | Voltage-mode |
| Tx Equlization               | -               | 2-tap FFE         | 2-tap asymmetric<br>FFE | 2-tap FFE        | 2-tap FFE    |
| Signaling type               | PAM4/NRZ        | PAM4/NRZ          | PAM4                    | PAM4/NRZ         | PAM4         |
| Clocking type                | External        | ADPLL             | External                | External         | ADPLL        |
| Tx Driver RLM                | 0.95            | 0.95              | 0.99                    | 0.98             | 0.99         |
| Energy effciency<br>(pJ/bit) | -               | 3.66*             | 0.65                    | 3.07             | 1.78*        |

\*Includes PLL

# Chapter 5

# Conclusion

This paper proposes a transmitter of PAM4-Binary bridge that can test PAM4 DRAM with NRZ tester. The frequency of the external input clock is doubled through ADPLL and converts into 4 phase clocks, and eight data input through the PINs or PRBS generator in parallel are aligned through the 4 phase clock. In order to test the receiver of DRAM, a method to adjust the final PAM4 data was proposed, and a calibration scheme to improve RLM was proposed using this method.

The prototype chip is fabricated on the TSMC 40nm CMOS process and the total area of the proposed PAM4-Binary bridge transmitter is 0.32 mm<sup>2</sup>. The operating speed of this transmitter is 48 Gb/s with PAM4 signaling, and the operating power of this transmitter is 85.25 mW and the serializer and driver consume 60.48 mW of power. RLM improved from 0.73 to 0.99 through the calibration. The maximum operating speed of output PAM4 data is 48 Gb/s, and FoM achieves 1.78 pJ/bit.

## **Bibliography**

- W.A. Wulf and S.A. McKee, "Hitting the Memory Wall: Implications of the Obvious," ACM SIGARCH Computer Architecture News, vol. 23, no. 1, Mar. 1995, pp. 20-24.
- [2] Micron.com. [Online] [Accessed on 24th Nov. 2022]. https://www.micron.com/about/blog/2019/june/ddr5-the-next-step-in-system-level-performance
- [3] S. Lehmann and F. Gerfers, "Channel analysis for a 6.4 Gb/s DDR5 data buffer receiver front-end", Proc. 15th IEEE Int. New Circuits Syst. Conf. (NEWCAS), pp. 109-112, Jun. 2017.
- [4] T. M. Hollis et al., "An 8-Gb GDDR6X DRAM achieving 22 Gb/s/pin with single-ended PAM-4 signaling", IEEE J. Solid-State Circuits, vol. 57, no. 1, pp. 224-235, Jan. 2022.
- [5] H. Jun, S et al., "High-Bandwidth Memory (HBM) Test Challenges and Solutions," in IEEE Design & Test, vol. 34, no. 1, pp. 16-25, Feb. 2017.
- [6] D. Yun et al., "A 32-Gb/s PAM4-Binary Bridge With Sampler Offset Cancellation for Memory Testing," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 9, pp. 3749-3753, Sept. 2022
- [7] H. Jin et al., "A 24Gb/s/pin PAM-4 Built Out Tester chip enabling PAM-4 chips test with NRZ interface ATE," 2021 IEEE Asian Solid-State Circuits Conference (A-SSCC), 2021, pp. 1-3.
- [8] 최정환. (2012). 고속 DRAM interface. 전자공학회지, 39(7), 20-26.
- [9] R. Stephens, PAM4: Symbol levels and voltage compression measurements. EDN Network [Online] [Accessed on 24th Nov. 2022]. Available: http://www.edn.com

- [10] PCI Express® Base Specification Revision 6.0 Version 0.7, 2020, [online] Available: https://pcisig.com/specifications.
- [11] Y. -U. Jeong et al., "A 0.64-pJ/Bit 28-Gb/s/Pin High-Linearity Single-Ended PAM-4 Transmitter With an Impedance-Matched Driver and Three-Point ZQ Calibration for Memory Interface," in IEEE Journal of Solid-State Circuits, vol. 56, no. 4, pp. 1278-1287, April 2021,
- [12] B. Razavi, "Design Techniques for High-Speed Wireline Transmitters," in IEEE Open Journal of the Solid-State Circuits Society, vol. 1, pp. 53-66, 2021.
- [13] P. -J. Peng et al., "A 112-Gb/s PAM-4 Voltage-Mode Transmitter With Four-Tap Two-Step FFE and Automatic Phase Alignment Techniques in 40nm CMOS," in IEEE Journal of Solid-State Circuits, vol. 56, no. 7, pp. 2123-2131, July 2021,
- K. Lee et al., "An Adaptive Offset Cancellation Scheme and Shared-Summer Adaptive DFE for 0.068 pJ/b/dB 1.62-to-10 Gb/s Low-Power Receiver in 40 nm CMOS," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 68, no. 2, pp. 622-626, Feb. 2021
- [15] P. -J. Peng, J. -F. Li, L. -Y. Chen and J. Lee, "6.1 A 56Gb/s PAM-4/NRZ transceiver in 40nm CMOS," 2017 IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp. 110-111
- [16] C. Hyun *et al.*, "A 20Gb/s Dual-Mode PAM4/NRZ Single-Ended Transmitter with RLM Compensation," in *IEEE Int. Symposium on Circuits and Systems*. (*ISCAS*), 2019, pp. 1-4.

# 초 록

머신 러닝, A/I 등 고성능 컴퓨팅 시스템이 요구되는 새로운 application 이 등장하 므로 인해 높은 메모리 대역폭이 요구되어지고 있다. DRAM 에 요구되는 높은 대 역폭을 만족시키기 위해서 Multi-level signaling 적용이 논의되고 있지만, 이 기술 을 적용하기 위해서 수많은 재반시설의 변화가 요구된다. 특히 DRAM 과 같이 대 량생산을 목적으로 하는 제품의 경우, 이것을 대량으로 평가하기 위한 대규모의 설비가 구성되어있다. DRAM manufactor 는 NRZ(Non-Return-to-Zero) signal 평가를 위한 대규모 설비를 이미 갖추고 있기 때문에 Multi-level signaling 지원은 시간과 비용이 발생하는 대규모 test 설비 변화가 야기된다. 이 문제를 해결하기 위해서 DRAM tester 와 DRAM 을 이어주는 bridge chip 이 제안되었고, 이는 저성능의 test 장비에서 병렬로 data 를 입/출력 받아 high speed PAM4 signal 로 변환하여 DRAM 에 출/입력 한다.

본 논문에서는 bridge chip 에서의 transmitter 를 제안한다. Voltage mode CMOS driver 로 구성된 bridge chip 의 transmitter 는 test 장비에서 출력된 "8 pins, 6Gb/s" NRZ 신 호를 병렬로 수신하고 수신된 데이터를 "1pin, 48Gb/s" PAM4 신호로 DRAM 에 전 송 한다. Serialiser, 4:1 MUX, overdrive scheme 을 사용하는 predriver 를 사용하여 transmitter 를 구현하였다. Clock 구동 방식은 내부 ADPLL 이 생성한 4-phase clock 을 사용한다.. 특히 본 논문에서 제안된 transmitter 는 DRAM reciever 를 평가할 수 있도록 PAM4 signal 의 output level 을 조절할 수 있으며, 이 output level 을 조절하 는 기능을 이용하여 RLM 을 최적화 하는 calibration 을 제안한다. 각 output level 은 bridge chip transmitter 의 final driver 의 gate voltage level 을 변경하여 조절한다. 40nm CMOS 로 제작된 bridge transmitter 는 0.32mm2 의 면적을 차지하고 85.25mW 의 전력을 소모하여 48Gb/s 의 속도 동작하고, 0.99 의 RLM 을 달성하였다. 주요어 :4 단계 펄스 진폭 변조,4 단계 펄스 진폭 변조-2 진법 브리지, DRAM 검사 장비.

학 번 :2021-21988