



**Master's Thesis** 

# Design of High-Speed PAM-4 Receiver with Nonlinearity Compensation for DRAM Test

DRAM 평가를 위해 비선형성 보상을 활용한 고속 PAM-4 수신기의 설계

by

Kahyun Kim

February, 2023

Department of Electrical and Computer Engineering College of Engineering Seoul National University

# Design of High-Speed PAM-4 Receiver with Nonlinearity Compensation for DRAM Test

지도 교수 정 덕 균

이 논문을 공학석사 학위논문으로 제출함 2023 년 2 월

> 서울대학교 대학원 전기·정보공학부 김 가 현

김가현의 석사 학위논문을 인준함 2023 년 2 월

| 위 육 | <u> </u> 장 | (인) |
|-----|------------|-----|
| 부위  | 원장         | (인) |
| 위   | 원          | (인) |

# Design of High-Speed PAM-4 Receiver with Nonlinearity Compensation for DRAM Test

by

Kahyun Kim

A Thesis Submitted to the Department of Electrical and Computer Engineering in Partial Fulfillment of the Requirements for the Degree of Master of Science at

SEOUL NATIONAL UNIVERSITY

February, 2023

Committee in Charge:

Professor Woo-Seok Choi, Chairman

Professor Deog-Kyoon Jeong, Vice-Chairman

Professor Woogeun Rhee

## Abstract

In this thesis, a design of high-speed single-ended PAM-4 receiver with adaptive nonlinearity compensating techniques for DRAM test is proposed. The receiver incorporates 3 parallel Cherry-Hooper continuous-time linear equalizers (CTLEs) and a 1-tap 9-coefficient adaptive decision feedback equalizer (DFE). CTLEs provide a variable gain with offset-canceling calibration. The DFE detects the level separation mismatch ratio (RLM) of the transmitted data and nonlinear distortion within the receiver analog front-end (AFE). The nonlinearity is compensated by simultaneously adapting 9 coefficients of the nonlinearity compensator.

The proposed receiver is fabricated in the 40nm CMOS technology, occupying 0.236 mm<sup>2</sup>. Measured in a 7-dB loss channel, the PAM-4 receiver achieves a data rate 48-Gb/s while BER being less than 10<sup>-12</sup>. Figure-of-merit (FOM) is shown as 0.42 pJ/b/dB while exhibiting energy efficiency of 2.97 pJ/b.

**Keywords**: Four-level pulse amplitude modulation (PAM-4), single-ended receiver (RX), adaptive decision feedback equalizer (DFE), Cherry-Hooper continuous-time linear equalizer (CTLE), level separation mismatch ratio (RLM)

Student Number: 2021-28506

# Contents

| ABSTRACT                                        | Ι         |
|-------------------------------------------------|-----------|
| CONTENTS                                        | II        |
| LIST OF FIGURES                                 | IV        |
| LIST OF TABLES V                                | <b>II</b> |
| CHAPTER 1 INTRODUCTION                          | 1         |
| 1.1 MOTIVATION                                  | 1         |
| 1.2 THESIS ORGANIZATION                         | 4         |
| CHAPTER 2 BACKGROUNDS                           | 5         |
| 2.1 ARCHITECTURE IN HIGH-SPEED INTERFACE        | 5         |
| 2.1.1 Serial Link                               | 5         |
| 2.1.2 MULTI-LEVEL PULSE-AMPLITUDE MODULATION    | 9         |
| 2.2 Equalizer                                   | 14        |
| 2.2.1 Continuous-Time Linear Equalizer          | 14        |
| 2.2.2 DECISION-FEEDBACK EQUALIZER               | 20        |
| 2.2.3 Adaptive Equalizer                        | 23        |
| CHAPTER 3 DESIGN OF PAM-4 RECEIVER WITH ADAPTIV | VE        |
| NONLINEARITY COMPENSATION                       | 26        |
| 3.1 Design Consideration                        | 26        |

| 3.2 PROPOSED ARCHITECTURE                                                                                   |                             |
|-------------------------------------------------------------------------------------------------------------|-----------------------------|
| 3.3 CIRCUIT IMPLEMENTATION                                                                                  |                             |
| 3.3.1 Continuous-Time Linear Equalizer                                                                      |                             |
| 3.3.2 NONLINEARITY-COMPENSATING DFE                                                                         | 34                          |
| 3.3.3 EYE-OPENING MONITOR                                                                                   | 43                          |
| 3.3.4 Deserializer                                                                                          | 46                          |
| 3.3.5 DIGITAL-TO-ANALOG CONVERTER                                                                           | 48                          |
| CHAPTER 4 MEASUREMENT RESULTS                                                                               |                             |
|                                                                                                             |                             |
| 4.1 DIE PHOTOMICROGRAPH                                                                                     | 50                          |
| 4.1 DIE PHOTOMICROGRAPH                                                                                     | 50<br>                      |
| <ul> <li>4.1 Die Photomicrograph</li> <li>4.2 Measurement Setup</li> <li>4.3 Measurement Results</li> </ul> | 50<br>52<br>                |
| <ul> <li>4.1 DIE PHOTOMICROGRAPH</li></ul>                                                                  | 50<br>52<br>55<br><b>58</b> |
| <ul> <li>4.1 DIE PHOTOMICROGRAPH</li></ul>                                                                  | 50<br>52<br>55<br>58<br>59  |

# **List of Figures**

| FIG. 1.1 MEMORY TEST INTERFACE BETWEEN DRAM AND ATE2                         |
|------------------------------------------------------------------------------|
| FIG. 2.1 SIMPLIFIED BLOCK DIAGRAM OF A SERIAL LINK                           |
| Fig. 2.2 (a) Single bit response; (b) Degraded NRZ eye diagram with ISI 8 $$ |
| FIG. 2.3 (A) BINARY ENCODED PAM-4 SIGNAL; (B) PAM-4 EYE DIAGRAM9             |
| FIG. 2.4 CONCEPT OF A LEVEL SEPARATION MISMATCH RATIO11                      |
| FIG. 2.5 ISSUES ON DECIDING PAM-4 THRESHOLD VOLTAGES                         |
| FIG. 2.6 (A) CIRCUIT AND (B) FREQUENCY RESPONSE OF CTLE                      |
| FIG. 2.7 DC GAIN AND ZERO LOCATION ADJUSTMENTS OF RC-DEGENERATED CTLE        |
|                                                                              |
| Fig. 2.8 Interpretation of (a) differential input and (b) single-ended input |
|                                                                              |
| FIG. 2.9 PRACTICAL CTLE WITH CHANNEL LENGTH MODULATION                       |
| Fig. 2.10 Block diagram of an N-tap decision feedback equalizer (DFE)20      |
| FIG. 2.11 OPERATION OF DFE WITH SINGLE-BIT RESPONSE (SBR)21                  |
| FIG. 2.12 STRUCTURE OF A STRONGARM LATCH                                     |
| FIG. 2.13 CONCEPTUAL BLOCK DIAGRAM OF AN ADAPTIVE EQUALIZER23                |
| FIG. 3.1 OVERALL BLOCK DIAGRAM OF PROPOSED RECEIVER AND INTERNAL CLOCK       |
| PATH                                                                         |
| FIG. 3.2 CIRCUIT IMPLEMENTATION AND BLOCK DIAGRAM OF CHERRY-HOOPER CTLE      |
|                                                                              |
| FIG. 3.3 SWING UNEQUALITY AND DFE LINEAR RANGE                               |

| FIG. 3.4 CTLE FREQUENCY RESPONSE POST SIMULATION RESULT                  |
|--------------------------------------------------------------------------|
| FIG. 3.5 PAM-4 DATA IN 3 CTLES                                           |
| FIG. 3.6 DFE CIRCUIT IMPLEMENTATION                                      |
| FIG. 3.7 ISI CANCELLATION OF CONVENTIONAL DFE WITHOUT NONLINEAR          |
| DISTORTION                                                               |
| FIG. 3.8 NONLINEARITY COEFFICIENT APPROXIMATION                          |
| FIG. 3.9 ISI CANCELLATION OF CONVENTIONAL DFE WITH NONLINEAR DISTORTION  |
|                                                                          |
| FIG. 3.10 ARCHITECTURE OF 1-TAP DFE WITH 9 NONLINEARITY COEFFICIENTS 39  |
| FIG. 3.11 ISI CANCELLATION OF THE PROPOSED DFE WITH NONLINEAR DISTORTION |
|                                                                          |
| FIG. 3.12 EYE DIAGRAM WITH CONVENTIONAL AND PROPOSED DFE42               |
| FIG. 3.13 VERILOG SIMULATION RESULT OF 9-COEFFICIENT ADAPTATION42        |
| FIG. 3.14 FLOW CHART OF COUNT-BASED EOM                                  |
| FIG. 3.15 OPERATION OF COUNT-BASED EOM                                   |
| FIG. 3.16 CIRCUIT IMPLEMENTATION OF XOR-EOM45                            |
| FIG. 3.17 BLOCK DIAGRAM OF 4:64 DESERIALIZER                             |
| FIG. 3.18 CIRCUIT IMPLEMENTATION OF TSPC47                               |
| FIG. 3.19 CIRCUIT IMPLEMENTATION OF R-LADDER DAC                         |
| FIG. 4.1 DIE PHOTOMICROGRAPH                                             |
| FIG. 4.2 POWER CONSUMPTION BREAKDOWN                                     |
| FIG. 4.3 MEASUREMENT SETUP                                               |
| FIG. 4.4 PRODUCTION OF PAM-4 SIGNAL                                      |
| FIG. 4.5 MEASURED PAM-4 SIGNAL BY OSCILLOSCOPE                           |

| FIG. 4.6 MEASURED EOM         |  |
|-------------------------------|--|
|                               |  |
| FIG. 4.7 MEASURED BER BATHTUB |  |

# **List of Tables**

| TABLE 3. 1 | POST-LAYOUT SIMULATION RESULT OF DAC VOLTAGE RANGE | 9 |
|------------|----------------------------------------------------|---|
| TABLE 4.1  | PERFORMANCE SUMMARY AND COMPARISON                 | 7 |

## Chapter 1

## Introduction

#### **1.1 Motivation**

In response to a rapid increase in demand for AI and GPU accelerators, bandwidth extension in a single-ended channel is required. Multi-level pulse amplitude modulation (PAM) signaling is discussed as one solution to overcome the bandwidth limitation of existing non-return to zero (NRZ) signaling. On the other hand, current testing equipment is not able to keep up with the speed increment, nor does it offer PAM-4 signaling with low cost. One example is an interface between DRAM and Automatic Test Equipment (ATE) (Fig. 1.1). When testing a DRAM interface, a T5511 (AD-VANTEST) tester is widely used to verify the interface operation. However, it only supports a binary mode, and its maximum clock speed is 4GHz, limiting the testable



Fig. 1.1 Memory test interface between DRAM and ATE

data rate within 8-Gb/s. Furthermore, a new test equipment for PAM-4 signaling is not expected to be available soon since the increased data rate inevitably incurs more signal attenuation over the lengthy test cable. Hence, a bridging operation is used to mediate between a single-ended, VSS-terminated high-speed PAM-4 signal from memory and low-speed NRZ data to the tester [1]. For the future high-speed PAM-4 signaling tests, up to 48-Gb/s PAM-4 receiver (RX) is clearly in need.

Previous works employed various schemes for high-speed PAM-4 receivers. One commonly-adopted solution is the use of an analog-to-digital converter (ADC), showing an advantage in design flexibility and mostly digital domain operation in a high-speed manner. However, it shows relatively low power efficiency of ~10 pJ/bit [2]. Another approach is based on an analog-front-end (AFE), with power-efficient analog equalization circuits. However, this raises nonlinearity and offset issues. A level-separation mismatch ratio (RLM) is a common criterion to indicate the amount of the

nonlinearity of the PAM-4 signal. RLM is often treated on the transmitter side [3] while not as much at the receiver side. Two primary nonlinearity sources exist in the interface. First, voltage mode drivers, which are common topologies in the memory interface, use transistors to match the characteristic impedance of the channel, rather than passive resistors. It makes the interface more vulnerable to distortion caused by impedance mismatch [4]. Second, active elements in the receiver front-end add more distortion in transimpedance or bandwidth, as well as transistor mismatch and offset [5].

This article describes the design of a 48-Gb/s single-ended PAM-4 receiver with adaptive nonlinearity compensation. The receiver attempts to solve data nonlinearity problem caused by the transmitter driver and receiver active elements.

#### **1.2 Thesis Organization**

This thesis is organized as follows. In Chapter 2, background of a high-speed interface and channel equalization is provided. Basic concept of a multi-level pulse-amplitude modulation is briefly explained. Widely used receiver equalizers are introduced, including a mothod of an adaptive equalization.

In chapter 3, a single-ended PAM-4 RX with an adaptive nonlinearity compensation is proposed. Specific circuit implementations of main blocks, including CTLE, DFE, EOM, DES, and DAC, are described. Especially, CTLE and DFE are designed to adaptively compensate channel loss and PAM-4 data nonlinearity.

In chapter 4, measurement results of the proposed RX is described. The chip is fabricated in the 40 nm CMOS technology, and tested under the data rate of 48-Gb/s. Measured performance including BER and eye diagram are shown, and compared with the state-of-the-art RX.

Chapter 5 summarizes the proposed design and concludes this thesis.

## Chapter 2

### Backgrounds

#### 2.1 Architecture in High-Speed Interface

#### 2.1.1 Serial Link

In a high-speed serial interface, data transmitted from a transmitter (TX) pass through a channel, and are received at a receiver (RX). Fig. 2.1 shows a simplified block diagram of a serial link. Data get serialized by a serializer (SER), and deserialized by a deserializer (DES) for a serial communication. Practical channel consists of a printed circuit board (PCB), cables, connectors, vias, and backplanes. Such channels show frequency dependent loss at high frequency mainly due to skin effect and dielectric loss. Equalizer (EQ) is adopted to compensate low-pass filtering characteristic of the channel and to extend bandwidth. Frequency response of the channel can also be represented as a time-domain response. Single-bit response (SBR) is one effective way of visualizing channel response. SBR of the channel, sbr(t), can be written as,

$$sbr(t) = h(t) * \phi(t) \tag{2.1}$$

where h(t) is the impulse response of channel, and  $\phi(t)$  is the transmitted single-bit pulse. At the receiver side, continuous-time signal sbr(t) is sampled and it can be represented as discrete-time signal, such as

$$sbr[n] = sbr(T_0 + nT_b) \tag{2.2}$$

where  $T_0$  is sampling time of a main cursor and  $T_b$  is a bit period. The value sbr[0], or  $h_0$ , is called main cursor. When n is negative or positive value, the value sbr[n], or  $h_n$ ,



Fig. 2.1 Simplified block diagram of a serial link

is called pre-cursor or post-cursor, respectively. Assuming the channel as linear timeinvariant (LTI) system, signal at RX side is expressed as superposition of SBR with bit-period apart. Received signal y(t) is written as,

$$y(t) = \sum_{k=-\infty}^{\infty} x[n-k] \cdot sbr(t+kT_b)$$
(2.3)

where x[n] is the transmitted signal. x[n] can be +1 or -1 in PAM-2 or NRZ signaling. As in equation (2.3), sampled received signal y[n] is written as,

$$y[n] = y(T_0 + nT_b) = \sum_{\substack{k=-\infty\\k\neq 0}}^{\infty} x[n-k] \cdot sbr[k]$$
  
=  $x[n] \cdot sbr[0] + \sum_{\substack{k\neq 0\\k\neq 0}}^{\infty} x[n-k] \cdot sbr[k]$  (2.4)  
=  $x[n] \cdot h_0 + \sum_{\substack{k\neq 0\\k\neq 0}}^{\infty} x[n-k] \cdot h_k$ 

As in equation (2.4), y[n] is composition of two terms: desired signal and unwanted deterministic dispersion, while the latter also being called inter-symbol interference (ISI). Fig. 2.2 shows an example of SBR and corresponding degraded NRZ eye diagram with ISI. Thus, equalizers aim to cancel out ISIs and to fully recover transmitted signal to achieve target bit error rate (BER), typically less than 10<sup>-12</sup> in NRZ and 10<sup>-9</sup> in PAM-4. With reasonable design and with absent of noise, NRZ eye and PAM-4 eye are known to start to close at around 10dB and 4.5dB, respectively.

0



Fig. 2.2 (a) Single bit response; (b) Degraded NRZ eye diagram with ISI

(b)

#### 2.1.2 Multi-Level Pulse-Amplitude Modulation

As a data bandwidth, or a symbol rate, is generally limited by the channel bandwidth and process integrity, multi-level signaling has been discussed as one solution to overcome the bandwidth limitation. Four-level pulse amplitude modulation (PAM-4) signaling encodes 2 bit data into 1 symbol, so that twice more data bits can be packed in the same time unit interval (UI). Within a channel of Nyquist 12GHz bandwidth, PAM-4 format can achieve a data rate of 48-Gb/s, or 24-Gsymbol/s, while NRZ can only transfer data in 24-Gb/s. Fig. 2.3 describes a basic eye diagram of PAM-4 signaling with binary coding. Among two consecutive data bits consisting a single PAM-4 symbol, the former bit can be referred as a most



Fig. 2.3 (a) Binary encoded PAM-4 signal; (b) PAM-4 eye diagram

significant bit (MSB), and the latter bit as a least significant bit (LSB). Since LSB in equally-spaced four level data has a half voltage swing compared to that of MSB, equation (2.4) is converted to as,

$$y[n] = x[n] \cdot \frac{h_0}{3} + \sum_{k \neq 0} x[n-k] \cdot \frac{h_k}{3}$$
(2.5)

In PAM-4, x[n] can be one of +3, +1, -1, or -3 in PAM-4 signaling. This can be notated in different ways such as a binary coding, {00, 01, 10, 11}, a thermometer coding, {000, 001, 011, 111}, or an integer coding {0, 1, 2, 3}. These will be used interchangeably in this thesis. As vertical eye opening becomes one-thrid of NRZ in PAM-4, intrinsic signal-to-noise ratio (SNR) loss of PAM-4 format compared to NRZ format is expressed as,

$$20 \cdot \log\left(\frac{1}{3}\right) \sim -9.5 \, dB \tag{2.6}$$

Furthermore, nonlinearity degrades SNR more severely in PAM-4 than NRZ [4]. One useful indicator for measuring assymetricity of vertical linearity is a level seperation mismatch ratio (RLM). Using four voltage levels  $\{V_0, V_1, V_2, V_3\}$  as described in fig. 2.4, the effective symbol separations and parameter  $R_{LM}$  are given by followings.

$$ES_1 = \frac{V_1 - V_{min}}{V_0 - V_{min}}$$
(2.7)

$$ES_2 = \frac{V_2 - V_{min}}{V_3 - V_{min}}$$
(2.8)

$$V_{min} = \frac{V_0 + V_3}{2} \tag{2.9}$$

$$R_{LM} = min(3 \cdot ES_1, 3 \cdot ES_2, (2 - 3 \cdot ES_1), (2 - 3 \cdot ES_1))$$
(2.10)

In perfectly symmetric data eyes,  $ES_1 = ES_2 = \frac{1}{3}$  and  $R_{LM} = 1$ . As system margin bottleneck lies on the worst eye, inferior RLM directly degrades BER.

Threshold voltage is another issue that lessens horizontal eye width and BER in accordance. Adjacent voltage levels among four voltage levels are bisected by three threshold voltages. If three vertical eyes are not symmetric, deciding unequally-spaced threshold voltages must be in concern. Moreover, transitions between non-adjacent signal levels take longer time than required for transitions between adjacent levels, which further narrows the eye. Fig. 2.5 describes above issue.



Fig. 2.4 Concept of a level separation mismatch ratio

Meanwhile, PAM-4 within single-ended signaling for memory applications, has various nonlinearity sources. First, inherit nonlinearity occurs from the transmitter output, depending on the architecture of a TX driver. A current-mode driver, with current-mode logic (CML) structure shows a higher linearity in signaling. However, it consumes twice more power than needed, as memory interface uses single-ended signaling for pin efficiency. Instead, a voltage-mode (VM) driver, or a source-series terminated (SST) driver, is commonly adopted topology in memory interface TX. VM driver shows inferior linearity driven from impedance variation in pull-up and pull-down drivers, as well as mismatched impedance of the channel. Especially when 50Ω termination is matched through transistor size instead of a passive R, as in recent LPDDR [6] 오류! 참조 원본을 찾을 수 없습니다., nonlinearity from termination mismatch comes more. Second, active blocks in RX add more nonlinear



#### Threshold voltages

Under assumption of RLM=1
 Maximum eye height margin
 Maximum eye width margin

Fig. 2.5 Issues on deciding PAM-4 threshold voltages

distortion. Data path in the receiver side, from analog-front-end to samplers, consists of equalizers which will be discussed in a next section. Mismatch or offset in transistors, and segregated pathes per clock phase or data level, occur assymetric and nonlinear transimpedance or frequency dependent gain. As PAM-4 needs more dynamic voltage range [7], especially within single-ended signaling, transistors suffer more distortion in transimpedance or bandwidth.

To deal with inferior BER characteristic of PAM-4 signaling, several design techniques have been introduced. One is Gray coding. When two bits are mapped to one symbol, consecutive data levels are encoded to {00, 01, 11, 10} instead of {00, 01, 10, 11} as in binary coding. This leads to only one bit error per symbol for incorrect decisions, in other words, reduces BER by 33% compared with a binary coding. By grounding LSB, this supports dual-mode with NRZ signaling. Next, to improve RLM performance in TX, several compensation schemes have been proposed so far [8]. However, nonlinearity issue has not been treated as much at RX.

### 2.2 Equalizer

#### 2.2.1 Continuous-Time Linear Equalizer

Continuous-time linear equalizer (CTLE) is a linear equalizer, typically used in RX. A linear equalizer compensates frequency dependent channel loss with a high-pass filtering transfer function. If the transfer function of an equalizer is inverse of the channel frequency response, multiplied functions form a flat response in frequency domain. Practical equalizer aims to boost signal around Nyquist frequency so that overall frequency response becomes flat up to Nyquist frequency. RC-degenerated



Fig. 2.6 (a) Circuit and (b) frequency response of CTLE

CTLE is a conventional active linear equalizer. Fig. 2.6 shows a basic structure and frequency response of CTLE. Transfer function is obtained as follows:

$$H(s) = \frac{g_m R_D}{1 + \frac{g_m R_S}{2}} \frac{(1 + \frac{s}{\omega_z})}{(1 + \frac{s}{\omega_{p1}})(1 + \frac{s}{\omega_{p2}})}$$
(2.11)

where 
$$\omega_z = \frac{1}{R_S C_S}, \omega_{p1} = \frac{1 + \frac{g_m R_S}{2}}{R_S C_S}, \omega_{p2} = \frac{1}{R_D C_P}$$
 (2.12)

 $g_m$  is the transconductance of input transistor, while  $R_S$  and  $C_S$  represent a source degeneration resistance and capacitance. When  $\omega_z$  and  $\omega_{p1}$  are set to  $\omega_{channel}$  and  $\omega_{Nyquist}$  respectively, the circuit provides maximum boost factor of  $1 + \frac{g_m R_S}{2}$  around Nyquist frequency. Excessive boosting by CTLE induces noise boosting, so lowering dc gain is typically adopted to adapt boosing factor within various channel loss. Dc gain and placement of poles and zero can be adjusted by controlling  $R_S$  and  $C_S$ , generally by digital control.

Two major design considerations arise here [9]. First, the boost factor  $1 + \frac{g_m R_S}{2}$ trades with dc gain  $\frac{g_m R_D}{1 + \frac{g_m R_S}{2}}$ . To keep received data swings not so attenuated, dc gain

needs to be maintained around unity, leading to a limited boosting gain. Second, in high speed, output pole  $\omega_{p2}$  occurs beneath a degeneration pole  $\omega_{p1}$  and limits the circuit bandwidth. In this case, boost gain fails to reach its maximum value, as  $\omega_{p1}$ and  $\omega_{p2}$  get closer in the frequency domain. Greater  $C_S$  brings about a longer distance among poles and hence a bigger boost gain, but at the expense of lowered peaking frequency. Fig. 2.7 describes dc gain and zero location adjusted by controlling  $R_S$  and  $C_S$  when  $\omega_{p1} > \omega_{p2}$ .

Various design techniques have been introduced to overcome the issues above. One can cascade multiple CTLE stages. It increases peaking, but at the cost of reduced



Fig. 2.7 DC gain and zero location adjustments of RC-degenerated CTLE

bandwidth and greater power consumption. Another often employs an inductive peaking for higher bandwidth. The inductor in series with  $R_D$  widens bandwidth with no extra power, but there is a huge area penalty due to a bulky inductor in the integrated chip.

Differential CTLE is commonly adopted in a single-ended signaling as well, such as in the memory interface. Single-ended signal is applied to one of the input transistors, and dc common-mode voltage or a reference voltage is applied to the other. Fig. 2.8 shows a difference between sensing differential input, and single-ended input in a differential CTLE. Theorically, this converts single input to fully differential output. When a differential input is applied, each half circuit works as a degenerated common-source stage with an internal virtual ground and  $\frac{R_S}{2}$ ,  $2C_S$  for each side. If a single-ended signal is applied as  $V_{in}$  and dc voltage as  $V_b$ , this is no more interpreted as a half circuit. Instead,  $V_{in}$  sees degenerated common-mode stage with parallel  $R_S$ ,  $C_S$  in series with  $\frac{1}{g_m}$ . To derive  $\frac{V_{out+}}{V_{in}}$ , the circuit can be decomposed as cascaded source-follower stage and common-gate stage, as  $\frac{V_{out+}}{V_{in}} = \frac{V_b}{V_{in}} \cdot \frac{V_{out+}}{V_b}$ . Assuming ideal current source and no channel-length modulation, derived transfer function  $H(s) = \frac{V_{out+}-V_{out-}}{V_{in}}$  is exactly same as equation (2.11), where  $\frac{V_{out+}}{V_{in}}$ ,  $\frac{V_{out-}}{V_{in}}$  and  $\frac{V_Y}{V_{in}}$  obtained as follows.

$$\frac{V_{out+}}{V_{in}} = -\frac{V_{out-}}{V_{in}} = \frac{-\frac{g_m R_D}{2}}{1 + \frac{g_m R_S}{2}} \frac{(1 + \frac{s}{\omega_z})}{(1 + \frac{s}{\omega_{p1}})(1 + \frac{s}{\omega_{p2}})}$$
(2.13)

$$\frac{V_Y}{V_{in}} \cong \frac{g_m \cdot \frac{1}{g_m}}{1 + g_m \left(\frac{R_S}{1 + sR_SC_S} + \frac{1}{g_m}\right)}$$
(2.14)

where 
$$\omega_z = \frac{1}{R_S C_S}$$
,  $\omega_{p1} = \frac{1 + \frac{g_m R_S}{2}}{R_S C_S}$ ,  $\omega_{p2} = \frac{1}{R_D C_P}$  (2.15)



Fig. 2.8 Interpretation of (a) differential input and (b) single-ended input

Meanwhile, practical transistors have channel length modulation, which can be modeled as parallel resistance  $r_0$  as below:

$$r_{O} = \frac{\partial V_{DS}}{\partial I_{D}} = \frac{1}{\frac{1}{2}\mu_{n}C_{ox}\frac{W}{L}(V_{GS} - V_{TH})^{2} \cdot \lambda} \approx \frac{1}{\lambda I_{D}}$$
(2.16)

where 
$$\lambda \propto \frac{1}{L}$$
 (2.17)

This becomes finite and considerably small value along with process scaling. Small  $r_0$  parallel with current sources or an input transistor brings about two nonideality. First, it degrades the gain at a source follower and a common gate stage. Second, it degrades the boosting factor, as the zero frequency shifts toward right in frequency domain. As a result, single-ended input CTLE shows inferior DC gain and boosting around Nyquist frequency.



Fig. 2.9 Practical CTLE with channel length modulation

#### 2.2.2 Decision-Feedback Equalizer

A decision feedback equalizer (DFE) is a nonlinear equalizer, adopted in the receiver side. Fig. 2.10 shows the block diagram of a typical DFE architecture. Main functional blocks consist of a decision block, a feedback filter and a summer. A decision block, also called a sampler or a comparator, decides whether the received data is logical 0 or 1. This nonlinear bang-bang behavior makes DFE a nonlinear equalizer. Sampled data is delayed and weighted by a predetermined tap coefficient, passes through the feedback filter, and is subtracted at the summer. In a n-tap DFE, sampled data is delayed for n UI, and multiplied by a n<sup>th</sup> tap coefficient  $w_n$ . If  $w_n$  is



Fig. 2.10 Block diagram of an n-tap decision feedback equalizer (DFE)

set to a post cursor value  $h_n$ , the summer can cancle corresponding post-cursor ISIs from the received signal, as described in Fig. 2.11. While CTLE boosts a main cursor value and removes long-tail post cursors in front, DFE aims to eliminate the residual post cursors. DFE has an advantage over CTLE in that feedback of hard-decision value does not occur noise amplication. Yet, it has the possibility of an error propagation when a previous decision is incorrect.

One critical design challenge of the DFE is the timing constraint of the feedback loop. The sampler first should have a setup time, bring out the result after C-to-Q delay, and be settled at the summing node before the next data being sampled. This constraint can be written as,

$$t_{c2q} + t_{setup} + t_{settle} < 1 UI \tag{2.18}$$



Fig. 2.11 Operation of DFE with single-bit response (SBR)

where  $t_{c2q}$ ,  $t_{setup}$  and  $t_{settle}$  are C-to-Q delay, a setup time of the sampler, and a settling time at the summing node, respectively. Loop-unrolling DFE, or speculative DFE can be adopted to alleviate this stringent timing constraint.

Among several sampler topologies, StrongArm latch is most widely adopted for three reasons [10] . It comsumes zero static power, directly produces rail-to-rail outputs, and its input-referred offset arises from primarily one differential pair. Fig. 2.12 shows a circuit implementation of the StrongArm latch. Hard decision output offers a convenience in respect of implementing DFE. Additional input pairs can be employed to compensate for input-referred offset, or to compare differential input with differential reference voltage. Meanwhile, kickback noise arisen from clocked current source becomes troublesome, eventhough differential architecture cancles out it in some portion.



Fig. 2.12 Structure of a StrongArm latch

#### 2.2.3 Adaptive Equalizer

As mentioned in Section 2.2.2, DFE can fully cancle out post-cursor ISIs only when the tap coefficient  $w_n$  coincides with the post cursor value  $h_n$ . Meanwhile, ISI value differs depending on a frequency response of the channel, which is unknown from the receiver point of view. Thus, the equalizer often manages an adaptation method to obtain a maximal eye opening within various channel characteristics. Fig. 2.13 shows the conceptual block diagram of an adaptive equalizer. x[n], y[n], w[n], r[n] and e[n] correspond to transmitted data, received data, equalizer weight vector, equalized received data, and error vector, respectively. DLev is the desired data level of the received signal. By comparing received signal r[n] with DC-scaled version of the transmitted data symbol x[n], dLev  $\cdot x[n]$ , both dLev and  $w_n$  can be adjusted in a



Fig. 2.13 Conceptual block diagram of an adaptive equalizer

direction that reduces the error e[n]. Least-mean-square, or LMS algorithm, is one error reduction methodology whose update equation can be written as,

$$w_i[k+1] = w_i[k] - \frac{\mu}{2} \frac{\partial e^2[n]}{\partial w_i[k]}$$
(2.19)

where *i* is a tap index, *k* is a time instant, and  $\mu$  is an update coefficient. Second term in equation (2.19) can be expressed as,

$$\frac{\partial e^{2}[n]}{\partial w_{i}[k]} = 2 \cdot e[n] \cdot \frac{\partial}{\partial w_{i}[k]} (r[n] - dLev \cdot x[n])$$

$$= 2 \cdot e[n] \cdot \frac{\partial}{\partial w_{i}[k]} \left( \sum_{l=0}^{\infty} w_{i}[l] \cdot y[n-l] - dLev \cdot x[n] \right)$$

$$= 2 \cdot e[n] \cdot y[n-k]$$
(2.20)

In short, LMS algorithm can be expressed as follow.

$$w_i[k+1] = w_i[k] - \mu \cdot e[n] \cdot y[n-k]$$
(2.21)

However, e[n] and y[n-k] are basically analog value, which requires a high resolution and a power-consuming circuit to capture the value. Measuring linear difference becomes even harder in a high speed. Instead, sign-sign LMS (SS-LMS) is commonly used in superiority of a simple circuit implementation. SS-LMS algorithm updates adaptation coefficients only based on the sign of the data and the error. Both can be hard-decided within comparator. Update equation of SS-LMS is as below:

$$w_i[k+1] = w_i[k] - \mu \cdot sign(e[n]) \cdot d[n-k]$$
(2.22)

where d[n] is the recovered data by hard decision.

In DFE, SS-LMS can be utilized in adapting both data level dLev and a tap coefficient  $w_n$  simultaneously, as follows:

$$dLev[k+1] = dLev[k] - \mu_{dLev} \cdot sign(e[n]), \qquad d[n] = "+1" \qquad (2.23)$$

$$w_i[k+1] = w_i[k] - \mu_{w_i} \cdot sign(e[n]), \qquad d[n-i] = "+1"$$
(2.24)

where dLev is the desired data level when logical value of received data is "+1" in NRZ signaling.

In PAM-4 signaling, one of four data levels, generally data '+3', is selected for dLev. Recalling equation (2.5), when four data level is equally-spaced, post cursor value  $h_k$  is shown as  $+h_k$ ,  $+\frac{h_k}{3}$ ,  $-\frac{h_k}{3}$ ,  $-h_k$  when transmitted data x[n] is +3, +1, -1, -3, respectively. PAM-4 DFE can be realized by three taps with feedback from three data samplers, each. In this case, tap coefficient  $w_n$  is divided into equal three portions,  $\frac{w_n}{3}$ , for each tap, so that DFE can fully cancle  $\frac{h_k}{3}$  term as expressed above. Under assumption that PAM-4 signal has equivalent and linear three data eyes which show RLM=1, one dLev and identical  $\frac{w_n}{3}$  values work as intended. On the other hand, when RLM< 1 and nonlinearity occurs, aforementioned dLev and tap coefficient adaptation is not enough to eliminate ISI precisely. More details are all covered in Chapter 3.

Furthermore, in PAM-4 signaling, one major design challenge is to decide three data threshold voltages to bisect upper, middle and lower eye in a PAM-4 signal. Previous works have been attempted to adapt them independently [11], or under assumption that they are equally spaced [12]. Optimal threshold voltage might bring better BER with immunity to clock jitter or voltage noise, even though optimality differs in respect of horizontal eye width, or vertical eye height.
# **Chapter 3**

# Design of Pam-4 Receiver with Adaptive Nonlinearity Compensation

### **3.1 Design Consideration**

DRAM is experiencing ever-increasing data bandwidth, as a state-of-the-art version of GDDR, GDDR6, operates in 16-Gb/s PAM-4 signaling [13]. Therefore, prototype chip in this article is designed as a PAM-4-binary bridge for the next-generation memory interface. Target DRAM generation is post-GDDR6. The bridge receives up to a 48-Gb/s PAM-4, VSS terminated, single-ended signal from DRAM, and delivers the data to ATE in 6-Gb/s NRZ signal within 8 DQ pins, respectively. Major design considerations are as following. First, memory interface of post-GDDR6 operates within a single-ended signal, and VSS termination. Voltage level of data from DRAM is 0V ~ 0.625V. In an aspect of an inferior performance of PMOS transistors, it can be considered to pull up the input voltage level up to the operation range of NMOS by a level shifter. However, RX AFE in this thesis utilizes PMOS based CTLE and DFE because the level shifter cannot afford the high bandwidth. In addition, a single-to-differential converter is not adopted because of bandwidth limitation. Instead, CTLE converts the single-ended signal to the differential data.

Second, a single-ended PAM-4 input data needs three threshold voltages. Using count-based EOM, the threshold voltages are adapted independently for the maximum eye opening.

Third, a nonlinearity is compensated at receiver side. As mentioned in Chapter 2, inherit RLM arises from DRAM transmitter itself. Also in RX datapath, active circuits and PVT variation occur mismatch and following nonlinearity in data. Novel nonlinearity-compensating equalization, especially in DFE, is proposed.

### **3.2 Proposed Architecture**

Fig. 3.1 illustrates the overall block diagram of the proposed receiver and internal clock path. AFE consists of 3 parallel CTLEs, a DFE, a 4:64 deserializer and a VDAC. Digital circuit includes two types of EOM, XOR-EOM and Counter-based EOM, threshold voltage calibration, dLev adaptation DLFs, and proposed nonlinearity compensating DFE tap coefficient adaptation DLFs. Internal clock path incorporates ADPLL, PI, DCC, and 4-phase generator.

48-Gb/s PAM-4 data, or DQ, come from the memory and pass through 3 parallel Cherry-Hooper CTLEs to cancel the long tail post-cursor ISIs. 3 data threshold voltages are calibrated through the Count-based EOM. A 1-tap quadrature DFE eliminates first post-cursor with a 1-coefficient 9-coefficient adaptation instead of a single tap coefficient. In addition to 12 data samplers, 3 error samplers are used to track dLev voltages for each data eye and to perform as an XOR-based EOM. Ideal data sampling timing is determined through the Count-based EOM and controlled by PI. After being aligned, decoded data are sent to ATE in the form of 8 parallel NRZ DQ data with a 6-Gb/s data rate.

The internal clock is made from the ADPLL, which converts 3GHz DQS from DRAM to a 8-phase 6GHz clock. Consecutive PI is controlled by I2C code. A DCC (Duty Cycle Corrector) and a 4-phase generator follows, and the 4-phase clock is distributed through the clock tree.



### **3.3 Circuit Implementation**

#### **3.3.1 Continuous-Time Linear Equalizer**

At the very front of AFE, 3 parallel CTLEs are located and receive PAM-4 data, sent from DRAM, at one side of input transistors. Fig. 3.2 shows a schematic of the employed PMOS-based Cherry-Hooper topology CTLE and an equivalent block diagram. PMOS input transistors are used to cover the data level from VSS to half VDD, sent from DRAM. Instead of conventional RC-degeneration structure, Cherry-Hooper topology is adopted. Cherry-Hooper CTLE is composed of three stage. First



Fig. 3.2 Circuit implementation and block diagram of Cherry-Hooper CTLE

stage is a conventional RC-degeneration stage, which offers AC gain peaking around Nyquist frequency, as mentioned in 2.2.1. Second stage is a transconductance stage, or a CML stage which offers overall gain from DC to Nyquist frequency, similar with VGA (Variable Gain Amplifier). Third stage is a negative feedback stage. Intended resistance in a feedback path makes an additional pole in the frequency domain, which indeed operates as a zero due to a negative feedback. This additional zero helps enlarging peaking gain and widening the bandwidth in the presence of an output pole. Consequently, the Cherry-Hooper CTLE has advantages over the RC-degeneration CTLE in respect of overall gain, bandwidth, and peaking gain in presence of identical output pole. This helps overcoming aforementioned design complexity when singleended input is applied in presence of the channel-length modulation.

Transconductance stage offers another merit to the operation of following DFE. The summer of DFE, placed right after CTLE, is a CML type. When input voltage swing is larger than certain range, linearity breaks and DC gain reduces. While the first stage of CTLE outputs pseudo-differential swing where one side voltage swing is as twice of the other, second CML stage distributes them to almost equal current as two input transistors share the same source node and a current source. This helps CTLE output swing to fit in the linearity range of a DFE summer. Fig. 3.3 explains a swing unequality and a linearity range of a DFE summer.

The value of source-degenerated resistor and capacitor are digitally controlled with respective 3-bit control bits through I2C. Current sources for three stage is biased from a single bias pad, and respective 3-bit control bits adjust biasing current in a fine resolution. Resistance in the negative feedback path is digitally controlled by 1 bit.

Post-simulation of CTLE frequency response with various R, C values is shown in Fig. 3.4. Selective 2 ~ 8dB peaking gain is offered around Nyquist frequency.



Fig. 3.3 Swing unequality and DFE linear range



Fig. 3.4 CTLE frequency response post simulation result

3 threshold voltages are applied to each 3 parallel CTLEs input transistor, in the opposite side of the received PAM-4 signal, as described in Fig. 3.5. Threshold voltages are independently calibrated through the Count-based EOM, and generated from VDAC with 7-bit control input. Independently calibrated threshold voltages reduce effect of random offset arisen from transistor mismatch and datapath. In addition, as each 3 CTLEs have only to deal with 1 data eye respectively, design complexity of the PAM-4 RX, such as linearity condition and DFE summer input linearity range, becomes far relaxed.



Fig. 3.5 PAM-4 data in 3 CTLEs

#### 3.3.2 Nonlinearity-Compensating DFE

Fig. 3.6 shows a circuit implementation of the proposed nonlinearity-compensating 1-tap 9-coefficient adaptive DFE. The direct feedback DFE has a harder timing constraint comparing to a speculative DFE, as shown in (2.18). To reduce the feedback



Fig. 3.6 DFE circuit implementation

time of the tap, output of the StrongArm latch feeds through to CML tap directly only within buffer, without RS latch which converts RZ data to NRZ data. The DFE adopts shared-summer structure [14] . In a conventional quarter-rate DFE, 4 summers are in need to utilize the first tap by NRZ data with 4 UI width. By using RZ data which provides a valid data for only 2 UI, 4 summers can be merged into two path, even and odd, respectively. Reduced summer leads to reduced parasitic capacitance at CTLE output node, thus relaxes settling time constraints. Common-mode level feedback transistor, or a bleeder, maintains CM level regardless of a tap bias, for optimal operation point of samplers.

Proposed DFE adapts a first tap coefficient by calculating 9 nonlinearity coefficients to compensate the nonlinearity of the PAM-4 data. First, ideal PAM-4 data can be represented with thermometer code as below.

$$V_{ideal_{TX}} = \frac{1}{3}d_h + \frac{1}{3}d_z + \frac{1}{3}d_l$$
(3.1)

Four data level can be presented as  $(d_h, d_z, d_l) = (0,0,0), (0,0,1), (0,1,1), (1,1,1)$ in sequence. Assuming that the pre-cursors are eliminated in TX and the long tail postcursors except first post-cursor  $h_1$  are eliminated in RX CTLE, ideal PAM-4 data arriving at DFE is as below.

$$V_{ideal_{RX}} = h_0/3 * \{d_h[0] + d_z[0] + d_l[0]\} + h_1/3 * \{d_h[-1] + d_z[-1] + d_l[-1]\}$$
(3.2)

Conventional DFE assumes ISI model as (3.1), so that the DFE can cancle first post-cursor by simply estimating  $h_0$ ,  $h_1$  through adaptation algorithm, as described in Fig. 3.7.

However, practical PAM-4 data, especially in single-ended signaling, have two major nonlinearity source as aforementioned in 2.1.2: within TX driver, and within RX AFE. The former, TX RLM, can be represented with the transmitter eye height coefficients  $a_h, a_z, a_l$  where  $a_h, +a_z + a_l = 1$ . The latter, or different transimpedance for each data eye, arises due to the different CM levels per data eye, and transistor offset in each CTLE. This can be approximated with the receiver



Fig. 3.7 ISI cancellation of conventional DFE without nonlinear distortion

transimpedance parameter  $b_h$ ,  $b_z$ ,  $b_l$ . Fig. 3.8, (3.3), (3.4), and (3.5) explains the nonlinear distortion of PAM-4 signal with nonlinearity coefficient from TX and RX, respectively.  $V_{\text{RX\_input}}$  is the received signal in front of RX AFE, and  $V_{\text{RX\_EYEH}}$ ,  $V_{\text{RX\_EYEZ}}$ ,  $V_{\text{RX\_EYEL}}$  correspond to the output of 3 CTLEs, respectively.

$$V_{\rm TX} = a_h * d_h + a_z * d_z + a_l * d_l \tag{3.3}$$

$$V_{\text{RX\_input}} = a_h * \{h_0 * d_h[0] + h_1 * d_h[-1]\} + a_z * \{h_0 * d_z[0] + h_1 * d_z[-1]\} + a_l * \{h_0 * d_l[0] + h_1 * d_l[-1]\}$$
(3.4)

$$V_{\text{RX}_{\text{EYEH}}} = b_h * V_{RX_{input}}$$

$$V_{\text{RX}_{\text{EYEZ}}} = b_z * V_{RX_{input}}$$

$$V_{\text{RX}_{\text{EYEL}}} = b_l * V_{RX_{input}}$$
(3.5)



Fig. 3.8 Nonlinearity coefficient approximation

Following nonlinear ISI model described above, conventional DFE cannot fully cancle out the ISIs, as described in Fig. 3.9. The impact of nonlinearity coefficients are shown more clearly in (3.6), (3.7), and (3.8).



Fig. 3.9 ISI cancellation of conventional DFE with nonlinear distortion

$$V_{\text{RX}_{EYEH}} = b_h h_0 * [a_h d_h[0] + a_z d_z[0] + a_l d_l[0]] + b_h h_1 * [a_h d_h[-1] + a_z d_z[-1] + a_l d_l[-1]] = \beta_1 d_h[0] + \beta_2 d_z[0] + \beta_3 d_l[0] + \alpha_1 d_h[-1] + \alpha_2 d_z[-1] + \alpha_3 d_l[-1]$$
(3.6)

$$V_{\text{RX}_{EYEZ}} = b_z h_0 * [a_h d_h[0] + a_z d_z[0] + a_l d_l[0]] + b_z h_1 * [a_h d_h[-1] + a_z d_z[-1] + a_l d_l[-1]] = \beta_4 d_h[0] + \beta_5 d_z[0] + \beta_6 d_l[0] + \alpha_4 d_h[-1] + \alpha_5 d_z[-1] + \alpha_6 d_l[-1]$$
(3.7)

$$V_{\text{RX}_{EYEL}} = b_l h_0 * [a_h d_h[0] + a_z d_z[0] + a_l d_l[0]] + b_l h_1 * [a_h d_h[-1] + a_z d_z[-1] + a_l d_l[-1]] = \beta_7 d_h[0] + \beta_8 d_z[0] + \beta_9 d_l[0] + \alpha_7 d_h[-1] + \alpha_8 d_z[-1] + \alpha_9 d_l[-1]$$
(3.8)

In other words, when PAM-4 data eyes are distorted from TX and RX, the value of the first post-cursor differs depending on the combination of d[n-1] and d[n].  $\alpha_n$ ,  $1 \le n \le 9$ , are redefined nonlinearity coefficients, which incorporate both TX and RX nonlinearity impact.



Fig. 3.10 Architecture of 1-tap DFE with 9 nonlinearity coefficients

In the proposed AFE architecture, 3 data eyes are separately treated in each 3 CTLEs, so 9 coefficients can be individually calculated by adapting 3 dLevs simultaneously. Detailed architecture of DFE is described in Fig. 3.10. In this architecture, 9 coefficients  $\alpha_n$ , rather than singel  $h_1$ , can be adapted through



Fig. 3.11 ISI cancellation of the proposed DFE with nonlinear distortion

independent DLFs using SS-LMS. Practical post-cursor can be eliminated by calculating 9 nonlinearity coefficients  $\alpha_n$ ,  $1 \le n \le 9$ , as well as 3 data levels for each data eyes,  $\beta_1 + \beta_2 + \beta_3$ ,  $\beta_5 + \beta_6$ , and  $\beta_9$ , as described in Fig. 3.11.

Fig. 3.12 shows MATLAB-simulated eye diagram before and after DFE tap adaptation with conventional and proposed 9-coefficients adaptation algorithm. Fig. 3.13 describes 9 coefficients adapting to different values when nonlinearity occurs, and it is simulated through Verilog after synthesized.

3 error samplers, implemented for dLev adaptaion, are reutilized for EOM after DFE adaptation. The details will be discussed in next section.



Fig. 3.12 Eye diagram with conventional and proposed DFE



Fig. 3.13 Verilog simulation result of 9-coefficient adaptation

#### 3.3.3 Eye-Opening Monitor

The proposed receiver incorporates two types of 2D on-chip eye-opening monitor: count-based EOM, and XOR-based EOM. The former performs a role of finding the optimal voltage reference and timing codes for the VDAC and PI [1]. Fig. 3.14 and Fig. 3.15 shows a flow chart and operation of the count-based EOM. During EOM training, predefined PRBS-7 data is transmitted from memory to the bridge receiver in a burst mode. Recall, three data threshold voltages, REFH, REFZ, REFL, are applied to three parallel CTLEs, and bisect upper, middle, and lower eye in PAM-4 signal. When the data is sampled without any error, subsequent samplers following each CTLEs produce 32, 64, and 96 '1s' out of 127 bits in a single PRBS-7 cycle. By counting number of 1s, one can judge whether certain voltage and timing code correspond to valid sampling point. Reference voltages are generated from VDAC with 7-bit control input VCW, and the PI generates clock with 6-bit control input PCW.

XOR based EOM performs a role of scanning 2D data eye, and measureing BER. After DFE adaptation, 3 error samplers are reutilized to scan 2D eye-opening monitor. The output of error samplers are XORed with 3 sampled data [15]. If the scanning point or an error sampler is inside the data eye, the result must be coincident with that of data sampler. By sweeping VCW and PCW, 128x64 pixels are obtained to draw a 2D eye-opening map. Fig 3.16 describes the circuit implementation of the XOR EOM. With PRBS burst pattern, BER can also be measured in 2 dimension.



Fig. 3.14 Flow chart of Count-based EOM



Fig. 3.15 Operation of Count-based EOM



Fig. 3.16 Circuit implementation of XOR-EOM

#### **3.3.4 Deserializer**

Fig 3.17 describes a block diagram and a circuit implementation of a 4:64 deserializer. The deserializer is made up of true single-phase clock (TSPC) logic DFF [16] to align data and error samples with 6GHz clock. Deserialized data go into digital block for DFE adaptation.



Fig. 3.17 Block diagram of 4:64 Deserializer



Fig. 3.18 Circuit implementation of TSPC

#### 3.3.5 Digital-to-Analog Converter

Fig. 3.19 shows a block diagram and a function of digital-to-analog converter (DAC). DAC uses a R-ladder topology to guarantee monotonicity. 127 voltage levels are equally spaced over possible output range, selected by 7-bit control input VCW. Maximum voltage range and corresponding voltage step per LSB are finely tuned with 5 pmos transistor in parallel with resistors, which are controlled by 5-bit digital code. Table 3.1 indicates possible output voltage range and voltage step per LSB when pmos transistors are all turned on and all turned off.



Fig. 3.19 Circuit implementation of R-ladder DAC

|            | Ref range | 1 step   | Current | Code     |
|------------|-----------|----------|---------|----------|
| TM All off | 0 ~ 0.42V | ~ 3.2 mV | 130 uA  | 128 step |
| TM All on  | 0 ~ 0.65V | ~ 5 mV   | 165 uA  | 128 step |

 Table 3.1
 Post-layout simulation result of DAC voltage range

# Chapter 4

# **Measurement Results**

### 4.1 Die Photomicrograph

The prototype chip is fabricated in 40nm CMOS technology, and the die photomicrograph is shown in Fig. 4.1. The active area of the RX, excluding clocking and NRZ TX circuit, is 0.2

36mm<sup>2</sup>. The RX is tested with a 48-Gb/s PAM-4 PRBS-7 pattern, and the energy efficiency is measured as 2.97 pJ/b. Fig 4.2 shows a power breakdown of the RX when operating at 48-Gb/s.



Fig. 4.1 Die photomicrograph



Fig. 4.2 Power consumption breakdown

### 4.2 Measurement Setup

Measurement setup is shown in Fig. 4.3. The signal quality analyzer (Anritsu MU1800) is used for a pattern generator, and an error detector. The bit error tester (BERT) produces two binary PRBS data corresponding to MSB and LSB. A passive power combiner (HL9404 BALUN) is employed to make the PAM-4 signal, as shown in fig. 4.4 [17] . MSB is applied to one port of a passive power combiner, and LSB, passed through 6dB attenuator, is applied to the other. Channel insertion loss at the Nyquist frequency of 12GHz in the SMA cables and the PCB is measured as around 7dB. Recovered and deserialized NRZ DQ data is fed back to the error detector to measure BER. The oscilloscope (Tektronix MSO73304DX) is used to measure eye diagram of transmitted PAM-4 input signal as shown in Fig. 4.5. Configurations for DUT are adjusted by external PC using I2C protocol.



Fig. 4.3 Measurement setup



Fig. 4.4 Production of PAM-4 signal



Fig. 4.5 Measured PAM-4 signal by oscilloscope

### **4.3 Measurement Results**

The proposed RX is verified with a measured eye diagram and BER, as shown in Fig 4.6 and Fig 4.7. The horizontal eye-opening margin is more than 0.2 UI at BER  $< 10^{-12}$  at 12GHz. The bathtub curve shows measured BER separately for both MSB and LSB.

Table 4.1 summarizes and compares the performance of the proposed adaptive nonlinearity compensating RX with previous state-of-the-art PAM-4 receivers. The proposed receiver achieves the highest speed within a single-ended channel by incorporating nonlinearity and offset mismatch compensation.



Sampling Clock Phase (UI)

Fig. 4.6 Measured EOM



Fig. 4.7 Measured BER bathtub

|                              | ASSCC'21<br>[18] | JSSC'21<br>[19] | This work       |
|------------------------------|------------------|-----------------|-----------------|
| Technology                   | 28nm             | 28nm            | 40nm            |
| Single/Differential          | Single           | Single          | Single          |
| Signaling type               | PAM4/NRZ         | PAM3            | PAM4            |
| Supply(V)                    | 1                | 0.6             | 1.25/0.9        |
| Data rate(Gb/s/pin)          | 24               | 30              | 48              |
| Rx Equalization              | CTLE, 1-tap DFE  | CTLE, 1-tap DFE | CTLE, 1-tap DFE |
| Channel Loss (dB)            | 5dB@6GHz         | 6.6dB@10GHz     | 7dB@12GHz       |
| Energy effciency<br>(pJ/bit) | -                | 1.11            | 2.97            |
| BER                          | <1E-11           | <1E-12          | <1E-12          |
| Nonlinearity<br>Compensation | x                | x               | 0               |

 Table 4.1
 Performance summary and comparison

# Chapter 5

# Conclusion

In this work, an adaptive nonlinearity compensating PAM-4 receiver is proposed for use as single-ended PAM-4 interface with a low-speed NRZ tester. Through the CTLEs and the 9-coefficient adaptative DFE, the RX receives 48-Gb/s PAM-4 signal and then converts to 6-Gb/s NRZ DQs. This work employs 9 adaptive nonlinearity coefficients for PAM-4 data nonlinearity compensation occurring within the TX and the RX. A prototype chip fabricated in 40 nm CMOS technology occupies an active area of 0.236 mm<sup>2</sup>. The proposed PAM-4 RX achieves the BER of less than 10<sup>-12</sup> in 48 Gb/s and the power efficiency of 2.97 pJ/b. The measurement result shows the highest speed of 48-Gb/s among similar implementations of a single-ended, VSS terminated PAM-4 interface.

# **Bibliography**

- [1] D. Yun et al., "A 32-Gb/s PAM4-Binary Bridge With Sampler Offset Cancellation for Memory Testing," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 9, pp. 3749-3753, Sept. 2022.
- [2] D. Cui et al., "3.2 A 320mW 32Gb/s 8b ADC-based PAM-4 analog front-end with programmable gain control and analog peaking in 28nm CMOS," 2016 IEEE International Solid-State Circuits Conference (ISSCC), 2016.
- [3] Y. -U. Jeong, H. Park, C. Hyun, J. -H. Chae, S. -H. Jeong and S. Kim, "A 0.64pJ/Bit 28-Gb/s/Pin High-Linearity Single-Ended PAM-4 Transmitter With an Impedance-Matched Driver and Three-Point ZQ Calibration for Memory Interface," in IEEE Journal of Solid-State Circuits, vol. 56, no. 4, pp. 1278-1287, April 2021.
- [4] M. Bassi, F. Radice, M. Bruccoleri, S. Erba and A. Mazzanti, "A High-Swing 45 Gb/s Hybrid Voltage and Current-Mode PAM-4 Transmitter in 28 nm CMOS FDSOI," in IEEE Journal of Solid-State Circuits, vol. 51, no. 11, pp. 2702-2715, Nov. 2016.
- [5] Q. Pan, L. Wang, X. Luo and C. P. Yue, "A Low-Power PAM4 Receiver With an Adaptive Variable-Gain Rectifier-Based Decoder," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 10, pp. 2099-2108, Oct. 2020.

- [6] J. -S. Heo et al., "A 5Gb/s/pin 16Gb LPDDR4/4X Reconfigurable SDRAM with Voltage-High Keeper and a Prediction-based Fast-tracking ZQ Calibration," 2019 Symposium on VLSI Circuits, 2019, pp. C114-C115.
- [7] N. Dikhaminjia et al., "PAM4 signaling considerations for high-speed serial links," 2016 IEEE International Symposium on Electromagnetic Compatibility (EMC), 2016, pp. 906-910.
- [8] C. Hyun, H. Ko, J. -H. Chae, H. Park and S. Kim, "A 20Gb/s Dual-Mode PAM4/NRZ Single-Ended Transmitter with RLM Compensation," 2019 IEEE International Symposium on Circuits and Systems (ISCAS), 2019, pp. 1-4.
- [9] B. Razavi, "The Design of an Equalizer—Part One [The Analog Mind]," in IEEE Solid-State Circuits Magazine, vol. 13, no. 4, pp. 7-160, Fall 2021.
- [10] B. Razavi, "The StrongARM Latch [A Circuit for All Seasons]," in IEEE Solid-State Circuits Magazine, vol. 7, no. 2, pp. 12-17, Spring 2015.
- [11] C. -T. Hung, Y. -P. Huang and W. -Z. Chen, "A 40 Gb/s PAM-4 Receiver with 2-Tap DFE Based on Automatically Non-Even Level Tracking," 2018 IEEE Asian Solid-State Circuits Conference (A-SSCC), 2018, pp. 213-214.
- [12] L. Tang, W. Gai, L. Shi, X. Xiang, K. Sheng and A. He, "A 32Gb/s 133mW PAM-4 transceiver with DFE based on adaptive clock phase and threshold voltage in 65nm CMOS," 2018 IEEE International Solid - State Circuits Conference - (ISSCC), 2018, pp. 114-116.
- [13] Y. -J. Kim et al., "A 16Gb 18Gb/S/pin GDDR6 DRAM with per-bit trainable single-ended DFE and PLL-less clocking," 2018 IEEE International Solid - State Circuits Conference - (ISSCC), 2018, pp. 204-206.

- [14] K. Lee, W. Jung, H. Ju, J. Lee and D. -K. Jeong, "A 48 Gb/s PAM4 receiver with Baud-rate phase-detector for multi-level signal modulation in 40 nm CMOS," 2021 IEEE Asian Solid-State Circuits Conference (A-SSCC), 2021.
- [15] P. -J. Peng, J. -F. Li, L. -Y. Chen and J. Lee, "6.1 A 56Gb/s PAM-4/NRZ transceiver in 40nm CMOS," 2017 IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp. 110-111.
- [16] Razavi, Behzad. "TSPC logic [a circuit for all seasons]." IEEE Solid-State Circuits Magazine 8.4 (2016): 10-13.
- [17] S. Roh et al., " A 64-Gb/s PAM-4 Receiver With Transition-Weighted Phase Detector," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 9, pp. 3704-3708, Sept. 2022.
- [18] H. Jin et al., "A 24Gb/s/pin PAM-4 Built Out Tester chip enabling PAM-4 chips test with NRZ interface ATE," 2021 IEEE Asian Solid-State Circuits Conference (A-SSCC), 2021, pp. 1-3.
- [19] H. Park et al., "30-Gb/s 1.11-pJ/bit Single-Ended PAM-3 Transceiver for High-Speed Memory Links," in IEEE Journal of Solid-State Circuits, vol. 56, no. 2, pp. 581-590.
## 초 록

본 논문은 DRAM 평가를 위해 적응형 비선형성 보상 기술을 포함한 고속 단일 종단 4 단계 펄스 진폭 변조 수신기의 설계를 제안한다. 수신기는 3 개의 병렬 체리-후퍼 연속 시간 선형 등화기와 1-탭 9-계수 결정 피드백 등화기를 포함한다. 연속 시간 선형 등화기는 오프셋 제거 계측을 포함하여 가변적인 이득을 제공한다. 결정 피드백 등화기는 전송된 데이터의 레벨 불일치 비율과 수신기 아날로그 전단의 비선형성 왜곡을 탐지한다. 비선형성은 비선형성 보상기의 9 계수를 동시에 적응하여 보상한다.

제안된 수신기는 40 nm CMOS 공정으로 제작되었고, 0.236 mm<sup>2</sup> 를 차지한다. 7-dB 손실 채널에서 측정하여, 4 단계 펄스 진폭 변조 수신기는 48-Gb/s 의 속도에서 비트 오류율 10<sup>-12</sup> 이하를 달성하였다. 성능 지수는 0.42/pJ/b/Db 로 나타났고, 전력 효율은 2.97 pJ/b 로 나타났다.

주요어 : 4 단계 펄스 진폭 변조 (PAM-4), 단일 종단 수신기 (RX), 적응 결정 피드백 등화기 (DFE), 체리-후퍼 연속 시간 선형 등화기 (CTLE), 레벨 불일치 비율 (RLM)

학 번:2021-28506