



**Ph.D. Dissertation** 

# Design Techniques for Clock Generation and Recovery in Serial Interface

직렬 인터페이스에서 클록 생성 및 복구를 위한 설계 기법

by

Woosong Jung

August, 2023

Department of Electrical and Computer Engineering College of Engineering Seoul National University

# Design Techniques for Clock Generation and Recovery in Serial Interface

지도 교수 정 덕 균

이 논문을 공학박사 학위논문으로 제출함 2023 년 8 월

> 서울대학교 대학원 전기·정보공학부 정 우 송

정우송의 박사 학위논문을 인준함 2023 년 8 월

| 위 육 | <u></u>     | 김 | 재 하             | · (인)      |
|-----|-------------|---|-----------------|------------|
| 부위  | 원장 <u>-</u> | 정 | 덕 균             | . (인)      |
| 위   | 원_          | 모 | - <del>डे</del> | . (인)      |
| 위   | 원           | ネ | 우 석             | (인)        |
| 위   | 원           | 비 | · 관 서           | <u>(인)</u> |

# Design Techniques for Clock Generation and Recovery in Serial Interface

by

Woosong Jung

A Dissertation Submitted to the Department of Electrical and Computer Engineering in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

at

#### SEOUL NATIONAL UNIVERSITY

August, 2023

Committee in Charge:

Professor Jaeha Kim, Chairman

Professor Deog-Kyoon Jeong, Vice-Chairman

Professor Yong Moon

Professor Woo-Seok Choi

Professor Kwanseo Park

## Abstract

This dissertation outlines the clocking system within SerDes and associated common issues. It proposes a phase-locked loop (PLL)-based clock driver for clock generation in the transmitter and clock and data recovery (CDR) in the receiver. The thesis suggests a wide frequency tuning range (FTR) LC resonator for frequency synthesis, which achieves fast frequency acquisition. For reference-less operation, a stochasticbased frequency acquisition scheme is implemented in a Baud-rate CDR. Additionally, this dissertation presents a Baud-rate CDR with a reference clock, which achieves pulse amplitude modulation (PAM)-4 signaling.

Initially, a digital-PLL (DPLL) based clock driver with a wide FTR LC oscillator is presented. The clock driver employs an 8-shaped inductor structure to implement three mode-switchings for wide FTR in one compact area. The analysis demonstrates the compact inductor-stacked layout. Furthermore, the clock driver achieves fast frequency acquisition by using a fast Fourier transform (FFT) algorithm, reducing the lock time significantly compared to the conventional PLL that uses a bang-bang phase and frequency detector (BB-PFD) or time-to-digital converter (TDC). The prototype is fabricated in a 40-nm CMOS technology verifying low-jitter, wide FTR, and fast frequency acquisition. The presented LC oscillator achieves a phase noise of -118.5 dBc/Hz to -124.7 dBc/Hz, achieving the figure of merit (FoM) from FoM<sub>T</sub> from 173.5 dBc/Hz to 181.5 dBc/Hz and 196 dBc/Hz to 204 dBc/Hz, respectively. The clock driver generates a clock frequency ranging from 0.82 to 4.1 GHz, achieving an FTR of 133%. The clock driver achieves a root mean square (RMS) jitter of 84.64 fs at 4 GHz output clock frequency, showing FoM<sub>RMS</sub> of -249.1 dB. Furthermore, the proposed clock driver reduces the settling time requiring only 0.99  $\mu$ s, whereas it requires 2.27 ms in conventional techniques, thus verifying fast frequency acquisition.

The dissertation proposes a reference-less Baud-rate CDR with a stochastic-based phase and frequency detection for the second implementation. It proposes a 14 - 28 Gb/s reference-less Baud-rate CDR that uses a stochastic-based phase and frequency detector (PFD). The PFD with the optimum weight through histogram-based correlation of various data patterns achieves phase and frequency detection. The reference-less Baud-rate CDR utilizes data samples and phase error samples obtained from the integrator. The proposed CDR achieves a data rate of up to 28 Gb/s employing a continuous-time linear equalizer (CTLE) under a 4.7-dB data loss channel at Nyquist frequency. Fabricated in 28-nm CMOS technology, the proposed CDR achieves a bit error rate (BER) of less than 10<sup>-12</sup> and an energy efficiency of 1.06 pJ/b.

The final embodiment is about a 48 Gb/s PAM-4 receiver with a Baud-rate CDR suitable for multi-level signaling. By deriving the association between the vertical eye margin and the ratio of the main cursor to the pre-cursor, the proposed Baud-rate phase detector (BRPD) adjusts the pre-cursor and finds the lock point with targeted vertical eye-opening. Thus, the BRPD offers a unique lock point when used with an adaptive decision feedback equalizer (DFE) where post-cursor  $h_1$  is removed. Otherwise, the lock point could drift with the conventional Mueller-Müller PD. Furthermore, a summer loading of the DFE reduces the input loading of the DFE by embracing the RZ sampler output instead of the conventional NRZ output adding to the delay associated with an RS latch. A prototype chip fabricated in 40 nm CMOS technology consists of an analog front end, a phase rotator, a current digital-to-analog converter, and

synthesizable digital logic, occupying a total active area of 0.24 mm<sup>2</sup>. The proposed PAM-4 receiver achieves a bit-error rate (BER) of less than 10<sup>-11</sup> at 48 Gb/s and offers an energy efficiency of 2.42 pJ/b.

**Keywords:** Fast Fourier Transform (FFT), 8-shaped inductor, wide frequency tuning range, mode switching, phase-locked loop (PLL), clock driver, fast frequency acquisition, Baud-rate, clock and data recovery (CDR), phase and frequency detector (PFD), reference-less, receiver, stochastic, integrator, adaptive equalizer, decision feedback equalizer (DFE), merged-summer, Mueller-Müller PD, PAM-4, a phase detector (PD), pre-cursor.

Student Number: 2019-29990

# Contents

| ABS' | TRACT    |                           | Ι   |
|------|----------|---------------------------|-----|
| CON  | TENTS    |                           | IV  |
| LIST | T OF FIG | JURES                     | VII |
| LIST | T OF TAI | BLES                      | XII |
| СНА  | PTER 1   | INTRODUCTION              | 1   |
|      | 1.1 Мот  | TIVATION                  | 1   |
|      | 1.2 THE  | SIS ORGANIZATION          | 4   |
| СНА  | PTER 2   | BACKGROUNDS               | 5   |
|      | 2.1 CLO  | CKING IN SERIAL INTERFACE | 5   |
|      | 2.2 Phas | SE-LOCKED LOOP            | 8   |
|      | 2.2.1    | PLL FUNDAMENTALS          | 8   |
|      | 2.2.2    | TYPES OF OSCILLATORS      |     |
|      | 2.2.3    | CHALLENGES OF OSCILLATORS |     |
|      | 2.3 CLO  | CK AND DATA RECOVERY      | 16  |
|      | 2.3.1    | RECEIVER FUNDAMENTALS     | 16  |
|      | 2.3.2    | TYPES OF CDR              | 17  |
|      | 2.3.3    | TYPES OF PD               |     |
|      |          |                           |     |

#### CHAPTER 3 FAST LOCKING WIDE TUNING RANGE DPLL-BASED

| CLO | CK DRI   | VER                                                   | 23 |
|-----|----------|-------------------------------------------------------|----|
|     | 3.1 Ove  | RVIEW                                                 | 23 |
|     | 3.2 WID  | E TUNING RANGE LC RESONATOR                           | 26 |
|     | 3.2.1    | COMPACT 8-SHAPED INDUCTOR                             | 26 |
|     | 3.2.2    | TRANSFORMER-BASED MODE-SWITCHING                      | 30 |
|     | 3.3 FFT  | BASED FAST FREQUENCY ACQUISITION                      | 33 |
|     | 3.3.1    | FAST-FOURIER TRANSFORM                                | 33 |
|     | 3.3.2    | PROPOSED FFT-BASED FREQUENCY TUNING                   | 38 |
|     | 3.4 Circ | CUIT IMPLEMENTATION                                   | 15 |
|     | 3.5 MEA  | SUREMENT RESULTS                                      | 50 |
| СНА | PTER 4   | REFERENCE-LESS BAUD-RATE CDR WITH STOCHASTIC          | 1  |
| PHA | SE AND   | FREQUENCY DETECTOR                                    | 56 |
|     | 4.1 OVE  | RVIEW                                                 | 56 |
|     | 4.2 Stoc | CHASTIC BAUD-RATE PHASE AND FREQUENCY DETECTION       | 50 |
|     | 4.2.1    | INTEGRATOR-BASED BAUD-RATE EDGE DETECTION TECHNIQUES. | 50 |
|     | 4.2.2    | METHODOLOGY OF THE STOCHASTIC PHASE AND FREQUENCY     |    |
| DE  | TECTION  | 63                                                    |    |
|     | 4.3 Circ | CUIT IMPLEMENTATION                                   | 74 |
|     | 4.4 MEA  | SUREMENT RESULTS                                      | 78 |
| СНА | PTER 5   | PAM-4 RECEIVER WITH PRE-CURSOR ADJUSTABLE             |    |
| BAU | D-RATE   | PHASE DETECTOR                                        | 81 |
|     | 5.1 Ove  | RVIEW                                                 | 31 |

| 5.2 Pro                   | POSED PHASE ACQUISITION TECHNIQUE                   | 85  |
|---------------------------|-----------------------------------------------------|-----|
| 5.2.1                     | CONCEPT OF PROPOSED BAUD-RATE PHASE DETECTOR        | 85  |
| 5.2.2                     | DATA LEVEL AND DFE ADAPTATION                       | 91  |
| 5.2.3                     | PRE-CURSOR ADJUSTABLE BAUD-RATE PHASE DETECTOR WITH |     |
| MULTI-LEV                 | EL MODULATION SIGNALING                             | 96  |
| 5.3 Circ                  | CUIT IMPLEMENTATION                                 | 100 |
| 5.3.1                     | PROPOSED PAM-4 RECEIVER ARCHITECTURE                | 100 |
| 5.3.2                     | PROPOSED MERGED-SUMMER DFE WITH THE INVERTER-BASED  |     |
| AMPLIFIER                 | 104                                                 |     |
| 5.4 MEA                   | SUREMENT RESULTS                                    | 107 |
| CHAPTER 6 CONCLUSIONS 115 |                                                     |     |
| BIBLIOGRA                 | РНҮ                                                 | 118 |
| 초 록                       |                                                     | 128 |

# **List of Figures**

| Fig. 1.1 The data rate and process scaling over time                      |
|---------------------------------------------------------------------------|
| Fig. 1.2 The data rate of various I/O standards per lane over time2       |
| FIG. 1.3 THE I/O BANDWIDTH PROGRESS OVER TIME                             |
| FIG. 2.1 THE OVERALL ARCHITECTURE OF THE TYPICAL SERDES6                  |
| FIG. 2.2 THE CONVENTIONAL PLL ARCHITECTURE                                |
| FIG. 2.3 CHARGE PUMP AND LOOP FILTER IN THE ANALOG PLL9                   |
| Fig. 2.4 Loop filter and delta sigma modulator in the digital PLL10 $$    |
| FIG. 2.5 THE BBPD IMPLEMENTATION AND ITS OUTPUT11                         |
| Fig. 2.6 The time-to-digital converter implementation and its output $11$ |
| FIG. 2.7 SCHEMATIC OF AND INVERTER-BASED RING OSCILLATOR12                |
| FIG. 2.8 SCHEMATIC OF LC OSCILLATOR WITH (A) NMOS-BASED NEGATIVE          |
| RESISTANCE AND (B) CMOS-BASED NEGATIVE RESISTANCE                         |
| FIG. 2.9 THE CONVENTIONAL RECEIVER ARCHITECTURE                           |
| FIG. 2.10 BLOCK DIAGRAM OF REFERENCE CDR ARCHITECTURE: (A) PLL-BASED      |
| CDR AND (B) PI-BASED CDR                                                  |
| FIG. 2.11 BLOCK DIAGRAM OF THE REFERENCE-LESS CDR STRUCTURE               |
| FIG. 2.12 COMPARISON BETWEEN 2X OVERSAMPLING PD AND BAUD-RATE PD20        |
| FIG. 2.13 HOGGE PHASE DETECTOR AND ITS GAIN CURVE                         |
| FIG. 2.14 ALEXANDER PHASE DETECTOR AND ITS GAIN CURVE                     |
| FIG. 2.15 BAUD-RATE MÜELLER-MÜLLER PHASE DETECTOR AND ITS BLOCK           |
| DIAGRAM                                                                   |

| Fig. 3.1 Overall characteristic: (a) Wide frequency tuning range $LC$      |
|----------------------------------------------------------------------------|
| OSCILLATOR, (B) FAST FREQUENCY ACQUISITION, AND (C) RJ-REDUCTION CLOCK     |
| DRIVER                                                                     |
| FIG. 3.2 8-SHAPED INDUCTOR EMPLOYED FOR PROPOSED LC OSCILLATOR             |
| Fig. 3.3 Impedance response of two LC tanks using stacked inductor: (a)    |
| OCTAGONAL INDUCTOR AND (B) 8-SHAPED INDUCTOR                               |
| FIG. 3.4 Q VS FREQUENCY OF THE 8-SHAPED INDUCTORS                          |
| Fig. 3.5 Mode-switching in the proposed LC oscillator: (a) single core     |
| (MODE1), (B) IN-PHASE COUPLING (MODE2), AND (C) OUT-OF-PHASE COUPLING      |
| (MODE3)                                                                    |
| FIG. 3.6 STRUCTURE OF THE BUTTERFLY CELL                                   |
| FIG. 3.7 DIAGRAM OF THE EMPLOYED 32-POINT FFT ALGORITHM                    |
| FIG. 3.8 COMPARISON BETWEEN THE CONVENTIONAL COUNTER-BASED FREQUENCY       |
| ACQUISITION AND PROPOSED FFT-BASED FREQUENCY ACQUISITION AND ITS LOCK TIME |
| DEPENDING ON THE INTEGRAL GAIN                                             |
| FIG. 3.9 FLOW CHART OF FREQUENCY TUNING AND BLOCK DIAGRAM OF THE FFT       |
| TUNING ALGORITHM                                                           |
| FIG. 3.10 PROCUREMENT OF THE 32-POINT FFT INPUT                            |
| FIG. 3.11 THE CHARACTERISTIC OF THE FFT OUTPUT DEPENDING ON THE            |
| CORRELATION BETWEEN INPUT FREQUENCY AND SAMPLING FREQUENCY                 |
| FIG. 3.12 MATRIX CALCULATION TO OBTAIN THE FFT TUNING COEFFICIENT43        |
| FIG. 3.13 OVERALL ARCHITECTURE OF THE PROPOSED DPLL-BASED CLOCK DRIVER     |
|                                                                            |
| FIG. 3.14 CAPACITOR DAC STRUCTURE IN THE LC OSCILLATOR                     |

| FIG. 3.15 THE TIME-TO-DIGITAL CONVERTER STRUCTURE                                                                                                                                                                                                                                                                     |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| FIG. 3.16 EMPLOYED PHASE AND FREQUENCY DETECTOR ARCHITECTURE                                                                                                                                                                                                                                                          |
| FIG. 3.17 CHIP PHOTOGRAPH OF THE PROPOSED DPLL-BASED CLOCK DRIVER50                                                                                                                                                                                                                                                   |
| FIG. 3.18 Phase noise of the free-runing LC oscillator at each mode51                                                                                                                                                                                                                                                 |
| Fig. 3.19 PN and FoM and FoM $_{\rm T}$ at each mode depending on the frequency                                                                                                                                                                                                                                       |
|                                                                                                                                                                                                                                                                                                                       |
| Fig. 3.20 Measured integrated RMS jitter W/ and W/O random jitter in the                                                                                                                                                                                                                                              |
| REFERENCE CLOCK                                                                                                                                                                                                                                                                                                       |
| Fig. 3.21 Measured jitter reduction vs jitter magnitude at each mode $\dots 53$                                                                                                                                                                                                                                       |
| Fig. 3.22 Frequency transient curve with the frequency from $3.82 \text{GHz}$ to                                                                                                                                                                                                                                      |
| 4GHz                                                                                                                                                                                                                                                                                                                  |
| Fig. 3.23 Frequency transient curve with the frequency from $3.82 GHz$ to                                                                                                                                                                                                                                             |
| 1GHz                                                                                                                                                                                                                                                                                                                  |
| Fig. 4.1 Sampling type comparison between $2x$ oversampling CDR and                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                                                       |
| PROPOSED BAUD-RATE CDR                                                                                                                                                                                                                                                                                                |
| PROPOSED BAUD-RATE CDR                                                                                                                                                                                                                                                                                                |
|                                                                                                                                                                                                                                                                                                                       |
| FIG. 4.2 PROPOSED PHASE DETECTION MECHANISM                                                                                                                                                                                                                                                                           |
| FIG. 4.2 PROPOSED PHASE DETECTION MECHANISM                                                                                                                                                                                                                                                                           |
| Fig. 4.2 Proposed Phase detection mechanism       58         Fig. 4.3 Phase error decision table       58         Fig. 4.4 The structure of the exploited integrator       60                                                                                                                                         |
| Fig. 4.2 Proposed Phase detection mechanism       58         Fig. 4.3 Phase error decision table       58         Fig. 4.4 The structure of the exploited integrator       60         Fig. 4.5 The simulation of the integrator       61                                                                              |
| FIG. 4.2 PROPOSED PHASE DETECTION MECHANISM       58         FIG. 4.3 PHASE ERROR DECISION TABLE.       58         FIG. 4.4 THE STRUCTURE OF THE EXPLOITED INTEGRATOR       60         FIG. 4.5 THE SIMULATION OF THE INTEGRATOR       61         FIG. 4.6 PATTERN ACQUISITION IN THE PROPOSED BAUD-RATE CDR       63 |
| FIG. 4.2 PROPOSED PHASE DETECTION MECHANISM58FIG. 4.3 PHASE ERROR DECISION TABLE58FIG. 4.4 THE STRUCTURE OF THE EXPLOITED INTEGRATOR60FIG. 4.5 THE SIMULATION OF THE INTEGRATOR61FIG. 4.6 PATTERN ACQUISITION IN THE PROPOSED BAUD-RATE CDR63FIG. 4.7 OBTAINED PATTERN (A) WHEN PHASE IS EARLY OR LATE AND (B) WHEN   |

| FIG. 4.11 WEIGHT GAIN CURVE DEPENDING ON THE FREQUENCY DIFFERENCE 69                                     |
|----------------------------------------------------------------------------------------------------------|
| Fig. 4.11 Frequency gain curve depending on the frequency difference . $69$                              |
| FIG. 4.12 THE WEIGHT GAIN CURVE DEPENDING ON THE SUPERPOSITION OF PHASE                                  |
| AND FREQUENCY DIFFERENCE AND THE DETERMINED PATTERN WEIGHT VALUE $70$                                    |
| Fig. 4.13 Phase and frequency gain curve when the determined weight is                                   |
| APPLIED TO THE PROPOSED BAUD-RATE PHASE AND FREQUENCY DETECTOR71                                         |
| FIG. 4.14 Phase and frequency gain curve of the proposed PFD with the                                    |
| INPUT DATA PASSED THROUGH LOSSY CHANNEL AND AN EQUALIZATION                                              |
| FIG. 4.15 OVERALL ARCHITECTURE OF THE PROPOSED BAUD-RATE CDR75                                           |
| Fig. 4.16 The structure of the DCO and its DCR76                                                         |
| FIG. 4.17 THE CHIP PHOTOGRAPH AND ITS AREA AND POWER CONSUMPTION                                         |
| Fig. 4.18 Recovered clock histogram                                                                      |
| Fig. 4.19 The measured jitter histogram at 28 GB/s with the $BER{<}10^{{-}12}$ 80                        |
| Fig. 5.1 Lock point of conventional MMPD and proposed MMPD with DFE                                      |
| ON SINGLE-BIT RESPONSE                                                                                   |
| FIG. 5.2 SIMULATED VERTICAL EYE OPENING OF THE PAM-4 AND PAM-2 SIGNAL                                    |
| WITH ACTIVATED DFE VS (A) SAMPLING TIME AND (B) CURSOR RATE $M$ 87                                       |
| Fig. 5.3 (a) Simulated single-bit responses and (b) cursor ratio on the                                  |
| TIME WITH VARIOUS CHANNEL LOSSES                                                                         |
| Fig. 5.4 The eye diagram of the PAM-4 and data histogram for data $+3\ .92$                              |
| FIG. 5.5 PAM-4 DFE ADAPTATION OF $W_1$ when $3 \cdot H_0$ and $3 \cdot H_0 + 3 \cdot H_{-1}$ are used as |
| $V_{\text{DLev}}, \text{ILLUSTRATING BOTH CONVERGING TO } H_1 \dots 93$                                  |
| FIG. 5.6 SIMULATED EYE DIAGRAM WITH CONVENTIONAL DATA LEVEL ADAPTATION                                   |
| AND PROPOSED DATA LEVEL ADAPTATION                                                                       |

| Fig. 5.7 Generating phase error based on the consecutive data (D[n],         |
|------------------------------------------------------------------------------|
| $D[\text{N}{+}1])$ = (+3, -3) and the sign of the $V_{\text{Ref},\text{PD}}$ |
| FIG. 5.8 THE FLOW CHART OF THE PROPOSED MMPD98                               |
| Fig. 5.9 Simulated eye diagram of the proposed MMPD under two input          |
| DATA CONDITIONS:                                                             |
| Fig. 5.10 Overall block diagram of the proposed CDR with PA-MMPD and         |
| THE MERGED SUMMER DFE                                                        |
| FIG. 5.11 I-DAC IMPLEMENTATION FOR SAMPLER THRESHOLD VOLTAGE AND             |
| MISMATCH CALIBRATION                                                         |
| FIG. 5.12 IMPLEMENTATION OF PROPOSED MERGED SUMMER DFE 105                   |
| FIG. 5.13 TIMING DIAGRAM OF PROPOSED MERGED SUMMER DFE 106                   |
| FIG. 5.14 CHIP PHOTOMICROGRAPH OF THE IMPLEMENTED RECEIVER WITH THE          |
| DETAILED AREA AND                                                            |
| FIG. 5.15 BLOCK DIAGRAM OF THE MEASUREMENT SETUP                             |
| Fig. 5.16 Measured DLev code vs sampling time in (a) PAM-4 signaling and     |
| (B) PAM-2 SIGNALING                                                          |
| Fig. 5.17 Measured ratio M from measured cursor values in (a) PAM-4          |
| SIGNALING AND (B) PAM-2 SIGNALING                                            |
| FIG. 5.18 MEASURED BATHTUB CURVE111                                          |
| FIG. 5.19 MEASURED JITTER TOLERANCE AT BER OF 10 <sup>-11</sup>              |
| FIG. 5.20 MEASURED PAM-4 EYE HEIGHT VS CURSOR RATIO M112                     |
| Fig. 5.21 Measured PAM-4 eye diagram with BER $< 10^{-11}$                   |

## **List of Tables**

|    | TABLE I COMPARISON BETWEEN RING OSCILLATOR AND LC OSCILLATOR           | .15  |
|----|------------------------------------------------------------------------|------|
|    | TABLE II COMPARISON TABLE AND PERFORMANCE SUMMARY                      | .55  |
|    | TABLE III THE PATTERN PROBABILITY TABLE WHEN PHASE OR FREQUENCY IS (A) |      |
| EA | ARLY AND (B) LATE                                                      | .65  |
|    | TABLE IV PERFORMANCE SUMMARY AND COMPARISON                            | . 80 |
|    | TABLE V PERFORMANCE SUMMARY AND COMPARISON WITH OTHER DESIGNS          | 114  |

## Chapter 1

## Introduction

### **1.1 Motivation**

A data center is a physical facility designed to house computer systems and their related components. It plays a critical role in providing IT services and supporting business processes. The rapid growth in data centers and their traffic has led to a corresponding increase in transmission speed [1] - [2]. As illustrated in Fig. 1.1, the process has shrunk over time, and the data rate has gradually increased [3]. Moreover, Fig. 1.2 shows the exponential growth in the data rates per lane for various I/O standards over time [3].

Peripheral Component Interconnect Express (PCIe) is one of the prominent wirelines, doubling every 3-4 years, with a trend accelerating as Fig. 1.3 [4]. The memory interface speed also increases with each generation, with the increasing



Fig. 1.1 The data rate and process scaling over time



Fig. 1.2 The data rate of various I/o standards per lane over time

number of lanes additionally. The demand for data throughput on interfaces is increasing exponentially, leading to the widespread adoption of multilevel signaling such as pulse amplitude modulation (PAM), especially PAM-4, in many standards [5] - [7]. Although PAM signaling can improve data rate significantly, inter-symbol



Fig. 1.3 The I/O bandwidth progress over time

interference (ISI) and signal-to-noise ratio (SNR) of the channel can degrade signal integrity and bit error rate (BER) performance severely.

As the data rate increases, the effect of data loss and noise becomes more significant, reducing the data's timing margins and making the quality of clock signals crucial for the transceiver. In the transmitter, the clock signal is generated and used by the phase-locked loop (PLL), while the clock and data recovery (CDR) regenerates the clock signal with the reference clock or the data only at the receiver.

This paper discusses a PLL-based clock driver that overcomes the limitations of conventional LC oscillators and can obtain a wide frequency tuning range while achieving fast lock times. Additionally, we present a reference CDR and reference-less CDR based on the Baud-rate structure that reduces the requirement on multiphase clock signals and Baud-rate reference CDRs that enable PAM-4 signaling.

### **1.2 Thesis Organization**

This thesis is organized as follows. In Chapter 2, a clocking network used in serial interface is presented briefly. The clocking mechanism within the interface can be categorized into two main components: the PLL and the CDR. The chapter expounds on the underlying principles of PLL, as well as the types of oscillators employed. Additionally, the fundamentals of CDR are elucidated, along with an overview of the various types of CDR and phase detectors (PD) utilized.

Chapter 3 shows the implementation of a fast locking wide tuning range digital PLL (DPLL)-based clock driver. The chapter details an LC oscillator with a wide tuning range that overcomes the limitations of conventional LC oscillators. Additionally, the methodology employed to achieve fast frequency acquisition through fast Fourier transform (FFT) is illustrated.

Chapter 4 describes the implementation of a 28 Gb/s reference-less Baud-rate CDR. In addition, the chapter reveals the stochastic phase and frequency detection techniques and the Baud-rate CDR structure realized by incorporating an integrator.

In Chapter 5, a 48 Gb/s PAM-4 receiver is presented. The receiver design utilizes a reference CDR structure and incorporates a Baud-rate PD that determines the locking position by analyzing the main cursor and pre-cursor ratio. Furthermore, the chapter shows the decision feedback equalization (DFE) adaptation methodology in the PAM-4 structure, verified by the simulation results.

Chapter 6 summarizes the proposed works and concludes this thesis.

## Chapter 2

### **Backgrounds**

### 2.1 Clocking in Serial Interface

Modern wireline communication systems employ serial links over parallel links due to their advantages, such as reduced pin count, increased throughput, etc. Serial links have seen an increase in transmission rate due to the exponential increase in data, and most interface standards have doubled in speed with every generation. For example, the specification for PCIe Gen 6.0, running at 64 GT/s, is advancing to version 7.0, running at 128 GT/s [4]. This increase in speed has led to decreased timing margins and increased design challenges for timing circuits.

A serial link converts parallel data into serial data for high-speed transmission. As depicted in Fig. 2.1, the parallel data is serialized at the transmitter end and transmitted as serial data through the channel to the receiver end. The received data



Fig. 2.1 The overall architecture of the typical SerDes

are then deserialized back into parallel form for further processing, which is accomplished by using SerDes components.

Clocking in SerDes refers to the generation of a clock signal synchronized to a specified data rate and the architecture of the transmitter and receiver. The transmitter in a serial link synchronizes parallel data by serializing it with a high-frequency clock that is synthesized on-chip from a low-frequency crystal reference clock. On the other hand, the timing circuit on the receiver end has a distinct relationship between the received data and the sampling clock. In such cases, the serial data is parallelized as the receiver's clock samples the data center.

SerDes can be categorized into various structures based on specific viewpoints. One such viewpoint is the presence or absence of a phase relationship and frequency difference between data and clock, which leads to four synchronization types: synchronous, mesochronous, plesiochronous, which is addressed in this thesis, and asynchronous [8]. Additionally, in the multiphase clocking regime, SerDes can be classified into full-rate, half-rate, quadrature, and other sub-rate structures, depending on the number of clock phases used for synchronization [9] – [10].

The clock architectures employed in SerDes significantly impact system performance metrics such as eye height, jitter amount, bit error rate (BER), and jitter tolerance. Therefore, it is necessary to employ an appropriate clock architecture for each application based on its specific requirements.

### 2.2 Phase-Locked Loop

#### 2.2.1 PLL Fundamentals

Fig. 2.2 illustrates the block diagram of a conventional PLL. The oscillator output clock signal is divided based on a predetermined division ratio and forwarded to the phase and frequency detector (PFD). The input clock signal is compared to an on-chip or off-chip reference clock in terms of phase and frequency. The phase or frequency error information is fed to the loop filter (LF), which modulates the oscillator frequency, constituting a negative feedback loop to generate the clock at the desired frequency. The output frequency is  $f_{out}=N \cdot f_{ref}$  in a steady state, where N and  $f_{ref}$  represent the division ratio and reference clock frequency, respectively. In the case of integer PLL, the division ratio is an integer N, whereas, in the case of fractional PLL, it becomes a fraction N+ $\alpha$ . The fractional division ratio enables the PLL to achieve a very fine resolution to meet stringent channel requirements in most



Fig. 2.2 The conventional PLL architecture

wireless applications [11] - [13]. However, it requires more complex hardware than an integer PLL and generates additional noise and spurious tones. These characteristics are unsuitable for SerDes, which require strict timing margins, but some require fractional PLLs, such as spread spectrum clocking. Nevertheless, the scope of this dissertation includes only integer PLL

PLL can be classified into three types, and the LF structure in a PLL depends on the type of PLL. In an analog PLL, the LF is implemented as a passive element consisting of a resistor and capacitor combination. In this case, the LF is preceded by the charge pump (CP). Fig. 2.3 shows the LF of a conventional analog PLL. This PLL is based on a 3rd-order type-2 architecture. The resistor in the LF functions as a proportional path, while the capacitor connected in series to the resistor serves as an integral path. Furthermore, a capacitor connected in parallel is used to suppress the ripple caused by the UP/DN current of the CP. The CP adjusts the frequency of the voltagecontrolled oscillator (VCO) by charging or discharging the LF based on the



Fig. 2.3 Charge pump and loop filter in the analog PLL

UP/DN signals from the PD.

On the other hand, in a digital PLL, the resistor and capacitor paths are established using a digital loop filter (DLF) [14] . Additionally, a digitally-controlled oscillator (DCO) is used as the oscillator in DPLLs, which was first introduced by Westlake in 1960 [15] . A delta-sigma modulator (DSM,  $\Delta\Sigma$ ) is commonly employed between the DLF and DCO to effectively suppress the quantization noise of the DCO, dithering the digital control word. The structure of a conventional DLF is presented in Fig. 2.4. K<sub>a</sub> represents the gain of the proportional path, and the gain of the integral path is represented by K<sub>β</sub>. Additionally, an accumulator (ACC) is employed in the integral path of the DLF. The DLF computes the phase and frequency error by processing the UP/DN signals from the PD (PFD). Then, the computed error is transmitted to the DSM for further processing. Finally, the LF of a hybrid PLL combines the LF of analog and digital PLL structures. This thesis focuses on the digital PLL.

DPLLs employ two types of PDs commonly: Bang-Bang PD (BBPD) and time-todigital converter (TDC). Fig. 2.5 illustrates the implementation and output of the



Fig. 2.4 Loop filter and delta sigma modulator in the digital PLL

BBPD, where the phase error (ERR) produced is either 1 or -1 depending on the phase

difference ( $\Delta$ t) between the reference clock and the divided oscillator clock. The structure of the TDC and its phase difference output can be observed in Fig. 2.6. In 2004, Staszewski [16] introduced a TDC to replace analog PFDs in all-digital PLLs (ADPLLs), which effectively reduced quantization error and resulted in low jitter performance. Fig. 2.6 shows how the TDC operates by sampling the reference clock that has been delayed by each inverter stage using the divided oscillator clock. When the phases of the reference clock and the divided oscillator clock delay by each inverter stage, the output of the TDC increments by 1 bit, represented in the thermometer code.



Fig. 2.5 The BBPD implementation and its output



Fig. 2.6 The time-to-digital converter implementation and its output

#### 2.2.2 Types of Oscillators

There are two main types of oscillators: Ring Oscillators (RO) and LC Oscillators (LCO). ROs consist of using a sequence of inverter delay cells that are connected in a cascaded way and are phase-shifted by 180 degrees via negative feedback satisfying Barkhausen's criteria. Fig. 2.7 shows the conventional RO structure. The oscillation frequency of RO can be expressed as  $1/(N \cdot T_D)$ , where *N* is the number of stages, and  $T_D$  is the propagation time of delay cells. The load capacitance is determined by the product of the inverter width (*W*) and length (*L*), which is inversely proportional to the oscillation frequency. On the other hand, the driving strength is determined by *W/L*, which is directly proportional to the oscillation frequency. Thus, the oscillation frequency is independent of the inverter width, while the frequency is quadratically increased by reducing the inverter length. However, increasing the inverter width reduces flicker noise and improves phase noise (PN) at the cost of higher power consumption. The common techniques for controlling the frequency of the RO are as follows: adjustment of the supply voltage [18], or regulation of the inverter current [19].



Fig. 2.7 Schematic of and inverter-based ring oscillator



Fig. 2.8 Schematic of LC oscillator with (a) NMOS-based negative resistance and (b) CMOSbased negative resistance

LCO is a resonant circuit composed of passive elements. Fig. 2.8 illustrates the schematics of two conventional LC oscillators. The frequency of LCO is determined by the inductance (*L*) and capacitance (*C*) as  $1/\sqrt{LC}$ . However, due to the difficulty in controlling inductance, the frequency is primarily determined by adjusting capacitance. The quality factor (Q-factor) of an inductor is defined as  $L \cdot \omega/R_s$ , which is inversely dependent on series inductance,  $R_s$ , determining gain and oscillation. For oscillation, NMOS-LCO satisfies a condition as  $R_p - \frac{2}{g_m} \ge 0$ , where  $R_p$  denotes the equivalent parallel resistance of the inductor,  $R_p \approx \frac{L^2 \cdot \omega^2}{R_s} \approx Q^2 \cdot R_s$  [20]. In the case of NMOS-based LCO, as shown in Fig. 2.8(a), the output swing oscillates at the dc level of  $V_{DD}$ , which is the supply voltage, whereas in the case of CMOS-based LCO, as shown in Fig. 2.8(b), it oscillates at the dc level of  $V_{DD}/2$ . As the swing of

NMOS-LCO is twice that of CMOS-LCO, PN is improved by 6dB. However, this improvement comes at the expense of quadrupled power consumption [21]. Additionally, since NMOS-LCO oscillates at the dc level of  $V_{DD}$ , there is a reliability issue, and therefore, a thick gate element is used, increasing parasitic capacitance that limits the tuning range. In contrast, CMOS-LCO offers a wider tuning range by utilizing a thin gate element, and reducing parasitic capacitance thereby.

### 2.2.3 Challenges of Oscillators

RO and LO possess certain advantages and disadvantages. RO provides a wide frequency tuning range (FTR) and facilitates multi-phase generation. Furthermore, RO is implemented in a small area without requiring passive devices. However, they are vulnerable to supply noise and have inferior PN performance compared to LC oscillators.

On the other hand, LCO utilizes the resonance of passive components, rendering them resilient to supply noise and exhibiting superior PN performance. Nevertheless, their reliance on passive components restricts their area, limiting their tuning range. Additionally, it is challenging to generate multiple phases using LC oscillators.

This paper concentrates on LCO that possesses a wide FTR while being implemented in small area, overcoming the limitations of area and tuning range.

|      | Ring Oscillator                                           | LC Oscillator                                                |
|------|-----------------------------------------------------------|--------------------------------------------------------------|
| Pros | Wide tuning range<br>Multi-phase generation<br>Small area | Supply insensitve<br>Superior phase noise                    |
| Cons | Supply sensitive<br>Inferior phase noise                  | Small tuning range<br>Large area<br>Limited phase generation |

Table I Comparison between Ring oscillator and LC oscillator

### **2.3 Clock and Data Recovery**

#### 2.3.1 Receiver Fundamentals

The receiver, based on the plesiochronous clocking architecture as shown in Fig. 2.9, is absent a dedicated clock channel in the transceiver. Instead of a clock forwarding system, the receiver employs a reference clock to generate the sampling clock. However, since the reference clocks in the transmitter and the receiver cannot be perfectly synchronized, a frequency difference between the received data and the sampling clock of the receiver arises, which necessitates a CDR.

Fig. 2.9 illustrates the conventional receiver structure. The equalizer, such as continuous time linear equalizer (CTLE) and DFE, compensates for the channel loss to remove inter-symbol interference (ISI), which allows for accurate sampling of incoming serialized data without errors. The CDR is required to recover the clock signal for synchronization operations such as retiming or deserializing data. The CDR achieves this objective by receiving a reference clock or by extracting the clock signal from the transmitted data.



Fig. 2.9 The conventional receiver architecture

#### 2.3.2 Types of CDR

CDR is categorized into two types: reference CDR and reference-less CDR. The conventional block diagrams of these two types are presented in Fig. 2.10 and Fig. 2.11, respectively. Further, the reference CDR is classified into PLL-based CDR and PI-based CDR. The PLL-based CDR comprises two tracking loops, as illustrated in Fig. 2.10 (a). The first loop receives a reference clock, similar to a conventional PLL, and coarsely tracks the frequency. The second loop tracks the phase finely through the output clock and data through the PD, with the clock frequency adjusted by the first loop. Because two loops employ the identical oscillator sharing the frequency control voltage, the mismatch between the two oscillators causes inadequate frequency, which degrades the CDR performance.

On the other hand, the PI-based CDR, as shown in Fig. 2.10 (b), recovers the clock utilizing a high-speed reference clock. First, the I/Q generator receives the reference clock and outputs the multi-phase clock to the phase interpolator (PI). Then, the optimal clock is achieved through the phase tracking loop by processing the data and output clock signal from the PI. Although, compared to the PLL-based CDR, PI-based CDR requires only one tracking loop, reducing circuit complexity and power consumption, it necessitates a high-speed clock signal and more pins increasing the hardware cost.

As shown in Fig. 2.11, the CDR consists of a single loop and operates without a reference clock, resulting in a reduction of the required number of pins. Nonetheless, an additional technique for frequency acquisition is necessary alongside the existing phase detector to achieve clock recovery without a reference clock. To ensure that



Fig. 2.10 Block diagram of reference CDR architecture: (a) PLL-based CDR and (b) PI-based

CDR



clock recovery occurs within the oscillator's frequency range, the system requires a wide capture range performance capable of tracking frequencies at various data rates.

#### 2.3.3 Types of PD

Conventionally, the types of PDs used in CDR can be classified into oversampling PDs and Baud-rate PDs, based on the number of samplings per data. Fig. 2.12 illustrates two commonly used methods, the 2x oversampling and Baud-rate type. In the 2x oversampling method, edge sampling clocks are necessary, in contrast to the Baud-rate structure, which samples data only once. Although twice as many clock phases are required for the additional sampling of data edges, increasing the hardware complexity and power consumption, 2x oversampling obtains more information on frequency and phase errors, facilitating better frequency and phase tracking.

Two commonly used 2x oversampling PDs are linear PD and binary PD. The Hogge PD [22] is an example of a linear PD, and its structure and output are shown in Fig. 2.13. The linear PD generates an output voltage that is proportional to the



Fig. 2.12 Comparison between 2x oversampling PD and Baud-rate PD



Fig. 2.14 Alexander phase detector and its gain curve

phase error, and it comprises two D flip-flops (DFFs) and two XOR logic gates. The Hogge PD outputs a reference pulse (REF) with a constant pulse width and an error pulse (ERR) with a pulse width that is proportional to the phase error (ERR). If the clock is early than the data edge, the pulse width of ERR is less than that of REF, and vice versa if the clock is late.

Fig. 2.14 illustrates the structure and gain curve of the Alexander PD [23], which is an example of a binary PD. Contrary to the linear PD, the binary PD displays the polarity of the phase error, as shown in its gain curve. The Alexander PD consists of four DFFs and two XOR gates. By detecting the transition of the four DFFs as phase error, the Alexander PD is implemented in a more straightforward and powerefficient design than the linear PD. Since The clock signal in Hogge PD and Alexander PD means the edge sampling clock phase, the phase error is the difference between the clock and data edge.

The Baud-rate PD necessitates only one sampling clock per data, which reduces the number of clock phases compared to the 2x oversampling PD. This reduction decreases the cost of clock generation and distribution and can significantly reduce power consumption in clocking. Fig. 2.15 illustrates the operating principle and structure of Mueller-Müller PD (MMPD)[24], a representative example of a Baudrate PD. MMPD achieves locking at the point where  $h(\tau_k - T) = h(\tau_k + T)$ , where h denotes the single-bit response (SBR),  $\tau_k$  is the sampling time, and T represents one data period. If  $h(\tau_k - T) < h(\tau_k + T)$ , the clock is early, and if  $h(\tau_k - T) > h(\tau_k + T)$ , it is considered late. Despite a Baud-rate structure, MMPD implementation requires significant hardware complexity. Thus, for simplicity, signsign MMPD (SS-MMPD) using two binary samples is widely employed [25] -[27].



Fig. 2.15 Baud-rate Müeller-Müller phase detector and its block diagram

# Chapter 3 Fast Locking Wide Tuning Range DPLL-based Clock Driver

### **3.1 Overview**

Owing to the jitter-filtering nature, a PLL often plays as a clock driver in memory systems, filtering out high-frequency noise present in the overall clock distribution path. For LC oscillators, a parallel multi-core topology enhances the output PN but still suffers from a narrow FTR. To overcome this, [28] – [29] employ a mode switching, but the negligible equivalent inductance in the odd mode requires excessively large capacitance for low-frequency oscillation. While the coupling method in [30] also gives a wide FTR, the achievable area efficiency is limited due to the spiral inductor structure. In a DPLL, a wide-range TDC achieves fast frequency acquisition but at the expense of a larger area and power. Although gear-shifting [31] – [33] or digital frequency-error recovery [34] may reduce the lock time without much hard-



Fig. 3.1 Overall characteristic: (a) Wide frequency tuning range LC oscillator, (b) Fast frequency acquisition, and (c) RJ-reduction clock driver

-ware overhead, an effective method for a wide FTR still remains to be sought.

This work presents a DPLL-based input-jitter-filtering clock driver that achieves 0.99 µs lock time and 133% FTR. Fig. 3.1 shows the overall scheme of the presented clock driver. By using three different modes, the implemented LC oscillator provides an extensive FTR. One standalone LC tank covers the first frequency band (mode1), while two other LC tanks together cover the second and third frequency bands (mode2, mode3), which are mode-switched by conjugating the magnetic coupling thereof. Furthermore, the presented DPLL incorporates an FFT-based frequency acquisition technique, achieving fast frequency lock over a wide FTR. Compared to conventional DPLL, which corporates the TDC, proposed DPLL reduces the lock time significantly by adjusting a frequency control word (FCW) to the near target point. And the proposed DPLL serves as a clock driver. As a clock driver, it reduces high-frequency noise components such as random jitter (RJ), which is caused in a buffer stage.

# **3.2 Wide Tuning Range LC Resonator**

#### **3.2.1** Compact 8-shaped Inductor

Fig. 3.2 illustrates the implemented LC tanks. The inductor L0, drawn with the top metal, covers mode1. Meanwhile, the two magnetically coupled inductors, L1 and L2, are placed under L0, with a symmetric layout. This transformer-based magnetic coupling offers not only a wide FTR but also an adequate equivalent inductance, being suitable for supporting low-frequency oscillation. To minimize the interference between L0 and L1/L2, each inductor adopts the 8-shaped structure.

Compared to the conventional octagonal inductor design, the 8-shaped inductor



Fig. 3.2 8-shaped inductor employed for proposed LC oscillator



Coupled Tank w/ 8-Shaped Inductor: Small Coupling



Fig. 3.3 Impedance response of two LC tanks using stacked inductor: (a) Octagonal inductor and (b) 8-shaped inductor

exhibits a lower Q-factor due to its higher resistance to inductance. Despite the lower Q-factor of the 8-shaped inductor compared to the octagonal inductor, it offers a significant advantage when utilized in a stack because of its unique structure that minimizes interference. Fig. 3.3 presents a comparison of the impedance response between an inductor with a conventional octagonal design and one with an 8-shaped design when stacked. When conventional octagonal inductors are stacked as Fig. 3.3 (a), the coupling factor between inductors is significant, resulting in a high peak value at other

resonant frequencies, making it challenging to maintain oscillation in steady-state and leading to PN degradation.

However, the 8-shaped inductor has an upper and lower coil whose directions are opposite, causing the magnetic fields generated in opposite directions in the upper and lower inductors to cancel each other out. Consequently, when 8-shaped inductors are stacked as Fig. 3.3 (b), the coupling is minimal, and the impedance has a slight peak value at other resonant frequencies. This ensures that the two stacked inductors have little influence on each other, allowing for stable oscillation.

Moreover, the stacked 8-shaped inductors generate a stable and high-performance clock signal when compared to existing octagonal inductors. Fig. 3.4 illustrates the Q-factor when the 8-shaped inductor is alone and when arranged in a stacked structure with other inductors. Unlike the octagonal inductor, the coupling is small when the 8-character structure is implemented in a multilayer structure, and the Q-factor is reduced accordingly. Moreover, the negative resistance implemented by a thin-oxide



Fig. 3.4 Q vs frequency of the 8-shaped inductors

CMOS differential pair reduces parasitic capacitance as compared to that of the NMOS-only design with thick-oxide devices.

#### 3.2.2 Transformer-based Mode-Switching

Proposed DCO is implemented with mode-switching to enhance FTR. Three modes are available as shown in Fig. 3.5. Mode1 covering the FTR from 0.82 GHz to 1.8 GHz is achieved by LC tank employing inductor L0, described in Fig. 3.2 and in Fig. 3.5 (a). In this mode, the active core with L0 using top metal is turned on while the other active cores are off. The effective inductance at mode0 is,  $L_{eq} = L_0$ .

Mode2 and mode3 are implemented by the transformer-based inductor. In mode2 and mode3, the active cores with L1 and L2 are activated, while the active core with L0 is decommissioned. Through the switches connecting two cores, mode2 and mode3 change. In mode2, the coupled magnetic fields are set to be in-phase, adding the mutual inductance, M, to the equivalent inductance,  $L_{eq}$ . The switches interconnecting the two cores are in the deactivated state, leading to the emergence of a magnetic field oriented in the direction of magnetic field addition, as depicted in Fig. 3.5 (b). The effective inductance at mode2 is,  $L_{eq} = L_1 + M$ . Mode2 achieves a frequency range from 1.76 GHz to 2.88 GHz.

In mode3, the magnetic fields are out-of-phase, canceling out each other, giving a lower value of  $L_{eq}$ . In contrast to mode2, mode3 activates the switches that interconnect the two cores, resulting in the flow of current in a direction that facilitates the cancellation of the magnetic fields. The resulted magnetic field is illustrated in Fig. 3.5 (c). The effective inductance at mode3 is,  $L_{eq} = L_1 - M$ . By employing mode3, it becomes feasible to accomplish a frequency spectrum ranging from 2.52 GHz to 4.1 GHz.



Fig. 3.5 Mode-switching in the proposed LC oscillator: (a) single core (mode1), (b) in-phase coupling (mode2), and (c) out-of-phase coupling (mode3)

L1 and L2 inductors are magnetically coupled to each other, whereby the 1 rotation area of one inductor affects the magnetic field of the other. To reduce the interference of the self-induced magnetic field, the two-turn area of the inductors is implemented to be half the size of the single-turn area. This results in the inductors having a distinctive 8-shaped design, which effectively cancels out the external magnetic field generated by the inductor, improving area efficiency.

## **3.3 FFT-based Fast Frequency Acquisition**

#### 3.3.1 Fast-Fourier Transform

The fundamental principle of Fourier analysis is that all functions are composed of an infinite number of sinusoidal waves. Specifically, our focus lies on discrete functions of finite length. It is possible to represent the function as a sum of sine and cosine waves with different frequencies, and for any such function,  $x_n$ , its frequencies can be entirely represented by another discrete function,  $X_k$ , with an equivalent number of samples. The discrete Fourier transform (DFT) is a widely used transformation technique employed to represent discrete signals in the frequency domain. The N-point DFT is established as

$$X_{k} = \sum_{n=0}^{N-1} x_{n} W_{N}^{nk}, \quad 0 \le k \le N-1, \quad (3.1)$$

where  $X_k$  is the frequency domain,  $x_n$  is the time domain of the sequence, and N denotes the number of samples in  $x_n$ .  $W_N^{nk}$  is the twiddle factor and is represented as

$$W_N^{nk} = e^{-j2\pi nk/N} = \cos(\frac{2\pi nk}{N}) - j\sin(\frac{2\pi nk}{N})$$
(3.2)

The FFT is an efficient algorithm for the computation of the DFT. Through the elimination of redundant calculations, FFT algorithms optimize the DFT. The FFT

algorithm, called the divide and conquer method, follows: The input is split into two halves, each half undergoes an FFT, and the results are combined to form the overall transform. Cooley and Tukey proposed the FFT algorithm to simplify the computational complexity of the DFT, specifically, from N<sup>2</sup> to N/2·log<sub>2</sub>(N/2) multiplications and from N·(N-1) to N·log<sub>2</sub>N additions [35] . We can rewrite (3.1) as follows:

$$X_{k} = \sum_{n=0}^{N/2-1} x_{n} W_{N}^{nk} + \sum_{n=N/2}^{N-1} x_{n} W_{N}^{nk}$$
  
$$= \sum_{n=0}^{N/2-1} x_{n} W_{N}^{n} + \sum_{n=0}^{N/2-1} x_{n+N/2} W_{N}^{(n+N/2)k} , \quad 0 \le k \le N-1.$$
(3.3)  
$$= \sum_{n=0}^{N/2-1} x_{n} W_{N}^{n} + W_{N}^{(N/2)k} \cdot \sum_{n=0}^{N/2-1} x_{n+N/2} W_{N}^{nk}$$

Using the definition of W<sub>N</sub>, the following equation can be derived as,

$$W_N^{(N/2)k} = e^{-j2\pi(N/2)k/N} = e^{-jk\pi} = (-1)^k.$$
(3.4)

With (3.3) and (3.4), we get

$$X_{k} = \sum_{n=0}^{N/2-1} (x_{n} + (-1)^{k} \cdot x_{n+N/2}) \cdot W_{N}^{nk} \quad , \ 0 \le k \le N-1,$$
(3.5)

where  $X_k$  is the N-point DFT of  $x_n$ . The N-point DFT can be divided into two N/2-point DFTs as follows.

$$\begin{cases} X_{2k} = \sum_{n=0}^{N/2-1} (x_n + x_{n+N/2}) \cdot W_{N/2}^{nk} \\ X_{2k+1} = \sum_{n=0}^{N/2-1} ((x_n - x_{n+N/2}) \cdot W_N^n) \cdot W_{N/2}^{nk} \end{cases}, \quad 0 \le k \le (N/2) - 1, \quad (3.6)$$

Equation (3.6) shows how a complete transform can be divided into two half-transforms. The aforementioned process can be easily executed by iteratively employing the butterfly structure depicted in Fig. 3.7, from N-point FFT to 2-point FFT. This approach relies on the recursive use of the butterfly structure, which enables the simplification of the overall process.



Fig. 3.6 Structure of the butterfly cell

Fig. 3.7 presents a diagram of the 32-point FFT structure utilized in the proposed DPLL. It can be observed that each stage of the FFT is based on a butterfly cell. The 32-point FFT is decomposed into a 16-point FFT, which is further divided into an 8-point FFT, and eventually a 2-point FFT. Notably, the G(k) and H(k), which are the 16-point and 8-point FFTs respectively, can be expressed as follows:



$$G_{1,k} = X_{2k} = \sum_{n=0}^{15} g_{1,n} \cdot W_{16}^{nk}, \quad 0 \le k \le 15,$$

$$G_{2,k} = X_{2k+1} = \sum_{n=0}^{15} (g_{2,n} \cdot W_{32}^{n}) \cdot W_{16}^{nk}$$
(3.7)

and

$$\begin{cases} H_{1,k} = G_{1,2k} = \sum_{n=0}^{7} h_{1,n} \cdot W_8^{nk} \\ H_{2,k} = G_{1,2k+1} = \sum_{n=0}^{7} (h_{2,n} \cdot W_{16}^n) \cdot W_8^{nk} \end{cases}, \quad 0 \le k \le 7,$$
(3.8)

where  $g_{1,n} = x_n + x_{n+16}$ ,  $g_{2,n} = x_n - x_{n+16}$ ,  $h_{1,n} = g_{1,n} + g_{1,n+8}$ , and  $h_{2,n} = g_{1,n} - g_{1,n+8}$ . As (3.7) and (3.8), the 32-point FFT can be divided into 16-point FFTs and further 8-point FFTs, and 2-point FFTs finally. In other words, a 32-point FFT can be easily implemented based on a 2-point FFT cell.

#### 3.3.2 Proposed FFT-based Frequency Tuning



Fig. 3.8 Comparison between the conventional counter-based frequency acquisition and proposed FFT-based frequency acquisition and its lock time depending on the integral gain

Before delving into the explanation of the proposed FFT-based fast frequency calibration, a comparative analysis was conducted, considering three key aspects: hardware cost, frequency error, and lock time. Fig. 3.8 shows the comparison table between the conventional counter-based frequency acquisition and FFT-based frequency acquisition and its frequency acquisition time according to the integral gain of the DLF. The conventional counter-based method offers a straightforward implementation, whereas the adoption of the proposed FFT method may require a significant amount of hardware, depending on the number of FFT points. However, it is worth noting that the conventional counter-based approach. Consequently, the lock time, which denotes the time required to achieve frequency lock, becomes prolonged. In cases where the integral gain is small, as the Fig. 3.8 on the right, it is possible that the required lock time might not be satisfied in the conventional counter-

based method.

Fig. 3.9 explains the proposed fast-frequency acquisition scheme. Upon initialization, the reference clock samples the divided oscillator clock, whose frequency is given from a certain frequency control word (FCW<sub>lin</sub>), feeding the output sequence to the 32-point FFT module. An adequate division ratio is employed for the FFT to satisfy the Nyquist theorem, thereby preventing aliasing. Although 32-point FFT does not require high hardware complexity, the raw frequency detection resolution obtained by choosing the dominant bin is insufficient for achieving a fast lock. To mitigate this, we utilize the fact that the profile of the sampled signal is known; given that it is always a rectangular pulse with a duty cycle of 50%, there exists a deterministic set of weight coefficients,  $w_k$ , that relates the FFT output bins to the oscillator frequency. Using this, the coarse frequency calibration detects the oscillator frequency with a much higher effective resolution without significantly increasing the hardware overhead. Then, based on a linear model of the frequency tuning curve given from the simulation, the FCW is updated by

$$FCW_{cal} = FCW_{lin} + (FCW_{lin} + \alpha) \cdot \sum_{k=1}^{16} (m_k \cdot w_k)$$
(3.9)

where  $m_k$  is the FFT output bin value. Here,  $\alpha$  is a pre-defined constant that equalizes the full-range ratio of FCW to that of output frequency, i.e.,

$$(FCW_{max} + \alpha)/\alpha = f_{max} / f_{min}, \qquad (3.10)$$





validating the given FCW update for all input frequencies. Since the linearization precedes the calibration, the updated FCW is reverted to the one corresponding to the original mode before being applied to the oscillator. After the coarse FCW calibration by the FFT, the DLF performs fine-tuning through a 7-bit Vernier-delay-line-based TDC having a 2.5 ps resolution. Once the FFT tuning proceeds, the lock detector in the DLF disables the FFT and adjusts the frequency through the TDC loop until the unlock state is recognized. Overall, in comparison with the conventional TDC-DPLL suffering from an excessive hardware cost, the proposed DPLL with the FFT frequency calibration achieves fast frequency acquisition time, retaining good jitter performance.



Fig. 3.10 Procurement of the 32-point FFT input

Fig. 3.10 shows the process of acquiring a 32-point FFT input. The input is obtained by sampling the clock, derived from dividing the oscillator by 8, as the reference clock. This division ratio of 8 was chosen to comply with the Nyquist theorem, while taking into account the minimum and maximum frequencies. The code at the reference clock frequency is referred to as  $FCW_{ref}$ , while the code at the current frequency of the oscillator is referred to as FCW<sub>in</sub>. The FCW value is determined based on the linearized model of the implemented oscillator, and the difference between FCW<sub>ref</sub> and FCW<sub>in</sub> is defined as the  $\Delta$ FCW.

Fig. 3.11 illustrates the FFT output as a function of the sampling frequency and input frequency. In Fig. 3.11 (a), when the input frequency is a multiple of k with respect to the sampling frequency divided by N, the output is shown in a skirt shape



Fig. 3.11 The characteristic of the FFT output depending on the correlation between input frequency and sampling frequency

rather than a single peak value due to the input signal with a pulse shape rather than a sine wave. If the input frequency is greater than this, as shown in Fig. 3.11 (b), and less than (k+1/2) of the sampling frequency/N, the magnitude of the  $k^{th}$  bin decreases while the magnitude of the  $(k+1)^{th}$  bin increases. Fig. 3.11 (c) shows that when the input frequency is a multiple of (k+1/2) of the sampling frequency/N, the bin size of k and k+1 are the largest and output the same value. As the input frequency exceeds this and approaches the (k+1) multiple of the sampling frequency/N, the bin sizes of (k+1) and (k+2) increase while the bin size of k decreases. Thus, depending on the relationship between the input frequency and the sampling frequency, the bins tend to interpolate the two maximum values. By utilizing the interpolation characteristics of these bin values, the frequency can be calibrated by inferring the relationship between the current oscillator frequency and the input frequency using the FFT outputs.

Fig. presents how to calculate the FFT tuning coefficient by using a formula (3.9). To get the desired coefficient,  $w_k$ , we collect samples for the FFT output magnitudes using a  $\Delta_k$  value that's determined by dividing the  $\Delta$ FCW by FCW<sub>in</sub>+ $\alpha$ . By computing the inverse of the matrix shown in Fig. 3.12 using the collected values, we can obtain

$$\begin{bmatrix} \Delta_{1} \\ \Delta_{2} \\ \Delta_{3} \\ \vdots \\ \Delta_{n} \end{bmatrix} = \begin{bmatrix} m_{1,1} & m_{1,2} & m_{1,3} & \circ & \circ & m_{1,16} \\ m_{2,1} & m_{2,2} & m_{2,3} & \circ & \circ & m_{2,16} \\ m_{3,1} & m_{3,2} & m_{3,3} & \circ & \circ & m_{3,16} \\ \vdots & & & & \vdots \\ m_{n,1} & m_{n,2} & m_{n,3} & \circ & \circ & m_{n,16} \end{bmatrix} \bullet \begin{bmatrix} w_{1} \\ w_{2} \\ w_{3} \\ \vdots \\ w_{16} \end{bmatrix} \bullet \begin{bmatrix} w_{1} \\ w_{2} \\ w_{3} \\ \vdots \\ w_{16} \end{bmatrix} = KCW_{in} + (FCW_{in} + \alpha) \cdot \sum (w_{k} \cdot m_{k})$$

Fig. 3.12 Matrix calculation to obtain the FFT tuning coefficient

the value of  $w_k$ . This method allows for FFT-based fast-frequency acquisition.

## **3.4 Circuit Implementation**

Fig. 3.13 depicts the proposed DPLL-based clock driver, which features a TDC with FFT-based frequency acquisition and employs three mode-switching LC oscillators to allow for a wide frequency tuning range. The system uses a shift register to obtain a 32-point FFT input and incorporates a lock detector in the DLF to enable FFT-based frequency tuning. The lock detector detects the lock state based on frequency detector outputs such as up and down and decides whether to control the FCW using a conventional TDC-based proportional and integral path or an FFT adjustment.

Moreover, the DLF contains the FCW computation logic that implements frequency calibration by FFT. Before FCW computation, this logic converts the current FCW into an FCW<sub>in</sub> that fits the linearized model. The computation is then performed using the FFT output, after which the FCW<sub>cal</sub> is converted back to the FCW of the original model.

The binary to thermometer (B2T) receives a 10-bit binary code and outputs a 32x32-bit thermometer code to the LC oscillator. In mode1 and mode2, two different 32x32-bit thermometer codes are employed, while mode2 and mode3 share the same cap bank. Fig. 3.14 illustrates the implementation of the cap bank used in the LC oscillator. The cap digital to analog converter (DAC) operates using row 32-bit and column 32-bit inputs from digital logic. In thermometer-based operation, the even and odd rows are designed differently, and a cell is turned on/off based on a 1-bit increase or decrease.







Fig. 3.15 The time-to-digital converter structure

The employed TDC [36] is presented in Fig. 3.15. It is based on a 7-bit Vernierdelay line structure and has a resolution of 2.5 ps as  $t_s - t_f$ , defined as the delay difference between the fast and slow paths. The capture range is determined as follows.

$$3 \cdot (t_f - t_s) \le Capture \le 3 \cdot (t_s - t_f). \tag{3.11}$$

To detect frequency outside of the capture range, PFD generates a down (dn) signal by sampling div<3> with ref<1> and an up signal by sampling ref<3> with div<1>. The implemented structure of the PFD is illustrated in Fig. 3.16. Within the capture range, the signals, ref, and div, up before the ref<1> or div<1> signal is activated. As a result, PUP and PDN are reset, which prevents the generation of PFD\_UP and PFD\_DN.



Fig. 3.16 Employed phase and frequency detector architecture

# **3.5 Measurement Results**



Fig. 3.17 Chip photograph of the proposed DPLL-based clock driver

The proposed clock driver, fabricated in 40nm CMOS, generates a clock frequency ranging from 0.82 GHz to 4.1 GHz (FTR=133%). As shown in chip photograph in Fig. 3.17, the core size is 540  $\mu$ m by 450  $\mu$ m, and the total active area



Fig. 3.18 Phase noise of the free-runing LC oscillator at each mode



Fig. 3.19 PN and FoM and FoM<sub>T</sub> at each mode depending on the frequency

is 0.3 mm<sup>2</sup>. Implementing a 32-point FFT does not significantly increase hardware complexity.

Fig. 3.18 and Fig. 3.19 present the measured phase noise of the free-running oscillator in the three modes, along with the corresponding FoM and FoM<sub>T</sub> at various frequencies. The phase noise varies from -118.5 dBc/Hz to -124.7 dBc/Hz at 1 MHz offset frequency, achieving FoM and FoM<sub>T</sub> ranging from 173.5 dBc/Hz to 181.5 dBc/Hz and 196 dBc/Hz to 204 dBc/Hz, respectively.



Fig. 3.20 Measured integrated RMS jitter w/ and w/o random jitter in the reference clock



#### Jitter Reduction = 20log(J<sub>RMS,PLL</sub>/J<sub>RMS,REF</sub>)

Fig. 3.21 Measured jitter reduction vs jitter magnitude at each mode

Further, integrated root-mean-square (RMS) jitter is observed in Fig. 3.20. In the absence of a reference clock jitter (RJ), the RMS jitter integrated over a frequency range of 1 kHz to 30 MHz was measured to be 84.63 fs. However, in the presence of a 0.15 unit interval RJ<sub>peak-to-peak</sub> in the reference clock, the RMS jitter integrated over a wider frequency range of 1 kHz to 1 GHz was measured to be 561.7 fs. The plot in Fig. 3.20 shows that the phase noise of the reference signal reduces the high-frequency noise component at the output of the PLL.

Jitter reduction over the input jitter magnitude present in the reference is observed in Fig. 3.21. Since RJ by clock buffering typically predominates in the highfrequency region, the jitter reduction is evaluated by comparing the integrated RMS jitter from 1 kHz to 1 GHz. Overall, the proposed clock driver offers good nominal jitter performance and significant RJ reduction under a noisy reference clock.



Fig. 3.22 Frequency transient curve with the frequency from 3.82GHz to 4GHz



Fig. 3.23 Frequency transient curve with the frequency from 3.82GHz to 1GHz

Fig. 3.22 and Fig. 3.23 compare the proposed FFT-DPLL frequency-locking transients with the conventional TDC-DPLL when the frequency jumps from 3.82 GHz to 4 GHz and 3.82 GHz to 1 GHz. It is observed that the frequency acquisition time is significantly reduced regardless of the polarity and magnitude of the frequency jump. With a small frequency jump, the required time for the settlement is measured to be a few hundred nanoseconds. Even with a very large frequency jump from 3.82 GHz to 1 GHz, the DPLL requires only 0.99 µs to settle again, while it takes 2.27 ms without the proposed technique, verifying fast frequency acquisition. Table II summarizes the performance of the proposed clock driver and compares it with other complementary oscillators or fast-locking PLLs. Compared with the oscillators in [28] and [30], the proposed oscillator achieves the widest FTR, showing an outstanding FoM<sub>T</sub>. In addition, the presented work exhibits the shortest frequency tracking time of 0.99  $\mu$ s, notwithstanding a large frequency jump. Lastly, this work gives FoM<sub>RMS</sub> of -249.1dB, validating the outstanding jitter performance.

| Reference                                   |     | This work                | Oscillator Comparison    |                        | PLL Comparison     |                |                     |
|---------------------------------------------|-----|--------------------------|--------------------------|------------------------|--------------------|----------------|---------------------|
|                                             |     |                          | CICC, 21 [26]            | JSSC, 21 [28]          | JSSC, 20 [30]      | ISSCC, 18 [31] | JSSC, 20 [32]       |
| PLL Architecture                            |     | ADPLL                    | N/A                      | N/A                    | BBPLL              | BBPLL          | BBPLL               |
| Technology                                  |     | 40nm                     | 65nm                     | 40nm                   | 28nm               | 65nm           | 28nm                |
| Туре                                        |     | Integer-N                | N/A                      | N/A                    | Integer-N          | Fractional-N   | Fractional-N        |
| Locking Method                              |     | Aux. FFT                 | N/A                      | N/A                    | Aux. BBPD<br>+ GS  | Aux. BBPDs     | Aux. BBPD<br>+ DFER |
| Oscillator Description                      |     | Dual-core<br>Triple-mode | Dual-core<br>Quad-mode   | Quad-core<br>Quad-mode | N/A                | N/A            | N/A                 |
| Ref. Freq. [MHz]                            |     | 1000 ~ 4000              | N/A                      | N/A                    | 216                | 52             | 500                 |
| Out. Freq. [GHz]                            |     | 0.82 ~ 4.1               | 8.2 ~ 21.5               | 18.6 ~ 41.1            | 22.5 ~ 27.7        | 3.7 ~ 4.1      | 12.8 ~ 15.2         |
| Active Area [mm <sup>2</sup> ]              |     | 0.243 / 0.3              | 0.4                      | 0.08                   | 0.017 <sup>†</sup> | 0.5†           | 0.03 <sup>†</sup>   |
| FTR [%]                                     |     | 133                      | 89.6                     | 73.2                   | 20.7               | 10.2           | 17.1                |
| Power [mW]                                  | OSC | 4.3 ~ 11                 | 4 ~ 6                    | 9 ~ 15                 | 13.7               | N/A            | N/A                 |
|                                             | PLL | 17.1                     | N/A                      | N/A                    | 25                 | 5.28           | 19.8                |
| Phase noise [dBc/Hz]<br>@ 1 MHz offset      |     | -118.5~-124.7            | -100 ~ -109 <sup>†</sup> | -108.5 ~ -100.3        | -108.1             | N/A            | N/A                 |
| <sup>1</sup> FoM [dBc/Hz]<br>@ 1 MHz offset |     | 173.5 ~ 181.5            | 177 ~ 181 <sup>†</sup>   | 181.4 ~ 184.4          | -183.8             | N/A            | N/A                 |
| <sup>2</sup> FoM⊤[dB]<br>@ 1 MHz offset     |     | 196 ~ 204                | 196 ~ 201 <sup>†</sup>   | 198.7 ~ 201.7          | -189.8             | N/A            | N/A                 |
| Locking time [µs]                           |     | 0.99                     | N/A                      | N/A                    | 45                 | 5.6            | 18.55               |
| Integration Bandwidth [Hz]                  |     | 1k to 30M                | N/A                      | N/A                    | 10k to 20M         | 1k to 30M      | 1k to 100M          |
| RMS Jitter[fs]                              |     | 84.63                    | N/A                      | N/A                    | 220                | 183            | 66.2                |
| <sup>3</sup> FoM <sub>RMS</sub> [dB]        |     | -249.1                   | N/A                      | N/A                    | -239.2             | -247.5         | -250.6              |
| Reference Spur [dBc]                        |     | -63.1                    | N/A                      | N/A                    | -65                | N/A            | -80.1               |

Table II Comparison table and performance summary

 $^{1}$ FoM = |PN| + 20  $\log_{10}(f_{0}/\Delta f)$  - 10  $\log_{10}(Power(mW))$ 

<sup>†</sup> = Estimated from the literature

 $^{2}FoM_{T} = FoM + 20 \cdot log_{10}(FTR/10)$ 

<sup>3</sup>FoM<sub>RMS</sub> = 10·log[(Power/1mW)·(RMS jitter/1s)<sup>2</sup>]

# Chapter 4 Reference-less Baud-rate CDR with Stochastic Phase and Frequency Detector

# 4.1 Overview

As the data rate increases in the wireline system, CDR accounts for a significant portion of the total power consumption. The commonly employed 2x oversampling CDR samples the data more than once per unit interval (UI), necessitating the generation of a data sampling clock and edge sampling clock [37] - [38]. Fig. 4.1 presents the comparison between the 2x oversampling mechanism and the proposed Baud-rate technique. In the 2x oversampling CDR, the data are sampled on the data clock phase (D[n]) and edge clock phase (E[n]). The phase and frequency errors are determined through D[n] and E[n]. This structure requires a double clock phase and more



Fig. 4.1 Sampling type comparison between 2x oversampling CDR and proposed Baud-rate CDR

sophisticated phase error correction, resulting in large power consumption and frequency limitation.

On the other hand, Baud-rate CDR utilizes only a data sampling clock, thereby reducing the number of samplers and the complexity of clock distribution to achieve power-efficient receivers [39] – [44]. The proposed Baud-rate CDR presented in Fig. 4.1 obtains edge data, in contrast to the conventional Baud-rate architecture, which achieves the Baud-rate architecture solely through a data sampling clock without edge data sampling clocks. An integrator is adopted to obtain edge samples through phase error without an additional sampling clock. In contrast to [42], phase errors in the proposed CDR are determined by integrating the data once per UI, as depicted in Fig. 4.2. Fig. 4.3 shows the phase error decision. The phase error is determined, whether early or late, depending on the data D[n] and D[n+1], as well as the integrated value.

Also, the proposed CDR is based on reference-less architecture. It is implemented without an external reference clock, which entails a jitter-free clock and an additional



Fig. 4.2 Proposed Phase detection mechanism

| D[n] | D[n+1] | INT | E[n]  |
|------|--------|-----|-------|
| 1    | -1     | 1   | Early |
| 1    | -1     | -1  | Late  |
| 1    | 1      | 1   | Late  |
| -1   |        | -1  | Early |

Fig. 4.3 Phase error decision table

pin. Reference-less CDR demands additional frequency detection schemes to extract phase and frequency error information from the input data [39]. Furthermore, reference-less designs constrain the wide-range frequency acquisition. To extend the frequency capture range, many reference-less CDRs necessitate extra frequency detectors (FD) or digital logic, increasing circuit complexity [37] – [42].

This paper proposes a Baud-rate reference-less CDR architecture to address these drawbacks, enabling stochastic phase and frequency detection [37]. In [44], an extra sampler with a distinct threshold voltage level is employed to achieve Baud-rate phase

detection in the PAM-4 signaling. As a result, the proposed CDR achieves Baud-rate structure-based PAM-4 signaling CDR utilizing a stochastic pattern distribution. However, the CDR architecture in [44] is only capable of phase detection and thus requires a reference clock for its implementation rather than being a reference-less structure. While this approach represents a notable advancement, it has limitations regarding clock recovery from data without a reference clock. In the proposed CDR, the probability histogram of sequential data patterns is obtained to get optimal weights for phase and frequency detection facilitating a wide range of frequency acquisition.

The remainder of this paper is organized as follows. Section II describes the proposed integrator-based stochastic PFD implementation with the behavior of the integrator and the simulation demonstrating extensive range frequency acquisitions. In Section III, the proposed receiver implementation is presented in detail. The measurement results of the proposed Baud-rate CDR are shown in Section IV. Finally, Section V concludes and summarizes this paper.

## 4.2 Stochastic Baud-rate Phase and Frequency Detection

### 4.2.1 Integrator-based Baud-rate Edge Detection Techniques

The integrator structure is depicted in Fig. 4.4. Differential input data is integrated into a capacitor based on the clock phase, from which phase errors are derived. Fig. 4.4 shows the implemented integrator architecture for zero-phase data. As the proposed CDR operates with a quadrature architecture, a total of four integrators are exploited. Fig. 4.5 illustrates the simulation results of the integrator's transient behavior. The integrator operates in four different states: INTEGRATION, HOLD, DECISION, and RESET. During INTEGRATION state, the input data is integrated



Fig. 4.4 The structure of the exploited integrator

into the capacitor. This integration occurs at CLK0. Subsequently, the integrated data is held in the HOLD state at CLK90. The DECISION state, which occurs at CLK180, determines the phase error. Finally, the capacitors are pre-charged during the RESET state at CLK270, enabling to integrate the following input data. As a result, the phase errors meaning edge data are generated without an edge sampling clock, achieving the Baud-rate architecture.



Fig. 4.5 The simulation of the integrator

When implementing an integrator, two essential considerations need to be taken into account. The first consideration is determining the amount of charge during the integration and decision states, and the second consideration is deciding the amount of charge required for pre-charge during the reset state. During the integration state (T/4), the voltage of the subsequent sampler is determined by the charge / discharge from the capacitor. The maximum current required for integration is determined by the condition that the integrated voltage in the common mode is greater than the sampler input threshold voltage. This condition can be expressed as follows.

$$i(t) = C \cdot \frac{dV}{dt} \tag{4.1}$$

$$V_{CM} = V_{DD} - \frac{1}{C} \cdot (\frac{I}{2}) \cdot (\frac{T}{4}) > V_{th}$$
  
=  $V_{DD} - \frac{I \cdot T}{8C} > V_{th}$  (4.2)

Subsequently, to satisfy the pre-charge condition during the reset state, the time to pre-charge needs to be less than T/4 as follows.

$$t = V_{DD} \cdot \frac{C}{I_P} < \frac{T}{4} \tag{4.3}$$

This ensures the capacitor is fully charged before the following integration process begins.

### 4.2.2 Methodology of the Stochastic Phase and Frequency Detection

The concept of the proposed stochastic PFD is to establish robust frequency detection in a straightforward way [37] and [45]. Based on [37] and [45], the pattern histogram can be formulated through the sequential pattern (N) consisting of two data samples and one edge sample, ranging from 000 to 111, as shown in Fig. 4.6.  $F_{DIFF}$  is defined as

$$F_{\text{DIFF}} = \frac{f_{\text{data}} - f_{\text{clk}}}{f_{\text{clk}}}$$
(4.4)

where  $f_{data}$  is the Nyquist frequency of the data and  $f_{clk}$  is the clock frequency. Similar to  $F_{DIFF}$ , the  $P_{DIFF}$  is defined as the phase deviation from the data center. As shown in Fig. 4.7 and Table III, the probabilities of each pattern corresponding to the phase and



Fig. 4.6 Pattern acquisition in the proposed Baud-rate CDR



Fig. 4.7 Obtained pattern (a) when phase is early or late and (b) when frequency is early or late

frequency difference,  $P_{DIFF}$  and  $F_{DIFF}$ , can be obtained. Fig. 4.7(a) presents the probability distribution of each pattern concerning the phase difference, while Fig. 4.7(b) illustrates the probability distribution of each pattern based on the frequency difference. Table III(a) lists the instances when the phase and frequency are in the early state, while Table III(b) outlines the occurrences when the phase and frequency

| Pa | Pattern       |      | 0(000) | 1(001) | 2(010) | 3(011) | 4(100) | 5(101) | 6(110) | 7(111) |
|----|---------------|------|--------|--------|--------|--------|--------|--------|--------|--------|
|    |               | -0.1 | 0.244  | 0.252  | 0      | 0      | 0      | 0      | 0.252  | 0.252  |
|    | P₀iff<br>(UI) | -0.2 | 0.244  | 0.252  | 0      | 0      | 0      | 0      | 0.252  | 0.252  |
|    | (-)           | -0.3 | 0.244  | 0.252  | 0      | 0      | 0      | 0      | 0.252  | 0.252  |
|    |               | -30  | 0.322  | 0.089  | 0      | 0.089  | 0.089  | 0      | 0.089  | 0.322  |
|    | Fdiff<br>(%)  | -60  | 0.40   | 0.05   | 0      | 0.05   | 0.05   | 0      | 0.05   | 0.40   |
|    |               | -90  | 0.47   | 0.01   | 0      | 0.016  | 0.016  | 0      | 0.01   | 0.478  |

| 1  | ~  |
|----|----|
| "  | ٦١ |
| ۱. | 11 |
|    |    |

| Pa     | Pattern       |      | 0(000) | 1(001) | 2(010) | 3(011) | 4(100) | 5(101) | 6(110) | 7(111) |
|--------|---------------|------|--------|--------|--------|--------|--------|--------|--------|--------|
|        |               | +0.1 | 0.244  | 0      | 0      | 0.252  | 0.252  | 0      | 0      | 0.252  |
|        | P₀iff<br>(UI) | +0.2 | 0.244  | 0      | 0      | 0.252  | 0.252  | 0      | 0      | 0.252  |
| P(N L) | ( )           | +0.3 | 0.244  | 0      | 0      | 0.252  | 0.252  | 0      | 0      | 0.252  |
| F(N L) |               | +30  | 0.206  | 0.126  | 0.038  | 0.125  | 0.126  | 0.038  | 0.126  | 0.215  |
|        | Fы⊧⊧<br>(%)   | +60  | 0.17   | 0.126  | 0.075  | 0.126  | 0.126  | 0.075  | 0.126  | 0.176  |
|        | ( )           | +90  | 0.135  | 0.126  | 0.109  | 0.126  | 0.126  | 0.11   | 0.126  | 0.142  |

(b)

Table III The pattern probability table when phase or frequency is (a) early and (b) late

are in the late state. Based on the probabilities of  $P_{DIFF}$  and  $F_{DIFF}$ , the weights of each pattern, meaning the difference between late and early, are determined. The weight for each pattern is calculated as

$$\mathbf{W}_{N} = \mathbf{P}(\mathbf{L} \mid \mathbf{N}) - \mathbf{P}(\mathbf{E} \mid \mathbf{N}) \tag{4.5}$$

Chapter 4. Reference-less Baud-rate CDR with Stochastic Phase and Frequency Detector 66

where P(L|N) and P(E|N) are the conditional probabilities of late and early, assuming the pattern is N. Also, the conditional probabilities can be re-defined based on Bayes' theorem, which explains the correlation between the given event and its prior event, as

$$P(L(\text{or } E) | N) = \frac{P(N | L(\text{or } E)) \cdot P(L(\text{or } E))}{P(N)}$$
(4.6)

where P(N|L(or E)) is the probability of the pattern, assuming the condition is late (early). Based on (4.5) and (4.6), the weight for each pattern ( $W_N$ ) can be established as follows.

$$W_{N} = \frac{P(N|L) \cdot P(L) - P(N|E) \cdot P(E)}{P(N)}$$

$$= \frac{P(N|L) \cdot P(L) - P(N|E) \cdot P(E)}{P(N|L) \cdot P(L) + P(N|E) \cdot P(E)}$$

$$= \frac{P(N|L) - P(N|E)}{P(N|L) + P(N|E)}.$$
(4.7)

Based on equation (4.7), the weight gain curve can be drawn as Fig. 4.9. The phase and frequency detection gain curves are obtained by applying the calculated weights. The value of the gain curves is computed by the weighted sum of the probabilities as follows.

$$Gain = P(L) - P(E)$$
  
=  $\sum_{N=0}^{7} P(L|N) \cdot P(N) - P(E|N) \cdot P(N)$   
=  $\sum_{N=0}^{7} W_{N} \cdot P(N)$  (4.8)

where  $W_N$  and P(N) are the weight and probability for the pattern N. Using equation (4.8), the phase detection gain curve with  $P_{DIFF}$  varied from -0.5 UI to +0.5UI is obtained. The resulting gain curve is shown in Fig. 4.9. Since applied weight implies the phase information, phase detection can be achieved using the weight obtained from the weight gain curve, but frequency detection is failed.

To obtain the weight according to the frequency difference, Equation (4.7) is used as the weight obtained according to the phase difference, and the resulting weight gain curve is shown in Fig. 4.11. With (4.8), the frequency detection gain curve with  $F_{DIFF}$  varied from -90% to 150% is obtained as Fig. 4.11. Since employed weight is also computed based on the frequency difference histogram, it contains only frequency information. As a result, only frequency detection is achieved, and phase detection is failed.

Under circumstances of multiple phase and frequency differences, the weights obtained through  $P_{DIFF}$  and  $F_{DIFF}$  enable independent phase and frequency detection. However, it is not possible to achieve phase detection and frequency detection simultaneously. The weights for N =1 and N = 6 in Fig. 4.9 and Fig. 4.11 demonstrate the impossibility of concurrently achieving both phase and frequency detection in all cases. The weights for two patterns indicate opposing tendencies, and this inconsistency accounts for the discrepancy. Consequently, separate weights cannot be



Fig. 4.9 Weight gain curve depending on the phase difference

-0.4 -0.6 -0.8 -1.0



Fig. 4.9 Phase gain curve depending on the phase difference

utilized to accomplish both phase and frequency detection simultaneously.

To achieve both phase and frequency detection, it is necessary to obtain a weight that considers both phase error and frequency error information. This can be achieved by combining the weights of each pattern extracted from multiple  $P_{DIFFS}$  and  $F_{DIFFS}$ . Fig. 4.12 illustrates the weights determined through the correlation of



Fig. 4.11 Weight gain curve depending on the frequency difference



Fig. 4.11 Frequency gain curve depending on the frequency difference

the weights obtained from the phase difference and frequency difference; Fig. 4.13 then presents the associated phase and frequency gain curves when the derived weights are applied to the integrator-based proposed CDR architecture. Fig. 4.12 represents the weight curves for each pattern, showing their convergence as the frequency difference increases. We determined the weight value for each pattern and presented them in the table at the bottom of Fig. 4.12. The phase gain curve is



| Employed Weight at Each Pattern |         |                                                                 |      |      |      |      |       |       |  |  |
|---------------------------------|---------|-----------------------------------------------------------------|------|------|------|------|-------|-------|--|--|
| Pattern                         | 0 (000) | 0 (000) 1 (001) 2 (010) 3 (011) 4 (100) 5 (101) 6 (110) 7 (111) |      |      |      |      |       |       |  |  |
| Weight                          | -1      | -1                                                              | 3    | 3    | 3    | 3    | -1    | -1    |  |  |
| Decision                        | EARLY   | EARLY                                                           | LATE | LATE | LATE | LATE | EARLY | EARLY |  |  |

Fig. 4.12 The weight gain curve depending on the superposition of phase and frequency difference and the determined pattern weight value

depicted for phase deviation ranging from -0.5 UI to +0.5UI, and the frequency gain curve for frequency differences ranging from -90% to +150%. By employing a Baud-rate architecture, the proposed CDR can reduce the overhead in 2x oversampling CDR of twice-clock phases and its distribution, achieving frequency acquisition with the data rate from 14 to 28 Gb/s.



Fig. 4.13 Phase and frequency gain curve when the determined weight is applied to the proposed Baud-rate phase and frequency detector

To evaluate the performance of the CDR system under the influence of data transmission through a lossy channel, we considered the susceptibility of the integrator to ISI. Fig. 4.14 presents the phase and frequency detection gain curves of the proposed PFD when the input passes through a lossy channel and the data with ISI has sufficient equalization or not. Same weights were maintained as before and compared two cases: one where sufficient equalization was achieved and another where it was not. For our experimentation, we utilized a channel with a loss of 7dB at the Nyquist frequency. A comparative analysis was conducted with and without the implementation of an equalizer. Due to the integrator's vulnerability to ISI, failure to adequately compensate for channel loss results in suboptimal phase locking and even the failure of frequency detection. Conversely, when channel loss is effectively compensated, it is evident that the phase achieves optimal locking, and frequency detection becomes feasible.



Fig. 4.14 Phase and frequency gain curve of the proposed PFD with the input data passed through lossy channel and an equalization.

### **4.3 Circuit Implementation**

Fig. 4.15 represents the overall block diagram of the proposed Baud-rate CDR. The CDR configures a quarter-rate architecture. It consists of a synthesizable digital logic (SDL) with a pattern detector, a thermometer-based DCO, de-serializers (DES), a BBPD, and an analog front-end (AFE).

The AFE corporates 50-ohm termination, a CTLE, 4 integrators, and 8 samplers. The CLTE output is sampled by the 4 data samplers and integrated into the capacitors in the integrators simultaneously. Since the proposed CDR is based on the quarter-rate architecture, the 4-phase clock signals are employed in AFE, which are generated from DCO. The four-stage ring structure with a digitally controlled resistor is employed for DCO, generating multi-phase clock signals. Integrated data are sampled by following samplers forming phase error. Sampled data and phase errors are 1:8 deserialized through 1:8 DES and proceed into the SDL.

The SDL comprises a pattern detector, weight multiplier, lock detector, DLF, and binary-to-thermometer converter (B2T). The pattern detector recognizes each pattern within the deserialized data, specifically those ranging from 000 to 111. The weight multiplier applies predetermined values, as illustrated in Fig. 4.12, to the patterns, respectively. These values, designated as  $w_0$ ,  $w_1$ ,  $w_2$ , and  $w_3$ , are assigned based on the pattern recognized by the detector.  $w_0$  corresponds to patterns 000 and 111,  $w_1$  to 001 and 110,  $w_2$  to 010 and 101, and  $w_3$  to 011 and 100. The ratio of these weights,  $w_0:w_1:w_2:w_3$ , is determined through histogram analysis and set to 1:1:3:3,





adjusted within the DLF. The lock detector monitoring the lock state of the system adjusts the integral and proportional gain used in the DLF and the DCO. Before achieving lock, the integral gain is set to a significant value, while the proportional gain is kept small. Once the lock is achieved, the integral gain is reduced, and the proportional gain is increased to facilitate fast lock acquisition.

### **4.4 Measurement Results**

The prototype chip fabricated in a 28-nm CMOS technology occupies 0.02 mm<sup>2</sup>, as Fig. 4.17. The proposed CDR consumes a total power consumption of 29.8 mW, achieving an energy efficiency of 1.06 pJ/b at 28 Gb/s. The clock recovery is achieved at the data rate of 14 to 28 Gb/s. The measurement was carried out through the signal quality analyzer with a pattern generator and error detector under 4.7-dB data loss caused by SMA and PCB trace loss. The recovered clock, shown in Fig. 4.18, is recovered from the input data operating at 28 Gb/s with the RMS jitter of 1.466 ps and

|         | 1     | 0              | 00000000                    | 6   | 65      | 0000       |      |
|---------|-------|----------------|-----------------------------|-----|---------|------------|------|
| and and |       |                | Block Description           | Are | a (um²) | Power (mW) | S    |
|         |       | Α              | DCO                         |     | 6771    | 2          | 5    |
| 0       |       | В              | Analog Front-end            |     | 1366    | 19.8       | 2    |
|         | un    | С              | De-serializer               |     | 1558    | 19.0       | -    |
| -       | 120um | D              | Synthesizable Digital Logic | 1   | 1291    | 8          | e    |
|         | .1    | and the second | 166                         | 60L | D       |            | 1000 |
|         | 2     | to,            | 1111111                     | 1   | 4.4     | 8 10 C     |      |

Fig. 4.17 The chip photograph and its area and power consumption



Fig. 4.18 Recovered clock histogram

peak-to-peak jitter of 11.6 ps. Fig. 4.19 presents the jitter tolerance (JTOL) curves with the BER of less than  $10^{-12}$ . The JTOL curves satisfy the IEEE 802.3 mask by tolerating the jitter amplitude of 0.05 UI at 100MHz.

Table IV summarizes the proposed CDR and compares it with other CDR designs. The proposed CDR demonstrates Baud-rate reference-less CDR over a wide frequency range. The stochastic Baud-rate PFD achieves the best energy efficiency of 1.06 pJ/b with a simple implementation in a small area compared to other CDR designs exclusively.



Fig. 4.19 The measured jitter histogram at 28 Gb/s with the BER  $< 10^{-12}$ 

|                              | JSSC, 21<br>[35] | JSSC, 16<br>[37] | JSSC, 17<br>[36] | JSSC, 22<br>[41] | This<br>work |
|------------------------------|------------------|------------------|------------------|------------------|--------------|
| Technology                   | 65nm             | 28nm             | 28nm             | 40nm             | 28nm         |
| Modulation                   | NRZ              | NRZ              | NRZ              | PAM-4            | NRZ          |
| Samples/UI                   | 4                | 2                | 1                | 1                | 1            |
| Reference                    | No               | No               | No               | Yes              | No           |
| Channel Loss [dB]            | N/A              | 5                | 14.8             | 4                | 4.7          |
| Data Rate [Gb/s]             | 4 - 20           | 7.4–11.5         | 22.5 - 32        | 48               | 14 - 28      |
| Supply Voltage [V]           | 1.2              | 0.9              | 0.9              | 1                | 1.0          |
| Core Area [mm <sup>2</sup> ] | 0.045            | 0.21             | 0.213            | N/A              | 0.02         |
| Power [mW]                   | 37.3             | 22.9             | 102              | 116.3            | 29.8         |
| Energy Efficiency [pJ/b]     | 1.87             | 1.9              | 3.19             | 2.42             | 1.06         |

Table IV Performance summary and comparison

# Chapter 5 PAM-4 Receiver with Pre-Cursor Adjustable Baud-rate Phase Detector

### **5.1 Overview**

Recently, demands for higher data rate keep increasing in many applications of wireline communications. Since the channel loss at such a high data rate also increases rapidly, sophisticated equalization schemes have been proposed to compensate for the channel loss. Moreover, multi-level signaling such as four-level pulse amplitude modulation (PAM-4) offers an advantage over two-level signaling (PAM-2) because it provides a doubled data rate with the same channel loss. However, the multi-level signaling is highly vulnerable to inter-symbol interference (ISI) due to the reduced vertical eye margin (VEM). Therefore, in the receiver side, the effect of the pre-cursor

ISI, which is not removed by a DFE, becomes more significant in the multi-level signaling. It is also a challenge to implement a feed-forward equalizer (FFE) for compensating the pre-cursor ISIs in the transmitter due to reduced signal power [46] - [47].

In another perspective, as the data rate increases, a Baud-rate clock and data recovery (CDR) emerges as a powerful candidate in the receiver designs to reduce clocking power compared to the conventional oversampling CDRs. For example, the phase detectors (PDs) introduced in [37], [48] – [49] are implemented using 2x oversampling that requires two sampling phases per unit interval (UI) to obtain phase information. Although the oversampling CDR achieves wide-range frequency detection capability and fast locking acquisition [37], [48], [50], they require additional clock phases, so that expand circuit complexity and increase power consumption.

On the other hand, the Baud-rate CDR, which samples the data only once per UI, reduces the number of the required high-speed samplers and makes the clock distribution network simpler, enabling a more power-efficient receiver compared with the conventional oversampling CDRs [24] [25] – [26], [39], [42], [51] – [55]. Among various Baud-rate PDs (BRPDs), the MMPD is primarily used because of its simplicity [24]. For more practical implementation, a SS-MMPD has been presented [25] [25] – [26], [51]. The MMPD locks at a point where the pre-cursor ISI ( $h_{-1}$ ) and the first post-cursor ISI ( $h_1$ ) become the same. However, if the MMPD operates with an adaptative DFE, the MMPD locks at the point where  $h_{-1}$  becomes zero due to the removed  $h_1$ . It causes the lock point to be drifted toward where  $h_{-1}$ =0 and makes the receiver vulnerable to noise since the vertical eye height is reduced. In [25], an

unequalized  $(h_{-1} \neq h_1)$  MM CDR is presented, where a digital offset is added to effectively make  $h_{-1} \neq 0$ . However, the PD requires two error samplers and has the burden of finding an optimal digital offset. Another technique proposed in [26] is a weightadjusting MM CDR (WA-MM CDR) which uses only one additional error sampler and updates the CDR only when a data pattern of "0110" is detected. Although the WA-MM CDR is power efficient by reducing the number of required error samplers, it is vulnerable due to the data dependency on the specific data pattern. The CDR proposed in [54] features an MM CDR showing considerable jitter tolerance with a solid lock point, while pre-cursor and post-cursor are removed partially by the DFE; However, it requires more samplers than other designs. The Baud-rate CDR in [39] is implemented with an additional sampler for frequency detection by examining rising and falling data waveforms. The CDRs mentioned above have been implemented in PAM-2 signaling [24] [25] – [26], [37], [39], [48] – [55] and will require an exponentially increasing number of samplers when implemented in PAM-4 signaling. In [42], BRPD employs an integrator that performs phase detection based on the integrated voltage over data transition. However, the integrator-based PD is vulnerable to the ISI, which causes the lock point to drift away from the optimum.

To overcome the previous limitations, this paper proposes a BRPD that detects early or late phase by estimating  $h_{-1}$  [60]. Contrary to the MMPD, the proposed BRPD is irrelevant to the drift issue because its lock point is determined by the ratio of  $h_0$ and  $h_{-1}$  with completely removed post-cursors. Furthermore, since the BRPD locks at the point where  $h_0$  becomes  $M_{ext}$ · $h_{-1}$  with the externally controlled target cursor ratio ( $M_{ext}$ ), a targeted VEM can be achieved even with equalizers that do not effectively remove pre-cursor ISI. Therefore, the proposed BRPD is suitable for multi-level signaling where the effect of the pre-cursor is significant. Additionally, the DFE adaptation based on the uneven data level adaptation (UDA) algorithm introduced in [56] is extended to the PAM-4 signaling. Contrary to the commonly-used sign-sign least mean square (SS-LMS) algorithm, the UDA algorithm brings a more accurate DFE adaptation. The data level is defined based on the data histogram shown in Section II, and the DFE removes  $h_1$  more accurately, resolving the ambiguity in the SS-LMS algorithm.

The remainder of this paper is organized as follows. Section II describes the proposed BRPD with the analysis of correlation between  $h_{-1}$  and VEM, UDA algorithm, phase detection of the proposed BRPD, and DFE tap coefficient adaptation. In Section III, circuit implementation of the proposed receiver is presented in detail. The measurement results of the designed receiver are presented in Section IV. Finally, Section V concludes and summarizes this paper.

### **5.2 Proposed Phase Acquisition Technique**

### 5.2.1 Concept of Proposed Baud-rate Phase Detector

Fig. 5.1 shows the single-bit responses (SBRs) and lock points of the conventional MMPD and the proposed PD, both with DFE adaptation. As mentioned earlier, the conventional MMPD with DFE adaptation has the issue of the wandering lock point. While many techniques have been presented to resolve this problem, [25] - [26], [54], to prevent the dilemma caused by data dependency or hardware overhead in the above techniques, this paper proposes the BRPD with the lock point determined by the cursor ratio of  $h_0$  to  $h_{-1}$  which is denoted as M. Thus, the performance property is different



Fig. 5.1 Lock point of conventional MMPD and proposed MMPD with DFE on single-bit response

from the conventional MMPD in which a lock point is determined where  $h_{-1}=h_1$  on the SBR [24]. Furthermore, since the lock point of the proposed BRPD is independent of the  $h_1$ , the impediment caused by using DFE adaptation in the MMPD can be resolved.

The proposed BRPD meets three purposes: (1) The receiver without the assistance of the TX FFE, (2) the CDR based on the BRPD with a unique lock point, and (3) the CDR determining the lock point using h<sub>-1</sub> since the receiver is difficult to remove h<sub>-1</sub> completely through the CTLE and DFE when channel loss increases beyond some extent, although post-cursors can be removed entirely.

The operating principle of the proposed BRPD is based on the correlation between the ratio of  $h_0$  to  $h_{-1}$  and the VEM. The eye magnitude of the conventional PAM-2 without an equalizer can be estimated through the SBR as

$$\operatorname{VEM}_{\text{eye}, \text{PAM-2}} = 2 \cdot \left( h_0 - \sum_{k \neq 0}^{\infty} \left| h_k \right| \right)$$
(5.1)

where  $h_k$  is the magnitude of the  $k^{th}$  cursor of the SBR. If all the post-cursors are removed by the equalizers, only pre-cursors remain and the eye magnitude is written as

$$\operatorname{VEM}_{\operatorname{EQ.eye,PAM-2}} = 2 \cdot \left( h_0 - \sum_{k=-1}^{\infty} \left| h_k \right| \right).$$
(5.2)

Although the DFE effectively removes post-cursor ISIs, it cannot equalize the precursor ISIs hence the VEM is determined by the remaining pre-cursor ISIs. Extended to the PAM-4 signaling, the equalized eye magnitude is derived as



Fig. 5.2 Simulated vertical eye opening of the PAM-4 and PAM-2 signal with activated DFE vs (a) sampling time and (b) cursor rate M

$$\text{VEM}_{\text{EQ.eye.PAM-4}} = \frac{2}{3} \cdot \left( h_0 - 3 \cdot \sum_{k=-1}^{\infty} |h_k| \right).$$
(5.3)

Based on (5.2) and (5.3), Fig. 5.2(a) shows the VEM versus the sampling time for the PAM-4 and PAM-2 signaling. It shows that the VEM is considerably diminished when the PAM-4 signaling is used. It means that the receiver necessitates more precise

convergence to an optimal lock point to achieve a target BER.

Assuming that the VEM degradation caused by the other pre-cursors except  $h_{-1}$  is negligible, the VEM of (5.3) can be expressed as

$$\operatorname{VEM}_{\operatorname{EQ.eye.PAM-4}} = \frac{2}{3} \cdot h_0 \cdot \left(1 - 3 \cdot \left|\frac{1}{M}\right|\right).$$
(5.4)

where *M* is the ratio of  $h_0$  to  $h_{-1}$ . Based on (5.4), Fig. 5.2(b) illustrates the correlation between the PAM-4 VEM and *M*. As shown in Fig. 5.2(b), the ratio of  $h_0$  and  $h_{-1}$ should be greater than 3 for PAM-4 to secure a non-zero eye-opening. Meanwhile, non-zero  $h_{-1}$  is required for BRPD to avert the possibility of the locking point drift.

Through this analysis, we propose a PD with the targeted ratio  $M_{ext}$  of  $h_0$  to  $h_{-1}$  which is externally adjusted to the channel characteristic. As shown in Fig. 5.1, the lock point of the proposed PD on the SBR is determined as follows

$$h_0 = M_{ext} \cdot h_{-1}. \tag{5.5}$$

The proposed PD has two advantages. First, it is well-suited with the BRPD structure and compatible with the adaptive DFE. Thus, the locking point is stationary even if the residual post-cursor ISIs change with the adaptive DFE. Second, the SNR degradation caused by  $h_{-1}$  can be directly manipulated. Since the ratio of  $h_0$  to  $h_{-1}$  can be easily controlled externally,  $h_{-1}$  can be adjusted to an appropriate value according to the channel loss to meet the target BER.

To make the proposed PD operate correctly, the *M* versus sampling time should



Fig. 5.3 (a) Simulated single-bit responses and (b) cursor ratio on the time with various channel losses

exhibt monotonicity for 1 UI. Otherwise, there is a possibility of multiple locking points caused by the concave region. Fig. 5.3 shows (a) the SBR and (b) the value of M obtained through simulation with various channel losses. Is is observed that the

value M gradually decreases over the sampling phase, verifying the monotonic characteristic. An early phase is observed if the ratio is greater than the externally set value  $M_{ext}$  and late is the opposite way.

#### 5.2.2 Data Level and DFE Adaptation

One of the key features of the proposed clock recovery is how to estimate the magnitude of  $h_0$ ,  $h_{-1}$ , and the cursor ratio M. To estimate the magnitudes of the cursors, results from the data level adaptation are used, which employs the SS-LMS algorithm. It is intended to determine the magnitude of  $h_0$  and is widely used because of its simplicity [57]. In this work, a new type of data level adaptation called the uneven data level adaptation (UDA) method presented in [56], enables accurate equalizer adaptation when  $h_{-1}$  is present. In the UDA method, the optimal data level for PAM-2 is decided to  $h_0+h_{-1}$ , rather than  $h_0$  and its update equation is expressed as

$$V_{Dlev}[n+1] = \begin{cases} V_{Dlev}[n] + 3 \cdot \mu_{Dlev} \cdot E[n] \text{ if } E[n] = +1 \\ V_{Dlev}[n] + 1 \cdot \mu_{Dlev} \cdot E[n] \text{ if } E[n] = -1 \end{cases}$$
(5.6)

where  $\mu_{Dlev}$  is an update coefficient with a small number, and E[n] is the output of the error sampler when data D[n]=+1. Note that  $V_{Dlev}$  is adaptively determined asymmetrically with a 3:1 ratio. Adopting the UDA in [56] to PAM-4 signalcan be obtained asymmeing, different data levels with different tries. Fig. 5.4 shows the data histogram for +3 at the DFE input, where +3 is defined as the highest data. The histogram shows four peaks for data +3. The adaptive data level  $V_{Dlev}$  can be located anywhere by changing the up/down coefficients of SS-LMS as shown in the table of Fig. 5.4. The  $V_{Dlev}$  update can be expressed as the following equation.



Fig. 5.4 The eye diagram of the PAM-4 and data histogram for data +3

$$V_{Dlev}[n+1] = \begin{cases} V_{Dlev}[n] + \mu_{up} \cdot \mu_{Dlev} \cdot E[n] \text{ if } E[n] = +1 \\ V_{Dlev}[n] + \mu_{dn} \cdot \mu_{Dlev} \cdot E[n] \text{ if } E[n] = -1 \end{cases}$$
(5.7)

where  $\mu_{up}$  and  $\mu_{dn}$  are update coefficients when D[n] = 3. For example,  $V_{Dlev}$  corresponding to  $3 \cdot h_0 + 3 \cdot h_{-1}$  is obtained by the 7:1 coefficient adaptation ratio since the probability of DN at  $3 \cdot h_0 + 3 \cdot h_{-1}$  is 7 times larger than UP. As shown in Fig. 5.4, it is possible to obtain other  $V_{Dlev}$ s by adjusting the coefficients.

The question is whether the DFE adaptation can operate correctly with the asymmetric data level as  $3 \cdot h_0 + 3 \cdot h_{-1}$ , and is more suitable than DFE adaptations using the data level as  $3 \cdot h_0$  obtained with a symmetrical update coefficient. Fig. 5.5 shows

the comparison of the DFE adaptation using the conventional data level adaptation and the UDA for PAM-4 signaling. Each data level marked as A, B, C, and D in Fig. 5.5 can be calculated by the following equation.



$$y[n] = (h_1 - w_1) \cdot D[n-1] + h_0 \cdot D[n] + h_{-1} \cdot D[n+1].$$
(5.8)

where all other the cursors except  $h_{-1}$ ,  $h_0$ , and  $h_1$  are zeros and the tap coefficient  $\omega_1$  of the one-tap DFE is adapted by the SS-LMS algorithm. Signals "UP" and "DN", which adjust the tap coefficient, are calculated using the input of the DFE, D[n-1], and the output of the error sampler,  $E_{Dlev}[n]$ .

Before tap weight  $w_1$  of the 1-tap DFE converges to  $h_1$ , residual post-cursor ISI appears as  $3 \cdot h_1 - 3 \cdot w_1$  for the data +3 received. For the conventional adaptation, the data level has a magnitude of  $3 \cdot h_0$ . In the presence of  $h_{-1}$ , the SS-LMS algorithm cannot locate the exact point where  $3 \cdot h_1 - 3 \cdot w_1$  becomes zero because of the dead zone, and the tap coefficient wanders [56].

On the other hand, for the UDA adaptation,  $V_{Dlev}$  is represented as  $3 \cdot h_0 + 3 \cdot h_{-1}$ . When using the UDA adaptation, the dead zone disappears and the tap coefficient is determined at the fixed point without wandering. Eventually, the UDA guarantees the appropriate tap coefficient  $w_1$  converges to  $h_1$  even if  $h_{-1}$  is non-zero.

The DFE adaptation based on the UDA is shown as follows for the PAM-4 signaling.

$$w[n+1]_k = w[n]_k + \mu_{DFE} \cdot E[n] \cdot \left(\frac{D[n-k]}{|D[n-k]|}\right)$$
for  $D[n] = +3$ . (5.9)

where  $w_k$  is the DFE tap coefficient, k is the tap index, and E[n] is the comparison result with the threshold of  $3 \cdot h_0 + 3 \cdot h_{-1}$ .

Fig. 5.6 shows the simulated eye diagrams of conventional DFE adaptation and UDA. Proposed UDA in PAM-4 signaling achieves a 20% enhanced vertical eye margin, demonstrating accurate DFE adaptation.



Fig. 5.6 Simulated eye diagram with conventional data level adaptation and proposed data level adaptation

#### 5.2.3 Pre-cursor Adjustable Baud-rate Phase Detector with Multi-level Modulation Signaling

Fig. 5.7(a) shows the operation of the proposed BRPD. Only the transitions from +3 to -3 are used for phase detection. When the present data is +3, and the following data is -3, the input can be expressed as  $3 \cdot h_0 - 3 \cdot h_{-1}$ . Since  $V_{Dlev}$  is  $3 \cdot h_0 + 3 \cdot h_{-1}$ , the received signal  $V_{RX,in}$  can be expressed as

$$V_{RX,in} = 3 \cdot h_0 - 3 \cdot h_{-1} = (3 \cdot h_0 + 3 \cdot h_{-1}) \cdot \left(\frac{M - 1}{M + 1}\right).$$
(5.10)  
$$= V_{Dlev} \cdot \left(\frac{M - 1}{M + 1}\right).$$

The proposed BRPD uses a virtual reference data level  $V_{ref,PD}$  generated on chip with  $V_{Dlev}$  and the externally provided ratio of  $h_0$  and  $h_{-1}$ ,  $M_{ext}$ , as follows.

$$V_{ref,PD} = V_{Dlev} \cdot \left(\frac{M_{ext} - 1}{M_{ext} + 1}\right)$$
(5.11)

The phase detection is achieved by comparing the input  $V_{RX,in}$  against  $V_{ref,PD}$ . If the actual ratio *M* is less than  $M_{ext}$ , the comparison result would be +1 and vice versa, as the following equation shows.



Fig. 5.7 Generating phase error based on the consecutive data (D[n], D[n+1]) = (+3, -3) and the sign of the V<sub>ref,PD</sub>

$$V_{RX,in} - V_{ref,PD} = V_{Dlev} \cdot \left[ \left( \frac{M-1}{M+1} \right) - \left( \frac{M_{ext}-1}{M_{ext}+1} \right) \right]$$
  
$$= V_{Dlev} \cdot \frac{2 \cdot (M - M_{ext})}{(M+1) \cdot (M_{ext}+1)}.$$
 (5.12)

Thus, the lock point is where *M* equals  $M_{ext}$ , which means  $V_{RX,in}=V_{ref,PD}$ . The phase early is where M>Mext, equivalent to  $V_{RX,in}>V_{ref,PD}$ . On the contrary,  $V_{RX,in}<V_{ref,PD}$  concurs with the phase late, which is synonymous with  $M<M_{ext}$ .

The flowchart of the proposed BRPD as summarized in Fig. 5.8 illustrates the overall phase detection algorithm. First, the data level is decided through the UDA algorithm. Then, by using the data level,  $V_{ref,PD}$  is calculated with the externally provided  $M_{ext}$ . When data transitions from +3 to -3, phase detection is performed by comparing the input and the virtual reference  $V_{ref,PD}$ , and finally the phase-control



| <i>D</i> [n]     | <i>E<sub>Dlev</sub></i> [n] | V <sub>Dlev</sub> |
|------------------|-----------------------------|-------------------|
| +3               | +1                          | -7                |
| +3               | -1                          | +1                |
| All othe         | er cases                    | Hold              |
| <i>D</i> [n,n+1] | <i>E<sub>PD</sub></i> [n]   | PD <sub>out</sub> |
| +3, -3           | +1                          | Early             |
| +3, -3           | -1                          | Late              |
| All othe         | er cases                    | Hold              |

Fig. 5.8 The flow chart of the proposed MMPD



Fig. 5.9 Simulated eye diagram of the proposed MMPD under two input data conditions:

(a) MSB: PRBS-7, LSB: PRBS-10, (b) MSB: PRBS-31, LSB: PRBS-15

word (PCW) is adjusted for clock recovery. The sampling point is moved through this operation so that the cursor ratio M converges to  $M_{ext}$ .

Fig. 5.9 shows the eye diagram of the proposed MMPD under two input data

conditions: (a) MSB: PRBS-7, LSB: PRBS-10, and (b) MSB: PRBS-31, LSB: PRBS-15. Although there is a slight difference between the two input patterns due to the different run length, there is no significant difference in the lock position.

#### **5.3 Circuit Implementation**

#### 5.3.1 Proposed PAM-4 receiver architecture

The overall circuit implementation of the proposed PAM-4 receiver is shown in Fig. 5.10. The receiver employs a half-rate architecture with a forwarded clocking system. It consists of a synthesizable digital logic (SDL) with the proposed BRPD and an adaptation logic, an analog front-end (AFE), digital-to-analog converters (DACs), deserializers (DESs), an I/Q generator, and a phase rotator (PR).

The AFE includes 50-ohm termination resistors, a DFE summer, and 10 samplers. The DFE is composed of a single-stage amplifier, and the summing is achieved by using an inverter-based amplifier to improve the linearity [58]. The half-rate one-tap DFE is implemented by merging two summers into one [59]. Direct feedback uses the RZ sampled output instead of the conventional NRZ output of an RS latch. The sampling circuits use a Strong-Arm (SA) latch with a pair of differential input transistors to adjust the sampling threshold. The sampling path has three data samplers for a PAM-4 signal ( $V_H$ ,  $V_0$ ,  $V_L$ ) and two error samplers. One of the error samplers plays the dual role of finding the magnitude of the main cursor and the adaptation of DFE coefficients, and the other sampler is for the phase detection explained before. The sampler outputs are deserialized and delivered to the SDL.

The SDL includes a DFE adaptation logic, an UDA logic, a logic for sampling threshold calculation, and a digital loop filter (DLF) for the BRPD and controller. The DFE adaptation logic is based on the SS-LMS algorithm [57]. The proposed



Chapter 5. PAM-4 Receiver with Pre-Cursor Adjustable Baud-rate Phase Detector 101

BRPD implemented in the SDL finds the lock point and generates a PCW that controls the PR. The PR controls the sampling clock phases by interpolating the received forwarded clock signals. The sampling threshold calculator makes the thresholds of three data samplers,  $V_H$ ,  $V_0$ ,  $V_{L}$ , as  $2 \cdot h_0$ , 0,  $-2 \cdot h_0$ , an error sampler as  $V_{Dlev}$ ,  $3 \cdot h_0 + 3 \cdot h_{-1}$ , and a PD sampler as  $V_{ref,PD}$  derived from  $V_{Dlev}$  and  $M_{ext}$ . If the sampling threshold is provided with the voltage of a control word, a significant offset might occur by the nonlinearity of the SA latch. Therefore, the current of the input transistor is adjusted for better linearity of the sampling threshold, and the linear adjustment is obtained by current mirroring.



Fig. 5.11 I-DAC implementation for sampler threshold voltage and mismatch calibration

The proposed I-DAC generates linear differential current for each threshold control word, as shown in Fig. 5.11. A total of 10 I-DACs are implemented to control 10 samplers, where an 8-bit control word is used for adjusting the sampling threshold and a 6-bit control word for calibration. The sampling threshold of the sampler are generated by 8-bit I-DAC. Furthermore, 6-bit mismatch calibration is implemented to reduce the effect of the offset caused by random variations in the device parameters. Each I-DAC cell is composed of the decoder controlled by thermometer-coded row and column addresses for seamless transitions.

# 5.3.2 Proposed merged-summer DFE with the inverter-based amplifier

The designed AFE is composed of a single-stage amplifier. The AFE amplifier performs the function of the DFE summer and controls the DC gain adjustable by up to 3dB as implemented in [58]. In the DFE based on the conventional time-interleaving approach, an independent adaptation loop is necessary to remove the effect of the mismatch. The power consumption and hardware overhead are intolerable when using such an independent adaptation loop. To circumvent the overhead, one DFE summer operating at full-rate is proposed instead of using several DFE summers.

As shown in Fig. 5.10, an inverter-based DFE is employed to obviate SNR reduction caused by nonlinearity in PAM-4 signaling. Furthermore, a merged-summer DFE based on the time-interleaving methodology is adopted to distinguish valid data from invalid data removing ISI as the conventional DFE operation. Detailed schematic of the proposed inverter-based DFE with the merged summer and its timing diagram are shown in Fig. 5.12 and Fig. 5.13. The SA latch output has the decision state and the refresh state. By exploiting SA latch outputs that are s, sb, r, rb, the feedback in the proposed summer is accomplished, as shown in the table in Fig. 5.12. When the clock is high, which is the decision state, the feedback path charges or discharges the summer differentially depending on the valid data; However, it ceases the charging and discharging and maintains the common-mode value in the invalid data when the clock is low, which is referred to as refresh state. Since the invalid data retains a common-mode state, the valid data in the AFE operating at half rate is not degraded by the invalid data even if the feedback is executed at the same





Fig. 5.13 Timing diagram of proposed merged summer DFE

summer node. Through the proposed merged-summer DFE, the receiver reduces the feedback time, required the number of summers, the adaptation logic, mismatches between the tap coefficients, and power consumption.

### **5.4 Measurement Results**

The prototype chip fabricated in a 40 nm CMOS process occupies 0.24 mm<sup>2</sup> as shown in Fig. 5.14 and consumes 116.3 mW, achieving an energy efficiency of 2.42 pJ/b at 48 Gb/s with PAM-4 signaling. The chip area and power breakdown are shown in Fig. 5.14. Among the sub-blocks, the SDL and DAC, which controls the sampler threshold, occupy the most area.



Fig. 5.14 Chip photomicrograph of the implemented receiver with the detailed area and power consumption

The measurement setup is shown in Fig. 5.15. The measurements of the eye diagram, the bathtub curve, and the jitter tolerance are automatically performed by using an I2C with a Python program. Passive power dividers generate PAM-4 signals by combining the PRBS-7 MSB and LSB data from the pattern generator. A 6-dB attenuator is utilized for the LSB pattern to lower the swing level to half of the MSB pattern. To verify the proposed phase detection and equalizer performance in multi-level signaling, both PAM-2 and PAM-4 signals are used, and two channels with different losses are used, as shown in Fig. 5.15. Measured insertion loss is 19dB with PAM-2 at 12.5 GHz and 4dB with PAM-4 at 12 GHz, respectively. A 12 GHz clock is generated from the pattern generator and forwarded to the prototype chip.

The proposed phase detection technique is verified by measuring Dlevs and the cursor ratio M. Fig. 5.16 shows the measured Dlev code for PAM-4 and PAM-2



Fig. 5.15 Block diagram of the measurement setup

signals at varying sampling clock phases. The Dlevs of the PAM-4 signal at each sampling clock phase are obtained by adjusting the UP/DN ratio according to the data histogram shown in Fig. 5.4 through the UDA algorithm. The Dlevs of the PAM-2 signal are obtained using the histogram mentioned in [56].



Fig. 5.16 Measured Dlev code vs sampling time in (a) PAM-4 signaling and (b) PAM-2 signaling

Fig. 5.17 represents the measured cursor ratio M derived from the measured Dlevs with the proposed BRPD lock point. Lock points of the proposed BRPD are determined by the Dlevs and the externally provided  $M_{ext}$  for both PAM-4 and PAM-2. To verify the performance of the proposed BRPD, the cursor ratio M is measured



Fig. 5.17 Measured ratio M from measured cursor values in (a) PAM-4 signaling and (b) PAM-2 signaling

over varying discrete PCWs, and it is confirmed that the *M* of the lock point is identical to the external  $M_{ext}$  with both PAM-4 and PAM-2 signal. Fig. 5.17 shows that the value of M for PAM-4 is larger than that of M for PAM-2, meaning the smaller



Fig. 5.18 Measured bathtub curve



Fig. 5.19 Measured jitter tolerance at BER of 10<sup>-11</sup>

 $h_{-1}$  with PAM-4 signaling. Considering the larger channel loss in PAM-2 compared to PAM-4 at Nyquist frequency,  $h_{-1}$  of PAM-4 is smaller than that of PAM-2, which means the larger M, as shown in Fig. 5.17. The pre-cursor proportion can be figured



Fig. 5.20 Measured PAM-4 eye height vs cursor ratio M



Fig. 5.21 Measured PAM-4 eye diagram with BER  $< 10^{-11}$ 

out through the measured *M* at each PCW. Fig. 5.18 shows the measured bathtub curve, and Fig. 5.19 shows the measured jitter tolerance (JTOL) with the BER of less than  $10^{-11}$ . The JTOL curves in Fig. 5.19 satisfy the IEEE 802.3 mask by tolerating the jitter amplitude greater than 0.05 UI<sub>peak-peak</sub> at the jitter frequency of 1MHz. Fig. 5.20 shows the measured eye height versus the cursor ratio M in PAM-4 signaling and Fig. 5.21 represents the measured PAM-4 eye diagram with the BER under  $10^{-11}$ . The timing margin in the eye diagram corresponds to the timing margin of the bathtub curve in Fig. 5.18.

Table V presents the performance summary of the proposed receiver and the comparison with other receivers using BR CDR in the clock recovery. This work demonstrates the pre-cursor adjustable CDR operating at the Baud rate. Although we used a low-loss channel in this work without using the CTLE, it achieves a power efficiency of 2.42 pJ/b, which is lower than any other PAM-4 receivers since it employed a merged-summer DFE.

| <1E-11                                  | 3.4E-9                   | <1E-12                   | <1E-9                               | <1E-12                     | <1E-12                                 | <1E-12                           | BER                  |
|-----------------------------------------|--------------------------|--------------------------|-------------------------------------|----------------------------|----------------------------------------|----------------------------------|----------------------|
| 116.3                                   | 321                      | 59.7                     | 173                                 | 52                         | 56.6                                   | 102                              | Power<br>(mW)        |
| 2.42                                    | 6.27                     | 3.7                      | 2.88                                | 2.1                        | 2.02                                   | 3.19                             | Efficiency<br>(pJ/b) |
| 1-tap DFE                               | 8-dB<br>Pre-emphasis     | CTLE,<br>8-tap DFE       | 2-tap TX FFE,<br>CTLE,<br>3-tap DFE |                            | CTLE,<br>2-tap DFE                     | CTLE,<br>1-tap DFE               | Equalizer            |
| 4                                       | ~ 10                     | 34                       |                                     |                            | 20                                     | 14.8                             | Link Loss<br>(dB)    |
| 48                                      | 51                       | 16                       | 60                                  | 25                         | 28                                     | 32                               | Data Rate<br>(Gb/s)  |
| Baud-rate<br>(Pre-cursor<br>Adjustable) | Baud-rate<br>(Bang-Bang) | Baud-rate<br>(Bang-Bang) | Baud-rate<br>(Slope<br>Detection)   | Baud-rate<br>(Integrating) | Baud-rate<br>(Maximum<br>Eye tracking) | Baud-rate<br>(Pattern-<br>based) | PD type              |
| PAM-4                                   | PAM-4                    | NRZ                      | NRZ                                 | NRZ                        | NRZ                                    | NRZ                              | Modulation           |
| 40nm                                    | 65nm                     | 22nm                     | 65nm                                | 40nm                       | 40nm                                   | 28nm                             | Technology           |
| This work                               | [13]                     | [15]                     | [14]                                | [12]                       | [9]                                    | [7]                              | Reference            |

Table V Performance summary and comparison with other designs

# Chapter 6

## Conclusions

Starting with the basic concepts of the clocking system in the SerDes along with the related concerns, the DPLL-based clock driver is proposed. The presented clock driver provides a wide FTR ranging from 0.82 GHz to 4.1 GHz, achieving 133% FTR. Three modes switching LC resonator brings wide FTR together with the low phase noise performance. Two resonators constitute the transformer adjusting inductance by in-phase coupling and out-of-phase coupling. By employing the 8-shaped inductor structure, three LC resonators are stacked in one area. Thereby, implemented mode switching LC oscillator shows a phase noise of -118.5 dBc/Hz to -124.7 dBc/Hz, achieving the FoM and FoM<sub>T</sub> from 173.5 dBc/Hz to 181.5 dBc/Hz and 196 dBc/Hz to 204 dBc/Hz, respectively. In addition, the clock driver achieves an RMS jitter of 84.64 fs at 4GHz output clock frequency, showing FoM<sub>RMS</sub> of -249.1dB. Moreover, to satisfy the requirement about the frequency settling time, the clock driver exploits the FFT algorithm to attain frequency error information. Analysis about the

correlation between the frequency of the reference clock and oscillator clock and the FFT outputs presents the frequency calibration by the FFT algorithm. Compared to the conventional DPLL employing TDC, the proposed clock driver cuts down the frequency acquisition time significantly through frequency coarse tuning by 32-point FFT and frequency fine-tuning by conventional TDC. It shortens the settling time from 2.27 ms without the proposed FFT algorithm to 0.99 µs, verifying fast frequency acquisition.

The following describes a 14 - 28 Gb/s reference-less Baud-rate CDR. The frequency acquisition utilizes stochastic characteristics of the randomness of the patterns. Before being implemented in the CDR, consecutive 3-bit data patterns are collected at various frequencies and phases and represented as a histogram of the pattern from 000 to 111. Then, using Bayes' theorem, the weights are computed based on the probabilities. The analysis based on the frequency and phase gain curves of the proposed CDR adopting the integrator, which is achieved using the probabilities of the patterns and computed weights, demonstrates the robustness and wide capture range of the proposed PFD. The prototype chip, fabricated in 28-nm CMOS technology, validates the functionality and effectiveness of the stochastic methodology, achieving a BER of less than 10<sup>-12</sup> and an energy efficiency of 1.06 pJ/b at 28 Gb/s.

Finally, a 48 Gb/s receiver with a pre-cursor adjustable Baud-rate PD for multilevel signaling is presented. The proposed Baud-rate PD achieves phase detection by utilizing the cursor ratio, and  $V_{Dlev}$  obtained through the UDA algorithm. While the UDA algorithm with  $V_{Dlev}$  obtained by adjusting the up and down coefficient enhances the DFE adaptation performance in the presence of  $h_{-1}$ , the proposed BRPD offers a unique lock point when combined with the DFE. Furthermore, the proposed BRPD adjusts the value of  $h_{-1}$  with the externally provided ratio of  $h_0$  to  $h_{-1}$ , which offers a guaranteed, consistent bit error rate. The receiver reduces the circuit complexity using the merged summer in the DFE by employing the RZ sampler data output instead of the conventional NRZ data output using RS latches. The receiver is implemented in 40-nm CMOS technology and achieves an energy efficiency of 2.42 pJ/b at 48 Gb/s with PAM-4 signaling.

# **Bibliography**

- R. Balodis and I. Opmane, "Reflections on the History of Computing," *IFIP* Advances in Information and Communication Technology, vol. 387, Springer, Berlin, Heidelberg, pp. 180-203.
- [2] F. Q. Kareem et al., "A survey of optical fiber communications: challenges and processing time influences." Asian Journal of Research in Computer Science, Apr, 2021, pp. 48-58.
- [3] ISSCC, "2022 Press Kit," *Online* (accessed: Dec. 05, 2022), https://www. isscc.org/past-conferences (2022).
- [4] A. Yanes, "Announcing the PCIe 7.0 Specification: Doubling the Data Rate to 128 GT/s for the Next Generation of Computing", PCI-SIG Developers Conference 2022, June, 2022.
- [5] IEEE 802.3 Ethernet Working Group, Online (accessed: Dec. 05, 2022) http://www.ieee802.org/3
- [6] InfiniBand Trade Association, Online (accessed: Dec. 05, 2022) http://www.infinibandta.org
- [7] INCITS Fibre Channel Technical Committee (T11), *online* (accessed: Dec. 05, 2022) http://www.incits.org/committees/t11

- [8] W. J. Dally and J. W. Poulton, "*Digital System Engineering*", Cambridge University Press, 1988.
- [9] J. Kim et al., "A 16-to-40Gb/s quarter-rate NRZ/PAM4 dual-mode transmitter in 14nm CMOS," 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers, San Francisco, CA, USA, 2015, pp. 1-3.
- [10] T. Beukema et al., "A 6.4-Gb/s CMOS SerDes core with feed-forward and decision-feedback equalization," in IEEE Journal of Solid-State Circuits, vol. 40, no. 12, pp. 2633-2645, Dec. 2005.
- [11] B. Miller and B. Conley, "A multiple modulator fractional divider," 44th Annual Symposium on Frequency Control, Baltimore, MD, USA, 1990, pp. 559-568.
- [12] T. A. D. Riley, M. A. Copeland and T. A. Kwasniewski, "Delta-sigma modulation in fractional-N frequency synthesis," in IEEE Journal of Solid-State Circuits, vol. 28, no. 5, pp. 553-559, May 1993.
- [13] A. Lacaita, S. Levantino, and C. Samori, "Integrated Frequency Synthesizers for Wireless Systems", Cambridge University Press, 2007.
- [14] S. Gupta, "On Optimum Digital Phase-Locked Loops," in IEEE Transactions on Communication Technology, vol. 16, no. 2, pp. 340-344, April 1968.
- [15] P. Westlake, "Digital Phase Control Techniques," in IRE Transactions on

Communications Systems, vol. 8, no. 4, pp. 237-246, December 1960.

- [16] R. B. Staszewski, Chih-Ming Hung, K. Maggio, J. Wallberg, D. Leipold and P. T. Balsara, "All-digital phase-domain TX frequency synthesizer for Bluetooth radios in 0.13/spl mu/m CMOS," *IEEE Int. Solid-State Circuits Conference (ISSCC) Dig. Tech. Papers*, Feb. 2004, pp. 272-52.
- [17] A. A. Abidi, "Phase Noise and Jitter in CMOS Ring Oscillators," in IEEE Journal of Solid-State Circuits, vol. 41, no. 8, pp. 1803-1816, Aug. 2006.
- [18] H. Song, D. -S. Kim, D. -H. Oh, S. Kim and D. -K. Jeong, "A 1.0–4.0-Gb/s All-Digital CDR With 1.0-ps Period Resolution DCO and Adaptive Proportional Gain Control," in IEEE Journal of Solid-State Circuits, vol. 46, no. 2, pp. 424-434, Feb. 2011.
- [19] T. Olsson and P. Nilsson, "A digitally controlled PLL for SoC applications," in IEEE Journal of Solid-State Circuits, vol. 39, no. 5, pp. 751-760, May 2004.
- [20] B. Razavi, "Design of Analog CMOS Integrated Circuits", *McGRAW-HILL*, 2015, pp. 618-630.
- [21] D. Murphy and H. Darabi, "A 27-GHz Quad-Core CMOS Oscillator With No Mode Ambiguity," *in IEEE Journal of Solid-State Circuits*, vol. 53, no. 11, pp. 3208-3216, Nov. 2018.
- [22] C. Hogge, "A self correcting clock recovery curcuit," in Journal of

Lightwave Technology, vol. 3, no. 6, pp. 1312-1314, December 1985.

- [23] Alexander, J. D. H. "Clock recovery from random binary signals." Electronics Letters, vol. 22, no. 11, pp. 541-542, 1975.
- [24] K. Mueller and M. Muller, "Timing Recovery in Digital Synchronous Data Receivers," in IEEE Transactions on Communications, vol. 24, no. 5, pp. 516-531, May 1976.
- [25] R. Dokania et al., "A 5.9pJ/b 10Gb/s serial link with unequalized MM-CDR in 14nm tri-gate CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2015, pp. 1-3M. Kossel, *et al.*, "A 10 Gb/s 8-Tap 6b 2-PAM/4-PAM Tomlinson-Harashima Precoding Transmitter for Future Memory-Link Applications in 22-nm SOI CMOS," *IEEE J. Solid-State Circuits*, vol. 48, no. 12, pp. 3268-3284, Dec. 2013.
- [26] M. -C. Choi, H. -G. Ko, J. Oh, H. -Y. Joo, K. Lee and D. -K. Jeong, "A 0.1pJ/b/dB 28-Gb/s Maximum-Eye Tracking, Weight-Adjusting MM CDR and Adaptive DFE with Single Shared Error Sampler," in *IEEE Symp. VLSI Circuits*, 2020, pp. 1-2.
- [27] F. Spagna et al., "A 78mW 11.8Gb/s serial link transceiver with adaptive RX equalization and Baud-rate CDR in 32nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC)*, 2010, pp. 366-367.
- [28] W. Deng et al., "An 8.2-to-21.5 GHz Dual-Core Quad-Mode Orthogonal-Coupled VCO with Concurrently Dual-Output using Parallel 8-Shaped

Resonator," 2021 IEEE Custom Integrated Circuits Conference (CICC), Austin, TX, USA, 2021, pp. 1-2.

- [29] L. Fanori, T. Mattsson and P. Andreani, "21.6 A 2.4-to-5.3GHz dual-core CMOS VCO with concentric 8-shaped coils," 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 2014, pp. 370-371.
- [30] Y. Shu, H. J. Qian and X. Luo, "A 2-D Mode-Switching Quad-Core Oscillator Using E-M Mixed-Coupling Resonance Boosting," in IEEE Journal of Solid-State Circuits, vol. 56, no. 6, pp. 1711-1721, June 2021.
- [31] S. M. Dartizio et al., "A Fractional-N Bang-Bang PLL Based on Type-II Gear Shifting and Adaptive Frequency Switching Achieving 68.6 fs-rms-Total-Integrated-Jitter and 1.56 µs-Locking-Time," in IEEE Journal of Solid-State Circuits, vol. 57, no. 12, pp. 3538-3551, Dec. 2022.
- [32] C. -H. Tsai, Z. Zong, F. Pepe, G. Mangraviti, J. Craninckx and P. Wambacq, "Analysis of a 28-nm CMOS Fast-Lock Bang-Bang Digital PLL With 220fs RMS Jitter for Millimeter-Wave Communication," in IEEE Journal of Solid-State Circuits, vol. 55, no. 7, pp. 1854-1863, July 2020.
- [33] L. Bertulessi, L. Grimaldi, D. Cherniak, C. Samori and S. Levantino, "A low-phase-noise digital bang-bang PLL with fast lock over a wide lock range," 2018 IEEE International Solid - State Circuits Conference - (ISSCC), San Francisco, CA, USA, 2018, pp. 252-254.

- [34] A. Santiccioli et al., "A 66-fs-rms Jitter 12.8-to-15.2-GHz Fractional-N Bang–Bang PLL With Digital Frequency-Error Recovery for Fast Locking," in IEEE Journal of Solid-State Circuits, vol. 55, no. 12, pp. 3349-3361, Dec. 2020.
- [35] Cooley, James W., and John W. Tukey. "An algorithm for the machine calculation of complex Fourier series." Mathematics of computation 19.90 (1965): 297-301.
- [36] R. B. Staszewski, S. Vemulapalli, P. Vallur, J. Wallberg and P. T. Balsara,
  "1.3 V 20 ps time-to-digital converter for frequency synthesis in 90-nm CMOS," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 53, no. 3, pp. 220-224, March 2006.
- [37] K. Park, M. Shim, H. -G. Ko, B. Nikolić and D. -K. Jeong, "Design Techniques for a 6.4–32-Gb/s 0.96-pJ/b Continuous-Rate CDR With Stochastic Frequency–Phase Detector," in IEEE Journal of Solid-State Circuits, vol. 57, no. 2, pp. 573-585, Feb. 2022.
- [38] K. Park et al., "A 4–20-Gb/s 1.87-pJ/b Continuous-Rate Digital CDR Circuit With Unlimited Frequency Acquisition Capability in 65-nm CMOS," in IEEE Journal of Solid-State Circuits, vol. 56, no. 5, pp. 1597-1607, May 2021.
- [39] W. Rahman et al., "A 22.5-to-32-Gb/s 3.2-pJ/b Referenceless Baud-Rate Digital CDR With DFE and CTLE in 28-nm CMOS," in IEEE Journal of

Solid-State Circuits, vol. 52, no. 12, pp. 3517-3531, Dec. 2017.

- [40] T. Masuda et al., "A 12 Gb/s 0.9 mW/Gb/s Wide-Bandwidth Injection-Type CDR in 28 nm CMOS With Reference-Free Frequency Capture," in IEEE Journal of Solid-State Circuits, vol. 51, no. 12, pp. 3204-3215, Dec. 2016.
- [41] Y. -S. Lee, W. -H. Ho and W. -Z. Chen, "A 25-Gb/s, 2.1-pJ/bit, Fully Integrated Optical Receiver With a Baud-Rate Clock and Data Recovery," in IEEE Journal of Solid-State Circuits, vol. 54, no. 8, pp. 2243-2254, Aug. 2019.
- [42] N. Qi et al., "A 51Gb/s, 320mW, PAM4 CDR with baud-rate sampling for high-speed optical interconnects," 2017 IEEE Asian Solid-State Circuits Conference (A-SSCC), Seoul, Korea (South), 2017, pp. 89-92.
- [43] K. Lee, W. Jung, H. Ju, J. Lee and D. -K. Jeong, "A 48 Gb/s PAM4 receiver with Baud-rate phase-detector for multi-level signal modulation in 40 nm CMOS," 2021 IEEE Asian Solid-State Circuits Conference (A-SSCC), Busan, Korea, Republic of, 2021, pp. 1-3.
- [44] H. Ju, K. Lee, K. Park, W. Jung and D. -K. Jeong, "Design Techniques for 48-Gb/s 2.4-pJ/b PAM-4 Baud-Rate CDR With Stochastic Phase Detector," in IEEE Journal of Solid-State Circuits, vol. 57, no. 10, pp. 3014-3024, Oct. 2022.

[45] ISCAS.

- [46] A. A. Hafez, M. Chen and C. K. Yang, "A 32-to-48Gb/s serializing transmitter using multiphase sampling in 65nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2013, pp. 38-39.
- [47] M. Chen and C. K. Yang, "A 50–64 Gb/s Serializing Transmitter With a 4-Tap, LC-Ladder-Filter-Based FFE in 65 nm CMOS Technology," *IEEE J. Solid-State Circuits*, vol. 50, no. 8, pp. 1903-1916, Aug. 2015.
- [48] K. Park, W. Bae, J. Lee, J. Hwang and D. Jeong, "A 6.7–11.2 Gb/s, 2.25 pJ/bit, Single-Loop Referenceless CDR With Multi-Phase, Oversampling PFD in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 53, no. 10, pp. 2982-2993, Oct. 2018.
- [49] D. Yoo, M. Bagherbeik, W. Rahman, A. Sheikholeslami, H. Tamura and T. Shibasaki, "A 30Gb/s 2x Half-Baud-Rate CDR," in *IEEE Custom Integr. Circuits Conf. (CICC)*, 2019, pp. 1-4.
- [50] S. Son, S. Ryu, H. Yeo and J. Kim, "A 2x Blind Oversampling FSE Receiver With Combined Adaptive Equalization and Infinite-Range Timing Recovery," *IEEE J. Solid-State Circuits*, vol. 54, no. 10, pp. 2823-2832, Oct. 2019.
- [51] F. Spagna et al., "A 78mW 11.8Gb/s serial link transceiver with adaptive RX equalization and Baud-rate CDR in 32nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC)*, 2010, pp. 366-367.
- [52] Y. Lee, W. Ho and W. Chen, "A 25-Gb/s, 2.1-pJ/bit, Fully Integrated Optical Receiver With a Baud-Rate Clock and Data Recovery," *IEEE J. Solid*-

State Circuits, vol. 54, no. 8, pp. 2243-2254, Aug. 2019.

- [53] J. Han, Y. Lu, N. Sutardja, K. Jung and E. Alon, "Design Techniques for a 60 Gb/s 173 mW Wireline Receiver Frontend in 65 nm CMOS Technology," *IEEE J. Solid-State Circuits*, vol. 51, no. 4, pp. 871-880, April 2016.
- [54] P. A. Francese et al., "A 16 Gb/s 3.7 mW/Gb/s 8-Tap DFE Receiver and Baud-Rate CDR With 31 kppm Tracking Bandwidth," *IEEE J. Solid-State Circuits*, vol. 49, no. 11, pp. 2490-2502, Nov. 2014.
- [55] T. Shibasaki et al., "A 56Gb/s NRZ-electrical 247mW/lane serial-link transceiver in 28nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC)*, 2016, pp. 64-65.
- [56] J. Lee, K. Lee, H. Kim, B. Kim, K. Park and D. Jeong, "A 0.1-pJ/b/dB 1.62to-10.8-Gb/s Video Interface Receiver With Jointly Adaptive CTLE and DFE Using Biased Data-Level Reference," *IEEE J. Solid-State Circuits*, vol. 55, no. 8, pp. 2186-2195, Aug. 2020.
- [57] V. Stojanovic et al., "Autonomous dual-mode (PAM2/4) serial link transceiver with adaptive equalization and data recovery," *IEEE J. Solid-State Circuits*, vol. 40, no. 4, pp. 1012-1026, April 2005.
- [58] K. Zheng et al., "An Inverter-Based Analog Front End for a 56 GB/S PAM4 Wireline Transceiver in 16NMCMOS," in *IEEE Symp. VLSI Circuits*, 2018, pp. 269-270.

- [59] K. Lee et al., "An Adaptive Offset Cancellation Scheme and Shared-Summer Adaptive DFE for 0.068 pJ/b/dB 1.62-to-10 Gb/s Low-Power Receiver in 40 nm CMOS," *IEEE Trans. Circuits and Sys. II: Exp. Briefs*, vol. 68, no. 2, pp. 622-626, Feb. 2021.
- [60] W. Jung, K. Lee, K. Park, H. Ju, J. Lee and D. -K. Jeong, "A 48 Gb/s PAM-4 Receiver With Pre-Cursor Adjustable Baud-Rate Phase Detector in 40 nm CMOS," in IEEE Journal of Solid-State Circuits, 2022.

# 초 록

이 논문에서는 SerDes 의 클로킹 시스템과 관련된 일반적인 문제에 대해 간략히 설명합니다. 이것은 송신기에서 클록 생성을 위한 위상 잠금 루프(PLL) 기반 클록 드라이버와 수신기에서 클록 및 데이터 복구(CDR) 회로를 제안합니다. 이 논문은 주파수 합성을 위한 넓은 주파수 튜닝 범위 LC 공진기를 제안하여 빠르게 타겟 주파수에 도착합니다. 레퍼런스가 없는 동작을 위해, 확률에 기초한 주파수 취득 방식이 Baud-rate CDR 에 구현되었습니다. 또한 이 논문은 펄스 진폭 변조 (PAM)-4 시그널링을 실현하는 레퍼런스 클록을 사용하는 Baud-rate CDR 을 제시합니다.

먼저 넓은 주파수 튜닝 LC 오실레이터가 있는 디지털 PLL (DPLL) 기반 클록 드라이버를 소개합니다. 클록 드라이버는 8 형 인덕터 구조를 채택하여 1 개의 컴팩트한 영역에서 넓은 FTR 을 위한 3 가지 모드 스위칭을 구현한다. 분석은 컴팩트한 스택된 인덕터 레이아웃을 보여줍니다. 또한 클록 드라이버는 고속 푸리에 변환 (FFT) 알고리즘을 사용하여 빠른 주파수 획득을 실현하고 뱅뱅 위상 및 주파수 검출기 (BB-PFD) 및 디지털 컨버터 (TDC) 를 사용하는 기존 PLL 에 비해 잠금 시간을 크게 단축합니다. 구현된 클록 드라이버는 낮은 지터, 넓은 FTR 및 고속 주파수 획득을 검증하는 40nm CMOS 기술로 제조됩니다. 제시된 LC 발진기는 -118.5 dBc/Hz 에서 -124.7 dBc/Hz 의 위상 잡음을 달성하고 FoM<sub>T</sub> 에서 173.5 dBc/Hz 에서 181.5 dBc/Hz 및 196 dBc/Hz 에서 204 dBc/MHz 의 성능을 달성합니다. 클록 드라이버는 0.82 - 4.1 GHz 범위의 클록 주파수를 생성하여 133%의 주파수 튜닝 범위을 달성합니다. 클록 드라이버는 4GHz 출력 클록 주파수에서 84.64fs 의 제곱 평균 제곱근 (RMS) 지터를 달성하여 -249.1 dB 의 FoM<sub>RMS</sub> 를 나타냅니다. 또한, 종래 기술에서는 2.27 ms 걸려 있던 주파수 획득 시간을 0.99 μs로 단축하여 주파수 취득의 고속화를 증명했다.

이 논문에서는 두 번째 구현으로 확률론 기반 위상 및 주파수 검출을 사용하는 레퍼런스가 없는 Baud-rate CDR 을 제안합니다. 이것은 확률 기반 위상 및 주파수 검출기 (PFD)를 사용하는 14 - 28 Gb/s 레퍼런스가 없는 Baud-rate CDR 을 제안합니다. 다양한 데이터 패턴의 히스토그램 기반 상관관계를 활용하여 구한 최적의 가중치를 가진 PFD 는 위상 및 주파수 감지를 제공합니다. 레퍼런스가 없는 Baud-rate CDR 은 데이터 샘플과 적분기에서 얻은 위상 오류 샘플을 사용합니다. 제안된 CDR 은 Nyquist 주파수에서 4.7dB 의 데이터 손실 채널 하에서 연속 시간 선형 이퀄라이저 (CTLE)를 사용하여 최대 28 Gb/s 의 데이터 속도를 달성합니다. 28nm CMOS 기술로 제조된 제안된 CDR 은 10<sup>-12</sup> 미만의 비트 오류율 (BER)과 1.06 pJ/b 의 에너지 효율을 제공합니다.

마지막으로 구현된 회로는 다중 레벨 시그널링에 적합한 Baud-rate CDR 구조의 약 48 Gb/s PAM-4 수신기이다. 수직 아이 마진과 메인 커 서 대 프리 커서 비율 사이의 연관성을 도출함으로써 제안된 Baud-rate 위상 검출기는 프리 커서를 조정하고 목표 수직 아이 오프닝에서 잠금 포 인트를 찾습니다. 따라서 Baud-rate 위상 검출기는 포스트 커서 h<sub>1</sub> 이 제거된 적응 결정 피드백 이퀄라이저 (DFE)와 함께 사용할 때 고유 한 잠금 지점을 제공합니다. 그렇지 않으면 기존 Mueller-Müller PD 처럼 잠금 지점이 드리프트 될 수 있습니다. 또한 DFE 의 가산기 부하는 기존 의 RS 래치와 관련된 지연이 추가되는 NRZ 출력 대신 RZ 샘플러 출력 을 채택하여 DFE 의 입력 부하를 줄입니다. 이렇게. 40nm CMOS 기술로 제작된 프로토타입 칩은 아날로그 프런트 엔드, 위상 회전자, 현재 디지 털/아날로그 컨버터 및 합성 가능한 디지털 로직으로 구성되며 총 활성 영역은 0.24mm<sup>2</sup> 입니다. 제안된 PAM-4 수신기는 48 Gb/s 에서 10<sup>-11</sup> 미만의 BER 을 달성하고 2.42 pJ/b 의 에너지 효율을 제공합니다.

**주요?**: Fast Fourier Transform (FFT), 8-shaped inductor, wide frequency tuning range, mode switching, phase-locked loop (PLL), clock driver, fast frequency acquisition, Baud-rate, clock and data recovery (CDR), phase and frequency detector (PFD), reference-less, receiver, stochastic, integrator, adaptive equalizer, decision feedback equalizer (DFE), merged-summer, Mueller-Müller PD, PAM-4, a phase detector (PD), pre-cursor.

**학번:** 2019-29990