



Ph.D. Dissertation

# Design of High-Speed Multi-Phase Clock Corrector

고속 멀티페이즈 클락 교정기의 설계

by

Jung-Woo Sull

August 2023

Department of Electrical and Computer Engineering College of Engineering Seoul National University Ph.D. Dissertation

# Design of High-Speed Multi-Phase Clock Corrector

고속 멀티페이즈 클락 교정기의 설계

by

Jung-Woo Sull

August 2023

Department of Electrical and Computer Engineering College of Engineering Seoul National University

## Design of High-Speed Multi-Phase Clock Corrector

고속 멀티페이즈 클락 교정기의 설계

## 지도교수 정 덕 균 이 논문을 공학박사 학위논문으로 제출함

### 2023년 8월

서울대학교 대학원

전기·정보공학부

## 설정우

설정우의 공학박사 학위 논문을 인준함 2023년 8월

| 위 원 장 | 김재하 | _(인) |
|-------|-----|------|
| 부위원장  | 정덕균 | (인)  |
| 위 원   | 문 용 | (인)  |
| 위 원   | 최우석 | (인)  |
| 위 원   | 박관서 | (인)  |

## Abstract

In this dissertation, an octa-phase clock corrector performing octa-phase error correction and duty-cycle correction operating at 8 GHz is proposed and examined with two prototype chips. An 8-GHz Octa-phase Error Corrector (OEC) employing a digital delay-locked loop (DLL) with a coprime phase comparison scheme is proposed. To alleviate timing constraint during the phase comparison, clock phases spaced in coprime to 8 is utilized, enabling up to a 64-Gb/s link operation. In particular, this prototype applies a 3T/8 spaced clock rather than T/8. In addition, by employing a clock-divided 5-bit selection scheme, a high-speed 8:2 multiplexer (MUX) operates seamlessly without glitches. To minimize a mismatch and calibration-induced jitter, a single shared phase comparator and a finite-state machine (FSM) for tracking the minimum total delay are employed. The test chip has been fabricated in the 40-nm CMOS technology in an active area of 0.0814 mm2. The core phase calibration loop consumes 10.8 mW at 8 GHz at a 0.9-V supply achieving a maximum residue phase error of 0.95 ps.

In addition, another prototype is presented with an 8 GHz octa-phase clock corrector using a shared clock selector-based digital DLL. The corrector can be sorted by function: Octa-phase Error Corrector (OEC) and Duty-Cycle Corrector (DCC). The phase error is detected via the 3T/8 delay line and the duty cycle error is detected by exploiting opposite polarity edges in a differential clock without the use of an additional delay line. An Edge Converter (EC) is designed to match the edge propagation delay through an 8:1 MUX and an EC to achieve a high level of accuracy in duty-cycle calibration. Furthermore, to save power and area, a clock selector composed of a MUX and a logic generator is shared between the phase and duty-cycle error detection loops. The prototype chip has been fabricated in 40-nm CMOS technology and occupies an active area of  $0.047 \text{ mm}^2$ . The total calibration power consumption of the corrector is 17.1 mW at a 1.0-V supply.

**keywords**: coprime, digitally controlled delay line (DCDL), digital delay-locked loop (DLL), duty-cycle corrector (DCC), multiplexer (MUX), octa-phase error corrector (OEC)

student number: 2018-24582

# Contents

| Abstrac   | et           |                                           | i   |
|-----------|--------------|-------------------------------------------|-----|
| Conten    | ts           |                                           | iii |
| List of ' | Fables       |                                           | vi  |
| List of ] | Figures      |                                           | vii |
| CHAP      | FER 1        | INTRODUCTION                              | 1   |
| 1.1       | MOTI         | VATION                                    | 1   |
| 1.2       | THES         | IS ORGANIZATION                           | 7   |
| CHAPT     | <b>FER 2</b> | BACKGROUND ON CLOCK CORRECTION SCHEME     | 8   |
| 2.1       | OVER         | VIEW                                      | 8   |
| 2.2       | PRIO         | R WORKS                                   | 16  |
|           | 2.2.1        | ANALOG BASED DETECTION CORRECTOR          | 16  |
|           | 2.2.2        | STATISTICAL BASED DETECTION CORRECTOR     | 22  |
|           | 2.2.3        | DIGITAL BASED DETECTION CORRECTOR         | 29  |
| 2.3       | BUILI        | DING BLOCKS OF DDLL BASED CLOCK CORRECTOR | 37  |
|           | 2.3.1        | DELAY LINE                                | 37  |

#### CONTENTS

| CHAPT  | TER 3 | DESIGN OF THE PROPOSED OEC                   | 40  |
|--------|-------|----------------------------------------------|-----|
| 3.1    | OVE   | RVIEW                                        | 40  |
| 3.2    | PROF  | POSED COPRIME PHASE CORRECTION               | 44  |
| 3.3    | OVE   | RALL ARCHITECTURE                            | 47  |
| 3.4    | CIRC  | UIT IMPLEMENTATIONS                          | 51  |
|        | 3.4.1 | 8:2 MUX AND SELECTION GENERATOR              | 51  |
|        | 3.4.2 | DELAY LINE                                   | 62  |
|        | 3.4.3 | PHASE DETECTOR                               | 64  |
|        | 3.4.4 | LOOP FILTER                                  | 65  |
| СНАРТ  | TER 4 | DESIGN OF THE PROPOSED CLOCK CORRECTOR       | 68  |
| 4.1    | OVEI  | RVIEW                                        | 68  |
| 4.2    | PROF  | POSED CLOCK CORRECTION SCHEME                | 71  |
| 4.3    | OVE   | RALL ARCHITECTURE                            | 73  |
| 4.4    | CIRC  | UIT IMPLEMENTATIONS                          | 75  |
|        | 4.4.1 | EDGE CONVERTER                               | 75  |
|        | 4.4.2 | LOOP FILTER                                  | 78  |
|        | 4.4.3 | CLOCK ADJUSTER                               | 81  |
| СНАРТ  | TER 5 | MEASUREMENT RESULTS                          | 84  |
| 5.1    | OVE   | RVIEW                                        | 84  |
| 5.2    | AN 8- | -GHZ OCTA-PHASE ERROR CORRECTOR WITH COPRIME |     |
|        | SPAC  | ING                                          | 84  |
| 5.3    | AN 8  | -GHZ OCTA-PHASE CLOCK CORRECTOR WITH PHASE   |     |
|        | AND   | DUTY-CYCLE CORRECTION                        | 92  |
| СНАРТ  | TER 6 | CONCLUSION                                   | 100 |
|        |       | DUN                                          | 103 |
| DIDLIC | лықаі |                                              | 102 |

CONTENTS

초록

115

# **List of Tables**

| 3.1 | Performance comparison with different $M$ values  | 46 |
|-----|---------------------------------------------------|----|
| 5.1 | Performance summary and comparison of prototype 1 | 91 |
| 5.2 | Performance summary and comparison of prototype 2 | 99 |

# **List of Figures**

| 1.1  | Global data center traffic growth [1]                                    | 2  |
|------|--------------------------------------------------------------------------|----|
| 1.2  | Global application traffic in 2023 [2].                                  | 3  |
| 1.3  | Breakdown of internet traffic of video category [2]                      | 3  |
| 1.4  | IEEE P802.3ck Ethernet Task Force for over 200 Gb/s ethernet spec [2].   | 5  |
| 1.5  | DRAM bandwith trend                                                      | 6  |
| 1.6  | PCIe bandwith trend [5].                                                 | 6  |
| 2.1  | Per-lane transfer rate trend of interface standards [6]                  | 9  |
| 2.2  | Clock amplitude reduction (%) with clock period (in FO-4 delays) [7].    | 10 |
| 2.3  | Per-lane sub-rate clocking transmitter trend                             | 11 |
| 2.4  | Clock distribution network in DRAM [33]                                  | 13 |
| 2.5  | Clock distribution network in 4-channel PCIe PMA layer [34]              | 13 |
| 2.6  | (a) Schematics of 1-UI pulse generator and waveform of 1-UI data. (b)    |    |
|      | Schematics of 4:1 MUX and output network.                                | 14 |
| 2.7  | Simulated results of (a) ideal case, (b) with duty-cycle error case, and |    |
|      | (c) with phase error case                                                | 15 |
| 2.8  | Circuit diagram of a quadrature corrector [37]                           | 16 |
| 2.9  | Circuit diagram of the phase generator/rotator [38]                      | 16 |
| 2.10 | Simplified distributed DLL-based multi-phase generator.                  | 17 |

| 2.11 | Circuit and timing diagram of phase detector implemented in the dis-                                              |    |
|------|-------------------------------------------------------------------------------------------------------------------|----|
|      | tributed DLL                                                                                                      | 17 |
| 2.12 | Circuit diagram of the duty-cycle detector.                                                                       | 19 |
| 2.13 | Circuit diagram of a quadrature detector                                                                          | 19 |
| 2.14 | (a) Block diagram and (b) timing diagram of a relaxation oscillator-                                              |    |
|      | based duty cycle detector (ICK/IBCK) [45].                                                                        | 21 |
| 2.15 | Block diagram of a quadrature phase detector [45]                                                                 | 21 |
| 2.16 | Uniform asynchronous DCO clock edge density and normalized delay.                                                 | 22 |
| 2.17 | Block diagram of the asynchronous sampling-based measurement cir-                                                 |    |
|      | cuit in [58]                                                                                                      | 23 |
| 2.18 | Concept diagram of the asynchronous sampling-based measurement                                                    |    |
|      | circuit in [61]                                                                                                   | 25 |
| 2.19 | Block diagram of clock measurement circuit [61]                                                                   | 25 |
| 2.20 | (a) Block diagram of the asynchronous sampling-based calibration                                                  |    |
|      | scheme in 4:1 MUX domain, and (b) schematic of 4:1 MUX                                                            | 26 |
| 2.21 | 4:1 MUX output $D_{\text{OUT}}$ with training patterns A and B. Ideal $\text{CK}_{\text{I}}/\text{CK}_{\text{Q}}$ |    |
|      | clock alignment and misalignment due to early $CK_Q$ clock                                                        | 27 |
| 2.22 | 4:1 MUX output $D_{\text{OUT}}$ with training patterns A and B. Ideal $\text{CK}_{\text{I}}/\text{CK}_{\text{Q}}$ |    |
|      | clock alignment and misalignment due to late $CK_Q$ clock                                                         | 27 |
| 2.23 | Block diagram of delay-line based TDC                                                                             | 30 |
| 2.24 | Timing diagram of delay-line based TDC                                                                            | 30 |
| 2.25 | Block diagram of Vernier delay-line based TDC                                                                     | 31 |
| 2.26 | Timing diagram of Vernier delay-line based TDC                                                                    | 31 |
| 2.27 | (a) Overall block diagram of the proposed all-digital synchronous DCC                                             |    |
|      | and (b) the timing diagram of the interpolator [54].                                                              | 32 |
| 2.28 | (a) Overall block diagram of the proposed TDC-based clock generator                                               |    |
|      | and (b) the timing diagram of the clock generator [54]                                                            | 33 |

| 2.29 | Overall block diagram DDLL based QEC [36]                                            | 36 |
|------|--------------------------------------------------------------------------------------|----|
| 2.30 | Overall block diagram of DDLL based QEC with total minimum total                     |    |
|      | delay tracking algorithm [41].                                                       | 36 |
| 2.31 | Schematic of (a) differential delay cell with R control, (b) differential            |    |
|      | delay cell with C control, (c) SCI delay cell, (d) CSI delay cell, and (e)           |    |
|      | supply voltage controlled delay cell                                                 | 38 |
| 2.32 | Schematic of CSI based delay cell with (a) MOSFET switches and (b)                   |    |
|      | IDAC-based voltage control.                                                          | 39 |
| 3.1  | Comparison between unit phase delay ( $T_{period}/8$ ) and two-stage buffer          |    |
|      | delay in 40-nm CMOS technology                                                       | 41 |
| 3.2  | Power consumption comparison of delay generation methods in 40-nm                    |    |
|      | CMOS technology.                                                                     | 42 |
| 3.3  | Timing diagram of proposed phase comparison flow for $\mbox{CK}_0$ and $\mbox{CK}_1$ |    |
|      | where $t_{octa}$ is $T/8$ .                                                          | 45 |
| 3.4  | Overall block diagram of the proposed OEC                                            | 48 |
| 3.5  | Timing diagram of the proposed OEC.                                                  | 49 |
| 3.6  | Monte Carlo simulation on at various process corners                                 | 50 |
| 3.7  | Block diagram of sequential clock selection path.                                    | 51 |
| 3.8  | Block diagram of (a) conventional 8:1 MUX slice, (b) proposed 8:1                    |    |
|      | MUX slice                                                                            | 52 |
| 3.9  | (a) Block diagram of clock selection signal generation path when SEL[2:0]            | ]  |
|      | changes 000 to 001 (b) and its timing diagram.                                       | 53 |
| 3.10 | (a) Block diagram of clock selection signal generation path when SEL[2:0]            | ]  |
|      | changes 001 to 010 (b) and its timing diagram.                                       | 54 |
| 3.11 | (a) Block diagram of clock selection signal generation path when SEL[2:0]            | ]  |
|      | changes 011 to 100 (b) and its timing diagram.                                       | 55 |

#### LIST OF FIGURES

| 3.12       | (a) Block diagram of clock selection signal generation path when the               |    |
|------------|------------------------------------------------------------------------------------|----|
|            | MUX output changes from $\mbox{CK}_7$ to $\mbox{CK}_6$ (b) and its timing diagram. | 57 |
| 3.13       | (a) Block diagram of clock selection signal generation path when the               |    |
|            | MUX output changes from $CK_4$ to $CK_3$ (b) and its timing diagram.               | 58 |
| 3.14       | Block diagram of divide-by-4                                                       | 59 |
| 3.15       | MUX selection logic and truth table                                                | 59 |
| 3.16       | (a) Block diagram of the power-up circuit configuration and (b) step               |    |
|            | sequence.                                                                          | 61 |
| 3.17       | Block diagram of four stages of delay line                                         | 62 |
| 3.18       | (a) Block diagram of BBPD, and (b) schematic of arbiter                            | 64 |
| 3.19       | (a) Clock propagation path in calibration. (b) The timing diagram of               |    |
|            | comparison                                                                         | 65 |
| 3.20       | When $phase_{err}$ is "1" (a) Clock propagation path in calibration. (b) The       |    |
|            | timing diagram of comparison.                                                      | 66 |
| 3.21       | When $phase_{err}$ is "0" (a) Clock propagation path in calibration. (b) The       |    |
|            | timing diagram of comparison.                                                      | 66 |
| 3.22       | A flow chart of DLF.                                                               | 67 |
| 41         | Example of 1-III pulse generation with octa-phase clock (a) Rising-                |    |
| 7,1        | Rising (b) Rising-Falling                                                          | 60 |
| 12         | Rising. (b) Rising-Laming.                                                         | 70 |
| ч.2<br>Л З | Concept diagram of (a) phase error detection and (b) duty-cycle error              | 70 |
| 4.5        | detection                                                                          | 71 |
| 4 4        |                                                                                    | 71 |
| 4.4        |                                                                                    | 12 |
| 4.5        | Overall block diagram of proposed octa-phase clock corrector                       | 73 |
| 4.6        | Schematic of (a) 8:1 MUX and (b) MUX input configuration at clock                  |    |
|            | correction path.                                                                   | 74 |

#### LIST OF FIGURES

| 4.7  | Simulation results on rising and falling edges in terms of PN ratio in                |    |
|------|---------------------------------------------------------------------------------------|----|
|      | 40-nm GP CMOS technology at the TT corner                                             | 75 |
| 4.8  | Schematic of Edge converter.                                                          | 76 |
| 4.9  | (a) Internal clock propagation path in DCC path and (b)Monte Carlo                    |    |
|      | simulation result of each path.                                                       | 76 |
| 4.10 | Simulated result of phase difference behavior when $K_{OEC} = 2^{-3} \cdot K_{DCC}$ . | 78 |
| 4.11 | Simulated result of phase difference behavior when $K_{OEC} = 2^0 \cdot K_{DCC}$ .    | 79 |
| 4.12 | Simulated result of phase difference behavior when $K_{OEC} = 2^3 \cdot K_{DCC}$ .    | 79 |
| 4.13 | Flow chart of DCC loop filter.                                                        | 80 |
| 4.14 | Block diagrams of (a) clock control cell and (b) delay and duty control               |    |
|      | unit cell                                                                             | 81 |
| 4.15 | Timing diagram of a malfunctioning case in which the clock control                    |    |
|      | cell adjusts both clock edges to control duty-cycle                                   | 82 |
| 4.16 | Post layout simulation result of the duty-cycle adjustment component                  |    |
|      | in the clock control cell                                                             | 83 |
| 5.1  | Schematic of quadrature divider based eight-phase generator                           | 85 |
| 5.2  | (a) Chip microphotograph of proposed OEC die photo and (b) calibra-                   |    |
|      | tion power breakdown.                                                                 | 86 |
| 5.3  | Measurement setup                                                                     | 87 |
| 5.4  | Measurement setting                                                                   | 88 |
| 5.5  | Measurement results of the (a) delay curve of main DCDLs and (b)                      |    |
|      | clock position plot before and after correction                                       | 89 |
| 5.6  | Waveforms of (a) uncorrected eight-phase clock, (b) corrected eight-                  |    |
|      | phase clock, (c)jitter histogram when calibration loop off, and (d) jitter            |    |
|      | histogram when calibration loop on.                                                   | 90 |

#### LIST OF FIGURES

| 5.7  | (a)Chip microphotograph of proposed clock corrector die photo and            |    |
|------|------------------------------------------------------------------------------|----|
|      | (b)calibration power breakdown.                                              | 93 |
| 5.8  | Schematic of PI unit slice.                                                  | 94 |
| 5.9  | Schematic of output measure network                                          | 94 |
| 5.10 | Measurement results of clock control cell, (a) delay control and (d)         |    |
|      | duty-cycle control.                                                          | 95 |
| 5.11 | Measurement results of (a) uncorrected phase difference, (b) corrected       |    |
|      | phase difference, (c) uncorrected duty-cycle, and (d) corrected duty-        |    |
|      | cycle                                                                        | 96 |
| 5.12 | Measurement results of the PI output. (a) DNL of before correction and       |    |
|      | (b) after correction. (c) INL of before correction and (d) after correction. | 97 |
| 5.13 | Waveforms of RMS and peak-to-peak jitter of output lock when (a)             |    |
|      | calibration loop off and (b) calibration loop on.                            | 98 |

### **Chapter 1**

### **INTRODUCTION**

### **1.1 MOTIVATION**

In recent years, there has been a significant increase in data traffic due to the growth and advancements in Over-The-Top (OTT) services such as Netflix and YouTube, the proliferation of cloud computing services, and the emergence of Artificial Intelligence (AI) technologies like Chat Generative Pre-Trained Transformer (Chat-GPT). As illustrated in Fig. 1.1 this surge in data traffic can be attributed to these technological developments. One of the main contributors to the growth of Internet traffic is the video services sector, which includes TV, video, and streaming downloads. According to Fig. 1.2, video services account for a staggering 65.93% of the total Internet traffic volume [2]. A more detailed analysis of Internet traffic in the video domain is presented



Fig. 1.1. Global data center traffic growth [1].

in Fig. 1.3, which shows that Netflix dominates the market with a share of 14.92%, while YouTube follows closely behind with 11.61%. To effectively manage the vast amount of internet data generated by the growth of aforementioned services, such as OTT services, cloud computing, and AI technologies, system I/O bandwidth must be increased.

Significant efforts are being made to develop high-speed electrical interfaces capable of handling large volumes of data traffic in response to the increasing demand for increased I/O bandwidth. Among these efforts, *IEEE P802.3ck Ethernet Task Force* [3] is actively working on the development of 200 Gb/s and 400 Gb/s Ethernet standards to provide a solid foundation for the next generation of high-speed data communications as in Fig. 1.4. Furthermore, advancements in DRAM technology and interface specifications, such as the Peripheral Component Interconnect Express (PCIe), are

| A  | APP CATEGORY TOTAL VOLUME |                 |  |  |  |
|----|---------------------------|-----------------|--|--|--|
|    | 2022 Categories           | Total<br>Volume |  |  |  |
| 1  | Video                     | 65.93%          |  |  |  |
| 2  | Marketplace               | 5.83%           |  |  |  |
| 3  | Gaming                    | 5.58%           |  |  |  |
| 4  | Social Networking         | 5.26%           |  |  |  |
| 5  | Cloud                     | 4.98%           |  |  |  |
| 6  | Web Browsing              | 4.63%           |  |  |  |
| 7  | File Sharing              | 3.39%           |  |  |  |
| 8  | Messaging                 | 2.30%           |  |  |  |
| 9  | VPN                       | 1.13%           |  |  |  |
| 10 | Audio                     | 0.95%           |  |  |  |

Fig. 1.2. Global application traffic in 2023 [2].



Fig. 1.3. Breakdown of internet traffic of video category [2].

helping address the growing demand for increased I/O bandwidth. These advances not only result in faster memory and data transfer rates but also improve overall system performance by optimizing data processing and reducing latency. In Fig. 1.5 shows the development trend of the DRAM. At Samsung Tech Day 2022, Samsung unveiled its breakthrough 36 Gb/s GDDR7 memory technology that uses PAM-3 signaling to achieve high-speed data transfer rates [4]. This cutting-edge memory solution requires a Data Query (DQ) clock of up to 12 Gb/s/pin, demonstrating the company's commitment to pushing the boundaries of memory performance and meeting the growing demand for higher data throughput. Along with DRAM advancements, the PCIe standard is evolving to meet the growing demand for faster data rates. The recently announced PCIe Gen7 specification aims to deliver an impressive 128 GT/s [5], requiring a clock frequency of up to 16 GHz when applying a quarter-rate architecture.

As I/O bandwidth continues to grow at an exponential rate, roughly doubling every 2-3 years, it will inevitably reach a saturation point. Various strategies are being investigated to avoid this impending limitation. One such method is to use higher modulation schemes, such as PAM-5, PAM-6, or PAM-8, which can significantly increase the data rate without increasing the required bandwidth. Increasing the level of of parallelism used in data communication systems is another promising approach to address bandwidth constraints. Moving from conventional half-rate or quadrature-rate architectures to octal-rate (8 parallel channels) architectures can significantly improve overall system performance, enabling higher data throughput while minimizing the negative effects associated with reaching the bandwidth limit. However, as the level of parallelism in data communication systems increases, so does the need for more accurate clock correction and compensation due to the increased complexity of clock distribution networks. This is especially important when implementing an octal rate architecture, where eight parallel channels require precise synchronization and alignment for optimal performance. As a result, any differences or discrepancies in the individual clock signals can degrade system performance. Given these difficulties, this thesis focuses on the investigation and development of an octa-phase clock corrector that aims to address the aforementioned issues by providing a robust and reliable method of synchronizing and compensating for variations in individual clock signals within an octal-rate system.

| SMF<br>40km                  |                |                 |                 |                                                      | Over single<br>SMF in each<br>direction |                  |                 |
|------------------------------|----------------|-----------------|-----------------|------------------------------------------------------|-----------------------------------------|------------------|-----------------|
| SMF<br>10km                  |                |                 |                 |                                                      | Over single<br>SMF in each<br>direction |                  |                 |
| SMF<br>2km                   | Over 1 Pair    |                 | Over 8 pairs    | <ol> <li>Over 4 pairs</li> <li>Over 4 λ's</li> </ol> |                                         |                  | Over 8 pairs    |
| SMF<br>500m                  | Over 1 Pair    | Over 2 Pair     | Over 8 pairs    | Over 4 pairs                                         |                                         |                  | Over 8 pairs    |
| MMF<br>100m                  |                |                 | Over 8<br>pairs |                                                      |                                         |                  |                 |
| MMF<br>50m                   |                |                 | Over 8<br>pairs |                                                      |                                         |                  |                 |
| Cu Cable                     | Over 1 pair    | Over 2 pairs    | Over 8 pairs    | Over 4 pairs                                         |                                         |                  | Over 8 pairs    |
| B                            |                |                 | Over 8<br>lanes |                                                      |                                         |                  |                 |
| AUI                          | Over 1<br>lane | Over 2<br>lanes | Over 8<br>lanes | Over 4<br>lanes                                      |                                         | Over 16<br>lanes | Over 8<br>lanes |
| Assumed<br>Signaling<br>Rate | 200 Gb/s       | 200 Gb/s        | 100 Gb/s        | 200 Gb/s                                             | TBD                                     | 100 Gb/s         | 200 Gb/s        |
| Ethernet<br>Rate             | 200 Gb/s       | 400 Gb/s        | 800 Gb/s        |                                                      |                                         | 1.6 Tb/s         |                 |

Fig. 1.4. IEEE P802.3ck Ethernet Task Force for over 200 Gb/s ethernet spec [2].







Fig. 1.6. PCIe bandwith trend [5].

#### **1.2 THESIS ORGANIZATION**

This thesis is organized as follows. In Chapter 2, the background of the clock corrector is explained. The topologies to detect clock error such as analog-based detection correctors, statistically detected correctors, and digital-based detection correctors have been illustrated. Delay line topologies, an important component of DDLL, have also been discussed and organized.

In Chapter 3, the proposed octa-phase clock corrector with octa-phase error corrector (OEC) is described. The overall architecture of the proposed OEC is depicted and to operate at high-speed and alleviate timing constraints, the concept of the coprime spacing comparison is explained. Then, the circuit implementation is described.

In Chapter 4, the improved version of the octa-phase clock corrector with OEC and duty-cycle correction (DCC) is presented. With the operating OEC, the concept of utilizing phase correction results to perform the DCC scheme is presented. Following that, the overall architecture is proposed, and circuit implementation is described. To perform a falling edge to rising edge comparison, an edge converter is proposed and a simulation result is included.

In Chapter 5, the measurement results of the fabriacted chips in Chapter 3 and Chapter 4 are presented. The power consumption of the proposed clock correctors is measured and a comparison with the prior works is presented. Also the measurement setup and measurement circuit such as eight-phase generator and CML PI for justifying the phase corrected result on-chip is presented.

Finally, Chapter 6 summarizes the proposed works and concludes this thesis.

### **Chapter 2**

## **BACKGROUND ON CLOCK CORRECTION SCHEME**

#### 2.1 OVERVIEW

The rapid increase in global data traffic has necessitated a corresponding increase in the data bandwidth capacity of wireline interfaces. However, the potential for the growth of these systems is limited by factors such as physical dimensions and bandwidth limitations. As a result, modern interface standards have shifted away from low-speed multi-lane configurations and toward higher per-lane data rates and Pulse Amplitude Modulation with four levels (PAM-4).

As shown in Fig. 2.1 [6], the increased demand for high per-lane data rates manifests itself in a variety of wireline applications, including memory systems, backplanes, rack-to-rack connections, and local area networks (LANs). In particular, the



Fig. 2.1. Per-lane transfer rate trend of interface standards [6].

demanding transfer rates in these applications have demonstrated a near doubling every four years, moving from slow Double Data Rate (DDR) DRAM interfaces to fast Chip-to-Equipment Interfaces (CEI). However, due to inherent limitations within the fabrication process, the operating speed is constrained. Fig. 2.2 shows the clock amplitude reduction ratio as the clock frequency increases and implying that the maximum achievable bit rate  $R_{b,max}$  of the process as [7],

$$R_{b,max} = 1/(8 \cdot \text{FO-4})$$
 (2.1)

where FO-4 delay for the process can be roughly approximated by [8],

$$FO-4 = \frac{\text{channel length [nm]}}{2} \ ps \tag{2.2}$$



Fig. 2.2. Clock amplitude reduction (%) with clock period (in FO-4 delays) [7].

To overcome speed penalty due to process limitation parallelism has been implemented. In Fig. 2.3, the trend of the per-lane sub-rate clocking transmitter is plotted [9–31]. Until now, papers in the literature have featured half-rate architectures with speeds of up to 112 Gb/s [17] and quarter-rate architectures with speeds of up to 224 Gb/s [26]. Furthermore, the ongoing development of the Ethernet standard at 200 Gb/s and 400 Gb/s results [3] in another candidate to overcome the speed limit. For example, higher-order modulation, such as PAM-5/6/8, or octal-rate clocking architecture, is taken into account. In the GDDR interface, for example, octa-data rate (ODR) is mentioned as a potential candidate [32]. However, previously published papers implementing an octa-date rate system only operate at a few GHz [29–31].

The main design constraint in sub-rate clocking architecture is variation along the clock distribution network. Clock signals may be influenced by factors such as supply and ground noise, imbalanced PMOS and NMOS strengths, and process, voltage, and temperature (PVT) variations as they traverse the extensive clock distribution tree



Fig. 2.3. Per-lane sub-rate clocking transmitter trend.

through numerous clock buffers. The clock distribution network of DRAM and peripheral component interconnect express (PCIe) physical medium attachment (PMA) layer is presented in Fig. 2.4 [33] and Fig. 2.5 [34]. The skews induced by the distribution network consequently lead to the deterioration of link performance, including aspects such as bit-error-rate (BER) performance. For example, in the case of a quarter-rate transmitter, the distortion of the clock signals affects the valid data window. Fig. 2.6(a) illustrates the generation of a 1-UI pulse to demonstrate the 4:1 multiplexing process. The data D<sub>0</sub> is carved into a 1-UI pulse width by AND gating of D<sub>0</sub>, CLK<sub>90</sub> and CLK<sub>180</sub>. Since data carving utilizes both the high time of CLK<sub>90</sub> and CLK<sub>180</sub>, the phase and duty-cycle skew affects the data window. Simulated results of the 4:1 MUX driver output at the output node shown in Fig. 2.7. At 7-GHz clock, the achieved eye

height is 518 mV and the horizontal opening is 34.7 ps for the ideal case with no phase and duty cycle skew, as shown in Fig. 2.7(a). Fig. 2.7(b) and (c) show the duty-cycle error and phase skew error case. Each case's vertical eye closed about 20 mV and the horizontal eye closed about 2 ~ 3 ps. Hence, the implementation of a local clock corrector is essential to compensate for distribution-induced mismatches in the system. Therefore, various types of clock correctors have recently been used to perform both duty cycle and phase compensation [25, 35–58]. The circuits implemented in the published reference for measuring the quality of high-speed clock signals can be divided into three different categories and discussed in the following section. 1) Analog-based detection correction circuit, 2) Statistical type detection 3) Digital-based detection correction circuits.



Fig. 2.4. Clock distribution network in DRAM [33].



Fig. 2.5. Clock distribution network in 4-channel PCIe PMA layer [34].



Fig. 2.6. (a) Schematics of 1-UI pulse generator and waveform of 1-UI data. (b) Schematics of 4:1 MUX and output network.



Fig. 2.7. Simulated results of (a) ideal case, (b) with duty-cycle error case, and (c) with phase error case.

#### 2.2 PRIOR WORKS

#### 2.2.1 ANALOG BASED DETECTION CORRECTOR

Several multi-phase correction schemes have been proposed in analog domain [37], [38], [25], [35], [45], [47], [48]. The open-loop scheme based on active polyphase filtering was found to have a relatively low RMS jitter contribution [37, 38]. However,



Fig. 2.8. Circuit diagram of a quadrature corrector [37].



Fig. 2.9. Circuit diagram of the phase generator/rotator [38].



Fig. 2.10. Simplified distributed DLL-based multi-phase generator.



Fig. 2.11. Circuit and timing diagram of phase detector implemented in the distributed DLL.

due to the open-loop nature of the system, a residual phase error remains after error correction, making it unmanageable. In [35], an analog-based distributed delay lock loop (DLL) that performs quadrature error correction (QEC) is presented in Fig. 2.10. It employs a phase detector consisting of a NOR gate and an integrator to convert the time difference of the clock into a voltage domain. The circuit diagram of the proposed phase detector and its operating concept is shown in Fig. 2.11. Since prior work on DLL control this each delay line cell with identical voltage or code [59], [60] results in the increased resolution that degrades DLL clock performance. Therefore, adopting

separate PD to control each delay line cell output separately reduces the resolution and offers small RMS jitter. However, the distributed DLL provides a small RMS jitter, but the residual phase error is also non-negligible due to the mismatch of error detection circuits in each control loop.

The clock error detection methodology presented in [25], [47] utilizes a low pass filter (LPF) and comparator to perform time-to-voltage converting to detect the error. For duty-cycle error detection, the average voltage is filtered by LPF and compared with the comparator for each differential pair ( $CK_0$  and  $CK_{180}$ ) and ( $CK_{90}$  and  $CK_{270}$ ). The circuit diagram of the duty-cycle detector is shown in Fig. 2.12. With the high time of each differential clock  $CK_0$  and  $CK_{180}$ , the voltage  $V_1$  and  $V_2$  is filtered. The correction loop works to equalize two voltages, resulting in a differential pair with a 50% duty cycle. Similarly, an additional XOR gate is included in the phase error detection mechanism to generate a pulse representing the quadrature phase difference. For example, as in Fig. 2.13, the phase differences between  $(CK_0 \text{ and } CK_{90})$  and  $(CK_{90})$ and CK<sub>180</sub>) are converted into overlap generated pulses by XOR gate and filtered to voltage V1 and V2 and then compared at the comparator. However, this type of circuit can be affected by any error in the common mode or offset in the comparator, matching between parasitic elements, which are resistors and capacitors. Therefore, the autozero comparator is implemented to minimize the dc offset and improve the detection accuracy.



Fig. 2.12. Circuit diagram of the duty-cycle detector.



Fig. 2.13. Circuit diagram of a quadrature detector.

The phase error corrector with a phase detector based on a relaxation oscillator is presented in [45] and shown in Fig. 2.14 and Fig. 2.15. The relaxation oscillator, which works by charging and discharging a capacitor between two fixed voltages, operates as a phase detector in the frequency domain, requiring only a small number of active transistors and exhibiting good noise performance. The operation of the duty-cycle detector's timing diagram is shown in Fig. 2.14(b). In addition, to measure quadrature error, similar circuit is applied with an XOR and XNOR gate. However with area efficiency and good noise performance, susceptible to the mismatch of parasitic elements and pull-up and -down current.


Fig. 2.14. (a) Block diagram and (b) timing diagram of a relaxation oscillator-based duty cycle detector (ICK/IBCK) [45].



Fig. 2.15. Block diagram of a quadrature phase detector [45].



## 2.2.2 STATISTICAL BASED DETECTION CORRECTOR

Fig. 2.16. Uniform asynchronous DCO clock edge density and normalized delay.

Several phase and duty cycle measurement circuits based on an asynchronous sampling of high-speed clock signals have been proposed in recent years for clock calibration in high-speed I/O links [56–58,61,62]. The fundamental principle of these studies is the Code Density Test (CDT) statistical technique [63]. This entails sampling the input clock signal(s) with a fully asynchronous random clock, ensuring that the edge of the sampling clock is uniformly spread, and statistically analyzing the sampling results to determine the phase difference between two clock signals or the duty cycle of a single clock signal. For example, as shown in Fig. 2.16 [58], the amount of the generated delay  $\Delta T$  can be calculated by comparing the digitally controlled oscillator (DCO) edges triggered event to the entire span of CK's UI. The phase shift is then determined as the normalized delay which is defined as the ratio of DCO edges in the  $\Delta T$  window to the total number of DCO edges. This can be expressed as,

Normalized Delay = 
$$\frac{\Delta T}{T_{CK}} = \frac{\Sigma DCO \text{ edges in } \Delta T \text{ Window}}{\Sigma \text{ All } DCO \text{ edges}}$$
 (2.3)



Fig. 2.17. Block diagram of the asynchronous sampling-based measurement circuit in [58].

The block diagram of asynchronous sampling-based measurement in [58] is shown in Fig. 2.17. A synchronizer, composed of half-cycle and full-cycle FFs, spans 1.5 clock cycles and can be expanded to more cycles. The properly synchronized samples are compared with an XOR gate which determines if a DCO edge occurred. A return-to-zero (RZ) circuit produces a pulse for any DCO edge. To guarantee the randomness of the asynchronous sample clock, DCO is employed whose frequency is modulated by a linear feedback shift register (LFSR) that scrambles the DCO digital code and randomly modulates the DCO frequency. The frequency scrambler also ensures a uniform distribution of DCO edges relative to the measured clocks.

In [61], both duty cycle and phase measurement of the quadrature clock is implemented using the asynchronous sampling technique. The concept of the detection method is shown in Fig. 2.18. For duty cycle detection, the asynchronous DCO samples the differential I/IB and Q/QB clock pairs. It then compares the 1' occurrence between the differential pairs to determine the duty cycle error. This can be expressed as,

$$D_E = 0.5 - \frac{CNT_I}{CNT_I + CNT_{IB}} \tag{2.4}$$

where  $D_E$  denotes duty-cycle error and  $CNT_X$  denotes the occurrence of a counter value of 1's of clock X.

Similarly, phase measurement is performed by comparing the occurrence of 1' at the overlap of quadrature-spaced clocks, such as (I and Q) to (Q to IB). This can be expressed as,

$$PH_E = 0.5 - \frac{CNT_{I-Q}}{CNT_{I-Q} + CNT_{Q-IB}}$$
(2.5)

where  $PH_E$  denotes phase error and  $CNT_{A-B}$  denotes the occurrence of a counter value of 1's in the overlap between clock A and B. The block diagram of the clock measurement circuit is depicted in Fig. 2.19.



Fig. 2.18. Concept diagram of the asynchronous sampling-based measurement circuit in [61].



Fig. 2.19. Block diagram of clock measurement circuit [61].



Fig. 2.20. (a) Block diagram of the asynchronous sampling-based calibration scheme in 4:1 MUX domain, and (b) schematic of 4:1 MUX.

Although an asynchronous sampling method is used in [62], the sampling is performed in the replica 4:1 MUX output domain. In the calibration state, the test patterns (A) 0101 and (B) 1010 are sent to the 4:1 MUX for input, and the asynchronous oscillator samples the multiplexed data  $D_{OUT}$  and then compares the captured 1's difference to detect phase mismatch between clocks. The overall block diagram of the asynchronous sampling-based calibration scheme in the 4:1 MUX domain is shown in Fig. 2.20(a). The schematic of the 4:1 MUX is shown in Fig. 2.20(b) and multiplexing is performed by using the rising edges of the quadrature clock. The serialized data  $D_{OUT}$  (A) and (B) are sampled and 1's occurrence is compared. For example, as shown



Fig. 2.21. 4:1 MUX output  $D_{OUT}$  with training patterns A and B. Ideal  $CK_I/CK_Q$  clock alignment and misalignment due to early  $CK_Q$  clock.



Fig. 2.22. 4:1 MUX output  $D_{OUT}$  with training patterns A and B. Ideal  $CK_I/CK_Q$  clock alignment and misalignment due to late  $CK_Q$  clock.

in Fig. 2.21, an ideal phase alignment has the same number of 1s. However, when  $CK_Q$  is early, as shown in Fig. 2.21, the number of 1's in  $D_{OUT}$  (B) exceeds the number of 1's in  $D_{OUT}$  (A). On the other hand, when  $CK_Q$  is late, as illustrated in Fig. 2.22, the count of 1's in  $D_{OUT}$  (A) surpasses that in  $D_{OUT}$ (B), presenting vice versa.

The asynchronous sampling-based technique is implemented in the digital domain, so has immunity to PVT variation and has process scalability. However, it requires an additional power-consuming uncorrelated oscillator and a large counter, such as 24-bits for each register, to process sampled data to ensure stability.

#### 2.2.3 DIGITAL BASED DETECTION CORRECTOR

The time-to-digital converter (TDC) is widely used as a phase detector in a phaselocked loop (PLL) with the basic concept of measuring the time delay between two signal edges [64–67]. In [54] the duty cycle can also be corrected by the TDC. This method can be used to measure the phase delay between two clocks on the same rising or edges, and the duty cycle of a single clock can be used to measure the timing between opposite edges such as rising to falling edge. Several techniques for measuring delay have been proposed, but the simplest is to use a delay-line based TDC, which measures the clock by delaying an input clock n times a fixed delay of  $T_q$ . The delay-line based TDC schematic is shown in Fig. 2.23 and the timing diagram is shown in Fig. 2.24 [64]. Two input clock signals are used for the start and stop signals of the TDC. The phase difference between the two clocks is then detected by TDC by sampling the  $n \cdot T_q$  delayed start signal with the non-delayed stop signal. The resolution of the techniques is confined by the resolution of the buffer delay, which in this paper two inverter delays; to further improve resolution, a Vernier delay-line TDC [66] has been proposed. Rather than simply comparing  $n \cdot T_q$  delayed start signal to stop signal, the delay line in the stop signal path is also introduced to modify timing resolution  $T_q$  of two inverter buffers to the difference of the buffers of  $T_s - T_f$  in order to achieve a smaller resolution.



Fig. 2.23. Block diagram of delay-line based TDC.



Fig. 2.24. Timing diagram of delay-line based TDC.



Fig. 2.25. Block diagram of Vernier delay-line based TDC.



Fig. 2.26. Timing diagram of Vernier delay-line based TDC.



Fig. 2.27. (a) Overall block diagram of the proposed all-digital synchronous DCC and (b) the timing diagram of the interpolator [54].

The TDC-based DCC has been presented in [54]. The proposed DCC detects the rising and falling edges of the input clock separately, thus operating without distorting the input duty-cycle. The overall block diagram of the proposed DCC is shown in Fig. 2.27. Clock A and B are generated within the clock generator, utilizing the measured low-level pulse width of the input clock  $T_L$  as measured by TDC. The clock generator generates clock B, whose low-level pulse width is similar to that of the input clock. Concurrently, clock A is generated with an inverse duty-cycle with respect to clock B. The overall block diagram and timing diagram of the proposed clock generator is shown in Fig. 2.28. By interpolating generated clock A and B, the duty-cycle of 50% clock CLK<sub>INT</sub> can be generated and the equation of interpolator is given as,



Fig. 2.28. (a) Overall block diagram of the proposed TDC-based clock generator and (b) the timing diagram of the clock generator [54].

$$T_{L,CLK_{INT}} = 0.5T_L + 0.5(T_{CLK} - T_L)$$
  
= 0.5T<sub>CLK</sub> (2.6)

where  $T_{CLK}$  is the period of the input clock and  $T_L$  is the low-level width of the input

clock which implies duty-cycle.

To reduce power consumption and area, the 1-bit TDC, binary bang-bang PD (BBPD) has been widely implemented in PLL and DLL. BBPD generates a digital 0,1 output that indicates whether a clock signal leads or lags a reference clock signal of identical frequency. A shared feedback-based digital DLL (DDLL) clock phase corrector has been presented in Fig. 2.29, to employ power-efficient binary PD, the BBPD, and to reduce calibration path mismatch [36]. The proposed method employs sequential updates to four delay cells ( $t_Q$ ,  $t_{IB}$ ,  $t_{QB}$ , and  $t_{quad}$ ), which are adjusted to equalize the time difference between adjacent clocks rising edge to  $t_{quad}$ . As a result,  $t_{quad}$  equals one-quarter of a clock period (T/4), effectively eliminating the four-phase error. However, this approach introduces more RMS jitter, owing to quantization noise and increased clock path delay due to locking at a non-optimum point. The use of single PD for phase detection allows for the elimination of PD-induced mismatch. In particular, with the BBPD offset of  $t_{os}$ , all four clock's time differences will eventually settle to  $t_{quad} + t_{os}$  and the sum of the time difference is equal to the clock's period T. This can be expressed as,

$$T_{I\leftrightarrow Q} + T_{Q\leftrightarrow IB} + T_{IB\leftrightarrow QB} + T_{QB\leftrightarrow I} = 4 \cdot (t_{quad} + t_{os}) \tag{2.7}$$

=T(2.8)

$$\therefore t_{quad} = \frac{T}{4} - t_{os} \tag{2.9}$$

Therefore  $t_{quad}$  settles to  $T/4 - t_{os}$  resulting in the elimination of PD's offset. However, it should be noted that, unlike BBPD, non-common blocks induced mismatch such as MUX and peripheral circuits such as gating logic in [41] remains. An improved version of [36] has been proposed in [41] to minimize calibration jitter contribution by applying minimum total delay tracking algorithm by controlling all delay line contrary to [36], that fix I delay-line in mid-point code. It controls all five delay line  $(t_I, t_Q, t_{IB},$   $t_{QB}$ , and  $t_{quad}$ ) with the mentioned algorithm to achieve phase correction.

The previously proposed DDLL's operating speed is close to a few GHz, which targets DRAM operation; however, to apply to a high-speed link, a technique to alleviate timing constraints is required. It also has the disadvantage of leaving the duty cycle errors uncorrected. Therefore, the circuit is required to respond to falling edges of the clock in order to detect duty-cycle errors and correct them. In the following Chapter 3 and Chapter 4, each solution will be discussed in detail.



Fig. 2.29. Overall block diagram DDLL based QEC [36].



Fig. 2.30. Overall block diagram of DDLL based QEC with total minimum total delay tracking algorithm [41].

# 2.3 BUILDING BLOCKS OF DDLL BASED CLOCK COR-RECTOR

The fundamental components of the proposed DDLL hold significant importance in the overall design process, as they directly impact its performance and functionality. These essential building blocks include the delay line, which adjusts clock timing, MUX, which selects the input clock, and the BBPD, which contributes to the phase detection and control mechanism.

#### 2.3.1 DELAY LINE

[68] A delay line is composed of variable delay elements that control the rising or falling edges of the clock to adjust propagation delay as shown in Fig. 2.31. The delay of a CMOS gate  $t_{d,CMOS}$  can be expressed as [69],

$$t_{d,CMOS} \propto \frac{C_L}{I_D} V_{DD} \tag{2.10}$$

where  $C_L$  represents the load capacitance,  $I_D$  represents the drain current and  $V_{DD}$  represents the supply voltage. Also delay of a CML logic  $t_{d,CML}$  can be expressed as [70],

$$t_{d,CML} \propto C_L R_L \tag{2.11}$$

where  $C_L$  represents the load capacitance and  $R_L$  represents the load resistance. From (2.10) and (2.11), the propagation delay of the clock can be controlled by either varying its RC time constant [71, 72], or output load capacitance [73] or current [74, 75] and supply voltage [76] of the inverter. In Fig. 2.31 (a) and (b) CML type differential buffer offer delay control of varying R or C with low sensitivity of delay to static



Fig. 2.31. Schematic of (a) differential delay cell with R control, (b) differential delay cell with C control, (c) SCI delay cell, (d) CSI delay cell, and (e) supply voltage controlled delay cell.

and dynamic supply variation compared to CMOS buffer. The most common delay element employed in multiphase DLL is a shunt-capacitor inverter (SCI) [31, 45] as in Fig. 2.31(c) and a current-starved inverter (CSI) [59,75] as in Fig. 2.31(d). A delay cell based on SCI adjusts load capacitance to control delay. It has the advantage of linear delay control, but it takes up a lot of area and power due to the additional load



Fig. 2.32. Schematic of CSI based delay cell with (a) MOSFET switches and (b) IDACbased voltage control.

capacitance. The CSI-based delay cell, on the other hand, adjusts the transition time by controlling the drain current. The topologies to control drain current in CSI-based cells are presented in Fig. 2.32. Fig. 2.32(a) shows the digitally controlled MOSFET switches sized in a binary fashion. It has the advantage of being simple to implement, but it exhibits nonlinear behavior due to nonlinear effective resistance control due to parasitic capacitance of the MOSFET switches [75]. To overcome the nonlinear behavior, voltage controlled CSI delay cell is implemented that offers less parasitic capacitance as in Fig. 2.32(b). This topology outperforms MOSFET switch control in terms of jitter performance because it uses a DAC for control voltage generation to filter deterministic jitter caused by digital code dithering. The MOSFET switches-based current DAC (IDAC) mirroring offers low pass filtering of jitter at the cost of power consumption [75]. An alternative approach to managing the performance of inverters is to regulate their supply voltage, using a control voltage to precisely determine the appropriate supply voltage level as in Fig. 2.31(e). The effective switching resistance of the inverters varies when the supply voltage is manipulated in this way.

# Chapter 3

# **DESIGN OF THE PROPOSED OEC**

# **3.1 OVERVIEW**

As the demand for high data bandwidth increases, the half-rate and quarter-rate clocking architectures are widely adopted in the system design to reduce clock distribution power and to allow adequate timing margin. Although the state-of-art transceiver reached a data rate of over 200 Gb/s with quarter-rate architecture [26], [24], [77], [78], to achieve a data rate above 200 Gb/s, employing an octa-rate clocking architecture can be a promising candidate. Since the skew between multiphase clocks can severely degrade bit-error-rate (BER) performance [79] and reduce the eye width [47], the need for implementing a fine and accurate clock phase corrector is required. To reduce phase error, several error detection and correction schemes have been proposed. For



Fig. 3.1. Comparison between unit phase delay  $(T_{period}/8)$  and two-stage buffer delay in 40-nm CMOS technology.

example, detection by analog integrators and comparators [35] offers small RMS jitter. However, the separate phase calibration loops are susceptible to random offset voltages in comparators and mismatches in integrators, which degrades the accuracy of a phase corrector. On the other hand, a digital phase detector-based digital delay-locked loop (DDLL) can alleviate the mismatch issues and additionally eliminate the effect of the mismatch by using a single shared digital feedback loop [36].

In previous works, the phase correction by comparing two adjacent clock edges in the shared feedback loop-based DDLL was implemented at the low-speed operation near a few GHz [36], [41], [46]. However, as the required clock frequency increases, comparing the adjacent clock step such as T/4 or T/8, where T represents the clock period, becomes more challenging in generating required delays. The time difference reaches sub-20ps, which is the minimum delay that can be generated from a delay line based on a two-stage buffer. Fig. 3.1 shows the unit delay spacing of an octaphase delay and the two-stage buffer delay at various process corners in 40-nm CMOS technology. To further achieve shorter delay, two approaches can be considered :multi-



Fig. 3.2. Power consumption comparison of delay generation methods in 40-nm CMOS technology.

path delay cells such as skewed inverters [80], [81] and utilizing the difference between two delay lines can be implemented at the cost of power consumption [26], [82]. In addition, the tight internal timing margin becomes a problem at high-speed operation due to the sequential clock selection loop. To solve the above issues, this work proposes the coprime phase comparison scheme. Rather than using the phase difference between the adjacent clock phases for delay adjustment, comparing two clock phases with a coprime number of phases spaced apart to N phase, where N is the number of phases, such as 3 for quadrature clock and 3 or 5 for the octa-phase clock, resolves the issue by alleviating the timing constraint and power consumption in generating shorter delay with the aforementioned method as the frequency increases. Fig. 3.2 shows the power consumption comparison between delay generation methods. The resulting neighboring clock phase differences are eventually settled to T/N for the N- phase clock, which is the goal of the phase correction. Furthermore, the clock-divided selection logic is applied to relax the timing margin in the clock selection loop.

## **3.2 PROPOSED COPRIME PHASE CORRECTION**

The phase error correction of the N-phase clock is accomplished by equally spacing two adjacent clock phases with a unit time of T/N. Since the setup time of the phase comparison can be longer than the unit time, it is necessary to extend the time distance in comparisons. The comparison interval of M unit time relaxes the setup time but results in the delayed settling time since the comparison results are produced M times slower. Since the comparison must cover all the edges with equal frequency, M must be a coprime number to N. For the conventional multiphase clock correction, which compares adjacent clock edges is the case of M = 1. With the proposed comparison scheme using  $M (\neq 1)$ , the sum of time differences,  $\Delta T$  of  $N_s$  consecutive comparison results can be written as,

$$\Delta T = \left(M \cdot \frac{T}{N}\right) \cdot N_s \tag{3.1}$$

The result of a modular operation of  $\Delta T$  yields,

$$\Delta T \pmod{T} = \left(M \cdot \frac{T}{N}\right) \cdot N_s \pmod{T} = \frac{T}{N}$$
(3.2)

when  $M \cdot N_s = k \cdot N + 1$ , where k is an integer, or equivalently

$$(M \cdot N_s) \pmod{N} = 1 \tag{3.3}$$

If N and M are coprime, there is always a solution of  $N_s$  that satisfies (3.3). For octaphase correction when N is 8, the available candidates for M are 1, 3, 5, and 7. The number of samples  $N_s$  needed for each candidate satisfying (3.3) is 1, 3, 5, and 7, respectively. By selecting M instead of 1 can alleviate timing constraints in generating delay at the high frequency. However, introducing latency which is proportional to



Fig. 3.3. Timing diagram of proposed phase comparison flow for  $CK_0$  and  $CK_1$  where  $t_{octa}$  is T/8.

comparison interval. Furthermore, the power consumption varies depending on M since the amount of delay produced in the phase comparison path varies. Extending the time interval which is equivalent to increasing value M for comparison requires a more delay, resulting in more power consumption.

Considering the trade-off between obtaining enough timing margin, the shortest latency, and the lower power consumption, M = 3 is selected in this work. Fig. 3.3 shows the timing diagram of the proposed scheme, where  $t_{octa}$  is the desired time difference T/8 between adjacent clock edges. To place the adjacent clock phases in  $t_{octa}$  interval,  $CK_0$  and  $CK_1$ , for example, the proposed scheme corrects three phase differences of  $CK_0 - CK_3$ ,  $CK_3 - CK_6$ , and  $CK_6 - CK_1$  in the  $3 \cdot t_{octa}$  time spacing. As a result, the time difference between  $CK_0$  and  $CK_1$  settles to  $9 \cdot t_{octa}$  which is equivalent to placing adjacent edges in  $t_{octa}$ . Likewise, other adjacent clock phases eventually settle to the  $t_{octa}$  spacing.

|                   | M = 1        | M = 3  | M = 5                                 | T = M   |
|-------------------|--------------|--------|---------------------------------------|---------|
| Settle Time       | $T_{settle}$ |        | $T_{settle} + \frac{(M-1)T}{8}\alpha$ |         |
| Delay [ps]        | 15.625       | 46.875 | 78.125                                | 109.375 |
| Normalized Power] | 1.3          | 1      | 1.3                                   | 1.6     |

Table 3.1. Performance comparison with different M values.

\*  $T_{settle}$ : settling time for M = 1 simulated based on 8-GHz clock \*\*  $\alpha$ : the number of the clock cycles needed to settle for M = 1 case \*\*\* Power consumption normalized with case M = 3

## **3.3 OVERALL ARCHITECTURE**

The overall architecture of the proposed OEC is shown in Fig. 3.4. The main blocks of the proposed scheme are eight digitally controlled delay lines (DCDLs) that adjust the phase of the input clock and the phase calibration loop, which uses a single shared phase comparator to minimize mismatches between phase-detection paths [36]. The phase calibration loop is composed of the 8:2 MUX, an octa-delay line (3T/8), a bang-bang phase detector (BBPD), and a look-up table (LUT)-based digital loop filter (DLF). To alleviate the tight timing margin in the sequential selection loop, rather than operating at the full-rate, a divided-by-4 logic is implemented to secure robust operation. Also, the additional asynchronous clock gating logic is employed to reduce the power consumption by turning off the loop once the steady state is reached.

During the calibration process, the two 3T/8 spaced clock,  $CK_{MUX0}$  and  $CK_{MUX1}$  are sequentially selected by the 8:2 MUX and its count selection logic. The  $CK_{MUX0}$  is delayed with the 3T/8 delay line, which is adjusted by the control code  $C_{OCTA}$ . Then delayed clock  $CK_{MUX0,D}$  is compared with  $CK_{MUX1}$  in the BBPD to determine phase error polarity. The eight-phase detection results are deserialized into the loop filter. Also, the additional SEL[4] signal is deserialized into the loop filter, in order to verify the BBPD comparison sequence. The DLF filters out the BBPD outputs and adjusts the codes of DCDLs,  $C_{CK0-CK7}$  and  $C_{OCTA}$ . Fig. 3.5 shows the timing diagram of the proposed OEC. To eliminate glitches during phase shifting in the MUX, the selection loop rotates in a phase-backward manner.



Fig. 3.4. Overall block diagram of the proposed OEC



Fig. 3.5. Timing diagram of the proposed OEC.



Fig. 3.6. Monte Carlo simulation on at various process corners.

Considering the clock propagation path in the corrector, the possible factors that degrade the calibration accuracy are offset in the BBPD and mismatches in the gating logic and tri-state inverters in the MUX used in the calibration path. Since the phase calibration loop shares the single PD to determine the error, this static offset can be canceled. However, since the gating logic and tri-state inverters are separate blocks in each path, their mismatches degrade PD accuracy, resulting in an output phase error. Monte Carlo simulation as shown in Fig. 3.6, shows a standard deviation of 292 fs at the TT corner.

# 3.4 CIRCUIT IMPLEMENTATIONS

#### 3.4.1 8:2 MUX AND SELECTION GENERATOR



Fig. 3.7. Block diagram of sequential clock selection path.

The operation of the sequential clock selection is confined by selection signal generation delay through the clock selector block, which consists of an 8:2 multiplexer (MUX) and a MUX selection logic generator, as shown in Fig. 3.7. The selection signal generation delay  $t_d$ , is the sum of delay through MUX and divider and selection generation logic. To ensure proper functionality and selection of the subsequent clock signal, the selection signal for example has a timing margin of 7T/8, where T is the period of the clock, for adequate operation. This can be expressed as,

$$t_d = t_{MUX} + t_{DIV4} + t_{sel\_logic} \tag{3.4}$$

$$<\frac{7T}{8}$$
 (3.5)



Fig. 3.8. Block diagram of (a) conventional 8:1 MUX slice, (b) proposed 8:1 MUX slice.

However, as the frequency rises, the delay easily exceeds 7T/8, 109.375 ps; as a result, a scheme to alleviate timing constraints in the generation of the selection signal is required. Thus, a 5-bit controlled 8:1 MUX is used and divide-by-4 generated logic is used to prevent full-rate operation.

The 8:2 MUX is composed of two identical slices of the 5-bit controlled 8:1 MUX. Each slice of the 8:1 MUX shown in Fig. 3.8(b) is split into two separately controlled paths in order to secure the timing margin in the internal nodes. The conventional 3-bit controlled 3 stage 8:1 MUX shown in Fig. 3.8(a) has a tight and varying timing margin during switching in the internal node (i.e., sel[2:0] changes {000 to 001} or {001 to 010} or {011 to 100}). Because of the different clock propagation paths, the clock pulse is swallowed in some transition, causing the frequency of the MUX output to be unstable. For example in the case when SEL[2:0] changes {000 to 001}, there is no switching occurring in the internal node, therefore the net signal generation delay  $t_{d,0}$ can be expressed as,



Fig. 3.9. (a) Block diagram of clock selection signal generation path when SEL[2:0] changes 000 to 001 (b) and its timing diagram.

$$t_{d,0} = 2 \cdot t_{inv} + t_{div4} + t_{sel\_logic} \tag{3.6}$$

The delay path is colored red and the timing diagram of the clock sequential selection sequence is shown in Fig. 3.9. In addition for the case when SEL[2:0] changes to {001 to 010}, switching occurs in the internal node of the second stage of MUX as in Fig. 3.10 and the net delay  $t_{d,1}$  can be expressed as,



Fig. 3.10. (a) Block diagram of clock selection signal generation path when SEL[2:0] changes 001 to 010 (b) and its timing diagram.

$$t_{d,1} = 3 \cdot t_{inv} + t_{div4} + t_{sel\_logic} \tag{3.7}$$

Furthermore, the worst delay case occurs when SEL[2:0] changes to  $\{011 \text{ to } 100\}$ . The generation delay of the clock propagation path  $t_{d,2}$  can be expressed as,

$$t_{d,2} = 4 \cdot t_{inv} + t_{div4} + t_{sel\_logic} \tag{3.8}$$



Fig. 3.11. (a) Block diagram of clock selection signal generation path when SEL[2:0] changes 011 to 100 (b) and its timing diagram.

due to switching occurring in the first stage of MUX as in Fig. 3.11. Because the selection signal generation delays  $t_d$  varies as in (3.6), (3.7), and (3.8) when sequential clock selection, the tight timing margin as the frequency increases leads to unstable operation.

However, by separately controlling 8:1 MUX into even and odd paths (PATHe and PATHo in Fig. 3.8(b)) by 5-bit logic allows the clock to arrive and settle in time at the

final MUX stage before being selected, ensuring constant selection signal generation delay  $t_{d,const}$  of,

$$t_{d,const} = 2 \cdot t_{inv} + t_{div4} + t_{sel\_logic} \tag{3.9}$$

which equals to minimum delay as in 3-bit controlled 8:1 MUX. Fig. 3.12 and Fig. 3.13 show different the clock selection sequence and its timing diagram showing the constant selection signal delay of  $t_{d,const}$ . For both cases, before the MUX output clock changes (CK<sub>7</sub> in PATHo and CK<sub>4</sub> PATHe for Fig. 3.12(a) and Fig. 3.13(a)), the next output clock (CK<sub>6</sub> in PATHe and CK<sub>3</sub> in PATHo) arrives in opposite path's final stage neglecting effect of different clock propagation path on signal generation delay.


Fig. 3.12. (a) Block diagram of clock selection signal generation path when the MUX output changes from  $CK_7$  to  $CK_6$  (b) and its timing diagram.



Fig. 3.13. (a) Block diagram of clock selection signal generation path when the MUX output changes from  $CK_4$  to  $CK_3$  (b) and its timing diagram.



Fig. 3.14. Block diagram of divide-by-4.



Fig. 3.15. MUX selection logic and truth table.

The MUX selection signal generator is composed of divide-by-4 and counter-based logic. As shown in Fig. 3.14, the divide-by-4 is implemented by a 2-stage cascaded D-FF with NOR gate for reset. Fig. 3.15 shows the generation of the counter-based MUX selection bits which are triggered by  $CK_{0,DIV4}$  and its truth table. As the MUX selection logic part uses two independent divide-by-4 units to generate  $CK_{0,DIV4}$  and  $CK_{1,DIV4}$  so the appropriate power-up sequence is essential for the stable operation.

To properly select  $CK_{MUX0}$  and  $CK_{MUX1}$  enable signal and their power-up sequence is needed as in Fig. 3.16.





#### 3.4.2 DELAY LINE



Fig. 3.17. Block diagram of four stages of delay line.

The digitally controlled delay line (DCDL) is implemented in two types: octadelay line and main delay line. The resolution and total delay range are the main design constraints. A fine resolution is required because the resolution of the delay lines has a significant impact on the accuracy of the phase correction. Furthermore, because the resolution of the octa-delay line affects the entire output eight-phase clock, a finer resolution was considered in the design. Furthermore, at the calibration operating speed, which generates 3T/8 required for phase error detection, the total delay range of the octa-delay line is significant.

Both of them employ four stages of the current-starved inverter to control the delay, two for coarse control and the other for fine. To compensate for the non-linear behavior of a current-starved inverter, a digitally controlled current-mirror type is used

to generate control voltages [75]. Furthermore, a thermometer-based control code is employed to enhance linearity. The octa-delay line is designed to cover the 3T/8range with 6-bit control offering a resolution of 200fs/LSB. The eight main-delay lines dynamically cover the T/8 delay with 5-bit control providing 500fs/LSB resolution.

#### 3.4.3 PHASE DETECTOR



Fig. 3.18. (a) Block diagram of BBPD, and (b) schematic of arbiter.

The BBPD is implemented by sense-amplifer-based arbiter and flip-flop as in Fig. 3.18(a). The output of the arbiter is sampled by  $CK_{MUX1,DIV4}$  to properly capture the comparison result. The sense-amplifier-based arbiter is implemented as SR-Latch based on cross-coupled NAND gates, as shown in Fig. 3.18(b) [83]. It identifies the preceding rising edge transition of two input signals I<sub>1</sub> and I<sub>2</sub> and generates logical "0" or "1" indicating phase error information.

#### 3.4.4 LOOP FILTER



Fig. 3.19. (a) Clock propagation path in calibration. (b) The timing diagram of comparison.

The DLF utilizes the minimum total delay tracking FSM [41] with an 8-bit LUT to minimize the calibration-induced jitter. The LUT is generated by processing the clock's update polarities for possible 8-bit phase error sequences. During the calibration process, each BBPD result, referred to as phase<sub>err</sub>, indicates the polarity of update for the corresponding DCDL codes associated with the coprime spaced clock and the octa delay line. In order to illustrate the methodology for data processing, the CK<sub>0</sub> and CK<sub>3</sub> are selected. The calibration path that each clock takes is shown in Fig. 3.19(a). CK<sub>0</sub> is subjected to a total delay of  $(t_0+t_D)$ , while CK<sub>3</sub> is subjected to a total delay of  $t_3$  before being compared at BBPD, where  $t_D$  is  $3 \cdot t_{octa}$  in this paper. If phase<sub>err</sub> is "1", this indicates that CK<sub>D</sub> precedes CK<sub>3</sub>. CK<sub>D</sub>, on the other hand, lags when phase<sub>err</sub> is "0". To correct the phase error for the case when phase<sub>err</sub> is "1", the delay  $t_0$  or  $t_D$  can be decreased and  $t_3$  can be increased. To increase or decrease the delay in this prototype chip, it is implemented by increasing or decreasing



Fig. 3.20. When  $phase_{err}$  is "1" (a) Clock propagation path in calibration. (b) The timing diagram of comparison.



Fig. 3.21. When  $phase_{err}$  is "0" (a) Clock propagation path in calibration. (b) The timing diagram of comparison.

the code of the clock's DCDL,  $C_{CK_n}$ . Fig. 3.20 and Fig. 3.21 show the phase error correction polarity.

This 8-bit error sequence is then sorted by pre-defined priority and update polarity



Fig. 3.22. A flow chart of DLF.

flag. When the  $C_{CK7}$  and  $C_{CK3}$  are both candidates,  $C_{CK7}$  takes precedence. Furthermore,  $C_{OCTA}$  is controlled when all 8-bit BBPD results imply the same update polarity for the octa-delay line. Flow chart of DLF is shown in Fig. 3.22.

## **Chapter 4**

# **DESIGN OF THE PROPOSED CLOCK CORRECTOR**

## 4.1 OVERVIEW

With the ever-growing volume of data traffic due to advances in information and communication technology such as Artificial Intelligence (AI) and Over-The-Top (OTT) services, higher I/O bandwidth is required. Sub-rate clocking architectures such as half-rate or quarter-rate have been widely used to improve power efficiency and overcome intrinsic process speed limits. The distribution of multiphase clocks is prone to skews caused by process, voltage, and temperature (PVT) variations and layout mismatches, resulting in degradation of link performance such as bit-error-rate (BER) performance. For instance, for the transmitter (TX) to generate a unit interval (UI) of data, a combination of the clock's rising and falling edges is used [84], [21]. Addition-



Fig. 4.1. Example of 1-UI pulse generation with octa-phase clock. (a) Rising-Rising. (b) Rising-Falling.

ally, for the ADC-based receiver (RX) duty-cycle is important at track-and-hold (T/H) interleaving operation [78]. Fig. 4.1 shows two strategies for generating 1-UI pulses at the TX serializer using octa-phase clock edges:

- 1. Rising edge to Rising edge as Fig. 4.1(a)
- 2. Rising edge to Falling edge as Fig. 4.1(b)

According to [26], the use of opposing edge polarities can average out the random jitter, therefore correcting both the phase and duty-cycle error of the clock becomes more significant.

In Chapter 3, only phase correction has been implemented through rising edge comparison. While this method is effective for phase correction, it is inadequate for duty-cycle error correction. To further implement both phase and duty-cycle error correction in single shared PD-based DDLL architecture, jointly operating loops have been presented [43]. Each loop uses a separate clock selector, as shown in Fig. 4.2, which is composed of two MUXes and a logic generator that can be shared to reduce



Fig. 4.2. Block diagram of clock selector.

power consumption and area. To save power and space, one of the two MUXes in each path and the logic generator (shown in blue in Fig. 4.2) are shared.

In this chapter, a shared clock selector-based DDLL executing OEC and DCC is implemented. To share one 8:1 MUX and a logic generator between the correction loops, the input connections at the MUXs are changed according to the calibration loop sequence. For the OEC path, the 3T/8 delay line mitigates the process limit in generating a delay [44]. Furthermore, the DCC path's edge converter (EC) is optimized to generate equal propagation delays between rising and falling edges in order to achieve an accurate comparison between a falling edge and a rising edge of the differential clock.

## 4.2 PROPOSED CLOCK CORRECTION SCHEME



Fig. 4.3. Concept diagram of (a) phase error detection and (b) duty-cycle error detection.

As indicated in Fig. 4.3, the clock correction can be carried out by comparing clock edges. For example, phase error correction can be performed by comparing the clock's rising edges and placing them in evenly spaced intervals. The comparison progresses by delaying the precedent clock with equally spaced target intervals, for instance, 3T/8 in this paper, where T is the period of the clock. And then comparison to the following clock is made at the BBPD. For the duty-cycle error, it can be removed by aligning one clock's falling edge and its complementary clock's rising edge. With this approach, this error can be detected without an additional delay line.



Fig. 4.4. Timing diagram of duty-cycle correction.

Phase and duty-cycle errors could be eliminated by jointly operating two correction loops. In the phase correction loop all the adjacent clock spaces will eventually settle to  $T_{octa}$  whose value is T/8 [44]. The duty correction loop can achieve duty-cycle of 50% with the aid of phase correction results. For example, the CK<sub>0</sub>'s high time can be expressed as the cumulative sum of the four neighboring phase difference between CK<sub>0</sub> to CK<sub>4</sub>, specifically,  $\Delta T_{(CK_0\leftrightarrow CK_1)}$ ,  $\Delta T_{(CK_1\leftrightarrow CK_2)}$ ,  $\Delta T_{(CK_2\leftrightarrow CK_3)}$ , and  $\Delta T_{(CK_3\leftrightarrow CK_4)}$ . This behavior is shown in Fig. 4.4. When phase correction proceeds, each phase difference settles to  $T_{octa}$ , therefore, CK<sub>0</sub>'s high time becomes 4.  $T_{octa}$ . Which is equivalent to a 50% duty-cycle. This sequence can be written as,

$$\Delta T_{CK_{0,High}} = \Delta T_{(CK_{0}\leftrightarrow CK_{1})} + \Delta T_{(CK_{1}\leftrightarrow CK_{2})} + \Delta T_{(CK_{2}\leftrightarrow CK_{3})} + \Delta T_{(CK_{3}\leftrightarrow CK_{4})} = 4 \cdot T_{octa} = \frac{T}{2}$$
(4.1)

### 4.3 OVERALL ARCHITECTURE



Fig. 4.5. Overall block diagram of proposed octa-phase clock corrector.

The overall architecture of the proposed octa-phase clock corrector is depicted in Fig. 4.5. Both the OEC path and the DCC paths include a clock selector which is composed of a dedicated 8:1 MUX, a shared 8:1 MUX, a shared selection logic generator, a sense-amplifier based BBPD, and a digital loop filter (DLF). Additionally, there is an octa-delay line (3T/8) for the OEC and an edge converter (EC) for the DCC. The MUX is based on three-stage tristate inverters with a 5-bit control as shown in Fig. 4.6(a). The clock control cell is composed of a delay and duty-cycle adjuster. The DLF modifies the clock control cell code according to the look-up table (LUT).

The clock is sequentially selected by the clock selector which is composed of MUXes and the selection logic generator to perform clock correction. For phase com-



Fig. 4.6. Schematic of (a) 8:1 MUX and (b) MUX input configuration at clock correction path.

parison, the 8:1 MUXes in the OEC loop have input connections from  $CK_0$  to  $CK_7$ and  $CK_3$  to  $CK_2$ . To share the MUX used in the OEC, the DCC path has an input 8:1 MUX sequence that corresponds to that of the OEC loop,  $CK_7$  to  $CK_0$ . The input configurations of three MUXes are shown in Fig. 4.6(b). The DLF filters out and adjusts the clock control cell using the BBPD outputs. To avoid mutual pulling between two calibration loops, bandwidth of the OEC is set to eightfold that of the DCC. The results of OEC loop, as aforementioned, aid the duty-cycle correction operation as well.

## 4.4 CIRCUIT IMPLEMENTATIONS

#### 4.4.1 EDGE CONVERTER



Fig. 4.7. Simulation results on rising and falling edges in terms of PN ratio in 40-nm GP CMOS technology at the TT corner.

This prototype chip utilizes the BBPD mentioned in Section 3.4.3. It detects the rising edge transition of the input signals. Therefore, the EC in the DCC path converts the falling edge of the clock to the rising edge in order to use the same BBPD as in the OEC. The schematic of the EC is illustrated in Fig. 4.8. The main design difficulty is due to the fact that the rising and falling edge delays are different at high frequencies, with a PN ratio of 2:1 in CMOS design. Fig. 4.7 depicts the simulation results of measuring rising and falling edge delay in TSMC 40-nm GP CMOS technology in the TT corner. In this simulation results, we can find out the appropriate PN ratio should be 1.7:1 rather than 2:1. However, to simplify the design PN ratio of 2:1 has been selected in this design. To match the falling edge to the rising edge  $(T_{F2R})$  delay and the rising edge to the rising edge  $(T_{R2R})$  delay through an 8:1



Fig. 4.8. Schematic of Edge converter.



Fig. 4.9. (a) Internal clock propagation path in DCC path and (b)Monte Carlo simulation result of each path.

MUX and an EC as shown in Fig. 4.9(a), the skewed inverter is utilized in the EC. The post-layout two-hundred Monte Carlo simulation of the delay through the MUX and the EC is shown in Fig. 4.9(b). The average delay difference through each path is only 0.05 ps which can induce duty-cycle error of 0.04%.

#### 4.4.2 LOOP FILTER

The proposed architecture includes two different control loops, the OEC loop, and the DCC loop; it is critical that their respective bandwidths be properly configured to avoid interference between the two loops. In addition, the DCC loop depends on the corrected results provided by the OEC loop. Therefore, the gain of the DCC loop, which is  $K_{DCC}$ , must be selected to be slower than that of the OEC loop, which is  $K_{OEC}$ . The simulated results of phase difference correction behavior across the various gain value of  $K_{DCC}$  when the clock adjuster cell controls both rising and falling edges is shown in Fig. 4.10, Fig. 4.11, and Fig. 4.12.



Fig. 4.10. Simulated result of phase difference behavior when  $K_{OEC} = 2^{-3} \cdot K_{DCC}$ .

Fig. 4.10 shows the simulated result of when  $K_{OEC} = 2^{-3} \cdot K_{DCC}$ . This illustrates the case where the results of the DCC loop affect the OEC loops when duty cycle adjustments are made by altering both the rising and falling edges of the clock signal. Since the DCC loop modifies the rising edges, the phase difference shows the maximum dither of 8.6 ps in the phase-corrected state. This result highlights the design constraint mentioned at the beginning of this section that the gain of the DCC loop should be chosen lower than that of the OEC loop. Furthermore, Fig. 4.11 and Fig. 4.12 show



Fig. 4.11. Simulated result of phase difference behavior when  $K_{OEC} = 2^0 \cdot K_{DCC}$ .



Fig. 4.12. Simulated result of phase difference behavior when  $K_{OEC} = 2^3 \cdot K_{DCC}$ .

the cases of  $K_{OEC} = 2^0 \cdot K_{DCC}$  and  $K_{OEC} = 2^3 \cdot K_{DCC}$  where the gain of the DCC is the lower or equivalent case. The maximum dither after the phase correction has settled is 1.5 ps and 1.3 ps, respectively. Considering the trade-off between obtaining minimum phase difference dither, the shortest latency,  $K_{OEC}$  is selected eightfold of  $K_{DCC}$  as mentioned in Section. 4.3. The flow chart of the OEC and DCC loop filter is shown in Fig. 3.22 as in the OEC chip and DCC loop filter in Fig. 4.13.



Fig. 4.13. Flow chart of DCC loop filter.

### 4.4.3 CLOCK ADJUSTER



Fig. 4.14. Block diagrams of (a) clock control cell and (b) delay and duty control unit cell.

The clock control cell consists of two parts: delay and duty controller as in Fig. 4.14(a). Each controller has two coarse stages and two fine stages. The delay control part utilizes a current-mirror type current starved inverter. To improve linearity for phase detection accuracy, a thermometer code with coarse 3 bits and fine 3 bits is utilized offering a resolution of 250 fs per LSB. The duty-cycle adjuster controls the amount of current that sink or source at the inverter output stages. To avoid interference between the OEC loop and the DCC loop, the DCC part only adjusts the falling edge because it uses phase-corrected information from the OEC. Fig. 4.15 depicts the



Fig. 4.15. Timing diagram of a malfunctioning case in which the clock control cell adjusts both clock edges to control duty-cycle.

potentially malfunctioning case when the DCC loop adjusts both edges of the clock. For instance, consider a case in which the duty cycle of  $CLK_0$  is greater than 50% and the phase skew between  $CLK_0$  and  $CLK_3$  is considerably aligned. In this case,  $CLK_4$  detects that the duty cycle of  $CLK_0$  exceeds 50% and proceeds to reduce it by adjusting both the rising and falling edges inward. However, as a result of this adjustment, the interval between the rising edges of  $CLK_0$  and  $CLK_3$  decreases. To compensate for this discrepancy, the system operates in a manner that continuously increases the delay of  $CLK_3$ , thereby realigning the rising edges of the two signals. However, as the delay continues to increase, it may lead to an overflow in clock control cell code or deviate from the minimum total delay tracking FSM case employed in [41], ultimately resulting in suboptimal outcomes. Fig. 4.16 illustrates the post-layout simulation outcomes of the target duty-cycle control behavior within the clock control cell. It is managed by a 6-bit control word with 2 bits for coarse control and 4 bits for fine control.



Fig. 4.16. Post layout simulation result of the duty-cycle adjustment component in the clock control cell.

# Chapter 5

# **MEASUREMENT RESULTS**

## 5.1 OVERVIEW

# 5.2 AN 8-GHZ OCTA-PHASE ERROR CORRECTOR WITH COPRIME SPACING

The prototype chip of OEC (prototype 1) is fabricated in the 40-nm CMOS technology with an active area of 0.0814 mm<sup>2</sup>. The chip die photo is shown in Fig. 5.2(a). For the testing purpose, the eight-phase clock is generated by dividing the quadrature clock [19], which is implemented by feeding a differential 16-GHz clock from external to internal open-loop quadrature generator [37]. The eight-phase generator is shown in Fig. 5.1. To extend the bandwidth, feedforward inverters have been added to the input



Fig. 5.1. Schematic of quadrature divider based eight-phase generator.

and output nodes of latches [85], [86], [48].

To evaluate the performance of the proposed OEC, programmable delay cells are added for controlling initial skews for each path of the eight-phase clock. One of eight error-corrected output clock is selected by a MUX for external monitoring. The detailed measurement setup is shown in Fig. 5.3. The phase calibration loop consumes 10.8 mW at a 0.9-V supply, and the power breakdown of calibration is presented in Fig. 5.2(b). The total power consumption of the prototype chip is 30.6 mW and reduced to 19.8 mW when the calibration loop is turned off after reaching the steady-state. The remaining power consumption includes eight main DCDLs, internal clock distribution buffers and clock gating circuits. The measured result of the delay curve of the main DCDLs is shown in Fig. 5.5(a), achieving an average resolution of 500 fs per LSB. Also, in Fig. 5.5(b), the measured phase correction behavior for five different cases is plotted in terms of the ideal eight-phase clock position. The phase error is measured in terms of the negative edge. The maximum time error at the corrected output is measured as 0.95 ps, while the maximum correctable input phase

| ₩        |
|----------|
| M        |
| ÷        |
|          |
|          |
| alibrati |
|          |







(a)

**(9**)





Fig. 5.4. Measurement setting.



Fig. 5.5. Measurement results of the (a) delay curve of main DCDLs and (b) clock position plot before and after correction.

difference is 11.8 ps. Fig. 5.6 shows the selected case from Fig. 5.5(b) measured by an oscilloscope. Fig. 5.6(a) and Fig. 5.6(b) show the uncorrected and corrected octa-phase clock waveforms, respectively. The maximum input phase difference is 11.8 ps for PD5 which is corrected to 16.3 ps remaining phase error of 0.7 ps. The output clock's measured rms jitter and peak-to-peak jitter when the calibration is on are 1.01 ps<sub>*RMS*</sub> and 8.0 ps<sub>*P*-*P*</sub>. On the other hand, 0.99 ps<sub>*RMS*</sub> and 7.6 ps<sub>*P*-*P*</sub>, when calibration is off as shown in Fig. 5.6(c) and Fig. 5.6(d). Thus, the amount of the RMS jitter that the OEC loop contributes is 0.2 ps<sub>*RMS*</sub>.

Table 5.1 compares the performance of the proposed OEC with the recently published phase correction design.





| Table 5.1. Performance summary and comparison of prototype 1. | This Work          | 40              | 0.9     | Digital DLL  | 8               | 8     | 0.2<br>(1.6)                                                                             | 10.8                        | 30.6 (cal on)<br>19.8 (cal off) | 0.0814                  | 34.1°<br>(75.5)                                                                  | 0.056                    |
|---------------------------------------------------------------|--------------------|-----------------|---------|--------------|-----------------|-------|------------------------------------------------------------------------------------------|-----------------------------|---------------------------------|-------------------------|----------------------------------------------------------------------------------|--------------------------|
|                                                               | SSCL'2020<br>[39]  | 65              | 1.0     | SER & PSDL   | 3.2             | 4     | 0.42<br>(1.33)                                                                           | N/A                         | 6.72                            | 0.01                    | $19.47^{\circ}$ (21.6)                                                           | 0.097                    |
|                                                               | ISSCC'2020<br>[41] | 40              | 1.1     | Digital DLL  | 0.8 - 2.3       | 4     | 0.53<br>(1.22)                                                                           | 5.12                        | 8.89 (cal on)<br>3.9 (cal off)  | 0.0428                  | 84.1° @ 2.3<br>GHz<br>(93.4)                                                     | 0.041                    |
|                                                               | TCAS2'2017<br>[36] | 65              | 1.0     | Digital DLL  | 1.25            | 4     | 1.73<br>(2.16)                                                                           | N/A                         | 2.27                            | 0.01                    | 8.7°<br>(9.7)                                                                    | 0.187                    |
|                                                               | ISSCC'2009<br>[38] | 45              | 6.0     | Open-loop    | 0.8 - 5         | 7     | 0.48<br>(2.4)                                                                            | 5.4                         | 5.4 @ 5 GHz                     | 0.0035                  | Υ/N                                                                              | N/A                      |
|                                                               |                    | Technology (nm) | VDD (V) | Architecture | Frequency (GHz) | Phase | RMS jitter contribution (ps <sub>RMS</sub> )<br>(Normalized RMS jitter<br>contribution*) | Calibration loop power (mW) | Power (mW)                      | Area (mm <sup>2</sup> ) | Correctable phase error range<br>(Normalized Correctable phase<br>error range**) | $\mathrm{FoM}_{1}^{***}$ |

\* RMS jitter Contribution ( $p_{RMS}$ )  $\cdot$  clock frequency (GHz) \*\* Correctable phase error range(°) / 360(°)  $\cdot$  100 \*\*\* FoM<sub>1</sub> = Power (mW)/clock frequency (GHz)  $\cdot$  normalized correctable phase error range \*\*\*\* FoM<sub>2</sub> = Power (mW)/clock frequency (GHz)  $\cdot$  phase  $\cdot$  normalized correctable phase error range

91

0.007

0.024

0.01

0.047

N/A

 $FoM_2^{****}$ 

# 5.3 AN 8-GHZ OCTA-PHASE CLOCK CORRECTOR WITH PHASE AND DUTY-CYCLE CORRECTION

The proposed octa-phase clock corrector (prototype 2) shown in Fig. 5.7(a) is fabricated in 40-nm CMOS technology. The chip occupies an active area of 0.047 mm<sup>2</sup>. The design operates at the 8-GHz clock with a 1.0-V supply. The power breakdown of the corrector is presented in Fig. 5.7(b). The eight-clock control cell consumes 19.86 mW and the clock calibration logic consumes 17.1 mW. To minimize output measure path mismatch, as shown in Fig. 5.9, the output clock is selected with an 8:1 MUX and measured with a single shared output clock driver. An additional CML-based eightphase input phase interpolator (PI) is implemented to show the performance of the clock corrector. Schematic of the unit slice of PI is shown in Fig. 5.8.

The measured delay curve of the clock control cell is shown in Fig. 5.11(a), with a peak-to-peak range of 16 ps and an average resolution of 500 fs per LSB. In addition, the duty-cycle adjusted range is shown in Fig. 5.11(b). The duty-cycle control component has a peak-to-peak control range of 10% and an average resolution of 0.26% per LSB. PD<sub>n</sub> represents the phase difference between CK<sub>n</sub> and CK<sub>n+3</sub>, while D<sub>n</sub> represents the duty-cycle of CK<sub>n</sub>. Fig. 5.11(a) and Fig. 5.11(c) show the uncorrected results and Fig. 5.11(b) and Fig. 5.11(d) show the corrected results. The maximum residual phase error of 0.64 ps and 1.1% of duty-cycle error. The DNL and INL of PI change caused by clock calibration are illustrated in Fig. 5.12. Furthermore, the jitter contribution of the corrector is shown in Fig. 5.13. The peak-to-peak jitter increased by about 1.4 ps<sub>P-P</sub>, and the RMS jitter increased by about 0.19 ps<sub>RMS</sub>.
|                    | Block                                                      | Power<br>(mW) |
|--------------------|------------------------------------------------------------|---------------|
| A                  | Clock Selector*,<br>Octa-Delay line,<br>EC<br>and BBPD x 2 | 15.02         |
| B                  | 2:16 DES                                                   | 1.58          |
| C                  | DLF                                                        | 0.46          |
| D                  | <b>Clock Control Cells</b>                                 | 19.86         |
| Total (            | Calibration Power** 17<br>@ 8GHz                           | .1 mW         |
| * 8:1 M<br>**Clock | UX x 3 , MUX selection<br>< Control Cells power            | n logic       |



Fig. 5.7. (a)Chip microphotograph of proposed clock corrector die photo and (b)calibration power breakdown.

(a)

not included

9



Fig. 5.8. Schematic of PI unit slice.



Fig. 5.9. Schematic of output measure network.



Fig. 5.10. Measurement results of clock control cell, (a) delay control and (d) duty-cycle control.



Fig. 5.11. Measurement results of (a) uncorrected phase difference, (b) corrected phase difference, (c) uncorrected duty-cycle, and (d) corrected duty-cycle.



Fig. 5.12. Measurement results of the PI output. (a) DNL of before correction and (b) after correction. (c) INL of before correction and (d) after correction.





| CI.                 |
|---------------------|
| f prototype         |
| 5                   |
| comparison          |
| and                 |
| Performance summary |
| d.                  |
| S.                  |
| e,                  |
| p                   |
| Ia                  |

| lable 2.2. Feriormance summary and comparison of prototype 2. | This Work            | 40              | 1.0     | Digital DLL              | ∞         | ∞     | 0                        | 11                                            | 0.64                                      | 36.95 (cal on)<br>20.58 (cal off) | 0.047                   |
|---------------------------------------------------------------|----------------------|-----------------|---------|--------------------------|-----------|-------|--------------------------|-----------------------------------------------|-------------------------------------------|-----------------------------------|-------------------------|
|                                                               | ESSCIRC'2021<br>[46] | 28              | 1.0     | Digital DLL              | 0.8-3.2   | 4     | Х                        | N/A                                           | < 1.59                                    | 9.8 @ 3.2GHz                      | 0.01                    |
|                                                               | TVLSI'2019<br>[45]   | 55              | 1.2     | Relaxation<br>Oscillator | 1-3       | 4     | Х                        | 0.8 @3GHz                                     | 1.03 @3GHz                                | 2.08                              | 0.003                   |
|                                                               |                      | Technology (nm) | VDD (V) | Architecture             | Frequency | Phase | Calibration disable mode | Max. duty-cycle error after<br>correction (%) | Max. phase error after<br>correction (ps) | Power (mW)                        | Area (mm <sup>2</sup> ) |

## **Chapter 6**

## CONCLUSION

In this dissertation, the digital delay-locked loop (DDLL) based high-speed octaphase clock corrector performing octa-phase error corrector (OEC) and duty-cycle corrector (DCC) loop has been proposed. The proposed OEC corrects the 8-GHz clock with coprime spacing comparison to alleviate timing constraints. To optimize power consumption in delay generation, 3T/8 is selected for phase detection. The active area of the OEC chip (prototype 1) is 0.081 mm<sup>2</sup> in 40-nm CMOS technology and the calibration logic consumes 10.8 mW at 0.9-V supply, resulting in a maximum phase error of 0.7 ps.

In addition, both OEC and DCC operation chip (prototype 2) is proposed. The DCC loop utilizes OEC correction results to calibrate the duty-cycle error without additional delay lines for edge comparison. To match the rising and falling edge delay

in a 2:1 CMOS PN ratio design, an edge converter(EC) with a skewed inverter is implemented. The prototype chip is also implemented in 40-nm CMOS technology with an active area of 0.047 mm<sup>2</sup> and consumes 19.86 mW in 1.0-V supply. The test chip exhibits performance with a maximum residual phase error of 0.64 ps and 1.1% duty cycle error.

## **Bibliography**

- [1] Cisco. "Cisco Global Cloud Index: Forecast and Methodol-2016-2021". Accessed: Apr. 23. 2023. [Online]. Available: ogy, https://www.cisco.com/site/us/en/solutions/index.html
- [2] Sandvine. "Global Internet Phenomena". Accessed: Apr. 23, 2023. Additional data: https://www.statista.com/chart/15692/distribution-of-global-downstreamtraffic/. [Online]. Available: https://www.sandvine.com/phenomena
- [3] IEEE P802.3ck 400Gb/s Ethernet Task Force. Accessed: Mar. 22, 2023.
   [Online]. Available: http://www.ieee802.org/3/ck/
- [4] Samsung. "Samsung Electronics Envisions Hyper Growth in Memory and Logic Semiconductors through Intensified Industry Collaborations at Samsung Tech Day 2022". Accessed: May. 04, 2023. [Online]. Available: https://news.samsung.com/global/samsung-electronics-envisionshyper-growth-in-memory-and-logic-semiconductors-through-intensifiedindustry-collaborations-at-samsung-tech-day-2022
- [5] PCI-SIG. "PCI-SIG: The organization that defines and develops the PCI Express (PCIe) standard". Accessed: May. 04, 2023. [Online]. Available: https://pcisig.com/

- [6] S. Mirabbasi, L. C. Fujino, and K. C. Smith, "Through the Looking Glass—The 2023 Edition: Trends in solid-state circuits from ISSCC," *IEEE Solid-State Circuits Magazine*, vol. 15, no. 1, pp. 45–62, 2023.
- [7] C.-K. K. Yang, "Design of High-Speed Serial Links in CMOS," PhD Dissertation, Stanford University, 1998.
- [8] M. Horowitz, C.-K. K. Yang, and S. Sidiropoulos, "High-speed electrical signaling: Overview and limitations," *IEEE Micro*, vol. 18, no. 1, pp. 12–24, 1998.
- [9] B. Raghavan, D. Cui, U. Singh, H. Maarefi, D. Pi, A. Vasani, Z. Huang, A. Momtaz, and J. Cao, "A sub-2W 39.8-to-44.6Gb/s transmitter and receiver chipset with SFI-5.2 interface in 40nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2013, pp. 32–33.
- [10] K. Gopalakrishnan, A. Ren, A. Tan, A. Farhood, A. Tiruvur, B. Helal, C.-F. Loi, C. Jiang, H. Cirit, I. Quek *et al.*, "A 40/50/100Gb/s PAM-4 Ethernet transceiver in 28nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2016, pp. 62–63.
- [11] C. Menolfi, T. Toifl, R. Reutemann, M. Ruegg, P. Buchmann, M. Kossel, T. Morf, and M. Schmatz, "A 25Gb/s PAM4 transmitter in 90nm CMOS SOI," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2005, pp. 72–73.
- [12] C. Menolfi, T. Toifl, P. Buchmann, M. Kossel, T. Morf, J. Weiss, and M. Schmatz,
   "A 16Gb/s source-series terminated transmitter in 65nm CMOS SOI," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2007, pp. 446–614.
- [13] C. Menolfi, J. Hertle, T. Toifl, T. Morf, D. Gardellini, M. Braendli, P. Buchmann, and M. Kossel, "A 28Gb/s Source-Series Terminated TX in 32nm CMOS SOI,"

in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2012, pp. 334–336.

- [14] J. Kim, J.-K. Kim, B.-J. Lee, M.-S. Hwang, H.-R. Lee, S.-H. Lee, N. Kim, D.-K. Jeong, and W. Kim, "Circuit Techniques for a 40Gb/s Transmitter in 0.13μm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2005, pp. 150–589.
- [15] T. O. Dickson, H. A. Ainspan, and M. Meghelli, "6.5 A 1.8 pJ/b 56Gb/s PAM-4 transmitter with fractionally spaced FFE in 14nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2017, pp. 118–119.
- [16] P.-J. Peng, S.-T. Lai, W.-H. Wang, C.-W. Lin, W.-C. Huang, and T. Shih, "6.8 A 100Gb/s NRZ transmitter with 8-Tap FFE using a 7b DAC in 40nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2020, pp. 130– 132.
- [17] B. Zhang, A. Vasani, A. Sinha, A. Nilchi, H. Tong, L. Rao, K. Khanoyan, H. Hatamkhani, X. Yang, X. Meng *et al.*, "6.1 A 112Gb/s Serial Link Transceiver With 3-tap FFE and 18-tap DFE Receiver for up to 43dB Insertion Loss Channel in 7nm FinFET Technology," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2023, pp. 5–7.
- [18] A. Nazemi, K. Hu, B. Catli, D. Cui, U. Singh, T. He, Z. Huang, B. Zhang, A. Momtaz, and J. Cao, "A 36Gb/s PAM4 transmitter using an 8b 18GS/S DAC in 28nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2015, pp. 1–3.
- [19] A. A. Hafez, M.-S. Chen, and C.-K. K. Yang, "A 32-48 Gb/s serializing transmitter using multiphase serialization in 65 nm CMOS technology," *IEEE Journal of Solid-State Circuits*, vol. 50, no. 3, pp. 763–775, 2015.

- [20] X. Zheng, C. Zhang, F. Lv, F. Zhao, S. Yue, Z. Wang, F. Li, and Z. Wang, "A 5-50 Gb/s Quarter Rate Transmitter with a 4-Tap Multiple-MUX based FFE in 65 nm CMOS," in *IEEE European Solid State Circuits Conference (ESSCIRC)*, 2016, pp. 305–308.
- [21] Y. Frans, S. McLeod, H. Hedayati, M. Elzeftawi, J. Namkoong, W. Lin, J. Im, P. Upadhyaya, and K. Chang, "A 40-to-64 Gb/s NRZ transmitter with supplyregulated front-end in 16 nm FinFET," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 12, pp. 3167–3177, 2016.
- [22] Z. Toprak-Deniz, J. E. Proesel, J. F. Bulzacchelli, H. A. Ainspan, T. O. Dickson, M. P. Beakes, and M. Meghelli, "A 128-Gb/s 1.3-pJ/b PAM-4 transmitter with reconfigurable 3-tap FFE in 14-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 55, no. 1, pp. 19–26, 2019.
- [23] Y. Chang, A. Manian, L. Kong, and B. Razavi, "An 80-Gb/s 44-mW wireline PAM4 transmitter," *IEEE Journal of Solid-State Circuits*, vol. 53, no. 8, pp. 2214–2226, 2018.
- [24] M. Choi, Z. Wang, K. Lee, K. Park, Z. Liu, A. Biswas, J. Han, and E. Alon, "An output-bandwidth-optimized 200Gb/s PAM-4 100Gb/s NRZ transmitter with 5tap FFE in 28nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, vol. 64, 2021, pp. 128–130.
- [25] P.-J. Peng, Y.-T. Chen, S.-T. Lai, and H.-E. Huang, "A 112-Gb/s PAM-4 voltagemode transmitter with four-tap two-step FFE and automatic phase alignment techniques in 40-nm CMOS," *IEEE Journal of Solid-State Circuits*, vol. 56, no. 7, pp. 2123–2131, 2020.
- [26] J. Kim, S. Kundu, A. Balankutty, M. Beach, B. C. Kim, S. Kim, Y. Liu, S. K. Murthy, P. Wali, K. Yu *et al.*, "A 224Gb/s DAC-based PAM-4 transmitter with

8-tap FFE in 10nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, vol. 64, 2021, pp. 126–128.

- [27] J. Yang, E. Song, S. Hong, D. Lee, S. Lee, H. Im, T. Shin, and J. Han, "6.8 A 100Gb/s 1.6Vppd PAM-8 Transmitter with High-Swing 3 + 1 Hybrid FFE Taps in 40nm," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2023, pp. 122–124.
- [28] Y.-H. Song, H.-W. Yang, H. Li, P. Y. Chiang, and S. Palermo, "26.5 An 8-to-16Gb/s 0.65-to-1.05 pJ/b 2-tap impedance-modulated voltage-mode transmitter with fast power-state transitioning in 65nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2014, pp. 446–447.
- [29] C.-K. K. Yang, R. Farjad-Rad, and M. A. Horowitz, "A 0.5-μ/m CMOS 4.0-Gbit/s serial link transceiver with data recovery using oversampling," *IEEE Journal of Solid-State Circuits*, vol. 33, no. 5, pp. 713–722, 1998.
- [30] H. Lu and C. Su, "A 1.25 to 5Gbps LVDS Transmitter with a Novel Multi-Phase Tree-Type Multiplexer," in 2005 IEEE Asian Solid-State Circuits Conference, 2005, pp. 389–392.
- [31] W.-S. Choi, G. Shu, M. Talegaonkar, Y. Liu, D. Wei, L. Benini, and P. K. Hanumolu, "A 0.45-to-0.7 V 1-to-6Gb/S 0.29-to-0.58 pJ/b source-synchronous transceiver using automatic phase calibration in 65nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2015, pp. 1–3.
- [32] GDDR6 SGRAM Standard: JESD250C. Accessed: Apr. 24, 2023. [Online]. Available: https://www.jedec.org/standards-documents/docs/jesd250c
- [33] C. Kim, H.-W. Lee, and J. Song, *High-Bandwidth Memory Interface*. Heidelberg, Germany: Springer, 2014.

- [34] M.-C. Choi, S. Lee, S. Roh, K. Lee, J. Oh, S. Kim, K. Kim, W.-S. Choi, J. Kim, and D.-K. Jeong, "A 2.5–32 Gb/s Gen 5-PCIe Receiver With Multi-Rate CDR Engine and Hybrid DFE," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 69, no. 6, pp. 2677–2681, 2022.
- [35] K.-J. Hsiao and T.-C. Lee, "A low-jitter 8-to-10GHz distributed DLL for multiple-phase clock generation," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2008, pp. 514–632.
- [36] Y. Kim, K. Song, D. Kim, and S. Cho, "A 2.3-mW 0.01-mm2 1.25-GHz quadrature signal corrector with 1.1-ps error for mobile DRAM interface in 65-nm CMOS," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 64, no. 4, pp. 397–401, 2016.
- [37] K.-H. Kim, P. W. Coteus, D. Dreps, S. Kim, S. V. Rylov, and D. J. Friedman, "A
  2.6 mW 370MHz-to-2.5 GHz open-loop quadrature clock generator," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2008, pp. 458–627.
- [38] K.-H. Kim, D. M. Dreps, F. D. Ferraiolo, P. W. Coteus, S. Kim, S. V. Rylov, and D. J. Friedman, "A 5.4 mW 0.0035 mm 2 0.48 ps rms-jitter 0.8-to-5GHz non-PLL/DLL all-digital phase generator/rotator in 45nm SOI CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2009, pp. 98–99.
- [39] H. Ko, C. Hyun, J.-H. Chae, G.-M. Hong, and S. Kim, "A 3.2-GHz quadrature error corrector for DRAM transmitters, using replica serializers and pulseshrinking delay lines," *IEEE Solid-State Circuits Letters*, vol. 3, pp. 38–41, 2020.
- [40] R. Z. Bhatti, M. Denneau, and J. Draper, "Duty cycle measurement and correction using a random sampling technique," in *Proc. IEEE Int.Midwest Symp. Circuits Syst.*, 2005, pp. 1043–1046.

- [41] S. Shin, H.-G. Ko, S. Jang, D. Kim, and D.-K. Jeong, "A 0.8-to-2.3 GHz quadrature error corrector with correctable error range of 101.6 ps using minimum total delay tracking and asynchronous calibration on-off scheme for DRAM interface," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2020, pp. 340– 342.
- [42] W.-S. Choi, G. Shu, M. Talegaonkar, Y. Liu, D. Wei, L. Benini, and P. K. Hanumolu, "A 0.45–0.7 V 1–6 Gb/s 0.29–0.58 pJ/b source-synchronous transceiver using near-threshold operation," *IEEE Journal of Solid-State Circuits*, vol. 53, no. 3, pp. 884–895, 2018.
- [43] J.-W. Sull, S. Lee, and D.-K. Jeong, "A 10-to-12-GHz Dual Loop Quadrature Clock Corrector in 28-nm CMOS Technology," in *International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC)*, 2022, pp. 571–573.
- [44] J.-W. Sull, S. Shin, J. Oh, K.-H. Lee, J. Kim, J.-H. Park, and D.-K. Jeong, "An 8-GHz Octa-Phase Error Corrector With Coprime Phase Comparison Scheme in 40-nm CMOS," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 69, no. 3, pp. 874–878, 2021.
- [45] J.-H. Chae, H. Ko, J. Park, and S. Kim, "A quadrature clock corrector for DRAM interfaces, with a duty-cycle and quadrature phase detector based on a relaxation oscillator," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 27, no. 4, pp. 978–982, 2019.
- [46] H. Yoon, W. Jung, J. Park, J. Byun, H. Jin, H. Cho, Y. Kim, B. Lim, Y.-C. Cho, Y. Choi *et al.*, "A 3.2-12.8 Gb/s Duty-Cycle Compensating Quadrature Error Corrector for DRAM Interfaces, With Fast Locking and Low Power

Characteristics," in *IEEE European Solid State Circuits Conference (ESSCIRC)*, 2021, pp. 463–466.

- [47] Y.-T. Lin, T.-W. Xu, and W.-Z. Chen, "A 50 Gb/s PAM-4 transmitter with feedforward equalizer and background phase error calibration," *IEEE Transactions* on Circuits and Systems II: Express Briefs, vol. 68, no. 8, pp. 2820–2824, 2021.
- [48] A. Varzaghani, B. Bozorgzadeh, J. Lam, A. Goel, X. Yuan, M. Elzeftawi, M. Izad, S. Sarkar, A. Baldisserotto, S.-R. Ryu *et al.*, "A 1-to-112Gb/s DSP-Based Wireline Transceiver with a Flexible Clocking Scheme in 5nm FinFET," in *Symp. VLSI Circuits Dig. Tech. Papers*, 2022, pp. 26–27.
- [49] Y. Jang, S. Bae, and H. Park, "CMOS Digital Duty Cycle Correction Circuit for Multi-Phase Clock," *Electronics Letters*, vol. 39, no. 19, pp. 1383–1384, 2003.
- [50] Y.-J. Min, C.-H. Jeong, K.-Y. Kim, W. H. Choi, J.-P. Son, C. Kim, and S.-W. Kim, "A 0.31–1 GHz fast-corrected duty-cycle corrector with successive approximation register for DDR DRAM applications," *IEEE transactions on very large scale integration (VLSI) systems*, vol. 20, no. 8, pp. 1524–1528, 2011.
- [51] C.-H. Jeong, A. Abdullah, Y.-J. Min, I.-C. Hwang, and S.-W. Kim, "All-digital duty-cycle corrector with a wide duty correction range for DRAM applications," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 24, no. 1, pp. 363–367, 2015.
- [52] K. Agarwal and R. Montoye, "A duty-cycle correction circuit for high-frequency clocks," in *Symp. VLSI Circuits Dig. Tech. Papers*, 2006, pp. 106–107.
- [53] K.-H. Cheng, C.-W. Su, and K.-F. Chang, "A High Linearity, Fast-Locking Pulsewidth Control Loop With Digitally Programmable Duty Cycle Correction

for Wide Range Operation," *IEEE Journal of Solid-State Circuits*, vol. 43, no. 2, pp. 399–413, 2008.

- [54] S.-K. Kao and S.-I. Liu, "All-Digital Fast-Locked Synchronous Duty-Cycle Corrector," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 53, no. 12, pp. 1363–1367, 2006.
- [55] M. Zanuso, S. Levantino, C. Samori, and A. Lacaita, "3MHz-BW 3.6GHz Digital Fractional-N PLL with Sub-Gate-Delay TDC, Phase-Interpolation Divider, and Digital Mismatch Cancellation," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2010, pp. 476–477.
- [56] R. Z. Bhatti, M. Denneau, and J. Draper, "Phase Measurement and Adjustment of Digital Signals Using Random Sampling Technique," in *IEEE Int. Symp. Circuits* and Syst. (ISCAS), 2006, pp. 4–pp.
- [57] T.-C. Hsueh, G. Balamurugan, J. Jaussi, S. Hyvonen, J. Kennedy, G. Keskin, T. Musah, S. Shekhar, R. Inti, S. Sen *et al.*, "A 25.6Gb/s Differential and DDR4/GDDR5 Dual-Mode Transmitter with Digital Clock Calibration in 22nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2014, pp. 444–445.
- [58] M. Mansuri, B. Casper, and F. O'Mahony, "An On-Die All-Digital Delay Measurement Circuit with 250fs Accuracy," in *Symp. VLSI Circuits Dig. Tech. Papers*, 2012, pp. 98–99.
- [59] Y. Moon, J. Choi, K. Lee, D.-K. Jeong, and M.-K. Kim, "An All-Analog Multiphase Delay-Locked Loop Using a Replica Delay Line for Wide-Range Operation and Low-Jitter Performance," *IEEE Journal of Solid-State Circuits*, vol. 35, no. 3, pp. 377–384, 2000.

- [60] C.-C. Chung and C.-Y. Lee, "A New DLL-Based Approach for All-Digital Multiphase Clock Generation," *IEEE Journal of Solid-State Circuits*, vol. 39, no. 3, pp. 469–475, 2004.
- [61] J. Kim, A. Balankutty, R. K. Dokania, A. Elshazly, H. S. Kim, S. Kundu, D. Shi, S. Weaver, K. Yu, and F. O'Mahony, "A 112 Gb/s PAM-4 56 Gb/s NRZ reconfigurable transmitter with three-tap FFE in 10-nm FinFET," *IEEE Journal* of Solid-State Circuits, vol. 54, no. 1, pp. 29–42, 2018.
- [62] C. F. Poon, W. Zhang, J. Cho, S. Ma, Y. Wang, Y. Cao, A. Laraba, E. Ho, W. Lin, D. Z. Wu *et al.*, "A 1.24-pJ/b 112-Gb/s (870 Gb/s/mm) transceiver for in-package links in 7-nm FinFET," *IEEE Journal of Solid-State Circuits*, vol. 57, no. 4, pp. 1199–1210, 2022.
- [63] J. Doernberg, H.-S. Lee, and D. A. Hodges, "Full-speed testing of A/D converters," *IEEE Journal of Solid-state circuits*, vol. 19, no. 6, pp. 820–827, 1984.
- [64] T. Rahkonen, J. Kostamovaara, and S. Saynajakangas, "Time interval measurements using integrated tapped CMOS delay lines," in *Proceedings of the 32nd Midwest Symposium on Circuits and Systems*, 1989, pp. 201–205.
- [65] R. B. Staszewski, S. Vemulapalli, P. Vallur, J. Wallberg, and P. T. Balsara, "1.3 V 20 ps time-to-digital converter for frequency synthesis in 90-nm CMOS," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 53, no. 3, pp. 220–224, 2006.
- [66] P. Dudek, S. Szczepanski, and J. V. Hatfield, "A High-Resolution CMOS Timeto-Digital Converter Utilizing a Vernier Delay Line," *IEEE Journal of Solid-State Circuits*, vol. 35, no. 2, pp. 240–247, 2000.

- [67] J.-H. Chae, M. Kim, G.-M. Hong, J. Park, H. Ko, W.-Y. Shin, H. Chi, D.-K. Jeong, and S. Kim, "0.11-2.5 GHz All-digital DLL for Mobile Memory Interface with Phase Sampling Window Adaptation to Reduce Jitter Accumulation," *JSTS: Journal of Semiconductor Technology and Science*, vol. 17, no. 3, pp. 411–424, 2017.
- [68] B. I. Abdulrazzaq, I. Abdul Halin, S. Kawahito, R. M. Sidek, S. Shafie, and N. A. M. Yunus, "A review on high-resolution CMOS delay lines: towards subpicosecond jitter performance," *SpringerPlus*, vol. 5, pp. 1–32, 2016.
- [69] T. Sakurai, "CMOS Inverter Delay and Other Formulas Using α-power Law MOS Model," in 1988 IEEE International Conference on Computer-Aided Design, 1988, pp. 74–75.
- [70] W. Bae, "Supply-Scalable High-Speed I/O Interfaces," *Electronics*, vol. 9, no. 8, p. 1315, 2020.
- [71] C.-K. K. Yang et al., "Delay-locked loops-an overview," Phase-Locking in High-Performance Systems: From Devices to Architectures, pp. 13–22, 2003.
- [72] H. Kim, H.-S. Oh, W. Jung, Y. Song, J. Oh, and D.-K. Jeong, "A 100MHz-Reference, 8GHz/16GHz, 177fsrms/223fsrms RO-Based IL-ADPLL Incorporating Reference Octupler with Probability-Based Fast Phase-Error Calibration," in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3.
- [73] F. Baronti, D. Lunardini, R. Roncella, and R. Saletti, "A Self-Calibrating Delay-Locked Delay Line With Shunt-Capacitor Circuit Scheme," *IEEE Journal of Solid-State Circuits*, vol. 39, no. 2, pp. 384–387, 2004.

- [74] D. Jeong, G. Borriello, D. Hodges, and R. Katz, "Design of PLL-based clock generation circuits," *IEEE Journal of Solid-State Circuits*, vol. 22, no. 2, pp. 255– 261, 1987.
- [75] M. Maymandi-Nejad and M. Sachdev, "A Monotonic Digitally Controlled Delay Element," *IEEE Journal of Solid-State Circuits*, vol. 40, no. 11, pp. 2212–2219, 2005.
- [76] D.-H. Oh, D.-S. Kim, S. Kim, D.-K. Jeong, and W. Kim, "A 2.8Gb/s All-Digital CDR with a l0b Monotonic DCO," in *IEEE Int. Solid-State Circuits Conf.* (*ISSCC*) Dig. Tech. Papers, 2007, pp. 222–598.
- [77] T. O. Dickson, Z. T. Deniz, M. Cochet, T. J. Beukema, M. Kossel, T. Morf, Y.-H. Choi, P. A. Francese, M. Brändli, C. W. Baks *et al.*, "A 72-GS/s, 8-Bit DAC-Based Wireline Transmitter in 4-nm FinFET CMOS for 200+ Gb/s Serial Links," *IEEE Journal of Solid-State Circuits*, 2022.
- [78] Y. Segal, A. Laufer, A. Khairi, Y. Krupnik, M. Cusmai, I. Levin, A. Gordon, Y. Sabag, V. Rahinski, G. Ori *et al.*, "A 1.41 pJ/b 224Gb/s PAM-4 SerDes receiver with 31dB loss compensation," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, vol. 65, 2022, pp. 114–116.
- [79] B.-J. Yoo, D.-H. Lim, H. Pang, J.-H. Lee, S.-Y. Baek, N. Kim, D.-H. Choi, Y.-H. Choi, H. Yang, T. Yoon *et al.*, "6.4 A 56Gb/s 7.7 mW/Gb/s PAM-4 wireline transceiver in 10nm FinFET using MM-CDR-Based ADC timing skew control and low-power DSP with approximate multiplier," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2020, pp. 122–124.
- [80] S.-J. Lee, B. Kim, and K. Lee, "A novel high-speed ring oscillator for multiphase clock generation using negative skewed delay scheme," *IEEE Journal of Solid-State Circuits*, vol. 32, no. 2, pp. 289–291, 1997.

- [81] L. Sun and T. Kwasniewski, "A 1.25-GHz 0.35-μ/m Monolithic CMOS PLL Based on a Multiphase Ring Oscillator," *IEEE Journal of Solid-State Circuits*, vol. 36, no. 6, pp. 910–916, 2001.
- [82] Z. Wang and P. R. Kinget, "A 65nm CMOS, 3.5-to-11GHz, Less-Than-1.45 LSB-INL pp, 7b Twin Phase Interpolator with a Wideband, Low-Noise Delta Quadrature Delay-Locked Loop for High-Speed Data Links," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, vol. 65, 2022, pp. 292–294.
- [83] A. Samarah and A. C. Carusone, "A Digital Phase-Locked Loop With Calibrated Coarse and Stochastic Fine TDC," *IEEE Journal of Solid-State Circuits*, vol. 48, no. 8, pp. 1829–1841, 2013.
- [84] J. Hwang, H.-S. Choi, H. Do, G.-S. Jeong, D. Koh, K. Park, S. Kim, and D.-K. Jeong, "A 64Gb/s 2.29 pJ/b PAM-4 VCSEL transmitter with 3-tap asymmetric FFE in 65nm CMOS," in *Symp. VLSI Circuits Dig. Tech. Papers*, 2019, pp. C268–C269.
- [85] I.-F. Sun, J. Yin, P.-I. Mak, and R. P. Martins, "A comparative study of 8-phase feedforward-coupling ring VCOs," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 66, no. 4, pp. 527–531, 2018.
- [86] P.-S. Han, "Analysis of Feedforward Ring Oscillators and Its Application to High-Speed Multiphase Clock Generation," Ph.D. dissertation, Yonsei University, 2009.

초록

본 논문에서는 8 GHz에서 동작하는 옥타 위상 에러 보정 및 듀티 사이클 보정 을 수행하는 옥타 위상 클록 보정기를 제안하고 두 개의 프로토타입 칩을 사용하여 검토합니다. 서로소 위상 비교 방식과 디지털 지연동기루프 (Delay-Locked Loop)를 사용하는 8-GHz 옥타 위상 오류 교정기(OEC)를 제안합니다. 위상 비교 시 타이밍 제약을 완화하기 위해 8과 서로소 간격을 둔 클럭 위상을 활용하여 최대 64 Gb/s 링크 동작을 가능하게 합니다. 특히 이 프로토타입은 *T/*8이 아닌 3*T/*8 간격의 클럭 을 적용했습니다. 또한 클럭 분할 5비트 선택 방식을 채택해 고속 8:2 멀티플렉서가 끊김 없이 원활하게 동작합니다. 불일치 및 캘리브레이션으로 인한 지터를 최소화 하기 위해, 단일 공유 위상 비교기와 최소 총 지연을 추적하기 위한 유한 상태 머신 (FSM)이 사용됩니다. 테스트 칩은 0.0814 mm<sup>2</sup>의 활성 영역에서 40-nm CMOS 기 술로 제작되었습니다. 제안하는 옥타 위상 교정 루프의 보정기능 0.9-V 공급에서 8 GHz에서 10.8 mW를 소비하여 최대 잔류 위상 오차가 0.95 ps를 보입니다.

또한 공유 클록 선택기 기반 디지털 DLL을 사용하는 8 GHz 옥타 위상 클록 보정기가 포함된 또 다른 프로토타입이 제공됩니다. 보정기는 다음과 같은 기능을 가지는 부분으로 분류 할 수 있습니다: 옥타 위상 오류 보정기(OEC) 및 듀티 사이클 보정기(DCC). 위상 오류는 3T/8 지연 라인을 통해 감지되며, 듀티 사이클 오류는 추가 지연 라인을 사용하지 않고 차동 클록에서 반대 극성 에지를 활용하여 감지됩 니다. 에지 변환기(EC)는 8:1 MUX와 EC를 통한 에지 전파 지연을 일치시켜 듀티 사이클 보정에서 높은 수준의 정확도를 달성하도록 설계되었습니다. 또한 전력과 면적을 절약하기 위해 위상 및 듀티 사이클 오류 감지 루프 간에 멀티플렉서와 로직 제너레이터로 구성된 클럭 셀렉터를 공유합니다. 프로토타입 칩은 40-nm CMOS 기 술로 제작되었으며 0.047 mm<sup>2</sup>의 활성 면적을 차지합니다. 보정기의 총 보정 전력 소비는 1.0-V 공급에서 17.1mW입니다.

**주요어**: 서로소, 디지털 지연선, 디지털 지연동기루프, 위상 오차, 듀티 오차,위상 오차 교정기, 듀티 오차 교정기, 멀티플렉서 **학번**: 2018-24582