



#### PH.D. DISSERTATION

### A DESIGN OF CLOCKING SCHEME WITH WIDE-RANGE DUTY-CYCLE CORRECTOR FOR HBM PHYSICAL LAYER

### 넓은 동작 주파수를 갖는 듀티-사이클 코렉터를 포함한 클럭킹 회로 설계

 $\mathbf{B}\mathbf{Y}$ 

JAE WOOK KIM

AUGUST 2023

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING COLLEGE OF ENGINEERING SEOUL NATIONAL UNIVERSITY

### A DESIGN OF CLOCKING SCHEME WITH WIDE-RANGE DUTY-CYCLE CORRECTOR FOR HBM PHYSICAL LAYER

넓은 동작 주파수를 갖는 듀티-사이클-코렉터를 포함한 클럭킹 회로 설계

지도교수 김수환

이 논문을 공학박사 학위논문으로 제출함

2023 년 8 월

서울대학교 대학원

전기·정보공학부

김 재 욱

김재욱의 공학박사 학위논문을 인준함

2023 년 8 월

| 위 원 장 : | 정 | 덕 | 균 | (印) |
|---------|---|---|---|-----|
| 부위원장 :  | 김 | 수 | 환 | (印) |
| 위 원:    | 김 | 재 | 하 | (印) |
| 위 원:    | 최 | 우 | 석 | (印) |
| 위 원:    | 채 | 주 | 형 | (印) |

### ABSTRACT

## A DESIGN OF CLOCKING SCHEME WITH WIDE-RANGE DUTY-CYCLE CORRECTOR FOR HBM PHYSICAL LAYER

JAEWOOK KIM DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING COLLEGE OF ENGINEERING SEOUL NATIONAL UNIVERSITY

In various applications, performance depends on the duty cycle of the clock. As more and more applications require higher bandwidth, the demand for HBMs in memory is also increasing in memory. Therefore, it is necessary to develop not only the HBM but also the physical layer between the controller and memory. This paper will explain the design of this physical layer.

Duty cycle distortion can occur when process and voltage changes or clock signals pass through the clock buffer. Various types of duty cycle compensators have been proposed. In order to have a wide operating range, it is better to compensate using halfcycle delay line (HCDL) rather than phase interpolation duty cycle compensator. To compensate for duty distortion, the half-cycle delay line of the traditional edge combiner type DCC with counter-based HCDL requires a large area and makes DCC unsuitable for applications operating at a wide range of frequencies. The proposed counterbased HCDL reduces silicon costs by repeating delay lines while maintaining the performance of existing DCCs.

In addition, FSM blocks are designed for 34 cycles of training to operate efficiently over a 65nm wide range of motion. Measurement results using CMOS technology show that the duty cycle error is less than 0.89% in the 20-80% input duty cycle range for 50-1600 MHz. DCC consumes 2.11 mW at 1.6 GHz.

Keywords: DRAM interface, HBM physical layer, duty-cycle corrector, HCDL.

Student Number: 2015-22778

# **CONTENTS**

| ABSTRACT       |                                                                  |
|----------------|------------------------------------------------------------------|
| CONTENTS       |                                                                  |
| LIST OF FIG    | URES                                                             |
| LIST OF TAB    | LE8                                                              |
| CHAPTER 1      | INTRODUCTION1                                                    |
| 1.1            | MOTIVATION1                                                      |
| 1.2            | HBM CONTROLLER PHY6                                              |
| 1.3            | DUTY-CYCLE CORRECTOR11                                           |
| 1.3.1          | ANALOG DCC                                                       |
| 1.4            | DIGITAL DCC15                                                    |
| 1.4.1          | DESIGN CONSIDERATIONS ON DIGITAL DCC16                           |
| 1.4.2          | PRIOR WORKS                                                      |
| 1.5            | SUMMARY                                                          |
| 1.6            | THESIS ORGANIZATION                                              |
| CHAPTER 2      | HBM CLOCKING SCHEME                                              |
| 2.1            | CHARACTERISTICS OF HBM CLOCKING SCHEME                           |
| 2.2            | CONCEPTUAL ARCHITECTURE OF HBM CONTROLLER PHY28                  |
| CHAPTER 3      | TRAINING OPERATION OF THE PROPOSED DCC WITH COUNTER-BASED        |
| HCDL           |                                                                  |
| 3              |                                                                  |
| 3.1<br>Counter | CONCEPTUAL ARCHITECTURE OF THE PROPOSED DCC WITH<br>R-BASED HCDL |
| 3.2            | TRAINING OPERATION OF THE DCC WITH COUNTER-BASED HCDL. 34        |

| CHAPTER 4       | NORMAL OPERATION OF THE PROPOSED DCC WITH COUNTER         | R-BASED  |
|-----------------|-----------------------------------------------------------|----------|
| HCDL            |                                                           | 44       |
| 4.1<br>based HC | DESIGN CONSIDERATION OF THE PROPOSED DCC WITH COU         | NTER-    |
| 4.2<br>HCDL     | NORMAL OPERATION OF THE PROPOSED DCC WITH COUNT 48        | ER BASED |
| 4.2.1           | HALF-DELAY MODE                                           |          |
| CHAPTER 5       | ARCHITECTURE AND IMPLEMENTATION                           | 54       |
| 5.1             | OVERALL ARCHITECTURE                                      | 55       |
| 5.2             | COUNTER-BASED HCDL                                        | 57       |
| 5.3             | Clock Path                                                | 61       |
| 5.4             | EDGE COMBINER                                             | 62       |
| CHAPTER 6       | MEASUREMENTS RESULTS                                      | 65       |
| 6.1             | MEASUREMENT SETUP                                         | 65       |
| 6.2<br>based HC | MEASUREMENT RESULT OF THE PROPOSED DCC WITH COU<br>CDL 67 | NTER-    |
| CHAPTER 7       | CONCLUSION                                                | 75       |
| APPENDIX A      | HBM CONTROLLER PHY                                        | 77       |
| A.1 CON         | TROLLER PHY AND DFI SPECIFICATION                         | 78       |
| A.2 ARC         | CHITECTURE OF HBM CONTROLLER PHY                          |          |
| A.3 DES         | IGN CONSIDERATION OF THE HBM PHYSICAL LAYER               | 86       |
| BIBLIOGRAPH     | ΗΥ                                                        | 89       |
| 한글초록            |                                                           | 94       |

# LIST OF FIGURES

| Figure 1.1.1. Duty-cycle definition1                                                   |
|----------------------------------------------------------------------------------------|
| Figure 1.1.2. Growth of DRAM market2                                                   |
| Figure 1.1.3. DDR per-pin data-rate trends                                             |
| Figure 1.1.4. Use of ADC according to converter resolution and conversion rate         |
| Figure 1.1.5. Noise source in applications                                             |
| Figure 1.2.1. Structure of HBM with controller [1.1.1]                                 |
| Figure 1.2.2. Comparison of GDDR5 and HBM                                              |
| Figure 1.2.3. Memory and controller physical layer9                                    |
| Figure 1.2.4. Noise factors in clock path10                                            |
| Figure 1.3.1. Basic block diagram of a DCC                                             |
| Figure 1.3.1.1. Simplified (a) block diagram of the analog duty-cycle detector and (b) |
| analog duty-cycle adjuster13                                                           |
| Figure 1.4.1. Basic block diagram of a digital DCC15                                   |
| Figure 1.4.1.1. Block diagram of a digital DCC16                                       |
| Figure 1.4.2.1. Block diagram of the digital DCC using PI [1.3.1.4]                    |
| Figure 1.4.2.2. Block diagram of the digital DCC using HCDL [1.3.2.1]                  |
| Figure 1.4.2.3. Block diagram of TDC [1.3.1.1]                                         |
| Figure 1.4.2.4. Block diagram of digital logic with binary search [1.3.1]              |
| Figure 2.1.1. Timing diagrams of SDR and DDR operation                                 |
| Figure 2.1.2. Timing parameters of CK clock frequency [2.1.1]27                        |
| Figure 2.2.1. Clocking scheme for HBM PHY                                              |
| Figure 3.1.1. Block diagram of the conventional DCC with HCDL                          |
| Figure 3.1.2. Area breakdown of the conventional DCC with HCDL                         |
| Figure 3.1.3. Conceptual block diagram of counter-based HCDL                           |
| Figure 3.2.1. (a) Block diagram of the proposed DCC with counter-based HCDL used in    |

| (b) training operation                                                                          |
|-------------------------------------------------------------------------------------------------|
| Figure 3.2.2. Algorithm of the proposed DCC                                                     |
| Figure 3.2.3. $N_{CNT}$ and DCDL training timing diagram of the proposed DCC with counter-      |
| based HCDL41                                                                                    |
| Figure 4.1.1. Timing diagram of phase interpolating sequence of a conventional DCC45            |
| Figure 4.1.2. HCDL DCC architecture (a) feedforward and (b) feedback46                          |
| Figure 4.2.1. Block diagram of the proposed DCC with counter-based HCDL used in                 |
| normal operation                                                                                |
| Figure 4.2.2. Timing diagram of the proposed DCC with counter-based HCDL in normal              |
| operation                                                                                       |
| Figure 4.2.1.1. Conceptual timing diagram of the half-delay mode (a) off (DH=0) and (b)         |
| on (DH=1)52                                                                                     |
| Figure 5.1.1. Overall architecture of the proposed DCC with counter-based HCDL 55               |
| Figure 5.2.1. Block diagram of (a) coarse delay line, and (b) fine delay line                   |
| Figure 5.2.2. Post-layout simulation results of (a) coarse delay line, and (b) fine delay line. |
|                                                                                                 |
| Figure 5.2.3. Counter-based HCDL with DCDL                                                      |
| Figure 5.3.1. Block diagram of clock buffer                                                     |
| Figure 5.4.1. Block diagram of edge combiner                                                    |
| Figure 5.4.2. (a) Block diagram and (b) timing diagram of rising edge detector                  |
| Figure 5.4.3. (a) Block diagram and (b) timing diagram of falling edge detector                 |
| Figure 5.4.4. Timing diagram of edge combiner                                                   |
| Figure 6.1.1. (a) Die photograph and block description and (b) measurement setup of the         |
| prototype DCC                                                                                   |
| Figure 6.2.1. Area brackdown of the proposed DCC67                                              |
| Figure 6.2.2. Measured input and output clock waveform of the proposed DCC at minimum           |
| frequency (50MHz) (a) input duty-cycle (20%), (b) input duty-cycle (80%)                        |
|                                                                                                 |
| Figure 6.2.3 Measured input and output clock waveform of the proposed DCC at maximum            |

| frequency (1.6 GHz) (a) input duty-cycle (20%), (b) input duty-cycle (80%).                 |
|---------------------------------------------------------------------------------------------|
|                                                                                             |
| Figure 6.2.4. Measured results of the duty-cycle every 100MHz70                             |
| Figure 6.2.5. Measured results of the output duty-cycle while sweeping input duty-cycle (a) |
| 50MHz and (b) 1.6GHz71                                                                      |
| Figure 6.2.6. Jitter histogram of (a) the input clock and (b) the proposed DCC at 1.6 GHz.  |
|                                                                                             |
| Figure A.1.1. Bandwidth and data-rate changes per pin for GDDR and HBM [2.1.1]78            |
| Figure A.1.2. Block diagram of HBM PHY [2.1.2]79                                            |
| Figure A.1.3. Block diagram of HBM controller PHY80                                         |
| Figure A.1.4. Comparison of DFI clock frequency and HBM controller PHY clock                |
| frequency                                                                                   |
| Figure A.2.1. HBM2 single channel signal description [2.2.1]                                |
| Figure A.2.2. HBM2 (a) row commands and (b) column commands truth table                     |
| Figure A.2.3. Conceptual architecture of HBM PHY                                            |
| Figure A.3.1. HBM PHY (a) normal mode and (b) test mode measurement setup                   |

# LIST OF TABLE

Table 6.2.1. Performance comparison with other DCC designs.

## **CHAPTER 1**

### INTRODUCTION

#### **1.1 MOTIVATION**



Figure 1.1.1. Duty-cycle definition.

The ratio between the pulse width and the period of the clock signal is called dutycycle. Duty cycle is commonly expressed as a percentage or a ratio. As a formula, a duty cycle (%) may be expressed as:

$$D = \frac{PW}{T} * 100\% \tag{1.1.1}$$

where D is the duty cycle, PW is the pulse width (pulse active time), and T is the total period of the signal.



Figure 1.1.2. Growth of DRAM market.

In various applications, performance is affected by the duty-cycle of the clock signal. For example, dynamic random access memory (DRAM)'s market share is continuously increasing. Figure 1.1.2 shows transition in DRAM market from 2.15 to 2025 (estimated). This demand is expected to continue to increase as the application of DRAM is expanded. Figure 1.1.3 shows the per-pin data-rate trend of DDR. As shown in Figure 1.1.3, as the generation of DDR increases, the data-rate continues to increase, and another feature is that DRAM is not used only for one product, so wide-range operation should be possible. Double data-rate (DDR) and quad data-rate (QDR) architectures are now common for



Figure 1.1.3. DDR and HBM per-pin data-rate trends

RAM [1.1.1]. As the internal clock frequency is reduced, the power consumption and the timing margin are relaxed. These structures require an accurate duty-cycle of the clock for data sampling or transmitting [1.1.2].

Similar approaches have been taken in analog-to-digital converters (ADCs). Figure 1.1.4 shows the converter resolution and conversion rate of the recent ADC. In the case of Pipeline ADC, it operates from a few MS/s units to a sampling speed of 1GS/s. Double-sampling techniques are used in various ADCs to reduce power consumption [1.1.3]. Also, these technique is used in pipelined ADCs to share op-amps, which can bring advantage to the area [1.1.4]



Figure 1.1.4. Use of ADC according to converter resolution and conversion rate

A duty-cycle can become distorted inside the applications due to process and voltage variation. And this distortion is likely to be larger for a more complicated clock tree. As shown in Figure 1.1.5, the noise source of the clock signal may come by voltage, and noise may occur due to mismatch in the clock distribution. As the clock duty distortion affects



Figure 1.1.5. Noise source in applications

circuit performance, it is necessary to compensate for the duty-cycle distortion. For this reason, these applications introduce duty-cycle corrector (DCC) for accurate duty-cycle of the clock signal [1.1.5].

DCC is divided into analog [1.1.4], [1.1.5] and digital types according to the method of detecting the duty cycle of the signal. In the case of the analog method, the duty-cycle is detected using passive devices such as capacitors and resistors. This method has the advantage of being able to finely detect the duty cycle. However, the analog DCD detects the duty cycle by using capacitors, which can take a long time to settle and may cause duty fluctuations due to leakage current [1.1.6]. In the case of digital DCC, duty-cycle information is converted into a digital code by DCD. Thus a digital DCC is preferred in various applications, which are advantageous for maintaining duty cycle corrected information, considering standby mode or power down mode [1.1.7].

#### **1.2 HBM CONTROLLER PHY**



Figure 1.2.1. Structure of HBM with controller [1.1.1].

High-performance graphics processing units (GPUs), machine learning, supercomputers (HPC), and artificial intelligence (AI) are considered important trends during the fourth industrial revolution. Consequently, there is an increasing demand for memory, particularly for memory that offers high bandwidth and excellent energy efficiency. In the past, significant scaling has been achieved not only in the CMOS logic process but also in the memory process, resulting in increased integration and reduced unit prices. However, as scaling continues, the size of capacitors capable of storing charge decreases, and the leakage current of access devices increases, making relaxation more challenging. Therefore, it is widely observed that scaling in the memory process will reach its limits sooner than in the logic process.

Steady research and development efforts are underway for high-performance and

high-capacity memories such as High Bandwidth Memory (HBM) and Graphical Double Data Rate (GDDR).

Figure 1.2.2 is a table comparing GDDR5 and HBM memory. GDDR5 memory operates with 32-bit input and output per module and is equipped with a minimum of four (128-bit), eight (256-bit), or sixteen (512-bit) modules, depending on the memory controller of the GPU core. Each module requires a 1.5V voltage, leading to increased power consumption as the number of modules increases. This presents a power consumption challenge for GDDR5, as it does not decrease compared to the power consumption of the decreasing personal computer. Furthermore, GDDR5 memory has reached a physical limitation where it becomes impractical to place on the large PCB substrate area unique to high-end graphics cards.

On the other hand, HBM has been proposed for various applications, including highperformance computing, servers, networks, cache memory, and high-end graphics memory. It comprises a controller, a logic dies, and core dies, with the GPU and HBM transmitting and receiving parallel signals on the silicon interposer. HBM can handle 1024 bits per stack module, allowing for up to four stack modules and a total of 4096 bits. With a voltage of 1.3V, lower than that of GDDR, the power consumption of graphics cards can be reduced by more than three times. Currently, HBM2E technology is being mass-produced as a commercial product, enabling the implementation of terabit-class bandwidth through 1024 parallel I/O at a speed of 2.4 Gbps per pin. While DDR interfaces do not anticipate significant improvements in bandwidth, HBM interface technology is still in its early stages, suggesting the possibility of further development.



Figure 1.2.2. Comparison of GDDR5 and HBM [1.2.2].

Memory bottlenecks can be effectively addressed by utilizing next-generation highspeed memory technologies such as HBM2 and HBM2E, which are integrated within server artificial intelligence processor chips. By conducting research and development on core technologies alongside technology standardization efforts, securing next-generation high-speed memory interface technology for server artificial intelligence processors at an early stage can yield significant economic benefits.

As the performance of HBM gradually improved over generation, the development of the physical layer accordingly began to become important. The physical layer of the HBM exists in each of the memory and the controller, and its appearance exists as shown in figure 1.2.3. To develop the next-generation high-speed memory interface physical layer necessary for artificial intelligence technology, it is essential to have a comprehensive



Figure 1.2.3. Memory and controller physical layer.

understanding of the structure and conduct high-level functional verification between HBM internal logic dies, core dies, and host processors. Modeling technology is required for high-level function verification, ensuring compatibility with controllers and memories, as well as specific technologies tailored to the HBM physical layer.

Since HBM memory utilizes a double data-rate structure, the duty cycle of the clock becomes crucial during data transmission. With the gradual increase in data-rate per pin in HBM, the occurrence of clock duty cycle distortion within the circuit is also on the rise. As a result, it has become essential to incorporate a circuit within the HBM physical layer that internally compensates for the clock's duty cycle. Consequently, there is a growing trend of introducing duty-cycle correctors within the clock path. As shown in Figure 1.2.4, noise



Figure 1.2.4. Noise factors in clock path.

or power noise caused by clock distillation in the clock path is one of the biggest factors that distorts the duty-cycle of the clock [1.2.3].

Therefore, in this paper, we will focus on the design of the interface controller physical layer suitable for next-generation memory and explain the DCC technology required for the design.

#### **1.3 DUTY-CYCLE CORRECTOR**



Figure 1.3.1. Basic block diagram of a DCC.

Figure 1.3.1 shows basic block diagram of a DCC. The DCC is largely divided into two parts, as shown in the figure 1.2.1. In the figure, clock input is a distortion signal of the duty-cycle. There is a duty-cycle detector (DCD) that detects the distorted duty-cycle of the input signal, and a duty-cycle adjuster (DCA) that adjusts the distorted duty-cycle based on the detector to compensate. Therefore, the clock output signal is a state in which the duty-cycle is compacted by DCA.

DCC is divided into analog DCC and digital DCC according to the method of detecting and adjusting the duty-cycle of DCD and DCA. As mentioned in Chapter 1.1, digital DCC is preferred in various applications, and the first reason is that it is easy to store

duty cycle information because it does not use passive devices. Therefore, information on the duty-cycle can be stored when entering and leaving the power down mode without the need for additional training. In addition, digital DCD has the advantage of being able to detect duty-cycle faster than analog DCD.

Because of these advantages and disadvantages, it is necessary to choose the type of DCC suitable for the application. From the following chapter, we will explain in detail the analog DCC and digital DCC and why digital DCC was chosen in this paper.

#### 1.3.1 ANALOG DCC



(a)



Figure 1.3.1.1. Simplified (a) block diagram of the analog duty-cycle detector and (b) analog duty-cycle adjuster [1.3.1.1].

In the case of analog DCC, passive elements such as resistors and capacitors are required to detect duty-cycle. Figure 1.3.1.1 shows block diagram of the analog DCC and analog DCD [1.3.1.1]. Wide-range operation is suitable for using a capacitor, and there is no quantization error, so relatively accurate duty-cycle correction is possible.

However, in the case of passive devices, when applications enter power down mode to reduce power consumption, duty-cycle information cannot be memorized due to leakage if input clock signal is not entered, and the setting time is longer than digital DCC. Therefore, digital DCC is advantageous for applications with power down mode like DRAM [1.3.1.2].

#### **1.4 DIGITAL DCC**



Figure 1.4.1. Basic block diagram of a digital DCC.

Figure 1.4.1 shows basic block diagram of a digital DCC. As shown in the figure, in the case of digital DCC, data digitized by DCD is used, and TDC or digital logics are typically used. Digital DCC has the advantage of short lock-in time and storing duty-cycle information, so it is easily used in various applications [1.4.1].

#### **1.4.1 DESIGN CONSIDERATIONS ON DIGITAL DCC**



Figure 1.4.1.1. Block diagram of digital DCC.

Figure 1.4.1.1 shows the basic structure of digital DCC. Digital DCC can be divided into two blocks, duty-cycle detector (DCD) and duty-cycle adjuster (DCA).

Digital DCD uses TDC [1.4.1.1] or digital logics [1.4.1.2] as a block to find out the duty-cycle information of the current input signal. In the case of TDC, there is a disadvantage in that hardware cost is large, and it is not suitable because conservative power dissipation is required for wide-range operation [1.4.1.3]. Therefore, it is suitable to use digital logic for wide-range operation.

Digital DCA largely uses phase interpolation (PI) [1.4.1.4] and half-cycle delay line (HCDL) [1.4.1.5] to adjust duty-cycle based on DCD information. PI has a limitation with respect to wide-range operation. PIs designed for high frequencies have a limited interpolation range. An example is shown in the DCA of Figure 1.4.1.1. So interpolation performance is poor at low frequency and wide duty cycle offsets when the duty offset is

greater than this range [1.4.1.4]. Therefore, the HCDL structure is suitable for wide-range operation. However, in the case of HCDL, there is a disadvantage that the area or power cost increases for wide-range operation, so there are also some problems to be solved before applying.

In this chapter, basic digital DCC was explained. In chapter 1.4.2, some prior works will be introduced and find out the limitations of each structure.

#### 1.4.2 PRIOR WORKS



Figure 1.4.2.1. Block diagram of the digital DCC using PI [1.4.1.4].

Figure 1.4.2.1 shows the block diagram of digital DCC using PI [1.4.1.4]. CLK<sub>IN</sub> is a signal whose duty-cycle is distorted, and CLK<sub>INB</sub> is an inverted signal of CLK<sub>IN</sub>. If the rising edges of these two signals are adjusted through DCDL, the clock signal with the duty-cycle compensated can be output through PI, as shown in Figure 1.4.2.1(b). However, the disadvantage of this PI structure is that as the degree of distortion of the duty-cycle

increases, the performance of the PI is affected. If the performance of PI is increased to reduce the uncertainty, the correction range at low frequency is inevitably reduced, so it is suitable as a DCC for wide-range operation.



Figure 1.4.2.2. Block diagram of the digital DCC using HCDL [1.4.2.1].

As shown in Figure 1.4.2.2, another digital DCC method uses HCDL [1.4.2.1]. In this method, the rising edge of the CLK<sub>IN</sub> signal is detected and the rising edge is output to the CLK<sub>OUT</sub> signal. In addition, after delaying the input signal or output signal enough to half period, the falling edge is output to the output signal. In this case, the CLK<sub>OUT</sub> signal has a



Figure 1.4.2.3. Block diagram of TDC [1.4.1.1].

signal having a duty-cycle of 50%. Usually, this method of giving HCDL is given through DCDL. In this method, since the DCDL should be able to give a delay as much as the half cycle of the minimum frequency, the low frequency has a disadvantage in that the ratio of the DCDL increases.

In addition, if HCDL is used, DCD is essential because the delay value of this HCDL



Figure 1.4.2.4. Block diagram of digital logic with binary search [1.4.1].

must be changed if the operation frequency changes. Various types of DCDs have been proposed, and TDCs and digital logics have been used. In the case of TDC, fast lock is possible, but there is a disadvantage that hardware cost is large.

There was a method of composing digital logic using 1-bit DCD to reduce hardware cost, but in this case, the lock-in time is extended in proportion to the length of the DCDL [1.4.1.3]. To reduce this lock-in time, a binary search-like method was proposed. Through this method, it was possible to reduce lock-in time and hardware cost.

#### **1.5 SUMMARY**

As mentioned in chapters above, DCC is utilized in various applications. For instance, it is employed in DRAM, HBM, and DDR applications that still employ double data-rate and counter-rate structures to achieve high bandwidth. In such cases, adjusting the duty cycle of the clock becomes necessary as the signal integrity of the DQ pin is affected by the clock's duty cycle. Additionally, ADCs employ double-sampling techniques to enhance ADC efficiency, and in structures that share op-amps, the duty cycle of the clock becomes crucial. The operation frequency of the DRAMs is at the GHz level, and the ADCs mostly use the MHz clocks. Therefore, DCC can be operational in wide-range frequency to be compatible with various applications.

There are two types of DCC: analog and digital. Digital DCC excels in applications like DRAM and ADCs, where efficient operation through power-down mode is essential. Moreover, digital DCC offers the advantage of fast calibration. The use of PI and HCDL as digital DCC structures is common, yet each structure has its own limitations. When PI is used, wide-range operation becomes challenging, while HCDL leads to an increased ratio of DCDL to the total area in wide-range operation, which may pose a burden in certain applications.

To overcome the drawbacks described above, we propose counter-based HCDL to solve the above disadvantage in digital DCC using HCDL. Through the counter-based HCDL, DCC can reach a wide frequency range while resolving their area requirements. The operation of the proposed DCC is divided into training operation and normal operations. In the training operation, DCDL and intrinsic delay are adjusted to satisfy the half-cycle delay by setting the amount of delay line and the number of counting. To reduce power and area cost, a method of training using the bang-bang duty-cycle detector is proposed [1.5.1]. However, this method has a long lock time when used in wide-range operation. we apply the binary search method, which helps to minimize the lock time. In normal operation, these values are used to correct the duty-cycle of the clock signal. The proposed DCC is verified by measurement of prototype chips.

#### **1.6 THESIS ORGANIZATION**

This thesis is composed of six chapters. Chapter 1 is an introduction that explains recent trends on DCC with HBM physical layer. Chapter 3 describes the concept and the design consideration of the proposed DCC with counter-based HCDL in normal operation. To explain the normal operation of the proposed DCC with counter-based HCDL will be described in chapter 4. In chapter 5, the implementation of the proposed DCC with counter-based HCDL will be explained. Chapter 6 shows experimental setup and results of the proposed DCC with a prototype chip. The conclusion will be drawn in chapter 7.

## CHAPTER 2

## **HBM CLOCKING SCHEME**

As discussed in chapter 1.2, HBM's share of DRAM consumption is gradually increasing because HBM can reach high bandwidth. Therefore, various techniques such as transmitter and receiver are required in DRAM internal circuits. In addition, research on the clocking system is needed, and HBM's unique characteristics should be applied in the HBM clocking system. In this chapter, we will learn about these characteristics and explain the clocking scheme that applies them.
### 2.1 CHARACTERISTICS OF HBM CLOCKING SCHEME



Figure 2.1.1. Timing diagrams of SDR and DDR operation.

As a characteristic of the HBM clocking system, double data-rate (DDR) was adopted instead of a single data-rate (SDR) method like figure 2.1.1 to increase data-rate. In the case of DDR, data is transmitted using both the rising edge and falling edge of the clock, so the duty-cycle of the clock affects performance.

The clock frequency inside the chip increased, resulting in a duty-cycle error caused by mismatch or noise in clock distribution. In addition, as the sampling margin continues to decrease, the performance degradation caused by this duty-cycle error is having a greater impact. Therefore, DCC should be used at the end of the HBM's clocking scheme to compensate for the distorted duty-cycle.

|                                                 |                   |                       | Speed Bin |                       |      |                       |      |                       |      |                       |      |                       |      |                 |       |
|-------------------------------------------------|-------------------|-----------------------|-----------|-----------------------|------|-----------------------|------|-----------------------|------|-----------------------|------|-----------------------|------|-----------------|-------|
|                                                 |                   | 1.0 Gbps/pin          |           | 1.6 Gbps/pin          |      | 2.0 Gbps/pin          |      | 2.4 Gbps/pin          |      | 2.8 Gbps/pin          |      | 3.2 Gbps/pin          |      | 1               |       |
| Parameter                                       | Symbol            | Min                   | Max       | Min                   | Max  | Min                   | Max  | Min                   | Max  | Min                   | Max  | Min                   | Max  | Unit            | Notes |
| CK Timings                                      |                   |                       |           |                       |      |                       |      |                       |      |                       |      |                       |      |                 |       |
| CK clock frequency                              | f <sub>CK</sub>   | 50                    | 500       | 50                    | 800  | 50                    | 1000 | 50                    | 1200 | 50                    | 1400 | 50                    | 1600 | MHz             |       |
| CK clock frequency with bank groups<br>disabled | f <sub>CKBG</sub> | f <sub>CK</sub> (min) |           | f <sub>CK</sub> (min) |      | f <sub>CK</sub> (min) |      | f <sub>CK</sub> (min) |      | f <sub>CK</sub> (min) |      | f <sub>CK</sub> (min) |      | MHz             | 4,5   |
| CK clock period                                 | t <sub>CK</sub>   | 2.0                   | 20        | 1.25                  | 20   | 1.0                   | 20   | 0.833                 | 20   | 0.714                 | 20   | 0.625                 | 20   | ns              | 6     |
| CK clock differential HIGH-level width          | t <sub>CH</sub>   | 0.47                  | 0.53      | 0.47                  | 0.53 | 0.47                  | 0.53 | 0.47                  | 0.53 | 0.47                  | 0.53 | 0.47                  | 0.53 | t <sub>CK</sub> |       |
| CK clock differential LOW-level width           | t <sub>CL</sub>   | 0.47                  | 0.53      | 0.47                  | 0.53 | 0.47                  | 0.53 | 0.47                  | 0.53 | 0.47                  | 0.53 | 0.47                  | 0.53 | t <sub>CK</sub> |       |

Figure 2.1.2. Timing parameters of CK clock frequency [2.1.1].

Figure 2.1.2 shows the clock frequency timing parameters of the HBM according to the operation frequency. As can be seen from this table, clock frequency should be able to operate in various ways depending on the operation frequency inside DRAM. Since maximum frequency adopts the DDR structure as mentioned earlier, it must be operational up to half the data-rate. In addition, in the case of minimum frequency, it should be possible to cover up to low frequency, especially up to 50 MHz, which should be wide-range operable DCC. Based on these characteristics, the HBM clocking scheme was considered.

### 2.2 CONCEPTUAL ARCHITECTURE OF HBM CONTROLLER PHY



Figure 2.2.1. Clocking scheme for HBM PHY.

Figure 2.2.1 shows clocking scheme for HBM PHY. To achieve high-bandwidth DRAM, research on physical layer (PHY) between memory and controller as well as memory itself is required. This also requires DCC for HBM PHY. The PHY configured in consideration of this was as follows. After generating the clock as PLL or receiving it from the controller, the clock circuit synchronizes with the commands through the internal delay line, and compensates for the duty-cycle twisted through the delay line through DCC.

In summary, DCC research was conducted for HBM memory and PHY, and its characteristics require a wide-range operation first. It should be able to operate from tens of MHz to GHz. Another feature is that memory interfaces must be compatible.

## CHAPTER 3

# TRAINING OPERATION OF THE PROPOSED DCC WITH COUNTER-BASED HCDL

As discussed in Chapter 1.4.1, Digital DCC with HCDL can have advantages in applications that increase efficiency, such as in power down mode. However, in applications that require wide-range operation, the cost of delay line increases in DCC. In addition, as the delay line increases, the time required for the training operation may also increase. To alleviate the problem, we have proposed a DCC with counter-based HCDL. Through the counter-based HCDL, DCC can reach a wide frequency range while resolving their area requirements. It will also introduce efficient training method in this chapter.

### 3.1 CONCEPTUAL ARCHITECTURE OF THE PROPOSED DCC WITH COUNTER-BASED HCDL



Figure 3.1.1. Block diagram of the conventional DCC with HCDL.

In the case of DCC with conventional HCDL, the HCDL should be able to cover the delay of half-period of the lowest operating frequency. As mentioned before, Since DRAM has a frequency that can be operated for each generation, a wide-range operation should be

possible. Also The ADC must be capable of wide-range frequency operation for use in applications such as radar and broadband wireless cables. In order to use DCC in various applications, wide-range frequency operation from several MHz frequency bands to GHz must be possible.

Figure 3.1.1(a) shows the block diagram and basic operation of conventional DCC with HCDL [3.1.1]. CLK<sub>IN</sub> represents a duty-cycle distorted clock signal. As shown in figure 3.1.1 (b), the CLK<sub>IN</sub> signal affects only the rising edge timing of the CLK<sub>OUT</sub> signal, so if only the rising edge of the CLK<sub>IN</sub> appears at the right time, the duty-cycle does not affect the operation. If the HCDL is set to delay as much as the half-period of the operation frequency, CLK<sub>FB</sub> generates a falling edge after the half-period delay. The falling edge of



Figure 3.1.2. Area breakdown of the conventional DCC with HCDL.

the CLK<sub>FB</sub> signal generates the falling edge of CLK<sub>OUT</sub> through the edge combiner.

In the case of DCC using Conventional HCDL, DCDL that can give a delay of half period is required. Also, considering the margin due to voltage or process variation, DCDL should be covered rather than maximum delay. Figure 3.1.2 shows the area breakdown of the conventional DCC with an operating frequency of 50 to 1600 MHz. As shown in the figure above, DCDL occupies most of the entire block to cover half of the period of about tens of MHz. In this case, the silicon cost can be a burden, which may be a disadvantage in the low frequency operation. In addition, as the area of the delay line increases, the training time to find the accurate duty-cycle may also increase.



Figure 3.1.3. Conceptual block diagram of the counter-based HCDL.

To overcome the drawbacks described above, we propose counter-based HCDL to solve the above disadvantage in digital DCC using HCDL. As shown in the Figure 3.1.3 above, counter-based HCDL consists of a DCDL, a counter, and a 2:1 multiplexer. DCDL is then composed of a coarse delay line (CDL) and a fine delay line (FDL). To explain the

approximate operation method, it takes a lot of cost to cover the entire delay with a minimum delay, so it is configured to match the HCDL by repeating the delay line based on the counter. A detailed description of the operation will be continued in the next chapter 3.2.

In order to operate the counter-based HCDL in the above manner, it is necessary to detect the current frequency. After detecting the frequency, the counter-based HCDL should be adjusted to match the delay of the half-period of the current operation frequency through training before normal operation. Next, in chapter 3.2, the training operation is explained, and then in chapter 3, the operation required for normal operation is explained through the entire block diagram.

# **3.2** TRAINING OPERATION OF THE DCC WITH COUNTER-BASED HCDL

As described in chapter 3.1, various applications need to operate wide-range frequency. For wide-range operation, when a start or reset signal comes, it is necessary to detect the current operating frequency before normal operation and adjust the DCC according to the frequency. Training is completed by detecting the current operation frequency and adjusting the HCDL to half-period.

Various methods have been introduced to detect frequency as a digital detector. To reduce power and area cost, a method of training using the bang-bang duty-cycle detector is proposed [3.2.1]. However, this method has a long lock time when used in wide-range operation. The method of fast-locking using TDC has been introduced [3.2.2]. This method has the advantage of locking only 15 cycles, but it consumes a large silicon cost in TDC and also has a large power consumption. The method of locking the HCDL by adjusting the DCDL by 1-bit using a 1-bit detector was also introduced [3.2.3]. In this case, the cost could be reduced because the DCDL itself was controlled, but if the DCDL is adjusted by 1-bit, the locking time increases in proportion to the size of the DCDL. In order to reduce locking time, binary search method was used in detecting method while minimizing the silicon cost [3.2.4]. While controlling DCDL with binary search method, the locking time could be reduced to 14 cycles.

In order to reduce the locking time, the proposed DCC introduces binary search during training operation. The difference from the previous method is that the  $N_{CNT}$  value that



(a)



Figure 3.2.1. (a) Block diagram of the proposed DCC with counter-based HCDL used in (b) training operation.

determines the number of times to repeat the DCDL should be found before adjusting the DCDL with binary search. Therefore, the variables to be obtained through the binary search

method are  $N_{CNT}$  and  $D_{DL}$  which are bits that control DCDL. The training operation is explained to find the variables in this paper.

To know the delay components to match the counter-based HCDL to half-period through training, it is necessary to know the delay components used for normal operation. Figure 3.2.1 (a) shows a block diagram of the proposed DCC with counter-based HCDL. The proposed DCC consists of an edge combiner (EC), a counter-based HCDL, a switch, a divider, a phase detector (PD), and a finite state machine (FSM). And the counter-based HCDL consists of a NAND gate, a coarse delay line (CDL), a fine delay line (FDL), a counter (CNT), and a 2:1 multiplexer (MUX). In figure 3.2.1 (a), t<sub>SW</sub>, t<sub>CNT</sub>, t<sub>MUX</sub> and t<sub>EC</sub> marked in red are the intrinsic delay of the switch, the CNT, the 2:1 MUX and the EC, respectively. The delay provided by the combined CDL and FDL, together with any intrinsic delays introduced by other components, must be half the period of the operating frequency of the DCC. The counter-based HCDL including intrinsic delay has to be expressed in half-period as follows:

$$0.5T_{REF} = t_{EC} + 2N_{CNT}t_{DL} + t_{CNT} + t_{MUX} + t_{SW}$$
(1)

The variable  $N_{CNT}$  determines the number of times that the delay is repeated, and  $t_{DL}$  is the sum of the delay provided by CDL and FDL. The counter counts the rising edge of  $CLK_{DL}$ , which is the output signal of DCDL. Therefore, the  $t_{DL}$  multiplied by 2 goes into the above equation. The values of  $N_{CNT}$  and  $t_{DL}$  required to fulfill (1) are obtained during the training operation.

During the training operation, the PD and the FSM determine the count  $N_{CNT}$ , which is required to produce the half-period delay, together with the digital codes  $D_{CDL}$  and  $D_{FDL}$ , which control the delay of the coarse and fine delay lines. Since obtaining a pulse width of half-period from the duty-cycle distorted input clock is challenging, a CLK2 signal which changes value at every rising edge of the input signal is generated from the divider. Therefore, CLK2 has a pulse width of 1-period with an accurate duty-cycle. Through the training operation, the  $2N_{CNT}$  value with the period of the input clock is detected, and when the counter-based HCDL repeats the DCDL by  $N_{CNT}$ , the delay is half-period. In normal operation, the counter counts the number of rising edges of the CLK<sub>DL</sub> signal and compares this value with  $N_{CNT}$ . If the count value reaches  $N_{CNT}$ , the counter creates a falling edge in CLK<sub>FB</sub>.

Figure 3.2.1(b) is a block diagram of the proposed DCC with counter-based HCDL that highlights the scheme used in training operation. Since the duty-cycle of the input clock signal is likely to be distorted, it cannot be used to provide the reference pulse width  $0.5T_{REF}$ . Therefore, the FSM creates the CLK<sub>FSM</sub> signal generated by the inversion of data for each rising edge of the input clock. As a result, CLK<sub>FSM</sub> signal has the pulse width of  $T_{REF}$ , which is 1 period of the input clock. Based on the delays used in the training loop, the expression is as follows:

$$T_{REF} = 4N_{CNT}t_{DL} + t_{CNT} + t_{MUX} + t_{SW}$$
(2)

When comparing equation (1) multiplied by 2 and equation (2), the delay components

of  $t_{SW}$ ,  $t_{CNT}$ ,  $t_{MUX}$  and  $2t_{EC}$  in (2) are missing in the training loop path. To compensate for this mismatch, additional replica delay components marked in blue at figure 2.2.1(a) are placed between the FSM and the counter-based HCDL. As a result, variables satisfying the equation (1) are obtained in the training operation. After obtaining a  $2N_{CNT}$  value that satisfies the equation (a), the FSM divides the value in half and applies it as  $N_{CNT}$  in normal operation. After that, when  $t_{DL}$  is applied as it is by FSM, a value satisfying the equation (1) is found.



Figure 3.2.2. Algorithm of the proposed DCC.

Training is preceded to detect operating frequency when it is turned on or when a reset signal is received. The training algorithm of the proposed DCC is shown in figure 3.2.2. When training begins, DCDL is reset to a digital code with a maximum delay, and  $N_{CNT}$ training, a value that repeats the delay, starts. Count the number of rising edges of CLK<sub>DL</sub> during the pulse width of CLK<sub>FSM</sub> and check the counted value at the falling edge of  $CLK_{FSM}$ . If  $N_{CNT}$  is an odd number other than 1, then it is incremented by 1 to divide  $N_{CNT}$ in half in normal operation. If the  $N_{CNT}$  value is 1,  $D_H$  signal is turned on. So high-frequency operation is possible by operating in the half-delay mode. After the N<sub>CNT</sub> value has been fixed, DCDL training starts by changing the digital code of the DCDL to determine the value of tDL. To reduce the locking time of the DCDL, the binary search method is used for tDL training. DCDL training is divided into two phases: a tDL training phase comparing the timing and a ready phase determining the bit value. In tDL training phase, PD compares the delay amount of T<sub>REF</sub> and training loop after changing the MSB of the DCDL code to 0. The delay amount can be compared with the timing of  $CLK_{FSM}$  and  $CLK_{FB}$ , where  $CLK_{FSM}$  has the pulse width of  $T_{REF}$  and  $CLK_{FB}$  is delayed by the training loop. In ready phase, according to the following equations (3) and (4), the bit previously converted to 0 is determined as 0 or 1.

$$T_{REF} > 2t_{EC} + 4N_{CNT}t_{DL} + 2t_{CNT} + 2t_{MUX} + 2t_{SW}$$
(3)

$$T_{REF} < 2t_{EC} + 4N_{CNT}t_{DL} + 2t_{CNT} + 2t_{MUX} + 2t_{SW}$$
(4)

The equation (3) means that the timing of CLK<sub>FB</sub> leads the CLK<sub>FSM</sub>, and in the case of (4), CLK<sub>FSM</sub> leads. Since the delay of the DCDL should increase when CLK<sub>FB</sub> leads, the

PD sets the previous bit to 1 when the UP signal is generated, and keeps it at 0 when the DN signal is generated. The tDL training continues until comparing the LSB. After the training of the LSB is completed, the overall training operation is finished.

Based on the input signal,  $N_{CNT}$  training requires 2 cycles, and DCDL training requires 4 cycles per 1-bit, resulting in a total of 34 cycles for the entire training operation. Figure 3.2.3 shows the timing diagram of the training to obtain  $N_{CNT}$ , and the value of the MSB. For example, if the counter value is 3, add 1 to make  $N_{CNT}$  an even number. After completing  $N_{CNT}$  training, the tDL training begins. Changing the MSB value to 0, the timing of CLK<sub>FSM</sub> and CLK<sub>FB</sub> is compared. In the figure 3.2.3, it is observed that CLK<sub>FB</sub> is leading, so it is fixed to 1.



Figure 3.2.3. N<sub>CNT</sub> and DCDL training timing diagram of the proposed DCC with counter-based HCDL

As shown in the figure 3.2.3, in the case of  $N_{CNT}$  training, the rising edge of the CLK<sub>DL</sub> signal is counted with the digital code of DCDL set to the maximum. And then count the rising edge of the CLK<sub>DL</sub> until 1 period has passed, and fix the  $N_{CNT}$  value according to the signal of the falling edge of the CLK<sub>FSM</sub>. In the timing diagram, the  $N_{CNT}$  value is 3 which is odd, so in order to make it even, the next  $t_{DL}$  training continues with a value of 4.  $N_{CNT}$  training is completed in 2 cycles based on the input clock signal, and  $t_{DL}$  training is started to match the  $t_{DL}$  value.

As for  $t_{DL}$  training, binary search was used as mentioned earlier. When  $t_{DL}$  training starts, the MSB of DCDL, which was set to maximum, is changed to 0 as shown in the figure, and then training is started. Since the  $N_{CNT}$  value was calculated as 4, the counter compares the values until the  $CLK_{DL}$  rising edge becomes number 4, and it can be confirmed that the falling edge is output through the  $CLK_{FB}$  signal as soon as the count value becomes 4. The timing of  $CLK_{FB}$  falling edge and the timing of CLK are compared through PD. As shown in the figure, if  $CLK_{FB}$  arrives at the PD in advance, it means that the current HCDL is less than 1 period, so the MSB value of the DCDL changed earlier is set to 1. It enters the ready phase to fix where the previous count value is reset to 1 and the second bit of the digital code of the DCDL is changed to 0. As such, it was confirmed that it was adjusted only in 4 cycles per 1-bit, including  $t_{DL}$  training and ready phase. After adjusting the digital code of DCDL using binary search by 1-bit and adjusting the last bit, the training operation is finished.

As a result, the total training is  $N_{CNT}$  training 2 cycles,  $t_{DL}$  training is 1 to 4 cycles per bit, and a total of 8-bits, so 34 cycles are consumed. The value of these 34 cycles is not fast

compared to other digital DCCs targeting fast locks, but this has increased for better DCC performance. In order to improve the performance of DCC, the training cycles increased as the two cycles of the input clock were used as the reference clock. After the training operation, the normal operation is started, and in order to reduce the power consumption of DCC, the blocks required only for training, including FSM, are powered off.

Chapter 3 describes the training operation required for digital DCC to operate accurately. The next chapter will explain the normal operation of the proposed DCC with counter-based HCDL.

## **CHAPTER 4**

# NORMAL OPERATION OF THE PROPOSED DCC WITH COUNTER-BASED HCDL

To alleviate the issues about lock time and wide-range operation frequency discussed in chapter 1.3.1, a DCC with counter-based HCDL has presented in this thesis. The proposed DCC generates an output signal with compensated duty-cycle by combining the edge of the input clock signal and the half-period delayed feedback clock signal.

# 4.1 DESIGN CONSIDERATION OF THE PROPOSED DCC WITH COUNTER-BASED HCDL



Figure 4.1.1. Timing diagram of phase interpolating sequence of a conventional DCC

In the case of digital DCC, there are two main methods. One is a DCC using phase interpolation (PI) and the other is a DCC using a half-cycle delay line (HCDL). As mentioned in Chapter 1, DCC using PI is not suitable for wide-range applications, but the reason is as follows.

The conventional DCC using PI [4.1.1] is briefly as follows: Figure 3.1.1 shows the timing diagram of the conventional PI. The  $CLK_{INB}$  signal is an inverted signal of the  $CLK_{IN}$  signal and a signal delayed through the DCDL. From the figure, it can be seen that the amount of red delayed. When phase interpolating both  $CLK_{IN}$  signal and  $CLK_{INB}$  signal, a signal mixed with phase is generated like the  $PI_{OUT}$  signal below it. If this signal is buffered, a signal with duty-cycle compensated like a  $DCC_{OUT}$  signal is output as the final

output. However, the PI method is not suitable for wide-range operation. PIs designed for high frequencies have a limited interpolation range, so interpolation performance is poor at low frequency and wide duty cycle offsets when the duty offset is greater than this range [4.1.2].



Figure 4.1.2. HCDL DCC architecture (a) feedforward and (b) feedback

Therefore, the normal operation was performed with digital DCC using HCDL suitable for wide-range operation. The structure of the HCDL has a feedforward and feedback method depending on the method of placing the HCDL, and in the case of the

feedforward method, the input duty-cycle is limited [4.1.3]. Therefore, the proposed digital DCC was configured in the feedback method as shown in figure 4.1.2 (b).

In summary, the overall structure was established with DCC using HCDL to satisfy wide-range operation, and input duty-cycle range could be increased by arranging HCDL in a feedback form. In the next chapter, the overall structure and normal operation will be explained.

### 4.2 NORMAL OPERATION OF THE PROPOSED DCC WITH COUNTER BASED HCDL



Figure 4.2.1. Block diagram of the proposed DCC with counter-based HCDL used in normal operation.

Figure 4.2.1 shows highlighting block diagram of the proposed DCC with counterbased HCDL in normal operation. As mentioned in chapter 4.1, a structure was adopted in which CLK<sub>IN</sub> enters the input of EC and the output of EC is feedback to counter-based HCDL, then enters the EC again.

In normal operation, as the  $D_{TR}$  signal is turned off in the FSM, it operates in a normal operation loop rather than a training operation loop. the EC generates the rising edge of the CLK<sub>OUT</sub> by detecting the rising edge of the CLK<sub>IN</sub> and then enters DCDL. After repeating

DCDL as much as  $N_{CNT}$ , the counter-based HCDL generates the falling edge to CLK<sub>FB</sub>. and the EC generates the falling edge of CLK<sub>OUT</sub>. When the counter-based HCDL needs to generate a short delay for high-frequency operation, the D<sub>H</sub> signal is turned on. Since the DCDL is directly connected to CLK<sub>FB</sub> without counting, the counter-based HCDL operates as if N<sub>CNT</sub> is 0.5. Through the MUX and D<sub>H</sub> signal, the input clock passes through DCDL only once, enabling high-frequency operation.

As mentioned in Chapter 3, variables  $N_{CNT}$  and  $t_{DL}$  satisfying equation (1) were obtained from training operation, so this should be applied in normal operation. In the training operation, training was conducted based on the  $T_{REF}$ , which is 1 period, so  $2N_{CNT}$ , which is twice as high as the variable  $N_{CNT}$  that repeats DCDL, was obtained. Therefore, if the  $2N_{CNT}$  obtained value is applied in half and the  $t_{DL}$  value of DCDL is applied as it is,  $0.5T_{REF}$ , which is half period, can be delayed to counter-based HCDL. The approximate operation of the top diagram will be described as follows. When the rising edge of the CLK<sub>IN</sub> input clock signal is detected and output to the EC,  $0.5T_{REF}$  is delayed to counterbased HCDL, and the falling edge of the CLK<sub>FB</sub> signal is detected and output, a duty-cycle composed output signal CLK<sub>OUT</sub> can be generated.



Figure 4.2.2. Timing diagram of the proposed DCC with counter-based HCDL in normal operation

The timing diagram in figure 4.2.2 shows that the DCC enters normal operation after LSB training. When entering normal operation, the value of  $N_{CNT}$  is converted because the counter-based HCDL must have a delay of  $0.5T_{REF}$ , not a delay of  $T_{REF}$ . If the  $N_{CNT}$  value is an even number, divide it by 2, and then the normal operation continues. And if the  $N_{CNT}$  value is 1, the operating frequency can be increased by reducing the  $N_{CNT}$  value to 0.5 through the half-delay mode. As in the timing diagram, in the training operation,  $N_{CNT}$  proceeds to a value of 4 and changes to 2 when it becomes the normal operation. The digital codes of the DCDL remain the same.

### 4.2.1 HALF-DELAY MODE



Figure 4.2.1.1. Conceptual timing diagram of the half-delay mode (a) off (DH=0) and (b) on (DH=1).

 $D_H$  signal of figure 4.2.1.1 is a bit that selects a half-delay mode. In the training operation, when the value of  $N_{CNT}$  is 1,  $D_H$  is turned on and enters half-delay mode. Since the DCDL is directly connected to CLK<sub>FB</sub> without counter in figure 4.2.1.1, the counterbased HCDL operates as if  $N_{CNT}$  is 0.5. Figure 4.2.1.1(a) is the timing diagram when half-delay mode is off, and the counter compares the counting value at the rising edge of the

 $CLK_{DL}$  signal. Therefore, even when the  $N_{CNT}$  value is 1, it is delayed twice by DCDL, and the falling edge is output to the  $CLK_{FB}$  signal. Figure 4.2.1.1(b) shows the timing diagram when the half-delay mode is on. Through the MUX and  $D_H$  signal, the input clock passes through DCDL only once, enabling high-frequency operation.

# **CHAPTER 5**

## **ARCHITECTURE AND IMPLEMENTATION**

In Chapter 5, we explain the overall architecture and the implementation of the proposed DCC with counter-based HCDL. The DCC is implemented with clock buffer to receive a clock signal with a distorted duty-cycle. which have the limited number of pin count.

### 5.1 OVERALL ARCHITECTURE



Figure 5.1.1. Overall architecture of the proposed DCC with counter-based HCDL

Figure 5.1.1 shows the block diagram of the proposed DCC with counter-based HCDL. The overall architecture is composed of a clock buffer for supplying an external duty-cycle distorted clock signal to the inside of the chip and the proposed DCC with counter-based HCDL for compensating and measuring the clock signal. The external clock signal could be supplied into the chip through PCB by adjusting the duty-cycle using the data pattern in the external clock source. The clock buffer similar to shown in [4.1.1] was used, and the difference was that there was no restive feedback path, so the distortion of duty-cycle could be checked as it was. Since the frequency of the external clock should be able to operate from 50 to 1600 MHz, it was designed to receive a high-speed clock signal. The detailed block diagram will continue in chapter 5.3.

The proposed DCC is designed such that the duty-cycle is compensated and output

when the duty-cycle is distorted to 20% to 80%. In order to verify that the DCC operates properly, the CLK<sub>IN</sub> signal entered through the clock buffer can be measured through output PAD before entering the proposed DCC, and the CLK<sub>OUT</sub> signal compensated after training operation can also be measured through PAD.

### 5.2 COUNTER-BASED HCDL



(a)



(b)

Figure 5.2.1. Block diagram of (a) coarse delay line, and (b) fine delay line.

In order to compensate for drawbacks mentioned at conventional DCC, HCDL is constructed using counter-based HCDL. Since the counter counts the rising edge of the HCDL, data must change when passing through the HCDL. Therefore, the CDL is constructed as shown in Fig. 5.2.1. (a) The CDL is designed to be 3-bit controllable, and 1

unit of CDL is composed of an inverter and switch. If 3-bit is converted to a thermometer code, it can be adjusted up to 8-stage, so the selecting bits could be increased to 8. Figure 5.2.2 (a) shows post-layout simulation results of the CDL. As can be seen from the results, it can be seen that CDL has a resolution of about 120 ps.



| 1   | n | ۱ |
|-----|---|---|
|     | а |   |
| · ( |   | , |



Figure 5.2.2. Post-layout simulation results of (a) coarse delay line, and (b) fine delay line.

The resolution of the DCDL determines the accuracy of the DCC. Therefore, to increase the accuracy of DCC, fine adjustment is possible through FDL. The FDL is composed of PI, as shown in figure. 5.2.1 (b). For PI operation, one of the two inputs is delayed by about 1 unit of CDL, and the number of tri-state inverters is adjusted according to the selected bits. PI is designed to be 5-bit controllable and adjusted through the binary-to-thermometer converter inside the PI. Based on Post simulation, the resolution of FDL was about 4.5ps.



Figure 5.2.3. Counter-based HCDL with DCDL

Fig. 5.2.3. (c) shows the counter-based HCDL based on the DCDL and counter. As a result, DCDL can be 8-bit controlled, and the maximum value of the counter is set to 16 to operate wide-range frequency.  $D_H$  signal is a bit that selects a half-delay mode. In the half-delay mode, the  $D_H$  signal of Fig. 5.2.3. (c) is turned on. Since the DCDL is directly

connected to  $CLK_{FB}$  without counting, the counter-based HCDL operates as if  $N_{CNT}$  is 0.5. Through the MUX and  $D_H$  signal, the input clock passes through DCDL only once, enabling high-frequency operation.

### 5.3 CLOCK PATH



Figure 5.3.1. Block diagram of clock buffer.

Figure 5.3.1 shows the block diagram of the clock buffer which receives the external clock source signals to inside of the chip. The clock path consists of a differential termination, a CML input buffer, AC coupled inverters, and phase corrector. The CML input buffer is composed of two-stage CML buffer, one of the two stage is with negative capacitive feedback. The input buffer amplifies the high-speed differential clocks (CLK<sub>P</sub>, CLK<sub>N</sub>), and the low-speed differential clock can also be received. The AC coupled inverters a block that performs as a CML-to-CMOS. The subsequent blocks adjusting the phase, and as a result, CLK<sub>IN</sub> and CLK<sub>INB</sub> signals are finally output. These clock signals can output their own results by distorted duty-cycle from the external clock source.

In order to properly check the effect of DCC, a path through which the CLK<sub>IN</sub> signal entering the DCC input can be measured as an output PAD was designed.
### **5.4 EDGE COMBINER**



Figure 5.4.1. Block diagram of edge combiner.

The block diagram of the edge combiner (EC) is shown in figure 5.4.1. EC consists of rising edge detectors and falling edge detectors, and pMOS and nMOS controlled by the detectors [5.4.1].

Figure 5.4.2 shows block diagram and timing diagram of rising edge detector. As shown in Figure 5.4.2(b), input signal  $CLK_R$  is inverted with a slight delay to produce  $CLK_{RD}$ . These two signals pass through the NAND gate.  $CLK_{RO}$  maintains a low level as long as starting at the rising edge of input signal  $CLK_R$  and keeping the inverter delay. In this case, a high level is generated by pMOS in the output of EC.

The falling edge detector also operates similarly to the rising edge detector. Figure 5.4.3 shows block diagram and timing diagram of rising edge detector. As shown in Figure 5.4.3(b), input signal  $CLK_F$  is inverted with a slight delay to produce  $CLK_{FD}$ . Contrary to the rising edge detector, it operates through NOR at the falling edge of the input signal

**Rising Edge Detector** 









Figure 5.4.3. (a) Block diagram and (b) timing diagram of falling edge detector.

 $CLK_F$ .  $CLK_{FO}$  maintains a low level, but only high level by inverter delay from the falling edge of  $CLK_F$ . At this time, since the nMOS of EC operates, the final output changes to the



Figure 5.4.4. Timing diagram of edge combiner.

low level. EC operates when the inverter delay inserted enough to drive pMOS and nMOS. enough to drive.

Therefore, if there is a delay as much as the half-cycle of the current operation period between  $\text{CLK}_R$  and  $\text{CLK}_F$ , the  $\text{CLK}_{\text{OUT}}$  signal may produce a signal with a duty-cycle of 50%.

# **CHAPTER 6**

## **MEASUREMENTS RESULTS**

#### 6.1 MEASUREMENT SETUP

Figure 6.1.1 (a) is the die photograph and the block description of the prototype DCC. The proposed DCC is fabricated in a CMOS 65nm process with 1 V supply voltage. The total active area is 0.0117 mm<sup>2</sup>, of which the proposed DCC occupies 0.0064 mm<sup>2</sup> and the clock buffer occupies 0.0053 mm<sup>2</sup>

Fig. 6.1.1 (b) shows the die micrograph and the measurement setup. A power source generates the supply voltage for the chip. The differential clock signal  $CLK_P$  and  $CLK_N$  are supplied by a clock source (Anritsu MP1800A). The clock source intentionally distorts the duty-cycle of the differential clock signal to verify the performance of our DCC. Either of these signals is measured as  $CLK_{OUT}$  before entering the DCC with only the clock buffer passed through. The DCC<sub>OUT</sub> signal is the output signal from the proposed DCC, and it is measured using an oscilloscope (Tektronix MSO733404DX).



(a)



Figure 6.1.1. (a) Die photograph and block description and (b) measurement setup of the prototype DCC.

## 6.2 MEASUREMENT RESULT OF THE PROPOSED DCC WITH COUNTER-BASED HCDL



Figure 6.2.1 Area breakdown of the proposed DCC.

The area breakdown of the DCC is shown in figure 6.2.1. The total area of the proposed DCC is 0.0064mm<sup>2</sup> without clock buffer. The DCDL accounts for 20.4% of the total area and 29.8% of the counters are included. By setting the maximum value of NCNT to 16, the proportion of area of the delay line could be reduced by about 32 times. Compared with Figure 3.1.2 of Chapter 3.1, DCDL occupies about 82.3% of the total area when using









Figure 6.2.2 Measured input and output clock waveform of the proposed DCC at minimum frequency (50MHz) (a) input duty-cycle (20%), (b) input duty-cycle (80%).

the conventional structure of HCDL, and it can be seen that the value is significantly decreased.

The following figure 6.2.2 shows the output clock signal through an oscilloscope









Figure 6.2.3 Measured input and output clock waveform of the proposed DCC at maximum frequency (1.6 GHz) (a) input duty-cycle (20%), (b) input duty-cycle (80%).
when the input clock has a duty-cycle of 20 or 80% in the minimum frequency. The above signal CLK<sub>OUT</sub> in figure 6.2.2 shows the input before entering DCC. With this signal, it can be confirmed that the external clock source has entered through the clock buffer. The DCC<sub>OUT</sub> signal below is a signal whose duty-cycle is compensated by the proposed DCC.



Figure 6.2.4 Measured results of the duty-cycle every 100MHz.

These signals are 50.1% and 50.3%, respectively, and it can be seen that accurate dutycycle is output.

Figure 6.2.3 shows the result of measuring with an oscilloscope at maximum frequency, 1.6 GHz. As can be seen from figure 6.2.3 (a) and (b), it can be seen that the duty-cycle is 20% and 80% and the input signal is well inserted. In addition, at the maximum frequency, the output signal was also compensated to 50.8% and 50.7%, respectively, and output. Through this, the operation could be confirmed at maximum and minimum frequencies.

Next, it is necessary to check whether this DCC works well in the intermediate frequency band, and the results are shown in Figure 6.2.4. Figure 6.2.4 shows the measured results of the output clock duty-cycle when the input clock duty-cycle is 20 or 80% of every 100 MHz frequency. Through this graph, the proposed DCC is able to correct duty-cycle





(b)

Figure 6.2.5 Measured results of the output duty-cycle while sweeping input duty-cycle (a) 50MHz and (b) 1.6GHz.

with an error of 0.89%.

Figure 6.2.5 shows the results of sweeping the input duty-cycle from 20% to 80% in steps of 10%, and measuring the output duty-cycle at the minimum and maximum operating frequencies. The graph shows that the DCC can correct duty-cycle errors up to 30%, with a maximum error of only 0.8%, regardless of frequency.



(b)

Figure 6.2.6 Jitter histogram of (a) the input clock and (b) the proposed DCC at 1.6 GHz.

Figure 6.2.6 shows the measured jitter of the input clock and the output clock at 1.6GHz, which is the maximum frequency of the DCC. As shown in Figure 6.2.6(a), the RMS and peak-to-peak jitter of the input clock are measured as 2.13psrms and 13.2pspp. The jitters of the output clock slightly increase to 2.23psrms and 14.8pspp, respectively, as shown in Figure 6.2.6(b).

Table 6.2.1 compares the performance of our design with that of other DCCs. In the case of analog DCC [6.2.1], it is a disadvantage because it takes time in micro units at lock-

in time. On the other hand, it can be seen that proposed DCC has a much faster lock time because DCC is trained only in 34-cycles. Compared to other digital DCCs, it consumes about twice as much cycle due to the count variable in terms of lock-in time. However, this counter allowed wide-range operation and reduced cost compared to other DCCs. Our circuit offers a wider operating frequency range compared to other DCCs when comparing the maximum and minimum frequency ratios. By using Counter-based HCDL, it is possible to minimize area cost. For fair performance comparison, the normalized power (NP) is inserted with reference to [6.2.2]. Digital DCC [6.2.3] showed good performance in terms of NP. However, due to its coarse-tuning, it can be seen that it is not as effective in output duty-cycle errors.

|                                                                                                                                                            | Area (mm <sup>2</sup> ) | <sup>a</sup> Normalized<br>Power | Power<br>consumption<br>(mW)       | Pk-PK Jitter<br>(ps) | RMS Jitter<br>(ps) | Lock-in Time | Max. Output<br>Duty-Cycle<br>Error (%) | Input Duty-<br>Cycle Range (%) | Frequency<br>(MHz) | Supply Voltage<br>(V) | Detection Type        | Process (nm) |                         |
|------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|----------------------------------|------------------------------------|----------------------|--------------------|--------------|----------------------------------------|--------------------------------|--------------------|-----------------------|-----------------------|--------------|-------------------------|
| ${}^{\mathrm{a}}\mathrm{NP}= \mathrm{P} 	imes \left(rac{0.65}{Technology} ight) 	imes \left(rac{1.0}{VDD} ight)^2 	imes \left(rac{1600}{F_{MAX}} ight)$ | 0.008<br>(Simulation)   | 2.78                             | 0.05 @ 100MHz                      | I                    | ,                  | 1.4 us       | 1.5                                    | 25~75                          | 4~100              | 1.2                   | Analog                | 130          | ISCAS '21<br>[6.2.1]    |
|                                                                                                                                                            | 0.0187                  | 3.95                             | 2.09<br>@1000MHz                   | 12.53<br>@1000MHz    |                    | 3-5 cycles   | 1.4                                    | 20~80                          | 333~1000           | 1                     | PI                    | 55           | TCASII '18<br>[1.3.1.4] |
|                                                                                                                                                            | 0.0289                  | 7.23                             | 4.59 @<br>34MHz                    | 27.13<br>@734MHz     | 4.11 @734MHz       | 15 cycles    | 1.78                                   | 9~86                           | 75~734             | 1                     | HCDL                  | 90           | TVLSI '15<br>[6.2.2]    |
|                                                                                                                                                            | 0.059                   | 3.11                             | 5.6<br>@1000MHz                    | 20.5<br>@1000MHz     | 2.901<br>@1000MHz  | 14 cycles    | 1.4                                    | 10~90                          | 350~1000           | 1.2                   | HCDL                  | 130          | TVLSI '16<br>[1.3.1]    |
|                                                                                                                                                            | 0.025                   | 0.32                             | 1.76<br>@400MHz<br>3.6<br>@2000MHz | 28.45<br>@1000MHz    | 3.5<br>@1000MHz    |              | 3.5                                    | 20~80                          | 400~2000           | 1.8                   | HCDL                  | 180          | TVSI '12<br>[6.2.3]     |
|                                                                                                                                                            | 0.0064                  | 2.1                              | 2.11<br>@1600MHz                   | 14.8<br>@1600MHz     | 2.23<br>@1600MHz   | 34 cycles    | 0.89                                   | 20~80                          | 50~1600            | 1                     | Counter-based<br>HCDL | 65           | This work               |

Table 6.2.1. Performance comparison with other DCC designs.

# **CHAPTER 7**

## CONCLUSION

In this thesis, 50 MHz to 1.6 GHz digital DCC with counter-based HCDL for HBM PHY controller has been proposed. As explained in Chapter 1, this paper focuses on DCC for HBM controller PHY, but it is compatible with various applications. Specifically, it is well-suited for use in areas such as DRAM or ADCs that require wide-range operation. Although there were various analog and digital DCCs, DCCs that fit well with applications such as DRAM and ADCs are digital DCCs. Among them, DCC using HCDL is advantageous for wide-range operation, in which case the area of DCDL has been pointed out as a disadvantage.

By using counter-based HCDL, it was able to correct a wide-range of input frequency while consuming less silicon cost. In addition, through binary search training, duty-cycle was accurately complemented at an operation frequency. Also, according to the training operation, it can settle in various operating frequencies within 34-cycles. A prototype implementation of our DCC corrected duty-cycle errors to within 0.89% of 50% over operating frequencies ranging from 50 MHz to 1.6 GHz.

The total area of the proposed DCC is 0.0064mm<sup>2</sup>, and the proportion of area occupied by DCDL can be reduced with the counter-based HCDL. In the conventional method, the

area of DCDL, which accounted for approximately 89%, could be reduced to 29.8%. It has the highest power consumption at maximum operating frequency, and 2.11 mW was consumed at 1.6 GHz.

The results of the Jitter histogram showed that the RMS jitter was 2.23 ps and the peak-to-peak jitter was 14.8 ps at 1.6GHz. Table 6.2.1 compares the performance of our design with that of other digital DCCs. Our circuit offers a wider operation frequency compared to other DCCs with similar or better cost by using counter-based HCDL.

## **APPENDIX A**

# **HBM CONTROLLER PHY**

As discussed in chapter 1.2, HBM's share of DRAM consumption is gradually increasing because HBM can reach high bandwidth. Accordingly, the development of memory itself is being carried out in various ways, and with the development of this memory [A.1], the development of controllers and memory physical layers is also required. When designing the HBM controller physical layer, there are DCC, equalization, and the like, and in this paper, the HBM controller physical layer focused on DCC will be introduced.

### A.1 CONTROLLER PHY AND DFI SPECIFICATION



Figure A.1.1. Bandwidth and Data-rate Changes per Pin for GDDR and HBM [A.1.1].

Figure A.1.1 is a graph showing the per-pin data-rate and total bandwidth of HBM and GDDR. As mentioned earlier, since HBM has a large bandwidth compared to GDDR, the use of HBM is increasing in places such as AI and machine learning that require a lot of data processing. Currently, technology up to HBM2E is mass-produced as a commercial product, and terabit-class bandwidth can be implemented with 1024 parallel I/O at a speed of 2.4Gbps per pin.

As the per pin data-rate of HBM DRAM increases, it is important to develop the area of PHY accordingly. The PHY of the HBM may be divided into two types: the PHY of the



### HBM PHY Block Diagram

Figure A.1.2. Block diagram of HBM PHY [A.1.2].

memory converter and the PHY of the HBM DRAM. Key HBM Gen2 PHY product highlights include support for DRAM 2, 4 and 8 stack height, a DFI-style interface to the memory controller, 2.5D interposer connections between the PHY and DRAM, a validated memory controller interface. Figure A.1.2 shows the HBM PHY block diagram of RAMBUS. As shown in the figure, the PHY structure should satisfy the DFI-Style with the memory controller and the memory specification with the HBM DRAM.

First of all, it is necessary to satisfy the DFI spec. for communication with the controller, and the block diagram required for it is shown in figure A.1.3. DFI is pervasive



Figure A.1.3. Block diagram of HBM controller PHY.

industry specification that defines an interface protocol between DDR memory controllers and PHYs [A.1.3]. It enables the development of systems-on-chip (SoCs) that support the latest DRAM standards. DFI 5.0 specifications support DFI frequency ratio, which means how the DDR MC encodes the PHY timing information in the Phase-Specific bus. In Figure A.1.3, the N:1 serializer varies depending on the DFI frequency ratio. The clock frequency of the DFI frequency ratio affects the clock frequency inside the PHY. In this paper, the clock frequency inside the PHY was doubled compared to using the 1:1 DFI ratio using the 2:1 frequency ratio.



Figure A.1.4. Comparison of DFI clock frequency and HBM controller PHY clock frequency.

### A.2 ARCHITECTURE OF HBM CONTROLLER PHY

| Function               | # uBumps   | Notes                                     |  |  |  |  |  |
|------------------------|------------|-------------------------------------------|--|--|--|--|--|
| Data                   | 128        | DQ[127:0]                                 |  |  |  |  |  |
| Column Command/Address | 8 or 9     | C[7:0] or C[8:0]                          |  |  |  |  |  |
| Row Command/Address    | 6 or 7     | R[5:0] or R[6:0]                          |  |  |  |  |  |
| DBI                    | 16         | 1 DBI per 8 DQs                           |  |  |  |  |  |
| DM                     | 16         | 1 DM per 8 DQs                            |  |  |  |  |  |
| PAR                    | 4          | 1 PAR per 32 DQs                          |  |  |  |  |  |
| DERR                   | 4          | 1 DERR per 32 DQs                         |  |  |  |  |  |
| Strobes                | 16         | 1 RDQS_t/RDQS_c, WDQS_t/WDQS_c per 32 DQs |  |  |  |  |  |
| Clock                  | 2          | CK_t/CK_c                                 |  |  |  |  |  |
| CKE                    | 1          | CKE                                       |  |  |  |  |  |
| AERR                   | 1          | AERR                                      |  |  |  |  |  |
| Redundant Data         | 8          | RD[7:0]                                   |  |  |  |  |  |
| Redundant Row          | 1          | RR                                        |  |  |  |  |  |
| Redundant Column       | 1          | RC                                        |  |  |  |  |  |
| Total                  | 212 or 214 |                                           |  |  |  |  |  |

Figure A.2.1. HBM2 single channel signal description [A.2.1].

Figure A.2.1 is a table showing the necessary channels per HBM single channel. This is the number required to design the single channel structure of the HBM2 Controller Side PHY, and since the size of the system becomes too large to verify this, this paper attempts to verify the memory operation at a minimum. In this paper, we are going to verify READ/WRITE operation with basic operation, and the signals required for this are as follows.

The READ and WRITE operation, and ACT process of activating the DRAM cell of the address bit declared before the READ (RD), WRITE (WR) command, and the process of de-activating it through the precharge (PRE) command after RD/WR are required. Basic operation can be verified using these four commands. Therefore, the number of signals to

|                   |        | CLOCK<br>CYCLE | CKE   |      |             |                |               |         |      |      |         |                          |
|-------------------|--------|----------------|-------|------|-------------|----------------|---------------|---------|------|------|---------|--------------------------|
| FUNCTION          | SYMBOL |                | (N-1) | (N)  | R0          | Rl             | R2            | R3      | R4   | R5   | R6 11   | NOTES                    |
| Row No            | RNOP   | Rising         | Н     | Η    | Н           | Н              | Н             | v       | V    | V    | V       | 1,4,7                    |
| Operation         |        | Falling        | 1     |      | V           | V              | PAR           | v       | V    | V    | V       | 1                        |
| Activate          | ACT    | Rising         | Н Н   | н    | L           | H              | V/SID<br>SID0 | BA0     | BA1  | BA2  | RA14    | 1,2,3,4,<br>7,8,9,<br>10 |
|                   |        | Falling        |       |      | RA11        | <b>RA12</b>    | PAR           | BA4     | RA13 | BA3  | V/SID1  |                          |
|                   |        | Rising         |       |      | RA5         | RA6            | RA7           | RA8     | RA9  | RA10 | V       |                          |
|                   |        | Falling        |       | RA0  | RA1         | PAR            | RA2           | RA3     | RA4  | V    | 1       |                          |
| Precharge         | PRE    | Rising         | Н     | Η    | Н           | Н              | L             | BA0     | BA1  | BA2  | V/SID1  | 1,3,4,7,                 |
|                   |        | Falling        |       |      | v           | V/SID/<br>SID0 | PAR           | BA4     | L    | BA3  | v       | 8,9                      |
| Precharge All     | PREA   | Rising         | Н     | Η    | Н           | Н              | L             | v       | V    | V    | V       | 1,4,7,8                  |
|                   |        | Falling        |       |      | V           | V              | PAR           | BA4     | Н    | V    | V       | 1                        |
| Single Bank       | REFSB  | Rising         | Н     | Н    | L           | L              | Н             | BA0     | BA1  | BA2  | V/SID1  | 1,3,4,7,<br>8,9          |
| Refresh           |        | Falling        |       |      | v           | V/SID/<br>SID0 | PAR           | BA4     | L    | BA3  | v       |                          |
| Refresh           | REF    | Rising         | Н     | H    | L           | L              | Н             | V       | V    | V    | V       | 1,4,7,8                  |
|                   |        | Falling        | 1     |      | V           | V              | PAR           | BA4     | Н    | V    | V       | 1                        |
| Power-Down        | PDE    | Rising         | Н     | L    | Н           | Н              | Н             | v       | V    | V    | V       | 1,4,6,7                  |
| Entry             |        | Falling        | 1     |      | V           | V              | PAR           | v       | V    | V    | V       |                          |
| Self Refresh      | SRE    | Rising H       | L     | L    | L           | Н              | v             | V       | V    | V    | 1,4,6,7 |                          |
| Entry             |        | Falling        | 1     |      | V           | v              | PAR           | v       | V    | V    | V       | 1                        |
| Power-Down &      | PDX/   | Rising L       | L     | Η    | Н           | Н              | Н             | v       | V    | V    | V       | 1,5,6,7                  |
| Self Refresh Exit | SRX    | Falling        | 1     |      | v           | v              | v             | V       | v    | v    | v       |                          |
|                   |        | Table          | 31 —  | Colu | (a<br>mn Co | )<br>mmane     | ds Tru        | h Table |      |      |         | •                        |
| FUNCTION SY       |        | OCK            |       | C0   | cı          | C2             | <b>C3</b>     | C4 C    | 5 C6 | C7   | C8 10   | NOTES                    |

Table 30 — Row Commands Truth Table



Figure A.2.2. HBM2 (a) row commands and (b) column commands truth table.

verify basic operation can be reduced, and the number of signals can be reduced as follows according to figure A.2.2. The Row commands required to view basic operations such as RD, WR, ACT, and PRE are R0, R1, R2, and R4, and the Column commands are C0, C1, C2, and C3. After a two-cycle command, COL can return to CNOP. However, one of ROW command is ACTIVATE, which is a four-cycle command. As described in chapter A.1, DFI



Figure A.2.3. Conceptual architecture of HBM PHY.

frequency ratio is set to 2:1, PHY from the controller must receive 4 signal pins from controller per 1 signal transmission.

Therefore, the entire chip architecture was configured as figure A.2.3. Four blocks

were configured to receive ROW command and COL command from the controller, respectively, and one DQ block and one strobe block were placed to verify RD/WR operation through CK block and DQ to operate inside the circuit.

### A.3 DESIGN CONSIDERATION OF THE HBM PHYSICAL LAYER

The following considerations are required to verify the chip configured in chapter A.2. At first, it is necessary to receive data from the controller and verify whether it RD data from WR or memory in memory. Since it is practically difficult to verify this part by directly obtaining memory and controller, we will implement memory and controller emulator using FPGA for verification. In the case of data from FPGA, since the data-rate is limited, verification will be conducted only in function. Next, it is necessary to verify whether the previous function is capable of high-speed operation. Therefore, a high-speed operation verification path is required to verify the clock tree and ROW data inside the chip at a data inside the chip.

Figure A.3.1 illustrates the two measurement methods. Firstly, Figure A.3.1(a) depicts the measurement method for the normal mode, where the controller and memory emulator are simulated using Xilinx's VC707 evaluation kit. Considering the operating frequency of this FPGA board, each FPGA input and output are measured at 100 Mbps. The basic operation, using the WR operation as an example, is as follows: ROW and COL commands, along with DQ and DQS signals, are sent from the controller to the controller PHY chip. The data received from this chip is then communicated to the memory PHY side through the PCB. The memory PHY deserializes the data and sends it to the memory emulator side, allowing for verification of successful data writing to the memory emulator through the PC. Similarly, in the RD operation, the HBM PHY's basic operation is confirmed by ensuring that the data is read correctly and delivered to the controller upon sending the command.



Figure A.3.1. HBM PHY (a) normal mode and (b) test mode measurement setup.

Figure A.3.1(b) is a measurement method configured to prove that the inside of the chip is capable of high-speed operation in test mode. A total of two signals will be identified for high-speed operation. One signal allows only one of the differential clock signals to be checked through CKE pin to check whether the internal clock works well. In addition, since

it is necessary to be able to check whether the command or DQ signal is properly output at high speed through the internal clock, two of the ROW command signals will be used to check equalization and crosstalk cancellation.

As mentioned in Chapter 1.3, DCC plays a crucial role in the internal operation of the HBM PHY. One important specification of the chip is the requirement for wide-range operation. As previously discussed, the chip must support operation frequencies in the tens of MHz range to accommodate measurements using FPGA, and for high-speed operation, it should be capable of operating at frequencies in the GHz range. Hence, a DCC capable of wide-range operation is necessary. Additionally, by implementing digital DCC, the memory can store operation frequency information in power-down mode, enabling immediate operation at any time. Lastly, the design has taken into consideration the need to minimize costs in the initial training process through the use of fast-lock mechanisms.

## **BIBLIOGRAPHY**

- [1.1.1] Y.-J. Min et al., "A 0.31–1 GHz fast-corrected duty-cycle corrector with successive approximation register for DDR DRAM applications," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 20, no. 8, pp. 1524–1528, Aug. 2012.
- [1.1.2] J.-H. Chae, H. Ko, J. Park, and S. Kim, "A 12.8-Gb/s quarter-rate transmitter using a 4:1 overlapped multiplexing driver combined with an adaptive clock phase aligner," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 66, no. 3, pp. 372–376, Mar. 2019.
- [1.1.3] N. Liu, J. Todsen and D. Chen, "A Low-Power and Area-Efficient Analog Duty Cycle Corrector for ADC's External Clocks," in *IEEE International Symposium* on Circuits and Systems (ISCAS), 2020, pp. 1-4.
- [1.1.4] T. Miki, T. Morie, T. Ozeki and S. Dosho, "An 11-b 300-MS/s Double-Sampling Pipelined ADC With On-Chip Digital Calibration for Memory Effects," in *IEEE Journal of Solid-State Circuits*, vol. 47, no. 11, pp. 2773-2782, Nov. 2012
- [1.1.5] DDR5 SDRAM Specification (JESD79-5), JEDEC Standard, JEDEC solid state technology association, Jul. 2020.
- [1.1.6] Qiu. Y., Zeng. Y., and Zhang. F.: "1–5 GHz duty-cycle corrector circuit with wide correction range and high precision", in *Electronics Letters (EL)*, 2014, 50, (11), pp. 792–794.
- [1.1.7] Cheng. K., Su. C., and Chang. K.: "A high linearity, fast-locking pulse width control loop with digitally programmable duty cycle correction for wide range operation", in IEEE Journal of Solid-State Circuits, 2008, 43, pp. 399–4133.
- [1.2.1] SK Hynix "https://news.skhynix.co.kr/presscenter/developed-the-first-tsvbased-high-speed-memory"
- [1.2.2] AMD "https://www.amd.com/system/files/documents/high-bandwidth-memory-

hbm.pdf".

- [1.2.3] J. -H. Chae, H. Ko, J. Park and S. Kim, "A Quadrature Clock Corrector for DRAM Interfaces, With a Duty-Cycle and Quadrature Phase Detector Based on a Relaxation Oscillator," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 4, pp. 978-982, April 2019.
- [1.3.1.1] I. Raja, G. Banerjee, M. A. Zeidan and J. A. Abraham, "A 0.1–3.5-GHz Duty-Cycle Measurement and Correction Technique in 130-nm CMOS," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 24, no. 5, pp. 1975-1983, May 2016.
- [1.3.1.2] C.-C. Chung, D. Sheng, S.-E. Shen, "High-Resolution All-Digital Duty-Cycle Corrector in 65-nm CMOS Technology," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 22, no. 5, pp. 1096-1105, May 2014.
- [1.4.1] C.-H. Jeong, A. Abdullah, Y.-J. Min, I.-C. Hwaing, S.-W. Kim, "All-Digital Duty-Cycle Corrector With a Wide Duty Correction Range for DRAM Applications," in *IEEE Transactions on Very Large Scale Integration (VLSI)* Systems, vol. 24, no. 1, pp. 363-367, Jan. 2016.
- [1.4.1.1] C.-C. Chung, D. Sheng, C.-J. Li, "A Wide-Range Low-Cost All-Digital Duty-Cycle Corrector," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 23, no. 11, pp. 2487-2496, Nov. 2015.
- [1.4.1.2] J.-R. Su, T.-W. Liao, and C.-C. Hung, "Delay-line based fast-locking all-digital pulsewidth-control circuit with programmable duty cycle," in *IEEE Asian Solid-State Circuits Conference (A-SSCC)*, Nov. 2012, pp. 305–308.
- [1.4.1.3] J. Sim, H. Park, Y. Kwon, S. Kim and C. Kim, "A 1-3.2 GHz 0.6 mW/GHz Duty-Cycle-Corrector Using Bangbang Duty-Cyle-Detector," in 2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021, pp. 1-4.
- [1.4.1.4] K.-T. Kang, S.-Y. Kim, S. J. Kim, D. Lee, S.-S. Yoo, and K.-Y. Lee, "A 0.33–1 ghz open-loop duty cycle corrector with digital falling edge modulator," in *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 65, no. 12, p. 1949– 1953, Dec. 2018.

- [1.4.1.5] C.-H. Jeong, A. Abdullah, Y.-J. Min, I.-C. Hwaing, S.-W. Kim, "All-Digital Duty-Cycle Corrector With a Wide Duty Correction Range for DRAM Applications," in *IEEE Transactions on Very Large Scale Integration (VLSI)* Systems, vol. 24, no. 1, pp. 363-367, Jan. 2016.
- [1.4.2.1] J. -H. Lim et al., "A Delay Locked Loop With a Feedback Edge Combiner of Duty-Cycle Corrector With a 20%–80% Input Duty Cycle for SDRAMs," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 63, no. 2, pp. 141-145, Feb. 2016.
- [1.5.1] J. Sim, H. Park, Y. Kwon, S. Kim and C. Kim, "A 1-3.2 GHz 0.6 mW/GHz Duty-Cycle-Corrector Using Bangbang Duty-Cycle-Detector," 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Korea, 2021.
- [2.1.1] HBM2 SDRAM Specification, Jan. 2020.
- [3.1.1] Wang and Jinn-Shyan Wang, "An all-digital 50% duty-cycle corrector," 2004 in 2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2004, pp. II-925.
- [3.2.1] C.-C. Chung, D. Sheng, C.-J. Li, "A Wide-Range Low-Cost All-Digital Duty-Cycle Corrector," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 23, no. 11, pp. 2487-2496, Nov. 2015
- [3.2.2] J. Sim, H. Park, Y. Kwon, S. Kim and C. Kim, "A 1-3.2 GHz 0.6 mW/GHz Duty-Cycle-Corrector Using Bangbang Duty-Cyle-Detector," in 2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021, pp. 1-4.
- [3.2.3] C.-H. Jeong, A. Abdullah, Y.-J. Min, I.-C. Hwaing, S.-W. Kim, "All-Digital Duty-Cycle Corrector With a Wide Duty Correction Range for DRAM Applications," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 24, no. 1, pp. 363-367, Jan. 2016.
- [4.1.1] J. Sim, H. Park, Y. Kwon, S. Kim and C. Kim, "A 1-3.2 GHz 0.6 mW/GHz Duty-Cycle-Corrector Using Bangbang Duty-Cycle-Detector," 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Korea, 2021.
- [4.1.2] C. Yoo, C. Jeong and J. Kih, "Open-loop full-digital duty cycle correction circuit",

in Electronics Letters (EL), vol. 41, no. 11, pp. 635-636, May 2005.

- [4.1.3] K.-T. Kang, S.-Y. Kim, S. J. Kim, D. Lee, S.-S. Yoo, and K.-Y. Lee, "A 0.33–1 ghz open-loop duty cycle corrector with digital falling edge modulator," in *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 65, no. 12, p. 1949– 1953, Dec. 2018.
- [4.1.4] J. -H. Lim et al., "A Delay Locked Loop With a Feedback Edge Combiner of Duty-Cycle Corrector With a 20%–80% Input Duty Cycle for SDRAMs," in *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 63, no. 2, pp. 141-145, Feb. 2016.
- [5.1.1] S. Choi, Y. -U. Jeong, J. -H. Chae, S. -H. Jeong and S. Kim, "A Differentiating Receiver With a Transition-Detecting DFE for Dual-Rank Mobile Memory Interface," in IEEE Access, vol. 9, pp. 120285-120296, 2021.
- [5.4.1] J. -H. Lim et al., "A Delay Locked Loop With a Feedback Edge Combiner of Duty-Cycle Corrector With a 20%–80% Input Duty Cycle for SDRAMs," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 63, no. 2, pp. 141-145, Feb. 2016.
- [6.2.1] N. Liu, J. Todsen and D. Chen, "A Low-Power and Area-Efficient Analog Duty Cycle Corrector for ADC's External Clocks," in *IEEE International Symposium* on Circuits and Systems (ISCAS), 2020, pp. 1-4.
- [6.2.2] C.-C. Chung, D. Sheng, and C.-J. Li, "A wide-range low-cost all-digital dutycycle corrector," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 11, pp. 2487–2496, Nov. 2015.
- [6.2.3] J. Gu, J. Wu, D. Gu, M. Zhang and L. Shi, "All-digital wide range precharge logic 50% duty cycle corrector," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 4, pp. 760-764, April 2012.
- [A.1] D. U. Lee et al., "22.3 A 128Gb 8-High 512GB/s HBM2E DRAM with a Pseudo Quarter Bank Structure, Power Dispersion and an Instruction-Based At-Speed PMBIST," 2020 IEEE International Solid- State Circuits Conference - (ISSCC), San Francisco, CA, USA, 2020, pp. 334-336.

- [A.1.1] M. -J. Park et al., "A 192-Gb 12-High 896-GB/s HBM3 DRAM with a TSV Auto-Calibration Scheme and Machine-Learning-Based Layout Optimization," 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 2022, pp. 444-446.
- [A.1.2] RAMBUS "https://www.rambus.com/blogs/the-rambus-hbm-gen2-phy-acloser-look".
- [A.1.3] DFi<sup>TM</sup> Specification document
- [A.2.1] JEDEC Specification of HIGH BANDWIDTH MEMORY (HBM) DRAM2 "https://www.jedec.org/document search?search api views fulltext=HBM".

# 한글초록

다양한 어플리케이션들에서는 클럭의 듀티 사이클에 의해 성능이 좌우된 다. 높은 대역폭을 필요로 하는 애플리케이션이 많아짐에 따라 메모리에서도 HBM에 대한 수요가 증가하고 있습니다. 따라서, HBM 뿐만 아니라 컨트롤러 와 메모리 사이의 물리 계층도 개발할 필요가 있습니다. 이 논문은 이 물리적 층의 설계를 설명할 것입니다.

듀티 사이클 왜곡은 프로세스 및 전압 변화 또는 클럭 신호가 클럭 버피 를 통과할 때 발생할 수 있습니다. 다양한 방식의 듀티 사이클 보상기가 제안 되었습니다. 그 중에서도 넓은 동작 범위를 갖기 위해서는 위상 보간 방식의 듀티 사이클 보상기 보다는 HCDL(Half-Cycle Delay Line)를 사용하여 보상해주 는게 성능에 좋습니다.

듀티 왜곡을 보상하기 위해 카운터 기반 HCDL이 있는 디지털 듀티 사이 클 보정기(Digital Duty-Cycle Corrector) 종래의 에지 결합기 타입 DCC의 반주기 지연선은 넓은 면적을 필요로 하고, DCC를 넓은 범위의 주파수에서 동작하는 애플리케이션에 적합하지 않게 한다. 제안된 카운터 기반 HCDL은 기존 DCC 의 성능을 유지하면서 지연선을 반복하여 실리콘 비용을 절감한다.

또한 34 사이클로 트레이닝을 진행할 수 있도록 FSM 블록을 설계하여 65nm 넓은 동작 범위를 효율적으로 동작할 수 있도록 하였습니다.CMOS 기술

94

을 사용한 측정 결과는 50-1600MHz에 대해 20-80%의 입력 듀티 사이클 범위 에서 듀티 사이클 오차가 0.89% 미만임을 보여준다. DCC는 1.6GHz에서 2.11mW를 소비합니다.

주요어 : 메모리 인터페이스, HBM 물리 계층, 듀티 사이클 보정기, HCDL

학 번 : 2015-22778