Deep Learning with Privacy Protection via Indirect Data Exchange

장재희

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Deep Learning with Privacy Protection via Indirect Data Exchange : 개인 정보 보안을 위한 간접 데이터 교환 기반 딥러닝

Cited 0 time in Web of Science Cited 0 time in Scopus

Export

Authors: 장재희

Advisor: 윤성로

Issue Date: 2023

Publisher: 서울대학교 대학원

Keywords: Artificial Intelligence ; Deep Learning ; Homomorphic Encryption ; Federated Learning

Description: 학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2023. 2. 윤성로.

Abstract: 딥 러닝(deep learning)으로 구동되는 인공 지능(artificial intelligence, AI)은 수많은 분야에서 놀라운 성능을 보여주었다. 그러나 이를 가능하게 한 것은 심층 신경망(deep neural network, DNN)을 대규모 데이터에서 훈련했기 때문임에도, 딥 러닝에 사용되는 데이터의 프라이버시 문제는 그동안 크게 간과되었다. 딥 러닝을 기반으로 하는 많은 애플리케이션 서비스들의 실제 사례에서 관찰되는 것처럼, 모델의 추론 및 학습 과정에서 발생하는 개인 정보 보호 위반 사례는 매우 심각한 수준이다. 따라서 개인 정보를 보호하는 딥 러닝의 중요성이 대두하였다. 이 논문은 암호화된 심층 신경망 추론, 이기종(heterogeneous) 모델을 사용한 개인화된 연합 학습(personalized federated learning), 자기 지도 학습(self-supervised learning)을 사용한 수직 연합 학습(vertical federated learning, VFL)을 포함하여 개인 정보를 보호하는 딥 러닝의 세 가지 방법론에 대해 논의한다.
첫 번째 연구는 동형 암호화(homomorphic encryption, HE) 를 사용하여 원격 서버에 저장된 심층 신경망과 사용자 데이터를 안전하게 보호하는 데 이바지한다. 암호화된 상태에서 모델과 사용자 데이터를 연산하는 것은 가장 높은 수준의 개인 정보 보호 솔루션을 제공할 수 있다. 그럼에도 동형 암호화된 데이터에 대한 연산의 어려움으로, 지금까지의 연구는 심층 신경망 네트워크의 깊이가 매우 얕은 모델에 대해서만 이루어졌고, 해당 모델을 사용해 해결할 수 있는 문제의 범위를 크게 한정하였다. 본 연구에서는 효율적인 행렬 표현을 제공하기 위해 CKKS 체계를 확장하는 $\texttt{MatHEAAN}$을 사용하여 $\texttt{MatHEAAN}$ 기반 게이트 순환 유닛($\texttt{MatHEAAN}$-based recurrent gated unit, MatHEGRU)이라는 게이트 순환 유닛으로 심층 순차 모델을 구현한다. MatHEGRU는 본 연구에서 제안하는 \textit{학습 가능한 비선형 함수 근사화 기법}을 채택하여, 상당히 정확한 활성화 값을 달성함과 동시에 활성화 함수 계산에서의 연산 회로 깊이의 양을 크게 줄일 수 있다. 그리고 실험을 통하여, MatHEGRU가 시퀀스 모델링, 회귀, 이미지 및 게놈 시퀀스의 분류를 포함하는 실제 시퀀스 데이터 세트에 대해 거의 평문에 가까운 예측 성능을 보인다는 것을 확인한다.
두 번째 연구에서는 연합 분류기 평균화(Federated classifier averaging, FedClassAvg)를 통해 이기종 클라이언트 모델 학습을 가능하게 함으로써 연합 학습의 개인화를 강화한다. 개인화된 연합 학습(personalized federated learning)은 참여하는 여러 클라이언트가 각자 가지고 있는 민감한 데이터를 공개하지 않고 커뮤니케이션이 효율적인 방식으로 학습에 참여하면서 동시에 개인화된 모델을 구축할 수 있도록 하였다. 그럼에도 대다수의 개인화된 연합 학습 프레임워크 들은 클라이언트들이 같은 신경망 모델 구조로 되어 있다고 가정하여 왔으며, 개인화된 클라이언트 모델 구조에 관한 연구는 부족한 실정이다. 따라서 우리는 새로운 개인 맞춤형 연합 학습 방법인 FedClassAvg를 제안한다. FedClassAvg의 메인 아이디어는 지도 학습(supervised learning) 작업을 위한 심층 신경망이 특징 추출기(feature extractor)와 분류기(classifier) 레이어로 구성된다는 관찰을 기반으로 한다. FedClassAvg는 기능 공간(feature space)에 대한 결정선(decision boundary)에 대한 합의로 분류기 모델의 웨이트(classifier weights)를 집계하여 독립적이고 동일하게 분산되지 않은(non-IID) 데이터가 있는 클라이언트가 희소 레이블에 대해 학습할 수 있도록 한다. 또한 로컬 기능 표현 학습(local feature representation learning)을 활용하여 의사 결정 범위를 안정화하고 클라이언트 모델의 로컬 기능 추출 기능을 향상한다. 이종 모델을 사용하는 기존의 연구에서는 추가 데이터 및 그에 대한 모델 웨이트를 활용해 지식 이전(knowledge distillation)를 위한 추가적인 연산을 해야 했다. 그러나 FedClassAvg는 클라이언트가 단 몇 개의 완전히 연결된 신경망 레이어(fully-connected layers)를 바탕으로 통신하여 학습하는 알고리즘이기 때문에 통신 효율성이 비교적 매우 높다. 또한 FedClassAvg는 상당한 계산 오버헤드를 초래하는 지식 이전과 같은 서버에서의 추가 최적화 문제가 필요하지 않다. FedClassAvg를 사용한 광범위한 실험을 통해 FedClassAvg가 기존의 최신 알고리즘을 능가한다는 것을 입증한다.
세 번째 연구는 그래디언트 전송이 없는 새로운 VFL(Vertical Federated Learning) 프레임워크를 제안하여, 로컬 클라이언트 데이터 세트의 개인 정보 보호에 기여한다. 수직 연합 학습은 데이터를 직접 전송하지 않고 분산된 데이터 기능을 학습하기 위한 연합 학습 방법론이다. 기존의 수직 연합 학습은 하나의 대규모의 모델을 서브네트워킹하여 여러 클라이언트의 정렬된 데이터에서 학습하도록 설계되었으며, 협업 당사자 간의 로짓 및 그래디언트와 같은 중간 컴퓨팅 결과의 전송을 통해 학습한다. 그러나 수많은 연구를 통해 공격자가 그래디언트를 기반으로 클라이언트의 로컬 데이터 세트를 성공적으로 복구할 수 있음이 입증되었고, 이는 새로운 프라이버시 침해 문제를 초래할 수 있다. 또한 기존의 수직 연합 학습은 정렬된 클라이언트 데이터 세트에 대해서만 학습할 수 있다. 즉, 클라이언트의 로컬 데이터 세트 중 일부만을 활용할 수밖에 없으며, 더 많은 데이터를 학습하기 위해서 클라이언트간 데이터를 정렬하는 과정에서 추가적인 비용과 개인 정보 손실이 필요할 수 있다. 또한 테이블 형식의 데이터 세트에 대해 딥 러닝이 기존의 머신 러닝 알고리즘과 비교하여 성능이 좋지 않은 것이 많은 연구에서 확인되었기 때문에, 수직 연합 학습이 가장 많이 필요시되는 테이블 데이터에 특정된 새로운 알고리즘이 필요하다. 따라서 우리는 로컬 표현 벡터의 지식 이전을 이용하는 테이블 데이터에 대한 수직 연합 학습 방법론(vertical federated learning on tabular datasets via local representation distillation, FedTaDR)을 제안한다. FedTaDR은 로컬 인코더의 표현 능력을 향상하기 위해 로컬 데이터 세트에서 마스킹된 컬럼의 재구성(masked feature reconstruction) 및 마스크 위치 예측(mask prediction)을 학습한다. 이후 글로벌 서버는 정렬된 데이터에 대한 로컬 인코더의 표현 벡터들을 수집하고, 지도 재구성 네트워크(supervised reconstruction network)를 사용하여 예측 문제에 대한 협업 정보를 전달할 수 있는 방향으로 로컬 표현 벡터들을 재구성한다. 이때, 대조 학습(contrastive learning)을 사용하여, 로컬 표현 벡터의 잠재 공간(latent space) 상에서의 거리도 조절한다. 이후, 글로벌 서버가 재구성된 표현 백터를 클라이언트에게 전송하면, 클라이언트는 지식 이전 기법을 사용하여 로컬 모델을 미세 조정한다. FedTaDR이 산업 제조 출력, 분산된 자동차 센서 및 웹사이트 클릭 로그와 같은 실제 VFL 시나리오에서 탁월한 예측 성능을 보여주는 것을 실제 실험으로 확인하였다.
본 논문에서는 간접적인 데이터 교환을 통한 프라이버시 보호를 통한 딥러닝 방법을 제안한다. 제안된 기법으로 모델의 계산 효율성 및 예측 효율성에 미치는 영향을 최소화하면서 데이터 프라이버시 문제를 해결할 수 있다. 실제 데이터에 대한 많은 실험을 통해, 제안된 방법이 데이터를 있는 그대로 직접 공유하지 않고도 딥 러닝을 위한 강력한 추론 또는 학습 기능을 제공한다는 것을 확인한다. 본 논문이 데이터 프라이버시를 보호하는 인공지능 시스템 개발에 이바지하여, 누구나 안심하고 AI를 사용할 수 있을 것으로 기대된다.
In numerous disciplines, artificial intelligence systems powered by deep learning have demonstrated amazing performance. However, the significance of data privacy in deep learning has been largely overlooked, despite the fact that these advances were made possible by deep neural networks (DNNs) trained on enormous data. In applications of deep learning, serious privacy violations can occur during both model inference and training, as observed in various real-world instances. Therefore, the significance of privacy-preserving deep learning has emerged. This dissertation discusses three fundamental themes in privacy-preserving deep learning, including encrypted DNN inference, personalized federated learning with heterogeneous client models, and vertical federated learning with self-supervised learning.
The first study contributes to making DNNs and data on third-party servers safe using homomorphic encryption (HE). Computing both model and user data in encrypted state offers the solution at the highest privacy level. Nevertheless, the difficulty of functioning on homomorphically encrypted data has hitherto limited the scope of accessible operations and the depth of networks. Using an extended CKKS scheme $\texttt{MatHEAAN}$ to provide efficient matrix representations, we implemented a deep sequential model with a gated recurrent unit called the $\texttt{MatHEAAN}$-based gated recurrent unit (MatHEGRU). MatHEGRU adopts the \textit{trainable nonlinear approximation} proposed in this study, which significantly reduces the amount of operation circuit depth in computing activation functions while achieving fairly precise activations. MatHEGRU demonstrated near-plaintext predictive performance on practical sequence datasets including sequence modeling, regression, and classification of images and genome sequences.
In the second study, we enhance the personalization of federated learning by enabling heterogeneous client model training through federated classifier averaging (FedClassAvg). Personalized federated learning seeks to enable multiple clients to build personalized models while engaging in collaborative training in a communication-efficient manner and without disclosing sensitive data. Nonetheless, numerous personalized federated learning algorithms assume that clients have the same neural network design, and those for heterogeneous models are understudied. Therefore, we propose a novel personalized federated learning method FedClassAvg. FedClassAvg is based on the observation that DNNss for supervised learning tasks consist of feature extractor and classifier layers. FedClassAvg aggregates classifier weights as an agreement on decision limits on feature spaces, allowing clients with not independently and identically distributed (non-iid) data to learn about scarce labels. In addition, local feature representation learning is utilized to stabilize decision bounds and increase client local feature extraction capabilities. While prior approaches necessitate the collection of supplementary data or model weights in order to generate a counterpart, FedClassAvg simply requires clients to communicate with a few of fully connected layers, which is extremely communication-efficient. In addition, FedClassAvg does not require additional optimization challenges at server, such as knowledge transfer, which results in a substantial computation overhead. Extensive experimentation utilizing FedClassAvg proved that it outperforms the previous state-of-the-art algorithms.
The third study proposes a novel vertical federated learning (VFL) framework training without gradient transmission, which contributes to preserving the privacy of local client datasets. Vertical Federated Learning (VFL) is a method that aims to acquire knowledge of distributed data features without requiring direct data transfer. Conventional VFL approaches involve the dissemination of a large model into subnetworks, with the training process being executed through the transmission of intermediate computation results, such as logits and gradients, between collaborating parties. However, multiple research studies have highlighted the vulnerability of this approach, as attackers can potentially reconstruct the local dataset through partial derivatives. Furthermore, conventional VFL training is limited to aligned client datasets, which can result in clients being unable to fully leverage their local datasets, leading to additional costs and a loss of privacy in pursuit of optimal alignment. To address these issues, we propose Vertical Federated Learning on tabular datasets via Local Representation Distillation (FedTaDR). FedTaDR utilizes masked feature reconstruction and mask prediction on local datasets to enhance the expressive capacity of local encoders. The global server subsequently reconstructs local representations using a supervised reconstruction network, which facilitates communication of collaborative information pertaining to the prediction task. Through the use of contrastive learning on the latent space, the server regulates the distance between local representations. Client models are then updated through the distillation of reconstructed representations. Through empirical evaluations, we demonstrate that FedTaDR achieves exceptional prediction performance on real-world VFL scenarios, such as industrial production output, distributed automobile sensors, and website click records.
Throughout this dissertation, we propose methods for deep learning with privacy protection via indirect data exchange. They addressed the data privacy issues with minimal impact on the computational efficiency and predictive effectiveness of the model. Substantial empirical evaluations reveal that the proposed methods provide potent inference or training capabilities for deep learning without requiring the direct sharing of raw data. It is anticipated that this dissertation will contribute to the development of private artificial intelligence systems, allowing everyone to use AI applications without fear.

Language: eng

URI: https://hdl.handle.net/10371/203987

https://dcollection.snu.ac.kr/common/orgView/000000176009

Files in This Item:

000000176009.pdf 11.49 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share