Adaptive Deep Image Signal Processor for Practical Applications

Abstract: 현대인에게 카메라는 삶을 기록할 뿐 아니라 소셜 네트워크 서비스와 화상 회의, 개인 방송을 위해 없어서는 안 될 존재가 되었다. 스마트폰 제조사(예: 삼성, 애플)는 스마트폰 시장 점유율에서 카메라의 성능이 중요한 역할을 수행함에 따라, 차세대 스마트폰에 더 많고 더 큰 카메라 랜즈를 사용하고 있다. 디지털 카메라의 이미지 파이프라인에서 이미지 센서는 빛을 원시 이미지라고 하는 디지털 신호로 변환하고 이미지 시그널 프로세서(ISP)는 원시 이미지를 사람이 읽을 수 있는 RGB 이미지로 변환한다. 이러한 ISP는 카메라 하드웨어의 제약과 소프트웨어적 변형으로 인해 손상된 이미지를 복구하는 과정(이미지 복원)과 이미지를 매력적인 스타일로 보정하는 과정(이미지 개선)에 관련된 다양한 영상 처리 과제를 수행한다.

ISP의 도전 과제는 모든 이미지 손상 또는 모든 이미지 스타일에 대한 일반적인 모델을 설계하고 구현하는 것이다. 특히 이미지는 노이즈, 블러, 압축과 같은 다양한 요인으로 인해 이미지의 세부 정보가 손실된다. 모든 손상을 복원하는 일반적인 모델은 충분히 많은 수의 모델 파라미터를 필요하며 각 손상에 대해 최적의 결과도 얻지 못한다. 그뿐만 아니라, 매력적인 이미지 스타일은 주관적이어서, 개인적 경험이나 분위기, 기분과 같은 다양한 요소에 의해 달라진다. 대부분의 최신 인공신경망 모델들은 입력 이미지에 대해 단일 스타일 이미지를 생성하는데, 이는 사용자를 만족시키기에 제한적이다.

전술한 도전 과제를 해결하기 위해 본 학위 논문은 심층 신경망을 사용하여 실용적인 애플리케이션에 대한 적응형 ISP를 제안한다. 구체적으로, 제안하는 세 가지 알고리즘과 애플리케이션은 각각: (1) 카메라 이미지의 잡음 제거를 위한 적응형 데이터 합성, (2) 조작 가능한 이미지 복원을 위한 적응형 신경망 구조 탐색, (3) 조작 가능한 이미지 개선을 위한 적응형 ISP 매개변수 추정이다. 첫 번째로, 카메라 이미지 잡음 제거에서는 지도 학습을 위해서 잡음이 있는 이미지와 깨끗한 이미지의 쌍을 얻기가 어렵다. 제안 방법은 잡음이 있는 시험 이미지만으로 지도 학습을 가능하게 하는 모조 학습 데이터를 생성하여 일반적인 CNN 기반 잡음 제거기를 시험 이미지에 적합하게 학습시킨다. 두 번째로, 조작 가능한 이미지 복원은 알 수 없는 손상에 대해 미리 정해진 복수의 복원 작업의 결과를 생성하고 사용자가 원하는 결과를 선택하는 새로운 이미지 복원 애플리케이션이다. 제안 방법은 이러한 복수의 이미지 복원 결과를 생성하는데 효율적인 CNN 구조를 자동으로 찾는다. 찾아진 CNN은 앞선 계층을 공유하고, 남은 계층을 과제에 맞게 조정한다. 세 번째로, 제안하는 방법은 주관적인 사용자 선호도를 만족시키기 위해 다수의 고품질 이미지 스타일 생성 방법을 학습한다. 사용자는 학습된 잠재 표현에서 스타일을 선택하고 신경망은 스타일을 ISP 매개변수로 변환한다. 이미지 스타일에 대한 주관적인 평가를 위해 제안 방법은 고품질의 다양한 스타일을 생성한다. 스타일 생성은 플러그 앤드 플레이 ISP를 사용하여서 효율적이다.

제안 방법들은 각각의 컴퓨터 비전 과제에서 유의미한 성능 향상을 얻을 수 있었으며, 면밀한 실험적 분석과 구성 요소별 분석을 통해 유효성을 검증하였다. 또한 각 과제에서 널리 사용되는 벤치마크에서 탁월한 화질 개선 능력과 모델의 효율성을 보였으며, 실제 영상에서도 유의미한 성능을 확인했다.
Cameras have become indispensable in everyday life for history recording, social network services, video meetings, and personal broadcasting. Smartphone manufacturers (e.g., Samsung and Apple) adopt more and bigger camera lenses for new generations, given that camera performance plays a key role in smartphone market share. In an image pipeline of digital cameras, an image sensor converts light from an object into a digital signal called a raw image, and an Image Signal Processor (ISP) transforms the raw image into a human-readable RGB image. An ISP performs a variety of image processing tasks related to restoration and enhancement, where image restoration aims to restore an original image from its corrupted version and image enhancement aims to retouch images attractive.

The challenge in ISPs addresses designing and implementing a general model across all image degradation or all image styles. Specifically, image details are corrupted by many types of degradation such as noise, blur, and compression. A general model to handle all degradation not only requires a sufficiently large number of model parameters but also is not optimal for each degradation. Moreover, attractive image styles are subjective and vary depending on many factors such as personal experience, atmosphere, and mood. Recent deep enhancement models usually generate a single image style for an input image which is limited to satisfy user preferences.

This dissertation proposes adaptive deep ISPs for practical applications. Specifically, we introduce the three adaptation methods for three applications: (1) adaptive data synthesis for camera image denoising, (2) adaptive neural architecture search for controllable image restoration, and (3) adaptive ISP parameter estimation for controllable image enhancement. First, in camera image denoising, it is difficult to obtain noisy-clean image pairs for supervised learning. The proposed method generates RGB noisy-clean image pairs at low-resolution from raw-RGB noisy image pairs and allows the accurate training of general CNN-based denoisers. Second, controllable image restoration is a new application for unknown degradation that aims to generate outputs for predetermined restoration tasks and select the desired result for user preferences. The proposed method automatically finds a neural network architecture that is efficient for multiple inferences to generate different restoration outputs. The searched network shares the early layers and adapts the remaining
layers to each task. Third, the proposed method learns style representation that can generate multiple high-quality image styles to satisfy subjective user preferences. Users select styles in the latent representation, and a neural network decodes the style into ISP parameters. Style generation is efficient through a plug-and-play ISP.

The proposed methods improve significant performances in each practical computer vision application, and empirical analyses and ablation studies show the effectiveness of the proposed methods. We present the improvement of image quality and model efficiency in benchmarks widely used in each task and real images.

Language: eng

URI: https://hdl.handle.net/10371/193232

https://dcollection.snu.ac.kr/common/orgView/000000176632

Files in This Item:

000000176632.pdf 56.57 MB

Appears in Collections:

College of Engineering/Engineering Practice School (공과대학/대학원)
- Dept. of Electrical and Computer Engineering (전기·정보공학부)
  - Theses (Ph.D. / Sc.D._전기·정보공학부)

Altmetrics

Item View & Download Count

Show Full Item Record

Find it @ SNU

트윗하기

SNS Share