Towards a better understanding of reverse-complement equivariance for deep learning models in regulatory genomics 논문 리뷰

cleanUrl: "reverse-complement-equivariance-review"
description: "Towards a better understanding of reverse-complement equivariance for deep learning models in regulatory genomics 논문을 리뷰합니다."

Summary

Reverse-complement parameter sharing (RCPS)는 이론적으로 적어도 conjoined model만큼의 expressive power를 가지지만, 아마도 optimization이 어려워서 성능이 좋아지지 않는 것으로 보인다.

Abstract

Double-stranded DNA sequence feature를 이용해서 예측을 수행하는 모델의 경우, 이론적으로 forward strand와 reverse strand 각각을 input으로 주었을 때 동일한 예측 값을 출력해야 한다. (Reverse-complement equivariance) 그러나 대부분의 standard neural network는 그렇지 않다.
모델이 reverse-complement equivariance를 갖도록 하는 전략에는 두가지가 제시되어 왔지만, 어떤 전략이 더 좋은지 benchmarking이 수행되지 않아 왔다.
- conjoined/”siamese” architecture
- RC parameter sharing (RCPS)
본 논문에서는 base-resolution signal prediction 문제에 대해서 “post-hoc conjoined” model, 즉 개별 strand에 대해서 각각 학습을 진행한 뒤 예측 값을 aggregate하는 방식의 모델이 잘 작동함을 보이고 이를 strong baseline으로 제시한다.
이 post-hoc conjoining 모델은 RCPS 보다 대부분 좋은 성능을 보였고, training 과정에서 conjoining을 진행하는 “conjoined-during-training” 모델보다는 항상 성능이 좋았다.

→ RC equivariance를 달성하는 모델을 구축하기 위해서, post-hoc conjoined 모델을 reliable baseline으로 사용하고, 그것보다 성능이 좋은 모델을 구축하는 것을 목표로 하는 게 바람직하다.

Introduction

DNA sequence 상에 존재하는 regulatory motif를 잡아내는 데 convolutional neural network (CNN)이 널리 활용되고 있지만, Standard CNN들은 주로 computer vision task를 위해 개발되고 발전되어 왔기 때문에 double-strand DNA의 complementary base-pairing 정보를 고려하지 않는다.
- 예를 들어 5’-GATA-3’에 binding하는 TF가 있을 때, reverse strand에 나타난 3’-CTAT-5’ signal을 보고도 forward strand에 5’-GATA-3’ 가 있다는 정보를 모델이 알아낼 수 있어야 하는데, 그게 자연스럽게 되지 않는다는 것.
- 따라서 foward sequence를 input으로 주었을 때와, 그것의 RC version을 input으로 주었을 때의 출력 값이 매우 달라지는 경우가 많다.
- 심지어 reverse-complement version의 sequence를 training data에 추가하여 augmentation한 경우에도 그렇다. 따라서 모델의 신뢰도가 떨어지게 되는 것.
초창기 Deep learning for genomics 연구들을 보면 이 문제를 forward/reverse sequence 예측을 둘다 활용함으로써 해결한다. 이런 구조를 conjoined 혹은 “siamese” architecture라고 부른다.
- DeepBind는 두 strand 예측 값 중 더 큰 값을 사용한다.
- FactorNet은 두 strand 예측 값의 평균을 사용한다.
결국 conjoined model은 forward/reverse strand input에 대한 “Representation merging”을 수행한다고 볼 수 있다.
전통적으로 representation merging을 training / testing time 둘 다 수행할 때 conjoined model 이라고 부르지만, training 시에는 representation merging을 하지 않아도 test-time에 merging을 수행하는 경우도 conjoined architecture로 볼 수 있다.
- Test-time에만 merging을 수행하는 이 경우, single-strand model을 post-hoc하게 conjoined model로 변환하는 것으로 볼 수 있다. (post-hoc conjoining)
Conjoined architecture의 단점. Conjoined architecture는 convolutional filter에 의한 motif scanning 보다 뒷 단계에서 RC equivariant가 부여되기 때문에, filter 자체는 forward motif / reverse motif 두 개가 각각 학습되어야 한다는 부담이 있다. 따라서 어떤 sequence에 어떤 motif는 forward orientation으로 있고, 어떤 motif는 reverse orientation으로 있다면 어느 한 orientation의 motif만 학습한 모델은 모든 motif를 identify할 수 없다.

→ Reverse-complement parameter sharing (RCPS)의 필요성
RCPS. RCPS는 window length + channel axis를 따라 flipped된 한 쌍의 weight-tied filter를 가지고 학습을 진행한다.
RCPS 아이디어를 사용한 연구들.
- Brown and Lunter : RCPS를 확장하고, dropout을 걸어 recombination hotspot detection에 활용.
- Bartoszewicz et al. : Pathogenic potential of novel DNA
- Onimaru et al. : RCPS-like concept을 가지는 layer를 고안. Forward and Reverse Sequence Scan (FRSS)라고 이름 붙임.