The use of statistics is expanding rapidly inside and outside of academia, with unprecedented growth in computationally intensive areas using large datasets from novel sources. How to efficiently and accurately decipher complex relationships underlying the large-scale data setting from real-world problems remains an urgent challenge. In recent years, a confluence of advances in machine learning and statistics is fueling a new wave of innovation that is poised to improve the data aggregation, estimation, interpretation, and prediction. Many classical disciplines of statistics are being redefined and reinforced by the magnificent bloom of machine learning. My general research goal lies in this multi-disciplinary area where I have been devoting to developing practical statistical and machine learning tools with both significance in statistical theory and applications in real-world problems. In particular, I have been pursuing this research agenda by exploiting the deep generative models to tackle several important statistical problems, such as density estimation, causal inference, and likelihood-free Bayesian inference, with broad applications in computational biology.

My research goal is to fulfill the new theories and methodologies for solving both statistical problems and data science problems in computational biology by developing computationally efficient techniques from machine learning field. I have been developing pioneered frameworks for causal inference (arXiv 2022), density estimation (PNAS 2021), using deep generative models and apply the relative techniques to various computational biology problems, which include genomic studies (NMI 2021, BIB 2022,ISMB/Bioinformatics 2019), and pharmacology studies (ECCB/Bioinformatcis 2020,NeurIPS 2020).

Representative works in Statistics

CausalEGM: a general causal inference framework by encoding generative modeling

In this article, we develop a general framework CausalEGM for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. We establish a bidirectional transformation between the high-dimensional confounders space and a low-dimensional latent space where the density is known. Through this, CausalEGM simultaneously decouples the dependencies of confounders on both treatment and outcome and maps the confounders to the low-dimensional latent space. By conditioning on the low-dimensional latent features, CausalEGM can estimate the causal effect for each individual or the average causal effect within a population. Our theoretical analysis shows that the excess risk for CausalEGM can be bounded through empirical process theory. Under an assumption on encoder-decoder networks, the consistency of the estimate can be guaranteed.

Qiao Liu, Zhongren Chen, Wing Hung Wong

arXiv preprint arXiv:2212.05925, 2022

Density estimation using deep generative neural networks

Density estimation is among the most fundamental problems in statistics. It is notoriously difficult to estimate the density of high-dimensional data due to the “curse of dimensionality”. Here, we introduce a new general-purpose density estimator based on deep generative neural networks. By modeling data normally distributed around a manifold of reduced dimension, we show how the power of bidirectional deep generative models can be exploited for explicit evaluation of the data density by either importance sampling or Laplacian approximation. Simulation and real data experiments suggest that our method is effective in a wide range of problems. This approach should be helpful in many applications where an accurate density estimator is needed.

Qiao Liu, Jiaze Xu, Rui Jiang, Wing Hung Wong

Proceedings of the National Academy of Sciences (PNAS), 2021, 18(15):e2101344118


Representative works in Computational Biology

Simultaneous deep generative modeling and clustering of single-cell genomic data

We proposed scDEC, a computational tool for single cell data analysis with deep generative neural networks. scDEC is built on a pair of generative adversarial networks (GANs), and is capable of learning the latent representation and inferring the cell labels, simultaneously. scDEC can be used for clustering scATAC-seq and Multiome data.

Qiao Liu, Shengquan Chen, Rui Jiang, Wing Hung Wong

Nature Machine Intelligence, 2021, 3(6):536-544.

Deep generative modeling and clustering of single cell Hi-C data

Deciphering 3D genome conformation is important for understanding gene regulation and cellular function at a spatial level. Here, we proposed scDEC-Hi-C, a new framework for single cell Hi-C analysis with deep generative neural networks.

Qiao Liu, Wanwen Zeng, Wei Zhang, Sicheng Wang, Hongyang Chen, Rui Jiang, Mu Zhou, Shaoting Zhang

Briefings in Bioinformatics, 2022.

hicGAN Infers Super Resolution Hi-C Data with Generative Adversarial Networks

We proposed hicGAN for inferring high resolution Hi-C data from low resolution Hi-C data with generative adversarial networks (GANs). To the best of our knowledge, this is the first study to apply GANs to 3D genome analysis. Pretrained hicGAN models were also provided.

Qiao Liu, Hairong Lv, Rui Jiang

Bioinformatics, 2019, 35(14):i99–i107

Invited talk at ISMB2019

Cancer drug response prediction via a hybrid graph convolutional network

DeepCDR is a hybrid graph convolutional network consisting of a uniform graph convolutional network (UGCN) and multiple subnetworks. Unlike prior studies modeling hand-crafted features of drugs, DeepCDR automatically learns the latent representation of topological structures among atoms and bonds of drugs. DeepCDR achieves state-of-the-art performance in cancer drug response prediction.

Qiao Liu, Zhiqiang Hu, Rui Jiang and Mu Zhou

Bioinformatics, 2021, 36(S2):i911–i918

Invited talk at ECCB2020