Welcome to CompGenome!

We work on big data driven discovery in cancer biology, Bayesian adaptive designs to empower precision medicine, and innovative computational approaches for cancer diagnosis and prognosis based on tumor heterogeneity. We analyze multi-omics data generated from next-generation sequencing platform. Our group (PI: Dr. Yuan Ji) consists of scientists with diverse background ranging across cancer biology, clinical trials, computational biology and bioinformatics, and statistics.


The novel coronavirus outbreak and the COVID-19 is a pandemic affecting over 200 countries and regions in all the continents except Antarctica. Based on public and open source data, we have been conducting statistical inference to interpret emerging epidemiological data, with the aim to 1) help public raise awareness of the virus outbreak and required self-discipline to fight the outbreak and 2) assist decision makers to make data-driven decisions. While our research works are being produced and submitted for peer-reviewed publications, we take the form of blogging to share preliminary results of our work and convey them in a way that is more accesible to the public. We will also embed our own interpretation and opinion, with the purpose to share and exchange with our readers.

We have moved the blog to a more stable server, which is accessible here

  • Congratulation to Lin Wei for the paper "TCGA-Assembler 2: Software Pipeline for Retrieval and Processing of TCGA/CPTAC Data". It has been accepted for publication in Bioinformatics. Dec 20, 2017
  • We have extended our LocHap tool for GBS method. A new paper using this tool got accepted by G3(Genes, Genomic, Genetics). May 2, 2017
  • Our mTPI-2 paper is accepted by Contemporary Clinical Trials. Apr 23, 2017
  • TCGA-Assember version 2.0.3 has been released. It can now acquire and process TCGA somatic mutation data from the Genomic Data Commons (GDC) and mass spectrometry proteomics data of TCGA samples generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC). Apr 18, 2017
  • TCGA-Assember version 2.0.0 has been released and it is compatible with Genomic Data Commons. Dec 22, 2016
  • Congratulations to NextGen-DF breaking 300 registered users. Nov 17, 2016
  • more

Yuan Ji
Principal Investigator

Yitan Zhu
Research Scientist

Subhajit Sengupta
Postdoc Fellow

Lin Wei
Postdoc Fellow

Shengjie Yang
Postdoc Fellow

Tianjian Zhou
Postdoc Fellow

Jiaying Lyu
Visiting Student
Past Lab Members

TCGA-Assembler is an open-source, freely available tool that automatically downloads, assembles, and processes public The Cancer Genome Atlas (TCGA) data, to facilitate downstream data analysis by relieving investigators from the burdens of data preparation. TCGA-Assembler includes two modules. Module A acquires public TCGA data from TCGA Data Coordinating Center and assembles individual data files into locally stored data tables. Module B does various manipulations on the data tables to prepare them for downstream analysis.

The Cancer Genomes Atlas (TCGA) provides multimodal and genome-wide measurements of various genomics features, such as DNA methylation , copy number, gene expression, and protein expression on thousands of matched patient samples across multiple cancer types. Taking advantage of the multimodality structure of TCGA data, we perform a vertical integration of genomics features using a Bayesian graphical model and assemble a large-scale database and information system, called Zodiac. Zodiac reports computational results on the genetic interaction between features of 19,304 genes and 186,312,556 gene pairs, from a genome-wide large-scale data analysis. In addition, Zodiac provides a user-friendly interface allowing for visualization of genetic interactions as graphs. Due to the fully probabilistic inference under the Bayesian model, false discovery rates of reported graphs can be easily controlled using posterior probabilities and thus ensuring the quality of reported interactions. As a unique resource of analytic inferences based on TCGA data, Zodiac might be useful to a variety of cancer researchers. We illustrate the potential impact of Zodiac with a few convincing examples in different fields of research.

We introduce local haplotype variants (LHVs) and a computational pipeline, LocHap, for calling LHVs. An LHV refers to a haplotype that manifests more than two alleles in a single human sam- ple and consists of multiple proximal single nucleotide variants (SNVs). Since humans are diploid, having more than two alleles implies somatic mosaicism. i.e., cells are genetically heterogeneous in the sample due to somatic mutations. Using deep DNA-Seq data, we demonstrate by direct observa- tions and rigorous statistical inference the existence of wide spread LHVs in human normal tissues and tumors, with higher frequencies of LHVs observed in tumor samples compared to normal, and older healthy individuals compared to younger healthy individuals. LocHap is ultrafast, open-source, and freely available at http://www.compgenome.org/lochap. Recognition of the existence of LHVs in normal and disease samples could fundamentally change our practice in disease diagnosis, association, and prognosis.

For a heterogeneous sample with subclones, DNA nucleotide sequences and copy numbers can differ on multiple loci between subclones. To accurately estimate the number of subclonal variant allele, we must account for subclonal copy numbers. To our knowledge, there are no existing tools for jointly calling subclonal copy numbers and the corresponding variant allele fractions for the same loci. Most methods either call one of the two features while ignoring the other, or assume that one is known and call the other. The proposed method, BayClone2 (as an extension of BayClone (Sengupta et al., 2015)) provides a Bayesian solution using next-generation sequencing (NGS) data for joint inference on both, structure and sequencing variants within a subclone. In addition, BayClone2 estimates the number and cellular fractions of the subclones in the sample, thus providing a complete description of subclonal genetic structure. BayClone2 is implemented as an R package, and the source code is available at http://www.compgenome.org/bayclone2.

NextGen-DF is a web-based statistical tool for designing phase I dose-finding trials in oncology. It assumes that a fixed number of doses is predetermined and allows users to evaluate three designs by running computer simulations. The three methods are the 3+3 design, the CRM design, and the mTPI design. NextGen-DF provides mTPI designs (Ji et al., 2007, 2010; Ji and Wang, 2013) as the main recommended method, and as it has been shown to be safer and more efficient than 3+3, and more intuitive and less burdensome than CRM. NextGen-DF is divided into three main components as show in the menu bar: DECISION, SIMULATION, COMPARISON.

Our Research

The research of our lab spans across four areas: [1] Big-data cancer informatics, [2] Bayesian Genomics, [3] Bayesian graphical models, and [4] Bayesian adaptive designs for cancer clinical trials. Dynamically, [1, 2, 3] -> [4] realizes translational research in modern precision medicine from a computational perspective. In particular, we use rigorous and powerful statistical and computational approaches to analyze big and complex genomics and phenotypic data to extract information that can facilitate clinical care of cancer patients. Below is a selected list of papers.

Big-Data Cancer Informatics:
  • Zhu Y, Xu Y, Helseth L, et al., Ji Y*. Zodiac: A Comprehensive Depiction of Genetic Interactions in Cancer by Integrating TCGA Data. Journal of the National Cancer Institute. 107(8), p.djv129, 2015
  • Zhu Y, Qiu P, Ji Y*. TCGA-Assembler: Open-Source Software for Retrieving and Processing TCGA Data. Nature Method. 11(6):399-400, 2014.
  • Mitra R, Mueller P, Liang S, Xu Y, Ji Y*. Towards Breaking the Histone Code – Bayesian Graphical Models for Histone Modifications. Circulation: Cardiovascular Genetics. 6(4): 419-426, 2013.
  • Y. Zhu, H. Li, W. Guo, K. Drukker, L. Lan, M.L. Giger, Y. Ji*, Deciphering genomic underpinnings of quantitative MRI-based radiomic phenotypes of invasive breast carcinoma, Scientific Reports, vol. 5, Article Number: 17787, 2015
  • H. Li, Y. Zhu, E.S. Burnside, K. Drukker, K.A. Hoadley, C. Fan, S.D. Conzen, L. Lan, M. Zuley, G. Whitman, E.J. Sutton, J.M. Net, M. Ganott, K.R. Brandt, E. Bonaccio, A. Rao, C. Jaffe, E Huang, J.B. Freymann, J. Kirby, E. Morris, C.M. Perou, Y. Ji*, M.L. Giger, MRI radiomics signatures for predicting the risk of breast cancer recurrence as given by research versions of gene assays of MammaPrint, Oncotype DX, and PAM50, Radiology, in press.
  • W. Guo, H. Li, Y. Zhu, L. Lan, S. Yang, K. Drukker, M.L. Giger, Y. Ji*, Prediction of clinical phenotypes in invasive breast carcinomas from the integration of radiomics and genomics data, Journal of Medical Imaging, vol. 2, no. 4, 041007, 2015.
Bayesian Genomics:
  • Xu Y, Mueller P*, Yuan Y, Gulukota K, Ji Y*. MAD Bayes for Tumor Heterogeneity -- Feature Allocation with Exponential Family Sampling. Journal of the American Statistical Association. In press. 2015
  • Lee J, Mueller P*, Ji Y, Gulukota K. A Bayesian Feature Allocation Model for Tumor Heterogeneity. Annals of Applied Statistics. In press. 2015
  • Sengupta S, Gulukota K, Lee J, Mueller, P, Ji Y*. BayClone: Bayesian Nonparametric Inference of Tumor Subclones Using NGS Data. Accepted for Pacific Symposium on Biocomputing. Vol. 20, 2015.
  • Marsh R, Talamonti M, Ji Y. Effect of adjuvant chemotherapy with Fluorouracil plus Folinic Acid or Gemcitabine vs observation on survival in patients with resected periampullary adenocarcinoma. Transl Gastrointest Cancer, Feb 20. 2013. Doi: 10.3978/j.issn.2224-4778.2012.02.03.
  • Lee J, Mueller P, Ji Y*. A nonparametric Bayesian model for local clustering with application to proteomics. Journal of the American Statistical Association. 103(508): 775-788.
  • Hu B, Ji Y*, Xu Y, Ting AH*. Screening for SNPs with Allele-Specific Methylation based on Next-Generation Sequencing Data. Statistics in Biosciences. 5(1):179-197, 2013.
  • Xu Y, Lee J, Yuan Y, Mitra R, Liang S, Mueller P, Ji Y*. Nonparametric Bayesian Bi-Clustering for ChIP-Seq Count Data. Bayesian Analysis. 8(2):1-22, 2013.
  • Yuan Y, Norris C, Xu Y, Tsui KW, Ji Y*, Liang H*. BM-Map: an efficient software package for accurately allocating multireads of RNA-seq data. BMC Genomics. 13(Suppl 8):S9. Epub 12/2012.
  • Ji Y*, Xu Y, Zhang Q, Tsui KW, Yuan Y, Liang S, Liang H*. BM-Map: Bayesian mapping of multireads for next-generation sequencing data. Biometrics. 67(4):1215-24, 12/2011.
  • Baladandayuthapani V, Ji Y, Morris J, Talluri R, Nieto-Barajas L. Bayesian random segmentation models to identify shared copy number aberrations for array CGH data. Journal of the American Statistical Association. 105(492):1358-1375, 12/2010.
  • Ji Y, Yin G, Tsui K-W, Kolonin M, Sun J, Arap W, Pasqualini R, Do K-A. Bayesian mixture models for complex high-dimension count data. Journal of the Royal Statistical Society: Series C (Applied Statistics). 56(2):139-152, 2007.
Bayesian Graphical Models:
  • Mitra R, Mueller P, Ji Y. Bayesian graphical models for differential pathways. Bayesian Analysis. In press. 2015.
  • Yajima M, Telesca D, Ji Y, Mueller P. Differential patterns of interaction and Gaussian graphical models. Biostatistics. In press, 2015.
  • Mitra R, Mueller, P*, Liang, S, Yue, L, Ji Y*. A Bayesian Graphical Model for ChIP- Seq Data on Histone Modifications. Journal of American Statistical Association 108: 69-80, 03/2013.
  • Telesca D*, Mueller P, Kornblau S, Suchard MA, Ji Y*. Modeling protein expression and protein signaling pathways. Journal of American Statistical Association, 107(500):1372-1384, 12/2012.
Bayesian Adaptive Design:
  • Lee, JH, Thall, PF, Ji Y, Mueller, P. Bayesian Dose-Finding in Two Treatment Cycles Based on the Joint Utility of Efficacy and Toxicity. Journal of American Statistical Association. In press, 2015.
  • Pan H, Xie F, Liu P, Xia J*, Ji Y*. A Phase I/II Seamless Dose Escalation/Expansion with Adaptive Randomization Scheme (SEARS). Clinical Trials. 11(1):49-59, 2014.
  • Cai C, Ji Y, Ying Y. A Bayesian Dose-finding Design for Oncology Clinical Trials of Combinational Biological Agents. Journal of Royal Statistical Society, Series C (Applied Statistics). 63(1): 159-173, 2014.
  • Hu B, Bekele BN, Ji Y*. Adaptive dose insertion in early phase clinical trials. Clinical Trials. 10(2):216-224, 2013.
  • Ji Y*, Wang S-J. The mTPI Design: A Safer and More Reliable Method than the 3+3 Design for Practical Phase I Trials. Journal of Clinical Oncology. 31(14):1785-91, 2013.
  • Ji Y, Liu P, Li Y, Bekele BN. A modified toxicity probability interval method for dose-finding trials. Clinical Trials. 7(6):653-63, 12/2010.
  • Ji Y, Bekele BN. Adaptive randomization for multi-arm comparative clinical trials based on joint efficacy/toxicity outcomes. Biometrics. 65(3):876-84, 9/2009.
  • Hu B, Ji Y*, Tsui KW. Bayesian estimation of inverse dose-response. Biometrics. 64(4):1223-30, 12/2008.
  • Ji Y, Li Y, Yin G. Bayesian dose-finding designs for phase I clinical trials. Statistica Sinica. 17:531-47, 2007.
  • Congratulation to Lin Wei for the paper "TCGA-Assembler 2: Software Pipeline for Retrieval and Processing of TCGA/CPTAC Data". It has been accepted for publication in Bioinformatics. Dec 20, 2017
  • We have extended our LocHap tool for GBS method. A new paper using this tool got accepted by G3(Genes, Genomic, Genetics). May 2, 2017
  • Our mTPI-2 paper is accepted by Contemporary Clinical Trials. Apr 23, 2017
  • TCGA-Assember version 2.0.3 has been released. It can now acquire and process TCGA somatic mutation data from the Genomic Data Commons (GDC) and mass spectrometry proteomics data of TCGA samples generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC). Apr 18, 2017
  • TCGA-Assember version 2.0.0 has been released and it is compatible with Genomic Data Commons. Dec 22, 2016
  • Congratulations to NextGen-DF breaking 300 registered users. Nov 17, 2016
  • Congratulations for our radiogenomics research published in high-level journals, including Scientific Reports, Cancer, Radiology, Journal of Medical Imaging, and NPJ Breast Cancer. May 3, 2016
  • SCUBA is now a semi-finalist at the Precision Trial Challenge hosted by Harvard Business School (https://openforum.hbs.org/challenge/precision-medicine/announce-semi-finalists) May 1, 2016
  • Mitra R, Mueler P, Ji Y. "Bayesian Multiplicity Control for Graphs" is accepted by Canadian Journal of Statistics. May 1, 2016
  • LocHap version 2.0 is released. See details. Feb 15, 2016
  • Radiogenomics paper is accepted by Scientific Reports. Congratulations to Yitan! Nov 20, 2015
  • NGDF paper is accepted by Contemporary Clinical Trials. Congratulations to Shengjie! Sep 21, 2015
  • Congratulation to Tianjian and Subhajit for the paper "A Bayesian Nonparametric Model for Reconstructing Tumor Subclones Based on Mutation Pairs". It has been accepted in the conference Pacific Symposium on Biocomputing (PSB) 2016. Sep 16, 2015
  • Congratulation to Subhajit for the paper "Ultra-Fast Local-Haplotype Variant Calling Using Paired-end DNA-Sequencing Data Reveals Somatic Mosaicism in Tumor and Normal Blood Samples". It has been accepted for publication in Nucleic Acids Research. Sep 16, 2015
  • Yuan’s talk at JSM is featured in the press release. Aug 09, 2015
    PHYSEurekAlertScienceDailyScience NewslineASA NewsNext Medicine NowAZ NewsQRDRG
  • Congratulations to TCGA-Assembler breaking 1,000 registered users. Aug 04, 2015
  • We have launched a blog discussing the tools on our website. The first series blog, written by Yuan Ji, is about Zodiac, our big-data cancer genomics database. Jun 04, 2015
  • Our Zodiac paper is nominated as the featured recommendation article on PubChase. May 09, 2015
  • Zodiac paper is accepted to Journal of the National Cancer Institute. Apr 10, 2015
  • We have updated Zodiac with a new landing page that include several exciting discoveries. Also, we have enhanced the user-experience in the gene search by linking NCBI gene description directly to the gene names in the search results. Click here for a screen shot. Apr 01, 2015
  • Congratulations to Yanxun who will start her new job as an assistant professor in the Department of Applied Mathematics & Statistics at Johns Hopkins University.
  • Congratulations to Subhajit on getting the travel award for PSB 2015. Nov 13, 2014
  • "Are normal cells genetically identical" is selected as a big idea and provided initial seed funding by University of Chicago BIG IDEAS GENERATOR. Congratulations to Yuan! Aug 2014
  • BayClone paper is accepted to Pacific Symposium on Biocomputing (PSB) 2015 (January 4-8, 2015. The Big Island of Hawaii).
  • SUBA paper is accepted to Statistics in Biosciences.
  • Two Treatment Cycles paper is accepted to Journal of American Statistical Association.
  • SEARS paper is published in Clinical Trials. [Link]
  • TCGA-Assembler paper is published in Nature Methods. [Link]
  • Abstract for miRNA regulation study in head & neck cancer is accepted by IFHNOS 5th World Congress/AHNS 2014 Annual Meeting.