ZODIAC (manuscript submitted) is an on-line tool for visualizing gene interactions based on statistical analysis of data collected by The Cancer Genome Atlas (TCGA; http://cancergenome.nih.gov). TCGA provides data about the characterizations of genomic features, such as gene copy numbers, gene expression, protein expression and genome-wide methylation for thousands of samples from dozens of cancer types. We analyzed this voluminous data set using a class of novel Bayesian Graphical Models (BGM)1. We identified interactions between every possible gene pair chosen from the nearly 20,000 genes about which data are presented in TCGA.
We developed ZODIAC as an information system to help the research community query our statistical inference from the massive TCGA data for evidence of interactions between all ~200 million possible pairs of genes at the genetic, epigenetic and protein levels. ZODIAC uses a web-based portal to visualize genomic interactions inferred based on TCGA data of 1448 patients from 11 different cancer types. Backed by the full public data assembled by our own data acquisition tool, TCGA-Assembler, ZODIAC focuses on presenting findings from a massive big-data analysis using statistical inferences. We performed gene-pair analysis in parallel on Beagle, a supercomputer with about 17,400 CPUs at the University of Chicago and Argonne National Laboratory, and generated ~200 million individual interaction networks, one for each gene pair.
The ZODIAC web portal allows researchers to query this pair-wise result using their genes of interest, displaying the analysis results graphically and tabularly, where significant genomic interactions are identified and shown based on false discovery rates (FDR) and posterior probabilities. Users can query using non-standard gene symbols and the ZODIAC web portal will display choices that match their query. Gene symbols are validated against the current NCBI approved nomenclature.
There are three ways to search
- Find interactions between a pair of TCGA genes
- Find interactions
between a single gene and all
other TCGA genes
- Find interactions between multiple genes
For single gene or gene pair searches we may ask you to validate the symbol because we're converting older gene symbols used by TCGA to current NCBI gene names. When you search for a gene, you can enter it in upper or lower case. We'll offer you all choices where the symbol you entered is the official symbol or is a known synonym. For example, entering "MAPK" as the first gene symbol will produce a list of options. Select the gene you want and click Continue to see your results.
Pair-wise search will display a brief introduction of the genes, a graph of significant interaction network based on FDR cutoff, and a description of features (nodes) in the network.
The graph shows interactions with an FDR of less than or equal to 0.01. The left half of the graph shows nodes for the first gene; the right half shows nodes for the second gene. The symbols indicates assay platform, i.e. GE for gene expression, PE for protein expression, ME for DNA methylation, and CN for DNA copy number. Additional information is shown in the table below the figure. A Green edge indicates positive interaction and a Red edge indicates negative interaction. Full details of the Bayesian inference, including the statistical significance based on posterior probabilities are provided in the table below the figure. The Edge width is proportional to the PostMeanBeta in the table, reflecting the strength of the interaction.
By clicking on "Download all results" at the bottom of the page, You can Download a zip file including network pictures, the information of gene features, and a list including inference statistics of all potential interactions between the two genes. Top1.GD.png, Top2.GD.png, Top3.GD.png are pictures showing the top three most frequent graphs in MCMC samplings. FDR.GD.png shows the identified significant interactions (FDR <= 0.1). Edges connecting CN or ME between different genes are not included in FDR.GD.png for clarity purpose. NodeInfo.txt gives the introduction of each node (feature) in the network. EdgeInfo.txt includes inference statistics of all potential interactions between the two genes. It is a tab-delimited column table in plain .txt file, based on which users can sort the interactions and filter them by FDR values and posterior mean values of interaction strength coefficient β for obtaining the most significant and strongest interactions.
A single gene (vs. all other genes) search will display information related to interactions between the gene and all other TCGA genes. Zodiac returns a histogram of the number of significant positive and negative interactions for that gene.
You can download a list of all significant interactions with the gene in query by clicking the "DOWNLOAD" link above the histogram. The list is a tab-delimited column table including all the inference statistics of interactions in a plain .txt file. Users can sort the interactions and filter them by FDR values and posterior mean values of interaction strength coefficient β for obtaining the most significant and strongest interactions.
Below the histogram are tables summarizing the numbers of positive and negative interactions by feature types, i.e. GE, PE, ME, and CN. You can click on the numbers in each cell of the table to view details of the strongest interactions with specific feature types.
There is also a "DOWNLOAD" link under the tables, using which you can Download all inference statistics of the interactions with specific feature types.
A note about the RNA-Seq data used in preparing ZODIAC:
Some of the TCGA RNA-Seq gene expression values are zeros and they were substituted by the smallest positive RNA-Seq value in each cancer type to allow log transformation of the data. However, genes with a large number of zero RNA-Seq values may introduce bias to the interaction inference results generated by Bayesian graphical models. 2147 out of the 19304 genes in Zodiac have zero RNA-Seq values in more than 50% of the samples. We identify these genes with an asterisk (*) next to the gene symbol in the summary tables and show the following disclaimer:
*: A gene name with * means that the RNA-Seq expression value of the gene is zero in more than 50% of the samples. In our analysis, we used the smallest positive RNA-Seq value in each cancer type to replace the zero values. However, when the majority of the expression values of a gene are zeros, the interaction inference result might be biased. We mark the gene name with a * sign to alert users for the potential bias.
- Mitra, R., Muller, P., Liang, S., Yue, L. and Ji, Y., 2013. A bayesian graphical model for chip-seq data on histone modifications. Journal of the American Statistical Association, 108(501), pp.69-80.