TCGA-Assembler 2: Software Pipeline for Automatic Retrieval, Processing, and Integration of TCGA/CPTAC Data
TCGA-Assembler 2 is an open-source, freely available tool that automatically downloads, assembles and processes public The Cancer Genome Atlas (TCGA) data and the Clinical Proteomic Tumor Analysis Consortium (CPTAC) data of TCGA samples. It facilitates downstream data analysis by relieving investigators from the burdens of data preparation. TCGA-Assembler 2 includes two modules. Module A acquires public TCGA data from the Genomic Data Commons (GDC) of the U.S. National Cancer Institute and assembles individual data files into locally stored data tables. It can also acquire mass spectrometry proteomics data of TCGA samples generated by the CPTAC. Module B fulfills various data processing needs to prepare them for downstream analysis. TCGA-Assembler 2 is licensed under the GPL version 3 and can be distributed under GPL version 3.
TCGA-Assembler 2 software package can be downloaded from GitHub at https://github.com/compgenome365/TCGA-Assembler-2
Distribution of TCGA-Assembler Users:
Since its first release in Feb. 2014, TCGA-Assembler has been downloaded and used by 3511 researchers from 69 different countries and regions all over the world. Click here to see the details.
|Country||Number of Users||Institutions|
|China||1381||Tsinghua University, Fudan University, University of Science and Technology of China, Shanghai Jiao Tong University, Beijing Genomics Institute, Ruijin Hospital, Institute of Biophysics (Chinese Academy of Sciences), Shanghai Institutes for Biological Sciences (Chinese Academy of Sciences), China Medical University, Tongji Medical College, Nanjing Medical University, Wuhan university, West China Hospital, Sun Yat-sen University, Capital Medical University, Harbin Medical University, Huazhong University of Science and Technology, CAS-MPG Partner Institute for Computational Biology, University of Science & Technology Beijing, Shanghai University, Jinan University, Central South University, Guangdong Medical College, Third Military Medical University, Guangxi Medical University, Southeast University, ...|
|United States||1122||Johns Hopkins University, Harvard University, Stanford University, Princeton University, Massachusetts Institute of Technology, National Cancer Institute, Columbia University, Emory University, Yale University, University of Chicago, University of Michigan, Baylor College of Medicine, The University Of Texas M.D. Anderson Cancer Center, Dana-Farber Cancer Institute, Mayo Clinic, Cleveland Clinic Foundation, Vanderbilt University, The Scripps Research Institute, Mount Sinai School of Medicine, University of Washington, Northwestern University, University of Pennsylvania, Duke University, Brigham and Women's Hospital, Roswell Park Cancer Institute, Albert Einstein College of Medicine, University of Connecticut Health Center, Boston University, University of California at Los Angeles, Georgetown University, Dartmouth College, University of Pittsburgh, Duke Cancer Institute, The Wistar Institute, Moffitt Cancer Institute, Oak Ridge National Lab, University of Washington, Sanofi, ...|
|Korea, South||105||Seoul National University, Samsung Medical Center, Samsung Biomedical Research Institute, The Armed Forces Medical Research Institute, Hanyang University, SUNY Korea, ...|
|India||83||Assam University, National Institute of Immunology, Institute of Bioinformatics and Applied Biotechnology, Advanced Centre for Treatment Research and Education in Cancer, Centre for Development of Advanced Computing, Datar Genetics Limited, CSIR-Institute of Himalayan Bioresource Technology (IHBT), Positive Bioscience, ...|
|Taiwan||76||National Yang-ming University, Taipei Medical University, Academia Sinica Institute of Biological Chemistry, Tri-Service General Hospital, National Taiwan University, Chang Gung Bioinformatics Center, Wan Fang Hospital, Chang Gung Hospital, ...|
|Canada||68||Mount Sinai Hospital, University Health Network, Ontario Institute for Cancer Research, The Hospital for Sick Children, The INRS-Institut Armand-Frappier Research Centre, Princess Margaret Cancer Centre, McGill University, The University of British Columbia, Ottawa Hospital Research Institute, University of Guelph, ...|
|United Kingdom||66||Oxford University, University of Cambridge, European Bioinformatics Institute, Welcome Trust Sanger Institute, Institute of Cancer Research, Cancer Research UK, University College London, King's College London, Thomson Reuters, Imperial College London, Queen's University Belfast, University of Dundee, University of Sheffield, Jenner Institute, ...|
|Germany||52||The German Cancer Research Center, Heidelberg University, Technical University of Munich, University Medical Center Hamburg-Eppendorf, University of Münster, German Research Center for Environmental Health, University of Applied Sciences Koblenz, Bayer HealthCare AG, Cellzome AG, University of Kiel, Ludwig Maximilian University of Munich, University of Cologne, ...|
|Italy||49||European Institute of Oncology, University of Pavia, University of Turin, Biomedical University of Rome, University of Rome Tor Vergata, Fondazione Edo ed Elvo Tempia Valenta, Istituto Scientifico Romagnolo per lo Studio e la Cura dei Tumori, Biogem Institute, ...|
|Denmark||40||University of Copenhagen, Aarhus University, ...|
|Spain||39||University of Málaga, Spanish National Cancer Research Center, Centre of Studies and Technical Research, Institute of Predictive and Personalized Medicine of Cancer, Navarra-Biomed, University of Navarra, Center for Research in Environmental Epidemiology, Vall d'Hebron Institute of Oncology, Príncipe Felipe Research Center, National Center for Genome Analysis, ...|
|Japan||38||Kyushu University, Hamamatsu University School of Medicine, ...|
|Australia||38||The University of Sydney, Royal Melbourne Hospital, The University of Queensland, University of Melbourne, The Walter and Eliza Hall Insitute of Medical Research, ...|
|France||36||Centre for Cancer Research in Lyon, Centre Léon-Bérard, National Institute of Health and Medical Research, European Institute for Systems Biology and Medicine, ...|
|Netherlands||34||University of Amsterdam, Delft University of Technology, Leiden University Medical Center, Academic Medical Center, Utrecht University, VU University Amsterdam, Radboud university medical center, ...|
|Sweden||29||Chalmers University of Technology, Karolinska Institute, Lund University, Umea University, ...|
|Hong Kong||29||The Chinese University of Hong Kong, The University of Hong Kong, ...|
|Singapore||27||Cancer Science Institute, Duke-NUS Graduate Medical School, Genome Institute of Singapore, National University of Singapore, Bioinformatics Institute (BII), ...|
|Brazil||22||Federal University of Paraiba, Universidade Federal de Mato Grosso do Sul, Universidade de São Paulo, ...|
|Israel||20||Ben-Gurion University of the Negev, Tel Aviv Sourasky medical center, Hebrew University of Jerusalem, ...|
|Iran||17||University of Tehran, Iran Blood Transfusion Organization, ...|
|Turkey||16||Bilkent University, Dokuz Eylul University, Koc University, Karadeniz Technical University, ...|
|Finland||15||University of Turku, University of Helsinki, ...|
|Norway||11||University of Bergen, University of Oslo, ...|
|Argentina||11||University of Buenos Aires, Institute of Experimental Medicine and Biology of Cuyo, ...|
|Russia||10||Ulyanovsk State University, ITMO University, ...|
|Austria||9||Medical University of Graz, ...|
|Belgium||9||iTeos Therapeutics, ...|
|Switzerland||8||Novartis AG, University Hospital Zurich, SIB Swiss Institute of Bioinformatics, ...|
|Qatar||6||Carnegie Mellon University in Qatar, Sidra Medical and Research Center|
|Portugal||6||Ipatimup Institute of Molecular Pathology and Immunology of the University of Porto, ...|
|Poland||6||Wroclaw Medical University, Selvita S.A., ...|
|Chile||5||Pontifical Catholic University of Chile, ...|
|Greece||5||University of Patras, Foundation for Research & Technology - Hellas (FORTH), ...|
|Luxembourg||4||Luxembourg Centre for Systems Biomedicine, ...|
|Mexico||4||The National Institute of Genomic Medicine, ...|
|Pakistan||3||COMSATS Institute of Information Technology, ...|
|Hungary||3||Eötvös Loránd University, ...|
|Lithuania||2||Lithuanian University of Health Sciences, ...|
|Romania||2||Solutions of Artificial Intelligence Applications (SAIA), ...|
|Uruguay||1||Biological Research Institute Clemente Estable, ...|
|Czech Republic||1||Czech Technical University in Prague, ...|
|Serbia and Montenegro||1||Seven Bridges Genomics, Inc, ...|
|South Africa||1||University of Cape Town, ...|
|Costa Rica||1||University of Costa Rica, ...|
- Wei, L., Jin, Z., Yang, S., Xu, Y., Zhu, Y. and Ji, Y. "TCGA-Assembler 2: Software Pipeline for Retrieval and Processing of TCGA/CPTAC Data." Bioinformatics (2017). https://doi.org/10.1093/bioinformatics/btx812
- Zhu, Y., Qiu, P. and Ji, Y., 2014. TCGA-assembler: open-source software for retrieving and processing TCGA data. Nature methods, 11(6), pp.599-600.
- Version 2.0.6 release - 06/04/2018
- The 2.0.6 version aims to keep up with latest change on GDC and CPTAC server:
- Version 2.0.5 release - 08/15/2017 Bug fixed
- Because CPTAC recently changed its data file structure, we updated TCGA-Assembler 2 code to make sure the functions in Module A can work correctly.
- Need to make curl available as a system command for TCGA-Assembler to use. The easiest way to do so is to copy curl.exe in TCGA-Assembler package to the Windows system directory C:\Windows\System32. You can also download latest curl executable file supporting SSL and SSH and compatible with your operating system from https://curl.haxx.se/download.html.
- Windows operating system usually has a limitation on the length of file path, which is 260 characters. TCGA data files usually have a long file name and folder name. So the downloaded data files may have paths (including both the full directory and file name) longer than the limitation, causing failure when writing the data files to your local hard disk. If you see the following messages in your R console, it most likely indicates a failure caused by the limitation on file path length, so the program can not save the data files and keeps retrying. "metadata file: preparing ..."To solve this problem, either put TCGA-Assembler package in a directory with a short path (such as the root directory C:\) or change the setting of your operating system to allow long file path. The setting procedure is specific to your Windows version. Please Google on the internet for solutions about configuring your Windows operating system to allow long file path.
 "metadata file: preparing done!"
 "*.tar.gz file: downloading & unzipping ..."
 "cannot open the connection"
 "cannot open the connection"
 "cannot open the connection"
Error in function (type, msg, asError = TRUE):
gnutls_handshaked: A TLS fatal alert has been received.
$ sudo apt-get remove libcurl4-gnutls-dev
$ sudo apt-get iibcurl4-openssl-dev
- Version 2.0.4 release - 06/29/2017
- Bug fixed:
Because GDC recently changed the column names of meta data file, we updated TCGA-Assembler 2 code to make sure the functions in Module A can work correctly.
- Version 2.0.3 release - 04/18/2017 We have implemented two significant improvements in TCGA-Assembler version 2.0.3 compared to previous versions.
- TCGA-Assembler can now acquire and process TCGA somatic mutation data. See function DownloadSomaticMutationData in Module A and function ProcessSomaticMutationData in Module B.
- TCGA-Assembler can now acquire and process mass spectrometry proteomics data of TCGA samples generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC). The proteomics data include proteome, phosphoproteome, and glycoproteome measurements of breast cancer, ovarian cancer, and colorectal cancer. See function DownloadCPTACData in Module A and function ProcessCPTACData in Module B.
- Version 2.0.2 release - 04/05/2017
- Bug fixed:
Because GDC recently changed the format of meta data file for biospecimen and clinical data, we updated TCGA-Assembler 2 code so that the DownloadBiospecimenClinicalData function can work correctly.
- Version 2.0.1 release - 01/18/2017
- TCGA-Assembler version 2.0.1 is compatible with Windows, Linux, and Mac systems.
- Version 2.0.0 release - 12/19/2016
- TCGA-Assembler 2 retrieves TCGA public data from the Genomic Data Commons (GDC) of the U.S. National Cancer Institute. Unlike TCGA-Assembler 1, TCGA-Assembler 2 does not require obtaining all data file information from the data server. Thus the data acquisition usage is simplified. It acquires data from GDC using its Application Program Interfaces (APIs).
- TCGA-Assembler 2 is compatible with Linux and Mac OS. We are currently developing a version for Windows users, which will be released soon.
Some users encountered the following error message on certain operating systems, such as Ubuntu
The error is likely due to missing libcurl4-openssl-dev. Please try the following commands to install it.
Summary of TCGA data that can be retrieved using TCGA-Assembler 2 (accurate as of April 24, 2017)
|Cancer type||Number of patient samples (including both tumor and normal tissue samples)||Number of patients with clinical information|
|Somatic mutation||Copy number alternation||DNA methylation||Gene expression||miRNA expression||Protein expression (RPPA)||Protein expression (iTRAQ)|
|Adrenocortical carcinoma (ACC)||92||180||80||79||80||46||92|
|Bladder urothelial carcinoma (BLCA)||413||806||434||427||429||344||412|
|Breast invasive carcinoma (BRCA)||1092||2207||1228||1215||1189||937||105||1097|
|Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC)||307||586||312||309||312||173||307|
|Colon adenocarcinoma (COAD)||423||920||535||481||455||362||60||459|
|Lymphoid neoplasm diffuse large B-cell lymphoma (DLBC)||48||96||48||48||47||33||48|
|Esophageal carcinoma (ESCA)||186||373||202||196||198||126||185|
|Glioblastoma multiforme (GBM)||409||1110||437||172||5||244||596|
|Head and neck squamous cell carcinoma (HNSC)||528||1090||580||566||569||357||528|
|Kidney chromophobe (KICH)||66||132||66||91||91||63||113|
|Kidney renal clear cell carcinoma (KIRC)||504||1059||895||607||588||478||537|
|Kidney renal papillary cell carcinoma (KIRP)||290||593||342||323||326||216||291|
|Acute myeloid leukemia (LAML)||201||392||194||173||188||0||200|
|Brain lower grade glioma (LGG)||530||1015||530||530||526||435||515|
|Liver hepatocellular carcinoma (LIHC)||377||760||429||423||424||184||377|
|Lung adenocarcinoma (LUAD)||572||1095||636||576||561||365||522|
|Lung squamous cell carcinoma (LUSC)||493||1035||572||555||523||328||504|
|Ovarian serous cystadenocarcinoma (OV)||583||1172||622||577||493||436||174||587|
|Pancreatic adenocarcinoma (PAAD)||186||368||195||183||183||123||185|
|Pheochromocytoma and paraganglioma (PCPG)||184||360||187||187||187||82||179|
|Prostate adenocarcinoma (PRAD)||499||1031||549||550||547||352||500|
|Rectum adenocarcinoma (READ)||157||316||178||173||165||132||30||170|
|Skin cutaneous melanoma (SKCM)||472||938||475||473||452||355||470|
|Stomach adenocarcinoma (STAD)||441||906||470||450||477||357||443|
|Testicular germ cell tumors (TGCT)||156||304||156||156||156||122||134|
|Thyroid carcinoma (THCA)||504||1020||571||572||573||374||507|
|Uterine corpus endometrial carcinoma (UCEC)||545||1094||595||253||572||440||548|
|Uterine carcinosarcoma (UCS)||57||110||57||57||57||47||57|
|Uveal melanoma (UVM)||80||160||80||80||80||12||80|