Best practices for the analysis of SARS-CoV-2 data: Genomics, Proteomics, Evolution, and Cheminformatics

Using open source tools and public cyberinfrastructure for transparent, reproducible analyses of viral datasets.

DOI Powered by: usegalaxy usegalaxy org org usegalaxy usegalaxy eu eu usegalaxy usegalaxy be be usegalaxy usegalaxy usegalaxy usegalaxy fr fr


The goal of this resource is to provide publicly accessible infrastructure and workflows for SARS-CoV-2 data analyses. We currently feature three different types of analyses:


Each analysis section is continuously updated as new data becomes available. The main highlights are:

There are many complete genomes but only a handful of raw sequencing read datasets. We provide lists of raw read accessions for Illumina and ONT. These lists are updated daily. There are 4,899 distinct variable sites showing intra-host variation across 1,093 samples (with frequencies between 5% and 100%) from 28 studies representing 24 geographic locations. Variant lists and VCF files are updated as new data comes in. Intra-host polymorphisms may reveal sites affecting the pathogenicity of the virus.

Which positions in the SARS-CoV-2 genome may be subject to positive selection (involved in adaptation), or negative selection (conserved during evolution)? We are using comparative evolutionary techniques to run daily analyses identify potential candidates using genomes from GISAID. At present, ~5 genomic positions may merit further investigation because they may be subject to diversifying positive selection. See live results presented as continuously updated notebooks.

Nonstructural proteins (nsps) vital for the life-cycle of SARS-CoV-2 are cleaved from a large precursor (encoded by ORF1ab) by enzymes such as the main protease (Mpro). We performed computational analyses (using protein-ligand docking) to identify potentially inhibitory compounds that can bind to MPro and can be used to control viral proliferation. This work analyzed over 40,000 compounds considered to be likely to bind, which were chosen based on recently published X-ray crystal structures, and identified 500 high scoring compounds. Workflows used for this analysis as well as individual compound list can be accessed here.

Project Video Introduction

The analyses have been performed using the Galaxy platform and open source tools from BioConda. Tools were run using XSEDE resources maintained by the Texas Advanced Computing Center (TACC), Pittsburgh Supercomputing Center (PSC), and Indiana University in the U.S., de.NBI, VSC cloud resources and IFB cluster resources on the European side, STFC-IRIS at the Diamond Light Source, and ARDC cloud resources in Australia.

Galaxy Project   European Galaxy Project   Australian Galaxy Project   bioconda   XSEDE   TACC   de.NBI   ELIXIR   PSC   Indiana University   Galaxy Training Network   Bio Platforms Australia   Australian Research Data Commons   VIB   ELIXIR Belgium   Vlaams Supercomputer Center   EOSC-Life   Datamonkey   IFB   GalaxyP