The goal of this resource is to provide publicly accessible infrastructure and workflows for SARS-CoV-2 data analyses. We currently feature three different types of analyses:
Assembly and intra-host variation
Sites under selection
- Natural Selection Analysis
- Observable Notebooks
Each analysis section is continuously updated as new data becomes available. The main highlights are:
There are many complete genomes but only a handful of raw sequencing read datasets. We provide lists of raw read accessions for Illumina and ONT. These lists are updated daily. There are 4,899 distinct variable sites showing intra-host variation across 1,093 samples (with frequencies between 5% and 100%) from 28 studies representing 24 geographic locations. Variant lists and VCF files are updated as new data comes in. Intra-host polymorphisms may reveal sites affecting the pathogenicity of the virus.
Which positions in the SARS-CoV-2 genome may be subject to positive selection (involved in adaptation), or negative selection (conserved during evolution)? We are using comparative evolutionary techniques to run daily analyses identify potential candidates using genomes from GISAID. At present, ~5 genomic positions may merit further investigation because they may be subject to diversifying positive selection. See live results presented as continuously updated notebooks.
Nonstructural proteins (nsps) vital for the life-cycle of SARS-CoV-2 are cleaved from a large precursor (encoded by ORF1ab) by enzymes such as the main protease (Mpro). We performed computational analyses (using protein-ligand docking) to identify potentially inhibitory compounds that can bind to MPro and can be used to control viral proliferation. This work analyzed over 40,000 compounds considered to be likely to bind, which were chosen based on recently published X-ray crystal structures, and identified 500 high scoring compounds. Workflows used for this analysis as well as individual compound list can be accessed here.
Project Video Introduction
The analyses have been performed using the Galaxy platform and open source tools from BioConda. Tools were run using XSEDE resources maintained by the Texas Advanced Computing Center (TACC), Pittsburgh Supercomputing Center (PSC), and Indiana University in the U.S., de.NBI, VSC cloud resources and IFB cluster resources on the European side, STFC-IRIS at the Diamond Light Source, and ARDC cloud resources in Australia.