Welcome to bioinformatics and computational biology at Marshall University.
We operate as a research core facility supported by a number of program grants. We provide bioinformatic analysis service for the sequencing services for the Genomics Core Facility, as well as other analysis services and regular training workshops.
Our core facility currently houses a standalone Linux server with 32 dual-core CPUs and 768GB RAM, a compute cluster consisting of six 24-dual core nodes, and a networked data server with 218TB of usable storage. These resources are housed in the Marshall University Research Computing Data Center.
We receive support from the following grants from the National Institute of General Medical Sciences at the National Institutes of Health:
- WV-INBRE (P20GM103434)
- WV-CTSI (U54GM104942)
- Appalachian Center for Cellular transport in Obesity Related Disorders (ACCORD) (P20GM121299)
RNA-seq pipeline description
In the data preprocessing step, raw reads are quality-assessed and trimmed to remove Illumina sequencing adapters and low-confidence reads. Reads that fail to meet a miniumum length (typically 25 bases) are discarded after the trimming step, and quality assessment is performed a second time.
- Alignment-based quantification
- Reads are first aligned to a reference genome. For eukaryotic organisms a spliced aligner, such as HISAT2 must be used in order to account for reads that originated from a mRNA fragment that spans a splice junction.
- The location of aligned reads is then compared to a known annotation for the genome, in order to ascertain from which gene and transcript the read originated. We use the R/Bioconductor package GenomicAlignments to perform this step.
- Alignment-free quantification.
- In alignment-free quantification, we start with a set of known transcript sequences which are indexed by the occurence a short sub-sequences of fixed length K (K-mers). Reads are then assigned to a transcript by comparing them to this index.
- In this pipeline we use Salmon.
In either case, the output from this step can be visualized as a large matrix with one row for each gene
and one column for each sample. The entry in each cell in the matrix is the number of sequencing reads from
that sample that originated from that gene. This is known as the count matrix.
- What is the difference in expression of the gene between the samples in the two groups? We usually express this as a ratio, which you can think of as an estimate of the average expression of the gene in one group of samples divided by the expression of the gene in the other group of samples.
- How confident are we in the estimate of the difference in expression? Here we borrow from traditional statistical hypothesis testing and, loosely speaking, ask the question "how confident are we that the difference in expression is non-zero?".
- The count data for each gene and each sample will depend on the total amount of cDNA sequenced for that particular sample. Consequently, the counts must be normalized by some measure of the total number of reads for the sample.
- Because these counts are sampled from a highly non-normal distribution, simple T-tests are not appropriate. Under fairly reasonable assumptions, the count data will be distributed according to a negative binomial distribution.
- The dispersion (variance) of the counts varies with the total expression level. This happens in a way that exaggerates the fold change, and underestimates the raw p-value, for genes with low expression.
- Because we are testing so many (tens of thousands) of different genes, raw p-values are difficult or impossible to interpret. Even if all the null hypotheses were true (i.e. no genes were differentially expressed between the two groups), a large number of genes would show small p-values. The threshold for the p-value cutoff needs to be reduced in a manner that controls the proportion of false positives among all those genes whose p-value is less than the threshold.
The final step in the pipeline is to move from questions about individual genes to questions about pathways and functions. These questions take the following forms:
- Are there specific molecular functions and biological processes for which a large number of genes are differentially expressed in the experiment?
- Are there known pathways or regulatory networks that are coordinately perturbed in the experiment?
Generally, these questions are more readily addressed by the individual investigator, who has expertise in the biological question of interest, than by the core facility who have expertise in bioinformatics. However, we do provide support for these analyses using Cytoscape and related tools.
RNA-seq pipeline results format
In most expression profiling experiments, the primary result is the identification of genes that are differentially expressed between different experimental conditions. These experimental conditions may be differences in treatment, differences in the source material (e.g. different cell types or tissue), or other differences in phenotype (e.g. differences in outcome status from a treatment or disease).
Most users will want to focus on genes for which they are reasonably confident there is a non-zero change in the expression level; consequently we typically present genes for which the FDR is below some threshold. Typically this threshold is set at 10%. These results are presented as a comma-separated value (CSV) file which can be opened with Excel or other spreadsheet software, or imported into R or other programming environments for more advanced processing. Note that we do not impose cutoffs for fold change, though since the file is organized in decreasing order of fold change, it is straightforward to impose such a cutoff manually.
Structure of the results CSV files
The results of each comparison are presented in a CSV file. This CSV file contains a header row, followed by a row for each gene. Typically only genes which have an FDR below some predetermined threshold (typically 0.1) are included in the file. The rows are ordered by the magnitude of the fold change.
The columns of the CSV file are described below. For most users, the most useful columns will be symbol, effect, log2FoldChange, and padj.
|First Column||The identifier of the gene (usually an Ensembl id)|
|baseMean||The mean raw count (number of sequencing reads mapping to the gene) across all samples|
|log2FoldChange||The base 2 log of the estimated fold change (ratio) of expression between the two groups in the comparison|
|lfcSE||The standard error of the log 2 fold change|
|stat||The statistic computed for the statistical test employed by DESeq2 (usually a negative binomial test)|
|pvalue||The raw p-value for the statistical test (for the null hypothesis that the expression levels are equal in both groups in the comparison). Note that, because of the large number of genes being tested, it is difficult or impossible to interpret raw p-values in this context|
|padj||The Benjamini-Hochberg adjusted p-value, which estimates the false discovery rate associated with this gene. (Specifically, if a threshold were chosen such that this gene were included in the list of differentially expressed genes, but none with higher p-values, it is the estimated proportion of that list which would be false positives.) The CSV file is usually filtered by this value.|
|effect||A "user-friendly" presentation of the fold change. This is the ratio of expression in the group that has the higher expression to the group that has the lower expression, with "Up" meaning that the "treatment" group has higher expression than the control group, and "Down" meaning the treatment group has lower expression than the control group.|
|symbol||The common name for the gene|
|fpkm columns||These columns give the FPKM (fragments per kilobase per million reads sequenced) values for the gene for each of the samples considered in the comparison. These values are the number of reads mapping to the gene, normalized both for the size of the transcript (in kilobases) and for the total number of reads sequenced for the sample. These values are comparable both across samples and across genes|
Some (sometimes many) of the gene symbols will be "NA". This is because the RNA-Seq analysis pipeline assesses all annotated transcripts in an unbiased manner. Many of these transcripts are not protein-coding transcripts (for example, predicted genes or psuedogenes), and as such may not have associated gene names. You can search for the id on the appropriate web site (e.g. ensembl.org for Ensembl ids) to learn more about the gene.
Using Cytoscape to explore RNA-Seq data: How-to
Cytoscape is an open-source network visualization software platform, with widespread usage in the bioinformatics community. It is used for visualizing complex networks (for example, protein-protein interaction networks or gene regulatory networks), and overlaying these networks with data (for example, gene expression data).
Through the WV-INBRE grant, the bioinformatics core facility hosted a Cytoscape workshop in February 2019. Slides from the workshop are available from the Cytoscape tutorials page or via the links below:
We also provide a brief "how-to" for a simple use case further down this page.
- Download and install the latest version of Cytoscape.
- If you want to follow along with our sample data set, download the sample data. This data set is from an experiment in which we grew cells from a mouse breast cancer cell line in media conditioned by adipose tissue from C57BL/6 mice. In one set of samples, the mice were fed a high-fat diet; in the other set of samples they were fed a standard chow diet. The CSV file is generated by our standard pipeline and filtered for a false discovery rate (FDR) of 10%, comparing the high-fat conditioned cells to the lean conditioned cells. These data are generously contributed by Travis Salisbury.
- Launch Cytoscape
- From the menu, choose "Apps" and "App Manager".
- Either using the search bar, or by selecting "All apps" and scrolling through the list, locate "stringApp". Select it and press "Install"
- Similarly, locate and install "clusterMaker2".
- Close the App Manager. From the "App" menu, you should now see that the ClusterMaker and STRING apps are installed.
In the remainder of the workflow, we will show how to use the STRING app to find known protein-protein interactions between a collection of differentially expressed genes to form a network, how to visualize the network with gene expression values overlaid, how to break the network into small, biologically related, components, and how to find functions enriched in those components.
In this section of the workflow, we will create a network using identifiers of genes from an expression profiling experiment and the STRING protein-protein interaction database. If you downloaded our sample data, open it in a spreadsheet (such as Excel). If you are using your own data, you will need, at a minimum, a file containing gene identifiers and some measure of gene expression (preferably logarithms of a fold change).
- Copy the ids for the genes in your differentially expressed gene list. If you are using Excel, you can just select the entire column of ids by clicking on the column header. Then choose "Copy" or press Control-C.
- In Cytoscape, make sure the "Network" tab is selected in the control panel on the left. In the drop-down to the left of the search bar, choose "STRING protein query".
- In the search bar (where it says "Enter one term per line"), paste in the gene ids you copied in the previous step.
- Click on the menu button (three horizontal lines) in the search bar. Be sure to select the organism name (the sample data we provide come from a mouse experiment). Leave the other options as their defaults.
- Finally, press the search button (magnifying glass). Cytoscape will connect to STRING-db, first retrieving annotations associated with the gene ids, and then retrieving protein-protein interactions. On success, Cytoscape will display the results in the table panel, and display a graphical representation of the network in the main part of the display.
At this stage, we have a network derived from our list of genes, and known protein-protein interactions between those genes. Scroll through the columns in the table. Note that:
- The "shared name", which is the default id, is a protein id provided by STRING.
- The "query name" column contains the gene ids we provided to STRING. This will be important in the next step.
In the previous step, we created a network based on a set of differentially expressed genes and known protein-protein interactions between the protein products associated with those genes. In this step, we will add our expression data to the network. This will only add the data to the table; in the next step we will visualize those data in the graphical representation of the network.
- With the STRING network displayed in Cytoscape, choose File -> Import, and choose "Table from File" in the submenu:
- Navigate to the file with your data. If you followed the previous steps in this workflow, you should use the same file from which you copied the gene ids. The minimum requirement here is that your file contains a column with the gene ids used in your network, and a column with expression data values for those genes (preferably log-scaled). Press "Open".
- In the resulting dialog, for "Key Column for Network", select "query term". This specifies the column in the "Node table" in Cytoscape which will be used to identify the genes in the data file. Recall that when we imported the network from STRING, that the "query term" column was created using the ids that we submitted as our STRING protein query.
- Ensure that the column in the preview containing the gene ids in your file is annotated with the key icon. If needed, you can right-click on the column header, and choose the key icon under "meaning".
- Press "OK". No changes will be visible, immediately, but if you scroll right in the table, you should see new columns corresponding to the columns in the data file. If you are using our sample data, or data from the core facility, these columns will include "baseMean", "log2FoldChange", "lfcSE", etc.
We have now added data from our expression profiling experiment to the STRING network we created. In the next step, we will create a custom Cytoscape style to visualize these data.
Cytoscape uses a system of "styles" to control how the data are visualized in a network. These take the form of either fixed values (which can be defined as defaults, or via a "bypass" to apply them to selected genes in the network, or by mapping column values from the table to visual properties in the network.
The network we obtained with the STRING app comes with a sophisticated style that includes images of the protein structure in the nodes. The colors chosen by STRING are random. While this style has some uses (and matches that on the STRING website), it doesn't convey the data we want. We will create a new style for our expression data.
- With the network showing in Cytoscape, select the "Style" tab in the control panel on the left. Press the menu button (three horizontal lines) and choose "Creaste new style". Use an appropriate name for the style. (The style can be reused for multiple networks, so I usually use something generic such as "Expression style".) The genes will all change to a default red circle.
- We will change the color of the nodes in the network (representing proteins) to a color that indicates the log of the fold change of expression for the corresponding genes. The default color (currently red) will be used if no value is available. In the "Properties" table in the style control panel, click on the default ("Def.") color for "Fill Color" (currently red) to bring up a color palette. Choose a light grey color to indicate no fold change value is available. All genes will change to grey.
- Click on the mapping ("Map.") column in the "Fill Color" row of the properties table. Currently there is no mapping. For "Column" click in the "select value" box, and scroll down to find the column that contains the expression data. For data from the core facility, this will be the "log2FoldChange" column.
- Cytoscape supports three types of mapping when mapping table values to style properties.
- A "passthrough" mapping will use the actual value in the column for the value of the style property.
- A "continuous mapping" will treat the column value as a continuous variable, and map it to a value for the style property. For colors, this allows for a color gradient depending on the column value.
- A "discrete mapping" allows for mapping specific nominal values in the column to specific values for the style property.
- Finally, we need to label the genes with their gene name. To do this, click on the mapping (middle) column in the "Label" row in the style properties control panel. For the column name, choose "display name" (this column was created by the STRING app). Since the column contains the actual value we want to use in the label, for mapping type, choose "Passthrough Mapping". If you zoom into a portion of the network, you should now see gene names, along with the colors representing the differential expression.
For large networks, there can be too many genes and interactions for the network to be informative. This situation can be simplified by clustering the network. The basic aim here is to break the network into smaller components, called clusters, with the defining property of a cluster being, loosely speaking, that there are many more interactions within a cluster than there are between clusters. After this, the interactions between the clusters are dropped, allowing the clusters to be analyzed and interpreted independently.
Here we will use the Markov Clustering algorithm to cluster our example network. For networks of fewer than around 100 genes, this step can be skipped.
- From the menu, select "Apps", then "clusterMaker", and then in the "Network Cluster Algorithms" select "MCL Cluster".
- In the MCL cluster dialog box, check the box "Create new clustered network" (near the bottom of the dialog). Initially, you should leave other parameters as their defaults. You might want to experiment with the "granularity" parameter. A higher value of this parameter will tend to cluster the network into more, smaller clusters, while a smaller value will generate fewer, larger clusters.
- Press OK.
- Click on the "Network" tab in the control panel. You will see that a new network has been created, in which the original network has been broken into clusters.
To create a new network containing just one of the clusters, we first need to select all the genes in a cluster.
There are a couple of ways to do this:
In the control panel, click the "Select" tab, press the "+" button to create a new filter, and choose "column filter".
When we ran the Markov Clustering algorithm (MCL) on our network, it introduced a new column in the table called "__mclCluster", which identifies the cluster to which each gene in the network belongs. Choose this column for the selection filter:
Set the range for __mclCluster to be a single value (e.g. "between 1 and 1" to select cluster 1, "between 2 and 2" to select cluster 2, etc.) If "Apply when filter changes" is checked, the selection will immediately update to contain only nodes in the single cluster.
- A quicker, but perhaps less robust, way to select the nodes is to select a single gene in the desired cluster, and then repeatedly select neighbors of that gene using the "First Neighbors of Selected Nodes" button . Keep pressing this button until all the genes in the desired cluster are selected.
- Finally, once all the genes in the desired cluster are selected, press the "New network from selection" button . This will create a new network containing only the selected genes (i.e. only the genes in a single cluster): .
- Repeat these last steps to create networks from as many clusters as you need.
The clusters in this network represent protein products of genes in our differentially expressed gene set, for which there are many known protein-protein interactions among them. Biologically, these tend to represent closely-related proteins from a functional perspective.
In the next step, we'll see how to assess a network for overrepresentation of a biological process, molecular function, or gene pathway. For this clustered network, we'll perform overrepresentation analysis on each cluster independently.
To find overrepresented functions, processes, and pathway in a Cytoscape network, we can use "STRING Enrichment", which was installed as part of the STRING-app. Note this only works on "STRING networks". A network retrieved directly from a STRING query will automatically be a STRING network, but other networks we create (such as networks created from a single cluster after performing MCL clustering, as we did in previous steps) will need to be set as STRING networks.
- If your network is not directly created by the STRING app (e.g. if you followed the previous steps and created network from individual clusters), with your network selected, from the menu choose "Apps", "STRING", and "Set as STRING network".
- From the menu, choose "Apps", "STRING Enrichment", and "Retrieve functional enrichment". Leave the FDR cutoff as its default of 0.05.
This will result in Cytoscape retrieving a list of overrepresented functions and pathways from STRING. These include Gene Ontology terms, KEGG pathways, and Reactome pathways. By clustering a large network prior to performing an overrepresentation analysis, you should find more coherent functions for each of the individual clusters than you would for the entire network. You may also like to filter the pathways to include only certain categories. (Use the filter button, which is the leftmost icon above the enrichment table.)
Information for ACCORD investigators
- Data for the OZR-LZR expression profiling experiment (requires login).