This repository provides a Python implementation of the Bayesian lead-lag
The algorithm makes use of the NumPy and SciPy packages, as well as Matplotlib and NetworkX for data visualization.
A common task in genomic analysis is to identify groups of genes within an organism that exhibit similar expression (i.e., activity) patterns over time. Given time series measurements of gene expression levels, the objective of the LLR2 algorithm is to cluster genes based on their temporal expression dynamics. To achieve this, we quantify the similarity between each pair of genes using the lead-lag
The similarity metric is computed as follows. We assume the expression of a single gene
where
which is linear in the (unknown) constants
Thus, given temporal measurements
The
The auxiliary function computeLLR2Bayes in the file llr2.py uses Bayesian regression to compute the above-mentioned lead-lag
Our full algorithm, which computes the LLR2 between all pairs of genes and stores the values in a matrix, is implemented in the main function LLR2 in the file llr2.py. As input, it takes a list of all
We can use the LLR2 matrix produced by the LLR2 function as a similarity matrix in a hierarchical clustering method to divide the set of genes into groups that exhibit similar temporal dynamics and are likely to be biologically associated with one another.
Below is an abridged example of how to use this software, taken from the file results.py.
First, we read our Data directory. We also define the hours corresponding to the
# Read gene expression data and prior adjacency matrix
geneData = pd.read_csv("Data/geneData.csv", index_col = 0)
priorMatrix = pd.read_csv("Data/priorMatrix.csv", index_col = 0)
# Save gene names and define hours corresponding to each time point
geneNames = geneData.index
hours = [0, 1, 2, 4, 5, 6, 8, 10, 12, 14, 16, 20, 24, 30, 36, 42, 48]
# Reshape gene expression data into a list of lists
geneData = geneData.values.tolist()
Next, we run the LLR2 algorithm. This returns the LLR2Bayes, which we can use for clustering.
LLR2Bayes = LLR2(geneData, hours, bayes = True, priorMatrix = priorMatrix, writeToCSV = True)
We now run the standard hierarchical clustering function from SciPy, first converting our similarities into distances, and then we divide the genes into 16 clusters.
# Perform clustering on condensed distance matrix
hierClust = linkage(squareform(1 - LLR2Bayes), method = "ward")
# Divide genes into clusters
subGroups = fcluster(hierClust, t = 16, criterion = "maxclust")
(clusterNumber, clusterCounts) = np.unique(subGroups, return_counts = True)
Using the function plotGenes defined in plotting.py, we now plot the temporal expression patterns of the genes in the 16 clusters and can see that there are several groups with visually similar trajectories.
# Define plot colors for each cluster
plotColors = ["darkorange", "dodgerblue", "forestgreen", "darkmagenta", "teal", "saddlebrown", "navy", "darkgoldenrod", "blueviolet", "darkred", "olivedrab", "darkslategray", "cornflowerblue", "rosybrown", "cadetblue", "crimson"]
# Set up plot window with 16 subplots
fig, axes = plt.subplots(4, 4, figsize = (12, 9))
# Plot genes in each cluster
for i, ax in enumerate(axes.flatten()):
# Extract gene time series data for current cluster i
genesInCluster = np.where(subGroups == (i + 1))[0]
clusterData = [geneData[i] for i in genesInCluster]
# Set plot title
plotTitle = "Cluster " + str(i + 1) + " (" + str(clusterCounts[i]) + " genes)"
# Draw plot
fig = plotGenes(clusterData, hours, plotColors = [plotColors[i]] * len(genesInCluster), lineOpacity = 0.15, axis = ax, xAxisLabel = "Time", yAxisLabel = "Expression", plotTitle = plotTitle)
# Display plot
plt.tight_layout()
plt.show()
We can also visualize the associations found between temporal patterns as a network. Below, we draw a network of the genes in this dataset by representing each gene as a node, and drawing an edge between two genes if their LLR2 value exceeds 0.9.
# Create adjacency matrix by thresholding R^2 values at 0.9
adjacencyBayes = (LLR2Bayes > 0.9) + 0
# Zero out diagonal to avoid self-loops and construct graph
np.fill_diagonal(adjacencyBayes, 0)
graphBayes = nx.from_numpy_array(adjacencyBayes)
# Draw network figure
plt.figure(figsize = (5, 5))
plt.title("Gene network computed with Bayesian LLR2")
nx.draw(graphBayes, nx.spring_layout(graphBayes, k = 0.05, seed = 123), with_labels = False, node_size = 12, edgecolors = "navy", edge_color = "gray", linewidths = 1, node_color = "orange")
plt.show()
The advantage of our Bayesian regression method is that it makes use of prior information to filter out unlikely and/or spurious associations between genes. The LLR2 function also allows the user to compute the lead-lag
# Run the LLR2 algorithm without Bayesian regression (i.e., using OLS regression)
LLR2OLS = LLR2(geneData, hours, bayes = False, writeToCSV = True)
# Create adjacency matrix by thresholding R^2 values at 0.9
adjacencyOLS = (LLR2OLS > 0.9) + 0
# Zero out diagonal to avoid self-loops and construct graph
np.fill_diagonal(adjacencyOLS, 0)
graphOLS = nx.from_numpy_array(adjacencyOLS)
# Draw network figure
plt.figure(figsize = (5, 4))
plt.title("Gene network computed with OLS LLR2")
nx.draw(graphOLS, nx.spring_layout(graphOLS, k = 0.25, seed = 123), with_labels = False, node_size = 12, edgecolors = "navy", edge_color = "gray", linewidths = 0.7, node_color = "orange", alpha = 0.7)
plt.show()


