By Tim Woelfle, 27/08/2019
This is an exploratory analysis of the openly available brain proteome dataset by Ping et al., described in their 2018 article: Ping, L., Duong, D.M., Yin, L., Gearing, M., et al. (2018) Global quantitative analysis of the human brain proteome in Alzheimer’s and Parkinson’s Disease. Scientific Data. [Online] 5, 180036. Available from: doi:10.1038/sdata.2018.36.
It contains mass-spectrometry based proteomes from 40 brain tissue donors: 10 controls, 10 Alzheimer's patients (AD), 10 Parkinson's patients (PD) and 10 patients with both (ADPD). Samples were analyzed from the frontal cortex (Brodmann area 9) and the cingulate gyrus (Brodmann area 24). This exploratory analysis focuses on Alzheimer's Disease and the cingulate gyrus but could be repeated in the same way for Parkinson's Disease and / or the frontal cortex. An overview of most methods used here is given in this review-paper on statistical OMICS analysis (subscription only).
import numpy as np
import pandas as pd
import seaborn as sns
import umap
import statsmodels.formula.api as smf
import statsmodels.stats.multitest as smm
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn import metrics
samples_trait_cingulate.csv is derived from the second sheet "anterior cingulate gyrus" of samples trait.xlsx
samples = pd.read_csv("samples_trait_cingulate.csv")
samples.head()
samples["Condition"] = pd.Series(["AD","AD","Control","Control","PD","PD","ADPD","ADPD"]*5)
pd.crosstab(samples.Condition, samples.Sex, margins=True).plot.bar()
samples.boxplot("Age at Death", by="Condition")
cingulate.csv is derived from the second sheet "anterior cingulate gyrus" of TMT_Summary_Data.xlsx with the first two rows removed
proteome = pd.read_csv("cingulate_.csv").set_index("Accession")
proteome_dict = proteome.loc[:,"Gene":"Description"]
proteome = proteome.loc[:,"AD1":"ADPD10"].transpose()
proteome.head()
A single value of 0 means the protein was identified but not successfully quantified, if the whole batch is 0 it means that one of the GIS (global internal standards) was missing and the measurement is therefore considered unreliable, see sheet "Key" in TMT_Summary_Data.xlsx. There are 5 batches of 8 subjects each. In either case we consider them NA.
(proteome.isna() | (proteome == 0)).sum().value_counts().plot.bar()
The barplot shows that most proteins with NA values are missing for at least one whole batch (8/16/24/32/40 subjects), so imputation is risky. We try a complete case analysis first, with nearly 60% of all proteins.
(proteome>0).all().sum() / proteome.shape[1]
Normalization to mean 0 and standard deviation 1 per column is important for PCA.
proteome = proteome.loc[:, (proteome>0).all()]
proteome = (proteome - proteome.mean()) / proteome.std()
proteome.head()
Distributions look more or less normal, don't seem to need log-transformation.
proteome.iloc[:,0:16].hist()
pca_model = PCA()
pca = pca_model.fit_transform(proteome)[:,0:2]
print(pca_model.explained_variance_ratio_[0:2])
pd.Series(np.cumsum(pca_model.explained_variance_ratio_)).plot()
In the PCA plot we can already see that AD+ADPD cluster more to the right whereas Control+PD cluster more to the left. However, these two components only capture roughly 30% of the variance in the data, let's see if UMAP produces a different result.
# can't color categorically with pandas alone, so use seaborn
#pd.DataFrame(pca, columns=["PC1", "PC2"]).plot.scatter(x="PC1", y="PC2")
sns.scatterplot("PC1", "PC2", data=pd.DataFrame(pca, columns=["PC1", "PC2"]),
hue=samples.Condition)
UMAP is a modern dimensionality reduction technique similar to t-SNE, able to shrink all of the data to two components. See more details here: https://github.com/lmcinnes/umap
Again, we can already appreciate that AD+ADPD seem to cluster together (top-right) whereas Control+PD seem to be clustered more in the bottom-left.
embedding = umap.UMAP().fit_transform(proteome)
sns.scatterplot("UMAP 1", "UMAP 2", data=pd.DataFrame(embedding, columns=["UMAP 1", "UMAP 2"]),
hue=samples.Condition)
PLS can be considered both a dimensionality reduction method and a supervised learning algorithm. Whereas PCA & UMAP are agnostic of the outcome variable, PLS is fitting components to maximize variance not just in the input space X (proteome) but also in the outcome space Y, which is a 40x4 binary (dummy) matrix representing Controls, AD, PD & ADPD.
So this plot shows fairly good separability of the four classes when trained on the outcome data but we see that Alzheimer's (AD/ADPD) separates better from controls than does Parkinson's only (PD).
pls = PLSRegression(n_components=2).fit(proteome, pd.get_dummies(samples.Condition))
sns.scatterplot("PLS 1", "PLS 2", data=pd.DataFrame(pls.x_scores_, columns=["PLS 1", "PLS 2"]),
hue=samples.Condition)
This can be considered a "PWAS", a Proteome Wide Association Study". There are two ways of modelling this: Linear Regression and Logistic Regression
Every protein-quantity is tested as dependent variable against the two binary phenotypes Alzheimer's (including AD and ADPD patients) and Parkinson's (including PD and ADPD patients) in two separate univariate models. Results are saved to AD_univariate_linreg.csv and PD_univariate_linreg.csv
AD = np.array([1,1,0,0,0,0,1,1]*5) # AD, ADPD: True; Control, PD: False
PD = np.array([0,0,0,0,1,1,1,1]*5) # PD, ADPD: True; Control, AD: False
ADuniv = proteome_dict.loc[proteome.columns,:"Gene"]
PDuniv = proteome_dict.loc[proteome.columns,:"Gene"]
for protein in proteome.columns:
lr = smf.ols("Q('" + protein + "') ~ AD", data=proteome).fit()
ADuniv.loc[protein, "coef"] = lr.params[1]
ADuniv.loc[protein, "pval"] = lr.pvalues[1]
lr = smf.ols("Q('" + protein + "') ~ PD", data=proteome).fit()
PDuniv.loc[protein, "coef"] = lr.params[1]
PDuniv.loc[protein, "pval"] = lr.pvalues[1]
# Sort outcome by p-value
ADuniv = ADuniv.sort_values("pval")
PDuniv = PDuniv.sort_values("pval")
ADuniv.to_csv("AD_univariate_linreg.csv")
PDuniv.to_csv("PD_univariate_linreg.csv")
ADuniv.head(20)
132 proteins are significant after Bonferroni correction, and an additional 1491 after FDR correction. The volcano plot looks U-shaped because the input matrix is normalized to mean 0 and std 1. See 2_compare_univariate_results.ipynb and 2_compare_univariate_results.html for interactive volcano plots.
ADuniv["Significant"] = "False"
ADuniv.loc[smm.multipletests(ADuniv.pval, method="fdr_bh")[0], "Significant"] = "FDR"
ADuniv.loc[smm.multipletests(ADuniv.pval, method="bonferroni")[0], "Significant"] = "Bonferroni"
ADuniv.loc[:,"Significant"].value_counts()
sns.scatterplot(ADuniv.coef, -np.log10(ADuniv.pval), hue=ADuniv["Significant"], edgecolor="none", alpha=0.3)
Among the Bonferroni significant top hits are two of Alzheimer's Disease hallmark proteins:
Another hallmark protein is not significant, probably because only its NAC fragment is associated with Alzheimer's:
The most significant protein however, encoded by the SMOC1 gene, has only recently been associated with Alzheimer's in a proteomic analysis of cerebrospinal fluid:
ADuniv.loc[("Q9H4F8-2", "P05067", "P10636", "P37840"),:]
For performing logistic regression, we need a binary dependent variable. Thus, this time we use AD / PD status as dependent (outcome) variable and protein-quantity as independent (input) variable in two separate univariate models per protein.
ADuniv_log = proteome_dict.loc[proteome.columns,:"Gene"]
PDuniv_log = proteome_dict.loc[proteome.columns,:"Gene"]
for protein in proteome.columns:
#protein = proteome.columns[1]
AD_logit = smf.logit("AD ~ Q('" + protein + "')", data=proteome).fit(disp=False, method='lbfgs')
PD_logit = smf.logit("PD ~ Q('" + protein + "')", data=proteome).fit(disp=False, method='lbfgs')
ADuniv_log.loc[protein, "coef"] = AD_logit.params[1]
PDuniv_log.loc[protein, "coef"] = PD_logit.params[1]
ADuniv_log.loc[protein, "pval"] = AD_logit.pvalues[1]
PDuniv_log.loc[protein, "pval"] = PD_logit.pvalues[1]
# Sort outcome by p-value
ADuniv_log = ADuniv_log.sort_values("pval")
PDuniv_log = PDuniv_log.sort_values("pval")
ADuniv_log.to_csv("AD_univariate_logreg.csv")
PDuniv_log.to_csv("PD_univariate_logreg.csv")
ADuniv_log.head(20)
No proteins are significant after Bonferroni correction, but 1016 after FDR correction. See 2_compare_univariate_results.ipynb and 2_compare_univariate_results.html for interactive volcano plots.
ADuniv_log["Significant"] = "False"
ADuniv_log.loc[smm.multipletests(ADuniv_log.pval, method="fdr_bh")[0], "Significant"] = "FDR"
ADuniv_log.loc[smm.multipletests(ADuniv_log.pval, method="bonferroni")[0], "Significant"] = "Bonferroni"
ADuniv_log.loc[:,"Significant"].value_counts()
sns.scatterplot(ADuniv_log.coef, -np.log10(ADuniv_log.pval),
hue=ADuniv_log["Significant"], edgecolor="none", alpha=.3).set(xlim=(-10,10))
Similar to the biological interpretation of linear regression results, Alzheimer's hallmark proteins Amyloid beta and Tau protein are highly signficant. However, SMOC1 is not significant this time, despite its huge outlier beta-coefficient of 75 (in fact, it's the only outlier not shown in this volcano plot). Closer examination reveals that this is because of perfect separability, as demonstrated in the plot below. In perfect separability cases, logistic regression fails to give meaningful p-values.
ADuniv_log.loc[("Q9H4F8-2", "P05067", "P10636", "P37840"),:]
sns.stripplot(AD, proteome["Q9H4F8-2"], hue=AD)
It already becomes obvious by comparing the numbers of significant variables that linear-regression seems to be more powerful on this dataset: 132 Bonferroni- and 1623 FDR-significant proteins for linear-regression vs 0 Bonferroni- and 1016 FDR-significant proteins for logistic-regression (at the 0.05 level).
A more detailed interactive comparison of the two results are in the accompanying files 2_compare_univariate_results.ipynb and 2_compare_univariate_results.html.
Remove SMOC1 (Q9H4F8-2) because of perfect separability. LASSO selects a subset of variables as predictors by penalizing on the L1-Norm of the beta-coefficients. SKLearn's LogisticRegressionCV employs cross-validation to identify the optimal lambda (or C) hyperparameter. Below are the selected proteins and their beta coefficients. The training-set size is 30 samples and the test set size is 10 samples.
proteome_reduced = proteome.drop("Q9H4F8-2", axis=1)
X_train, X_test, y_train, y_test = train_test_split(proteome_reduced, AD, test_size=0.25, random_state=1337)
model = LogisticRegressionCV(cv=10, solver="liblinear", penalty="l1", random_state=1337).fit(X_train, y_train)
list(zip(proteome_reduced.columns[(model.coef_!=0)[0]],
model.coef_[model.coef_!=0]))
# train accuracy + confusion matrix
y_pred_class = model.predict(X_train)
print("Training accuracy:", metrics.accuracy_score(y_train, y_pred_class))
print("Training F1 score:", metrics.f1_score(y_train, y_pred_class))
#print("Log loss:", metrics.log_loss(y_train, y_pred_class))
#print(metrics.confusion_matrix(y_train, y_pred_class).transpose())
# test accuracy + confusion matrix
y_pred_class = model.predict(X_test)
print("Test-set accuracy:", metrics.accuracy_score(y_test, y_pred_class))
print("Test-set F1 score:", metrics.f1_score(y_test, y_pred_class))
#print("Log loss:", metrics.log_loss(y_test, y_pred_class))
#print(metrics.confusion_matrix(y_test, y_pred_class).transpose())