Brain proteome in Alzheimer's Disease (AD)

Comparison of univariate linear- and logistic-regression

By Tim Woelfle, 27/08/2019

This is an exploratory analysis of the openly available brain proteome dataset by Ping et al., described in their 2018 article: Ping, L., Duong, D.M., Yin, L., Gearing, M., et al. (2018) Global quantitative analysis of the human brain proteome in Alzheimer’s and Parkinson’s Disease. Scientific Data. [Online] 5, 180036. Available from: doi:10.1038/sdata.2018.36.

See 1_analysis.ipynb / 1_analysis.html for an overview of the dataset and the generation of AD_univariate_linreg.csv and AD_univariate_logreg.csv.

This interactive notebooks visualizes the differences between univariate linear- and logistic-regression results by comparing their respective beta-coefficients and p-values. The top row of below plots shows the individual volcano plots, which look so thin because the input matrix (proteomic quantities) has been normalized to mean 0 and standard deviation 1 by column.

The color-coding indicates the FDR-corrected significance at the 0.05 level: blue means significant in both sets, yellow means significant in linear regression only, red means significant in logistic regression only and green means insignificant in both sets. We can appreciate that linear regression seems to be more powerful for this dataset, identifying many more proteins as significant than logistic regression.

Selecting any protein through clicking will highlight it in all plots. Pan with dragging and zoom with Ctrl+Mousewheel. The interactivity will only work if Altair is set up correctly. GitHub will only show a static preview image. Try 2_compare_univariate_results.html in case the interactivity doesn't work.

In [2]:
# nbconvert template for Altair

# Code toggle:
from IPython.display import HTML

function code_toggle() {
 if (code_show){
 } else {
 code_show = !code_show
$( document ).ready(code_toggle);
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.
In [2]:
import numpy as np
import pandas as pd
import altair as alt
import statsmodels.stats.multitest as smm

#alt.renderers.enable('notebook') # Uncomment this when using Jupyter Notebook instead of Jupyter Lab

# Without changing this option max_rows=5000
alt.data_transformers.enable('default', max_rows=None)

# Expects dataframe1 and dataframe2 to be joinable through their index and to have three columns each: x-axis (eg. beta-coefficients), y-axis (e.g. p-values) and boolean indicators for cross-tabulation (e.g. significance)
def compareDataframesPlot(df1, df2, df1_name, df2_name):
    df = df1.join(df2, lsuffix=" " + df1_name, rsuffix=" " + df2_name)
    df = df.reset_index()
    bool1 = df.iloc[:,3]
    bool2 = df.iloc[:,6]
    df.loc[(bool1==bool2) &  bool1, "crosstab"] = "+|+"
    df.loc[(bool1==bool2) & ~bool1, "crosstab"] = "-|-"
    df.loc[(bool1!=bool2) &  bool1, "crosstab"] = "+|-"
    df.loc[(bool1!=bool2) & ~bool1, "crosstab"] = "-|+"
    single_selection = alt.selection_single(empty="none", fields=["index"])
    confusionMatrix = alt.Chart(df).mark_text().encode(
        x = alt.X(df.columns[3], axis=alt.Axis(orient='top')),
        y = df.columns[6],
        color = alt.Color('crosstab', legend=None),
        text = "count()"
    # cx is the column index for the x-axis, cy for the y-axis
    def scatterPlot(cx, cy):
        return alt.Chart(df).mark_point().encode(
            x = alt.X(df.columns[cx], scale=alt.Scale(clamp=True)),
            y = alt.Y(df.columns[cy], scale=alt.Scale(clamp=True)),
            color = alt.Color('crosstab', legend=None),
            size = alt.condition(single_selection, alt.value(400), alt.value(20)),
            opacity = alt.condition(single_selection, alt.value(1), alt.value(.5)),
            tooltip = list(df)
    return (confusionMatrix &
            (scatterPlot(1,2) | scatterPlot(4,5)) &
            (scatterPlot(1,4) | scatterPlot(2,5)))

def prepareRegResults(path):
    df = pd.read_csv(path)
    df = df.set_index(df["Accession"] + " (" + df["Gene"] + ")").loc[:,("coef","pval")]
    df["signif"] = smm.multipletests(df.pval, method="fdr_bh")[0]
    df["pval"] = -np.log10(df["pval"])
    df.columns = ["beta-coef", "-log10 p-val", "FDR"]
    return df

    "lin reg",
    "log reg")