Data Exploration Of Features For Outcome Association In Digital Pathology


In the field of digital pathology, a frequent approach for the creation of image-based biomarkers involves extracting features from scanned pathology slides. These features, which are often related to the morphology or spatial distribution of various tissue or cell types, provide valuable insights into the underlying biology of diseases. In cancer research, it is particularly important to examine how these features correlate with clinical outcomes such as overall survival (OS), progression-free survival (PFS), or other binary outcomes (e.g., response to a specific treatment).

Here we release python code that can be executed in a notebook to facilitate this process. It accepts a pandas DataFrame and generates a one-page summary PDF file, facilitating the analysis of individual features and their potential correlation with clinical outcomes.


We present here the different statistical methods used for the analysis, illustrated with random data. The outcome for this example is the 3-year progression-free survival landmark, but it is applicable to other binary outcomes.

Importantly, these techniques are tools to begin understanding the underlying data, and thus should not constitute the final analysis. Each dataset is unique, and these methods should be seen as a way to initiate its data exploration.


Visualizes the distribution of the feature by grouping data into bins and displaying the frequency of observations within each bin. This helps understand the data’s shape, spread, and central tendency, and hopefully motivates the selection of thresholds, and improves the understanding of the functional form of the feature distribution (e.g., bimodal, gaussian, exponential, etc).

Violin plots

Provides visualization of the feature’s distribution across the two different outcome groups, helping quickly visualize the separation and identify concerns associated with outliers.

Mann-Whitney U test

Mann-Whitney p-value: 0.003

This is a non-parametric statistical test comparing the two groups (outcome 0 vs 1). The p-value indicates whether the distributions of the two groups are significantly different, with p<0.05 typically being considered as significant.

Waterfall plot

Visualizes the distribution of a feature across patients, color-coded to indicate the two different outcomes. This plot helps identify particular patients which deserve additional scrutiny, for example this “blue” patient on the far left, or the orange patient on the far right, and ask “why are these patients being sorted to these locations?”

Scaling and dichotomization

Scaling adjusts the range and distribution of feature values. Standardization, used by default here, transforms data to have zero mean and unit variance, while normalization scales data to a fixed range, typically 0 to 1.

Dichotomization converts a continuous variable into a binary one, here, using the median to divide the data into two groups. This is necessary for certain visualizations and analyses and can enhance understandability. However, it is crucial to note that while using the median as a threshold for dichotomization is an unbiased approach, it may not always be the most effective for every dataset, especially if the outcomes themselves are not balanced. A careful examination of the histogram and specific characteristics of each dataset is recommended. In most cases, other thresholds might yield more insightful or performant results.

Kaplan-Meier survival plots

Plots the probability of survival over time for patients, stratified by feature values. It can be applied to PFS or OS. By default, we show these plots for patients separated into two and three groups of equal size. The separation of curves, their distance apart, and the time points at which they diverge are key aspects to observe. It is as well important to note if the curves cross each other, as this suggests that the proportional hazards assumption has been violated (see below).

Logrank test

logrank OS p-value: <0.001

Hypothesis test to compare the survival distributions of the two groups dichotomized around the threshold value. A low p-value (p<0.05 typically) suggests that there is a statistically significant difference in survival between the two groups.

Cox proportional hazards analysis

Cox PFS HR: 0.207 (0.078-0.541)
Cox PFS p-value: 0.002
Cox OS ph assumption p-value: 0.178

Statistical technique to investigate the relationship between patient survival and various predictors (here, the feature). The hazard ratio (HR) quantifies the association strength, and the p-value assesses its statistical significance. This model assumes a constant hazard ratio over time (proportional hazards condition). A statistical test can be employed to verify this, where the p-value is used to verify this assumption. If the p-value is significant, it suggests that the proportional hazards assumption is violated.

Forest plot

Displays the results of Cox proportional hazard univariable analyses for the feature, showing hazard ratios and their confidence intervals. A HR greater than 1 indicates increased risk, while a value less than 1 suggests a protective effect of the feature. One thing to look out for here is when the 95% confidence interval (CI), crosses the dash line indicating a HR=1. HR = 1 means equal efficacy in both groups, suggesting there is no difference between them.

Prediction using logistic regression

AUC logistic regression LOOCV: 0.736

Applies logistic regression for binary classification (outcome) and assesses performance using the area under the receiver operating characteristic curve (AUC). Leave-one-out cross-validation (LOOCV) is used for model validation. This approach provides an evaluation of the feature’s predictive ability. The AUC value ranges from 0 to 1, with higher values indicating better model performance. An AUC of 0.5 suggests no discriminative ability while an AUC close to 1 indicates high accuracy in prediction.


Ensure you have the necessary Python libraries installed: Seaborn, NumPy, Pandas, Matplotlib, SciPy, Lifelines, Scikit-learn, and FPDF. These libraries provide the backbone for data manipulation, statistical analysis, and visualization.

The analyze_feature function conducts the various statistical analyses and visualizations described above on a specified feature in a dataframe and consolidates all results into a PDF file.


dataframe: The dataframe containing data for the analysis. Must include the columns ‘os_months’, ‘os_event’, ‘pfs_months’, and ‘pfs_event’, in addition to the feature and outcome columns.

feature: The name of the feature (column in the dataframe) to analyze.

save_dir_path: The file path to save the resulting plots and PDF file.

scaling_method: Method for scaling the feature (standardize, normalize or None).

outcome: The outcome variable for analysis.

For a practical demonstration, the notebook example_ notebook.ipynb includes an example.

Note on author and contributions:
Jonatan Bonjour is working in the field of digital pathology in Geneva. This article was written as a combined effort by Jonatan Bonjour and Andrew Janowczyk. JB conducted writing of the post. AJ consulted on content and provided feedback during writing.

One thought on “Data Exploration Of Features For Outcome Association In Digital Pathology”

Leave a Reply

Your email address will not be published. Required fields are marked *