C#

Using Survival Analysis

Zvi Topol

Survival analysis (SA) is a discipline of statistics that focuses on estimating time to events. You would typically apply survival analysis methods to clinical studies to help determine the effectiveness of certain drugs (time to patient death), reliability of software systems (time to failure) and credit analytics (time to loan default).

Pharmaceutical clinical studies involving two groups of patients are an excellent example of how this can work. The control group members are administered a placebo. The test group members receive the experimental drug targeting the disease. Survival analysis methods are then applied to determine whether there’s a statistically significant difference in patient survival between the two groups. The “time to” event in this case is the time from the beginning of the study until patients die.

To expose you to using SA, I’ll cover basic concepts along with a C# implementation of a commonly used estimation method called the Kaplan-Meier estimator. I’ll use a real-world example of estimating the survival probability of mobile applications.

Imagine a software development firm produces two separate mobile applications titled X and Y. Each one is developed by separate teams. The firm is eager to learn how robust the mobile applications are and to determine if one application is significantly less robust and requires more effort to improve its reliability.

At any given moment, there can be many instances of X and Y alive and running on customer mobile devices. Thus, a mobile application crash is what’s most interesting. Longer time periods to the event, in this case, will indicate the application is more robust or has better survivability.

In the demo program, I’ll first display the survival data for mobile applications X and Y (see Figure 1). The data shows both applications were run by 10 different users with IDs ranging from zero to nine. In my example, an application can either crash (described by event = app crash in the screenshot) or get closed by the user (described by event = app off). The day on which the event occurs is also recorded.

The Survival Analysis Demo Showing Lifecycle of Mobile Apps
Figure 1 The Survival Analysis Demo Showing Lifecycle of Mobile Apps

Basic Concepts of SA

The most basic concept in SA is that of the survival function. This is commonly denoted by S(t). If X is a random variable (a variable whose values are outcomes based on chance) representing the event time, then S(t) = Pr (X > t). In other words, S(t) is the probability for survival after time t. S(0) is defined to be 1. The survival function is related to the lifetime distribution function. This is typically denoted by F(t) and is defined as F(t) = Pr(X<=t), in other words, the probability the event happened until time t. Therefore, F(t) = 1 – S(t). The event density function f(t) is then defined to be dF(t)/dt—the first derivative of F(t), if F(t) is differentiable. Therefore, f(t) can be thought of as the rate of the event (per unit of time).

The hazard function, another basic concept, is equal to f(t)/S(t) and is the event rate at time t for individuals who are alive at time t.

You can specify survival functions parametrically, using an explicit function or a family of functions. You can also infer them non-parametrically from existing data, without having a parametrized closed form. A semi-parametric specification, which is a mix between parametric and non-parametric specification, is also possible. The exponential distribution is a simple and popular parametrized function family to describe survival functions due to its appealing mathematical properties.

For example, S(t) = exp(-0.05t) is a survival function from a paramterized exponential distribution plotted in Figure 2. The survival functions of the form S(t) = exp(-at) (where a is a parameter controlling the hazard rate can describe that distribution). The lifetime distribution function is given by F(t) = 1 – S(t) = 1 – exp(-at). Figure 2 helps us visualize how survival functions behave over time.

How Survival Functions Behave over Time
Figure 2 How Survival Functions Behave over Time

Working with a given parametric model, you can use actual data to estimate the model’s parameters. In the case of exponential distribution, it’s the parameter a. One way to do so is to use Maximum Likelihood Estimation (MLE) methods, but that’s another subject entirely.

I’ll focus on implementing a non-­parametric estimate for the survival function. That is, I won’t set a predefined model for the survival function and estimate the model parameters. Instead, I’ll derive the survival function directly from the data observed. Before I describe how to do that, I have to explain another important concept of SA called censoring.

Censoring occurs when some observations in the dataset are incomplete. At some point, you’ve lost track of the item observed. In my example, this would mean a mobile application ended its execution without crashing (throwing a fatal exception). The application was gracefully closed by the user. Although there can be other reasons an application ended without crashing, I’ll assume an application either crashes or gets closed by the user.

There are two main flavors for censoring—right censoring and left censoring. Right censoring occurs when the start time is known, but the event time is missing. Left censoring occurs when the event time is present, but the start time is missing. Right censoring is occuring in my example.

Using the Kaplan-Meier Estimator to Estimate the Survival Function

The Kaplan-Meier (KM) estimator is a non-parametric algorithm that estimates the survival function. Deriving the KM estimator entails the use of advanced math, including martingale theory and counting processes, and is beyond the scope of this article. Implementing the KM estimator, however, is straightforward and is based on counts.

Consider computing the KM estimator for the survival of mobile application X. The KM estimator needs to keep track of three different counts:

  1. How many instances of mobile application X are still up and running. This is represented using the variable atRisk in my implementation.
  2. The number of instances that have crashed. This is tracked in the crashed variable.
  3. The number of instances that finished execution gracefully. These are counted using the variable censored.

The following lines of code (for mobile application X) are using the CrashMetaData class to encode the survival data represented in Figure 3:

var appX = new CrashMetaData[] {new CrashMetaData{UserID = 0,
  CrashTime = 1, Crashed = false},
           new CrashMetaData{UserID = 1, 
              CrashTime = 5, Crashed = true}, 
           new CrashMetaData{UserID = 2, 
              CrashTime = 5, Crashed = false},
           new CrashMetaData{UserID = 3, 
              CrashTime = 8, Crashed = false},
           new CrashMetaData{UserID = 4, 
              CrashTime = 10, Crashed = false},
           new CrashMetaData{UserID = 5, 
              CrashTime = 12, Crashed = true},
           new CrashMetaData{UserID = 6, 
              CrashTime = 15, Crashed = false},
           new CrashMetaData{UserID = 7, 
              CrashTime = 18, Crashed = true},
           new CrashMetaData{UserID = 8, 
              CrashTime = 21, Crashed = false},
           new CrashMetaData{UserID = 9, 
              CrashTime = 22, Crashed = true}};

Figure 3 Survival Data of Mobile Application X

UserIDDaysCrashedCensored
01 X
15X 
25 X
38 X
410 X
512X 
615 X
718X 
821 X
922X 

The survival data contains event time in days (encoded by CrashTime) and information about whether the event refers to an application crash or censoring. If Crashed is equal to true, the application crashed. Otherwise, the application closed gracefully (in other words, was censored). Additionally, a UserID field is tracking the instance of the application.

The KM estimator is implemented in the EstimateKaplanMeier method. This partitions the data to different non-overlapping time intervals based on time periods to events (in my example this is an application crash). It keeps track of the counts in each interval.

It’s important to note the count of how many applications are still up and running is done just before the event (this is due to the mathematical formulation of counting processes). So in the first interval in my example, which covers days 0 to 5, 9 out of the 10 instances were up and running just before day 5 (one instance finished running at time 1). In the interval up to and including day 5, I had one crash (which defines the interval) and 2 instances finishing (on days 1 and 5). See Figure 4.

Day Intervals Created by the KM Estimator
Figure 4 Day Intervals Created by the KM Estimator

The KM estimate for the survival function is then the product over all the different intervals of the survival derived from the counts in the partitions: 

1 – (crashed in interval) /(those at risk just before the end of the interval)

The EstimateKapalanMeier method returns an object of class SurvivalCurve. This represents the estimated survival function. The output is a step function. Each step is the value of survival function in a corresponding interval (as estimated by the KM estimator). Figure 5 includes part of the Survival Analysis demo program output corresponding to the SurvivalCurve object (for both applications X and Y).

Survival Analysis Demo Output for KM Estimates for Applications X and Y
Figure 5 Survival Analysis Demo Output for KM Estimates for Applications X and Y

Figure 6 includes a plot of the step survival function estimated for mobile application X. In the plot, short vertical lines in each step denote multiple occurrences of the crash event during the interval corresponding to the step.

KM Estimate of the Survival Function for Mobile Application X
Figure 6 KM Estimate of the Survival Function for Mobile Application X

You can then use the estimate to infer the median survival time, or the time by which half of the instance will be alive. This should occur at some point in time between days 12 (where the survival probability estimate is 0.711 > 0.5) and 18 (where the survival probability is 0.474 < 0.5). There are a few approaches in the SA literature describing how to exactly compute this quantity, because it typically falls between two steps.

I’ll define the median survival time as the minimal survival time for which the survival function is less than 0.5, which for mobile application X results in a median survival time of 18 days. The interpretation of this quantity is that by day 18, half of the mobile instances of application X crash and half stay up and running. This implementation computes the median survival time in the method GetMedianSurvivalTime.

Another question you can answer using the KM estimates is whether there’s a difference in the survivability of two (or more) different applications. One way to approach this question is to visually plot the KM estimates corresponding to each application. This type of plot is described in Figure 7, and compares the estimated survival functions of applications X and Y.

KM Estimates for Mobile Applications X and Y
Figure 7 KM Estimates for Mobile Applications X and Y

The green curve represents survival function of application X and the blue curve represents survival function of application Y.

From the plot, you can see the survival function of application X tops the survival function of application Y. Therefore, you can infer application X has better survivability than application Y and, thus, is more robust.

While visualizing survival functions may help in determining survivability differences, some cases are not as clear-cut. Fortunately, there’s a statistical approach to test for such differences in a formal and rigorous way, called the Log Rank Test. This is an algorithm that tests whether there’s a significant difference between two (or more) survival distributions in a non-parametric way. The SA literature includes a detailed discussion about this and most SA statistical libraries include Log Rank test implementations.

It’s worth noting there’s another popular algorithm to estimate the survival function in a non-parametric way called the Nelson-Aalen (NA) estimator. The NA estimates the cumulative hazard function from survival data. You can then derive the survival function from this estimate using a mathematical formula that ties it to the cumulate hazard function. You can find more details about this estimator in the SA literature.

Wrapping Up

I’ve introduced basic concepts and terminology from the statistical branch of survival analysis. I showed how to implement the non-parametric Kaplan-Meier estimator and applied it to an example comparing the robustness of mobile applications. This estimator can help determine whether there’s a difference in the survivability of the two applications. I also mentioned a rigorous statistical test to check for differences, called the Log Rank test. Another quantity I derived using the KM estimator is the median survival time, which also points to survivability differences between applications X and Y.

Finally, I mentioned the Nelson-Aalen estimator as an alternative non-parametric method for estimating the survival function. It doesn’t directly estimate the survival function like the KM estimator, but rather estimates the cumulative hazard function. You can then derive the survival function from the cumulate hazard function.

This only scratches the surface of the rich field of SA. The applications span areas from medicine to engineering and whose methods and algorithms are implemented in many statistical packages. With the proliferation of mobile applications and Software as a Service enterprise deployments, I anticipate SA methods can play a role in monitoring and improving the quality of such deployments.


Zvi Topol works as a senior scientist in marketing analytics in New York City. He designs and applies non-linear large-scale optimization algorithms and statistical methods to improve marketing planning for big Fortune 500 companies.

Thanks to the following technical expert for reviewing this article: Dr. James McCaffrey (Microsoft Research)

 

Rate: