Tools for survival analysis

While survival analysis has been present in academia for quite some time, a lot of data scientist today focus more on neuronal networks, decision trees, random forest, grid search, KNN etc. However, the discipline of survival analysis has gained popularity in various applications. This is the case whenever an analyst it not only interested if an event occurs but rather when the event occurs. In modeling credit risk it is of particular interest when a loan default over time and in turn, understanding how the default probabilities change over time. In this context, survival analysis can help answer questions like: Given a loan has been alive for a specific time t, what is the probability of default at time x. Furthermore, when studying churn rates of customers survival analysis can give exclusive insights, not obtained by most of the other models. Furthermore, models like the one proposed by COX deal with censored observations in an effective manner.

In this post, I want to illustrate some of the tools around for survival analysis.

 

Microsoft EXCEL

Excel is maybe the most accessible and versatile tool for anybody working with data. Therefore people try almost everything to use Excel for everything and a few solutions have come up:

1. https://atlasofscience.org/survival-analysis-using-excel-learn-it-use-it-and-improve-your-work/

2. https://help.xlstat.com/customer/en/portal/articles/2062246-kaplan-meier-survival-analysis-in-excel-tutorial?b_id=9283

However, I consider Excel not to be a viable tool due to it’s limitations of handling data and the landscape of packages available.

STATA

STATA is well documented and very easy to code and therefore a useful tool for non-developers. It has limited survival analysis build in and even provides a simple gui interface.

https://stats.idre.ucla.edu/stata/seminars/stata-survival/

Ocasionally some new features are added.

https://www.stata.com/stata-news/news33-1/spotlight-stintreg/

However, it is very slow with large datasets and lacks several key functionalities when compared to R and Python.

R

R is a great and easy to use to code tool for statistics. Terry Therneau with the library(survival) package has been made a remarkable development over the years. The estimators are fast, the output very well formatted and there is a wide range of functionalities available.

https://github.com/therneau/survival

Python

For a long time, the discipline of survival data has not been in the focus of most data scientist using python. However, Python is in contrast to all other options mentioned here, a full-featured programming language with unparalleled speed and ability to handle data.

Over the recent years there has been a rapid development of mainly two packages:

1. https://pypi.org/project/scikit-survival/

2. https://lifelines.readthedocs.io/

 

Excel
15%
STATA
35%
R
45%
Python
65%

Leave a Reply

Your email address will not be published. Required fields are marked *