Machine Learning in Insurance

Marco De Virgilis; Daniel Lupton; Liam McGrath; Marjan Qazvini; Seth Roby; Leslie Vernon

1. INTRODUCTION

Bolstered by improvements in computing power and innovations in key algorithms, machine learning (ML) has been experiencing tremendous growth and expansion in many fields during the last decade. However, adoption of ML algorithms by the insurance industry has been comparatively slow, and the ultimate role of ML in actuarial practice has yet to be determined.

There are several reasons for this slow pace. Commonly cited issues include lack of available computing power, lack of knowledge, regulatory scrutiny, lack of adequate data, privacy challenges, difficulty of interpretation, and communication challenges. Nevertheless, the industry has begun to overcome these challenges in some areas, leading to increased use of ML algorithms in some domains.

ML is often discussed as a unified field, though it covers many diverse and technically distinct models and algorithms and draws from many areas such as computer science, statistics, mathematics, and bioinformatics. The methods at the center of ML are united by the concept of using a very flexible and general model, applying that model to some “training” data set, and then adjusting model parameters to find a suitable optimum for a given function (typically, though not always, with the goal of minimizing a “loss function”). Broadly speaking, ML may offer advantages over Generalized Linear Models (GLMs) or other traditional statistical models, both in terms of its ability to automatically capture non-linear relationships in data to produce more accurate models and in terms of its flexibility to take on many different functional forms.

Generally, uses for ML may include:

Developing predictive variables to use in other methods (often referred to as “feature engineering” in ML research); this includes clustering as well as development of non-linear transformations or combinations of data elements.
Determining appropriate binning or clustering of variables for use in other models.
Dimensionality reduction for high-dimensional data.
Identifying non-linear relationships between variables and a predictor with minimal modeling assumptions.
Prediction based on sparse data sets.
Development of computationally tractable approximations to intractable traditional models.

In recognition of the potential for ML to provide value to actuarial science, this document seeks to provide a survey of current research in actuarial applications of ML, with a particular focus on applications of ML to Property & Casualty (P&C) insurance. This paper may therefore serve as a guide to individuals and practitioners interested in applying machine learning to a particular area, or it may serve to help researchers identify areas in which additional studies would be of greatest benefit.

2. ENHANCING TRADITIONAL METHODS

In this section, we focus on the use of ML algorithms to enhance traditional models, particularly with respect to clustering and binning of variables. This partial reliance on machine learning has the distinct advantage of reaping many of the predictive benefits of machine learning while retaining many familiar statistical tools that are useful for diagnosing and understanding models.

For example, ML models are often deterministic and non-parametric, and it is not always straightforward to calculate a probability or information criterion with an ML model. Some popular GLM software packages on the market rely on binning continuous variables rather than directly modeling continuous variables. In some cases, it is also preferable to bin continuous variables in order to explore the existence of non-linear relationships.

To these ends, Henckaerts et al. (2018) use generalized additive models (GAM) to motivate binning of continuous variables for use in GLMs. In particular, they begin by developing a model of pure premium using the flexible GAM framework to model spatial and continuous variables. The next step is binning. For spatial effects, they explore the use of four different binning methods, and test the impact of different numbers of bins on the goodness-of-fit of the GAM. For continuous variables, Henckaerts et al. (2018) apply evolutionary trees to test sequential splits of continuous variables to maximize goodness of fit subject to a constraint that no bin can be too small. These binned categories can then be applied in the context of a more familiar GLM.

Dai (2018) similarly uses tree-based models to perform spatial clustering for application in a GLM. The paper considers Gradient Boosting Machines (GBM) and random forests as options for clustering using tree-based algorithms. Both GBMs and random forests work by combining the predictions of multiple smaller models; however, GBMs work by iteratively generating many very small (or “weak”) decision trees, whereas random forests work by simultaneously generating many independent, larger decision trees. The author notes that random forests are easier to train and tune, easier to parallelize, and more robust to overfitting, but GBMs tend to outperform them in prediction if carefully tuned. The author also notes that, compared to traditional GLMs, tree-based models have no assumption of model structure, simplify the use of interactions, assist in dealing with missing values, and provide for built-in variable (“feature”) selection.

Both papers might be described as “feature engineering” - using machine learning to develop independent variables (e.g., binned categories or clusters) that can be useful in prediction tasks (e.g., ratemaking or reserving). In such applications, ML algorithms offer the advantage of being extremely flexible and, often, non-parametric. These features of ML algorithms assist in revealing unintuitive and non-linear predictive relationships. Such relationships may be missed by traditional GLMs, which typically require more direct effort on the part of the modeler to intentionally choose the specific transformations of a variable rather than searching over a broad function space to find what fits best.

3. LIFE INSURANCE

While the insurance industry has been tentative about the adoption of ML, it appears that there has been more adoption of ML within the finance sector. For this reason, many proposed applications of ML within the insurance industry come from life insurance, which has some features in common with finance that may make transfer of these techniques more natural. In some cases, these techniques may also be relevant to certain domains within property and casualty insurance.

One such area is mortality risk. Mortality risk changes over time for a given population, and multiple models exist for projecting mortality risk for individual populations. However, it is reasonable to expect that related populations might have related mortality risks, or that changes affecting one population may be related to changes affecting others in a systematic, if non-linear, way. In some instances, modeling such multi-population mortality risks may present a challenging or intractable optimization problem that requires significant judgment. Neural networks are a natural fit to address these kinds of problems.

Richman and Wüthrich (2019) used neural networks to model general mortality risk. In particular, they explored extending the traditional Lee-Carter mortality model to multiple populations simultaneously by using a deep neural network, coded in R using the Keras package. Several extensions of the traditional model were tested, as well as several variants of the deep neural network architecture. The deep neural network significantly outperformed all other models considered in the paper.

For similar reasons, valuing large portfolios of variable annuities is problematic with traditional approaches. Determining the sensitivity of variable annuities to risk factors is time-consuming or even computationally intractable with a traditional Monte Carlo simulation approach. However, machine learning can approximate these computationally expensive methods with a high degree of precision and much greater speed.

To achieve this, (Gan 2013) uses a combination of k-prototypes clustering algorithms and Gaussian process models (also known as “kriging”). K-prototypes is a method that is algorithmically similar to the well-known k-means clustering algorithm, but with a more generalized measure of “distance” between two points that can reflect distances between continuous variables and categorical variables, making it slightly more flexible.

4. SOLVENCY MONITORING

Solvency monitoring is important for insurance companies and insurance regulators to make sure that companies can meet their obligations as they fall due. The National Association of Insurers Commissioners (NAIC) has developed the property-liability Risk-Based Capital (RBC) system to calculate the amount of capital that insurance companies need to hold relative to their retained risk. Similarly, the Solvency II Directive in the European Union requires EU insurers to hold a minimum amount of capital. RBC systems calculate the regulatory capital requirement by assessing risks such as credit risk, underwriting risk, market and operational risk, and allowing for inter-dependency among these risks. Companies with RBC ratios below certain thresholds are subject to different degrees of regulatory intervention.

These solvency capital requirements may be re-framed in the context of machine learning as (linear) decision boundary problems. The amount of capital held by a company relative to their required solvency capital is a one-dimensional decision boundary. Similarly, systems like the Insurance Regulatory Information System (IRIS) might be seen as using a multi-dimensional decision boundary. By leveraging large amounts of data and capturing non-linearities, machine learning methods may be better able to make accurate predictions, even in consideration of the dimensionality problem for solvency modeling.

Support Vector Machines (SVMs) have been explored as one promising option for binary classification of companies based on solvency. SVMs work by automatically finding a dividing (hyper-)plane that separates solvent companies from insolvent ones. SVMs can have a linear decision boundary (i.e., a line above which companies are solvent and below which they are insolvent), or they can have an effectively nonlinear boundary by mapping inputs into a higher-dimensional feature space using a “kernel method” or “kernel trick” and finding a dividing hyperplane in the higher-dimensional space (which may correspond to a nonlinear boundary in the lower-dimensional space).

Salcedo-Sanz et al. (2004) use SVMs in combination with simulated annealing and Walsh analysis to perform feature selection on a set of 19 financial variables for predicting insurer insolvency. Their analysis suggests that a limited subset of just five financial ratios are sufficient to evaluate insurer solvency. Tian et al. (2019) rely on a different approach to SVMs, instead using a non-kernel fuzzy quadratic surface SVM applied to financial ratios and macroeconomic factors. This “non-kernel” method attempts to overcome limitations of kernel methods, particularly the fact that results of an SVM may be sensitive to the choice of kernel function, and no universal method exists for selecting the best kernel function. This method achieves promising results at the cost of higher dimensionality, which may make the problem computationally expensive depending on the number of variables used.

Random forest algorithms have been used to predict insolvency within the insurance industry (as in Kartasheva and Traskin 2013) and in more general business settings (as in Behr and Weinblat 2017). Compared to SVMs, random forests can handle binary data without additional assumptions, whereas SVMs require the user to specify a notion of “distance” between the binary data points.

Random forests can be designed to provide probability of insurer default rather than a binary classification of “default” or “healthy.” Random forests can also be used to automatically rank the importance of variables to determining insurer solvency. They have an advantage over statistical approaches like logistic regression in their ability to automatically detect and model highly non-linear relationships. However, random forests may be challenging to interpret, and may produce more highly variable results, particularly with sparse data sets.

5. RESERVING

5.1. Individual Claim Reserving

Estimating future claim payments is one of the main tasks performed by actuaries on a daily basis. Such estimates are of high value for the insurance companies because they constitute one of the largest liabilities on the balance sheet. It follows that the accuracy and timing of these figures is of primary concern for all stakeholders.

Traditionally, actuaries employed classical methodologies to perform this task. Such methodologies of estimating claim liabilities rely on aggregate data of insurance claims. These approaches have the advantage of relative simplicity, making them easy to communicate to stakeholders. In addition, by aggregating loss information from many claims, these methods may provide more stable results; however, this stability belies the uncertainty inherent in the classical loss projections. Specifically, less mature years may be subject to considerable uncertainty. Accurate estimates of loss reserves may not be available for months or years after they are incurred. This contributes to uncertainty in reserve estimates and, consequently, in the profit of the company and in the available funds. This could lead to delay in important strategic decisions, which could significantly harm profitability and market share.

Machine Learning methods could fill this gap, providing an accurate estimate at a very early stage of the claim indemnification process for individual claims. In addition, ML methods make it possible to take advantage of all available claim information, unlike standard triangle-based methods that only employ information about the timing and amount of claims. Using this additional information can reduce uncertainty in claim estimates, particularly for immature claims where triangle-based methods have comparatively little data on which to base their estimates.

Moreover, ML methodologies are fully flexible, and allow actuaries to consider (almost) any kind of feature information. We are, in fact, not limited to fixed data structures (e.g., triangles, which only provide insights about claim amount, timing, and development). As an example, ML can mine claim description text data to generate new features that can improve model predictiveness.

Another advantage of ML techniques is that such algorithms can operate without extensive user assumptions, as ML techniques can estimate most parameters of interest from the data. In addition, they can update/retrain themselves automatically. ML algorithms can also be deployed in an automatic way in order to achieve an instant estimated ultimate amount when the claim is first reported.

As a result, ML methods can provide considerable savings in terms of both time and money. Processes such as claim triage can be performed automatically and in a fraction of the time. It is also important to note that ML methodologies can adjust and adapt to changes in observations. ML methods can assist in identifying, studying, and reacting to trends more quickly than traditional methods because such trends can be discovered automatically.

Common methods and algorithms that hold promise for individual claim reserving include Neural Networks, Gradient Boosting, and Classification and Regression Trees (CART).

Wüthrich (2016) is a good introduction to individual claims reserving with machine learning. This paper provides a toy example using regression trees, with several simplifications, in order to introduce and prove the concept of ML in individual claim reserving. The author notes that although this paper presents a very simplified model, there would be little difficulty in generalizing it to the full real world problem.

Jamal et al. (2018) presents five different ML techniques applied to forecast individual claim development. They were implemented in a cascading triangular way similar to triangular reserving methods, and the prediction results were compared with results achieved by classical reserving methods. The findings offer a better understanding of the possible complexity of the nature of the claims, point out some weaknesses that traditional methods might have, and indicate a strong potential for machine learning algorithms.

De Virgilis and Cerqueti (2020) analyze a set of fully developed claims in order to predict the ultimate cost with the aid of ML techniques. The authors employ several ML methods highlighting strengths and weaknesses of each one. On a high level the target is to predict the ultimate cost of claims on the individual level at the moment they are reported to the insurance company, alongside the intermediate cashflows before the claim is closed. The paper also contains a section in which

methods to predict claims with no payments (CNP) are implemented. A final paragraph is dedicated to the topic of predicting the amount of Incurred But Not Yet Reported claims (IBNYR), a concept often overlooked in the context of individual claim reserving.

5.2. Aggregate Reserving

Notwithstanding the promise of individual claims reserving, machine learning can also provide improvements to aggregate reserving methods by contemplating additional claims information and by capturing uncertainty in claim reserves. Machine learning models that have been used for this purpose include neural networks, random forests, gradient boosting machines, boosted Tweedie models, and Gaussian process regression models.

5.2.1. Neural Networks

Neural networks can be implemented to enhance classical methods such as Mack’s chain ladder model and the Over-Dispersed Poisson generalized linear model or to directly model incremental paid losses and outstanding case reserves.

Classical aggregate claims reserving assumes a homogeneous claims portfolio. (Wüthrich 2018) starts with Mack’s chain ladder model of cumulative paid claims by accident year and further refines it for heterogeneity and individual claims features using a neural network. The neural network incorporates claim information including line of business, labor sector, accident quarter, age, and affected body part. The neural networks employed are relatively simple and contain only one hidden layer, with the specific parameters and architecture differing for each development period. The paper helpfully discusses important data pre-processing steps and methods of tuning hyper-parameters - in this case using out-of-sample validation. The paper demonstrates the promise of ML techniques for individual reserving. Neural networks are more accurate in three out of the four lines of business tested.

The model presented by (Gabrielli, Richman, and Wüthrich 2019) embeds a cross-classified Over-Dispersed Poisson (ODP) reserving model of incremental paid claims into a neural network architecture using the Keras package in R. This approach goes beyond individual portfolios to learn representations simultaneously over several portfolios. The resulting model for each individual line of business is more accurate in all lines of business at the price of slightly higher prediction uncertainty as measured by bootstrapping. Multi-triangle representation learning requires an additional line-of business-dependent embedded layer. The multi-triangle model gives the best results with respect to the true reserves, and prediction uncertainty is reduced over individual lines of business models.

(Kuo 2019) considers another approach. In Kuo’s paper, the DeepTriangle model combines paid losses and claims outstanding to estimate loss reserves using a deep neural network, also coded using Keras in R. The neural network architecture uses incremental paid and outstanding case reserves by accident year as inputs and outputs incremental paid and outstanding for cells in the run-off triangle. Between the input vector and output vector lie hidden layers of the neural network with associated intermediate values, activation functions, biases, weights, and a mean absolute percentage error loss function. Fifty companies are modeled simultaneously on Aggregate Schedule P data between 1988 and 1997, covering 10 years of development for four lines of insurance: commercial auto liability, personal auto liability, workers compensation, and other liability.

Validation of the model across lines of business shows it meets or exceeds the predictive accuracy of existing stochastic methods. Out of time performance is compared to the Mack chain-ladder method, the bootstrap overdispersed Poisson method, and a selection of Markov Chain Monte Carlo models. DeepTriangle achieves the lowest mean absolute prediction error and root mean squared prediction error except for the commercial auto line. Future extensions of the model may include outputting distributions of loss reserves, rather than point estimates.

5.2.2. Gaussian Process Regression

A common criticism of link ratio and regression based models is that they tend to be heavily parameterized for a problem with few degrees of freedom.

(Lally and Hartman 2018) propose a hierarchical Bayesian Gaussian process model, a flexible nonparametric statistical/machine learning method which provides a robust and smooth fit to a wide variety of data types, structures, and distributions. Gaussian process regression is a method of smoothly interpolating and extrapolating while capturing uncertainty in predicted data points based on parameters inferred from the data points’ relationship to each other.

Gaussian process regression has many favorable features: it is non-parametric, it has implementations in many standard software packages (for instance, Stan), it has a probabilistic interpretation, and input warping automates feature engineering. The existence of a probabilistic interpretation allows one to develop a posterior predictive distribution for predicted data points and

thereby determine the uncertainty in estimated values (or the uncertainty arising from potential measurement error in observed values).

6. INSURANCE FRAUD DETECTION

Fraud is dramatically increasing with the expansion of modern technology and the global superhighways of communication, resulting in the loss of billions of dollars worldwide each year.

Although prevention technologies are the best way of reducing fraud, fraudsters are adaptive and, given time, will usually find ways to circumvent such measures. Methodologies for the detection of fraud are essential if we are to catch fraudsters once fraud prevention has failed.

Statistics and machine learning provide effective technologies for fraud detection and have been applied successfully to detect activities such as money laundering, e-commerce credit card fraud, telecommunication fraud, and computer intrusion, to name but a few.

We describe the tools available for statistical fraud detection and the areas in which fraud detection technologies are most used.

Fraud detection is particularly important in health insurance as it can significantly impact healthcare costs. Bauder and Khoshgoftaar (2017) compare the performance of supervised, unsupervised and hybrid machine learning approaches to predict fraud for the 2015 Medicare data. While the conclusion is not surprising - that supervised learning outperforms the other approaches - labeled fraud datasets may not be widely available at insurance companies, and label quality may be poor. As a result, unsupervised methods may provide at least some degree of useful information for motivating further investigation. In another paper, Bauder, Herland, and Khoshgoftaar (2019) evaluate the predictive power of ML methods using both two separate data sets, i.e., training and validation data sets and cross-validation sets in which a single data set is divided into smaller training and test subsets and find that the former provides a more realistic picture of the real-world model. For a review of statistical fraud detection and other applications of ML in health insurance, see, for example, (Bolton and Hand 2002) and (Mehta, Katz, and Jha 2020), respectively.

Similar to the application of ML in insurance that raises concern regarding regulation, the application of ML in health care raises ethical issues which are addressed in (Char, Shah, and Magnus 2018).

7. TELEMATICS

The use of telemetry devices has been growing in auto insurance as more carriers introduce telematics products and as more data is collected. Data collected through telematics devices promises to be useful in a variety of insurance applications: pricing, underwriting, claim response and handling, fraud detection, maintenance recommendations, and more.

Pricing is an especially promising application, as telematics variables can accurately capture a driver’s behavior (presumably the proximate cause of many claims). This could potentially replace proxy variables like gender or household size. In reality, many programs have yet to actually incorporate telematics into their pricing; gathering the data and analyzing it have been an important first step. Because raw telematics data must be organized into useful predictors, some early research has focused on using machine learning techniques to extract the most information from this raw data.

There are two main flavors of telematics that have emerged – pay-as-you-drive (PAYD) and pay-how-you-drive (PHYD). The main distinction is that PAYD focuses simply on driving habits (e.g., distance, time of day, location), while PHYD adds information about driving style (e.g., speed, braking). Depending on the specific hardware/software used, the raw data collected varies. In general, it includes basic PAYD information such as number of trips, distance and duration of trips, and time of day. It may also collect data about GPS location, speed, acceleration, and turns every second; or it may gather information on road types and conditions. Insurers are challenged to turn this high-frequency, high-dimensional data into useful covariates for predicting loss costs. Several studies have experimented with different techniques for developing these covariates.

Verbelen, Antonio, and Claeskens (2018) is an early study focused on PAYD variables. The authors compared several approaches for modeling claims frequency. Their goal was to evaluate the predictive power of telematics variables as well as compare traditional (time) and telematics (distance) exposure measures. They created four different datasets using third-party liability data from a Belgian insurer: one with only traditional rating variables (e.g. driver age, gender, postal code, and vehicle age) and time as exposure; one with only telematics data (e.g. yearly distance, number of trips, distance on road types, time of day, and day of week) and distance as exposure; one with a combination of traditional and telematics variables and time as exposure; and one with a combination of traditional and telematics variables and distance as an exposure.

The authors used generalized additive models (GAMs) as their framework. GAMs allow for more flexible relationships between continuous predictors and response, i.e., non-linear effects. Since some of the telematics variables are compositional (i.e., proportions of different categories that sum to 1), the authors developed a novel approach to include these as predictors. Based on the model results, the authors found: that including telematics variables improved the model (but time as exposure was preferred); that differences in gender were explained by driving habits (women drove fewer miles per year); and that time of day and type of road were predictive – driving in the evening is riskier and driving on urban roads or motorways was riskier than other roads.

A subsequent series of papers investigates PHYD variables and various approaches to corral this high-dimensional data. (Wüthrich 2016) started with an idea of visualizing a driver’s style by plotting speed-acceleration (v-a) heatmaps. These heatmaps show the distribution of time spent at each combination of speed and acceleration. Drivers can be compared by calculating the dissimilarity between their heatmaps (each pixel of the heatmap has a value that can be used to calculate distance). The author used K-means to categorize similar drivers.

Because the categorical variable resulting from K-means is less desirable than a low-dimensional continuous variate, (Gao, Meng, and Wüthrich 2019) explored two additional techniques for turning the v-a heatmaps into useful predictors. The authors compared two Principal Components Analysis approaches to the original K-means approach. The first PCA approach was Singular Value Decomposition, which is restricted linearly; the second, based on a neural network, gives a non-linear analog to PCA. The authors found the SVD and NN approaches were able to provide sufficient representations of the information in the v-a heatmaps. The benefit of using these techniques in place of K-means is that continuous predictors require fewer parameters than categorical ones, which leads to less over-parameterization; and with continuous variates, new data could be simulated.

(Gao, Meng, and Wüthrich 2019) compared the predictive power of the three techniques described above (K-means, PCA, and bottlenecked neural network) in predicting claims frequency. The authors used a Poisson Generalized Additive Model (GAM), which allows for non-linear covariate effects. All three approaches improved the out-of-sample deviation of a Poisson GAM predicting claims frequency with initial covariates of driver age and vehicle age, with PCA and the neural network outperforming K-means. The authors believe PCA and NN approaches were more successful because they are numeric instead of categorical and therefore more granular. They found that the three approaches accounted for some of the same information. Because most accidents occur at low speeds, the authors focused on low speed intervals and longitudinal (straight-line) acceleration rates. The resulting principal component and bottleneck activation features correspond to safer driving (lower acceleration) at low speeds (5-10 km/hr or about 3-12 mph). For severity modeling, the authors note that high speed intervals and lateral acceleration rates should be investigated.

8. SUMMARY

In this report, we considered applications of ML in different areas of insurance. All papers that we discuss confirmed the strong predictive power of these tools compared with traditional methods. However, there are still some concerns among practitioners. These include the interpretability of the results, regulatory, and ethical issues. Although these techniques attracted the attention of many researchers, there is not much research that addresses these concerns. It is the view of this working party that working in these areas can facilitate the use of such strong tools in the insurance industry.

DISCLAIMER

While this paper is the product of a CAS working party, its findings do not represent the official view of the Casualty Actuarial Society. Moreover, while we believe the approaches we describe are very good examples of the use of machine learning techniques in various insurance contexts, we do not claim they are the only acceptable ones.