The debate regarding potential algorithmic bias in the insurance industry has spanned decades but has gained steam in recent years. Most state laws currently define unfair discrimination as “price differentials [that] fail to reflect equitably the differences in expected losses and expenses.” Recent legislations such as Colorado Senate Bill 21-169 begin to evolve and clarify traditional notions of unfair discrimination by exploring cases where algorithms may indirectly result in higher estimates for classes of individuals protected by state or federal law. Such higher estimates may occur when certain insurance rating variables distribute differently, for example, across racial and ethnic groups, creating potential for the variables to proxy for protected class membership. Measures exist to adjust for these distributional differences. However, when we apply such measures to facially neutral rating variables that otherwise appear to help explain differences in expected loss or expense, some might argue we operate against traditional notions of (cost) equity. A simpler way to state this argument is that measures to reduce pricing differentials between classes have the potential to reduce pricing accuracy in some cases. This paper offers an illustrative approach to relate reductive measures to resulting accuracy impacts, based on the nature of biases present in the data.
The crux of our conversation is that not all distributional differences are likely to behave in the same way with respect to reduction techniques. We review three examples of such differences – misalignment, predilection and confusion. To study how each behaves, we simulate several hypothetical datasets where we affect different permutations of the three. We then construct predictive models to explain the datasets – some of which apply measures to address distributional differences and others of which do not. Finally, we measure the degree to which accuracy improves or worsens after we apply the measures. By analyzing how our models behave, we can begin to characterize the nature of distributional differences even without clear a priori knowledge of how we simulated them. These characterizations may improve our ability to converse around the types of measures that may be most reasonable – from the perspectives of both accuracy as well as evolving statute and regulation.
For readers who would like to explore our subject matter further on their own or delve deeper into some of the figures we present, we provide Supplementary Material at the conclusion consisting of the R computer code used to develop our analysis.
Our discussion will largely focus on how disproportionate impact occurs and is managed. The American Academy of Actuaries (AAA) defines disproportionate impact as “a rating tool that results in higher or lower rates, on average, for a protected class, controlling for other distributional differences.” Depending on the distributional differences that we control for in our analysis, disproportionate impact has the potential to focus on the effects of a model – and not, necessarily, on the absence or presence of relationships between the risk conditions in the model and its target. We illustrate this by enumerating three potential influences of disproportionate impact:
Misalignment - we define this as a tendency for certain classes to exhibit higher or lower prevalence of certain risk conditions in the data, where the apparent impact of exhibiting these risk conditions varies depending on the class. Table 1 displays a hypothetical example of misalignment.
Class B receives speeding tickets at about a ¼ higher frequency than class A, that is, 4% ÷ 33% = 12% is approximately 25% higher than 6% ÷ 67% = 9%. Moreover, drivers with speeding tickets generally exhibit higher loss costs than those without. However, while the loss costs are the same for the ticketed drivers in classes A and B, the loss costs for drivers without tickets are 6% lower in Class B than in Class A. This may suggest that there is greater policing surrounding speeding in Class B’s communities, while in Class A’s communities there are more drivers who travel at unsafe speeds that are lucky enough not to receive tickets – and these drivers may contribute to a higher mean for non-ticketed drivers in Class A than Class B. The difference between the category “No” loss costs for classes A and B suggests the potential for disproportionate impact, because in a regression model both classes’ non-ticketed drivers are likely to be treated in the manner implied by the Total.
Misalignment is an example of “input bias,” which “occurs when inputs are non-representative, lack information, historically biased, or otherwise bad data.[sic]” Other terms in literature to describe phenomena such as misalignment include “negative legacy” and “label bias.”
Predilection - we define this as a tendency for some classes to exhibit higher or lower prevalence of certain risk conditions in the data, where the apparent impact of exhibiting the risk condition itself does not vary materially by class. Table 2 displays a hypothetical example of predilection.
Class A is more prevalent in areas where traffic density is low, and Class B is more prevalent in areas where traffic density is high. Moreover, high traffic density areas generally exhibit higher loss costs than lower traffic density ones. Therefore, Class B exhibits a higher average loss cost than Class A. However, within each traffic density category, the loss costs are identical for Classes A and B. Whether or not we deem a model based on traffic density to have disproportionate impact likely depends on whether or not traffic density is one of the factors for which we control.
Predilection is an example of “training bias,” in which “the output of an algorithm is based on certain learned correlations while a different, and potentially more accurate output, may have been produced had the algorithm considered different or additional information.” Another term used to describe phenomena such as predilection in literature is “algorithmic prejudices.”
Confusion - we define this as essentially the same phenomenon as predilection, except that whatever risk conditions correlate with cost differences are not apparent to the modeler in any of the features in the dataset. For example, an organization may be more likely or able to offer roadside assistance as a value-added service to members of Class A than Class B. The assistance may generally result in less severe accidents. However, the organization may not be aware of its biases and resulting impacts. Therefore, Class A may appear to have inexplicably lower severities than Class B, all other things being equal. In these cases, some of the confusion may find its way into the model through other covariates that are loosely correlated with the latent one(s) and related class. If this does happen, disproportionate impact will likely occur, because the invisibility of the confusing factor in analysis prevents us from controlling for it in our assessment of how rates differ between classes.
In the example above where the organization’s actions contributed to confusion, the confusion may be an example of “programming bias” - which “can develop from an algorithm’s interactions with human users and assimilating to existing or new data.” Another way to describe this situation seen in literature is “selection bias.” In cases where latent covariates are not the result of organizations’ own actions, we would more likely classify it as input or training bias.
If one or more of the influences above are present, this creates the potential for disproportionate impact. We will illustrate how these influences may materialize in practice using simulated data.
Insurance loss costs are frequently modeled using generalized linear models (GLMs) with Tweedie distribution functions and logarithmic link functions. It is convenient to model loss costs using the Tweedie distribution, which is essentially a “Poisson-distributed sum of Gamma distributions.” Loss cost equals frequency (claims per exposure) multiplied by severity (amount of loss per claim) i.e., loss per exposure. Therefore, the Tweedie distribution allows us to model two phenomena at once – the Poisson component helps explain frequency, while the Gamma component helps explain severity. The modeler determines the relative influence of each using a parameter (“p”). Logarithmic link functions are convenient for insurance models due to their tendency to create a multiplicative rating structure. In order to simulate somewhat realistic datasets, we sample frequency and severity statistics from Poisson and Gamma distributions (respectively) under a variety of circumstances where different combinations of the three influences above are present. The approach we take to do so, outlined in Step 2 of the Supplementary Material, is as follows:
For each of eight simulated populations φ = N through PMC, we begin by setting a random seed (123) and generating observations n = 1 through 150,000. These eight scenarios will be defined by the three influences introduced above using the assumptions that will follow. From a nomenclature perspective, we describe our populations (φ) using acronyms representing the combinations of influences present – that is, predilection (P), misalignment (M), and/or confusion (C). N is the scenario with no influences present, while PMC is the scenario where all three are present.
For each n, and for each of six dimensions i = 0 through 5, we conduct random draws rφ,i,n from the standard uniform distribution. i = 0 informs whether each observation is associated with a protected class, and i = 1 through 5 inform the levels of five different rating variables and related parameters that will be used in the loss simulations. The five hypothetical rating variables are:
i. Geographic territory – high versus moderate versus low vehicle density
ii. Driver age – 25 or under, 26 to 65, or greater than 65
iii. Speeding ticket – whether or not speeding ticket received in past three years
iv. Vehicle weight – light, medium, or heavy
v. Safety class – whether or not policyholder ever completed approved course
Lower random draws generally correspond with less favorable treatment except for i = 2 whose effects will be non-monotonic.
We apply two disturbances to rφ,i,n when i = 1, 3, or 5 based on the value of rφ,0,n for predilection and misalignment respectively:
a. We calculate the experiential draw xφ,i,n as rφ,i,n ÷ (1.00 + δφ,i0), where δφ,i0 is 0 when i0= 0 (dominant class) and takes on the values in Table 3 when i0 = 1 (protected class).
The higher the value of δφ,i0, the smaller the value of xφ,i,n will be, and the more likely the observation will be to receive simulation parameters that result in higher loss amounts in Step 5 of the Simulation.
b. We calculate the social draw sφ,i,n as xφ,i,n ÷ (1.00 + εφ,i0), where εφ,i0 is 0 when i0= 0 (dominant class) and takes on the values in Table 4 below when i0 = 1 (protected class).
The higher the value of εφ,i0, the smaller the value of sφ,i,n will be, and the more likely the observation will be to receive labels in Step 4 of the Simulation that are less favorable than corresponding simulation parameters in Step 5.
We rank normalize xφ,i,n and sφ,i,n after each disturbance in order to constrain each within a 0.00 to 1.00 range. When i = 2 or 4, we set sφ,i,n and xφ,i,n (back) equal to rφ,i,n in order to affect in our example the typical case that not all covariates will be subject to the influences in our study. We select only a handful of δ and ε parameters that materially affect the influences we describe (which also have the same signage), along with a baseline scenario where neither is present, for purposes of illustration. Readers may experiment with additional combinations at their leisure.
Based on the values of the social draw sφ,i,n, we determine levels of each variable i = 0 through 5 that will be visible to the modeler, using ranges shown in Table 5.
Binary indicators of these levels will be the independent variables in our GLMs.
Based on the values of the experiential draw xφ,i,n, we determine multipliers Fφ,i,n that will disturb each observation’s simulation parameters, using the ranges shown in Table 6.
Table 7 displays the values of the confusion factor αφ.
Note that αφ broadly affects both sociodemographic classes’ simulation parameters irrespective of the experiential and social draws for the rating variables. Similar to δ and ε, we select only a handful of α parameters for purposes of illustration.
For each n, we estimate the number of claims fφ,n by conducting a random draw from the Poisson distribution where λφ,n = 2% • Fφ,1,n • Fφ,2,n • Fφ,3,n.
For each claim cφ,n on each observation (if any), we estimate the loss amount lφ,n,c by conducting a random draw from the Gamma distribution where the scale parameter θφ,n,c = 50,000 • Fφ,0,n • Fφ,4,n • Fφ,5,n. (We use a constant shape parameter of 2.00 for all Gamma simulations regardless of scale parameter.) We sum the losses on each observation, that is, lφ,n = Σcφ,n lφ,n,c. These loss amounts will be the dependent variable for our Tweedie GLMs.
Note that predilection affects both the dependent and independent variables, misalignment affects only the independent variables, and confusion affects only the dependent variable. We now proceed to train several GLMs on our data in order to study how they behave in the presence of different influences.
In a practical sense we will rarely (most likely never) have the luxury of clear prior knowledge regarding the mathematical processes that generate the datasets we use to model. However, modeling with synthetic datasets such as the ones produced in the previous section can help provide insight into the processes that may have generated real world datasets that exhibit similar model behaviors to the “toy” ones. To help relate generative processes to resulting behaviors, we train four different GLMs on each population – the nuances of which we will speak to shortly. Step 3 of the Supplementary Materials trains these GLMs on a random 50% of observations. Consistent with the Poisson and Gamma assumptions used to simulate, we utilize GLMs with Tweedie distribution functions and logarithmic link functions. We judgmentally select a p-parameter of 1.3, which signifies that frequency is more dominant than severity in its explanatory power for our problem. After converging models, Step 4 of the Supplementary Materials executes the GLMs and Steps 5 and 6 develop Gini Indices and Actual versus Expected analysis using the 50% of data we held out to examine how the models perform. Between these models and the appraisals of merit, we will begin to see clear differences emerge in how the different populations behave from a modeling perspective.
We develop four GLMs on each of the eight different populations, as follows:
Sociodemographic GLM – uses only protected class status to predict loss amount per exposure. We clearly would never consider using a GLM such as this outside of a laboratory environment. Its purpose here is to help examine whether the influences we program result in statistically significant relationships between protected class and target.
Baseline GLM – uses only the five rating variables to predict loss amount per exposure. This is the type of GLM we might see used in insurance pricing, particularly when algorithmic bias is of low concern. Its principal purpose for our discussion is to serve as a basis of comparison for the next two models we will describe, which apply simple measures from literature to help manage the potential for disproportionate impact.
Control Variable GLM – uses the five rating variables and protected class status to predict the targets. LaCour-Little and Fortkowsky among others describe similar approaches in their work. We include protected class when training the GLM to siphon signal away from other rating variables that may correlate with protected class. However, we neutralize its effects during deployment and build back any imbalance this creates in aggregate uniformly across dominant and protected classes. The purposes of this GLM and the next one are to evaluate the impact of reductive measures.
Residualized GLM – uses only the five rating variables to predict the target, but before doing so applies an adjustment to each that reduces correlation with protected class. We adjust via a simple approach, described in Berk et al among others, that regresses each predictor against protected class status and uses the residuals rather than the originals as independent variables in the GLM. The residuals (“partials”) are highly correlated with the originals but are uncorrelated with protected class. For example, Table 8 displays the correlations between protected class, the original geographic territory, and the partial geographic territory for φ = PMC.
The partial variables are 95% and 87% correlated with the originals for high and low density geography respectively, but correlations with protected class decrease from 30% and -50% respectively originally to 0% for both after residualizing.
Our remaining discussion of the GLMs will focus on three ways we can use the models and their resulting performance to draw distinctions between the sample populations:
Protected class exhibits a statistically significant relationship with loss amount for all φ except N and M. We expect this because we did not program any influences for N and the one we programmed for M disturbed only the rating variables not the target.
Estimates for the scenarios besides φ = N and M also generally align with expectations. We do not have a precise a priori sense of the magnitude of predilection’s impact on target since we affected it as a disturbance to a random draw. However, we do have a clear expectation for confusion (α) because of its direct relationship to the Gamma scale parameter through Fφ,0,n. Specifically, for the scenarios where we program α = 0.5, we expect the impact on the protected class to be (1.00 + α ÷ 2) ÷ (1.00 - α ÷ 4) = 1.25 ÷ 0.875 = 1.43 compared to the dominant class. If we normalize the estimates for φ = P through M by the one for influence-free φ = N, we obtain the following (Table 10):
The normalized estimate of 1.43 for C aligns with the expected impact of confusion. The normalized estimates for scenarios involving multiple influences are close to what we would expect by multiplying the individual confusion (C), predilection (P), and (trivial) misalignment (M) estimates by each other. That is, 1.41 (P) • 0.96 (M) = 1.35 ≈ 1.34 for φ = PM; 1.41 (P) • 1.43 (PM) = 2.01 for φ = PC; 0.96 (M) • 1.43 (C) = 1.37 for φ = MC; and 1.41 (P) • 0.96 (M) • 1.43 (C) = 1.94 ≈ 1.91 for φ = PMC.
Rating factor spreads – Table 11 provides factor estimates for the three GLMs that utilize the rating variables. We limit the table to only variables subjected to our three influences for brevity. For information, we also display the values of Fi used in the simulations as a basis of comparison.
To help visualize the range of predictions associated with the tertiary Geography variable, we display the quotient of values associated with the maximum and minimum levels (Table 12). For example, for φ = N, we estimate spread as 1.366 ÷ 0.464 ≈ 2.947.
In general, the spread behaviors of the control variable and residualized GLMs resemble each other. In this subsection, we will focus on the behaviors of the control variable GLM due to the rather direct relationship between the control variable and rating variable spread. In the next subsection, we will delve into the residualized GLM in greater detail in the context of model performance.
When φ = N and P we tend not to observe material differences in spread between baseline and control variable GLMs. For φ = N, we did not program influences involving protected class, so it is unlikely the control variable would siphon away signal from other variables. For φ = P, although δ creates correlations between protected class and rating variables, the latter explains the target more directly than the former (which is the nature of predilection) and renders the control variable insignificant. Table 13 illustrates this via the protected class’s p-values and factor estimates. The control variable does not exhibit significance for φ = N or P.
The estimate of 0.72 for the control variable when φ = M helps explain why rating variable spreads in the resulting GLM are wider than in the baseline GLM. For example, the geography spread increases by roughly 10%, from 2.251 to 2.473. The spreads for the speeding ticket and safety class variables also increase slightly. Because ε disturbs predictors in a way that by definition does not track with the target when φ = M, the resulting noise depresses the affected rating variables’ spreads in the baseline compared to if noise-free predictors had been available. Once the control variable enters play, it isolates observations causing the depression – that is, protected class ones that ε often artificially tagged with less favorable rating variable levels. The coefficient for the control variable captures some of this depression and the rating variable spreads expand as a result. We observe similar spread expansion for φ = PM and another control variable coefficient (0.84) well below 1.00 – both of which are expected given the control variable’s sensitivity to ε and insensitivity to δ absent α.
We do not observe material differences in spread between the baseline and control variable GLMs for φ = C. The control variable is highly significant and its estimate is consistent with α effects we programmed, that is, 1.36 (C) ÷ 0.95 (N) = our previous expectation of 1.43. However, because protected class is uncorrelated with the rating variables, confusion effects do not infiltrate estimates for the latter in the baseline GLM and there is no extra signal available for the control variable to siphon away.
Scenario φ = PC exhibits material reductions in spread. Similar to φ = C, the control variable’s estimates exhibit significance and 1.44 (PC) ÷ 0.95 (N) = 1.52 does not drift far from 1.43. Because δ correlates with α and both bear on the target, affected rating variables in the baseline GLM absorb some confusion into their estimates. This results in spreads that materially exceed the ones implicit in simulation assumptions. For example, the geographic spread of 3.562 suggested by the baseline GLM is nearly 20% higher than the 1.50 ÷ 0.50 = 3.00 spread between the maximum and minimum geography multipliers applied to the Gamma scale parameters. When the control variable enters play, it absorbs back some of the confusion effects and the rating variables more closely adhere to simulation assumptions. For example, the control variable compresses the geographic spread from 3.562 to 3.181, within 6% of 3.00.
The static behavior of spreads for φ = MC resembles that of φ = C more than the expansion seen with φ = M. Misalignment dissociates predictors from target by an approximate magnitude of 0.72 when φ = M. Confusion correlates with misalignment but influences target not predictors, and the approximate magnitude of the influence is 1.36 when φ = C. When the two simultaneously disturb predictors and target respectively, this neutralizes the control variable and minimal siphoning occurs.
Compressive spread behavior for φ = PMC resembles that of φ = PC. This extrapolates logically from the control variable being sensitive to δ (φ = PC) but not ε (φ = MC) in cases when α influences are simultaneously present.
Performance of residualized GLM – Gini Indices summarize the skewness of a distribution. The higher the index, the more effectively a model characterizes skewness. Frees et al observe how when applied to risk scores such as GLM predictions, the Gini Index is proportional to a correlation between the risk score and an insurer’s out-of-sample profit. Table 14 displays Gini Indices developed from the holdout observations using the various GLMs.
The control variable GLM generally has minimal impact on Gini Index performance for the populations as a whole compared to the baseline GLM, despite sometimes having material impacts on factor estimates. In contrast, the residualized GLM tends to have material Gini Index impacts on the populations as a whole compared to the baseline and control variable GLMs. Neither the control variable nor residualized GLMs achieve significantly better or worse Gini Indices than the baseline when limiting the population to only the dominant or protected class. This suggests that differences in Gini Index on the broader populations are due to the residualized GLM assigning significantly different predictions to the dominant or protected classes as a whole, rather than it rank ordering exposures more or less effectively within one or the other or both. Table 15, which displays differences between observed versus predicted loss amounts per exposure, corroborates this hypothesis.
We calculate error as the average cohort observation divided by the average cohort prediction less 100%. Positive error suggests under-prediction while negative error suggests over-prediction. The GLMs over-predict the holdout for all models and populations due to distributional differences between training and holdout data. However, we see more significant prediction errors at the class level of aggregation across several of the populations.
Residualizing expectedly has minimal impact on Gini Index performance for φ = N and C. For both scenarios, we do not program any influences that correlate rating variables with protected class, therefore residualizing effectively accomplishes nothing besides perhaps injecting a bit of noise into the independent variables used in the GLMs.
In contrast, for φ = P, residualizing leads to material reductions in Gini Index, from 0.357 to 0.346. δ creates correlation between rating variables and protected class – however, unlike with ε (which we will discuss next), resulting differences in prediction derive from legitimate differences in target not labeling inaccuracies. Residualizing the predictors essentially creates reverse misalignment and results in the GLM significantly under-predicting for the protected class. Adding confusion to predilection for φ = PC exacerbates these Gini Index reductions. The reduction grows from 0.011 φ = P (0.357 less 0.346) to 0.033 (0.372 less 0.339). Both δ and α tend to result in observably higher targets on protected class (although the latter is not visible to the modeler via the rating variables), so new (reverse) misalignment caused by residualizing widens the gap between predicted and observed. Under-prediction on the protected class grows from 26% with the control variable to 56% with residualizing.
The only case where residualizing materially improves Gini Index is φ = M, where the GLM helps correct under-prediction and over-prediction on dominant and protected classes, respectively. For example, the protected class’s error decreases from -15% in the control variable GLM to 0% in the residualized GLM. This is because removing correlations between protected class and rating variables counteracts ε’s tendency to label the protected class less favorably than simulation parameters warrant. Intriguingly, had we not neutralized protected class when deploying the control variable GLM, then this would also have improved Gini Index compared to baseline – however, it affects minimal improvement as actually deployed.
Adding predilection to misalignment when φ = PM leads to the control variable and residualized GLMs receiving roughly similar Gini Indices of 0.346 and 0.345, respectively. Residualizing worsens Gini by 0.011 when φ = P and improves it by 0.009 when φ = M, and the two roughly cancel when φ = PM. The control GLM’s under-prediction on the dominant class (15%) becomes an over-prediction (-4%) in the residualized GLM, and its over-prediction on the protected class (-4%) becomes an under-prediction (29%). From an accuracy perspective, residualizing essentially addresses the misalignment we programmed, but creates reverse misalignment in its ineffective approach to predilection as discussed for φ = P.
On the other hand, adding confusion to misalignment when φ = MC results in residualizing reducing Gini Index nominally from 0.354 to 0.349. For the baseline, although ε artificially dissociates predictors from target in a way that tends to create inaccurately higher predictions on the protected class, α correspondingly increases the target for the protected class and counteracts ε’s inaccuracy. Therefore, while addressing misalignment ordinarily leads to accuracy improvement, here confusion had already somewhat addressed the misalignment inaccuracy and residualizing does so in a duplicative way. Under-prediction on the protected class grows from 7% with the control variable GLM to 26% with the residualized GLM.
Finally, the Gini Index reduction of 0.036 for φ = PMC (0.378 less 0.342) is the largest of all scenarios. In the presence of confusion, residualizing worsens Gini by 0.033 and 0.006 respectively when predilection or misalignment alone is also present (φ = PC and MC), and the reductions compound when all three are present.
The baseline GLM creates higher predictions for the protected class than the dominant one except when φ = N or C and no influences are programmed at all or that correlate with rating variables respectively. For φ ϵ (N, P, MC) the average predicted differentials track within 10% of observed. For φ = M and PM, average predicted exceeds average observed by 32% and 15% respectively. For φ = C or PC, predicted falls short of observed by 27% and 29% respectively, while for φ = PMC predicted falls 13% short of observed. As we have discussed, misalignment and confusion lead to inaccuracies in the baseline GLM because they act on predictors and target respectively, but neither affects both. In most cases, the control variable approach has nominal impact on the average predicted relativities, whereas residualizing consistently forces average predicted relativities very close to 1.00. Sometimes this ballasting pressure tracks well with the target (φ = M), other times it does not (φ = PC).
Whether or not disproportionate impact exists in each case for the baseline depends on which distributional differences we control for when reaching this assessment. For example, it would likely exist when φ = M and predicted significantly exceeds observed, and unlikely to exist when φ = P when predicted is close to observed (assuming we control for all five rating factors). Assessments become more complex when two or influences are present. For example, confusion has the potential to disguise disproportionate impacts created by misalignment at first glance (φ = MC), because offsetting errors lead to predicted and observed relativities that do not differ significantly. These nuances accentuate the importance of being able to identify and articulate influences present in a modeler’s dataset in order to drive effective conversations regarding the diagnosis and management of disproportionate impact.
Table 17 summarizes how we can generally describe the absence or presence of these different influences via the GLMs described above.
Although we have the benefit of prior knowledge regarding the assumptions we programmed, the nearly mutually exclusive findings above do not necessarily require prior knowledge to hypothesize in the sense that they follow logically from the influences that created them. In practice, influences may be much more difficult to diagnose - due, for example, to subtler manifestations of the influences, potential for δ, ε, and α to differ in directionality and magnitude, and potential of ordinary volatility to obfuscate influences. Therefore, we do not suggest a table such as above (or the section as a whole) will generalize well beyond our contrived set of circumstances or yield convictive evidence of what influences disproportionate impact (if any) in a real-world predictive model. However, we hope the general thought processes help motivate productive discussions regarding the bias dynamics of a data set and corresponding reductive measures that may be appropriate.
Some additional considerations around our approach include (but are not limited to):
Market competition – As we note in the introductory section, some argue that bias reduction methods are inconsistent with traditional notions of cost-based pricing. If some marketplace participants use bias reduction and others do not, then ones who use the approach that detriments model accuracy may be subject to adverse selection. For example, for φ = P, residualizing among other things reduces the GLM’s ability to respond to the protected class’s predilection to be situated in traffic dense areas, resulting in a less accurate model statistically. Some may consider traffic density to be an uncontroversial predictor of loss that we should not neutralize, whereas others may feel its correlations with protected class have problematic tendencies to consider addressing. Marketplace participants who residualize may underestimate traffic dense areas’ loss costs in their models and therefore face adverse selection. The extent of adverse selection may depend on many factors such as:
Magnitude of accuracy improvements/reductions
Whether or not use of reductive methods is compulsory
Price elasticity of prospective policyholder base
Expense structure of marketplace participants
Non price-based factors (e.g., service) affecting retention
One way to estimate the impacts of adverse selection under various assumptions regarding such variables is to conduct a competitive marketplace simulation. Such an exploration is beyond our scope. For an interesting treatment, see Riahi et al.
Dimensionality – We focus our discussion on regression-based models and reductive measures. However, Berk et al note that residualizing falters in its ability to mitigate distributional differences when applied in higher dimensional, interactive environments. Our ability to diagnose influences may similarly diminish in such environments. The approaches we illustrate would require significant adaptation to handle tree-based techniques. At a minimum, one could train a surrogate regression model to explore biases – but this only interrogates the problem at the level of information that we surrogate model and not at any levels the trees uniquely interrogate. The preponderance of literature focuses on more elaborate reductive techniques with potentially greater ability to scale to higher dimensional settings. We focused on simpler techniques because their tractability lends well to illustrating how different types/influences of bias are likely to manifest.
Prior Information – In practice we will not usually have the luxury of knowing the influences that resulted in our dataset. However, ample resources exist to form a starting point of knowledge from which to iterate. For instance, U.S. Census Bureau Data provides a trove of detailed information regarding class breakdowns by geography. Of course, this or any data may have its own biases embedded (specifically (non)response bias). Technology is also helping refine our view of misalignment and predilection. For example, a recent study using telematics data collected from Lyft drivers illustrated minority drivers were more likely to receive tickets than white drivers travelling the same speed were. Telematics and novel credit data have the potential to address historical inequities around credit-based insurance scores and other rating variables. However, telematics may present programming biases – for example, some communities may have less access to newer vehicles or phones with the required technology to collect driving data than others. A perfect solution to bias management that addresses all stakeholders’ concerns and interests is unlikely.
Availability – In our analysis, we simulate a binary indicator of whether or not each observation relates to a protected class. In practice, many organizations do not possess this information. Bayesian Improved Surname Geocoding is a simple and accessible method to infer protected class status from other information – however, the result is just an estimate and a non-binary one at that. Organizations’ ability to explore the domain in question will vary based on their circumstances.
Permissibility – Clear consensus has yet to emerge regarding the most acceptable ways to address disproportionate impact. The statutory, regulatory, and judicial landscapes are changing quickly. Analysts will be well-advised to seek counsel to ensure reductive measures or lack thereof they are considering are compliant.
These considerations are not unique to the specific example we navigated, nor is the specific example we navigated likely to be unique. All we did was train several Tweedie GLMs to hypothetical data and examine them in the context of biases and reductive methods common in literature. While the problem space in which we conduct our exercise is vast, the exercise itself is simple and illustrative. Not all algorithmic bias is created equal. Different biases exist in different datasets and behave differently when subjected to different reductive measures. It is not fait accompli that reduction measures will make models less, or more, accurate. If or when they do, this may reveal clues regarding which biases may be at play. These clues may in turn help better understand and manage input, training, and programming biases rather than simply mitigating their impacts afterwards – leading to more informed algorithmic and policy outcomes.
The author acknowledges Mallika Bender, FCAS for her feedback on an early draft of the paper as well as E-Forum Working Group members for improvements and clarifications suggested during the review process.
NAIC, page 5.
National Law Review.
Laws often also protect national origin, gender, disability, religion, and sexual orientation.
We created these three monikers to describe our special cases of different bias realizations commonly seen in literature. We describe them in detail in the next section.
American Academy of Actuaries, page 31.
Chibanda, page 6.
Serwin and Perkins, page 3.
Datta, as well as Verma et al. page 1.
Serwin and Perkins, page 3.
Serwin and Perkins, page 3.
Verma, page 1.
Goldburd et al., page 22.
Differences in random number generation between different R versions could lead to slightly different results than the ones presented in the paper. We conducted analysis in R 3.5.1.
We choose 150,000 as a reasonably large number that does not excessively burden author’s laptop.
Steps 2.3.2 and 2.4.2 of Supplementary Materials apply rank normalization.
There will be multiple claims per observation when the Poisson draw is greater than one. See Step 2.8 of Supplementary Materials for details of how we approach this.
Balancing occurs during deployment, in Step 4.2.1 of Supplementary Materials.
Berk et al Page 21.
Residualizing occurs prior to training, in Step 3.3.1 of Supplementary Materials.
Step 3.3.2 of Supplementary Materials develops full correlation matrices across all scenarios.
Step 3 of Supplementary Materials generates estimates and p-values.
See Step 2.6 (last line) and 2.8.1 of Supplementary Materials.
We conduct this calculation by hand using the precise estimates from Step 3.0 above.
These estimates, as well ones not displayed, derive from Steps 3.1 through 3.3 of Supplementary Material.
In this table, we normalize Fi by the value associated with the base level. For example, for Speeding Ticket = Yes, 2.00 ÷ 0.89 = 2.25.
Frees et al., page 335.
Gini Index relates the cumulative percentage of total observations to the cumulative percentage of total target as one increases the prediction. Therefore, it is not sensitive to the scale of either the prediction or the target as long as rank ordering and proportionality remain within each.
“Model 2.1” section of Step 5 in Supplementary Materials produces Gini Indices for Control Variable GLM before neutralizing the control variable in deployment.
This calculation occurs at the conclusion of Step 6 in the Supplementary materials.
In some cases, bias reduction detriments accuracy (φ = P) and in other cases it enhances accuracy (φ = M).
Aggrawal et al.