The term “artificial intelligence” is often applied indiscriminately, at least in popular press articles and many vendor advertisements, to a wide variety of computer models intended for inference, prediction or (more recently) content generation. If intelligence is a way of thinking about data that allows inference and prediction, then our discussions of artificial intelligence often try to equate that intelligence with a particular set of algorithms. This naturally implies the question: Which technology is smartest? Perhaps Google’s deep learning technology will win the market, or OpenAI’s ChatGPT, or something new.
This extends a pattern that is familiar to me, and I think to many other actuaries as well. I find myself looking for the “best” model, or the best few models. One candidate might be a fully Bayesian model using Markov Chain Monte Carlo, in which I could put any desired prior and likelihood distributions. Or perhaps a deep neural net (DNN) would perform better. Some of these large language models (LLMs), such as ChatGPT, are being advertised as being the right choice for almost any application, and even as showing some capabilities of an “artificial general intelligence” (Bubeck et al. 2024).
Whatever model type might be the latest and greatest, I imagine it will require more computing power than the last one did. And it will also need more detailed data on insured risks or losses. Explaining the new model might be harder than explaining the old one, and this is a consideration in model selection. Most of all, it must give more accurate predictions. The insurer needs the competitive edge this provides. Otherwise, it might be subject to adverse selection and potentially go out of business.
The problem is that finding the best model seems at odds with peoples’ actual intelligent behavior, which is not monolithic. Instead, people in different situations apply radically different approaches to understanding and solving problems. For example, actuaries estimating future rates think differently than paramedics responding to a disaster, who think differently than factory floor managers, etc. This means their behavior must be understood in the context of their working environment. Despite this, the intended environments of the latest artificial intelligence tools usually get short shrift.
This essay draws attention back to the environment, especially how its tendency to exhibit novel behavior might affect the appropriateness of a model to its intended purpose. Novel behavior is essentially unpredictable. The probability of possible outcomes might change, such as an increased frequency of severe hurricanes. Or there might be profoundly new outcomes, such as the emergence of cybercrime and identity theft. Because the word “intelligence” implies a characteristic of the software itself, and draws attention away from the environment, this essay will largely prefer the word “rational.” Instead of “artificial intelligence,” it will refer to “models.”
What does it mean to rationally use a model? I will first consider the traditional search for the best model in more detail. Then I will expand to considering the environment as well.
This traditional effort to find the “right” model can be set within a larger effort to find a universal (or near-universal) set of rules with which to understand the world around us. It represents an entire way of thinking about rationality which I will call “meditative” because it puts so much focus on finding and using the most appropriate mental rules. Bayesian decision theory is one candidate (Savage 1972; de Finetti 2017). Another one seems to be the algorithms behind DNNs and LLMs, at least in the view of those who claim this technology can be used to make an artificial general intelligence.
This meditative view has strong implications for how we understand the rationality of real people. This is because the candidate approaches for making the “right” decision are very complicated. Bayesian decision theory cannot be understood without advanced mathematical training.[1] DNNs probably cannot be fully interpretable by anybody. Whatever the right decision process, it does not seem to describe how any real person thinks except in some very narrow professional endeavors.
The Biases and Heuristics (BaH) research program has helped us understand that people make many of their decisions using simple rules called heuristics. Daniel Kahneman’s book, Thinking, Fast and Slow, has justifiably become famous and influential, even in more popular culture, and it represents a summary of BaH (Kahneman 2011). Kahneman describes human decision making in terms of “System 1” and “System 2.” System 1 is fast and uses heuristics. System 2 is focused, deliberate, and capable of complex logic or scientific reasoning.
Kahneman gives example after example demonstrating that the heuristics used by System 1 give different answers than the “right ones” according to Bayesian decision theory, classical economic theory, or some similar way of thinking. Because System 1 uses these heuristics to save us time and energy, it leaves all of us biased and therefore less rational. This is potentially more pessimistic than it seems in Thinking, Fast and Slow, perhaps because the book has a tone so appreciative of people. Referring to Paul Slovic, another researcher in the BaH program, Kahneman writes:
Paul Slovic probably knows more about the peculiarities of human judgment of risk than any other individual. His work offers a picture of Mr. and Ms. Citizen that is far from flattering: guided by emotion rather than by reason, easily swayed by trivial details, and inadequately sensitive to differences between low and negligibly low probabilities (Kahneman 2011).
Formal models can become the solution for our inadequate rationality. The subjective judgement of the underwriter or claims adjuster can give way to more accurate estimates, at least in aggregate, based on credibility theory and one-way analyses. These estimates are improved by multivariate statistical models. These in turn are replaced by regularized or Bayesian statistical learning models.
It is easy to imagine this story continuing into the future. More computing power allows more sophisticated models. They could be Bayesian statistical models, which can be understood by people, or DNNs, which cannot, or some other option. In any case, once the right “meditative” rules are discovered, the computer thinks for us, making up the shortcomings of our brains. Feeding the computer more detailed data about our system of interest gives us better predictions.
What if the system exhibits novel behavior? Past behavior might continue but with different frequency, such as severe hurricanes becoming less rare. Or there might be entirely new behavior. A truly universal model will be expected to accurately incorporate the new data. A Bayesian posterior probability might take different values, or a DNN’s parameters might change after updated training. The method of thinking has not really changed.
Credibility theory is one way of dealing with new data that is familiar to actuaries (Bühlmann and Gisler 2006). This is sometimes expressed in terms of balancing responsiveness and stability. More or less credibility may be given to new data, perhaps as indicated by different estimated variances. This corresponds to valuing responsiveness or stability more. Note the approach, or theory, is the same regardless of whether a unit of exposure gets more or less credibility.
The statistical learning literature has an analogous concept, the variance-bias tradeoff (Hastie, Tibshirani, and Friedman 2009). A more responsive, or flexible, model used with smaller data sets will tend to misidentify stochastic variation as indicative of future behavior; the ensuing (squared) error is called variance. More stable, or inflexible, models applied to large data sets will be unable to reproduce all the system’s behavior; the ensuing error is called statistical bias. There will tend to be less data on more novel behavior. Therefore, flexible models should be used with large data sets, and presumably more stable behavior. And inflexible models should be used with smaller data sets, and presumably more novel behavior. Many modern models try to automatically adjust their flexibility, and complexity, for a given data set. However, this adjustment is approximate, usually must be estimated from the same data set, and may not work well in practice.
Consider a real example of a potentially unstable environment, meaning one that exhibits novel behavior over time: litigation on property insurance in Florida over the past 10 years or so. First, it should be mentioned that in 2023 the Florida Legislature passed Senate Bill 2-A, which reformed Florida’s legal environment. There is good reason to believe it will reduce litigation and help the market. Historically, though, Florida has had far more litigation than its size warrants. Circa 2022, it experienced about 81% of the national share of property insurance litigation with only 9% of all property insurance claims (Hilton 2023). The detailed causes do not concern us. What is important is that the quantity and outcome of the litigated cases depend on statute and case law. These changed in important ways. For example, prior to 2020, statute required the insurer to pay all legal costs whenever the plaintiff was awarded any damages. House Bill 7065 softened this requirement until it was eliminated in 2023. The effects of these changes on litigation outcome can be imagined to be very complicated because they potentially impact the incentive of plaintiffs to settle.
Also problematic for modeling outcomes is that litigated cases often take years to resolve. By the time the actual outcomes were available, statute or case law had changed enough to make them potentially irrelevant.
The relevant question is this: could a sophisticated enough model plus detailed data on Florida policyholders, insured risks, plaintiff attorneys, etc., predict future litigated outcomes with reasonable accuracy, even as attorneys, judges, and legislative bodies changed the law? Note the model would not have to predict the occurrence of the law changes themselves, just their effects on the litigated outcomes. The future litigated outcome would usually be the award, if any, given to the plaintiff.
Suppose that, yes, the right model could make those predictions, even if it took detailed data and enormous computing power. This effectively assumes the insurance market is a particular kind of mechanism. Information about the mechanism’s parts is enough to predict the behavior of the whole. The solar system is the mechanistic system par excellence. Three laws of motion, plus one law of gravitation, apply to every solar body, or to every infinitesimally small piece of each solar body. This model, plus enough computational power, and measured data, allows the prediction of the solar bodies’ movements with incredibly high precision. The historical success of this model has helped predispose all of us to find mechanistic models like it more compelling (Krüger 1987).
In this scenario, the actuary would not need to take the novelty of law changes very seriously. This is because the model could still predict litigated outcomes. The laws’ effects might look unpredictable to people, but not require any change in the type of model. Systems that are mechanistic like the solar system have this kind of static quality to them, even if their parts are moving.
Suppose instead that, no, there is no model that could predict litigated outcomes. It might be hard to imagine how this could be possible: surely, with enough data, a reasonably accurate prediction would become possible. One way this might happen would be if the relationship between these groups were as important as the groups themselves. A model might have very detailed data on each policyholder, but if the law changes their relationship to the legal system, the model cannot predict the policyholders’ behavior. Systems like this, in which the relationships between the parts are as important as the parts themselves, are “emergent.”
Psychologist Michael Gazzaniga uses the internet as an example to help understand emergent phenomena (Gazzaniga 2011). Suppose a researcher wishes to understand a message sent from location A to location B over the internet. The design of the internet is such that the message will be split into parts and sent through many different intermediary computers. Monitoring each individual computer will only show part of the message and leave the researcher unable to decipher its meaning. The individual computers must be understood in terms of the whole, namely the sending of the entire message. In our example, a new plaintiff strategy might play a role analogous to the internet message. The new strategy would control behavior in a way that cannot be understood beforehand. Either new data elements will be needed, or existing elements will impact behavior in surprising ways.
Emergent behavior like this requires the actuary to take novelty much more seriously. Explaining the new behavior might take a radically different modeling approach than the old model. An analogous need is demonstrated by scientists in different fields, who must develop unique methods and theories for studying their systems of interest. In his seminal article on emergent behavior in science, P. W. Anderson writes:
[A]t each [new] level of complexity [in nature] new properties appear … Psychology is not applied biology, nor is biology applied chemistry.(Anderson 1972).
Whether nature is fundamentally emergent is currently an open question, but not directly relevant here (Gillett 2016). It is enough that, in practice, many systems exhibit a kind of emergence, even if only in appearance to us. This emergence can be responsible for genuinely novel behavior that cannot be predicted by models.
This existence of emergent systems does not necessarily mean a given insurance environment will exhibit novel behavior. This leaves unresolved the question of whether the litigation outcomes in Florida could not have been predicted with reasonable accuracy. However, it does establish this possibility, in this example and others. This means the actuary must take the possibility of novel behavior seriously.
The meditative view of rationality is not as helpful when thinking about novelty because it does not emphasize the relationship between thinking and the environment. Therefore, thinking does not radically change when the environment changes. For example, Bayesian decision theory is supposed to be universally valid. While it allows for genuine, irreducible error in our knowledge, it doesn’t have us change our basic way of thinking in response to it.
A different view of rationality that is more helpful for thinking about novelty is presented by the Ecological Rationality (ER) research program, an alternative and rival to the BaH program (Gigerenzer 2010).[2] Led by Gerd Gigerenzer, it presents what I will call a “relational” view of rationality: rational behavior must be evaluated relative to an environment. Environments with more novel behavior require radically different kinds of thinking than do those with less novel behavior. More complicated models may be appropriate to more stable environments. Heuristics like those used by people will tend to be more appropriate in less stable environments that are more likely to exhibit novel behavior.
To understand how less data may lead to better predictions, it can help to start with a concrete example. Gigerenzer (2015) discusses a test given to a group of German college students. They were given pairs of cities, and asked which had the higher population. Surprisingly, the students scored better on questions about American cities even though they were less familiar with them.
When the German students recognized only one of the city names, they tended to guess the one with the recognized name was the bigger one. This strategy is called the “recognition heuristic,” and it turns out to be fairly accurate at guessing the larger of two cities. Using it requires recognizing only one of the cities, which is why the German students scored better when asked about American cities. In that case, they were more likely to recognize only one of the cities.
Research has shown many other situations in which the recognition heuristic performs well. For example, casual amateur tennis players more accurately predicted Wimbledon winners than did professional tennis players who knew more (Gigerenzer 2015). Computer models may also throw data away. For example, a regularized linear model may ignore most available predictors. Katsikopoulos et al. (2021) gives additional examples of situations in which very simple models perform as well or better than more complicated ones.
It may seem strange to consider the recognition heuristic as rational. It is, after all, just a guess that is based on ignorance in some sense. And it is certainly not a universal strategy. We would expect it to only work in situations in which the rational agent is more likely to recognize the right answer. The ER program’s response to this challenge is that rationality is essentially an educated guess. Like any educated guess, it will only ever work well in certain environments or situations.
One major interest of the ER program is how well a cognitive strategy matches its environment. Here, it draws inspiration from evolutionary biology. Humans have evolved cognitive strategies. These may have originally been to exploit certain ecological “niches,” but now can be applied more broadly (Gazzaniga 2011). In one circumstance, a person might use the recognition heuristic, in another, copy whatever other people are already doing, etc. These different situations do not require a single, universal strategy, like Bayesian decision theory, that is applied differently based on the available data. Instead, different environments require radically different ways of thinking.
This means we should not adopt a single model for making inferences or decisions. Instead, we should have an “adaptive toolbox” of different models, or cognitive strategies, which we selectively apply to different environments.
The ER program is naturally concerned with those characteristics of an environment that make it well suited for a particular kind of model. As already discussed, the stability of the model is important because instability will limit the amount of data that is available to optimize any model. An environment is stable when the relationship between predictors and outcomes does not change over time, as is the case in a casino, for example. Also important is how many variables are predictive of the targeted outcome, and whether those predictors are highly correlated with each other.
Simpler models like heuristics are often better than more complicated models in more unstable environments. This is because the less flexible heuristic will tend to be more robust to the environment’s changes in the relationship between predictors and outcomes. Heuristics will also tend to be more transparent to people because they more closely match how we tend to think. More complicated models will tend to perform better when there is the data to train them, and the stability for their parameters to stay relatively optimal into the future. This is supported by variance-bias tradeoff theory. One disagreement is over the statistical learning literature’s frequent recommendation of a model that automatically adjusts its flexibility. The argument here is in many cases to replace that adjustable model with a heuristic that is automatically inflexible.
This leads ER to a different evaluation of human rationality as compared to BaH. Humans are rational. They have an adaptive toolbox of mostly heuristics. And experiment shows they tend to rationally select from that toolbox, meaning they tend to select a heuristic that performs well for a given environment and task. That scientists in the BaH program can devise a laboratory experiment in which a given heuristic fails is beside the point because any real model will fail in certain environments.
So far, the fit between model and environment has been described in actuarial and statistical terms. To build a practical toolbox, it must also be described from a managerial and operational perspective. The author and researcher Amy Edmondson can help with this.
In her book Teaming, Amy Edmondson writes in operational terms about how organizations learn. She implicitly assumes a relational model of rationality (Edmondson 2012). The relationships are between members of different business teams, and between that team and the inherent predictability or controllability of the business operation they are managing. Edmondson divides business processes into three broad categories, from most predictable to least predictable.
Routine operations are well understood and controlled. Management tries to make them as efficient as possible using continuous improvement. An example from insurance might be a phone bank that answers routine policyholder or agent questions within a particular amount of time.
Complex operations are a mix of controllable and uncontrollable elements. Management tries to minimize the risk inherent to them. An example from insurance would be some kinds of claims adjusting, in which the risk of improper payment or litigation are minimized through adjusting rules and other procedures.
Innovation operations have highly unpredictable results. Management must seek to learn as much as possible from failure, which may be likely and even desirable. An example from insurance might be the creation of a new insurance product.
Complex and innovation operations will have routine elements. For example, the routine part of underwriting may be an initial review of a submission within 30 days, or the concrete application of underwriting rules, etc. The complex part of underwriting might be developing the underwriting rules to minimize risk, or in the application of more complicated rules or judgement by senior underwriters.
I’ll now speculate on what a toolbox of models might look like. This is to give an idea of the sort of thinking that would be involved. There is not enough space to describe each model type, but references are given.
Heuristics include fast and frugal trees, tallying, and many others (Todd and Gigerenzer 2012). They are potentially appropriate in two situations. First, they are easier for people to understand. This can be useful in certain operational situations in which this is very important. In particular, the routine components of many operations seemed designed so that they can be managed using relatively simple heuristics based on a few metrics, often called key performance indicators. Second, heuristics may deal well with the highly uncertain components of complex and innovative operations. For example, triage decisions made after disasters like 9/11 are based on a simple series of yes/no questions (Katsikopoulos et al. 2021).
Taking heuristics seriously may also mean that actuaries want to interview underwriters, claims adjusters, special investigators, and others about the rules they use in their jobs, and then test the rules’ efficacy with data. For example, it may be that certain underwriting decisions are better made by these simple rules. This should be established before trying to use more complicated computer models.
More familiar statistical and machine learning models seem well suited to complex and innovation operations in which a medium to large amount of data is available. This includes GLMs (generalized linear models), GAMs (generalized additive models), GBMs, kernel machines, etc. (Hastie, Tibshirani, and Friedman 2009; Bishop 2006). For example, these may all be useful to predict future policy losses. If the available data is not directly relevant, it may be helpful to use a model that allows that data to be supplemented with judgement. For example, a new product that is partially like an existing product may be able to take advantage of the existing product’s data with some adjustments. It seems easier to make those adjustments with GLMs, GAMs, and perhaps some kernel machines than with GBMs or other tree models.
Deep neural nets (DNNs) seem like a good fit for routine operations. Because these operations are routine, the “right choice” is often obvious to a trained human. For example, a claims adjuster might look for damage in a picture of a roof after a hurricane. This is a very stable environment. A damaged roof today looks similar to a damaged roof 20 years ago. This means there will be very large data sets available for training the deep neural net’s billion-plus parameters, and those values will hopefully stay reasonably optimal over time. Language rules are also stable, and this is another good application for DNNs.
Games are the most stable of environments; the rules of chess or Go do not change. It should probably be unsurprising that DNNs are extremely good at games. While there are apparently some examples in the physical sciences in which the rules are fixed in ways analogous to games, I cannot think of any in insurance.
Generative models, for both images and languages, can be understood as a kind of “reverse inference,” which take an inference or summary as input, and output a set of probable predictors that would lead to that inference. For example, if the inference were “cat” then the generated predictors would be the pixels in an image of a cat. These generative models might be useful for generating first drafts of documents in routine, complex or innovative operations.
Imagine an example of a tool that demonstrates a matching of model to environment: a computerized aid for underwriters of homeowners policies reviewing a home submitted for coverage. The software searches pictures taken inside the home for “red flags” indicating possible causes of future losses. Potential red flags are identified by human underwriters; people are good at causal thinking. Whether these red flags actually increase the propensity for losses is verified using more familiar statistical models; these are appropriate to inference. And deep neural nets are used to identify the red flags in pictures; these are appropriate when there is a large amount of data and clear signal.
There is one last important part of taking novelty seriously: the need to take responsibility for our ability to shape it. Unfortunately, there is only enough space to briefly discuss this issue.
The relational view emphasizes that model selection is a rational choice. Our ability as humans to pick a model, or other cognitive strategy, is an important part of our “intelligence.” Intelligence is a word that now has a wide range of meanings. Contemporary use often associates it with doing well at academics, especially math and science, or with the Intelligent Quotient (IQ) test. This fits well with the “meditative” view of rationality, which emphasizes using complex, step-by-step rules.
An older view of “intelligence” included the capability for aesthetic and moral reasoning. A person could not be intelligent by being only a calculator. This broader view of intelligence is easier to incorporate into a relational view of rationality. Finding a model that does a good job describing an environment does not need to be the only element that makes it rational (though this is certainly the focus of the ER program.) Model selection may also be rational because it effects change in ways reflecting particular, intentional values. Under this view, models are, in the end, just tools. Therefore, it should not be surprising that the use of a data-heavy model does not necessarily achieve fairer outcomes. This is a value-laden goal that requires the directed intelligence of a person to achieve.
To summarize, the approach advocated here stresses rationality as a relationship between the rational agent and the environment. To be rational is to make an educated guess well suited to the environment. Different environments should be expected to require radically different models rather than a single kind of model with adaptable flexibility. The environments can be described as insurance operations that are routine, complex, or innovation. Routine operations will tend to have the most available, relevant data, and innovation the least. People are rational and actuaries should take their heuristics seriously. These heuristics may be best suited for innovation operations, or routine operations that have been designed to be managed through a series of metrics. More traditional models may be best suited for complex and innovation operations. DNNs and other very flexible models are probably a good fit for some routine operations. Their applicability to complex and innovation operations seems more suspect. Finally, model selection should be done under the recognition that only people are capable of making scientific, aesthetic and moral judgements to intentionally shape change according to some set of values. The actuary must make sure that any model selection is consistent with their professional and company values.
Stephen Senn argues that perhaps nobody practices fully subjective Bayesian inference and decision making in the sense advocated by de Finetti (Senn 2011).
The earlier criticism of BaH leaving us irrational because it emphasizes a particular view of rationality is taken from Gigerenzer (2010). Kahneman (2011) mentions Gigerenzer once, in the footnotes, as a critic. It seems to me that their disagreement is mostly a philosophical one concerning how decisions should be made. However, this philosophical dispute has implications for the kinds of scientific questions that should be asked, so their disagreement takes a scientific dimension.