Processing math: 90%
Skip to main content
null
E-Forum
  • Menu
  • Articles
    • Essays
    • Independent Research
    • Ratemaking Call Papers
    • Reinsurance Call Papers
    • Reports
    • Reserving Call Papers
    • All
  • For Authors
  • Editorial Board
  • About
  • Issues
  • Archives
  • Archives
  • search

RSS Feed

Enter the URL below into your favorite RSS reader.

http://localhost:28256/feed
Independent Research
Vol. Spring, 2023May 10, 2023 EDT

Credibility Based Smoothing Using Ghost Trend

Joseph Boor,
data smoothingratemakingrelativityaggregate loss simulationseverity distribution simulation
Photo by fabio on Unsplash
E-Forum
Boor, Joseph. 2023. “Credibility Based Smoothing Using Ghost Trend.” CAS E-Forum Spring (May).
Save article as...▾
Download all (12)
  • Figure 1. Fitted vs. Raw Data
    Download
  • Figure 2. Fitted vs. Raw Data with ‘Hump’
    Download
  • Figure 3. Choosing All Parameters to Minimize Variance
    Download
  • Figure 4. Ghost Trend vs. Fixed Expected Trend
    Download
  • Figure 5. Ghost Trend vs. Fixed Expected Trend on Steep Hump Using  δ2=.0625
    Download
  • Figure 6. Ghost Trend vs. Fixed Expected Trend on Steep Hump Using  δ2=5
    Download
  • Figure 7. Raw Results of Aggregate Loss Simulation.
    Download
  • Figure 8. Comparison of Results of Initial Ghost Trend -Generated Curve to Raw Data.
    Download
  • Figure 9. Comparison of Results of Ghost Trend with Three Adjustments to Raw Data.
    Download
  • Figure 10. Comparison of Results of Five Point Averaged Enhanced Ghost Trend Curve with Raw Data.
    Download
  • Figure 11. Comparison of Results of Five Point Averaged Enhanced Ghost Trend Curve with Raw Data.
    Download
  • Figure 12. Comparison of Results of Fully Enhanced 2000 Samples Ghost Trend Curve with Five Point Averaged Raw Data, Raw Data, and 30,000 Samples Ghost Trend+ Curve.
    Download

Sorry, something went wrong. Please try again.

If this problem reoccurs, please contact Scholastica Support

Error message:

undefined

View more stats

Abstract

Many actuarial tasks, such as analysis of pure premiums by amount of insurance, require an analysis of data that is split among successive “buckets” along a line. Often, there is also significant randomness in the data. That results in process error volatility that affects the (usually average) values of the data within the buckets, so some smoothing of these values is needed if they are to be truly useful. The “ghost trend” approach allows for a high-quality smoothing of those values. Therefore, it helps to produce smoothed values that are more useful relativity factors, loss distributions for pricing aggregate losses, etc. An enhanced approach, integrating the ghost trend approach with other smoothing approaches is also provided. That composite approach provides additional flexibility in dealing with large datasets and datasets that are greatly affected by random differences from point to point.

1. INTRODUCTION

Smoothing “bumpy” or “volatile” data is not required in many actuarial analyses. However, it is often needed when rating factors vary by policy limits, amount of insurance, or other characteristics that fall into “buckets” along a line, or when a random sample is used to construct a severity distribution. Typically, the average frequency, severity, pure premium, percentage of sampled values, etc. of all the risks that fall into each bucket is the value assigned to the bucket, and for smoothing purposes the index assigned to each bucket is the midpoint of the range of the amount of insurance, etc. that the bucket covers.

Smoothing is especially relevant when the data in some or all of the buckets do not have adequate credibility. This paper begins with a credibility-based approach to smoothing that recognizes the credibility of individual buckets but still recognizes the tendency of a curve to move in a continuous way and at a continuous rate. Then, it expands the method to provide a broader tool kit for more challenging smoothing situations.

2. THE MODEL

This approach begins with a model that reflects certain assumptions. The situation, in more precise terms, is:

  • The data in each of the buckets has what might be termed process variance around the true, but unknown, values along an underlying curve. The data creates a statistical approximation to that curve which is more accurate at the points/buckets[1] where there is more data (less process variance) and less accurate elsewhere.

  • The underlying curve is assumed to be fairly smooth, so the point-to-point changes on the underlying curve would be encouraged, but not forced, to change by the same slope as one moves along the curve.

  • On the other hand, most curves do not fall perfectly along a line, so the point-to-point changes along the curve should have random aspects but still retain a continuous looking shape.

  • Further, few actual curves seen in practice are generated from straight lines or linear relationships, so that trend must change as one moves from point to point along the curve.

  • However, one would logically assume that the process errors, the general trend, and the point-to-point trends between adjacent points are all random.

Then, the next step is to develop a model that may be used to estimate the curve by smoothing the data.

2. THE SIMPLER MODEL-CONSTANT UNDERLYING TREND

In this model, the process error variances vary from point to point but are either known or estimated with reasonable accuracy. Under best estimate credibility (see Boor 1992) that process error would be part of a multiplicative inverse of credibility. So, points with high process error should receive less weight in deriving the smoothed curve. Then, to illustrate the situation, there would be observed data points S1, S2, …, Sn that differ from the unknown true values p1, p2, …, pn by process errors with variances of σ21, σ22, … ,σ2n respectively.

If one defines each “change” Di+1 as the change from point-to-point  Di+1= pi+1−pi, one might view them as driven by trend. That is especially in this case where the data is evaluated at the points 1, 2, …, n, or other situations where the indices (i′s) are equally spaced. The expected trend is assumed to be constant, at some slope G. However, since one would not expect the underlying values to lie perfectly on a line, one must allow the actual (and also unknown) changes to vary from period to period around G, with some variance τ2.

Table 1 shows an example of the consequent smoothing process. The assumed constant overall trend (not yet the ghost trend) of, in this case, 0.75, is specified in column (7). The process variances are specified, along with the observed values, G =0.75, and τ2= 0.80. The actual data values input to the process and estimates of their variances around the “true” underlying expected values are included in columns (2) and (3). The fitted curve, the 'pi‘s’ that the process finds are in column (4). The normalized fit error (the squared differences between the raw data and the pi's, divided by the variance associated with the raw data point) is computed in column (5). The estimate of the “local” (using p between the (i−1)th and ith steps) trend is computed in column (6). The constant overall governing trend (in this case, 0.75) is posted in column (7). Lastly, how far the “local trend” has "drifted away from that constant trend is computed by taking the squared difference between the local trend and the selected overall trend value, the dividing the results by a common preselected, τ2=.8. That result is shown in column (8).

Table 1.Flexible Trend with Fixed Expected Trend and Drift Variance
Reference Values
A.    τ2= 0.80
B.   G = 0.75
(1) (2) (3) (4) (5) (6) (7) (8)
Data Database Data To Minimize C. [((4)-(2))^2]/(3) (4)-Prev,(4) Data [((7)-(6))^2]/A.
Evaluations Observed Estimated Global
at Values Variance True Value Normalized Estimated Trend Normalized
i= Si σ2i pi Fit Error Local Trend G Trend Drift
(Data) (Data) (Data) (Minimizes C) (pi−Si)2/σ2i Di=pi−pi−1 B. Above (G−Di)2/τ2
 
1 10 36 8.766 0.045
2 7 25 9.488 0.239 0.723 0.75 0.001
3 13 4 10.291 1.896 0.802 0.75 0.003
4 9 1 10.551 2.237 0.260 0.75 0.300
5 15 4 12.052 2.338 1.501 0.75 0.705
6 10 16 12.963 0.482 0.911 0.75 0.033
7 11 16 14.022 0.472 1.059 0.75 0.120
8 16 1 15.233 1.301 1.211 0.75 0.265
9 15 36 15.830 0.000 0.597 0.75 0.029
10 18 4 16.446 1.881 0.616 0.75 0.023
11 9 36 16.751 0.101 0.305 0.75 0.248
12 6 64 17.228 0.090 0.477 0.75 0.093
13 15 16 17.845 0.004 0.617 0.75 0.022
14 20 4 18.605 0.281 0.760 0.75 0.000
15 14 36 19.085 0.005 0.481 0.75 0.091
16 13 16 19.679 0.940 0.594 0.75 0.031
17 26 16 20.607 1.291 0.928 0.75 0.039
18 28 36 21.265 0.165 0.658 0.75 0.011
19 21 4 21.773 0.482 0.508 0.75 0.073
20 22 4 22.436 1.711 0.663 0.75 0.009
Variance Subtotals 20.447 2.095
C.= Total Variance = 22.542

(The gray values were selected by the optimization routine to minimize the cell in yellow.)
This is also visualized graphically in Figure 1.

Then the goal is to find the pi's that simultaneously fit the data well and yet provide a smooth curve. However, the values of “does not fit the data well” and “is not smooth,” are easier to compute numerically than the original targets. Specifically, the sum of all the entries in column (5), the normalized fit error, represents “does not fit the data well .” The sum of the entries of the drift in the local trend in column (8) represents “is not smooth ,” albeit indirectly. Then one would seek to reduce (minimize, speaking in numerical terms) those values.

Essentially, using a computer minimization routine, the process finds the pi points that minimize the standard squared differences(squared difference divided by variance) between the points and the observed data, the Si's. It simultaneously minimizes the standard squared differences between the  pi+1−pi's and G as well. First, a cell or variable adding together the subtotals of columns (5) and (7) is now included in the chart and highlighted in yellow. Then the value in that target cell in yellow was minimized by finding the pi's that create the lowest possible value of the target. For reference, the solution routine in standard spreadsheet software was used to compute the  pi's in all of the tables in this article.

In this case, the fitted curve looks like it could reasonably be a smoothed version of the data. However, in this special case, the data does show a steady uptick, mirroring the assumption of constant governing trend. Hence, something similar to a straight line can be an effective smoothing of this particular data.

On the other hand, when the raw data has a ‘hump,’ or other curvature, the results of this smoothing method do not fit the data as well. In Figure 2 all the parameters are exactly the same as they were in Table 1 and Figure 1, (except for minimizing the target “C” by selecting new pi's) but the raw data values are more U-shaped.

Figure 1
Figure 1.Fitted vs. Raw Data
Figure 2
Figure 2.Fitted vs. Raw Data with ‘Hump’

Of course, some of the fit problems may be mitigated by changing the values of the drift parameter τ2 and the expected trend G. One may vary both those, as well as the various pi's to produce a better match to the data. In fact, varying those values produces the following graph (reusing the Figure 2 raw data).

However, one may readily see, that this eliminates all smoothing. Considering the alternatives, the use of constant expected trend in the simple model limits its ability to appropriately smooth data with “humps.”

3. INCLUDING THE GHOST TREND

Since the constant underlying trend G limits the ability of the smoothing process in Section 2 to mimic curves, it would be logical to enhance G by allowing it to change as one moves among the data points. Therefore, one would no longer specify a constant expected trend (as Table 1 did), but rather find the expected local trends that best match the data. Of course, there should be some control on the point-to-point changes, or something like Figure 3 will reoccur. The result is that one will have a set of “nearly invisible” Gi's, governed by a requirement that the difference between each Gi and Gi+1 follow a probability distribution (in this article they are specified to vary with a mean of zero and a constant variance of some δ2). The  Di's would continue to vary around G, except that now each  Di will vary around each individual Gi with mean zero and a variance of some prespecified τ2.

Figure 3
Figure 3.Choosing All Parameters to Minimize Variance

The unobserved, indirectly estimated, and “nearly invisible” Gi's affect the  Di's the  Di's help to determine the p′is, and those smooth the Si's, which in turn are the only hard data in the process. So, in a sense, the Gi's are shadows of shadows of the data. It is then logical to describe them as “ghost trend.”

In this case, the total standard squared error to be minimized still includes that of the differences between the pi's and the Si's and that between the  Di's and now the Gi's. However, it also includes the squared differences (divided by δ2) arising between each Gi and the Gi−1 that preceded it. Table 2 illustrates this process.

Table 2.Fit to Data with ‘Hump’ Using Flexible Trend Governed by Ghost Trend
Reference Values
 A.  τ2 = 0.8
B. δ2 = 0.0625
(1) (2) (3) (4) (5) (6) (7) (8) (9)
Data Data Data To Minimize C. [((4)-(2))^2]/(3) (4)-Prev,(4) To Min. C. [((7)- Prev.(7)))^2]/B. [((7)-(6))^2]/A.
Evaluations Observed Estimated Observed Normalized Similar
at Values Variance True Value Normalized Step-by-Step Ghost Ghost Trend Squared Diff
i= Si σ2i pi Fit Error Difference in p’s Trend Squared Diff Di from Gi
(Data) (Data) (Data) (To minimize C) (pi−Si)2/σ2i Di=pi−pi−1 (To min. C.) (Gi−Gi−1)2/δ2 τ2 normalized
 
1 10 36 9.038 0.026
2 7 25 9.677 0.287 0.640 0.661 0.001
3 13 4 10.407 1.681 0.730 0.662 0.000 0.006
4 9 1 10.613 2.600 0.205 0.659 0.000 0.257
5 15 4 12.141 2.044 1.528 0.690 0.016 0.878
6 10 16 13.067 0.588 0.926 0.657 0.018 0.091
7 11 16 14.094 0.598 1.027 0.602 0.048 0.226
8 16 1 15.188 0.660 1.094 0.514 0.123 0.420
9 15 36 15.496 0.007 0.309 0.381 0.284 0.007
10 18 4 15.686 1.338 0.190 0.253 0.261 0.005
11 13 36 15.290 0.146 -0.396 0.132 0.236 0.349
12 17 64 14.864 0.071 -0.426 0.050 0.107 0.283
13 14 16 14.373 0.009 -0.491 0.006 0.032 0.309
14 15 4 13.891 0.307 -0.482 0.000 0.000 0.290
15 13 36 13.191 0.001 -0.701 0.000 0.000 0.613
16 9 16 12.489 0.761 -0.702 0.000 0.000 0.616
17 8 16 11.963 0.981 -0.526 0.000 0.000 0.346
18 10 36 11.634 0.074 -0.329 0.000 0.000 0.135
19 11 4 11.342 0.029 -0.291 0.000 0.000 0.106
20 10 4 11.118 0.313 -0.224 0.000 0.000 0.063
 
Variance Subtotals 12.521 1.125 4.999
C. Total Variance 18.645

To implement the ghost trend process, the chart in Table 2 adds two columns to the chart from Table 1. The constant trend in the previous example is replaced with a column (7) of ghost trend. To keep the ghost trend smooth, a “penalty” column, containing the squares of the differences between successive values of the ghost trend is included as column (8). To control the balance between that column, the fit error penalty column (5), and column (9) penalizing the drift of the actual trend (Di's) from the “expected” ghost trend (Gi's), each term in column (8) is divided by the δ2 discussed above.

Further, in this case, the sum of the fit error, the drift penalty, and the ghost trend smoothing penalty must be computed in the target cell/variable to be minimized. So, the total variance in C. below (in yellow) sums all three. The spreadsheet minimization routine was directed to minimize that value by choosing the values of the  pi's and Gi's in columns (4) and (7). The results are shown in Table 2. Of course, the key values are the smoothed values, the  pi's, in column (4).

As one may see in Figure 4, this provides a better fit (conforms better) to the last half of the data from Figure 3.

Figure 4
Figure 4.Ghost Trend vs. Fixed Expected Trend
Figure 5
Figure 5.Ghost Trend vs. Fixed Expected Trend on Steep Hump Using  δ2=.0625

Now, in this case, the hump is fairly modest. However, when a more pronounced hump (such as a normal distribution with a low variance) is involved, the difference may be more significant. Figure 6 looks at data with a steeper hump but continues using the existing δ2 of .0625.

Figure 6
Figure 6.Ghost Trend vs. Fixed Expected Trend on Steep Hump Using  δ2=5

In this case, the fit is improved, but still not that desirable. However, this approach also allows one to vary the “long-term flexibility” δ2. When the δ2 parameter is increased to five in Figure 6, the fit is demonstrably superior to that of the simpler model.

Increasing δ2 results in smoothing with a fairly good fit. This illustrates a key point. Both approaches require selecting two parameters. For the fixed expected trend approach, the trend and the short-term flexibility (or inverse of smoothness) parameter τ2 must be selected. For the ghost trend, τ2 must still be chosen, but one also chooses the long-term flexibility parameter δ2. Therefore, actual implementation may involve judgments of how much smoothness is desired and how much replication of the data, or “fit,” is required.

It would be desirable if some proper optimum set of parameters for smoothing could be identified, but, considering Figure 3, that appears to be impossible. Nevertheless, given proper judgment-based selections of the flexibility parameters, this appears to be a very good, structured, tool for smoothing data.

4. AN EXAMPLE: USING AN ENHANCED GHOST TREND ANALYSIS ON VERY CHALLENGING DATA

Sometimes fitting a curve can be difficult even when the ghost trend process is used. For example, in process of preparing a separate article related to assessing transfer of risk, Boor 2021, it was necessary to use an aggregate loss distribution with the number of claims generated by a Poisson random variable with a mean of five hundred and the severity distribution of each claim following a Pareto distribution with an alpha value of 1.5 and truncation point of 100,000. The simulation[2] was not overly hard to generate. However, the results of the simulation, graphed in Figure 7, and supported by the data in Table 3, are fairly “bumpy ,” even after combining thirty thousand trials into one hundred bands. Due to the large number of bands, only the first and last ten rows of the table are shown

Table 3.Bucketed Raw Data From Monte Carlo Simulations.
Top of Histogram "Bucket" Number of Claims up to Top of Bucket Number of Claims in Bucket Relative Frequencies
60,000,000 36 36 0.00120
62,000,000 113 77 0.00257
64,000,000 240 127 0.00423
66,000,000 440 200 0.00667
68,000,000 774 334 0.01113
70,000,000 1,265 491 0.01637
72,000,000 1,866 601 0.02003
74,000,000 2,713 847 0.02823
76,000,000 3,780 1,067 0.03557
78,000,000 5,014 1,234 0.04113
…
240,000,000 29,706 6 0.00020
242,000,000 29,708 2 0.00007
244,000,000 29,713 5 0.00017
246,000,000 29,720 7 0.00023
248,000,000 29,729 9 0.00030
250,000,000 29,736 7 0.00023
252,000,000 29,740 4 0.00013
254,000,000 29,743 3 0.00010
256,000,000 29,748 5 0.00017
258,000,000 29,756 8 0.00027
Total Probability in this Range 0.9918
Figure 7
Figure 7.Raw Results of Aggregate Loss Simulation.

For reference, the labels correspond to the top ends of the buckets, which each have a width of $2 million. However, as one may see a shift of $1 million to the left would not meaningfully change the appearance of the curve.

(Also, less than 1% of the curve lies in the tail beyond this range. However, due to the skewness of the distribution, graphing that would place most of the attention where the fewest losses are.)

The next step is of course to use the ghost trend approach. The approach is identical to Table 3, but with more data and different values for the constants. Note that as the calculations unfold the rationale behind the specific constants used will become clear. The resulting graph (including the raw data it began with) is in Figure 8.

Figure 8
Figure 8.Comparison of Results of Initial Ghost Trend -Generated Curve to Raw Data.

As one may see, in this case, and with the given assumptions the ghost trend approach alone does not match this numerous and volatile raw data very well. The associated calculations are shown, again for the first and last ten rows, in Table 4.

Table 4.Ghost Trend Applied to Bucketed Raw Data.
Reference Values
 A.  τ2 = 0.005
B. δ2 = 0.005
(1) (2) (3) (4) (5) (6) (7) (8) (9)
Data Data Data To Minimize C. [((4)-(2))^2]/(3) (4)-Prev,(4) To Min. C. [((7)- Prev.(7)))^2]/B. [((7)-(6))^2]/A.
Top Observed Binomial-Based Estimated Observed Normalized Similar
Of Each Frequencies Variance True Frequency Normalized Step-by-Step Ghost Ghost Trend Squared Diff
Bucket Si σ2i pi Fit Error Difference in p’s Trend Squared Diff Di from Gi
(Data) (Data) (Data) (To Minimize C) (pi−Si)2/σ2i Di=pi−pi−1 (To min. C.) (Gi−Gi−1)2/δ2 τ2 normalized
 
60,000,000 0.00120 35.664 0.0215 1.154E-05
62,000,000 0.00257 76.178 0.0223 5.116E-06 8.237E-04 8.211E-04 7.082E-12
64,000,000 0.00423 125.434 0.0231 2.848E-06 8.247E-04 8.202E-04 7.622E-13 2.001E-11
66,000,000 0.00667 197.051 0.0240 1.516E-06 8.194E-04 8.153E-04 2.361E-11 1.676E-11
68,000,000 0.01113 327.595 0.0248 5.669E-07 8.088E-04 7.984E-04 2.884E-10 1.097E-10
70,000,000 0.01637 479.036 0.0256 1.761E-07 7.895E-04 7.790E-04 3.738E-10 1.105E-10
72,000,000 0.02003 584.170 0.0263 6.733E-08 7.544E-04 7.548E-04 5.866E-10 1.211E-13
74,000,000 0.02823 816.392 0.0270 1.752E-09 7.327E-04 7.211E-04 1.136E-09 1.337E-10
76,000,000 0.03557 1020.681 0.0277 6.021E-08 6.898E-04 6.806E-04 1.643E-09 8.620E-11
78,000,000 0.04113 1173.618 0.0284 1.389E-07 6.362E-04 6.282E-04 2.741E-09 6.336E-11
…
240,000,000 0.00020 5.950 0.0002 1.327E-10 1.639E-05 1.802E-05 1.950E-11 2.659E-12
242,000,000 0.00007 1.984 0.0002 1.435E-08 7.253E-06 8.141E-06 9.754E-11 7.873E-13
244,000,000 0.00017 4.959 0.0002 7.363E-10 -8.264E-06 -2.726E-06 1.181E-10 3.067E-11
246,000,000 0.00023 6.941 0.0002 8.163E-11 -1.756E-05 -1.790E-05 2.302E-10 1.136E-13
248,000,000 0.00030 8.924 0.0002 1.788E-09 -3.586E-05 -3.131E-05 1.800E-10 2.064E-11
250,000,000 0.00023 6.941 0.0001 1.371E-09 -3.789E-05 -3.332E-05 4.030E-12 2.085E-11
252,000,000 0.00013 3.967 0.0001 1.000E-10 -2.237E-05 -2.033E-05 1.687E-10 4.156E-12
254,000,000 0.00010 2.975 0.0001 1.503E-10 7.733E-06 -2.647E-06 3.128E-10 1.078E-10
256,000,000 0.00017 4.959 0.0001 2.022E-10 1.385E-05 2.570E-06 2.723E-11 1.273E-10
258,000,000 0.00027 7.933 0.0001 1.867E-09 9.954E-06 9.669E-06 5.039E-11 8.119E-14
                 
Column Totals 0.99187 3.027E-05 4.544E-07 1.265E-08
                 
Adjustment Factors 1 2.000E+03 2.000E+03
  (no change) (1/δ2) (1/τ2)
Final Components 3.027E-05 9.089E-04 2.530E-05
 
C. Sum of All Components (Value to Minimize) 9.644E-04

Table 4 partially explains the poor fit in this case. Although the trend values are modified by values of τ2 and δ2, which may be titrated up and down for the desired degrees of “stiffness,” the impact of the fit error column (5) is greatly affected by the variances in column (3). Further, the sum of column (5) is much larger than that of column (8), but the effect of column (8) is increased by an “adjustment factor” (discussed below) of 2000, the trend controls in column (8) have roughly the same impact as the fit (or accuracy) controls. So, this may be thought of as a fifty/fifty balance between smoothness and fit accuracy. Note also for reference that the variance-based divisors τ2 and δ2 are now applied at the bottom of each column rather than used in the calculation of the individual column entries.

As used above, some additional adjustment factors are also used. Certainly, one is needed for the fit error column. As it turns out, though, two additional columns are both helpful in controlling the accuracy and smoothness of the fitted curve. First, a review of column (5) will show that the fit errors that enforce accuracy are much lower in the upper end of the range. Therefore, since small changes in the smoothed values would not generate many changes in the total error value at the bottom of the chart, one might argue that there is less emphasis on accuracy in the upper end of the range. However, readers might desire a smooth curve the works in different contexts with different requirements. So, now there is an additional column, along with its own adjustment factor. In its calculation, the difference between each raw value and p value pair is then divided by the raw value before the result is squared and divided by the variance. That will give more weight to the smaller values, for a more consistent fit.

Another adjustment is included in this version. Noticing the sort of “granular bumpiness” (a high degree of small, high slope, oscillations near the tail), a direct control against abrupt changes in the slope is now included in the new column (8). It is simply the squared difference between each slope Gi and the previous slope Gi−1. This penalizes abrupt changes in the trend/slope. The adjustment factor applied to the total of those values titrates its influence on the fitted curve.

Those adjustments result in the curve in Figure 9. Note that the general shape is somewhat close to acceptable, but there is still so much bumpiness that, within the scale of the graph, it cannot be distinguished from the raw data.

Figure 9
Figure 9.Comparison of Results of Ghost Trend with Three Adjustments to Raw Data.

That curve is generated by the process in Table 5. Again, in this case, all the normalizing divisions, by τ2, δ2, etc. take place at the bottom of the Table.

Table 5.Enhanced Ghost Trend Applied to Bucketed Raw Data.
Reference Values
A.  τ2  = 0.0005
B.  δ2 = 0.0005
C. λ² = 1
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
Data Data Data To Minimize D. [((4)-(2))^2]/(3) [{((4)-(2))/(2)}^2]/(3) (4)-Prev.(4) ((7)-Prev.(7))^2 To Min. D. [((9)-Prev.(9))^2]/B. [((9)-(7))^2]/A.
Top Observed Binomial-Based Estimated Normalized Observed in Square of Normalized Similar
of Each Frequencies Variance True Frequency Normalized and Relative Step-by-Step Differences in Trend Ghost Ghost tend Squared Diff
Bucket Si σ2i pi Fit Error Fit Error Difference in p’s From Step to Step Trend Squared Diff Di from Gi
(Data) (Data) (Data) (To Minimize D.) (pi−Si)2/σ2i
[(pi−Si)/Si]2/σ2i
Di=pi−pi−1 [ {(D_{i} - D_{i})}^{2}\rbrack/\lambda ² (To Minimize D.) {(G_{i} - G_{i - 1})}^{2}/\delta^{2} \tau^{2} normalized
 
60,000,000 0.00120 35.664 0.0012 2.787E-18 1.93562E-12
62,000,000 0.00257 76.178 0.0026 1.506E-17 2.28535E-12 1.367E-03 -1.018E-05 1.896E-06
64,000,000 0.00423 125.434 0.0042 4.839E-17 2.6999E-12 1.667E-03 3.238E-02 7.193E-04 5.321E-07 8.974E-07
66,000,000 0.00667 197.051 0.0067 3.864E-18 8.69503E-14 2.433E-03 9.932E-02 -1.062E-05 5.327E-07 5.973E-06
68,000,000 0.01113 327.595 0.0111 9.869E-20 7.96214E-16 4.467E-03 2.072E-01 -1.125E-05 4.011E-13 2.005E-05
70,000,000 0.01637 479.036 0.0164 1.115E-16 4.16301E-13 5.234E-03 2.147E-02 -1.410E-05 8.131E-12 2.754E-05
72,000,000 0.02003 584.170 0.0200 1.496E-17 3.72645E-14 3.666E-03 8.967E-02 -1.742E-05 1.099E-11 1.357E-05
74,000,000 0.02823 816.392 0.0282 3.260E-17 4.09011E-14 8.200E-03 3.057E-01 -2.125E-05 1.468E-11 6.759E-05
76,000,000 0.03557 1020.681 0.0356 6.132E-17 4.84768E-14 7.333E-03 1.119E-02 -1.895E-05 5.296E-12 5.405E-05
78,000,000 0.04113 1173.618 0.0411 1.512E-17 8.93808E-15 5.567E-03 5.799E-02 -1.425E-05 2.204E-11 3.115E-05
…
240,000,000 0.00020 5.950 0.0002 2.490E-17 6.22544E-10 -6.668E-05 2.778E+00 -1.693E-05 1.397E-09 2.474E-09
242,000,000 0.00007 1.984 0.0001 7.705E-17 1.73352E-08 -1.333E-04 9.985E-01 -1.159E-05 2.851E-11 1.481E-08
244,000,000 0.00017 4.959 0.0002 1.709E-23 6.15376E-16 9.999E-05 5.444E+00 -1.232E-05 5.314E-13 1.261E-08
246,000,000 0.00023 6.941 0.0002 2.284E-20 4.19567E-13 6.667E-05 1.111E-01 -1.604E-05 1.382E-11 6.841E-09
248,000,000 0.00030 8.924 0.0003 1.006E-24 1.11793E-17 6.667E-05 1.405E-10 -1.682E-05 6.048E-13 6.970E-09
250,000,000 0.00023 6.941 0.0002 8.766E-22 1.61011E-14 -6.667E-05 4.000E+00 -1.738E-05 3.185E-13 2.429E-09
252,000,000 0.00013 3.967 0.0001 2.141E-19 1.20438E-11 -1.000E-04 2.500E-01 -1.544E-05 3.773E-12 7.150E-09
254,000,000 0.00010 2.975 0.0001 2.233E-22 2.23296E-14 -3.333E-05 4.000E+00 -1.455E-05 7.875E-13 3.527E-10
256,000,000 0.00017 4.959 0.0002 2.246E-24 8.08513E-17 6.667E-05 2.250E+00 -1.554E-05 9.695E-13 6.758E-09
258,000,000 0.00027 7.933 0.0003 7.505E-23 1.05533E-15 1.000E-04 1.111E-01 -2.171E-05 3.805E-11 1.481E-08
 
Column Totals 0.991866667 7.881E-08 3.619E-01 9.455E+03 1.595E-06 4.424E-04
 
Adjustment Factors 1E-22 1 1.000E+00 2.000E+03 2.000E+03
(zeroed out) (no change) (1/λ²) (1/\mathbf{\delta}^{\mathbf{2}}) (1/\tau^{2})
Final Components 0.000E+00 3.619E-01 9.455E+03 3.190E-03 8.848E-01
D. Sum of All Components (Value to Minimize) 9.457E+03

To finish the curve, a different smoothing process, centered five-point averaging, was used. That produced the quite acceptable curve in Figure 10.

Figure 10
Figure 10.Comparison of Results of Five Point Averaged Enhanced Ghost Trend Curve with Raw Data.

For reference, a curve comparing the fit, with the scale above, of straight five-point averaging to this process is provided in Figure 11.

Figure 11
Figure 11.Comparison of Results of Five Point Averaged Enhanced Ghost Trend Curve with Raw Data.

As one may, due to the relatively small changes from point to point and the large volume of points, straight five-point averaging is almost or as good as the ghost trend+ process. However, the example is good as an illustration of the process, if not for optimizing the result.

For an example where the ghost trend+ is clearly superior, one need only look at the same situation, only with 2000 points in the sample rather than 30,000.

Figure 12
Figure 12.Comparison of Results of Fully Enhanced 2000 Samples Ghost Trend Curve with Five Point Averaged Raw Data, Raw Data, and 30,000 Samples Ghost Trend+ Curve.

There are several curves in the graph, but a few things are visible

The final ghost trend+ is smooth and fits the data well;

It also is fairly close to the presumably-more-accurate curve resulting from 30,000 samples of the underlying distribution; and,

Five-point averaging on this raw data does not result in a smooth curve.

So, one may conclude that this enhanced ghost trend process can be quite useful in the right circumstances.

5. WHAT IF THE DISTANCES BETWEEN THE POINTS VARY FROM POINT TO POINT?

It is fairly common to break down data into categories such as “under $5,000 ,” “$5,000-$9,999 ,” “10,000-$24,999 ,” “$25,000-$50,000 ,” and "over “50,000 .” In that example, one could attempt to fit a smooth curve to values corresponding to the points 2,500, 7,500, 17,500, 37,500, and 100,000 (making judgmental selections for the points at the bottom and top). Then the spacing between the points is 5,000, 10,000, 20,000, and 62,500. In other words, they are very unequally spaced. One might expect the ghost trend to change a lot more between 37,500 and 100,000 than between 2,500 and 7,000, depending on the appearance of the data that is involved. However, the more important question is how it changes between the adjacent intervals. For example, how does it change between the interval from 17,500 to 37,500 and the interval from 37,500 to 100,000?

Since Brownian motion would say that the variance between values is proportional to the distance between the points they correspond to, it seems logical that the value of \mathbf{\delta}^{\mathbf{2}} be multiplied by the distance between the midpoints of the intervals (100,000+37,500)/2 – (37,500+17,500)/2 = 68,750-27,500 = 41,250. Thus, the variance between two adjacent \mathbf{G}_{\mathbf{i}}'s would be proportional to 41,250, in effect \mathbf{41,250\ \times \ }\mathbf{\delta}^{\mathbf{2}}. Then, that revised variance would be used in the denominator of the computations in the “Normalized Ghost Trend Squared Diff” instead of just the overall \mathbf{\delta}^{\mathbf{2}}, associated with this smoothing process. In fact, one of those “scaling” parameters must be used for each \mathbf{G}_{\mathbf{i}}. It could also be logical to apply the same scaling within the \mathbf{\tau}^{\mathbf{2}}\mathbf{\ }terms as well. As one may imagine, the large 41,250 multiplier suggests that a much lower value of \mathbf{\delta}^{\mathbf{2}} should be used. It also may be useful to visually compare these scaled \mathbf{p}_{\mathbf{i}}'s to those based on equal variances for the changes in ghost trend. When appropriate, this scaling process can be a useful tool.

4. SUMMARY

Even the simpler approach is of value in smoothing data with a great deal of variance. However, by carefully choosing the parameters in the ghost trend model, one may convert data with a lot of process error into a smooth curve that does a quality job reflecting the data. The enhanced process expands that to cover a wider variety of scenarios.


  1. For reference, data is typically provided in buckets to expedite processing, e.g., “claims on policies with amounts of insurance between $45,000 and $55,000,” but curves are usually fit to points like “$50,000 .”

  2. The simulation was done using the NTRAND implementation of the Mersenne Twister and standard spreadsheet functions. The author notes that real world situations would often add parameter variance, but this approach is suitable in context.

Submitted: November 03, 2021 EDT

Accepted: February 26, 2023 EDT

References

Boor, Joseph A. 2021. “Risk Transfer Criteria That Are Not Ad Hoc: A Decrease in the Coefficient of Variation and Cost-Effective Pricing.” Submitted to CAS E-Forum of Casualty Actuarial Society, November.
Google Scholar

This website uses cookies

We use cookies to enhance your experience and support COUNTER Metrics for transparent reporting of readership statistics. Cookie data is not sold to third parties or used for marketing purposes.

Powered by Scholastica, the modern academic journal management system