Holmes.Intro

From BattleActs Wiki
Jump to navigation Jump to search

Reading: Holmes, T. & Casotto, M, “Penalized Regression and Lasso Credibility,” CAS Monograph #13, November 2024. Chapters 1 and 2.

Synopsis: In this reading we cover a brief refresher of the early content found in the GLM monograph. We explore a couple of scenarios which motivate us to think about the credibility/volume of data underpinning each GLM coefficient. We conclude with a discussion on the limitations of p-values and significance testing plus a high-level review of credibility concepts.

Study Tips

This is a fairly quick reading if you don't stop to explore the equations in the appendix. The CAS doesn't say anything about the appendices not being on the syllabus so in theory they are fair game. However, the technical nature of some of the proofs/content mean in our opinion you're probably safe to skip the technical details and focus on what the results mean for modeling with a GLM.

Estimated study time: 1 Hour (not including subsequent review time)

BattleTable

Based on past exams, the main things you need to know (in rough order of importance) are:

  • GLM estimates are unstable on segments with low exposures.
  • The five limitations of p-values and significance testing.
  • The three necessary properties required to rigorously incorporate credibility into a multivariate modeling technique.
Questions from the Fall 2019 exam are held out for practice purposes. (They are included in the CAS practice exam.)
reference part (a) part (b) part (c) part (d)
Currently no prior exam questions
Full BattleQuiz Excel Files Forum

You must be logged in or this will not work.

In Plain English!

By this point in your studies and/or career you should be becoming fairly familiar with generalized linear models (GLMs). If you're not yet well-versed in GLMs we recommend reading our wiki articles on the CAS GLM monograph first.

The key weakness of GLMs which the Holmes & Casotto monograph seeks to highlight and address is GLMs effectively assume the underlying datasets are 100% credible, even if some segments of the dataset have very little data. While the standard error (or confidence interval) associated with a parameter in a GLM may reflect the uncertainty surrounding the parameter estimate, having more or less confidence in the estimate does nothing to change the actual estimate itself once you've decided to include the parameter in the model. In a traditional GLM, a parameter estimate is not adjusted to reflect the volatility of the estimate. If you subsequently use univariate analyses to alter individual rating variables then you run the risk of eroding the benefits of the multi-variate GLM approach.

Penalized regression enhances the GLM process in a way which allows the modeler to incorporate a measure of the credibility of the underlying data. It shifts the focus from determining the inclusion of a model variable via p-values to how much credibility should be given to the variable. We begin with a brief review of GLMs.

Generalized Linear Models (GLMs)

As you've seen in GLM.Basics, a Generalized Linear Model consists of:

  1. A target variable, Y, which is a random variable that follows a probability distribution from the exponential family of distributions. This distribution is chosen by selecting a variance function and dispersion parameter.
  2. A linear predictor, [math]\eta = X\beta[/math], where X is the design matrix and [math]\beta[/math] is the coefficient vector.
  3. A monotonic link function, g, such that [math]E[Y] = \mu = g^{-1}(\eta)[/math].

We let Y represent the set of all actual observations while Yi reflects a single actual observation from the dataset and note that while it may be a single observation, it may have more than one exposure (weight) associated with it. For example, Yi may be the observation of all red model year 2023 cars in your book of business during the modeling experience period. Clearly, there's likely to be more than one such vehicle.

Holmes and Casotto define the grand average (grand mean) as [math]\overline{Y}=\displaystyle\frac{1}{n}\sum_{i=1}^nY_i[/math]. They then let [math]y_i = Y_i - \overline{Y}[/math] , i.e. they shift the observations so they are centered around 0.

The characteristics associated with the i th observation are encoded in the design matrix, X, which has a row for each observation. The columns of the design matrix reflect the encoding of the various characteristics being considered. For example, a single column may contain the age of the named insured, while three columns may be used to numerically capture a categorical variable with four categories. Alice: "You did remember that the base level doesn't get its own column in the design matrix..."

Feature engineering is the term which describes transforming variables in a way that allows them to be included in the linear predictor. For example, to include a second-order polynomial for VehicleAge in a GLM, the design matrix would include two columns - one for VehicleAge and the other for VehicleAge2. One-hot encoding refers to the practice of using multiple dummy variables which take the values of either 1 or 0 to represent the presence or absence of a variable. This is used repeatedly to capture categorical variables.

Fitting a GLM requires specifying the target variable, variance function, dispersion parameter, link function, and then finding the set of [math]\beta[/math] coefficients which minimizes the difference between the GLM output ([math]g(\mu_i)=X\beta[/math]) and the actual observed outcome. This is equivalent to maximizing the likelihood of observing the actual outcome. The same set of [math]\beta[/math] coefficients is used for all records in the dataset. For technical reasons, Holmes and Casotto target [math]y_i=Y_i - \overline{Y}[/math] rather than [math]Y_i[/math]; this does not affect the optimization problem at hand though.

This is expressed mathematically as:

[math] \begin{align}\hat{\beta}_{GLM} &= {\arg\!\max}_\beta \,\textrm{LogLikelihood}(y, X, \beta) \\ &= {\arg\!\min}_\beta -\textrm{LogLikelihood}(y,X,\beta) \\ &= {\arg\!\min}_\beta NLL(y, X, \beta).\end{align} [/math]

Alice: "– LogLikelihood and NLL are shorthand for the negative of the loglikelihood. I prefer the argmax notation or argmin NLL myself. Also, recall [math]Y=X\beta[/math] is the model output (target variable) and y is the set of actual observations, so this is saying we are finding the [math]\beta[/math] values which maximize the likelihood of getting y for the given design matrix X."

For each observation, i, the linear predictor takes a set of risk characteristics and produces a prediction of the transformed expected risk, [math]g(\mu_i) = \beta_0 + \beta_1\cdot x_{i,1} + \beta_2\cdot x_{i,2} + \ldots + \beta_p\cdot x_{i,n}[/math]. To produce the expected risk, we must invert the link function, g. It's common to use a log (ln) link function because this readily produces a multiplicative rating structure. The key thing is to match the link function with the range of possible outcomes. For example, recall logistic regression requires us to predict a value between 0 and 1. Instead of using the standard log link function we use the logit link function [math]g(\mu) = \ln\left(\displaystyle\frac{\mu}{1-\mu}\right)[/math] because the inverse of the logit function is the logistic function, [math]g^{-1}(\mu) = \displaystyle\frac{1}{1+e^{-\mu}}[/math], which takes any real number and transforms it into one between 0 and 1.

As you saw here, some variables (such as rating territory) or coverage options (limit/deductibles) are best priced using methods other than a GLM. However, we still want to account for their influence in the GLM to reduce the chance of distortions.

Alice: "A classic example would be homeowners pricing in an area which includes a historic district. Historic districts consist of older homes so there would be a strong positive correlation between the age of home and rating territory. A GLM without territory would attempt to pick up the signal in the age of home variable. Yet if we modeled territory in the GLM we would likely introduce too many levels or experience aliasing, potentially introducing instability into our model."

An offset is a way of avoiding this. The variable to be offset is modeled separately using an appropriate process such as a loss elimination ratio. The corresponding scaled relativity is then added onto the linear predictor for each record in the dataset.

[math]g(\mu_i) = \beta_0 + \beta_1\cdot x_{i,1} + \beta_2\cdot x_{i,2} + \ldots + \beta_p\cdot x_{i,n}+\color{red}{\textrm{Offset}_i}[/math]

More than one risk characteristic may be offset at once with the corresponding relativities combined into a single offset quantity for each record. Click here for a refresher on offsets from the GLM text.

Alice: "Let's recap what we've learned so far by attempting the following problem which is loosely based on an example from the source material."

Offsets, Linear Predictors, and Rating Tables

Full Credibility Assumption

Let's begin with Holmes and Casotto's thought experiment. You are collecting data on the pure premium of risks with and without a fire extinguisher present. Let's think about the following three scenarios:

Scenario 1
Category Exposures Loss Dollars Pure Premium
With fire extinguisher 1,000,000 100,000,000 100
Without fire extinguisher 1,000,000 120,000,000 120
Scenario 2
Category Exposures Loss Dollars Pure Premium
With fire extinguisher 1,000,000 100,000,000 100
Without fire extinguisher 5,000 600,000 120
Scenario 3
Category Exposures Loss Dollars Pure Premium
With fire extinguisher 1,000,000 100,000,000 100
Without fire extinguisher 10 1,200 120

The pure premium (average loss) is the same for each category in each scenario. In fact, the "with fire extinguisher" category is identical for each scenario. However, these univariate analyses do not take into account the volume of data associated with the "without fire extinguisher" category in each scenario. So although each of the scenarios generates the same pure premium relativity of 120 / 100 = 1.200 for the "without fire extinguisher" category, as actuaries, we recognize there is much greater uncertainty associated with the 1.2 derived from scenario 3 than there is the 1.2 derived from scenario 1. That is, we would assign less than full credibility to scenarios 2 and 3, leading to a selected relativity of something other than 1.200.

In appendix A.1, Holmes and Casotto build an identity link function GLM for a single categorical rating variable having p levels (including the base level) using a dataset consisting of n observations. They demonstrate that minimizing the negative log-likelihood for the GLM produces the same relativities as a univariate analysis which treats the data as fully credible would (the average of each category).

Key Takeaways:
  • The initial takeaway of this result is it's independent of the number of exposures within the rating variable category. That is, the GLM beta coefficients are the same as the full credibility univariate result for each of the three scenarios above.
  • Holmes and Casotto go on in the appendix to state this result continues to hold if we use a general link function. Further, they state if we restrict ourselves to the multivariate situation where all rating variables are discrete (categorical) and encoded using one-hot encoding then the result also holds.
  • They use these points to underscore the GLM approach does not take into consideration the volume of data when determining the beta coefficients. Alice: "Although they don't mention it, it's implicit the GLM may account the volume of data via the assigned weights. The weights then influence the width of the confidence interval associated with the beta coefficient rather than the beta coefficient itself. In general, the greater the weight the smaller the confidence interval."
  • Holmes and Casotto conclude: "GLM estimates are unstable on segments with low exposures."

In a univariate analysis we can use a measure of credibility to determine the appropriate relativity as a weighting of the observed data with an appropriate complement. We do not have this luxury in an unpenalized GLM. The only control a modeler has over a variable is whether or not to include it in the model. Thus, we either give full credibility or zero credibility to a segment of the dataset.

p-Values

The output of a GLM consists of a series of coefficients. The null hypothesis is each coefficient is actually zero and the p-value reflects the likelihood we would observe a coefficient at least as extreme as the output of the GLM if the coefficient were actually zero. Typically a p-value of 0.05 or smaller is used, meaning 5% of the time (or less) we would accept the output of the GLM for a coefficient rather than correctly setting it to be zero. When the p-value is greater than 5% (or our chosen threshold) then we say the coefficient is insignificant and usually remove it from the model. Otherwise, the coefficient is considered to be significant and is included in the model with 100% credibility. This is known as significance testing.

Question: Briefly describe five limitations of p-values and significance testing.
Solution:
  1. It is binary; we are answering is this coefficient likely to be zero or something more extreme? It does not tell us what the coefficient should actually be.
  2. The traditional significance level of 0.05 is arbitrary. Many sources say it's not appropriate - particularly when you have a considerable number of variables in the model as then you may be including several coefficients which actually should have been insignificant.
  3. The process is iterative because removing or including one coefficient alters the p-values associated with the other variables contained in the model. This can have a domino affect as other variables which were significant may now become insignificant or vice versa.
  4. Adjustments made for questionable p-values are often based on univariate analyses after modeling. Including such adjustments erodes the benefit of the multivariate nature of a GLM.
  5. The GLM output contains confidence intervals to evaluate coefficient stability but does not provide guidance on how to adjust the output. Holmes and Casotto give the example of a rating variable which is accepted by regulators and the actuarial community as being a strong predictor of loss yet suppose your model produces a result for this variable with a p-value of 0.06. Assuming your threshold is 0.05, would you still want to exclude this variable?

Alice: "This makes significance testing sound terrible. Why on earth would we want to do it?"

We often use p-value testing despite the downfalls because it's convenient and simple to apply.

A post hoc analysis is one which is performed after modeling.

Question: In the context of (unpenalized) GLMs, what are two possible kinds of post hoc analysis?
Solution:
  1. An analysis informing a subsequent model iteration
  2. An analysis informing selections based on the final modeled coefficients.

When informing subsequent model iterations, using p-values / significance testing is binary - we include a coefficient in the model as is (i.e. giving it full credibility) or we exclude it from the model.

When we make selections for variables based on the final model output it's done on a variable by variable basis. Even if we are using credibility when making these selections, the variable by variable nature of the process erodes the multivariate benefit of the GLM, likely producing a sub-optimal model. A sequence of steps, each of which is individually optimal, may not produce an outcome which is optimal overall.

A Brief Review of Credibility

Credibility is a weighting of different estimates to produce a combined estimate.

[math]\textrm{Estimate} = Z \cdot\mbox{Observed Experience} + (1-Z)\cdot\mbox{Complement of Credibility}[/math]

Here, Z, is the credibility factor which is between 0 and 1. The credibility factor increases with the number of observations (exposures). This allows us to directly incorporate information about the number of exposures into our estimate.

Classical credibility is defined as [math]Z =\min\left\{1,\sqrt{\frac{N}{N_{full}}}\right\}[/math]. Here, [math]N_{full} = N_{full}(K,P)[/math] is the minimum number of observations required to achieve Z = 1 (full credibility). It is a function of the probability that the observations are within the estimated risk (P) and our percent error tolerance (K).

Bühlmnann credibility is defined as [math]Z=\displaystyle\frac{n}{n+k}[/math]. Here, [math]k = \displaystyle\frac{\sigma^2_{PV}}{\tau^2_{HM}}[/math] is the ratio of the expected process variance (within class variance) to the variance of hypothetical means (between class variance).

Question: Briefly describe three necessary properties required to rigorously incorporate credibility into a multivariate modeling technique.
Solution:
  1. Parameter estimation shall not solely rely on minimizing the negative log-likelihood or variance as doing so will assign 100% credibility to the data.
  2. As the number of observations decreases, estimates should shrink towards the complement of credibility.
  3. The "credibility weighting" of the coefficients must be part of the fitting procedure, not performed afterwards.

In the next wiki article we learn about penalized regression and how it satisfies these criteria.

Full BattleQuiz Excel Files Forum

You must be logged in or this will not work.