FERM - Ch.12: Extreme Value Theory


A Fellowship Forums Wiki Community Post

Online Link to this Reading: N/A

Topics Covered in this Reading:

  • This is a technical reading that discusses two ways to model tail risk (the risk associated with extreme losses)
  1. The Generalized Extreme Value Distribution

    • Here we take the max observation (i.e. biggest loss) from a series of loss distributions/data sets, and combine them to form a single distribution
    • This distribution consists entirely of extreme values - so we are specifically modelling tail risk directly
  2. The Generalized Pareto Distribution

    • Rather than taking the max observation from each data set, we consider all observations that exceed a pre-specified threshold
    • This provides more information than the Generalized Extreme Value Distribution, but can potentially include non-tail observations if the threshold is too low

This is a wiki post, editable by anyone. Feel free to edit and add key points/extra detail above!


I’ve just started this reading and have a few early questions re: the Generalized Extreme Value Distribution section; I was hoping someone could help clarify?

  1. When Gamma > 0, it says we have a Frechet-type GEV distribution, which has a tail that follows a power law. What does it mean for a tail to follow a power law?

  2. When Gamma < 0, we have a Weibull-type GEV distribution, with a tail that falls off so quickly that there is a ‘finite right endpoint’ to the distribution. It says that given EVT is used when talking about extreme observations, this distribution is of little interest. Can someone expand on why this distribution is of little interest, I don’t quite understand?

  3. They say that a major drawback of the GEV approach is that by only using the largest values in each block of data, it ignores a lot of potentially useful information. I.e. For a block of a thousand observations, 99.9% of the information is discarded. Why is this such a drawback when what we are most concerned about is the data points in the tail, and not the remaining part of the distribution?

Thanks :slight_smile:


I can try and answer 2 and 3. Please let me know if you agree.

  1. With EVT, what we care about is the extreme observations. In other words, we look at the data points/observations in the extreme tail end of the distribution (which are the REALLY BAD scenarios). Because of this, distributions that have really FAT tails are going to have more data points in the end of the distribution and be more useful for EVT. The Weibull distribution, on the other hand, does not have a fat tail at all (in fact, the tail drops off very quickly) meaning there will be very few extreme observations that we can use for EVT – which is why it’s of little interest.

  2. As mentioned above, for EVT we only care about the extreme observations. Thus, if we take 1000 observations, but only 10 of them are considered “extreme”, we are essentially throwing out the remaining 990 observations. The issue is that some of these 990 observations are still quite extreme (although not as extreme as the other 10), and could potentially provide valuable information….But we throw this information out when we ONLY use the largest value in each block of data (and exclude, for example, the second or third largest values). That’s why there’s a balance between choosing a threshold high enough to make sure we are only considering extreme values, but low enough so that we aren’t throwing away a ton of useful information.

Hope this helps!


For #1, very simply what this means is that the CDF is something like:

F(x) ~ 1 - x^(-a), where a > 0

So what does this mean? It means that to cover more of the CDF, the jumps have to get bigger.

Let’s say that a = 2. Going from x = 2 to x = 3 increases the CDF by 0.13889, which is a big chunk of the CDF. But if you are going from x = 20 to x = 21, this increases the CDF by only 0.00023.

So what does THIS mean, intuitively? It means that the tail is going “complete” very, very slowly. As the numbers get bigger, you need HUGE jumps to get a significant change in the CDF.

What this causes, in the real world, is a distribution that models catastrophes very well. Once we get into the tail, small changes in where we are in the CDF are HUGE changes in the value of the loss. So, making up numbers, a 99.0% loss might be $1B, a 99.1% loss might be $10B, and a 99.2% loss might be $100B. You can see here that with each incremental 0.1% increase in the CDF, we see bigger “jumps” in the catastrophe.


Thank you @MattyM and @PaulP for your responses! :pray: This clears things up!


I am just trying to finished up this chapter and have a couple more questions that I’m hoping @MattyM, @PaulP, or others can help with.

  1. Can someone help explain the empirical mean excess function e(u)? I don’t really understand the author’s description; perhaps someone has a quick example with some numbers behind it that will make this idea more clear?

  2. This pertains to my first question, but can someone help me understand the ideas in these graphs? They say that choosing a threshold u is about not only where e(u) has become linear, but remains so. I am not seeing this idea in the graphs! Please help! :worried:


This function basically involves you setting a value for u that is your “limit”. The function then only considers points above the limit, and gives you the average of those values. So for example, let’s assume the following empirical distribution with the following n points:
20, 15, 4, 12, 17, 2, 45, 20,19, 22, 14

Let me set u to a few different values to determine the empirical mean excess loss function:
Note that simply boils down to taking the average excess value above u for those points that are above u.

Similar to your previous question 3, the choice of u has to be done in a way that we capture enough tail data to be able to have some level of confidence in the distribution we fit, but also we don’t want to capture too much data as we aren’t interested in modelling the body of the distribution -> only the tail. One way of determining this u value is by studying the graph of e(u). If we choose a value of u that is too high, we won’t have a linear relationship -> we will have an unpredictable function. That is because if u is very high, a small number of data points will be included in the calculation of e(u) and the value will likely jump around. For example, in the above example, e(20)=13.5 and only 2 points were used in the calculation. Moving to e(22) would give us: e(22)=23/1=23. We see an extreme jump as small number of data points are used and then fall out of the calculation as u increases. At the same time, a u that is too small will have the majority of the points used in the calculation which causes the denominator to be large. However, many of these points (on the left-hand side of the distribution) will have small excesses over u because they are not the right-hand extreme values. So as u increases and some of these points drop out of the calculation, the result will be that the numerator will decrease a small amount but the denominator will also drop rapidly as many points in the body of the distribution fall out of the calculation. This will likely see e(u) decrease until a “balance” is found between only considering extreme-enough values, but not considering too few. That is where you see the linear increase in the e(u) functions in the graphs. The extreme points after the solid line indicate that there are too few data points at those high levels of u to have a stable relationship. Therefore, the chosen value of u should probably be around the area where the linearly-increasing pattern begins (at the start of the solid line).


I think I am also confused on Emily’s question #2 and the above answer didn’t satisfy me. What is it specifically about those graphs that allows us to choose a u value?


I see it as the area on the graph with the largest value of u where there is a liner pattern, before the pattern becomes unstable. Sure, there are decreasing linear patterns earlier on, but using a low value for u won’t really be focusing on modelling the tail. So picking the highest value of u where there is still a stable pattern should be the goal.


If you look at the formula for e(u), notice that it is simply the average excess over u in the tail. The denominator (I) simply counts the number of tail values beyond u. Therefore, e(u) is very similar to a CTE measure, but we’re averaging the excess over u instead of the full tail values.

As u increases, e(u) gives an indication of whether the modeled data is actually from the tail. When moving toward the right tail from the center of the distribution, e(u) falls sharply. Once far enough into the tail, e(u) begins to decline linearly. The point where the linear decline begins is a good choice for u. Remember, the point of this is to figure out where the tail starts.

In other words, we’re trying to figure out the point where the really nasty outcomes begin. As you are moving from left to right, imagine encountering a flock of flying swans (one for each “u step”). The tail begins when the all the swans are black. As you keep going right, there is one fewer black swan (hence the drop in e). :slight_smile: