FERM - Ch.12: Extreme Value Theory



Reading Source: Textbook - Financial Enterprise Risk Management

Topics Covered in this Reading:

  • The Generalized Extreme Value Distribution
  • The Generalized Pareto Distribution


I’ve just started this reading and have a few early questions re: the Generalized Extreme Value Distribution section; I was hoping someone could help clarify?

  1. When Gamma > 0, it says we have a Frechet-type GEV distribution, which has a tail that follows a power law. What does it mean for a tail to follow a power law?

  2. When Gamma < 0, we have a Weibull-type GEV distribution, with a tail that falls off so quickly that there is a ‘finite right endpoint’ to the distribution. It says that given EVT is used when talking about extreme observations, this distribution is of little interest. Can someone expand on why this distribution is of little interest, I don’t quite understand?

  3. They say that a major drawback of the GEV approach is that by only using the largest values in each block of data, it ignores a lot of potentially useful information. I.e. For a block of a thousand observations, 99.9% of the information is discarded. Why is this such a drawback when what we are most concerned about is the data points in the tail, and not the remaining part of the distribution?

Thanks :slight_smile:


I can try and answer 2 and 3. Please let me know if you agree.

  1. With EVT, what we care about is the extreme observations. In other words, we look at the data points/observations in the extreme tail end of the distribution (which are the REALLY BAD scenarios). Because of this, distributions that have really FAT tails are going to have more data points in the end of the distribution and be more useful for EVT. The Weibull distribution, on the other hand, does not have a fat tail at all (in fact, the tail drops off very quickly) meaning there will be very few extreme observations that we can use for EVT – which is why it’s of little interest.

  2. As mentioned above, for EVT we only care about the extreme observations. Thus, if we take 1000 observations, but only 10 of them are considered “extreme”, we are essentially throwing out the remaining 990 observations. The issue is that some of these 990 observations are still quite extreme (although not as extreme as the other 10), and could potentially provide valuable information….But we throw this information out when we ONLY use the largest value in each block of data (and exclude, for example, the second or third largest values). That’s why there’s a balance between choosing a threshold high enough to make sure we are only considering extreme values, but low enough so that we aren’t throwing away a ton of useful information.

Hope this helps!


For #1, very simply what this means is that the CDF is something like:

F(x) ~ 1 - x^(-a), where a > 0

So what does this mean? It means that to cover more of the CDF, the jumps have to get bigger.

Let’s say that a = 2. Going from x = 2 to x = 3 increases the CDF by 0.13889, which is a big chunk of the CDF. But if you are going from x = 20 to x = 21, this increases the CDF by only 0.00023.

So what does THIS mean, intuitively? It means that the tail is going “complete” very, very slowly. As the numbers get bigger, you need HUGE jumps to get a significant change in the CDF.

What this causes, in the real world, is a distribution that models catastrophes very well. Once we get into the tail, small changes in where we are in the CDF are HUGE changes in the value of the loss. So, making up numbers, a 99.0% loss might be $1B, a 99.1% loss might be $10B, and a 99.2% loss might be $100B. You can see here that with each incremental 0.1% increase in the CDF, we see bigger “jumps” in the catastrophe.


Thank you @MattyM and @PaulP for your responses! :pray: This clears things up!


I am just trying to finished up this chapter and have a couple more questions that I’m hoping @MattyM, @PaulP, or others can help with.

  1. Can someone help explain the empirical mean excess function e(u)? I don’t really understand the author’s description; perhaps someone has a quick example with some numbers behind it that will make this idea more clear?

  2. This pertains to my first question, but can someone help me understand the ideas in these graphs? They say that choosing a threshold u is about not only where e(u) has become linear, but remains so. I am not seeing this idea in the graphs! Please help! :worried:


This function basically involves you setting a value for u that is your “limit”. The function then only considers points above the limit, and gives you the average of those values. So for example, let’s assume the following empirical distribution with the following n points:
20, 15, 4, 12, 17, 2, 45, 20,19, 22, 14

Let me set u to a few different values to determine the empirical mean excess loss function:
Note that simply boils down to taking the average excess value above u for those points that are above u.

Similar to your previous question 3, the choice of u has to be done in a way that we capture enough tail data to be able to have some level of confidence in the distribution we fit, but also we don’t want to capture too much data as we aren’t interested in modelling the body of the distribution -> only the tail. One way of determining this u value is by studying the graph of e(u). If we choose a value of u that is too high, we won’t have a linear relationship -> we will have an unpredictable function. That is because if u is very high, a small number of data points will be included in the calculation of e(u) and the value will likely jump around. For example, in the above example, e(20)=13.5 and only 2 points were used in the calculation. Moving to e(22) would give us: e(22)=23/1=23. We see an extreme jump as small number of data points are used and then fall out of the calculation as u increases. At the same time, a u that is too small will have the majority of the points used in the calculation which causes the denominator to be large. However, many of these points (on the left-hand side of the distribution) will have small excesses over u because they are not the right-hand extreme values. So as u increases and some of these points drop out of the calculation, the result will be that the numerator will decrease a small amount but the denominator will also drop rapidly as many points in the body of the distribution fall out of the calculation. This will likely see e(u) decrease until a “balance” is found between only considering extreme-enough values, but not considering too few. That is where you see the linear increase in the e(u) functions in the graphs. The extreme points after the solid line indicate that there are too few data points at those high levels of u to have a stable relationship. Therefore, the chosen value of u should probably be around the area where the linearly-increasing pattern begins (at the start of the solid line).


I think I am also confused on Emily’s question #2 and the above answer didn’t satisfy me. What is it specifically about those graphs that allows us to choose a u value?


I see it as the area on the graph with the largest value of u where there is a liner pattern, before the pattern becomes unstable. Sure, there are decreasing linear patterns earlier on, but using a low value for u won’t really be focusing on modelling the tail. So picking the highest value of u where there is still a stable pattern should be the goal.


If you look at the formula for e(u), notice that it is simply the average excess over u in the tail. The denominator (I) simply counts the number of tail values beyond u. Therefore, e(u) is very similar to a CTE measure, but we’re averaging the excess over u instead of the full tail values.

As u increases, e(u) gives an indication of whether the modeled data is actually from the tail. When moving toward the right tail from the center of the distribution, e(u) falls sharply. Once far enough into the tail, e(u) begins to decline linearly. The point where the linear decline begins is a good choice for u. Remember, the point of this is to figure out where the tail starts.

In other words, we’re trying to figure out the point where the really nasty outcomes begin. As you are moving from left to right, imagine encountering a flock of flying swans (one for each “u step”). The tail begins when the all the swans are black. As you keep going right, there is one fewer black swan (hence the drop in e). :slight_smile:


Will I need to memorize the PDF/CDF for the Generalized Extreme Value Distribution and the Generalized Pareto Distribution? Or is it more likely that it will be provided if needed?


Hi Cody!

I would focus on understanding the concepts rather than memorizing the formulas. However, you should know how to use the formulas should they provide them to you on the exam.

The below question comes from the Spring 2017 exam, and actually provides you with the CDF of the Generalized Pareto Distribution. Therefor, I think it’s probably a safe bet that the formula will be provided if needed :wink:


What do other people think?

One option (if you have time… probably not top priority) would be to memorize H(x). You could then derive the PDF (by taking the derivative with respect to x) and the standardized formulas (by setting alpha to 0 and beta to 1) in the unlikely cases you’d need to.


@Andrew It’s been a little while since I’ve taken derivatives and am having a bit of trouble, and wouldn’t some help. Would you be able to show the steps on how to take the derivate w.r.t. x for the following function (FERM textbook - formula 12.2 - standardized form of H(x)):

H(x) = \left\{ \begin{array}{ll} e^{-(1+\gamma x)^{-1/\gamma}} & \quad \gamma \neq 0 \\ e^{-e^{-x}} & \quad \gamma = 0 \\ \end{array} \right.

Also, thanks for sharing with me that this forum now uses MathJax, very cool!


No problem! I’ve taken a picture of the step by step derivative calculation. Please let me know if any steps are confusing!


Hi, can someone explain to me the point of dividing the data into blocks for the return period approach? What does it mean to find the “rate of extreme observations per thousand (%)”?


Hi beyoncepadthai,

We often talk about frequency and severity. Frequency: How often the event occurs and Severity: How severe the event is.

The return period approach deals with modeling the frequency. If we divide our blocks into groups of 1000 and set a threshold, then for each block, we can see how many observations were above the threshold. These data points can then be used to model the frequency of an event occurring with chosen threshold.

For example, assume we had 1500 data points and we break our data into blocks of 100. We want to know the number of data points that exceed 1,000,000. We now have 15 blocks of size 100. Perhaps our sample blocks exceed the threshold with the following frequencies: 1,0,3,2,5,0,1,2,2,1,2,2,0,2,3. We can then use these data points to model the rate per 100 that an event will exceed 1,000,000


So any value or the lowest value in the linearly-increasing pattern can be our chosen value u? This indicates a tail pattern after u, right? Thanks.


The lowest possible value of u where the linear pattern starts is the ideal choice.


I see. Thanks a lot.