Graphical Diagnostics for Threshold Selection in Fitting the Generalized Pareto Distribution

When fitting the Generalized Pareto distribution (GPD), selecting an appropriate threshold value is important for achieving an effective fit. The main objective of this study is to give five graphical diagnostics for selecting GPD thresholds. The other objective of this study is to examine different graphical methods based on the goodness-of-fit test. Maximum likelihood method was used to estimate the shape of parameter. Finally, use flood data to compare five graphical diagnostics of threshold selection for shape parameter estimate. The results show that, the four graphical diagnostics (threshold choice plot, mean excesses plot, dispersion index plot and quantail quantail plot) yield the same threshold range, with the exception of the Hill plot. On the other hand, threshold choice plot is simple to identify the range of thresholds that should be stable to fit. When compared to other graphical diagnostic, the GP distribution becomes valid because they demand too much subjectivity and make it difficult to define a range threshold from the plots. In other words, graphical diagnostics of the higher are an acceptable option for fitting the GPD model based on the goodness-of-fit test. All statically analyses for the study are performed using R- statistical program.


Introduction
The probability of rare flood and drought events is an important in hydrology and other branches of high flow problem to estimate parameters of distribution, a review of flood frequency analysis is found in [1], see, [2] for floods, and [3,4] for droughts.For the original theoretical development Generalized Pareto Distribution (GPD), see [5,6] and for further developments and applications, see [7].The method of Peaks Over Thresholds (POT) has been used in many fields, for example, applied to environmental data such as rainfall, sea levels and modeling high flows by GPD distribution, among others and a review is found in, [8,9].The properties of the GP stated by [10,11] to make the distribution a logical candidate for the analysis of extreme events, and such practical problems are addressed in [12].To fit the GPD, there is the problem of threshold choice.Several procedures for selection threshold value in the case of POT modeling are given in the literature (e.g., [13]) suggest the use of the Mean Excess Function (MEF).In, [14], outlined two diagnostics for the choice of threshold: Mean Excesses Plot (MEP) and Threshold Choice Plot (TCP).The shape and scale parameters of the GPD can be estimated by several methods such as the ML method, see, [13] and methods for estimating the GPD parameters have been reviewed by [10], and established the asymptotic normality and consistency of the MLE estimators.Various approaches of numerical methods have been suggested and applied by authors, for example, Square Error Method (SEM), by [15] and automatic choice using shape parameter, by [16] and Automated Threshold Selection Method (ATSM), by [17].On the other hand, graphical method, by [18].Various literature discusses the problem of threshold selection, and the best method is still to be found.The outline of the present study report as follows: In section 2, the background theoretical of GPD is described.Fitting the GPD is introduced in Section 3.While Section 4, model adequacy is provided.In section 5 is concerned with diagnostics plot with the problem of choosing the best value of threshold, five graphical diagnostics is introduced.Section 6, describes application on flood data.Our conclusions and future work are given in section 7.

GP
) is the distribution of the where Y is a random variable with the standard exponential distribution.
) , (   GPD has the distribution function (p.d.f.) is: The p-quantile of the x GPD and the parameters for the distribution are unknown, then the likelihood function of GPD can be expressed as follows: The method of MLE works by finding the value for the parameters that maximize the value of likelihood function.Equivalently, the loglikelihood may be maximized on taking the logarithms of (4), The log-likelihood is derived from (5) as: The MLE solution is found via numerical routines or numerical optimization method (i.e., Newton-Raphson method) are used to find the MLE of  and  For computational, see for more details, [19,20].

Model adequacy
We can use hypothesis testing to select the best fit model of the data over threshold.Therefore, we now implement two criteria to test if the GPD is a reasonable for our flood dataset or not.

Deviance Test (DT)
Deviance test is a statistical test provides one objective criterion for selecting among possible models.The null and the alternative hypotheses, written as: 0 : 0 : The first test presented can be found in [14].The deviance statistic to be used is given by: be the maximized values of the loglikelihood for distribution under 0 H .To accept exponential distribution as 0 . For details see [21].Another test that can be performed comes from Akaike information criterion (AIC).

Akaike information criterion (AIC test)
In order to check a number of models, we perform tests with Akaike information criterion (AIC), see [22,23].The AIC test is given by: Where nll is the negative log likelihood and i  is parameter distribution with i events.The goodness of fit is a first term of AIC and the second term is penalizing model complexity.The model with the smallest value for AIC is preferred.The AIC , DT can be performed with package of POT and evd. in [30,31].

Graphical Diagnostics for Threshold Choice
Some graphics diagnostics should be introduced to select the value of threshold.In this section, we will be introduced five diagnostics plot to select the suitability threshold value to fit GPD.In, [14], outlined the diagnostics for the choice of threshold.There are several popular methods for choosing the suitable threshold, some of these are, together with confidence interval for each of these plotting (see [14,30,34]), and to select 0  as the smallest value when the estimating remains stable region.The stability plot shows the points given by: is the maximum of the observations X .

Mean Excesses Plot (MEP)
Another graphical method which is widely used to determine the threshold is the mean excess plot (or shortly MEP), is a graphical method for determining a d.f.F .The MEP was described by [15,28,34] and used the mean excesses of the GP to select the best threshold.The mean excesses plot is defined by: .
over threshold  , it is mean that the data support the GPD with a long-tailed ( 0   ).Furthermore, , a horizontal MEP (medium tailed,  is near 0) mean that exponentially model while thin-tailed ( 0   ), see, [16].for a discussion of the properties of this function.In practice, based on a sample size, n , If the data follows at GPD greater than some high value of threshold, we will be expected the MEP to see linear in view of Eq-11.One quite hard with the MEP, is very much dispersion, especially at above thresholds, it will be hard to choose whether an observed departure from linearity.Another graphical plot should be used to select the value of threshold is the Hill plot (HP).

Hill plot (HP)
The most popular tail index estimator is the Hill plot by [25].which, however, is restricted to the Pareto case 0   .The HP is another approach to estimate the positive shape (heavy-tailed, ) distributions, the HP will be employed for the following reasons, easy implementation, asymptotic unbiasedness under large samples and the most efficient estimator of shape parameter  .Let

 
where the HP estimator is provided by the formula: For every choice of k , we obtain another estimator of  .The threshold  is choice from the plot of areas remain constant of the tail index (TI).However, this select is not always clear.We hope of finding a stationary area when estimates of TI do not change above a different value of threshold obtained.The results from the Hill estimator (HE) arebasedon the TI selection.The HP proved by [26], at another plot diagnostic to establish the TI, which HP for a range value of k versus the TI , or k against the corresponding threshold.
Many authors were studied the statistical behavior and properties of the HE to extend the HE to the general case    .Recent generalizations of the HP for shape (

  
)are introduced by [ 26,27].As [29], extend the idea of Hill to derive a plug-in estimator by applying the hypothesis test on an accumulation of the log spacing's further considered a kernel-based goodness of fit statistic of the tail fit in the Pareto type tail case.Another approach which has been considered for the threshold selection is the dispersion index plot.

Dispersion Index Plot (DIP)
According to [30], is presented another plot to determine the threshold, namely the Dispersion Index Plot (DIP).The DIP is special useful when dealing with over high a threshold will be asymptotic by a generalize pareto distribution.Let X be a random variable as a Poisson distribution with parameter  That is: where  is the average of events.Moreover, a confidence interval can be computed by using a The following is the last graphical plot based on the estimation of the model at a range of thresholds is call Quantail Quantail Plot (QQP).

Quantail Quantail plot (QQ-Plot)
the QQ-plot introduced by [30], as an alternative to the Hill plot.And, the HE is convergence to the slope of the line fitted to the upper TI of GPD, this estimator, will be defined by [15,32], can be represented by formula of  : 1 ln 1 ln ln ln 1 ln  18 Some features were established by the authors, such as the asymptotic variance of the q-q estimator, weak consistency, and asymptotic normality.However, when the convergence is not completely perfect, the main advantage of using the q-q-estimator over the HE is that the residuals of the GPD plot provide information to figure out the bias in the estimates.The estimator is represented as being reasonable in the case of the bias of the standard estimator.

Application on real data
In this study, we use the ardieres data frame containing flood discharges over a period of 33 years of the Ardières river at Beaujeau.The source data is taken from the POT package in R by [30,33] and [35].This data was used because it was used in some applications of extreme values, in addition to not obtaining real data for extreme value phenomena.First, we have to extract extreme events while preserving independent between events from the time series and select a suitable threshold such that asymptotic approximation in equation ( 1) is good enough.

Descriptive Statistic
Firstly, descriptive statistics summarizes the characteristics of a dataset to see the behavior of our data.), which suggests heavy tails.To make the comparison between distributions, first we performed tests to verify the symmetry (normality) or asymmetric and to ensure that on real data.For such verification we performed the Shapiro-Wilk test and ks.test to confirm the data does not follow a normal distribution.A positive value of skewness signifies a distribution with an asymmetric (not normal) tail extending out towards more positive and If your data are not symmetric, the mean and median are not equal (or similar).If the distribution of data is skewed to the right (or asymmetric), the median is often less than the mean, therefore, there is a strong reason to suggest the model of these data belong to heavy tail of GPD (long tail, > 0).

Graphical Diagnostics (GD)
To fit GPD of our data, it is first important to choose a threshold.Different graphical diagnostics have been suggested for choosing the threshold.All the different graphical diagnostic presented in section 5 have been used, and it is then possible to compare the different values of thresholds obtained.) respectively.We look for approximate linearity whilst keeping in between the confidence bounds.Indeed, the MEP are vertical dashed lines mark these thresholds, since there is some indicate that the plot slightly about this threshold, and represented a threshold at about the 97th percentile of the data.From the Fig- 1, we see that the MEP ultimately increases, so the TI of our data is positive.It seemingly indicates linearity from threshold 10 = u to 13 = u , but it is not easy to detect which point is an appropriate threshold.Based on linear property, all points between (10,13) may be good for threshold.From Fig- 1, can be observed in the medium right panel is Hill plot.The vertical dashed lines, in combination with the connection HE, indicate that the HP is rather stable about the range threshold of (shape,  respectively), but they are both in keeping with the overall recommendations for threshold selection using the HP.The HP, is inconsistencies are observed between the estimated shape parameter at this range and other thresholds of TCP and MEP.On the other hand, the DI plot is shown in Fig- 1 (on bottom-left panel) does not quite stabilize in any particular region.FromFig-1, we realize that the DIP unstable in any especially area, making it extremely difficult to select a specific value for threshold value at this plot.A stable region in the DIP for threshold is (11,13)

*(is P-value)
Table-2, summarizes some estimation results of the shape parameter  via the MLE estimator at each four choices of threshold are summarized in the Table-2, and the results are stable for TCP, MEP and qq-plot except for HP. for comparison between diagnostic plots, to achieve a good model fit, traditionally, the threshold was chosen a suitable value of threshold before fitting GPD.The shape parameters estimated for both (  ) are very close, there is not a great difference between them and the shape parameter was always positive.

Goodness-of-Fit Criteria (DT & AIC)
According to goodness-of-fit test via DT and AIC.Both DT and AIC information criteria test are used to obtain whether or not the observed differences are consistent to find the most an appropriate threshold choice to fit the data by GPD.Note that the values of AIC for TCP, MEP are less than the values for the HP, QQP in all diagnostic plots.Therefore, TCP best fitted in all cases where the stability of TCP was higher.The estimated shape parameters are significantly different from threshold.In this case, the data for all the considered thresholds pass both goodness of fit tests and the tests are significant.Another way of checking the adequacy of GPD fit estimates via diagnosis plot.any model (GPD) to choose an appropriate value of threshold.In addition, we can use diagnostic plots, P-P plots, q-q plots and RL plot to select the best fit of GPD.All diagnostics seem to indicate a reasonable fit of the GPD to our data.Visually we can see a slight difference in the estimation of the threshold four of these graphical diagnostics (TCP, MEP, DI and QQP) share with same selecting threshold range (10,13.On the other hand, Hill graphical diagnostics differences among these graphical diagnostics.All depend on visual inspection to identify the threshold.The level over which linearity is evident might be used as a threshold level for various levels.If the model provides significantly different outcomes for different values.An important step in any strategy is selecting a threshold depending on the ME plot so that the plot is nearly linear above it.Threshold selection can be difficult, and parameter estimations might be affected by the threshold selection, particularly when real data is analyzed.It's hard to determine the best threshold through using this method.The position of a stable zone in the HP and the significant dispersion present in the region of the upper-order statistics in the DI are quite difficult.The problems can occur when transfers from the GPD occur.The q-q-plot is an alternative to the HP.Although the HP seems to be less smother than the QQ-plot, problems can still occur when departures from the pareto distribution occur.There are some issues that would be taken into consideration, the first problem is how to deal with choosing the threshold value or the number of upper-order statistics required for the GPD by using an automated threshold selection instead of graphical diagnostics that require prior experience (or subjective) of their interpretation of plots.The second problem is how to find some modifications of these proposed in the literature based on smoothing and robustifying procedures and other diagnostic tests in order to select the best threshold of fit GPD model.Finally, all results of graphical diagnostics belonging to GPD under linear can be done nonlinear, so it would be recommended to have other practical applications to GPD under power.
the usual GPD; a type II GPD; and exponential distribution respectively.The GPD can be extended by adding a location parameter  .

Graphical
Diagnostics for Threshold Selection in Fitting the Generalized Pareto Alaswed.JOPAS Vol.23 No.1 2024 92 numerical procedure, plotting methods and combinations between them.Five diagnostic plot of threshold choice as follows: 5.1 Threshold Choice Plot (TCP) Stability plot (SP) or Threshold Choice plot (TCP) is procedure for a range of thresholds selection to fit the GPD.If the GPD is a suitable fit for high a threshold 0 is estimated by its empirical mean.The sample ME function is given by: .To show the sample ME-plot, we plot the points as follows:

Fig- 1 ,
show five graphical diagnostics for selection the value of threshold, and help in choosing where to begin looking at thresholds.From Fig-1, TCP is on the top-left and right panel, medium left panel is MEP, while medium right panel is Hill plot, bottom-left panel is the DIP and bottom-right panel is QQplot.It is not easy to show by eye, and the interpretation of these plots often requires a good deal of subjective judgment.

Fig- 1 : 1 ,
Fig-1: Graphic diagnostic for the threshold choice to fit GPD and the vertical line indicates to the graphical threshold selection.In Fig-1, the results of graphical diagnostic are shown.According to Fig-1, TCP is shown on the top-left and right panel, we look to construct a horizontal line, that cuts through all the confidence bars and both modified scale and shape estimates seem to be constant or stability on the range ( 13 10 = = u to u).TCP for lower shape ( and the high dispersion present in the region of the high order statistics in the DIP.The Q-Qplots show in Fig-1 (on bottom-right panel) for empirical versus GPD model quantiles is seen.The qq-plots shows a rapid increase up to around the threshold 10 = u to 13 = u should be chosen, then levels off.A vertical line marks the location of our threshold.6.3 Fitting the GPD Based on Fig-1, five different thresholds are chosen for fitting the shape parameter estimation of GPD by using MLE, we get the estimates of shape parameters.Fitting distributions was made in different levels of threshold values of u by MLE, and the goodness of fit tests are presented these results in the table-2 below.

Fig- 5 :
Fig-5: Graphical diagnostic plots of threshold selection via Hill=6.Depended on Fig-2, Both the probability plot (top left) and the q-q plot (top right) show the reasonability of the GPD fit.The RL curve (bottom left) asymptotes as a consequence of the positive shape.Finally, the correspond density estimate (bottom right) seems consistent with the histogram of the data.Consequently, all four diagnostics plots reduce to GPD model.Fig-2, shows diagnostic plots at GP distribution is best fit.On other hand, all diagnostics plot in Fig 2-5, seem to indicate a reasonable fit of the GPD to our data.7 ConclusionThe problem of threshold selection estimating of long-tailed model is very important in many practical applied.In this study, we have presented a five graphical threshold selection method.Various diagnostics plots to evaluate the GPD fit are commonly used for selecting threshold, such five-plot including, Stability plot (SP) or Threshold Choice plot (TCP), Mean Excesses Plot (MEP), Hill plot (HP), Dispersion Index Plot (DIP) and Q-Q plot (QQP) respectively.These graphical methods are a diagnostic plot drawn before fitting ‫الذاتية‬ ‫الخبرة‬ ‫من‬ ‫الكثير‬ ‫تتطلب‬ ‫التي‬ ‫األخرى‬ ‫التشخيصية‬ ‫بالرسومات‬ ‫مقارنته‬ ‫عند‬ ‫وذلك‬ ‫اختيار‬ ‫يعد‬ ‫العليا‬ ‫العتبة‬ ‫الختيار‬ ‫التشخيصية‬ ‫الرسومات‬ ‫مع‬ ‫باريتو‬ ‫نموذج‬ ‫أن‬ ‫النتائج‬ ‫أظهرت‬ ‫كما‬ ‫الرسومات.‬‫لهذه‬ ‫باالعتماد‬ ‫مناسب‬ ‫على‬ ‫املعلمات‬ ‫لتقدير‬ ‫آر‬ ‫برنامج‬ ‫بواسطة‬ ‫الورقة‬ ‫هذا‬ ‫في‬ ‫الحسابات‬ ‫جميع‬ ‫اء‬ ‫اجر‬ ‫تم‬ ‫االختبارين.‬‫واالختبارين.‬‫التشخيصية‬ ‫الرسومات‬ ‫اء‬ ‫واجر‬

3 3 Fitting the GPD After
choice the value of threshold to estimate the GPD parameters by MLE.Considering that the data Table-1, shows the results of some important descriptive statistic (min, max, mean, median and skewness).The results in Table-1 shows that the sample mean is Graphical Diagnostics for Threshold Selection in Fitting the Generalized Pareto

Table 1 .
Summary statistics for ardieres data

Table 2 .
Estimated parameter of GPD by MLE for different graphical