Statistical Tests
Combine is a likelihood based statistical tool. That means that it uses the likelihood function to define statistical tests.
Combine provides a number of customization options for each test; as always it is up to the user to chose an appropriate test and options.
General Framework
Statistical tests
Combine implements a number of different customizable statistical tests. These tests can be used for purposes such as determining the significance of some new physics model over the standard model, setting limits, estimating parameters, and checking goodness of fit.
These tests are all performed on a given model (null hypothesis), and often require additional specification of an alternative model.
The statistical test then typically requires defining some "test statistic",
For example, in a simple coin-flipping experiment, the number of heads could be used as the test statistic.
The distribution of the test statistic should be estimated under the null hypothesis (and the alternative hypothesis, if applicable).
Then the value of the test statistic on the actual observed data,
This comparison, which depends on the test in question, defines the results of the test, which may be simple binary results (e.g. this model point is rejected at a given confidence level), or continuous (e.g. defining the degree to which the data are considered surprising, given the model). Often, as either a final result or as an intermediate step, the p-value of the observed test statistic under a given hypothesis is calculated.
How p-values are calculated
The distribution of the test statistic,
And the observed value of the test statistic is
In some cases, the bounds of the integral may be modified, such as
The p-values using the left-tail and right tail are related to each other via
Test Statistics
The test statistic can be any real valued function of the data. While in principle, many valid test statistics can be used, the choice of tests statistic is very important as it influences the power of the statistical test.
By associating a single real value with every observation, the test statistic allows us to recast the question "how likely was this observation?" in the form of a quantitative question about the value of the test statistic. Ideally a good test statistic should return different values for likely outcomes as compared to unlikely outcomes and the expected distributions under the null and alternate hypotheses should be well-separated.
In many situations, extremely useful test statistics, sometimes optimal ones for particular tasks, can be constructed from the likelihood function itself:
Even for a given statistical test, several likelihood-based test-statistics may be suitable, and for some tests combine implements multiple test-statistics from which the user can choose.
Tests with Likelihood Ratio Test Statistics
The likelihood function itself often forms a good basis for building test statistics.
Typically the absolute value of the likelihood itself is not very meaningful as it depends on many fixed aspects we are usually not interested in on their own, like the size of the parameter space and the number of observations. However, quantities such as the ratio of the likelihood at two different points in parameter space are very informative about the relative merits of those two models.
The likelihood ratio and likelihood ratio based test statistics
A very useful test statistic is the likelihood ratio of two models:
For technical and convenience reasons, often the negative logarithm of the likelihood ratio is used:
With different proportionality constants being most convenient in different circumstances.
The negative sign is used by convention since usually the ratios are constructed so that the larger likelihood value must be in the denominator.
This way,
Sets of test statistics
If the parameters of both likelihoods in the ratio are fixed to a single value, then that defines a single test statistic.
Often, however, we are interested in testing "sets" of models, parameterized by some set of values
This is important in limit setting for example, where we perform statistical tests to exclude entire ranges of the parameter space.
In these cases, the likelihood ratio (or a function of it) can be used to define a set of test statistics parameterized by the model parameters. For example, a very useful set of test statistics is:
$$ t_{\vec{\mu}} \propto -\log\left(\frac{\mathcal{L}(\vec{\mu})}{\mathcal{L}(\vec{\hat{\mu}})}\right) $$.
Where the likelihood parameters in the bottom are fixed to their maximum likelihood values, but the parameter
When calculating the p-values for these statistical tests, the p-values are calculated at each point in parameter space using the test statistic for that point.
In other words, the observed and expected distributions of the test statistics are computed separately at each parameter point
Expected distributions of likelihood ratio test statistics
Under appropriate conditions, the distribution of
Combine provides asymptotic methods, for limit setting, significance tests, and computing confidence intervals which make used of these approximations for fast calculations.
In the general case, however, the distribution of the test statistic is not known, and it must be estimated. Typically it is estimated by generating many sets of pseudo-data from the model and using the emprirical distribution of the test statistic.
Combine also provides methods for limit setting, significance tests, and computing confidence intervals which use pseudodata generation to estimate the expected test-statistic distributions, and therefore don't depend on the asymptotic approximation. Methods are also provided for generating pseudodata without running a particular test, which can be saved and used for estimating expected distributions.
Parameter Estimation using the likelihood ratio
A common use case for likelihood ratios is estimating the values of some parameters, such as the parameters of interest,
A confidence region for the parameters
Where the likelihood in the top is the value of the likelihood at a point
Then the confidence region can be defined as the region where the p-value of the observed test-statistic is less than the confidence level:
This construction will satisfy the frequentist coverage property that the confidence region contains the parameter values used to generate the data in
In many cases, Wilks' theorem can be used to calculate the p-value and the criteria on
Discoveries using the likelihood ratio
A common method for claiming discovery is based on a likelihood ratio test by showing that the new physics model has a "significantly" larger likelihood than the standard model.
This could be done by using the standard profile likelihood ratio test statistic:
Where
which excludes the possibility of claiming discovery when the best fit value of
As with the likelihood ratio test statistic,
Once the value
Limit Setting using the likelihood ratio
Various test statistics built from likelihood ratios can be used for limit setting, i.e. excluding some parameter values.
One could set limits on a parameter
However, this could "exclude"
This can be done using a modified test statistic:
However, this can also have undesirable properties when the best fit value,
Which also has a known distribution under appropriate conditions, or can be estimated from pseudo-experiments. One can then set a limit at a given confidence level,
However, this procedure is rarely used, in almost every case we use a modified test procedure which uses the
The CLs criterion
Regardless of which of these test statistics is used, the standard test-methodology has some undesirable properties for limit setting.
Even for an experiment with almost no sensitivity to new physics, 5% of the time the experiment is performed we expect the experimenter to find
In order to avoid such situations the
Where
Using the
Note that this means that a limit set using the
Goodness of fit tests using the likelihood ratio
The likelihood ratio can also be used as a measure of goodness of fit, i.e. describing how well the data match the model for binned data.
A standard likelihood-based measure of the goodness of fit is determined by using the log likelihood ratio with the likelihood in the denominator coming from the saturated model.
Here
This ratio is then providing a comparison between how well the actual data are fit as compared to a hypothetical optimal fit.
Unfortunately, the distribution of
Once the distribution is determined, a p-value for the statistic can be derived which indicates the probability of observing data with that quality of fit given the model, and therefore serves as a measure of the goodness of fit.
Channel Compatibility test using the likelihood ratio
When performing an anlysis across many different channels (for example, different Higgs decay modes), it is often interesting to check the level of compatibility of the various channels.
Combine implements a channel compatibility test, by considering the a model,
The distribution of the test statistic is not known a priori, and needs to be calculated by generating pseudo-data samples.
Other Statistical Tests
While combine is a likelihood based statistical framework, it does not require that all statistical tests use the likelihood ratio.
Other Goodness of Fit Tests
As well as the saturated goodness of fit test, defined above, combine implements Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests.
For the Kolomogorov-Smirnov (KS) test, the test statistic is the maximum absolute difference between the cumulative distribution function between the data and the model:
Where
For the Anderson-Darling (AD) test, the test statistic is based on the integral of the square of the difference between the two cumulative distribution functions. The square difference is modified by a weighting function which gives more importance to differences in the tails:
Notably, both the Anderson-Darling and Kolmogorov-Smirnov test rely on the cumulative distribution. Because the ordering of different channels of a model is not well defined, the tests themselves are not unambiguously defined over multiple channels.