How to run the tool

The executable combine provided by the package allows to use the Higgs Combination Tool indicating by command line which is the method to use for limit combination and which are user's preferences to run it. To see the entire list of all available options ask for the help:

combine --help

The option -M allows to chose the method used. There are several groups of statistical methods:

The command help is organized into five parts:

Those options reported above are just a sample of all available.The command --help provides documentation of all of them.

Common command line options

There are a number of useful command line options which can be used to alter the model (or parameters of the model) at run These are the most commonly used, generic options,

Warning

Note that the floating/freezing options have a priority ordering from lowest to highest as floatParameters < freezeParameters < freezeNuisanceGroups < floatAllNuisances. Options with higher priority will override those with lower priority.

By default, the dataset used by combine will be the one pointed to in the datacard. You can tell combine to use a different dataset (for example a toy one that you generated) by using the option --dataset. The argument should be rootfile.root:workspace:location or rootfile.root:location. In order to use this option, you must first convert your datacard to a binary workspace and use this binary workspace as the input to the command line.

Generic Minimizer Options

Combine uses its own minimizer class which is used to steer Minuit (via RooMinimizer) named the CascadeMinimizer. This allows for sequential minimization which can help in case a particular setting/algo fails. Also, the CascadeMinimizer knows about extra features of Combine such as discrete nuisance parameters.

All of the fits which are performed in several of the methods available use this minimizer. This means that the fits can be tuned using these common options,

The allowed combinations of minimizer types and minimizer algos are as follows

Minimizer Type Minimizer Algo
Minuit Migrad, Simplex, Combined, Scan
Minuit2 Migrad, Simplex, Combined, Scan
GSLMultiMin ConjugateFR, ConjugatePR, BFGS, BFGS2, SteepestDescent

You can find details about these in the Minuit2 documentation here.

More of these options can be found in the Cascade Minimizer options section when running --help.

Output from combine

Most methods will print the results of the computation to the screen, however, in addition, combine will also produce a root file containing a tree called limit with these results. The name of this file will be of the format,

higgsCombineTest.MethodName.mH$MASS.[word$WORD].root

where $WORD is any user defined keyword from the datacard which has been set to a particular value.

A few command line options of combine can be used to control this output:

The output file will contain a TDirectory named toys, which will be empty if no toys are generated (see below for details) and a TTree called limit with the following branches;

Branch name Type Description
limit Double_t Main result of combine run with method dependent meaning
limitErr Double_t Estimated uncertainty on the result
mh Double_t Value of MH, specified with -m option
iToy Int_t Toy number identifier if running with -t
iSeed Int_t Seed specified with -s
t_cpu Float_t Estimated CPU time for algorithm
t_real Float_t Estimated real time for algorithm
quantileExpected Float_t Quantile identifier for methods which calculated expected (quantiles) and observed results (eg conversions from \Delta\ln L values) with method dependent meaning. Negative values are reserved for entries which do not related to quantiles of a calculation with the default being set to -1 (usually meaning the observed result).

The value of any user defined keyword $WORD which is set using keyword-value described above will also be included as a branch with type string named WORD. The option can be repeated multiple times for multiple keywords.

In some cases, the precise meanings of the branches will depend on the Method being used, which is included in this documentation.

Toy data generation

By default, each of these methods will be run using the observed data as the input. In several cases (as detailed below), it might be useful to run the tool using toy datasets, including Asimov data.

The option -t is used to specify to combine to first generate a toy dataset(s) which will be used in replacement of the real data. There are two versions of this,

Warning

The default values of the nuisance parameters (or any parameter) are used to generate the toy. This means that if, for example, you are using parametric shapes and the parameters inside the workspace are set to arbitrary values, those arbitrary values will be used to generate the toy. This behaviour can be modified through the use of the option --setParameters x=value_x,y=value_y... which will set the values of the parameters (x and y) before toy generation. You can also load a snap-shot from a previous fit to set the nuisances to their post-fit values (see below).

The output file will contain the toys (as RooDataSets for the observables, including global observables) in the toys directory if the option --saveToys is provided. If you include this option, the limit TTree in the output will have an entry corresponding to the state of the POI used for the generation of the toy, with the value of quantileExpected set to -2.

Info

The branches that are created by methods like MultiDimFit will not show the values used to generate the toy. If you also want the TTree to show the values of the POIs used to generate to toy, you should add additional branches using the --trackParameters option as described in the common command line options section above. These branches will behave as expected when adding the option --saveToys.

Asimov datasets

If you are using wither -t -1 or using AsymptoticLimits, combine will calculate results based on an Asimov dataset.

You can also ask combine to use a Pseudo-Asimov dataset, which is created from many weighted unbinned events.

Setting --X-rtd TMCSO_AdaptivePseudoAsimov=\beta with \beta>0 will trigger the internal logic of whether to produce a Pseudo-Asimov dataset. This logic is as follows;

  1. For each observable in your dataset, the number of bins, n_{b} is determined either from the value of RooRealVar::getBins if it exists or assumed to be 100.

  2. If N_{b}=\prod_{b}n_{b}>5000, the number of expected events N_{ev} is determined. Note if you are combining multiple channels, N_{ev} refers to the number of expected events in a single channel, the logic is separate for each channel. If N_{ev}/N_{b}<0.01 then a Pseudo-Asimov dataset is created with the number of events equal to \beta \cdot \mathrm{max}\{100*N_{ev},1000\}. If N_{ev}/N_{b}\geq 0.01 , then a normal Asimov dataset is produced.

  3. If N_{b}\leq 5000 then a normal Asimov dataset will be produced

The production of a Pseudo-Asimov dataset can be forced by using the option --X-rtd TMCSO_PseudoAsimov=X where X>0 will determine the number of weighted events for the Pseudo-Asimov dataset. You should try different values of X since larger values leads to more events in the Pseudo-Asimov dataset resulting in higher precision but in general the fit will be slower.

You can turn off the internal logic by setting --X-rtd TMCSO_AdaptivePseudoAsimov=0 --X-rtd TMCSO_PseudoAsimov=0 thereby forcing histograms to be generated.

Info

If you set --X-rtd TMCSO_PseudoAsimov=X with X>0 and also turn on --X-rtd TMCSO_AdaptivePseudoAsimov=\beta, with \beta>0, the internal logic will be used but this time the default will be to generate Pseudo-Asimov datasets, rather than the normal Asimov ones.

Nuisance parameter generation

The default method of dealing with systematics is to generate random values (around their nominal values, see above) for the nuisance parameters, according to their prior pdfs centred around their default values, before generating the data. The unconstrained nuisance parameters (eg flatParam or rateParam) or those with flat priors are not randomised before the data generation. If you wish to also randomise these parameters, you must declare these as flatParam in your datacard and when running text2workspace you must add the option --X-assign-flatParam-prior in the command line.

The following are options which define how the toys will be generated,

If you are using toysFrequentist, be aware that the values set by --setParameters will be ignored for the toy generation as the post-fit values will instead be used (except for any parameter which is also a parameter of interest). You can override this behaviour and choose the nominal values for toy generation for any parameter by adding the option --bypassFrequentistFit which will skip the initial fit to data or by loading a snapshot (see below).

Warning

The methods such as AsymptoticLimits and HybridNew --LHCmode LHC-limits, the "nominal" nuisance parameter values are taken from fits to the data and are therefore not "blind" to the observed data by default (following the fully frequentist paradigm). See the detailed documentation on these methods for avoiding this and running in a completely "blind" mode.

Generate only

It is also possible to generate the toys first and then feed them to the Methods in combine. This can be done using -M GenerateOnly --saveToys. The toys can then be read and used with the other methods by specifying --toysFile=higgsCombineTest.GenerateOnly... and using the same options for the toy generation.

Warning

Some Methods also use toys within the method itself (eg AsymptoticLimits and HybridNew). For these, you should not specify the toy generation with -t or the options above and instead follow the specific instructions.

Loading snapshots

Snapshots from workspaces can be loaded and used in order to generate toys using the option --snapshotName <name of snapshot>. This will first set the parameters to the values in the snapshot before any other parameter options are set and toys are generated.

See the section on saving post-fit workspaces for creating workspaces with post-fit snapshots from MultiDimFit.

Here are a few examples of calculations with toys from post-fit workspaces using a workspace with r, m_{H} as parameters of interest

combineTool for job submission

For longer tasks which cannot be run locally, several methods in combine can be split to run on the LSF batch or the Grid. The splitting and submission is handled using the combineTool (see this getting started section to get the tool)

Submission to Condor

The syntax for running on condor with the tool is

combineTool.py -M ALGO [options] --job-mode condor --sub-opts='CLASSADS' --task-name NAME [--dry-run]

with options being the usual list of combine options. The help option -h will give a list of both combine and combineTool sets of options. This can be used with several different methods from combine.

The --sub-opts option takes a string with the different ClassAds that you want to set, separated by \n as argument (e.g. '+JobFlavour="espresso"\nRequestCpus=1').

The --dry-run option will show what will be run without actually doing so / submitting the jobs.

For example, to generate toys (eg for use with limit setting) users running on lxplus at CERN the condor mode can be used eg

combineTool.py -d workspace.root -M HybridNew --LHCmode LHC-limits --clsAcc 0  -T 2000 -s -1 --singlePoint 0.2:2.0:0.05 --saveHybridResult -m 125 --job-mode condor --task-name condor-test --sub-opts='+JobFlavour="tomorrow"'

The --singlePoint option is over-ridden so that this will produce a script for each value of the POI in the range 0.2 to 2.0 in steps of 0.05. You can merge multiple points into a script using --merge - e.g adding --merge 10 to the above command will mean that each job contains at most 10 of the values. The scripts are labelled by the --task-name option. These will be submitted directly to condor adding any options in --sub-opts to the condor submit script. Make sure multiple options are separated by \n. The jobs will run and produce output in the current directory.

Below is an example for splitting points in a multi-dimensional likelihood scan.

Splitting jobs for a multi-dimensional likelihood scan

The option --split-points issues the command to split the jobs for MultiDimFit when using --algo grid. The following example will split the jobs such that there are 10 points in each of the jobs, which will be submitted to the 8nh queue.

combineTool.py datacard.txt -M MultiDimFit --algo grid --points 50 --rMin 0 --rMax 1 --job-mode condor --split-points 10 --sub-opts='+JobFlavour="workday"' --task-name mytask -n mytask

Remember, any usual options (such as redefining POIs or freezing parameters) are passed to combine and can be added to the command line for combineTool.

Info

The option -n NAME should be included to avoid overwriting output files as the jobs will be run inside the directory from which the command is issued.

Grid submission with combineTool

For more CPU-intensive tasks, for example determining limits for complex models using toys, it is generally not feasible to compute all the results interactively. Instead, these jobs can be submitted to the Grid.

In this example we will use the HybridNew method of combine to determine an upper limit for a sub-channel of the Run 1 SM H\rightarrow\tau\tau analysis. For full documentation, see the section on computing limits with toys.

With this model it would take too long to find the limit in one go, so instead we create a set of jobs in which each one throws toys and builds up the test statistic distributions for a fixed value of the signal strength. These jobs can then be submitted to a batch system or to the Grid using crab3. From the set of output distributions it is possible to extract the expected and observed limits.

For this we will use combineTool.py

First we need to build a workspace from the H\rightarrow\tau\tau datacard,

$ text2workspace.py data/tutorials/htt/125/htt_mt.txt -m 125
$ mv data/tutorials/htt/125/htt_mt.root ./

To get an idea of the range of signal strength values we will need to build test-statistic distributions for we will first use the AsymptoticLimits method of combine,

$ combine -M Asymptotic htt_mt.root -m 125
 << Combine >>
[...]
 -- AsymptoticLimits (CLs) --
Observed Limit: r < 1.7384
Expected  2.5%: r < 0.4394
Expected 16.0%: r < 0.5971
Expected 50.0%: r < 0.8555
Expected 84.0%: r < 1.2340
Expected 97.5%: r < 1.7200

Based on this, a range of 0.2 to 2.0 should be suitable.

We can use the same command for generating the distribution of test statistics with combineTool. The --singlePoint option is now enhanced to support expressions that generate a set of calls to combine with different values. The accepted syntax is of the form MIN:MAX:STEPSIZE, and multiple comma-separated expressions can be specified.

The script also adds an option --dry-run which will not actually call combine but just prints out the commands that would be run, e.g,

combineTool.py -M HybridNew -d htt_mt.root --LHCmode LHC-limits --singlePoint 0.2:2.0:0.2 -T 2000 -s -1 --saveToys --saveHybridResult -m 125 --dry-run
...
[DRY-RUN]: combine -d htt_mt.root --LHCmode LHC-limits -T 2000 -s -1 --saveToys --saveHybridResult -M HybridNew -m 125 --singlePoint 0.2 -n .Test.POINT.0.2
[DRY-RUN]: combine -d htt_mt.root --LHCmode LHC-limits -T 2000 -s -1 --saveToys --saveHybridResult -M HybridNew -m 125 --singlePoint 0.4 -n .Test.POINT.0.4
[...]
[DRY-RUN]: combine -d htt_mt.root --LHCmode LHC-limits -T 2000 -s -1 --saveToys --saveHybridResult -M HybridNew -m 125 --singlePoint 2.0 -n .Test.POINT.2.0

When the --dry-run option is removed each command will be run in sequence.

Grid submission with crab3

Submission to the grid with crab3 works in a similar way. Before doing so ensure that the crab3 environment has been sourced in addition to the CMSSW environment. We will use the example of generating a grid of test-statistic distributions for limits.

$ cmsenv; source /cvmfs/cms.cern.ch/crab3/crab.sh
$ combineTool.py -d htt_mt.root -M HybridNew --LHCmode LHC-limits --clsAcc 0 -T 2000 -s -1 --singlePoint 0.2:2.0:0.05 --saveToys --saveHybridResult -m 125 --job-mode crab3 --task-name grid-test --custom-crab custom_crab.py

The option --custom-crab should point to a python file python containing a function of the form custom_crab(config) that will be used to modify the default crab configuration. You can use this to set the output site to your local grid site, or modify other options such as the voRole, or the site blacklist/whitelist.

For example

def custom_crab(config):
  print '>> Customising the crab config'
  config.Site.storageSite = 'T2_CH_CERN'
  config.Site.blacklist = ['SOME_SITE', 'SOME_OTHER_SITE']

Again it is possible to use the option --dry-run to see what the complete crab config will look like before actually submitted it.

Once submitted the progress can be monitored using the standard crab commands. When all jobs are completed copy the output from your sites storage element to the local output folder.

$ crab getoutput -d crab_grid-test
# Now we have to un-tar the output files
$ cd crab_grid-test/results/
$ for f in *.tar; do tar xf $f; done
$ mv higgsCombine*.root ../../
$ cd ../../

These output files should be combined with hadd, after which we invoke combine as usual to calculate observed and expected limits from the merged grid as usual.