Building statistical models with Combine
Getting started
To get started, you should have a working setup of Combine, please follow the instructions from the home page. Make sure to use the latest recommended release.
After setting up Combine, you can access the working directory for this tutorial which contains all of the inputs and scripts needed in this excercise exercise:
cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/
git checkout main
scram b -j 8
cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/model_building_2024/
Exercise outline
This tutorial focuses and extends on the model building topic, and it is not going to give a full picture on the statistical methods, which are extensively covered in the long exercise and statistical methods exercise.
1) Building analysis with templates
2) Using keywords
3) Control regions
4) Rate parameters
5) Extra arguments
6) Physics Models
Introduction
The most general definition for the binned model likelihood can be given as
Where \(c\) indexes the channel, \(b\) indexes the histogram bin, and \(e\) indexes the nuisance parameter.
The generic model of the expected event count in a given bin, \(n^\mathrm{exp}_{cb}\), implemented in combine for template based analyses is given by:
In terms of datacard structure there are several differences with respect to the counting datacard:
- A new block of lines at the top defining how channels and processes are mapped to the histograms (more than one line can be used)
- In the list of systematic uncertainties we now have entries marked as
shape
The "shapes" line has to follow the following syntax:
shapes <process_name> <channel_name> <path/to/input_shape.root> <histograms_nominal> <histograms_with_variations>
To start the hands-on for this section:
cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/model_building_2024/simple_shape
The input shapes for the first example (datacard-2-template-analysis.txt
) are located in the template-analysis-datacard-input.root
, it contains the observed distribution data_obs
, the nominal histograms for each process and systematic uncertainties templates:
root [0]
Attaching file template-analysis-datacard-input.root as _file0...
(TFile *) 0x556fd5b0fea0
root [1] .ls
TFile** template-analysis-datacard-input.root
TFile* template-analysis-datacard-input.root
KEY: TH1F background;1 Histogram of background__x
KEY: TH1F background_alphaUp;1 Histogram of background__x
KEY: TH1F background_alphaDown;1 Histogram of background__x
KEY: TH1F data_obs;1 Histogram of data_obs__x
KEY: TH1F signal;1 Histogram of signal__x
KEY: TH1F signal_sigmaUp;1 Histogram of signal__x
KEY: TH1F signal_sigmaDown;1 Histogram of signal__x
To define the mapping to the systematic uncertainties templates the $SYSTEMATIC
keyword should be used, which connects the systematic uncertainties marked as shape
type with the input shapes.
imax 1
jmax 1
kmax 4
# ---------------
shapes signal ch1 template-analysis-datacard-input.root signal signal_$SYSTEMATIC
shapes background ch1 template-analysis-datacard-input.root background background_$SYSTEMATIC
# ---------------
bin ch1
observation 85
# ------------------------------
bin ch1 ch1
process signal background
process 0 1
rate 24 100
# --------------------------------
lumi lnN 1.1 1.0
bgnorm lnN - 1.3
alpha shape - 1 # uncertainty in the background template.
sigma shape 0.5 - # uncertainty in the signal template.
To simplify the shape mapping line the keywords $PROCESS
, $CHANNEL
can be used. The $PROCESS
keyword is associated with the processes listed in the datacard: [signal, background]
, it is also possible to use the *
wildcard to map multiple processes and/or channels with one line as shown below.
imax 1
jmax 1
kmax 4
# ---------------
shapes * * template-analysis-datacard-input.root $PROCESS $PROCESS_$SYSTEMATIC
# ---------------
bin ch1
observation 85
# ------------------------------
bin ch1 ch1
process signal background
process 0 1
rate 24 100
# --------------------------------
lumi lnN 1.1 1.0
bgnorm lnN - 1.3
alpha shape - 1 # uncertainty in the background template.
sigma shape 0.5 - # uncertainty in the signal template.
If there are more than one category it can be useful to store the input shapes corresponding to different regions in separate TDirectory
s and use $CHANNEL keyword as shown below:
shapes * * <input-file.root> $CHANNEL/$PROCESS $CHANNEL/$PROCESS_$SYSTEMATIC
Keywords
Go to the datacards location corresponding to this section of the tutorial:
cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/model_building_2024/keywords
The datacard can also contain $MASS
keyword, it allows to setup a single datacard for various mass points. It will be replaced with the value passed to -m
option when running the tool. In addition, user-defined keywords can be used. Any word in the datacard $WORD
will be replaced by VALUE
when including the option --keyword-value WORD=VALUE
. This option can be repeated multiple times for multiple keywords.
KEY: TH1D ggH110;1
KEY: TH1D bbH110;1
KEY: TH1D ggH110_CMS_eff_t_mssmHigh_tt_13TeVDown;1
....
KEY: TH1D ggH120;1
KEY: TH1D bbH120;1
KEY: TH1D ggH120_CMS_eff_t_mssmHigh_tt_13TeVDown;1
.....
htt_tt_9_13TeV.txt
datacard you can find the following lines:
shapes * htt_tt_9_13TeV htt_input.root htt_tt_9_13TeV/$PROCESS htt_tt_9_13TeV/$PROCESS_$SYSTEMATIC
shapes bbH htt_tt_9_13TeV htt_input.root htt_tt_9_13TeV/bbH$MASS htt_tt_9_13TeV/bbH$MASS_$SYSTEMATIC
shapes ggH htt_tt_9_13TeV htt_input.root htt_tt_9_13TeV/ggH$MASS htt_tt_9_13TeV/ggH$MASS_$SYSTEMATIC
combine -M AsymptoticLimits -d htt_tt_9_13TeV.txt -m <mass_value>
Simultaneous fit in multiple categories
cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/model_building_2024/control_regions
To combine the datacards corresponding to various (independent) regions into a single card one can use combineCards.py
.
combineCards.py htt_tt_9_13TeV=htt_tt_9_13TeV.txt htt_tt_8_13TeV=htt_tt_8_13TeV.txt >htt_tt_SRs.txt
htt_tt_SRs.txt
now has two categories:
----------------------------------------------------------------------------------------------------------------------------------
bin htt_tt_9_13TeV htt_tt_8_13TeV
observation 3416 105545
----------------------------------------------------------------------------------------------------------------------------------
Rate parameters
cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/model_building_2024/control_regions
It is quite common to use data-drive background estimation methods. In Combine one can perform simultaneous fit of signal and control regions it allows to automatically handle the statistical uncertainty due to the number of data events in the control region, correctly handles signal contamination in the control region, allows to properly take into account systematic uncertainties affecting the backgrounds in the control regions.
In the working directory for this section you can find the htt_zmm_8_13TeV.txt, htt_zmm_9_13TeV.txt and htt_ttbar_1_13TeV.txt
cards, corresponding to the control regions enriched in ZLL and ttbar processes, in addition to the signal regions from the previous step. Let's combine all of the regions into one datacard.
combineCards.py htt_zmm_9_13TeV=htt_zmm_9_13TeV.txt htt_zmm_8_13TeV=htt_zmm_8_13TeV.txt htt_ttbar_1_13TeV=htt_ttbar_1_13TeV.txt htt_tt_9_13TeV=htt_tt_9_13TeV.txt htt_tt_8_13TeV=htt_tt_8_13TeV.txt >htt_tt_combined.txt
htt_tt_combined.txt
contains signal and control regions. To allow the rate of the background processes to be corrected from the control regions we can define common rate parameters, which linearly scale the predicted rates specified in the datacard, using the syntax
<rate_param_name> rateParam <category> <process> <initial_value> [min_value,max_value]
The following lines define rate_TT
and rate_ZMM
rate parameters scaling TT and ZLL processes in all regions simultaneously:
rate_TT rateParam * TT 1 [0,5]
rate_TT rateParam * TTT 1 [0,5]
rate_ZMM rateParam * ZLL 1 [0,2]
Note that by default rate parameters are freely floating (unconstrained) parameters in Combine, however it is possible to add a constrain term to the likelihood by adding the param
modifier with the same name as rate parameter:
rate_TT param <mean> <sigma>
param
nuisance named rate_TT
with mean = 1.
and sigma = 1
, check how the uncertainty on rate_TT
parameter changes, what happens if you change the width of the constraint term?
In addition modifiers that are functions of other parameters can be included using the following syntax:
name rateParam bin process formula args
rate_A rateParam * process_A
ratio_BtoA param 1 1
rate_B rateParam * process_B @0*@1 rate_A*ratio
Extra arguments
cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/model_building_2024/PhysicsModels
In one wants to connect different models with common parameters, or just use external functions it is possible to import parameters defined within external workspaces with extArg
:
name extArg rootfile:workspacename
extArg
syntax allows to import RooAbsReal
object from an external workspace. This object can be another free floating parameter, or a function of other parameters. In this section we are going to import RooSpline1D
objects which define how various Higgs production cross sections depend on the Higgs mass (MH
parameter).
The datacard we are going to use in this section htt_tt_125_8TeV.txt
correspond to 8 TeV analysis and Higgs mass of 125 GeV. To excercise the extArg
features, let's rescale the signal templates to 13 TeV cross section values.
The 13 and 8 TeV cross sections predictions from YR4 are stored in the $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/
in sm_yr4_13TeV.root
and sm_yr4_8TeV.root
files respectively, let's inspect the contents of sm_yr4_13TeV.root
:
TFile* /afs/cern.ch/work/a/anigamov/rootv630/CMSSW_14_1_0_pre4/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/sm_yr4_13TeV.root
KEY: RooWorkspace xs_13TeV;1 xs_13TeV
KEY: TProcessID ProcessID0;1 2fd49e90-f1a0-11e8-9717-b052b8bcbeef
root [2] xs_13TeV->Print()
RooWorkspace(xs_13TeV) xs_13TeV contents
variables
---------
(MH)
functions
--------
RooSpline1D::WH_13TeV[ xvar=MH ] = 1.369
RooSpline1D::WminusH_13TeV[ xvar=MH ] = 0.5313
RooSpline1D::WplusH_13TeV[ xvar=MH ] = 0.838
RooSpline1D::ZH_13TeV[ xvar=MH ] = 0.8824
RooSpline1D::bbH_13TeV[ xvar=MH ] = 0.4863
RooSpline1D::ggH_13TeV[ xvar=MH ] = 48.52
RooSpline1D::ggZH_13TeV[ xvar=MH ] = 0.1227
RooFormulaVar::qqZH_13TeV[ actualVars=(ZH_13TeV,ggZH_13TeV) formula="@0-@1" ] = 0.7597
RooSpline1D::tHW_13TeV[ xvar=MH ] = 0.01517
RooSpline1D::tHq_13TeV[ xvar=MH ] = 0.07714
RooSpline1D::ttH_13TeV[ xvar=MH ] = 0.5065
RooSpline1D::vbfH_13TeV[ xvar=MH ] = 3.779
RooSpline1D::WH_13TeV[ xvar=MH ] = 1.369
contains cross sections values interpolated between various Higgs mass points.
We can import them into our model as shown below
vbfH_13TeV extArg $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/sm_yr4_13TeV.root:xs_13TeV
ggH_13TeV extArg $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/sm_yr4_13TeV.root:xs_13TeV
ZH_13TeV extArg $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/sm_yr4_13TeV.root:xs_13TeV
WH_13TeV extArg $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/sm_yr4_13TeV.root:xs_13TeV
vbfH_8TeV extArg $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/sm_yr4_8TeV.root:xs_8TeV
ggH_8TeV extArg $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/sm_yr4_8TeV.root:xs_8TeV
ZH_8TeV extArg $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/sm_yr4_8TeV.root:xs_8TeV
WH_8TeV extArg $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/sm_yr4_8TeV.root:xs_8TeV
rateParam
as functions of imported extArg
s to rescale signal processes WH, ZH, ggH, qqH:
WH_8to13TeV rateParam * WH @0/@1 WH_13TeV,WH_8TeV
ZH_8to13TeV rateParam * ZH @0/@1 ZH_13TeV,ZH_8TeV
ggH_8to13TeV rateParam * ggH @0/@1 ggH_13TeV,ggH_8TeV
vbfH_8to13TeV rateParam * qqH @0/@1 vbfH_13TeV,vbfH_8TeV
When running Combine methods (e.g. combine -M MultiDimFit --algo singles
) with these parameters you have to freeze MH
, i.e. add --freezeParameters MH
option.
Advanced task: rescale the signal templates to the cross-section corresponding to a different MH value (e.g. 120 GeV).
Physics Models
With physics model one can instruct Combine how to scale the signal (background) yields with parameters of the model: \(M_{cp}(\vec{\mu})\) from
In Combine we can use PhysicsModel as a base class.
This class has several useful methods, but the most important ones are doParametersOfInterest()
which defines parameters and functions of the model, and getYieldScale(self, bin, process)
defines how expected events are scaled with the model parameters.
class PhysicsModelBase(six.with_metaclass(ABCMeta, object)):
...
def doParametersOfInterest(self):
"""Create POI and other parameters, and define the POI set."""
def getYieldScale(self, bin, process):
"Return the name of a RooAbsReal to scale this yield by or the two special values 1 and 0 (don't scale, and set to zero)"
return "r" if self.DC.isSignal[process] else 1
...
There are many models available to use for different physics cases, follow the link for more information. In the following sections we will discuss how one can construct custom models.
To use a different physics model instead of the default one, we are going to use the option -P as in
text2workspace.py datacard.txt -P HiggsAnalysis.CombinedLimit.PythonFile:modelName
Default physics model
The default physics model implemented in Combine defines a single POI that linearly scales all signal processes. We use this model by default when running text2workspace.py -m <mass_value> <datacard.txt>
.
class PhysicsModel(PhysicsModelBase):
"""Example class with signal strength as only POI"""
def doParametersOfInterest(self):
"""Create POI and other parameters, and define the POI set."""
self.modelBuilder.doVar("r[1,0,20]")
self.modelBuilder.doSet("POI", "r")
# --- Higgs Mass as other parameter ----
if self.options.mass != 0:
if self.modelBuilder.out.var("MH"):
self.modelBuilder.out.var("MH").removeRange()
self.modelBuilder.out.var("MH").setVal(self.options.mass)
else:
self.modelBuilder.doVar("MH[%g]" % self.options.mass)
Multi Signal model
Combine already contains a model HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel
that can be used to assign different signal strengths to multiple processes in a datacard, configurable from the command line using the mapping --PO 'map=<bin>/<process>:<parameter_name>'
. The wildcard *
are allowed for <bin>
and <process>
entries. The following command assigns r_ggH
signal strength to scale ggH
processes in all regions (bin
s).
text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO verbose --PO 'map=.*/ggH:r_ggH[1,-5,5]' --PO 'map=.*/qqH:r_qqH[1,-5,5]' PhysicsModel/htt_tt_125_8TeV.txt -o ws_multiSignal.root -m 125
combine -M MultiDimFit --algo singles -d ws_multiSignal.root -n .multiSignal.
Custom models
Now let's look at the example of how one can construct a model where the two leading signal processes (qqH and ggH) are scaled with relative fraction parameter f
and the overall rate modifier r
.
As discussed above first we have to define the parameters of the model
def doParametersOfInterest(self):
"""Create POI and other parameters, and define the POI set."""
self.modelBuilder.doVar("f[0,0,4]")
self.modelBuilder.doVar("r[1,0,10]")
self.modelBuilder.factory_( "expr::scale_qqH(\"(1-@0)*@1\", f,r)")
self.modelBuilder.factory_( "expr::scale_ggH(\"@0*@1\", f,r)")
self.modelBuilder.doSet("POI", ",".join(["f"]))
self.modelBuilder.doSet("POI", ",".join(["r"]))
getYieldScale(self, bin, process)
method will scale qqH
process with the scale_qqH
object and ggH
with scale_ggH
.
def getYieldScale(self, bin, process):
if process == "qqH": return "scale_qqH"
elif process == "ggH": return "scale_ggH"
else: return 1
To use this model we should add the directory where the corresponding file is located to the PYTHON3PATH
:
export PYTHON3PATH=${PYTHON3PATH}:${PWD}/models
text2workspace.py PhysicsModels/htt_tt_125_8TeV.txt -P FractionModel:Fraction_2signals -m 125 -o ws_fraction.root
One can inspect the created workspace to ensure that the model is correctly implemented
root -l ws_fraction.root
root [1] w->Print()
The created workspace is quite large, but it should have two new RooFormulaVar
objects
RooFormulaVar::scale_ggH[ actualVars=(f,r) formula="x[0]*x[1]" ] = 0
RooFormulaVar::scale_qqH[ actualVars=(f,r) formula="(1-x[0])*x[1]" ] = 1
ProcessNormalization::n_exp_binhtt_tt_1_8TeV_proc_ggH[ thetaList=(CMS_eff_t_tt_8TeV,CMS_htt_scale_met_8TeV,QCDscale_ggH1in,UEPS,lumi_8TeV,pdf_gg) asymmThetaList=() otherFactorList=(scale_ggH) ] = 0
ProcessNormalization::n_exp_binhtt_tt_1_8TeV_proc_qqH[ thetaList=(CMS_eff_t_tt_8TeV,CMS_htt_scale_met_8TeV,CMS_scale_j_8TeV,QCDscale_qqH,UEPS,lumi_8TeV,pdf_qqbar) asymmThetaList=() otherFactorList=(scale_qqH) ] = 1.4954
EFT model
One can also define analytical BSM models with Combine. In this section we will extract the CI for one of the SMEFT Wilson coefficient. Without going into details, it can be shown that the dimension-6 SMEFT operators scale with quadratic equations. In this example will consider the \(c_{Hg}\) operator that among all signal processes in the datacard affects only ggH at the LO in SMEFT.
\(\sigma(c_{g}) = \sigma_{SM} (1 + A c_{g} + B c^{2}_{g})\), where A and B coefficients are numbers that can be estimated from simulation.
class SMEFT_chg(PhysicsModel):
def doParametersOfInterest(self):
"""Create POI and other parameters, and define the POI set."""
self.modelBuilder.doVar("A[39.54]")
self.modelBuilder.doVar("B[245.32]")
self.modelBuilder.out.var("A").setConstant(True)
self.modelBuilder.out.var("B").setConstant(True)
self.modelBuilder.doVar("chg[0,-1,1]")
self.modelBuilder.factory_( "expr::ggH_scaling_chg(\"1+@1*@0+@2*@0*@0\", chg, A, B)")
self.modelBuilder.doSet("POI", ",".join(["chg"]))
def getYieldScale(self, bin, process):
if process == "ggH": return "ggH_scaling_chg"
else: return 1
smeft_chg_tutorial = SMEFT_chg()
To create the workspace using this model one can run
export PYTHON3PATH=${PYTHON3PATH}:${PWD}/models
text2workspace.py PhysicsModels/htt_tt_125_8TeV.txt -P EFT_simple:smeft_chg_tutorial -m 125 -o ws_chg.root
Run the likelihood scan for \(c_{Hg}\) parameter and make the plot:
combine -M MultiDimFit --algo grid -d ws_chg.root --setParameterRanges chg=-0.2,0.2
plot1DScan.py higgsCombineTest.MultiDimFit.mH120.root --POI chg