CombineHarvester
Examples Part I

File: CombineTools/bin/Example1.cpp

In this example we use CombineHarvester to parse an existing datacard and then extract information from it. Open the file above and take a look at the source code. To run the example, first make sure the code has been compiled with scram:

cd $CMSSW_BASE/src
scram b -j4
Example1

Parsing a single card

In the first part we locate and open a single text datacard file:

// Use the CMSSW_BASE environment variable to get the full path to the auxiliaries folder
string in_dir = string(getenv("CMSSW_BASE")) + "/src/auxiliaries/datacards/sm/htt_mt/";
// Create a new CombineHarvester instance
// Call the `ParseDatacard` method specifying the path to the text file and the additional
// metadata we want to associate with it
cmb1.ParseDatacard(in_dir + "htt_mt_6_8TeV-125.txt", "htt", "8TeV", "mt", 6, "125");
// Print the Observation, Process and Systematic entries to the screen
CombineHarvester & PrintObs()
int ParseDatacard(std::string const &filename, std::string const &analysis, std::string const &era, std::string const &channel, int bin_id, std::string const &mass)
CombineHarvester & PrintProcs()
CombineHarvester & PrintSysts()

When parsing a datacard, CombineHarvester breaks down the information it contains into sets of objects, each represented by a C++ class. A ch::Observation object stores the information about the observed data in a single category, and likewise ch::Process stores the information for one expected signal or background process in a category. A ch::Systematic object records the uncertainty value assigned to a particular process from a particular source.

Note
Internally a CombineHarvester contains three vectors, one for each kind of object. It's the job of the ParseDatacard method to build a new set of these objects, extract the mapped histograms, and append them to these vectors. The histogram extraction is done automatically using the mapping rules given in the datacard (the lines starting with shape). Once the mapped ROOT file has been located and opened, the relevant histograms are copied into their corresponding CombineHarvester objects.

Each object class stores a standard set of metadata, designed to aid in the filtering and selection of particular objects within a CombineHarvester, and which in the example above is specified explicitly. The possible metadata is listed in the following table.

name type example value
bin string automatic
process string automatic
analysis string "htt"
era string "8TeV"
channel string "mt"
bin_id int 6
mass string "125"

Of these only bin, process and mass are tracked and used by combine, the others are optional and can be left empty if unneeded. The bin property is used to uniquely label an event category. Along with the process names, this is written directly into the datacard and is extracted automatically. The mass property is an exception: although we typically create a datacard for a particular signal mass hypothesis this information is not recorded in the datacard, but rather is passed to combine as a command line option, e.g. combine -M Asymptotic -m 125 my_datacard.txt.

Warning
If mass is not specified, ParseDatacard is likely to fail as this property is often needed to map signal processes to histograms in the input ROOT file. You can tell if this property is needed by looking for the term $MASS in the shapes rules at the top of the text datacard.

An important concept is that the objects created from a datacard are not explicitly linked - each is completely independent. When it's necessary to determine which objects are related, e.g. if we ask for the total uncertainty on a particular process, CombineHarvester will determine this on-the-fly by matching up ch::Process and ch::Systematic objects that have identical metadata.

The last line of the code will print the information stored about the Observation, Process and Systematic entries that we've created. Like many CombineHarvester methods, these Print functions return a reference to the calling object, meaning they can be chained together to keep the code concise. The output will look like:

--------------------------------------------------------------------------------------------------------
mass   analysis  era    channel  bin                          id  process               rate       shape
--------------------------------------------------------------------------------------------------------
125    htt       8TeV   mt       muTau_vbf_loose              6   data_obs              76         1
--------------------------------------------------------------------------------------------------------
mass   analysis  era    channel  bin                          id  process          sig  rate       shape
--------------------------------------------------------------------------------------------------------
125    htt       8TeV   mt       muTau_vbf_loose              6   ggH              1    1.2171     1
125    htt       8TeV   mt       muTau_vbf_loose              6   qqH              1    3.5739     1
125    htt       8TeV   mt       muTau_vbf_loose              6   VH               1    0.029033   1
125    htt       8TeV   mt       muTau_vbf_loose              6   ZTT              0    50.212     1
125    htt       8TeV   mt       muTau_vbf_loose              6   QCD              0    10.788     1
125    htt       8TeV   mt       muTau_vbf_loose              6   W                0    22.544     1
125    htt       8TeV   mt       muTau_vbf_loose              6   ZL               0    0.23798    1
125    htt       8TeV   mt       muTau_vbf_loose              6   ZJ               0    1.9001     1
125    htt       8TeV   mt       muTau_vbf_loose              6   TT               0    2.0405     1
125    htt       8TeV   mt       muTau_vbf_loose              6   VV               0    0.7037     1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
mass   analysis  era    channel  bin                          id  process          sig  nuisance                                      type     value         sh_d sh_u
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
125    htt       8TeV   mt       muTau_vbf_loose              6   ggH              1    lumi_8TeV                                     lnN      1.026         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   qqH              1    lumi_8TeV                                     lnN      1.026         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   VH               1    lumi_8TeV                                     lnN      1.026         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   qqH              1    pdf_qqbar                                     lnN      1.036         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   VH               1    pdf_qqbar                                     lnN      1.04          0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   ggH              1    pdf_gg                                        lnN      1.097         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   ggH              1    UEPS                                          lnN      0.893         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   qqH              1    UEPS                                          lnN      0.988         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   VH               1    UEPS                                          lnN      0.988         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   ggH              1    QCDscale_ggH2in                               lnN      0.772         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   qqH              1    QCDscale_qqH                                  lnN      1.018         0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   VH               1    QCDscale_VH                                   lnN      1.04          0    0
125    htt       8TeV   mt       muTau_vbf_loose              6   ggH              1    CMS_scale_t_mutau_8TeV                        shape    0.9513/1.093  1    1
125    htt       8TeV   mt       muTau_vbf_loose              6   qqH              1    CMS_scale_t_mutau_8TeV                        shape    0.9585/1.038  1    1
125    htt       8TeV   mt       muTau_vbf_loose              6   VH               1    CMS_scale_t_mutau_8TeV                        shape    0.9999/1      1    1
125    htt       8TeV   mt       muTau_vbf_loose              6   ZTT              0    CMS_scale_t_mutau_8TeV                        shape    0.9639/1.055  1    1
...

Parsing multiple cards

Next we create another CombineHarvester instance and parse several datacards, this time using a method in which the object metadata is inferred from the datacard filenames:

// Parse a set of datacards into one CombineHarvester instance
for (string bin_id : {"1", "2", "3", "4", "5", "6", "7"}) {
cmb2.ParseDatacard(in_dir + "htt_mt_" + bin_id + "_8TeV-125.txt",
"$ANALYSIS_$CHANNEL_$BINID_$ERA-$MASS.txt");
}

In this ParseDatacard method only two arguments are required: the path to the text datacard and a string containing place-holders for the metadata. Note that there's no obligation to include all five place-holders in this string. Additionally, the place-holders are not restricted to appearing in the filename but may also be included in the preceding directory path, e.g. $MASS/$ANALYSIS_$CHANNEL_$BINID_$ERA.txt is also valid.

Filtering

Now let's pretend we are no longer interested in the first four categories we parsed. To remove objects we use a filter method. There are four generic filter methods: FilterAll, FilterObs, FilterProcs and FilterSysts that each accept a function, or function-type object that must have a single ch::Object, ch::Observation, ch::Process or ch::Systematic pointer argument and return a bool. A true return value indicates that the object should be dropped. The FilterAll method will act on all three object collections whereas the others operate only on their respective collections. It's often convenient to write a small lambda function in-place:

// Filter out all objects that do not have bin_id <= 4
cmb2.FilterAll([](ch::Object const* obj) {
return obj->bin_id() <= 4;
});
CombineHarvester & FilterAll(Function func)
virtual int bin_id() const
Definition: Object.h:35

Alternatively,there are a number of fixed-property filters, in which you need only supply a vector of the object properties you want to keep:

// Another way to achieve the same effect
cmb2.bin_id({5, 6, 7});
// Filter out all objects that have process name "QCD"
cmb2.process({"QCD"}, false);
CombineHarvester & bin_id(std::vector< int > const &vec, bool cond=true)
CombineHarvester & process(std::vector< std::string > const &vec, bool cond=true)

An optional boolean can be supplied as a second argument. When set to false this reverses the logic - objects with a property in the list will be dropped. In the second line we use this to remove all information about the QCD process. The full list of filter methods is found here

Yields, copying and sets

In the final part of this example we take a look at the rate evaluation methods. These calculate the total event yields for either the observed data or the expected processes:

// GetObservedRate() will sum the event yields of all remaining Observation
// objects
cout << "Total observed rate: " << cmb2.GetObservedRate() << "\n";
// We can get the yield for a single bin by first making a shallow copy of the
// current instance with the cp() method, then filtering this copy to leave
// only the bin we're interested in, before finally calling GetObservedRate()
cout << "Single category rate: "
<< cmb2.cp().bin({"muTau_vbf_loose"}).GetObservedRate() << "\n";
// It is also possible to generate sets of object properties. Here we print
// out the expected yield for each process in each bin
for (auto bin : cmb2.bin_set()) {
for (auto proc : cmb2.cp().bin({bin}).process_set()) {
cout << bin << "," << proc << ": "
<< cmb2.cp().bin({bin}).process({proc}).GetRate()
<< "\n";
}
}
std::set< std::string > process_set()
CombineHarvester & bin(std::vector< std::string > const &vec, bool cond=true)
std::set< std::string > bin_set()
CombineHarvester cp()
Creates and returns a shallow copy of the CombineHarvester instance.

Note that these functions are greedy - they will sum the contribution from every available Observation or Process entry. This means in the first line we get the total number of observed events in the three remaining categories. To get the yield for a single category we can prefix the function with a filter method. But here we must be careful, because we don't want to actually remove the information on the other categories permanently, which is what would happen if we just do:

cmb2.bin({"muTau_vbf_loose"}).GetObservedRate();
// cmb2 only contains objects for the "muTau_vbf_loose" category now!

To get around this we first call the cp method on our CombineHarvester instance. This makes a shallow copy of the instance - in this it is only pointers to the contained objects, not the objects themselves, which are copied into a new instance. Such a copy is computationally fast to make, and we are free to filter objects from it without affecting the original instance at all.

Note
Although filtering the objects in a shallow copy has no effect on the object lists in the original instance, both instances do still point to the same objects, so modifying the actual contents via the shallow copy will affect both instances. To create a full CombineHarvester copy, in which the underlying objects are also duplicated, use the deep copy method instead.

The last part of the example code uses the CombineHarvester set-generating methods to conveniently loop through all defined (bin, process) combinations and print out the expected yield. The full list of available set-generating methods can be found here.