Lumberjack: from TTree to histograms

Lumberjack logo

Lumberjack is a tool for processing large amounts of columnar data stored in ROOT TTree objects. For analysis purposes, these data are typically processed to obtain simpler analysis-level objects such as histograms or profile histograms (“profiles” for short).

In principle, this is achievable using the “traditional” interfaces provided by the TTree object directly, such as TTree::Draw(), or by manually looping through the TTree.

However, these approaches have their disadvantages. TTree::Draw() has the inherent limitation of requiring a loop to be executed every time a histogram is filled, while manually implementing the loop implies writing a lot of “boilerplate” C++ and/or Python code, which can be tedious and error-prone.

In addition, both these approaches often results in generic code being mixed with analysis-specific code and metadata (binnings, threshold values for filters, etc.), which can prove difficult to debug and maintain.

A much more flexible interface is provided by RDataFrame, a recent addition to ROOT, which takes a declarative approach towards processing TTree objects. This means that the entire workflow is specified before looping through the TTree. Once everything is set up, the main TTree loop is executed only once, filling all the requested histograms in parallel. This makes TTree processing with RDataFrames potentially very efficient, especially for workflows requiring a large number of histograms.

Lumberjack is built on top of the RDataFrame interface and aims to provide users with a simple but powerful interface for configuring and running such workflows.

Configuration

A configuration module supplies Lumberjack with all the information it needs to produce the requested outputs. This includes things like:

  • the names of the variables stored in the input TTree

  • how to bin these variables when creating histograms

  • what selection filters to apply to the sample

  • how to split the sample into subsamples

This information is provided via Python dictionaries or lists and is provided by each configuration module via a series of module-level variables. These content and structure of these variables is covered in the following subsections. The following table provides an overview of these variables and gives a short summary of their contents:

Variable name

Contains

QUANTITIES

names and binnings of quantities to be filled into histograms. Can be branches of the input TTree or expressions involving them.

DEFINES

named expressions involving TTree branches. Can be used as “shortcuts” then defining quantities.

ROOT_MACROS

C++ code passed to the global ROOT interpreter. Functions defined here can be used expressions when defining quantities.

SELECTIONS

named groups of filter expressions to apply to TTree before further processing.

SPLITTINGS

named “recipes” for splitting the sample into subsamples based on the value of a variable/expression.

TASKS

named “recipes” for commonly performed tasks. A task consists of one or more global selections one or more splitting recipes, and a list of requested output quantities

Note

Variables must be made available at the top level of a configuration module. That is, they must be importable via from Lumberjack.cfg.config_module import VARIABLE

QUANTITIES: what should Lumberjack output contain?

A quantity is a TTree branch (or an expression involving TTree branches) which is meant to be filled into a histogram or profile histogram. Quantities always have an associated binning, indicating the bin structure to be used when creating the corresponding histograms.

Quantities are represented in Lumberjack by their own Python objects containing the following properties:

  • name: (string), uniquely identifies a quantity. The name of output histograms involving this quantity will contain this string

  • expression: (string), optional, formula for calculating the quantity value from TTree branches. If this is not given, a branch called name is assumed to exist and will be used.

  • binning: (list) of numeric values, indicating the bin edges. Must be sorted in ascending order

For more information, consult the API documentation for Quantity.

Every quantity has an associated binning, that is, an array of float values, sorted in ascending order, indicating the bin edges. Values less than the first bin edge or greater than the last bin edge are counted as underflows or overflows in the resulting ROOT histograms.

Note

If the same TTree branch should be filled into histograms with different binnings, then a separate quantity must be defined for each binning. The expression should be set to the name of the TTree branch.`

The configuration variable QUANTITIES is mandatory and contains the definition of all quantities that Lumberjack can work with. It is a Python dictionary.

The top-level keys correspond to the different input types. Different input types (e.g. data for different measurement channels, Monte Carlo simulation samples, etc.) can contain different quantities, so this layer exists to allow users to define input-type specific quantities.

The names of the input types can be chosen freely. For quantities which are defined in all input types, a special input type called global exists. The QUANTITIES dictionary must always contain a global key, even if it is empty.

Each input type maps to an inner dictionary which contains the definitions of quantities available for that input type. This inner dictionary maps the names of quantities to instances of the Quantity class, which

Note

In virtually all cases, the name property of a Quantity object should be identical to the dictionary key in QUANTITIES that maps to it. While this is not enforced, it is good practice to ensure that this is always the case. Lumberjack uses the name property when constructing the names of histograms in the output file, and the dictionary key when looking up a quantity requested by the user.

Note

If the same key is present in the global dictionary and the dictionary for a particular input type, then the global quantity definition is replaced by input-type-specific one.

Below is an example of a valid QUANTITIES definition:

from Lumberjack import Quantity

...

QUANTITIES = {

  # mandatory: these quantities are defined for all input types
  'global': {

    # assuming there is a TTree branch called 'quantityA'
    'quantityA': Quantity(
      name='quantityA',
      binning=[0, 5, 10, 100, 5000]
    ),

    # same TTree branch, different binning
    'quantityA_narrowBins': Quantity(
      name='quantityA_narrowBins',
      expression='quantityA',
      binning=[0, 2.5, 5, 7.5, 10, 50, 100, 300, 500, 5000]
    ),

    # apply an expression to a TTree branch
    'abs_quantityB': Quantity(
      name='abs_quantityB',
      expression='abs(abs_quantityB)',  # expression interpreted by ROOT
      binning=[0, 1, 2, 3]
    ),

  }

  # quantities only defined for a special input type
  'my_special_input_type': {

    # this TTree branch only exists in a special sample type
    'mySpecialQuantity': Quantity(
      name='mySpecialQuantity',
      binning=[0, 5, 10, 100, 5000]
    ),

  }

}

DEFINES: shortcuts for often-used expressions

It can happen that a variable is not directly stored in a TTree branch but has to be calculated for every entry. Often, these expressions are not full-blown quantities themselves (i.e. they do not need to be filled into histograms), but are only used as “shortcuts” when defining quantities.

To avoid specifying these variables in QUANTITIES, which would require specifying an (unneeded) binning, they are specified in a separate configuration variable: DEFINES.

The DEFINES configuration variable is a Python dictionary and has a structure similar to the QUANTITIES variable. The top-level keys must correspond to the different input types defined in the QUANTITIES variable and must map to inner dictionaries specifying the expressions to be defined.

Unlike for quantities, however, there is no dedicated Define object. Instead, an expression is simply defined by adding a keyvalue pair to the inner dictionary. The key is the name given to the expression and the value is simply the expression string.

The above functionality is provided by ROOT’s RDataFrame interface. via the Define call. For each keyvalue pair, a Define(key, value) call will be issued.

Note

For quantities that have an expression property which does not coincide with their name, a Define call will already be issued when initializing the RDataFrame, so in that case there is no need to add an entry for this in DEFINES.

Here is an example of a DEFINES variable:

DEFINES = {

  'global': {

    # calculate radius as a function of x and y
    'radius': 'TMath::Sqrt(x*x + y*y)',  # can use ROOT's function library

  },

  'special_3d_sample': {

    # calculate radius as a function of x, y and z
    'radius': 'TMath::Sqrt(x*x + y*y + z*z)',  # overrides `global` definition above

  }

}

Note

Since dictionary keys are not ordered, there is no guarantee that the Define calls will be issued in the same order as they appear in the configuration file. So the dictionary keys should not be used for any expression defined in the same dictionary.

There is, however, a guarantee that all global defines will be made before the ones specific to input-type, so input-type-specific expressions may contain globally-defined variables.

If ordering is important, it may be possible to use Python OrderedDicts instead of plain dictionaries, but this has not been tested.

ROOT_MACROS: C++ code to be executed in the ROOT interpreter

The underlying mechanism used by ROOT’s RDataFrame to allow simple strings to be interpreted as expressions works by translating these expressions into C++ code via ROOT’s internal JIT (just-in-time compilation) mechanism.

For simple expressions, this is very convenient, but for cases when more complicated functions of TTree variables need to be evaluated, it is not feasible to provide these as strings. ROOT’s RDataFrame interface, however, allows applying arbitrary C++ functions to TTree data.

The configuration variable ROOT_MACROS can be used for this purpose. It is a string containing C++ code which will be passed as-is to the ROOT global interpreter. Any C++ functions defined in this way can then be used in QUANTITIES and DEFINES just as any other function.

Here is an example which defines a C++ function called isMandelbrot. This uses a while loop to determine whether a point in the complex plane is (roughly) in the Mandelbrot set:

ROOT_MACROS = """

#include <complex>

/* determine whether point belongs to Mandelbrot set */
bool isMandelbrot(const double& realPart, const double& imagPart) {

    std::complex<double> c(realPart, imagPart);
    std::complex<double> z(0, 0);

    unsigned int nIterations = 0;
    while (std::abs(z) < 2.0 && nIterations < 10000) {
        z = z*z + c;
        ++nIterations;
    }

    if (nIterations == 10000)
        return true;
    else
        return false;

}
"""

Note the triple-quoted string, which is needed for multi-line strings in Python. Once defined in ROOT_MACROS as above, the function can be used in DEFINES:

DEFINES = {

  'global': {

    ...

    # check if entry is in the Mandelbrot set using
    # ``TTree`` branches 'realPart' and 'imagPart' as inputs
    'isMandelbrot': 'isMandelbrot(realPart, imagPart)'
  }
  ...
}

Instead of using a triple-quoted string for ROOT_MACROS, all C++ code can be put into a separate file and loaded directly into the variable using the I/O tools provided by Python:

import os

...

# put the contents of a file in ROOT_MACROS
_root_macro_file_path = os.path.join(os.path.dirname(__file__), "root_macros.C")
with open(_root_macro_file_path, 'r') as _root_macro_file:
    ROOT_MACROS = ''.join(_root_macro_file.readlines())

SELECTIONS: named groups of cuts to be applied before splitting

An important operation when processing n-tuple data is the application of filters, that is, accepting only those entries for which the n-tuple values satisfy certain conditions, and rejecting the rest.

A typical use for filters in HEP data analyses is in implementing the so-called event selection. For events stored as n-tuples in a TTree, this translates to applying a group of several cuts on TTree branches or functions thereof before further processing.

In Lumberjack, users can define multiple such selections which can be applied to the input TTree.

A selection has a unique name and consists of a list of boolean filter expressions which will evaluated for every TTree entry. An entry is rejected if one of the expressions evaluates to False (i.e. the expressions are joined via a logical AND).

Selections are defined in the SELECTIONS configuration variable. It is a Python dictionary which maps the selection name to the list of filter expressions.

Here is an example of a SELECTIONS specification:

SELECTIONS = {

  'mainSelection' : [
    'varA > 100',
    'abs(varB) < 2.4',
    'varA/varC < 0.3',
  ],

  'additionalSelection' : [
    'varD == 1',
  ]

}

SPLITTINGS: how should the TTree be split?

An important task for analyses consists in splitting a large sample into several subsamples (or “regions”) based on the values of one or more variables.

In Lumberjack, this is specified by the SPLITTINGS configuration variable.

The SPLITTINGS configuration variable consists of a series of nested Python dictionaries, with three levels structured as follows:

  • the top-level dictionary consists of keyvalue pairs where the key is the name of the splitting and the value (the intermediate dictionary) contains the specification for that particular splitting

  • the intermediate dictionary describes the structure of the cuts to be applied and consists of keyvalue pairs where the keys are strings containing “cut descriptions” and the values (the innermost dictionaries) contain the cuts that should be applied

  • the innermost dictionaries consist of keyvalue pairs where the keys are the variables to cut on and the values specify the variable values (or range of values)

In the following example, two splittings of the sample based on the values of variables var_A and sign_B are declared:

SPLITTINGS = {

    # split into multiple regions of A
    'region_A' : {
        'A_from_0_to_1':    dict(var_A=(0, 1)),     # implies "0<=var_A && var_A<1"
        'A_from_1_to_2':    dict(var_A=(1, 2)),
        'A_greater_than_2': dict(var_A=(2, 1000)),
    },

    # split into two parts depending on the sign of B
    'sign_B' : {
        'B_negative': dict(sign_B=-1),
        'B_positive': dict(sign_B=1),
    },

}

Note that any variable appearing as a key in the innermost dictionaries (sign_B and var_A in the above example) must either be a TTree branch or a named expression specified in DEFINES.

TASKS: what should be done?

The main unit of work in Lumberjack is the task. A task is like a “blueprint” for a particular workflow and tells Lumberjack which splittings to use when splitting the sample and what analysis-level objects (histograms and/or profiles) should be filled.

When running Lumberjack, each task produces exactly one output ROOT file. The structure of this ROOT file reflects the chosen splitting(s) and analysis-level objects and is described in more detail below.

Tasks are specified in the TASKS configuration variable, which is a Python dictionary that maps the task name to an inner dictionary containing the task specification. The inner dictionary contains the following keys:

  • splittings: a list of strings corresponding to top-level, indicating how the sample should be split into subsamples

  • histograms: a list of strings specifying the histograms to be filled for each subsample

  • profiles: a list of strings specifying the profile histograms to be filled for each subsample

The strings given in splittings must be keys of the SPLITTINGS configuration dictionary. If multiple splitting keys are specified, the sample will be split according to the outer product of the corresponding splitting specifications. This means that the sample is first split according to the first splitting key, then each of the subsamples created is split according to the second key, and so on.

The strings given in histograms and profiles specify which quantities should be filled into the object. If multidimensional histograms or profiles are desired, the quantities to be filled on the x, y (and optionally z) axes must be provided and separated by a colon (:).

Note

The quantities for multidimensional histograms and profiles should be given in the order x:y[:z]. This is different from the convention used by ROOT’s TTree::Draw(), which uses y:x.

Weighted histograms can be requested by appending @ to the histogram of profile specification, followed by the variable to be used as a weight.

If one of histograms or profiles is not specified, no objects of that type will be filled. If both are empty, nothing is done.

The following example shows how to configure a task which splits the input sample according to the outer product of two splittings and creates several analysis-level objects of different types:

TASKS = {

  ...

    'MyTask' : {

        # split sample into subsamples according to these entries in SPLITTINGS:
        'splittings': [
          'region_A',
          'sign_B'
        ],

        # for each subsample, produce the following histograms
        'histograms': [
            "my_quantity_1",                # 1D histogram
            "my_quantity_2@my_weight",      # 1D histogram with weights
            "my_quantity_1:my_quantity_2"   # 2D histogram ("x:y")
        ],

        # for each subsample, produce the following profiles
        'profiles': [
            "my_quantity_1:my_quantity_2"   # profile histogram ("x:y")
        ]

    },

}

Running Lumberjack from the command line

Once a configuration module is created (as described above), Lumberjack can be run from the command line by executing the lumberjack.py script. The flags passed to the script will determine which configuration module to load, what ROOT file and TTree to use as an input, and various other options.

The command-line interface offers two sub-commands: task and freestyle. The former is used to run tasks, as configured in the TASKS configuration variable, while the latter allows users to specify the run parameters (splittings, requested histograms/profiles) directly on the command-line.

The following lines show an example of how to run the task subcommand:

$> lumberjack.py
      --analysis    "my_analysis"         # name of Python config module
      --input-file  "input_file.root"     # name of ROOT file to use as input
      --input-type  "data"                # type of input (must be a key of
                                          # QUANTITIES and DEFINES)
      --tree        "path/to/ttree"       # path to TTree in ROOT file
      --selections  "my_main_selection"   # name of selection as defined in SELECTIONS
      --jobs        10                    # use 10 threads
      --log                               # create log file
      --progress                          # show progress bar
      task "MyTask" "MyTask2"             # run these tasks
        --output-file-suffix "mySuffix"   # suffix to append to output filename

Multiple tasks can be run during a single Lumberjack run. The output of each task will be stored to ROOT files (one per task). The ROOT files contain the task name and a user-specified suffix (if any). In the above example, two files called MyTask_mySuffix.root and MyTask2_mySuffix.root will be created.

Below, an example is shown for the freestyle subcommand:

$> lumberjack.py
      --analysis    "my_analysis"        # name of Python config module
      --input-file  "input_file.root"    # name of ROOT file to use as input
      --input-type  "data"               # type of input (must be a key of
                                         # QUANTITIES and DEFINES)
      --tree        "path/to/ttree"      # path to TTree in ROOT file
      --selections  "my_main_selection"  # name of selection as defined in SELECTIONS
      --jobs        10                   # use 10 threads
      --log --progress                   # create log file
      freestyle                          # no task, but specify parameters manually
        "my_splitting_level_1"           # split sample using these SPLITTINGS
        "my_splitting_level_2"           # (in this order)
        --histograms  "my_quantity_1"                # 1D histogram
                      "my_quantity_2@my_weight"      # 1D histogram with weight
                      "my_quantity_1:my_quantity_2"  # 2D histogram ("x:y")
        --profiles    "my_quantity_1:my_quantity_2"  # profile histogram ("x:y")
        --output-file "MyOutputFile.root"

Usage instructions can be obtained by running lumberjack.py --help on the command-line. Running lumberjack.py -a "my_analysis" --help will provide help using the information defined in the configuration module my_analysis.

Output

The output ROOT file will contain all requested objects for every combination of splitting regions. This is achieved by placing the objects into a nested directory structure which reflects the splitting specification.

As an example, consider the following splitting specification:

SPLITTINGS = {
    'region_A' : {
        'A_from_0_to_1':    dict(var_A=(0, 1)),     # -> "0<=var_A && var_A<1"
        'A_from_1_to_2':    dict(var_A=(1, 2)),
        'A_greater_than_2': dict(var_A=(2, 1000)),
    },
    'sign_B' : {
        'B_negative': dict(sign_B=-1),
        'B_positive': dict(sign_B=1),
    },
}

The above configuration specifies how to split sample into different regions, depending of the values of the variables var_A and sign_B.

If Lumberjack is run with the splittings specified as "region_A" "sign_B", this will result in the following directory structure:

(root directory)
 ├── A_from_0_to_1/
    ├── B_negative/
    └── B_positive/
 ├── A_from_1_to_2/
    ├── B_negative/
    └── B_positive/
 └── A_greater_than_2/
     ├── B_negative/
     └── B_positive/

Note that the ordering is important. Running with "sign_B" "region_A" instead of "region_A" "sign_B" will result in the following directory structure:

(root directory)
 ├── B_negative/
    ├── A_from_0_to_1/
    ├── A_from_1_to_2/
    └── A_greater_than_2/
 └── B_positive/
     ├── A_from_0_to_1/
     ├── A_from_1_to_2/
     └── A_greater_than_2/

Each “leaf” directory will contain the requested objects and will have the exact same structure, which will depend on the types of objects requested/

To illustrate this, consider the following Lumberjack run. It uses splittings defined above and requests a number of objects of different types:

$> lumberjack.py
      ...
      freestyle
        "region_A"
        "sign_B"
        --histograms "my_quantity_1"                # 1D histogram
                     "my_quantity_2@my_weight"      # 1D histogram with weight
                     "my_quantity_1:my_quantity_2"  # 2D histogram ("x:y")
        --profiles   "my_quantity_1:my_quantity_2"  # profile histogram ("x:y")

This is the output structure:

(root directory)
 ├── B_negative/
    ├── A_from_0_to_1/
       ├── h_my_quantity_1            # unweighted 1D histogram
       ├── h_my_quantity_2_my_weight  # weighted 1D histogram
       └── my_quantity_2/             # subdirectory for objects involving `my_quantity_2`
           ├── h2d_my_quantity_1      # 2D histogram
           └── p_my_quantity_1        # profile histogram

Inside the directories which originate from the splitting specification, the output objects are further organized into subdirectories based on their dimensionality:

  • objects with a quantity on the x axis only are placed directly inside the splitting directory

  • objects with an additional quantity on the y axis (e.g. profile histograms or 2D histograms) are placed one level down inside a directory whose name coincides with the y quantity

  • objects with an additional quantity on the z and y axes (e.g. 2D profile histograms or 3D histograms) are placed two levels down inside directories whose names coincide with the z and y quantities, in that order

Note that the object name itself only contains the quantity represented on the x axis (and, for weighted histograms, the name of the weight).