LINQ is amazingly powerful. In v0.4 of this tool a sub-set of LINQ’s full power is implemented – but it is enough to do most things.

What is LINQ?

If you are familiar with SQL, then LINQ is a lot like SQL, but done as a first class feature in a programming language, rather than a string that is parsed by a remote database engine. Further, one thing LINQ does is bring functional programming into a procedural language. This turns out to have some unique advantages for working with large datasets (and the translation of a query into C++ that can run directly against the ROOT TTree).

There are lots of introductions to LINQ. Just search the web and find one that seems to work for you.

All LINQ queries have to start with a data source. In our case the source is a ROOT file or files containing a Tree. For example, to get a data source here you might use the following code:

            var f = new FileInfo(@"..\..\..\hvsample.root");
            if (!f.Exists)
            {
                Console.WriteLine("could not find btag input files: " + f.FullName);
                return;
            }

            var evts = ROOTTTreeDataModel.QueryableCollectionTree.Create(f);

That last line creates a data source we can now do a query against (evts).

Next we can write the query. Using C# one can write something that looks like this:

            var trackPt = from e in evts
                          from t in e.tracks
                          select t.pt / 1000.0;

This turns trackPt into a sequence of floating point numbers, representing the pT of all tracks in GeV. There is no reason you can’t chain several of these statements together:

            var allTracks = from e in evts
                            from t in e.tracks
                            select t;

            var centralTracks = from t in allTracks
                                where Math.Abs(t.eta) < 1.0
                                select t;

This select first all tracks. The second looks at each of those tracks and only selects the ones with an eta less than 1.0.

Finally, the last part you have to turn the sequence into something – like a plot or a count. There are a number of helper routines in LINQToTTree to help with this. For example, if you wanted to count the number of centralTracks in your whole sample, you could write something like this:

int count = centralTracks.Count();

Or if you wanted to plot the central pT of all tracks, you could write something like the following:

output.Add(centralTracks.Plot(nameAdd + "TracksPt", titleAdd + " Track p_{T}", 100, 0.0, 100.0, t => t.pt / 1000.0));

Where output is a TFile where the resulting histogram should be stored, and Plot is a helper routine (see below). Note the last lambda function, 't => t.pt/1000.0’. That function takes as input each object in centralTracks (the track objects) and then transforms it into a double that can be plotted (a call to TH1F::Fill in this case).

LINQ, btw, is really syntatic sugar. The compiler translates it into a series of library calls. For example, the above sequence can be re-written as:

var allTracks = evts.SelectMany(e => e.tracks);

var centralTracks = allTracks.Where(t => Math.Abs(t.eta) < 1.0);

There is no difference between these two approaches. Use whatever is simplest to understand.

Futures

I am ambivalent about futures. They, or something like them, is absolutely necessary, however, when dealing with large data sets. Their huge plus is they allow you to run a set of queries at once, rather than one at a time. If the query is expensive (like doing delta-R comparisons between jets and tracks), and several queries make use of the same comparisons, they can be a CPU savings. They can be a huge data savings as well since if two queries use a similar set of TTree data leaves, then combining them means that much less data has to be moved around.

Queuing up a future query is exactly like queuing up a regular query, except you put “Future” in front of the Count or Plot operator. For comparison, counting the number of events or making a plot the normal way can be written as:

var count = centralTracks.Count();
var plot = centralTracks.Plot("trackPT", "Track pT", 50, 0.0, 100.0, t => t.pT / 1000.0);

The same queries written as futures, and batched together can be done as follows:

var fcount = centralTracks.FutureCount();
var fplot = centralTracks.FuturePlot("trackPT", "Track pT", 50, 0.0, 100.0, t => t.pT / 1000.0);

var count = fcount.Value;
var plot = fplot.Value;

The first two lines do nothing but queue up the queries for eventual running. The fcount.Value will cause the LINQToTTree infrastructure to run both the count and plot queries at the same time (combining the C++ code as much as it can between the two queries).

The problem comes when you start to write a large program that generates many 100’s of plots or numbers. You want to save many of these plots to a file and forget about them. But to save them to a ROOT file you need the plot, not the future. But if you ask for the actual plot then all the queued queries will run, and if that isn’t the last plot, well… then you’ll re-run – totally defeating the purpose.

To deal with this there is a fair amount of infrastructure setup around Futures. This is the downside, while it works, it does make the code more difficult to understand – one of the primary design goals of this project. See the reference section to learn more about futures. If you are running on a large file, or have a complex query that takes a significant amount of CPU time, then it is necessary to learn this (or some up with another scheme and let me know!!).

Supported LINQ Predicates

LINQ has a giant number of predicates that are defined. It is also possible to define custom predicates. Only a sub-set is supported at the moment in LINQToTTree.

  • Where – the filtering operator. Most complex logical operations are supported.
  • Count – Count the # of objects that are in the stream. This returns an integer that can then be printed out on the console, etc., for quick diagnostics.
  • Aggregate – Used to make plots and the like. While you can use it directly yourself, it is much simpler to use the helper methods described below.
  • Take/Skip – Allows one to skip or take a certain number of objects in a sequence. Perfect for doing something like “1st jet in the event” or “second jet in the event”.
  • Any/All – Test a sequence for at least one or all to satisfy a condition.
  • Min/Max – find the min or max value in a sequence.
  • Sum – Add all the values up in a sequence

A few non-LINQ, custom, operators are also supported:

  • PairWiseAll – Take each element in a sequence and pair it up with every other element in the sequence and apply a test. If the element passes every test then it is marked as good. Returns a sequence of all good elements. This can be used, for example, to find a list of all isolated jets. The input list would be the jets, and the test would be true if the two jets tested had a Delta-R larger than the isolation criteria. Only jets that were isolated would pass every pair-wise test.
  • UniqueCombinations – Returns a Tuple<T,T> (a pair) made up of all pairs of objects from the list. It does not return the symmetric items nor the identity pairs. If you feed it a list of 3 jets, j1, j2, and j3, it will return (j1, j2), (j1, j3), and (j2, j3).

While it is possible to extend the library with new algorithms, it isn’t trivial (the extension method is exactly what the core library uses).

Unsupported Gotta’s

This is an incomplete implementation of the LINQ standard. Besides what isn’t mentioned above, there are also some other things to keep in mind for v0.4. This list tends to be things that I’ve run across that I would really like to see implemented because they would make my own analysis work easier. As a result, they generally are high on the to-do list.

  • ROOT Objects and custom objects sometimes have properties (i.e. GetEntries –> Entries). Don’t use them inside a LINQ expression. They aren’t translated.
  • The First/Last operators are not supported. This is one of two major missing bits of functionality that I wish was working (it requires some serious re-working to make work correctly).
  • The ordering predicates in LINQ – to say sort the jets by pT or closest distance to another object – are not implemented yet. This is the second major feature that I’ve found lacking that does sometimes get in the way.

Helper Methods

To make life simple there are a number of helper methods that have been written to further cut down on boiler plate code. These are all in the Helpers object in the LINQToTTreeLib namespace.

  • FindAllFiles – Given a base directory and a pattern match, will return all files that match that pattern. This can be passed directly to the data source creation (see above). Use of this is depreciated; use DataSetFinder instead.
  • DataSetFinder – this class will parse a text file that specifies the locations and wild cards for files, and build a file list that can be used in a list of files to be analyzed.
  • Plot – create a TH1F or a TH2F plot and fill it all in one go.
  • Save – functional implementation that allows saving objects to files.

Last edited Aug 15, 2011 at 5:26 AM by gwatts, version 7

Comments

No comments yet.