This is general documentation for DataSetFinder. You can find reference/API documentation here.

I wrote DataSetFinder to get around the problem of having to hard code lists of files (or, eventually, PROOF datasets, or other sources I might think of); I found myself writing this a few times for the various analysis programs I was writing early. Among the problems I found myself having to address were wild-cards at the directory, subdirectory, and file level (all at once), the fact that my data files were located in different places on different machines, and that different collections of datasets could sometimes be used as one.

This class attempts to address most of this. Ultimate flexibility comes by embedding a programming language and using it to build these lists (like python). However, I was looking for something a little less flexible and a bit quicker to use: you don’t have to know a programming language to use this. As usual, features were added on an as-needed basis. If there are things missing, feel free to contact me, or add them yourself!

There are two things that you must address. First is the file that contains the dataset specification. The second is the API to extract the appropriate information from the dataset file.

The Dataset File

The dataset file should be called dataset-list.txt, by convention. In fact, it isn’t currently possible for it to be called anything else. Further, it must be in the working directory that your code is running in. I find it most convenient to add this file to the project that contains my data model files. Under Properties I set it to Copy Always so that an up to date version of the file is always in the working directory when I hit build or run.

In general the syntax is very simple. There is no multi-line text – you must make sure to go all the way to the end of a line. If a path or filename contains a space make sure that the complete filename is enclosed in quotes!

A very general example follows:

// Sevenup is the big home machine

machine SEVENUP
{
    macro dsloc = "\\lint-shark\HEP Data\JetBackToBack-v6"

    JetStream = $dsloc\*data11_7TeV*\*.root
}

Comments

Text between a double-slash (“//”) and the end of line is ignored as a comment. They can appear anywhere on a line, but note that they do get replaced in the input stream and during parsing by a end-of-line (which are important in this format).

Machine

The machine definition indicates that the following block of data set definitions is for only that machine. As can be seen from the example above, the definition is quite simple. The keyword machine followed by the machine name and then an enclosing set of braces.

The machine name must fit exactly. By default the machine name is fetched from System.Enironment.MachineName. However, it is trivial to override it by setting the global value DatasetFinder.MachineName.

You can also leave out the machine name clause (and the brackets). In that case the dataset locations are the same no matter the running environment.

Dataset

The dataset line is pretty simple, in its most basic form: <dataset> = <filespec>. If the filespec contains any spaces, remember to surround it by double-quotes! The dataset name must not contain any spaces!

Wildcards can be placed almost anywhere. The algorithm is pretty simple: at each place there is a wild-card matching is performed, and then for each match the rest of the filespec is explored. So in the example above, with "*data11_7TeV*\*.root” the wildcard matching will first find all directories containing data11_7TeV and then in each of those directories all files ending in .root.

It is possible to or two search strings": <dataset> = <filespec1> | <filespec2>. First filespec1 is searched. If no files are found, then filespec2 is searched. This feature was added because I would do testing on my portable on a single file, and then plug in my 1 TB portable drive and want to run from that.

Use the  DatasetFinder.FindROOTFilesForDS method to return the wild-card resolved list of files. This can be fed directly to the first argument in your ROOTLINQ.QueryableCollectionTree.Create method. The method HasDS is also useful to check ahead of time if a dataset is defined. I end up using this so I can run a single dataset for testing, but when I move the code to the big machine it can run on multiple real datasets – on the test machine the full datasets aren’t defined – and I use HasDS to detect that.

Two things that you might want to do that the current version doesn’t support. First, you can’t search differing or unknown depth directories. That is – you have to have a \ for each directory level. Second, you can’t specify several different paths for a single search string. Not hard to add, but…

Macros

Macros are  a very simple text replacement facility. Their definition is proceeded by the text macro and then the replacement text. If the text contains spaces make sure to surround it in double quotes. All macro definitions must precede dataset definitions.

Tags

Tags are a convenient way to organize more than one dataset. For example, if QCD multi-jet samples are split by hard scatter energy into various samples (e.g. J1, J2, J3, J4, etc.) and you would like to have your code process all of them despite wanting to specify them individually, tags are the way to do it. Fetch a list of datasets that are marked with a particular set of tags or get the set of tags associated with a particular tag.

The tags are associated with datasets by a parenthesis-enclosed, comma-separated list of strings just after the dataset name:

// Sevenup is the big home machinemachine SEVENUP
{
    macro dsloc = "\\lint-shark\HEP Data\JetBackToBack-v6"

    JetStream (Data, Jets) = $dsloc\*data11_7TeV*\*.root
}

The JetStream dataset has two tags associated with it – Data and Jets.

To get a list of tags associated with the dataset JetStream, use the method DataSetFinder.DSTags. To get all datasets that are tagged by a collection of tags use the method DataSetFinder.DatasetNamesForTag.

Finally, if you want to know all the tags defined in the dataset file, you can use the property DataSetFinder.AllTags

Last edited May 17, 2012 at 7:38 AM by gwatts, version 4

Comments

No comments yet.