So the LHC has been humming along, tiny 'clouds' composed of hundreds of trillions of protons have been whizzing through one another as they circulate the LHC ring in opposite directions and cross at about 40 million times per second. This has been happening right in the centre of ATLAS more or less, spewing out a ton of matter and energy in the process, and – if we're lucky! – some of that matter will briefly be in the form of some of the more exotic building blocks of our universe.
The detector itself has been working beautifully, soaking up as much information about the interactions as possible. And what's more, you've learned how all of the different types of particles interact with the various parts of the detector, so you know how to reconstruct electrons, photons, and so on, and crucially you know how to put those pieces together to reconstruct the heavier objects you're interested in.
So now, after having waited patiently for a few years you've got yourself a big heap of data sitting waiting for you to analyze: you're ready to make a physics measurement!
How does one go about doing that exactly?
Let’s start with the data.
Before we do that though, let's just summarize the two main categories of measurements we could do:
While precision measurements give us a better understanding of some particle or quantity we already know about, performing a search for new physics allows us to look for something new that our current model of how the universe works wouldn't be able to explain. These are then subdivided into several categories, but broadly speaking, those are the main ones.
It's also important to point out that there's of course some degree of overlap between those categories, so the depiction above might be a bit misleading. Deviations from our theoretical predictions could manifest themselves in very subtle ways, and the first hints of something new and exciting could come from observations of such differences between theory and experiment. Indeed one could argue that's the reason we do precision measurements in the first place.
Where Exactly Do These Data Look Like?
By data we refer either to the information collected by the detector, or that collected by our simulated detector – we ultimately have to make comparisons between theory and experiment so we know if we’ve seen something unexpected. The data can come in several forms, but ultimately they're made accessible by the physicists performing a number of various analyses.
Data storage, and some of the more technical aspects of how one goes about that is a topic worthy of an entire section unto itself, but I’m going to really skim over it here since it’s not crucial to the understanding of what we do.
Here’s what you need to know:
For the full set of ATLAS data (as of 2014) you’re looking at something like 140 PB (that's 140 million GB). If you’ve got a few million old PCs lying around in your garage, you could theoretically download the whole lot and have the full collection yourself.
You can probably guess that that’s not a clever way of going about things. It’s obviously an exaggeration anyway, but even the storage and transfer of small fractions of that full dataset is a huge waste of resources – unfathomably large numbers of electrons and photons flying back and forth encoded with bits of information for thousands of so-called users worldwide, when a given user is only really interested in a minuscule fraction of that dataset.
There’s a more practical approach: we call it the Worldwide LHC Computing Grid, or just The Grid for short. The basic idea is that you, a user or physicist, write your fancy (or not so fancy) code to do all the calculations you want to do – produce some final plots or, more likely, produce a much slimmer and custom-tailored dataset that’s a tiny fraction of the full thing – and you send your code which carries out the desired steps to the Grid. This you send in the form of instructions from your local machine or wherever you’re working. It will likely get split up into a number of sub-jobs as we call them – a divide and conquer approach with huge clusters of individual CPUs.
So the jobs themselves could end up running in Vancouver, or maybe Stanford, New York, Hamburg, or Daejon City just south of Seoul just to name a few (there are roughly 165 locations worldwide). The local machines at whichever particular site receives your jobs will compile the code, and then execute it exactly as per your instructions. You, the user, can monitor the progress and, ultimately, retrieve the output. This output you can then store at your local site, and you’re read to run whatever final steps have to be run to make your finished product: make your final plots, get your final numbers, and so on.
So the key point is that for the most part the data themselves remain stationary – it's the smaller bits of code which get transferred back and forth.
What does this code look like? Ok, so this is just to give a flavour, and there are several different programming languages that are used, but this is one example:
It’s just a little snippet from some analysis code, but what you see is that it’s performing operations on some simulated particles (in this case W bosons) to be used at some later stage in the code. That bit of code written above is in a language called C/C++. You can maybe recognize things like energy (E) and transverse momentum (pt) in there. It’s with analysis code such as this that you can do manipulations with various objects and fill the distributions that will allow you to make whatever measurement you're trying to make.
[Note to programmers: go easy on me, this isn't meant to dazzle you...]
A Physicist’s Toolkit
One thing you might get your code to do is to count something. That might be, say, the total number of candidate Z bosons you have left in your sample after you've made some event selection cuts (which we'll talk about in Section 9). You're interested in counting them since some theories (other than the Standard Model) might exhibit the presence of more Z bosons than you might have otherwise expected if the Standard Model were all there was to nature.
One way we might go about doing this counting is by making use of something called a histogram. It’s not the only way to go about things by any means, but in certain situations it’s a good way to do it. Actually a histogram is probably one of the most common tools a particle physicist will use.
In a histogram we split up our events into so-called bins, where the bins themselves can be in terms of any number of different quantities. Here let's say for example we're interested in how many electrons were reconstructed in each event – assume that this tells us something interesting if we were to count it. When it's discrete numbers like this of course, it naturally makes sense to fill a histogram; we don't have half electrons, so we really can count: for this particular event did we have 0? 1? 2? What about for the next event?
For other quantities, say, energy for example, you can still split things up into bins, but they'll be in terms of intervals or ranges of values: did the energy come out somewhere between 0 and 20 GeV? Above 20 but under 40? The same as if you were trying to make a histogram of the heights of you and 50 of your friends: you might let each bin cover a 5 cm range. You get the idea.
Again, for our example we'll be simply counting the numbers of reconstructed electrons in each event. By the way, by reconstructed electrons here, I refer to objects we've built up together from various pieces – in this case measured tracks from the inner part of the detector and energy deposits in the calorimeters. The reconstructed objects pass some criteria we've laid out such that we call it an electron candidate.
So let's start counting electrons. We'll do it event by event. Our histograms might then start to take the form of something like this:
Look at that, in one event we reconstructed six objects which passed our definition of what an electron is. Hmm, could be real, could be a fluke.
So we go on like that, counting events. Could be in the millions!
Luckily of course we have computers which can do the counting for us while we go get a coffee (though usually that's when we pick up where we left off on something we're working on in parallel). So then once it's all done looping through the events, we take a sip of coffee, and have a look at the final distribution.
If we had enough statistics, that distribution (another name for a histogram) might then have taken on some meaningful form which gives us a bit of insight which we never would have had just by looking at spreadsheets of thousands of numbers on their own.
If we didn't have enough statistics, the underlying shape might be there, but there would be huge statistical fluctuations in the bin contents that would blur or hide the ultimate shape from us.
Here, look at this: let's take a normal distribution (a 'bell-curve') produced by drawing random numbers and filling a histogram as described. Here by 'random' I don't mean numbers rhymed off the top of your head that sound random. We use something called a random number generator (you can google it if you want) which makes sure the numbers truly are random in the mathematical sense of the word, but here according to some base distribution. We'll fix the binning from the get-go. Then you can start to see how things change as we increase the number (N) of 'events':
Note that the actual range of the vertical axis has to change for each of those cases (since the total number of events in each is different). Actually you might have noticed that I'm leaving the actual numbers off the scales entirely (something they always teach you not to do in math or science class) but I'm doing it to keep things simple given that we're focusing on the shape.
It's important to point out that the 1st distribution above, though it might look strange, isn't wrong even if it doesn't take on the form you expect. One could be tempted to think that the whole thing is wrong until one collects enough points that the shape takes form, but that's not correct – the underlying distribution is always there, and all of the above distributions are acceptable or consistent with a normal bell curve within the expected statistical uncertainties.
The above example highlights the importance of collecting as much data as possible: broadly speaking the more data you collect, the more smooth your distributions will be and the more easy it is to spot something by eye which deviates from the prediction.
Comparing Data and Simulation
At the end of the day we can get distributions such as those described above in two ways that we care about in experimental particle physics: once using simulated data and once using actual measured data. The simulated data still run through much of the same machinery we use in the case of the measured data, but we have to do the simulation part earlier upstream.
If we compare both data and simulation and the distributions differ (ever-so-slightly in some cases!) it might be the hint that we're seeing something new and exciting! But even in such cases we still need to make distributions of other quantities which we know and understand – quantities which we've previously seen do agree between data and simulation. That way we know we've done things right (and can believe it when we see do see something different in other distributions).
So we want to show these two types of histograms – actual data and simulated data – overlaid on top of one another. One common away we do that in the physics world is by using coloured bars for the simulation, and black markers for the measured data.
Kind of like this:
The black bars or lines sticking out of either end of the circular markers correspond to statistical uncertainties in the measured values. The uncertainties in this case doesn't mean we did something wrong in terms of counting events in the data. It's an inherent part of any counting experiment – there are simply statistical fluctuations that inherently exist in nature; they're there whether we like it or not. It kind of runs along the same lines as what you saw in the N = 10, 100, 1000 game we played earlier.
It will almost never be the case that all of the black markers will be situated perfectly in the middle of the lines at the top of each of the coloured bars. If the simulation is accurate though, they should be close. How close? The black bars should at least overlap the tops of the columns in a certain high fraction of the cases. If not, there could be a real problem in something we've done in the analysis. Or there could be something new.
Have a look at the made-up distributions below. You'll note I've changed the axis labels to show something different, but I'm omitting actual numbers as I did in the last one just to highlight the concept.
Looks more or less ok, right? Pretty good agreement between data and simulation. Except...maybe that region in the middle, right? Is that just a statistical fluke that the data are all consistently higher than the simulation? We need to quantify the probability for that to occur if it were just a fluctuation.
Taking things a bit further, actually we often have several types of processes that we're looking at. If we're counting electrons for example, there might be several different types of processes at the LHC that produce electrons. We often split them up in our histograms and stack them on top of one another like this:
If we make histograms of several different quantities, this can show that we understand things even better and they help to pinpoint potential problems by looking at histograms of several different quantities. The different colours correspond to different simulated datasets altogether. We've run our analysis code on each of them separately, but we put them all on the same plot, together with the measured data, to get the final picture.
Now let's assume we've shown that one region in the middle cannot just be a statistical fluke. Something weird and unexpected is happening there! We'll talk more about that when we talk about discoveries in Section 10.
But for now let's just say that some theorist has a proposal for a brand new theory or extension of the Standard Model which we can also run through the full simulation and see where it comes out. We treat its corresponding simulated dataset (incorporating the new theory) just like we would all of the others. And in this case we assume it doesn't interfere with the Standard Model stuff we already have (which isn't necessarily the case).
Maybe it'll come out like either of these two scenarios below in which case it means the newly proposed theory just doesn't cut it:
In the left-hand distributions, you can see that the new proposed theory predicts extra electrons in the wrong place, so the original issue remains and, what's more, it now predicts an excess of electrons where there isn't any excess.
In the right-hand distribution, we consider a different scenario: here the data and prediction agreed well enough without needing to add something new. You can notice that I've modified the data markers in the central region (where before we had said there was an excess in data) such that they're now sitting snugly around the total prediction from simulation before adding in the prediction from the new theory. So things look fine with the Standard Model as-is! If there had been an excess we should have seen it, but we simply don't. Again, we have to do this in a quantitative way, but you get the basic idea.
Of course just maybe adding in the prediction from the new theory will make things end up looking just right!
Of course the theorists aren't allowed to see the data beforehand – has to be the case, right? But they can certainly make a prediction. So part of the job of an experimental physicist is to rule out certain theories. That helps the theorists not waste their time. And theoretical physicists help experimental physicists by telling them what might be the best ways to look for hints of new physics.
Make no mistake: having a theory agree with the data in all respects is not at all a trivial task. It's not just a matter of having things work out just right for one histogram as I've shown above. It has to be self-consistent across the board! There's been no new, fundamental theory nor extension of the Standard Model which has managed to do that since the Standard Model's inception. But that doesn't mean we should stop looking. After all, great lunges ahead in theoretical physics have certainly happened before (take relativity and the early quantum theory for example).
Fitting Your Data to a Model
Somewhat related to histograms are fits which can be performed to distributions. Fits generally involve some sort of optimization – minimizing or maximizing some quantity – which returns a value of some parameter you're trying to measure, or some probability that something is or isn't the case.
So you might go back to that distribution we filled randomly with 1000 'events' earlier, and assuming you knew the distribution would be in the shape of a normal distribution, you could fit the functional shape of a normal distribution to your data in the case that you wanted to measure the average or peak position. The fit would return you some value you were interested in knowing. It would also give you a statistical uncertainty on that value.
Doing fits such as the above allow us to measure the masses of particles, to know the resolution of the various components of our detector, and to set upper and lower limits on quantities associated with new physics scenarios, just to name a few things.
Just for fun, here's a more realistic set of distributions, this time taken from my PhD thesis. You'll see that the style is similar to what we've been talking about above, but things are a little bit more polished. This plot wasn't used to make a measurement, and no fit was performed. It was more of a cross-check. But similarly to how we started out by filling a distribution of the numbers of electrons in each event, this shows the number of reconstructed jets (since it's more relevant to what we were looking for).
A few things should look familiar to you: there's the signal (shown in blue) which is stacked on top of the background (we have only one and it's shown in grey). Then you'll see the familiar black markers showing the measured data which you're used to. And there are actual numbers on this plot, so we're not being sloppy like I was earlier in this section.
Then a few extra embellishments:
There are gray shaded regions which show uncertainty bands (the black bars we talked about are still there, but they correspond to statistical uncertainties on the measured data only). These shaded regions take a few other things into account. So there are now uncertainties for both the measured data points and the stacked prediction.
You'll also notice the ratio plot at the bottom. That's useful for telling us overall how many data we measured (the number on top) divided by how many we predict for that bin from simulation and otherwise (the number on the bottom). If you see 1's across the board then things are more or less as expected!
The plot shown doesn't have the most interesting shape, but it all goes together to help make a measurement.
One thing that's neat to think about with this plot is that prior to 1995 we hadn't yet proved the existence of the top quark! If we had been able to somehow reproduce this set of distributions below without any such knowledge of the top quark, we'd have had our black data markers sitting way above the prediction (which would be the grey bars only). That would have told us that something significant was missing. That something was obviously (we know now) the contributions from pairs of top quarks! Back in 1995 we were limited by statistics, meaning we just had enough statistics to see concrete evidence that the top quark does exist (as theorists had predicted for some 20 years back), but nowadays we produce a vast number of top quarks.
We've gone from the discovery of something new to something we now use as a main candle. "Yesterday's signal is today's background", so the saying goes. Now we can use top quarks to search for even more exotic forms of matter!
That just about wraps it up.
Next up we talk about finding the proverbial needle in the haystack: how to suppress the contributions from backgrounds (the haystack) as much as possible in order to find the signal we're truly interested in (the needle).