GSoC 2024 | Learning Quantum Representations of Classical High-Energy Physics Data with Contrastive Learning
Google Summer of Code @ ML4Sci
This year Google celebrates its 20th anniversary of Google Summer of Code (GSoC) with 1,220 Contributors, writing code for 195 open-source mentoring organizations. As a GSoC 2024 contributor, I worked in Machine Learning for Science (ML4Sci), an open-source organization that brings together modern machine learning techniques and applies them to cutting-edge problems in STEM. Over the summer, I worked on Quantum Machine Learning applied on High Energy Physics data (QMLHEP) to contrastively train models to output embeddings that can be used for other downstream tasks like classification.
Here is the code repository: Quantum_SSL_for_HEP_Sanya_Nanda
Project
Learning quantum representations of classical high-energy physics data with contrastive learning:Background
LHC: Large Hadron Collider
At LHC, scientists are looking into the unknown and probing the most fundamental questions about our Universe, like: "What is the Universe really made of? what forces act within it?". The Large Hadron Collider (LHC) is the world's largest and most powerful particle physics accelerator at CERN in Geneva, Switzerland. It features a 27-kilometer ring of superconducting magnets, designed to accelerate particles to the speed of light and collide them to explore fundamental physics. The collider has been instrumental in significant discoveries, including the Higgs boson. In total, the LHC is comprised of seven separate experiments, few can be seen in Figure 1.
CMS: Compact Muon Solenoid
CMS is one of the detectors located at CERN's LHC, designed to investigate a broad range of physics. The CMS detector is massive and has a cylindrical onion-like structure with multiple layers of detectors. These layers allow CMS to capture detailed photographs of the particle collisions occurring at LHC. To top it all, CMS measures the properties of well-known particles with unprecedented precision and is on the lookout for completely new and unknown phenomenas.
What is High Energy Physics at CMS?
Bending Particles
Identifying Tracks
Measuring Energy
- Electromagnetic Calorimeter (ECAL): is the inner layer of the two and measures the energy of electrons and photons by stopping them completely.
- Hadron Calorimeter (HCAL): is the outer layer and it stops Hadrons, which are composite particles made up of quarks and gluons that fly through ECAL.
A powerful solenoid magnet bends the trajectories of charged particles as they fly outwards from the collision point. This helps to identify the charge of the particle and measure its momentum.
The path of the bent charged particles is calculated with a very high precision by using a silicon tracker consisting of many electronic sensors arranged in concentric layers. When a charged particle flies through the Tracker layer, it interacts electromagnetically with the silicon and produces a hit. These hits are then joined together to identify the track of the traversing particle. The Tracker layer can be seen in Figure 2.
The energies of the various particles produced in each collision are crucial to understanding what occurred at the collision point. This information is collected from the two calorimeters in CMS as marked in Figure 2.
Data: Quark Gluon Dataset from CMS
The CERN CMS Open Data Portal makes simulated data available from experiments, which was used to derive the Quark-Gluon dataset by S. Gleyzer et al [1]. The goal of this project is to discriminate between quark-initiated and gluon-initiated jets in the mentioned dataset. The dataset consists of 933206 images with 3-channels of 125x125 shape, representing equal number of quarks and gluons. The three channels in the images correspond to measurements from components of the CMS detector as discussed above: Track, ECAL and HCAL. Mean of all the images of this dataset are depicted in Figure 3-4.
Quarks: Fundamental particles that make up protons and neutrons.
Gluons: Force carrier particles that mediate the strong force between quarks.
More Details about this dataset, along with Quark-Gluon properties and other datasets used in this project can be found in my Mid Term GSoC Blog.Data Preprocessing
Data Preprocessing is of high significance to ensure good quality data for model training. In Machine Learning cycle, this is the most important step and it ensures a better performing model.
For the computer vision models used in contrastive learning framework, the 3 channels of the Quark-Gluon dataset were first analysed from a physics perspective as explained above and then the images were preprocessed as shown in this code. Some of the preprocessing techniques used were color jittering, gaussian blur and z-scale normalisation. A new channel was introduced by superimposing the preprocessed channels 1-3. Following, Figure 5 is a sample image with the 4 channels and Figure 6 is the overall mean, it is evident that the 4th channel has a wider mean compared to 3rd channel due to the superimposition.
Next, pairs/views were created from the preprocessed data to pass as input to the contrastive learning framework. Figure 7 depicts one such sample.
The logic behind creating views or pairs for contrastive learning involves creation of positive views. Positive views are created by taking an image and creating a pair or view by augmenting it. This is done for both quark (0) and gluon (1). Views created by using quarks and it's augmented version is called a similar pair and assigned new label 1 and the similarly created views of gluons are called dissimilar and labelled as 0. Views from the same sample and views from different samples are considered postive and negative, respectively. While training the model, we consider every pair other than the given sample as negative. The context of positive and negative pair is created only in terms of the loss function and the model doesn't have idea about the view concept. It is the loss function that nudges the model in the direction of clustering similar samples together. More on this is detailed in the contrastive learning section.
Similarly, for graph-based models the data was preprocessed and views were created as explained above. Figure 8 is a sample of views used for graph-based contrastive learning. The weights used in the graph are physics informed. There were 12500 graphs or jets with 8 features per node depicting physics based attributes, as follows:
- p_T: Traverse Momentum is a measure of momentum of a particle perpendicular to the beamline or collision axis.
- y: Rapidity is a measure of how the particle's velocity compares to the speed of light in the direction of motion along the beamline.
- phi: Azimuthal Angle is the angle in the transverse plane, ranging from 0 to 2pi. It describes the direction of a particle's momentum perpendicular to the beamline.
- m: Rest mass of the particle.
- E: Total energy of the particle, both kinetic and potential energy.
- px: The momentum component along the x-axis.
- py: The momentum component along the y-axis.
- pz: The momentum component along the z-axis.
Contrastive Learning
Representation learning involves extracting meaningful, compressed representations of data for various downstream tasks like classification, transfer learning etc. Contrastive learning, a form of representation learning, quantifies similarity or dissimilarity between data elements by contrasting positive/similar and negative/dissimilar pairs in the feature space, facilitating effective representation learning.
Objective:
The primary objective is to minimize the distance between positive pairs while maximizing the distance between negative pairs in the learned embedding space. This creates a representation where similar data points are clustered together, and dissimilar ones are well-separated. This enables the model to capture the underlying structure and semantic relationships within the data without requiring explicit labels.
Contrastive Loss Functions:
1. Contrastive Pair Loss
The contrastive loss function operates on pairs of samples, encouraging the model to bring similar pairs closer in the embedding space while pushing dissimilar pairs apart.
The contrastive loss for a pair of samples \( (x_1, x_2) \) with label \( y \) is defined as:
\[ \mathcal{L} = y \cdot D^2 + (1 - y) \cdot \max(0, m - D)^2 \]
Where:
- \( D \) is the Euclidean distance between the embeddings of \( x_1 \) and \( x_2 \).
- \( y = 1 \) if \( x_1 \) and \( x_2 \) are similar, else \( y = 0 \).
- \( m \) is the margin that defines the minimum distance for dissimilar pairs.
2. InfoNCE Loss
InfoNCE is a type of contrastive loss that leverages multiple negative samples within a batch to improve representation learning.
The InfoNCE loss for a positive pair \( (i, j) \) is defined as:
\[ \mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)} \]
Where:
- \( z_i \) and \( z_j \) are the embeddings of the positive pair.
- \( \text{sim}(a, b) \) is a similarity function, typically cosine similarity.
- \( \tau \) is a temperature parameter that scales the logits.
- \( N \) is the batch size, and \( 2N \) accounts for all positive and negative pairs in the batch.
- \( \mathbb{1}_{k \neq i} \) is an indicator function that excludes the positive pair from the denominator.
3. NT-Xent Loss (Normalized Temperature-scaled Cross Entropy Loss)
NT-Xent is a specific formulation of the InfoNCE loss, it emphasizes the normalization of the loss over positive and negative pairs. It's essentially same as InfoNCE but incorporates batch-wise negatives and ensures symmetry in the loss computation. The formula is same as above. More about the variety of contrastive loss functions can be found here.
Model Architecture:
Following are the four major components that make up the contrastive learning framework [3]:1. Data Augmentation
It generates multiple views of the same data point, which are treated as positive pairs. The diversity introduced by augmentation helps the model learn invariant features.
2. Encoder Network
The encoder network transforms raw input data into high-dimensional embeddings or feature vectors. Its primary objective is to capture the underlying structure of the data, facilitating effective comparison between data points.
In our case, for MNIST and Quark-Gluon image datasets, CNN and Resnet encoders were used. For the graph views of Quark-Gluon data, GNN and its quantum hybrid were used as the encoder, returning embeddings in a high dimensionality space upon training.
3. Projection Head
The projection head maps the high-dimensional embeddings produced by the encoder into a lower-dimensional space where the contrastive loss is applied. This separation allows the encoder to learn features beneficial for downstream tasks, while the projection head focuses on the contrastive objective. Numerous quantum versions of the projection head were experimented in the study by introducing quantum layers.
4. Loss Function
The loss function quantifies how well the model distinguishes between positive and negative pairs. It guides the optimization process by providing gradients that adjust the model parameters to improve performance. Some examples used in this project were explained above.
Encoder Networks & Projection Head:
Graph Neural Networks optimise transformations on all attributes of graph; on a node, edge and global level, at the same time preserving symmetries. The GNN encoder used has GATConv layers, followed by batch normalization to stabilize training by normalizing the embeddings. Furthermore, residual connections were added between layers to help prevent gradient vanishing and improve training. As the projection head, mean and max pooling were used to capture a richer set of information from the nodes, to improve performance in graph classification tasks.
Convolutional Neural Network follows the same framework and is detailed in the Mid Term GSoC Blog. ResNet used is a further enhancement to the computer vision models used. Experimentation on quantum-based projection head was conducted thoroughly, which is detailed in a section below.
Training the model:
- Data Augmentation: For each input data point, apply different augmentations to create a positive pair.
- Encoding: Pass augmented views through the encoder network to obtain embeddings.
- Projection: Use the projection head to map the embeddings into the latent space where contrastive loss is applied.
- Loss Computation: Calculate the contrastive loss using the positive pair and a set of negative samples.
- Backpropagation: Update the encoder and projection head parameters to minimize the loss.
- Iteration: Repeat the process for multiple epochs until the representations converge.
Model Evaluation
After the training was complete, all models were evalauted as detailed in this section. All experiments are tracked using weights and biases functionalities. Below, we will look into results from classical GNN encoder network on Quark-Gluon views (GNN wandb report) and classical CNN encoder on MNIST 3-8 views (CNN wandb report) using contrastive pair loss. Similarly, all the models were evaluated and their wandb reports and results are logged in the next section on benchmarking.
Evaluation 1: Learning History
Learning history is logged across the epochs while training the model on train and validation datasets. Figure 9 shows the training and validation learning curve for classical GNN encoder network when running on Quark-Gluon graph views.
Evaluation 2: Test Embeddings plot using TSNE
The embeddings projected by the CNN encoder for MNIST 3-8 dataset are represented in Figure 10. The embeddings are in higher dimensional space and were reduced to 3 dimensions in the plot below by using TSNE dimensionality reduction technique.
Evaluation 3: Test Predicitons
In Figure 11, test prediction on a positive sample is plotted along with embedding vector in high dimensionality. It is evident that the close numbers appear in the same positions in the embedding, indicating that the two samples of 3 are plotted nearer to each other. Whereas, in Figure 12, it is clear that the embeddings of 3 and 8 are distant from each other.
Evaluation 4: Downstream Task:- Linear Classification Test
The accuracy of the generated embeddings can be tested by using them for downstream tasks and thereby evaluating the task for its effectiveness. For the GNN encoder, the linear classification test is implemented and the efficiency is measured using confusion matrix (Figure 13) and AUC-ROC curve (Figure 14).
High Energy Physics datasets are generally complicated to work with and an AUC above 0.7 is considered good.
On the contrary, CNN encoder on MNIST performs well in less epochs due to the simplicity of the dataset, it's AUC is nearing to 1 and the confusion matrix (Figure 15) shows that the model makes almost no mistakes. The same CNN when applied on Quark-Gluon images, doesn't perform well.
Quantum Hybrid Models
Quantum machine learning combines quantum computing and classical machine learning to enhance data processing and model performance by leveraging the three main quantum mechanics principles: Interference, Superposition and Entanglement. Quantum machine learning aims to solve complex problems faster and more efficiently than classical algorithms as qubits, the quantum analogue of classical bits, can store more information at a given time due to superposition. Following are the quantum components used in the model architecture used in this study:
- Quantum Projection Head:
The classical projection head after the encoder layers is replaced with a quantum circuit-based projection head using quantum layers. This potentially captures more complex relationships in the embedding space.
- Quantum Layer:
This is the layer where the parameterized quantum circuit (PQC) is applied using Pennylane. It processes the embeddings from the classical encoder layers.
- Quantum Circuit:
A simple quantum circuit is defined to apply quantum gate-based rotations between the qubits in the Hilbert space, and then expectation values are measured.
Parameterized Quantum Circuits (PQCs) are fundamental building blocks in quantum machine learning. PQCs consist of quantum gates with tunable parameters, optimized during ML training. Following is an overview of the primary PQCs used in this project:
Figure 16, shows a quantum circuit with Ry rotations followed by entanglement.
Figure 17, shows a quantum circuit with angle embedding template from Pennylane followed by entanglement.
Figure 18, shows a quantum circuit with amplitude embedding followed by entanglement.
Aforementioned are the main circuits used in experiments, the number of qubits and layers were played around with and some other samples can also be found in Mid Term GSoC Blog.
Quantum Fidelity
Fidelity is a quantum equivalent of a similarity score between the quantum states. It was used in one of the quantum-hybrid experiments along with the previously defined loss functions to observe its effect on embeddings.
Benchmarking
Table 1, illustrates the result of CNN encoder on MNIST and Quark-Gluon image pairs. First round of experiments were conducted on different pairs of MNIST numbers that look alike; 0-1, 3-8, 9-6. MNIST data was used for quick experimentation and validation of the approach being used. The results on MNIST showed the utility of the framework. Good results were achieved using the classical computer vision based contrastive learning model as shown in Table 1. Additionally, respective wandb reports for each run can be viewed for a detailed inspection of the results. The model worked moderately on the Quark-Gluon data, so the next experiments were done on this dataset to log improvements from the base.
Dataset | Model | Validation Loss | Validation Accuracy | WandB Report |
---|---|---|---|---|
0-1 MNIST | CNN Encoder + contrastive pair | 0.000911 | 0.9997 | Report 1 |
3-8 MNIST | CNN Encoder + contrastive pair | 0.004080 | 0.9977 | Report 2 |
9-6 MNIST | CNN Encoder + contrastive pair | 0.002580 | 0.9994 | Report 3 |
Quark-Gluon | CNN Encoder + contrastive pair | 0.4921 | 0.5617 | No Report |
Table 2, shows the result of different classical encoders on Quark-Gluon dataset. The first row is the same as the last row of Table 1 and it shows the performance of CNN encoder. The next row shows the result from ResNet18 which are promising and can be enhanced further. Less data samples were used with ResNet as it is a huge model and was naturally time-consuming. Finally, GNNs were explored and they gave good results with an AUC nearing to 0.8, which is considered as a good model and even so for HEP dataset due to their inherent complexities. For more details of the GNN Encoder run, checkout its Report 4.
Model | Test Accuracy (%) | AUC |
---|---|---|
CNN Encoder | 56.17% | 0.52 |
ResNet18 Encoder | 60.02% | 0.5416 |
GNN Encoder | 73.28% | 0.7984 |
Table 3, compares the classical vs quantum GNN models on Quark-Gluon dataset as the GNN model performed the best in Table 2, naturally because of the sparse nature of particle cloud dataset. The quantum hybrids were tried with most of the models, but only the results with GNN are shown below for easy comparison. It is evident that quantum hybrid models have a comparable performance to classical model, even though they are bound by parameters like the number of qubits that can be used as of date. The following report 5 shows that quantum-hybrids work better in 10 epochs and in report 6 GNN takes over with more epochs.
Conclusion
It can be concluded that quantum and classical contrastive learning works effectively on MNIST as well as HEP datasets like Quark-Gluon. It is noteworthy that a simple CNN encoder based model is almost always correct when working on MNIST dataset in less number of iterations. The same model doesn't perform as well on the HEP data. Upgrading the computer vision model to ResNet18 encoder shows improvement on the HEP dataset. Moreover, implementing GNN based encoders by converting HEP particle cloud data to a graph shows considerable improvement in accuracy on the downstream tasks. The quantum hybrid models show comparable and sometimes slightly better performance in terms of AUC on the datasets used. In conclusion, all the experiments conducted show the viability of using respresentation learning from both classical and quantum perspectives on HEP dataset to generate embeddings that can be meaningfully used in downstream tasks. Studies like these conducted at ML4Sci make sure that LHC is equipped for it's next wave of experiments that will generate data in huge numbers, requiring ML models to make sense of everything more efficiently. A research of the kind of LHC increases our understanding of what exists and can eventually spark new technologies that change the world we live in.
Future Scope
Currently, the experiments were done with 12.5k data points while the complete dataset has 933k data points. The complete Quark-Gluon dataset is quite big in comparison and consequently experiments on the complete dataset is required to learn more and observe the effect on the current results. Additionally, existing models can be tuned further for better performance.
Experiment with more types of loss functions and architectures is crucial. There are many more frameworks in the literature which weren't tested like MoCo, BYOL etc. Experiments with a larger variety of quantum circuits found in literature would be beneficial and trying fully quantum models vouch as a good next step.
Acknowledgment
I would love to start by acknowledging all the unwavering support showered throughout the program by my mentors and co-mentees. I want to extend my deepest gratitude to my mentors and Professors Sergei Gleyzer, KC Kong, Katia Matcheva, Konstantin Matchev, Myeonghun Park, and Gopal Ramesh Dahale; who have guided me with invaluable insights. Their constant encouragement has been a source of inspiration and motivation for me. I am truly grateful for the time and effort they have invested in nurturing my skills, broadening my horizons and deepening my knowledge.
To my co-mentees, Amey Bhatuse and Duy Do Lee, I deeply appreciate the camaraderie, shared learning, and mutual support we've offered each other. Together, we have navigated challenges, celebrated successes, and grown stronger. To all the other ML4Sci GSoC contributors and their amazing work! It was always a delight to get on a call with everyone and learn from everyone's experiences. For me, the best part was being part of such a dedicated community working towards quantum computing. The global perspective of the team helped me understand different points of view and approaches to solve the problem at hand.
Kudos to the GSoC organizers and leads for such a phenomenal job at managing the program in its entirety and bringing together mentors and mentees for a collaboration of such a huge scale!
References
[1] M. Andrews, J. Alison, S. An, B. Burkle, S. Gleyzer, M. Narain, M. Paulini, B. Poczos, E. Usai, End-to-end jet classification of quarks and gluons with the CMS Open Data, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Volume 977, 2020, 164304, ISSN 0168–9002,.
[2] You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z. and Shen, Y., 2020. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33, pp.5812-5823.
[3] Le-Khac, P.H., Healy, G. and Smeaton, A.F., 2020. Contrastive representation learning: A framework and review. Ieee Access, 8, pp.193907-193934.
[4] A. Hammad, Kyoungchul Kong, Myeonghun Park and Soyoung Shim, Quantum Metric Learning for New Physics Searches at the LHC, 2023
[5] Jaderberg, B., Anderson, L.W., Xie, W., Albanie, S., Kiffner, M. and Jaksch, D., 2022. Quantum self-supervised learning. Quantum Science and Technology, 7(3), p.035005.
[6] Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D. and Makedon, F., 2020. A survey on contrastive self-supervised learning. Technologies, 9(1), p.2.
[7] Liu, Y., Jin, M., Pan, S., Zhou, C., Zheng, Y., Xia, F. and Philip, S.Y., 2022. Graph self-supervised learning: A survey. IEEE transactions on knowledge and data engineering, 35(6), pp.5879–5900.