-
Traditional deep-learning models such as CNNs and RNNs face significant challenges in representing jets effectively. Image representations often struggle with incorporating particle identity, which affects performance improvement [10]. Similarly, sequence [23] and tree [25] representations impose artificial ordering on jet particles, which inherently possess no sequential structure. Considering a jet as an unordered collection of its constituent particles leads to a more natural representation. This format not only facilitates the inclusion of particle-specific features, but also guarantees permutation invariance. Among models that adopt this perspective, ParticleNet describes jets as "particle clouds", similar to the point cloud technique for 3D shape analysis in computer vision. ParticleNet uses the DGCNN architecture, with its EdgeConv operations effectively using the local spatial structures of particle clouds to achieve notably higher performance.
ParT, a transformative variant based on the Class-Attention in Image Transformers (CaiT) framework [38], integrates interaction variables as a secondary input. The self-attention mechanism of this architecture uniquely addresses all positions within the input sequence, capturing extensive range dependencies efficiently and maintaining invariance to particle order. By refining the Multi-Head Attention (MHA) mechanism to include jet-particle interaction variables, ParT not only outperforms traditional transformer models, but also sets a new benchmark in jet tagging. These modifications position ParT as the leading model in jet tagging.
Based on the ParT framework, we developed MIParT to enhance the input of interaction data, as depicted in Fig. 1. MIParT adopts the input formats of ParT and processes jet data with two distinct inputs:
Figure 1. (color online) Schematic of the More-Interaction Particle Transformer (MIParT) architecture. The particle features
${\boldsymbol{x}}_1$ are processed sequentially through K MI particle attention blocks and L particle attention blocks. The interaction features$ {\bf{U}}_1 $ are first fed to K MI particle attention blocks, then dimensionally reduced by a 1D pointwise convolution to${\boldsymbol{U}}_2$ , and then fed to L particle attention blocks. The MIParT architecture ends with the application of the Class-Attention in Image Transformers (CaiT) methodology, which uses a class token${\boldsymbol{x}}_{\rm{class}}$ to systematically extract and summarize information from$ {\boldsymbol{x}}_3 $ in the class attention blocks.● Particle Input
${\boldsymbol{x}}_1$ : It comprises a list of C features per particle, arranged into an array of shape (N,C), where N represents the number of particles within a jet.● Interaction Input
${\boldsymbol{U}}_1$ : It includes a matrix of$ C' $ features for each particle pair, formatted as an array of shape (N, N, C').The particle input is first transformed by a Multilayer Perceptron (MLP) to project feature dimensions to
$ D_1 $ , resulting in an array${\boldsymbol{x}}_1$ with dimensions$ (N,D_{1}) $ . Similarly, the interaction input undergoes pointwise 1D convolution processing, yielding${\boldsymbol{U}}_1$ with dimensions$ (N,N,D_1) $ ;${\boldsymbol{x}}_1$ then passes through K MI-Particle Attention Blocks to generate${\boldsymbol{x}}_2$ of the same shape. In each block,${\boldsymbol{U}}_1$ serves as an additional input and is dimensionally reduced by a pointwise 1D convolution to${\boldsymbol{U}}_2$ , having dimensions$ (N,N,D_2) $ .Following the structural framework of ParT,
${\boldsymbol{x}}_2$ progresses through L Particle Attention Blocks, enhancing with${\boldsymbol{U}}_2$ at each layer, to produce${\boldsymbol{x}}_3$ . Subsequently, using the CaiT methodology, a class token${\boldsymbol{x}}_{\rm{class}}$ is used to systematically extract and summarize information from${\boldsymbol{x}}_3$ in the class attention blocks. Finally, this summarized information forms a single vector that constitutes the input of a linear classifier through an MLP and a softmax function to derive the classification scores. -
The Particle Attention Block, a crucial element of the ParT framework, was seamlessly integrated into our MIParT model. The architecture of this block is based on the NormFormer design [39], specifically using Layer Normalization instead of Batch Normalization. Layer Normalization optimizes normalization by adjusting each layer individually for every single sample, enhancing model stability and overall performance across diverse datasets. The architecture of the Particle Attention Block is shown in Fig. 2. Furthermore, in this configuration, the traditional Multi-Head Attention (MHA) is substituted by Particle-Multi-Head Attention (P-MHA). This modification allows for the incorporation of particle interaction features directly into the attention mechanism, enriching the capability of the model to capture complex particle dynamics. The P-MHA mechanism, which is key to the Particle Attention Block, is mathematically expressed as
Figure 2. (color online) Schematic of the More-Interaction Attention (MIA) architecture. The shape of U is (N, N, C), while both the input x and output x' have a shape (N, C). MIA maintains a one-to-one correspondence between the feature dimensions of U, x, and the heads of MHA C.
$ {\rm{P}}\text{-}{\rm{MHA}}(Q, K, V) = {\rm{SoftMax}}\left(\frac{QK^T}{\sqrt{d_k}} + {\boldsymbol{U}}\right)V, $
(1) where Q, K, and V are the linear projections of the particle embedding x, and U represents the interaction embedding. The dimensions of U are precisely aligned with the attention heads in the MHA mechanism, thereby facilitating the integration of particle interaction features. The specific implementation of P-MHA can be found in Ref. [36]. This integration significantly enhances the ability of the model to capture complex particle interactions, which is crucial in particle physics applications.
-
In the original P-MHA mechanism, the feature dimensions of U align one-to-one with the heads of MHA, both denoted as C. Increasing the feature dimensions of U necessitates a proportional increase in the number of attention heads, which significantly increases the complexity of the model. To mitigate this issue, we introduce More-Interaction Attention (MIA) and the MI-Particle Attention Block. These components replace P-MHA with MIA, as shown in Fig. 2 (MIA architecture) and Fig. 3 (MI-Particle Attention Block/Particle Attention Block architecture). The MI-Particle Attention Block incorporates Layer Normalization and the Gaussian Error Linear Unit (GELU) activation function. When the red block inFig. 3 uses MIA, it forms the MI-Particle Attention Block. Conversely, when it uses P-MHA, it forms the Particle Attention Block. This approach allows the model to effectively use interaction inputs without significantly increasing complexity. The MIA is calculated using the following formula:
Figure 3. (color online) Schematic of the MI-Particle Attention Block / Particle Attention Block architecture. Here, LN represents Layer Normalization, and GELU represents the Gaussian Error Linear Unit activation function. The block forms the MI-Particle Attention Block when using MIA and the Particle Attention Block when using P-MHA.
$ {\rm{MIA}}({\boldsymbol{U}}, V) = {\rm{SoftMax}}({\boldsymbol{U}})V, $
(2) where V is a linear projection of the particle embedding x. In MIA, each feature dimension of U and x, as well as each head, are denoted by C, ensuring a one-to-one correspondence.
By increasing the feature dimensions of U, MIA effectively exploits the interaction inputs without significantly increasing the complexity of the model. Moreover, the MI-Particle Attention Block, which incorporates self-attention on x, acts as a supplement in front of the Particle Attention Block rather than as a replacement.
-
We incorporated the Class Attention Block from the ParT framework, inspired by the CaiT architecture. This block uses a class token
${\boldsymbol{x}}_{\rm{class}}$ to efficiently extract information through attention mechanisms, as depicted in Fig. 4. The Multi-Head Attention inputs are defined as follows:Figure 4. (color online) Schematic of the Class Attention Block architecture. Here, LN represents Layer Normalization, GELU represents the Gaussian Error Linear Unit activation function, and MHA stands for Multi-Head Attention block.
$ Q = W_{q}{\boldsymbol{x}}_{\rm{class}}+b_{q}, $
(3) $ K = W_{k}{\boldsymbol{z}}+b_{k}, $
(4) $ V = W_{v}{\boldsymbol{z}}+b_{v}, $
(5) where
${\boldsymbol{z}}=[{\boldsymbol{x}}_{\rm{class}},{\boldsymbol{x}}]$ , and W and b represent learnable parameters. This design ensures a lower computational overhead for the Class Attention mechanism by using the concatenated vector${\boldsymbol{z}}$ .The Class Attention Block significantly enhances feature extraction from the input x by capitalizing on the class token, thereby improving the focus of the model on essential aspects of the data. This enhancement also improves jet classification performance significantly, making the Class Attention Block a crucial component within the ParT framework.
-
The architecture of our MIParT model includes K = 5 MI-particle attention blocks, L = 5 particle attention blocks, and 2 class attention blocks. The choice of these hyperparameters balances complexity and accuracy; we observed an increase in accuracy with additional layers at the cost of increased complexity. Therefore, we limited the total number of attention blocks to ten. The rationale for choosing two class attention blocks is based on the CaiT framework [38], which recommends such a configuration for efficient classification. For particle embeddings
${\boldsymbol{x}}_1$ , a three-layer Multi-Layer Perceptron (MLP) is used, with each layer containing 128, 512, and 64 neurons, respectively. This configuration results in embeddings with a dimensionality of$ D_1 = 64 $ . The decision to reduce the embedding dimension compared to the ParT model was motivated by the addition of the MIA module. This adjustment rationalizes the complexity of the model while maintaining its efficiency, thus optimizing the trade-off between performance and computational load. Each layer incorporates GELU as the activation function and Layer Normalization. Additionally, a three-layer, 64-channel pointwise 1D convolution is used for the interaction embeddings${\boldsymbol{U}}_1$ , performing convolutions only along the feature dimension. The${\boldsymbol{U}}_1$ embeddings are further processed through a single-layer, 8-channel pointwise 1D convolution to generate${\boldsymbol{U}}_2$ , achieving a dimensionality of$ D_2 = 8 $ . This design choice maintains consistency with the ParT model, ensuring alignment with established architectural standards and facilitating comparative analysis. The MI-particle attention blocks implement MIA with 64 heads, while the P-MHA and Class Multi-Head Attention in the particle and class attention blocks use 8 heads each. A dropout rate of 0.1 is maintained in all MI-particle and particle attention blocks, with the class attention blocks being exempt from dropout.For very large datasets, increasing the embedding dimension significantly enhances model performance. Therefore, for such datasets, we double the dimension of the particle embeddings to
$ D_1 = 128 $ . This adjustment is straightforward, requiring a change in the neuron configuration of the three-layer MLP to 128, 512, and 128. Consequently, the dimensions of x and U in MIA will no longer be identical; however, this discrepancy is acceptable as long as the dimension of x is an integer multiple of the dimension of U. We refer to this modified model as MIParT-Large (MIParT-L). -
We developed the MIParT model using the PyTorch framework [40]. Its implementation is based on
$\textsf{Weaver}$ 1 , and is referred to the implementation of ParT2 .We initially evaluated the MIParT model on two widely used jet tagging benchmark datasets, namely the top tagging [16] and quark-gluon datasets [32]. The model was trained on an NVIDIA RTX 4090 GPU, using a learning rate of 0.001 and a batch size of 256. Training was limited to 15 epochs to prevent overfitting. Both datasets incorporate kinematic variables as particle input features, with particle identification information included only in the quark-gluon dataset. All these input features for the two datasets are shown in Table 1.
Category Variable TOP QG JC $ \Delta\eta $ * * * $ \Delta\phi $ * * * log $ p_{\rm{T}} $ * * * Kinematics log E * * * $ \log {p_{\rm{T}}}/{p_{\rm{T}}{\rm(jet)}} $ * * * $ \log { E}/{ E{\rm(jet)}} $ * * * $ \Delta R $ * * * charge * * Electron * * Particle Muon * * identification Photon * * Charged Hadron * * Neutral Hadron * * $ \tanh d_0 $ * Trajectory $ \tanh d_z $ * displacement $ \sigma_{d_0} $ * $ \sigma_{d_z} $ * Table 1. Summary of kinematic and particle identification variables included in the top tagging (TOP), quark-gluon (QG), and JetClass (JC) datasets. Variables present in each dataset are indicated by a star symbol (*). The table includes seven kinematic variables describing the physical characteristics of particles relative to the jet axis, six particle identification variables categorizing particles by type and charge, and four trajectory displacement features, which provide detailed information on particle trajectories.
We then pre-trained our larger model variant, MIParT-L, on the JetClass dataset containing 100M samples [36]. This model was pre-trained on dual NVIDIA RTX 3090 GPUs using a learning rate of 0.0008 and a batch size of 384, with pre-training limited to 50 epochs to avoid overfitting. After pre-training, MIParT-L was fine-tuned on the top tagging and quark-gluon datasets. Interestingly, the pre-training of MIParT-L on the JetClass dataset for the top tagging dataset included only kinematic features, while for the quark-gluon dataset, both kinematic and particle identification features were included.
For fine-tuning, we replaced the last MLP for classification with a newly initialized MLP having two output nodes. All weights were then fine-tuned across the datasets for 20 epochs. We used a learning rate of 0.00016 for the pre-trained weights and 0.008 for the new MLP.
The seven kinematic input features are as follows:
●
$ \Delta\eta $ : difference in pseudorapidity$ \eta $ between the particle and jet axis;●
$ \Delta\phi $ : difference in azimuthal angle$ \phi $ between the particle and jet axis;●
$ \log p_{\rm{T}} $ : logarithm of the particle transverse momentum$ p_{\rm{T}} $ ;●
$ \log E $ : logarithm of the particle energy;●
$ \log {p_{\rm{T}}}/{p_{\rm{T}}{\rm(jet)}} $ : logarithm of the particle$ p_{\rm{T}} $ relative to the jet$ p_{\rm{T}} $ ;●
$ \log {E}/{E{\rm(jet)}} $ : logarithm of the particle energy relative to the jet energy;●
$ \Delta R $ : angular separation between the particle and jet axis.The six particle identification features are as follows:
● "Charge": electric charge of the particle;
● "Electron": whether the particle is an electron;
● "Muon": whether the particle is a muon;
● "Photon": whether the particle is a photon;
● "Charged Hadron": whether the particle is a charged hadron;
● "Neutral Hadron": whether the particle is a neutral hadron.
The four trajectory displacement features in the JetClass are as follows:
●
$ \tanh d_0 $ : hyperbolic tangent of the transverse impact parameter value;●
$ \tanh d_z $ : hyperbolic tangent of the longitudinal impact parameter value;●
$ \sigma_{d_0} $ : error of the measured transverse impact parameter;●
$ \sigma_{d_z} $ : error of the measured longitudinal impact parameter.For particle interaction features, we consider four logarithmic characteristics
$ (\ln \Delta, \ln k_T, \ln z, \ln m^2) $ derived from the energy-momentum four-vector$ p = (E, p_x, p_y, p_z) $ [41]. These features are defined as follows:$ \Delta= \sqrt{(y_a -y_b)^2 +(\phi_a -\phi_b)^2}, $
(6) $ k_T = {\min}(p_{{\rm{T}},a},p_{{\rm{T}},b})\Delta, $
(7) $ z ={\min}(p_{{\rm{T}},a},p_{{\rm{T}},b})/(p_{{\rm{T}},a} + p_{{\rm{T}},b}), $
(8) $ m^2 =(E_a+E_b)^2 - |{{\boldsymbol{p}}_a+{\boldsymbol{p}}_b}|^2 \, , $
(9) where
$ y_i $ is the rapidity,$ \phi_i $ is the azimuthal angle,$ p_{{\rm{T}},i} $ is the transverse momentum, and${\boldsymbol{p}}_i$ is the momentum three-vector of the particle$ i=a,b $ . The motivation for selecting these variables comes from their widespread adoption in several advanced neural networks [34, 36].To evaluate the performance of the MIParT model, we conducted comparative evaluations with several popular models using the top tagging and quark-gluon datasets. Our evaluation focused on several key metrics:
● Accuracy: This metric quantifies the proportion of correct predictions made by the model, including both true positive and true negative identifications. Mathematically, accuracy is defined as follows:
$ {\rm{Accuracy}} = \frac{TP+TN}{TP+TN+FN+FP} \, , $
(10) where TP stands for true positives, TN stands for true negatives, FN stands for false negatives, and FP stands for false positives.
● AUC (Area Under the Curve): AUC is a comprehensive metric to assess model performance across all classification thresholds. This metric is derived from the Receiver Operating Characteristic (ROC) curve, which represents the true positive rate (sensitivity) against the false positive rate (1 - specificity) for various thresholds. This curve illustrates the trade-off between sensitivity and specificity. AUC values range from 0.5, which indicates no discriminatory ability (similar to random guessing), to 1.0, which represents perfect discrimination and indicates excellent ability of the model to discriminate between classes.
● Background Rejection at a Certain Signal Efficiency, RejX%: This metric calculates the inverse of the false positive rate (FPR) when the true positive rate (TPR) is fixed at a certain percentage, commonly referred to as RejX%. It is mathematically expressed as follows:
$ {\rm{Rej}}_{X{\text{%}}} = \frac{1}{\rm{FPR}}\bigg|_{{\rm{TPR}} = X{\text{%}}}. $
(11) For example, a Rej30% value of 2500 indicates that at a TPR of 30%, the inverse of the FPR is 2500. This implies a single false positive per 2500 negative instances, highlighting the exceptional specificity and minimal error rate of the model at this level.
Top tagging is a critical task in jet tagging, which is often used in the search for new physics at the LHC. For this study, we used a top tagging dataset [16] consisting of 2M jets, with
$ t \to bqq' $ as the signal and q/g as the background. This dataset only provides the energy-momentum four-vectors (kinematic features) for each particle.Figure 5 shows the performance of our MIParT model compared to other popular models on the top tagging dataset. The MIParT model achieves accuracy and AUC metrics nearly identical to those of LorentzNet [42], and its Rej50% and Rej30% metrics are, within the error range, comparable to those of LorentzNet. Note that a series of Lorentz-equivariant methods demonstrate similar performance to that of LorentzNet, such as Clifford Group Equivariant Neural Networks (CGENN) [43], permutation equivariant and Lorentz invariant or covariant aggregator network (PELICAN) [44], and Lorentz-Equivariant Geometric Algebra Transformers (L-GATr) [45]. Moreover, MIParT, LorentzNet, and several Lorentz-equivariant based models significantly outperform other models, including Particle Flow Network (PFN) [32], Particle-level Convolutional Neural Network (P-CNN) [34], ParticleNet [34], Point Cloud Transformer (PCT) [46], and ParT [36], with metrics in the comparison extracted from published results. For the fine-tuned MIParT-L model pre-trained on the 100M JetClass dataset, a 39% enhancement in background rejection performance was achieved, comparable to that of the fine-tuned ParT. Detailed comparison results are presented in Table 2. The MIParT model significantly outperforms ParT in the top tagging benchmark, with approximately 25% better background rejection at a 30% signal efficiency. Among evaluated the models, MIParT, along with LorentzNet and other Lorentz-equivariant based models, ranks at the top tier, exhibiting robustness and high performance.
Figure 5. (color online) Comparison of MIParT performance metrics with those of other models on the top tagging dataset. This figure shows Accuracy, AUC, Rej50%, and Rej30% metrics for the MIParT model alongside those of Particle Flow Network (PFN) [32], Particle-level Convolutional Neural Network (P-CNN), Point Cloud Transformer (PCT) [46], Clifford Group Equivariant Neural Networks (CGENN) [43], permutation equivariant and Lorentz invariant or covariant aggregator network (PELICAN) [44], Lorentz-Equivariant Geometric Algebra Transformers (L-GATr) [45], LorentzNet [42], and ParticleNet [34], ParT [36]. Metrics of other models are extracted from published results. Detailed outcomes are provided in Table 2. Bars without slashes indicate original models without fine-tuning, while bars with slashes indicate models with fine-tuning. The gray dashed line represents the results for MIParT, whereas the red dashed line represents the results for fine-tuned MIParT-L (MIParT-L f.t.).
Accuracy AUC Rej50% Rej30% PFN — 0.9819 247±3 888±17 P-CNN 0.930 0.9803 201±4 759±24 PCT 0.940 0.9855 392±7 1533±101 CGENN 0.942 0.9869 500 2172 PELICAN 0.9426 0.9870 — — L-GATr 0.9417 0.9868 548±26 2148±106 LorentzNet 0.942 0.9868 498±18 2195±173 ParticleNet 0.940 0.9858 397±7 1615±93 ParT 0.940 0.9858 413±16 1602±81 MIParT (ours) 0.942 0.9868 505±8 2010±97 ParT f.t. 0.944 0.9877 691±15 2766±130 MIParT-L f.t. (ours) 0.944 0.9878 640±10 2789±133 Table 2. Performance comparison of various models on the top tagging dataset. This table lists the results for the MIParT model alongside those of other prominent models such as Particle Flow Network (PFN) , Particle-level Convolutional Neural Network (P-CNN), Point Cloud Transformer (PCT) [46], Clifford Group Equivariant Neural Networks (CGENN) [43], permutation equivariant and Lorentz invariant or covariant aggregator network (PELICAN) [44], Lorentz-Equivariant Geometric Algebra Transformers (L-GATr) [45], LorentzNet [42], ParticleNet [34], and ParT [36]. Metrics of other models are extracted from published results. The results for the fine-tuned version of our model, MIParT-L f.t., are shown at the bottom of the table for comparison with those of the fine-tuned ParT model, that is, ParT f.t.
Quark-gluon tagging is another crucial jet tagging task. Unlike the top tagging dataset, the quark-gluon dataset [32] includes not only the kinematic features of each particle, but also particle identification information. This dataset allows for a more detailed categorization of particles, including specific distinctions among electrically charged and neutral hadrons, such as pions, kaons, and protons. Additionally, similar to the top tagging dataset, the quark-gluon dataset contains 2M jets, with quarks and gluons designated as the signal and background, respectively.
Figure 6 shows the performance of our MIParT model compared to other popular models on the quark-gluon dataset. Within this dataset, the MIParT model significantly outperforms LorentzNet across all metrics, including accuracy, AUC, Rej50%, and Rej30%, as well as other models. Note that only the ParT model approaches the performance of our model in some metrics, but MIParT still maintains an overall lead over ParT. In comparison with other models, such as PFN [32], ABCNet [47], and PCT [46], MIParT demonstrates a substantial lead, with metrics extracted from published results. For the fine-tuned MIParT-L model pre-trained on the 100M JetClass dataset, a 6% enhancement in background rejection performance is achieved, outperforming fine-tuned ParT. Detailed comparison results on the quark-gluon dataset are presented in Table 3. MIParT achieves the best performance across all evaluation metrics, improving background rejection power by approximately 3% compared to ParT. Simultaneously, the background rejection of the fine-tuned MIParT-L model is improved by approximately 2% compared to fine-tuned ParT.
Figure 6. (color online) Comparison of MIParT performance metrics with those of other models on the quark-gluon dataset. This figure shows Accuracy, AUC, Rej50%, and Rej30% metrics for the MIParT model alongside Particle Flow Network (PFN), attention-based Cloud Net (ABCNet) [47], Point Cloud Transformer (PCT) [46], LorentzNet [42], and ParT [36]. Metrics of other models are extracted from published results. Detailed outcomes are provided in Table 2. Bars without slashes indicate original models without fine-tuning, while bars with slashes indicate models with fine-tuning. The gray dashed line indicates the results for MIParT, whereas the red dashed line shows the results for fine-tuned MIParT-L (MIParT-L f.t.).
Accuracy AUC Rej50% Rej30% PFN — 0.9052 37.4±0.7 — ABCNet 0.840 0.9126 42.6±0.4 118.4±1.5 PCT 0.841 0.9140 43.2±0.7 118.0±2.2 LorentzNet 0.844 0.9156 42.4±0.4 110.2±1.3 ParT 0.849 0.9203 47.9±0.5 129.5±0.9 MIParT (ours) 0.851 0.9215 49.3±0.4 133.9±1.4 ParT f.t. 0.852 0.9230 50.6±0.2 138.7±1.3 MIParT-L f.t. (ours) 0.853 0.9237 51.9±0.5 141.4±1.5 Table 3. Performance comparison of various models on the quark-gluon dataset. This table lists the results for the MIParT model along with other significant models, including Particle Flow Network (PFN) , attention-based Cloud Net (ABCNet) [47], Point Cloud Transformer (PCT) [46], LorentzNet [42], and ParT [36]. Metrics of other models are extracted from published results. The fine-tuned version of our model, MIParT-L f.t., is shown at the bottom of the table for comparison with the fine-tuned ParT model, that is, ParT f.t.
Given that MIParT shares many components with ParT and differs only in the addition of the MIA blocks, the comparative results between these two models highlight the effectiveness of the MIA block. Specifically, MIParT consists of five MIA blocks, five particle attention blocks, and two class attention blocks, whereas ParT consists of eight particle attention blocks and two class attention blocks. Thus, from the results tested on the top tagging and quark-gluon datasets, it is evident that MIParT outperforms ParT, stressing the significant role played by the MIA block. Furthermore, the effectiveness of the particle attention blocks was already established in the ParT paper [36], and the impact of the class attention blocks was tested in the CaiT framework [38].
Regarding the impact of hyperparameter choices on model performance, we found that MIParT is not overly sensitive to hyperparameter settings, but is more influenced by the overall network architecture. In particular, increasing the number of MIA blocks and particle attention blocks generally leads to better performance at the cost of increased complexity. Architectural modifications show that placing MIA blocks before particle attention blocks is optimal. Placing MIA blocks after particle attention blocks or alternating them significantly reduces effectiveness, sometimes to the point of performing worse than ParT. We consider that MIA blocks function similarly to embeddings, allowing better integration of interaction information into the jets for improved information fusion and classification.
Table 4 presents the parameters, FLOPs (Floating Point Operations Per Second), and accuracy of various models on the top tagging and quark-gluon datasets. Parameters denote the number of trainable elements within a model, which indicates its capacity to learn. Having more parameters generally increases the complexity of the model. FLOPs is a metric that measures the computational complexity required to process data through the model. Reducing the number of parameters typically reduces FLOPs, simplifying the model and making it more computationally efficient.
TOP QG Params FLOPs PFN — 86.1k 4.62M P-CNN 0.930 — 354k 15.5M ParticleNet 0.940 — 370k 540M ParT 0.940 0.849 2.14M 340M MIParT (ours) 0.942 0.851 720.9k 180M MIParT-L f.t. (ours) 0.944 0.853 2.38M 368M Table 4. Parameters, FLOPs, and accuracy for various models on the top tagging (TOP) and quark-gluon (QG) datasets. Parameters refer to the number of trainable elements within a model, while FLOPs (Floating Point Operations Per Second) is a metric that measures the computational complexity involved in processing data through the model.
However, reducing the number of parameters to reduce FLOPs usually results in lower accuracy. In contrast, our MIParT model has only 30% of the parameters and 53% of the FLOPs of the ParT model, significantly reducing model complexity. Despite this reduction, there is no compromise in accuracy; in fact, accuracy improves on both top tagging and quark-gluon datasets. For the fine-tuned version of MIParT-L, the parameters and FLOPs are comparable to those of the ParT model, but with a slight improvement in accuracy.
Table 5 presents a performance comparison of various models on different sizes of the JetClass dataset. We displaced the results for the MIParT-L model alongside ParticleNet [34] and ParT [36] across the 2M, 10M, and 100M JetClass datasets. Note that as the dataset size increases, the performance of the models improves. Specifically, MIParT-L and ParT exhibit nearly identical effectiveness on very large datasets, outperforming ParticleNet. In addition, our evaluation of models on the JetClass dataset serves to test the ability of MIParT to generalize across different classification tasks. The JetClass datasets represent a more complex classification challenge, aiming at identifying Higgs boson decays to charm quarks. Our MIParT model shows remarkable stability on this task, highlighting its generalization capabilities.
All classes $ H\to b \bar{b} $ $ H\to c \bar{c} $ $ H\to g g $ $ H\to 4 q $ $ H\to \ell \nu q q' $ $ t\to b q q' $ $ t\to b \ell \nu $ $ W\to q q' $ $ Z\to q q' $ Accuracy AUC Rej50% Rej50% Rej50% Rej50% Rej99% Rej50% Rej99.5% Rej50% Rej50% ParticleNet (2 M) 0.828 0.9820 5540 1681 90 662 1654 4049 4673 260 215 ParticleNet (10 M) 0.837 0.9837 5848 2070 96 770 2350 5495 6803 307 253 ParticleNet (100 M) 0.844 0.9849 7634 2475 104 954 3339 10526 11173 347 283 ParT (2 M) 0.836 0.9834 5587 1982 93 761 1609 6061 4474 307 236 ParT (10 M) 0.850 0.9860 8734 3040 110 1274 3257 12579 8969 431 324 ParT (100 M) 0.861 0.9877 10638 4149 123 1864 5479 32787 15873 543 402 MIParT-L (2 M) 0.837 0.9836 5495 1940 95 819 1778 6192 4515 311 242 MIParT-L (10 M) 0.850 0.9861 8000 3003 112 1281 3650 16529 9852 440 336 MIParT-L (100 M) 0.861 0.9878 10753 4202 123 1927 5450 31250 16807 542 402 Table 5. Performance comparison of various models on different sizes of the JetClass dataset. This table lists the results for the MIParT-L model alongside ParticleNet [34] and ParT [36] across 2M, 10M, and 100M JetClass datasets. Metrics of other models are extracted from published results. Models trained using the full 100M training dataset are highlighted in bold text.
Here, we discuss the improvements attributed to the pre-training performed on the JetClass dataset, with subsequent performance improvements observed on the top tagging and quark-gluon datasets. These three jet tagging tasks differ in their objectives: the JetClass dataset focuses on identifying Lorentz boosted W, Z, Higgs bosons and top quarks, the top tagging dataset aims to identify top quarks, and the quark-gluon dataset aims to distinguish between quark and gluon jets. The improvements across such diverse tasks suggest that MIParT has learned more generalized jet properties during the pre-training phase. These characteristics are effectively transferable to other tasks, demonstrating the robustness of the model and its adaptability to different jet identification challenges. This capability highlights the potential of pre-trained models to improve performance in a wide range of applications by capturing and exploiting general features applicable to multiple scenarios.
Regarding the interpretability of MIParT, it is important to acknowledge that, as a model based on a transformer-based neural network architecture, its interpretability remains limited, similar to many neural networks currently in use. Despite these interpretability challenges, the CMS collaboration has successfully used the graph neural network ParticleNet [34], another model that lacks full interpretability, to search for Higgs boson decay to charm quarks [48]. This success underscores that the lack of interpretability does not prevent the use of neural network models in particle physics experiments. In fact, ParticleNet, which functions as a non-interpretable "black box" model, is playing a significant role in particle experiments, demonstrating that the non-interpretable nature of these models should not be a barrier to their use in advancing scientific discovery.
Jet tagging with more-interaction particle transformer
- Received Date: 2024-08-14
- Available Online: 2025-01-15
Abstract: In this paper, we introduce the More-Interaction Particle Transformer (MIParT), a novel deep-learning neural network designed for jet tagging. This framework incorporates our own design, the More-Interaction Attention (MIA) mechanism, which increases the dimensionality of particle interaction embeddings. We tested MIParT using the top tagging and quark-gluon datasets. Our results show that MIParT not only matches the accuracy and AUC of LorentzNet and a series of Lorentz-equivariant methods, but also significantly outperforms the ParT model in background rejection. Specifically, it improves background rejection by approximately 25% with a signal efficiency of 30% on the top tagging dataset and by 3% on the quark-gluon dataset. Additionally, MIParT requires only 30% of the parameters and 53% of the computational complexity needed by ParT, proving that high performance can be achieved with reduced model complexity. For very large datasets, we double the dimension of particle embeddings, referring to this variant as MIParT-Large (MIParT-L). We found that MIParT-L can further capitalize on the knowledge from large datasets. From a model pre-trained on the 100M JetClass dataset, the background rejection performance of fine-tuned MIParT-L improves by 39% on the top tagging dataset and by 6% on the quark-gluon dataset, surpassing that of fine-tuned ParT. Specifically, the background rejection of fine-tuned MIParT-L improves by an additional 2% compared to that of fine-tuned ParT. These results suggest that MIParT has the potential to increase the efficiency of benchmarks for jet tagging and event identification in particle physics.