As experiments grow more complex, the demand for efficient access to conditions data has increased. To address this, the HEP Software Foundation (HSF) proposed a reference architecture, NopayloadDB. It stores metadata and file URLs instead of payloads. However, NopayloadDB lacks a centralized logging subsystem. To address these limitations, this project proposes an intelligent logging pipeline integrated with NopayloadDB. The pipeline combines advanced log aggregation, scalable storage, and deep learning-based anomaly detection to reduce downtime and improve operation. The result is enhanced reliability, maintainability, and scalability of conditions database services in modern HEP experiments.
The project extended NopayloadDB, the HSF-referenced conditions database [1], by introducing a centralized and intelligent logging pipeline. The main goal was to provide centralized log aggregation, structured parsing and storage for easier querying, and anomaly detection using DeepLog to detect anomalies. The pipeline also aimed to support real-time monitoring and diagnostics for different stakeholders. It detected issues before escalation and provided insights for tuning system parameters. The design emphasized scalability, modularity, and compatibility with OpenShift deployments.
I am passionate about contributing to systems that are accessible and exist in both open-source and open-science settings. My background in distributed systems, cloud computing, and data-intensive applications aligns closely with this project. I am excited to contribute my skills that help to build a scalable and intelligent system and share the results with the broader community. This opportunity also allowed me to deepen my expertise in log analysis and machine learning while advancing my passion for technology and scientific knowledge to be openly available and accessible.
The logging pipeline is deployed on Minikube and is implemented around three containerized modules: Process, Monitor, and Predict. Each component as shown in Figure 1 has been described below.
Figure 1: Intelligent Logging Pipeline
In the Process module, NopayloadDB generates log data, which includes both the Django application and the PostgreSQL database. This data is then collected, filtered, and parsed by Fluent Bit. The processed Fluent Bit logs are consumed by Kafka on a topic. Alloy is used as a forwarding agent that forwards a structured log from a Kafka topic to a Loki.
The Monitor module focused on scalable storage and visualization. Loki is a distributed, horizontally scalable log aggregation system composed of several key components [2]. The Distributor receives logs from Alloy, validates them, and routes them to Ingesters while balancing the load. The Ingester temporarily stores logs in memory, compresses them, and forwards them to long-term storage in MinIO. The Querier retrieves the required logs from MinIO, forwarding them to Drain3 for prediction and Grafana for visualization. Figure 2 shows the flow of data within the customized Loki architecture.
Figure 2: Customized Loki
The Predict module added intelligence to the pipeline. Drain3 parsed raw logs into structured template IDs. These template IDs are consumed by Redis in sequence from Drain3. The template ID sequences are then processed by DeepLog for sequence-based anomaly detection. Any detected anomalies are visualized by Grafana for debugging and monitoring.
DeepLog is an LSTM-based model that learns log patterns from normal execution [3]. It detects anomalies when log patterns deviate from the model trained from logs under normal execution. Over time, the DeepLog model adapts to new log patterns over time and constructs workflows from the underlying system log. This is because once an anomaly is detected, users can diagnose the detected anomaly and perform root cause analysis effectively. The deeplog configuration can be changed by the top K. The top K is how many of the most likely predictions will be considered "normal" by the model. If k is set to 2, you take the two events with the highest probabilities, and these will be the top k most probable next events.
Suppose a log system has several events, each represented by a unique ID. The model takes a sequence of past events and predicts what the next event is likely to be. For example, table 1 shows unique IDs for the set of events of a user trying to upload a file. From the past event, the DeepLog learns the upcoming event and gives a probability to each unique event. It is to be noted that these sets of events are consumed by DeepLog in a random sequence.
| Unique ID | Event | Probability | |———–|————–|————-| | 0 | Login | 0.7 | | 1 | Upload File | 0.4 | | 2 | Select File | 0.6 | | 3 | Logout | 0.25 | | 4 | Submit File | 0.3 | Table 1: Set of Events*
Here the model thinks "Login" is most likely next event, then "Select File" and then "Upload File" etc. Hence, the sequence will be [Login, Select File, Upload File, Submit File, Logout] and with their respective unique IDs, it will be [0, 2, 1, 4, 3]. With k=2, the model predicts the top 2 event IDs as [Login, Select File], while the true event is Upload File. Since the true event does not appear in the top 2 predictions, this case is flagged as an anomaly. When k=3, the top 3 event IDs are [Login, Select File, Upload File], and the true event Upload File is included, so it is considered normal. In practice, the model checks whether the true event ID appears within the top-k predicted IDs: if the true event is not present, the sequence is labelled as an anomaly; otherwise, it is treated as normal.
The intelligent logging pipeline demonstrated log collection and aggregation for Kubernetes-based clusters. The heterogeneous logs were parsed and formatted into structured sequences. The DeepLog was integrated into the pipeline, showing the feasibility of automated real-time monitoring and anomaly detection. The Grafana dashboards provided tailored access for different user roles.
This research will establish a baseline for how the observability and diagnostics of a system can benefit the most from artificial intelligence. In addition, it will also be beneficial for the open source community, scientific research, and enterprise applications. From the experiment's point of view, it will provide more reliable and reproducible physics experiments. This will also enable HEP to efficiently allocate resources from insights gained from the system. In addition, it will also pave the way for how cutting-edge techniques can be applied beyond HEP, such as large-scale cloud applications and enterprise systems.
I enjoyed my time working with my mentors Ruslan, Michel, and John. This project was the first time that I contributed to CERN and BNL to such an extent and provided me with a sense of accomplishment in my professional career. The consistent feedback on both the project and the publication helped me a lot in shaping the project. My mentors provided me with a path to present the project on a greater scale, and it resulted in the project reaching greater potential to be used for other experiments. I am happy to be mentored by such experienced and knowledgeable professionals.
[1] Ruslan Mashinistov, L. Gerlach, P. Laycock, A. Formica, Giacomo Govi, and C. Pinkenburg, “The HSF Conditions Database Reference Implementation,” EPJ Web of Conferences, vol. 295, pp. 01051–01051, Jan. 2024, doi: https://doi.org/10.1051/epjconf/202429501051.
[2] “Grafana Loki | Grafana Loki documentation,” Grafana Labs, 2025. https://grafana.com/docs/loki/latest (accessed Jul. 30, 2025). |
[3] M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security”, Oct. 2017, doi: https://doi.org/10.1145/3133956.3134015.