As China’s first digital-only bank , WeBank has developed rapidly in recent years，supported by hundreds of millions of active users. The experience of massive users would be strongly affected if there are IT system failures.
To ensure system stability and improve key indicators, including MTTR (Mean Time To Repair) and MTBF (Mean Time Between Failures), the O&M team has made long-term efforts in the RCA project to detect anomalies quickly and perform effective RCA for swift system recovery. The current accuracy rate of the RCA module remains stable at around 80%.
Data analysis drives root cause locating. Therefore, data accuracy and integrity are crucial to the RCA module. The O&M team started the AIOps project based on the solid data provided by the second-generation maintenance systems.
CMDB stores full-cycle information of configuration items and their relationships. The CMDB system records the relationships from the bottom to the top layers, accessible by the RCA module for correlation analysis.
The maintenance log includes files about WEMQ, business operation, and application log. As a reliable message bus system developed by WeBank, WEMQ enables systems to work together using messaging. Business operation log files record formatted information related to product and business. The application log file contains unusual stack traces. As shown in the following figure, we can generates a transaction tree by analyzing the WEMQ message log.
IMS is the cornerstone of the O&M team because it collects data on IT systems indicators. The monitoring system also provides data for RCA: real-time data of time sequence, i.e., indicator data and alerts from third-party systems.
As a change management system, AOMP records change data combined with other data to conduct RCA.
Techniques Applied in RCA
We have conducted intensive research before making the technology choice for the RCA module. Though deep learning and machine learning deliver excellent anomaly detection results, we decided to adopt Expert System for the RCA module since we don’t have enough abnormal data. Expert System has stronger explanatory capabilities than machine learning, and can better explain how the root cause is analyzed, which is more in line with human thinking logic.
The Expert System and Knowledge Graph
Initially, RCA was built with Drools, a business rules management system. Experts continue to improve and enrich the rules using Drools to enable the RCA module to find a root cause. However, there are some drawbacks. First, there’s a lack of data transparency. We solve the problem by storing abnormal data in the graph database in the RCA module. Each module shares the private data through the graph database, making it convenient for us to access the data. The second drawback is that it’s challenging to maintain rules with Drools. Although Drools can decouple knowledge and code, it is challenging to preserve the vast number of rules. To significantly reduce the difficulty in maintaining rules, we created deduction models for different anomaly types, which helped us find the root cause in the graph database.
The Design of RCA
RCA design is about building the knowledge graph, applying statistical methods like deductive and inductive reasoning, and applying expert rules in the knowledge graph. So the knowledge graph design of an anomaly is crucial for RCA design.
The Design of Knowledge Graph for Anomalies
The RCA knowledge graph contains dynamic data such as business operation logs, alerts and static data like configuration data from CMDB, all of which contribute to a complete knowledge graph of anomalies. The following figure shows that the knowledge graph is a DAG (directed acyclic graph) analyzed by the RCA module to deduce the root cause from the left side to the right side.
The RCA process can be divided into three stages: information collection, root cause analysis, and root cause location. In the first stage, the RCA module collects comprehensive data to build the knowledge graph, including:
1. Events containing information of anomalies like start and end time.
2. The Four Key Performance Indicators — transaction volume, business success rate, system success rate, and latency — that we use to detect if an anomaly occurs.
a. Transaction volume is the number of transactions per unit time.
b. The business success rate measures the number of successful businesses per unit time. It decreases when business fails to be processed. Business failure refers to a failure conforms to business logic, for example, incorrect verification code.
c. The system success rate is the number of system success per unit time. It decreases when system failures occur, for example, abnormal connection to the database.
d. Latency is the total time to complete a transaction in unit time.
3. A UUID (universally unique identifier) is generated every time user operates the product, which is used in querying business logs and WEMQ logs.
4. Each system follows standard to print business log, including specific business parameters, subsystems, host, DCN, etc.
5. WEMQ log files include host data and time-consumption data, which is used to generate complete transaction trace.
In the RCA phase, the RCA module analyzes the transaction log files to locate an anomaly’s root cause. If an anomaly occurs with a lot of error logs, and these logs gather in a subsystem after the RCA module analyzing, which proves that the subsystem is the root cause of this anomaly.
In the phase of locating the root cause of an anomaly. The RCA module re-edit description of the root cause with change data and alerts data related to this subsystem.
Production Case Study
The RCA process is divided into three phases.
Phase 1 — Information collection
In the first stage, the RCA module creates a knowledge graph, including information like business operation logs, alerts, change data, and configuration data of an anomaly.
The knowledge graph is made up of indicator information, business flow log, WEMQ log, and alerts related to this anomaly.
Phase 2- Root Cause Analysis (Locate the abnormal subsystem)
In the root cause analysis stage, the RCA module finds the root cause of anomaly by applying deduction models. As shown in the following figure, the knowledge graph extracts data like IP, DCN (Data Center Node) from the abnormal subsystem.
Phase 3 — Root Cause Location
In phase 3, the RCA module analyzes data from the last step based on domain knowledge to find the root cause. The release of the APS causes the anomaly since it is in the releasing stage and creates all time-consuming log. After three steps, ‘Knowing’ finally finds the root cause of the anomaly.
Analyzing and consolidating data from business logs played a significant role in finding the root cause of an anomaly. The RCA system’s performance based on domain knowledge, graph database techniques, and knowledge graph can be improved by credible historical data of anomalies, which provides us data to gain more domain knowledge.
Chinese author: Jinzan Ye
Translator: Danny Chen, Sookie Tao
Editors: WeBank AIOps Team