AIOps Series IV: Generating Transaction Tree with Transaction Message
WeBank initiated root cause analysis (RCA) project after anomaly detection accuracy has improved significantly. It has always been a challenge to define the locating and reasoning basis for RCA. By analyzing historical cases, we find transaction trace the beacon guiding us to reach the problem’s root. All WeBank’s transactions are communicated via the message bus, from which it is possible to generate a transaction tree for every transaction.
WeBank offers a financial-grade message bus solution by utilizing the self-developed WEMQ (WeBank Message Queue). Each transaction assigned a unique business sequence number communicates through WEMQ, which records the original message log. It also records the information of sender, receiver, log point, sending time, receiving time, business sequence number, concreted message, etc. Based on a set of pre-configured rules and message logs, WeBank AIOps employs an algorithm to generate a unique transaction tree for each transaction. Transaction tree can be used for alert analysis, RCA, and other functions.
Process the original message log
As illustrated in figure 1, a message pair is created by processing the original message log through a rule engine. Each message pair consists of a request message and a response message, which form a single call. Using CMDB (Configurated Management Database) data to process a message will generate multiple tree nodes for respective receivers. Discrete tree nodes are converted into multiple tree chains considering the nodes in the upstream and downstream. The tree chains are consolidated into one transaction tree based on specific rules. After adjustments like vertical or horizontal merging, the transaction tree is finalized. Figure 2 demonstrates an example.
Acquire the original message list of sequence number
First, we need to acquire all message logs of the transaction. Most transactions are completed within 3 minutes based on experience. Message logs are stored in Redis after pre-processing. A complete transaction log will be generated three minutes later by data aggregation with the same sequence number. Each transaction is assigned with a unique sequence number that contains several original messages. Meanwhile, a transaction has a unique id to identify which two subsystems are involved. For example, subsystem A sends a request to subsystem B, and B responds to subsystem A.
Convert original message into message pair
A call between two subsystems consists of two messages: one request and one response. After acquiring the complete transaction log, original messages with the same unique id are merged into a message pair, representing a call between two subsystems. A message pair has a unified format filled with information extracted from logs.
Messages with the same unique id are grouped by log point, and the messages in the same log point are merged into a single one. A message pair is initialized afterward by calculating the confidence levels of different log_points based on historical data, retrieving information from the log_points to req_message and rsp_message according to a specific rule engine.
The information in the CMDB can supplement the message pair’s missing attributes during the preprocessing. The CMDB also holds the complete correspondence between the subsystems and services, based on which we can get all possible downstream nodes. According to the rsp_message in the message pair and the rule engine, we transform the message pair into tree nodes. Multiple tree nodes (one responder, one node) can be generated depending on the responders.
A tree node cannot be created if the subsystem providing the corresponding service in CMDB can not be found to match the service in message pair. Such a message pair is referred as an isolated message pair.
Convert tree node into tree chain
So far, we have transformed the original message from a transaction to multiple tree nodes, and then to tree chains according to specific rules. First, select a root node, and search for downstream nodes to connect. The subsystem id of the root node is recorded in the transaction sequence number, and then all nodes are traversed. If the sender attribute of a node (such as subsystem, ip, etc.) is the same as the receiver attribute of the root node, the node is spliced behind the root node to form a chain. Repeat the above steps to traverse the remaining nodes, and match the nodes that meet the above conditions to the leaf nodes of each link.
Discrete tree nodes are converted into several chains, and a tree node may appear in multiple chains simultaneously. If the subsystem id of the root node is not obtained from the sequence number, algorithm will try to obtain the subsystem ids of the first three nodes sorted by time and form a chain. Each subsystem id is used as the root node in different attempts to generate chain. If all nodes can be merged into a chain, this solution will be applied. If not, the solution with the least isolated nodes will be selected.
Merge chains into transaction tree
Pre-processing each tree chain is the last step before generating the transaction tree. If the receiver shares the same attributes with the responder, it means a subsystem invokes itself, and a vertical merging is performed. If the upstream and downstream attributes of different chains are entirely consistent, they share the same parent chain, and a horizontal merging will be performed. The original transaction tree is obtained by merging according to the above rules. The sub-nodes within the same level are sorted in time ascending order to form the final transaction tree.
We have introduced in detail how to generate a transaction tree based on the original message logs in WEMQ and the information in CMDB. We store the original message logs and the transaction trees and display them on the interface to facilitate Ops team in searching for specific transactions and having a clear view of the overall process. Based on the transaction tree, we are able to obtain multi-dimensional information about subsystem, alert, log, and so on, which are visualized for detailed transaction status check.
The transaction trees contain a lot of valuable information. Each transaction is assigned with a unique tree key based on the transaction tree. All transactions can be classified and analyzed to obtain the operating status of the entire bank’s systems. Besides, we also use algorithms to identify the product to which the transaction belongs. We can illustrate the “transaction forest” of the entire product scenario to have a clear perception of the entire product framework. The transaction forest is a big database which we can perform data mining on top of it. RCA is one of the application scenarios. The next article will introduce how to use the transaction tree to locate an anomaly’s root cause.
Chinese author: Guofeng Wang
Translator: Tony Su, Linda Lin
Editors: WeBank AIOps Team