Wei Lu · Yuqing Zhang · Weiping Wen · Hanbing Yan · Chao Li (Eds.)

Communications in Computer and Information Science 1699

# **Cyber Security**

19th China Annual Conference, CNCERT 2022 Beijing, China, August 16–17, 2022 Revised Selected Papers

## **Communications in Computer and Information Science 1699**

Editorial Board Members

Joaquim Filipe *Polytechnic Institute of Setúbal, Setúbal, Portugal*

Ashish Ghosh *Indian Statistical Institute, Kolkata, India*

Raquel Oliveira Prates *Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil*

Lizhu Zhou

*Tsinghua University, Beijing, China*

More information about this series at https://link.springer.com/bookseries/7899

Wei Lu · Yuqing Zhang · Weiping Wen · Hanbing Yan · Chao Li (Eds.)

## Cyber Security

19th China Annual Conference, CNCERT 2022 Beijing, China, August 16–17, 2022 Revised Selected Papers

*Editors* Wei Lu CNCERT Beijing, China

Weiping Wen Peking University Beijing, China

Chao Li CNCERT Beijing, China Yuqing Zhang University of Chinese Academy of Sciences Beijing, China

Hanbing Yan CNCERT Beijing, China

ISSN 1865-0929 ISSN 1865-0937 (electronic) Communications in Computer and Information Science ISBN 978-981-19-8284-2 ISBN 978-981-19-8285-9 (eBook) https://doi.org/10.1007/978-981-19-8285-9

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

## **Preface**

The China Cyber Security Annual Conference is the annual event of the National Computer Network Emergency Response Technical Team/Coordination Center of China (hereinafter referred to as CNCERT/CC). Since 2004, CNCERT/CC has successfully held 18 China Cyber Security Annual Conferences. As an important bridge for technical and service exchange on cyber security affairs among industry, academics, and practitioners, the conference has played an active role in safeguarding cyber security and raising social awareness.

Founded in August 2001, CNCERT/CC is a non-governmental non-profit cyber security technical center and the key coordination team for China's cyber security emergency response community. As the national CERT of China, CNCERT/CC strives to improve the nation's cyber security posture and safeguard the security of critical information infrastructure. CNCERT/CC leads efforts to prevent, detect, alert, coordinate, and handle cyber security threats and incidents, in line with the guiding principle of "proactive prevention, timely detection, prompt response, and maximized recovery".

This year, the China Cyber Security Annual Conference was held online from August 16 to 17, 2022, on the theme of "Jointly Safeguarding Digital Information Infrastructure" as the 19th event in the series. The conference featured one main session and six subsessions. The mission was not only to provide a platform for sharing new emerging trends and concerns on cyber security, and discussing countermeasures or approaches to deal with them, but also to find ways to join hands in managing threats and challenges to digital information infrastructure. There were over 5.8 million visits received to our online event. Please refer to the following URL for more information: http://conf.cert. org.cn.

We announced our call for papers on our official website, after which 64 submissions were received by the deadline from authors with a wide range of affiliations, including universities, research institutions, telecom operators, companies, financial institutions, and NGOs. After receiving all submissions, we randomly assigned every reviewer with five papers, and every paper was reviewed by three reviewers in a single blind manner. All submissions were assessed based on their credibility of innovation, contribution, reference value, significance of research, language quality, and originality. We adopted a thorough and competitive reviewing and selection process which took place in two rounds. In the first round we invited the reviewers to conduct an initial review. Based on the comments received, 34 papers passed and the authors of these 34 pre-accepted papers made modifications accordingly. In the second round the modified papers were reviewed again. Finally, 17 out of the total 64 submissions stood out and were accepted. The acceptance rate was 26.56%.

The 17 papers contained in this proceedings cover a wide range of cyber-related topics, including network intrusion detection, cloud network, data security, cryptocurrency, vulnerabilities, mobile Internet security, threat intelligence, and webpage tempering detection etc.

#### vi Preface

We hereby would like to sincerely thank all the authors for their participation, and our thanks also go to the Program Committee for their considerable efforts and dedication in helping us solicit and select the papers of quality and creativity.

Lastly, we humbly hope this proceedings of CNCERT 2022 will shed some light for all readers in their forthcoming research and exploration of their respective fields.

October 2022 Wei Lu Yuqing Zhang Weiping Wen Hanbing Yan Chao Li

## **Organization**

## **General Chair**

Wei Lu CNCERT/CC, China

## **Organizing Chair**

Hanbing Yan CNCERT/CC, China

## **Publications Chairs**


## **Program Committee Chairs**


## **Program Committee**


Xinhui Han Peking University, China Guojun Peng Wuhan University, China Xueying Li Topsec, China Yaniv David Technion, Israel Chao Li CNCERT/CC, China Li Ding CNCERT/CC, China Huaping Cao CNCERT/CC, China Ruiguang Li CNCERT/CC, China

Min Yang Fudan University, China Bo Lang Beihang University, China Christopher Kruegel University of California, USA Yang Zhang CISPA Helmholtz Center for Information Security, Germany Guoai Xu Beijing University of Posts and Telecommunications, China Stevens Le Blond Max Planck Institute for Software Systems, Germany Siri Bromander University of Oslo, Norway Chao Zhang Tsinghua University, China Zoubin Ghahramani University of Cambridge, UK Dawn Song University of California, Berkeley, USA Kangjie Lu University of Minnesota, USA Senlin Luo Beijing Institute of Technology, China Meng Xu Georgia Institute of Technology, USA Wenling Wu Institute of Software, Chinese Academy of Sciences, China Yu Zhou CNCERT/CC, China

## **Contents**

#### **Data Security**



#### **Information Security**


Research on Information Security Asset Value Assessment Methodology . . . . . . 162 *Xueqin Yang, Peng Yang, and Honggang Lin*

#### **Vulnerabilities**


#### **Mobile Internet**


#### **Traffic Analysis**


#### **Threat Intelligence**


#### **Text Recognition**


## **Data Security**

## **An Intelligent Data Flow Security Strategy Model of Cloud-Network Integration**

Nishui Cai(B) , Zhuxiang Deng, and Hao Wang

Telecom Park, China Telecom Research Institute, Shanghai 201315, China cainishui@chinatelecom.cn

**Abstract.** Cloud-network integration business data flow security is mainly reflected in the business deployment stage and online service stage. First, this paper analyzes the trend of the digital platform technology of the cloud-network integration business system, puts forward an intelligent data flow security strategy model of cloud-network integration, including expert rule judgment system of simple cloud scene and AI algorithm application model of complex cloud scene. Then, this paper studies hierarchical linkage cloud-network integration security operation system based on the security policy model of intelligent data flow and risk monitoring capability system for personal privacy data protection by scenario system based on the security policy model of intelligent data flow. Finally, this paper points out that cloud-network integration intelligent data flow security strategy based on AI algorithms needs to be further studied.

**Keywords:** Cloud-network integration · Digital operation platform · Hierarchical linkage · Security operation · Data classification · Intelligent data flow · Security strategy · AI algorithm

## **1 Introduction**

The so-called "cloud-network integration" means that the cloud is cloud computing, the network is the communication network, the network is the foundation, the cloud is the core, the network moves with the cloud, and the cloud-network is integrated. Cloudnetwork integration is China's digital economy development strategy and enterprise digital transformation strategy with Chinese characteristics [1]. Among them, cloudnetwork integration is the foundation, cloud-network security is the support, digital platform is the hub, and scientific and technological innovation is the core.

At this stage, the main problems faced by cloud-network operation support means are that the cloud-network operation support system is too scattered, the BMO data of cloudnetwork operation is not fully connected, the improvement of data enabled cloud-network operation efficiency is not obvious, and the application of AI injection into intelligent cloud-network operation is not widely used [2]. The common goal is to establish an AI enabled digital platform, fully understand the needs of customers, implement data-based decisions, provide digital business service capability and efficient response operation system quickly, and adapt to the rapid development of industrial digitization.

The new generation cloud-network operation business system should have key technologies such as digital twinning of cloud-network resources [3], decoupled acquisition and control of atomic power, big data and AI enabling, cloud-network integration security operation. It corresponds to the resource center, acquisition and control center, big data and AI center and cloud-network security operation center of the system.


## **2 Intelligent Data Flow Security Strategy Model of Cloud-Network Integration**

The security policy model of intelligent data flow, as shown in Fig. 1, can call different data flow intelligent models according to different scenario applications, such as the security protection inter layer linkage strategy and control rules of "network moves with cloud and cloud moves with data", and automatically divide new specific security area boundaries and security levels according to the security linkage between different layers.

	- Data to flow: data capacity, data classification, protection requirements, etc.
	- Data flow analysis of simple cloud scenario: including reasoning and judgment based on boundary constraints and expert rules, data flow strategy and multi scheme decision-making selection, cloud-network characteristic capacity, protection level, unit energy consumption of equipment, etc.
	- Application scenario: when the data flow changes in a single cloud or two clouds, the rule-based intelligent model is preferred.

**Fig. 1.** Security policy model of intelligent data flow.

	- Database: including cloud feature database, hierarchical security component database and data flow case database.
	- Intelligent model: including expert rule base and AI algorithm base. The expert rule base is divided into single feature rule and compound feature rule; AI algorithm, such as:
		- a. Partition clustering: K-means, k-medoids
		- b. Hierarchical clustering: birch, cure
		- c. Cluster density: dbcsi, scan
		- d. Grid clustering: sting, cliqu
		- e. Mixed clustering: Gaussian mixture model, clique
	- The self-learning of intelligent model: is to save the output result "data flow scheme" executed by each strategy model to the data flow case base, and then regularly call the latest case base for AI algorithm learning and training, so as to update the relevant model parameters of AI algorithm in time.

## **3 Hierarchical Linkage Cloud-Network Integration Security Operation System Based on the Security Policy Model of Intelligent Data Flow**

Intelligent operation is a new digital operation capability, and it will also be a necessary capability for enterprise digital transformation [7]. At present, intelligent operation needs to gradually realize intelligent operation from single scenario to global intelligent operation.

Aiming at the characteristics of "the network follows the cloud and the cloud follows the data" of the cloud-network integration business system and the security protection requirements of the hierarchical and domain classification of the security domain classification unit of the cloud-network integration business system, based on the research experience of the industry in network and information security strategy, this paper proposes a hierarchical linkage cloud-network integration security operation strategy to meet the security operation requirements of the cloud-network integration business system.

#### **3.1 Cloud-Network Security Protection System with Layers, Regions and Levels**

According to the national standard of Chinese information technology GB/T 22239-2019 basic requirements for network security classification protection of information security technology, the security equipment or security components distributed in the network are classified according to "network, cloud, application, data and terminal", so as to realize the hierarchical decoupling, flexible arrangement and open ability of atomic capability of cloud security resources. The hierarchical security capability components of "network cloud application data terminal" of cloud-network integration business system are shown in Table 1 below.

The security capability components of each layer are as follows:


#### **3.2 Hierarchical Linkage Cloud-Network Integration Security Operation System Based on the Security Policy Model of Intelligent Data Flow**

The hierarchical linkage cloud-network integration security operation system based on the security policy model of intelligent data flow is shown in Fig. 2.

The core modules include: cloud-network integration security policy management point, hierarchical and domain security policy blockchain, "hierarchical linkage" security policy, and hierarchical and domain security control.

– Cloud integrated security policy management point: the security administrator configures the security policy in the cloud integrated security domain unit through the


**Table 1.** List of hierarchical security capability components of cloud-network system.


**Table 1.** (*continued*)

**Fig. 2.** Cloud-network integration security operation system with hierarchical linkage based on the security policy model of intelligent data flow.

security policy management point, including the setting of security parameters, unified security marks for subjects and objects, authorization of subjects, configuration of trusted authentication policies, etc.


security linkage rules between different layers according to the security protection inter layer linkage policy and control rules of "the network moves with the cloud and the cloud moves with the number", and automatically divide the new specific security area boundary and security level.

– Hierarchical and sub domain hierarchical security control: implement security policies and security control rules hierarchically, carry out "network cloud application data terminal" hierarchical and sub domain security level protection according to the security level of cloud-network integrated security domain unit, and automatically control the security equipment or security components distributed in the network.

#### **3.3 Feasibility Verification of Cloud-Network Integration "Intelligent Data Flow" Security Strategy**

#### **Verification Flow Chart**

Hierarchical linkage cloud-network integration security operation flow chart, as shown in Fig. 3.

**Fig. 3.** Hierarchical linkage cloud-network integration security operation flow chart.

According to Fig. 3, the following simulation verifies the security strategy of cloudnetwork integration "intelligent data flow". Since there are only two clouds in the application scenario, the simple cloud scenario expert rule model will be called in the simulation process.

#### **Application Case of Hierarchical Rules**

Through the security policy management point, the security administrator configures the security policy in the cloud-network integrated security domain unit, including the setting of security parameters, unified security marks (*Se\_Token*) for subjects and objects, boundary range of security area (*Zone\_defense*), authorization of subjects, configuration of trusted authentication strategy, etc.

*Security*\_*Police*{*Se*\_*Token(Subjects, Objects), Zone*\_*defense*}

*S0: Cloud-Network Characteristics and Initialization Security Policy Parameters of Layered and Domain*

1. S01 Existing Cloud Feature "Layered Linkage" Security Policy Rule Base

*SPRB(zone*0*, zone*1*, zone*2 *. . .)*


*SP*0{*ST(Subj*0*, Obj*0*), Zone*0}




*S1: Cloud-Network Integration Security Policy Configuration*

*SP*{*ST(Subj, Obj), Zone*}


*S2: Hierarchical and Domain-Based Security Policy Blockchain*

Timely release to the security policy execution-point of each layer through the blockchain.

*S3: "Layered Linkage" Security Policy Model*

1. *S31: Implementation Points of Security Policies at All Levels*

Receive the security policy issued by the security policy blockchain, call S32 layered linkage security policy rules to calculate the minimum protection area (*MinZone*) and security protection level (*MaxST*), form the adjusted overall requirements of cloud-network security protection, then determine the security policy of this layer, and query the security capability components of relevant security protection levels of this layer according to the cloud-network integration layered protection security capability component system diagram and security protection level, And issue relevant security policy adjustment instructions at all levels.

2. S32: "Layered Linkage" Security Policy Rule Base

*SPRB (zone*0*, zone*1*, zone*2 *. . .)*

According to the security protection linkage strategy and control rules of "the network moves with the cloud and the cloud moves with the data", the security linkage rules between different layers can be called to automatically divide the new specific security area boundary and security level(1).

$$\begin{array}{l}SP = SP + SP0\\SP = \{MaxST(Subj0 + Subj, Obj0 + Obj), MinZone\}\end{array} \tag{1}$$

#### a) Operation 1: Add a Protection Object

Scheme 1: C1 first and then C2, and determine the minimum protection area (*MinZone)* according to the capacity of the protected object:

$$400\,GB + 150\,GB = \\$50\,GB$$

Determine the safety protection level (*MaxST*) according to the highest level of the protected object:

Adjust the security protection level of C1, C2 and corresponding network boundary to level 3.


Similarly, scheme 2, C2 first and then C1… and so on. (See Table 3).


**Table 3.** Cloud-network integration "intelligent data flow" process table (operation 1)

#### b) Operation 2: Reduce Protected Objects Similarly…… (See Table 4).

#### *S4: Hierarchical Security Control*

After each security operation policy adjustment operation, immediately receive and execute the security policy adjustment instructions of each layer, query the corresponding security capability components in the hierarchical security capability component list of cloud-network integration business system according to Table 1, and automatically control the security equipment or security components distributed in the cloud-network integration system, that is, the network, cloud, application, data, terminal to load and reinforce the corresponding level of safety protection equipment and application safety components respectively.

By adding and reducing protection objects and protection requirements, the protection strategies of "network, cloud, application, data and terminal" of the cloud system have been adjusted automatically and implemented through the hierarchical security control points.


**Table 4.** Cloud-network integration "intelligent data flow" process table (operation 2)

#### **3.3.1 Application Case of Intelligent Multi-cloud Resource Scheduling**

*Feature Selection in the Sample Space of Resource Scheduling AI Algorithm in Multicloud Scenarios*

General principles to be followed:


The above principles can be used as a basis for evaluating the importance of feature parameters when AI algorithm selects feature space. In order to better reflect the principles of resource scheduling in a multi-cloud scenario, the operation log of each multicloud resource scheduling is generalized. Each scheduling operation is taken as a feature sequence, and features with strong correlation are selected for vector representation, which is stored in the case database as a case training set.

The characteristic variable name conforming to the above scheduling principle is generalized to *<* c1\_ free\_ space *>*, *<* c2\_ free\_ space *>*, *<* c1\_ safety\_ level *>*, *<* c2\_ safety\_ level *>*, *<* c1\_ unit price *>*, *<* c2\_ Unit price *>*, which respectively represents the utilization rate, security level and unit price of economic indicators of the cloud. And *<* demand\_cloud\_space *>*, *<* demand\_safety\_level *>* representing resource scheduling requirements. Due to *<* c1\_ safety\_ level *>*, *<* c2\_ safety\_ level *>*, *<* c1\_ unit price *>*, *<* c2\_ unit price *>* the relevant features are relatively stable in resource scheduling. The relevant features are not considered temporarily. Here, only important features are considered to form the sample feature space.

#### *Multi-cloud Resource Scheduling Method Based on KNN Algorithm*

The advantage of KNN algorithm is that it can deal with classification problems and regression problems. At the same time, it has strong anti-interference and high accuracy. The low efficiency of the algorithm can be avoided by updating the control sample size, which is more suitable for the operation log size of multi-cloud resource scheduling.

Now only the features *<* c1\_ free\_ space *>*, *<* c2\_ free\_ space *>* in the feature space are taken, assuming that the unknown samples are serialized as follows:

*(*demand\_cloud\_space*,* c1\_free\_space*,* c2\_free\_space*)* = *(*100*,* 350*,* 200*)*take k = 3*.*

Query the training sample Table 5, calculate the nearest neighbor distance, and determine that the samples with ID6, ID7, and ID8 are k nearest neighbor samples. ID6 and ID7 belong to class 2 and ID8 belong to class 1. Thus, this time, they are classified as class 2 and the corresponding policy\_ scheme (50, 50), where cloud1 and cloud2 respectively schedule 50 GB of resource space. (See Table 5).


**Table 5.** Training sample set

## **4 Risk Monitoring Capability System for Personal Privacy Data Protection by Scenario System Based on the Security Policy Model of Intelligent Data Flow**

The effective methods for monitoring the personal information protection risk of the business system are as follows:


In this way, it not only solves the problem of visual display of key nodes of user's personal information protection in the business system; It also meets the accurate requirements of the risk monitoring model of each key node, thus improving the accuracy and efficiency of the user's personal information protection risk monitoring.

#### **4.1 List of Basic Risk Models for Rule-Based Personal Privacy Protection**

The basic risk models for rule-based personal privacy protection are shown in Table 6. The basic risk models can be divided into five categories:



**Table 6.** List of risk models for rule-based personal privacy protection.


**Table 6.** (*continued*)


**Table 6.** (*continued*)


**Table 6.** (*continued*)


#### **Table 6.** (*continued*)


**Table 6.** (*continued*)


**Table 6.** (*continued*)


**Table 6.** (*continued*)




**Table 6.** (*continued*)

#### **4.2 Risk Identification of Personal Privacy Data Protection in Complex Scenarios**

#### **Risk Identification of Personal Privacy Data Protection in Complex Scenarios Based on Rules**

Batch information export is an important and complex scenario for personal privacy data protection. Here, it is simply divided into two stages: authentication and authorization and information export. It is shown in Fig. 4.

#### **Batch Information Export Scenario Risk Monitoring Process**

Step 1: establish batch information according to the management requirements and export the scene management requirements feature matrix.

Step 2: data identification and analysis, that is, access monitoring business system scenarios, mirror business system scenarios, user access traffic data, batch export related multi log multi-dimensional data modeling, including approval, bank mode, permission range and other data for feature rule modeling.

Step 3: risk identification based on rule model, extract and identify models according to key data requirements through protocol analysis and request data analysis, including

**Fig. 4.** Batch information export.

data type, regular expression, eigenvalue matching, rule matching, behavior matching, etc.

Step 4: risk identification of AI model by scenario, comparative analysis of various scenario features based on AI and big data technology, and identification of risk scenarios - batch export of AI model feature matching algorithm. Through UEBA user behavior analysis technology, according to the behavior baseline of big data statistical analysis, judge whether it belongs to abnormal behavior derived from batch information, and identify corresponding risks.

Step 5: analyze the authentication model, approve the score scenario information, and judge the compliance of scenario behavior - compare and identify the access behavior and risk of batch exported user information.

Step 6: optimize AI algorithm model to realize self-learning. Based on AI technologies such as machine learning and NLP, the AI algorithm model, strategy and feature base are derived by iteratively optimizing batch information.

## **5 Conclusion**

The security domain unit of cloud-network integration business system has the characteristics of "network cloud application data terminal" layered and sub-domain hierarchical protection and "network moves with cloud and cloud moves with data". This paper puts forward the intelligent data flow security strategy model of cloud-network integration, including expert rule judgment system of simple cloud scene and AI algorithm application model of complex cloud scene, which can be applied to hierarchical linkage cloud-network integration security operation system and risk monitoring capability system for personal privacy data protection by scenario system. With the acceleration of enterprise digital transformation and the massive growth of cloud-network integration services, AI algorithm application model of complex cloud scene is an important content of in-depth research in the field of intelligent security operation of cloud-network integration in the next stage.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Applying a Random Forest Approach to Imbalanced Dataset on Network Monitoring Analysis**

Qian Chen1,2(B) , Xing Zhang1,2, Ying Wang1,2, Zhijia Zhai1,2, and Fen Yang1,2

<sup>1</sup> China Electronics Cyberspace Great Wall Limited Company, Beijing 102209, China cbxhch@163.com

<sup>2</sup> National Engineering Laboratory for Big Data Collaborative Security Technology, Beijing 102209, China

**Abstract.** Since the rapid growth of big data technology and the continuous development of information technology in recent years, the significance of network security monitoring is increasing consistently. As one of the major tools to secure the system environment, organizations use various monitoring devices to govern the utilities of networks, hardware and applications. Meanwhile, massive and redundant data are produced by these devices constantly, which make a huge problem for analysts and scientists who are willing to extract useful information from them, and even impact the accuracy and efficiency of the monitoring systems. In this paper, we employ random forest algorithm and propose an ensemble learning model under certain scenarios with fixed data features. We use a preprocessing method to balance positive and negative samples, and then use 6 different intrusion detection systems as weak classifiers, which satisfy the rules of "partial sampling" and "partial features selection" of ensemble learning. Finally, we test three combination strategies, including relative majority voting, weighted voting and stacking, to combine the predictions. Experiments show that stacking has a better performance than the other two, with a score of 98.25% in recall, and achieves a 47.91% precision.

**Keywords:** Random Forest · Network Security · Monitoring and Analysis · Ensemble Learning · Imbalanced Classification

## **1 Introduction**

In recent years, network security monitoring has developed rapidly and played a significant role in network security. Network security monitoring is the prerequisite of a protected and functional network system. In the context of big data, network monitoring data are produced and altered endlessly. Network monitoring systems not only need to recognize the traditional risks such as spiders, port scanning, webshell, injection attack, advanced persistent threat, and phishing mail, but also have to discover the emerging risks such as privacy disclosure, information leakage, data theft, etc. In order to solve these problems, it is necessary to integrate the strengths of multiple security systems and platforms, which include internet probe, situation awareness system, internet management system, terminal detection system, database protection system, and so forth. However, these systems and platforms are mostly self-contained, which have fuzzy boundaries and duplicated functions. If their advantages can be combined and weaknesses can be complemented in one mechanism, it will reduce unneeded human labor and increase the overall efficiency.

Ensemble learning is the ideal method for solving the problem. Each intrusion detection system can be treated as a weak classifier to distinguish normal and intrusive data, through the integration of several weak classifiers, it will generate a strong classifier with more precise results and higher effectiveness.

In 2001, Giacinto et al. started to solve intrusion detection problems using ensemble learning method [1]. In 2008, Giacinto et al. proposed an ensemble learning method which could detect and discover unknown types of intrusion [2]. Random forest algorithm has been widely used and approved to be effective in intrusion detection ensembles. The common process is to extract the syntax features from PHP code through text analysis, and then build the webshell detection model [3, 4]. Because webshell contains both behavioral features and static text features, it is possible to build a stronger feature combination by merging behavioral features with text static features [5, 6]. Another method is combining random forest with deep learning to build a network intrusion detection model through deep random forest, which can handle more complex and huge datasets [7].

Researchers in intrusion detection often choose public datasets, such as NSL-KDD, ISC2012, ADFA13, DARPA98, or public repositories such as Github. Most of the public datasets are cleaned and balanced, with a proper balance rate of normal and intrusive data, which is suitable for algorithm research. But these datasets are outdated and are not able to reflect the newest trend in intrusion detection. According to certain scenarios, it is necessary to collect specific data and construct a specialized dataset [8].

In this paper, we use a dataset from recall sampling after desensitization of real data, which is deeply imbalanced. In order to adjust the ratio of different samples in an imbalanced dataset, the primary machine learning solutions are undersampling and oversampling. By adjusting model quality metrics for different categories, we can mitigate model failure caused by data imbalance [9, 10].

#### **2 Network Security Monitoring and Random Forest**

#### **2.1 Network Security Monitoring**

Network security monitoring is a technology that through collecting and analyzing attack alarms to enhance the responses to network intrusions. To conduct the network traffic analysis, people generally export network flow replica via a private network switch and execute the analytical procedures in a dedicated server. By using data presentation tools, data transmission tools, and data collection tools to analyze network traffic, flow information such as sessions, transactions, statistics, metadata, and alert data can be extracted. By analyzing various types of monitoring data, digital threats and intruders can be controlled to ensure network security.

**Fig. 1.** Network security monitoring process diagram

Figure 1 shows the process diagram of network security monitoring. Network monitoring analysis usually relies on analytical skills of the monitoring staff. Monitoring staff are in charge with extracting information from thousands of alarm data, analyzing and defining the misreporting rate, threaten level and hazard level of each alarm, and implement relevant responses appropriately. In addition to their analytical skill level, monitoring staff also need a thorough understanding of the network environment in specific field, including but not limited to business data patterns, asset locations, etc. They must identify and response to the intrusions timely from numerous alarm data in the complicate environments, and keep tracking the subsequent events and potential risks.

Time is the most important factor in safeguarding the network system. In one sense, misreporting can lead to serious failures because the monitoring staff are unable to deal with intrusions timely. On the other hand, underreporting can cause more risky situations which are hard to predict. Therefore, with the development of network security in recent years, monitoring systems become more and more comprehensive. With the arrival of the big data era and the improvement of computing power, the quantity and repeatability of security data are increased tremendously. New security risks, especially data security risks arise. These factors are challenges to the real-time monitoring.

Because each intrusion detection system has its own technical advantages, the combination of these systems are fairly complex. In practice, administrators must patrol all intrusion detection systems at the same time during a monitoring process. In addition to intrusion detection tools, monitoring staff should be able to operate other systems flexibly, including asset mapping system, log audit system, host scanning system, security disposal tools, security filing platform, external intelligence platform, tracking and recording platform, and so on. The complex environment and complicated functions also challenge the real-time monitoring.

From the perspective of alarm data itself, in the practical monitoring exercises, most of the alarm information belong to misreported data. Among the alarm data, most intrusions are crawlers, port scanning or vulnerability detection, which are highly repeated and lower threatened. The true threatening invasion are difficult to discover at first because they are hidden in a lot of worthless data.

To face these challenges, a common solution is building network security policies. But policies are always static and fixed, which can be slow to adapt to the network environment changes and cannot be simply applied to all the services and systems.

Based on above conditions, we propose a new solution to decrease the amount of data size and increase the efficiency of security system devices by further screening and classifying of the alarm data.

#### **2.2 Random Forest**

Machine learning (ML) has made great achievements in automated classification tasks in recent years, and one of the popular field in ML is ensemble learning. By training multiple weak classifiers and combine them into a strong classifier, ensemble learning can solve a classification problem jointly. Generally speaking, the classifier generated by ensemble learning is more precise than any of the weak classifiers. Sampling methods such as boosting and bagging are commonly used in ensemble learning. As combination strategies, except voting methods such as average method and relative majority voting method, stacking method is also used which integrating and combining models by constructing learners. Random forest is an important method widely used in ensemble learning.

Random forest is an integrated classifier based on bagging expansion, and consists of many decision trees. The predictive output of the classifier is combined after each decision tree is classified. Based on bagging, random feature selection is introduced into random forest. In another words, we need to make a random selection for a feature subset before classification, and then conduct the classification task on the subset.

From the perspective of machine learning, we can treat intrusion detection as a classification task. Intrusion detection systems can transform raw data into structured data tables using data representation tool, then classify data according to attack features and attack types after analyzing them. Each intrusion detection system can be treated as a weak classifier to execute classification. Because of the inaccuracy of the classification result of each weak classifier, we can generate an integrated classifiers using ensemble learning to improve the precision.

By further studies on the data features of intrusion detection in monitoring analysis, we found that each intrusion detection system can only identify part of the attack features because different intrusion detection systems come from different manufacturers with different application scenarios. On the other hand, the network traffic capture method can be consider as a special bagging in machine learning because each intrusion detection system is deployed at different positions of the network system, which captures incomprehensive and overlapping network traffic data. Therefore, when we use individual intrusion detection system as a classifier, it naturally satisfies the two properties of "partial sampling" and "partial feature selection". Based on the ideology of random forest, we use ensemble learning method to conduct network monitoring analysis from multiple intrusion detection systems.

Compare with other analysis types in network security, monitoring analysis has higher requirements on timeliness. Analysis with large-scaled neural networks require expensive equipments to ensure the efficiency of computing process. Some algorithms such as K-NN and SVM are only applicable to analysis with small-scaled datasets. With random forest, the computation cost is at equivalent level as the cost of IDS. When considering timeliness, cost and efficiency, and datasets scale, random forest method is the best choice in practice of large-scale network monitoring analysis.

#### **2.3 Imbalanced Learning and Cost-Sensitive Learning**

From the perspective of machine learning, the characteristics of monitoring data are typical category imbalance and cost sensitive data. In the network traffic, the vast majority of traffic comes from normal network services, only a small part comes from intrusions.

In this paper, alarm data is regarded as positive class, normal traffic is regarded as negative class. Without any data processing, after sampling the network traffic, we found that the ratio of alarm data to service data reached a level of 1:106 at most. There is a serious imbalance between positive and negative data, which will lead to the natural bias of classification algorithm towards negative data.

Monitoring data is an important data related to network security. The consequences of incorrect classification of monitoring data are different, misreporting may not lead to direct consequences, but underreporting may lead to security vulnerabilities in actual monitoring. As shown in the Table 1, for the confusion matrix, the impact of underreporting is far greater than that of misreporting.


**Table 1.** Classification result confusion matrix

There are data level methods and algorithm level methods to solve the class imbalance. The data level methods mainly include oversampling, undersampling and composite sampling. Among them, the disadvantage of undersampling is that it may cause the loss of information, while the disadvantage of oversampling is that it causes over fitting. The algorithm level method is mainly to modify the existing algorithm to pay more attention to the minority class.

In this paper, for the monitoring datasets, we mainly use the undersampling method. Specifically, we use two types of methods. First, we limit the recall channel and increase the proportion of positive samples as we keep the sampling comprehensiveness as much as possible. Second, based on the monitoring data itself, we screen data by several methods include filter white list data, remove data containing important business features, and remove normal network traffic in combination with external intelligence base. After undersampling, positive and negative classes form a data ratio within 1:100.

In order to balance the cost of underreporting, we adjusted the weight of underreporting in the learning process and increased the punishment.

#### **3 Application of Random Forest Algorithm**

#### **3.1 Experiment Description**

In this paper, the number of service data (TN) is far more than that of other classes, to avoid the disturbance of service data, we use precision, recall and F-score to evaluate classifier performance. Precision is defined as *TP TP*+*FP* , recall is defined as *TP TP*+*FN* .

From Table 1, FN stands for the number of underreporting, FP stands for the number of misreporting. Considering the importance of underreporting, we increase the weight of FN and define F-score as - *<sup>a</sup>*2+<sup>1</sup> *PR <sup>a</sup>*2*P*+*<sup>R</sup>* , in which *<sup>a</sup>* <sup>=</sup> 2.

In this paper, we use three different combination strategies to combine classifiers, including relative majority voting, weighted voting and stacking.

For weak classifier *h*1*, h*2*, ...h*<sup>6</sup> and collection of category tags {*c*1*, c*2*, ...c*6}, we express the prediction output of *hi* out of *x* as a 6-dimensional vector - *h*1 *<sup>i</sup> (x), <sup>h</sup>*<sup>2</sup> *<sup>i</sup> (x), ...h*<sup>6</sup> *<sup>i</sup> (x)* , let *h j i (x)* be the output of *hi* on category tag *cj*.

Relative majority voting:

$$H(\mathbf{x}) = c \underset{\text{arg}\_f \max \sum\_{l=1}^6 h\_l^\dagger(\mathbf{x})}{\text{argmax}} \tag{1}$$

Weighted voting (*wi* is the weight of *hi*):

$$H(\mathbf{x}) = c \underset{\text{arg}\_{l} \max \sum\_{l=1}^{6} w\_{l} h\_{l}^{\prime}(\mathbf{x})}{\text{arg}\, h\_{l}^{\prime}(\mathbf{x})} \tag{2}$$

Stacking: A new dataset is generated from the training results of the initial dataset as a training sample, which is called a secondary training set, then we generate secondary learners for training by cross validation.

#### **3.2 Data Sampling and Preprocessing**

We select part of the network traffic through the recall channel for analysis. To ensure data comprehensiveness, we need to sample from the complete time period for forming a dataset. Table 2 shows the basic features of the dataset:


**Table 2.** Features of sampling data

The dataset in Table 2 is generated and sampled from the full period in proportion based on the above features. In a complete period of one week, we observed that during working hours, the network traffic is large and mainly internal business data, while during night and holiday, the amount of network traffic data is relatively small, and the external network access data is the main data. After the dataset is formed, 30 data features are extracted from it combined with each intrusion detection device.

Before we preprocess the data, the ratio of the number of positive classes to the number of negative classes reaches 1:3322, which would cause bias that the results of the model tend to be negative class and cannot be classified correctly when we directly classify on the dataset of Table 2.

Therefore, we clean the dataset in Table 2 by filter the white list, clear the analyzed data in the security policy, remove the business characteristic data and analyze in combination with the external intelligence base. After above preprocess, Table 3 shows the features of the dataset:


**Table 3.** Features of preprocessed data characteristics

After the above preprocess, the ratio of positive and negative classes in the dataset in Table 2 is reduced to nearly 1:69 in Table 3. The following is a further analysis based on the dataset formed in Table 3.

#### **3.3 Classifier Analysis**

Combined with intrusion detection equipment, six weak classifiers are extracted from the dataset. By manually analyzing the real situation of positive classification and manually labeling, the actual performance and classification ability of each weak classifier are obtained. Details are shown in the Table 4:

Further analysis based on the data in Table 4:


**Table 4.** Features of weak classifiers


#### **3.4 Combination Strategy**

We randomly divided the sample data into two subsets: a training dataset and a testing dataset. 70% of the total sample is used as training data to determine the optimal model parameters. The remaining 30% dataset is used as testing data to evaluate the predictive precision. In this paper, we use three different combination strategies for model training, including relative majority voting, weighted voting and stacking. Table 5 shows the classification results under different combination strategies.

The recall rate in Table 5 reflects the number of underreporting of the combination strategy. In the dataset of this paper, 1% recall rate represents about 40 underreports. Therefore, from Table 5 we can see that using relative majority voting or weighted voting


**Table 5.** Classification result on different combination strategies

for the classifier would cover part of the feature recognition ability, which has certain destructiveness to the model when the amount of weak classifiers are limited. Compared with the above two methods, stacking method has higher performance in classification recall rate. Based on specific scenarios, when data features are relatively fixed, stacking can find the correct classification when weak classifiers conflicted with each other.

## **4 Conclusion**


The data application detection in this paper is mainly used for off-line analysis. For the real-time detection of monitoring and analysis, how to conduct real-time analysis through the stream processing engine, and how the detection efficiency and effect are, still pending further study and improvement.

## **References**

1. Giacinto, G., Roli, F.: Design of effective neural network ensembles for image classification purposes. Image Vis. Comput. **19**(9/10), 699–707 (2001)


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Brief Analysis for Network Security Issues in Mega-Projects Approved for Data Clusters**

Shizhan Lan1,2(B) and Jing Huang<sup>3</sup>

<sup>1</sup> School of Software Engineering, South China University of Technology, Guangzhou 510006,

China

lanshizhan@gx.chinamobile.com

<sup>2</sup> China Mobile Guangxi Branch Co., Ltd., Nanning 530012, China

<sup>3</sup> EVERSEC (Beijing) Technology Co., Ltd., Beijing 100191, China

**Abstract.** Network security is an important guarantee for Mega-projects approved for data clusters. It is necessary to comprehensively improve the network security awareness, monitoring, early warning, disposal and evaluation capabilities of Mega-projects approved for data clusters. It makes a comprehensive analysis on the network security issues in Mega-projects approved for data clusters from dimensions of computing facility security, network facility security, combination and scheduling security, network operation service security, data security, network situation awareness, etc. It is set up gradually evolving atomic power security capabilities for building a ubiquitous security network computing brain. It identifies data assets in an active and passive ways, sorts out data assets through in-depth scanning and information completion, supports the formation of preset templates according to AI (artificial intelligence) models, regular matching, keywords, combination rules, etc., classifies and grades data according to data sensitivity, and visually displays them in the form of charts. It forms a multi-layer architecture system that includes the collaborative scheduling of computing networks on the control side, the perception of network convergence on the data side, management and the scheduling of computing resources on the service side, realizes the interaction and supervision of the whole process, all elements and the whole industry chain of computing scheduling, has functions of security perception, monitoring, early warning, disposal and evaluation, and improves the security perception and linkage monitoring capability of cross data center and clusters. Gradually, it builds a coordinated threat handling capability.

**Keywords:** Mega-projects approved for data clusters · Network security · Network situation awareness · Network security computing brain · Atomic security capability · AI model

## **1 Introduction**

"Mega-projects approved for data clusters" refers to building a new computing power network system integrating data center, cloud computing and big data [1] to orderly guide the computing power demand in the east to the West. According to the demand of computing power, China promotes the echelon layout and overall development of data centers from east to west; Accelerates the gradual and rapid iteration of "Mega-projects approved for data clusters". In order to comprehensively boost the development of new data centers, it builds an intelligent computing ecosystem with new data centers as the core, and gives full play to the enabling and driving role of the digital economy, the Ministry of industry and information technology has formulated and issued the threeyear action plan for the development of new data centers (2021–2023) [2], and makes every effort to ensure the promotion of the "Mega-projects approved for data clusters" project.

Network security is the premise for the development of "Mega-projects approved for data clusters". The "Mega-projects approved for data clusters" project urgently needs to improve the ability of network security perception, monitoring, early warning, disposal and evaluation in an all-round way, accelerate the security protection level of data resources in the whole life cycle, improve the ability of computing power security monitoring and scientific scheduling, and cope with the transformation of network attacks from static analysis to dynamic perception, post disposal to prior prevention, single point prevention and control to global joint prevention.

It is oriented to "Mega-projects approved for data clusters" and meets the scenarios of massive data processing and scientific computing; The training reasoning scenario of artificial intelligence model for east digital west training. Promote the successful implementation of the project of "Mega-projects approved for data clusters", accelerate the transformation of data centers, and provide new momentum for high-quality economic and social development.

#### **2 General Analysis of Computing Power security**

For the construction of the security system of the "Mega-projects approved for data clusters" project, it is necessary to refine the security assurance objectives, clarify the access standards for security technical means such as security situation monitoring, traffic protection and threat disposal, deepen policy reform measures and major engineering suggestions in terms of data resource protection and computing resource monitoring and scheduling, and promote the application security of data resource circulation. As the core task of the construction and application of "Mega-projects approved for data clusters", network security focuses on building a multi-level collaborative supervision platform and monitoring system for basic networks, data centers, data center clusters, cloud platforms and application enterprises, and improving the ability of "Mega-projects approved for data clusters" project to serve economic operation monitoring and industrial digital transformation monitoring.

The architecture of computing power network consists of three levels: computing power infrastructure, arrangement management and operation service. The infrastructure layer consists of computing infrastructure and network infrastructure to form a new computing network integration infrastructure, and build a flexible and agile computing base and a fully connected intelligent network at the cloud edge. The arrangement management layer realizes the unified arrangement and intelligence of the calculation network by building the brain of the calculation network. The operation service layer creates a new operation service system and business model by using technologies such as computing power trading, multidimensional dimension and computing power grid connection. In this architecture, safety runs through the whole process, and improving safety endogenous capability has become an important development goal. This paper will analyze the relevant network security issues from the above dimensions.

The overall goal is to build a network security value system of "Mega-projects approved for data clusters" and provide refined, ubiquitous and original twin security services; Build a ubiquitous security computing network brain, provide synchronous "pay as you go" security experience, transform application-based into task-based, and realize differentiated security experience of more refined process. Realize near source defense mode based on twin computing power mode, and realize super edge plus near source side defense mode (Fig. 1).

**Fig. 1.** Ubiquitous security computing network brain

With the rapid development of computing network technology and the continuous integration with Internet+, industrial Internet, big data, cloud computing and other new technologies, more and more information assets provide services with the help of Internet technology. At the micro security capability implementation level, they build a gradually evolving atomic capability security means to protect network security from all dimensions (Table 1).




#### **Table 1.** (*continued*)

## **3 Computing Facilities Securities**

Computing infrastructure includes cloud computing, edge computing and end computing. While providing powerful computing technology support services for upper tier applications, it also faces many risks. It is necessary to build a comprehensive, systematic and three-dimensional protection means for cloud computing, edge computing and end computing.

In cloud computing, security protection should be provided for physics, virtualization, business, data, operation and maintenance management, etc. In terms of edge computing, security protection should be provided for network services, hardware environment, virtualization, edge computing platform, applications, capacity opening, management, data, etc.

In terms of end-to-end computing, security protection should be carried out for physics, virtualization, application, capacity opening, management, data, etc.

At the same time, it is also necessary to do a good job in the security protection of cloud, edge and end interconnection, including identity authentication, traffic monitoring and audit, interface control, security situation monitoring and other security protection means.

Simultaneously carry out the basic system planning of computing network security. Based on the independent collaborative evolution stage of the computing power network, strengthen the construction of basic atomic capabilities. Start the standardization of computing security, formulate standardized interfaces and access criteria, and solve the problems of self security and interoperability of computing network. Meet the personalized and distributed computing power needs of customers, conduct technical pre research and pilot demonstration, and adopt decentralized and security identification/security slicing technology to make the security capability compatible with the distribution of computing power; Research on the application of dynamic intelligent network slicing technology to ensure differentiated network service capability.

#### **4 Network Facility Security**

SRv6 (segment routing IPv6) simplifies the network protocol type, has good scalability and programmability, can meet the diversified needs of more new services, provides high reliability, and has a good application prospect in cloud services. SRv6 and the new generation SD-WAN (software defined wide area network) are the core technologies to realize the convergence of computing and networking. The networking scheme combining the two can realize the network linkage between the backbone network and enterprise sites, and realize the interconnection and perception of computing power; Deterministic network technology provides quality of service guarantee for new services with ultra-large bandwidth, ultra-low delay and ultra-high reliability. However, the complex network environment, fuzzy security boundary and highly sensitive time delay have also brought new security challenges.

Traditional security solutions do not have the good scalability and programmability of SRv6 and the performance, flexibility or interconnection required for SD-WAN connection. The atomic security capability can support flexibility, interconnection, scalability and programmability, sense the changes of edge connections, and provide consistent policy implementation. This policy can isolate users, applications, workflows, or data based on many parameters to provide security over the entire transaction path. Traffic can be forced to follow specific behaviors, or isolated to specific users or destinations to ensure consistent policy application and execution.

#### **5 Arranging and Scheduling Security**

Facing the highly complex computing network environment, the arrangement management layer cooperatively schedules the resources of each domain of the computing network according to the diversified and customized computing power requirements. The arrangement management layer perceives and cooperates with the arrangement of computing power users, computing tasks, network resources and computing power resources. The arrangement management shall have the ability to control the security of computing power and solve the problem of computing power abuse. The abuse of computing power includes illegal mining, violent cracking and other acts, which not only encroach on computing power resources, but also may use computing power to launch security attacks. Based on the self adaptation mode of the computing power network, establish the North-South linkage between security services and computing power, and promote the scheduling of security computing power. Considering the introduction of heterogeneous computing power nodes rather than completely self built, it is necessary to solve the identity and trust problems of computing power nodes, and conduct research and verification on technologies such as differential privacy and homomorphic encryption during the interaction between algorithms and computing power. Carry out the pre research on node collaboration. The computing power is in multiple nodes. The nodes need to have a synchronization mechanism. The nodes need to adopt an adaptive and self-organizing architecture. The "edge by edge collaboration" mechanism is used for local interaction of capability and performance information.

## **6 Operation Service Security**

Operation service security is mainly to ensure the security of computing network services, including identity security, operation security and integrated application security. Among them, identity security ensures that the identities of computing nodes and users in the computing power network can be identified and verified; The operation security realizes the functions of security transaction, security monitoring, security audit, etc. The integrated application security provides flexible, dynamic and end-to-end business security for differentiated application scenarios such as digital life, intelligent production and digital society.

## **7 Data Security**

Data security [3] runs through all levels of the computing power network, mainly including data asset identification, data security protection, data flow security, computing security, East West training, etc. which can effectively ensure that the data is in an effective and legitimate use state in the whole life cycle.

#### **Data Asset Identification**

Data asset identification combines initiative and passivity to discover assets including servers, relational databases, non relational databases, interfaces, etc., and complete the completion of data asset attributes through information completion and in-depth scanning. From the perspective of data assets, data is obtained from SMC/SMP and data resource scanning discovery, and the data is classified and managed at different levels. The classification and classification list management function mainly includes data classification and classification list, important data list and sensitive data list. Real time display of classification and classification data information of different dimensions, data sorting of identified asset data, classification and classification mapping of data according to data sensitivity, visual display in the form of charts, and controllable storage of warm and cold data.

According to the data classification and grading rules of countries, industries or enterprises, preset templates can be formed according to AI models, regular matching, keywords, combination rules, etc. you can also configure classification and grading templates according to the needs of the current business.

#### **East Digital West Training**

The data value evolution path with knowledge as the core. Driven by technology, it develops artificial intelligence model training and reasoning, and constructs the overall technical framework of "East digital West training". Driven by technology, AI has become the base of new infrastructure technology, promoting the acceleration of artificial intelligence deployment.

AI modeling is different from data development. It has no hierarchical modeling restrictions. At the same time, it opens the way of data reading and warehousing, and supports free modeling; In addition to the basic data processing components, it also has built-in rich machine learning algorithms. It also supports user-defined processing components to help dig deep into data value.

#### **Data Flow Security**

Data flow involves data aggregation, data transmission between providers and users, as well as the use of data out of the control of owners. Data will face greater security risks, including personal information disclosure, data vulnerable to attack and disclosure, illegal over collection, analysis and abuse of data, etc. During the data flow process, the data shall be identified, the data flow node, operation, flow direction and other information shall be recorded, and a unified cross domain and cross system data flow identification shall be established to realize that the data flow direction can be controlled and the data flow can be perceived. In order to monitor the flow of data in real time, it is necessary to strengthen network security monitoring through technical means, especially automated security monitoring, and comprehensively monitor and analyze the data sharing platform and system through traffic, logs, configuration files, etc., so as to facilitate early warning and collaborative defense of network security events, and improve the overall security situation awareness, security decision-making and other capabilities.

#### **8 Situational Awareness**

Situational awareness integrates detection, early warning, response and disposal functions, and is the safety brain in the active defense system. It plans the security capability of the integrated computing service system of "Mega-projects approved for data clusters", integrates the existing data center security data, interoperability monitoring platform and supporting business systems, builds a data center level, data center cluster level and industry-wide computing security perception and monitoring platform, realizes the interaction and supervision of the whole process, all elements and the whole industry chain of computing scheduling, and has the functions of security system [4] perception, monitoring, early warning, disposal, evaluation, etc., Improve the security awareness and linkage monitoring capability of cross data center and cross data center clusters. Gradually build a coordinated threat handling capability.

#### **8.1 Situation Awareness of Network Security Quality Based on Data Network Collaboration**

It will improve the monitoring system for computing network governance, promote the optimization of the network architecture and traffic routing of data centers in the eastern and western regions, promote the quality monitoring of data network collaboration, promote the networking of edge data centers, and continuously improve the network capacity of data centers (Fig. 2).

**Fig. 2.** Situation awareness of network security quality based on data network collaboration

#### **8.2 Situation Awareness of Computing Capacity Security Improvement Evaluation**

Take the cloud with the network and use the network to strengthen computing, realize the enhancement of computing power value based on the computing power network, and ensure the enhancement of computing power value with computing power security. Promote the development of computing power network from multiple demands, support the implementation of ubiquitous computing power with multiple technologies, and enhance the security value of computing power with multi-dimensional security situational awareness (Fig. 3).

**Fig. 3.** Situation awareness of computing power security improvement evaluation

#### **8.3 Situation Awareness of Industry Chain Security Enhancement Assessment**

Accelerate the key technology and product innovation of the new data center operation security management and other software layers, as well as the cloud native and cloud edge integration security and other platform layers, and improve the software and hardware synergy; Establish and improve the new data center security standard system; Draw the security map of the whole industry chain of the new data center, promote the completion of key links, and carry out the security capability evaluation of the new data center (Fig. 4).

**Fig. 4.** Industry chain security enhancement assessment situation awareness

#### **8.4 Situation Awareness of Green Low Carbon Assessment**

The continuous deepening of the national "double carbon" strategy has put forward higher requirements for the green and low-carbon level of the data center industry, and the PUE, cue and other energy efficiency indicators are more strictly restricted. "Megaprojects approved for data clusters" is a powerful driving scheme for the data center to achieve "carbon neutralization and carbon peak". The collection and evaluation of energy consumption indicators saved after "Mega-projects approved for data clusters" can be used as one of the dimensions to evaluate the situation awareness of "Megaprojects approved for data clusters". In order to quickly achieve the "double carbon" goal, implement the notice of the Ministry of industry and information technology on printing and distributing the three-year action plan for the development of new data centers (2021–2023), and optimize the green development of the data center industry chain, it is necessary to establish and improve the green data center standard system (Fig. 5).

**Fig. 5.** Green low carbon assessment situational awareness

#### **8.5 Situation Awareness of Security Assurance Assessment**

In the important supporting support construction scheme of "Mega-projects approved for data clusters", it is clearly emphasized that from the aspects of data risk identification and protection, data security compliance assessment, to data encryption protection and related technical monitoring, it is necessary to "synchronously plan, construct and use security technical measures to ensure business stability and data security (Fig. 6).

**Fig. 6.** Security assurance assessment situation awareness

#### **8.6 Threat Collaborative Disposal Scenario**

Based on the new data center security monitoring means, facing the "Mega-projects approved for data clusters" network threat collaborative disposal scenario [5], carry out closed-loop disposal and collaborative linkage of threat disposal, deposit the network security risk case base, emergency drill scenario base, emergency disposal plan base, emergency disposal expert base and emergency response tool set base, promote the transformation of threat disposal to risk early warning and pre prevention, and improve the scientificity, accuracy and timeliness of threat disposal, We will strengthen capacitybuilding for coordinated disposal.

#### **9 Conclusion**

The "Mega-projects approved for data clusters" project realizes "the network moves with the cloud, and the cloud moves with the needs", forming a multi-layer architecture system including the computer network collaborative scheduling of the control plane, the network fusion perception and management of the data plane, and the arrangement of computing resources of the service plane.

The new architecture, new technologies and new services of the "Mega-projects approved for data clusters" network may have new security risks that need to be overcome, and need to be guaranteed by a new security mechanism adapted to it. There are potentially complex network risks and computing power node security risks in the infrastructure layer. The scheduling management layer involves scheduling security risks and computing power use out of control. The operation service layer faces problems such as accessing malicious nodes, untrusted transactions, insecure applications, etc. in addition, there may be data security risks such as uncontrollable data flow in the "Megaprojects approved for data clusters" network, which needs to be strengthened through an integrated whole process trusted mechanism.

This paper makes a comprehensive analysis on the network security problems in "Mega-projects approved for data clusters" from the aspects of computing power facility security, network facility security, scheduling security, operation service security, data security, situation awareness and so on. It is proposed to build a network-based ubiquitous endogenous security system with ubiquitous security computing brain as the core, atomic security capability as the foothold, and intelligent orchestration as the link.

Guided by the security application, oriented to the new business mode of cluster scheduling, combined with the existing traffic protection and security monitoring means in the data center, it focuses on the realization of network security quality situational awareness, computing power security improvement assessment situational awareness, industrial chain security enhancement assessment situational awareness, green low-carbon assessment situational awareness, security assurance assessment situational awareness and other assessment systems for data network collaboration.

And then promote the network convergence, transmission, storage and integration application links for the cluster nodes of the data center to carry out the construction of traffic protection security means; In combination with active detection means and detection work, implement the construction of security situation capability and build the capability of threat collaborative disposal.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Considerations on Evaluation of Practical Cloud Data Protection

Rui Mei1,2 , Han-Bing Yan3(B) , Yongqiang He<sup>4</sup>, Qinqin Wang1,2, Shengqiang Zhu<sup>5</sup>, and Weiping Wen<sup>4</sup>

<sup>1</sup> Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China

<sup>2</sup> School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China

<sup>3</sup> National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT/CC), Beijing 100029, China

yhb@cert.org.cn <sup>4</sup> Peking University, Beijing 102600, China

<sup>5</sup> PingAn Cloud, PingAn Insurance Group of China Ltd., Shenzhen 518023, China

Abstract. With the continuous growth of enterprises' digital transformation, business-driven cloud computing has seen tremendous growth. The security community has proposed a large body of technical mechanisms, operational processes, and practical solutions to achieve cloud security. In addition, diverse jurisdictions also present regulatory requirements on data protection to mitigate possible risks, for instance, unauthorized access, data leakage, sensitive information and privacy disclosure. In view of this, several practical standards, frameworks, and best practices in the industry are proposed to evaluate and improve the protection level of cloud data. However, few evaluation models can conduct a comprehensive quantitative evaluation for cloud data protection that includes security, privacy, and even ethical considerations. In this paper, we first make a comprehensive review of cloud data security and privacy issues, especially also including ethical concerns that we consider as a type of specific risks caused by human factors, which refers to acting honorably, honestly, justly, and legally, due diligence, and due care. Then, we propose a novel evaluation model for cloud data protection that can quantitatively assess the protection level. Finally, based on the parallel evaluation between manual assessment by experts and our evaluation model, results show that our evaluation model is consistent with the manual evaluation conclusion.

Keywords: Cloud data protection *·* Evaluation model *·* Security *·* Privacy *·* Ethics

## 1 Introduction

With the rapid improvement of cloud computing, the cloud offers flexible and affordable software, platforms, infrastructure, and storage available to organizations across all industries. Faced with limited budgets and increasing growth demands, cloud computing presents an opportunity for organizations to reduce costs, increase flexibility, and improve IT capability [14]. Despite the rapid adoption of cloud computing, security and privacy remain key issues for the security community [20,25]. Although cloud service providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) continue to expand security services to protect their evolving cloud platforms, security and privacy are ever-lasting considerations while migrating traditional IT to cloud [10].

As we all know, cyberspace is not peaceful, and both external advanced persistent threats (APTs) and insider attacks still occur from time to time. Since cloud environments often contain a variety of tenants and their vast amounts of valuable data, cloud platforms are also targeted by cyber threat actors [32–34]. For external APTs, attackers are always looking for new attack surfaces in the cloud to bypass existing security controls [41]. Insider attacks are often acted by disgruntled insider employees, who have limited authorized access and tend to exfiltrate sensitive data or escalate privilege intentionally. This is also an ethical issue due to the human factor instead of a technical issue.

The implementation of cloud migration by enterprises means losing physical control of systems and data, thus it requires an assessment method to evaluate the protection level of cloud environments, including cloud data. Although many standards, frameworks, and best practices have been proposed by the security community and industry, there is rarely a comprehensive evaluation model that can quantitatively analyze the score of the protection level of cloud data that fully considers security, privacy, and ethical issues. In summary, this paper makes the following contributions:


## 2 Overview of Cloud Data Protection

This section introduces the methodology related to cloud data protection. We leverage the Data States Model and Cloud Data Lifecycle Model to summarize major security and privacy controls from a top-level perspective. More considerations on fine-grained controls including technique measures, operational policies, and legal & regulatory compliance will be discussed in the rest of the paper.

Fig. 1. The data states model in IT systems.

#### 2.1 Data State Model

Data, as a type of critical asset, exists in one of three states both on-premise and in the cloud, including while it is *at rest*, *in transit*, and *in use* [26]. Regardless of the state of the data, IT systems should implement appropriate controls to protect the data and mitigate security and privacy risks [43]. Figure 1 shows the Data State Model, including three states of data and transformation between them. Data in use can be converted to both in-transit state and at rest, however, data at rest cannot be changed to in-transit state directly and vice versa. It is worth noting that this characteristic of conversion between data states depends on the classical Von Neumann architecture which is still the major one all over the world, other computing architectures e.g. quantum computing are out of our scope.

Data in Use, refers to any data in the main memory or other caches while an application is using it. Due to the multitasking and concurrent features of modern information systems, it is important to ensure authorized access to data in use. Operating System (OS) built-in process isolation and application-level sandbox are primary controls for data in memory and cache. However, emerging attack vectors often try to bypass existing security mechanisms by vulnerabilities exploitation or advanced impersonation techniques. To this end, pieces of research attempt to leverage homomorphic encryption [1]. This limits the risk of data leakage because memory doesn't hold unencrypted data.

Data at Rest, aka data on storage, is any data stored on media, such as hard drives, external USB drives, network attached storage (NAS), and storage area network (SAN). The major risks it faces include data exfiltration, integrity breaches, unavailability (e.g. Denial of Service i.e. DoS). Strong symmetric encryption is the key control of data at rest for security and privacy concerns. In Addition, as a compensating control, data redundancy can improve the high availability (HA) of data. Furthermore, strict authentication and authorization controls [30] can also help prevent unauthorized access.


Table 1. Cloud data lifecycle and representative controls

Data in Transit, also called data in motion, refers to any data transmitted over a network. The exchange of data between information infrastructures almost entirely depends on the transmission network in cyberspace. In particular, unlike the traditional on-premises model using the internal local networks, if enterprises migrate their IT systems to the cloud, all data access will be transferred over the Internet. Therefore, data in transit is more likely to be the target of cyber attacks than the other two data states. Leveraging a combination of symmetric and asymmetric encryption can protect data in transit generally.

#### 2.2 Cloud Data Lifecycle

Data in cloud is constantly being created, stored, used, and transmitted, and once the data is no longer valuable, it needs to be destroyed. Unlike other valuable physical assets, the value of data is time-sensitive and specific, thus data protection is sophisticated. Cloud Data Lifecycle Model provides a generic approach to identifying the broad categories of risks facing the data and associated security or privacy controls, therefore this allows us to consider threats, vulnerabilities, and risks of cloud data at a higher level of abstraction in case of getting bogged down in the concrete details of a specific organization.

Table 1 illustrates each phase of this model and corresponding representative controls. Noting that the cloud data lifecycle is not always iterative, on the contrary, it is not constantly linear, sometimes even exists in multiple phases simultaneously. As an example, data being shared may be used and stored at the same time if co-workers collaborate in a Software as a Service (SaaS) app. Furthermore, data in a phase can also exist in multiple states. Regardless, data should be protected at every stage with security and privacy controls commensurate with its value [18].


#### 2.3 Shared Responsibility Model

Cloud computing is a business-driven computing model rather than technologydriven, thus the interests of cloud service providers (CSPs) and cloud service customers (CSCs) are not always aligned. CSCs want maximum computing capabilities at the lowest cost. On the other hand, CSPs want to provide as few services as possible while maximizing profits. In this paper, we don't review the cloud computing reference model here, which is clearly defined in the ISO/IEC 17789 [24]. Fortunately, despite the adversarial relationship existing between the two sides, the interests of security and privacy on both sides converge. One example is that a data breach of a CSC caused by vulnerabilities of the infrastructure


Table 2. The shared responsibility model

in a CSP will bring both parties to suffer brand and reputation damage, lower profits, and even face ongoing lawsuits.

The Cloud Shared Responsibility Model [16] clarifies the clear responsibilities of both the CSPs and CSCs for defense-in-depth of cloud architecture. Table 2 shows details of this model, where the rows and columns represent the layers and cloud service models of the cloud architecture, respectively. In Table 2, cells marking C, S, and P indicate the responsibilities of CSC, Both, and CSP, respectively. It is worth mentioning that although we only list the most common three cloud service models here, namely Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS), which are also defined in ISO/IEC 17789 [24], other service models have similar shared responsibility model. We are particularly concerned that regardless of the service model, the responsibility for data access security attributes to the CSC. This means that the ultimate responsibility for any data breach should be borne by the CSC. Of course, the CSC also has the right to seek compensation from the CSP. Due to the elementary principle of "layered defenses" in information security, both CSCs and CSPs need to implement security and privacy controls at different layers to protect data. In a nutshell, this paper doesn't intend to clearly distinguish which party is responsible for the implemented security controls, which does not influence our evaluation of security, privacy, and ethics of controls.

## 3 Techniques, Operations, and Compliance

The practice of industry in the past decade shows that a large body of previous excellent approaches, mechanisms, and tools in traditional IT has been introduced for better building the foundation of cloud computing. Therefore, the cloud and traditional IT also share most of the security controls to secure systems and data. These controls usually include three aspects, namely techniques, operational activities, and compliance. We will discuss security and privacy controls for cloud data protection, along with the possible risks and how to mitigate them. Furthermore, ethical considerations are also discussed, which can be also considered as a risk in essence, and refer to acting honorably, honestly, justly, and legally, due diligence, and due care.

The rest of this section will discuss cloud-specific key security and privacy controls involving three aspects i.e. technical mechanisms, operational policies, and legal compliance.

#### 3.1 Technological Mechanisms

Unlike on-premises, the cloud environment can rarely protect data by implementing strong access controls at clear boundaries, thus encryption is the primary option for protecting data. It is known that cloud computing is usually multitenant, and even with the deployment model of private cloud, there are also conflicts of interest among different departments within the same organization. Therefore, data obfuscation mechanism is an important security and privacy control. In addition, virtualization techniques and corresponding security controls, as elementary cloud infrastructure, face several cloud-specific risks.

#### *1) Encryption and Key Management*

It should come as no surprise that cloud computing has a deep dependency on encryption, and no matter what state the data is in, without encryption technology, it is impossible to use cloud computing technology in any secure way. Due to the criticality of encryption organizations should concentrate their efforts on correctly implementing and deploying cryptographic systems, while key management is the area of greatest concern. If an organization uses multiple CSPs or intends to hold physical control over cryptographic keys, one solution is to escrow keys within the organization, but this requires additional infrastructure and personnel. Another way is to escrow keys to a third party, such as the prevailing Cloud Access Security Broker (CASB) [3], which is a service that provides key management and unified cloud data access control.

Despite all the efforts to encrypt data in the cloud, there are still risks that make us have to strike a balance. First, encryption can be done at different layers and granularities, such as volume-level, object-level, file-level, applicationlevel, and so forth [42]. For the performance reason, it is difficult to implement strong encryption at all layers. As an example, despite implementing volumelevel encryption, which is used to be connected to a virtual machine (VM) instance, it is still vulnerable if an attacker gains access to the VM instance. Second, IT administrators or security staff may be necessary to access other personnel's cryptographic keys for key recovery or other reasons. If a disgruntled employee gets the key, it will increase the risk of unauthorized access. This is also an ethical issue. Third, despite not a good practice, CSCs, for technical or budgetary reasons, also escrow keys to the same CSP that also stores the organization's data. This risk of dependency is also termed *Lock-In*, which will be discussed in Sect. 3.2. Finally, due to legal and regulatory requirements for specific encryption algorithms or methods, there is a security gap between different jurisdictions where data has transborder exchange. This issue will be detailed in Sect. 3.3.

Fig. 2. Overview of tokenization mechanism.

#### *2) Data Obfuscation and De-identification*

Concerning security and privacy, practical cloud data protection is necessary to obscure sensitive data or instead use a representation of that data. Masking is an elementary data hiding technique (e.g., showing only the last four digits of a credit card number), and similar techniques include *randomization* which replaces part of the data with random characters, and *shuffling* that represents the data with different records within the same dataset. Tokenization is another privacy protection technique which is illustrated in Fig. 2, a nonsensitive tag called a token is created as a substitute to be used in place of sensitive data. The implementation of tokenization typically consists of two databases, one storing actual and real sensitive data, and the other storing tokens corresponding to each data entry. A user who needs to access data first obtains nonsensitive tokens, and then a strong access control mechanism such as Identity and Access Management (IAM) [15] decides whether this user can access the corresponding sensitive data entries. Anonymization is the primary technique for de-identifying when the data contains Personally Identifiable Information (PII). This process includes removing *direct identifiers* e.g. names, bank accounts, and *indirect identifiers* which are often statistical or demographic information but can be combined to infer PII e.g. personal age and shopping history [6,8,31,37].

There are three major risks during the implementation of the aforementioned security and privacy controls. First, the above techniques can perform well for structured data but may present problems on unstructured data that could be located in any media. Although the existing available solution is DLP and continuous monitoring, this is still not enough to address the challenge of sensitive data mining. Second, the tokenization technique depends on the access control mechanism, thus we have to face all the risks, and the human factor is always the most significant risk among those. This is also an ethical dilemma. Last, although most privacy regulations require data anonymization or de-identification for any PII use outside of live production environments, how to identify indirect identifiers is also a hard nut to crack due to the lack of effective rules to estimate whether the information is an indirect identifier that seems humble but can be combined with other information to infer PII.

#### *3) Virtualization*

Proverbially, virtualization technology is the cornerstone of cloud computing, which helps the cloud to implement critically acclaimed on-demand services and resource pooling. In a sense, virtualization is also a security control that achieves access control through the isolation of diverse layers. Despite all the convenience, risks need to be considered while protecting cloud data in practice using virtualization. First, since the hypervisor that manages VM instances is the critical component of the virtualization solution, it tends to be attacked. Compromising a VM instance only results in the data breach within the VM guest, thus threat actors may instead attempt to compromise the hypervisor. Because the hypervisor acts as the interface and controller between the virtualized instances and the host resources, exploiting the hypervisor can affect the security of all VM guests [7]. Another risk is guest escape. Weakly designed or configured VM instances or hypervisors may allow users to break restrictions and leave their own VM instances to gain unauthorized access. There are two ways of guest escape, one is lateral movement, that is, unauthorized access from one VM guest to another one, and the other is vertical movement, that is from one VM guest a user obtains the host machine permissions. As a matter of fact, the second way is more harmful to cloud data protection. Finally, since the cloud environment is multi-tenant, we have to deal with data seizure issues. Legal activity may result in the seizure or inspection of the host machine which has hundreds of VM instances belonging to different CSCs by law enforcement agencies or plaintiff attorneys, even if the organization is not the target. Great efforts still need to be made to cope with this problem by both the security community and the judicial community [39].

#### 3.2 Operational Policies

While technical controls have laid the foundation for mitigating cloud risks, security operations in the cloud provide ongoing security and privacy assurance. This section will discuss several key controls in security operations. Due to space reasons, we will not go into all the details here, but more policies of security operations analysis.

#### *1) Data Classification*

Data identification and classification are the foundation of cloud data protection. All implemented technical and administrative controls determine the level of protection based on the classification of data. Since the organization is constantly creating data in its operations, this is an operational process, aka "Data Discovery" [17]. Typical approaches to data discovery include label-based, metadatabased, and content-based ones. Whether the data is created on-premises or in the cloud, an assistant tool to data classification is DLP, a technology system designed to identify, inventory, and control the use of data that an organization deems sensitive, regardless of whether it is employees' personal data, such as web browsing history, pending resignation letters, and so forth. In a nutshell, data discovery can sometimes be a "double-edged sword", raising privacy and ethical concerns.

### *2) DRM/IRM*

Data is out of the physical hold of the organization in the cloud, thus compensating controls are needed to protect the data during its lifecycle, especially during the use and share phases. DRM aka IRM which is mentioned in Sect. 2.2 is an ideal mechanism [23,27]. DRM/IRM usually has the following advantages: (1) *persistent protection*, which follows the information it protects, regardless of where it is located; (2) *dynamic policy control*, which allows data owners to modify access control lists (ACLs) and permissions for the protected data under their control; (3) *remote rights revocation*, which the data owner can revoke permissions at any time; (4) *continuous auditing*, that allow for comprehensive monitoring of the access history.

Despite the many advantages of DRM/IRM, leveraging DRM/IRM in the cloud still faces some challenges. One is *replication restrictions*. DRM/IRM involves permissions for replication and sharing, but the administrative process in the cloud environment often requires creating, shutting down, moving, and backing up VM instances, which is undoubtedly in conflict with the policies of DRM/IRM. The other is *jurisdictional conflicts*. The blurred physical interface brought about by cloud computing will bring about the transborder flow or even out control of a large amount of data, which will lead to regulatory restrictions in different jurisdictions.

#### *3) Continuous Monitoring*

Automated or continuous monitoring and reporting is an important mechanism for cloud computing to achieve its capability of self-service. The monitoring objects mainly include: (1) *physical environment*, involving the temperature, moderation, and so forth of the data center; (2) *host-level*, including the performance and event tracing of the operating system, middleware, and applications; (3) *network-level*, refers to monitoring various network components, not only hardware and software but also cabling, Software Defined Network (SDN), and control plane.

Continuous monitoring can improve performance and enhance security, however, it can also raise privacy and ethical issues. As an example, the CSP collects the event tracking log of the operating system through the agent installed in the VM instance, so as to obtain the VM guest status and implement anomaly detection of the system. Although the system event log does not contain direct identifiers i.e. PII, the user's behavioral characteristics can still be analyzed by reasoning about system events. Thus those auditing data can be used for precision marketing, and in the worst-case obtained by cyber threat actors to understand user behavior so that they can prepare proper attack vectors.

#### 3.3 Legal and Regulatory Compliance - LRC

Since the essence of cloud computing is to drive business improvement, many of its features such as decentralization and multi-tenancy make it difficult to comply with existing data privacy protection and other laws and regulations.

#### *1) eDiscovery*

eDiscovery refers to the process of identifying and obtaining electronic evidence for either prosecutorial or litigation purposes. Since cloud computing is often multi-tenant, it is more difficult to find data owned by one CSC without invading data from other CSCs that may reside on the same storage volume, drive, or physical machine. In addition, from a judicial point of view, all evidence needs to be tracked and monitored from the time it is recognized as evidence and acquired for that purpose, which is also called *chain of custody*. While the design of cloud computing may dynamically allocate and recycle resources for other tenants in the same storage location, which is in conflict with judicial principles. Thus when creating security and privacy policies for maintaining a chain of custody or conducting activities requiring the preservation and monitoring of evidence, we need to comply with the regulations.

#### *2) Diverse Jurisdictions*

A great deal of the difficulties in compliance with the legal and regulation of cloud computing stems from the design of cloud computing. They are often dispersed, often across the county, state, and even international borders. As mentioned earlier, transborder transfer of data is the most difficult reason for cloud to comply with laws and regulations. The governance of compliance must take all of the applied laws and regulations into account to operate reasonably with an understanding of legal risks and liabilities in the cloud.

## 4 Empirical Evaluation Model

This section will detail our proposed evaluation model for cloud data protection. By analyzing the aforementioned factors that affect cloud data security and privacy, we present a novel algorithm to quantitatively calculate the score of cloud data protection in a specific organization, and show its overall protection level.

#### 4.1 Important Factors

We consider four factors to be important when assessing the protection level of cloud data in an organization.

– Intra-phase. As mentioned in Sect. 2.2, a variety of security and privacy controls are implemented at each phase of the cloud data lifecycle. In Table 1 we can see, even within one phase, there may be multiple data states, which means that more controls need to be implemented so that data in different states are protected. To this end, the more states of the data within a phase, the more assessment is required. Furthermore, data in transit is generally more vulnerable to compromise than data in storage and at rest. Due to the fact that data in transit on public carriers is beyond the confines of the cloud itself. Based on this intuition, we prefer to assign a higher weight to phases with multiple states or containing the transit state in our evaluation model.


Hence, to conduct a comprehensive evaluation of cloud data protection, we need to first calculate the scores for intra-phase, inter-phase, operations and compliance, respectively.

#### 4.2 Quantitative Analysis

First, to calculate the score of intra-phase, we define the weights based on different data states within a phase, as shown in Table 3. We prioritize the three data states of in transit, at rest, and in use according to its possibility of risk occurring discussed in Sect. 4.1, and construct a binary truth table. In the last


Table 3. Intra-phase weight rating scale

Table 4. Severity levels of possible risks rating scale


column of Table 3, we can see the weight values generated in different phases due to the existence of different data states, which in parentheses is the binary representation. We further use the weight of intra-phase to compute its protection score, which is defined as:

$$IntraS = \frac{1}{6} \times \left(\sum\_{i=1}^{6} w\_i \times \frac{1}{max(r\_i)}\right) \tag{1}$$

where *w<sup>i</sup>* is the weight of *i* phase of the cloud data lifecycle defined in Table 3, and *r <sup>i</sup>* means the severity levels of possible risk in the *i* phase, which can be found in industry standards or best practices. Since the possible risk within the phase with diverse data states is usually technical issues, we identify the risks and assign their severity levels based on the Common Attack Pattern Enumeration and Classification (CAPEC) list defined by US-CERT and DHS with the collaboration of MITRE [35]. For quantitative calculation, we map the severity level to a numerical value based on the conversion table shown in Table 4 included in the Common Vulnerability Scoring System (CVSS) [19]. Thus, we select the highest value of identified risk and then obtain the intra-phase score.

Next, we define the formula for calculating the score of protection level of inter-phase.

$$InterS = \frac{1}{6} \times \left(\sum\_{i=1}^{6} \frac{1}{(max(r\_i))^{w\_i'}}\right) \tag{2}$$

where *w' <sup>i</sup> = (10+i)/10* for the phase of cloud data lifecycle *i*, which takes into account slightly higher weights for later phases mentioned in Sect. 4.1. Similar to the formula of intra-phase score, *r <sup>i</sup>* also means the possible risks during the whole lifecycle of cloud data. We use CAPEC and Cloud Controls Matrix (CCM) published by Cloud Security Alliance (CSA) [9] to identify more possible risks. Noting that our evaluation model has the capability of customization mechanism, whereby risk identification model and severity level can be substituted to expert-specified others. Thus, the risk of the highest severity value is identified to represent the most critical risk in each phase and calculate the sum as the inter-phase score.

Then, to calculate lifecycle operations score, we define the formula as:

$$OpS = \frac{1}{\max(r)}\tag{3}$$

where *r* is the possible operations risk, e.g., data breaches. Similarly, the risk of the highest severity value is the representation of the protection level of operations.

Similar to the *Ops*, the score of compliance is defined as:

$$\text{ComS} = \frac{1}{\max(r)}\tag{4}$$

where *r* is the possible compliance risk based on the organization's geographic location, its jurisdiction and industry. As an example, a bank located in EU needs to comply with GDPR [36] and PCI DSS [11].

Last, we give the overall formula to compute the protection level score of cloud data in an organization, which is defined as:

$$S = \alpha \times IntraS + \beta \times IntrS + \gamma \times OpS + \delta \times ComS \tag{5}$$

where the parameters α, β, γ, and δ can be configured based on the expert knowledge and specific application scenarios. Based on our empirical experience, the default values of those parameters are 0.2, 0.2, 0.3, and 0.3 respectively.

#### 4.3 Case Study

We leverage the protection score presented above to assess a financial enterprise that would like to be anonymous, and the result is in accordance with a parallel manual evaluation by experts. Table 5 shows four sub-scores which reflected four evaluation factors mentioned before. For intra-phase score, we obtained the highest severity value in the *SHARE* stage, this is because the target of evaluation has a weak access control for user PII. While for inter-phase score, we identified the highest severity value when data from the *ARCHIVE* stage to *DESTROY* stage due to a lack of approved and unified mechanism for destroying no longer retention data. Then we use the default parameters mentioned in Sect. 4.2, the overall protection score of the target of evaluation's cloud data is 0.46. We mapped this score value to the magnitude of the numeric interval listed in Table 4, and it shows that the overall protection level is medium. This is consistent with the manual qualitative assessment by another team.


Table 5. Case study scores.

## 5 Related Work

#### 5.1 Cloud Security Assessment

Traditional information system risk assessment mechanisms are still effective for cloud computing environments. However, as a popular computing architecture, the cloud computing environment has some aspects that are unique to other IT system risk assessments. First, the cloud environment involves more entities, including CSPs, CSCs, cloud users, cloud auditors, cloud carriers, and so forth. These stakeholders bring more challenges to cloud security assessment. Second, the technology stack of cloud computing architecture is more complex, and the evaluation targets include components owned and used by multiple parties such as physical environment, virtualization, and applications. In addition, compliance with the cloud environment is also an important aspect of cloud risk assessment [9].

#### 5.2 Data Security and Privacy

Data security, privacy, and ethics come to be widely considered in the security community and among the legal profession. Whether data incorporates security and privacy controls in its life cycle is a critical observation for data security assessment. Data classification and access control matrix are important ideal data risk assessment tools as well as data security controls. Moreover, continuous monitoring is also used to assess whether the exchange of data violates the organization's data security policy [12–14,26,43].

## 6 Conclusion

In this paper, we make a comprehensive review of each aspect of cloud data protection including security, privacy, and ethical considerations. To evaluate an organization's cloud data protection level, we propose an empirical model that calculates the protection score based on four important factors we consider. A novel algorithm we present can improve the ability of automated evaluation and the credibility of evaluation results. However, frankly speaking, our evaluation model is still a semi-automatically model that also needs experts to identify risks and conduct other manual activities. This will be our further research goal.

Acknowledgements. The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work is partially supported by CNCERT/CC and (ISC)<sup>2</sup> Chengdu Chapter.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Anomaly Detection**

## A Self-supervised Adversarial Learning Approach for Network Intrusion Detection System

Lirui Deng<sup>1</sup>, Youjian Zhao1,2(B), and Heng Bao<sup>3</sup>

<sup>1</sup> Department of Computer Science and Technology, Tsinghua University, Beijing, China dlr18@mails.tsinghua.edu.cn, zhaoyoujian@tsinghua.edu.cn <sup>2</sup> Zhongguancun Laboratory,Beijing, China <sup>3</sup> School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

baoheng@iie.ac.cn

Abstract. The network intrusion detection system (NIDS) plays an essential role in network security. Although many data-driven approaches from the field of machine learning have been proposed to increase the efficacy of NIDSs, it still suffers from extreme data imbalance and the performance of existing algorithms depends highly on training datasets. To counterpart the class-imbalanced problem in network intrusion detection, it is necessary for models to capture more representative clues within same categories instead of learning from only classification loss. In this paper, we proposed a self-supervised adversarial learning approach for intrusion detection, which utilize instance-level discrimination for better representation learning and employs a adversarial perturbation styled data augmentation to improve the robustness of NIDS on rarely seen attacking types. State-of-the-art result was achieved on multiple frequently-used datasets and experiment conducted on cross-dataset setting demonstrated good generalization ability.

Keywords: Network intrusion detection *·* Self-supervise learning *·* Adversarial learning

## 1 Introduction

While the advent of the Internet has brought immense convenience to our daily lives in recent decades, it has also unavoidably introduced dozens of new challenges. As people nowadays spend more time in cyberspace than real world no matter living or working, attacking on network activities with various kinds of intrusion techniques to prey privacy information or corporation confidential information has never stop. Therefore, as a counterpart, the intrusion detection system (IDS) which safeguard the integrity and availability of key assets has always been a hot research topic in computer and network security community. In contrast to host-based IDS which are distributed at end point users' system, network intrusion detection system (NIDS) primarily characterized as a solution inside the data transfer pipeline between computers that can monitor the network traffic and alert or even take active response measures when malicious behavior is spotted [4]. Other than some NIDS designed for specific network environment [16,19,24] like Hadoop-based platforms or particular cloud system, most general NIDS researches [10,14,33,38] were performed on network intrusion detection datasets to demonstrate and compare their effectiveness and generalization ability in a data-driven fashion.

Among several limitations of existing algorithms, data imbalance in different classes, especially the lack of data in rarely seen attacking categories, is a one of the most challenging problems. However, it is also a very common phenomenon in network intrusion detection datasets considering the difficulty in data collection or generation. Benign traffic is no doubt the majority part of internet data transfer, not to mention the inherent nature of malicious network activity as of being disguised. While the performance of most traditional ML-based method declines significantly in the case of learning from imbalanced data, a large amount of researches try to address this problem by various approaches [5,8,20,28,34,36,38]. Recently, contrastive learning has drawn a lot of attention with impressive performance improvement [27,35] in computer vision and natural language processing. Besides supervised contrastive learning, instance-level discrimination framework in self-supervised fashion have also shown promising result with few-Shot classification [21] and quickly being used in NIDS research [22].

Inspired by the success of contrastive learning and adversarial learning in CV and NLP, in this paper we proposed a self-supervised adversarial learning (SSAL) approach for network intrusion detection. The main contributions of this paper are as follows:


## 2 Related Work

In this section we summarize the algorithms and research work related to this study.

#### 2.1 Network Intrusion Detection System

Data-driven methods have been developed and deployed for NIDSs for more than two decades [9]. In order to achieve an effective NIDS, various methods including both machine learning (ML) and deep learning (DL) techniques have been proposed by research community.

Traditional machine learning algorithm such as KNN, PCA, SVM, and treebased models have all been adopted with intrusion detection, and often used as baseline for particular improved module. For example, Gao et al. [11] used classification and regression trees (CARTs) on NSL-KDD datasets with a ensemble scheme where multiple trees were trained on adjusted sampling. Karatas et al. [17] addressed the dataset imbalance problem by reducing the imbalance ratio using Synthetic Minority Oversampling Technique (SMOTE), and used different ML algorithms as a baseline for cross comparison that shows improved detection ability for minority class attacks.

Recent studies suggested that the use of DL algorithms for NIDSs have much superior performance than the ML-based methods. RNN and autoencoder [1] was pointed to be the most frequently used models for NIDS in past decades. Regarding data imbalance, Yu et al. proposed a CNN-based few shot learning model to improve the detection reliability of network attack categories with the few sample problem. Manocchio proposed FlowGAN [23] which utilized generative models for data augmentation. However, most DL schemes are more complex and require extensive computing resources compare to ML-based methods.

#### 2.2 NIDS Datasets

High-quality data sets are definitely required to fully evaluate the performance of various intrusion detection systems. Many contributions have been published in recent years containing representative network flow data with different kinds of preproccess, which are provided mainly in three categories of formats.

Packet Based Data. The most original and commonly used format is packet based data captured in pcap format and contains payload. Early NIDS datasets does not provide packet based data because it takes too much storage space. But datasets published more recently like CIC-IDS-2017/2018, UNSW-NB15 and LITNET-2020 [7] tend to provide both pcap files and flow based features for the benefit of comparison between different NIDS methods.

Flow Based Data. Flow based data is much more condensed compare to packet based data. It aims to describes the behavior of whole network connection session by aggregate all packets sharing same properties within a time window. Commonly used flow-based formats includes NetFlow [6], OpenFlow [25] and NFStream [2]. CICFlowmeter (formerly known as ISCXFlowMeter [32]) is another important network flow format generator, which tranfers pcap files into more than 80 netflow features, since it was published by Canadian Institute for Cybersecurity therefore used by both CICIDS-2017 and CICIDS-2018.

Other Data. This summarize all data sets that are neither purely packetbased nor flow-based. For example, The KDD CUP 1999 [18] contains hostbased attributes like number of failed logins, which can only obtained from above network interface. As a consequence, dataset of this category has its own set of attributes and can not be unified with each other.

## 2.3 Contrastive Learning

Contrastive learning techniques has been widely used in metric learning such as triplet loss [30] and contrastive loss [13]. While in recent self-supervised approaches, contrastive learning mostly shares a core idea of minimizing various kinds of contrastive loss (i.e. NCE [12], infoNCE [27]) evaluated on pairs of data augmentations. Typically, augmentations are obtained by data transformation (i.e. rotation, cropping, color Jittering in CV, or masking in NLP), but using "adversarial augmentations" as challenging training pairs that maximize the contrastive loss shows more robustness in recently study [15].

## 3 Approach

In this section, we will explain the main algorithms of our proposed selfsupervised adversarial learning framework for data imbalance network intrusion detection.

Fig. 1. preprocess pipeline from PCAP files to flow-based feature vector

#### 3.1 Data Preprocessing

To build a comparable cross-dataset evaluation process, we adopt commonly used datasets UNSW-NB15, CIC-IDS-2017 and CIC-IDS-2018, as they not only contain a wide range of attack scenarios but also provide original pcap files that can be easily processed into unified feature set. CIC-IDS-2017 dataset is made up of 5 days network traffic with 7 different network attacking, which forms 51GB size of data. The benign traffic was generated with profile system to protect user privacy. It provides both network traffic (pcap files) and event logs for attack label on each machine. CIC-IDS-2018 dataset is also created by CICFlowMeter but with both benign and malicous profile system, and has more than 400GB pcap data among 17 days. UNSW-NB15 was release in 2015 by Australian Centre for Cyber Security (ACCS) that contains a total of 100 GB of pcap files, consist of 2,218,761 (87.35%) benign flows and 321,283 (12.65%) attack ones.

After obtaining original PCAP files, we follow the setting from [29] and take 43 extended feature dimension from the latest *netflow version 9 flow-record format* [6] for flow-based feature extraction (full feature set can be obtained from [29]). Netflow was proposed by Cisco and has become one of the most commonly used flow-based formats for recording network traffic. A network flow stream is an aggregation of a sequence of packets in a continuous session (of TCP connection by default) with the same source IP, source port, destination IP, destination port, and transport protocol. The distribution of our processed unified dataset is shown at Table 1.


Table 1. Distribution of Unified Dataset

Session stream separation might be a little tricky since streams obtained by only quintuple may not be accurate and contain too much data packets. Inspired by [37], other than following tcp handshake flags, we further segment streams by a timeout mechanism to cut idle stream into more pieces with periodic reset. The procedure of generating NIDS datasets with unified feature set is show in Fig. 1.

Fig. 2. Self-supervised Adversarial Learning vs. Vanilla Contrastive Learning

In self-supervise styled contrastive learning (CL), the dataset *<sup>D</sup>* = *{***x**i*}*<sup>N</sup> <sup>n</sup>=1 is unlabeled, and each example **x**<sup>i</sup> from a mini-batch is either paired with a positive sample **x** - <sup>i</sup> by transformations *<sup>T</sup>* or a negative sample <sup>x</sup><sup>j</sup>/<sup>x</sup> - j,j=i. CL seeks to learn an invariant representation of **x**<sup>i</sup> by minimizing the distance between positive samples defined as:

$$\mathcal{L}\_{\rm CL} = -\log \frac{\exp(\text{sim}(\mathbf{x}\_i, \mathbf{x}\_j))}{\sum \exp(\text{sim}(\mathbf{x}\_i, \mathbf{x}\_k))} \tag{1}$$

While Chen et al. demonstrate in SimCLR [3] that a temperature parameter τ and a non-linear projector *<sup>G</sup>* after backbone network is crucial to the performance of self-supervise CL, we adopt SimCLR loss *L*SimCLR for the base setting of SSAL:

$$\begin{aligned} \mathcal{L}\_{\text{SimCLR}}(\mathbf{x}\_i, \mathbf{x}\_j) &= -\log \frac{\exp(\text{sim}(\mathbf{z}\_i, \mathbf{z}\_j)/\tau)}{\sum\_{k=1}^{2N} \exp(\text{sim}(\mathbf{z}\_i, \mathbf{z}\_k)/\tau)}, \\ \text{where} \quad \mathbf{h}\_i &= f(\mathbf{x}\_i), \quad \mathbf{h}\_j = f(\mathbf{x}\_j), \\ \text{and} \quad \mathbf{z}\_i &= g(\mathbf{h}\_i), \quad \mathbf{z}\_j = g(\mathbf{h}\_j) \end{aligned} \tag{2}$$

Adversarial Attack. The design of positive and negative sampling strategy is key to performance of CL models, and the robustness of model will largely depend on the difficulty of proposed sample pairs. As opposed to vanilla contrastive learning, self-supervise adversarial learning leverages adversarial augmentation to ease the difficulty in hard sample mining. Define the perturbation using L∞-Norm attack for example:

$$\epsilon = \operatorname\*{arg\,max}\_{||e||\_{\infty}} \mathcal{L}\_{\text{SimCLR}}(\mathbf{x}\_i, \mathbf{x}\_i + \epsilon) \tag{3}$$

With perturbations given in certain radius that lead to the most diverse positive pairs, we have a adversarial training scheme by both encouraging the learning algorithm to produce a more invariant representation upon updating parameter θ and then find the under <sup>θ</sup> again. This pipeline is described in Fig. 2 (Fig. 3).

Fig. 3. Framework of proposed 2-stage SSAL NIDS training process

#### 3.3 Classifier Fine-Tune

With SSAL we can already pre-train the model without any class labels in adversarial fashion, but without class annotation pre-trained model cannot be directly used for class-level classification.

Therefore we froze the parameter θ from pre-trained model f, and switch projector head g with a non-linear classifier ψ. The training was conducted under standard multi-class single-label training:

$$\begin{aligned} \mathbf{z}\_i &= \psi(f(\mathbf{x}\_i)), \quad \text{for } i = 1, 2, \dots, N\\ p\_{i,c} &= \sigma(z\_{i,c}) = \frac{e^{z\_{i,c}}}{\sum\_{j=1}^{M} e^{z\_{i,j}}}, \quad \text{for } c = 1, 2, \dots, M \end{aligned} \tag{4}$$

with cross entropy loss:

$$\mathcal{L}\_{ce}(\mathbf{x}\_i, \mathbf{l}\_i) = -\sum\_{c=1}^{M} y\_{i,c} \log(p\_{i,c}) \tag{5}$$

The full process of proposed 2-stage SSAL for NIDS is shown in Algorithm 1.

#### Algorithm 1: self-supervised adversarial learning for NIDS

```
1 Stage1 SSAL pre-train
     input : Dataset D = {xi}N
                            n=1
     output: model f
2 Initial model f with parameter θ and projector g
3 repeat
4 for all x ∈ minibatch B do
5 generate  = arg max||-
                               ||∞ LSimCLR(xi, xi + )
6 θ-
             = θ + ∇xLSimCLR(x, x + )
7 end
8 until reach epoch N or L ≤ δ1
1 Stage2 Classifier Fine-tune
     input : Dataset with label D = {xi, li}N
                                        n=1, model f with parameter θ
     output: model f and classifier ψ
2 Initial classifier ψ with parameter ρ, freeze θ
3 repeat
4 for all x ∈ minibatch B do
5 ρ-
             = ρ + ∇xLce(xi, li)
6 end
7 until reach epoch N or L ≤ δ2
```
## 4 Experiment Results

Metric and Implementation. The evaluation is conducted by comparing the classifier performance with various classification metrics. The intrusion detection datasets we evaluate on contain several attacking categories, which can be treated as both binary classification and multiple classification problem. While comparing performance under binary classification scenario, the basic terms used in the evaluation is as follow:

$$Accuracy(ACC) = \frac{TP + TN}{TP + FP + TN + FN},$$

$$DetectionRate(DR) = \frac{TP}{TP + FN}, \quad a.k.a.\ Recall,$$

$$Precision = \frac{TP}{TP + FP},\tag{6}$$

$$F1Score = \frac{2 \times Precision \times Recall}{Precision + Recall}.$$

where TP stands for numbers of true positive samples, FN for false negative, and so forth.

For multi-class classification setting with more detailed label of attacking types, weighted average measure of above metric was adopted considering the proportion for each label in the dataset. To achieve a fair evaluation, five crossvalidation splits are conducted and the mean is measured.

Evaluation on Unified Feature Dataset. With the unified feature set upon pre-processed UNSW-NB15 and CIC-IDS-2017/2018 dataset mentioned in Sect. 3.1, we conduct a evaluation across multiple datasets. For the purpose of comparison, we implemented a simple MLP and the Extra Trees model from [29] as baseline models. In Table 2, we can see that our SSAL method achieved outstanding result in all three datasets and exceed previous works in most metrics.


Table 2. Performance on unified dataset

Table 3 presents the detailed detection results of different attacking class on the merged NIDS dataset. While using the same backbone (Multi-Layer Perceptron), the performance of model with SSAL pre-train was largely improved on rare seen attacking data.

Table 3. Detailed performance of different classes on unified dataset.(ACC)


Further Ablation. To further demonstrate the superiority of our proposed method, we compare our method with different backbone networks with ablation studies upon SSAL modules. We first use two different frequently used backbones, MLP and CNN, and plug them with SSAL pre-train for representation learning. The evaluation result shown on Table 4 proves that SSAL can effectively enhance the ability of network intrusion detection systems. As for feature extraction, Table 5 shows the result of different classifiers when SSAL was used as a feature extractor. We first pre-train with all unlabeled training data with SSAL for feature extraction, then freese the network parameter and use SVM or k-NN as a classifier to check the representative ability of SSAL model.


Table 4. Performance with different backbone.(ACC)



## 5 Conclusion and Discussions

In this paper, we try to tackles the data imbalance problem in network intrusion detection with adversarial style data augmentation and self-supervised contrastive representation learning. More specifically, we proposed a self-supervised adversarial learning way to enhance the representative learning progress in deep learning based NIDS, which utilizing a instance-wise attack to yield a robust model by suppressing theirs adversarial vulnerability against perturbation samples. State-of-the-art performance was achieved on commonly used Experiments on multiple datasets show improvement of proposed learning framework against vanilla DL approach with same backbones.

In addiction to the conclusion, there are also some works could be done in the future. Although we among other researchers have made a lot of effort on data imbalance for network intrusion detection problems, there are still more gaps need to be filled to a robust and applicable NIDS. For instance, in our method the result from different feature sets shows noticeable performance gap. we believe that to further improve the representative ability of network flow data with a standard and comprehensive behavior feature set is key to better data-driven NIDS solution. Also we are looking forward to explore an universal end-to-end approach for more generalized NIDS which could greatly reduces the difficulty of system deployment.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Anomaly Detection of E-commerce Econnoisseur Based on User Behavior**

Yangyu Long, Wei Zhao, Jilong Yang, Jincheng Deng(B) , and Fangming Liu

Beijing Knownsec Information Technology Co., Ltd., Beijing 100097, China {longyy3,zhaow,yangjl,dengjc,liufm}@knownsec.com, jcdeng\_209@163.com

**Abstract.** Econnoisseur refers to users who obtain high returns from the Internet at low cost. It is of great significance for platform to identify econnoisseur to reduce unnecessary losses. At present, econnoisseur is mainly intercepted by rules. This method will fail when the new get the best deal method appears, and there is a certain lag. This paper identifies the econnoisseur from Knownsec Security Intelligence Brain's e-commerce website visitors. First of all, it is found that the precision and recall of the Isolation Forest are better than the Local Outlier Factor and DBSCAN in econnoisseur detection. Secondly, we merged the similar URLs visited by users with Bi-directional Long Short-Term Memory (BiLSTM), then use the merged data in Isolation Forest Model. It is found that the improved Isolation Forest model based on BiLSTM can further improve the detection ability. Practical case studies showed that this method has certain validity and reference for the detection of econnoisseur.

**Keywords:** Econnoisseur · Isolated Forest · Local Outlier Factor · Anomaly detection · Bidirectional Long Short Term Memory · User behavior

## **1 Introduction**

In recent years, e-commerce platforms have shown a trend of peak traffic dividends. Each platform will provide marketing activities to win customers and improve user stickiness. The process also give birth to the econnoisseur who exploit vulnerabilities in platform activity to profit.

According to the analysis report on the application of digital financial anti fraud Technology (2021) released by Institute of Cloud Computing and Big Data of China Academy of Information and Communications Technology and the ICBC Security Attack and Defense Laboratory, it is found that the total loss caused by the anti-fraud of black industry(see Fig. 1), showing an increasing trend every year, and the loss is expected to reach 710 billion yuan in 2022, the econnoisseur accounts for a large proportion. In 2019, Pinduoduo was robbed of tens of millions of yuan by the econnoisseur within a few hours because of an expired coupon bug on the platform. In 2021, Jingdong Mall was discovered and spread by the econnoisseur due to the wrong coupon setting, resulting in a direct loss of nearly 70 million yuan. On the one hand, the existence of the econnoisseur damages the profits of ordinary users, on the other hand, it also greatly reduces the company's activities, and its governance is urgent.

**Fig. 1.** Fraud losses and forecasts as a percentage of GDP

#### **2 Related Work**

The application of anomaly detection in the field of cyber security focuses on APT detection, intrusion detection and so on.

APT attack is a hidden and persistent network intrusion process, which carries out advanced persistent threats against specific targets. Bohara [1] compared various unsupervised algorithms such as K-means for APT detection and found that it can detect infected hosts. Zhong Yao [2] performed anomaly detection on traffic log data based on Isolated Forest and found that it has certain detection ability against APT attacks and can mark the suspected infected hosts.

Intrusion detection is a system that detects intruders in a network. The detection methods can be divided into supervised and unsupervised based machine learning algorithms. References [3–6] are mainly based on supervised algorithms such as Naive Bayes, Bayesian Networks, Hidden Markov Models, and ensemble learning for intrusion detection. This detection method requires a large number of labeled sample data, but there will be insufficient sample label data in many scenarios. In literature [7–11], unsupervised algorithms such as K-means clustering, hierarchical clustering and DBSCAN are used for intrusion detection, which has good detection ability.

At present, the research of econnoisseur detection is still in the theoretical stage, the engineering is mainly based on traditional threshold setting or rule-based interception. Yuan Dandan [12] based on the community discovery algorithm, identified the econnoisseur with similar characteristics into groups. When the econnoisseur characteristics change this method fails.

The challenges of e-commerce econnoisseur identification are as follows: 1. Rule omission. at present, most platforms intercept econnoisseurs based on rules, but the detection will be missed when econnoisseur behavior changes. 2. Insufficient sample labels. The econnoisseur is newly added every day, and manual labeling requires a large labor cost. 3. Model detection lag.With the iterative update of the econnoisseur's method, the existing rules and models have a certain lag, so it is necessary to periodically iterate and maintain the model.

## **3 Theoretical Basis**

Anomaly detection can be divided into supervised and unsupervised anomaly detection. Since the econnoisseur is generated in real-time, unlabeled sample data is mainly used in practical applications, this paper uses the unsupervised anomaly detection scheme. Considering the different applicability of detection methods in different scenarios and data, this paper compares three commonly used anomaly detection schemes to see their ability to identify econnoisseur in e-commerce website log data.

#### **3.1 Isolated Forest (IForest)**

The isolation forest model was first proposed by Zhou Zhihua's team and Fei Tony Liu of Monash University [13] as an ensemble learning method. It was used in the field of industrial anomaly detection due to the advantages of high accuracy and linear time complexity. The theoretical basis of the model are: 1. There are differences between abnormal data and normal data. 2. The proportion of abnormal data is relatively small. These two theories are consistent with the econnoisseur detection.The Isolation Forest algorithm cuts the data space through a random hyperplane. The data plane can divide the data into two subspaces at a time, until each subspace has only one sample point or reaches the given height of the tree.

The Isolated Forest needs to be trained on the Isolated Tree first to obtain the Isolated Forest. After that, calculated the isolated score S of each test sample, then compare the difference between isolated score S and the given threshold to see whether the sample is an abnormal sample.


After data training to obtain an Isolated Forest, the anomaly score of the test sample can be evaluated based on the generated Isolated Tree. Since the structure of the Isolation Tree is consistent with the binary search tree (BST), the average path length of the tree is consistent. Based on this, BST is used to estimate the average path length of isolated tree.

$$c(n) = 2H(n-1) - (2(n-1)/n) \tag{1}$$

$$H(n) = \ln(n) + 0.5772156649\tag{2}$$

Formula (1) is the average path depth of the isolated tree composed of n samples, and it is used to standardize the depth of the samples on the Isolated Tree, so that the abnormal score of the test sample x is shown in Formula (3).

$$s(\mathbf{x}, n) = 2^{-\frac{E(h(\mathbf{x}))}{c(n)}} \tag{3}$$

$$E(h(\mathbf{x})) = \sum\_{l=1}^{t} h\_l(\mathbf{x})/t \tag{4}$$

**Algorithm 2**. Calculate the sample anomaly score

**Input:** *x* - an instance, Forest-Isolated Forest **Output:** anomaly score *s* 1: Initialize tree depth *h(x) =* [ ] 2: *for i = 0 to t do* 3: extract the *i*th Isolated Tree *iTreei*, initialize tree height *e* = 0 4: *if iTreei* is an external node 5: ,c(.) is defined in Equation(1), 6: *end if* 7: 8: *if* 9: 10: *else* 11: 12: *end if* 13: *end for* 14: calculate anomaly score *s(x,n)*, s(.) is defined in Equation(3)

#### **3.2 Local Outlier Factor (LOF)**

The Local Outlier Factor [14] is a model that determines whether the sample points are abnormal based on the density. Its core concept is that the density of abnormal points is smaller than that of other points. Before introducing the Local Outlier Factor model, we need to understand some basic concepts.

**Definition 1:** (k-distance). For point *p*, sort the distances between point *p* and other points from small to large, and the *k-*th closest distance point to point *p* is *k-distance* of point *p*. If point *o* is the *k-*th point closest to point p, then distance is k-distance of object *p*, i.e.

$$k\\_distance(p) = d(p, o)\tag{5}$$

**Definition 2:** (k-distance neighborhood). Draw a circle with point *p* as the center and *k*-distance as the radius. The points in this circle is the *k*-distance neighborhood of *p, i.e.*

$$N\_k(p) = d(p, o') \le d\_k(p) \tag{6}$$

**Definition 3:** (reachability distance). Take point *o* as the center, and take the maximum value of the *k-*th distance nearest to point *o*, then the distance is the reachable distance from point *p* to point *o.*

$$\text{reach\\_dist}\_k(o, p) = \max\{d\_k(o), d(o, p)\}\tag{7}$$

**Definition 4:** (local reachability density). The reciprocal of the average reachable distance in the neighborhood of point *p* is the local reachable density of point *p*, defined as

$$Ind\_k(p) = \frac{1}{\frac{\sum o \cdot N\_k(p) \operatorname{reach}\_{\text{-}dist}(p, o)}{|N\_k(p)|}}.\tag{8}$$

**Definition 5:** (local outlier factor). The mean of the local reachability density of points in the field divided by the local reachability density of point *p* is the local outlier factor of point *p*, defined as

$$LOF\_k(p) = \frac{\sum o \ni N\_k(p)\frac{\ln(o)}{\ln(p)}}{|N\_k(p)|}. \tag{9}$$

Algorithm steps:


Therefore, the algorithm calculates the density of the samples based on the local points in the *k* field of the sample points. The lower the density, the greater the probability of abnormal samples.

#### **3.3 DBSCAN**

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) [15] is a density based spatial clustering algorithm.

The concepts involved in the algorithm are:

**Definition 1:** (Core user). For sample point *p*, give a distance ε, if there are at least *Minpts* sample points within ε neighborhood, then *p* is the core point. For point *p*, its density is defined as ρ(*p*) = |*N*ε(*p*)|. Where *N*ε(·) denotes the set of points in its ε neighborhood. If *p* is a core user, then defined it as.

$$
\rho(p) = |N\_\varepsilon(p)| \ge Minpts \tag{10}
$$

**Definition 2:** (Directly density-reachable). If point *p* and point *q* is directly densityreachable, the following two conditions must be satisfied:

i) *p* is in the ε neighborhood of a core point *q p* ∈ *N*ε(*q*). . ii) q is core user |*N*ε(*q*)| ≥ *Minpts*.

**Definition 3:** (Density-reachable). If there is a point *o* so that both *p* and *q* can be directly density-reachable, then the point *p* and *q* densities is density-reachable.

**Definition 4:** (Border user). For sample point *p*, if the sample points included in the ε radius are smaller than *Minpts* and the sample is in the field of other core points, sample *p* is the border user.

**Definition 5:** (Noise user). If sample point *p* is of non core user and border user, then it is names noise user.

#### **Algorithm 3**. DBSCAN

**Input:** *P*-sample data, ε-scan radius, *Minpts*-minimum number of points included **Output: s**ample category set X = {*x1, x2 ,..., xn*} 1: set **s**ample category number of categories *k=*1, *h =φ* 2: *while P* ≠*φ* 3: randomly take a user *pi* from P 4: *if*  5: *if* 6: 7: *else* 8: add users in into h; 9: *while h* ≠*φ* 10: randomly select h from a user 11: *if*  12: 13: *end if*  14: *if* 15: add users in into h; 16: *end if* 17: *end while* 18: *k = k* + 1 19: *end if* 20: *end if* 21: *end while*

### **4 Research Process**

#### **4.1 Data Sources**

The data is based on the real-time streaming log data in the Knownsec Security Intelligence Brain. The fields in the log data are: access time, access IP, user agent, URL link, website domain name, etc. Three e-commerce websites log data was screened, an IP and user agent was regarded as an independent visitor of the website.

Log data with abnormal access URLs or few visits is deleted to reduce interference to the model. The data of three e-commerce websites in November 2021 are sampled for observation to see their performance in one month. The website visit situation is shown in Table 1 :


**Table 1.** Website visit number.

#### **4.2 Feature Construction**

After analyzing the access data of some users, it is found that the econnoisseur can be divided into two categories: 1. Monitoring users, who monitor the preferential information of commodities in real-time and at low frequency. 2. Activity type users, who make high-frequency application and purchase of goods with large discounts during the activity period.

Combing with the characteristics of the econnoisseur, we constructed three characteristics for users: website visits, website visit time and different website visits, totally nine features are shown in Table 2.


#### **4.3 User Behavior**

Reduce the user feature data of website C to 2D for visualization based on t-SNE (tdistributed Stochastic Neighbor Embedding), as shown in Fig. 2. User data can be divided into four categories according to the color of points. Some sample data were extracted from the four categories and analyzed. It was found that the econnoisseur appeared in categories two and four, while the normal users were concentrated in categories one and three (Fig. 3).

**Fig. 2.** User dimension reduction visualization

**Fig. 3.** Feature density curves of different categories

Draw a feature density map for the 4 categories of sample users, shown in Fig. 2. It can be seen that there are differences between the characteristics of different categories of users, that is.

Category 1: random visit users. Characterized by small visit number, short visit time, and less website visit information.

Category 2: monitoring users. Users visit specific web pages for a long time and infrequently.

Category 3: normal access users. The categories of web pages visited by users are scattered, and the time of visiting web pages is more than that of category 2, which is a normal browsing user of web information.

Category 4: specific page access users. A large number of visits to the website in a short period of time, and the visited pages are targeted.

#### **4.4 Result Analysis**

The data of the previous week of the website are used as the training set for model training, and the econnoisseur detection is carried out on the user access of the latest day.The daily update and retraining of training data can obtain the recent overall distribution of users, and the model can be adjusted in real-time. During the Double Eleven period, these ecommerce companies had promotional activities, which also became the carnival of the econnoisseur.The user access data on November 11 and November 18 were extracted to compare the detection effect of econnoisseur during the active period and the non-active period.

Since the detected user data is unlabeled data and the sample size is large, 3000 users data are randomly selected from each website for labeling to check the test effect of the model. In Table 3, it can be seen that the detection precision and recall of the Isolated Forest model are higher than those of the other two models, showing that the model has better applicability.

After further analysis, it was found that some of the underreported econnoisseur were due to small difference between the amount of website URLs visited by econnoisseur and normal users. For example, aa/01 and aa/02, these two URLs are the same type of URLs, which can be combined to better to distinguish different user access situations. Consider combining URL of the same type based on the Bidirectional Long Short Term Memory (BiLSTM) model to reduce its impact on URL visits distribution. The detection effect of econnoisseur is shown in Table 4.

The average detected amount of econnoisseur during the period from November 1 to November 11 was taken as the average of daily econnoisseur amount during the activity period. The average detected amount of econnoisseur during the period from November 12 to November 30 was taken as the average of daily econnoisseur amount during the inactive period. The performance is shown in Fig. 4, it can be seen that the econnoisseur during the activity period is about twice as much as the non activity period.

It can be concluded from the above:


The detection ability of econnoisseur during non-activity period is better than that during activity period. According to the analysis, it is found that some users have visited the preferential products for many times with low frequency. This behavior is similar to that of normal users and has not been detected. It can be further optimized later.


**Table 3.** Model result comparison

**Table 4.** Comparison of results before and after improvement of Isolated Forest


**Fig. 4.** Detection amount of econnoisseur users in different periods

#### **5 Conclusion**

The rise of e-commerce platforms not only brings convenience to people's life, but also poses a higher challenge to the platform's risk control ability. How to control and reduce the risk of the platform being played for a sucker needs to be solved urgently. Based on the real log data of website users visiting the website in the Knownsec Security Intelligence Brain, this paper extracts nine features of users to identify the econnoisseur.

By comparison, it is found that the unsupervised anomaly detection model has certain detection ability for the e-commerce website econnoisseur. Among them, Isolated Forest has higher detection precision and recall rate than LOF and DBSCAN models in three e-commerce websites, and is more suitable for current e-commerce user data.

After that, the analysis found the same type of URL visited by users, which can be combined to better describe the real visit behavior of users. The detection results show that the econnoisseur detection based on the BiLSTM-IForest has been further improved.

The econnoisseur selected in this paper can intercept its traffic access in advance. After that, a risk control model can be built based on the actual browsing, purchasing and other specific behaviors of users to strengthen real-time prevention and control.

The econnoisseur and the platform have always been in a state of mutual competition. The so-called the devil is one foot tall and the road is one foot tall. We need to track the attack methods of the econnoisseur in real-time, and at the same time combine the risk control platform to further attack the econnoisseur.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## CMPD: Context-Based Malicious Parameter Detection for APIs

Zhangjie Zhao<sup>1</sup>, Lin Zhang<sup>1</sup>, Xing Zhang<sup>2</sup>, Ying Wang2(B), and Yi Qin<sup>2</sup>

<sup>1</sup> Beijing Big Data Centre, Beijing 100101, China {zhaozj,zhanglin}@jxj.beijing.gov.cn <sup>2</sup> National Engineering Laboratory for Big Data Collaborative Security Technology, Beijing 102209, China {zhangxing,wangying,qinyi}@cecgw.cn

Abstract. The Application Program Interface (API) plays an important role as the channel for data interaction between programs, while the widespread use of APIs has brought security risks that cannot be ignored. The adversary can perform various Web attacks, including SQL Injection and Cross-Site Scripting (XSS), by tampering with the parameters of API. Efficient detection of parameter tampering attacks for API is critical to ensure the system is running in the expected condition, further avoiding data leakage and property loss. Previous works always utilize the rule-based method or simple learning-based method to detect parameter tampering attacks. However, they ignore the contextual information of the API tokens and thus have a poor performance. In this paper, we propose the Context-based Malicious Parameter Detection (CMPD) framework to detect the parameter tampering attacks for APIs. We use a neural network language model to learn the distribution of the parameters, parameter names, and URLs and then use a tree model to detect the malicious query based on the high dimensional API embedding. Experiments show that CMPD outperforms all baseline, including rule-based method, Support Vector Machine (SVM), and Autoencoder, on CSIC 2010 dataset with *F*<sup>1</sup> value reaching 0.971. CMPD can also achieve a 0.895 *F*<sup>1</sup> value when training data is reduced to 20% and can achieve a 0.910 *F*<sup>1</sup> value when negative examples are reduced to 1%.

Keywords: Parameter tampering · API · Language model

## 1 Introduction

The API is a combination of a set of definitions and protocols, which plays an important role as a channel for data interaction between programs. Modern applications are often developed with many well-defined interfaces to improve the scalability and compatibility of the program. Although the widespread use of APIs has brought great convenience to data access, and thus different terminals can access relevant information in a similar way, the extensive use of APIs has also brought security issues that cannot be ignored. Especially in the modern microservice architecture, each application is subdivided as much as possible, making the security risks faced by APIs more difficult to be detected completely. Effective detection of API security risks ensures that the system is running in good condition.

The access of API is based on the HTTP/HTTPS protocol, so the threat on the Web protocol may extend to API, such as SQL Injection, Broken Authentication, Session Management, Cross-Site Scripting (XSS). It should be noted that these attacks are always implemented by tampering with the parameters in the API. Therefore, to mitigate the threat of API, a key idea is to prevent the parameters from being tampered with. The security community has proposed a variety of approaches to address the security risks of parameter tampering, the most common of which is the rule-based detection. Such methods are often implemented by a lightweight agent that first detects security risks that may be contained in a Web request before a server process it. If a relevant rule is matched and a request is identified as a security risk, the request is filtered out to avoid the server from being affected. Although this detection method is simple to implement and efficient, its over-reliance on rules that humans preset leads to its inability to detect unknown new attacks, and thus not only has a poor performance in actual detection but also has a high false-negative rate.

Deep learning model has achieved remarkable success in various natural language processing (NLP) tasks, and it has been shown that these models can effectively learn the data distribution, which is difficult for the rule-based detection method to do. By learning the data distribution, the detection of parameter tampering can thus be seen as a pattern classification problem whose goal is to distinguish the feature pattern of the normal access of API and the malicious access of API. However, deep learning methods always need to learn from a large amount of data, which may be difficult to obtain. Furthermore, normal access is more common than malicious access, and the ratio of normal accesses and malicious accesses may be significant unbalance. The unbalance data also makes the model difficult to learn the data distribution, as the model may mainly focus on the type of data that is more common in the dataset and ignores the less one.

In this paper, to detect the parameter tampering attack against API and reduce the influence of unbalanced data, we propose the Context-based Malicious Parameter Detection (CMPD) framework. CMPD improves the effectiveness of detecting malicious parameters by learning the distribution of each component of the API and builds the relationship amount URL, parameter names, and parameters. Experiments show that CMPD outperforms all baseline on CSIC 2010 dataset, with *<sup>F</sup>*<sup>1</sup> value reaching 0.97. CMPD also achieves 0.91 *<sup>F</sup>*<sup>1</sup> value on the unbalance dataset that normal access data is 100 times more than the malicious data, and achieves 0.89 *<sup>F</sup>*<sup>1</sup> value when training data of CSIC 2010 dataset are reduced to 20%.

We summarize our main contributions as follows:

– We propose a semantic extraction and learning module to learn the relationship amount URL, parameter names, and parameters, which universally models the parameter distribution of different APIs in one framework.


## 2 Related Work

#### 2.1 Vulnerability Detection for APIs

To detect API's vulnerability, current methods mainly focus on black-box testing, which needs to generate a large number of testing cases. Much research relies on crawlers or manual methods to get the detection object, parse out the fuzz domain based on the detection object to generate test cases, and use the attack pattern library to perform vulnerability detection [1,3,4]. To generate test cases, various methods are proposed. Atlidakis et al. [2] propose REST-ler to automatically generate test requests with a random walk algorithm. Avinash [12] et al. proposed six attack patterns for replay attacks to automatically generate test cases. Douibi et al. [5] automatically generate test cases for REST API based on the description of Swagger and OpenAPI. Because the crawler-based API vulnerability detection method has the problem of low coverage and manual testing can not be carried out on a large scale, the black box testing method is often combined with the interface documentation. Yu et al. [16] propose a fuzz system with RESTful API based on SwaggerHub's development interface and improves the effectiveness of fuzz testing by automatically generating test cases and automatic filtering. Viglianisi et al. [14] generated normal test cases and malicious test cases based on the interface documentation to test the security risk of RESTful API. Different tools have also been proposed to automatically scan API vulnerabilities, such as FuzzAPI<sup>1</sup>, APIFuzzer<sup>2</sup>, boofuzz<sup>3</sup>, and Astra<sup>4</sup>. These tools do not need to obtain source code and interface documents but combine manual and crawler methods to achieve vulnerability detection. When interface documents are available, the tools such as TNT-Fuzzer<sup>5</sup>, 42Crunch<sup>6</sup>, and OWASPZAP<sup>7</sup> can directly extract detection objects from interface documents to achieve vulnerability detection with high coverage.

<sup>1</sup> https://github.com/Fuzzapi/fuzzapi.

<sup>2</sup> https://github.com/KissPeter/APIFuzzer.

<sup>3</sup> https://github.com/jtpereyda/boofuzz.

<sup>4</sup> https://github.com/flipkart-incubator/Astra.

<sup>5</sup> https://github.com/Teebytes/TnT-Fuzzer.

<sup>6</sup> https://42crunch.com.

<sup>7</sup> https://www.zaproxy.org.

#### 2.2 Parameter Tampering Detection for APIs

To detect the parameter tampering attacks for APIs, both rule-based methods and learning-based methods are proposed. ModSecurity<sup>8</sup> develops the OWASP ModSecurity Core Rule Set (CRS), which contains a large number of rules for detecting SQL Injection, Cross-Site Scripting, and HTTP Protocol Violations. Rieck et al. [11] use the n-grams and a similarity measurement to generate new features for anomaly detection. Ingham et al. [6] proposed the Deterministic Finite Automata (DFA) induction method, which uses a heuristic algorithm to detect abnormalities. Ma et al. [8] use machine learning methods including Naive Bayes, Support Vector Machine, and Logistic Regression to learn the distribution of static features to detect attacks. Nguyen et al. [10] use a feature selection algorithm to reduce the dimension of features extracted from traffic, reducing the computational complexity of the learning algorithm. Liang et al. [7] developed an RNN-MLP network to detect malicious accesses, where the RNN contains LSTM and GRU cells, and the MLP follows the RNN. Wang et al. [15] investigated CNN and LSTM and their combination method for malicious detection, which outperforms the traditional methods.

## 3 Methodology

#### 3.1 Parameter Tampering Attacks Against APIs

API parameter attacks attempt to manipulate parameters transmitted between the client and server in order to alter application data, such as user passwords and permissions, product prices and quantities. This type of data is typically kept in cookies, hidden form fields, or URL query strings and is used to regulate and enhance the functionality of the program. The attack's success is conditional on integrity and logical validation mechanism faults, and exploiting these errors may result in further implications such as cross-site scripting (XSS) and SQL injection. The tampering of parameters is frequently limited to several essential categories of data: API query parameters, cookies, form fields, and HTTP headers. Specifically, for an API, which is consisted of a basic URL *u*, a group of parameter names {*n<sup>i</sup>*|*<sup>i</sup>* <sup>∈</sup> *<sup>N</sup>*}, and a group of parameter {*p<sup>i</sup>*|*<sup>i</sup>* <sup>∈</sup> *<sup>N</sup>*}. The *i*-th parameter is integrated with the *i*-th parameter name. Suppose the server expects to receive a benign query, and for the target *u*, all possible benign choices of the *i*-th parameter are denoted as <sup>P</sup>*i*, all possible benign choices of the *<sup>i</sup>*-th parameter name are denoted as N*i*. Therefore, a benign API query for the target URL *u* can be defined as

$$\forall \ i \in N, \ p\_i \in \mathcal{P}\_i, \quad \text{and} \quad \forall \ i \in N, \ n\_i \in \mathcal{N}\_i \tag{1}$$

And a parameter tampering attack for the target URL *u* can thus be defined as

$$\exists \ i \in N, \ p\_i \notin \mathcal{P}\_i, \quad \text{and} \quad \exists \ i \in N, \ n\_i \notin \mathcal{N}\_i \tag{2}$$

<sup>8</sup> https://github.com/SpiderLabs/ModSecurity.

Intuitively, a parameter tampering attack may happen in the following conditions:


#### 3.2 Semantic Extraction and Learning

As parameter tampering attacks have the characteristics of a wide attack surface and large scope of tampering, traditional methods cannot realize the judgment of whether a request has been tampered with in one model. Furthermore, as text information is discrete, traditional methods cannot use the semantic information contained in it. Therefore, we use the Semantic Extraction and Learning Module to learn the distribution relationship among the basic URL in the API, the parameter names, and the parameters and then map discrete text information to high-dimensional continuous space. The general frameworks of the semantic extraction and learning module is shown in Fig. 1.

Fig. 1. Illustration of the semantic extraction and learning module of CMPD.

Specifically, suppose there is a neural network with *K* layers, and the weights and the bias vectors for each layer can be defined as

$$\begin{aligned} \mathbf{W}^{(1)} \in R^{m\_1 \times m\_0} & \quad \mathbf{b}^{(1)} \in R^{m\_1 \times 1} \\ \mathbf{W}^{(2)} \in R^{m\_2 \times m\_1} & \quad \mathbf{b}^{(2)} \in R^{m\_2 \times 1} \\ & \quad \dots \\ \mathbf{W}^{(K)} \in R^{m\_K \times m\_{K-1}} & \quad \mathbf{b}^{(K)} \in R^{m\_K \times 1} \end{aligned} \tag{3}$$

where (*m*<sup>0</sup>*, m*<sup>1</sup>*,* ··· *, m<sup>K</sup>*) is the number of units in each layer. The active function in each layer are denoted as (*f*(1)*, f*(2)*,* ··· *, f*(*K*)), and thus the output of the *K*-th layer **<sup>Y</sup>**(*k*) can be defined as

$$\begin{aligned} \mathbf{net}\_i^{(k)} &= \sum\_{i=1}^{m\_{k-1}} W\_{i,j}^{(k)} Y\_j^{(k-1)} + b\_i^{(k)}, (1 \le i \le m\_k) \\ \mathbf{net}^{(k)} &= \mathbf{W}^{(k)} \mathbf{Y}^{(k-1)} + \mathbf{b}^{(k)} \\ \mathbf{net}^{(k)} &= \left[ \mathbf{net}\_1^{(k)}, \mathbf{net}\_2^{(k)}, \dots, \mathbf{net}\_{m\_K}^{(k)} \right]^T \\ \mathbf{Y}^{(k)} &= f^{(k)} \left( \mathbf{net}^{(k)} \right) = \left[ Y\_1^{(k)}, Y\_2^{(k)}, \dots, Y\_{m\_k}^{(k)} \right]^T \end{aligned} \tag{4}$$

We extract the URL *<sup>u</sup>*, the parameter names {*n<sup>i</sup>*|*i* <sup>∈</sup> *N*}, and the parameter {*p<sup>i</sup>*|*i* <sup>∈</sup> *N*} in each API query and then arrange them in the order they appear in the query as (*w<sup>t</sup>*)*<sup>t</sup>*∈{1*,*2*,*··· *,M*}. We randomly remove a token *<sup>w</sup><sup>t</sup>* in (*w<sup>t</sup>*)*<sup>t</sup>*∈{1*,*2*,*··· *,M*}, and send the rest token into the network with a look-up layer that is concatenated before the first layer. The look-up layer has the parameter in dimension *V* <sup>×</sup> *E*, where *V* is the size of vocabulary, and *E* is the size of embedding. This module maps the token to continuous values. We expect the network knows what the removed token is and output the probability of the removed word in **Y**(*k*) . Turning the training, the probability of the removed word in the output layer, i.e., the *K*-th layer, is maximized. After training, we use the value in the look-up layer as the embedding of a token and the average result of each token in an API query as the embedding of the API. In this high-dimensional continuous space, the representation of tokens indicates the relationship between tokens so that subsequent modules can effectively use the semantic information in the token.

#### 3.3 Detection on Parameter Tampering

To reduce the reliance of model on the amount of data and to enable it to learn effectively when the positive and negative samples are not balanced, we additionally classify the API embedding using a decision tree model.

A decision tree model is a tree structure that describes how instances are classified, and it is composed of nodes and directed edges. Nodes are classified into two types: internal nodes and leaf nodes. Internal nodes denote a property or attribute, while leaf nodes denote a class. Begin with the root node and test a specific feature of the instance; then, using decision tree classification, assign the instance to its child nodes based on the test results; at this point, each child node corresponds to a value of the feature. Recursively, instances are tested and allocated until a leaf node is reached. Specifically, suppose the training data consisting of all the API embedding is *<sup>D</sup>*, *<sup>A</sup>* is the feature group, *<sup>C</sup><sup>k</sup>* is the samples of class *k*, the dataset can thus be separated into *D*1*, D*2*,* ··· *, Dn*. Denote the samples in *<sup>D</sup><sup>i</sup>* and in class *<sup>C</sup><sup>k</sup>* as *<sup>D</sup>ik*, the entropy of the dataset *<sup>D</sup>* can be calculated as

$$H(D) = -\sum\_{k=1}^{K} \frac{|C\_k|}{|D|} \log\_2 \frac{|C\_k|}{|D|} \tag{5}$$

and the conditional entropy of *D* given *A* is

$$H(D \mid A) = \sum\_{i=1}^{n} \frac{|D\_i|}{|D|} H(D\_i) = -\sum\_{i=1}^{n} \frac{|D\_i|}{|D|} \sum\_{k=1}^{K} \frac{|D\_{ik}|}{|D\_i|} \log\_2 \frac{|D\_{ik}|}{|D\_i|} \tag{6}$$

and the information gain of *D* from *A* is defined as

$$g(D, A) = H(D) - H(D \mid A) \tag{7}$$

During the training, the attribute with the largest information gain rate is selected as the test attribute each time, and the construction of the decision tree is completed from top to bottom. The Parameter Tampering Detection Module in CMPD is consisted of the well-learned decision tree.

#### 3.4 Context-Based Malicious Parameter Detection Framework

Based on the previous analysis, we now illustrate the general architecture of the proposed Context-based Malicious Parameter Detection (CMPD) framework, which is shown in Fig. 2.

Fig. 2. General architecture of CMPD.

The CMPD framework is consisted of the semantic extraction and learning module we detailed in Sect. 3.2, and the parameter tampering detection Module we detailed in Sect. 3.3. We first collect all the API access records in the form of URL requests and feed all of the collected data into the semantic extraction and learning module. Therefore, the API request will be map to the high dimensional hidden space in the form of API vector representation, which contains the context information about the normal and abnormal parameters. We then collect all the API representations and feed them to the parameter tampering detection module, which can complete the malicious parameter detection without relying on the balanced and numerous data. The detection module will classify each API representation into a benign request or an abnormal request from the root to leaf nodes of the decision tree, as we illustrated in Fig. 2.

## 4 Experiments

#### 4.1 Metric

The experiments will focus on the identification of the parameter-tampering attacks, and the evaluation metrics used in the current work include precision, recall, *F*<sup>1</sup>. These metrics are calculated using the proportion of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) in the classification results. TP and TN are the number of correctly classified malicious and legitimate API requests. FP is the numbers of normal API requests misclassified as malicious, while FN is the number of abnormal requests misclassified as legitimate API requests. Where the precision is calculated as

$$\text{Precision} = \frac{TP}{TP + FP} \tag{8}$$

the recall is calculated as

$$\text{Recall } = \frac{TP}{TP + FN} \tag{9}$$

the *<sup>F</sup>*<sup>1</sup> value is calculated as

$$F\_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \tag{10}$$

#### 4.2 Main Result

The results of different methods on the HTTP DATASET CSIC 2010 are shown in Table 1. CRS stands for Core Rule Set, and PL stands for Paranoia Level, which is used to control the strictness of ModSecurity's rule checking, with a smaller value indicating greater strictness. As the PL increases, the Precision value of ModSecurity increases while the Recall value decreases, resulting in a decrease in the *<sup>F</sup>*<sup>1</sup> value, indicating that the traditional method based on rules has a very limited effect. The effect of the SVM algorithm on detecting parameter tampering is weaker than that of the traditional method, most likely because the SVM algorithm is extremely dependent on the quality of the features, and the features fail to indicate the distribution of data. Autoencoder are deep learningbased methods that perform better than traditional methods. The proposed CMPD outperforms all baselines, including traditional detection methods and learning-based methods, in terms of *F*1. CMPD has a balanced precision and recall, indicating that our method has low false-positive and false-negative rates.


Table 1. The comparisons on Precision, Recall, and *F*<sup>1</sup> Score.

#### 4.3 Further Analysis

Influence of the Number of Training Data. In practical situations, the samples of normal and abnormal accesses may be extremely unbalanced. To explore the effect of our model in a more demanding environment, we conducted experiments in two ways. We first reduce the number of training data to illustrate the performance of CMPD when training data is not enough. The influence of the number of training data is shown in Fig. 3. We find that the classification performance of the model gradually increases with the increase of training data, and the classification results are consistent with the results in Table 1. When the complete training data set is used, CMPD achieves the best classification performance. Moreover, when the training data is only 20% of the original dataset, the *<sup>F</sup>*<sup>1</sup> value can also achieve 0.89, indicating that CMPD is less sensitive to the amount of training data and is effective even with fewer data.

Influence of the Ration Between the Number of Negative Examples and Positive Examples. Further, we randomly drop the samples of malicious queries in the dataset, and the performance of our method are shown in Fig. 4. We find that when the number of malicious samples decreases, the performance of the classification decreases accordingly, but even when it is reduced to 1% of the normal samples, the *<sup>F</sup>*<sup>1</sup> value can still reach above 0.91, indicating that our model is effective even when the normal and malicious samples are extreme imbalance.

Fig. 3. Influence of the percentage of training examples.

Fig. 4. Influence of the ration of negative examples/positive example.

Fig. 5. Visualization of the parameters and parameter names that the model most concentrated on.

Visualization of Model Concentration. Further, we collect the parameters and the parameter names that have a critical impact on the parameter tampering detection extracted by the model, and the visualization of these tokens is shown in Fig. 5. It can be seen that parameters such as *email*, *login*, and *password*, which are highly relevant to parameter tampering attacks, are correctly extracted and are considered to be of high importance, indicating that the model has different levels of attention to different parameters and successfully learns the features related to parameter tampering.

Case Study. To show that our method can identify the tampering with parameters, the tampering with parameter names, and the correspondence of URL and parameters or parameter names, we provide the results of case study in Table 2. As we illustrated in the Table 2, if we tamper with the parameter of the API "http://localhost:8080/tienda1/publico/entrar.jsp" and add the string "%11" behind the normal parameter "errorMsg=Credenciales+incorrectas", the CMPD framework will find that the parameters of the API are tampered. Similarly, if we tamper with the parameter name (from *errorMsr* to *errorMsgBAC* ), CMPD still successfully detects the tampering, which shows that our method learns the correct relationship between the parameters and the parameter names. Furthermore, if we change the parameter of the "http://localhost:8080/tienda1/publico/entrar.jsp" to the normal parameter of


Table 2. Case study on different types of tampering

another API, "http://localhost:8080/tienda1/publico/vaciar.jsp", our method can also detect that the parameter and parameter names do not correspond to the correct URL. It is shown that CMPD successfully learns the correspondence between URLs, parameters, and parameter names, and we do not need to use different models to detect the parameter tampering attack for different APIs.

## 5 Conclusion

APIs are vital for data exchange between programs, but their widespread use has brought significant security risks. By modifying API parameters, the adversary can launch Web attacks such as SQL Injection and Cross-Site Scripting (XSS). API parameter tampering detection is critical to keep the system running smoothly. To detect parameter tampering attacks, previous works always used rule-based or simple learning-based methods, while they ignore the API tokens' contextual information and thus perform poorly. In this paper, we propose a framework for detecting API parameter tampering attacks called Contextbased Malicious Parameter Detection (CMPD). We first learn the distribution of parameters, parameter names, and URLs using a neural network language model and then use a tree model to detect malicious queries based on the highdimensional API embedding. On the CSIC 2010 dataset, CMPD outperforms all baseline with *<sup>F</sup>*<sup>1</sup> of 0.971.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Webpage Tampering Detection Method Based on BiGRU-CRF-RCNN**

Xiangyu Fan, Jilong Yang, Wei Zhao, Jincheng Deng(B) , and Fangming Liu

Beijing Knownsec Information Technology Co.,Ltd, Beijing 100097, China {fanxy,yangjl,zhaow,dengjc,liufm}@knownsec.com

**Abstract.** With the development of the Internet, cyber security events occur frequently, especially webpage tampering events account for a high proportion. In response to this phenomenon, this paper constructs a webpage tampering detection framework BCR. Based on the webpage to be detected, the webpage text data is segmented and extracted according to the webpage structure, the text features are extracted by using BiGRU model combined with context dependence, and then combined with the CRF to learn sequence state labeling named entities, the word vector is constructed by the extracted named entity and brought into the RCNN model for tampering detection. The experiment results show that the framework has achieved 95.37% precision, 95.35% recall and 95.34% F1-Score in webpage tampering detection, which is better than Textrank RCNN framework in webpage tampering detection. In practical application, it also achieved 95.13% precision and 93.25% recall.

**Keywords:** Webpage tampering · Named entity recognition · Text classification · Bidirectional gated cyclic unit network · Conditional random field

## **1 Introduction**

With the rapid development of the Internet, various cyber security incidents continue to occur, among which the proportion of webpage tampering events has always been high. How to quickly and accurately locate the tampered content in the webpage and rectify it in time is of great significance to reducing the loss of the site.

At this stage, NLP technology is developing rapidly, text classification technology has a wide range of applications in various fields, and named entity recognition technology is becoming more and more mature. This paper is based on the named entity model to extract the named entities of the text in the webpage segment by segment, and then combined with the text classification model to identify the tampered text.

### **2 Research Status**

At present, the commonly used webpage tampering detection methods are mainly through image recognition and comparison and rule-based detection. Yan Yufeng and Shen Yong [1] proposed to capture the original image and real-time image of the webpage and detect the feature point information in the before and after images according to the image processing model, and calculate the similarity of webpages according to the feature point information to determine whether the webpage has been tampered with. This method has a good application in the detection of webpage tampering with relatively fixed content or low content update frequency, however, in the case of webpage tampering detection with high content update frequency and rich content, it will affect the model efficiency and detection accuracy. Hongwei R et al. [2] proposed to classify webpage attributes according to principal component analysis, and introduce corresponding rules for each category to realize the judgment of webpage tampering. This method has better effect and efficiency in the scenario of simple webpage structure, but the recognition accuracy will be affected when the web page attributes are complex and the rules cannot cover new objects.

Named entity recognition is a popular research direction of NLP, and named entity recognition models have very good applications in big data research in many fields. The early named entity recognition mainly used the method of building a dictionary, which required a lot of labor costs. After continuous optimization and iteration, today's named entity recognition model mainly relies on various machine learning algorithms to achieve. In the field of named entity recognition in cyber security, Chiu J et al. [3] proposed a method of combining BiLSTM-CNN to build a dictionary in a neural network to encode some words and then match them, this method has better F1-Scrore than other methods on open source datasets. Fan Xiaoxia et al. [4] proposed a method of constructing a named entity recognition system (DNER) for darknet market text based on Branwen's open source darknet market data text using CBOW-CNN-BiLSTM-CRF. Of entity types, the system can significantly improve recognition. Yi F et al. [6] proposed a named entity recognition model based on regular expressions, entity dictionary, CRF combined with feature templates after considering the particularity and complexity of security entities, got good results.

#### **3 Research Content and Methods**

It can be seen from the above that most of the detection of webpage tampering, the final data carrier is text data, how to extract effective and well-characterized key words from the text data plays a decisive role in webpage tampering detection. Different webpages have different text complexity, there is often more noise text data in complex text, and the structure of complex text is more complex than simple text, which has a great impact on the extraction of key words with effective features. In view of the interference of complex text data, this paper designs and implements a framework that extracts text data segment by segment according to the structure of webpages, and then uses named entity model to extract named entities to construct text vectors and bring them into the text classification model for webpage tampering detection, including: Data Preprocessing Framework, BiGRU-CRF Named Entity Recognition Model, RCNN text classification model.

#### **3.1 Data Sources**

The experimental data in this paper comes from the historical data of webpage tampering monitoring in the threat intelligence data of Knownsec Security Intelligence Brain. The data is HTML text data, involving five types of websites of government, universities, hospitals, transportation, and energy. It contains 20,000 untampered webpage data and 10,000 tampered webpage data. The tampered content involves pornography, gambling, novels, tripartite movie website, tripartite investment website and reactionary information.

#### **3.2 Data Preprocessing**

According to the above content and method, the original data is firstly extracted in segments according to the structure of the webpage, and then perform manual labeling and stop word filtering on the extracted data.

**Data Extraction.** 1) Parse the HTML data. 2) Build a DOM tree. 3) Traverse the DOM tree to find the tag where the required text is located. 4) Extract the text data segmented based on the webpage structure from the returned HTML data according to the tag.

#### **Data Labeling**

1) Named Entity Labeling

This paper uses the word segmentation tool Jieba to perform word segmentation and part-of-speech tagging on the text data. Since named entities are derived from nouns, data labeling is based on the nouns after word segmentation. According to the tampering content of the webpage, a total of 5 types of entity types are labeled, including: PER (person), ORG (company/organization), PLF (platform), OBJ (special noun), 0 (irrelevant word), to ensure that each segment corresponds to one Named Entity Labeling to serve as the data basis for subsequent model building.


Use each webpage domain name as the source label of segmented text data to facilitate subsequent positioning.

**Stop Word Filtering.** Build a stop word database, including: webpage navigation vocabulary, website copyright statement vocabulary, common auxiliary words, special symbols, etc.

#### **3.3 Text Vectorization**

Use word2vec to build text vectors. Word2vec has two models of CBOW and SKIP-GRAM in building text vectors. The CBOW model predicts the central word according to the context of the input text, and the SKIP-GRAM model predicts the context according to the central word. Based on the research background, this paper adopts the CBOW model to construct text vectors.

#### **3.4 BiGRU Model**

In the field of named entity recognition, the LSTM model has a wide range of applications. In the LSTM model, a single module consists of three gate units: input gate, forget gate, and output gate. The input gate determines the necessary information to retain, the forget gate determines to discard the information, and the output gate shows the final result. In the GRU network, the three gating units of the LSTM model are replaced by the update gate and the reset gate. The update gate determines the amount of attention information, and the reset gate determines the amount of forgotten information. The reduction of gating units also reduces the parameters in the network, making GRU more concise and efficient than LSTM. BiGRU is a neural network model composed of two unidirectional and opposite GRUs, The current hidden layer state of BiGRU is jointly determined by the current input *Xt*, the forward hidden layer state *h*<sup>→</sup> *<sup>t</sup>*−<sup>1</sup> at time *<sup>t</sup>* <sup>−</sup> 1, and the backward hidden layer state *h*← *<sup>t</sup>*−<sup>1</sup> at time *<sup>t</sup>* <sup>−</sup> 1. The state of the hidden layer at time *t*:

$$h\_{\mathbf{l}}^{\rightarrow} = G(\mathbf{X}\_{\mathbf{l}}, h\_{\mathbf{l}-\mathbf{l}}^{\rightarrow}) \tag{1}$$

$$h\_t^{\leftarrow} = G(\mathcal{X}\_l, h\_{t-1}^{\leftarrow}) \tag{2}$$

$$h\_l = \alpha\_l h\_l^\rightarrow + \vartheta\_l h\_l^\leftarrow + b\_l \tag{3}$$

The function *G*()is a nonlinear transformation of the input word vector, encoding the word vector at this moment into the corresponding hidden layer state, ω*t* and ϑ*t* respectively represent the weights corresponding to *h*→ *<sup>t</sup>* and *h*<sup>←</sup> *<sup>t</sup>* at time *t*, and *bt* represents the corresponding bias. Its structure diagram is shown in Fig. 1:

#### **3.5 CRF Model**

The Conditional Random Field (CRF) model is a special Markov random field. It is assumed that there are only observation values *X* and state values *Y* in the model. In the CRF model, each state value *Yn* is only related to its adjacent state value, and its observation value *Xn* is not has Markov properties. The CRF model needs to consider the correlation between the output state values. The feature function ∂ can be used to learn the relationship between states. The CRF will output a sequence score, and normalize all sequence scores to find the path with the highest probability as the prediction sequence. The CRF model includes state feature function ∂ and state transition function μ.

**Fig. 1.** BiGRU model structure diagram

**State Feature Function.** Only related to the current node, ϑ represents the current weight of the feature function, that is:ϑ∂(*Yi*, *Xi*).

**State Transition Function.** Related to both node *i* +1 and node *i* −1, ω represents the current weight of the transfer function, that is:ωμ(*Yi*+1, *Yi*−1, *Yi*, *Xi*).

Suppose there are state feature functions ∂1, ∂2,…, ∂*<sup>L</sup>* whose weights are ϑ1, ϑ2,…, ϑ*L*, and transition state feature functionsμ1,μ2,…,μ*K*, whose weights areω1,ω2,…,ω*L*, for the sequence *X* = {*X*1,*X*2,…,*Xn*}, the probability of the output sequence *Y* can be calculated as:

$$P(Y|X) = \frac{1}{Z(X)} \exp\left(\sum \vartheta\_L \eth\_L(Y\_l, X\_l) + \sum \omega\_K \mu\_K(Y\_{l+1}, Y\_{l-1}, Y\_l, X\_l)\right) \tag{4}$$

of which:

$$Z(X) = \sum \exp\left(\sum \vartheta\_L \partial\_L(Y\_l, X\_l) + \sum \omega\_K \mu\_K(Y\_{l+1}, Y\_{l-1}, Y\_l, X\_l)\right) \tag{5}$$

*Z*(*X* ) is the generalization factor, which can be seen as the sum of the scores of all output sequences.

When the transition feature and state feature are represented by unified functions *s* and *f* , the probability of the output sequence *Y* is:

$$P(Y|X) = \frac{1}{Z(X)} \exp\sum s f\_l(Y, X) \tag{6}$$

of which:

$$Z(X) = \sum \exp \sum \text{s} f\_l(Y, X) \tag{7}$$

When the CRF model is used for named entity recognition, its graph structure is shown in Fig. 2:

**Fig. 2.** CRF model structure diagram

#### **3.6 RCNN Model**

The RCNN model is a commonly used text classification model, and its structure is divided into three parts.

**Region-CNN Model.** A bidirectional RNN model is used to obtain the context information of each word embedding, and its expression is:

$$c\_l(w\_l) = f\left(W\_{(l)}c\_l(w\_{l-1}) + W\_{(sl)}e(w\_{l-1})\right) \tag{8}$$

$$c\_r(w\_l) = f\left(W\_{(r)}c\_r(w\_{l+1}) + W\_{(sr)}e(w\_{l+1})\right) \tag{9}$$

of which:

*cl*(*wi*) represents the above of the word *wi*.

*cr*(*wi*) represents the context of the word *wi*.

*e*(*wi*) represents the embedding vector of word *wi*.

*W*(*l*) and *W*(*r*) are weight matrices, which transfer the above and below of the previous word to the above and below of the next word.

*W*(*sl*) and *W*(*sr*) are feature matrices, which combine the semantic features of the current word to the upper and lower parts of the next word.

**Computing Hidden Semantic Vectors.** The context information obtained in the previous step is merged with the expanded word embedding information, and the activation function is used to calculate the hidden semantic feature vector of the word *wi*. Expanded word embedding information is:

$$X\_l = [c\_l(\mathbf{w}\_l); \, e(\mathbf{w}\_l); \, c\_r(\mathbf{w}\_l)] \tag{10}$$

Hidden semantic vector is:

$$Y\_l^{(2)} = \tanh\left(W^{(2)}X\_l + b\_{(2)}\right) \tag{11}$$

**Continuous Learning, Output Results.** After continuous learning of TextCNN, maxpooling and fully connected layers, the classification result is obtained.

The structure diagram of the RCNN model is shown in Fig. 3:

**Fig. 3.** RCNN model structure diagram

#### **4 Experiment and Result Analysis**

#### **4.1 Experimental Environment and Evaluation Indicators**

This experiment was performed in the following configuration:

In this experiment, both the named entity model and the text classification model use the precision rate (PRE), the recall rate (REC), and the comprehensive evaluation (F1-Score) as the model's accuracy evaluation indicators.

#### **4.2 Experimental Configuration**

**Named Entity Recognition.** The 30,000 pieces of data after data preprocessing are divided into training set, test set and validation set according to the ratio of 6:2:2. The distribution of the data set is as follows:

In order to verify that the framework proposed in this paper is better, BiGRU-CRF model, BiLSTM-CRF model, and CNN-LSTM model are set up as comparison models. The three comparison model structures are shown in Table 3 (Tables 1 and 2):


**Table 1.** Configuration table.

**Table 2.** Named entity dataset partitioning.


**Table 3.** Named entity vs model structure.


The main parameter configuration of each model is shown in Table 4 (Table 5):

**Text Categorization.** The 30,000 pieces of data after data preprocessing are divided into training set, test set and validation set according to 6:2:2. The distribution of the data set is as follows:


**Table 4.** Parameter configuration.

**Table 5.** Text classification dataset partitioning.


Use two methods to build word vectors and then bring them into the RCNN model for comparison. They are: Named entities combined with RCNN model for classification, Text summarization combined with RCNN model for classification.The RCNN model epoch is set to 30, batch\_size is set to 256, and the training process is shown in Table 6:


**Table 6.** RCNN model training process

#### **4.3 Experimental Results and Analysis**

**Named Entity Recognition.** The accuracy indicators of each model are shown in Fig. 4:

**Fig. 4.** Accuracy indicators of each named entity model

In terms of recognition accuracy, the PRE, REC, and F1-Score of the BiGRU-CRF model in this scenario are 93.88%, 91.36%, and 92.60% respectively, which is a certain improvement compared to the other two models. The main reason is that the data set is based on segmented text data after webpage structure segmentation, and the BiGRU-CRF model has improved and optimized the gate control unit compared with the BiLSTM-CRF model, and has better applications in simple text data. Both BiGRU-CRF model and BiLSTM-CRF model can encode text information from front to back and from back to front, which can better capture bidirectional text semantic dependencies, while CNN-LSTM model cannot encode text information from back to front, It can only capture one-way text semantic dependencies, so it is lower than the other two models in terms of accuracy.

Figure 5, Fig. 6, and Fig. 7 show the evaluation indicators of each category of named entity recognition accuracy of each model:

Compared with the other two models, the BiGRU-CRF model has obvious advantages in PLF named entity recognition, and is comparable to the BiLSTM-CRF model in other types of named entity recognition. The CNN-LSTM model is far behind the other two models in terms of OBJ and PLF named entity recognition. From the comprehensive view of the above radar charts, BiGRU-CRF is relatively better in named entity recognition in this scenario.

**Text Categorization.** The accuracy evaluation indicators of each model are shown in Fig. 8:

Compared with TextRank-RCNN, BiGRU-CRF-RCNN has a certain improvement in precision, recall and F1-Score. The main reason is that BCR framework extracts

**Fig. 5.** The precision of each model for each type of named entity recognition

**Fig. 6.** The recall of each model for each type of named entity recognition

keywords representing text based on the characteristics of BiGRU-CRF model. Entities can better represent the domain features and context features of the current text. While the TextRank-RCNN framework constructs a network based on the relationship between local adjacent nodes when extracting keywords representing text The mechanism of exclusive nouns, the extracted information features are not comprehensive, so the accuracy of tampering identification is relatively poor.

**Fig. 7.** Each model recognizes the F1-Score for each type of named entity

**Fig. 8.** The accuracy index of each text classification model

#### **4.4 Practical Application**

This framework has been applied in Knownsec Security Intelligence Brain. From the test results, an average of 108,326 webpages are detected every day, and an average of 411 tampered webpages are identified every day. After manual sampling by the sampling team, the sampling precision was 95.13%, and the recall was 93.25%.

#### **4.5 Conclusion**

At this stage, named entities and text classification technology have been widely used in the field of cyber security, but less in webpage tampering detection. Therefore, the BiGRU-CRF-RCNN framework is proposed for webpage tampering detection. According to the above experimental process and practical application effect, we can get:

**Advantages of this Framework.** Due to the structural characteristics of the gated unit of the BiGRU-CRF model, it has a better application than other models in this scenario. In terms of text classification, the named entities extracted based on the named entity model can better reflect the characteristics of the current field. Therefore, in the scenario of this paper, using the text vector constructed based on named entities for text classification has a better effect.

**Weaknesses of the Framework.** The BiGRU-CRF-RCNN model achieves better results because the industry content of the website detected in production and experiments is less related to the tampered content. Considering the problem of model generalization, if the data surface is widened, and the positive samples and negative samples are related, it needs to be improved according to the actual effect.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Cryptocurrency**

## **Detecting Bitcoin Nodes by the Cyberspace Search Engines**

Ruiguang Li1,2(B) , Jiawei Zhu2, Jiaqi Gao3, Fudong Wu3, Dawei Xu1,3, and Liehuang Zhu1

<sup>1</sup> School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China lrg@cert.org.cn

<sup>2</sup> National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China

<sup>3</sup> School of Cyber Security, Changchun University, Changchun, China

**Abstract.** Nowadays, the cyberspace search engines have showed great power to find entities and services in the network, which provide new ideas and methods to detect the Bitcoin nodes. This paper introduces the Bitcoin's P2P network and nodes including the reachable nodes and the unreachable nodes. Then, the results of detecting reachable nodes by the cyberspace search engines are showed. Next, the author proposes a new approach to find and verify the unreachable nodes by the cyberspace search engines. Finally, this paper illustrates the de-anonymization of some Bitcoin nodes by the cyberspace search engines, which map some node's IP addresses to real Bitcoin entities, such as Zeblockchain (a browser website), Microwallet (a wallet website) and Laurentia Pool (a non-profit pool website).

**Keywords:** Cyberspace search engines · Bitcoin nodes · Reachable nodes · Unreachable nodes · De-anonymization

#### **1 Introduction**

The cyberspace search engine is a new kind of network tool, which has attracted more and more attention from network researchers in recent years. It is different from the traditional Web search engine which takes Web pages as the retrieval objects, such as Google, Baidu and Bing. The Web search engine is widely crawling and storing web pages in the network, extracting and analyzing the pages' content, and providing keyword retrieval services for the public. The cyberspace search engine finds the entities and services in the network by actively detecting, obtains the target's information through protocol interaction and makes a comprehensive display. At present, the well-known cyberspace search engines in the industry include: Shodan (shodan.io, US), Censys (censys.io, US), BinaryEdge (www.binaryedge.io, EU), Zoomeye (www.zoomeye.org, CN), Fofa (offline now, CN), and so on.

The cyberspace search engines commonly maintain a protocol library which contains a variety of protocols, deploy probes all around the world, detect the whole network using various protocols, and find open ports and services all the time. There are many kinds of equipments in the network, including Servers, Network equipments, Terminals, Office facilities, Smart home devices, Industrial controlling equipments, Webcams, Blockchain entities, etc. [1] made a detailed comparative analysis of the well-known cyberspace search engines, compared their supporting protocols, detected equipments, equipment types, detecting capabilities, system structure, and probes, etc.

Bitcoin is the most successful electronic crypto-currency in the world. It was proposed by Nakamoto in 2008 [2] and launched officially in January 2009. Bitcoin kept running stably since then and had become an important means for global finance and payment. The cyberspace search engines had strong infrastructures which offer powerful computing, abundant storage, and a large amount of detecting records. This gave a great convenience for the analysis of assets and equipments in the cyberspace. The well-known cyberspace search engines include Shodan, Censys, Zoomeye, Fofa, etc, which provide detecting services for Bitcoin nodes. In this paper, we will introduce our work of finding and analyzing Bitcoin nodes by cyberspace search engines.

The contributions of this paper are as follows: 1) Introduce the results of detecting the Bitcoin reachable nodes by the cyberspace search engines. 2) Propose a new approach to find and verify the Bitcoin unreachable nodes by the cyberspace search engines. 3) Illustrate the de-anonymization of some Bitcoin nodes by the cyberspace search engines, which map some node's IP addresses to real Bitcoin entities.

#### **2 Bitcoin Network and Nodes**

Bitcoin system can be logically divided into the network layer and the transaction layer, as shown in Fig. 1.

**Fig. 1.** Bitcoin's logic structure

The network layer is composed of a large number of Bitcoin nodes. Each node keeps working on broadcasting IP addresses, verifying transactions, packaging blocks, and mining independently. All the transactions are stored in the blocks, which connected each other by time order to form a blockchain. All the transactions are published to all network participants and stored in all nodes of the network. Previous studies mostly focused on the transaction layer, but less on the network layer. The Bitcoin network is a typical P2P network without an organization or trust center. Nodes gain trust between each other through interactions, and form the network by themselves. The Bitcoin network is the foundation of Bitcoin system.

The Bitcoin network can be divided into the visible part (reachable nodes) and the invisible part (unreachable nodes). The reachable nodes can receive incoming connections and provide public services for the whole network. Generally, they would store a complete copy of the blockchain data. They open fixed ports (often 8333) waiting for connections and can be regarded as "Servers" in the network.

The unreachable nodes do not receive incoming connections from outside and don't provide public services for the whole network. They don't keep a complete copy of the blockchain data and can be regarded as "clients". The unreachable nodes are generally deployed behind a NAT (Network Address Translation) or a firewall and cannot be found by active detecting. Bitcoin nodes have different connectivity, service type, and topology, which have great impacts on the performance of Bitcoin system. Therefore, it is important to detect and study the Bitcoin nodes in depth.

#### **3 Detecting the Reachable Nodes**

Because the reachable nodes open fixed ports waiting for outside connections, we can easily detect all the reachable nodes by active detecting. There were many studies by far. Joan et al. [3]. Measured the Bitcoin network from November 2013 to January 2014, connected the Bitcoin nodes with bitcoin-sniffer (a open source tool) [4], collected 872000 nodes, and analyzed the nodes' geographical distribution, stability, propagation delay, etc. Christian Decker et al. [5] in 2013 and Giuseppe Pappalardo et al. [6] in 2016 measured the Bitcoin network, observed the propagation delay of blocks and transactions in the network. In the same year, Fadhil et al. measured the Bitcoin network for a week and collected 6430 stable online nodes and 313676 client IP addresses [7]. Sehyun Park et al. carried out a comparative study [8] in 2018, developed a software Bitcoin-Node-Scanner, obtained and verified 1 million nodes' IP addresses within 37 days, and counted the IP types (IPv6/IPv4/Onion), geographical distribution, port numbers, client versions, protocol versions, etc.

All these studies above were carried out by individual researchers. The cyberspace search engines have far more power than the single terminals. By scanning the whole network with probes all over the world, they can connect to all reachable nodes, get their information and make a time-based cumulative analysis. Fofa showed 56748 Bitcoin reachable nodes detected from December 2016 to September 2021 [9]. Zoomeye showed 63504 Bitcoin reachable nodes [10] and display their information such as IP, open ports, open services, countries, affiliated enterprises, protocol slogans, geographic longitude and latitude, as shown in Fig. 2 below.

It should be noted that [9] and [10] are only reachable nodes detected by the cyberspace search engines. Next, we will propose an approach to find and verify unreachable nodes by the cyberspace search engines.


**Fig. 2.** Zoomeye's page of a reachable node

#### **4 Inferring the Unreachable Nodes**

The unreachable nodes don't open ports for outside connections, so they cannot be found by active detecting. Even if we got some addresses of the unreachable nodes, we could not definitely verify them by active detecting due to the existence of "Churn Nodes" which caused by the network delay.

There were a few studies on the unreachable nodes. Alex et al. proposed a deanonymization method for the unreachable nodes [12] in 2014, which setup some probes connected to all entry nodes. When an unreachable node broadcasted a connection request through the entry nodes, the request would be forwarded to the probes and be recorded. The author believed that there were about 90000 unreachable nodes at that time. Till et al. simulated the Bitcoin network in 2016 and analyzed the broadcasting of transactions in the network [13]. It was estimated that the total number of was about 16000 then. Liang et al. deployed 102 probes around the world to collect the connection requests [14] in 2017, and estimated that there were 155000 unreachable nodes in the whole network. Matthias et al. monitored the "unsolicited" ADDR messages [15] in 2021 and could identify about 31000 active unreachable nodes every day. Federico et al. studied in detail the roles and number of unreachable nodes in Bitcoin network [16], and proposed an improved transactions broadcasting protocol, which improved the efficiency and security of the Bitcoin network. Alex et al. introduced the Bitcoin network based on Tor [17], proposed a man-in-the-middle attack against Bitcoin, and analyzed the delay caused by unreachable nodes in the attack. Indra et al. proposed a de-anonymization method for the unreachable nodes [18] by collecting all GETDATA messages and matching IP addresses with transactions/blocks. The accuracy of identifying the unreachable nodes is up to 90%.

Next, we will propose an approach to infer the unreachable nodes by the cyberspace search engines. First, we setup a fake client to actively connect to the reachable nodes, obtained a large number of Bitcoin nodes' IP addresses by the interaction mechanism of GETADDR-ADDR. Then, we input these addresses to the cyberspace search engine and obtain all the feedback records of the engines. Finally, by analyzing the open services and detecting time of the target IP, we can infer the unreachable nodes. Here we make two judgments.

Judgment 1: If an IP had a record of opening Bitcoin service with a new timestamp (within the duration of detecting cycle), this IP stood for a reachable node.

Judgment 2: If an IP had a record of opening other services (HTTP, SSL, etc.) except for Bitcoin service and the timestamp was relatively new(within the duration of detecting cycle), this IP stood for an unreachable node.

The correctness of Judgment 1 is obvious. The correctness of judgment 2 is also easy to understand. Because if we can verify a real IP from the Bitcoin system is opening other services, but isn't opening the Bitcoin service, the IP must stood for an unreachable node. The worldwide probes and all-weather scanning of the cyberspace search engines made sure that the "unreachable" of nodes were not caused by " network delay ". In fact, we have made experiments to testify Judgment 2 and the accuracy was up to 95%.


**Fig. 3.** Zoomeye's page of a unreachable node

Here is an example. As shown in Fig. 3, the node "167.172.158.149" is a real IP address obtained from the Bitcoin system. We input the IP into Zoomeye and check the feedback records. It can be seen that the IP opened "SSH" and "TCP" service, but didn't open "Bitcoin" service and the detecting timestamp was "2022–03-24". So the node "167.172.158.149" is an unreachable node.

To make a Ground-Truth test, we deployed a Bitcoin probe on Vultr. By checking the real neighbors using "peerinfo" command, we found the node "167.172.158.149" was its neighbor and the attribute is "inbound = true". The node made an incoming connection to our probe and was an real unreachable node.

### **5 De-anonymization of the Bitcoin Nodes**

As an encrypted digital currency, Bitcoin protects the privacy and security of users' transactions. However, many researchers are very interested in the de-anonymization of Bitcoin addresses and tracing the route of transactions. By far, there are many studies on this issue being published. The existing methods are mainly based on the clustering of transaction addresses. For example, Butian Huang et al. proposed a clustering algorithm "BPC" [19] based on the nodes' behaviors, which clustered the nodes after behavior similarity measurement. The experiment showed that the accuracy was higher than the previous algorithms. Annika Baumann et al. analyzed the Bitcoin's transaction graph [20], inferred that there was a close relationship between network usage and exchange rate, and de-anonymized the 11 largest entities in the transaction graph. Meng Shen et al. analyzed the transaction propagation mode, proposed a method to obtain the initial transaction by calculating the pattern matching score [21], and established the association between the transaction and the initiating node's IP. The experimental accuracy was up to 81.3%.

In the de-anonymization of the Bitcoin nodes, it's important but difficult to find the association between a node's IP and the real network entity (exchages, browsers, wallets or pools), because many important entities keep their IP addresses highly confidential for the reason of privacy. **The cyberspace search engines provide new ideas for the association between Bitcoin nodes' IP and the real entities**. The cyberspace search engine detect the whole network using various protocols, and will find all services support by a node. For the reachable nodes, all the services such as HTTP, HTTPS, SSL, Bitcoin will be found together. As some persons or companies may open different services in One IP address, we could get extra information for a Bitcoin node by visiting its HTTP page. In some cases, we could get useful information such as the geographic location, organization information, and services operated by the website. Here we gave some examples.

1) A node with IP (147.135.252.43). This IP address is a Bitcoin browser "Zeblockchain", belonging to Japan Digital Service Company, as shown in Fig. 4.


**Fig. 4.** Zeblockchain (a Bitcoin browser)

2) A node with IP (216.108.227.39): This IP address is a Bitcoin wallet "Microwallet", operated by a US company, as shown in Fig. 5.

**Fig. 5.** Microwallet (a Bitcoin wallet)

3) A node with IP (51.81.56.49): This IP is a Bitcoin Pool "Laurentia pool", which is a non-profit mining pool(open source), as shown in Fig. 6.

**Fig. 6.** Laurentia (a Bitcoin pool)

Limitations: This method could only de-anonymize some Bitcoin websites which open different services in One IP address. If large organizations have many IP addresses and don't deploy different services on same IP address, this method is no longer applicable.

## **6 Summary**

This paper introduces the working principle of the cyberspace search engines and discusses their application in detecting the Bitcoin nodes. The Bitcoin network is composed of visible part (reachable nodes) and invisible part (unreachable nodes), which have different characteristics. The reachable nodes provide public services for the network and easy to detect, while the unreachable nodes are only clients and hidden in the network. The number of the unreachable nodes is about ten times to the reachable nodes [14], which are not easy to detect and analyze.

The author introduces the results of detecting Bitcoin nodes by the cyberspace search engines, then proposes a new approach to verify the Bitcoin unreachable nodes, finally illustrates the de-anonymization of the Bitcoin nodes which could find the association between a node's IP and the real network entity (a exchage, a browser, a wallet or a pool). By far, the cyberspace search engines can only detect Bitcoin nodes with Ipv4 addresses, and Ipv6 addresses are not supported. However, with the fast improvement of the cyberspace search engines, they will play more important roles in the detecting and analyzing of the Bitcoin network.

**Acknowledgement.** This work is supported by National Natural Science Foundation of China (Grant No. 62106060), and National Key Research and Development Program of China (Grant No.2020YFB1006105).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Research on Bitcoin Anti-anonymity Technology Based on Behavior Vectors Mapping and Aligning Model**

Shenwen Lin(B) , Hongliang Mao, Zhen Wu, and Jinglin Yang

National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China lsw@cert.org.cn

**Abstract.** Traditional anti-anonymity technologies for Bitcoin transactions include two types. One is network-layer anti-anonymity technology, which achieves the purpose of locating the initial IP of specific transaction information by speculating on the IP propagation path of transaction; the other is the anti-anonymity technology of the transaction layer. By analyzing the data of the Bitcoin ledger, it realizes the on-chain behavior portrait of a specific wallet address attributable to the user. In this work, we propose a new anti-anonymity technology, by constructing transaction behavior vectors and social behavior vectors based on Bitcoin ledger data and off-chain social data respectively, and build a model for mapping and aligning the two vectors. Experimental test shows that the proposed anti-anonymity technology is more accurate and has better practical effects. Furthermore, the technology suits for the anti-anonymity of other virtual currencies as well.

**Keywords:** Bitcoin · Virtual currency · Anti-anonymity · Behavior vector

## **1 Introduction**

Bitcoin is a purely peer-to-peer version of electronic cash [1], which allow online payments to be sent directly from one party to another without going through a financial institution. It relies on digital signatures to prove ownership and a public history of transactions to prevent double-spending. Bitcoin does not rely on third-party credit, has strong anonymity. It mainly reflects three aspects: one is the anonymous transaction address. Bitcoin transaction address is created by the user independently, independent of user identity information, and does not require third-party participation to create and use the address; Second, the fragmented transaction behavior. Bitcoin system supports users to generate different addresses for each transaction. User transaction information can be arbitrarily dispersed in different anonymous address behaviors. Third, the source of Bitcoin transaction package is difficult to find in network. Bitcoin communication network uses P2P protocol, and there is no central node. Transaction information broadcasts all over the network. It is difficult to track the origin of transaction information by monitoring a single server. Because of its strong anonymity, Bitcoins are often used in gambling, illegal fund-raising, fraud, pyramid sale, money laundering and other illegal activities.

Traditional Bitcoin transaction anti-anonymity technology mainly includes two types: one is the network layer anti-anonymity method, which mainly detects and collects the transaction information broadcast by the Bitcoin network layer, analyzes the propagation path of a specific Bitcoin transaction in the P2P network, infers the IP address of the originating service node of the transaction, and then locates the user IP of the transaction. Another method is the anti-anonymity method at the transaction level, which mainly obtain user portrait information for a specific wallet address by analyzing transaction relationships between different transaction addresses, especially with the help of the labels of the addresses of exchanges, mining pools and other institutions. The above two types of anti-anonymity technologies are not effective because they cannot track the source of the user's social identity information to which the transaction address belongs.

Because of the shortcomings of the traditional anti-anonymity technology of Bitcoin transaction, this paper integrates the data on and off the chain, studies and proposes an anti anonymity technology of Bitcoin transaction based on behavior vector mapping and aligning model. Build a social behavior vector based on off chain social data, and establish a mapping and aligning model with the transaction behavior vector based on Bitcoin ledger data, which can realize the anti-anonymity of Bitcoin address and transaction. Because the social behavior vector contains the real social identity information of users, this paper proposes anti anonymity technology, which has better practical effect than the traditional anti anonymity technology.

## **2 Bitcoin Transaction Overview**

Every transaction in the Blockchain has a list of inputs and outputs, where each includes addresses that were used in the transaction and the amount of coins spent in that transaction. Inputs of the current transaction come from the outputs of the previous transaction, and the output of the current transaction will be used as the input in other transactions, which to form a transaction chain (see Fig. 1).

**Fig. 1.** Bitcoin transaction chain

There will be either a single input from a larger previous transaction or multiple inputs combining smaller amounts, and at most two outputs: one for the payment, and one returning the change, if any, back to the sender, which will be automatically selected by the Bitcoin client as the input in future transactions.

Bitcoin transactions can be roughly divided into two types: the first type is mining reward transactions. Each block has a mining reward transaction. This kind of transaction has no input but only output. The system transfers the mining reward of this block and fee of the transaction contained in the block to the output; The second type is ordinary transactions, including several inputs and several outputs.

Since multiple input addresses of a transaction correspond to different private keys, Bitcoin transferring the input needs the signature of the corresponding private key; Therefore, it is generally believed that multiple input addresses of a transaction belong to the same entity. So, with the help of transaction address clustering, the decentralized transaction behaviors of the same entity in the ledger can be gathered, which is convenient to master the behavior characteristics of the entity.

There are four kinds of transaction address clustering technology [2]. One is the clustering technology based on multiple input addresses. Multiple input addresses of a transaction belong to the same address cluster; The second is the clustering technology based on the change address. The change address of a transaction belongs to the same address cluster as the input address. At the same time, through the change address as the connecting link, the input addresses in the two transactions can be combined into the same address cluster; the third is the clustering technology based on mining reward transaction. Multiple output addresses of a mining reward transaction belong to the same address cluster. The fourth is the comprehensive clustering technology combining the above three clustering technology.

#### **3 Transaction Scene Graph Structure**

Bitcoin transaction scene include mining reward, depositor withdrawal on the exchange, gambling, blackmail, MLM fraud, etc. Among them, deposit and withdrawal of Bitcoin on the exchange are more popular.

Deposit transaction transfer Bitcoin held by the user's personal wallet address to the deposit wallet address assigned to the user by exchange. The private key of the deposit wallet address is controlled by the exchange, and different deposit wallet addresses correspond to different users. Deposit transactions include customer to customer (C2C) transaction scene and business to customer (B2C) transaction scene.

The general characteristics of the graph structure of C2C deposit transaction are: a small number of transaction input and two outputs, one of which including user's deposit wallet address, and the cluster label of this address is the name of exchange (see Fig. 2).

**Fig. 2.** Graph structure of C2C deposit transaction scene

B2C deposit transaction scene graph has a 1-to-N structure, which is generally characterized by a small number of transaction input addresses and a large number of transaction output, in which the output addresses are deposit wallet addresses of a large number of different users, and the cluster labels of different output addresses are the same or different exchange (see Fig. 3).

**Fig. 3.** Graph structure of B2C deposit transaction scene

Withdrawal transaction transfer Bitcoin hosted on the exchange to the wallet address specified by the user. In order to reduce the transaction fee, exchange usually collects multiple users' withdrawal order and transfers Bitcoin to multiple users' wallet addresses in one transaction.

The graph structure of withdrawal transaction has the characteristics of a 1-to-N structure. The cluster labels of transaction input addresses are the same exchange, and the transaction output addresses are specified by a large number of different users (see Fig. 4).

Each transaction needs to pay fee, in reality, there is a combination of deposit transaction and withdrawal transaction, that is, user withdraws Bitcoin on a exchange and deposit it to another exchange.

**Fig. 4.** Graph structure of withdrawal transaction scene

#### **4 Traditional Bitcoin Anti-anonymity Technology**

Traditional Bitcoin anti-anonymity technology mainly includes network layer antianonymity technology and transaction layer anti-anonymity technology.

Network layer anti-anonymity technology [3] refers to collecting transaction packet transmitted by Bitcoin P2P network, analyzing the propagation path of a specific Bitcoin transaction packet in P2P network, and inferring the server IP of the first broadcast node. For example, koshy et al. [4] used special transactions to find the originating node. Most normal transactions will be forwarded once by multiple nodes, while transactions with wrong format will only be forwarded once by the originating node. Therefore, this feature can be used to identify the originating node of special transactions. However, due to the small proportion of special transactions, the effect of this method is limited. In addition, biryukov et al. [5, 6] proposed a transaction traceability mechanism based on neighbor nodes, which can improve the traceability accuracy by taking neighbor nodes as the judgment basis. However, the scheme needs to continuously send packet to all nodes in Bitcoin network, which may cause serious interference to Bitcoin network.

The network layer anti-anonymity technology has a certain probability to speculate the initial service node IP of the transaction. Gao Feng, Mao Hong-liang and others [3] have achieved the anti-anonymity traceability accuracy with a recall rate of 60% and an accuracy rate of 35.3%. The traceability and positioning from the service node IP to the end-user IP needs to be combined with the operator's traffic analysis technology and IP positioning data.

Transaction layer anti-anonymity technology refers to finding the correlation between different Bitcoin addresses by analyzing transaction records in Bitcoin ledger, so as to infer the transaction behavior law and capital flow of the transaction address. Liao et al. [7] analyzed the blackmail process of the blackmail software crypto locker by analyzing the Bitcoin ledger data, found multiple Bitcoin addresses belonging to blackmail organizations, and identified a large number of Bitcoin ransom transactions. Meiklejohn et al. [8] used heuristic cluster analysis technology to identify multiple Bitcoin addresses belonging to the Silk Road website. Guo Wen-sheng et al. [9] studied how to realize the division of Bitcoin entities with different types of characteristics through machine learning of Bitcoin ledger data.

Transaction layer anti-anonymity technology can analyze and speculate the characteristics of the trading behavior on the chain of a specific wallet address. Combined with the anti-anonymity label information of the exchange, mining pool and other platform institutions, it can speculate the ownership of some wallet addresses, but it is difficult to determine the user's social identity information. In reality, many Bitcoin hacking incidents generally analyze the transaction data of Bitcoin ledger, track the exchange into which Bitcoin is transferred, and coordinate the exchange to provide user information of the Bitcoin addresses.

In recent years, the research on Bitcoin anti-anonymity technology by integrating data on and off the chain has gradually become a research hotspot. Husam et al. [10] found that Tor Network anonymous services and users by integrating online social network data and Bitcoin ledger data.

## **5 Behavior Vectors Mapping and Aligning Model**

Due to the anonymity of Bitcoin transaction address and trading process, and the poor readability of Bitcoin ledger data, most centralized institutions or platforms, such as exchange and mixed service, will synchronously record the user identity information and behavior information corresponding to Bitcoin ledger data. The above data is called social data off chain. Although it does not contain Bitcoin address, making full use of this data can realize the positioning and anti-anonymity of transaction behavior of Bitcoin ledger data.

We define social behavior vector S including five dimensions: [time, value, scene, name and account]. Time is the time when user receives social data, value is the number of Bitcoinin social data, scene is the transaction scene describing in social information, name is the platform name, and account is the user's social account. If only time and value are considered, and the transaction scene, platform name are missing or ignored, the accuracy of anti-anonymity will be affected in some complex cases.

Like social behavior vector, we define transaction behavior vector E including seven dimensions: [time, value, scene, input label, output label, input address, output address]. Time is the transaction time recorded in the Bitcoin ledger, value is the number of Bitcoin in transaction output, scene is the transaction scene inferred through graph structure analysis, input label is the clustering label of the transaction input address, output label is the clustering label of the transaction output address (non change address), input address is the transaction input address and output address is the transaction output address (non change address). If transaction behavior vector E and social behavior vector S satisfy the following conditions:


Then, user's social account S.account corresponding to Bitcoin transaction address E.output address can be considered. Because user's social account is more unique and social than the IP and user behavior portrait, and can better reflect user's social identity information.

## **6 Experiment and Result Analysis**

In order to research and prove the alignment model of behavior vector mapping on and off the chain, the anti anonymity of Bitcoin transaction can be realized more accurately. We conducted an experimental test on the charging transaction of a platform. The experimental process is as follows:


The experimental results are shown in the following table (see Table 1):


**Table 1.** Anti-anonymity experimental results of Bitcoin transaction

Eleven social behavior vectors of social account A are respectively aligned with eleven C2C deposit transaction behavior vectors, and these Bitcoin transaction behavior vectors belong to one Bitcoin address, which is also the deposit address opened by the exchange for user A. Fifteen social behavior vectors of social account B are respectively aligned with fifteen B2C deposit transaction behavior vectors, and these Bitcoin transaction behavior vectors belong to one Bitcoin address, which is also the deposit address opened by the exchange for user B.

## **7 Conclusion**

The anti-anonymity technology of Bitcoin transaction based on behavior vector mapping and aligning model proposed in this paper, realizes the fusion analysis of data on and off the chain. Compared with the traditional anti-anonymity technology, it has stronger practical effect. At the same time, the anti-anonymity technology proposed in this paper is also applicable to the anti-anonymity of other virtual currencies, such as Ethereum Coin and Tether USD.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Information Security**

## Improving Deepfake Video Detection with Comprehensive Self-consistency Learning

Heng Bao<sup>1</sup>, Lirui Deng<sup>2</sup>, Jiazhi Guan<sup>2</sup>, Liang Zhang3(B) , and Xunxun Chen<sup>3</sup>

<sup>1</sup> School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China baoheng@iie.ac.cn <sup>2</sup> Department of Computer Science and Technology, Tsinghua University, Beijing, China {dlr18,guanjz20}@mails.tsinghua.edu.cn <sup>3</sup> CNCERT/CC, Beijing, China

zl@isc.org.cn

Abstract. Deepfake videos created by generative-base models have become a serious societal problem recently as been hardly distinguishable by human eyes, which has aroused a lot of academic attention. Previous researches have made effort to address this problem by various schemes to extract visual artifacts of non-pristine frames or discrepancy between real and fake videos, where the patch-based approaches are shown to be promising but mostly used in frame-level prediction. In this paper, we propose a method that leverages comprehensive consistency learning in both spatial and temporal relation with patch-based feature extraction. Extensive experiments on multiple datasets demonstrate the effectiveness and robustness of our approach by combines all consistency cue together.

Keywords: Deepfake detection · Digital forensics · Video classification

## 1 Introduction

"Seeing is believing" is hardly true in present days with the prosperity of computer science and information technology, especially the massively emerging applications of artificial intelligence. Although image and video forgery is never a new topic since the beginning of photography, open source applications represented by Deepfakes [7] and others have brought this problem into a whole new level. Face manipulation in visual content has become a effortless task with the help of deep learning based generative models like variational autoencoders (VAEs) [16] and generative adversarial networks (GANs) [12], that anyone can produce fake videos with false identity or manipulated expressions and movements (known as "Deepfake Videos") in several minutes without expert knowledge. Some of them have already been found to create malicious videos that violate citizen privacy or attack public figures like fake pornography and blackmailing, which may easily lead to catastrophic results with today's mass media and social networks. The detection of deepfake videos has become an hot and urgent issue.

Various methods have been proposed in recent years by academic community to effectively recognize this particular type of forged images and videos. Since the majority forgery methods share a common image stitching pipeline including face detection, warping and blending, early researches address this task by detecting suspicious artifacts left in the stitching process within frame-level, such as face warping artifacts [20] and blending boundaries [18]. To yield a result for whole video clip from frame-level prediction, they usually cascade frame-level model with a merge module, or sometimes simply use weighted average. But ignoring the dependency among consecutive frames tends to produce sub-optimal combination. Frequency-based approached [9,25] have also been included to fully utilized temporal relation. Self-consistency is another crucial concept in image forensic [14,35], where patch-based and feature-map based method have all shown promising results. Although the detection accuracy on datasets has improved significantly with different approaches presented, forgery techniques are also evolving on reducing these artifacts, which forms an ever-changing arms race.

In this work, we aim to catch both the intra-frame discrepancy during image stitched and the inherent flaws of inter-frame disalignment for more effective and robust deepfake detection. Our contributions can be summarized as follows:


## 2 Related Work

Frame-Level Detection Methods. The emerging of deepfake videos on the internet raise a lot of concern to both industry and government in the past few years. Early researches [1,26] tend to address this problem by a simple classification model with a well-designed backbone. And some [20] simulated the generation process of deep forgery to better obtain artifact of fake video pipeline. Not only in academic society, a one-million bounty real-world deepfake detection competition was held by Facebook with the concern of its endangerment of social media to encourage optimal deepfake detection methods being proposed. Plenty of classification model was proposed and achieve really amazing results beyond expectation. The winner of this competition [27] adopts the state-of-theart image classification backbone efficientNet [28] as the main component of his model, and a novel data argumentation strategy contributes a lot to his final ranking. The runner-up of this competition fulfilled their method afterward [34], which treat the deepfake detection task as a fine-grained classification task and explicitly refine attention maps by regional independence loss. Merging with texture features extracted at the front-end layer, their model achieve state-of-the art results in several datasets. Attention map prediction scheme is also considered by [6]. In their work, forged area is predicted in both learning-base and dictionary learning ways, binary classification and attention map regression tasks are trained using a multi-task loss function.

The above mentioned detection methods are all concentrate on the RGBdomain of deepfakes, and there are some other works try to explore fake clues inherent in the frequency domain of deepfake images. Discrete cosine transform (DCT) [2] is adopted in [25], frequency layout of image is fully handled in both global and local views. In combine with learnable frequency-aware component, nonaligned infomation can be reliably detected at frame-level. Frank *et al.* [9] also leverage DCT in detection, and analysis which part makes synthetical deepfake image detectable. Their results suggest that up-sampling blocks left unique fingerprint, but those frequency clues are not robust to perturbation.

Besides, some other approaches tried to inspect artifacts from the side-view. FakeSpotter [31] do not directly use the features extracted by backbone network, but regard the neuron behaviors as the basis of discrimination, which is aimed to achieve more robust detection. To better leveraging the time factor into consideration in video-level authentication, spotting bio-metrics clues like eye blinking [19,32] and head posing sequential [32] is the first and most natural insight. DeepRhythm [24] exposes deepfake counterfeits by monitoring the heartbeat rhythms associated with minuscule periodic changes of skin color due to blood pumping through the face.

Video-level Detection Methods. Most video-level methods regard video as set of independent frames, and simply take the average confidence score of frames as the basis of judging the authenticity of video. Those methods actually follow the frame-level perspective, and neglect the interconnection between successive frames.

Güera *et al.* [13] adopt a natural way to leverage both advantages of convolutional neural network (CNN) and recurrent neural network (RNN) by using CNN for per-frame feature extraction, and RNN for temporal inconsistencies exploration. But inter-frame inconsistency modeling is not well considered in their approach. Tariq *et al.* [29] consider the artifacts introduced by the non-consecutive frames, and developed a convolutional LSTM-base residual network to achieve temporal feature learning. Basic features of the human body like eye blinking and head pose moving are utilized in [19] and [32] to distinguish the real from fake. In [23], the authors leverage the relationship between visual and audio patterns extracted from the same video to determine whether it has been modified.

Video-level detection gathers more information and, in general, should deliver better performance. But strangely, video-level evaluation results in terms of ACC and AUC are somehow lower than those at the frame-level. Zi *et al.* [37] propose two models by stacking ADD block. In their experiments, ADDNet-3D report much lower detection accuracy than ADDNet-2D, about 10 percents gap at a challenging dataset. Ganiyusufoglu *et al.* [11] adopt the state-of-the art structures used in action recognition task, and evaluate their performance in deepfake detection.

Benchmark Datasets. Several comprehensive deepfake datasets were published in recent years which greatly promotes the performance of deepfake detection methods. One of the most popular dataset is FaceForensics++ (FF++) [26]. It contains two graphic based approaches, namely Face2Face [30] and Faceswap [8], and two learning based methods include Deepfakes and Natural Textures [15]. Both face swap and face reenactment are covered. Celeb-DF [21] is one of the most challenging dataset in deepfake detection task with clear identity label and pixel level annotation. During the deepfake generation stage, they scrutinizes carefully about several problems during fake video generation, including color mismatch, inaccurate face masks and video temporal flickering. With more attention drawn to this research topic, some new and better annotated datasets been proposed with more specific purpose recently like WildDeepfake [37] for real-world challenge and Open-Foreinsic [17] for multiple face scenario.

## 3 Approach

Given an input video with certain human activity, our goal is to detect if the identity is replaced or facial expression of character is manipulated. We propose CSCL network as shown in Fig. 1 to improve the robustness and generalization ability of deepfake-style forgery video detector with the help of self-consistency by measuring the comprehensive spatial and temporal discrepancy within the image stream.

To be more precise, our method mainly exploit a comprehensive consistency which tackles the substantial drawback of deepfake videos producing pipeline:


Fig. 1. Framework of CSCL network

#### 3.1 Problem Formulation

We formulate the video-level deepfake detection task at beginning. Dataset D = {X*i*, L*i*}*<sup>N</sup> <sup>n</sup>*=1 consists of n pairs of video-clip and it's label with fake or real denoted asL*<sup>i</sup>* <sup>=</sup> {0, <sup>1</sup>}. Video clip can be seen as multiple consecutive frames <sup>X</sup>*<sup>v</sup>* <sup>=</sup> {x*t*}*<sup>T</sup><sup>v</sup> <sup>t</sup>*=1, where <sup>x</sup>*<sup>t</sup>* <sup>∈</sup> <sup>R</sup>*<sup>C</sup>*×*H*×*<sup>W</sup>* is the *<sup>t</sup>*-th frame of video <sup>X</sup>*v*, and the total number of frames is denoted as T*v*. All the frames in one specific video X*<sup>v</sup>* are deemed as manipulated if X*<sup>v</sup>* is labeled with fake, vice versa. The goal of deepfake detection is to learn a model Φ, which takes all consecutive frames of one video, and give a clear judgment of the authenticity, formulated as <sup>Φ</sup>(X*v*) ∈ {fake, real}.

#### 3.2 Design of Model

Spatial Consistency of Contexts vs. Faces. Computing similarity scores among images patched for inconsistency has already been proved effective in image forensic researches [33,35,36]. Without loss of generality, we first obtain feature f*<sup>t</sup>* of image <sup>x</sup>*<sup>t</sup>* from backbone model **<sup>G</sup>** of size <sup>H</sup> <sup>×</sup> <sup>W</sup> <sup>×</sup> <sup>C</sup> where <sup>H</sup> and <sup>W</sup> and patch numbers along columns and rows.

$$f\_t = \mathbf{G}(x\_t)\_{t=1}^{T\_v} \in \mathbb{R}^{C' \times H' \times W'} \tag{1}$$

For each frame x*<sup>t</sup>* we follow the [35] to calculate the 4D consistency map SMˆ with:

$$\begin{split} \hat{S}\hat{M}\_{h,w,h',w'} &= d(f\_t^{h,w}, f\_t^{h',w'})\\ &= 1 - \cos\left(f\_t^{h,w}, f\_t^{h',w'}\right) \end{split} \tag{2}$$

While each frame's mask have only two possible status : manipulated or not, for patch P located in face area denoted as P*<sup>f</sup>* , else in context as P*c*, and ψ(P*<sup>f</sup>* )=1 else ψ(P*c*)=0, the ground truth:

$$SM\_{P\_i, P\_j} = \psi(P\_i) \oplus \psi(P\_j) \tag{3}$$

and the spatial consistency loss:

$$L\_{SC} = |SM - \hat{SM}|\tag{4}$$

Temporal Consistency of Consecutive Frames. In order to catch inconsistency between successive frames, we further extend the attention to temporal consistency learning. As we have obtained the patch-base feature f*<sup>t</sup>* from x*t*, we consider the relation between <sup>f</sup>*<sup>t</sup>* and <sup>f</sup>*<sup>t</sup>*−<sup>1</sup>. For each path <sup>P</sup>*h,w* at timestamp <sup>t</sup>, we have a 2D consistency map:

$$\begin{split} T\hat{M}\_{P\_t^{h,w}P\_{t-1}^{h,w}} &= d(f\_t^{h,w}, f\_{t-1}^{h,w}) \\ &= 1 - \cos\left(f\_t^{h,w}, f\_{t-1}^{h,w}\right) \end{split} \tag{5}$$

considering the momentum between <sup>t</sup> and <sup>t</sup>−1, we calculate temporal consistency loss:

$$L\_{TC} = \sum\_{t} |T\hat{M}\_t - \frac{\sum\_{h,w} T\hat{M}\_{t,h,w}}{HW}|\tag{6}$$

Coordinating Temporal and Spatial Consistency. It's not hard to imaging that no matter in pristine or deepfake video people's face will be moving most of the time, either talking or acting expressions. Otherwise the there's no need to forge this static video which conveys no more information than just a photo. Only measuring the discrepancy of distance between consecutive face and context would yield lots of false alarm. Therefore we propose a comprehensive consistency coordination loss for adaptive learning by monitoring the relation between temporal and spatial consistency. Now we have final Loss function:

$$L = L\_{real/fake} + \lambda L\_{SC} + \beta L\_{TC} + (1 - \beta)L\_{CCC} \tag{7}$$

## 4 Experiment Results

Implementation Details. We modify Xception [4] as the backbones and their parameters are initialized by Xception pre-trained on ImageNet. We train our model using Adam optimizer with initial learning rate 1e-4 and weight decay 1e-7. Train epoch size is set to 2000, batch size is set to 32, and if validation loss is not getting better in 5 epochs, learning rate is decayed by factor 0.3, so that model can converge after several learning rate decays.

#### 4.1 In-Dataset Evaluation on FF++

FF++ is one of the most popular dataset for evaluating deepfake detection methods. It contains 1000 real videos collected from internet, and 4000 fake videos generated by four kinds of deepfake techniques. More over, FF++ provides 3 different qualities of videos, we use the high quality (c23) and low quality (c40) versions in this section. The raw quality videos are not considered because they are not very common on the internet. We use the same split as [26], both real and fake video is split into train, validation and test set according to the ratio of 72:14:14. But it is noticed that number of real videos is much smaller than fake videos. So, we oversample real videos to balance the classes when training. At test stage, one video could contain several clips in FF++, we extract as much clips as we can from one video (interval is set to 16, no overlap), and take the average score of all clips as confidence score of the video. The test results are listed in Table 1.

Table 1. In-dataset Performance (ACC %) on four types of deepfake in FF++. DF: DeepFakes, F2F: Face2Face, FS: FaceSwap, NT: NeuralTextures. The best result is shown in bold text, and the second-best is underlined.


#### 4.2 Cross-Dataset Evaluation on Celeb-DF

The poor Generalization ability of deepfake detection is still a thorny problem, even the state-of-the-art methods suffer from drastically performance degradation when test on deepfakes generated by unseen techniques. Our method tries to formulate deepfake detection from a discrepancy discovering aspect, and achieves the best cross-dataset performance, as the results listed in Table 2. The test model is trained on FF++ low quality, follow the setting of [22] for fair comparison. It is noticed that many methods report around 100% AUC on train set, but fail to transfer to the different dataset. Our model achieve the best cross-dataset test performance, while keep the best test result on train set.


Table 2. Cross-dataset Performance (AUC %) on Celeb-DF. The best result is shown in bold text, and the second-best is underlined.

#### 4.3 Ablation Study

This section analyzes the effectiveness of our proposed CSCL module. CSCL consist of three parts in total: the spatial consistency, temporal consistency and comprehensive consistency coordination. To further validate whether each part of comprehensive consistency can improve the generalizability, we conduct an ablation study by comparing our methods with the following variant. (1)Xception [4]: the baseline approach without using any consistency cue. (2)Xception w/ sc: we follow the setting of [35] with only spatial patch consistency loss. (3)Xception w/ tc: we use only temporal consistency loss upon baseling. (4)Ours full CSCL model with both spatial and temporal consistency, plus consistency coordination loss. Results are listed in Table 3.


Table 3. Ablation Performance in FF++. The best result is shown in bold text.

## 5 Summary

In this paper, we try to address the problem, deepfake detection, from the view of comprehensive self-consistency learning. More specifically, we propose a CSCL model with spatial-temporal consistency learning to explicitly formulate the inherent flaws of intra- and inter-frame disalignment in deepfakes. To achieve more effective and robust deepfake detection, we also proposed C3Loss, namely comprehensive consistency coordination loss, which tackles the inevitable artifact within deepfake producing pipeline. Extensive experiments demonstrate the superior performance of our method in deepfake detection, especially in more realistic tests like cross-dataset and low quality setting.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Research on Information Security Asset Value Assessment Methodology**

Xueqin Yang1, Peng Yang2, and Honggang Lin1(B)

<sup>1</sup> School of Cyberspace Security, Chengdu University of Information Technology, Sichuan 610225, Chengdu, China 1844108560@qq.com

<sup>2</sup> National Computer Network Emergency Technology Processing Coordination Center, Beijing 10009, China

**Abstract.** In response to the fact that traditional asset value assessment methods are subjective and cannot distinguish the value of different assets carrying the same type of business, a comprehensive assessment method that takes into account the importance of the business carried by the assets is proposed. In this paper, four factors affecting business importance are selected as evaluation indicators, and the CRITIC objective assignment method is used to obtain the weights of each evaluation indicator, calculate the importance of the business carried by the asset, and then calculate the asset value using the multiplication method with the assigned values of the asset in terms of confidentiality (C), integrity (I) and availability (A). The results of the case validation show that the calculation results of assessing the asset value by combining business importance are consistent with the actual value of the asset, and the comparison results with the traditional method show that the proposed method is more objective and reasonable in assessing the asset value.

**Keywords:** Asset value assessment · Business importance · CRITIC objective empowerment method · Multiplication method

## **1 Introduction**

As government departments, financial institutions, enterprises and institutions, and commercial organizations rely on information systems, information security issues have received widespread attention and importance. Using risk assessment to analyze the security risks in information systems and propose targeted corrective measures is an effective means to solve information security problems. Among them, identifying assets and assessing their value is the primary task of information security risk assessment, and the current calculation of asset value is mainly to be achieved based on confidentiality (C), integrity (I) and availability (A) [1]. Tang [2] proposes an objective assignment method to assign weights to evaluation indicators to make the calculated importance values more objective, but this weighting method only considers the dispersion of data and does not consider the correlation between indicators. In the literature [3], it is proposed that since quantifying the security level of assets in terms of confidentiality, integrity, and availability is prone to subjectivity, the importance of the business carried by the assets is considered as a factor to reduce the subjective influence, and then using weighting and other methods to synthesize the value of the assets, but the literature does not give a specific implementation algorithm. Xiang Hong [4] identifies assets based on business and uses the AHP method to assign values to assets. The AHP method is effective in reducing the drawbacks of being completely subjective, but the method requires multiple experts with rich experience to give a reliable judgment matrix and also involves calculations such as consistency tests, which increases the complexity of the calculations. Zhou Jing-Xian [5] uses the rough set approach to calculate the value of each asset by assigning weights to four factors of CIA that determine the value of the asset and the importance of the business undertaken, and since the importance to the business is judged by human, then once the decision makers differ, it will result in a situation where the same asset has different values. In order to solve the above problems, it is necessary to propose to calculate the asset value by combining the importance of the business carried by the quantified assets.

In this paper, we propose a method to evaluate the asset value by using the importance of the business carried by the asset together with the four factors of confidentiality, integrity and availability. The method selects the evaluation indexes that can reflect the importance of the business carried by the asset and calculates the business importance value by combining the CRITIC weighting method, and then uses the multiplication method for the four influencing factors to obtain the asset value, which can reduce the subjective influence of the traditional method when considering CIA and distinguish the value of assets that belong to different organizations but carry the same type of business and assets that carry different types of business under the same organization.

## **2 Asset Valuation Method**

The value of an asset is determined by the level of assignment of the three security attributes of confidentiality, integrity and availability, as well as the importance of the business undertaken by the asset. The realization of a complete business requires the involvement of multiple assets, and the more significant the business is, the more important its associated assets are. Based on this, this paper proposes an intuitive asset valuation model to analyze the value of assets.

#### **2.1 Asset Valuation Model**

The asset value assessment model proposes in this paper is depicted in Fig. 1. For information assets, their value is mainly reflected in four indicators: confidentiality, integrity, availability and the higher the requirements for these indicators, the higher the asset value. The three security attributes of confidentiality, integrity and availability are classified into five levels: very high, high, medium, low and very low, and the higher the level, the higher the requirement of the asset for this security attribute. The importance of the business carried by the assets is mainly reflected in the business itself and the impact of the assets attached to the business on the organization, so the importance of the business can be evaluated from four aspects: organization ranking, organization level, scope of impact and business category.

**Fig. 1.** Asset valuation model

#### **2.2 Business Importance Indicators System Construction**

From the evaluation model, it is obvious that the asset value will be affected by the importance of the business. In order to evaluate the asset value more accurately, it is required to select indicators that rely completely on objective data to quantify the importance of the business. In this paper, the business category and influence range indicators proposed by the business itself point out that the more core the business category is, the higher the importance of the business, and the more extensive the influence range is when the business cannot operate normally, the higher the importance of the business. However, considering that the selected indicators do not fully reflect the importance of the business and the indicators proposed from the business itself cannot distinguish the importance of different businesses that belong to the same business category and have the same scope of influence, this paper based on the assets on which the business depends, and proposes two indicators, organization ranking and organization level, to reflect the importance of the business running on them by measuring the importance of the assets. Among them, organization ranking refers to the ranking of the organization to which the asset belongs within the industry. The higher the ranking, the stronger the organization is in the industry and the higher the importance of its subordinate assets; The organization level refers to the category in which the organization to which the asset belongs is classified in that industry. The higher the category level belongs to, the more important the organization is and the more important its subordinate assets are. (If the value of an indicator of the assessment object cannot be determined, we may assign the same default value to the indicator and it is necessary to ensure that the final sum of all indicator weights is 1.)

With regard to the organization ranking, organization level, and influence range indicators, it is necessary to analyze the reports issued by the organization to which the actual assets belong to obtain their values, while the business category indicator can be determined by initially knowing the classification of the business according to the literature [6] and then combining it with the business carried on the actual assets to identify the specific category. The literature roughly classifies businesses into five major categories according to their characteristics (since specific business systems are not mentioned, the information in the table is not complete, and the classification of businesses in the actual assessment work should be based on the actual situation), as shown in Table 1. Because the value of an asset takes into account the importance of the business it carries, it is likely that the exact same information asset will have a different value to the organization being evaluated because of the different businesses it carries.


**Table 1.** Example of business classification.

#### **2.3 Business Importance Calculation**

Because of the differences in the contribution of each indicator to the importance of business, this paper uses the CRITIC method [7] to assign weights to indicators to calculate the importance of business. As can be seen from the previous subsection, business importance is determined by four indicators: organization ranking, organization level, influence range, and business category, and the CRITIC method which takes into account the conflicting nature of the indicators and the characteristics of the differences in the values taken by the evaluation objects under each indicator is used to calculate the weight of each indicator [8]. For example, if there is a greater conflict between the organizational ranking indicator and other indicators, the greater the difference in the data under that indicator, which means that the indicator contains more information, that is, it has greater weight and contributes more to the importance of the business. Similarly, the weights of other indicators can be obtained from the CRITIC method [9], and the calculation steps are as follows:(In this paper, if we select only one indicator to assess the importance of the business, we only need to do the normalization step of the indicator data in this algorithm).

1) In order to eliminate the influence on the evaluation results of different magnitudes, formula (1) was used to reverse the process for the indicators belonging to the smaller value, and formula (2) was used to forward the process for the indicators belonging to the larger value [10]:

$$\mathbf{x}\_{ij}^{\*} = \frac{\max(\mathbf{x}\_{j}) - \mathbf{x}\_{ij}}{\max(\mathbf{x}\_{j}) - \min(\mathbf{x}\_{j})} \tag{1}$$

$$\mathbf{x}\_{lj}^{\*} = \frac{\mathbf{x}\_{lj} - \min(\mathbf{x}\_{lj})}{\max(\mathbf{x}\_{lj}) - \min(\mathbf{x}\_{lj})} \tag{2}$$

In the formula: max(*xj*) is the maximum value of the *j* indicator, min(*xj*) is the minimum value of the *j* indicator, *xij* is the value of evaluation object *i* under indicator *j*, *xij*\* is the processed value and its value range is [0,1].

2) After the data were processed, the standard deviation of each indicator was calculated using formula (3) as an indication of the difference in the values taken by each assessment subject under each indicator:

$$
\sigma = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left( \chi\_{ij}^{\*} - \overline{\chi\_{j}} \right)^{2}} \tag{3}
$$

Among them, *j* is the standard deviation of the *j* indicator and is the average of n assessment objects under indicator *j*.

3) Formulas (4) and (5) are used to calculate the magnitude of conflict be-tween indicators:

$$r\_{lj} = \frac{\sum\_{k=1}^{n} (\chi\_{ik} - \overline{\chi\_l}) \left(\chi\_{jk} - \overline{\chi\_j}\right)}{\sqrt{\sum\_{k=1}^{n} (\chi\_{ik} - \overline{\chi\_l})^2 \sum\_{k=1}^{n} \left((\chi\_{jk} - \overline{\chi\_j})^2\right)}} \tag{4}$$

$$A\_{\bar{l}} = \sum\_{l=1}^{m} \left( 1 - r\_{\bar{l}\bar{l}} \right) \tag{5}$$

In the formula, *rij* denotes the correlation coefficient between indicator *i* and indicator *j*, *xik* and *xjk* denote all data under indicator *i* and indicator *j*, respectively, and *Aj* denotes the conflict between indicator *j* and other indicators.

4) From formula (6), the weights of each indicator is *w*1, *w*2, *w*<sup>3</sup> and *w*4:

$$w\_j = \frac{\sigma\_j A\_J}{\sum\_{J=1}^{M} \sigma\_j A\_J} \tag{6}$$

5) According to the weight of each indicator and the value of each business object under each indicator, the business importance α can be obtained:

$$\alpha = \sum\_{J=1}^{M} \mathbb{w}\_{j} \mathbf{x}\_{ij}^{\*} \tag{7}$$

#### **2.4 Asset Value Calculation**

After obtaining the asset's assigned level of confidentiality, integrity and avail- ability and the importance of the business it carries, then we use the multiplication method to calculate the asset's value. The specific calculation steps are as follows:

Set the value of the j asset as *dj*, its values in confidentiality, integrity, and availability as *c*, *i*, and *a*, and its business importance as α. The formula of calculating the asset value is as follows:

$$d\_{\rangle} = \sqrt[3]{a\*c\*i\*a} \tag{8}$$

#### **3 Evaluation Examples and Results Analysis**

This paper uses the bank assets which are obtained from the extranet as the valuation object, and applies the above calculation method to calculate the value of each asset, and compares and analyzes the results obtained with the traditional method.

#### **3.1 Instance Data**

In order to determine the value of the indicators which are selected for the calculation of business importance, the analysis will be performed here in combination with the actual assets. From the literature [11], we know the ranking of organizations to which bank extranet assets belong, and from the literature [12], we can summarize and classify the organizations of bank information system assets into five major levels: state-owned large banks, state-owned commercial banks, regional urban commercial banks, rural banks in each county and district, and private banks. For the bank information system, the impact range of the business on it can be reflected by the impact range of the organization to which the asset belongs, so the impact range can be divided into five categories: global, national, province/municipality/autonomous region, city, and county, and assign values in descending order of range. Combined with the actual evaluation object and according to the literature [13], the categories of services carried by the bank's extranet assets can be classified into five major categories: transaction-type services, customer exchange services, online investment services, information services and other services. The transaction services include money transfers and credit operations performed by individuals or companies, which are the highest level of banking service systems and definitely have access to the bank's internal network. The customer exchange service is the communication of information, documents or files between the customer and the bank [14, 15], and this kind of service is a higher-level service system and has access to the bank's internal network. The online investment service [16] is a service that provides customers to purchase various types of financial products launched by the bank. The information service is to publish information that can be accessed by everyone, and this type of service is the most basic type of business that has no access to the bank's internal network. The other services include various forms of special value-added services, such as life type payment services. The business categories are assigned according to the degree of connection to the bank's internal and the level of the service system, see Table 2. Through the above analysis, the 18 acquired bank extranet assets are organized as shown in Table 3. ( The 18 selected assets S1-S18 are ICBC about ICBC system, China Construction Bank deposit and loan and bank card system, China Construction Bank investment and finance system, Agricultural Bank of China personal service system, Agricultural Bank of China talent recruitment system, Bank of China personal financial system, Bank of China electronic banking system, Bank of JIANGSU personal business system, Chongqing Rural Commercial Bank savings business system, Chengdu Rural Commercial Bank's personal financial service system, Bank of Chongqing's personal business system, TRC Bank's savings business system, Bank of Dongguan's personal business system, NRC Bank's savings business system, Bank of Tangshan's email system, XIAOSHAN Rural Commercial Bank Savings Business System, ZJB's savings business system.)


**Table 2.** Assignment of business importance indicators.




**Table 3.** (*continued*)

#### **3.2 Business Importance Calculation**

We use organization ranking (Index 1), organization level (Index 2), scope of influence (Index 3), and business type (Index 4) as four indicators to assess the importance of the business carried on the bank's assets. The four indicators are quantified in Table 2, and the results are shown in Table 4.


**Table 4.** Quantification of business importance indicators.

We use formula (1) and formula (2) to forward or reverse the values of assets under the above four indicators. For assets, the smaller the value under the organization ranking indicator, the better, while the larger the value under the three indicators of organization level, impact area, and business category, the better. From Table 4, it can be seen that asset S3 takes the value x31 of 2 under the organization ranking indicator, the data under this indicator has a maximum value of 100 and a minimum value of 1. Replacing into formula (1), the value of *x*<sup>31</sup> after reverse processing can be obtained as *x*<sup>∗</sup> <sup>31</sup> <sup>=</sup> <sup>100</sup>−<sup>2</sup> <sup>100</sup>−<sup>1</sup> <sup>=</sup> <sup>0</sup>.9899. The value *x*<sup>32</sup> of asset S3 under the business category indicator is 5, and the maximum value of data under this indicator is 5 and the minimum value is 2. Replacing into formula (2), we can get *x*∗ <sup>32</sup> <sup>=</sup> <sup>5</sup>−<sup>2</sup> <sup>5</sup>−<sup>2</sup> <sup>=</sup> 1. Similarly, the values *<sup>x</sup>*<sup>33</sup> and *<sup>x</sup>*<sup>34</sup> of S3 under the influence range and business category indicators are *x*∗ <sup>33</sup> = 1 and *x*<sup>∗</sup> <sup>34</sup> = 1 respectively after processing by formula (2). Similarly, the values of other assets under the four indicators are processed similarly.

After the above processing, the mean value of each indicator can be found as 0.669, 0.519, 0.528, and 0.819 in order. We then substituted the 18 data under the organization ranking index and the mean value of the index into formula (3) that we can find the standard deviation of the index is 0.362, and the standard deviation of the other indexes is similar to this, and the results are shown in Table 5. From formula (4) and formula (5), we can find the magnitude of conflict between each indicator and other indicators as 1.457, 1.558, 1.486, and 3.735.

From formula (6), we can obtain the weight w1 of the organizational ranking indicator:

$$w\_1 = \frac{0.362 \ast 1.457}{0.362 \ast 1.457 + 0.460 \ast 1.558 + 0.461 \ast 1.486 + 0.330 \ast 3.735} = 0.167$$

Similarly, the weight w2of the organization level indicator is found to be 0.227, the weight w3 of the influence range indicator is 0.217, and the weight w4 of the business category indicator is 0.389.


**Table 5.** CRITIC method to calculate the weighting process.

From Table 4, the values of asset S3 under the four indicators are 2, 5, 5, 5, and after processing are 0.9899, 1, 1, 1, 1, and the corresponding weights of each indicator are (0.167, 0.227, 0.217, 0.389), and the importance 3 of the business on asset S3 is obtained from formula (7) as:

$$\alpha\_3 = w\_1 \ast 0.9899 + w\_2 \ast 1 + w\_3 \ast 1 + w\_4 \ast 1 = 0.9983$$

Similarly, we can get the importance of the business carried by other assets, see Table 6.

#### **3.3 Value Assessment**

Based on the method described in Sect. 2, the three major security attributes of the bank's extranet assets are assigned, see Table 6. The confidentiality of assets is mainly analyzed and evaluated by the degree of disclosure of assets, for example, the confidentiality of deposit-related data within a bank is the highest, and once disclosed, it will have a very serious impact on the normal operation of the bank. The integrity is analyzed from the damage to the entire organization if the integrity of that asset is breached. The availability of assets is measured in terms of the damage caused to the organization by their functional interruptions. For assets which carry transaction-type services, there must be a connection channel with the bank's internal network, so the confidentiality,

integrity and reliability of the assets are of the highest level. For assets which run customer exchange services, there are generally connection channels established with the bank's internal network. For assets that carry online investment services, there is a certain connection to the bank's internal network. For assets carrying information service classes and other service classes, there is no connection channel with the bank's internal network, so the C, I and A of the assets take lower values than the previous ones. For assets that carry other services, the CIA takes the lowest value compared to the others.


**Table 6.** Asset value indicators.

From Table 6, the values of C, I, and A of asset S3 and the importance of the business it carries are 5, 5, 5, and 0.9983, respectively. Replacing formula (8), the value d3 of asset S3 is obtained as:

$$d\_3 = \sqrt[3]{0.9983 \ast \dots \ast \dots \ast} = 4.997$$

The remaining assets are evaluated by the same method and the results are written in Table 7.


**Table 7.** Comparison of asset value results.

#### **3.4 Results Analysis and Comparison**

For bank extranet assets, the comparison between the evaluation results obtained by this paper's method and the traditional method which only considers the three major security elements is shown in Table 7. The ranking of asset values obtained by the traditional method [17] is from highest to lowest (S1, S3, S5, S7, S9, S10, S11, S12, S13, S14, S15, S17, S18, S8, S4, S2, S6, S16), where assets S1 to S15 and assets S17, S18 are all obtained with asset values of 5, and assets S8, S2, S4, S6 are all obtained with asset values of 3.The asset values obtained from the methods in this paper are ranked from highest to lowest (S1, S3, S5, S7, S9, S8, S12, S14, S10, S4, S11, S13, S15, S17, S18, S2, S6, S16), and the values of each asset are different. It is easy to conclude that the value of each asset calculated by the traditional method is the same, making it impossible to distinguish the value of assets that carry different business types in the same organization and assets that carry the same business type in different organizations, while the value of these assets can be clearly distinguished by the results calculated by considering the business importance factor proposed in this paper. For assets S3 and S4, which belong to the same organization but carry different types of business, the values of assets which are calculated by using the traditional method are 5 and 4, and the results obtained by using the method in this paper are 4.997 and 4.006. Both methods obtain a higher value for asset S3 than asset S4, which indicates that the proposed method is correct and feasible. For asset S9 and asset S12, which are both personal business systems, the values of C, I, and A are the same, so the results obtained by the traditional method are the same, both are 5, and it is impossible to distinguish whose value is higher, while the results obtained by using the method of this paper are 4.464 and 4.370, because although the CIA values of the two assets are the same, it can be seen from Table 7 that the importance of the business carried on S9 is higher than that of S12. This is because although the CIA values of the two assets are the same, the importance of the business carried on S9 is higher than that of S12. Therefore, it can be concluded that the value of asset S9 is higher than that of asset S12, which indicates that for assets which carry the same type of business in different organizations, the value of these assets can be distinguished better by using the results calculated by this method than the traditional method.

It can be seen from the above examples that on the basis of the factors of confidentiality, integrity and reliability that affect the value of assets, it is necessary to consider the importance of the business carried by the assets, not only can reduce the influence of subjective factors, but also can solve the problem that the value of different assets carrying the same type of business cannot be distinguished by using traditional methods. Therefore, it is practical and realistic to use this paper's method to assess the value of assets.

## **4 Conclusions**

The objective, accuracy, and ease of differentiation are the goals that must be achieved for information asset value assessment. In this paper, considering the three security attributes of confidentiality, integrity, and availability of assets, we propose that the value of assets is also influenced by the importance of the business they carry, and use the multiplication method to calculate the value of assets. We use the CRITIC assignment method to assign weights to four objective evaluation indicators which measure business importance: organization rank, organization level, service scope and business category, and then calculate business importance from the obtained weights of each indicator and the data processed by forward or inverse direction. In this paper, the feasibility of the proposed method is verified by evaluating the value of bank assets which are obtained from the extranet. The method can also be applied to other organizations to calculate the value of assets and prepare for the subsequent risk assessment work.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Vulnerabilities**

## **Knowledge Graph Construction Research From Multi-source Vulnerability Intelligence**

Lin Du1,2(B) and Chuanqi Xu1

<sup>1</sup> Technical Team/Coordination Center of China, Tianjin Branch of National Computer Network Emergency Response, Tianjin, China dulin@cert.org.cn

<sup>2</sup> School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China

**Abstract.** There is a huge Internet user group in China, and many enterprises and institutions are deeply affected by the threat of Cybersecurity vulnerabilities. At present, according to the needs of different business scenarios, relevant business personnel often need to search for different vulnerability information separately, relying on manpower, and the vulnerability intelligence distributed on the Internet has the characteristics of multi-source heterogeneity, which is difficult to ensure the effectiveness and reliability of vulnerability knowledge. In view of the above background, with vulnerabilities as the core, knowledge extraction of vulnerability intelligence is carried out according to existing standards, corresponding entities and relationships are established, and related and visualized knowledge graphs are studied and constructed to provide support for the discovery and traceability of vulnerability threats by information workers.

**Keywords:** Knowledge graph · Cybersecurity · Vulnerability · Named entity recognition · Relationship extraction

## **1 Introduction**

With the rapid development of the Internet industry, a large number of Cybersecurity vulnerabilities have been gradually discovered and exploited in the use of various companies' products, causing potential risks to production and daily life. Vulnerability threat discovery and traceability have become common challenges and work requirements for personnel including system operation and maintenance and network management. There are various sources of vulnerability information, including vulnerability reports from various open source communities, public vulnerability databases, and products' patch information etc., which have the characteristics of scattered data, incomplete information, and different structures, and the vulnerability knowledge caused by data sources such as different Internet community platforms. The information of high quality and low quality are mixed, the repetition is high, the correlation is not clear, the data quality cannot be guaranteed, and it cannot effectively support the work needs of Cybersecurity business personnel for vulnerability detection, analysis and judgment.

In recent years, knowledge graphs can use deep learning to form valuable information and knowledge models through data collection, analysis, and mining. Since the knowledge graph theory was proposed by Google and applied to intelligent search [1], it was initially applied efficiently in the commercial field, such as the LinkedIn economic graph (User Profile) in the social field, and the Tianyancha enterprise graph (Enterprise Profile) in the field of enterprise information, etc.

In various vertical fields in China, there has been research and exploration on the application of knowledge graphs. An Ning et al. [2] proposed the construction of a cross-platform network public opinion knowledge graph, using Sina Weibo and Douyin short videos as data sources to build a network public opinion knowledge map, which is mainly used in the management and guidance of network public opinion. Xiao Le et al. [3] proposed knowledge graph for grain situation is mainly based on the grain situation dictionary and Flat-lattice model to extract grain situation entities for construction, which is used to assist grain situation decision-making. Mou Tianhao et al. [4] proposed a knowledge graph of process industrial control systems based on the control system cyber-physical asset management tasks to solve business problems related to industrial control systems. Zhang Kunli et al. [5] took obstetric diseases as the core and proposed a Chinese obstetric knowledge graph to facilitate medical question and answer and auxiliary diagnosis and treatment.

There are few applications of knowledge graphs in the field of Cybersecurity. This paper uses knowledge graphs to correlate numerous isolated vulnerability intelligences and present a panorama of vulnerability entities, which provides a new idea for vulnerability research and analysis, and helps to promote solutions for difficulties related to Cybersecurity business.

## **2 Vulnerability Knowledge Graph Construction Route**

Large-scale domestic vulnerability databases include the China National Vulnerability Database (CNVD), the China National Vulnerability Database of Information Security(CNNVD) etc., which are the main methods for the construction and sharing of vulnerability intelligence [6]. Combining with the current situation of information security development, the sources of vulnerability intelligence in this paper are CNVD, CNNVD and CVE (Common Vulnerability Disclosure). After the vulnerability knowledge is integrated, manual proofreading is finally performed, and data with low confidence is discarded to ensure the quality of the vulnerability knowledge base. At the same time, the knowledge extraction model is continuously supervised and trained with new intelligence. With the accumulation of data, more new knowledge base data sources such as open source security websites are added as appropriate, and finally the entire system is iteratively updated.

#### **2.1 Schema Layer Design**

The schema layer of the vulnerability knowledge graph is above the data layer, and the core is the ontology library, which is an abstract representation of vulnerability knowledge, like the "class" in object-oriented. The schema layer mainly includes: entityrelation-entity, entity-attribute-attribute's value. Based on "Information security technology—Cybersecurity vulnerability identification and description specification " [8] (GB/T 28458–2020), the framework of vulnerability identification and description can be composed of identification items and description items. Taking into account the actual situation of domestic vulnerabilities, mainly from the perspective of vulnerability management and emergency response [9], the main attribute of the vulnerability is

**Fig. 1.** The framework of entity and relationship

CNVD\_ID, and the framework of the preliminary design entity and relationship is shown in Fig. 1.

Based on the graph structure, entities are used to represent objects or abstract concepts in the vulnerability space, and relationships are used to model inter-entity interactions, the framework follows the triplet of (head entity, relation, tail entity). Entities are distinguished by boxes, each row under the entity name has its attributes, PK represents the main attribute, and the arrow represents the relationship. The entity defines 5: vulnerability = {CNVD\_ID, title, date, level, product, description, solution, patch, CVE\_ID}; event = {event\_id, description, time, URL, victim}; company = {name}; product = {name}; victim = {name}. Relationships define 4: influence, raise, belong to, use. More entities, attributes, and relationships can be gradually expanded according to this framework.

## **2.2 Data Layer Construction**

The vulnerability knowledge graph data layer consists of three steps: data collection, knowledge extraction, and knowledge fusion.

### **2.2.1 Data Collection**

Vulnerability, company, and product data are obtained from the unstructured text of the China National Vulnerability Database (CNVD) and semi-structured text of CVE (Common Vulnerability Disclosure) [10]. According to their own circumstances, the two entities, events and victims, can collect them in a compliant manner if they conduct unified management of vulnerabilities for the unit and its subordinate units, or as vulnerability managers.

### **2.2.2 Knowledge Extraction**

Knowledge extraction is a method to automatically obtain structured information such as entities, relationships, and entity attributes from heterogeneous data such as semistructured or unstructured data. According to the characteristics of vulnerability intelligence text, this paper marks the vulnerability intelligence text with BIOES [11], and then performs the following main operations: entity extraction, attribute extraction, and relation extraction. They are introduced as follows:

1) Entity extraction, namely named entity recognition (NER), refers to the automatic recognition of named entities from text datasets. At present, the main technical methods of named entity recognition are divided into: rule-based and dictionary-based methods -- manual construction of rule templates, and pattern and string matching as the main means; statistical-based methods -- including Hidden Markov Model (HMM), Maximum Entropy (MEM), Support Vector Machine (SVM), Conditional Random Field (CRF); Neural Network methods -- the main models are NN/CNN-CRF, RNN-CRF, LSTM-CRF. The goal of attribute extraction is to collect attribute information of a specific entity from different information sources. For example, for a specific vulnerability, attributes such as name and affected product can be obtained from the public information on the Internet. Entity and attribute extraction this paper adopts the BLSTM-CRF model (Bidirectional Long Short-Term Memory Network - Conditional Random Field) [12], which is currently more effective in the field of security vulnerabilities, taking the product entity (Apache Log4j) as an example, as shown in Fig. 2

**Fig. 2.** The structure of BLSTM-CRF model

2) Relation extraction. After the vulnerability intelligence text is extracted by entities and attributes, a series of discrete named entities are obtained. Continuing to obtain semantic information requires relation extraction: extracting the interrelationships between entities from related texts, and connecting entities through relationships to form a networked knowledge structure. The vulnerability knowledge graph is different from the social character graph. The relationship is relatively small and simple. For example, vulnerability A "raises" event B. Since the relationship defined in the schema layer is easier to distinguish in text data such as vulnerability reports, this paper chooses the method of rule matching, and the recognized entities are automatically selected according to the definition of the relationship in the category and schema layer, and fine-tuning is performed later. According to the definition, the entity can conform to the rules based on the pattern, so the relationship between the entities is determined according to the trigger word, and the designed rule samples are shown in Table 1.


**Table 1.** Samples of trigger word rules

#### **2.2.3 Knowledge Fusion**

After data collection and knowledge extraction, entities, relationships and entity attribute information are obtained from the original unstructured and semi-structured vulnerability intelligence data. However, the relationship between multiple sources (information) is flat and lacks hierarchy and logic; there is still a lot of redundancy and misinformation in the knowledge. Knowledge fusion is to solve this problem, through entity disambiguation and coreference resolution, to realize the integration of vulnerability knowledge. For example, the company " " and the "Apple" belong to the entity synonymous relationship and need to be integrated. After knowledge fusion, the noise and redundancy in the data are removed, and the quality of vulnerability knowledge is improved.

## **3 Vulnerability Knowledge Graph Construction Results**

#### **3.1 Experimental Environment**

The experimental environment of this paper: the operating system is Windows 10; the CPU is AMD Ryzen™ 7 5800H@3.2 GHz; the GPU is GTX 3050Ti (4 GB); the memory is 64 GB; the Python version is 3.7; the neo4j version is 3.1.1.

#### **3.2 Knowledge Graph Display**

Taking some generic vulnerability data and a small number of influenced victims under Apache as an example (entities are vulnerability ontology, historical events, involved victims, companies, and products; relationships are the edges of a directed graph), the constructed visual interface is shown in Fig. 3.

**Fig. 3.** Vulnerability knowledge graph

#### **3.3 Application Analysis**

In terms of vulnerability threat discovery and analysis, by constructing the graph to correlate and analyze vulnerability information, hidden information can be mined and effective judgments can be made. Referring to Fig. 3, various types of entities are used as nodes in the graph, and various types of relationships between entities are used as edges in the graph. Starting from a certain entity, such as an victim with critical infrastructure, you can know which products of which companies are used by the victim, and which security events have occurred due to which vulnerabilities occurred at specific times. Once a 0-day vulnerability occurs again in the corresponding products of the company, it can be reasonably predicted that the victim will be influenced by this vulnerability, and it will be warned in time before possible Cybersecurity events to avoid major losses. This information is often unavailable from a single vulnerability report, and knowledge graphs can organically connect numerous vulnerability information.

## **4 Conclusion**

According to the characteristics of the vulnerability field, this paper first integrates multi-source vulnerability intelligence data to design a vulnerability knowledge graph framework; then uses a deep learning model to extract entities and attributes, extracts relationships based on pattern rules, and constructs a vulnerability knowledge ontology, check and analyze; and finally complete the multi-source knowledge graph. In the future, by further adding multiple vulnerability threat intelligence data sources, a larger and more complete vulnerability knowledge graph can be formed, which can effectively provide more Cybersecurity decision support for information workers.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Mobile Internet**

## **Research and Practice of TCP Protocol Optimization in Mobile Internet**

Shizhan Lan1,2(B) , Lifang Ma3, and Jing Huang4

<sup>1</sup> School of Software Engineering, South China University of Technology, Guangzhou 510006, China

lanshizhan@gx.chinamobile.com

 China Mobile Guangxi Branch Co., Ltd., Nanning 530012, China Guangxi Polytechnic of Construction, Nanning 530007, China Cloud and Network Strategy Design Department, EVERSEC (Bei Jing) Technology Co., Ltd., Beijing 100191, China

**Abstract.** This paper analyzes the problems of traditional TCP protocol in the wireless network environment and proposes a scheme based on performanceenhancing agents, which is more suitable for the actual situation of current wireless core networks. TCP application optimization was employed to enhance congestion control. Based on the automatic learning mechanism of network path features, this paper proposes herein the dynamic algorithm ZetaTCP. In practice, the performance enhancement agent based on ZetaTCP In practice, the performance enhancement proxy based on ZetaTCP was verified and achieves good results in LTE networks. In practice, the performance enhancement proxy based on ZetaTCP was verified and achieved good results in the LTE network.

**Keywords:** LTE · TCP protocol optimization · Congestion control · ZetaTCP

## **1 Introduction**

TCP protocol, As the main protocol of online data transmission, TCP protocol has been widely used on mobile Internet, which carries over 90% of mobile Internet traffic in this case. Though the LTE network advances and the mobile Internet notably speeds up, traditional TCP technology tailored for the wired network environment cannot adapt to the wireless network environment with relatively poor link quality and frequent changes in latency and packet loss. Therefore, the improvement of the transmission performance of TCP protocol in the wireless network and the enhancement of the bandwidth utilization of wireless links are critical to optimizing the wireless core network [1].

## **2 Features and Problems Analysis of Standard TCP**

TCP provides transmission services with reliable point-to-point connections, subject to sliding windows to control the transmission rate. The congestion control of standard TCP contains several key technologies, including "slow start", "congestion avoidance", "fast retransmission", "quick recovery", and "retransmission timeout" [2].

Standard TCP is very sensitive to packet loss, halving or minimizing the value of the congestion windows with a slow increase. Previously, packet loss in the wired network often indicates the occurrence of network congestion, which can be better controlled and quickly recovered by standard TCP. Nonetheless, the current wireless network environment brings about increasingly salient defects of the standard TCP protocol:

#### **2.1 Ineffective Congestion Judgment Mechanisms**

In a wired network with a low bit error rate, it is rational for TCP to assume that packet loss is triggered by network congestion. The packet loss, nevertheless can also be caused by sudden errors in wireless channels, mobile device handoffs, attenuation channels, or changes in network topology. In this context, standard TCP cannot accurately distinguish whether packet loss is derived from congestion or not, resulting in congestion misjudge.

#### **2.2 Slow Congestion Recovery Mechanisms**

Once detecting packet loss, TCP will trigger the response for congestion control in three steps. At first, the packets failing to be confirmed will be retransmitted, thus reducing the congestion window and the transmission rate; Then, it will activate the congestion control mechanism, consisting of exponential back-off of the timeout clock and a reduction in the slow start threshold. At last, the congestion avoidance stage will be activated to relieve the congestion. If the packet loss results from channel errors or mobile device handoffs, the congestion recovery mechanism of TCP will induce throughput drop and longer latency.

#### **2.3 Inaccurate Packet Loss Judgment Mechanism**

The standard TCP stack determines packet loss by two methods. One is the number of consecutive Dup-ACKs, and the other is the ACK timeout. When there are considerable packet losses, ACK timeout is preferred to interpret the timeout and trigger retransmission. In a modern network, packet losses are often burst, and it's natural that multiple data packets are lost simultaneously on a connection. Therefore, standard TCP must rely on timeout for retransmission, which often leads to a waiting state of several or even ten seconds, causing long stagnant transmission, or even disconnection [3].

## **3 ZetaTCP Optimization**

Pursuant to the survey and analysis of the quality of mobile Internet access in all network operators, TCP traffic accounts for the vast majority of all the existing network traffic of mobile users accessing mobile Internet applications. However, due to the frequent and changing delay and packet loss of the wireless network environment, the transmission efficiency of traditional standard TCP is often substantially low in this context [4]. In case the defects of TCP's treatment mechanism in various wireless network conditions can be corrected, and the efficiency of TCP traffic transmission can be enhanced, the user's mobile Internet experience can be significantly improved [5].

#### **3.1 Comparison and Improvement of TCP Optimization Technology**

Most domestic and overseas TCP protocol optimization technologies apply static algorithms, which utilize fixed congestion judgment and recovery mechanisms in accordance with the assumption of the Internet traffic model. As the Internet environment progresses, the traffic characteristics are increasingly complicated and difficult to predict. Against such a circumstance, these TCP protocol optimization techniques can only be valid in specific network scenarios where the premise is established. Moreover, as the transmission progresses, the network path characteristics may change and the effect may turn out to be unstable. Two common TCP optimization algorithms are presented below:

The Vegas TCP algorithm defines a state variable, Base RTT (basic round-trip delay), whose theoretical value should be "round-trip delay of connection without congestion." With the delay change as the congestion indicator, the Vegas algorithm is more sensitive to the judgment of network congestion so as to decrease the packet loss rate of the network and obtain excellent average throughput rates in all networks using the Vegas algorithm. Nevertheless, in a network environment mixed with packet loss-based algorithms, it has always seen a rapid rise in time delay occurring before packet loss. In this case, Vegas always shrinks CWND (congestion window) before packet loss-based algorithms and reduces the transmission rate, making its overall performance inferior to packet loss-based algorithms. Vegas TCP is characterized by the relatively low transmission performance during shallow cohort congestion and frequent changes in wireless network delay [6].

CUBIC TCP, an enhanced version of BICTCP, simplifies its window adjustment algorithm. A cubic function is deployed as the growth function of the congestion window, grows only according to on the time interval between two consecutive congestion events. CUBIC is the default TCP algorithm of the Linux kernel. The CUBIC TCP has relatively low transmission performance under non-congestion packet loss and deep queue congestion in wireless networks [6].

The Performance-enhancement Proxy Based Scheme is adopted in a bid to improve the drawbacks of the above algorithms and enable TCP to register a high transmission efficiency in the wireless network with long latency and frequent link errors [4]. The method to segment the original TCP connection by the above Scheme is also known as TCP segmentation (see Fig. 1).

The Performance-enhancement Proxy Based Scheme follows the idea that local problems should be solved locally. By deploying the agent, the TCP connection between the server and the wireless mobile end falls into two sections at a certain node in the middle, with one deployed on the server sending end of the fixed network, and the other is connected to the mobile receiving end of the wireless network, which blocks the influence of the wireless environment on the server sending end. In this case, the server sending end can prevent irrational activation of the congestion control algorithm irrespective of random packet loss of the wireless network. By virtue of the improved TCP deployed on the enhanced proxy, the performance of TCP in wireless networks will be strengthened, and data transmission rate to mobile ends will be elevated [7]. In this scheme, there is no need for any modification on the TCP protocol stack of the server sending end and the wireless mobile receiving end, which is feasible to implement at this stage.

**Fig. 1.** Schematic diagram of performance-enhancement proxy based scheme

#### **3.2 ZetaTCP Optimality Principle**

It adopts the dynamic self-learning algorithm (ZetaTCP) with network path features and utilizes the Performance-enhancement Proxy Based Scheme. It observes and analyzes the real-time network features on each TCP connection and adjusts the algorithm anytime in accordance with the learned network characteristics. By doing so, it can judge the degree of congestion more accurately and distinguish packet loss more promptly, thereby handling the congestion more appropriately and retransmitting the lost packet more swiftly. According to the design principle, it helps the static algorithms adapt to changes in network path characteristics and ensures that the acceleration effect is constantly valid even under various network environments and frequently changing network delay and packet losses.

Besides applying the above two approaches of standard TCP, ZetaTCP also considers packet loss and delay changes into consideration and introduces a self-learning dynamic algorithm mechanism with TCP connection path network characteristics to make the congestion judgment more actuate and timely. The dynamic learning mechanism can be used to determine the network path characteristics of each specific connection during the transmission process. The characteristics include end-to-end delay and its changing features, arrival interval and its variation of receiving end feedback packet (ACK), packet reversal degree and its changing features, delay jitter possibly caused by deep data detection of security equipment, and random packet loss induced by various factors. ZetaTCP tracks these features in real time, apprehends these features in all aspects and deduces the precursor signals reflecting congestion and packet loss on this specific TCP connection network path. With the above steps, the congestion degree and congestion recovery mechanism appropriate for the available bandwidth of the current path, and packet loss judgment and recovery can be determined in light of these dynamic intelligent learning results and the transmission rate.

ZetaTCP is implemented by an automatic learning state machine (Learning State-Machine), as indicated in Fig. 2.

**Fig. 2.** ZetaTCP automatic learning state machine

Each automatic learning state machine matches a TCP connection, records the network path characteristics of the TCP connection, and dynamically determines appropriate congestion judgment, recovery mechanism and the packet loss judgment mechanism. In specific, connection management can directly extract the external features of the network path and input them to the machine. The intelligent learning outcomes accumulated by this machine are subject to the packet loss monitoring, congestion control, exception handling and delay monitoring modules to adjust the transmission behavior of the corresponding TCP connection. The dynamic feedback can be conveyed to the automatic learning state machine through the exception handling and congestion control modules to optimize network path learning further.

#### **3.3 Implementation Algorithm of ZetaTCP Optimization**

On Linux, Netfilter is enabled for packet interception. In different deployment scenarios, Netfilter can perform Hooks at the Ethernet bridge level to implement the transparent bridge mode or conduct Hooks at the INET level to perform the routing mode. A pair of Hook points can be mounted to the LAN and WAN of the engine on NF\_INET\_POST\_ROUTING and NF\_INET\_PRE\_ROUTING respectively as shown in Fig. 3.

**Fig. 3.** Implementation of Hook point of ZetaTCP

#### The ZetaTCP's congestion control algorithm is available, as shown in Fig. 4.

**Fig. 4.** Process flow of the congestion control algorithm of ZetaTCP

For the data message with the highest sequence number in the received ACK response, the actual instant throughput rate is calculated as per BC = FS/(T – TS). Wherein, T denotes the current time, TS indicates the sending time of the data packet with the highest sequence number, and FS means as the total amount of data sent at this TS time and fails to be responded to by ACK. The said TS and FS are recorded when the data packet with the highest sequence number is sent.

The smooth throughput rate is determined according to B = (1 – α)\*B- + α\* BC; Wherein, α refers to a constant parameter, BC means the actual instant throughput, and Bsuggests the last calculated smooth throughput rate.

CWND growth modes fall into categories of exponential growth, linear growth, and stop. Provided that the increase in smooth throughput rate exceeds the previous smooth throughput rate, the CWND growth mode is set as exponential growth. If the smooth throughput rate declines continuously for a predetermined number of times, and the total amount of smooth throughput rate drops is not less than the throughput rate drop threshold, it is required to judge further whether the current smooth round-trip delay SRTT is less than or equal to η \* RTTMIN; Wherein, RTTMIN means the smallest round-trip delay, and η refers to a constant parameter; If yes, the CWND growth mode should set as linear growth; If not, it shall be set to stop.

ZetaTCP can, through the foregoing algorithm, obtain the real-time optimal CWND value, thereby maximizing network throughput and preventing congestion.

### **4 Application of ZetaTCP Optimization Technology in Mobile Internet**

The packet losses occurred wireless networks are often attributed to signal loss, interference and other causes, and the packet loss rate therein is greater than that in wired networks [8]. In consequence, standard TCP usually fails to judge these packet losses in a quick manner, hence bringing about low transmission efficiency, unstable transmission quality, high unpredictability, and poor user experience. By contrast, ZetaTCP can quickly predict packet loss and recover in time in a wireless network environment, making the transmission more stable and quicker, thereby considerably improving the user experience.

In order to verify the actual effect of the ZetaTCP optimization technology in the wireless core network, the Performance-enhancement Proxy Based Scheme and the LotWan acceleration system using ZetaTCP as the proxy node are introduced to optimize the data transmission of the wireless core network and evaluate its optimization effect.

#### **4.1 ZetaTCP Optimization Deployment Scheme**

As indicated in Fig. 5, the ZetaTCP acceleration device is deployed outside the SGi port of the PDN-GW, which is transparently connected in series between the PDN-GW and the firewall or on the Internet side of the firewall. The acceleration device is transparently connected to the network and works as a TCP proxy to accelerate the coverage of the whole wireless network transmission path.

**Fig. 5.** ZetaTCP acceleration device deployment scheme

#### **4.2 ZetaTCP Optimization Application Results**

The implementation environment selected here covers three kinds of wireless networks with poor areas: medium-low field strength coverage, hotspot, and busy-hour regions. The data shows, the congestion control algorithm adopted by ZetaTCP in these three areas achieves relatively faster transmission speeds. The average results of application assessments are listed in the following table (Tables 1, 2 and 3).


**Table 1.** Acceleration effect of web browsing services


**Table 2.** Acceleration effect of file download services

**Table 3.** Acceleration effect of video download services


On the basis of the above results, due to the delayed judgment of these packet losses, the transmission efficiency of standard TCP is often low with unstable transmission quality, which is difficult to predict and seriously affects the user experience. As for ZetaTCP acceleration in the wireless network environment, the corresponding connection network characteristics can be accumulated through dynamic learning, Making the ZetaTCP congestion control algorithm more accurate. It also can predict packet loss very quickly and recover in time, making the transmission more stable and quicker, thereby significantly improving the user's experience.

#### **5 Conclusion**

TCP optimization, an essential approach for telecom operators to optimize the wireless core network, remarkably boosts the Internet access rate of mobile phone users, enhances user perception, and adds to the competency of traffic management.

This paper proposes, an improvement scheme of ZetaTCP performance in the wireless network environment is proposed by combining it with the enhanced TCP congestion control mechanism and applying it in the production environment of the current network. The experimental results demonstrate that ZetaTCP is a good guarantee for TCP users to get the appropriate bandwidth as defined by the flow specification. It can eliminate the unfairness problem caused by different RTTs when congestion occurs. It both maintains the end-to-end semantics of TCP, and takes corresponding measures upon distinguishing the types of network packet loss, thereby bolstering the transmission performance of TCP.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Traffic Analysis**

## **Research on Adversarial Patch Attack Defense Method for Traffic Sign Detection**

Yanjing Zhang1, Jianming Cui1, and Ming Liu2(B)

<sup>1</sup> School of Information Engineering, Chang'an University, Shaanxi 710064, China <sup>2</sup> National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China liuming@cert.org.cn

**Abstract.** Accurate and stable traffic sign detection is a key technology to achieve L3 driving automation, and its performance has been significantly improved by the development of deep learning technology in recent years. However, the current traffic sign detection has inadequate difficulty resisting anti-attack ability and even does not have basic defense capability. To solve this critical issue, an adversarial patch attack defense model IYOLO-TS is proposed in this paper. The main innovation is to simulate the conditions of traffic signs being partially damaged, obscured or maliciously modified in real world by training the attack patches, and then add the attacked classes in the last layer of the YOLOv2 which are corresponding to the original detection categories, and finally the attack patch obtained from the training is used to complete the adversarial training of the detection model. The attack patch is obtained by first using RP2 algorithm to attack the detection model and then training on the blank patch. In order to verify the defense effective of the proposed IYOLO-TS model, we constructed a patch dataset LISA-Mask containing 50 different mask generation patches of 33000 sheets, and then training dataset by combining LISA and LISA-Mask datasets. The experiment results show that the mAP of the proposed IYOLO-TS is up to 98.12%. Compared with YOLOv2, it improved the defense ability against patch attacks and has the real-time detection ability. It can be considered that the proposed method has strong practicality and achieves a tradeoff between design complexity and efficiency.

**Keywords:** Traffic sign detection · Adversarial patch attack · Deep learning

## **1 Introduction**

Traffic sign detection is a key technology that is continuously updated and iterated in the vision-based advanced driver assistance systems. Its purpose is to establish accurate, real-time and safe traffic sign recognition capabilities for complex and dynamic real roads [1]. The most widely used technology is target detection based on Deep Neural Networks (DNN) [2]. However, many recent studies have shown that the security of DNN models is not reliable, that is, it is susceptible to the influence of adversarial examples, which would mislead the classifier produces incorrect predictive output [3–5]. Currently, adversarial patch attacks in the physical world have been considered as a very effective means for attacking object detection models, and have achieved remarkable results in the fields of image classification [6], face recognition [7], object detection and etc. [8–10]. In order to deal with the security threats caused by patch attacks, a growing number of researchers began to study defense methods. However, current researches mainly focus on image classification, and there are few reports on traffic sign detection. In addition, traditional image pre-processing methods, such as image denoising [11], local gradient smoothing [12], and partial occlusion [13], would reduce the detection accuracy on the original samples, and most of them are designed to operate in the digital space and are ineffective to the physical world.

YOLO (You Only Look Once) series is a one-stage object detector that can directly output bounding boxes and categories. Compared with RCNN (Region-Convolutional Neural Networks), Faster-RCNN and other two-stage networks, YOLO has a lighter structure, fewer parameters, and faster speed. Therefore, it is more suitable for application research in the field of automatic driving that requires high real-time and accuracy [14]. Compared with v3–v5, YOLOv2 has less computation in forward reasoning [15–18], and can maintain a relatively high mAP (mean Average Precision) in the COCO dataset test under the same scale input. In addition, in automatic driving, object detection models are mostly deployed on edge devices for inference, resulting in limited model storage space and computing resource [19]. YOLOv2 mainly consists of convolutional layers and softmax, which is easier to implement in mobile device and can also accelerate inference by small graphics cards. Therefore, the interesting and challenging question addressed here is how to integrate and extend YOLOv2 to traffic sign detection and achieved the stable defense capability.

To solve the above problems, we propose an adversarial patch defense model IYOLO-TS (Improved YOLOv2 on Traffic Signs) on traffic sign detection. The main contributions can be summarized as follows: (1) We extend the research of patch attack defense to the field of traffic sign detection and proposed a practical defense model IYOLO-TS. (2) We improved the last layer of YOLOv2 model by adding an additional 11 attacked classes, and optimized it structure to ensure the high detection performance for normal traffic signs. (3) In order to achieve high robustness and more realistic style against perturbations, we adopt RP2 algorithm [8] to attack the YOLOv2 and pioneered the development of a patch dataset named LISA-Mask.

## **2 Improved YOLOv2 on Traffic Signs Detection Model**

#### **2.1 Framework Design of IYOLO-TS**

Figure 1 provides an overview of IYOLO-TS. From the structure of the neural network, IYOLO-TS adds 11 additional attacked categories to the last softmax layer. As a result, IYOLO-TS is able to detect the attacked targets while accurately identify the attacked targets to the true classes, which are defined as the right part of Fig. 1.

**Fig. 1.** Framework of IYOLO-TS.

We sample from each category of LISA and LISA-mask to train IYOLO-TS. IYOLO-TS retains the network structure of yolov2 except for the final softmax layer by adding 11 attacked categories. The right part of figure is attacked traffic sign detection result of IYOLO-TS. The base idea of YOLOv2 is to represent the output of the feature map as the center, width and height of the bounding box, as well as the confidence and category. YOLOv2 divides the input image *x* into *N* preselected areas, and each area predicts *M* anchor box. Assuming that there are *n* classes to be identified, for the LISA and LISA-Mask datasets, *n* is 11, each anchor box can be written as an (*n*+5) dimensional vector. The result of the feature map for each anchor box can be expressed as shown below:

$$\left\langle \hat{X}, \hat{Y}, W, H, P\_{obj}, P\_{cls1}, \dots, P\_{clsn} \right\rangle \tag{1}$$

where *X*ˆ , *Y*ˆ , *W* , *H* are the center and size of bounding box, *Pobj* is the confidence score indicates the probability of whether the bounding box contains a target and *Pcls*<sup>i</sup> is class score. Then, arrange the anchor boxes in order, and each preselected area would output a vector with dimension *M* (*n* + 5). Eventually, the output of YOLOv2 is a vector of dimension *NM* (*n* + 5). IYOLO-TS inherits the form of the YOLOv2 loss function and adds the loss to the attacked class score. We add 11 attacked categories to the last softmax layer of YOLOv2, so the length of each anchor boxes vector becomes (*n* + 5 + 11), and the corresponding final output becomes a vector of *NM* (*n* + 5 + 11) dimension. This gives IYOLO-TS two advantages: the detection speed inherited from YOLOv2 meets the time-sensitive requirements for defending against physical world attacks and can also be used as a model for detecting attacks.

#### **2.2 RP2-Based Attacking Process**

In order to achieve a high robustness and a more realistic style against perturbations, we use the method in [8] to attack the YOLOv2 detectors. To generate visual adversarial perturbations that are robust under different physical conditions, RP2 algorithm is first derived without considering other physical conditions, starting with the optimal method for generating perturbations to a single image *x*. Then update the algorithm considering continuous changes in the distance and angle of the camera to the road sign. Then, the constrained optimization problem of RP2 is expressed as below:

$$\arg\min\_{\delta} \lambda \|\delta\|\_{p} + J\left[f\_{\theta}(\mathbf{x} + \delta), \mathbf{y}^{\*}\right] \tag{2}$$

where *J* (·)is the loss function measures the degree of difference between the prediction of the model and the target class *y*∗. *x* is the input, δ denotes the perturbation of input *x*, *f*<sup>θ</sup> (·) denotes the target classifier, and λ is the hyperparameter that controls the regularization of the distortion. Specifying the distance function as δ*<sup>p</sup>*, which denotes the *p*-norm of δ. To better capture the effects of changing physical conditions, partial experimental samples containing random noise are generated to be added to the algorithm iterations. To ensure that the perturbation is applied only to the surface of the target object, a mask is introduced that will limit the physical region of the perturbation. The final robust spatially constrained perturbation is optimized as:

$$\arg\min\_{\delta} \lambda \|\mathbf{M}\_{\mathbf{x}} \cdot \delta\|\_{p} + N\text{PS} + E\_{\mathbf{x}\_{l} \sim X^{\vee}} J\left[f\_{\theta}[\mathbf{x}\_{l} + T(\mathbf{M}\_{\mathbf{x}} \cdot \delta)], \mathbf{y}^{\*}\right] \tag{3}$$

where the matrix *Mx* is the representation of the mask, *NPS* is the unprintability fraction, and the function *T*(·) represents the alignment function that maps the transformation of the object and the perturbation. Since all perturbation values must be reproducible in the physical world and there exist some reproduction errors in the colors produced by the printer [20], RP2 adds an additional term *NPS* to the objective function to model the printer color reproduction errors. It can be found that during an attack, forged patches generated under the qualification of different masks can simulate common vandalism behaviors that are ignored by most people. Such attacks in the physical world are highly disruptive to traffic sign detectors, so it is imperative to develop appropriate defense strategies.

#### **2.3 Generating of LISA-Mask Dataset**

In order to make IYOLO-TS more generalizable and make it effective in defending against various patch attacks, we generate 50 different masks and constructs a new dataset named LISA-Mask to help train the IYOLO-TS.

During attack patches generating experiment, we found that the patches at different locations have an impact on the effectiveness of the attack, and each mask produces a different attack effect. In addition, in order to simulate a more realistic random attack scenario as much as possible, 50 different masks are produced in this paper by limiting the size, distance, number and shape of the scope. The generated masks are different from other target detection datasets that can take the whole area as the area of interest for the attack, the masks in this paper should limit the size of the scope so that they avoid obscuring the whole pattern of traffic signs.

The success rate of the attack can be expressed as follows:

$$\frac{\sum\_{c \in C} \left\{ f\_{\theta} \left[ A \left( c^{d, \mathbf{g}} \right) \right] = \mathbf{y}^\* \land f\_{\theta} \left( c^{d, \mathbf{g}} \right) = \mathbf{y} \right\}}{\sum\_{c \in C} \left[ f\_{\theta} \left( c^{d, \mathbf{g}} \right) = \mathbf{y} \right]} \tag{4}$$

where *A*(*c*∗) represent a set of images with incorrect classification results from original images set *c*. *cd*,*<sup>g</sup>* represent the images taken from distance *d* and angle *g*. Respectively,

**Fig. 2.** Some of the masks and their attack success rates.

*y* is the actual class label of the target, and *y*∗ is the detection result of the target after the attack. As shown in Fig. 2, some of the generated masks and their attack success rates. It can be seen that different kinds of masks can lead to different degrees of reduction in YOLO's inference results, i.e., physical attacks on traffic signs can be simulated to some extent.

**Fig. 3.** The generation process of the LISA-Mask dataset.

Figure 3 exhibited the generation process of LISA-Mask dataset. First, YOLOv2 is trained on LISA training set and named as *Model0*, then 50 different masks are generated by using the aforementioned method, and then the attack on *Model0* is performed on different masks based on the method in [8], respectively, the difference of the detection results with the true labels is added to the loss function, and the attack patches are updated by back-propagation training. The generated patches are applied on LISA, and the images with the patch attacks are obtained, that is named as LISA-Mask dataset. The produced dataset contains a total of 11 categories of traffic sign images, each contained 3000 images that were attacked 50 times, for a total of 33,000 images.

## **3 Experiments and Results**

#### **3.1 Test Bench Setup**

To evaluate our proposed work, we constructed the experimental data according to the structure in Fig. 4. Firstly, the LISA-Mask and LISA data sets are merged. There are 11 types of targets and each type of target is divided into clean data and attacked data. Then, to keep data balance in training, three enhancement methods is used on categories less than 100 pictures in the LISA dataset: contrast, brightness and sharpness change. We don't recommend using cutting, mirroring, rotation and other enhancement methods, for these complex situations are not common in driving detection task. Finally, we selected two hundred images randomly from each category of data to construct the experimental dataset, which is split into 80% training and 20% test set.

**Fig. 4.** Construction structure of the experimental dataset.

For all experiment, we use tensorflow1.14 and P4000 for training. YOLO is trained by Adam optimizer with learning rate 0.01, and batch size is 32. In the training of adversarial patches, SGD is used with learning rate 0.01, and decay rate is set to 0.1.

### **3.2 Object Detection**

#### **Object Selection Performance Analysis of IYOLO-TS on Clean Dataset**

To evaluate the performance of IYOLO-TS, we calculate the AP of YOLOv2 and IYOLO-TS for each class on the LISA test set in Table 1. It can be observed that IYOLO-TS has less reduced in AP for each class compared to YOLOv2. On average, the mAP of IYOLO-TS is 97.75%, which is only 1.25% lower compared to YOLOv2, indicating that IYOLO-TS can maintain a strong roadmap detection.


**Table 1.** Performance of YOLOv2 and IYOLO-TS on the LISA test set

#### **Analysis of the Validity of IYOLO-TS Defense Detection**

To evaluate the defensive capability of IYOLO-TS, we calculated AP of each class on the dataset. It can be seen that IYOLO-TS can distinguish the adversarial samples from the clean data, and the mAP reaches 98.12%. Table 2 shows the detection AP of IYOLO-TS for all classes of images, and it can be seen that IYOLO-TS has a strong defense detection performance. Figure 5 shows the performance of IYOLO-TS and YOLOv2 against patch attacks. As can be seen that, compared to YOLOv2, IYOLO-TS achieves higher metrics in all the other 10 classes of flags except the signalahead class, which shows a stronger defense against attacked data.

**Fig. 5.** Performance comparison of IYOLO-TS and YOLOv2 against patch attacks.

Figure 6 shows the defense effect on LISA-Mask. The attacked addedlane is able to successfully trick YOLOv2 to identify it as the merge class, however, IYOLO-TS is able to successfully and correctly identify the attacked target.


**Table 2.** IYOLO-TS AP for 22 classes of images, where classes indicate clean traffic signs data and classes-ad indicate attacked traffic signs data

**Fig. 6.** Performance of YOLOv2 and IYOLO-TS for detection of attacked added lane.

In addition, IYOLO-TS adds 11 additional attacked classes to the structure of YOLOv2, as Fig. 7 shows the detection results of some of the attacked classes. It can be seen that IYOLO-TS is not only able to correctly identify the attacked traffic sign, but also distinguish whether the traffic sign is under attack or not. It shows that IYOLO-TS has good detection ability for different kinds of patch attacks.

**Fig. 7.** Detection results of partially attacked classes.

#### **3.3 Analysis of the Effectiveness of Patch Attack Defense**

In order to evaluate the defensive capability of IYOLO-TS, we test IYOLO-TS under white-box attacks and physical world attacks respectively.

#### **Defense Effectiveness Analysis under White-box Attacks**

We continue with the LISA-Mask generation process, by using RP2 to generate the patch dataset LISA-Mask0 against IYOLO-TS. First, IYOLO-TS was trained on the LISA training set, and then images with the patch attack were generated on the LISA dataset using RP2 against the trained IYOLO-TS to obtain the LISA-Mask0 dataset. Then, the generated patch dataset LISA-Mask0 was used to test the IYOLO-TS model. Table 3 shows the performance of IYOLO-TS against white-box attacks.



As can be seen from the Table 3, except for laneend and merge, which have an accuracy of about 90%, other classes have AP values higher than 94%, indicating that IYOLO-TS still shows a strong defense capability in the face of new attacks.

#### **Defense Effectiveness Analysis under Physical World Attacks**

To verify the usefulness of the model in this paper, the defensive performance of IYOLO-TS in the physical world was tested. In the experiments, the generated adversarial patches are printed and attached to the traffic signs to further compare and demonstrate the defense effectiveness of YOLOv2 and IYOLO-TS. As shown in (a) (d) (g) (j) of Fig. 8, YOLOv2 miscalculates under the generated adversarial patch, and the performance of (b) (c) (e) (f) (h) (i) (k) (l) shows that IYOLO-TS can distinguish the clean data from the attack data under physical attacks.

**Fig. 8.** Physical world attack test sample.

## **4 Conclusion and Future Work**

In this paper, an improved defense model, IYOLO-TS, was firstly proposed to improve the anti-attack ability of the traffic sign detection. Firstly, the masks under multi-scale and multi-constraint conditions were built to simulate random multi-type physical attacks in the physical world, and the first test data set, Lisa-Mask is constructed through annotation fusion. On this basis, 11 attacked classes are innovatively added to the YOLOv2 network structure, so that the model can distinguish the attack samples from the original samples while maintaining the detection capability. In the experiment, we compared the detection performance of IYOLO-TS and YOLOv2, and completed the performance test and analysis of white-box attack and physical world attack respectively. Experimental results show that IYOLO-TS has a good defense ability against the adversarial patch attack from the physical world. But it can also be found that the real road traffic signs obscured, to be damaged, is far beyond this study at this stage can simulate. In addition, vehicle speed, weather, light and other factors will directly affect the processing efficiency of the model. Therefore, in our next work, how to optimize the model to adapt dynamic environment and achieve a more accurate and interpretable detection method are also important and interesting research topics.

**Acknowledgements.** This work is supported by the National Natural Science Foundation of China (Grant No. 62106060).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Threat Intelligence**

## **Research on Named Entity Recognition Method of Network Threat Intelligence**

Keke Zhang1, Xu Chen1(B) , Yongjun Jing1, Shuyang Wang1, and Lijun Tang2

> <sup>1</sup> North Minzu University, Yinchuan Ningxia 750021, China chenxu@nmu.edu.cn <sup>2</sup> Ningxia University, Yinchuan Ningxia 750021, China

**Abstract.** With the continuous emergence of new network threat means, how to turn passive defense into active prediction, the rise of Cyber Threat Intelligence (CTI) technology provides a new idea. CTI technology can timely and effectively obtain all kinds of network security threat intelligence information to help security personnel quickly identify all kinds of attacks and make effective decisions in time. However, there are not only a large number of redundant information in threat intelligence information, but also the problems of Chinese English mixing, fuzzy boundary, and polysemy of related security entities. Therefore, identifying complex and valuable information from this information has become a great challenge. Through the research on the above problems, a named entity recognition model in the field of Network Threat Intelligence Based on BERT-BiLSTM-Self-Attention-CRF is proposed to identify the complex network threat intelligence entities in the text. Firstly, the dynamic word vector is obtained through Bert to fully represent the semantic information and solve the problem of polysemy of a word. Then the obtained word vector is used as the input of BiLSTM, and the context feature vector is obtained by BiLSTM. Then the output result is introduced into the self-attention mechanism to capture the correlation within the data or features, and finally the result is input into CRF for annotation. To verify the effectiveness of the model, experiments are carried out on the constructed network threat intelligence data set. The results show that the model significantly improves the effect of Threat Intelligence named entity recognition compared with several other classical models.

**Keywords:** Cybersecurity · Named entity recognition · BERT

## **1 Introduction**

With the acceleration of the world's digitization process, the network environment is becoming more and more complex. At the same time, the network attack behavior tends to be industrialized, and the attack means are becoming more and more diversified. The traditional way of building defense strategies and deploying products based on experience is difficult to detect [1], intercept, analyze and respond in time and effectively in the face of emerging new, persistent, and advanced threats [2]. In this context, Cyber Threat Intelligence (CTI) [3] technology came into being. As an important network security knowledge, it can support the construction of a more active network security defense [4] mode. Based on all-around intelligence perception and multi-dimensional fusion analysis, it can study and judge the overall situation of network security and reasonably predict the threat trend, so as to realize dynamic and accurate response to network security threats. However, the existing network threat intelligence information is also mixed with a large number of invalid or interference information. How to more effectively obtain more critical threat intelligence entity information (such as organization, software, vulnerability number, etc.) from threat intelligence has become the focus of current research. Applying named entity recognition (NER) technology [4] to the field of Network Threat Intelligence can effectively solve the problem of extracting important security entity information from unstructured Threat Intelligence text. Automatically identifying network security entities from Internet information, such as software, vulnerabilities, attack means, and related network terms, and classifying them is an important step in constructing the knowledge map in network security [5].

## **2 Relation Work**

In the early stage, NER tasks were performed using a rule-based and dictionary-based approach, which achieved good results when formulating very comprehensive rules and dictionaries, but at great cost, so machine learning methods were considered to improve the accuracy of NER. Mulwad V et al. [6] identified potential vulnerability descriptions through an SVM classifier and used Wikilogy knowledge base to identify vulnerabilities, threats, and attacks in Web text. Since SVM cannot consider context information, Joshi A et al. [7] used CRF based system to identify important entities and concepts related to network security in a given text. In order to better improve the performance of NER, we can also consider adding POS, Weerawardhana S et al. [8] identified the key PAG parameters embedded in the vulnerability description text by machine learning and POS, including software name, version, impact, attacker operation, and user operation. It is proved by experiments that entity recognition tasks are carried out in the field of network security. The POS method does provide a viable alternative to machine learning.

Although machine learning [9] has some improvement on NER tasks in network security, it requires network security researchers to label security data, which is extremely costly. As a branch of machine learning, deep learning has become increasingly popular in recent years. At present, some researchers have applied deep learning to the field of named entity identification of network threat intelligence. Pingchuan Ma et al. [10] proposed a BiLSTM-CRF method to extract security-related concepts and entities from unstructured text and used open-source data to evaluate the model on P, R, and F1-score with good results. Wu H et al. [11] added a domain dictionary matching correction method based on BiLSTM-CRF, using BiLSTM to automatically capture context features, using CRF to learn label constraint rules, and using ontology domain dictionary to match correction. Qin Y et al. [12] added a feature template (FT) to BiLSTM-CRF to extract local context features, and CNN to extract character-level features of security entities, such as malware and English naming vulnerabilities. Li T et al. [13] proposed a neural network model based on self-attention to identify entities. On the basis of the existing BiLSTM-CRF model, the self-attention mechanism was added to extract more context information related to the current word in a sentence and get more information about the current word. Han Zhang et al. [14] added GAN to BiLSTM-Attention-CRF to obtain tag data and solve the problem of lack of tag data in network security. P Evangelatos et al. [15] proposed using a transformer to extract named entities in threat intelligence and verified its validity by experimenting with the threat intelligence (DNRTI) dataset [16].

However, there is a polysemy in the named entity of Network Threat Intelligence. The word vectors obtained by word2vec and glove are static, which cannot solve the problem. At the same time, BiLSTM alone cannot obtain more information about the current word. Therefore, this paper proposes a BERT-BiLSTM-CRF named entity recognition method that combines a self-attention mechanism. BERT (Bidirectional Encoder Representations from Transformers) [17] the pre-training language model is a dynamic word vector based on the language model, which can dynamically adjust the embedding of words according to the semantics of the context, better express the representation relationship between words and sentences, and solve the problem of polysemy. In addition, the self-attention mechanism pays more attention to the important words related to the target entity in a sentence, which can better capture the interdependence between the current word and other words and extract more context information related to the current word.

#### **3 BERT-BiLSTM-Self-attention-CRF Model**

The BERT-BiLSTM-Self-attention-CRF model is divided into four parts: BERT pretraining language model, BiLSTM layer, Self-attention layer, and CRF layer. The unstructured text information is converted into dynamic word vectors through BERT, then the word vectors are used as input to BiLSTM. The context feature information is obtained from the forward LSTM and the reverse LSTM, and then some important information is selectively paid more attention and assigned higher weight through the

**Fig. 1.** BERT-BiLSTM-self-attention-CRF model architecture

self-attention mechanism. Finally, it is marked in the way of BIO through CRF. The model structure is shown in Fig. 1.

#### **3.1 BERT Model**

Language models are the most important part of named entity recognition, which transforms the input unstructured text into word vectors. Word2Vec [18] was originally used to get the word vector representation in the research of the named entity recognition of network threat intelligence. Its core idea is to obtain the vectorized representation of the word through the word context, including Skip-gram and CBOW. The former predicts the surrounding word by the given central word, and the latter predicts the central word by the given context information. In addition, the word vector representation is obtained by using the co-occurrence matrix with the Glove [19] method, which considers both local and global information. However, Word2vec and Glove are both static word vectors, and the word vector representation is the same in different contexts. For complex network security texts, there is a situation of polysemy. To solve this problem, this paper proposes a BERT pre-training language model, which can generate dynamic word vector representation to obtain the final representation of word vectors, so as to solve the problem of polysemy.

BERT adopts the encoding part of the bidirectional transformer and has two pretraining tasks. The first task is Mask Language, which randomly masks 15% of the words with MASK for the input text content, and then infers the masked words from the context information. The second task is to predict whether the second sentence is the next sentence of the first sentence, which is based on the first task, is marked with IsNest/NoNext by randomly selecting two sentences in the pre-training text. Figure 2 shows the structure of the BERT model.

**Fig. 2.** BERT architecture

The input representation of BERT consists of three parts: Token Embedding, Segment Embedding, and Position Embedding. By adding and summing these three vectors together as the final input, feature extraction is performed in the encoding part of the bidirectional transformer, and finally the sequence vector with rich semantics. The input representation is shown in Fig. 3.


**Fig. 3.** Input representation of BERT

#### **3.2 BiLSTM Layer**

The traditional neural networks cannot memorize the input context information and infer the content from the previous information. This paper uses LSTM to solve this problem better. The model has a memory function, and can better capture the longdistance dependency. It can learn the information that needs to be forgotten and needs to be remembered through training. Its structure is shown in Fig. 4.

**Fig. 4.** LSTM structure

Its structure is composed of a forgetting gate, a memory gate and an output gate. It is controlled by the unit status. The implementation of LSTM is denoted as follows:

$$f\_l = \sigma\left(W\_f \bullet \left[h\_{l-1}, \chi\_l\right] + b\_f\right) \tag{1}$$

$$i\_l = \sigma\left(W\_l \bullet \left[h\_{l-1}, \mathbf{x}\_l\right] + b\_l\right) \tag{2}$$

$$\widetilde{C}\_{\mathfrak{t}} = \tanh(W\_C \bullet [h\_{\mathfrak{t}-1}, \mathfrak{x}\_{\mathfrak{t}}] + b\_C) \tag{3}$$

$$C\_l = f\_l \* C\_{l-1} + i\_l \* \bar{C}\_l \tag{4}$$

$$
\rho\_l = \sigma\left(W\_o \bullet \left[h\_{l-1}, \mathbf{x}\_l\right] + b\_o\right) \tag{5}
$$

$$h\_l = o\_l \* \tanh(C\_l) \tag{6}$$

where *xt* is the input vector, *f <sup>t</sup>* is the forgetting gate, *it* is the memory gate, *ot* is the output gate, *Ct* is the unit status of the time *t*, and *ht* is the hidden state of the time *t*.

However, the LSTM cannot encode the information from the back to the front. Adding the reverse LSTM can better obtain the following information, that is, the BiLSTM model can better capture the bidirectional semantics, as shown in Fig. 5.

**Fig. 5.** BiLSTM model structure

In the text, the word vector output from the Bert layer is used as the input of the forward LSTM to obtain the forward feature information *ht* and the reverse feature information *ht* , and then the two are spliced to obtain the final hidden state *Ht*, as shown below:

$$H\_l = \begin{bmatrix} h\_l, h'\_l \end{bmatrix} \tag{7}$$

#### **3.3 Self-attention Layer**

In order to better understand the effective information in the threat intelligence text, this paper proposes to add a self-attention mechanism after BiLSTM, which can capture the correlation between vectors, selectively pay more attention to some important information in the feature vector of BiLSTM layer output, give higher weight, and give lower weight to other information. The process of calculation the self-attention mechanism in this paper is as follows.

First, the hidden state of the BiLSTM layer output is represented as *Ht*, and the vector-matrix *Q*, *K, and V* are obtained by mapping the vector *Ht*:

$$\begin{aligned} \mathcal{Q} &= H\_l W^{\mathcal{Q}} \\ K &= H\_l W^K \\ V &= H\_l W^V \end{aligned} \tag{8}$$

where *WQ, WK*, *and W<sup>V</sup>* are the parameters learned in the training process, and then calculated by scaling the dot product attention. The calculation formula is as follows:

$$Attention(\mathcal{Q}, K, V) = \text{Softmax}\left(\frac{\mathcal{Q}K}{\sqrt{d\_k}}\right)V\tag{9}$$

1/√*dk* is used to prevent the result from being too large. Finally, the result is normalized by using the Softmax function and multiplied by *V* to get the result.

#### **3.4 CRF Layer**

Conditional random fields (CRF) is a conditional probability model used to solve the maximization of sequence probability. In the threat intelligence NER task, BiLSTM is good at processing long-distance text information, but cannot deal with the dependency between adjacent tags. CRF can obtain the best prediction sequence through the relationship between adjacent tags, which makes up for the deficiency of BiLSTM. CRF ensures the validity of prediction tags by adding restriction rules to the final predicted tags. During the training process, these restriction rules are automatically learned by the CRF classifier, and the Viterbi is used to find the most likely tag sequence.

Given the input sequence X = {x1, x2,…, xn} of a sentence corresponds to the prediction sequence Y = {y1, y2,…, yn}, and the score corresponding to the prediction sequence Y is calculated. The formula is as follows:

$$s(X, Y) = \sum\_{i=0}^{n} A\_{\mathbf{y}\_i, \mathbf{y}\_{i+1}} + \sum\_{i=1}^{n} P\_{i, \mathbf{y}\_i} \tag{10}$$

where *A* represents the transfer matrix of the label, *P* represents the label score, which is used to predict the probability of sequence *Y*, and the formula is as follows:

$$P(Y|X) = \frac{e^{s(X,Y)}}{\sum\_{\tilde{Y} \in Y\_X} s\left(X, \tilde{Y}\right)}\tag{11}$$

where *Y*˜ represents the correctly marked sequence and *YX* represents the marked sequence. Logarithmically on both sides of the above formula to obtain the likelihood function of the prediction sequence. The formula is as follows:

$$\ln(P(Y|X)) = s(X,Y) - \ln\left(\sum\_{\tilde{Y} \in Y\_X} s\left(X, \tilde{Y}\right)\right) \tag{12}$$

Finally, a set of tag sequences with the highest probability is calculated by Viterbi.

#### **4 Experimental Analysis**

#### **4.1 Dataset Construction**

Since there is no public Chinese named entity identification dataset in network security, this paper mainly obtains the required data from the websites related to network security vulnerability through python, such as the National Information Security Vulnerability Sharing Platform (www.cnvd.org.cn), Information Security Vulnerability Portal (http:// cve.scap.org.cn) 360 Network Security Response Center (cert.360.cn) and national Internet Emergency Center (www.cert.org. cn) are divided into nine types (as shown in Table 1), labeled with BIO. B represents the first word of the entity, I represents the intermediate word of the entity, and O represents the non-entity.


**Table 1.** Entity labeled mode

The labeled dataset is divided into the training set, test set, and verification set in 7:2:1 (as shown in Table 2).


**Table 2.** Dataset size


**Table 2.** (*continued*)

#### **4.2 Evaluation Metrics**

How to evaluate the performance of NER is a crucial step in the NER task. Through evaluation, we can analyze the advantages and existing problems of the proposed algorithm. At present, there are three main evaluation indicators to measure the performance of NER tasks: Precision, Recall, and F1-score.

Precision refers to the probability that all the samples predicted to be positive are actually positive. The formula is as follows:

$$P = \frac{TP}{TP + FP} \ast 100\% \tag{13}$$

For the original sample, the recall rate refers to the probability of being predicted as a positive sample in the actually positive sample. The formula is as follows:

$$R = \frac{TP}{TP + FN} \ast 100\% \tag{14}$$

Obviously, the above two evaluation indicators are contradictory and cannot meet the requirements that the precision and recall can reach the best. Therefore, the F1-score is balanced, and the precision and recall rate are considered to maximize the two as much as possible. As a comprehensive index to balance the impact of precision and recall, its formula is as follows:

$$F1 = \frac{2 \ast P \ast R}{P + R} \ast 100\% \tag{15}$$

where *TP* refers to the number of samples that are actually positive and predicted to be positive, *FP* refers to the number of samples that are actually negative and predicted to be positive, *FN* refers to the number of samples that are actually positive and predicted to be negative.

#### **4.3 Experimental Results**

Experiments are carried out on the constructed network security data set. In order to verify the rationality of the proposed model, the model is compared with several classical models in the named entity recognition task. The comparison results are shown in Table 3.

For the task of named entity recognition in network security, more features are needed for recognition, and the state of the current time should be related to the state


**Table 3.** Comparison of different models (%)

of the previous time and the next time, while the current state in HMM is only related to the previous state. From the experimental results, it can be seen that the F1 value of BiLSTM is higher than that of HMM. BiLSTM cannot learn the relationship between state sequences. After adding CRF, it can learn state sequences. Compare the BiLSTM-CRF model with BERT-BiLSTM-CRF, the experimental results show that because BERT can deeply extract the semantic information of network security text and fully reflect the polysemy of a word, the F1-score has been significantly improved.

Comparing the BERT-BiLSTM-CRF model with the BERT-BiLSTM-CRF model proposed in this paper, which combines the self-attention mechanism, the precision, the recall, and F1-score are improved. Due to the addition of the self-attention mechanism, the model is better at capturing the correlation between the data in the full text of network security by calculating the interaction between words, so that the F1-score of the model proposed in this paper is 2.28% more than the BERT-BiLSTM-CRF model. It has achieved good results in the task of network security named entity recognition.

## **5 Conclusion and Future Work**

Threatening intelligence has gradually become one of the hot areas of network security. At present, government departments and network security enterprises pay more attention to the development of threatening intelligence, and the demand for threatening intelligence in all walks of life is growing. However, there are some problems in Network Threat Intelligence entities, such as ambiguous words, mixed Chinese and English, blurred boundary, etc. To solve these problems, this paper presents network security named entity recognition model based on BERT-BiLSTM-CRF, which combines a self-attention mechanism, uses a BERT pre-training language model to generate word vectors dynamically through two-way Transformer structure, mining syntax structure, and semantic information, and introduces a self-attention mechanism to calculate the correlation between words. Distance dependence can be better solved by assigning different weights to different words according to their degree of association. Experiments show that the model has a certain improvement in P, R, and F1-score, and has a good recognition effect. It can complete the actual network threat intelligence entity identification work and solve the difficulties of threat intelligence entity identification and the ambiguity of one word.

However, there is still much space to improve the task of identifying named entities for network threat intelligence. Because there are still a large number of unmarked network security corpora in a specific area, transfer learning can be considered in future research to solve the problem of lack of labeled data. The performance of identifying network threat intelligence entities can be further improved by expanding the size of the corpus.

**Acknowledgement.** The work is supported by Supported by the Fundamental Research Funds for the Central Universities, North Minzu University (2022PT\_S04), and the Natural Science Foundation of Ningxia Province (No. 2020AAC03212), and the Innovation Projects for Graduate Students of North Minzu University (Project No. YCX21087).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Text Recognition**

## **Research on the Recognition of Internet Buzzword Features Based on Transformer**

Dawei Xu1,2(B) , Yijie She1, Zhonghua Tan3, Ruiguang Li4, and Jian Zhao1

<sup>1</sup> College of Cybersecurity, Changchun University, Changchun, China xudw@ccu.edu.cn

<sup>2</sup> School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China <sup>3</sup> College of International Education, Hainan Normal University, Haikou, China

<sup>4</sup> National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China

**Abstract.** Accurate identification of Internet buzzwords plays an important role in positive Internet opinion guidance. A Transformer-based Internet buzzword feature recognition system was designed to address this problem. The traditional way of crawling data has been improved, a real-time crawling module has been added, and an Internet buzzword corpus has been constructed by itself. The traditional way of crawling data has been improved, a real-time crawling module has been added, and an Internet buzzword corpus has been constructed by itself. Traditional machine learning models suffer from gradient disappearance and gradient explosion, the Transformer model, with its parallel computing and self-attentive mechanism, is a good solution to these problems, and its bi-directional connection allows the parameters of the context to be updated uniformly, thus allowing better aggregation of information and solving the problem of scattered contextual information. Transformation of the position-encoded part of the Transformer model starts with a relative position representation (RPR). It compensates for its inability to obtain relative location information. The experimental results show that the improved Transformer model can achieve an accuracy rate of 90.1%, a recall rate of 92.13%, and an F1 value of 91.16% in recognizing Internet buzzwords.

**Keywords:** Internet buzzwords · Transformer model · Relative position representation (RPR)

## **1 Introduction**

With an Internet penetration rate of 73.0% as of December 2021 [1], Internet has become an essential part of people's lives. Internet has given the public more channels to express their ideas, and Internet buzzwords are the concentrated product of expressing ideas, but there are positive and negative Internet buzzwords, and while they express the ideas of Internet users, they may produce negative public opinion guidance. Therefore, accurate identification of Internet buzzwords plays an important role in the guidance of correct Internet opinion.

The system applies deep learning techniques to achieve recognition of Internet buzzwords. Deep learning techniques can extract, transform and combine features from the initial text to obtain a set of feature representations, and then input a prediction function to obtain the recognition results [2]. Deep learning is built around the implementation of three functional components: the embedding layer, the encoding layer, and the output layer, embedding layer convert words into feature vectors, the Encoding layer obtains textual contextual features, and the output layer acquires the rules between sequences and classifies their output [3]. Although RNN structures are widely used to process sequence-like time-stream data [4–6], they suffer from structural problems such as serial computation, gradient disappearance [7], and one-way construction. The contributions of applying the Transformer model for web buzzword feature recognition are as follows: (1) In the data crawling, the module of real-time crawling is added, which can obtain the data of Internet buzzwords more accurately and improve the problem that the traditional crawling data is too slow to update. (2) The current web buzzword dataset is scattered and sparse so the data collected through web crawling is used to build a dynamic web buzzword corpus on its own. (3) Traditional machine learning models suffer from the problem of gradient disappearance and gradient explosion. The Transformer model, with its parallel computing and self-attentiveness mechanism, solves these problems, and its bi-directional connection allows the parameters of the context to be updated uniformly, thus enabling better information aggregation and solving the problem of information dispersion in the context. (4) Improvements to the start position of the Transformer model, converting the encoding vector of the starting position to a relative position [8] representation (RPR), compensate for the necessity to introduce explicit location information at the location code.

## **2 Related Work**

The existing literature on the identification of Internet buzzwords and Internet neologisms summarizes three types: rule-based methods, statistical-based methods, and methods based on a combination of statistics and rules.

The rule-based approach focuses on developing rules that share common features between words, words, and words, based on linguistic theory and knowledge, or on observing the rules and patterns of word formation through long-term study of the language, and then summarizing their properties and combining them with grammar. As the core of the rule-based approach to new word discovery is the construction of a knowledge base for the domain, a more specialized rule base needs to be created, and new words need to be discovered based on the degree of similar recognition in its rule base when carrying out online buzzword identification.The statistical-based approach improves on the drawbacks of the rule-based approach which uses extensive manual annotation, saving significant time and labor costs. Even though the statistical-based approach makes up for many of the shortcomings of the rule-based approach, experiments in the literature have shown that the statistical-based approach has a low recognition rate that does not allow for good recognition of words, while a fusion of the two can improve the recognition rate of Internet buzzwords. The literature [9] proposes a kth order algorithm for PMI, and experiments show that its accuracy is improved by about 28.79% over PMI, and it is found that when the parameter k takes a value greater than or equal to 3, it can overcome the defects of the PMI method. The Transformer model is also based on a combination of statistical and rule-based methods and has been applied to Internet buzzwords to improve recognition rates.

#### **3 Overall System Architecture**

The Transformer deep learning model is applied to identify the features of Internet buzzwords, and the overall process of the system is shown in (see Fig. 1).

**Fig. 1.** Overall system flow chart

Firstly, the user logs in to the Internet buzzword recognition system and enters the text to be analyzed on the text analysis page. The Internet buzzword database in the background makes a judgment on the text entered, if it is an Internet buzzword in the corpus then it is directly identified as an Internet buzzword, if it does not exist in the Internet buzzword database then the input is entered into the Transformer model to determine if it is an Internet buzzword.

The Transformer Internet-based buzzword recognition technology solution is implemented in the following steps:

Step1, to crawl the existing Internet buzzword corpus on Weibo, to achieve real-time incremental crawling of Internet buzzwords on the original crawler technology, need to mark an identifier on the URL that is the data fingerprint, set the data fingerprint as a hash value, and then just compare the hash value to determine whether the crawled content needs to be updated.

Step2, the crawled Internet buzzwords were pre-processed by first de-duplicating the data, followed by word separation for the longer phrases, using search engine mode, and then filtering the deactivated words using Baidu's deactivated word list.

Step3, use matplotlib library, jieba library, and word cloud library to realize the visual display of the processed Internet buzzwords and draw the word cloud of Internet buzzwords.

Step4, the pre-processed data is selected for the text vector representation by the Skip-gram method in the word2vec model.

Step5, for the feature vectors obtained in the previous step, position encoding is performed, and a position vector representing position information is combined on word embedding to obtain the final vector with position information.

Step6, input the vector with location information into the Transformer model and determine whether the input is a web buzzword or not.

## **4 System Implementation**

#### **4.1 A Subsection Sample**

#### **4.1.1 Data Acquisition**

Real-time incremental crawling of Internet buzzwords is done by tagging URLs with a data fingerprint identifier. Set data fingerprint to the hash value, and generate a unique fixed-length string from the input words, the hash values are then compared to determine if the crawl needs to be updated. The former can insert a piece of data into the collection, returning 1 for success and 0 for failure; the latter can query whether an element exists in the collection, returning 1 for existence and 0 for non-existence. (see Fig. 2), when the Spider module receives a URL to process, a Spider middleware is added to determine whether the fingerprint of the URL exists in the Redis database and if so, the URL is discarded; if not, the new URL is fetched and crawled.

**Fig. 2.** Real-time web crawling flow chart

#### **4.1.2 Data Pre-processing**

By counting the content crawled by the keyword "Internet buzzwords", a total of tens of thousands of high-frequency Internet buzzwords were crawled. Firstly, tens of thousands of buzzwords were de-duplicated, applying the duplicated() function of pandas, a data analysis tool in python, to detect duplicate data, duplicate rows with small indexes will return "True", and data marked as True will need to be removed by applying the drop\_duplicates() function.

The next step is to apply python's third-party Chinese word splitting library, jieba, to the longer phrases in the crawled Internet buzzwords. According to the size of the granularity of the Internet, buzzword decided to use the more accurate search engine mode in the above for the word splitting process, for long words to cut the command as follows: jieba.cut\_for\_search(); jieba.lcut\_for\_search().

The next step is to filter the crawl data for English characters, numbers, mathematical characters, punctuation marks, single Chinese characters that are used very frequently, inflectional auxiliaries, adverbs, prepositions, conjunctions, etc. This article uses the Baidu deactivation word list filter.

#### **4.1.3 Constructing an Online Buzzword Feature Vector**

The pre-processed data is transformed into a character vector using the Word2vec model for characters. The Word2vec module is called from the Genism package. The Word2vec module contains two methods for vectorizing text, CBOW, and Skip-gram, respectively. In the training process, sg = 1 is set and the algorithm of Skip-gram is used for training. The window\_size of the sliding window is set to 5, the dimension of the size word vector is set to 100, and min\_count is used for the filtering operation. Words with a frequency less than the set value will be discarded, which is set to 5 in this paper. Skip-gram is the prediction of surrounding words using central words, for each central word there are K words as output, and there are K predictions for a word, for a total of K \* V.

The model training process is as follows: (1) Use center\_words V to query W0 and target\_words T to query W1 to get two tensors of shape [batch\_size, embedding\_size], respectively, denoted as H1 and H2. (2) The two tensors are then dotted together. (3) Using a sigmoid function acting on (2), the result of the above dot product is normalized to a probability value of 0–1 as the predicted probability, and this model can be trained based on the label information L. After finishing the training of the model,W0 is generally used as the final word vector to be used, represented by a vector of W0. Using vector dot product, the similarity between different words can be calculated.

#### **4.2 Transformer Model**

The Transformer model was proposed by Vaswani A. et al. in their paper "Attention Is All You Need" [10], published in late 2017, and the general structure is shown in (see Fig. 3).

**Fig. 3.** Transformer structure diagram

An analysis of the location coding place in the traditional Transformer model, since the Transformer model, does not have the iterative operation of a recurrent neural network, and no access to relative position information, so the position information of each word must be provided to the Transformer. Transformation of the position encoding part of the Encoder, the decoder part of the Transformer model into a relative position representation (RPR), compensating for its inability to obtain relative location information.

Two-position encoding vectors of the model need to be learned, one for computing *zi* and one for computing *eij*. If the middle index is k, then there will be 2k + 1 relative position encoding vectors to learn, of which k are to its left, k is to its right, and one belongs to itself. Relative positional encoding is not used in the traditional Transformer to calculate the degree of attention i pays to j after SoftMax for word i and word j. Comparing the two calculation methods, it is easy to see that the RPR calculation is more accurate for the position, so the model uses RPR for both the Encoder and Decode parts of the position encoding.

## **5 Analysis and Visualization of Experimental Results**

#### **5.1 Experimental Parameters**

The number of layers is set to 2 by default, and the value of 128 is set to True. BIDI-RECTIONAL is set to True to analyze the sequence from front to back and from back to front. Table 1 lists the parameters of the Transformer model and their corresponding optimal parameter values.


**Table 1.** Transformer model parameters.

#### **5.2 Comparative Experiments**

The experimental evaluation of the network structure for the recognition rate of Internet buzzwords was evaluated using the precision Pre, recall Rec, and F1 values to evaluate the effectiveness of Internet buzzword recognition. To verify the performance of the Transformer model proposed in this paper, the feature vectors of Internet buzzwords were used as the input vectors of the model, and the accuracy recognition results of the comparison experiments on top of the single models commonly used by CRF, LSTM, BILSTM and CNN [12] are shown in Table 2.


**Table 2.** Recognition performance of the models.

The experimental results show that the Transformer structure-based online buzzword recognition model is the best over the common single models of CRF, LSTM, BILSTM and CNN. The LSTM model has the lowest recognition rate for irregular words such as Internet buzzwords because it can only extract information from above, not below, and its F1 value is only 22.3% which is ineffective for the recognition of Internet buzzwords. The CNN model has an F1 value of 65.38%, which is an average performance in buzzword recognition compared to other models. The F1 value of the model using BILSTM is 87.53%, which is a 6.57% improvement compared to the CRF model and still performs relatively well in buzzword recognition. Applying the Transformer model performed best in terms of precision Pre, recall Rec, and F1 values, with 90.1%, 92.13%, and 91.16% respectively.

The evolution of the experimental evaluation parameter accuracy P is shown in (see Fig. 4), the evolution of the evaluation parameter recall R is shown in (see Fig. 5), and the evolution of the evaluation parameter F1 is shown in (see Fig. 6).

**Fig. 4.** Comparison of accuracy P (%) across models

**Fig. 5.** Comparison of recall R (%) across models

**Fig. 6.** Comparison of F1 (%) across models

The line graph of the experimental results reveals that the Transformer-based model has the highest accuracy, recall and F1 score, with the change curve at the top, at 90.1%, 92.13% and 91.16% respectively, and the experimental data shows that the model in this paper improves the recognition rate of Internet buzzwords.

#### **5.3 Visualization of Internet Buzzword Recognition**

Internet buzzword recognition system based on python's Flask lightweight web framework to implement a visual interface. The platform for the visualization of Internet buzzwords allows users to view options for data queries, real-time analysis, and hot topics in the sidebar of the home page after logging in, the data query is shown in (see Fig. 7): it contains all data, Internet buzzwords, non-Internet buzzwords, and allows you to view information such as user name, posting time and content, device information, number of likes, retweets and comments, and whether the data is an Internet buzzword.


**Fig. 7.** Visualization of data enquiry pages

The real-time analysis is shown in (see Fig. 8), where the words to be discriminated are entered at the content of the input, the probability of their prediction score is displayed at the sentiment score, and whether they are suspected to be Internet buzzwords is displayed at the sentiment evaluation column.


**Fig. 8.** Example of real-time analysis page visualization

## **6 Conclusion**

To improve the recognition rate of Internet buzzwords, Transformer-based Internet buzzword feature recognition is proposed. The module of real-time crawling has been added to the data crawling, which can obtain the data of Internet buzzwords more accurately and improve the problem of too slow an update of traditional crawling data. As buzzword datasets on the web are scattered and sparse, a dynamic corpus of Internet buzzwords is constructed in-house from data collected through web crawling. Traditional machine learning models suffer from the problem of gradient disappearance and gradient explosion. The Transformer model, with its parallel computing and self-attentiveness mechanism, solves these problems, and its bi-directional connection allows the parameters of the context to be updated uniformly, thus enabling better information aggregation and solving the problem of information dispersion in the context. Improvements to the start position of the Transformer model, converting the starting position-coding vector to a relative position representation (RPR). It compensates for the need to introduce explicit location information at its location code.

**Acknowledgments.** This research was supported by the scientific research project of the Education Department of Jilin Province [NO. JJKH20220602KJ].

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Author Index**

Bao, Heng 73, 151 Cai, Nishui 3 Chen, Qian 28 Chen, Xu 213 Chen, Xunxun 151 Cui, Jianming 199 Deng, Jincheng 86, 113 Deng, Lirui 73, 151 Deng, Zhuxiang 3 Du, Lin 177 Fan, Xiangyu 113 Gao, Jiaqi 129 Guan, Jiazhi 151 He, Yongqiang 51 Huang, Jing 38, 187 Jing, Yongjun 213 Lan, Shizhan 38, 187 Li, Ruiguang 129, 227 Lin, Honggang 162 Lin, Shenwen 139 Liu, Fangming 86, 113 Liu, Ming 199 Long, Yangyu 86 Ma, Lifang 187 Mao, Hongliang 139 Mei, Rui 51 Qin, Yi 99

She, Yijie 227

Tan, Zhonghua 227 Tang, Lijun 213

Wang, Hao 3 Wang, Qinqin 51 Wang, Shuyang 213 Wang, Ying 28, 99 Wen, Weiping 51 Wu, Fudong 129 Wu, Zhen 139

Xu, Chuanqi 177 Xu, Dawei 129, 227

Yan, Han-Bing 51 Yang, Fen 28 Yang, Jilong 86, 113 Yang, Jinglin 139 Yang, Peng 162 Yang, Xueqin 162

Zhai, Zhijia 28 Zhang, Keke 213 Zhang, Liang 151 Zhang, Lin 99 Zhang, Xing 28, 99 Zhang, Yanjing 199 Zhao, Jian 227 Zhao, Wei 86, 113 Zhao, Youjian 73 Zhao, Zhangjie 99 Zhu, Jiawei 129 Zhu, Liehuang 129 Zhu, Shengqiang 51