subscribe to arXiv mailings

CRUcialG: Reconstruct Integrated Attack Scenario Graphs by Cyber Threat Intelligence Reports

Authors: Wenrui Cheng, Tiantian Zhu, Tieming Chen, Qixuan Yuan, Jie Ying, Hongmei Li, Chunlin Xiong, Mingda Li, Mingqi Lv, Yan Chen

Abstract: Cyber Threat Intelligence (CTI) reports are factual records compiled by security analysts through their observations of threat events or their own practical experience with attacks. In order to utilize CTI reports for attack detection, existing methods have attempted to map the content of reports onto system-level attack provenance graphs to clearly depict attack procedures. However, existing stud… ▽ More Cyber Threat Intelligence (CTI) reports are factual records compiled by security analysts through their observations of threat events or their own practical experience with attacks. In order to utilize CTI reports for attack detection, existing methods have attempted to map the content of reports onto system-level attack provenance graphs to clearly depict attack procedures. However, existing studies on constructing graphs from CTI reports suffer from problems such as weak natural language processing (NLP) capabilities, discrete and fragmented graphs, and insufficient attack semantic representation. Therefore, we propose a system called CRUcialG for the automated reconstruction of attack scenario graphs (ASGs) by CTI reports. First, we use NLP models to extract systematic attack knowledge from CTI reports to form preliminary ASGs. Then, we propose a four-phase attack rationality verification framework from the tactical phase with attack procedure to evaluate the reasonability of ASGs. Finally, we implement the relation repair and phase supplement of ASGs by adopting a serialized graph generation model. We collect a total of 10,607 CTI reports and generate 5,761 complete ASGs. Experimental results on CTI reports from 30 security vendors and DARPA show that the similarity of ASG reconstruction by CRUcialG can reach 84.54%. Compared with SOTA (EXTRACTOR and AttackG), the recall of CRUcialG (extraction of real attack events) can reach 88.13% and 94.46% respectively, which is 40% higher than SOTA on average. The F1-score of attack phase verification is able to reach 90.04%. △ Less

Submitted 14 October, 2024; originally announced October 2024.

arXiv:2408.11431 [pdf, other]

Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning

Authors: Kai Xiong, Xiao Ding, Li Du, Jiahao Ying, Ting Liu, Bing Qin, Yixin Cao

Abstract: Large Language Models (LLMs) are versatile and demonstrate impressive generalization ability by mining and learning information from extensive unlabeled text. However, they still exhibit reasoning mistakes, often stemming from knowledge deficiencies, which can affect their trustworthiness and reliability. Although users can provide diverse and comprehensive queries, obtaining sufficient and effect… ▽ More Large Language Models (LLMs) are versatile and demonstrate impressive generalization ability by mining and learning information from extensive unlabeled text. However, they still exhibit reasoning mistakes, often stemming from knowledge deficiencies, which can affect their trustworthiness and reliability. Although users can provide diverse and comprehensive queries, obtaining sufficient and effective feedback is demanding. Furthermore, evaluating LLMs comprehensively with limited labeled samples is difficult. This makes it a challenge to diagnose and remedy the deficiencies of LLMs through rich label-free user queries. To tackle this challenge, we propose a label-free curricular meaningful learning framework (LaMer). LaMer first employs relative entropy to automatically diagnose and quantify the knowledge deficiencies of LLMs in a label-free setting. Next, to remedy the diagnosed knowledge deficiencies, we apply curricular meaningful learning: first, we adopt meaningful learning to adaptively synthesize augmentation data according to the severity of the deficiencies, and then design a curricular deficiency remedy strategy to remedy the knowledge deficiencies of LLMs progressively. Experiments show that LaMer efficiently and effectively diagnoses and remedies knowledge deficiencies in LLMs, improving various LLMs across seven out-of-distribution (OOD) reasoning and language understanding benchmarks, achieving comparable results to baselines with just 40\% training data. LaMer even surpasses methods that rely on labeled datasets for deficiency diagnosis. In application, our label-free method can offer an effective knowledge deficiency diagnostic tool for efficient LLM development. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: Under Review

arXiv:2408.04535 [pdf, other]

Synchronous Multi-modal Semantic Communication System with Packet-level Coding

Authors: Yun Tian, Jingkai Ying, Zhijin Qin, Ye Jin, Xiaoming Tao

Abstract: Although the semantic communication with joint semantic-channel coding design has shown promising performance in transmitting data of different modalities over physical layer channels, the synchronization and packet-level forward error correction of multimodal semantics have not been well studied. Due to the independent design of semantic encoders, synchronizing multimodal features in both the sem… ▽ More Although the semantic communication with joint semantic-channel coding design has shown promising performance in transmitting data of different modalities over physical layer channels, the synchronization and packet-level forward error correction of multimodal semantics have not been well studied. Due to the independent design of semantic encoders, synchronizing multimodal features in both the semantic and time domains is a challenging problem. In this paper, we take the facial video and speech transmission as an example and propose a Synchronous Multimodal Semantic Communication System (SyncSC) with Packet-Level Coding. To achieve semantic and time synchronization, 3D Morphable Mode (3DMM) coefficients and text are transmitted as semantics, and we propose a semantic codec that achieves similar quality of reconstruction and synchronization with lower bandwidth, compared to traditional methods. To protect semantic packets under the erasure channel, we propose a packet-Level Forward Error Correction (FEC) method, called PacSC, that maintains a certain visual quality performance even at high packet loss rates. Particularly, for text packets, a text packet loss concealment module, called TextPC, based on Bidirectional Encoder Representations from Transformers (BERT) is proposed, which significantly improves the performance of traditional FEC methods. The simulation results show that our proposed SyncSC reduce transmission overhead and achieve high-quality synchronous transmission of video and speech over the packet loss network. △ Less

Submitted 10 August, 2024; v1 submitted 8 August, 2024; originally announced August 2024.

Comments: 12 pages, 9 figures

arXiv:2407.00497 [pdf, other]

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Authors: Jiahao Ying, Mingbao Lin, Yixin Cao, Wei Tang, Bo Wang, Qianru Sun, Xuanjing Huang, Shuicheng Yan

Abstract: This paper introduces the innovative "LLMs-as-Instructors" framework, which leverages the advanced Large Language Models (LLMs) to autonomously enhance the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model, facilitating targeted and efficient training cycles… ▽ More This paper introduces the innovative "LLMs-as-Instructors" framework, which leverages the advanced Large Language Models (LLMs) to autonomously enhance the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model, facilitating targeted and efficient training cycles. Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors. Our empirical studies, conducted with several open-source models, demonstrate significant improvements across multiple benchmarks, including mathematical reasoning, coding abilities, and factual knowledge. Notably, the refined Llama-3-8b-Instruction has outperformed ChatGPT, illustrating the effectiveness of our approach. By leveraging the strengths of both strategies, we have attained a more balanced performance improvement on both in-domain and out-of-domain benchmarks. Our code can be found at https://yingjiahao14.github.io/LLMs-as-Instructors-pages/. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.13167 [pdf, other]

QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism

Authors: Bo Wang, Heyan Huang, Yixin Cao, Jiahao Ying, Wei Tang, Chong Feng

Abstract: While large language models (LLMs) have made notable advancements in natural language processing, they continue to struggle with processing extensive text. Memory mechanism offers a flexible solution for managing long contexts, utilizing techniques such as compression, summarization, and structuring to facilitate nuanced and efficient handling of large volumes of text. However, existing techniques… ▽ More While large language models (LLMs) have made notable advancements in natural language processing, they continue to struggle with processing extensive text. Memory mechanism offers a flexible solution for managing long contexts, utilizing techniques such as compression, summarization, and structuring to facilitate nuanced and efficient handling of large volumes of text. However, existing techniques face challenges with static knowledge integration, leading to insufficient adaptation to task-specific needs and missing multi-segmentation relationships, which hinders the dynamic reorganization and logical combination of relevant segments during the response process. To address these issues, we introduce a novel strategy, Question then Reflection Memory Mechanism (QRMeM), incorporating a dual-structured memory pool. This pool synergizes static textual content with structured graph guidance, fostering a reflective trial-and-error approach for navigating and identifying relevant segments. Our evaluation across multiple-choice questions (MCQ) and multi-document question answering (Multi-doc QA) benchmarks showcases QRMeM enhanced performance compared to existing approaches. △ Less

Submitted 26 September, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

Comments: EMNLP 2024 Findings

arXiv:2406.04875 [pdf, other]

3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

Authors: Xiaobiao Du, Haiyang Sun, Shuyun Wang, Zhuojie Wu, Hongwei Sheng, Jiaying Ying, Ming Lu, Tianqing Zhu, Kun Zhan, Xin Yu

Abstract: 3D cars are commonly used in self-driving systems, virtual/augmented reality, and games. However, existing 3D car datasets are either synthetic or low-quality, presenting a significant gap toward the high-quality real-world 3D car datasets and limiting their applications in practical scenarios. In this paper, we propose the first large-scale 3D real car dataset, termed 3DRealCar, offering three di… ▽ More 3D cars are commonly used in self-driving systems, virtual/augmented reality, and games. However, existing 3D car datasets are either synthetic or low-quality, presenting a significant gap toward the high-quality real-world 3D car datasets and limiting their applications in practical scenarios. In this paper, we propose the first large-scale 3D real car dataset, termed 3DRealCar, offering three distinctive features. (1) \textbf{High-Volume}: 2,500 cars are meticulously scanned by 3D scanners, obtaining car images and point clouds with real-world dimensions; (2) \textbf{High-Quality}: Each car is captured in an average of 200 dense, high-resolution 360-degree RGB-D views, enabling high-fidelity 3D reconstruction; (3) \textbf{High-Diversity}: The dataset contains various cars from over 100 brands, collected under three distinct lighting conditions, including reflective, standard, and dark. Additionally, we offer detailed car parsing maps for each instance to promote research in car parsing tasks. Moreover, we remove background point clouds and standardize the car orientation to a unified axis for the reconstruction only on cars without background and controllable rendering. We benchmark 3D reconstruction results with state-of-the-art methods across each lighting condition in 3DRealCar. Extensive experiments demonstrate that the standard lighting condition part of 3DRealCar can be used to produce a large number of high-quality 3D cars, improving various 2D and 3D tasks related to cars. Notably, our dataset brings insight into the fact that recent 3D reconstruction methods face challenges in reconstructing high-quality 3D cars under reflective and dark lighting conditions. \textcolor{red}{\href{https://xiaobiaodu.github.io/3drealcar/}{Our dataset is available here.}} △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Project Page: https://xiaobiaodu.github.io/3drealcar

arXiv:2406.03963 [pdf, other]

A + B: A General Generator-Reader Framework for Optimizing LLMs to Unleash Synergy Potential

Authors: Wei Tang, Yixin Cao, Jiahao Ying, Bo Wang, Yuyue Zhao, Yong Liao, Pengyuan Zhou

Abstract: Retrieval-Augmented Generation (RAG) is an effective solution to supplement necessary knowledge to large language models (LLMs). Targeting its bottleneck of retriever performance, "generate-then-read" pipeline is proposed to replace the retrieval stage with generation from the LLM itself. Although promising, this research direction is underexplored and still cannot work in the scenario when source… ▽ More Retrieval-Augmented Generation (RAG) is an effective solution to supplement necessary knowledge to large language models (LLMs). Targeting its bottleneck of retriever performance, "generate-then-read" pipeline is proposed to replace the retrieval stage with generation from the LLM itself. Although promising, this research direction is underexplored and still cannot work in the scenario when source knowledge is given. In this paper, we formalize a general "A + B" framework with varying combinations of foundation models and types for systematic investigation. We explore the efficacy of the base and chat versions of LLMs and found their different functionalities suitable for generator A and reader B, respectively. Their combinations consistently outperform single models, especially in complex scenarios. Furthermore, we extend the application of the "A + B" framework to scenarios involving source documents through continuous learning, enabling the direct integration of external knowledge into LLMs. This approach not only facilitates effective acquisition of new knowledge but also addresses the challenges of safety and helpfulness post-adaptation. The paper underscores the versatility of the "A + B" framework, demonstrating its potential to enhance the practical application of LLMs across various domains. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Accepted to ACL'24 (Findings)

arXiv:2405.13675 [pdf, other]

Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

Authors: Zhu Yu, Runmin Zhang, Jiacheng Ying, Junchen Yu, Xiaohai Hu, Lun Luo, Si-Yuan Cao, Hui-Liang Shen

Abstract: Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparse-to-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of… ▽ More Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparse-to-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of cross-attention. Additionally, the absence of depth information may lead to points projected onto the image plane sharing the same 2D position or similar sampling points in the feature map, resulting in depth ambiguity. In this paper, we present a novel context and geometry aware voxel transformer. It utilizes a context aware query generator to initialize context-dependent queries tailored to individual input images, effectively capturing their unique characteristics and aggregating information within the region of interest. Furthermore, it extend deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. Building upon this module, we introduce a neural network named CGFormer to achieve semantic scene completion. Simultaneously, CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to boost the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results demonstrate that CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks. △ Less

Submitted 3 October, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

Comments: NIPS 2024 Spotlight

arXiv:2405.02826 [pdf, other]

Nip in the Bud: Forecasting and Interpreting Post-exploitation Attacks in Real-time through Cyber Threat Intelligence Reports

Authors: Tiantian Zhu, Jie Ying, Tieming Chen, Chunlin Xiong, Wenrui Cheng, Qixuan Yuan, Aohan Zheng, Mingqi Lv, Yan Chen

Abstract: Advanced Persistent Threat (APT) attacks have caused significant damage worldwide. Various Endpoint Detection and Response (EDR) systems are deployed by enterprises to fight against potential threats. However, EDR suffers from high false positives. In order not to affect normal operations, analysts need to investigate and filter detection results before taking countermeasures, in which heavy manua… ▽ More Advanced Persistent Threat (APT) attacks have caused significant damage worldwide. Various Endpoint Detection and Response (EDR) systems are deployed by enterprises to fight against potential threats. However, EDR suffers from high false positives. In order not to affect normal operations, analysts need to investigate and filter detection results before taking countermeasures, in which heavy manual labor and alarm fatigue cause analysts miss optimal response time, thereby leading to information leakage and destruction. Therefore, we propose Endpoint Forecasting and Interpreting (EFI), a real-time attack forecast and interpretation system, which can automatically predict next move during post-exploitation and explain it in technique-level, then dispatch strategies to EDR for advance reinforcement. First, we use Cyber Threat Intelligence (CTI) reports to extract the attack scene graph (ASG) that can be mapped to low-level system logs to strengthen attack samples. Second, we build a serialized graph forecast model, which is combined with the attack provenance graph (APG) provided by EDR to generate an attack forecast graph (AFG) to predict the next move. Finally, we utilize the attack template graph (ATG) and graph alignment plus algorithm for technique-level interpretation to automatically dispatch strategies for EDR to reinforce system in advance. EFI can avoid the impact of existing EDR false positives, and can reduce the attack surface of system without affecting the normal operations. We collect a total of 3,484 CTI reports, generate 1,429 ASGs, label 8,000 sentences, tag 10,451 entities, and construct 256 ATGs. Experimental results on both DARPA Engagement and large scale CTI dataset show that the alignment score between the AFG predicted by EFI and the real attack graph is able to exceed 0.8, the forecast and interpretation precision of EFI can reach 91.8%. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2405.02629 [pdf, other]

SPARSE: Semantic Tracking and Path Analysis for Attack Investigation in Real-time

Authors: Jie Ying, Tiantian Zhu, Wenrui Cheng, Qixuan Yuan, Mingjun Ma, Chunlin Xiong, Tieming Chen, Mingqi Lv, Yan Chen

Abstract: As the complexity and destructiveness of Advanced Persistent Threat (APT) increase, there is a growing tendency to identify a series of actions undertaken to achieve the attacker's target, called attack investigation. Currently, analysts construct the provenance graph to perform causality analysis on Point-Of-Interest (POI) event for capturing critical events (related to the attack). However, due… ▽ More As the complexity and destructiveness of Advanced Persistent Threat (APT) increase, there is a growing tendency to identify a series of actions undertaken to achieve the attacker's target, called attack investigation. Currently, analysts construct the provenance graph to perform causality analysis on Point-Of-Interest (POI) event for capturing critical events (related to the attack). However, due to the vast size of the provenance graph and the rarity of critical events, existing attack investigation methods suffer from problems of high false positives, high overhead, and high latency. To this end, we propose SPARSE, an efficient and real-time system for constructing critical component graphs (i.e., consisting of critical events) from streaming logs. Our key observation is 1) Critical events exist in a suspicious semantic graph (SSG) composed of interaction flows between suspicious entities, and 2) Information flows that accomplish attacker's goal exist in the form of paths. Therefore, SPARSE uses a two-stage framework to implement attack investigation (i.e., constructing the SSG and performing path-level contextual analysis). First, SPARSE operates in a state-based mode where events are consumed as streams, allowing easy access to the SSG related to the POI event through semantic transfer rule and storage strategy. Then, SPARSE identifies all suspicious flow paths (SFPs) related to the POI event from the SSG, quantifies the influence of each path to filter irrelevant events. Our evaluation on a real large-scale attack dataset shows that SPARSE can generate a critical component graph (~ 113 edges) in 1.6 seconds, which is 2014 X smaller than the backtracking graph (~ 227,589 edges). SPARSE is 25 X more effective than other state-of-the-art techniques in filtering irrelevant edges. △ Less

Submitted 4 May, 2024; originally announced May 2024.

arXiv:2404.02621 [pdf, other]

Polynomial Graphical Lasso: Learning Edges from Gaussian Graph-Stationary Signals

Authors: Andrei Buciulea, Jiaxi Ying, Antonio G. Marques, Daniel P. Palomar

Abstract: This paper introduces Polynomial Graphical Lasso (PGL), a new approach to learning graph structures from nodal signals. Our key contribution lies in modeling the signals as Gaussian and stationary on the graph, enabling the development of a graph-learning formulation that combines the strengths of graphical lasso with a more encompassing model. Specifically, we assume that the precision matrix can… ▽ More This paper introduces Polynomial Graphical Lasso (PGL), a new approach to learning graph structures from nodal signals. Our key contribution lies in modeling the signals as Gaussian and stationary on the graph, enabling the development of a graph-learning formulation that combines the strengths of graphical lasso with a more encompassing model. Specifically, we assume that the precision matrix can take any polynomial form of the sought graph, allowing for increased flexibility in modeling nodal relationships. Given the resulting complexity and nonconvexity of the resulting optimization problem, we (i) propose a low-complexity algorithm that alternates between estimating the graph and precision matrices, and (ii) characterize its convergence. We evaluate the performance of PGL through comprehensive numerical simulations using both synthetic and real data, demonstrating its superiority over several alternatives. Overall, this approach presents a significant advancement in graph learning and holds promise for various applications in graph-aware signal analysis and beyond. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2404.00349 [pdf, other]

SGDFormer: One-stage Transformer-based Architecture for Cross-Spectral Stereo Image Guided Denoising

Authors: Runmin Zhang, Zhu Yu, Zehua Sheng, Jiacheng Ying, Si-Yuan Cao, Shu-Jie Chen, Bailin Yang, Junwei Li, Hui-Liang Shen

Abstract: Cross-spectral image guided denoising has shown its great potential in recovering clean images with rich details, such as using the near-infrared image to guide the denoising process of the visible one. To obtain such image pairs, a feasible and economical way is to employ a stereo system, which is widely used on mobile devices. Current works attempt to generate an aligned guidance image to handle… ▽ More Cross-spectral image guided denoising has shown its great potential in recovering clean images with rich details, such as using the near-infrared image to guide the denoising process of the visible one. To obtain such image pairs, a feasible and economical way is to employ a stereo system, which is widely used on mobile devices. Current works attempt to generate an aligned guidance image to handle the disparity between two images. However, due to occlusion, spectral differences and noise degradation, the aligned guidance image generally exists ghosting and artifacts, leading to an unsatisfactory denoised result. To address this issue, we propose a one-stage transformer-based architecture, named SGDFormer, for cross-spectral Stereo image Guided Denoising. The architecture integrates the correspondence modeling and feature fusion of stereo images into a unified network. Our transformer block contains a noise-robust cross-attention (NRCA) module and a spatially variant feature fusion (SVFF) module. The NRCA module captures the long-range correspondence of two images in a coarse-to-fine manner to alleviate the interference of noise. The SVFF module further enhances salient structures and suppresses harmful artifacts through dynamically selecting useful information. Thanks to the above design, our SGDFormer can restore artifact-free images with fine structures, and achieves state-of-the-art performance on various datasets. Additionally, our SGDFormer can be extended to handle other unaligned cross-model guided restoration tasks such as guided depth super-resolution. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.12541 [pdf, other]

Marlin: Knowledge-Driven Analysis of Provenance Graphs for Efficient and Robust Detection of Cyber Attacks

Authors: Zhenyuan Li, Yangyang Wei, Xiangmin Shen, Lingzhi Wang, Yan Chen, Haitao Xu, Shouling Ji, Fan Zhang, Liang Hou, Wenmao Liu, Xuhong Zhang, Jianwei Ying

Abstract: Recent research in both academia and industry has validated the effectiveness of provenance graph-based detection for advanced cyber attack detection and investigation. However, analyzing large-scale provenance graphs often results in substantial overhead. To improve performance, existing detection systems implement various optimization strategies. Yet, as several recent studies suggest, these str… ▽ More Recent research in both academia and industry has validated the effectiveness of provenance graph-based detection for advanced cyber attack detection and investigation. However, analyzing large-scale provenance graphs often results in substantial overhead. To improve performance, existing detection systems implement various optimization strategies. Yet, as several recent studies suggest, these strategies could lose necessary context information and be vulnerable to evasions. Designing a detection system that is efficient and robust against adversarial attacks is an open problem. We introduce Marlin, which approaches cyber attack detection through real-time provenance graph alignment.By leveraging query graphs embedded with attack knowledge, Marlin can efficiently identify entities and events within provenance graphs, embedding targeted analysis and significantly narrowing the search space. Moreover, we incorporate our graph alignment algorithm into a tag propagation-based schema to eliminate the need for storing and reprocessing raw logs. This design significantly reduces in-memory storage requirements and minimizes data processing overhead. As a result, it enables real-time graph alignment while preserving essential context information, thereby enhancing the robustness of cyber attack detection. Moreover, Marlin allows analysts to customize attack query graphs flexibly to detect extended attacks and provide interpretable detection results. We conduct experimental evaluations on two large-scale public datasets containing 257.42 GB of logs and 12 query graphs of varying sizes, covering multiple attack techniques and scenarios. The results show that Marlin can process 137K events per second while accurately identifying 120 subgraphs with 31 confirmed attacks, along with only 1 false positive, demonstrating its efficiency and accuracy in handling massive data. △ Less

Submitted 10 July, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

arXiv:2402.11894 [pdf, other]

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

Authors: Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, Shuicheng Yan

Abstract: Large language models (LLMs) have achieved impressive performance across various natural language benchmarks, prompting a continual need to curate more difficult datasets for larger LLMs, which is costly and time-consuming. In this paper, we propose to automate dataset updating and provide systematic analysis regarding its effectiveness in dealing with benchmark leakage issue, difficulty control,… ▽ More Large language models (LLMs) have achieved impressive performance across various natural language benchmarks, prompting a continual need to curate more difficult datasets for larger LLMs, which is costly and time-consuming. In this paper, we propose to automate dataset updating and provide systematic analysis regarding its effectiveness in dealing with benchmark leakage issue, difficulty control, and stability. Thus, once the current benchmark has been mastered or leaked, we can update it for timely and reliable evaluation. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom's taxonomy of educational objectives. Extensive experiments on updated MMLU and BIG-Bench demonstrate the stability of the proposed strategies and find that the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategy still shows promising results. Additionally, by controlling the difficulty, we can better discern the models' performance and enable fine-grained analysis neither too difficult nor too easy an exam can fairly judge students' learning status. To the best of our knowledge, we are the first to automate updating benchmarks for reliable and timely evaluation. Our demo leaderboard can be found at https://yingjiahao14.github.io/Automating-DatasetUpdates/. △ Less

Submitted 6 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

arXiv:2401.07762 [pdf]

Auto-Regressive Model with Exogenous Input--ARX--based traffic-flow prediction

Authors: Jun Ying, Xin Dong, Bowei Li, Zihan Tian

Abstract: Traffic flow prediction is widely used in travel decision making, traffic control, roadway system planning, business sectors, and government agencies. ARX models have proved to be highly effective and versatile. In this research, we investigated the applications of ARX models in prediction for real traffic flow in New York City. The ARX models were constructed by linear/polynomial or neural networ… ▽ More Traffic flow prediction is widely used in travel decision making, traffic control, roadway system planning, business sectors, and government agencies. ARX models have proved to be highly effective and versatile. In this research, we investigated the applications of ARX models in prediction for real traffic flow in New York City. The ARX models were constructed by linear/polynomial or neural networks. Comparative studies were carried out based on the results by the efficiency, accuracy, and training computational demand of the algorithms. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2312.00740 [pdf, ps, other]

Computing Networks Enabled Semantic Communications

Authors: Zhijin Qin, Jingkai Ying, Dingxi Yang, Hengjiang Wang, Xiaoming Tao

Abstract: Semantic communication has shown great potential in boosting the effectiveness and reliability of communications. However, its systems to date are mostly enabled by deep learning, which requires demanding computing resources. This article proposes a framework for the computing networks enabled semantic communication system, aiming to offer sufficient computing resources for semantic processing and… ▽ More Semantic communication has shown great potential in boosting the effectiveness and reliability of communications. However, its systems to date are mostly enabled by deep learning, which requires demanding computing resources. This article proposes a framework for the computing networks enabled semantic communication system, aiming to offer sufficient computing resources for semantic processing and transmission. Key techniques including semantic sampling and reconstruction, semantic-channel coding, semantic-aware resource allocation and optimization are introduced based on the cloud-edge-end computing coordination. Two use cases are demonstrated to show advantages of the proposed framework. The article concludes with several future research directions. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.10801 [pdf, other]

Reinforcement Learning with Maskable Stock Representation for Portfolio Management in Customizable Stock Pools

Authors: Wentao Zhang, Yilei Zhao, Shuo Sun, Jie Ying, Yonggang Xie, Zitao Song, Xinrun Wang, Bo An

Abstract: Portfolio management (PM) is a fundamental financial trading task, which explores the optimal periodical reallocation of capitals into different stocks to pursue long-term profits. Reinforcement learning (RL) has recently shown its potential to train profitable agents for PM through interacting with financial markets. However, existing work mostly focuses on fixed stock pools, which is inconsisten… ▽ More Portfolio management (PM) is a fundamental financial trading task, which explores the optimal periodical reallocation of capitals into different stocks to pursue long-term profits. Reinforcement learning (RL) has recently shown its potential to train profitable agents for PM through interacting with financial markets. However, existing work mostly focuses on fixed stock pools, which is inconsistent with investors' practical demand. Specifically, the target stock pool of different investors varies dramatically due to their discrepancy on market states and individual investors may temporally adjust stocks they desire to trade (e.g., adding one popular stocks), which lead to customizable stock pools (CSPs). Existing RL methods require to retrain RL agents even with a tiny change of the stock pool, which leads to high computational cost and unstable performance. To tackle this challenge, we propose EarnMore, a rEinforcement leARNing framework with Maskable stOck REpresentation to handle PM with CSPs through one-shot training in a global stock pool (GSP). Specifically, we first introduce a mechanism to mask out the representation of the stocks outside the target pool. Second, we learn meaningful stock representations through a self-supervised masking and reconstruction process. Third, a re-weighting mechanism is designed to make the portfolio concentrate on favorable stocks and neglect the stocks outside the target pool. Through extensive experiments on 8 subset stock pools of the US stock market, we demonstrate that EarnMore significantly outperforms 14 state-of-the-art baselines in terms of 6 popular financial metrics with over 40% improvement on profit. △ Less

Submitted 27 February, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

arXiv:2309.17415 [pdf, other]

Intuitive or Dependent? Investigating LLMs' Behavior Style to Conflicting Prompts

Authors: Jiahao Ying, Yixin Cao, Kai Xiong, Yidong He, Long Cui, Yongbin Liu

Abstract: This study investigates the behaviors of Large Language Models (LLMs) when faced with conflicting prompts versus their internal memory. This will not only help to understand LLMs' decision mechanism but also benefit real-world applications, such as retrieval-augmented generation (RAG). Drawing on cognitive theory, we target the first scenario of decision-making styles where there is no superiority… ▽ More This study investigates the behaviors of Large Language Models (LLMs) when faced with conflicting prompts versus their internal memory. This will not only help to understand LLMs' decision mechanism but also benefit real-world applications, such as retrieval-augmented generation (RAG). Drawing on cognitive theory, we target the first scenario of decision-making styles where there is no superiority in the conflict and categorize LLMs' preference into dependent, intuitive, and rational/irrational styles. Another scenario of factual robustness considers the correctness of prompt and memory in knowledge-intensive tasks, which can also distinguish if LLMs behave rationally or irrationally in the first scenario. To quantify them, we establish a complete benchmarking framework including a dataset, a robustness evaluation pipeline, and corresponding metrics. Extensive experiments with seven LLMs reveal their varying behaviors. And, with role play intervention, we can change the styles, but different models present distinct adaptivity and upper-bound. One of our key takeaways is to optimize models or the prompts according to the identified style. For instance, RAG models with high role play adaptability may dynamically adjust the interventions according to the quality of retrieval results -- being dependent to better leverage informative context; and, being intuitive when external prompt is noisy. △ Less

Submitted 20 February, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

arXiv:2309.13405 [pdf, other]

Learning Large-Scale MTP$_2$ Gaussian Graphical Models via Bridge-Block Decomposition

Authors: Xiwen Wang, Jiaxi Ying, Daniel P. Palomar

Abstract: This paper studies the problem of learning the large-scale Gaussian graphical models that are multivariate totally positive of order two ($\text{MTP}_2$). By introducing the concept of bridge, which commonly exists in large-scale sparse graphs, we show that the entire problem can be equivalently optimized through (1) several smaller-scaled sub-problems induced by a \emph{bridge-block decomposition… ▽ More This paper studies the problem of learning the large-scale Gaussian graphical models that are multivariate totally positive of order two ($\text{MTP}_2$). By introducing the concept of bridge, which commonly exists in large-scale sparse graphs, we show that the entire problem can be equivalently optimized through (1) several smaller-scaled sub-problems induced by a \emph{bridge-block decomposition} on the thresholded sample covariance graph and (2) a set of explicit solutions on entries corresponding to bridges. From practical aspect, this simple and provable discipline can be applied to break down a large problem into small tractable ones, leading to enormous reduction on the computational complexity and substantial improvements for all existing algorithms. The synthetic and real-world experiments demonstrate that our proposed method presents a significant speed-up compared to the state-of-the-art benchmarks. △ Less

Submitted 29 September, 2023; v1 submitted 23 September, 2023; originally announced September 2023.

arXiv:2309.02046 [pdf, other]

A Fast and Provable Algorithm for Sparse Phase Retrieval

Authors: Jian-Feng Cai, Yu Long, Ruixue Wen, Jiaxi Ying

Abstract: We study the sparse phase retrieval problem, which seeks to recover a sparse signal from a limited set of magnitude-only measurements. In contrast to prevalent sparse phase retrieval algorithms that primarily use first-order methods, we propose an innovative second-order algorithm that employs a Newton-type method with hard thresholding. This algorithm overcomes the linear convergence limitations… ▽ More We study the sparse phase retrieval problem, which seeks to recover a sparse signal from a limited set of magnitude-only measurements. In contrast to prevalent sparse phase retrieval algorithms that primarily use first-order methods, we propose an innovative second-order algorithm that employs a Newton-type method with hard thresholding. This algorithm overcomes the linear convergence limitations of first-order methods while preserving their hallmark per-iteration computational efficiency. We provide theoretical guarantees that our algorithm converges to the $s$-sparse ground truth signal $\mathbf{x}^{\natural} \in \mathbb{R}^n$ (up to a global sign) at a quadratic convergence rate after at most $O(\log (\Vert\mathbf{x}^{\natural} \Vert /x_{\min}^{\natural}))$ iterations, using $Ω(s^2\log n)$ Gaussian random samples. Numerical experiments show that our algorithm achieves a significantly faster convergence rate than state-of-the-art methods. △ Less

Submitted 19 March, 2024; v1 submitted 5 September, 2023; originally announced September 2023.

arXiv:2309.00960 [pdf, other]

Network Topology Inference with Sparsity and Laplacian Constraints

Authors: Jiaxi Ying, Xi Han, Rui Zhou, Xiwen Wang, Hing Cheung So

Abstract: We tackle the network topology inference problem by utilizing Laplacian constrained Gaussian graphical models, which recast the task as estimating a precision matrix in the form of a graph Laplacian. Recent research \cite{ying2020nonconvex} has uncovered the limitations of the widely used $\ell_1$-norm in learning sparse graphs under this model: empirically, the number of nonzero entries in the so… ▽ More We tackle the network topology inference problem by utilizing Laplacian constrained Gaussian graphical models, which recast the task as estimating a precision matrix in the form of a graph Laplacian. Recent research \cite{ying2020nonconvex} has uncovered the limitations of the widely used $\ell_1$-norm in learning sparse graphs under this model: empirically, the number of nonzero entries in the solution grows with the regularization parameter of the $\ell_1$-norm; theoretically, a large regularization parameter leads to a fully connected (densest) graph. To overcome these challenges, we propose a graph Laplacian estimation method incorporating the $\ell_0$-norm constraint. An efficient gradient projection algorithm is developed to solve the resulting optimization problem, characterized by sparsity and Laplacian constraints. Through numerical experiments with synthetic and financial time-series datasets, we demonstrate the effectiveness of the proposed method in network topology inference. △ Less

Submitted 2 September, 2023; originally announced September 2023.

arXiv:2306.04181 [pdf, other]

Benchmarking Foundation Models with Language-Model-as-an-Examiner

Authors: Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei Hou

Abstract: Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and… ▽ More Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: http://lmexam.xlore.cn. △ Less

Submitted 4 November, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: NeurIPS 2023 Datasets and Benchmarks

arXiv:2305.16580 [pdf, other]

doi 10.1109/TNNLS.2024.3443455

TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection

Authors: Xue Zhang, Xiaohan Zhang, Jiangtao Wang, Jiacheng Ying, Zehua Sheng, Heng Yu, Chunguang Li, Hui-Liang Shen

Abstract: Pedestrian detection plays a critical role in computer vision as it contributes to ensuring traffic safety. Existing methods that rely solely on RGB images suffer from performance degradation under low-light conditions due to the lack of useful information. To address this issue, recent multispectral detection approaches have combined thermal images to provide complementary information and have ob… ▽ More Pedestrian detection plays a critical role in computer vision as it contributes to ensuring traffic safety. Existing methods that rely solely on RGB images suffer from performance degradation under low-light conditions due to the lack of useful information. To address this issue, recent multispectral detection approaches have combined thermal images to provide complementary information and have obtained enhanced performances. Nevertheless, few approaches focus on the negative effects of false positives caused by noisy fused feature maps. Different from them, we comprehensively analyze the impacts of false positives on the detection performance and find that enhancing feature contrast can significantly reduce these false positives. In this paper, we propose a novel target-aware fusion strategy for multispectral pedestrian detection, named TFDet. TFDet achieves state-of-the-art performance on two multispectral pedestrian benchmarks, KAIST and LLVIP. TFDet can easily extend to multi-class object detection scenarios. It outperforms the previous best approaches on two multispectral object detection benchmarks, FLIR and M3FD. Importantly, TFDet has comparable inference efficiency to the previous approaches, and has remarkably good detection performance even under low-light conditions, which is a significant advancement for ensuring road safety. △ Less

Submitted 27 August, 2024; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: This paper has been accepted by IEEE T-NNLS journal. Please jump to External DOI to view the official version

arXiv:2210.15471 [pdf, other]

Adaptive Estimation of Graphical Models under Total Positivity

Authors: Jiaxi Ying, José Vinícius de M. Cardoso, Daniel P. Palomar

Abstract: We consider the problem of estimating (diagonally dominant) M-matrices as precision matrices in Gaussian graphical models. These models exhibit intriguing properties, such as the existence of the maximum likelihood estimator with merely two observations for M-matrices \citep{lauritzen2019maximum,slawski2015estimation} and even one observation for diagonally dominant M-matrices \citep{truell2021max… ▽ More We consider the problem of estimating (diagonally dominant) M-matrices as precision matrices in Gaussian graphical models. These models exhibit intriguing properties, such as the existence of the maximum likelihood estimator with merely two observations for M-matrices \citep{lauritzen2019maximum,slawski2015estimation} and even one observation for diagonally dominant M-matrices \citep{truell2021maximum}. We propose an adaptive multiple-stage estimation method that refines the estimate by solving a weighted $\ell_1$-regularized problem at each stage. Furthermore, we develop a unified framework based on the gradient projection method to solve the regularized problem, incorporating distinct projections to handle the constraints of M-matrices and diagonally dominant M-matrices. A theoretical analysis of the estimation error is provided. Our method outperforms state-of-the-art methods in precision matrix estimation and graph edge identification, as evidenced by synthetic and financial time-series data sets. △ Less

Submitted 8 June, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: 26 pages

arXiv:2210.12339 [pdf, other]

P$^3$LM: Probabilistically Permuted Prophet Language Modeling for Generative Pre-Training

Authors: Junwei Bao, Yifan Wang, Jiangyong Ying, Yeyun Gong, Jing Zhao, Youzheng Wu, Xiaodong He

Abstract: Conventional autoregressive left-to-right (L2R) sequence generation faces two issues during decoding: limited to unidirectional target sequence modeling, and constrained on strong local dependencies. To address the aforementioned problem, we propose P$^3$LM, a probabilistically permuted prophet language model, which strengthens the modeling of bidirectional information and long token dependencies… ▽ More Conventional autoregressive left-to-right (L2R) sequence generation faces two issues during decoding: limited to unidirectional target sequence modeling, and constrained on strong local dependencies. To address the aforementioned problem, we propose P$^3$LM, a probabilistically permuted prophet language model, which strengthens the modeling of bidirectional information and long token dependencies for sequence generation. Specifically, P$^3$LM learns to generate tokens in permuted order upon an order-aware transformer decoder, as well as to generate the corresponding future $N$ tokens with a multi-stream attention mechanism. Extensive experiments are conducted on the GLGE benchmark, which includes four datasets for summarization, two for question generation, one for conversational question answering, and one for dialog response generation, where P$^3$LM achieves state-of-the-art results compared with strong publicly available generative pre-training methods. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: Accepted to EMNLP(Findings) 2022

arXiv:2112.09008 [pdf, other]

APTSHIELD: A Stable, Efficient and Real-time APT Detection System for Linux Hosts

Authors: Tiantian Zhu, Jinkai Yu, Tieming Chen, Jiayu Wang, Jie Ying, Ye Tian, Mingqi Lv, Yan Chen, Yuan Fan, Ting Wang

Abstract: Advanced Persistent Threat (APT) attack usually refers to the form of long-term, covert and sustained attack on specific targets, with an adversary using advanced attack techniques to destroy the key facilities of an organization. APT attacks have caused serious security threats and massive financial loss worldwide. Academics and industry thereby have proposed a series of solutions to detect APT a… ▽ More Advanced Persistent Threat (APT) attack usually refers to the form of long-term, covert and sustained attack on specific targets, with an adversary using advanced attack techniques to destroy the key facilities of an organization. APT attacks have caused serious security threats and massive financial loss worldwide. Academics and industry thereby have proposed a series of solutions to detect APT attacks, such as dynamic/static code analysis, traffic detection, sandbox technology, endpoint detection and response (EDR), etc. However, existing defenses are failed to accurately and effectively defend against the current APT attacks that exhibit strong persistent, stealthy, diverse and dynamic characteristics due to the weak data source integrity, large data processing overhead and poor real-time performance in the process of real-world scenarios. To overcome these difficulties, in this paper we propose APTSHIELD, a stable, efficient and real-time APT detection system for Linux hosts. In the aspect of data collection, audit is selected to stably collect kernel data of the operating system so as to carry out a complete portrait of the attack based on comprehensive analysis and comparison of existing logging tools; In the aspect of data processing, redundant semantics skipping and non-viable node pruning are adopted to reduce the amount of data, so as to reduce the overhead of the detection system; In the aspect of attack detection, an APT attack detection framework based on ATT\&CK model is designed to carry out real-time attack response and alarm through the transfer and aggregation of labels. Experimental results on both laboratory and Darpa Engagement show that our system can effectively detect web vulnerability attacks, file-less attacks and remote access trojan attacks, and has a low false positive rate, which adds far more value than the existing frontier work. △ Less

Submitted 17 December, 2021; v1 submitted 16 December, 2021; originally announced December 2021.

arXiv:2112.01939 [pdf, other]

Fast Projected Newton-like Method for Precision Matrix Estimation under Total Positivity

Authors: Jian-Feng Cai, José Vinícius de M. Cardoso, Daniel P. Palomar, Jiaxi Ying

Abstract: We study the problem of estimating precision matrices in Gaussian distributions that are multivariate totally positive of order two ($\mathrm{MTP}_2$). The precision matrix in such a distribution is an M-matrix. This problem can be formulated as a sign-constrained log-determinant program. Current algorithms are designed using the block coordinate descent method or the proximal point algorithm, whi… ▽ More We study the problem of estimating precision matrices in Gaussian distributions that are multivariate totally positive of order two ($\mathrm{MTP}_2$). The precision matrix in such a distribution is an M-matrix. This problem can be formulated as a sign-constrained log-determinant program. Current algorithms are designed using the block coordinate descent method or the proximal point algorithm, which becomes computationally challenging in high-dimensional cases due to the requirement to solve numerous nonnegative quadratic programs or large-scale linear systems. To address this issue, we propose a novel algorithm based on the two-metric projection method, incorporating a carefully designed search direction and variable partitioning scheme. Our algorithm substantially reduces computational complexity, and its theoretical convergence is established. Experimental results on synthetic and real-world datasets demonstrate that our proposed algorithm provides a significant improvement in computational efficiency compared to the state-of-the-art methods. △ Less

Submitted 22 October, 2023; v1 submitted 3 December, 2021; originally announced December 2021.

arXiv:2111.14072 [pdf, other]

A Comprehensive and Cross-Platform Test Suite for Memory Safety -- Towards an Open Framework for Testing Processor Hardware Supported Security Extensions

Authors: Wei Song, Jiameng Ying, Sihao Shen, Boya Li, Hao Ma, Peng Liu

Abstract: Memory safety remains a critical and widely violated property in reality. Numerous defense techniques have been proposed and developed but most of them are not applied or enabled by default in production-ready environment due to their substantial running cost. The situation might change in the near future because the hardware supported defenses against these attacks are finally beginning to be ado… ▽ More Memory safety remains a critical and widely violated property in reality. Numerous defense techniques have been proposed and developed but most of them are not applied or enabled by default in production-ready environment due to their substantial running cost. The situation might change in the near future because the hardware supported defenses against these attacks are finally beginning to be adopted by commercial processors, operating systems and compilers. We then face a question as there is currently no suitable test suite to measure the memory safety extensions supported on different processors. In fact, the issue is not constrained only for memory safety but all aspect of processor security. All of the existing test suites related to processor security lack some of the key properties, such as comprehensiveness, distinguishability and portability. As an initial step, we propose an expandable test framework for measuring the processor security and open source a memory safety test suite utilizing this framework. The framework is deliberately designed to be flexible so it can be gradually extended to all types of hardware supported security extensions in processors. The initial test suite for memory safety currently contains 160 test cases covering spatial and temporal safety of memory, memory access control, pointer integrity and control-flow integrity. Each type of vulnerabilities and their related defenses have been individually evaluated by one or more test cases. The test suite has been ported to three different instruction set architectures (ISAs) and experimented on six different platforms. We have also utilized the test suite to explore the security benefits of applying different sets of compiler flags available on the latest GNU GCC and LLVM compilers. △ Less

Submitted 28 November, 2021; originally announced November 2021.

arXiv:2012.15410 [pdf, other]

Algorithms for Learning Graphs in Financial Markets

Authors: José Vinícius de Miranda Cardoso, Jiaxi Ying, Daniel Perez Palomar

Abstract: In the past two decades, the field of applied finance has tremendously benefited from graph theory. As a result, novel methods ranging from asset network estimation to hierarchical asset selection and portfolio allocation are now part of practitioners' toolboxes. In this paper, we investigate the fundamental problem of learning undirected graphical models under Laplacian structural constraints fro… ▽ More In the past two decades, the field of applied finance has tremendously benefited from graph theory. As a result, novel methods ranging from asset network estimation to hierarchical asset selection and portfolio allocation are now part of practitioners' toolboxes. In this paper, we investigate the fundamental problem of learning undirected graphical models under Laplacian structural constraints from the point of view of financial market times series data. In particular, we present natural justifications, supported by empirical evidence, for the usage of the Laplacian matrix as a model for the precision matrix of financial assets, while also establishing a direct link that reveals how Laplacian constraints are coupled to meaningful physical interpretations related to the market index factor and to conditional correlations between stocks. Those interpretations lead to a set of guidelines that practitioners should be aware of when estimating graphs in financial markets. In addition, we design numerical algorithms based on the alternating direction method of multipliers to learn undirected, weighted graphs that take into account stylized facts that are intrinsic to financial data such as heavy tails and modularity. We illustrate how to leverage the learned graphs into practical scenarios such as stock time series clustering and foreign exchange network estimation. The proposed graph learning algorithms outperform the state-of-the-art methods in an extensive set of practical experiments. Furthermore, we obtain theoretical and empirical convergence results for the proposed algorithms. Along with the developed methodologies for graph learning in financial markets, we release an R package, called fingraph, accommodating the code and data to obtain all the experimental results. △ Less

Submitted 30 December, 2020; originally announced December 2020.

Comments: 62 pages, 25 figures

arXiv:2012.01046 [pdf, other]

PiPoMonitor: Mitigating Cross-core Cache Attacks Using the Auto-Cuckoo Filter

Authors: Fengkai Yuan, Kai Wang, Rui Hou, Xiaoxin Li, Peinan Li, Lutan Zhao, Jiameng Ying, Amro Awad, Dan Meng

Abstract: Cache side channel attacks obtain victim cache line access footprint to infer security-critical information. Among them, cross-core attacks exploiting the shared last level cache are more threatening as their simplicity to set up and high capacity. Stateful approaches of detection-based mitigation observe precise cache behaviors and protect specific cache lines that are suspected of being attacked… ▽ More Cache side channel attacks obtain victim cache line access footprint to infer security-critical information. Among them, cross-core attacks exploiting the shared last level cache are more threatening as their simplicity to set up and high capacity. Stateful approaches of detection-based mitigation observe precise cache behaviors and protect specific cache lines that are suspected of being attacked. However, their recording structures incur large storage overhead and are vulnerable to reverse engineering attacks. Exploring the intrinsic non-determinate layout of a traditional Cuckoo filter, this paper proposes a space efficient Auto-Cuckoo filter to record access footprints, which succeed to decrease storage overhead and resist reverse engineering attacks at the same time. With Auto-Cuckoo filter, we propose PiPoMonitor to detect \textit{Ping-Pong patterns} and prefetch specific cache line to interfere with adversaries' cache probes. Security analysis shows the PiPoMonitor can effectively mitigate cross-core attacks and the Auto-Cuckoo filter is immune to reverse engineering attacks. Evaluation results indicate PiPoMonitor has negligible impact on performance and the storage overhead is only 0.37$\%$, an order of magnitude lower than previous stateful approaches. △ Less

Submitted 7 December, 2020; v1 submitted 2 December, 2020; originally announced December 2020.

Comments: This paper is going to appear in 2021 Design, Automation and Test in Europe Conference (DATE)

MSC Class: 68M01 (Primary) 68M99 (Secondary) ACM Class: C.1.0; C.5.3

arXiv:2006.14925 [pdf, other]

Does the $\ell_1$-norm Learn a Sparse Graph under Laplacian Constrained Graphical Models?

Authors: Jiaxi Ying, José Vinícius de M. Cardoso, Daniel P. Palomar

Abstract: We consider the problem of learning a sparse graph under the Laplacian constrained Gaussian graphical models. This problem can be formulated as a penalized maximum likelihood estimation of the Laplacian constrained precision matrix. Like in the classical graphical lasso problem, recent works made use of the $\ell_1$-norm regularization with the goal of promoting sparsity in Laplacian constrained p… ▽ More We consider the problem of learning a sparse graph under the Laplacian constrained Gaussian graphical models. This problem can be formulated as a penalized maximum likelihood estimation of the Laplacian constrained precision matrix. Like in the classical graphical lasso problem, recent works made use of the $\ell_1$-norm regularization with the goal of promoting sparsity in Laplacian constrained precision matrix estimation. However, we find that the widely used $\ell_1$-norm is not effective in imposing a sparse solution in this problem. Through empirical evidence, we observe that the number of nonzero graph weights grows with the increase of the regularization parameter. From a theoretical perspective, we prove that a large regularization parameter will surprisingly lead to a complete graph, i.e., every pair of vertices is connected by an edge. To address this issue, we introduce the nonconvex sparsity penalty, and propose a new estimator by solving a sequence of weighted $\ell_1$-norm penalized sub-problems. We establish the non-asymptotic optimization performance guarantees on both optimization error and statistical error, and prove that the proposed estimator can recover the edges correctly with a high probability. To solve each sub-problem, we develop a projected gradient descent algorithm which enjoys a linear convergence rate. Finally, an extension to learn disconnected graphs is proposed by imposing additional rank constraint. We propose a numerical algorithm based on based on the alternating direction method of multipliers, and establish its theoretical sequence convergence. Numerical experiments involving synthetic and real-world data sets demonstrate the effectiveness of the proposed method. △ Less

Submitted 5 September, 2023; v1 submitted 26 June, 2020; originally announced June 2020.

arXiv:1909.11594 [pdf, ps, other]

Structured Graph Learning Via Laplacian Spectral Constraints

Authors: Sandeep Kumar, Jiaxi Ying, Jos'e Vin'icius de M. Cardoso, Daniel P. Palomar

Abstract: Learning a graph with a specific structure is essential for interpretability and identification of the relationships among data. It is well known that structured graph learning from observed samples is an NP-hard combinatorial problem. In this paper, we first show that for a set of important graph families it is possible to convert the structural constraints of structure into eigenvalue constraint… ▽ More Learning a graph with a specific structure is essential for interpretability and identification of the relationships among data. It is well known that structured graph learning from observed samples is an NP-hard combinatorial problem. In this paper, we first show that for a set of important graph families it is possible to convert the structural constraints of structure into eigenvalue constraints of the graph Laplacian matrix. Then we introduce a unified graph learning framework, lying at the integration of the spectral properties of the Laplacian matrix with Gaussian graphical modeling that is capable of learning structures of a large class of graph families. The proposed algorithms are provably convergent and practically amenable for large-scale semi-supervised and unsupervised graph-based learning tasks. Extensive numerical experiments with both synthetic and real data sets demonstrate the effectiveness of the proposed methods. An R package containing code for all the experimental results is available at https://cran.r-project.org/package=spectralGraphTopology. △ Less

Submitted 24 September, 2019; originally announced September 2019.

Comments: 12 Pages, Accepted for NIPS 2019. arXiv admin note: substantial text overlap with arXiv:1904.09792

arXiv:1904.09792 [pdf, ps, other]

A Unified Framework for Structured Graph Learning via Spectral Constraints

Authors: Sandeep Kumar, Jiaxi Ying, José Vinícius de M. Cardoso, Daniel Palomar

Abstract: Graph learning from data represents a canonical problem that has received substantial attention in the literature. However, insufficient work has been done in incorporating prior structural knowledge onto the learning of underlying graphical models from data. Learning a graph with a specific structure is essential for interpretability and identification of the relationships among data. Useful stru… ▽ More Graph learning from data represents a canonical problem that has received substantial attention in the literature. However, insufficient work has been done in incorporating prior structural knowledge onto the learning of underlying graphical models from data. Learning a graph with a specific structure is essential for interpretability and identification of the relationships among data. Useful structured graphs include the multi-component graph, bipartite graph, connected graph, sparse graph, and regular graph. In general, structured graph learning is an NP-hard combinatorial problem, therefore, designing a general tractable optimization method is extremely challenging. In this paper, we introduce a unified graph learning framework lying at the integration of Gaussian graphical models and spectral graph theory. To impose a particular structure on a graph, we first show how to formulate the combinatorial constraints as an analytical property of the graph matrix. Then we develop an optimization framework that leverages graph learning with specific structures via spectral constraints on graph matrices. The proposed algorithms are provably convergent, computationally efficient, and practically amenable for numerous graph-based tasks. Extensive numerical experiments with both synthetic and real data sets illustrate the effectiveness of the proposed algorithms. The code for all the simulations is made available as an open source repository. △ Less

Submitted 22 April, 2019; originally announced April 2019.

arXiv:1604.02100 [pdf, other]

doi 10.1109/TSP.2017.2695566

Hankel Matrix Nuclear Norm Regularized Tensor Completion for $N$-dimensional Exponential Signals

Authors: Jiaxi Ying, Hengfa Lu, Qingtao Wei, Jian-Feng Cai, Di Guo, Jihui Wu, Zhong Chen, Xiaobo Qu

Abstract: Signals are generally modeled as a superposition of exponential functions in spectroscopy of chemistry, biology and medical imaging. For fast data acquisition or other inevitable reasons, however, only a small amount of samples may be acquired and thus how to recover the full signal becomes an active research topic. But existing approaches can not efficiently recover $N$-dimensional exponential si… ▽ More Signals are generally modeled as a superposition of exponential functions in spectroscopy of chemistry, biology and medical imaging. For fast data acquisition or other inevitable reasons, however, only a small amount of samples may be acquired and thus how to recover the full signal becomes an active research topic. But existing approaches can not efficiently recover $N$-dimensional exponential signals with $N\geq 3$. In this paper, we study the problem of recovering N-dimensional (particularly $N\geq 3$) exponential signals from partial observations, and formulate this problem as a low-rank tensor completion problem with exponential factor vectors. The full signal is reconstructed by simultaneously exploiting the CANDECOMP/PARAFAC structure and the exponential structure of the associated factor vectors. The latter is promoted by minimizing an objective function involving the nuclear norm of Hankel matrices. Experimental results on simulated and real magnetic resonance spectroscopy data show that the proposed approach can successfully recover full signals from very limited samples and is robust to the estimated tensor rank. △ Less

Submitted 31 March, 2017; v1 submitted 6 April, 2016; originally announced April 2016.

Comments: 15 pages, 12 figures

Showing 1–34 of 34 results for author: Ying, J