subscribe to arXiv mailings

doi 10.1145/3510457.3513044

What are Weak Links in the npm Supply Chain?

Authors: Nusrat Zahan, Thomas Zimmermann, Patrice Godefroid, Brendan Murphy, Chandra Maddila, Laurie Williams

Abstract: Modern software development frequently uses third-party packages, raising the concern of supply chain security attacks. Many attackers target popular package managers, like npm, and their users with supply chain attacks. In 2021 there was a 650% year-on-year growth in security attacks by exploiting Open Source Software's supply chain. Proactive approaches are needed to predict package vulnerabilit… ▽ More Modern software development frequently uses third-party packages, raising the concern of supply chain security attacks. Many attackers target popular package managers, like npm, and their users with supply chain attacks. In 2021 there was a 650% year-on-year growth in security attacks by exploiting Open Source Software's supply chain. Proactive approaches are needed to predict package vulnerability to high-risk supply chain attacks. The goal of this work is to help software developers and security specialists in measuring npm supply chain weak link signals to prevent future supply chain attacks by empirically studying npm package metadata. In this paper, we analyzed the metadata of 1.63 million JavaScript npm packages. We propose six signals of security weaknesses in a software supply chain, such as the presence of install scripts, maintainer accounts associated with an expired email domain, and inactive packages with inactive maintainers. One of our case studies identified 11 malicious packages from the install scripts signal. We also found 2,818 maintainer email addresses associated with expired domains, allowing an attacker to hijack 8,494 packages by taking over the npm accounts. We obtained feedback on our weak link signals through a survey responded to by 470 npm package developers. The majority of the developers supported three out of our six proposed weak link signals. The developers also indicated that they would want to be notified about weak links signals before using third-party packages. Additionally, we discussed eight new signals suggested by package developers. △ Less

Submitted 14 February, 2022; v1 submitted 19 December, 2021; originally announced December 2021.

Comments: (e.g.: 10 pages, 1 figure)

arXiv:2103.03846 [pdf]

Anomalicious: Automated Detection of Anomalous and Potentially Malicious Commits on GitHub

Authors: Danielle Gonzalez, Thomas Zimmermann, Patrice Godefroid, Max Schaefer

Abstract: Security is critical to the adoption of open source software (OSS), yet few automated solutions currently exist to help detect and prevent malicious contributions from infecting open source repositories. On GitHub, a primary host of OSS, repositories contain not only code but also a wealth of commit-related and contextual metadata - what if this metadata could be used to automatically identify mal… ▽ More Security is critical to the adoption of open source software (OSS), yet few automated solutions currently exist to help detect and prevent malicious contributions from infecting open source repositories. On GitHub, a primary host of OSS, repositories contain not only code but also a wealth of commit-related and contextual metadata - what if this metadata could be used to automatically identify malicious OSS contributions? In this work, we show how to use only commit logs and repository metadata to automatically detect anomalous and potentially malicious commits. We identify and evaluate several relevant factors which can be automatically computed from this data, such as the modification of sensitive files, outlier change properties, or a lack of trust in the commit's author. Our tool, Anomalicious, automatically computes these factors and considers them holistically using a rule-based decision model. In an evaluation on a data set of 15 malware-infected repositories, Anomalicious showed promising results and identified 53.33% of malicious commits, while flagging less than 1% of commits for most repositories. Additionally, the tool found other interesting anomalies that are not related to malicious commits in an analysis of repositories with no known malicious commits. △ Less

Submitted 9 March, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

Comments: 10 pages, 3 figures, 3 tables. To appear at the 2021 International Conference on Software Engineering (ICSE), Software Engineering in Practice (SEiP) track

arXiv:2012.11401 [pdf, other]

Universal Policies for Software-Defined MDPs

Authors: Daniel Selsam, Jesse Michael Han, Leonardo de Moura, Patrice Godefroid

Abstract: We introduce a new programming paradigm called oracle-guided decision programming in which a program specifies a Markov Decision Process (MDP) and the language provides a universal policy. We prototype a new programming language, Dodona, that manifests this paradigm using a primitive 'choose' representing nondeterministic choice. The Dodona interpreter returns either a value or a choicepoint that… ▽ More We introduce a new programming paradigm called oracle-guided decision programming in which a program specifies a Markov Decision Process (MDP) and the language provides a universal policy. We prototype a new programming language, Dodona, that manifests this paradigm using a primitive 'choose' representing nondeterministic choice. The Dodona interpreter returns either a value or a choicepoint that includes a lossless encoding of all information necessary in principle to make an optimal decision. Meta-interpreters query Dodona's (neural) oracle on these choicepoints to get policy and value estimates, which they can use to perform heuristic search on the underlying MDP. We demonstrate Dodona's potential for zero-shot heuristic guidance by meta-learning over hundreds of synthetic tasks that simulate basic operations over lists, trees, Church datastructures, polynomials, first-order terms and higher-order terms. △ Less

Submitted 21 December, 2020; originally announced December 2020.

arXiv:2005.11498 [pdf, other]

Pythia: Grammar-Based Fuzzing of REST APIs with Coverage-guided Feedback and Learning-based Mutations

Authors: Vaggelis Atlidakis, Roxana Geambasu, Patrice Godefroid, Marina Polishchuk, Baishakhi Ray

Abstract: This paper introduces Pythia, the first fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing. Pythia uses a statistical model to learn common usage patterns of a target REST API from structurally valid seed inputs. It then generates learning-based mutations by injecting a small amount of noise deviating from c… ▽ More This paper introduces Pythia, the first fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing. Pythia uses a statistical model to learn common usage patterns of a target REST API from structurally valid seed inputs. It then generates learning-based mutations by injecting a small amount of noise deviating from common usage patterns while still maintaining syntactic validity. Pythia's mutation strategy helps generate grammatically valid test cases and coverage-guided feedback helps prioritize the test cases that are more likely to find bugs. We present experimental evaluation on three production-scale, open-source cloud services showing that Pythia outperforms prior approaches both in code coverage and new bugs found. Using Pythia, we found 29 new bugs which we are in the process of reporting to the respective service owners. △ Less

Submitted 23 May, 2020; originally announced May 2020.

arXiv:1806.09739 [pdf, other]

REST-ler: Automatic Intelligent REST API Fuzzing

Authors: Vaggelis Atlidakis, Patrice Godefroid, Marina Polishchuk

Abstract: Cloud services have recently exploded with the advent of powerful cloud-computing platforms such as Amazon Web Services and Microsoft Azure. Today, most cloud services are accessed through REST APIs, and Swagger is arguably the most popular interface-description language for REST APIs. A Swagger specification describes how to access a cloud service through its REST API (e.g., what requests the ser… ▽ More Cloud services have recently exploded with the advent of powerful cloud-computing platforms such as Amazon Web Services and Microsoft Azure. Today, most cloud services are accessed through REST APIs, and Swagger is arguably the most popular interface-description language for REST APIs. A Swagger specification describes how to access a cloud service through its REST API (e.g., what requests the service can handle and what responses may be expected). This paper introduces REST-ler, the first automatic intelligent REST API security-testing tool. REST-ler analyzes a Swagger specification and generates tests that exercise the corresponding cloud service through its REST API. Each test is defined as a sequence of requests and responses. REST-ler generates tests intelligently by (1) inferring dependencies among request types declared in the Swagger specification (e.g., inferring that "a request B should not be executed before a request A" because B takes as an input argument a resource-id x returned by A) and by (2) analyzing dynamic feedback from responses observed during prior test executions in order to generate new tests (e.g., learning that "a request C after a request sequence A;B is refused by the service" and therefore avoiding this combination in the future). We show that these two techniques are necessary to thoroughly exercise a service under test while pruning the large search space of possible request sequences. We also discuss the application of REST-ler to test GitLab, a large popular open-source self-hosted Git service, and the new bugs that were found. △ Less

Submitted 25 June, 2018; originally announced June 2018.

arXiv:1801.04589 [pdf, other]

Deep Reinforcement Fuzzing

Authors: Konstantin Böttinger, Patrice Godefroid, Rishabh Singh

Abstract: Fuzzing is the process of finding security vulnerabilities in input-processing code by repeatedly testing the code with modified inputs. In this paper, we formalize fuzzing as a reinforcement learning problem using the concept of Markov decision processes. This in turn allows us to apply state-of-the-art deep Q-learning algorithms that optimize rewards, which we define from runtime properties of t… ▽ More Fuzzing is the process of finding security vulnerabilities in input-processing code by repeatedly testing the code with modified inputs. In this paper, we formalize fuzzing as a reinforcement learning problem using the concept of Markov decision processes. This in turn allows us to apply state-of-the-art deep Q-learning algorithms that optimize rewards, which we define from runtime properties of the program under test. By observing the rewards caused by mutating with a specific set of actions performed on an initial program input, the fuzzing agent learns a policy that can next generate new higher-reward inputs. We have implemented this new approach, and preliminary empirical evidence shows that reinforcement fuzzing can outperform baseline random fuzzing. △ Less

Submitted 14 January, 2018; originally announced January 2018.

arXiv:1701.07232 [pdf, other]

Learn&Fuzz: Machine Learning for Input Fuzzing

Authors: Patrice Godefroid, Hila Peleg, Rishabh Singh

Abstract: Fuzzing consists of repeatedly testing an application with modified, or fuzzed, inputs with the goal of finding security vulnerabilities in input-parsing code. In this paper, we show how to automate the generation of an input grammar suitable for input fuzzing using sample inputs and neural-network-based statistical machine-learning techniques. We present a detailed case study with a complex input… ▽ More Fuzzing consists of repeatedly testing an application with modified, or fuzzed, inputs with the goal of finding security vulnerabilities in input-parsing code. In this paper, we show how to automate the generation of an input grammar suitable for input fuzzing using sample inputs and neural-network-based statistical machine-learning techniques. We present a detailed case study with a complex input format, namely PDF, and a large complex security-critical parser for this format, namely, the PDF parser embedded in Microsoft's new Edge browser. We discuss (and measure) the tension between conflicting learning and fuzzing goals: learning wants to capture the structure of well-formed inputs, while fuzzing wants to break that structure in order to cover unexpected code paths and find bugs. We also present a new algorithm for this learn&fuzz challenge which uses a learnt input probability distribution to intelligently guide where to fuzz inputs. △ Less

Submitted 25 January, 2017; originally announced January 2017.

Showing 1–7 of 7 results for author: Godefroid, P