-
FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions
Authors:
Qiming Wang,
Raul Castro Fernandez
Abstract:
Reading comprehension models answer questions posed in natural language when provided with a short passage of text. They present an opportunity to address a long-standing challenge in data management: the extraction of structured data from unstructured text. Consequently, several approaches are using these models to perform information extraction. However, these modern approaches leave an opportun…
▽ More
Reading comprehension models answer questions posed in natural language when provided with a short passage of text. They present an opportunity to address a long-standing challenge in data management: the extraction of structured data from unstructured text. Consequently, several approaches are using these models to perform information extraction. However, these modern approaches leave an opportunity behind because they do not exploit the relational structure of the target extraction table. In this paper, we introduce a new model, Relation Coherence, that exploits knowledge of the relational structure to improve the extraction quality. We incorporate the Relation Coherence model as part of FabricQA-Extractor, an end-to-end system we built from scratch to conduct large scale extraction tasks over millions of documents. We demonstrate on two datasets with millions of passages that Relation Coherence boosts extraction performance and evaluate FabricQA-Extractor on large scale datasets.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
Programmable Dataflows: Abstraction and Programming Model for Data Sharing
Authors:
Siyuan Xia,
Chris Zhu,
Tapan Srivastava,
Bridget Fahey,
Raul Castro Fernandez
Abstract:
Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. The lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation. In this paper, we first introduce a data sharing model to represent every data sharing problem with a sequence of dataflows. From the model, we dist…
▽ More
Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. The lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation. In this paper, we first introduce a data sharing model to represent every data sharing problem with a sequence of dataflows. From the model, we distill an abstraction, the contract, which agents use to communicate the intent of a dataflow and evaluate its consequences, before the dataflow takes place. This helps agents move towards a common sharing goal without violating any regulatory and privacy constraints. Then, we design and implement the contract programming model (CPM), which allows agents to program data sharing applications catered to each problem's needs.
Contracts permit data sharing, but their interactive nature may introduce inefficiencies. To mitigate those inefficiencies, we extend the CPM so that it can save intermediate outputs of dataflows, and skip computation if a dataflow tries to access data that it does not have access to. In our evaluation, we show that 1) the contract abstraction is general enough to represent a wide range of sharing problems, 2) we can write programs for complex data sharing problems and exhibit qualitative improvements over other alternate technologies, and 3) quantitatively, our optimizations make sharing programs written with the CPM efficient.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
Controlling Dataflows with a Bolt-on Data Escrow
Authors:
Zhiru Zhu,
Raul Castro Fernandez
Abstract:
The data-driven economy has created tremendous value in our society. Individuals share their data with platforms in exchange for services such as search, social networks, and health recommendations. Platforms use the data to provide those services and create other revenue-generating opportunities, e.g., selling the data to data brokers. With the ever-expanding data economy comes the growing concer…
▽ More
The data-driven economy has created tremendous value in our society. Individuals share their data with platforms in exchange for services such as search, social networks, and health recommendations. Platforms use the data to provide those services and create other revenue-generating opportunities, e.g., selling the data to data brokers. With the ever-expanding data economy comes the growing concern about potential data misuse. While most platforms give individuals certain control over their data (i.e., what data is being shared), individuals do not know how the data will be used once shared; they cannot control the purpose.
In this paper, we introduce a data escrow design that permits individuals to observe all dataflows - not just what is shared but for what purpose. Rather than data flowing to the platform, the platform delegates their computation to the escrow, where individuals can observe and manage their data. To make the data escrow practical, we design and implement a prototype that works alongside the Apple ecosystem; specifically, we retrofit the Apple SDKs with a programming interface to enable delegated computation. Our solution does not depend on Apple's software and can be applied to other platforms, but building for Apple lets us study the main hypothesis of our work: whether such a data escrow solution is a feasible alternative to today's data governance. We show that our escrow prototype implementation is efficient, and we analyze the dataflows in real-world apps and show that the escrow's programming interface supports implementing a wide range of dataflows.
△ Less
Submitted 2 August, 2024;
originally announced August 2024.
-
Saving Money for Analytical Workloads in the Cloud
Authors:
Tapan Srivastava,
Raul Castro Fernandez
Abstract:
As users migrate their analytical workloads to cloud databases, it is becoming just as important to reduce monetary costs as it is to optimize query runtime. In the cloud, a query is billed based on either its compute time or the amount of data it processes. We observe that analytical queries are either compute- or IO-bound and each query type executes cheaper in a different pricing model. We expl…
▽ More
As users migrate their analytical workloads to cloud databases, it is becoming just as important to reduce monetary costs as it is to optimize query runtime. In the cloud, a query is billed based on either its compute time or the amount of data it processes. We observe that analytical queries are either compute- or IO-bound and each query type executes cheaper in a different pricing model. We exploit this opportunity and propose methods to build cheaper execution plans across pricing models that complete within user-defined runtime constraints. We implement these methods and produce execution plans spanning multiple pricing models that reduce the monetary cost for workloads by as much as 56%. We reduce individual query costs by as much as 90%. The prices chosen by cloud vendors for cloud services also impact savings opportunities. To study this effect, we simulate our proposed methods with different cloud prices and observe that multi-cloud savings are robust to changes in cloud vendor prices. These results indicate the massive opportunity to save money by executing workloads across multiple pricing models.
△ Less
Submitted 31 July, 2024;
originally announced August 2024.
-
A Data-Centric Online Market for Machine Learning: From Discovery to Pricing
Authors:
Minbiao Han,
Jonathan Light,
Steven Xia,
Sainyam Galhotra,
Raul Castro Fernandez,
Haifeng Xu
Abstract:
Data fuels machine learning (ML) - rich and high-quality training data is essential to the success of ML. However, to transform ML from the race among a few large corporations to an accessible technology that serves numerous normal users' data analysis requests, there still exist important challenges. One gap we observed is that many ML users can benefit from new data that other data owners posses…
▽ More
Data fuels machine learning (ML) - rich and high-quality training data is essential to the success of ML. However, to transform ML from the race among a few large corporations to an accessible technology that serves numerous normal users' data analysis requests, there still exist important challenges. One gap we observed is that many ML users can benefit from new data that other data owners possess, whereas these data owners sit on piles of data without knowing who can benefit from it. This gap creates the opportunity for building an online market that can automatically connect supply with demand. While online matching markets are prevalent (e.g., ride-hailing systems), designing a data-centric market for ML exhibits many unprecedented challenges.
This paper develops new techniques to tackle two core challenges in designing such a market: (a) to efficiently match demand with supply, we design an algorithm to automatically discover useful data for any ML task from a pool of thousands of datasets, achieving high-quality matching between ML models and data; (b) to encourage market participation of ML users without much ML expertise, we design a new pricing mechanism for selling data-augmented ML models. Furthermore, our market is designed to be API-compatible with existing online ML markets like Vertex AI and Sagemaker, making it easy to use while providing better results due to joint data and model search. We envision that the synergy of our data and model discovery algorithm and pricing mechanism will be an important step towards building a new data-centric online market that serves ML users effectively.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Making Differential Privacy Easier to Use for Data Controllers and Data Analysts using a Privacy Risk Indicator and an Escrow-Based Platform
Authors:
Zhiru Zhu,
Raul Castro Fernandez
Abstract:
Differential privacy (DP) enables private data analysis but is hard to use in practice. For data controllers who decide what output to release, choosing the amount of noise to add to the output is a non-trivial task because of the difficulty of interpreting the privacy parameter $ε$. For data analysts who submit queries, it is hard to understand the impact of the noise introduced by DP on their ta…
▽ More
Differential privacy (DP) enables private data analysis but is hard to use in practice. For data controllers who decide what output to release, choosing the amount of noise to add to the output is a non-trivial task because of the difficulty of interpreting the privacy parameter $ε$. For data analysts who submit queries, it is hard to understand the impact of the noise introduced by DP on their tasks.
To address these two challenges: 1) we define a privacy risk indicator that indicates the impact of choosing $ε$ on individuals' privacy and use that to design an algorithm to choose $ε$ and release output based on controllers' privacy preferences; 2) we introduce a utility signaling protocol that helps analysts interpret the impact of DP on their downstream tasks. We implement the algorithm and the protocol inside a new platform built on top of a data escrow, which allows controllers to control dataflows while maintaining high performance. We demonstrate our contributions through an IRB-approved user study, extensive experimental evaluations, and comparison with other DP platforms. All in all, our work contributes to making DP easier to use by lowering adoption barriers.
△ Less
Submitted 2 March, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Saibot: A Differentially Private Data Search Platform
Authors:
Zezhou Huang,
Jiaxiang Liu,
Daniel Alabi,
Raul Castro Fernandez,
Eugene Wu
Abstract:
Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset and these platforms search for augmentations (join or union compatible datasets) that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that man…
▽ More
Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset and these platforms search for augmentations (join or union compatible datasets) that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets.
We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50 to 90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.
△ Less
Submitted 1 July, 2023;
originally announced July 2023.
-
Addressing Budget Allocation and Revenue Allocation in Data Market Environments Using an Adaptive Sampling Algorithm
Authors:
Boxin Zhao,
Boxiang Lyu,
Raul Castro Fernandez,
Mladen Kolar
Abstract:
High-quality machine learning models are dependent on access to high-quality training data. When the data are not already available, it is tedious and costly to obtain them. Data markets help with identifying valuable training data: model consumers pay to train a model, the market uses that budget to identify data and train the model (the budget allocation problem), and finally the market compensa…
▽ More
High-quality machine learning models are dependent on access to high-quality training data. When the data are not already available, it is tedious and costly to obtain them. Data markets help with identifying valuable training data: model consumers pay to train a model, the market uses that budget to identify data and train the model (the budget allocation problem), and finally the market compensates data providers according to their data contribution (revenue allocation problem). For example, a bank could pay the data market to access data from other financial institutions to train a fraud detection model. Compensating data contributors requires understanding data's contribution to the model; recent efforts to solve this revenue allocation problem based on the Shapley value are inefficient to lead to practical data markets.
In this paper, we introduce a new algorithm to solve budget allocation and revenue allocation problems simultaneously in linear time. The new algorithm employs an adaptive sampling process that selects data from those providers who are contributing the most to the model. Better data means that the algorithm accesses those providers more often, and more frequent accesses corresponds to higher compensation. Furthermore, the algorithm can be deployed in both centralized and federated scenarios, boosting its applicability. We provide theoretical guarantees for the algorithm that show the budget is used efficiently and the properties of revenue allocation are similar to Shapley's. Finally, we conduct an empirical evaluation to show the performance of the algorithm in practical scenarios and when compared to other baselines. Overall, we believe that the new algorithm paves the way for the implementation of practical data markets.
△ Less
Submitted 4 June, 2023;
originally announced June 2023.
-
Kitana: Efficient Data Augmentation Search for AutoML
Authors:
Zezhou Huang,
Pranav Subramaniam,
Raul Castro Fernandez,
Eugene Wu
Abstract:
AutoML services provide a way for non-expert users to benefit from high-quality ML models without worrying about model design and deployment, in exchange for a charge per hour ($21.252 for VertexAI). However, existing AutoML services are model-centric, in that they are limited to extracting features and searching for models from initial training data-they are only as effective as the initial train…
▽ More
AutoML services provide a way for non-expert users to benefit from high-quality ML models without worrying about model design and deployment, in exchange for a charge per hour ($21.252 for VertexAI). However, existing AutoML services are model-centric, in that they are limited to extracting features and searching for models from initial training data-they are only as effective as the initial training data quality. With the increasing volume of tabular data available, there is a huge opportunity for data augmentation. For instance, vertical augmentation adds predictive features, while horizontal augmentation adds examples. This augmented training data yields potentially much better AutoML models at a lower cost. However, existing systems either forgo the augmentation opportunities that provide poor models, or apply expensive augmentation searching techniques that drain users' budgets.
Kitana is a data-centric AutoML system that also searches for new tabular datasets that can augment the tabular training data with new features and/or examples. Kitana manages a corpus of datasets, exposes an AutoML interface to users and searches for augmentation with datasets in the corpus to improve AutoML performance. To accelerate search, Kitana applies aggressive pre-computation to train a factorized proxy model and evaluate each candidate augmentation within 0.1s. Kitana also uses a cost model to limit the time spent on augmentation search, supports expressive data access controls, and performs request caching to benefit from past similar requests. Using a corpus of 518 open-source datasets, Kitana produces higher quality models than existing AutoML systems in orders of magnitude less time. Across different user requests, Kitana increases the model R2 from 0.16 to 0.66 while reducing the cost by >100x compared to the naive factorized learning and SOTA data augmentation search.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Data Station: Delegated, Trustworthy, and Auditable Computation to Enable Data-Sharing Consortia with a Data Escrow
Authors:
Siyuan Xia,
Zhiru Zhu,
Chris Zhu,
Jinjin Zhao,
Kyle Chard,
Aaron J. Elmore,
Ian Foster,
Michael Franklin,
Sanjay Krishnan,
Raul Castro Fernandez
Abstract:
Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built around data-sharing agreements resulting from long…
▽ More
Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built around data-sharing agreements resulting from long and tedious one-off negotiations. We introduce Data Station, a data escrow designed to enable the formation of data-sharing consortia. Data owners share data with the escrow knowing it will not be released without their consent. Data users delegate their computation to the escrow. The data escrow relies on delegated computation to execute queries without releasing the data first. Data Station leverages hardware enclaves to generate trust among participants, and exploits the centralization of data and computation to generate an audit log. We evaluate Data Station on machine learning and data-sharing applications while running on an untrusted intermediary. In addition to important qualitative advantages, we show that Data Station: i) outperforms federated learning baselines in accuracy and runtime for the machine learning application; ii) is orders of magnitude faster than alternative secure data-sharing frameworks; and iii) introduces small overhead on the critical path.
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
METAM: Goal-Oriented Data Discovery
Authors:
Sainyam Galhotra,
Yue Gong,
Raul Castro Fernandez
Abstract:
Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing so…
▽ More
Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing solutions do not leverage the synergy between discovery and augmentation, thus under exploiting data.
In this paper, we introduce METAM, a novel goal-oriented framework that queries the downstream task with a candidate dataset, forming a feedback loop that automatically steers the discovery and augmentation process. To select candidates efficiently, METAM leverages properties of the: i) data, ii) utility function, and iii) solution set size. We show METAM's theoretical guarantees and demonstrate those empirically on a broad set of tasks. All in all, we demonstrate the promise of goal-oriented data discovery to modern data science applications.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach
Authors:
Qiming Wang,
Raul Castro Fernandez
Abstract:
Most deployed data discovery systems, such as Google Datasets, and open data portals only support keyword search. Keyword search is geared towards general audiences but limits the types of queries the systems can answer. We propose a new system that lets users write natural language questions directly. A major barrier to using this learned data discovery system is it needs expensive-to-collect tra…
▽ More
Most deployed data discovery systems, such as Google Datasets, and open data portals only support keyword search. Keyword search is geared towards general audiences but limits the types of queries the systems can answer. We propose a new system that lets users write natural language questions directly. A major barrier to using this learned data discovery system is it needs expensive-to-collect training data, thus limiting its utility. In this paper, we introduce a self-supervised approach to assemble training datasets and train learned discovery systems without human intervention. It requires addressing several challenges, including the design of self-supervised strategies for data discovery, table representation strategies to feed to the models, and relevance models that work well with the synthetically generated questions. We combine all the above contributions into a system, Solo, that solves the problem end to end. The evaluation results demonstrate the new techniques outperform state-of-the-art approaches on well-known benchmarks. All in all, the technique is a stepping stone towards building learned discovery systems. The code is open-sourced at https://github.com/TheDataStation/solo
△ Less
Submitted 17 October, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Ver: View Discovery in the Wild
Authors:
Yue Gong,
Zhiru Zhu,
Sainyam Galhotra,
Raul Castro Fernandez
Abstract:
We present Ver, a data discovery system that identifies project-join views over large repositories of tables that do not contain join path information, and even when input queries are inaccurate. Ver implements a reference architecture to solve both the technical (scale and search) and human (semantic ambiguity, navigating a large number of results) problems of view discovery. We demonstrate users…
▽ More
We present Ver, a data discovery system that identifies project-join views over large repositories of tables that do not contain join path information, and even when input queries are inaccurate. Ver implements a reference architecture to solve both the technical (scale and search) and human (semantic ambiguity, navigating a large number of results) problems of view discovery. We demonstrate users find the view they want when using Ver with a user study and we demonstrate its performance with large-scale end-to-end experiments on real-world datasets containing tens of millions of join paths.
△ Less
Submitted 4 October, 2022; v1 submitted 2 June, 2021;
originally announced June 2021.
-
Comprehensive and Comprehensible Data Catalogs: The What, Who, Where, When, Why, and How of Metadata Management
Authors:
Pranav Subramaniam,
Yintong Ma,
Chi Li,
Ipsita Mohanty,
Raul Castro Fernandez
Abstract:
Data management tasks require access to metadata, which is increasingly tracked by databases called data catalogs. Current catalogs are too dependent on users' understanding of data, leading to difficulties in large organizations of users with different skills: catalogs either make metadata easy for users to store and difficult to retrieve, or they make it easy to retrieve, but difficult to store.…
▽ More
Data management tasks require access to metadata, which is increasingly tracked by databases called data catalogs. Current catalogs are too dependent on users' understanding of data, leading to difficulties in large organizations of users with different skills: catalogs either make metadata easy for users to store and difficult to retrieve, or they make it easy to retrieve, but difficult to store. In this paper, we present 5W1H+R, a new catalog mental model that is comprehensive in the metadata it represents, and comprehensible in that it permits all users to locate metadata easily. We demonstrate these properties via a user study. We then discuss practical guidelines for implementing the new mental model. We conclude mental models are important to make data catalogs more useful and to boost metadata management efforts.
△ Less
Submitted 1 February, 2023; v1 submitted 12 March, 2021;
originally announced March 2021.
-
The Data Station: Combining Data, Compute, and Market Forces
Authors:
Raul Castro Fernandez,
Kyle Chard,
Ben Blaiszik,
Sanjay Krishnan,
Aaron Elmore,
Ziad Obermeyer,
Josh Risley,
Sendhil Mullainathan,
Michael Franklin,
Ian Foster
Abstract:
This paper introduces Data Stations, a new data architecture that we are designing to tackle some of the most challenging data problems that we face today: access to sensitive data; data discovery and integration; and governance and compliance. Data Stations depart from modern data lakes in that both data and derived data products, such as machine learning models, are sealed and cannot be directly…
▽ More
This paper introduces Data Stations, a new data architecture that we are designing to tackle some of the most challenging data problems that we face today: access to sensitive data; data discovery and integration; and governance and compliance. Data Stations depart from modern data lakes in that both data and derived data products, such as machine learning models, are sealed and cannot be directly seen, accessed, or downloaded by anyone. Data Stations do not deliver data to users; instead, users bring questions to data. This inversion of the usual relationship between data and compute mitigates many of the security risks that are otherwise associated with sharing and working with sensitive data.
Data Stations are designed following the principle that many data problems require human involvement, and that incentives are the key to obtaining such involvement. To that end, Data Stations implement market designs to create, manage, and coordinate the use of incentives. We explain the motivation for this new kind of platform and its design.
△ Less
Submitted 31 August, 2020;
originally announced September 2020.
-
ARDA: Automatic Relational Data Augmentation for Machine Learning
Authors:
Nadiia Chepurko,
Ryan Marcus,
Emanuel Zgraggen,
Raul Castro Fernandez,
Tim Kraska,
David Karger
Abstract:
Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmen…
▽ More
Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement.
We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.
△ Less
Submitted 21 March, 2020;
originally announced March 2020.
-
Data Market Platforms: Trading Data Assets to Solve Data Problems
Authors:
Raul Castro Fernandez,
Pranav Subramaniam,
Michael J. Franklin
Abstract:
Data only generates value for a few organizations with expertise and resources to make data shareable, discoverable, and easy to integrate. Sharing data that is easy to discover and integrate is hard because data owners lack information (who needs what data) and they do not have incentives to prepare the data in a way that is easy to consume by others.
In this paper, we propose data market platf…
▽ More
Data only generates value for a few organizations with expertise and resources to make data shareable, discoverable, and easy to integrate. Sharing data that is easy to discover and integrate is hard because data owners lack information (who needs what data) and they do not have incentives to prepare the data in a way that is easy to consume by others.
In this paper, we propose data market platforms to address the lack of information and incentives and tackle the problems of data sharing, discovery, and integration. In a data market platform, data owners want to share data because they will be rewarded if they do so. Consumers are encouraged to share their data needs because the market will solve the discovery and integration problem for them in exchange for some form of currency.
We consider internal markets that operate within organizations to bring down data silos, as well as external markets that operate across organizations to increase the value of data for everybody. We outline a research agenda that revolves around two problems. The problem of market design, or how to design rules that lead to the outcomes we want, and the systems problem, how to implement the market and enforce the rules. Treating data as a first-class asset is sorely needed to extend the value of data to more organizations, and we propose data market platforms as one mechanism to achieve this goal.
△ Less
Submitted 1 July, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Dataset-On-Demand: Automatic View Search and Presentation for Data Discovery
Authors:
Raul Castro Fernandez,
Nan Tang,
Mourad Ouzzani,
Michael Stonebraker,
Samuel Madden
Abstract:
Many data problems are solved when the right view of a combination of datasets is identified. Finding such a view is challenging because of the many tables spread across many databases, data lakes, and cloud storage in modern organizations. Finding relevant tables, and identifying how to combine them is a difficult and time-consuming process that hampers users' productivity.
In this paper, we de…
▽ More
Many data problems are solved when the right view of a combination of datasets is identified. Finding such a view is challenging because of the many tables spread across many databases, data lakes, and cloud storage in modern organizations. Finding relevant tables, and identifying how to combine them is a difficult and time-consuming process that hampers users' productivity.
In this paper, we describe Dataset-On-Demand (DoD), a system that lets users specify the schema of the view they want, and have the system find views for them. With many underlying data sources, the number of resulting views for any given query is high, and the burden of choosing the right one is onerous to users. DoD uses a new method, 4C, to reduce the size of the view choice space for users. 4C classifies views into 4 classes: compatible views are exactly the same, contained views present a subsumption relationship, complementary views are unionable, and contradictory views have incompatible values that indicate fundamental differences between views. These 4 classes permit different presentation strategies to reduce the total number of views a user must consider.
We evaluate DoD on two key metrics of interest: its ability to reduce the size of the choice space, and the end to end performance. DoD finds all views within minutes, and reduces the number of views presented to users by 2-10x.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
Starling: A Scalable Query Engine on Cloud Function Services
Authors:
Matthew Perron,
Raul Castro Fernandez,
David DeWitt,
Samuel Madden
Abstract:
Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Fu…
▽ More
Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
Termite: A System for Tunneling Through Heterogeneous Data
Authors:
Raul Castro Fernandez,
Samuel Madden
Abstract:
Data-driven analysis is important in virtually every modern organization. Yet, most data is underutilized because it remains locked in silos inside of organizations; large organizations have thousands of databases, and billions of files that are not integrated together in a single, queryable repository. Despite 40+ years of continuous effort by the database community, data integration still remain…
▽ More
Data-driven analysis is important in virtually every modern organization. Yet, most data is underutilized because it remains locked in silos inside of organizations; large organizations have thousands of databases, and billions of files that are not integrated together in a single, queryable repository. Despite 40+ years of continuous effort by the database community, data integration still remains an open challenge. In this paper, we advocate a different approach: rather than trying to infer a common schema, we aim to find another common representation for diverse, heterogeneous data. Specifically, we argue for an embedding (i.e., a vector space) in which all entities, rows, columns, and paragraphs are represented as points. In the embedding, the distance between points indicates their degree of relatedness. We present Termite, a prototype we have built to learn the best embedding from the data. Because the best representation is learned, this allows Termite to avoid much of the human effort associated with traditional data integration tasks. On top of Termite, we have implemented a Termite-Join operator, which allows people to identify related concepts, even when these are stored in databases with different schemas and in unstructured data such as text files, webpages, etc. Finally, we show preliminary evaluation results of our prototype via a user study, and describe a list of future directions we have identified.
△ Less
Submitted 12 March, 2019;
originally announced March 2019.
-
A Deep Neural Network for Pixel-Level Electromagnetic Particle Identification in the MicroBooNE Liquid Argon Time Projection Chamber
Authors:
MicroBooNE collaboration,
C. Adams,
M. Alrashed,
R. An,
J. Anthony,
J. Asaadi,
A. Ashkenazi,
M. Auger,
S. Balasubramanian,
B. Baller,
C. Barnes,
G. Barr,
M. Bass,
F. Bay,
A. Bhat,
K. Bhattacharya,
M. Bishai,
A. Blake,
T. Bolton,
L. Camilleri,
D. Caratelli,
I. Caro Terrazas,
R. Carr,
R. Castillo Fernandez,
F. Cavanna
, et al. (148 additional authors not shown)
Abstract:
We have developed a convolutional neural network (CNN) that can make a pixel-level prediction of objects in image data recorded by a liquid argon time projection chamber (LArTPC) for the first time. We describe the network design, training techniques, and software tools developed to train this network. The goal of this work is to develop a complete deep neural network based data reconstruction cha…
▽ More
We have developed a convolutional neural network (CNN) that can make a pixel-level prediction of objects in image data recorded by a liquid argon time projection chamber (LArTPC) for the first time. We describe the network design, training techniques, and software tools developed to train this network. The goal of this work is to develop a complete deep neural network based data reconstruction chain for the MicroBooNE detector. We show the first demonstration of a network's validity on real LArTPC data using MicroBooNE collection plane images. The demonstration is performed for stopping muon and a $ν_μ$ charged current neutral pion data samples.
△ Less
Submitted 22 August, 2018;
originally announced August 2018.
-
Smallify: Learning Network Size while Training
Authors:
Guillaume Leclerc,
Manasi Vartak,
Raul Castro Fernandez,
Tim Kraska,
Samuel Madden
Abstract:
As neural networks become widely deployed in different applications and on different hardware, it has become increasingly important to optimize inference time and model size along with model accuracy. Most current techniques optimize model size, model accuracy and inference time in different stages, resulting in suboptimal results and computational inefficiency. In this work, we propose a new tech…
▽ More
As neural networks become widely deployed in different applications and on different hardware, it has become increasingly important to optimize inference time and model size along with model accuracy. Most current techniques optimize model size, model accuracy and inference time in different stages, resulting in suboptimal results and computational inefficiency. In this work, we propose a new technique called Smallify that optimizes all three of these metrics at the same time. Specifically we present a new method to simultaneously optimize network size and model performance by neuron-level pruning during training. Neuron-level pruning not only produces much smaller networks but also produces dense weight matrices that are amenable to efficient inference. By applying our technique to convolutional as well as fully connected models, we show that Smallify can reduce network size by 35X with a 6X improvement in inference time with similar accuracy as models found by traditional training techniques.
△ Less
Submitted 10 June, 2018;
originally announced June 2018.
-
Extracting Syntactic Patterns from Databases
Authors:
Andrew Ilyas,
Joana M. F. da Trindade,
Raul Castro Fernandez,
Samuel Madden
Abstract:
Many database columns contain string or numerical data that conforms to a pattern, such as phone numbers, dates, addresses, product identifiers, and employee ids. These patterns are useful in a number of data processing applications, including understanding what a specific field represents when field names are ambiguous, identifying outlier values, and finding similar fields across data sets. One…
▽ More
Many database columns contain string or numerical data that conforms to a pattern, such as phone numbers, dates, addresses, product identifiers, and employee ids. These patterns are useful in a number of data processing applications, including understanding what a specific field represents when field names are ambiguous, identifying outlier values, and finding similar fields across data sets. One way to express such patterns would be to learn regular expressions for each field in the database. Unfortunately, exist- ing techniques on regular expression learning are slow, taking hundreds of seconds for columns of just a few thousand values. In contrast, we develop XSystem, an efficient method to learn patterns over database columns in significantly less time. We show that these patterns can not only be built quickly, but are expressive enough to capture a number of key applications, including detecting outliers, measuring column similarity, and assigning semantic labels to columns (based on a library of regular expressions). We evaluate these applications with datasets that range from chemical databases (based on a collaboration with a pharmaceutical company), our university data warehouse, and open data from MassData.gov.
△ Less
Submitted 6 December, 2017; v1 submitted 31 October, 2017;
originally announced October 2017.