Skip to main content

Showing 1–23 of 23 results for author: Fernandez, R C

  1. arXiv:2408.09226  [pdf, other

    cs.IR

    FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions

    Authors: Qiming Wang, Raul Castro Fernandez

    Abstract: Reading comprehension models answer questions posed in natural language when provided with a short passage of text. They present an opportunity to address a long-standing challenge in data management: the extraction of structured data from unstructured text. Consequently, several approaches are using these models to perform information extraction. However, these modern approaches leave an opportun… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

  2. arXiv:2408.04092  [pdf, other

    cs.DB

    Programmable Dataflows: Abstraction and Programming Model for Data Sharing

    Authors: Siyuan Xia, Chris Zhu, Tapan Srivastava, Bridget Fahey, Raul Castro Fernandez

    Abstract: Data sharing is central to a wide variety of applications such as fraud detection, ad matching, and research. The lack of data sharing abstractions makes the solution to each data sharing problem bespoke and cost-intensive, hampering value generation. In this paper, we first introduce a data sharing model to represent every data sharing problem with a sequence of dataflows. From the model, we dist… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

  3. arXiv:2408.01580  [pdf, other

    cs.DB

    Controlling Dataflows with a Bolt-on Data Escrow

    Authors: Zhiru Zhu, Raul Castro Fernandez

    Abstract: The data-driven economy has created tremendous value in our society. Individuals share their data with platforms in exchange for services such as search, social networks, and health recommendations. Platforms use the data to provide those services and create other revenue-generating opportunities, e.g., selling the data to data brokers. With the ever-expanding data economy comes the growing concer… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

  4. arXiv:2408.00253  [pdf, other

    cs.DB

    Saving Money for Analytical Workloads in the Cloud

    Authors: Tapan Srivastava, Raul Castro Fernandez

    Abstract: As users migrate their analytical workloads to cloud databases, it is becoming just as important to reduce monetary costs as it is to optimize query runtime. In the cloud, a query is billed based on either its compute time or the amount of data it processes. We observe that analytical queries are either compute- or IO-bound and each query type executes cheaper in a different pricing model. We expl… ▽ More

    Submitted 31 July, 2024; originally announced August 2024.

    Comments: 12 pages; VLDB 2024

  5. arXiv:2310.17843  [pdf, other

    cs.LG cs.GT

    A Data-Centric Online Market for Machine Learning: From Discovery to Pricing

    Authors: Minbiao Han, Jonathan Light, Steven Xia, Sainyam Galhotra, Raul Castro Fernandez, Haifeng Xu

    Abstract: Data fuels machine learning (ML) - rich and high-quality training data is essential to the success of ML. However, to transform ML from the race among a few large corporations to an accessible technology that serves numerous normal users' data analysis requests, there still exist important challenges. One gap we observed is that many ML users can benefit from new data that other data owners posses… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  6. arXiv:2310.13104  [pdf, other

    cs.DB cs.CR

    Making Differential Privacy Easier to Use for Data Controllers and Data Analysts using a Privacy Risk Indicator and an Escrow-Based Platform

    Authors: Zhiru Zhu, Raul Castro Fernandez

    Abstract: Differential privacy (DP) enables private data analysis but is hard to use in practice. For data controllers who decide what output to release, choosing the amount of noise to add to the output is a non-trivial task because of the difficulty of interpreting the privacy parameter $ε$. For data analysts who submit queries, it is hard to understand the impact of the noise introduced by DP on their ta… ▽ More

    Submitted 2 March, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

  7. arXiv:2307.00432  [pdf, other

    cs.DB cs.CR

    Saibot: A Differentially Private Data Search Platform

    Authors: Zezhou Huang, Jiaxiang Liu, Daniel Alabi, Raul Castro Fernandez, Eugene Wu

    Abstract: Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset and these platforms search for augmentations (join or union compatible datasets) that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that man… ▽ More

    Submitted 1 July, 2023; originally announced July 2023.

    Journal ref: VLDB 2023

  8. arXiv:2306.02543  [pdf, other

    cs.LG

    Addressing Budget Allocation and Revenue Allocation in Data Market Environments Using an Adaptive Sampling Algorithm

    Authors: Boxin Zhao, Boxiang Lyu, Raul Castro Fernandez, Mladen Kolar

    Abstract: High-quality machine learning models are dependent on access to high-quality training data. When the data are not already available, it is tedious and costly to obtain them. Data markets help with identifying valuable training data: model consumers pay to train a model, the market uses that budget to identify data and train the model (the budget allocation problem), and finally the market compensa… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Published on International Conference on Machine Learning (ICML) 2023

  9. arXiv:2305.10419  [pdf, other

    cs.DB

    Kitana: Efficient Data Augmentation Search for AutoML

    Authors: Zezhou Huang, Pranav Subramaniam, Raul Castro Fernandez, Eugene Wu

    Abstract: AutoML services provide a way for non-expert users to benefit from high-quality ML models without worrying about model design and deployment, in exchange for a charge per hour ($21.252 for VertexAI). However, existing AutoML services are model-centric, in that they are limited to extracting features and searching for models from initial training data-they are only as effective as the initial train… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

  10. arXiv:2305.03842  [pdf, other

    cs.DB

    Data Station: Delegated, Trustworthy, and Auditable Computation to Enable Data-Sharing Consortia with a Data Escrow

    Authors: Siyuan Xia, Zhiru Zhu, Chris Zhu, Jinjin Zhao, Kyle Chard, Aaron J. Elmore, Ian Foster, Michael Franklin, Sanjay Krishnan, Raul Castro Fernandez

    Abstract: Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built around data-sharing agreements resulting from long… ▽ More

    Submitted 5 May, 2023; originally announced May 2023.

  11. arXiv:2304.09068  [pdf, other

    cs.DB cs.LG

    METAM: Goal-Oriented Data Discovery

    Authors: Sainyam Galhotra, Yue Gong, Raul Castro Fernandez

    Abstract: Data is a central component of machine learning and causal inference tasks. The availability of large amounts of data from sources such as open data repositories, data lakes and data marketplaces creates an opportunity to augment data and boost those tasks' performance. However, augmentation techniques rely on a user manually discovering and shortlisting useful candidate augmentations. Existing so… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

    Comments: ICDE 2023 paper

  12. arXiv:2301.03560  [pdf, other

    cs.IR

    Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach

    Authors: Qiming Wang, Raul Castro Fernandez

    Abstract: Most deployed data discovery systems, such as Google Datasets, and open data portals only support keyword search. Keyword search is geared towards general audiences but limits the types of queries the systems can answer. We propose a new system that lets users write natural language questions directly. A major barrier to using this learned data discovery system is it needs expensive-to-collect tra… ▽ More

    Submitted 17 October, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

    Comments: To appear at Sigmod 2024

  13. arXiv:2106.01543  [pdf, other

    cs.DB

    Ver: View Discovery in the Wild

    Authors: Yue Gong, Zhiru Zhu, Sainyam Galhotra, Raul Castro Fernandez

    Abstract: We present Ver, a data discovery system that identifies project-join views over large repositories of tables that do not contain join path information, and even when input queries are inaccurate. Ver implements a reference architecture to solve both the technical (scale and search) and human (semantic ambiguity, navigating a large number of results) problems of view discovery. We demonstrate users… ▽ More

    Submitted 4 October, 2022; v1 submitted 2 June, 2021; originally announced June 2021.

  14. arXiv:2103.07532  [pdf, other

    cs.DB

    Comprehensive and Comprehensible Data Catalogs: The What, Who, Where, When, Why, and How of Metadata Management

    Authors: Pranav Subramaniam, Yintong Ma, Chi Li, Ipsita Mohanty, Raul Castro Fernandez

    Abstract: Data management tasks require access to metadata, which is increasingly tracked by databases called data catalogs. Current catalogs are too dependent on users' understanding of data, leading to difficulties in large organizations of users with different skills: catalogs either make metadata easy for users to store and difficult to retrieve, or they make it easy to retrieve, but difficult to store.… ▽ More

    Submitted 1 February, 2023; v1 submitted 12 March, 2021; originally announced March 2021.

    Comments: 14 pages, 8 figures, 8 tables

  15. arXiv:2009.00035  [pdf, other

    cs.DB

    The Data Station: Combining Data, Compute, and Market Forces

    Authors: Raul Castro Fernandez, Kyle Chard, Ben Blaiszik, Sanjay Krishnan, Aaron Elmore, Ziad Obermeyer, Josh Risley, Sendhil Mullainathan, Michael Franklin, Ian Foster

    Abstract: This paper introduces Data Stations, a new data architecture that we are designing to tackle some of the most challenging data problems that we face today: access to sensitive data; data discovery and integration; and governance and compliance. Data Stations depart from modern data lakes in that both data and derived data products, such as machine learning models, are sealed and cannot be directly… ▽ More

    Submitted 31 August, 2020; originally announced September 2020.

  16. arXiv:2003.09758  [pdf, other

    cs.LG cs.DB stat.ML

    ARDA: Automatic Relational Data Augmentation for Machine Learning

    Authors: Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, David Karger

    Abstract: Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmen… ▽ More

    Submitted 21 March, 2020; originally announced March 2020.

  17. arXiv:2002.01047  [pdf, other

    cs.DB

    Data Market Platforms: Trading Data Assets to Solve Data Problems

    Authors: Raul Castro Fernandez, Pranav Subramaniam, Michael J. Franklin

    Abstract: Data only generates value for a few organizations with expertise and resources to make data shareable, discoverable, and easy to integrate. Sharing data that is easy to discover and integrate is hard because data owners lack information (who needs what data) and they do not have incentives to prepare the data in a way that is easy to consume by others. In this paper, we propose data market platf… ▽ More

    Submitted 1 July, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

  18. arXiv:1911.11876  [pdf, other

    cs.DB

    Dataset-On-Demand: Automatic View Search and Presentation for Data Discovery

    Authors: Raul Castro Fernandez, Nan Tang, Mourad Ouzzani, Michael Stonebraker, Samuel Madden

    Abstract: Many data problems are solved when the right view of a combination of datasets is identified. Finding such a view is challenging because of the many tables spread across many databases, data lakes, and cloud storage in modern organizations. Finding relevant tables, and identifying how to combine them is a difficult and time-consuming process that hampers users' productivity. In this paper, we de… ▽ More

    Submitted 26 November, 2019; originally announced November 2019.

  19. arXiv:1911.11727  [pdf, other

    cs.DB

    Starling: A Scalable Query Engine on Cloud Function Services

    Authors: Matthew Perron, Raul Castro Fernandez, David DeWitt, Samuel Madden

    Abstract: Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Fu… ▽ More

    Submitted 26 November, 2019; originally announced November 2019.

  20. arXiv:1903.05008  [pdf, other

    cs.DB

    Termite: A System for Tunneling Through Heterogeneous Data

    Authors: Raul Castro Fernandez, Samuel Madden

    Abstract: Data-driven analysis is important in virtually every modern organization. Yet, most data is underutilized because it remains locked in silos inside of organizations; large organizations have thousands of databases, and billions of files that are not integrated together in a single, queryable repository. Despite 40+ years of continuous effort by the database community, data integration still remain… ▽ More

    Submitted 12 March, 2019; originally announced March 2019.

  21. arXiv:1808.07269  [pdf, other

    hep-ex cs.CV physics.data-an physics.ins-det

    A Deep Neural Network for Pixel-Level Electromagnetic Particle Identification in the MicroBooNE Liquid Argon Time Projection Chamber

    Authors: MicroBooNE collaboration, C. Adams, M. Alrashed, R. An, J. Anthony, J. Asaadi, A. Ashkenazi, M. Auger, S. Balasubramanian, B. Baller, C. Barnes, G. Barr, M. Bass, F. Bay, A. Bhat, K. Bhattacharya, M. Bishai, A. Blake, T. Bolton, L. Camilleri, D. Caratelli, I. Caro Terrazas, R. Carr, R. Castillo Fernandez, F. Cavanna , et al. (148 additional authors not shown)

    Abstract: We have developed a convolutional neural network (CNN) that can make a pixel-level prediction of objects in image data recorded by a liquid argon time projection chamber (LArTPC) for the first time. We describe the network design, training techniques, and software tools developed to train this network. The goal of this work is to develop a complete deep neural network based data reconstruction cha… ▽ More

    Submitted 22 August, 2018; originally announced August 2018.

    Journal ref: Phys. Rev. D 99, 092001 (2019)

  22. arXiv:1806.03723  [pdf, other

    stat.ML cs.LG

    Smallify: Learning Network Size while Training

    Authors: Guillaume Leclerc, Manasi Vartak, Raul Castro Fernandez, Tim Kraska, Samuel Madden

    Abstract: As neural networks become widely deployed in different applications and on different hardware, it has become increasingly important to optimize inference time and model size along with model accuracy. Most current techniques optimize model size, model accuracy and inference time in different stages, resulting in suboptimal results and computational inefficiency. In this work, we propose a new tech… ▽ More

    Submitted 10 June, 2018; originally announced June 2018.

    Comments: 11 pages, 3 figures

  23. arXiv:1710.11528  [pdf, other

    cs.DB

    Extracting Syntactic Patterns from Databases

    Authors: Andrew Ilyas, Joana M. F. da Trindade, Raul Castro Fernandez, Samuel Madden

    Abstract: Many database columns contain string or numerical data that conforms to a pattern, such as phone numbers, dates, addresses, product identifiers, and employee ids. These patterns are useful in a number of data processing applications, including understanding what a specific field represents when field names are ambiguous, identifying outlier values, and finding similar fields across data sets. One… ▽ More

    Submitted 6 December, 2017; v1 submitted 31 October, 2017; originally announced October 2017.