Databases (cs.DB)

TELII: Temporal Event Level Inverted Indexing for Cohort Discovery on a Large Covid-19 EHR Dataset
Yan Huang
Oct 23 2024 cs.DB cs.IR arXiv:2410.17134v1

@misc{2410.17134, author = {Yan Huang}, title = {{TELII}: {T}emporal {E}vent {L}evel {I}nverted {I}ndexing for {C}ohort {D}iscovery on a {L}arge {C}ovid-19 {EHR} {D}ataset}, year = {2024}, eprint = {2410.17134}, note = {arXiv:2410.17134v1} }
PDF
Cohort discovery is a crucial step in clinical research on Electronic Health Record (EHR) data. Temporal queries, which are common in cohort discovery, can be time-consuming and prone to errors when processed on large EHR datasets. In this work, we introduce TELII, a temporal event level inverted indexing method designed for cohort discovery on large EHR datasets. TELII is engineered to pre-compute and store the relations along with the time difference between events, thereby providing fast and accurate temporal query capabilities. We implemented TELII for the OPTUM de-identified COVID-19 EHR dataset, which contains data from 8.87 million patients. We demonstrate four common temporal query tasks and their implementation using TELII with a MongoDB backend. Our results show that the temporal query speed for TELII is up to 2000 times faster than that of existing non-temporal inverted indexes. TELII achieves millisecond-level response times, enabling users to quickly explore event relations and find preliminary evidence for their research questions. Not only is TELII practical and straightforward to implement, but it also offers easy adaptability to other EHR datasets. These advantages underscore TELII's potential to serve as the query engine for EHR-based applications, ensuring fast, accurate, and user-friendly query responses.
CUBIT: Concurrent Updatable Bitmap Indexing
Junchang Wang, Manos Athanassoulis
Oct 23 2024 cs.DB arXiv:2410.16929v1

@misc{2410.16929, author = {Junchang Wang and Manos Athanassoulis}, title = {{CUBIT}: {C}oncurrent {U}pdatable {B}itmap {I}ndexing}, year = {2024}, eprint = {2410.16929}, note = {arXiv:2410.16929v1} }
PDF
Bitmap indexes are widely used for read-intensive analytical workloads because they are clustered and offer efficient reads with a small memory footprint. However, they are notoriously inefficient to update. As analytical applications are increasingly fused with transactional applications, leading to the emergence of hybrid transactional/analytical processing (HTAP), it is desirable that bitmap indexes support efficient concurrent real-time updates. In this paper, we propose Concurrent Updatable Bitmap indexing (CUBIT) that offers efficient real-time updates that scale with the number of CPU cores used and do not interfere with queries. Our design relies on three principles. First, we employ a horizontal bitwise representation of updated bits, which enables efficient atomic updates without locking entire bitvectors. Second, we propose a lightweight snapshotting mechanism that allows queries (including range queries) to run on separate snapshots and provides a wait-free progress guarantee. Third, we consolidate updates in a latch-free manner, providing a strong progress guarantee. Our evaluation shows that CUBIT offers 3x - 16x higher throughput and 3x - 220x lower latency than state-of-the-art updatable bitmap indexes. CUBIT's update-friendly nature widens the applicability of bitmap indexing. Experimenting with OLAP workloads with standard, batched updates shows that CUBIT overcomes the maintenance downtime and outperforms DuckDB by 1.2x - 2.7x on TPC-H. For HTAP workloads with real-time updates, CUBIT achieves 2x - 11x performance improvement over the state-of-the-art approaches.
NodeOP: Optimizing Node Management for Decentralized Networks
Angela Tsang, Jiankai Sun, Boo Xie, Azeem Khan, Ender Lu, Fletcher Fan, Maggie Wu, Jing Tang
Oct 23 2024 cs.DB cs.CR arXiv:2410.16720v1

@misc{2410.16720, author = {Angela Tsang and Jiankai Sun and Boo Xie and Azeem Khan and Ender Lu and Fletcher Fan and Maggie Wu and Jing Tang}, title = {{N}ode{OP}: {O}ptimizing {N}ode {M}anagement for {D}ecentralized {N}etworks}, year = {2024}, eprint = {2410.16720}, note = {arXiv:2410.16720v1} }
PDF
We present NodeOP, a novel framework designed to optimize the management of General Node Operators in decentralized networks. By integrating Agent-Based Modeling (ABM) with a Tendermint Byzantine Fault Tolerance (BFT)-based consensus mechanism, NodeOP addresses key challenges in task allocation, consensus formation, and system stability. Through rigorous mathematical modeling and formal optimization, NodeOP ensures stable equilibrium in node task distribution. We validate the framework via convergence analysis and performance metrics such as transaction throughput, system latency, and fault tolerance. We further demonstrate NodeOP's practical utility through two use cases: decentralized sequencer management in Layer 2 networks and off-chain payment validation. These examples underscore how NodeOP enhances validation efficiency and unlocks new revenue opportunities in large-scale decentralized environments. Our results position NodeOP as a scalable and flexible solution, significantly improving operational efficiency and economic sustainability in decentralized systems.
Efficient and Effective Algorithms for A Family of Influence Maximization Problems with A Matroid Constraint
Yiqian Huang, Shiqi Zhang, Laks V.S. Lakshmanan, Wenqing Lin, Xiaokui Xiao, Bo Tang
Oct 23 2024 cs.SI cs.DB arXiv:2410.16603v1

@misc{2410.16603, author = {Yiqian Huang and Shiqi Zhang and Laks V.S.~Lakshmanan and Wenqing Lin and Xiaokui Xiao and Bo Tang}, title = {{E}fficient and {E}ffective {A}lgorithms for {A} {F}amily of {I}nfluence {M}aximization {P}roblems with {A} {M}atroid {C}onstraint}, year = {2024}, eprint = {2410.16603}, note = {arXiv:2410.16603v1} }
PDF
Influence maximization (IM) is a classic problem that aims to identify a small group of critical individuals, known as seeds, who can influence the largest number of users in a social network through word-of-mouth. This problem finds important applications including viral marketing, infection detection, and misinformation containment. The conventional IM problem is typically studied with the oversimplified goal of selecting a single seed set. Many real-world scenarios call for multiple sets of seeds, particularly on social media platforms where various viral marketing campaigns need different sets of seeds to propagate effectively. To this end, previous works have formulated various IM variants, central to which is the requirement of multiple seed sets, naturally modeled as a matroid constraint. However, the current best-known solutions for these variants either offer a weak $(1/2-\epsilon)$-approximation, or offer a $(1-1/e-\epsilon)$-approximation algorithm that is very expensive. We propose an efficient seed selection method called AMP, an algorithm with a $(1-1/e-\epsilon)$-approximation guarantee for this family of IM variants. To further improve efficiency, we also devise a fast implementation, called RAMP. We extensively evaluate the performance of our proposal against 6 competitors across 4 IM variants and on 7 real-world networks, demonstrating that our proposal outperforms all competitors in terms of result quality, running time, and memory usage. We have also deployed RAMP in a real industry strength application involving online gaming, where we show that our deployed solution significantly improves upon the baselines.
The Cost of Representation by Subset Repairs
Yuxi Liu, Fangzhu Shen, Kushagra Ghosh, Amir Gilad, Benny Kimelfeld, Sudeepa Roy
Oct 23 2024 cs.DB arXiv:2410.16501v1

@misc{2410.16501, author = {Yuxi Liu and Fangzhu Shen and Kushagra Ghosh and Amir Gilad and Benny Kimelfeld and Sudeepa Roy}, title = {{T}he {C}ost of {R}epresentation by {S}ubset {R}epairs}, year = {2024}, eprint = {2410.16501}, note = {arXiv:2410.16501v1} }
PDF
Datasets may include errors, and specifically violations of integrity constraints, for various reasons. Standard techniques for ``minimal-cost'' database repairing resolve these violations by aiming for minimum change in the data, and in the process, may sway representations of different sub-populations. For instance, the repair may end up deleting more females than males, or more tuples from a certain age group or race, due to varying levels of inconsistency in different sub-populations. Such repaired data can mislead consumers when used for analytics, and can lead to biased decisions for downstream machine learning tasks. We study the ``cost of representation'' in subset repairs for functional dependencies. In simple terms, we target the question of how many additional tuples have to be deleted if we want to satisfy not only the integrity constraints but also representation constraints for given sub-populations. We study the complexity of this problem and compare it with the complexity of optimal subset repairs without representations. While the problem is NP-hard in general, we give polynomial-time algorithms for special cases, and efficient heuristics for general cases. We perform a suite of experiments that show the effectiveness of our algorithms in computing or approximating the cost of representation.
From Tokens to Materials: Leveraging Language Models for Scientific Discovery
Yuwei Wan, Tong Xie, Nan Wu, Wenjie Zhang, Chunyu Kit, Bram Hoex
Oct 23 2024 cs.CL cs.DB arXiv:2410.16165v1

@misc{2410.16165, author = {Yuwei Wan and Tong Xie and Nan Wu and Wenjie Zhang and Chunyu Kit and Bram Hoex}, title = {{F}rom {T}okens to {M}aterials: {L}everaging {L}anguage {M}odels for {S}cientific {D}iscovery}, year = {2024}, eprint = {2410.16165}, note = {arXiv:2410.16165v1} }
PDF
Exploring the predictive capabilities of language models in material science is an ongoing interest. This study investigates the application of language model embeddings to enhance material property prediction in materials science. By evaluating various contextual embedding methods and pre-trained models, including Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), we demonstrate that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties. Our findings reveal that information-dense embeddings from the third layer of MatBERT, combined with a context-averaging approach, offer the most effective method for capturing material-property relationships from the scientific literature. We also identify a crucial "tokenizer effect," highlighting the importance of specialized text processing techniques that preserve complete compound names while maintaining consistent token counts. These insights underscore the value of domain-specific training and tokenization in materials science applications and offer a promising pathway for accelerating the discovery and development of new materials through AI-driven approaches.