Electrical Engineering and Systems Science
See recent articles
Showing new listings for Friday, 25 October 2024
- [1] arXiv:2410.18091 [pdf, other]
-
Title: Predicting Fine-grained Behavioral and Psychological Symptoms of Dementia Based on Machine Learning and Smart Wearable DevicesSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Behavioral and Psychological Symptoms of Dementia (BPSD) impact dementia care substantially, affecting both patients and caregivers. Effective management and early detection of BPSD are crucial to reduce the stress and burden on caregivers and healthcare systems. Despite the advancements in machine learning for dementia prediction, there is a considerable gap in utilizing these methods for BPSD prediction. This study aims to fill this gap by presenting a novel personalized framework for BPSD prediction, utilizing physiological signals from smart wearable devices. Our personalized fine-grained BPSD prediction method accurately predicts BPSD occurrences by extracting individual behavioral patterns, while the generalized models identify diverse patterns and differentiate between various BPSD symptoms. Detailed comparisons between the proposed personalized method and conventional generalized methods reveals substantial improvements across all performance metrics, including a 16.0% increase in AUC. These results demonstrate the potential of our proposed method in advancing dementia care by enabling proactive interventions and improving patient outcomes in real-world scenarios. To the best of our knowledge, this is the first study that leverages physiological signals from smart wearable devices to predict BPSD, marking a significant stride in dementia care research.
- [2] arXiv:2410.18092 [pdf, html, other]
-
Title: Two-Stage Radio Map Construction with Real Environments and Sparse MeasurementsSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
Radio map construction based on extensive measurements is accurate but expensive and time-consuming, while environment-aware radio map estimation reduces the costs at the expense of low accuracy. Considering accuracy and costs, a first-predict-then-correct (FPTC) method is proposed by leveraging generative adversarial networks (GANs). A primary radio map is first predicted by a radio map prediction GAN (RMP-GAN) taking environmental information as input. Then, the prediction result is corrected by a radio map correction GAN (RMC-GAN) with sparse measurements as guidelines. Specifically, the self-attention mechanism and residual-connection blocks are introduced to RMP-GAN and RMC-GAN to improve the accuracy, respectively. Experimental results validate that the proposed FPTC-GANs method achieves the best radio map construction performance, compared with the state-of-the-art methods.
- [3] arXiv:2410.18103 [pdf, html, other]
-
Title: A Hybrid Graph Neural Network for Enhanced EEG-Based Depression DetectionSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Graph neural networks (GNNs) are becoming increasingly popular for EEG-based depression detection. However, previous GNN-based methods fail to sufficiently consider the characteristics of depression, thus limiting their performance. Firstly, studies in neuroscience indicate that depression patients exhibit both common and individualized brain abnormal patterns. Previous GNN-based approaches typically focus either on fixed graph connections to capture common abnormal brain patterns or on adaptive connections to capture individualized patterns, which is inadequate for depression detection. Secondly, brain network exhibits a hierarchical structure, which includes the arrangement from channel-level graph to region-level graph. This hierarchical structure varies among individuals and contains significant information relevant to detecting depression. Nonetheless, previous GNN-based methods overlook these individualized hierarchical information. To address these issues, we propose a Hybrid GNN (HGNN) that merges a Common Graph Neural Network (CGNN) branch utilizing fixed connection and an Individualized Graph Neural Network (IGNN) branch employing adaptive connections. The two branches capture common and individualized depression patterns respectively, complementing each other. Furthermore, we enhance the IGNN branch with a Graph Pooling and Unpooling Module (GPUM) to extract individualized hierarchical information. Extensive experiments on two public datasets show that our model achieves state-of-the-art performance.
- [4] arXiv:2410.18116 [pdf, html, other]
-
Title: Reconstruction with prior support information and non-Gaussian constraintsSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Classical Analysis and ODEs (math.CA)
In this study, we introduce a novel model, termed the Weighted Basis Pursuit Dequantization ($\omega$-BPDQ$_p$), which incorporates prior support information by assigning weights on the $\ell_1$ norm in the $\ell_1$ minimization process and replaces the $\ell_2$ norm with the $\ell_p$ norm in the constraint. This adjustment addresses cases where noise deviates from a Gaussian distribution, such as quantized errors, which are common in practice. We demonstrate that Restricted Isometry Property (RIP$_{p,q}$) and Weighted Robust Null Space Property ($\omega$-RNSP$_{p,q}$) ensure stable and robust reconstruction within $\omega$-BPDQ$_p$, with the added observation that standard Gaussian random matrices satisfy these properties with high probability. Moreover, we establish a relationship between RIP$_{p,q}$ and $\omega$-RNSP$_{p,q}$ that RIP$_{p,q}$ implies $\omega$-RNSP$_{p,q}$. Additionally, numerical experiments confirm that the incorporation of weights and the non-Gaussian constraint results in improved reconstruction quality.
- [5] arXiv:2410.18161 [pdf, html, other]
-
Title: Bridging the Diagnostic Divide: Classical Computer Vision and Advanced AI methods for distinguishing ITB and CD through CTE ScansComments: 9 pages, 3 figures, 3 algorithmsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Differentiating between Intestinal Tuberculosis (ITB) and Crohn's Disease (CD) poses a significant clinical challenge due to their similar symptoms, clinical presentations, and imaging features. This study leverages Computed Tomography Enterography (CTE) scans, deep learning, and traditional computer vision to address this diagnostic dilemma. A consensus among radiologists from renowned institutions has recognized the visceral-to-subcutaneous fat (VF/SF) ratio as a surrogate biomarker for differentiating between ITB and CD. Previously done manually, we propose a novel 2D image computer vision algorithm for auto-segmenting subcutaneous fat to automate this ratio calculation, enhancing diagnostic efficiency and objectivity. As a benchmark, we compare the results to those obtained using the TotalSegmentator tool, a popular deep learning-based software for automatic segmentation of anatomical structures, and manual calculations by radiologists. We also demonstrated the performance on 3D CT volumes using a slicing method and provided a benchmark comparison of the algorithm with the TotalSegmentator tool. Additionally, we propose a scoring approach to integrate scores from radiological features, such as the fat ratio and pulmonary TB probability, into a single score for diagnosis. We trained a ResNet10 model on a dataset of CTE scans with samples from ITB, CD, and normal patients, achieving an accuracy of 75%. To enhance interpretability and gain clinical trust, we integrated the explainable AI technique Grad-CAM with ResNet10 to explain the model's predictions. Due to the small dataset size (100 total cases), the feature-based scoring system is considered more reliable and trusted by radiologists compared to the deep learning model for disease diagnosis.
- [6] arXiv:2410.18207 [pdf, html, other]
-
Title: Trajectory Optimization for Spatial Microstructure Control in Electron Beam Metal Additive ManufacturingComments: 6 pages, 6 figuresSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Metal additive manufacturing (AM) opens the possibility for spatial control of as-fabricated microstructure and properties. However, since the solid state diffusional transformations that drive microstructure outcomes are governed by nonlinear ODEs in terms of temperature, which is itself governed by PDEs over the entire part domain, solving for the system inputs needed to achieve desired microstructure distributions has proven difficult. In this work, we present a trajectory optimization approach for spatial control of microstructure in metal AM, which we demonstrate by controlling the hardness of a low-alloy steel in electron beam powder bed fusion (EB-PBF). To this end, we present models for thermal and microstructural dynamics. Next, we use experimental data to identify the parameters of the microstructure transformation dynamics. We then pose spatial microstructure control as a finite-horizon optimal control problem. The optimal power field trajectory is computed using an augmented Lagrangian differential dynamic programming (AL-DDP) method with GPU acceleration. The resulting time-varying power fields are then realized on an EB-PBF machine through an approximation scheme. Measurements of the resultant hardness shows that the optimized power field trajectory is able to closely produce the desired hardness distribution.
- [7] arXiv:2410.18217 [pdf, other]
-
Title: A Methodology for Transformer Ratio Adjustment in Small-Size Rotary TransformersSubjects: Systems and Control (eess.SY)
This study addresses a neglected challenge that has been hidden in the Rotary Transformer (RT) field: the possibility of a discrepancy between transformer ratio and turn number ratio in small-size transformers. Previous investigations have shown that in the geometry design of RTs, as well as their resonant circuit design, the transformer ratio has been regarded as the same as the turn number ratio. This estimation is logical and true when a large-size RT is investigated. However, in small-size RTs, the magnitudes of leakage and magnetization inductances are significantly close, which leads to a difference between transformer ratio and turn number ratio. Accordingly, the absence of an exact methodology for transformer ratio calculation brought us to conduct this investigation. In this regard, a transformer ratio adjustment is suggested after proposing a low-error magnetic model. Its accuracy is high enough to consider different air gaps and subsequently calculate inductance with reference to 3D finite element analysis (3D-FEA). Finally, we take advantage of a test bench to show the exactness and proficiency of the suggested transformer ratio adjustment.
- [8] arXiv:2410.18239 [pdf, html, other]
-
Title: E2E-Swin-Unet++: An Enhanced End-to-End Swin-Unet Architecture With Dual Decoders For PTMC SegmentationSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Efficiently managing papillary thyroid microcarcinoma (PTMC) while minimizing patient discomfort poses a significant clinical challenge. Radiofrequency ablation (RFA) offers a less invasive alternative to surgery and radiation therapy for PTMC treatment, characterized by shorter recovery times and reduced pain. As an image-guided procedure, RFA generates localized heat by delivering high-frequency electrical currents through electrodes to the targeted area under ultrasound imaging guidance. However, the precision and skill required by operators for accurate guidance using current ultrasound B-mode imaging technologies remain significant challenges. To address these challenges, we develop a novel AI segmentation model, E2E-Swin-Unet++. This model enhances ultrasound B-mode imaging by enabling real-time identification and segmentation of PTMC tumors and monitoring of the region of interest for precise targeting during treatment. E2E-Swin- Unet++ is an advanced end-to-end extension of the Swin-Unet architecture, incorporating thyroid region information to minimize the risk of false PTMC segmentation while providing fast inference capabilities. Experimental results on a real clinical RFA dataset demonstrate the superior performance of E2E-Swin-Unet++ compared to related models. Our proposed solution significantly improves the precision and control of RFA ablation treatment by enabling real-time identification and segmentation of PTMC margins during the procedure.
- [9] arXiv:2410.18260 [pdf, html, other]
-
Title: Predicting total time to compress a video corpus using online inference systemsComments: Accepted by IEEE International Conference on Visual Communications and Image Processing (VCIP) 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Predicting the computational cost of compressing/transcoding clips in a video corpus is important for resource management of cloud services and VOD (Video On Demand) providers. Currently, customers of cloud video services are unaware of the cost of transcoding their files until the task is completed. Previous work concentrated on predicting perclip compression time, and thus estimating the cost of video compression. In this work, we propose new Machine Learning (ML) systems which predict cost for the entire corpus instead. This is a more appropriate goal since users are not interested in per-clip cost but instead the cost for the whole corpus. In this work, we evaluate our systems with respect to two video codecs (x264, x265) and a novel high-quality video corpus. We find that the accuracy of aggregate time prediction for a video corpus more than two times better than using per-clip predictions. Furthermore, we present an online inference framework in which we update the ML models as files are processed. A consideration of video compute overhead and appropriate choice of ML predictor for each fraction of corpus completed yields a prediction error of less than 5%. This is approximately two times better than previous work which proposed generalised predictors.
- [10] arXiv:2410.18300 [pdf, html, other]
-
Title: A Bayesian Approach to Low-Thrust Maneuvering Spacecraft TrackingJournal-ref: Journal of Guidance Control and Dynamics, Vol.47, No.8, April 2024, pp. 1586-1601Subjects: Systems and Control (eess.SY); Signal Processing (eess.SP)
Bayesian estimation with an explicit transitional prior is required for a tracking algorithm to be embedded in most multi-target tracking frameworks. This paper describes a novel approach capable of tracking maneuvering spacecraft with an explicit transitional prior and in a Bayesian framework, with fewer than two observations passes per day. The algorithm samples thrust profiles according to a multivariate Laplace distribution. It is shown that multivariate Laplace distributions are particularly suited to track maneuvering spacecraft, leading to a log probability function that is almost linear with the thrust. Principles from rare event simulation theory are used to propagate the tails of the distribution. Fast propagation is enabled by multi-fidelity methods. Because of the diffuse transitional prior, a novel k-nearest neighbor-based ensemble Gaussian mixture filter is developed and this http URL method allows Bayesian tracking of maneuvering spacecraft for several scenarios with fewer than two measurement passes per day, and with a mismatch between the true and expected thrust magnitude of up to a factor of 200. The validity domain and statistical significance of the method are shown by simulation through several Monte Carlo trials in different scenarios and with different filter settings.
- [11] arXiv:2410.18323 [pdf, html, other]
-
Title: Experimental Validation of a 3GPP Compliant 5G-Based Positioning SystemSarik Dhungel, Gaurav Duggal, Dara Ron, Nishith Tripathi, R. Michael Buehrer, Jeffrey H. Reed, Vijay K ShahComments: 8 pages, 9 figures, Accepted in ACM Wintech 2024Subjects: Systems and Control (eess.SY)
The advent of 5G positioning techniques by 3GPP has unlocked possibilities for applications in public safety, vehicular systems, and location-based services. However, these applications demand accurate and reliable positioning performance, which has led to the proposal of newer positioning techniques. To further advance the research on these techniques, in this paper, we develop a 3GPP-compliant 5G positioning testbed, incorporating gNodeBs (gNBs) and User Equipment (UE). The testbed uses New Radio (NR) Positioning Reference Signals (PRS) transmitted by the gNB to generate Time of Arrival (TOA) estimates at the UE. We mathematically model the inter-gNB and UE-gNB time offsets affecting the TOA estimates and examine their impact on positioning performance. Additionally, we propose a calibration method for estimating these time offsets. Furthermore, we investigate the environmental impact on the TOA estimates. Our findings are based on our mathematical model and supported by experimental results.
- [12] arXiv:2410.18364 [pdf, html, other]
-
Title: Position-Aided Semantic Communication for Efficient Image Transmission: Design, Implementation, and Experimental ResultsSubjects: Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Semantic communication, augmented by knowledge bases (KBs), offers substantial reductions in transmission overhead and resilience to errors. However, existing methods predominantly rely on end-to-end training to construct KBs, often failing to fully capitalize on the rich information available at communication devices. Motivated by the growing convergence of sensing and communication, we introduce a novel Position-Aided Semantic Communication (PASC) framework, which integrates localization into semantic transmission. This framework is particularly designed for position-based image communication, such as real-time uploading of outdoor camera-view images. By utilizing the position, the framework retrieves corresponding maps, and then an advanced foundation model (FM)-driven view generator is employed to synthesize images closely resembling the target images. The PASC framework further leverages the FM to fuse the synthesized image with deviations from the real one, enhancing semantic reconstruction. Notably, the framework is highly flexible, capable of adapting to dynamic content and fluctuating channel conditions through a novel FM-based parameter optimization strategy. Additionally, the challenges of real-time deployment are addressed, with the development of a hardware testbed to validate the framework. Simulations and real-world tests demonstrate that the proposed PASC approach not only significantly boosts transmission efficiency, but also remains robust in diverse and evolving transmission scenarios.
- [13] arXiv:2410.18366 [pdf, other]
-
Title: Cochlear Implantation of Slim Pre-curved Arrays using Automatic Pre-operative Insertion PlansKareem O. Tawfik, Mohammad M.R. Khan, Ankita Patro, Miriam R. Smetak, David Haynes, Robert F. Labadie, René H. Gifford, Jack H. NobleComments: First two listed authors are co-first authorsSubjects: Image and Video Processing (eess.IV)
Hypothesis: Pre-operative cochlear implant (CI) electrode array (EL) insertion plans created by automated image analysis methods can improve positioning of slim pre-curved EL.
Background: This study represents the first evaluation of a system for patient-customized EL insertion planning for a slim pre-curved EL.
Methods: Twenty-one temporal bone specimens were divided into experimental and control groups and underwent cochlear implantation. For the control group, the surgeon performed a traditional insertion without an insertion plan. For the experimental group, customized insertion plans guided entry site, trajectory, curl direction, and base insertion depth. An additional 35 clinical insertions from the same surgeon were analyzed, 7 of which were conducted using the insertion plans. EL positioning was analyzed using post-operative imaging auto-segmentation techniques, allowing measurement of angular insertion depth (AID), mean modiolar distance (MMD), and scalar position.
Results: In the cadaveric temporal bones, 3 scalar translocations, including 2 foldovers, occurred in 14 control group insertions. In the clinical insertions, translocations occurred in 2 of 28 control cases. No translocations or folds occurred in the 7 experimental temporal bone and the 7 experimental clinical insertions. Among the non-translocated cases, overall AID and MMD were 401(41) degrees and 0.34(0.13) mm for the control insertions. AID and MMD for the experimental insertions were 424(43) degrees and 0.34(0.09) mm overall and were 432(19) and 0.30(0.07) mm for cases where the planned insertion depth was achieved.
Conclusions: Trends toward improved EL positioning within scala tympani were observed when EL insertion plans are used. Variability in MMD was significantly reduced (0.07mm vs 0.13 mm, p=0.039) when the planned depth was achieved. - [14] arXiv:2410.18370 [pdf, html, other]
-
Title: Structured Connectivity for 6G Reflex Arc: Task-Oriented Virtual User and New Uplink-Downlink TradeoffSubjects: Systems and Control (eess.SY)
To accommodate the evolving demands of unmanned operations, the future sixth-generation (6G) network will support not only communication links but also sensing-communication-computing-control ($\mathbf{SC}^3$) loops. In each $\mathbf{SC}^3$ cycle, the sensor uploads sensing data to the computing center, and the computing center calculates the control command and sends it to the actuator to take action. To maintain the task-level connections between the sensor-computing center link and the computing center-actuator link, we propose to treat the sensor and actuator as a virtual user. In this way, the two communication links of the $\mathbf{SC}^3$ loop become the uplink and downlink (UL&DL) of the virtual user. Based on the virtual user, we propose a task-oriented UL&DL optimization scheme. This scheme jointly optimizes UL&DL transmit power, time, bandwidth, and CPU frequency to minimize the control linear quadratic regulator (LQR) cost. We decouple the complex problem into a convex UL&DL bandwidth allocation problem with the closed-form solution for the optimal time allocation. Simulation results demonstrate that the proposed scheme achieves a task-level balance between the UL&DL, surpassing conventional communication schemes that optimize each link separately.
- [15] arXiv:2410.18382 [pdf, html, other]
-
Title: Sensing-Communication-Computing-Control Closed-Loop Optimization for 6G Unmanned Robotic SystemsSubjects: Systems and Control (eess.SY); Signal Processing (eess.SP)
Rapid advancements in field robots have brought a new kind of cyber physical system (CPS)--unmanned robotic system--under the spotlight. In the upcoming sixth-generation (6G) era, these systems hold great potential to replace humans in hazardous tasks. This paper investigates an unmanned robotic system comprising a multi-functional unmanned aerial vehicle (UAV), sensors, and actuators. The UAV carries communication and computing modules, acting as an edge information hub (EIH) that transfers and processes information. During the task execution, the EIH gathers sensing data, calculates control commands, and transmits commands to actuators--leading to reflex-arc-like sensing-communication-computing-control ($\mathbf{SC}^3$) loops. Unlike existing studies that design $\mathbf{SC}^3$ loop components separately, we take each $\mathbf{SC}^3$ loop as an integrated structure and propose a goal-oriented closed-loop optimization scheme. This scheme jointly optimizes uplink and downlink (UL&DL) communication and computing within and across the $\mathbf{SC}^3$ loops to minimize the total linear quadratic regulator (LQR) cost. We derive optimal closed-form solutions for intra-loop allocation and propose an efficient iterative algorithm for inter-loop optimization. Under the condition of adequate CPU frequency availability, we derive an approximate closed-form solution for inter-loop bandwidth allocation. Simulation results demonstrate that the proposed scheme achieves a two-tier task-level balance within and across $\mathbf{SC}^3$ loops.
- [16] arXiv:2410.18456 [pdf, html, other]
-
Title: Multi-Stage Airway Segmentation in Lung CT Based on Multi-scale Nested Residual UNetSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Accurate and complete segmentation of airways in chest CT images is essential for the quantitative assessment of lung diseases and the facilitation of pulmonary interventional procedures. Although deep learning has led to significant advancements in medical image segmentation, maintaining airway continuity remains particularly challenging. This difficulty arises primarily from the small and dispersed nature of airway structures, as well as class imbalance in CT scans. To address these challenges, we designed a Multi-scale Nested Residual U-Net (MNR-UNet), incorporating multi-scale inputs and Residual Multi-scale Modules (RMM) into a nested residual framework to enhance information flow, effectively capturing the intricate details of small airways and mitigating gradient vanishing. Building on this, we developed a three-stage segmentation pipeline to optimize the training of the MNR-UNet. The first two stages prioritize high accuracy and sensitivity, while the third stage focuses on repairing airway breakages to balance topological completeness and correctness. To further address class imbalance, we introduced a weighted Breakage-Aware Loss (wBAL) to heighten focus on challenging samples, penalizing breakages and thereby extending the length of the airway tree. Additionally, we proposed a hierarchical evaluation framework to offer more clinically meaningful analysis. Validation on both in-house and public datasets demonstrates that our approach achieves superior performance in detecting more accurate airway voxels and identifying additional branches, significantly improving airway topological completeness. The code will be released publicly following the publication of the paper.
- [17] arXiv:2410.18461 [pdf, html, other]
-
Title: Uncertainty-Error correlations in Evidential Deep Learning models for biomedical segmentationComments: 15 pagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
In this work, we examine the effectiveness of an uncertainty quantification framework known as Evidential Deep Learning applied in the context of biomedical image segmentation. This class of models involves assigning Dirichlet distributions as priors for segmentation labels, and enables a few distinct definitions of model uncertainties. Using the cardiac and prostate MRI images available in the Medical Segmentation Decathlon for validation, we found that Evidential Deep Learning models with U-Net backbones generally yielded superior correlations between prediction errors and uncertainties relative to the conventional baseline equipped with Shannon entropy measure, Monte-Carlo Dropout and Deep Ensemble methods. We also examined these models' effectiveness in active learning, finding that relative to the standard Shannon entropy-based sampling, they yielded higher point-biserial uncertainty-error correlations while attaining similar performances in Dice-Sorensen coefficients. These superior features of EDL models render them well-suited for segmentation tasks that warrant a critical sensitivity in detecting large model errors.
- [18] arXiv:2410.18462 [pdf, html, other]
-
Title: Learn 2 Rage: Experiencing The Emotional Roller Coaster That Is Reinforcement LearningSubjects: Systems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
This work presents the experiments and solution outline for our teams winning submission in the Learn To Race Autonomous Racing Virtual Challenge 2022 hosted by AIcrowd. The objective of the Learn-to-Race competition is to push the boundary of autonomous technology, with a focus on achieving the safety benefits of autonomous driving. In the description the competition is framed as a reinforcement learning (RL) challenge. We focused our initial efforts on implementation of Soft Actor Critic (SAC) variants. Our goal was to learn non-trivial control of the race car exclusively from visual and geometric features, directly mapping pixels to control actions. We made suitable modifications to the default reward policy aiming to promote smooth steering and acceleration control. The framework for the competition provided real time simulation, meaning a single episode (learning experience) is measured in minutes. Instead of pursuing parallelisation of episodes we opted to explore a more traditional approach in which the visual perception was processed (via learned operators) and fed into rule-based controllers. Such a system, while not as academically "attractive" as a pixels-to-actions approach, results in a system that requires less training, is more explainable, generalises better and is easily tuned and ultimately out-performed all other agents in the competition by a large margin.
- [19] arXiv:2410.18470 [pdf, html, other]
-
Title: Bearing-Only Solution for Fermat-Weber Location Problem: Generalized AlgorithmsComments: 14 pages (double-column), 7 figures, submitted to a journalSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
This paper presents novel algorithms for the Fermat-Weber Location Problem, guiding an autonomous agent to the point that minimizes the weighted sum of Euclidean distances to some beacons using only bearing measurements. The existing results address only the simple scenario where the beacons are stationary and the agent is modeled by a single integrator. In this paper, we propose a number of bearing-only algorithms that let the agent, which can be modeled as either a single-integrator or a double-integrator, follow the Fermat-Weber point of a group of stationary or moving beacons. The theoretical results are rigorously proven using Lyapunov theory and supported with simulation examples.
- [20] arXiv:2410.18484 [pdf, html, other]
-
Title: Constraint-adaptive MPC for large-scale systems: Satisfying state constraints without imposing themComments: 6 pages, 4 figures, IFAC NMPC 2021 conferenceSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Model Predictive Control (MPC) is a successful control methodology, which is applied to increasingly complex systems. However, real-time feasibility of MPC can be challenging for complex systems, certainly when an (extremely) large number of constraints have to be adhered to. For such scenarios with a large number of state constraints, this paper proposes two novel MPC schemes for general nonlinear systems, which we call constraint-adaptive MPC. These novel schemes dynamically select at each time step a (varying) set of constraints that are included in the on-line optimization problem. Carefully selecting the included constraints can significantly reduce, as we will demonstrate, the computational complexity with often only a slight impact on the closed-loop performance. Although not all (state) constraints are imposed in the on-line optimization, the schemes still guarantee recursive feasibility and constraint satisfaction. A numerical case study illustrates the proposed MPC schemes and demonstrates the achieved computation time improvements exceeding two orders of magnitude without loss of performance.
- [21] arXiv:2410.18506 [pdf, other]
-
Title: Enhancing Graph Attention Neural Network Performance for Marijuana Consumption Classification through Large-scale Augmented Granger Causality (lsAGC) Analysis of Functional MR ImagesComments: 17 pagesSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
In the present research, the effectiveness of large-scale Augmented Granger Causality (lsAGC) as a tool for gauging brain network connectivity was examined to differentiate between marijuana users and typical controls by utilizing resting-state functional Magnetic Resonance Imaging (fMRI). The relationship between marijuana consumption and alterations in brain network connectivity is a recognized fact in scientific literature. This study probes how lsAGC can accurately discern these changes. The technique used integrates dimension reduction with the augmentation of source time-series in a model that predicts time-series, which helps in estimating the directed causal relationships among fMRI time-series. As a multivariate approach, lsAGC uncovers the connection of the inherent dynamic system while considering all other time-series. A dataset of 60 adults with an ADHD diagnosis during childhood, drawn from the Addiction Connectome Preprocessed Initiative (ACPI), was used in the study. The brain connections assessed by lsAGC were utilized as classification attributes. A Graph Attention Neural Network (GAT) was chosen to carry out the classification task, particularly for its ability to harness graph-based data and recognize intricate interactions between brain regions, making it appropriate for fMRI-based brain connectivity data. The performance was analyzed using a five-fold cross-validation system. The average accuracy achieved by the correlation coefficient method was roughly 52.98%, with a 1.65 standard deviation, whereas the lsAGC approach yielded an average accuracy of 61.47%, with a standard deviation of 1.44. The suggested method enhances the body of knowledge in the field of neuroimaging-based classification and emphasizes the necessity to consider directed causal connections in brain network connectivity analysis when studying marijuana's effects on the brain.
- [22] arXiv:2410.18582 [pdf, html, other]
-
Title: LLM-Aided Efficient Hardware Design AutomationSubjects: Systems and Control (eess.SY)
With the rapidly increasing complexity of modern chips, hardware engineers are required to invest more effort in tasks such as circuit design, verification, and physical implementation. These workflows often involve continuous modifications, which are labor-intensive and prone to errors. Therefore, there is an increasing need for more efficient and cost-effective Electronic Design Automation (EDA) solutions to accelerate new hardware development. Recently, large language models (LLMs) have made significant advancements in contextual understanding, logical reasoning, and response generation. Since hardware designs and intermediate scripts can be expressed in text format, it is reasonable to explore whether integrating LLMs into EDA could simplify and fully automate the entire workflow. Accordingly, this paper discusses such possibilities in several aspects, covering hardware description language (HDL) generation, code debugging, design verification, and physical implementation. Two case studies, along with their future outlook, are introduced to highlight the capabilities of LLMs in code repair and testbench generation. Finally, future directions and challenges are highlighted to further explore the potential of LLMs in shaping the next-generation EDA
- [23] arXiv:2410.18610 [pdf, html, other]
-
Title: A Joint Representation Using Continuous and Discrete Features for Cardiovascular Diseases Risk Prediction on Chest CT ScansMinfeng Xu, Chen-Chen Fan, Yan-Jie Zhou, Wenchao Guo, Pan Liu, Jing Qi, Le Lu, Hanqing Chao, Kunlun HeComments: 23 pages, 9 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Cardiovascular diseases (CVD) remain a leading health concern and contribute significantly to global mortality rates. While clinical advancements have led to a decline in CVD mortality, accurately identifying individuals who could benefit from preventive interventions remains an unsolved challenge in preventive cardiology. Current CVD risk prediction models, recommended by guidelines, are based on limited traditional risk factors or use CT imaging to acquire quantitative biomarkers, and still have limitations in predictive accuracy and applicability. On the other hand, end-to-end trained CVD risk prediction methods leveraging deep learning on CT images often fail to provide transparent and explainable decision grounds for assisting physicians. In this work, we proposed a novel joint representation that integrates discrete quantitative biomarkers and continuous deep features extracted from chest CT scans. Our approach initiated with a deep CVD risk classification model by capturing comprehensive continuous deep learning features while jointly obtaining currently clinical-established quantitative biomarkers via segmentation models. In the feature joint representation stage, we use an instance-wise feature-gated mechanism to align the continuous and discrete features, followed by a soft instance-wise feature interaction mechanism fostering independent and effective feature interaction for the final CVD risk prediction. Our method substantially improves CVD risk predictive performance and offers individual contribution analysis of each biomarker, which is important in assisting physicians' decision-making processes. We validated our method on a public chest low-dose CT dataset and a private external chest standard-dose CT patient cohort of 17,207 CT volumes from 6,393 unique subjects, and demonstrated superior predictive performance, achieving AUCs of 0.875 and 0.843, respectively.
- [24] arXiv:2410.18637 [pdf, html, other]
-
Title: Remote Detection of Applications for Improved Beam Tracking in mmWave/sub-THz 5G/6G SystemsAlexander Shurakov, Margarita Ershova, Abdukodir Khakimov, Anatoliy Prikhodko, Evgeny Mokrov, Vyacheslav Begishev, Galina Chulkova, Yevgeni Koucheryavy, Gregory Gol'tsmanSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
Beam tracking is an essential functionality of millimeter wave (mmWave, 30-100 GHz) and sub-terahertz (sub-THz, 100-300 GHz) 5G/6G systems. It operates by performing antenna sweeping at both base station (BS) and user equipment (UE) sides using the Synchronization Signal Blocks (SSB). The optimal frequency of beam tracking events is not specified by 3GPP standards and heavily depends on the micromobility properties of the applications currently utilized by the user. In absence of explicit signalling for the type of application at the air interface, in this paper, we propose a way to remotely detect it at the BS side based on the received signal strength pattern. To this aim, we first perform a multi-stage measurement campaign at 156 GHz, belonging to the sub-THz band, to obtain the received signal strength traces of popular smartphone applications. Then, we proceed applying conventional statistical Mann-Whitney tests and various machine learning (ML) based classification techniques to discriminate applications remotely. Our results show that Mann-Whitney test can be used to differentiate between fast and slow application classes with a confidence of 0.95 inducing class detection delay on the order of 1 s after application initialization. With the same time budget, random forest classifiers can differentiate between applications with fast and slow micromobility with 80% accuracy using received signal strength metric only. The accuracy of detecting a specific application however is lower, reaching 60%. By utilizing the proposed technique one can estimate the optimal values of the beam tracking intervals without adding additional signalling to the air interface.
- [25] arXiv:2410.18669 [pdf, html, other]
-
Title: Active Target Tracking Using Bearing-only Measurements With Gaussian Process LearningSubjects: Systems and Control (eess.SY)
This paper studies the tracking problem of target with the partially unknown motion model by an active agent with bearing-only measurements using Gaussian process learning. To address this problem, a learning-planning-control framework is proposed. First, to learn and predict the target motion under mild assumptions, a Gaussian-process-based scheme is proposed, and a probabilistic uniform prediction error bound can be rigorously proved. Second, by analyzing the data dependence of the posterior covariance, we obtain an optimal relative trajectory to achieve efficient sampling. Third, to realize efficient learning, a controller to track the planned path is proposed based on the learned target motion, which can provide guaranteed tracking performance. Theoretical analysis is conducted to prove the the given probabilistic error bounds. Numerical examples and comparison with other typical methods verify the feasibility and superior performance of our proposed framework.
- [26] arXiv:2410.18690 [pdf, other]
-
Title: Advancements in Image Resolution: Super-Resolution Algorithm for Enhanced EOS-06 OCM-3 DataAnkur Garg, Tushar Shukla, Purvee Joshi, Debojyoti Ganguly, Ashwin Gujarati, Meenakshi Sarkar, KN Babu, Mehul Pandya, S. Manthira Moorthi, Debajyoti DharComments: PreprintSubjects: Image and Video Processing (eess.IV); Signal Processing (eess.SP)
The Ocean Color Monitor-3 (OCM-3) sensor is instrumental in Earth observation, achieving a critical balance between high-resolution imaging and broad coverage. This paper explores innovative imaging methods employed in OCM-3 and the transformative potential of super-resolution techniques to enhance image quality. The super-resolution model for OCM-3 (SOCM-3) addresses the challenges of contemporary satellite imaging by effectively navigating the trade-off between image clarity and swath width. With resolutions below 240 meters in Local Area Coverage (LAC) mode and below 750 meters in Global Area Coverage (GAC) mode, coupled with a wide 1550-kilometer swath and a 2-day revisit time, SOCM-3 emerges as a leading asset in remote sensing. The paper details the intricate interplay of atmospheric, motion, optical, and detector effects that impact image quality, emphasizing the necessity for advanced computational techniques and sophisticated algorithms for effective image reconstruction. Evaluation methods are thoroughly discussed, incorporating visual assessments using the Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) metric and computational metrics such as Line Spread Function (LSF), Full Width at Half Maximum (FWHM), and Super-Resolution (SR) ratio. Additionally, statistical analyses, including power spectrum evaluations and target-wise spectral signatures, are employed to gauge the efficacy of super-resolution techniques. By enhancing both spatial resolution and revisit frequency, this study highlights significant advancements in remote sensing capabilities, providing valuable insights for applications across cryospheric, vegetation, oceanic, coastal, and domains. Ultimately, the findings underscore the potential of SOCM-3 to contribute meaningfully to our understanding of finescale oceanic phenomena and environmental monitoring.
- [27] arXiv:2410.18691 [pdf, other]
-
Title: Hyperspectral Spatial Super-Resolution using Keystone ErrorComments: PreprintSubjects: Image and Video Processing (eess.IV)
Hyperspectral images enable precise identification of ground objects by capturing their spectral signatures with fine spectral this http URL high spatial resolution further enhances this capability, increasing spatial resolution through hardware like larger telescopes is costly and inefficient. A more optimal solution is using ground processing techniques, such as hypersharpening, to merge high spectral and spatial resolution data. However, this method works best when datasets are captured under similar conditions, which is difficult when using data from different times. In this work, we propose a superresolution approach to enhance hyperspectral data's spatial resolution without auxiliary input. Our method estimates the high-resolution point spread function (PSF) using blind deconvolution and corrects for sampling-related blur using a model-based superresolution framework. This differs from previous approaches by not assuming a known highresolution blur. We also introduce an adaptive prior that improves performance compared to existing methods. Applied to the visible and near-infrared (VNIR) spectrometer of HySIS, ISRO hyperspectral sensor, our algorithm removes aliasing and boosts resolution by approximately 1.3 times. It is versatile and can be applied to similar systems.
- [28] arXiv:2410.18698 [pdf, html, other]
-
Title: Transferring Knowledge from High-Quality to Low-Quality MRI for Adult Glioma DiagnosisComments: Technical Report, MICCAI 2024 BraTS-SSA Challenge Runner UpSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Glioma, a common and deadly brain tumor, requires early diagnosis for improved prognosis. However, low-quality Magnetic Resonance Imaging (MRI) technology in Sub-Saharan Africa (SSA) hinders accurate diagnosis. This paper presents our work in the BraTS Challenge on SSA Adult Glioma. We adopt the model from the BraTS-GLI 2021 winning solution and utilize it with three training strategies: (1) initially training on the BraTS-GLI 2021 dataset with fine-tuning on the BraTS-Africa dataset, (2) training solely on the BraTS-Africa dataset, and (3) training solely on the BraTS-Africa dataset with 2x super-resolution enhancement. Results show that initial training on the BraTS-GLI 2021 dataset followed by fine-tuning on the BraTS-Africa dataset has yielded the best results. This suggests the importance of high-quality datasets in providing prior knowledge during training. Our top-performing model achieves Dice scores of 0.882, 0.840, and 0.926, and Hausdorff Distance (95%) scores of 15.324, 37.518, and 13.971 for enhancing tumor, tumor core, and whole tumor, respectively, in the validation phase. In the final phase of the competition, our approach successfully secured second place overall, reflecting the strength and effectiveness of our model and training strategies. Our approach provides insights into improving glioma diagnosis in SSA, showing the potential of deep learning in resource-limited settings and the importance of transfer learning from high-quality datasets.
- [29] arXiv:2410.18722 [pdf, other]
-
Title: Uplink Cell-Free Massive MIMO OFDM with Phase Noise-Aware Channel Estimation: Separate and Shared LOsYibo Wu, Luca Sanguinetti, Musa Furkan Keskin, Ulf Gustavsson, Alexandre Graell i Amat, Henk WymeerschComments: 13 pages, 8 figures, submitted to an IEEE JournalSubjects: Signal Processing (eess.SP)
Cell-free massive multiple-input multiple-output (mMIMO) networks enhance coverage and spectral efficiency (SE) by distributing antennas across access points (APs) with phase coherence between APs. However, the use of cost-efficient local oscillators (LOs) introduces phase noise (PN) that compromises phase coherence, even with centralized processing. Sharing an LO across APs can reduce costs in specific configurations but cause correlated PN between APs, leading to correlated interference that affects centralized combining. This can be improved by exploiting the PN correlation in channel estimation. This paper presents an uplink orthogonal frequency division multiplexing (OFDM) signal model for PN-impaired cell-free mMIMO, addressing gaps in single-carrier signal models. We evaluate mismatches from applying single-carrier methods to OFDM systems, showing how they underestimate the impact of PN and produce over-optimistic achievable SE predictions. Based on our OFDM signal model, we propose two PN-aware channel and common phase error estimators: a distributed estimator for uncorrelated PN with separate LOs and a centralized estimator with shared LOs. We introduce a deep learning-based channel estimator to enhance the performance and reduce the number of iterations of the centralized estimator. The simulation results show that the distributed estimator outperforms mismatched estimators with separate LOs, whereas the centralized estimator enhances distributed estimators with shared LOs.
- [30] arXiv:2410.18757 [pdf, html, other]
-
Title: Sliding DFT-based Signal Recovery for Modulo ADC with 1-bit Folding InformationComments: 11 pages, 7 figures, this work has been submitted to the IEEE for possible publicationSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
The modulo analog-to-digital converter (ADC) is a promising solution to resolve the limited dynamic range (DR) issue of conventional ADCs. However, a modulo ADC requires an unfolding scheme to correct the nonlinear distortion introduced by the modulo operation. This paper presents a sliding discrete Fourier Transform (DFT)-based method for fast signal reconstruction given the modulo ADC output sequence and a 1-bit folding information sequence. In contrast to existing DFT-based signal recovery techniques for modulo ADCs, our proposed sliding DFT method reduces the required observation time and minimizes the spectral leakage effects via proper choice of window function parameters. A mean squared error (MSE) performance guarantee is established for the proposed signal recovery algorithm. More precisely, we derive sufficient conditions for the oversampling factor ($\mathrm{OF}$) and the number of quantization bits ($b$) to obtain a specific MSE performance. Our numerical results demonstrate that modulo ADCs equipped with our proposed recovery method can outperform conventional ADCs without modulo for $\mathrm{OF} \geq 4$ and $b \geq 4$. The impact of spectral leakage on the MSE performance of the proposed sliding DFT recovery method is also quantified.
- [31] arXiv:2410.18767 [pdf, html, other]
-
Title: STAR-RIS-Enabled Full-Duplex Integrated Sensing and Communication SystemSubjects: Signal Processing (eess.SP)
Traditional self-interference cancellation (SIC) methods are common in full-duplex (FD) integrated sensing and communication (ISAC) systems. However, exploring new SIC schemes is important due to the limitations of traditional approaches. With the challenging limitations of traditional SIC approaches, this paper proposes a novel simultaneous transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-enabled FD ISAC system, where STAR-RIS enhances simultaneous communication and target sensing and reduces self-interference (SI) to a level comparable to traditional SIC approaches. The optimization of maximizing the sensing signal-to-interference-plus-noise ratio (SINR) and the communication sum rate, both crucial for improving sensing accuracy and overall communication performance, presents significant challenges due to the non-convex nature of these problems. Therefore, we develop alternating optimization algorithms to iteratively tackle these problems. Specifically, we devise the semi-definite relaxation (SDR)-based algorithm for transmit beamformer design. For the reflecting and refracting coefficients design, we adopt the successive convex approximation (SCA) method and implement the SDR-based algorithm to tackle the quartic and quadratic constraints. Simulation results validate the effectiveness of the proposed algorithms and show that the proposed deployment can achieve better performance than that of the benchmark using the traditional SIC approach without STAR-RIS deployment.
- [32] arXiv:2410.18768 [pdf, html, other]
-
Title: A New Definition of Demand Response in the Distributed Energy Resource EraJohanna L. Mathieu, Gregor Verbič, Thomas Morstyn, Mads Almassalkhi, Kyri Baker, Julio Braslavsky, Kenneth Bruninx, Yury Dvorkin, Gregory S. Ledva, Nariman Mahdavi, Hrvoje Pandžić, Alessandra Parisio, Vedran PerićComments: 12 pagesSubjects: Systems and Control (eess.SY)
Demand response is a concept that has been around since the very first electric power systems. However, we have seen an explosion of research on demand response and demand-side technologies in the past 30 years, coinciding with the shift towards liberalized/deregulated electricity markets and efforts to decarbonize the power sector. Now we are also seeing a shift towards more distributed/decentralized electric systems; we have entered the era of "distributed energy resources," which require new grid management, operational, and control strategies. Given this paradigm shift, we argue that the concept of demand response needs to be revisited, and more carefully/consistently defined to enable us to better utilize this massive resource for economic, technical, environmental, and societal aims. In this paper, we survey existing demand response definitions, highlight their shortcomings, propose a new definition, and describe how this new definition enables us to more effectively harness the value of demand response in modern power systems. We conclude with a demand response research agenda informed by a discussion of demand response barriers and enablers.
- [33] arXiv:2410.18773 [pdf, html, other]
-
Title: A frequency-domain approach for estimating continuous-time diffusively coupled linear networksComments: 12 pages, 6 figures, extended version of paper submitted to European Control Conference, 2025, Thessaloniki, GreeceSubjects: Systems and Control (eess.SY)
This paper addresses the problem of consistently estimating a continuous-time (CT) diffusively coupled network (DCN) to identify physical components in a physical network. We develop a three-step frequency-domain identification method for linear CT DCNs that allows to accurately recover all the physical component values of the network while exploiting the particular symmetric structure in a DCN model. This method uses the estimated noise covariance as a non-parametric noise model to minimize variance of the parameter estimates, obviating the need to select a parametric noise model. Moreover, this method is extended to subnetworks identification, which enables identifying the local dynamics in DCNs on the basis of partial measurements. The method is illustrated with an application from In-Circuit Testing of printed circuit boards. Experimental results highlight the method's ability to consistently estimate component values in a complex network with only a single excitation.
- [34] arXiv:2410.18834 [pdf, html, other]
-
Title: Highly efficient non-rigid registration in k-space with application to cardiac Magnetic Resonance ImagingAya Ghoul, Kerstin Hammernik, Andreas Lingg, Patrick Krumm, Daniel Rueckert, Sergios Gatidis, Thomas KüstnerSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
In Magnetic Resonance Imaging (MRI), high temporal-resolved motion can be useful for image acquisition and reconstruction, MR-guided radiotherapy, dynamic contrast-enhancement, flow and perfusion imaging, and functional assessment of motion patterns in cardiovascular, abdominal, peristaltic, fetal, or musculoskeletal imaging. Conventionally, these motion estimates are derived through image-based registration, a particularly challenging task for complex motion patterns and high dynamic resolution. The accelerated scans in such applications result in imaging artifacts that compromise the motion estimation. In this work, we propose a novel self-supervised deep learning-based framework, dubbed the Local-All Pass Attention Network (LAPANet), for non-rigid motion estimation directly from the acquired accelerated Fourier space, i.e. k-space. The proposed approach models non-rigid motion as the cumulative sum of local translational displacements, following the Local All-Pass (LAP) registration technique. LAPANet was evaluated on cardiac motion estimation across various sampling trajectories and acceleration rates. Our results demonstrate superior accuracy compared to prior conventional and deep learning-based registration methods, accommodating as few as 2 lines/frame in a Cartesian trajectory and 3 spokes/frame in a non-Cartesian trajectory. The achieved high temporal resolution (less than 5 ms) for non-rigid motion opens new avenues for motion detection, tracking and correction in dynamic and real-time MRI applications.
- [35] arXiv:2410.18848 [pdf, html, other]
-
Title: Sensing Accuracy Optimization for Communication-assisted Dual-baseline UAV-InSARSubjects: Signal Processing (eess.SP)
In this paper, we study the optimization of the sensing accuracy of unmanned aerial vehicle (UAV)-based dual-baseline interferometric synthetic aperture radar (InSAR) systems. A swarm of three UAV-synthetic aperture radar (SAR) systems is deployed to image an area of interest from different angles, enabling the creation of two independent digital elevation models (DEMs). To reduce the InSAR sensing error, i.e., the height estimation error, the two DEMs are fused based on weighted average techniques into one final DEM. The heavy computations required for this process are performed on the ground. To this end, the radar data is offloaded in real time via a frequency division multiple access (FDMA) air-to-ground backhaul link. In this work, we focus on improving the sensing accuracy by minimizing the worst-case height estimation error of the final DEM. To this end, the UAV formation and the power allocated for offloading are jointly optimized based on alternating optimization (AO), while meeting practical InSAR sensing and communication constraints. Our simulation results demonstrate that the proposed solution can improve the sensing accuracy by over 39% compared to a classical single-baseline UAV-InSAR system and by more than 12% compared to other benchmark schemes.
- [36] arXiv:2410.18908 [pdf, html, other]
-
Title: A Survey on Speech Large Language ModelsSubjects: Audio and Speech Processing (eess.AS)
Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs' advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.
New submissions (showing 36 of 36 entries)
- [37] arXiv:2410.17967 (cross-list from cs.LG) [pdf, other]
-
Title: POMDP-Driven Cognitive Massive MIMO Radar: Joint Target Detection-Tracking In Unknown DisturbancesComments: The paper has been submitted to ieee Transactions on radar systemsSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Applications (stat.AP)
The joint detection and tracking of a moving target embedded in an unknown disturbance represents a key feature that motivates the development of the cognitive radar paradigm. Building upon recent advancements in robust target detection with multiple-input multiple-output (MIMO) radars, this work explores the application of a Partially Observable Markov Decision Process (POMDP) framework to enhance the tracking and detection tasks in a statistically unknown environment. In the POMDP setup, the radar system is considered as an intelligent agent that continuously senses the surrounding environment, optimizing its actions to maximize the probability of detection $(P_D)$ and improve the target position and velocity estimation, all this while keeping a constant probability of false alarm $(P_{FA})$. The proposed approach employs an online algorithm that does not require any apriori knowledge of the noise statistics, and it relies on a much more general observation model than the traditional range-azimuth-elevation model employed by conventional tracking algorithms. Simulation results clearly show substantial performance improvement of the POMDP-based algorithm compared to the State-Action-Reward-State-Action (SARSA)-based one that has been recently investigated in the context of massive MIMO (MMIMO) radar systems.
- [38] arXiv:2410.18089 (cross-list from cs.CY) [pdf, html, other]
-
Title: Empowering Cognitive Digital Twins with Generative Foundation Models: Developing a Low-Carbon Integrated Freight Transportation SystemSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Effective monitoring of freight transportation is essential for advancing sustainable, low-carbon economies. Traditional methods relying on single-modal data and discrete simulations fall short in optimizing intermodal systems holistically. These systems involve interconnected processes that affect shipping time, costs, emissions, and socio-economic factors. Developing digital twins for real-time awareness, predictive analytics, and urban logistics optimization requires extensive efforts in knowledge discovery, data integration, and multi-domain simulation. Recent advancements in generative AI offer new opportunities to streamline digital twin development by automating knowledge discovery and data integration, generating innovative simulation and optimization solutions. These models extend digital twins' capabilities by promoting autonomous workflows for data engineering, analytics, and software development. This paper proposes an innovative paradigm that leverages generative AI to enhance digital twins for urban research and operations. Using freight decarbonization as a case study, we propose a conceptual framework employing transformer-based language models to enhance an urban digital twin through foundation models. We share preliminary results and our vision for more intelligent, autonomous, and general-purpose digital twins for optimizing integrated freight systems from multimodal to synchromodal paradigms.
- [39] arXiv:2410.18094 (cross-list from q-bio.QM) [pdf, other]
-
Title: Self-supervised inter-intra period-aware ECG representation learning for detecting atrial fibrillationXiangqian Zhu, Mengnan Shi, Xuexin Yu, Chang Liu, Xiaocong Lian, Jintao Fei, Jiangying Luo, Xin Jin, Ping Zhang, Xiangyang JiComments: Preprint submitted to Biomedical Signal Processing and ControlSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Atrial fibrillation is a commonly encountered clinical arrhythmia associated with stroke and increased mortality. Since professional medical knowledge is required for annotation, exploiting a large corpus of ECGs to develop accurate supervised learning-based atrial fibrillation algorithms remains challenging. Self-supervised learning (SSL) is a promising recipe for generalized ECG representation learning, eliminating the dependence on expensive labeling. However, without well-designed incorporations of knowledge related to atrial fibrillation, existing SSL approaches typically suffer from unsatisfactory capture of robust ECG representations. In this paper, we propose an inter-intra period-aware ECG representation learning approach. Considering ECGs of atrial fibrillation patients exhibit the irregularity in RR intervals and the absence of P-waves, we develop specific pre-training tasks for interperiod and intraperiod representations, aiming to learn the single-period stable morphology representation while retaining crucial interperiod features. After further fine-tuning, our approach demonstrates remarkable AUC performances on the BTCH dataset, \textit{i.e.}, 0.953/0.996 for paroxysmal/persistent atrial fibrillation detection. On commonly used benchmarks of CinC2017 and CPSC2021, the generalization capability and effectiveness of our methodology are substantiated with competitive results.
- [40] arXiv:2410.18123 (cross-list from cs.AI) [pdf, other]
-
Title: Movement Control of Smart Mosque's Domes using CSRNet and Fuzzy Logic TechniquesJournal-ref: IJACSA, 12(3), 2021Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
Mosques are worship places of Allah and must be preserved clean, immaculate, provide all the comforts of the worshippers in them. The prophet's mosque in Medina/ Saudi Arabia is one of the most important mosques for Muslims. It occupies second place after the sacred mosque in Mecca/ Saudi Arabia, which is in constant overcrowding by all Muslims to visit the prophet Mohammad's tomb. This paper aims to propose a smart dome model to preserve the fresh air and allow the sunlight to enter the mosque using artificial intelligence techniques. The proposed model controls domes movements based on the weather conditions and the overcrowding rates in the mosque. The data have been collected from two different resources, the first one from the database of Saudi Arabia weather's history, and the other from Shanghai Technology Database. Congested Scene Recognition Network (CSRNet) and Fuzzy techniques have applied using Python programming language to control the domes to be opened and closed for a specific time to renew the air inside the mosque. Also, this model consists of several parts that are connected for controlling the mechanism of opening/closing domes according to weather data and the situation of crowding in the mosque. Finally, the main goal of this paper has been achieved, and the proposed model has worked efficiently and specifies the exact duration time to keep the domes open automatically for a few minutes for each hour head.
- [41] arXiv:2410.18151 (cross-list from cs.SD) [pdf, html, other]
-
Title: Music102: An $D_{12}$-equivariant transformer for chord progression accompanimentComments: 10 pages, 3 figuresSubjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
We present Music102, an advanced model built upon the Music101 prototype, aimed at enhancing chord progression accompaniment through a D12-equivariant transformer. Inspired by group theory and symbolic music structures, Music102 leverages musical symmetry--such as transposition and reflection operations--integrating these properties into the transformer architecture. By encoding prior music knowledge, the model maintains equivariance across both melody and chord sequences. The POP909 dataset was employed to train and evaluate Music102, revealing significant improvements over Music101 in both weighted loss and exact accuracy metrics, despite using fewer parameters. This work showcases the adaptability of self-attention mechanisms and layer normalization to the discrete musical domain, addressing challenges in computational music analysis. With its stable and flexible neural framework, Music102 sets the stage for further exploration in equivariant music generation and computational composition tools, bridging mathematical theory with practical music performance.
- [42] arXiv:2410.18203 (cross-list from cs.SD) [pdf, html, other]
-
Title: Melody Construction for Persian lyrics using LSTM recurrent neural networksSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
The present paper investigated automatic melody construction for Persian lyrics as an input. It was assumed that there is a phonological correlation between the lyric syllables and the melody in a song. A seq2seq neural network was developed to investigate this assumption, trained on parallel syllable and note sequences in Persian songs to suggest a pleasant melody for a new sequence of syllables. More than 100 pieces of Persian music were collected and converted from the printed version to the digital format due to the lack of a dataset on Persian digital music. Finally, 14 new lyrics were given to the model as input, and the suggested melodies were performed and recorded by music experts to evaluate the trained model. The evaluation was conducted using an audio questionnaire, which more than 170 persons answered. According to the answers about the pleasantness of melody, the system outputs scored an average of 3.005 from 5, while the human-made melodies for the same lyrics obtained an average score of 4.078.
- [43] arXiv:2410.18218 (cross-list from cs.AI) [pdf, html, other]
-
Title: Optimizing the role of human evaluation in LLM-based spoken document summarization systemsJournal-ref: Proc. Interspeech 2024, 1935-1939 (2024)Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
The emergence of powerful LLMs has led to a paradigm shift in abstractive summarization of spoken documents. The properties that make LLMs so valuable for this task -- creativity, ability to produce fluent speech, and ability to abstract information from large corpora -- also present new challenges to evaluating their content. Quick, cost-effective automatic evaluations such as ROUGE and BERTScore offer promise, but do not yet show competitive performance when compared to human evaluations. We draw on methodologies from the social sciences to propose an evaluation paradigm for spoken document summarization explicitly tailored for generative AI content. We provide detailed evaluation criteria and best practices guidelines to ensure robustness in the experimental design, replicability, and trustworthiness of human evaluation studies. We additionally include two case studies that show how these human-in-the-loop evaluation methods have been implemented at a major U.S. technology company.
- [44] arXiv:2410.18283 (cross-list from cs.LG) [pdf, html, other]
-
Title: Augmenting Training Data with Vector-Quantized Variational Autoencoder for Classifying RF SignalsComments: IEEE Milcom 2024Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Radio frequency (RF) communication has been an important part of civil and military communication for decades. With the increasing complexity of wireless environments and the growing number of devices sharing the spectrum, it has become critical to efficiently manage and classify the signals that populate these frequencies. In such scenarios, the accurate classification of wireless signals is essential for effective spectrum management, signal interception, and interference mitigation. However, the classification of wireless RF signals often faces challenges due to the limited availability of labeled training data, especially under low signal-to-noise ratio (SNR) conditions. To address these challenges, this paper proposes the use of a Vector-Quantized Variational Autoencoder (VQ-VAE) to augment training data, thereby enhancing the performance of a baseline wireless classifier. The VQ-VAE model generates high-fidelity synthetic RF signals, increasing the diversity and fidelity of the training dataset by capturing the complex variations inherent in RF communication signals. Our experimental results show that incorporating VQ-VAE-generated data significantly improves the classification accuracy of the baseline model, particularly in low SNR conditions. This augmentation leads to better generalization and robustness of the classifier, overcoming the constraints imposed by limited real-world data. By improving RF signal classification, the proposed approach enhances the efficacy of wireless communication in both civil and tactical settings, ensuring reliable and secure operations. This advancement supports critical decision-making and operational readiness in environments where communication fidelity is essential.
- [45] arXiv:2410.18293 (cross-list from cs.AI) [pdf, other]
-
Title: 1-2-3-Go! Policy Synthesis for Parameterized Markov Decision Processes via Decision-Tree Learning and GeneralizationMuqsit Azeem, Debraj Chakraborty, Sudeep Kanav, Jan Kretinsky, Mohammadsadegh Mohagheghi, Stefanie Mohr, Maximilian WeiningerComments: Preprint. Under reviewSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Systems and Control (eess.SY)
Despite the advances in probabilistic model checking, the scalability of the verification methods remains limited. In particular, the state space often becomes extremely large when instantiating parameterized Markov decision processes (MDPs) even with moderate values. Synthesizing policies for such \emph{huge} MDPs is beyond the reach of available tools. We propose a learning-based approach to obtain a reasonable policy for such huge MDPs.
The idea is to generalize optimal policies obtained by model-checking small instances to larger ones using decision-tree learning. Consequently, our method bypasses the need for explicit state-space exploration of large models, providing a practical solution to the state-space explosion problem. We demonstrate the efficacy of our approach by performing extensive experimentation on the relevant models from the quantitative verification benchmark set. The experimental results indicate that our policies perform well, even when the size of the model is orders of magnitude beyond the reach of state-of-the-art analysis tools. - [46] arXiv:2410.18298 (cross-list from cs.LG) [pdf, html, other]
-
Title: Robust and Explainable Depression Identification from Speech Using Vowel-Based Ensemble Learning ApproachesComments: accepted at the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2024)Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This study investigates explainable machine learning algorithms for identifying depression from speech. Grounded in evidence from speech production that depression affects motor control and vowel generation, pre-trained vowel-based embeddings, that integrate semantically meaningful linguistic units, are used. Following that, an ensemble learning approach decomposes the problem into constituent parts characterized by specific depression symptoms and severity levels. Two methods are explored: a "bottom-up" approach with 8 models predicting individual Patient Health Questionnaire-8 (PHQ-8) item scores, and a "top-down" approach using a Mixture of Experts (MoE) with a router module for assessing depression severity. Both methods depict performance comparable to state-of-the-art baselines, demonstrating robustness and reduced susceptibility to dataset mean/median values. System explainability benefits are discussed highlighting their potential to assist clinicians in depression diagnosis and screening.
- [47] arXiv:2410.18301 (cross-list from cs.IT) [pdf, html, other]
-
Title: LEO-based Positioning: Foundations, Signal Design, and Receiver Enhancements for 6G NTNHarish K. Dureppagari, Chiranjib Saha, Harikumar Krishnamurthy, Xiao Feng Wang, Alberto Rico-Alvariño, R. Michael Buehrer, Harpreet S. DhillonComments: 7 pages, 6 figures, submitted to IEEE Communications MagazineSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
The integration of non-terrestrial networks (NTN) into 5G new radio (NR) has opened up the possibility of developing a new positioning infrastructure using NR signals from Low-Earth Orbit (LEO) satellites. LEO-based cellular positioning offers several advantages, such as a superior link budget, higher operating bandwidth, and large forthcoming constellations. Due to these factors, LEO-based positioning, navigation, and timing (PNT) is a potential enhancement for NTN in 6G cellular networks. However, extending the existing terrestrial cellular positioning methods to LEO-based NTN positioning requires considering key fundamental enhancements. These include creating broad positioning beams orthogonal to conventional communication beams, time-domain processing at the user equipment (UE) to resolve large delay and Doppler uncertainties, and efficiently accommodating positioning reference signals (PRS) from multiple satellites within the communication resource grid. In this paper, we present the first set of design insights by incorporating these enhancements and thoroughly evaluating LEO-based positioning, considering the constraints and capabilities of the NR-NTN physical layer. To evaluate the performance of LEO-based NTN positioning, we develop a comprehensive NR-compliant simulation framework, including LEO orbit simulation, transmission (Tx) and receiver (Rx) architectures, and a positioning engine incorporating the necessary enhancements. Our findings suggest that LEO-based NTN positioning could serve as a complementary infrastructure to existing Global Navigation Satellite Systems (GNSS) and, with appropriate enhancements, may also offer a viable alternative.
- [48] arXiv:2410.18322 (cross-list from cs.SD) [pdf, html, other]
-
Title: Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear ModulationComments: Currently under review for ICASSP 2025Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
In this study, we introduce Unified Microphone Conversion, a unified generative framework to enhance the resilience of sound event classification systems against device variability. Building on the limitations of previous works, we condition the generator network with frequency response information to achieve many-to-many device mapping. This approach overcomes the inherent limitation of CycleGAN, requiring separate models for each device pair. Our framework leverages the strengths of CycleGAN for unpaired training to simulate device characteristics in audio recordings and significantly extends its scalability by integrating frequency response related information via Feature-wise Linear Modulation. The experiment results show that our method outperforms the state-of-the-art method by 2.6% and reducing variability by 0.8% in macro-average F1 score.
- [49] arXiv:2410.18358 (cross-list from cs.CY) [pdf, html, other]
-
Title: Data Publishing in Mechanics and Dynamics: Challenges, Guidelines, and Examples from Engineering DesignHenrik Ebel, Jan van Delden, Timo Lüddecke, Aditya Borse, Rutwik Gulakala, Marcus Stoffel, Manish Yadav, Merten Stender, Leon Schindler, Kristin Miriam de Payrebrune, Maximilian Raff, C. David Remy, Benedict Röder, Peter EberhardComments: 21 pages, 8 figuresSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
Data-based methods have gained increasing importance in engineering, especially but not only driven by successes with deep artificial neural networks. Success stories are prevalent, e.g., in areas such as data-driven modeling, control and automation, as well as surrogate modeling for accelerated simulation. Beyond engineering, generative and large-language models are increasingly performing and helping with tasks that, previously, were solely associated with creative human processes. Thus, it seems timely to seek artificial-intelligence-support for engineering design tasks to automate, help with, or accelerate purpose-built designs of engineering systems, e.g., in mechanics and dynamics, where design so far requires a lot of specialized knowledge. However, research-wise, compared to established, predominantly first-principles-based methods, the datasets used for training, validation, and test become an almost inherent part of the overall methodology. Thus, data publishing becomes just as important in (data-driven) engineering science as appropriate descriptions of conventional methodology in publications in the past. This article analyzes the value and challenges of data publishing in mechanics and dynamics, in particular regarding engineering design tasks, showing that the latter raise also challenges and considerations not typical in fields where data-driven methods have been booming originally. Possible ways to deal with these challenges are discussed and a set of examples from across different design problems shows how data publishing can be put into practice. The analysis, discussions, and examples are based on the research experience made in a priority program of the German research foundation focusing on research on artificially intelligent design assistants in mechanics and dynamics.
- [50] arXiv:2410.18363 (cross-list from cs.AI) [pdf, html, other]
-
Title: Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper ModelSubjects: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains. However, this broad adaptability can lead to diminished performance in tasks requiring recognition of specific vocabularies. Addressing this challenge typically involves fine-tuning the model, which demands extensive labeled audio data that is often difficult to acquire and unavailable for specific domains. In this study, we propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters, using a relatively small training dataset. Our method leverages contextual biasing, to direct Whisper model's output towards a specific vocabulary by integrating a neural-symbolic prefix tree structure to guide the model's transcription output. To validate our approach, we conducted experiments using a validation dataset comprising maritime data collected within a simulated training environment. A comparison between the original Whisper models of varying parameter sizes and our biased model revealed a notable reduction in transcription word error rate and enhanced performance of downstream applications. Our findings suggest that this methodology holds promise for improving speech-to-text translation performance in domains characterized by limited vocabularies.
- [51] arXiv:2410.18371 (cross-list from cs.SD) [pdf, html, other]
-
Title: A Unimodal Speaker-Level Membership Inference Detector for Contrastive PretrainingSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Audio can disclose PII, particularly when combined with related text data. Therefore, it is essential to develop tools to detect privacy leakage in Contrastive Language-Audio Pretraining(CLAP). Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models. To address these challenges, we propose USMID, a textual unimodal speaker-level membership inference detector for CLAP models, which queries the target model using only text data and does not require training shadow models. We randomly generate textual gibberish that are clearly not in training dataset. Then we extract feature vectors from these texts using the CLAP model and train a set of anomaly detectors on them. During inference, the feature vector of each test text is input into the anomaly detector to determine if the speaker is in the training set (anomalous) or not (normal). If available, USMID can further enhance detection by integrating real audio of the tested speaker. Extensive experiments on various CLAP model architectures and datasets demonstrate that USMID outperforms baseline methods using only text data.
- [52] arXiv:2410.18395 (cross-list from cs.LG) [pdf, html, other]
-
Title: A contrastive-learning approach for auditory attention detectionSubjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Carrying conversations in multi-sound environments is one of the more challenging tasks, since the sounds overlap across time and frequency making it difficult to understand a single sound source. One proposed approach to help isolate an attended speech source is through decoding the electroencephalogram (EEG) and identifying the attended audio source using statistical or machine learning techniques. However, the limited amount of data in comparison to other machine learning problems and the distributional shift between different EEG recordings emphasizes the need for a self supervised approach that works with limited data to achieve a more robust solution. In this paper, we propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal. This network is further finetuned for the auditory attention classification task. We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.
- [53] arXiv:2410.18400 (cross-list from cs.CV) [pdf, html, other]
-
Title: DMVC: Multi-Camera Video Compression Network aimed at Improving Deep Learning AccuracyHuan Cui (1 and 2), Qing Li (3), Hanling Wang (1), Yong jiang (1) ((1) Tsinghua University, (2) Peking University, (3) Peng Cheng Laboratory)Subjects: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV)
We introduce a cutting-edge video compression framework tailored for the age of ubiquitous video data, uniquely designed to serve machine learning applications. Unlike traditional compression methods that prioritize human visual perception, our innovative approach focuses on preserving semantic information critical for deep learning accuracy, while efficiently reducing data size. The framework operates on a batch basis, capable of handling multiple video streams simultaneously, thereby enhancing scalability and processing efficiency. It features a dual reconstruction mode: lightweight for real-time applications requiring swift responses, and high-precision for scenarios where accuracy is crucial. Based on a designed deep learning algorithms, it adeptly segregates essential information from redundancy, ensuring machine learning tasks are fed with data of the highest relevance. Our experimental results, derived from diverse datasets including urban surveillance and autonomous vehicle navigation, showcase DMVC's superiority in maintaining or improving machine learning task accuracy, while achieving significant data compression. This breakthrough paves the way for smarter, scalable video analysis systems, promising immense potential across various applications from smart city infrastructure to autonomous systems, establishing a new benchmark for integrating video compression with machine learning.
- [54] arXiv:2410.18413 (cross-list from math.OC) [pdf, html, other]
-
Title: AC-Network-Informed DC Optimal Power Flow for Electricity MarketsComments: 11 pages, 6 figures, 52nd Hawaii International Conference on SystemSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
This paper presents a parametric quadratic approximation of the AC optimal power flow (AC-OPF) problem for time-sensitive and market-based applications. The parametric approximation preserves the physics-based but simple representation provided by the DC-OPF model and leverages market and physics information encoded in the data-driven demand-dependent parameters. To enable the deployment of the proposed model for real-time applications, we propose a supervised learning approach to predict near-optimal parameters, given a certain metric concerning the dispatch quantities and locational marginal prices (LMPs). The training dataset is generated based on the solution of the accurate AC-OPF problem and a bilevel optimization problem, which calibrates parameters satisfying two market properties: cost recovery and revenue adequacy. We show the proposed approach's performance in various test systems in terms of cost and dispatch approximation errors, LMPs, market properties satisfaction, dispatch feasibility, and generalizability with respect to N-1 network topologies.
- [55] arXiv:2410.18510 (cross-list from cs.AI) [pdf, other]
-
Title: A framework for GNSS-based solutions performance analysis in an ERTMS contextJuliette Marais (COSYS-LEOST), Quentin Mayolle (IRT Railenium), Martin Fasquelle (IRT Railenium), Vincent Tardif, Emilie Chéneau-GrehalleJournal-ref: 6th SmartRaCon Scientific Seminar, Oct 2024, San Sebastian, SpainSubjects: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Context Progresses in GNSS-based solution introduction in rail applications GNSS (Global Navigation Satellite System) is now used in most of our travels and each of our smartphone apps. Most of the usages are not safety-critical. But Europe identified GNSS for more applications and to be integrated in rail in general as part of the toolset to help railway to contribute to reduce transport carbon footprint. To increase the use of trains in European transports, railways must improve their attractiveness for passengers and freight, but also increase reliability, availability and efficiency by reducing capital expenditure and operational costs. GNSS is part of the global digitalization scheme of freight that aims to offer added value to the clients knowledge of accurate time of arrival, continuous monitoring of transport conditions (temperature, humidity...). But a major challenge will be to reach stringent applications and in particular, GNSS is today seen as a realistic and serious game changer for the future of the ERTMS (European Rail Traffic Management System). The localisation function is today performed with both odometry and balises. Odometer provides a continuous train position in time from a reference point. But as the distance delivered by the odometer shows a growing bias with distance, due to wear and wheel sliding, the use of on-track balises allows to reduce this error. Future systems will be based on on-board localisation solutions with GNSS receivers. It will allow the development of new concepts for moving blocks, virtual coupling and automation. Its use for train integrity is also investigated. But the environmental conditions of track and surroundings configuration, i.e, tunnels, dense urban areas or vegetation often degrade positioning performance and thus its efficiency and safety. Indeed, GNSS satellites are moving and their visibility (availability and relative position from the receiver) vary with time. Moreover, for optimal performance, the system requires open sky environments, which are the cases of most of the aeronautical uses but not of train uses. Trains often circulate in areas where signal reception can be disturbed (multipath, intentional or unintentional interferences) and thus, performances degraded. If many progresses have been made in the past years to develop more robust receivers [Puccitelli, 2022], multi-sensor solutions [CLUG website] or missing tools such as Digital Maps [Crespillo, 2023], in projects such as the Shift2Rail Project X2Rail-5 or CLUG, some questions remain and in particular related to performance evaluation. How can we evaluate performances in a dynamic environment (train, satellite, obstacles)? How can we be sure that every configuration has been tested? What is the impact of a failure (inaccuracy, missed detection) on operation? Some of these issues are addressed in the on-going R2DATO project funded by Europe's rail.
- [56] arXiv:2410.18519 (cross-list from cs.RO) [pdf, html, other]
-
Title: Towards Reinforcement Learning Controllers for Soft Robots using Learned EnvironmentsComments: soft manipulator, reinforcement learning, learned controllersJournal-ref: 2024 IEEE 7th International Conference on Soft Robotics (RoboSoft), San Diego, CA, USA, 2024, pp. 933-939Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Soft robotic manipulators offer operational advantage due to their compliant and deformable structures. However, their inherently nonlinear dynamics presents substantial challenges. Traditional analytical methods often depend on simplifying assumptions, while learning-based techniques can be computationally demanding and limit the control policies to existing data. This paper introduces a novel approach to soft robotic control, leveraging state-of-the-art policy gradient methods within parallelizable synthetic environments learned from data. We also propose a safety oriented actuation space exploration protocol via cascaded updates and weighted randomness. Specifically, our recurrent forward dynamics model is learned by generating a training dataset from a physically safe \textit{mean reverting} random walk in actuation space to explore the partially-observed state-space. We demonstrate a reinforcement learning approach towards closed-loop control through state-of-the-art actor-critic methods, which efficiently learn high-performance behaviour over long horizons. This approach removes the need for any knowledge regarding the robot's operation or capabilities and sets the stage for a comprehensive benchmarking tool in soft robotics control.
- [57] arXiv:2410.18607 (cross-list from cs.CL) [pdf, html, other]
-
Title: STTATTS: Unified Speech-To-Text And Text-To-Speech ModelComments: 11 pages, 4 Figures, EMNLP 2024 FindingsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates that the performance of our multi-task model is comparable to that of individually trained models while significantly saving computational and memory costs ($\sim$50\% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.
- [58] arXiv:2410.18625 (cross-list from physics.med-ph) [pdf, html, other]
-
Title: First performance of hybrid spectra CT reconstruction: a general Spectrum-Model-Aided Reconstruction Technique (SMART)Subjects: Medical Physics (physics.med-ph); Image and Video Processing (eess.IV)
Hybrid spectral CT integrates energy integrating detectors (EID) and photon counting detectors (PCD) into a single system, combining the large field-of-view advantage of EID with the high energy and spatial resolution of PCD. This represents a new research direction in spectral CT imaging. However, the different imaging principles and inconsistent geometric paths of the two detectors make it difficult to reconstruct images using data from hybrid detectors. In addition, the quality reconstructed images considering spectrum is affected by the accuracy of spectral estimation and the scattered photons. In this work, Firstly, we propose a general hybrid spectral reconstruction method that takes into account both the spectral CT imaging principles of the two different detectors and the influence of scattered photons in the forward process modelling. Furthermore, we also apply volume fraction constraints to the results reconstructed from the two detector data. By alternately solving the spectral estimation and the spectral image reconstruction by the ADMM method, the estimated spectra and the reconstructed images reinforce each other, thus improving the accuracy of the spectral estimation and the quality of the reconstructed images. The proposed method is the first to achieve hybrid spectral CT reconstruction for both detectors, allowing simultaneous recovery of spectrum and image reconstruction from hybrid spectral data containing scattering. In addition, the method is also applicable to spectral CT imaging using a single type of detector. We validated the effectiveness of the proposed method through numerical experiments and successfully performed the first hybrid spectral CT reconstruction experiment on our self-developed hybrid spectral CT system.
- [59] arXiv:2410.18628 (cross-list from cs.SD) [pdf, html, other]
-
Title: Wavetable Synthesis Using CVAE for Timbre Control Based on Semantic LabelComments: 6 pages, 4 figures, Accepted at APSIPA ASC 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Synthesizers are essential in modern music production. However, their complex timbre parameters, often filled with technical terms, require expertise. This research introduces a method of timbre control in wavetable synthesis that is intuitive and sensible and utilizes semantic labels. Using a conditional variational autoencoder (CVAE), users can select a wavetable and define the timbre with labels such as bright, warm, and rich. The CVAE model, featuring convolutional and upsampling layers, effectively captures the wavetable nuances, ensuring real-time performance owing to their processing in the time domain. Experiments demonstrate that this approach allows for real-time, effective control of the timbre of the wavetable using semantic inputs and aims for intuitive timbre control through data-based semantic control.
- [60] arXiv:2410.18677 (cross-list from cs.CV) [pdf, html, other]
-
Title: Enhancing pretraining efficiency for medical image segmentation via transferability metricsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
In medical image segmentation tasks, the scarcity of labeled training data poses a significant challenge when training deep neural networks. When using U-Net-style architectures, it is common practice to address this problem by pretraining the encoder part on a large general-purpose dataset like ImageNet. However, these methods are resource-intensive and do not guarantee improved performance on the downstream task. In this paper we investigate a variety of training setups on medical image segmentation datasets, using ImageNet-pretrained models. By examining over 300 combinations of models, datasets, and training methods, we find that shorter pretraining often leads to better results on the downstream task, providing additional proof to the well-known fact that the accuracy of the model on ImageNet is a poor indicator for downstream performance. As our main contribution, we introduce a novel transferability metric, based on contrastive learning, that measures how robustly a pretrained model is able to represent the target data. In contrast to other transferability scores, our method is applicable to the case of transferring from ImageNet classification to medical image segmentation. We apply our robustness score by measuring it throughout the pretraining phase to indicate when the model weights are optimal for downstream transfer. This reduces pretraining time and improves results on the target task.
- [61] arXiv:2410.18784 (cross-list from cs.LG) [pdf, html, other]
-
Title: Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionalitySubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Numerical Analysis (math.NA); Statistics Theory (math.ST); Machine Learning (stat.ML)
The denoising diffusion probabilistic model (DDPM) has emerged as a mainstream generative model in generative AI. While sharp convergence guarantees have been established for the DDPM, the iteration complexity is, in general, proportional to the ambient data dimension, resulting in overly conservative theory that fails to explain its practical efficiency. This has motivated the recent work Li and Yan (2024a) to investigate how the DDPM can achieve sampling speed-ups through automatic exploitation of intrinsic low dimensionality of data. We strengthen this prior work by demonstrating, in some sense, optimal adaptivity to unknown low dimensionality. For a broad class of data distributions with intrinsic dimension $k$, we prove that the iteration complexity of the DDPM scales nearly linearly with $k$, which is optimal when using KL divergence to measure distributional discrepancy. Our theory is established based on a key observation: the DDPM update rule is equivalent to running a suitably parameterized SDE upon discretization, where the nonlinear component of the drift term is intrinsically low-dimensional.
- [62] arXiv:2410.18794 (cross-list from cs.CV) [pdf, html, other]
-
Title: WARP-LCA: Efficient Convolutional Sparse Coding with Locally Competitive AlgorithmSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
The locally competitive algorithm (LCA) can solve sparse coding problems across a wide range of use cases. Recently, convolution-based LCA approaches have been shown to be highly effective for enhancing robustness for image recognition tasks in vision pipelines. To additionally maximize representational sparsity, LCA with hard-thresholding can be applied. While this combination often yields very good solutions satisfying an $\ell_0$ sparsity criterion, it comes with significant drawbacks for practical application: (i) LCA is very inefficient, typically requiring hundreds of optimization cycles for convergence; (ii) the use of hard-thresholding results in a non-convex loss function, which might lead to suboptimal minima. To address these issues, we propose the Locally Competitive Algorithm with State Warm-up via Predictive Priming (WARP-LCA), which leverages a predictor network to provide a suitable initial guess of the LCA state based on the current input. Our approach significantly improves both convergence speed and the quality of solutions, while maintaining and even enhancing the overall strengths of LCA. We demonstrate that WARP-LCA converges faster by orders of magnitude and reaches better minima compared to conventional LCA. Moreover, the learned representations are more sparse and exhibit superior properties in terms of reconstruction and denoising quality as well as robustness when applied in deep recognition pipelines. Furthermore, we apply WARP-LCA to image denoising tasks, showcasing its robustness and practical effectiveness. Our findings confirm that the naive use of LCA with hard-thresholding results in suboptimal minima, whereas initializing LCA with a predictive guess results in better outcomes. This research advances the field of biologically inspired deep learning by providing a novel approach to convolutional sparse coding.
- [63] arXiv:2410.18850 (cross-list from cs.CL) [pdf, other]
-
Title: We Augmented Whisper With kNN and You Won't Believe What Came NextComments: 6 pages incl. appendix, 2 figures, 6 tablesSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech recognition performance varies by language, domain, and speaker characteristics such as accent, and fine-tuning a model on any of these categories may lead to catastrophic forgetting. $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that can instead adapt by building an external datastore that can then be searched during inference time, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.
- [64] arXiv:2410.18888 (cross-list from math.OC) [pdf, html, other]
-
Title: Existence of solutions to port-Hamiltonian systems: initial value problems and optimal controlComments: 24 pages, 6 figuresSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
We investigate the existence of solutions of reversible and irreversible port-Hamiltonian systems. To this end, we utilize the associated exergy, a function that is composed of the system's Hamiltonian and entropy, to prove global existence in time for bounded control functions. The results are then leveraged to prove existence of solutions of energy- and entropy-optimal control problems. Last, we explore model predictive control tailored to irreversible port-Hamiltonian systems by means of a numerical case study with a heat exchanger network.
- [65] arXiv:2410.18965 (cross-list from cs.LG) [pdf, html, other]
-
Title: On the Crucial Role of Initialization for Matrix FactorizationSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
This work revisits the classical low-rank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates for such nonconvex and nonsmooth optimization. We introduce Nystrom initialization, which significantly improves the global convergence of Scaled Gradient Descent (ScaledGD) in both symmetric and asymmetric matrix factorization tasks. Specifically, we prove that ScaledGD with Nystrom initialization achieves quadratic convergence in cases where only linear rates were previously known. Furthermore, we extend this initialization to low-rank adapters (LoRA) commonly used for finetuning foundation models. Our approach, NoRA, i.e., LoRA with Nystrom initialization, demonstrates superior performance across various downstream tasks and model scales, from 1B to 7B parameters, in large language and diffusion models.
Cross submissions (showing 29 of 29 entries)
- [66] arXiv:2204.00768 (replaced) [pdf, html, other]
-
Title: VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic FeatureComments: Accepted to Interspeech 2022Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicated way, leading to a great difficulty for the AM to predict. Although high-fidelity audio can be generated by recent neural vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the predicted mel-spectrogram from AM degrades the performance of the entire TTS system. In this work, we propose VQTTS, consisting of an AM txt2vec and a vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic feature rather than mel-spectrogram. We redesign both the AM and the vocoder accordingly. In particular, txt2vec basically becomes a classification model instead of a traditional regression model while vec2wav uses an additional feature encoder before HifiGAN generator for smoothing the discontinuous quantized feature. Our experiments show that vec2wav achieves better reconstruction performance than HifiGAN when using self-supervised VQ acoustic feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art performance in terms of naturalness among all current publicly available TTS systems.
- [67] arXiv:2311.08820 (replaced) [pdf, other]
-
Title: Reinforcement Learning with Model Predictive Control for Highway Ramp MeteringComments: 17 pages, 10 figures, 3 tables, submitted to IEEE Transactions on Intelligent Transportation SystemsSubjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
In the backdrop of an increasingly pressing need for effective urban and highway transportation systems, this work explores the synergy between model-based and learning-based strategies to enhance traffic flow management by use of an innovative approach to the problem of ramp metering control that embeds Reinforcement Learning (RL) techniques within the Model Predictive Control (MPC) framework. The control problem is formulated as an RL task by crafting a suitable stage cost function that is representative of the traffic conditions, variability in the control action, and violations of the constraint on the maximum number of vehicles in queue. An MPC-based RL approach, which leverages the MPC optimal problem as a function approximation for the RL algorithm, is proposed to learn to efficiently control an on-ramp and satisfy its constraints despite uncertainties in the system model and variable demands. Simulations are performed on a benchmark small-scale highway network to compare the proposed methodology against other state-of-the-art control approaches. Results show that, starting from an MPC controller that has an imprecise model and is poorly tuned, the proposed methodology is able to effectively learn to improve the control policy such that congestion in the network is reduced and constraints are satisfied, yielding an improved performance that is superior to the other controllers.
- [68] arXiv:2312.11255 (replaced) [pdf, html, other]
-
Title: State-action control barrier functions: Imposing safety on learning-based control with low online computational costsSubjects: Systems and Control (eess.SY)
Learning-based control with safety guarantees usually requires real-time safety certification and modifications of possibly unsafe learning-based policies. The control barrier function (CBF) method uses a safety filter containing a constrained optimization problem to produce safe policies. However, finding a valid CBF for a general nonlinear system requires a complex function parameterization, which in general, makes the policy optimization problem difficult to solve in real time. For nonlinear systems with nonlinear state constraints, this paper proposes the novel concept of state-action CBFs, which not only characterize the safety at each state but also evaluate the control inputs taken at each state. State-action CBFs, in contrast to CBFs, enable a flexible parameterization, resulting in a safety filter that involves a convex quadratic optimization problem. This, in turn, significantly alleviates the online computational burden. To synthesize state-action CBFs, we propose a learning-based approach exploiting Hamilton-Jacobi reachability. The effect of learning errors on the effectiveness of state-action CBFs is addressed by constraint tightening and introducing a new concept called contractive CBFs. These contributions ensure formal safety guarantees for learned CBFs and control policies, enhancing the applicability of learning-based control in real-time scenarios. Simulation results on an inverted pendulum with elastic walls validate the proposed CBFs in terms of constraint satisfaction and CPU time.
- [69] arXiv:2401.05425 (replaced) [pdf, other]
-
Title: An Unobtrusive and Lightweight Ear-worn System for Continuous Epileptic Seizure DetectionAbdul Aziz, Nhat Pham, Neel Vora, Cody Reynolds, Jaime Lehnen, Pooja Venkatesh, Zhuoran Yao, Jay Harvey, Tam Vu, Kan Ding, Phuc NguyenSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Epilepsy is one of the most common neurological diseases globally (around 50 million people worldwide). Fortunately, up to 70% of people with epilepsy could live seizure-free if properly diagnosed and treated, and a reliable technique to monitor the onset of seizures could improve the quality of life of patients who are constantly facing the fear of random seizure attacks. The scalp-based EEG test, despite being the gold standard for diagnosing epilepsy, is costly, necessitates hospitalization, demands skilled professionals for operation, and is discomforting for users. In this paper, we propose EarSD, a novel lightweight, unobtrusive, and socially acceptable ear-worn system to detect epileptic seizure onsets by measuring the physiological signals from behind the user's ears. EarSD includes an integrated custom-built sensing-computing-communication PCB to collect and amplify the signals of interest, remove the noises caused by motion artifacts and environmental impacts, and stream the data wirelessly to the computer/mobile phone nearby, where data are uploaded to the host computer for further processing. We conducted both in-lab and in-hospital experiments with epileptic seizure patients who were hospitalized for seizure studies.
- [70] arXiv:2402.10998 (replaced) [pdf, other]
-
Title: Provably Safe Neural Network Controllers via Differential Dynamic LogicComments: 39 pages (main paper has 10 pages), 13 figures; Accepted at the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
While neural networks (NNs) have potential as autonomous controllers for Cyber-Physical Systems, verifying the safety of NN based control systems (NNCSs) poses significant challenges for the practical use of NNs, especially when safety is needed for unbounded time horizons. One reason is the intractability of analyzing NNs, ODEs and hybrid systems. To this end, we introduce VerSAILLE (Verifiably Safe AI via Logically Linked Envelopes): The first general approach that allows reusing control theory results for NNCS verification. By joining forces, we exploit the efficiency of NN verification tools while retaining the rigor of differential dynamic logic (dL). Based on provably safe control envelopes in dL, we derive specifications for the NN which is proven via NN verification. We show that a proof of the NN adhering to the specification is mirrored by a dL proof on the infinite-time safety of the NNCS.
The NN verification properties resulting from hybrid systems typically contain nonlinear arithmetic and arbitrary logical structures while efficient NN verification merely supports linear constraints. To overcome this divide, we present Mosaic: An efficient, sound and complete verification approach for polynomial real arithmetic properties on piece-wise linear NNs. Mosaic partitions complex verification queries into simple queries and lifts off-the-shelf linear constraint tools to the nonlinear setting in a completeness-preserving manner by combining approximation with exact reasoning for counterexample regions. Our evaluation demonstrates the versatility of VerSAILLE and Mosaic: We prove infinite-time safety on the classical Vertical Airborne Collision Avoidance NNCS verification benchmark for two scenarios while (exhaustively) enumerating counterexample regions in unsafe scenarios. We also show that our approach significantly outperforms State-of-the-Art tools in closed-loop NNV. - [71] arXiv:2402.16577 (replaced) [pdf, other]
-
Title: Time vs. Frequency Domain DPD for Massive MIMO: Methods and Performance AnalysisComments: 15 pages, 11 figures, submitted to an IEEE JournalSubjects: Signal Processing (eess.SP)
The use of up to hundreds of antennas in massive multi-user (MU) multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) poses a complexity challenge for digital predistortion (DPD) aiming to linearize the nonlinear power amplifiers (PAs). While the complexity for conventional time domain (TD) DPD scales with the number of PAs, frequency domain (FD) DPD has a complexity scaling with the number of user equipments (UEs). In this work, we provide a comprehensive analysis of different state-of-the-art TD and FD-DPD schemes in terms of complexity and linearization performance in both rich scattering and line-of-sight (LOS) channels and with antenna crosstalk. We propose a novel low-complexity FD convolutional neural network (CNN) DPD. We also propose a learning algorithm for any FD-DPDs with differentiable structure. The analysis shows that FD-DPD, particularly the proposed FD CNN, is preferable in LOS scenarios with few users, due to the favorable trade-off between complexity and linearization performance. On the other hand, in scenarios with more users or isotropic scattering channels, significant intermodulation distortions among UEs degrade FD-DPD performance, making TD-DPD more suitable. The proposed learning algorithm allows FD-DPDs to outperform TD-DPD optimized by indirect learning architecture under antenna crosstalk.
- [72] arXiv:2403.18200 (replaced) [pdf, html, other]
-
Title: Fault-tolerant properties of scale-free linear protocols for synchronization of homogeneous multi-agent systemsComments: The article was submitted to IEEE Transactions on Automatic Control at March 27th, 2024. Now, this updated version is the one re-submitted at October 7th, 2024 for second-round reviewSubjects: Systems and Control (eess.SY)
Originally, protocols were designed for multi-agent systems (MAS) using information about the network which might not be available. Recently, there has been a focus on scale-free synchronization where the protocol is designed without any prior information about the network.
As long as the network contains a directed spanning tree, a scale-free protocol guarantees that the network achieves synchronization.
If there is no directed spanning tree then synchronization cannot be achieved. But what happens when these scale-free protocols are applied to such a network where the directed spanning tree no longer exists? This paper establishes that the network decomposes into a number of basic bicomponents which achieves synchronization among all nodes in this basic bicomponent. On the other hand, nodes which are not part of any basic bicomponent converge to a weighted average of the synchronized trajectories of the basic bicomponents. The weights are independent of the initial conditions and are independent of the designed protocol. - [73] arXiv:2404.14700 (replaced) [pdf, html, other]
-
Title: FlashSpeech: Efficient Zero-Shot Speech SynthesisZhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Wei Xue, Qifeng Liu, Yike GuoComments: Efficient zero-shot speech synthesisSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in this https URL.
- [74] arXiv:2405.04287 (replaced) [pdf, html, other]
-
Title: Asymmetry of Frequency Distribution in Power Systems: Sources, Estimation, Impact and ControlSubjects: Systems and Control (eess.SY)
This paper analyses an emerging real-world phenomena in inverter-based renewable-dominated power systems, namely, asymmetry of frequency distribution. The paper first provides a rationale on why asymmetry reduces the "quality" of the frequency control and system operation. Then it provides qualitative theoretical insights that explain asymmetry in terms of the nonlinearity of real-world power systems and associated models. In particular network losses and pitch angle-based frequency control of wind power plants are discussed. Then the paper proposes a nonlinear compensation control to reduce the asymmetry as well as a statistical metric based on the frequency probability distribution to quantify the level of asymmetry in a power system. Real-world data obtained from the Irish and Australian transmission systems serve to support the theoretical appraisal, whereas simulations based on an IEEE benchmark system show the effectiveness of the proposed nonlinear compensation. The case study also shows that, while automatic generation control reduces asymmetry, frequency control limits and droop-based frequency support provided by wind generation using a tight deadband of 15 mHz, namely active power control, leads to a significant increase in the asymmetry of the frequency probability distribution.
- [75] arXiv:2405.04476 (replaced) [pdf, html, other]
-
Title: BERP: A Blind Estimator of Room Acoustic and Physical Parameters for Single-Channel Noisy Speech SignalsComments: 16-page, erratum revision, Submitted to IEEE/ACM Transaction on Audio Speech and Language Processing (TASLP)Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Room acoustic parameters (RAPs) and room physical parameters (RPPs) are essential metrics for parameterizing the room acoustical characteristics (RACs) of a sound field around a listener's local environment, offering comprehensive indications for various applications. Current RAP and RPP estimation methods either fall short of covering broad real-world acoustic environments in the context of real background noise or lack universal frameworks for blindly estimating RAPs and RPPs from noisy single-channel speech signals, particularly sound source distances, direction of arrival (DOA) of sound sources, and occupancy levels. On the other hand, in this paper, we propose a new universal blind estimation framework called the blind estimator of the room acoustical and physical parameters (BERP), by introducing a new stochastic room impulse response (RIR) model, namely the sparse stochastic impulse response (SSIR) model, and endowing the BERP with a unified encoder and multiple separate predictors to estimate the RPPs and the parameters SSIR in parallel. This estimation framework enables computationally efficient and universal estimation of room parameters using only noisy single-channel speech signals. Finally, all RAPs can be simultaneously derived from RIRs synthesized from the SSIR model with estimated parameters. To evaluate the effectiveness of the proposed BERP and SSIR models, we compile a task-specific dataset from several publicly available datasets. The results reveal that the BERP achieves state-of-the-art (SOTA) performance. In addition, the evaluation results for the SSIR RIR model also demonstrated its efficacy. The code is available on GitHub.
- [76] arXiv:2405.19366 (replaced) [pdf, html, other]
-
Title: ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological TextSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
The utilization of deep learning on electrocardiogram (ECG) analysis has brought the advanced accuracy and efficiency of cardiac healthcare diagnostics. By leveraging the capabilities of deep learning in semantic understanding, especially in feature extraction and representation learning, this study introduces a new multimodal contrastive pretaining framework that aims to improve the quality and robustness of learned representations of 12-lead ECG signals. Our framework comprises two key components, including Cardio Query Assistant (CQA) and ECG Semantics Integrator(ESI). CQA integrates a retrieval-augmented generation (RAG) pipeline to leverage large language models (LLMs) and external medical knowledge to generate detailed textual descriptions of ECGs. The generated text is enriched with information about demographics and waveform patterns. ESI integrates both contrastive and captioning loss to pretrain ECG encoders for enhanced representations. We validate our approach through various downstream tasks, including arrhythmia detection and ECG-based subject identification. Our experimental results demonstrate substantial improvements over strong baselines in these tasks. These baselines encompass supervised and self-supervised learning methods, as well as prior multimodal pretraining approaches.
- [77] arXiv:2406.08177 (replaced) [pdf, html, other]
-
Title: One-Step Effective Diffusion Network for Real-World Image Super-ResolutionComments: Accepted by NeurIPS 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The pre-trained text-to-image diffusion models have been increasingly employed to tackle the real-world image super-resolution (Real-ISR) problem due to their powerful generative image priors. Most of the existing methods start from random noise to reconstruct the high-quality (HQ) image under the guidance of the given low-quality (LQ) image. While promising results have been achieved, such Real-ISR methods require multiple diffusion steps to reproduce the HQ image, increasing the computational cost. Meanwhile, the random noise introduces uncertainty in the output, which is unfriendly to image restoration tasks. To address these issues, we propose a one-step effective diffusion network, namely OSEDiff, for the Real-ISR problem. We argue that the LQ image contains rich information to restore its HQ counterpart, and hence the given LQ image can be directly taken as the starting point for diffusion, eliminating the uncertainty introduced by random noise sampling. We finetune the pre-trained diffusion network with trainable layers to adapt it to complex image degradations. To ensure that the one-step diffusion model could yield HQ Real-ISR output, we apply variational score distillation in the latent space to conduct KL-divergence regularization. As a result, our OSEDiff model can efficiently and effectively generate HQ images in just one diffusion step. Our experiments demonstrate that OSEDiff achieves comparable or even better Real-ISR results, in terms of both objective metrics and subjective evaluations, than previous diffusion model-based Real-ISR methods that require dozens or hundreds of steps. The source codes are released at this https URL.
- [78] arXiv:2408.01738 (replaced) [pdf, html, other]
-
Title: Adaptive Safety with Control Barrier Functions and Triggered Batch Least-Squares IdentifierComments: 11 pages, 10 fiduresSubjects: Systems and Control (eess.SY)
In this paper, a triggered Batch Least-Squares Identifier (BaLSI) based adaptive safety control scheme is proposed for uncertain systems with potentially conflicting control objectives and safety constraints. A relaxation term is added to the Quadratic Programs (QP) combining the transformed Control Lyapunov Functions (CLFs) and Control Barrier Functions (CBFs), to mediate the potential conflict. The existing Lyapunov-based adaptive schemes designed to guarantee specific properties of the Lyapunov functions, may grow unboundedly under the effects of the relaxation term. The adaptive law is designed by processing system inputs and outputs, to avoid unbounded estimates and overparameterization problems in the existing results. A safetytriggered condition is presented, based on which the forward invariant property of the safe set is shown and Zeno behavior can be excluded. Simulation results are presented to demonstrate the effectiveness of the proposed adaptive control scheme.
- [79] arXiv:2408.06075 (replaced) [pdf, html, other]
-
Title: Five Pitfalls When Assessing Synthetic Medical Images with Reference MetricsComments: 10 pages, 5 figures, presented at Deep Generative Models workshop @ MICCAI 2024Journal-ref: In: Mukhopadhyay, A., Oksuz, I., Engelhardt, S., Mehrof, D., Yuan, Y. (eds) Deep Generative Models. DGM4MICCAI 2024. Lecture Notes in Computer Science, vol 15224. Springer, ChamSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Reference metrics have been developed to objectively and quantitatively compare two images. Especially for evaluating the quality of reconstructed or compressed images, these metrics have shown very useful. Extensive tests of such metrics on benchmarks of artificially distorted natural images have revealed which metric best correlate with human perception of quality. Direct transfer of these metrics to the evaluation of generative models in medical imaging, however, can easily lead to pitfalls, because assumptions about image content, image data format and image interpretation are often very different. Also, the correlation of reference metrics and human perception of quality can vary strongly for different kinds of distortions and commonly used metrics, such as SSIM, PSNR and MAE are not the best choice for all situations. We selected five pitfalls that showcase unexpected and probably undesired reference metric scores and discuss strategies to avoid them.
- [80] arXiv:2408.14947 (replaced) [pdf, html, other]
-
Title: ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line ScanningComments: 17 pages, 13 figures, 4 tables, code and datasets accessible at this https URLSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Detecting unexpected objects (anomalies) in real time has great potential for monitoring, managing, and protecting the environment. Hyperspectral line-scan cameras are a low-cost solution that enhance confidence in anomaly detection over RGB and multispectral imagery. However, existing line-scan algorithms are too slow when using small computers (e.g. those onboard a drone or small satellite), do not adapt to changing scenery, or lack robustness against geometric distortions. This paper introduces the Exponentially moving RX algorithm (ERX) to address these issues, and compares it with existing RX-based anomaly detection methods for hyperspectral line scanning. Three large and more complex datasets are also introduced to better assess the practical challenges when using line-scan cameras (two hyperspectral and one multispectral). ERX is evaluated using a Jetson Xavier NX compute module, achieving the best combination of speed and detection performance. This research paves the way for future studies in grouping and locating anomalous objects, adaptive and automatic threshold selection, and real-time field tests. The datasets and the Python code are available at: this https URL.
- [81] arXiv:2409.02041 (replaced) [pdf, html, other]
-
Title: The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 ChallengeShutong Niu, Ruoyu Wang, Jun Du, Gaobin Yang, Yanhui Tu, Siyuan Wu, Shuangqing Qian, Huaxin Wu, Haitao Xu, Xueyang Zhang, Guolong Zhong, Xindi Yu, Jieru Chen, Mengzhi Wang, Di Cai, Tian Gao, Genshun Wan, Feng Ma, Jia Pan, Jianqing GaoSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several aspects: For front-end speech signal processing, we introduced a data-driven joint training method for diarization and separation (JDS) to enhance audio quality. Additionally, we also integrated traditional guided source separation (GSS) for multi-channel track to provide complementary information for the JDS. For back-end speech recognition, we enhanced Whisper with WavLM, ConvNeXt, and Transformer innovations, applying multi-task training and Noise KLD augmentation, to significantly advance ASR robustness and accuracy. Our system attained a Time-Constrained minimum Permutation Word Error Rate (tcpWER) of 14.265% and 22.989% on the CHiME-8 NOTSOFAR-1 Dev-set-2 multi-channel and single-channel tracks, respectively.
- [82] arXiv:2410.07908 (replaced) [pdf, html, other]
-
Title: ONCOPILOT: A Promptable CT Foundation Model For Solid Tumor EvaluationLéo Machado, Hélène Philippe, Élodie Ferreres, Julien Khlaut, Julie Dupuis, Korentin Le Floch, Denis Habip Gatenyo, Pascal Roux, Jules Grégory, Maxime Ronot, Corentin Dancette, Daniel Tordjman, Pierre Manceron, Paul HérentSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Carcinogenesis is a proteiform phenomenon, with tumors emerging in various locations and displaying complex, diverse shapes. At the crucial intersection of research and clinical practice, it demands precise and flexible assessment. However, current biomarkers, such as RECIST 1.1's long and short axis measurements, fall short of capturing this complexity, offering an approximate estimate of tumor burden and a simplistic representation of a more intricate process. Additionally, existing supervised AI models face challenges in addressing the variability in tumor presentations, limiting their clinical utility. These limitations arise from the scarcity of annotations and the models' focus on narrowly defined tasks.
To address these challenges, we developed ONCOPILOT, an interactive radiological foundation model trained on approximately 7,500 CT scans covering the whole body, from both normal anatomy and a wide range of oncological cases. ONCOPILOT performs 3D tumor segmentation using visual prompts like point-click and bounding boxes, outperforming state-of-the-art models (e.g., nnUnet) and achieving radiologist-level accuracy in RECIST 1.1 measurements. The key advantage of this foundation model is its ability to surpass state-of-the-art performance while keeping the radiologist in the loop, a capability that previous models could not achieve. When radiologists interactively refine the segmentations, accuracy improves further. ONCOPILOT also accelerates measurement processes and reduces inter-reader variability, facilitating volumetric analysis and unlocking new biomarkers for deeper insights.
This AI assistant is expected to enhance the precision of RECIST 1.1 measurements, unlock the potential of volumetric biomarkers, and improve patient stratification and clinical care, while seamlessly integrating into the radiological workflow. - [83] arXiv:2410.17539 (replaced) [pdf, html, other]
-
Title: Urban Outdoor Propagation Measurements and Channel Models at 6.75 GHz FR1(C) and 16.95 GHz FR3 Upper Mid-Band Spectrum for 5G and 6GDipankar Shakya, Mingjun Ying, Theodore S. Rappaport, Peijie Ma, Idris Al-Wazani, Yanze Wu, Yanbo Wang, Doru Calin, Hitesh Poddar, Ahmad Bazzi, Marwa Chafii, Yunchou Xing, Amitava GhoshComments: 6 pages, 4 figures, 6 tablesSubjects: Signal Processing (eess.SP)
Global allocations in the upper mid-band spectrum (4-24 GHz) necessitate a comprehensive exploration of the propagation behavior to meet the promise of coverage and capacity. This paper presents an extensive Urban Microcell (UMi) outdoor propagation measurement campaign at 6.75 GHz and 16.95 GHz conducted in Downtown Brooklyn, USA, using a 1 GHz bandwidth sliding correlation channel sounder over 40-880 m propagation distance, encompassing 6 Line of Sight (LOS) and 14 Non-Line of Sight (NLOS) locations. Analysis of the path loss (PL) reveals lower directional and omnidirectional PL exponents compared to mmWave and sub-THz frequencies in the UMi environment, using the close-in PL model with a 1 m reference distance. Additionally, a decreasing trend in root mean square (RMS) delay spread (DS) and angular spread (AS) with increasing frequency was observed. The NLOS RMS DS and RMS AS mean values are obtained consistently lower compared to 3GPP model predictions. Point data for all measured statistics at each TX-RX location are provided to supplement the models and results. The spatio-temporal statistics evaluated here offer valuable insights for the design of next-generation wireless systems and networks.
- [84] arXiv:2410.17607 (replaced) [pdf, html, other]
-
Title: Exploiting Data Centres and Local Energy Communities Synergies for Market ParticipationComments: Accepted at IEEE PES ISGT Europe 2024Subjects: Systems and Control (eess.SY)
The evolving energy landscape has propelled energy communities to the forefront of modern energy management. However, existing research has yet to explore the potential synergies between data centres and energy communities, necessitating an assessment on their collective capabilities for cost efficiency, waste heat optimisation, and market participation. This paper presents a mixed integer linear programming model to assess the collaborative performance of energy communities, data centres and energy markets. The evaluation focuses on the efficient use of waste heat and the flexibility of job scheduling while minimising system energy costs and maintaining quality of service requirements for data centres. Our results, based on realistic profiles of an energy community and a data centre, showcase significant benefits of these synergies, with a 38% reduction in operating costs and an 87% decrease in heat demand.
- [85] arXiv:2410.18007 (replaced) [pdf, html, other]
-
Title: Effective Finite Time Stability Control for Human-Machine Shared Vehicle Following SystemSubjects: Systems and Control (eess.SY)
With the development of intelligent connected vehicle technology, human-machine shared control has gained popularity in vehicle following due to its effectiveness in driver assistance. However, traditional vehicle following systems struggle to maintain stability when driver reaction time fluctuates, as these variations require different levels of system intervention. To address this issue, the proposed human-machine shared vehicle following assistance system (HM-VFAS) integrates driver outputs under various states with the assistance system. The system employs an intelligent driver model that accounts for reaction time delays, simulating time-varying driver outputs. A control authority allocation strategy is designed to dynamically adjust the level of intervention based on real-time driver state assessment. To handle instability from driver authority switching, the proposed solution includes a two-layer adaptive finite time sliding mode controller (A-FTSMC). The first layer is an integral sliding mode adaptive controller that ensures robustness by compensating for uncertainties in the driver output. The second layer is a fast non-singular terminal sliding mode controller designed to accelerate convergence for rapid stabilization. Using real driver videos as inputs, the performance of the HM-VFAS was evaluated. Results show that the proposed control strategy maintains a safe distance under time-varying driver states, with the actual acceleration error relative to the target acceleration maintained within 0.5m/s~2 and the maximum acceleration error reduced by 1.2m/s~2. Compared to traditional controllers, the A-FTSMC controller offers faster convergence and less vibration, reducing the stabilization time by 27.3%.
- [86] arXiv:2208.04883 (replaced) [pdf, html, other]
-
Title: Neural-Rendezvous: Provably Robust Guidance and Control to Encounter Interstellar ObjectsHiroyasu Tsukamoto, Soon-Jo Chung, Yashwanth Kumar Nakka, Benjamin Donitz, Declan Mages, Michel InghamComments: Preprint Version, Accepted: October, 2024 (DOI: this https URL)Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Interstellar objects (ISOs) are likely representatives of primitive materials invaluable in understanding exoplanetary star systems. Due to their poorly constrained orbits with generally high inclinations and relative velocities, however, exploring ISOs with conventional human-in-the-loop approaches is significantly challenging. This paper presents Neural-Rendezvous -- a deep learning-based guidance and control framework for encountering fast-moving objects, including ISOs, robustly, accurately, and autonomously in real time. It uses pointwise minimum norm tracking control on top of a guidance policy modeled by a spectrally-normalized deep neural network, where its hyperparameters are tuned with a loss function directly penalizing the MPC state trajectory tracking error. We show that Neural-Rendezvous provides a high probability exponential bound on the expected spacecraft delivery error, the proof of which leverages stochastic incremental stability analysis. In particular, it is used to construct a non-negative function with a supermartingale property, explicitly accounting for the ISO state uncertainty and the local nature of nonlinear state estimation guarantees. In numerical simulations, Neural-Rendezvous is demonstrated to satisfy the expected error bound for 100 ISO candidates. This performance is also empirically validated using our spacecraft simulator and in high-conflict and distributed UAV swarm reconfiguration with up to 20 UAVs.
- [87] arXiv:2309.15507 (replaced) [pdf, other]
-
Title: Approximate Message Passing with Rigorous Guarantees for Pooled Data and Quantitative Group TestingComments: 62 pages, 11 figures, appeared in SIAM Journal on Mathematics of Data Science. The simulation results here use a slightly different metric from the journal version; see Remark 4.2Journal-ref: SIAM Journal on Mathematics of Data Science, vol. 6, no. 4, pp. 1027-1054, 2024Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In the pooled data problem, the goal is to identify the categories associated with a large collection of items via a sequence of pooled tests. Each pooled test reveals the number of items of each category within the pool. We study an approximate message passing (AMP) algorithm for estimating the categories and rigorously characterize its performance, in both the noiseless and noisy settings. For the noiseless setting, we show that the AMP algorithm is equivalent to one recently proposed by El Alaoui et al. Our results provide a rigorous version of their performance guarantees, previously obtained via non-rigorous techniques. For the case of pooled data with two categories, known as quantitative group testing (QGT), we use the AMP guarantees to compute precise limiting values of the false positive rate and the false negative rate. Though the pooled data problem and QGT are both instances of estimation in a linear model, existing AMP theory cannot be directly applied since the design matrices are binary valued. The key technical ingredient in our analysis is a rigorous asymptotic characterization of AMP for generalized linear models defined via generalized white noise design matrices. This result, established using a recent universality result of Wang et al., is of independent interest. Our theoretical results are validated by numerical simulations. For comparison, we propose estimators based on convex relaxation and iterative thresholding, without providing theoretical guarantees. The simulations indicate that AMP consistently outperforms these estimators.
- [88] arXiv:2310.09580 (replaced) [pdf, other]
-
Title: Where to Decide? Centralized vs. Distributed Vehicle Assignment for Platoon FormationJournal-ref: IEEE Transactions on Intelligent Transportation Systems, 2024Subjects: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Platooning is a promising cooperative driving application for future intelligent transportation systems. In order to assign vehicles to platoons, some algorithm for platoon formation is required. Such vehicle-to-platoon assignments have to be computed on-demand, e.g., when vehicles join or leave the freeways. In order to get best results from platooning, individual properties of involved vehicles have to be considered during the assignment computation. In this paper, we explore the computation of vehicle-to-platoon assignments as an optimization problem based on similarity between vehicles. We define the similarity and, vice versa, the deviation among vehicles based on the desired driving speed of vehicles and their position on the road. We create three approaches to solve this assignment problem: centralized solver, centralized greedy, and distributed greedy, using a Mixed Integer Programming (MIP) solver and greedy heuristics, respectively. Conceptually, the approaches differ in both knowledge about vehicles as well as methodology. We perform a large-scale simulation study using PlaFoSim to compare all approaches. While the distributed greedy approach seems to have disadvantages due to the limited local knowledge, it performs as good as the centralized solver approach across most metrics. Both outperform the centralized greedy approach, which suffers from synchronization and greedy selection effects. The centralized solver approach however assumes global knowledge and requires a complex MIP solver to compute vehicle-to-platoon assignments. Overall, the distributed greedy approach achieves close to optimal results but requires the least assumptions and complexity. Therefore, we consider the distributed greedy approach the best approach among all presented approaches.
- [89] arXiv:2312.09734 (replaced) [pdf, html, other]
-
Title: Learning of Hamiltonian Dynamics with Reproducing Kernel Hilbert SpacesSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
This paper presents a method for learning Hamiltonian dynamics from a limited set of data points. The Hamiltonian vector field is found by regularized optimization over a reproducing kernel Hilbert space of vector fields that are inherently Hamiltonian, and where the vector field is required to be odd or even. This is done with a symplectic kernel, and it is shown how this symplectic kernel can be modified to be odd or even. The performance of the method is validated in simulations for two Hamiltonian systems. The simulations show that the learned dynamics reflect the energy-preservation of the Hamiltonian dynamics, and that the restriction to symplectic and odd dynamics gives improved accuracy over a large domain of the phase space.
- [90] arXiv:2403.00790 (replaced) [pdf, html, other]
-
Title: Structuring Concept Space with the Musical Circle of Fifths by Utilizing Music Grammar Based ActivationsComments: 3 pagesSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
In this paper, we explore the intriguing similarities between the structure of a discrete neural network, such as a spiking network, and the composition of a piano piece. While both involve nodes or notes that are activated sequentially or in parallel, the latter benefits from the rich body of music theory to guide meaningful combinations. We propose a novel approach that leverages musical grammar to regulate activations in a spiking neural network, allowing for the representation of symbols as attractors. By applying rules for chord progressions from music theory, we demonstrate how certain activations naturally follow others, akin to the concept of attraction. Furthermore, we introduce the concept of modulating keys to navigate different basins of attraction within the network. Ultimately, we show that the map of concepts in our model is structured by the musical circle of fifths, highlighting the potential for leveraging music theory principles in deep learning algorithms.
- [91] arXiv:2404.07703 (replaced) [pdf, html, other]
-
Title: Learning Hamiltonian Dynamics with Reproducing Kernel Hilbert Spaces and Random FeaturesSubjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
A method for learning Hamiltonian dynamics from a limited and noisy dataset is proposed. The method learns a Hamiltonian vector field on a reproducing kernel Hilbert space (RKHS) of inherently Hamiltonian vector fields, and in particular, odd Hamiltonian vector fields. This is done with a symplectic kernel, and it is shown how the kernel can be modified to an odd symplectic kernel to impose the odd symmetry. A random feature approximation is developed for the proposed odd kernel to reduce the problem size. The performance of the method is validated in simulations for three Hamiltonian systems. It is demonstrated that the use of an odd symplectic kernel improves prediction accuracy and data efficiency, and that the learned vector fields are Hamiltonian and exhibit the imposed odd symmetry characteristics.
- [92] arXiv:2406.19388 (replaced) [pdf, html, other]
-
Title: Taming Data and Transformers for Audio GenerationMoayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Sergey Tulyakov, Vicente OrdonezComments: Project Webpage: this https URLSubjects: Sound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Generating ambient sounds is a challenging task due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle this problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. By using a compact audio representation and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching a CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. Using AutoCap to generate caption clips from existing audio datasets, we demonstrate the benefits of data scaling with synthetic captions as well as model size scaling. When compared to state-of-the-art audio generators trained at similar size and data scale, GenAu obtains significant improvements of 4.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. Moreover, we propose an efficient and scalable pipeline for collecting audio datasets, enabling us to compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset, at 90 times the scale of existing ones. Our code, model checkpoints, and dataset are publicly available.
- [93] arXiv:2407.05180 (replaced) [pdf, html, other]
-
Title: ReCAP: Recursive Cross Attention Network for Pseudo-Label Generation in Robotic Surgical Skill AssessmentSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
In surgical skill assessment, the Objective Structured Assessments of Technical Skills (OSATS) and Global Rating Scale (GRS) are well-established tools for evaluating surgeons during training. These metrics, along with performance feedback, help surgeons improve and reach practice standards. Recent research on the open-source JIGSAWS dataset, which includes both GRS and OSATS labels, has focused on regressing GRS scores from kinematic data, video, or their combination. However, we argue that regressing GRS alone is limiting, as it aggregates OSATS scores and overlooks clinically meaningful variations during a surgical trial. To address this, we developed a recurrent transformer model that tracks a surgeon's performance throughout a session by mapping hidden states to six OSATS, derived from kinematic data, using a clinically motivated objective function. These OSATS scores are averaged to predict GRS, allowing us to compare our model's performance against state-of-the-art (SOTA) methods. We report Spearman's Correlation Coefficients (SCC) demonstrating that our model outperforms SOTA using kinematic data (SCC 0.83-0.88), and matches performance with video-based models. Our model also surpasses SOTA in most tasks for average OSATS predictions (SCC 0.46-0.70) and specific OSATS (SCC 0.56-0.95). The generation of pseudo-labels at the segment level translates quantitative predictions into qualitative feedback, vital for automated surgical skill assessment pipelines. A senior surgeon validated our model's outputs, agreeing with 77% of the weakly-supervised predictions (p=0.006).
- [94] arXiv:2410.00009 (replaced) [pdf, html, other]
-
Title: Exergetic Port-Hamiltonian Systems: Compositional fluid and electro-magneto hydrodynamics modelsSubjects: Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY); Computational Physics (physics.comp-ph)
Fluid dynamics plays a crucial role in various multiphysics applications, including energy systems, electronics cooling, and biomedical engineering. Developing computational models for complex coupled systems can be challenging and time-consuming. In particular, ensuring the consistent integration of models from diverse physical domains requires meticulous attention. Considering the example of (electro-)magneto hydrodynamics (on a fixed spatial domain and with linear polarization and magnetization), this article demonstrates how relatively complex models can be composed from simpler parts by means of a formal language for multiphysics modeling. The Exergetic Port-Hamiltonian Systems (EPHS) modeling language features a simple graphical syntax for expressing the energy-based interconnection of subsystems. This reduces cognitive load and facilitates communication, especially in multidisciplinary environments. As the example demonstrates, existing models can be easily integrated as subsystems of new models. Specifically, the ideal fluid model is used as a subsystem of the Navier-Stokes-Fourier fluid model, which in turn is reused as a subsystem of the electro-magneto hydrodynamics model. The compositional approach makes it nearly trivial to encapsulate, reuse, and swap out (parts of) models. Moreover, structural properties of EPHS models guarantee fundamental properties of thermodynamic systems, such as conservation of energy, non-negative entropy production, and Onsager reciprocal relations.
- [95] arXiv:2410.16821 (replaced) [pdf, html, other]
-
Title: Guiding Reinforcement Learning with Incomplete System DynamicsShuyuan Wang, Jingliang Duan, Nathan P. Lawrence, Philip D. Loewen, Michael G. Forbes, R. Bhushan Gopaluni, Lixian ZhangComments: Accepted to IROS 2024Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Model-free reinforcement learning (RL) is inherently a reactive method, operating under the assumption that it starts with no prior knowledge of the system and entirely depends on trial-and-error for learning. This approach faces several challenges, such as poor sample efficiency, generalization, and the need for well-designed reward functions to guide learning effectively. On the other hand, controllers based on complete system dynamics do not require data. This paper addresses the intermediate situation where there is not enough model information for complete controller design, but there is enough to suggest that a model-free approach is not the best approach either. By carefully decoupling known and unknown information about the system dynamics, we obtain an embedded controller guided by our partial model and thus improve the learning efficiency of an RL-enhanced approach. A modular design allows us to deploy mainstream RL algorithms to refine the policy. Simulation results show that our method significantly improves sample efficiency compared with standard RL methods on continuous control tasks, and also offers enhanced performance over traditional control approaches. Experiments on a real ground vehicle also validate the performance of our method, including generalization and robustness.