subscribe to arXiv mailings

Early Action Recognition with Action Prototypes

Authors: Guglielmo Camporese, Alessandro Bergamo, Xunyu Lin, Joseph Tighe, Davide Modolo

Abstract: Early action recognition is an important and challenging problem that enables the recognition of an action from a partially observed video stream where the activity is potentially unfinished or even not started. In this work, we propose a novel model that learns a prototypical representation of the full action for each class and uses it to regularize the architecture and the visual representations… ▽ More Early action recognition is an important and challenging problem that enables the recognition of an action from a partially observed video stream where the activity is potentially unfinished or even not started. In this work, we propose a novel model that learns a prototypical representation of the full action for each class and uses it to regularize the architecture and the visual representations of the partial observations. Our model is very simple in design and also efficient. We decompose the video into short clips, where a visual encoder extracts features from each clip independently. Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction. During training, for each partial observation, the model is jointly trained to both predict the label as well as the action prototypical representation which acts as a regularizer. We evaluate our method on multiple challenging real-world datasets and outperform the current state-of-the-art by a significant margin. For example, on early recognition observing only the first 10% of each video, our method improves the SOTA by +2.23 Top-1 accuracy on Something-Something-v2, +3.55 on UCF-101, +3.68 on SSsub21, and +5.03 on EPIC-Kitchens-55, where prior work used either multi-modal inputs (e.g. optical-flow) or batched inference. Finally, we also present exhaustive ablation studies to motivate the design choices we made, as well as gather insights regarding what our model is learning semantically. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2309.11445 [pdf, other]

SkeleTR: Towrads Skeleton-based Action Recognition in the Wild

Authors: Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo

Abstract: We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target more general scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequen… ▽ More We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target more general scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios. To mitigate the negative impact of inaccurate skeleton associations, SkeleTR takes relative short skeleton sequences as input and increases the number of sequences. As a unified solution, SkeleTR can be directly applied to multiple skeleton-based action tasks, including video-level action classification, instance-level action detection, and group-level activity recognition. It also enables transfer learning and joint training across different action tasks and datasets, which result in performance improvement. When evaluated on various skeleton-based action recognition benchmarks, SkeleTR achieves the state-of-the-art performance. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: ICCV 2023

arXiv:2307.04047 [pdf, other]

Threshold-Consistent Margin Loss for Open-World Deep Metric Learning

Authors: Qin Zhang, Linghan Xu, Qingming Tang, Jun Fang, Ying Nian Wu, Joe Tighe, Yifan Xing

Abstract: Existing losses used in deep metric learning (DML) for image retrieval often lead to highly non-uniform intra-class and inter-class representation structures across test classes and data distributions. When combined with the common practice of using a fixed threshold to declare a match, this gives rise to significant performance variations in terms of false accept rate (FAR) and false reject rate… ▽ More Existing losses used in deep metric learning (DML) for image retrieval often lead to highly non-uniform intra-class and inter-class representation structures across test classes and data distributions. When combined with the common practice of using a fixed threshold to declare a match, this gives rise to significant performance variations in terms of false accept rate (FAR) and false reject rate (FRR) across test classes and data distributions. We define this issue in DML as threshold inconsistency. In real-world applications, such inconsistency often complicates the threshold selection process when deploying commercial image retrieval systems. To measure this inconsistency, we propose a novel variance-based metric called Operating-Point-Inconsistency-Score (OPIS) that quantifies the variance in the operating characteristics across classes. Using the OPIS metric, we find that achieving high accuracy levels in a DML model does not automatically guarantee threshold consistency. In fact, our investigation reveals a Pareto frontier in the high-accuracy regime, where existing methods to improve accuracy often lead to degradation in threshold consistency. To address this trade-off, we introduce the Threshold-Consistent Margin (TCM) loss, a simple yet effective regularization technique that promotes uniformity in representation structures across classes by selectively penalizing hard sample pairs. Extensive experiments demonstrate TCM's effectiveness in enhancing threshold consistency while preserving accuracy, simplifying the threshold selection process in practical DML settings. △ Less

Submitted 12 March, 2024; v1 submitted 8 July, 2023; originally announced July 2023.

Comments: Accepted to ICLR'24

arXiv:2306.16048 [pdf, other]

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Authors: Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joseph Tighe, Davide Modolo

Abstract: This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Finding… ▽ More This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data. Extensive evaluations reveal limitations in current VLMs, particularly in distinguishing between correct and subtly incorrect descriptions. While fine-tuning offers some improvements, it doesn't fully address these issues, highlighting the need for VLMs with enhanced generalization capabilities for real-world applications. This study provides insights into VLM limitations and suggests directions for developing more robust models. △ Less

Submitted 18 June, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

Comments: CVPR2024 MMFM workshop

arXiv:2306.04849 [pdf, other]

ScaleDet: A Scalable Multi-Dataset Object Detector

Authors: Yanbei Chen, Manchen Wang, Abhay Mittal, Zhenlin Xu, Paolo Favaro, Joseph Tighe, Davide Modolo

Abstract: Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisti… ▽ More Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisticated optimizations to unify labels across datasets, we introduce a simple yet scalable formulation to derive a unified semantic label space for multi-dataset training. ScaleDet is trained by visual-textual alignment to learn the label assignment with label semantic similarities across datasets. Once trained, ScaleDet can generalize well on any given upstream and downstream datasets with seen and unseen classes. We conduct extensive experiments using LVIS, COCO, Objects365, OpenImages as upstream datasets, and 13 datasets from Object Detection in the Wild (ODinW) as downstream datasets. Our results show that ScaleDet achieves compelling strong model performance with an mAP of 50.7 on LVIS, 58.8 on COCO, 46.8 on Objects365, 76.2 on OpenImages, and 71.8 on ODinW, surpassing state-of-the-art detectors with the same backbone. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: CVPR 2023

arXiv:2305.12039 [pdf, other]

Learning for Transductive Threshold Calibration in Open-World Recognition

Authors: Qin Zhang, Dongsheng An, Tianjun Xiao, Tong He, Qingming Tang, Ying Nian Wu, Joseph Tighe, Yifan Xing, Stefano Soatto

Abstract: In deep metric learning for visual recognition, the calibration of distance thresholds is crucial for achieving desired model performance in the true positive rates (TPR) or true negative rates (TNR). However, calibrating this threshold presents challenges in open-world scenarios, where the test classes can be entirely disjoint from those encountered during training. We define the problem of findi… ▽ More In deep metric learning for visual recognition, the calibration of distance thresholds is crucial for achieving desired model performance in the true positive rates (TPR) or true negative rates (TNR). However, calibrating this threshold presents challenges in open-world scenarios, where the test classes can be entirely disjoint from those encountered during training. We define the problem of finding distance thresholds for a trained embedding model to achieve target performance metrics over unseen open-world test classes as open-world threshold calibration. Existing posthoc threshold calibration methods, reliant on inductive inference and requiring a calibration dataset with a similar distance distribution as the test data, often prove ineffective in open-world scenarios. To address this, we introduce OpenGCN, a Graph Neural Network-based transductive threshold calibration method with enhanced adaptability and robustness. OpenGCN learns to predict pairwise connectivity for the unlabeled test instances embedded in a graph to determine its TPR and TNR at various distance thresholds, allowing for transductive inference of the distance thresholds which also incorporates test-time information. Extensive experiments across open-world visual recognition benchmarks validate OpenGCN's superiority over existing posthoc calibration methods for open-world threshold calibration. △ Less

Submitted 22 March, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

arXiv:2211.02175 [pdf, other]

Large Scale Real-World Multi-Person Tracking

Authors: Bing Shuai, Alessandro Bergamo, Uta Buechler, Andrew Berneshawi, Alyssa Boden, Joseph Tighe

Abstract: This paper presents a new large scale multi-person tracking dataset -- \texttt{PersonPath22}, which is over an order of magnitude larger than currently available high quality multi-object tracking datasets such as MOT17, HiEve, and MOT20 datasets. The lack of large scale training and test data for this task has limited the community's ability to understand the performance of their tracking systems… ▽ More This paper presents a new large scale multi-person tracking dataset -- \texttt{PersonPath22}, which is over an order of magnitude larger than currently available high quality multi-object tracking datasets such as MOT17, HiEve, and MOT20 datasets. The lack of large scale training and test data for this task has limited the community's ability to understand the performance of their tracking systems on a wide range of scenarios and conditions such as variations in person density, actions being performed, weather, and time of day. \texttt{PersonPath22} dataset was specifically sourced to provide a wide variety of these conditions and our annotations include rich meta-data such that the performance of a tracker can be evaluated along these different dimensions. The lack of training data has also limited the ability to perform end-to-end training of tracking systems. As such, the highest performing tracking systems all rely on strong detectors trained on external image datasets. We hope that the release of this dataset will enable new lines of research that take advantage of large scale video based training data. △ Less

Submitted 3 November, 2022; originally announced November 2022.

Comments: ECCV 2022

arXiv:2210.00129 [pdf, other]

An In-depth Study of Stochastic Backpropagation

Authors: Jun Fang, Mingze Xu, Hao Chen, Bing Shuai, Zhuowen Tu, Joseph Tighe

Abstract: In this paper, we provide an in-depth study of Stochastic Backpropagation (SBP) when training deep neural networks for standard image classification and object detection tasks. During backward propagation, SBP calculates the gradients by only using a subset of feature maps to save the GPU memory and computational cost. We interpret SBP as an efficient way to implement stochastic gradient decent by… ▽ More In this paper, we provide an in-depth study of Stochastic Backpropagation (SBP) when training deep neural networks for standard image classification and object detection tasks. During backward propagation, SBP calculates the gradients by only using a subset of feature maps to save the GPU memory and computational cost. We interpret SBP as an efficient way to implement stochastic gradient decent by performing backpropagation dropout, which leads to considerable memory saving and training process speedup, with a minimal impact on the overall model accuracy. We offer some good practices to apply SBP in training image recognition models, which can be adopted in learning a wide range of deep neural networks. Experiments on image classification and object detection show that SBP can save up to 40% of GPU memory with less than 1% accuracy degradation. △ Less

Submitted 30 September, 2022; originally announced October 2022.

Comments: NeurIPS 2022

arXiv:2205.11710 [pdf, other]

SCVRL: Shuffled Contrastive Video Representation Learning

Authors: Michael Dorkenwald, Fanyi Xiao, Biagio Brattoli, Joseph Tighe, Davide Modolo

Abstract: We propose SCVRL, a novel contrastive-based framework for self-supervised learning for videos. Differently from previous contrast learning based methods that mostly focus on learning visual semantics (e.g., CVRL), SCVRL is capable of learning both semantic and motion patterns. For that, we reformulate the popular shuffling pretext task within a modern contrastive learning paradigm. We show that ou… ▽ More We propose SCVRL, a novel contrastive-based framework for self-supervised learning for videos. Differently from previous contrast learning based methods that mostly focus on learning visual semantics (e.g., CVRL), SCVRL is capable of learning both semantic and motion patterns. For that, we reformulate the popular shuffling pretext task within a modern contrastive learning paradigm. We show that our transformer-based network has a natural capacity to learn motion in self-supervised settings and achieves strong performance, outperforming CVRL on four benchmarks. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: CVPR 2022 - L3DIVU workshop

arXiv:2204.03101 [pdf, other]

Hierarchical Self-supervised Representation Learning for Movie Understanding

Authors: Fanyi Xiao, Kaustav Kundu, Joseph Tighe, Davide Modolo

Abstract: Most self-supervised video representation learning approaches focus on action recognition. In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model (based on [37]). Specifically, we propose to pretrain the low-… ▽ More Most self-supervised video representation learning approaches focus on action recognition. In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model (based on [37]). Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task, which enables the usage of different data sources for pretraining different levels of the hierarchy. We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark [37] (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores). We further demonstrate the effectiveness of our contextualized event features on LVU tasks [54], both when used alone and when combined with instance features, showing their complementarity. △ Less

Submitted 6 April, 2022; originally announced April 2022.

Comments: CVPR 2022

arXiv:2204.00746 [pdf, other]

What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions

Authors: A S M Iftekhar, Hao Chen, Kaustav Kundu, Xinyu Li, Joseph Tighe, Davide Modolo

Abstract: We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two… ▽ More We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two new modules to help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features. These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET. △ Less

Submitted 25 May, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

Comments: CVPR 2022 Oral

arXiv:2203.05553 [pdf, other]

Transfer of Representations to Video Label Propagation: Implementation Factors Matter

Authors: Daniel McKee, Zitong Zhan, Bing Shuai, Davide Modolo, Joseph Tighe, Svetlana Lazebnik

Abstract: This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using self-supervised signals such as colorization or temporal cycle consistency. In the literature, these methods have been evaluated with an array of inconsistent settings, making it difficult to discern trends or compare performance fairly. St… ▽ More This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using self-supervised signals such as colorization or temporal cycle consistency. In the literature, these methods have been evaluated with an array of inconsistent settings, making it difficult to discern trends or compare performance fairly. Starting with a unified formulation of the label propagation algorithm that encompasses most existing variations, we systematically study the impact of important implementation factors in feature extraction and label propagation. Along the way, we report the accuracies of properly tuned supervised and unsupervised still image baselines, which are higher than those found in previous works. We also demonstrate that augmenting video-based correspondence cues with still-image-based ones can further improve performance. We then attempt a fair comparison of recent video-based methods on the DAVIS benchmark, showing convergence of best methods to performance levels near our strong ImageNet baseline, despite the usage of a variety of specialized video-based losses and training particulars. Additional comparisons on JHMDB and VIP datasets confirm the similar performance of current methods. We hope that this study will help to improve evaluation practices and better inform future research directions in temporal correspondence. △ Less

Submitted 10 March, 2022; originally announced March 2022.

arXiv:2111.05431 [pdf, other]

Multi-Task Prediction of Clinical Outcomes in the Intensive Care Unit using Flexible Multimodal Transformers

Authors: Benjamin Shickel, Patrick J. Tighe, Azra Bihorac, Parisa Rashidi

Abstract: Recent deep learning research based on Transformer model architectures has demonstrated state-of-the-art performance across a variety of domains and tasks, mostly within the computer vision and natural language processing domains. While some recent studies have implemented Transformers for clinical tasks using electronic health records data, they are limited in scope, flexibility, and comprehensiv… ▽ More Recent deep learning research based on Transformer model architectures has demonstrated state-of-the-art performance across a variety of domains and tasks, mostly within the computer vision and natural language processing domains. While some recent studies have implemented Transformers for clinical tasks using electronic health records data, they are limited in scope, flexibility, and comprehensiveness. In this study, we propose a flexible Transformer-based EHR embedding pipeline and predictive model framework that introduces several novel modifications of existing workflows that capitalize on data attributes unique to the healthcare domain. We showcase the feasibility of our flexible design in a case study in the intensive care unit, where our models accurately predict seven clinical outcomes pertaining to readmission and patient mortality over multiple future time horizons. △ Less

Submitted 9 November, 2021; originally announced November 2021.

arXiv:2110.02768 [pdf]

Posture Recognition in the Critical Care Settings using Wearable Devices

Authors: Anis Davoudi, Patrick J. Tighe, Azra Bihorac, Parisa Rashidi

Abstract: Low physical activity levels in the intensive care units (ICU) patients have been linked to adverse clinical outcomes. Therefore, there is a need for continuous and objective measurement of physical activity in the ICU to quantify the association between physical activity and patient outcomes. This measurement would also help clinicians evaluate the efficacy of proposed rehabilitation and physical… ▽ More Low physical activity levels in the intensive care units (ICU) patients have been linked to adverse clinical outcomes. Therefore, there is a need for continuous and objective measurement of physical activity in the ICU to quantify the association between physical activity and patient outcomes. This measurement would also help clinicians evaluate the efficacy of proposed rehabilitation and physical therapy regimens in improving physical activity. In this study, we examined the feasibility of posture recognition in an ICU population using data from wearable sensors. △ Less

Submitted 7 October, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

Comments: 8 pages

arXiv:2109.00888 [pdf, other]

Analysis of Intra-Operative Physiological Responses Through Complex Higher-Order SVD for Long-Term Post-Operative Pain Prediction

Authors: Raheleh Baharloo, Jose C. Principe, Parisa Rashidi, Patrick J. Tighe

Abstract: Long-term pain conditions after surgery and patients' responses to pain relief medications are not yet fully understood. While recent studies developed an index for nociception level of patients under general anesthesia, based on multiple physiological parameters, it remains unclear whether and how dynamics of these parameters indicate long-term post-operative pain (POP). To extract unbiased and i… ▽ More Long-term pain conditions after surgery and patients' responses to pain relief medications are not yet fully understood. While recent studies developed an index for nociception level of patients under general anesthesia, based on multiple physiological parameters, it remains unclear whether and how dynamics of these parameters indicate long-term post-operative pain (POP). To extract unbiased and interpretable descriptions of how physiological parameters dynamics change over time and across patients in response to surgical procedures, we employed a multivariate-temporal analysis. We demonstrate the main features of intra-operative physiological responses can be used to predict long-term POP. We propose to use a complex higher-order SVD method to accurately decompose the patients' physiological responses into multivariate structures evolving in time. We used intra-operative vital signs of 175 patients from a mixed surgical cohort to extract three interconnected, low-dimensional complex-valued descriptions of patients' physiological responses: multivariate factors, reflecting sub-physiological parameters; temporal factors reflecting common intra-surgery temporal dynamics; and patients factors, describing patient to patient changes in physiological responses. Adoption of complex-HOSVD allowed us to clarify the dynamic correlation structure included in intra-operative physiological responses. Instantaneous phases of the complex-valued physiological responses within the subspace of principal descriptors enabled us to discriminate between mild versus severe levels of long-term POP. By abstracting patients into different surgical groups, we identified significant surgery-related principal descriptors: each of them potentially encodes different surgical stimulation. The dynamics of patients' physiological responses to these surgical events are linked to long-term post-operative pain development. △ Less

Submitted 2 September, 2021; originally announced September 2021.

arXiv:2108.08836 [pdf, other]

Multi-Object Tracking with Hallucinated and Unlabeled Videos

Authors: Daniel McKee, Bing Shuai, Andrew Berneshawi, Manchen Wang, Davide Modolo, Svetlana Lazebnik, Joseph Tighe

Abstract: In this paper, we explore learning end-to-end deep neural trackers without tracking annotations. This is important as large-scale training data is essential for training deep neural trackers while tracking annotations are expensive to acquire. In place of tracking annotations, we first hallucinate videos from images with bounding box annotations using zoom-in/out motion transformations to obtain f… ▽ More In this paper, we explore learning end-to-end deep neural trackers without tracking annotations. This is important as large-scale training data is essential for training deep neural trackers while tracking annotations are expensive to acquire. In place of tracking annotations, we first hallucinate videos from images with bounding box annotations using zoom-in/out motion transformations to obtain free tracking labels. We add video simulation augmentations to create a diverse tracking dataset, albeit with simple motion. Next, to tackle harder tracking cases, we mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data. For hard example mining, we propose an optimization-based connecting process to first identify and then rectify hard examples from the pool of unlabeled videos. Finally, we train our tracker jointly on hallucinated data and mined hard video examples. Our weakly supervised tracker achieves state-of-the-art performance on the MOT17 and TAO-person datasets. On MOT17, we further demonstrate that the combination of our self-generated data and the existing manually-annotated data leads to additional improvements. △ Less

Submitted 19 August, 2021; originally announced August 2021.

arXiv:2108.02722 [pdf, other]

Video Contrastive Learning with Global Context

Authors: Haofei Kuang, Yi Zhu, Zhi Zhang, Xinyu Li, Joseph Tighe, Sören Schwertfeger, Cyrill Stachniss, Mu Li

Abstract: Contrastive learning has revolutionized self-supervised image representation learning field, and recently been adapted to video domain. One of the greatest advantages of contrastive learning is that it allows us to flexibly define powerful loss objectives as long as we can find a reasonable way to formulate positive and negative samples to contrast. However, existing approaches rely heavily on the… ▽ More Contrastive learning has revolutionized self-supervised image representation learning field, and recently been adapted to video domain. One of the greatest advantages of contrastive learning is that it allows us to flexibly define powerful loss objectives as long as we can find a reasonable way to formulate positive and negative samples to contrast. However, existing approaches rely heavily on the short-range spatiotemporal salience to form clip-level contrastive signals, thus limit themselves from using global context. In this paper, we propose a new video-level contrastive learning method based on segments to formulate positive pairs. Our formulation is able to capture global context in a video, thus robust to temporal content change. We also incorporate a temporal order regularization term to enforce the inherent sequential structure of videos. Extensive experiments show that our video-level contrastive learning framework (VCLR) is able to outperform previous state-of-the-arts on five video datasets for downstream action classification, action localization and video retrieval. Code is available at https://github.com/amazon-research/video-contrastive-learning. △ Less

Submitted 5 August, 2021; originally announced August 2021.

Comments: Code is publicly available at: https://github.com/amazon-research/video-contrastive-learning

arXiv:2106.10335 [pdf, other]

Single View Physical Distance Estimation using Human Pose

Authors: Xiaohan Fei, Henry Wang, Xiangyu Zeng, Lin Lee Cheong, Meng Wang, Joseph Tighe

Abstract: We propose a fully automated system that simultaneously estimates the camera intrinsics, the ground plane, and physical distances between people from a single RGB image or video captured by a camera viewing a 3-D scene from a fixed vantage point. To automate camera calibration and distance estimation, we leverage priors about human pose and develop a novel direct formulation for pose-based auto-ca… ▽ More We propose a fully automated system that simultaneously estimates the camera intrinsics, the ground plane, and physical distances between people from a single RGB image or video captured by a camera viewing a 3-D scene from a fixed vantage point. To automate camera calibration and distance estimation, we leverage priors about human pose and develop a novel direct formulation for pose-based auto-calibration and distance estimation, which shows state-of-the-art performance on publicly available datasets. The proposed approach enables existing camera systems to measure physical distances without needing a dedicated calibration process or range sensors, and is applicable to a broad range of use cases such as social distancing and workplace safety. Furthermore, to enable evaluation and drive research in this area, we contribute to the publicly available MEVA dataset with additional distance annotations, resulting in MEVADA -- the first evaluation benchmark in the world for the pose-based auto-calibration and distance estimation problem. △ Less

Submitted 18 June, 2021; originally announced June 2021.

arXiv:2106.09703 [pdf, other]

MaCLR: Motion-aware Contrastive Learning of Representations for Videos

Authors: Fanyi Xiao, Joseph Tighe, Davide Modolo

Abstract: We present MaCLR, a novel method to explicitly perform cross-modal self-supervised video representations learning from visual and motion modalities. Compared to previous video representation learning methods that mostly focus on learning motion cues implicitly from RGB inputs, MaCLR enriches standard contrastive learning objectives for RGB video clips with a cross-modal learning objective between… ▽ More We present MaCLR, a novel method to explicitly perform cross-modal self-supervised video representations learning from visual and motion modalities. Compared to previous video representation learning methods that mostly focus on learning motion cues implicitly from RGB inputs, MaCLR enriches standard contrastive learning objectives for RGB video clips with a cross-modal learning objective between a Motion pathway and a Visual pathway. We show that the representation learned with our MaCLR method focuses more on foreground motion regions and thus generalizes better to downstream tasks. To demonstrate this, we evaluate MaCLR on five datasets for both action recognition and action detection, and demonstrate state-of-the-art self-supervised performance on all datasets. Furthermore, we show that MaCLR representation can be as effective as representations learned with full supervision on UCF101 and HMDB51 action recognition, and even outperform the supervised representation for action recognition on VidSitu and SSv2, and action detection on AVA. △ Less

Submitted 20 July, 2022; v1 submitted 17 June, 2021; originally announced June 2021.

Comments: ECCV 2022

arXiv:2105.14158 [pdf, other]

SSCAP: Self-supervised Co-occurrence Action Parsing for Unsupervised Temporal Action Segmentation

Authors: Zhe Wang, Hao Chen, Xinyu Li, Chunhui Liu, Yuanjun Xiong, Joseph Tighe, Charless Fowlkes

Abstract: Temporal action segmentation is a task to classify each frame in the video with an action label. However, it is quite expensive to annotate every frame in a large corpus of videos to construct a comprehensive supervised training dataset. Thus in this work we propose an unsupervised method, namely SSCAP, that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments ac… ▽ More Temporal action segmentation is a task to classify each frame in the video with an action label. However, it is quite expensive to annotate every frame in a large corpus of videos to construct a comprehensive supervised training dataset. Thus in this work we propose an unsupervised method, namely SSCAP, that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. SSCAP leverages Self-Supervised learning to extract distinguishable features and then applies a novel Co-occurrence Action Parsing algorithm to not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal path of the sub-actions in an accurate and general way. We evaluate on both classic datasets (Breakfast, 50Salads) and the emerging fine-grained action dataset (FineGym) with more complex activity structures and similar sub-actions. Results show that SSCAP achieves state-of-the-art performance on all datasets and can even outperform some weakly-supervised approaches, demonstrating its effectiveness and generalizability. △ Less

Submitted 25 October, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

Comments: WACV 2022 camera ready

arXiv:2105.11595 [pdf, other]

SiamMOT: Siamese Multi-Object Tracking

Authors: Bing Shuai, Andrew Berneshawi, Xinyu Li, Davide Modolo, Joseph Tighe

Abstract: In this paper, we focus on improving online multi-object tracking (MOT). In particular, we introduce a region-based Siamese Multi-Object Tracking network, which we name SiamMOT. SiamMOT includes a motion model that estimates the instance's movement between two frames such that detected instances are associated. To explore how the motion modelling affects its tracking capability, we present two var… ▽ More In this paper, we focus on improving online multi-object tracking (MOT). In particular, we introduce a region-based Siamese Multi-Object Tracking network, which we name SiamMOT. SiamMOT includes a motion model that estimates the instance's movement between two frames such that detected instances are associated. To explore how the motion modelling affects its tracking capability, we present two variants of Siamese tracker, one that implicitly models motion and one that models it explicitly. We carry out extensive quantitative experiments on three different MOT datasets: MOT17, TAO-person and Caltech Roadside Pedestrians, showing the importance of motion modelling for MOT and the ability of SiamMOT to substantially outperform the state-of-the-art. Finally, SiamMOT also outperforms the winners of ACM MM'20 HiEve Grand Challenge on HiEve dataset. Moreover, SiamMOT is efficient, and it runs at 17 FPS for 720P videos on a single modern GPU. Codes are available in \url{https://github.com/amazon-research/siam-mot}. △ Less

Submitted 24 May, 2021; originally announced May 2021.

Journal ref: CVPR2021

arXiv:2104.11746 [pdf, other]

VidTr: Video Transformer Without Convolutions

Authors: Yanyi Zhang, Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio Brattoli, Hao Chen, Ivan Marsic, Joseph Tighe

Abstract: We introduce Video Transformer (VidTr) with separable-attention for video classification. Comparing with commonly used 3D networks, VidTr is able to aggregate spatio-temporal information via stacked attentions and provide better performance with higher efficiency. We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw… ▽ More We introduce Video Transformer (VidTr) with separable-attention for video classification. Comparing with commonly used 3D networks, VidTr is able to aggregate spatio-temporal information via stacked attentions and provide better performance with higher efficiency. We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage. We then present VidTr which reduces the memory cost by 3.3$\times$ while keeping the same performance. To further optimize the model, we propose the standard deviation based topK pooling for attention ($pool_{topK\_std}$), which reduces the computation by dropping non-informative features along temporal dimension. VidTr achieves state-of-the-art performance on five commonly used datasets with lower computational requirement, showing both the efficiency and effectiveness of our design. Finally, error analysis and visualization show that VidTr is especially good at predicting actions that require long-term temporal reasoning. △ Less

Submitted 15 October, 2021; v1 submitted 23 April, 2021; originally announced April 2021.

Comments: ICCV 2021 Accepted

arXiv:2104.00969 [pdf, other]

TubeR: Tubelet Transformer for Video Action Detection

Authors: Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G. M. Snoek, Joseph Tighe

Abstract: We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns… ▽ More We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet-queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. △ Less

Submitted 10 May, 2022; v1 submitted 2 April, 2021; originally announced April 2021.

Comments: Accepted at CVPR 2022 (Oral)

arXiv:2104.00179 [pdf, other]

Selective Feature Compression for Efficient Activity Recognition Inference

Authors: Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe

Abstract: Most action recognition solutions rely on dense sampling to precisely cover the informative temporal clip. Extensively searching temporal region is expensive for a real-world application. In this work, we focus on improving the inference efficiency of current action recognition backbones on trimmed videos, and illustrate that one action model can also cover then informative region by dropping non-… ▽ More Most action recognition solutions rely on dense sampling to precisely cover the informative temporal clip. Extensively searching temporal region is expensive for a real-world application. In this work, we focus on improving the inference efficiency of current action recognition backbones on trimmed videos, and illustrate that one action model can also cover then informative region by dropping non-informative features. We present Selective Feature Compression (SFC), an action recognition inference strategy that greatly increase model inference efficiency without any accuracy compromise. Differently from previous works that compress kernel sizes and decrease the channel dimension, we propose to compress feature flow at spatio-temporal dimension without changing any backbone parameters. Our experiments on Kinetics-400, UCF101 and ActivityNet show that SFC is able to reduce inference speed by 6-7x and memory usage by 5-6x compared with the commonly used 30 crops dense sampling procedure, while also slightly improving Top1 Accuracy. We thoroughly quantitatively and qualitatively evaluate SFC and all its components and show how does SFC learn to attend to important video regions and to drop temporal features that are uninformative for the task of action recognition. △ Less

Submitted 29 July, 2021; v1 submitted 31 March, 2021; originally announced April 2021.

Comments: Accepted by ICCV 2021

arXiv:2012.08041 [pdf, other]

NUTA: Non-uniform Temporal Aggregation for Action Recognition

Authors: Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Hao Chen, Joseph Tighe

Abstract: In the world of action recognition research, one primary focus has been on how to construct and train networks to model the spatial-temporal volume of an input video. These methods typically uniformly sample a segment of an input clip (along the temporal dimension). However, not all parts of a video are equally important to determine the action in the clip. In this work, we focus instead on learni… ▽ More In the world of action recognition research, one primary focus has been on how to construct and train networks to model the spatial-temporal volume of an input video. These methods typically uniformly sample a segment of an input clip (along the temporal dimension). However, not all parts of a video are equally important to determine the action in the clip. In this work, we focus instead on learning where to extract features, so as to focus on the most informative parts of the video. We propose a method called the non-uniform temporal aggregation (NUTA), which aggregates features only from informative temporal segments. We also introduce a synchronization method that allows our NUTA features to be temporally aligned with traditional uniformly sampled video features, so that both local and clip-level features can be combined. Our model has achieved state-of-the-art performance on four widely used large-scale action-recognition datasets (Kinetics400, Kinetics700, Something-something V2 and Charades). In addition, we have created a visualization to illustrate how the proposed NUTA method selects only the most relevant parts of a video clip. △ Less

Submitted 14 December, 2020; originally announced December 2020.

arXiv:2012.06567 [pdf, other]

A Comprehensive Study of Deep Video Action Recognition

Authors: Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong, Chongruo Wu, Zhi Zhang, Joseph Tighe, R. Manmatha, Mu Li

Abstract: Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation prot… ▽ More Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas. △ Less

Submitted 11 December, 2020; originally announced December 2020.

Comments: Technical report. Code and model zoo can be found at https://cv.gluon.ai/model_zoo/action_recognition.html

arXiv:2007.11040 [pdf, other]

Directional Temporal Modeling for Action Recognition

Authors: Xinyu Li, Bing Shuai, Joseph Tighe

Abstract: Many current activity recognition models use 3D convolutional neural networks (e.g. I3D, I3D-NL) to generate local spatial-temporal features. However, such features do not encode clip-level ordered temporal information. In this paper, we introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features. By applying multiple… ▽ More Many current activity recognition models use 3D convolutional neural networks (e.g. I3D, I3D-NL) to generate local spatial-temporal features. However, such features do not encode clip-level ordered temporal information. In this paper, we introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features. By applying multiple CIDC units we construct a light-weight network that models the clip-level temporal evolution across multiple spatial scales. Our CIDC network can be attached to any activity recognition backbone network. We evaluate our method on four popular activity recognition datasets and consistently improve upon state-of-the-art techniques. We further visualize the activation map of our CIDC network and show that it is able to focus on more meaningful, action related parts of the frame. △ Less

Submitted 21 July, 2020; originally announced July 2020.

Comments: ECCV 2020

arXiv:2004.14952 [pdf]

Pain and Physical Activity Association in Critically Ill Patients

Authors: Anis Davoudi, Tezcan Ozrazgat-Baslanti, Patrick J. Tighe, Azra Bihorac, Parisa Rashidi

Abstract: Critical care patients experience varying levels of pain during their stay in the intensive care unit, often requiring administration of analgesics and sedation. Such medications generally exacerbate the already sedentary physical activity profiles of critical care patients, contributing to delayed recovery. Thus, it is important not only to minimize pain levels, but also to optimize analgesic str… ▽ More Critical care patients experience varying levels of pain during their stay in the intensive care unit, often requiring administration of analgesics and sedation. Such medications generally exacerbate the already sedentary physical activity profiles of critical care patients, contributing to delayed recovery. Thus, it is important not only to minimize pain levels, but also to optimize analgesic strategies in order to maximize mobility and activity of ICU patients. Currently, we lack an understanding of the relation between pain and physical activity on a granular level. In this study, we examined the relationship between nurse assessed pain scores and physical activity as measured using a wearable accelerometer device. We found that average, standard deviation, and maximum physical activity counts are significantly higher before high pain reports compared to before low pain reports during both daytime and nighttime, while percentage of time spent immobile was not significantly different between the two pain report groups. Clusters detected among patients using extracted physical activity features were significant in adjusted logistic regression analysis for prediction of pain report group. △ Less

Submitted 21 April, 2020; originally announced April 2020.

Comments: 4 pages, 3 figures, 6 tables. Accepted for presentation at IEEE EMBC 2020

arXiv:2004.09134 [pdf, other]

doi 10.1109/EMBC44109.2020.9176453

Joint Distribution and Transitions of Pain and Activity in Critically Ill Patients

Authors: Florenc Demrozi, Graziano Pravadelli, Patrick J Tighe, Azra Bihorac, Parisa Rashidi

Abstract: Pain and physical function are both essential indices of recovery in critically ill patients in the Intensive Care Units (ICU). Simultaneous monitoring of pain intensity and patient activity can be important for determining which analgesic interventions can optimize mobility and function, while minimizing opioid harm. Nonetheless, so far, our knowledge of the relation between pain and activity has… ▽ More Pain and physical function are both essential indices of recovery in critically ill patients in the Intensive Care Units (ICU). Simultaneous monitoring of pain intensity and patient activity can be important for determining which analgesic interventions can optimize mobility and function, while minimizing opioid harm. Nonetheless, so far, our knowledge of the relation between pain and activity has been limited to manual and sporadic activity assessments. In recent years, wearable devices equipped with 3-axis accelerometers have been used in many domains to provide a continuous and automated measure of mobility and physical activity. In this study, we collected activity intensity data from 57 ICU patients, using the Actigraph GT3X device. We also collected relevant clinical information, including nurse assessments of pain intensity, recorded every 1-4 hours. Our results show the joint distribution and state transition of joint activity and pain states in critically ill patients. △ Less

Submitted 20 April, 2020; originally announced April 2020.

Comments: Accepted for Publication in EMBC 2020

arXiv:2004.07786 [pdf, other]

Multi-Object Tracking with Siamese Track-RCNN

Authors: Bing Shuai, Andrew G. Berneshawi, Davide Modolo, Joseph Tighe

Abstract: Multi-object tracking systems often consist of a combination of a detector, a short term linker, a re-identification feature extractor and a solver that takes the output from these separate components and makes a final prediction. Differently, this work aims to unify all these in a single tracking system. Towards this, we propose Siamese Track-RCNN, a two stage detect-and-track framework which con… ▽ More Multi-object tracking systems often consist of a combination of a detector, a short term linker, a re-identification feature extractor and a solver that takes the output from these separate components and makes a final prediction. Differently, this work aims to unify all these in a single tracking system. Towards this, we propose Siamese Track-RCNN, a two stage detect-and-track framework which consists of three functional branches: (1) the detection branch localizes object instances; (2) the Siamese-based track branch estimates the object motion and (3) the object re-identification branch re-activates the previously terminated tracks when they re-emerge. We test our tracking system on two popular datasets of the MOTChallenge. Siamese Track-RCNN achieves significantly higher results than the state-of-the-art, while also being much more efficient, thanks to its unified design. △ Less

Submitted 16 April, 2020; originally announced April 2020.

arXiv:2003.13759 [pdf, other]

Understanding the impact of mistakes on background regions in crowd counting

Authors: Davide Modolo, Bing Shuai, Rahul Rama Varior, Joseph Tighe

Abstract: Every crowd counting researcher has likely observed their model output wrong positive predictions on image regions not containing any person. But how often do these mistakes happen? Are our models negatively affected by this? In this paper we analyze this problem in depth. In order to understand its magnitude, we present an extensive analysis on five of the most important crowd counting datasets.… ▽ More Every crowd counting researcher has likely observed their model output wrong positive predictions on image regions not containing any person. But how often do these mistakes happen? Are our models negatively affected by this? In this paper we analyze this problem in depth. In order to understand its magnitude, we present an extensive analysis on five of the most important crowd counting datasets. We present this analysis in two parts. First, we quantify the number of mistakes made by popular crowd counting approaches. Our results show that (i) mistakes on background are substantial and they are responsible for 18-49% of the total error, (ii) models do not generalize well to different kinds of backgrounds and perform poorly on completely background images, and (iii) models make many more mistakes than those captured by the standard Mean Absolute Error (MAE) metric, as counting on background compensates considerably for misses on foreground. And second, we quantify the performance change gained by helping the model better deal with this problem. We enrich a typical crowd counting network with a segmentation branch trained to suppress background predictions. This simple addition (i) reduces background error by 10-83%, (ii) reduces foreground error by up to 26% and (iii) improves overall crowd counting performance up to 20%. When compared against the literature, this simple technique achieves very competitive results on all datasets, on par with the state-of-the-art, showing the importance of tackling the background problem. △ Less

Submitted 30 March, 2020; originally announced March 2020.

arXiv:2003.13743 [pdf, other]

Combining detection and tracking for human pose estimation in videos

Authors: Manchen Wang, Joseph Tighe, Davide Modolo

Abstract: We propose a novel top-down approach that tackles the problem of multi-person human pose estimation and tracking in videos. In contrast to existing top-down approaches, our method is not limited by the performance of its person detector and can predict the poses of person instances not localized. It achieves this capability by propagating known person locations forward and backward in time and sea… ▽ More We propose a novel top-down approach that tackles the problem of multi-person human pose estimation and tracking in videos. In contrast to existing top-down approaches, our method is not limited by the performance of its person detector and can predict the poses of person instances not localized. It achieves this capability by propagating known person locations forward and backward in time and searching for poses in those regions. Our approach consists of three components: (i) a Clip Tracking Network that performs body joint detection and tracking simultaneously on small video clips; (ii) a Video Tracking Pipeline that merges the fixed-length tracklets produced by the Clip Tracking Network to arbitrary length tracks; and (iii) a Spatial-Temporal Merging procedure that refines the joint locations based on spatial and temporal smoothing terms. Thanks to the precision of our Clip Tracking Network and our merging procedure, our approach produces very accurate joint predictions and can fix common mistakes on hard scenarios like heavily entangled people. Our approach achieves state-of-the-art results on both joint detection and tracking, on both the PoseTrack 2017 and 2018 datasets, and against all top-down and bottom-down approaches. △ Less

Submitted 30 March, 2020; originally announced March 2020.

Comments: Accepted to CVPR 2020 as oral

arXiv:2003.01455 [pdf, other]

Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications

Authors: Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, Krzysztof Chalupka

Abstract: Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classificat… ▽ More Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at github.com/bbrattoli/ZeroShotVideoClassification. △ Less

Submitted 20 June, 2020; v1 submitted 3 March, 2020; originally announced March 2020.

Comments: Accepted for publication at CVPR 2020

arXiv:1908.07625 [pdf, other]

Action recognition with spatial-temporal discriminative filter banks

Authors: Brais Martinez, Davide Modolo, Yuanjun Xiong, Joseph Tighe

Abstract: Action recognition has seen a dramatic performance improvement in the last few years. Most of the current state-of-the-art literature either aims at improving performance through changes to the backbone CNN network, or they explore different trade-offs between computational efficiency and performance, again through altering the backbone network. However, almost all of these works maintain the same… ▽ More Action recognition has seen a dramatic performance improvement in the last few years. Most of the current state-of-the-art literature either aims at improving performance through changes to the backbone CNN network, or they explore different trade-offs between computational efficiency and performance, again through altering the backbone network. However, almost all of these works maintain the same last layers of the network, which simply consist of a global average pooling followed by a fully connected layer. In this work we focus on how to improve the representation capacity of the network, but rather than altering the backbone, we focus on improving the last layers of the network, where changes have low impact in terms of computational cost. In particular, we show that current architectures have poor sensitivity to finer details and we exploit recent advances in the fine-grained recognition literature to improve our model in this aspect. With the proposed approach, we obtain state-of-the-art performance on Kinetics-400 and Something-Something-V1, the two major large-scale action recognition benchmarks. △ Less

Submitted 20 August, 2019; originally announced August 2019.

Comments: ICCV 2019 Accepted Paper

arXiv:1901.06026 [pdf, other]

Multi-Scale Attention Network for Crowd Counting

Authors: Rahul Rama Varior, Bing Shuai, Joseph Tighe, Davide Modolo

Abstract: In crowd counting datasets, people appear at different scales, depending on their distance from the camera. To address this issue, we propose a novel multi-branch scale-aware attention network that exploits the hierarchical structure of convolutional neural networks and generates, in a single forward pass, multi-scale density predictions from different layers of the architecture. To aggregate thes… ▽ More In crowd counting datasets, people appear at different scales, depending on their distance from the camera. To address this issue, we propose a novel multi-branch scale-aware attention network that exploits the hierarchical structure of convolutional neural networks and generates, in a single forward pass, multi-scale density predictions from different layers of the architecture. To aggregate these maps into our final prediction, we present a new soft attention mechanism that learns a set of gating masks. Furthermore, we introduce a scale-aware loss function to regularize the training of different branches and guide them to specialize on a particular scale. As this new training requires annotations for the size of each head, we also propose a simple, yet effective technique to estimate them automatically. Finally, we present an ablation study on each of these components and compare our approach against the literature on 4 crowd counting datasets: UCF-QNRF, ShanghaiTech A & B and UCF_CC_50. Our approach achieves state-of-the-art on all them with a remarkable improvement on UCF-QNRF (+25% reduction in error). △ Less

Submitted 25 July, 2019; v1 submitted 17 January, 2019; originally announced January 2019.

arXiv:1812.07129 [pdf]

Does the Position of Surgical Service Providers in Intra-Operative Networks Matter? Analyzing the Impact of Influencing Factors on Patients' Outcome

Authors: Ashkan Ebadi, Patrick J. Tighe, Lei Zhang, Parisa Rashidi

Abstract: We analyzed the relation between surgical service providers' network structure and surgical team size with patient outcome during the operation. We did correlation analysis to evaluate the associations among the network structure measures in the intra-operative networks of surgical service providers. We focused on intra-operative networks of surgical service providers, in a quaternary-care academi… ▽ More We analyzed the relation between surgical service providers' network structure and surgical team size with patient outcome during the operation. We did correlation analysis to evaluate the associations among the network structure measures in the intra-operative networks of surgical service providers. We focused on intra-operative networks of surgical service providers, in a quaternary-care academic medical center, using retrospective Electronic Medical Record (EMR) data. We used de-identified intra-operative data for adult patients who received nonambulatory/nonobstetric surgery in a main operating room at Shands at the University of Florida between June 1, 2011 and November 1, 2014. The intra-operative dataset contained 30,211 unique surgical cases. To perform the analysis, we created the networks of surgical service providers and calculated several network structure measures at both team and individual levels. We considered number of patients' complications as the target variable and assessed its interrelations with the calculated network measures along with other influencing factors (e.g. surgical team size, type of surgery). Our results confirm the significant role of interactions among surgical providers on patient outcome. In addition, we observed that highly central providers at the global network level are more likely to be associated with a lower number of surgical complications, while locally important providers might be associated with higher number of complications. We also found a positive relation between age of patients and number of complications. △ Less

Submitted 17 December, 2018; originally announced December 2018.

Comments: 17 pages, 3 Figures, 5 Tables PrePrint

arXiv:1804.10201 [pdf, other]

The Intelligent ICU Pilot Study: Using Artificial Intelligence Technology for Autonomous Patient Monitoring

Authors: Anis Davoudi, Kumar Rohit Malhotra, Benjamin Shickel, Scott Siegel, Seth Williams, Matthew Ruppert, Emel Bihorac, Tezcan Ozrazgat-Baslanti, Patrick J. Tighe, Azra Bihorac, Parisa Rashidi

Abstract: Currently, many critical care indices are repetitively assessed and recorded by overburdened nurses, e.g. physical function or facial pain expressions of nonverbal patients. In addition, many essential information on patients and their environment are not captured at all, or are captured in a non-granular manner, e.g. sleep disturbance factors such as bright light, loud background noise, or excess… ▽ More Currently, many critical care indices are repetitively assessed and recorded by overburdened nurses, e.g. physical function or facial pain expressions of nonverbal patients. In addition, many essential information on patients and their environment are not captured at all, or are captured in a non-granular manner, e.g. sleep disturbance factors such as bright light, loud background noise, or excessive visitations. In this pilot study, we examined the feasibility of using pervasive sensing technology and artificial intelligence for autonomous and granular monitoring of critically ill patients and their environment in the Intensive Care Unit (ICU). As an exemplar prevalent condition, we also characterized delirious and non-delirious patients and their environment. We used wearable sensors, light and sound sensors, and a high-resolution camera to collected data on patients and their environment. We analyzed collected data using deep learning and statistical analysis. Our system performed face detection, face recognition, facial action unit detection, head pose detection, facial expression recognition, posture recognition, actigraphy analysis, sound pressure and light level detection, and visitation frequency detection. We were able to detect patient's face (Mean average precision (mAP)=0.94), recognize patient's face (mAP=0.80), and their postures (F1=0.94). We also found that all facial expressions, 11 activity features, visitation frequency during the day, visitation frequency during the night, light levels, and sound pressure levels during the night were significantly different between delirious and non-delirious patients (p-value<0.05). In summary, we showed that granular and autonomous monitoring of critically ill patients and their environment is feasible and can be used for characterizing critical care conditions and related environment factors. △ Less

Submitted 26 September, 2018; v1 submitted 25 April, 2018; originally announced April 2018.

arXiv:1803.03359 [pdf]

A Quest for the Structure of Intra- and Postoperative Surgical Team Networks: Does the Small World Property Evolve over Time?

Authors: Ashkan Ebadi, Patrick J. Tighe, Lei Zheng, Parisa Rashidi

Abstract: We examined the structure of intra- and postoperative case-collaboration networks among the surgical service providers in a quaternary-care academic medical center, using retrospective electronic medical record (EMR) data. We also analyzed the evolution of the network properties over time, as changes in nodes and edges can affect the network structure. We used de-identified intra- and postoperativ… ▽ More We examined the structure of intra- and postoperative case-collaboration networks among the surgical service providers in a quaternary-care academic medical center, using retrospective electronic medical record (EMR) data. We also analyzed the evolution of the network properties over time, as changes in nodes and edges can affect the network structure. We used de-identified intra- and postoperative data for adult patients, ages >= 21, who received nonambulatory/nonobstetric surgery at Shands at the University of Florida between June 1, 2011 and November 1, 2014. The intraoperative segment contained 30,245 surgical cases, and the postoperative segment considered 30,202 hospitalizations. Our results confirmed the existence of strict small world structure in both intra- and postoperative surgical team networks. A sudden declining trend is expected in the future in both intra- and postoperative networks, since the small world property is currently at its peak. In addition, high network density was observed in the intraoperative segment and partially in postoperative one, representing the existence of cohesive clusters of providers. We also observed that the small world property is exhibited more in the intraoperative compared to the postoperative network. Analyzing the temporal aspects of the networks revealed the postoperative segment tends to lose its cohesiveness as the time passes. Our results highlight the importance of stability of personnel in key positions. This highlights the important role of the central players in the network that offers change-leaders the opportunity to quantify and target those nodes as mediators of process change. △ Less

Submitted 8 March, 2018; originally announced March 2018.

arXiv:hep-th/0108051 [pdf, ps, other]

Derivative Expansions of the Exact Renormalisation Group and SU(N|N) Gauge Theory

Authors: John F. Tighe

Abstract: We investigate the convergence of the derivative expansion of the exact renormalisation group, by using it to compute the beta function of scalar theory. We demonstrate that the derivative expansion of the Polchinski flow equation converges at one loop for certain fast falling smooth cutoffs. The derivative expansion of the Legendre flow equation trivially converges at one loop, but also at two… ▽ More We investigate the convergence of the derivative expansion of the exact renormalisation group, by using it to compute the beta function of scalar theory. We demonstrate that the derivative expansion of the Polchinski flow equation converges at one loop for certain fast falling smooth cutoffs. The derivative expansion of the Legendre flow equation trivially converges at one loop, but also at two loops: slowly with sharp cutoff (as a momentum-scale expansion), and rapidly in the case of a smooth exponential cutoff. We also show that the two loop contributions to certain higher derivative operators (not involved in beta) have divergent momentum-scale expansions for sharp cutoff, but the smooth exponential cutoff gives convergent derivative expansions for all such operators with any number of derivatives. In the latter part of the thesis, we address the problems of applying the exact renormalisation group to gauge theories. A regularisation scheme utilising higher covariant derivatives and the spontaneous symmetry breaking of the gauge supergroup SU(N|N) is introduced and it is demonstrated to be finite to all orders of perturbation theory. △ Less

Submitted 8 August, 2001; originally announced August 2001.

Comments: Thesis,LaTex,128 pages,14 eps figures

arXiv:hep-th/0106258 [pdf, ps, other]

doi 10.1142/S0217751X02009722

Gauge invariant regularisation via SU(N|N)

Authors: Stefano Arnone, Yuri A. Kubyshin, Tim R. Morris, John F. Tighe

Abstract: We construct a gauge invariant regularisation scheme for pure SU(N) Yang-Mills theory in fixed dimension four or less (for N = infinity in all dimensions), with a physical cutoff scale Lambda, by using covariant higher derivatives and spontaneously broken SU(N|N) supergauge invariance. Providing their powers are within certain ranges, the covariant higher derivatives cure the superficial diverge… ▽ More We construct a gauge invariant regularisation scheme for pure SU(N) Yang-Mills theory in fixed dimension four or less (for N = infinity in all dimensions), with a physical cutoff scale Lambda, by using covariant higher derivatives and spontaneously broken SU(N|N) supergauge invariance. Providing their powers are within certain ranges, the covariant higher derivatives cure the superficial divergence of all but a set of one-loop graphs. The finiteness of these latter graphs is ensured by properties of the supergroup and gauge invariance. In the limit Lambda tends to infinity, all the regulator fields decouple and unitarity is recovered in the renormalized pure SU(N) Yang-Mills theory. By demonstrating these properties, we prove that the regularisation works to all orders in perturbation theory. △ Less

Submitted 25 November, 2001; v1 submitted 27 June, 2001; originally announced June 2001.

Comments: Latex, 43 pages, extended to explain ERG context, preregularisation and why it is unecessary in less than 4 dimensions or at infinite N, and explain issues in early attempts at gauge invariant Pauli Villars regularisation

Report number: SHEP 01-15

Journal ref: Int.J.Mod.Phys. A17 (2002) 2283-2330

arXiv:hep-th/0102054 [pdf, ps, other]

doi 10.1142/S0217751X0100461X

A gauge invariant regulator for the ERG

Authors: S. Arnone, Yu. A. Kubyshin, T. R. Morris, J. F. Tighe

Abstract: A gauge invariant regularisation for dealing with pure Yang-Mills theories within the exact renormalization group approach is proposed. It is based on the regularisation via covariant higher derivatives and includes auxiliary Pauli-Villars fields which amounts to a spontaneously broken SU(N|N) super-gauge theory. We demonstrate perturbatively that the extended theory is ultra-violet finite in fo… ▽ More A gauge invariant regularisation for dealing with pure Yang-Mills theories within the exact renormalization group approach is proposed. It is based on the regularisation via covariant higher derivatives and includes auxiliary Pauli-Villars fields which amounts to a spontaneously broken SU(N|N) super-gauge theory. We demonstrate perturbatively that the extended theory is ultra-violet finite in four dimensions and argue that it has a sensible limit when the regularization cutoff is removed. △ Less

Submitted 9 February, 2001; originally announced February 2001.

Comments: 13 pages, 2 figures, uses ws-p9-75x6-50.cls, talk presented at the 2nd Conference on the Exact RG, Rome 2000

Report number: SHEP 01-07

Journal ref: Int.J.Mod.Phys. A16 (2001) 1989

arXiv:hep-th/0102027 [pdf, ps, other]

doi 10.1142/S0217751X01004761

Convergence of derivative expansions in scalar field theory

Authors: Tim R. Morris, John F. Tighe

Abstract: The convergence of the derivative expansion of the exact renormalisation group is investigated via the computation of the beta function of massless scalar lambda phi^4 theory. The derivative expansion of the Polchinski flow equation converges at one loop for certain fast falling smooth cutoffs. Convergence of the derivative expansion of the Legendre flow equation is trivial at one loop, but also… ▽ More The convergence of the derivative expansion of the exact renormalisation group is investigated via the computation of the beta function of massless scalar lambda phi^4 theory. The derivative expansion of the Polchinski flow equation converges at one loop for certain fast falling smooth cutoffs. Convergence of the derivative expansion of the Legendre flow equation is trivial at one loop, but also can occur at two loops and in particular converges for an exponential cutoff. △ Less

Submitted 6 February, 2001; originally announced February 2001.

Comments: 6 pages, 3 figures, presented at the 2nd Conference on the Exact RG, Rome 2000

Report number: SHEP 01-06

Journal ref: Int.J.Mod.Phys. A16 (2001) 2095-2100

arXiv:hep-th/0102011 [pdf, ps, other]

Gauge invariant regularisation in the ERG approach

Authors: S. Arnone, Yu. A. Kubyshin, T. R. Morris, J. F. Tighe

Abstract: A gauge invariant regularisation which can be used for non-perturbative treatment of Yang-Mills theories within the exact renormalization group approach is constructed. It consists of a spontaneously broken SU(N|N) super-gauge extension of the initial Yang-Mills action supplied with covariant higher derivatives. We demonstrate that the extended theory in four dimensions is ultra-violet finite pe… ▽ More A gauge invariant regularisation which can be used for non-perturbative treatment of Yang-Mills theories within the exact renormalization group approach is constructed. It consists of a spontaneously broken SU(N|N) super-gauge extension of the initial Yang-Mills action supplied with covariant higher derivatives. We demonstrate that the extended theory in four dimensions is ultra-violet finite perturbatively and argue that it has a sensible limit when the regularisation cutoff is removed. △ Less

Submitted 2 February, 2001; originally announced February 2001.

Comments: Talk given at 15th International Workshop on High-Energy Physics and Quantum Field Theory (QFTHEP 2000), Tver, Russia, 14-20 Sep 2000

Report number: SHEP 01-04

arXiv:hep-th/9906166 [pdf, ps, other]

doi 10.1088/1126-6708/1999/08/007

Convergence of derivative expansions of the renormalization group

Authors: Tim R. Morris, John F. Tighe

Abstract: We investigate the convergence of the derivative expansion of the exact renormalization group, by using it to compute the beta function of scalar field theory. We show that the derivative expansion of the Polchinski flow equation converges at one loop for certain fast falling smooth cutoffs. The derivative expansion of the Legendre flow equation trivially converges at one loop, but also at two l… ▽ More We investigate the convergence of the derivative expansion of the exact renormalization group, by using it to compute the beta function of scalar field theory. We show that the derivative expansion of the Polchinski flow equation converges at one loop for certain fast falling smooth cutoffs. The derivative expansion of the Legendre flow equation trivially converges at one loop, but also at two loops: slowly with sharp cutoff (as a momentum-scale expansion), and rapidly in the case of a smooth exponential cutoff. Finally, we show that the two loop contributions to certain higher derivative operators (not involved in beta) have divergent momentum-scale expansions for sharp cutoff, but the smooth exponential cutoff gives convergent derivative expansions for all such operators with any number of derivatives. △ Less

Submitted 21 June, 1999; originally announced June 1999.

Comments: Latex inc axodraw. 20 pages

Report number: SHEP 99-06

Journal ref: JHEP 9908:007,1999

Showing 1–44 of 44 results for author: Tighe, J