subscribe to arXiv mailings

GRIN: GRadient-INformed MoE

Authors: Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen

Abstract: Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To… ▽ More Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$\times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH. △ Less

Submitted 18 September, 2024; originally announced September 2024.

Comments: 58 pages

arXiv:2409.07847 [pdf, other]

C3-VQA: Cryogenic Counter-based Co-processor for Variational Quantum Algorithms

Authors: Yosuke Ueno, Satoshi Imamura, Yuna Tomida, Teruo Tanimoto, Masamitsu Tanaka, Yutaka Tabuchi, Koji Inoue, Hiroshi Nakamura

Abstract: Cryogenic quantum computers play a leading role in demonstrating quantum advantage. Given the severe constraints on the cooling capacity in cryogenic environments, thermal design is crucial for the scalability of these computers. The sources of heat dissipation include passive inflow via inter-temperature wires and the power consumption of components located in the cryostat, such as wire amplifier… ▽ More Cryogenic quantum computers play a leading role in demonstrating quantum advantage. Given the severe constraints on the cooling capacity in cryogenic environments, thermal design is crucial for the scalability of these computers. The sources of heat dissipation include passive inflow via inter-temperature wires and the power consumption of components located in the cryostat, such as wire amplifiers and quantum-classical interfaces. Thus, a critical challenge is to reduce the number of wires by reducing the required inter-temperature bandwidth while maintaining minimal additional power consumption in the cryostat. One solution to address this challenge is near-data processing using ultra-low-power computational logic within the cryostat. Based on the workload analysis and domain-specific system design focused on Variational Quantum Algorithms (VQAs), we propose the Cryogenic Counter-based Co-processor for VQAs (C3-VQA) to enhance the design scalability of cryogenic quantum computers under the thermal constraint. The C3-VQA utilizes single-flux-quantum logic, which is an ultra-low-power superconducting digital circuit that operates at the 4 K environment. The C3-VQA precomputes a part of the expectation value calculations for VQAs and buffers intermediate values using simple bit operation units and counters in the cryostat, thereby reducing the required inter-temperature bandwidth with small additional power consumption. Consequently, the C3-VQA reduces the number of wires, leading to a reduction in the total heat dissipation in the cryostat. Our evaluation shows that the C3-VQA reduces the total heat dissipation at the 4 K stage by 30% and 81% under sequential-shot and parallel-shot execution scenarios, respectively. Furthermore, a case study in quantum chemistry shows that the C3-VQA reduces total heat dissipation by 87% with a 10,000-qubit system. △ Less

Submitted 12 September, 2024; originally announced September 2024.

Comments: 15 pages, 9 figures, 5 tables. This is an extention of arXiv:2403.00363 and arXiv:2310.01630

arXiv:2408.16978 [pdf, other]

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

Authors: Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda

Abstract: Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that intro… ▽ More Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with extreme hardware efficiency. For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2407.15408 [pdf, other]

Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

Authors: Kent Fujiwara, Mikihiro Tanaka, Qing Yu

Abstract: With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. Despite these efforts to align language and motion representations, we claim t… ▽ More With the release of large-scale motion datasets with textual annotations, the task of establishing a robust latent space for language and 3D human motion has recently witnessed a surge of interest. Methods have been proposed to convert human motion and texts into features to achieve accurate correspondence between them. Despite these efforts to align language and motion representations, we claim that the temporal element is often overlooked, especially for compound actions, resulting in chronological inaccuracies. To shed light on the temporal alignment in motion-language latent spaces, we propose Chronologically Accurate Retrieval (CAR) to evaluate the chronological understanding of the models. We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions. We then design a simple task for motion-language models to retrieve the more likely text from the ground truth and its chronologically shuffled version. CAR reveals many cases where current motion-language models fail to distinguish the event chronology of human motion, despite their impressive performance in terms of conventional evaluation metrics. To achieve better temporal alignment between text and motion, we further propose to use these texts with shuffled sequence of events as negative samples during training to reinforce the motion-language models. We conduct experiments on text-motion retrieval and text-to-motion generation using the reinforced motion-language models, which demonstrate improved performance over conventional approaches, indicating the necessity to consider temporal elements in motion-language alignment. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: To appear at ECCV 2024. Project page: https://kfworks.com/CAR-WP/

arXiv:2407.03963 [pdf, other]

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Authors: LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto, Takuya Fukushima, Namgi Han, Yuto Harada, Chikara Hashimoto, Tatsuya Hiraoka, Shohei Hisada, Sosuke Hosokawa, Lu Jie, Keisuke Kamata, Teruhito Kanazawa, Hiroki Kanezashi, Hiroshi Kataoka, Satoru Katsumata, Daisuke Kawahara, Seiya Kawano , et al. (57 additional authors not shown)

Abstract: This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its… ▽ More This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.18820 [pdf, other]

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Authors: Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang

Abstract: Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are t… ▽ More Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques. △ Less

Submitted 27 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.14329 [pdf, other]

Adaptive Adversarial Cross-Entropy Loss for Sharpness-Aware Minimization

Authors: Tanapat Ratchatorn, Masayuki Tanaka

Abstract: Recent advancements in learning algorithms have demonstrated that the sharpness of the loss surface is an effective measure for improving the generalization gap. Building upon this concept, Sharpness-Aware Minimization (SAM) was proposed to enhance model generalization and achieved state-of-the-art performance. SAM consists of two main steps, the weight perturbation step and the weight updating st… ▽ More Recent advancements in learning algorithms have demonstrated that the sharpness of the loss surface is an effective measure for improving the generalization gap. Building upon this concept, Sharpness-Aware Minimization (SAM) was proposed to enhance model generalization and achieved state-of-the-art performance. SAM consists of two main steps, the weight perturbation step and the weight updating step. However, the perturbation in SAM is determined by only the gradient of the training loss, or cross-entropy loss. As the model approaches a stationary point, this gradient becomes small and oscillates, leading to inconsistent perturbation directions and also has a chance of diminishing the gradient. Our research introduces an innovative approach to further enhancing model generalization. We propose the Adaptive Adversarial Cross-Entropy (AACE) loss function to replace standard cross-entropy loss for SAM's perturbation. AACE loss and its gradient uniquely increase as the model nears convergence, ensuring consistent perturbation direction and addressing the gradient diminishing issue. Additionally, a novel perturbation-generating function utilizing AACE loss without normalization is proposed, enhancing the model's exploratory capabilities in near-optimum stages. Empirical testing confirms the effectiveness of AACE, with experiments demonstrating improved performance in image classification tasks using Wide ResNet and PyramidNet across various datasets. The reproduction code is available online △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Accepted in ICIP2024. The project page can be accessed at http://www.vip.sc.e.titech.ac.jp/proj/AACE

arXiv:2405.14146 [pdf, other]

Hyperspectral Image Dataset for Individual Penguin Identification

Authors: Youta Noboru, Yuko Ozasa, Masayuki Tanaka

Abstract: Remote individual animal identification is important for food safety, sport, and animal conservation. Numerous existing remote individual animal identification studies have focused on RGB images. In this paper, we tackle individual penguin identification using hyperspectral (HS) images. To the best of our knowledge, it is the first work to analyze spectral differences between penguin individuals u… ▽ More Remote individual animal identification is important for food safety, sport, and animal conservation. Numerous existing remote individual animal identification studies have focused on RGB images. In this paper, we tackle individual penguin identification using hyperspectral (HS) images. To the best of our knowledge, it is the first work to analyze spectral differences between penguin individuals using an HS camera. We have constructed a novel penguin HS image dataset, including 990 hyperspectral images of 27 penguins. We experimentally demonstrate that the spectral information of HS image pixels can be used for individual penguin identification. The experimental results show the effectiveness of using HS images for individual penguin identification. The dataset and source code are available here: https://033labcodes.github.io/igrass24_penguin/ △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: Accepted by 2024 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2024)

arXiv:2405.04771 [pdf, other]

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

Authors: Qing Yu, Mikihiro Tanaka, Kent Fujiwara

Abstract: To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Trans… ▽ More To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: Accepted to CVPR 2024, Project website: https://yu1ut.com/MotionPatches-HP/

arXiv:2404.14219 [pdf, other]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Authors: Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai , et al. (104 additional authors not shown)

Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version… ▽ More We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts. △ Less

Submitted 30 August, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 24 pages

arXiv:2403.11517 [pdf, other]

Inter-individual and inter-site neural code conversion without shared stimuli

Authors: Haibao Wang, Jun Kai Ho, Fan L. Cheng, Shuntaro C. Aoki, Yusuke Muraki, Misato Tanaka, Yukiyasu Kamitani

Abstract: Inter-individual variability in fine-grained functional brain organization poses challenges for scalable data analysis and modeling. Functional alignment techniques can help mitigate these individual differences but typically require paired brain data with the same stimuli between individuals, which is often unavailable. We present a neural code conversion method that overcomes this constraint by… ▽ More Inter-individual variability in fine-grained functional brain organization poses challenges for scalable data analysis and modeling. Functional alignment techniques can help mitigate these individual differences but typically require paired brain data with the same stimuli between individuals, which is often unavailable. We present a neural code conversion method that overcomes this constraint by optimizing conversion parameters based on the discrepancy between the stimulus contents represented by original and converted brain activity patterns. This approach, combined with hierarchical features of deep neural networks (DNNs) as latent content representations, achieves conversion accuracy comparable to methods using shared stimuli. The converted brain activity from a source subject can be accurately decoded using the target's pre-trained decoders, producing high-quality visual image reconstructions that rival within-individual decoding, even with data across different sites and limited training samples. Our approach offers a promising framework for scalable neural data analysis and modeling and a foundation for brain-to-brain communication. △ Less

Submitted 1 August, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.00363 [pdf, other]

SFQ counter-based precomputation for large-scale cryogenic VQE machines

Authors: Yosuke Ueno, Satoshi Imamura, Yuna Tomida, Teruo Tanimoto, Masamitsu Tanaka, Yutaka Tabuchi, Koji Inoue, Hiroshi Nakamura

Abstract: The variational quantum eigensolver (VQE) is a promising candidate that brings practical benefits from quantum computing. However, the required bandwidth in/out of a cryostat is a limiting factor to scale cryogenic quantum computers. We propose a tailored counter-based module with single flux quantum circuits in 4-K stage which precomputes a part of VQE calculation and reduces the amount of inter-… ▽ More The variational quantum eigensolver (VQE) is a promising candidate that brings practical benefits from quantum computing. However, the required bandwidth in/out of a cryostat is a limiting factor to scale cryogenic quantum computers. We propose a tailored counter-based module with single flux quantum circuits in 4-K stage which precomputes a part of VQE calculation and reduces the amount of inter-temperature communication. The evaluation shows that our system reduces the required bandwidth by 97%, and with this drastic reduction, total power consumption is reduced by 93% in the case where 277 VQE programs are executed in parallel on a 10000-qubit machine. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: 7 pages, 5 figures, 3 tables. Accepted by DAC'24 WIP poster session

arXiv:2401.13868 [pdf, other]

doi 10.1007/s00158-024-03873-0

Shell topology optimization based on level set method

Authors: Hiroki Kobayashi, Katsuya Nomura, Yuqing Zhou, Masato Tanaka, Atsushi Kawamoto, Tsuyoshi Nomura

Abstract: This paper proposes a level set-based method for optimizing shell structures with large design changes in shape and topology. Conventional shell optimization methods, whether parametric or nonparametric, often only allow limited design changes in shape. In the proposed method, the shell structure is defined as the isosurface of a level set function. The level set function is iteratively updated ba… ▽ More This paper proposes a level set-based method for optimizing shell structures with large design changes in shape and topology. Conventional shell optimization methods, whether parametric or nonparametric, often only allow limited design changes in shape. In the proposed method, the shell structure is defined as the isosurface of a level set function. The level set function is iteratively updated based on the shape sensitivity on the surface mesh. Therefore, the proposed method can represent an arbitrary manifold surface while dealing with topological changes, for example, from a spherical surface to a toroidal surface. We applied the proposed method to the mean compliance minimization problems of 3D shell structural designs for dome, bending plate and cantilever beam examples to demonstrate its efficacy of the proposed method. △ Less

Submitted 27 August, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

Comments: 15 pages, 13 figures

Journal ref: Structural and Multidisciplinary Optimization 67, 151 (2024)

arXiv:2401.08671 [pdf, other]

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Authors: Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He

Abstract: The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation compos… ▽ More The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user scenarios from interactive sessions to long-running applications. We present a detailed benchmarking methodology, analyze the performance through latency-throughput curves, and investigate scalability via load balancing. Our evaluations demonstrate substantial improvements in throughput and latency across various models and hardware configurations. We discuss our roadmap for future enhancements, including broader model support and new hardware backends. The DeepSpeed-FastGen code is readily available for community engagement and contribution. △ Less

Submitted 9 January, 2024; originally announced January 2024.

arXiv:2310.14581 [pdf, other]

Leveraging Image-Text Similarity and Caption Modification for the DataComp Challenge: Filtering Track and BYOD Track

Authors: Shuhei Yokoo, Peifei Zhu, Yuchi Ishikawa, Mikihiro Tanaka, Masayoshi Kondo, Hirokatsu Kataoka

Abstract: Large web crawl datasets have already played an important role in learning multimodal features with high generalization capabilities. However, there are still very limited studies investigating the details or improvements of data design. Recently, a DataComp challenge has been designed to propose the best training data with the fixed models. This paper presents our solution to both filtering track… ▽ More Large web crawl datasets have already played an important role in learning multimodal features with high generalization capabilities. However, there are still very limited studies investigating the details or improvements of data design. Recently, a DataComp challenge has been designed to propose the best training data with the fixed models. This paper presents our solution to both filtering track and BYOD track of the DataComp challenge. Our solution adopts large multimodal models CLIP and BLIP-2 to filter and modify web crawl data, and utilize external datasets along with a bag of tricks to improve the data quality. Experiments show our solution significantly outperforms DataComp baselines (filtering track: 6.6% improvement, BYOD track: 48.5% improvement). △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: Accepted at the ICCV 2023 Workshop on Towards the Next Generation of Computer Vision Datasets: DataComp Track

arXiv:2310.10985 [pdf]

doi 10.1126/sciadv.adn6129

Computational synthesis of locomotive soft robots by topology optimization

Authors: Hiroki Kobayashi, Farzad Gholami, S. Macrae Montgomery, Masato Tanaka, Liang Yue, Changyoung Yuhn, Yuki Sato, Atsushi Kawamoto, H. Jerry Qi, Tsuyoshi Nomura

Abstract: Locomotive soft robots (SoRos) have gained prominence due to their adaptability. Traditional locomotive SoRo design is based on limb structures inspired by biological organisms and requires human intervention. Evolutionary robotics, designed using evolutionary algorithms (EAs), have shown potential for automatic design. However, EA-based methods face the challenge of high computational cost when c… ▽ More Locomotive soft robots (SoRos) have gained prominence due to their adaptability. Traditional locomotive SoRo design is based on limb structures inspired by biological organisms and requires human intervention. Evolutionary robotics, designed using evolutionary algorithms (EAs), have shown potential for automatic design. However, EA-based methods face the challenge of high computational cost when considering multiphysics in locomotion, including materials, actuations, and interactions with environments. Here, we present a design approach for pneumatic SoRos that integrates gradient-based topology optimization with multiphysics material point method (MPM) simulations. This approach starts with a simple initial shape (a cube with a central cavity). The topology optimization with MPM then automatically and iteratively designs the SoRo shape. We design two SoRos, one for walking and one for climbing. These SoRos are 3D printed and exhibit the same locomotion features as in the simulations. This study presents an efficient strategy for designing SoRos, demonstrating that a purely mathematical process can produce limb-like structures seen in biological organisms. △ Less

Submitted 24 July, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: 36 total pages (27 pages, 9 supplementary pages), 5 Figures, 9 Supplementary figures. 1 Supplementary table

Journal ref: Sci. Adv. 10, eadn6129 (2024)

arXiv:2310.04610 [pdf, other]

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies

Authors: Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, Connor Holmes, Martin Cai, Adam Ghanem, Zhongzhu Zhou, Yuxiong He, Pete Luferenko, Divya Kumar, Jonathan Weyn, Ruixiong Zhang, Sylwester Klocek, Volodymyr Vragov, Mohammed AlQuraishi, Gustaf Ahdritz, Christina Floristean, Cristina Negri , et al. (67 additional authors not shown)

Abstract: In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique… ▽ More In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). In this paper, we showcase the early progress we made with DeepSpeed4Science in addressing two of the critical system challenges in structural biology research. △ Less

Submitted 11 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

arXiv:2310.01630 [pdf, other]

doi 10.1109/LCA.2023.3322700

Inter-temperature Bandwidth Reduction in Cryogenic QAOA Machines

Authors: Yosuke Ueno, Yuna Tomida, Teruo Tanimoto, Masamitsu Tanaka, Yutaka Tabuchi, Koji Inoue, Hiroshi Nakamura

Abstract: The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows… ▽ More The bandwidth limit between cryogenic and room-temperature environments is a critical bottleneck in superconducting noisy intermediate-scale quantum computers. This paper presents the first trial of algorithm-aware system-level optimization to solve this issue by targeting the quantum approximate optimization algorithm. Our counter-based cryogenic architecture using single-flux quantum logic shows exponential bandwidth reduction and decreases heat inflow and peripheral power consumption of inter-temperature cables, which contributes to the scalability of superconducting quantum computers. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: 4 pages, 5 figures, 1 table. Accepted by IEEE Computer Architecture Letters,

arXiv:2309.14509 [pdf, other]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Authors: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He

Abstract: Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studie… ▽ More Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline. △ Less

Submitted 4 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

arXiv:2308.01320 [pdf, other]

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales

Authors: Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He

Abstract: ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at… ▽ More ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at the scale of billions of parameters. This paper introduces DeepSpeed-Chat, a novel system that democratizes RLHF training, making it accessible to the AI community. DeepSpeed-Chat offers three key capabilities: an easy-to-use training and inference experience for ChatGPT-like models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from InstructGPT, and a robust DeepSpeed-RLHF system that combines various optimizations for training and inference in a unified way. The system delivers unparalleled efficiency and scalability, enabling training of models with hundreds of billions of parameters in record time and at a fraction of the cost. With this development, DeepSpeed-Chat paves the way for broader access to advanced RLHF training, even for data scientists with limited resources, thereby fostering innovation and further development in the field of AI. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 14 pages, 7 figures

arXiv:2307.13985 [pdf, other]

Enhanced Security against Adversarial Examples Using a Random Ensemble of Encrypted Vision Transformer Models

Authors: Ryota Iijima, Miki Tanaka, Sayaka Shiota, Hitoshi Kiya

Abstract: Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In addition, AEs have adversarial transferability, which means AEs generated for a source model can fool another black-box model (target model) with a non-trivial probability. In previous studies, it was confirmed that the vision transformer (ViT) is more robust against the property of adversarial transferab… ▽ More Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In addition, AEs have adversarial transferability, which means AEs generated for a source model can fool another black-box model (target model) with a non-trivial probability. In previous studies, it was confirmed that the vision transformer (ViT) is more robust against the property of adversarial transferability than convolutional neural network (CNN) models such as ConvMixer, and moreover encrypted ViT is more robust than ViT without any encryption. In this article, we propose a random ensemble of encrypted ViT models to achieve much more robust models. In experiments, the proposed scheme is verified to be more robust against not only black-box attacks but also white-box ones than convention methods. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: 4 pages, 3 figures

arXiv:2306.11629 [pdf, other]

Sound reconstruction from human brain activity via a generative model with brain-like auditory features

Authors: Jong-Yun Park, Mitsuaki Tsukamoto, Misato Tanaka, Yukiyasu Kamitani

Abstract: The successful reconstruction of perceptual experiences from human brain activity has provided insights into the neural representations of sensory experiences. However, reconstructing arbitrary sounds has been avoided due to the complexity of temporal sequences in sounds and the limited resolution of neuroimaging modalities. To overcome these challenges, leveraging the hierarchical nature of brain… ▽ More The successful reconstruction of perceptual experiences from human brain activity has provided insights into the neural representations of sensory experiences. However, reconstructing arbitrary sounds has been avoided due to the complexity of temporal sequences in sounds and the limited resolution of neuroimaging modalities. To overcome these challenges, leveraging the hierarchical nature of brain auditory processing could provide a path toward reconstructing arbitrary sounds. Previous studies have indicated a hierarchical homology between the human auditory system and deep neural network (DNN) models. Furthermore, advancements in audio-generative models enable to transform compressed representations back into high-resolution sounds. In this study, we introduce a novel sound reconstruction method that combines brain decoding of auditory features with an audio-generative model. Using fMRI responses to natural sounds, we found that the hierarchical sound features of a DNN model could be better decoded than spectrotemporal features. We then reconstructed the sound using an audio transformer that disentangled compressed temporal information in the decoded DNN features. Our method shows unconstrained sounds reconstruction capturing sound perceptual contents and quality and generalizability by reconstructing sound categories not included in the training dataset. Reconstructions from different auditory regions remain similar to actual sounds, highlighting the distributed nature of auditory representations. To see whether the reconstructions mirrored actual subjective perceptual experiences, we performed an experiment involving selective auditory attention to one of overlapping sounds. The results tended to resemble the attended sound than the unattended. These findings demonstrate that our proposed model provides a means to externalize experienced auditory contents from human brain activity. △ Less

Submitted 20 June, 2023; originally announced June 2023.

arXiv:2303.05763 [pdf, other]

Automatic Detection and Rectification of Paper Receipts on Smartphones

Authors: Edward Whittaker, Masashi Tanaka, Ikuo Kitagishi

Abstract: We describe the development of a real-time smartphone app that allows the user to digitize paper receipts in a novel way by "waving" their phone over the receipts and letting the app automatically detect and rectify the receipts for subsequent text recognition. We show that traditional computer vision algorithms for edge and corner detection do not robustly detect the non-linear and discontinuou… ▽ More We describe the development of a real-time smartphone app that allows the user to digitize paper receipts in a novel way by "waving" their phone over the receipts and letting the app automatically detect and rectify the receipts for subsequent text recognition. We show that traditional computer vision algorithms for edge and corner detection do not robustly detect the non-linear and discontinuous edges and corners of a typical paper receipt in real-world settings. This is particularly the case when the colors of the receipt and background are similar, or where other interfering rectangular objects are present. Inaccurate detection of a receipt's corner positions then results in distorted images when using an affine projective transformation to rectify the perspective. We propose an innovative solution to receipt corner detection by treating each of the four corners as a unique "object", and training a Single Shot Detection MobileNet object detection model. We use a small amount of real data and a large amount of automatically generated synthetic data that is designed to be similar to real-world imaging scenarios. We show that our proposed method robustly detects the four corners of a receipt, giving a receipt detection accuracy of 85.3% on real-world data, compared to only 36.9% with a traditional edge detection-based approach. Our method works even when the color of the receipt is virtually indistinguishable from the background. Moreover, our method is trained to detect only the corners of the central target receipt and implicitly learns to ignore other receipts, and other rectangular objects. Including synthetic data allows us to train an even better model. These factors are a major advantage over traditional edge detection-based approaches, allowing us to deliver a much better experience to the user. △ Less

Submitted 10 March, 2023; originally announced March 2023.

arXiv:2209.08724 [pdf, other]

On the Adversarial Transferability of ConvMixer Models

Authors: Ryota Iijima, Miki Tanaka, Isao Echizen, Hitoshi Kiya

Abstract: Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In addition, AEs have adversarial transferability, which means AEs generated for a source model can fool another black-box model (target model) with a non-trivial probability. In this paper, we investigate the property of adversarial transferability between models including ConvMixer, which is an isotropic n… ▽ More Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In addition, AEs have adversarial transferability, which means AEs generated for a source model can fool another black-box model (target model) with a non-trivial probability. In this paper, we investigate the property of adversarial transferability between models including ConvMixer, which is an isotropic network, for the first time. To objectively verify the property of transferability, the robustness of models is evaluated by using a benchmark attack method called AutoAttack. In an image classification experiment, ConvMixer is confirmed to be weak to adversarial transferability. △ Less

Submitted 18 September, 2022; originally announced September 2022.

Comments: 5 pages, 5 figures, 5 tables. arXiv admin note: substantial text overlap with arXiv:2209.02997

arXiv:2209.06027 [pdf, other]

Two-Step Color-Polarization Demosaicking Network

Authors: Vy Nguyen, Masayuki Tanaka, Yusuke Monno, Masatoshi Okutomi

Abstract: Polarization information of light in a scene is valuable for various image processing and computer vision tasks. A division-of-focal-plane polarimeter is a promising approach to capture the polarization images of different orientations in one shot, while it requires color-polarization demosaicking. In this paper, we propose a two-step color-polarization demosaicking network~(TCPDNet), which consis… ▽ More Polarization information of light in a scene is valuable for various image processing and computer vision tasks. A division-of-focal-plane polarimeter is a promising approach to capture the polarization images of different orientations in one shot, while it requires color-polarization demosaicking. In this paper, we propose a two-step color-polarization demosaicking network~(TCPDNet), which consists of two sub-tasks of color demosaicking and polarization demosaicking. We also introduce a reconstruction loss in the YCbCr color space to improve the performance of TCPDNet. Experimental comparisons demonstrate that TCPDNet outperforms existing methods in terms of the image quality of polarization images and the accuracy of Stokes parameters. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: Accepted in ICIP2022. Project page: http://www.ok.sc.e.titech.ac.jp/res/PolarDem/TCPDNet.html

arXiv:2209.02997 [pdf, other]

On the Transferability of Adversarial Examples between Encrypted Models

Authors: Miki Tanaka, Isao Echizen, Hitoshi Kiya

Abstract: Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In addition, AEs have adversarial transferability, namely, AEs generated for a source model fool other (target) models. In this paper, we investigate the transferability of models encrypted for adversarially robust defense for the first time. To objectively verify the property of transferability, the robustn… ▽ More Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In addition, AEs have adversarial transferability, namely, AEs generated for a source model fool other (target) models. In this paper, we investigate the transferability of models encrypted for adversarially robust defense for the first time. To objectively verify the property of transferability, the robustness of models is evaluated by using a benchmark attack method, called AutoAttack. In an image-classification experiment, the use of encrypted models is confirmed not only to be robust against AEs but to also reduce the influence of AEs in terms of the transferability of models. △ Less

Submitted 7 September, 2022; originally announced September 2022.

Comments: to be appear in ISPACS 2022

arXiv:2208.05758 [pdf, other]

NEO-QEC: Neural Network Enhanced Online Superconducting Decoder for Surface Codes

Authors: Yosuke Ueno, Masaaki Kondo, Masamitsu Tanaka, Yasunari Suzuki, Yutaka Tabuchi

Abstract: Quantum error correction (QEC) is essential for quantum computing to mitigate the effect of errors on qubits, and surface code (SC) is one of the most promising QEC methods. Decoding SCs is the most computational expensive task in the control device of quantum computers (QCs), and many works focus on accurate decoding algorithms for SCs, including ones with neural networks (NNs). Practical QCs als… ▽ More Quantum error correction (QEC) is essential for quantum computing to mitigate the effect of errors on qubits, and surface code (SC) is one of the most promising QEC methods. Decoding SCs is the most computational expensive task in the control device of quantum computers (QCs), and many works focus on accurate decoding algorithms for SCs, including ones with neural networks (NNs). Practical QCs also require low-latency decoding because slow decoding leads to the accumulation of errors on qubits, resulting in logical failures. For QCs with superconducting qubits, a practical decoder must be very power-efficient in addition to having high accuracy and low latency. In order to reduce the hardware complexity of QC, we are supposed to decode SCs in a cryogenic environment with a limited power budget, where superconducting qubits operate. In this paper, we propose an NN-based accurate, fast, and low-power decoder capable of decoding SCs and lattice surgery (LS) operations with measurement errors on ancillary qubits. To achieve both accuracy and hardware efficiency of the SC decoder, we apply a binarized NN. We design a neural processing unit (NPU) for the decoder with SFQ-based digital circuits and evaluate it with a SPICE-level simulation. We evaluate the decoder performance by a quantum error simulator for the single logical qubit protection and the minimum operation of LS with code distances up to 13, and it achieves 2.5% and 1.0% accuracy thresholds, respectively. △ Less

Submitted 1 September, 2022; v1 submitted 11 August, 2022; originally announced August 2022.

Comments: 13 pages, 9 figures, 5 tables

arXiv:2208.05198 [pdf, other]

A Detection Method of Temporally Operated Videos Using Robust Hashing

Authors: Shoko Niwa, Miki Tanaka, Hitoshi Kiya

Abstract: SNS providers are known to carry out the recompression and resizing of uploaded videos/images, but most conventional methods for detecting tampered videos/images are not robust enough against such operations. In addition, videos are temporally operated such as the insertion of new frames and the permutation of frames, of which operations are difficult to be detected by using conventional methods.… ▽ More SNS providers are known to carry out the recompression and resizing of uploaded videos/images, but most conventional methods for detecting tampered videos/images are not robust enough against such operations. In addition, videos are temporally operated such as the insertion of new frames and the permutation of frames, of which operations are difficult to be detected by using conventional methods. Accordingly, in this paper, we propose a novel method with a robust hashing algorithm for detecting temporally operated videos even when applying resizing and compression to the videos. △ Less

Submitted 11 August, 2022; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: To appear in 2022 IEEE 11th Global Conference on Consumer Electronics (GCCE 2022)

arXiv:2207.01847 [pdf, other]

PoF: Post-Training of Feature Extractor for Improving Generalization

Authors: Ikuro Sato, Ryota Yamada, Masayuki Tanaka, Nakamasa Inoue, Rei Kawakami

Abstract: It has been intensively investigated that the local shape, especially flatness, of the loss landscape near a minimum plays an important role for generalization of deep models. We developed a training algorithm called PoF: Post-Training of Feature Extractor that updates the feature extractor part of an already-trained deep model to search a flatter minimum. The characteristics are two-fold: 1) Feat… ▽ More It has been intensively investigated that the local shape, especially flatness, of the loss landscape near a minimum plays an important role for generalization of deep models. We developed a training algorithm called PoF: Post-Training of Feature Extractor that updates the feature extractor part of an already-trained deep model to search a flatter minimum. The characteristics are two-fold: 1) Feature extractor is trained under parameter perturbations in the higher-layer parameter space, based on observations that suggest flattening higher-layer parameter space, and 2) the perturbation range is determined in a data-driven manner aiming to reduce a part of test loss caused by the positive loss curvature. We provide a theoretical analysis that shows the proposed algorithm implicitly reduces the target Hessian components as well as the loss. Experimental results show that PoF improved model performance against baseline methods on both CIFAR-10 and CIFAR-100 datasets for only 10-epoch post-training, and on SVHN dataset for 50-epoch post-training. Source code is available at: \url{https://github.com/DensoITLab/PoF-v1 △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: Accepted to ICML2022. Contains a link to the code

arXiv:2205.13344 [pdf, other]

A neural network based controller for underwater robotic vehicles

Authors: Josiane Maria Macedo Fernandes, Marcelo Costa Tanaka, Raimundo Carlos Silvério Freire Júnior, Wallace Moreira Bessa

Abstract: Due to the enormous technological improvements obtained in the last decades it is possible to use robotic vehicles for underwater exploration. This work describes the development of a dynamic positioning system for remotely operated underwater vehicles based. The adopted approach is developed using Lyapunov Stability Theory and enhanced by a neural network based algorithm for uncertainty and distu… ▽ More Due to the enormous technological improvements obtained in the last decades it is possible to use robotic vehicles for underwater exploration. This work describes the development of a dynamic positioning system for remotely operated underwater vehicles based. The adopted approach is developed using Lyapunov Stability Theory and enhanced by a neural network based algorithm for uncertainty and disturbance compensation. The performance of the proposed control scheme is evaluated by means of numerical simulations. △ Less

Submitted 17 June, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: References added. This is a slightly updated version of the work presented at the COBEM 2011 - 21st Congress of Mechanical Engineering, 2011, Natal Brazil

arXiv:2109.14348 [pdf, ps, other]

Smart-home anomaly detection using combination of in-home situation and user behavior

Authors: Masaaki Yamauchi, Masahiro Tanaka, Yuichi Ohsita, Masayuki Murata, Kensuke Ueda, Yoshiaki Kato

Abstract: Internet-of-things (IoT) devices are vulnerable to malicious operations by attackers, which can cause physical and economic harm to users; therefore, we previously proposed a sequence-based method that modeled user behavior as sequences of in-home events and a base home state to detect anomalous operations. However, that method modeled users' home states based on the time of day; hence, attackers… ▽ More Internet-of-things (IoT) devices are vulnerable to malicious operations by attackers, which can cause physical and economic harm to users; therefore, we previously proposed a sequence-based method that modeled user behavior as sequences of in-home events and a base home state to detect anomalous operations. However, that method modeled users' home states based on the time of day; hence, attackers could exploit the system to maximize attack opportunities. Therefore, we then proposed an estimation-based detection method that estimated the home state using not only the time of day but also the observable values of home IoT sensors and devices. However, it ignored short-term operational behaviors. Consequently, in the present work, we propose a behavior-modeling method that combines home state estimation and event sequences of IoT devices within the home to enable a detailed understanding of long- and short-term user behavior. We compared the proposed model to our previous methods using data collected from real homes. Compared with the estimation-based method, the proposed method achieved a 15.4% higher detection ratio with fewer than 10% misdetections. Compared with the sequence-based method, the proposed method achieved a 46.0% higher detection ratio with fewer than 10% misdetections. △ Less

Submitted 29 September, 2021; originally announced September 2021.

Comments: 13 pages, 22 figures,

arXiv:2108.01892 [pdf, other]

A universal detector of CNN-generated images using properties of checkerboard artifacts in the frequency domain

Authors: Miki Tanaka, Sayaka Shiota, Hitoshi Kiya

Abstract: We propose a novel universal detector for detecting images generated by using CNNs. In this paper, properties of checkerboard artifacts in CNN-generated images are considered, and the spectrum of images is enhanced in accordance with the properties. Next, a classifier is trained by using the enhanced spectrums to judge a query image to be a CNN-generated ones or not. In addition, an ensemble of th… ▽ More We propose a novel universal detector for detecting images generated by using CNNs. In this paper, properties of checkerboard artifacts in CNN-generated images are considered, and the spectrum of images is enhanced in accordance with the properties. Next, a classifier is trained by using the enhanced spectrums to judge a query image to be a CNN-generated ones or not. In addition, an ensemble of the proposed detector with emphasized spectrums and a conventional detector is proposed to improve the performance of these methods. In an experiment, the proposed ensemble is demonstrated to outperform a state-of-the-art method under some conditions. △ Less

Submitted 4 August, 2021; originally announced August 2021.

Comments: to be appear in GCCE 2021

arXiv:2107.11196 [pdf, other]

Multi-Modal Pedestrian Detection with Large Misalignment Based on Modal-Wise Regression and Multi-Modal IoU

Authors: Napat Wanchaitanawong, Masayuki Tanaka, Takashi Shibata, Masatoshi Okutomi

Abstract: The combined use of multiple modalities enables accurate pedestrian detection under poor lighting conditions by using the high visibility areas from these modalities together. The vital assumption for the combination use is that there is no or only a weak misalignment between the two modalities. In general, however, this assumption often breaks in actual situations. Due to this assumption's breakd… ▽ More The combined use of multiple modalities enables accurate pedestrian detection under poor lighting conditions by using the high visibility areas from these modalities together. The vital assumption for the combination use is that there is no or only a weak misalignment between the two modalities. In general, however, this assumption often breaks in actual situations. Due to this assumption's breakdown, the position of the bounding boxes does not match between the two modalities, resulting in a significant decrease in detection accuracy, especially in regions where the amount of misalignment is large. In this paper, we propose a multi-modal Faster-RCNN that is robust against large misalignment. The keys are 1) modal-wise regression and 2) multi-modal IoU for mini-batch sampling. To deal with large misalignment, we perform bounding box regression for both the RPN and detection-head with both modalities. We also propose a new sampling strategy called "multi-modal mini-batch sampling" that integrates the IoU for both modalities. We demonstrate that the proposed method's performance is much better than that of the state-of-the-art methods for data with large misalignment through actual image experiments. △ Less

Submitted 23 July, 2021; originally announced July 2021.

Comments: Accepted by MVA2021

arXiv:2107.10524 [pdf, other]

Geometric Data Augmentation Based on Feature Map Ensemble

Authors: Takashi Shibata, Masayuki Tanaka, Masatoshi Okutomi

Abstract: Deep convolutional networks have become the mainstream in computer vision applications. Although CNNs have been successful in many computer vision tasks, it is not free from drawbacks. The performance of CNN is dramatically degraded by geometric transformation, such as large rotations. In this paper, we propose a novel CNN architecture that can improve the robustness against geometric transformati… ▽ More Deep convolutional networks have become the mainstream in computer vision applications. Although CNNs have been successful in many computer vision tasks, it is not free from drawbacks. The performance of CNN is dramatically degraded by geometric transformation, such as large rotations. In this paper, we propose a novel CNN architecture that can improve the robustness against geometric transformations without modifying the existing backbones of their CNNs. The key is to enclose the existing backbone with a geometric transformation (and the corresponding reverse transformation) and a feature map ensemble. The proposed method can inherit the strengths of existing CNNs that have been presented so far. Furthermore, the proposed method can be employed in combination with state-of-the-art data augmentation algorithms to improve their performance. We demonstrate the effectiveness of the proposed method using standard datasets such as CIFAR, CUB-200, and Mnist-rot-12k. △ Less

Submitted 22 July, 2021; originally announced July 2021.

Comments: Accepted to ICIP2021

arXiv:2105.13954 [pdf, other]

A Gradient Method for Multilevel Optimization

Authors: Ryo Sato, Mirai Tanaka, Akiko Takeda

Abstract: Although application examples of multilevel optimization have already been discussed since the 1990s, the development of solution methods was almost limited to bilevel cases due to the difficulty of the problem. In recent years, in machine learning, Franceschi et al. have proposed a method for solving bilevel optimization problems by replacing their lower-level problems with the $T$ steepest desce… ▽ More Although application examples of multilevel optimization have already been discussed since the 1990s, the development of solution methods was almost limited to bilevel cases due to the difficulty of the problem. In recent years, in machine learning, Franceschi et al. have proposed a method for solving bilevel optimization problems by replacing their lower-level problems with the $T$ steepest descent update equations with some prechosen iteration number $T$. In this paper, we have developed a gradient-based algorithm for multilevel optimization with $n$ levels based on their idea and proved that our reformulation asymptotically converges to the original multilevel problem. As far as we know, this is one of the first algorithms with some theoretical guarantee for multilevel optimization. Numerical experiments show that a trilevel hyperparameter learning model considering data poisoning produces more stable prediction results than an existing bilevel hyperparameter learning model in noisy data settings. △ Less

Submitted 26 October, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

Comments: NeurIPS 2021 camera-ready, 27 pages

arXiv:2103.16063 [pdf, ps, other]

Automatic Graph Partitioning for Very Large-scale Deep Learning

Authors: Masahiro Tanaka, Kenjiro Taura, Toshihiro Hanawa, Kentaro Torisawa

Abstract: This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such models do not fit into the memory of accelerator devices, they need to be partitioned by model parallelism techniques. Moreover, to accelerate training for huge tra… ▽ More This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such models do not fit into the memory of accelerator devices, they need to be partitioned by model parallelism techniques. Moreover, to accelerate training for huge training data, we need a combination of model and data parallelisms, i.e., hybrid parallelism. Given a model description for PyTorch without any specification for model parallelism, RaNNC automatically partitions the model into a set of subcomponents so that (1) each subcomponent fits a device memory and (2) a high training throughput for pipeline parallelism is achieved by balancing the computation times of the subcomponents. In our experiments, we compared RaNNC with two popular frameworks, Megatron-LM (hybrid parallelism) and GPipe (originally proposed for model parallelism, but a version allowing hybrid parallelism also exists), for training models with increasingly greater numbers of parameters. In the pre-training of enlarged BERT models, RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models. RaNNC also achieved better training throughputs than GPipe on both the enlarged BERT model pre-training (GPipe with hybrid parallelism) and the enlarged ResNet models (GPipe with model parallelism) in all of the settings we tried. These results are remarkable, since RaNNC automatically partitions models without any modification to their descriptions; Megatron-LM and GPipe require users to manually rewrite the models' descriptions. △ Less

Submitted 30 March, 2021; originally announced March 2021.

Comments: Accepted to the 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2021), May 2021

arXiv:2103.02198 [pdf, other]

Bulk Production Augmentation Towards Explainable Melanoma Diagnosis

Authors: Kasumi Obi, Quan Huu Cap, Noriko Umegaki-Arao, Masaru Tanaka, Hitoshi Iyatomi

Abstract: Although highly accurate automated diagnostic techniques for melanoma have been reported, the realization of a system capable of providing diagnostic evidence based on medical indices remains an open issue because of difficulties in obtaining reliable training data. In this paper, we propose bulk production augmentation (BPA) to generate high-quality, diverse pseudo-skin tumor images with the desi… ▽ More Although highly accurate automated diagnostic techniques for melanoma have been reported, the realization of a system capable of providing diagnostic evidence based on medical indices remains an open issue because of difficulties in obtaining reliable training data. In this paper, we propose bulk production augmentation (BPA) to generate high-quality, diverse pseudo-skin tumor images with the desired structural malignant features for additional training images from a limited number of labeled images. The proposed BPA acts as an effective data augmentation in constructing the feature detector for the atypical pigment network (APN), which is a key structure in melanoma diagnosis. Experiments show that training with images generated by our BPA largely boosts the APN detection performance by 20.0 percentage points in the area under the receiver operating characteristic curve, which is 11.5 to 13.7 points higher than that of conventional CycleGAN-based augmentations in AUC. △ Less

Submitted 3 March, 2021; originally announced March 2021.

Comments: IEEE EMBS Conference on Biomedical Engineering and Sciences (IECBES2020), Best Paper Award Student Category in Biomedical Imaging and Image Processing

arXiv:2102.01313 [pdf, other]

Fake-image detection with Robust Hashing

Authors: Miki Tanaka, Hitoshi Kiya

Abstract: In this paper, we investigate whether robust hashing has a possibility to robustly detect fake-images even when multiple manipulation techniques such as JPEG compression are applied to images for the first time. In an experiment, the proposed fake detection with robust hashing is demonstrated to outperform state-of-the-art one under the use of various datasets including fake images generated with… ▽ More In this paper, we investigate whether robust hashing has a possibility to robustly detect fake-images even when multiple manipulation techniques such as JPEG compression are applied to images for the first time. In an experiment, the proposed fake detection with robust hashing is demonstrated to outperform state-of-the-art one under the use of various datasets including fake images generated with GANs. △ Less

Submitted 2 February, 2021; originally announced February 2021.

Comments: to be appear in Life Tech 2021

arXiv:2102.00691 [pdf, other]

New Formulation for Coloring Circle Graphs and its Application to Capacitated Stowage Stack Minimization

Authors: Masato Tanaka, Tomomi Matsui

Abstract: A circle graph is a graph in which the adjacency of vertices can be represented as the intersection of chords of a circle. The problem of calculating the chromatic number is known to be NP-complete, even on circle graphs. In this paper, we propose a new integer linear programming formulation for a coloring problem on circle graphs. We also show that the linear relaxation problem of our formulation… ▽ More A circle graph is a graph in which the adjacency of vertices can be represented as the intersection of chords of a circle. The problem of calculating the chromatic number is known to be NP-complete, even on circle graphs. In this paper, we propose a new integer linear programming formulation for a coloring problem on circle graphs. We also show that the linear relaxation problem of our formulation finds the fractional chromatic number of a given circle graph. As a byproduct, our formulation gives a polynomial-sized linear programming formulation for calculating the fractional chromatic number of a circle graph. We also extend our result to a formulation for a capacitated stowage stack minimization problem. △ Less

Submitted 1 February, 2021; originally announced February 2021.

Comments: 23 pages, 5 figures

MSC Class: 05C15; 90C10; 90C27; 90C35; 05C72

arXiv:2101.11180 [pdf, other]

doi 10.1016/j.mathsocsci.2021.12.002

Pseudo Polynomial Size LP Formulation for Calculating the Least Core Value of Weighted Voting Games

Authors: Masato Tanaka, Tomomi Matsui

Abstract: In this paper, we propose a pseudo polynomial size LP formulation for finding a payoff vector in the least core of a weighted voting game. The numbers of variables and constraints in our formulation are both bounded by $\mbox{O}(n W_+)$, where $n$ is the number of players and $W_+$ is the total sum of (integer) voting weights. When we employ our formulation, a commercial LP solver calculates a pay… ▽ More In this paper, we propose a pseudo polynomial size LP formulation for finding a payoff vector in the least core of a weighted voting game. The numbers of variables and constraints in our formulation are both bounded by $\mbox{O}(n W_+)$, where $n$ is the number of players and $W_+$ is the total sum of (integer) voting weights. When we employ our formulation, a commercial LP solver calculates a payoff vector in the least core of practical weighted voting games in a few seconds. We also extend our approach to vector weighted voting games. △ Less

Submitted 23 August, 2021; v1 submitted 26 January, 2021; originally announced January 2021.

Comments: 14 pages, 1 figure

MSC Class: 91B12; 90C05

Journal ref: Mathematical Social Sciences, Volume 115, 2022, Pages 47-51

arXiv:2101.02841 [pdf, other]

doi 10.3390/g13030044

Monte Carlo Methods for Calculating Shapley-Shubik Power Index in Weighted Majority Games

Authors: Yuto Ushioda, Masato Tanaka, Tomomi Matsui

Abstract: This paper addresses Monte Carlo algorithms for calculating the Shapley-Shubik power index in weighted majority games. First, we analyze a naive Monte Carlo algorithm and discuss the required number of samples. We then propose an efficient Monte Carlo algorithm and show that our algorithm reduces the required number of samples as compared to the naive algorithm. This paper addresses Monte Carlo algorithms for calculating the Shapley-Shubik power index in weighted majority games. First, we analyze a naive Monte Carlo algorithm and discuss the required number of samples. We then propose an efficient Monte Carlo algorithm and show that our algorithm reduces the required number of samples as compared to the naive algorithm. △ Less

Submitted 7 January, 2021; originally announced January 2021.

Comments: 19pages

MSC Class: 91-08; 90-08; 65C05

Journal ref: Games 2022, 13(3), 44

arXiv:2012.00287 [pdf, other]

CycleGAN without checkerboard artifacts for counter-forensics of fake-image detection

Authors: Takayuki Osakabe, Miki Tanaka, Yuma Kinoshita, Hitoshi Kiya

Abstract: In this paper, we propose a novel CycleGAN without checkerboard artifacts for counter-forensics of fake-image detection. Recent rapid advances in image manipulation tools and deep image synthesis techniques, such as Generative Adversarial Networks (GANs) have easily generated fake images, so detecting manipulated images has become an urgent issue. Most state-of-the-art forgery detection methods as… ▽ More In this paper, we propose a novel CycleGAN without checkerboard artifacts for counter-forensics of fake-image detection. Recent rapid advances in image manipulation tools and deep image synthesis techniques, such as Generative Adversarial Networks (GANs) have easily generated fake images, so detecting manipulated images has become an urgent issue. Most state-of-the-art forgery detection methods assume that images include checkerboard artifacts which are generated by using DNNs. Accordingly, we propose a novel CycleGAN without any checkerboard artifacts for counter-forensics of fake-mage detection methods for the first time, as an example of GANs without checkerboard artifacts. △ Less

Submitted 1 December, 2020; originally announced December 2020.

arXiv:2011.10232 [pdf, other]

Deep Snapshot HDR Imaging Using Multi-Exposure Color Filter Array

Authors: Takeru Suda, Masayuki Tanaka, Yusuke Monno, Masatoshi Okutomi

Abstract: In this paper, we propose a deep snapshot high dynamic range (HDR) imaging framework that can effectively reconstruct an HDR image from the RAW data captured using a multi-exposure color filter array (ME-CFA), which consists of a mosaic pattern of RGB filters with different exposure levels. To effectively learn the HDR image reconstruction network, we introduce the idea of luminance normalization… ▽ More In this paper, we propose a deep snapshot high dynamic range (HDR) imaging framework that can effectively reconstruct an HDR image from the RAW data captured using a multi-exposure color filter array (ME-CFA), which consists of a mosaic pattern of RGB filters with different exposure levels. To effectively learn the HDR image reconstruction network, we introduce the idea of luminance normalization that simultaneously enables effective loss computation and input data normalization by considering relative local contrasts in the "normalized-by-luminance" HDR domain. This idea makes it possible to equally handle the errors in both bright and dark areas regardless of absolute luminance levels, which significantly improves the visual image quality in a tone-mapped domain. Experimental results using two public HDR image datasets demonstrate that our framework outperforms other snapshot methods and produces high-quality HDR images with fewer visual artifacts. △ Less

Submitted 20 November, 2020; originally announced November 2020.

Comments: Accepted at ACCV2020 (Oral). Project page: http://www.ok.sc.e.titech.ac.jp/res/DSHDR/

arXiv:2011.06788 [pdf, other]

Adaptive Future Frame Prediction with Ensemble Network

Authors: Wonjik Kim, Masayuki Tanaka, Masatoshi Okutomi, Yoko Sasaki

Abstract: Future frame prediction in videos is a challenging problem because videos include complicated movements and large appearance changes. Learning-based future frame prediction approaches have been proposed in kinds of literature. A common limitation of the existing learning-based approaches is a mismatch of training data and test data. In the future frame prediction task, we can obtain the ground tru… ▽ More Future frame prediction in videos is a challenging problem because videos include complicated movements and large appearance changes. Learning-based future frame prediction approaches have been proposed in kinds of literature. A common limitation of the existing learning-based approaches is a mismatch of training data and test data. In the future frame prediction task, we can obtain the ground truth data by just waiting for a few frames. It means we can update the prediction model online in the test phase. Then, we propose an adaptive update framework for the future frame prediction task. The proposed adaptive updating framework consists of a pre-trained prediction network, a continuous-updating prediction network, and a weight estimation network. We also show that our pre-trained prediction model achieves comparable performance to the existing state-of-the-art approaches. We demonstrate that our approach outperforms existing methods especially for dynamically changing scenes. △ Less

Submitted 15 November, 2020; v1 submitted 13 November, 2020; originally announced November 2020.

Comments: Accepted at 25th International Conference on Pattern Recognition Workshop (ICPRW 2020)

arXiv:2010.08092 [pdf, other]

Human Segmentation with Dynamic LiDAR Data

Authors: Tao Zhong, Wonjik Kim, Masayuki Tanaka, Masatoshi Okutomi

Abstract: Consecutive LiDAR scans compose dynamic 3D sequences, which contain more abundant information than a single frame. Similar to the development history of image and video perception, dynamic 3D sequence perception starts to come into sight after inspiring research on static 3D data perception. This work proposes a spatio-temporal neural network for human segmentation with the dynamic LiDAR point clo… ▽ More Consecutive LiDAR scans compose dynamic 3D sequences, which contain more abundant information than a single frame. Similar to the development history of image and video perception, dynamic 3D sequence perception starts to come into sight after inspiring research on static 3D data perception. This work proposes a spatio-temporal neural network for human segmentation with the dynamic LiDAR point clouds. It takes a sequence of depth images as input. It has a two-branch structure, i.e., the spatial segmentation branch and the temporal velocity estimation branch. The velocity estimation branch is designed to capture motion cues from the input sequence and then propagates them to the other branch. So that the segmentation branch segments humans according to both spatial and temporal features. These two branches are jointly learned on a generated dynamic point cloud dataset for human recognition. Our works fill in the blank of dynamic point cloud perception with the spherical representation of point cloud and achieves high accuracy. The experiments indicate that the introduction of temporal feature benefits the segmentation of dynamic point cloud. △ Less

Submitted 15 October, 2020; originally announced October 2020.

arXiv:2009.11558 [pdf, other]

An Analysis of Concurrency Control Protocols for In-Memory Databases with CCBench (Extended Version)

Authors: Takayuki Tanabe, Takashi Hoshino, Hideyuki Kawashima, Jun Nemoto, Masahiro Tanaka, Osamu Tatebe

Abstract: This paper presents yet another concurrency control analysis platform, CCBench. CCBench supports seven protocols (Silo, TicToc, MOCC, Cicada, SI, SI with latch-free SSN, 2PL) and seven versatile optimization methods and enables the configuration of seven workload parameters. We analyzed the protocols and optimization methods using various workload parameters and a thread count of 224. Previous stu… ▽ More This paper presents yet another concurrency control analysis platform, CCBench. CCBench supports seven protocols (Silo, TicToc, MOCC, Cicada, SI, SI with latch-free SSN, 2PL) and seven versatile optimization methods and enables the configuration of seven workload parameters. We analyzed the protocols and optimization methods using various workload parameters and a thread count of 224. Previous studies focused on thread scalability and did not explore the space analyzed here. We classified the optimization methods on the basis of three performance factors: CPU cache, delay on conflict, and version lifetime. Analyses using CCBench and 224 threads, produced six insights. (I1) The performance of optimistic concurrency control protocol for a read only workload rapidly degrades as cardinality increases even without L3 cache misses. (I2) Silo can outperform TicToc for some write-intensive workloads by using invisible reads optimization. (I3) The effectiveness of two approaches to coping with conflict (wait and no-wait) depends on the situation. (I4) OCC reads the same record two or more times if a concurrent transaction interruption occurs, which can improve performance. (I5) Mixing different implementations is inappropriate for deep analysis. (I6) Even a state-of-the-art garbage collection method cannot improve the performance of multi-version protocols if there is a single long transaction mixed into the workload. On the basis of I4, we defined the read phase extension optimization in which an artificial delay is added to the read phase. On the basis of I6, we defined the aggressive garbage collection optimization in which even visible versions are collected. The code for CCBench and all the data in this paper are available online at GitHub. △ Less

Submitted 18 August, 2021; v1 submitted 24 September, 2020; originally announced September 2020.

Comments: A short version is accepted at VLDB 2020 (PVLDB Volume 13, Issue 13). Code is at https://github.com/thawk105/ccbench

ACM Class: H.2.4

arXiv:2007.14292 [pdf, other]

Monochrome and Color Polarization Demosaicking Using Edge-Aware Residual Interpolation

Authors: Miki Morimatsu, Yusuke Monno, Masayuki Tanaka, Masatoshi Okutomi

Abstract: A division-of-focal-plane or microgrid image polarimeter enables us to acquire a set of polarization images in one shot. Since the polarimeter consists of an image sensor equipped with a monochrome or color polarization filter array (MPFA or CPFA), the demosaicking process to interpolate missing pixel values plays a crucial role in obtaining high-quality polarization images. In this paper, we prop… ▽ More A division-of-focal-plane or microgrid image polarimeter enables us to acquire a set of polarization images in one shot. Since the polarimeter consists of an image sensor equipped with a monochrome or color polarization filter array (MPFA or CPFA), the demosaicking process to interpolate missing pixel values plays a crucial role in obtaining high-quality polarization images. In this paper, we propose a novel MPFA demosaicking method based on edge-aware residual interpolation (EARI) and also extend it to CPFA demosaicking. The key of EARI is a new edge detector for generating an effective guide image used to interpolate the missing pixel values. We also present a newly constructed full color-polarization image dataset captured using a 3-CCD camera and a rotating polarizer. Using the dataset, we experimentally demonstrate that our EARI-based method outperforms existing methods in MPFA and CPFA demosaicking. △ Less

Submitted 28 July, 2020; originally announced July 2020.

Comments: Accepted in ICIP2020. Dataset and code are available at http://www.ok.sc.e.titech.ac.jp/res/PolarDem/index.html

arXiv:2007.09990 [pdf, other]

doi 10.1109/TIP.2020.3011269

Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering

Authors: Wonjik Kim, Asako Kanezaki, Masayuki Tanaka

Abstract: The usage of convolutional neural networks (CNNs) for unsupervised image segmentation was investigated in this study. In the proposed approach, label prediction and network parameter learning are alternately iterated to meet the following criteria: (a) pixels of similar features should be assigned the same label, (b) spatially continuous pixels should be assigned the same label, and (c) the number… ▽ More The usage of convolutional neural networks (CNNs) for unsupervised image segmentation was investigated in this study. In the proposed approach, label prediction and network parameter learning are alternately iterated to meet the following criteria: (a) pixels of similar features should be assigned the same label, (b) spatially continuous pixels should be assigned the same label, and (c) the number of unique labels should be large. Although these criteria are incompatible, the proposed approach minimizes the combination of similarity loss and spatial continuity loss to find a plausible solution of label assignment that balances the aforementioned criteria well. The contributions of this study are four-fold. First, we propose a novel end-to-end network of unsupervised image segmentation that consists of normalization and an argmax function for differentiable clustering. Second, we introduce a spatial continuity loss function that mitigates the limitations of fixed segment boundaries possessed by previous work. Third, we present an extension of the proposed method for segmentation with scribbles as user input, which showed better accuracy than existing methods while maintaining efficiency. Finally, we introduce another extension of the proposed method: unseen image segmentation by using networks pre-trained with a few reference images without re-training the networks. The effectiveness of the proposed approach was examined on several benchmark datasets of image segmentation. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: IEEE Transactions on Image Processing, Accepted in July, 2020

arXiv:2006.08145 [pdf, other]

Classifying degraded images over various levels of degradation

Authors: Kazuki Endo, Masayuki Tanaka, Masatoshi Okutomi

Abstract: Classification for degraded images having various levels of degradation is very important in practical applications. This paper proposes a convolutional neural network to classify degraded images by using a restoration network and an ensemble learning. The results demonstrate that the proposed network can classify degraded images over various levels of degradation well. This paper also reveals how… ▽ More Classification for degraded images having various levels of degradation is very important in practical applications. This paper proposes a convolutional neural network to classify degraded images by using a restoration network and an ensemble learning. The results demonstrate that the proposed network can classify degraded images over various levels of degradation well. This paper also reveals how the image-quality of training data for a classification network affects the classification performance of degraded images. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: Accepted by the 27th IEEE International Conference on Image Processing (ICIP 2020)

arXiv:2003.05093 [pdf, other]

Learning-Based Human Segmentation and Velocity Estimation Using Automatic Labeled LiDAR Sequence for Training

Authors: Wonjik Kim, Masayuki Tanaka, Masatoshi Okutomi, Yoko Sasaki

Abstract: In this paper, we propose an automatic labeled sequential data generation pipeline for human segmentation and velocity estimation with point clouds. Considering the impact of deep neural networks, state-of-the-art network architectures have been proposed for human recognition using point clouds captured by Light Detection and Ranging (LiDAR). However, one disadvantage is that legacy datasets may o… ▽ More In this paper, we propose an automatic labeled sequential data generation pipeline for human segmentation and velocity estimation with point clouds. Considering the impact of deep neural networks, state-of-the-art network architectures have been proposed for human recognition using point clouds captured by Light Detection and Ranging (LiDAR). However, one disadvantage is that legacy datasets may only cover the image domain without providing important label information and this limitation has disturbed the progress of research to date. Therefore, we develop an automatic labeled sequential data generation pipeline, in which we can control any parameter or data generation environment with pixel-wise and per-frame ground truth segmentation and pixel-wise velocity information for human recognition. Our approach uses a precise human model and reproduces a precise motion to generate realistic artificial data. We present more than 7K video sequences which consist of 32 frames generated by the proposed pipeline. With the proposed sequence generator, we confirm that human segmentation performance is improved when using the video domain compared to when using the image domain. We also evaluate our data by comparing with data generated under different conditions. In addition, we estimate pedestrian velocity with LiDAR by only utilizing data generated by the proposed pipeline. △ Less

Submitted 10 March, 2020; originally announced March 2020.

Comments: Please check the following URL for more information. http://www.ok.sc.e.titech.ac.jp/res/LHD/

Showing 1–50 of 74 results for author: Tanaka, M