\scaleobj0.038[Uncaptioned image] NaVIP: An Image-Centric Indoor Navigation        Solution for Visually Impaired People

Jun Yu 1 Yifan Zhang 1 Badrinadh Aila 1 Vinod Namboodiri 1,2 {juy220, yiz521, baa223, vin423}@lehigh.edu
1 Department of Computer Science and Engineering, Lehigh University. 
2 Department of Community and Population Health, Lehigh University. 
https://accesslab180.github.io/navip.github.io
Abstract

Indoor navigation is challenging due to the absence of satellite positioning. This challenge is manifold greater for Visually Impaired People (VIPs) who lack the ability to get information from wayfinding signage. Other sensor signals (e.g., Bluetooth and LiDAR) can be used to create turn-by-turn navigation solutions with position updates for users. Unfortunately, these solutions require tags to be installed all around the environment or the use of fairly expensive hardware. Moreover, these solutions require a high degree of manual involvement that raises costs, thus hampering scalability. We propose an image dataset and associated image-centric solution called NaVIP towards visual intelligence that is infrastructure-free and task-scalable, and can assist VIPs in understanding their surroundings. Specifically, we start by curating large-scale phone camera data in a four-floor research building, with 300300300300K images, to lay the foundation for creating an image-centric indoor navigation and exploration solution for inclusiveness. Every image is labelled with precise 6DoF camera poses, details of indoor PoIs, and descriptive captions to assist VIPs. We benchmark on two main aspects: 1) positioning system and 2) exploration support, prioritizing training scalability and real-time inference, to validate the prospect of image-based solution towards indoor navigation. The dataset, code, and model checkpoints are made publicly available at https://github.com/junfish/VIP_Navi.

Refer to caption
Figure 1: Illustration of pipelines for purely image-based indoor navigation. We collect videos and extract image frames as data sources. Each image is annotated with: 1) 6-DoF camera poses aligned with the floor plan, 2) indoor points-of-interest (PoIs), and 3) visual descriptions that assist visually impaired people (VIPs) in understanding their surroundings. We highlight the task scalability of this solution, facilitated by its end-to-end training and inference using simple image forward pass.

1 Introduction

Indoor navigation [1, 2, 3] remains a complex challenge due to the unavailability of satellite signals, such as GPS, for indoor localization. The dynamic nature of indoor environment scenes further complicates the positioning. While Wi-Fi-based indoor positioning systems (WIPS) [4, 5] offer some promise by utilizing signal strength measurements from multiple access points (APs), their effectiveness can vary. To improve positioning accuracy and stability, deploying bluetooth low energy (BLE) beacons [6, 7, 8] within buildings has been explored. These beacons emit audio signals that, when received by mobile devices, can aid in determining the user locations. However, this solution necessitates extensive tag hardware deployment and ongoing maintenance and sometimes interferes with other signals, often making it impractical due to resistance from building managers. Similar technologies, such as radio frequency identification (RFID) [9] and ultra-wideband (UWB) [10], have been refined to mitigate the inconvenience and further enhance accuracy; however, these also come with the same disadvantages of requiring tag installations and maintenance. Other alternatives like magnetic and acoustic positioning offer infrastructure-free solutions but lack robustness against the spatial and temporal variations inherent in indoor settings. Dead reckoning [11, 12] and SLAM [13, 14] encounter challenges with error accumulation in positioning during navigation and also demonstrate limited robustness to environmental changes. Current marker-based methods offer cost-effective solutions but struggle with the accuracy of positioning when tracking visual landmarks (e.g., fiducial markers [15, 16]).

All the technologies above have their place in indoor navigation solutions, often complementing each other within hybrid systems [17, 18, 19] that aim to leverage the strengths of each approach while mitigating their individual disadvantages. While the above mentioned options may already serve sighted individuals with a strong sense of direction, the challenges of indoor navigation is much greater for visually impaired people (VIPs) [20, 21, 22] who cannot independently explore unfamiliar indoor environments due to lack of access to existing wayfinding signage and the inability to create a mental map of complex layouts. For VIPs, challenges of independent indoor navigation is more than just an inconvenience; it can mean the difference between venturing into unknown indoor spaces or just avoiding them altogether. In summary, current indoor navigation technologies struggle to provide effective solutions, a necessity for VIPs, for dynamic environments where (1) accurate and real-time positioning, (2) scalability across varying scene sizes, and (3) understanding of the surroundings, are all required simultaneously. To address these concerns, we are pioneering the application of visual intelligence towards indoor navigation. However, our initial and most significant challenge is the absence of a large real-world dataset derived directly from indoor navigation scenarios. These scenarios involve rapidly changing scenes as users move, and continuously evolving environments over days. Therefore, we curated a image-based indoor navigation dataset to facilitate the straightforward implementation of feasible applications. To further support this, we also ensured that the dataset collection process is scalable and adaptable to changes in building layouts and scene sizes.

The key task in such a purely image-based navigation system is positioning, simply put as determining your camera location from a single image. Contemporary methods for this camera localization task exhibit different trade-offs among hardware resources, prediction accuracy, inference time, and algorithm robustness. For example, state-of-the-art (SOTA) accuracy in camera localization is achieved by constructing a 3D model of the scene using either sparse feature points [23, 24] or dense reconstruction [25, 26]. The camera pose is then estimated through geometric calculations [27, 28] or 2D-3D matching [29]. However, these methods typically require significant memory overhead and have slow processing speeds, making them impractical for integration into a navigation system. On the other hand, absolute pose regression (APR) [30, 31, 32] can determine the camera pose with a single forward propagation in deep neural networks using query images. This line of research offers significant advantages in inference speed and can be easily deployed in thin client applications due to its minimal memory footprint. It achieves this while only incurring an acceptable level of accuracy loss in positioning, which is suitable for real-time navigation purposes. Additionally, the APR decision process is aligned with other well-established computer vision (CV) tasks in the era of deep learning, such as image recognition [33, 34], semantic segmentation [35, 36], and depth estimation [37, 38]. Considering task scalability is crucial for the development of an inclusive and accessible indoor navigation system. In this paper, we not only benchmark APR but also demonstrate how image captioning can assist VIPs in navigating indoor environments independently for their own needs [39, 40]. Without the burden of human-crafted annotations, this is achieved by leveraging recent advancements in pre-trained foundation models (PFMs) [41, 42], such as Segment Anything Model (SAM) [36] and GPT-4 [43].

The main contribution of this paper is the creation of an image dataset for an indoor building and the demonstration of our image-centric solution using this dataset to assist VIPs in navigating and exploring dynamically changing indoor environments. Using only a commodity-off-the-shelf (COTS) phone camera and minimizing human involvement in ground-truth annotations, we demonstrate our solution to be both practical and highly adaptable. Our benchmarks utilize deep representation learning for its robustness to varying scenarios, scalability across different scene sizes, and, most importantly, its simple forward pass inference in an end-to-end manner, responding to query images in just a few milliseconds. In §4.1, we benchmark APR methods (e.g., representative PoseNet [30]) on NaVIP, demonstrating their ability to pinpoint phone-captured images on a floor plan with sub-meter accuracy when applied to actual building layouts. §4.2 explores the potential of image captioning techniques to support the exploration needs of VIPs, illustrating the broader applicability of this image-centric approach to addressing the diverse needs of individuals with disabilities. With the release of NaVIP, we hope to spark further interest within the academic community to develop image-centric solutions for challenges such as indoor navigation and exploration, thus realizing some important human and societal benefits of AI.

2 Related Work

2.1 Indoor Navigation

Current full-fledged indoor navigation systems [1, 2, 44, 45] employ a set of technologies including WiFi fingerprints, BLE beacons, magnetic fields, and IMUs, either individually or in combination. These sensor technologies are specifically engineered to address the most challenging aspect of indoor navigation—accurate and robust positioning. [46] proposed a hybrid multi-sensor fusion system for indoor localization using WiFi and LiDAR. ViNav [18], named vision-based navigation system, relies on a combination of WiFi fingerprints, dead reckoning, and SfM 3D point clouds to localize user positions. ASSIST [47] is a personalized system using multimodal sensors and high-level semantic information, with efficacy tested on a blind and visually impaired (BVI) group. [48] proposed data-driven methods similar to ours but trained their localization model using annotated Wi-Fi observations instead of images. [49] explored the combination of radio and visible light communication-based positioning technologies.

Recent advancements in vision-and-language navigation (VLN) [50, 51, 52] has facilitated the translation of natural language instructions into practical actions for embodied agents, leveraging their visual perceptions. This development is instrumental in assisting users who can command such agents for indoor applications, such as search-and-rescue missions [53]. However, most VLN methods [54, 55, 56, 57, 58, 59, 60, 61, 62] are largely limited to controlled, simulated 3D environments [63], which significantly narrows the applicability in real-world settings. Interactive VLN [64, 65, 66], synchronizing human feedback to adapt to new environment, can explore unknown command feasibility, but struggles in scenarios where the oracle is disabled. While trajectory-instruction generation [67, 68, 69, 70, 52, 71, 72, 73, 74] that synthesizes new language instructions can alleviate data scarsity, their efficacy still lags behind precious oracle instructions [75].

2.2 Visual Localization

The problem of visual localization, aka camera pose estimation, is fundamental in many CV applications except navigation, such as augmented reality, SLAM system, and autonomous driving.

Indirect localization casts the camera pose as a query frame retrieval problem [76, 77, 24]. Traditional methods [78] are dependent on the quality of feature detection and matching, and often require manual tuning to effectively retrieve the most similar images stored in a database or to interpolate the camera pose from top retrieved images. In contrast, deep learning approaches [24, 79, 80, 81] utilize hierarchical pipelines that establish 2D-3D correspondences more efficiently and robustly. Despite achieving SOTA accuracy, these methods, which often incorporate PnP and RANSAC processes, tend to be slower by an order of magnitude ranging from 10 to 100 times compared to APR methods. Furthermore, although relative pose regression (RPR) [82, 83, 84, 85, 86, 87] also adopts camera pose regression, the inherent retrieval phase continues to hinder inference efficiency.

Direct localization can instantly relocalize the camera pose from the query image [88, 89, 90, 30, 91, 92], thus requiring a smaller memory footprint compared to indirect methods. Feature-based matching methods [93, 94, 95, 96, 97] always perform global localization by matching the feature points of query images with a 3D point cloud reconstruction. APR methods [30, 98, 91, 99, 100, 101, 102, 103, 92, 104, 105, 32] are capable of directly regressing the camera pose from a single image input in milliseconds, with minimal loss of accuracy. Although these methods have historically struggled to generalize to new camera poses [106], recent advancements in novel view synthesis (NVS) [107, 108, 102, 109, 32] could alleviate this burden via synthesizing new images from random viewpoints as data augmentation. Recent floorplan localization [110, 111] proposed to directly predict and localize with a single image in a floor plan.

2.3 CV and Beyond

CV has achieved the most predominant achievement among generic applications over the past few deep learning years, such as object detection [112, 113, 114], pose estimation [115, 116, 117], semantic segmentation [35, 36], depth estimation [37, 38], etc. Recent PFMs [41, 42] provide a powerful base that can achieve zero-shot performance. For example, SAM [36] enables zero-shot generalization in the detection and segmentation of unfamiliar subjects. Depth Anything [38] leverages unlabeled data to achieve SOTA performance in monocular depth estimation across previously unseen datasets. More recently, multimodal large language models (MLLMs), such as GPT-4(o) [43] and Gemini [118], have been instrumental in providing accurate image descriptions and facilitating various user-adaptive tasks including image-to-text and image-to-audio conversions. Although there have been significant advances in computer vision, research dedicated to enhancing assistive and accessibility technologies remains limited [119, 120]. For VIPs, the project VizWiz111https://vizwiz.org/ [121, 122, 123, 124, 125] pioneers the first Visual Question Answering (VQA) datasets that originate directly from VIPs and are tailored to benefit them. Recently, applications such as Be My Eyes and Seeing AI utilize a generative AI-powered virtual volunteer222https://openai.com/index/be-my-eyes/ to interpret and describe images for their BVI users. Our research, however, distinguishes itself by focusing on customizing distilled model and providing tailored navigation assistance, thereby enhancing the support offered to VIPs.

3 NaVIP

3.1 Data collection and annotation

This section presents the methods used for collecting and annotating our large-scale image dataset. To make it easily extend this process to other buildings, we prioritize reducing human labour. Within our workflow, human involvement is confined to recording videos, pinpointing 5–10 images from each video to the floor plan, and designing prompts for GPT-4. We also release the data preprocessing and annotating code to further alleviate the burdens associated with data collection and annotation.

3.1.1 Collection

To facilitate the adaptation of this data-driven, vision-based indoor navigation solution across various buildings in the future, data collection is streamlined to minimize human effort and allow for automation through a basic robot that does not require specialized design. We hereby consider the video recording by smartphone built-in cameras and subsequent extraction of image frames. In this study, we collected approximately 400 videos, each lasting 2–4 minutes, within the Health, Science, and Technology (HST) Building333HST is a hub designed to encourage collaboration among faculty and students across disciplines, which is the largest building Lehigh has ever built. For more information, please click this link.. By setting the frame extraction interval between 0.2 and 0.3 seconds, we obtained a big dataset comprising around 300300300300K images. The simplicity of our data collection allows for its easy way of expansion, either through additional video recordings or by narrowing the time intervals for frame extraction.

To enhance the robustness of algorithms developed from this dataset, videos were captured using four distinct smartphone models, in both landscape and portrait orientations. These recordings were made in two different holding ways: by human hand and using a smartphone gimbal stabilizer. To comprehensively capture the variability of indoor environments, recordings spanned various times of the day–sunrise, morning, noon, afternoon, sunset, and evening–and were conducted from December 2023 through June 2024. Videos recorded after April are designated as testing set to ensure that models developed using this dataset exhibit generalizability. For additional statistics and instructions regarding the data, please refer to Appendix A. We acknowledge that despite our efforts to align this dataset closely with real-world scenarios, the dataset biases still exist and may be captured by models [126, 127]. For example, our dataset lacks the coverage of the fall season.

Algorithm 1  Camera Pose (6DoF) Annotation
0:  Video set 𝒱={V1,,VN}𝒱subscript𝑉1subscript𝑉𝑁\mathcal{V}=\{V_{1},\cdots,V_{N}\}caligraphic_V = { italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_V start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } by mobile cameras.
0:  Geo-registered ground-truth of camera poses (positions and orientations) on the floor plan for extracted video frames.
1:  Initialize frame interval ΔfΔ𝑓\Delta froman_Δ italic_f. \eqparboxCOMMENT\triangleright Δf16Δ𝑓16\Delta f\approx 16roman_Δ italic_f ≈ 16.
2:  for n=1,,N𝑛1𝑁n=1,\cdots,Nitalic_n = 1 , ⋯ , italic_N do
3:      Sample the image sequences n={I1,,IMn}subscript𝑛subscript𝐼1subscript𝐼subscript𝑀𝑛\mathcal{I}_{n}=\{I_{1},\cdots,I_{M_{n}}\}caligraphic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_I start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } from video Vnsubscript𝑉𝑛V_{n}italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT every ΔfΔ𝑓\Delta froman_Δ italic_f frames.
4:     repeat
5:        Run SfM algorithm on image set \mathcal{I}caligraphic_I. \eqparboxCOMMENT\triangleright Use COLMAP.
6:        if Point clouds align then
7:           Obtain camera pose for image sequences \mathcal{I}caligraphic_I.
8:        else
9:           Sample more images superscript\mathcal{I}^{\prime}caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT at breaking points. \eqparboxCOMMENT\triangleright Δf=1/3ΔfΔsuperscript𝑓13Δ𝑓\Delta f^{\prime}=1/3\Delta froman_Δ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 / 3 roman_Δ italic_f
10:           Data augmentations: =superscript\mathcal{I}=\mathcal{I}\cup\mathcal{I}^{\prime}caligraphic_I = caligraphic_I ∪ caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
11:        end if
12:     until Point clouds align
13:     Geo-register the camera pose on the floor plan. \eqparboxCOMMENT\triangleright See Figure 2.
14:  end for

3.1.2 Camera Pose

There are multiple ways to annotate a camera with its accurate shooting position. Most methods require professional equipment [88, 99, 26], e.g., the NavVis VLX 3D scanner backpack, or involve complex optimization pipelines that are either closed-source or difficult to replicate [128, 129, 130]. We conclude the summary of camera relocalization datasets, both indoors and outdoors, in Table 1. It is noteworthy that only the Cambridge [30] relied on ubiquitous device and open-source software to obtain the 6-DoF ground-truth of camera pose and achieved a 10 dm error level outdoors. Based on this, we emphasize the importance of low-cost and replicable methods for obtaining ground-truth data for training, especially when our users include low-income groups with health disparities. We employ the active open-source project COLMAP444COLMAP is available at https://github.com/colmap/colmap and is licensed under the BSD-3-Clause. to reconstruct the 3D scene and determine the camera pose for each image in our collections. The algorithm for 6DoF annotation process is presented in Algorithm 1.

Refer to caption
Figure 2: This illustration outlines the steps for converting camera poses from initial coordinates to a unified floor plan world coordinate system: 1) Independent 3D scene construction for images from each video using COLMAP, executed in parallel; 2) Pinpointing several anchor images from each video to the floor plan and geo-registering camera poses using these anchors; 3) Aligning and validating the entire path onto the floor plan; 4) Combining all images from different paths for unified training purposes.

As discussed in § 3.1.1, to capture the dynamic variations in scene evolution over time, we independently collected approximately 400 videos. These videos were recorded by various collectors using different gestures and devices at diverse times. Consequently, the world coordinates for each video are isolated within the COLMAP 3D reconstructions; they are not geo-registered and cannot be integrated into a unified world coordinate system that would align with a floor plan for navigation purposes. We introduced minimal human labor to accurately pinpoint 5–10 images for each video to the floor plan. This accuracy is achieved as the 3D points of the scene and each image are visualized using the COLMAP GUI. Additionally, we utilized embedded geo-registration function to transform the world coordinates. Figure 2 shows the process of geo-registring all camera poses from different paths into a unified world coordinate system, ensuring alignment with the floor plan.

Table 1: Summary of camera relocalization datasets. This table exclusively compares datasets that are publicly available.
Dataset Environment Device # Train / Test Resolution Scope Ground Truth Tool Error Level
Dubrovnik 6K [131] Outdoor 6666K / 0.80.80.80.8K 1.5×1.51.51.51.5\times 1.51.5 × 1.5 km2 SIFT Matching 10similar-toabsent10\sim 10∼ 10 m
7-Scenes [88] Indoor Kinect depth camera 16161616K / 17171717K 640×480640480640\times 480640 × 480 4×3434\times 34 × 3 m2 KinectFusion [132] <10absent10<10< 10 cm
Cambridge [30] Outdoor Smartphone camera 8.48.48.48.4K / 4.84.84.84.8K 1920×1080192010801920\times 10801920 × 1080 500×100500100500\times 100500 × 100 m2 MVS [133] <10absent10<10< 10 dm
12-Scenes [128] Indoor Structure.io depth sensor with iPad 240240240240K / 6.76.76.76.7K 1296×96812969681296\times 9681296 × 968 80 m3 VoxelHashing [134] 10similar-toabsent10\sim 10∼ 10 cm
TUM-LSI [99] Indoor NavVis M3 trolley 875875875875/220220220220 4592×3448459234484592\times 34484592 × 3448 5575557555755575 m2 SLAM <10absent10<10< 10 cm
InLoc [26] Indoor Faro 3D laser scanner 10101010K / 0.40.40.40.4K 1600×1,200160012001600\times 1,2001600 × 1 , 200 186186186186 m2 LiDAR + Manual 10similar-toabsent10\sim 10∼ 10 dm
LaMAR [129] Indoor & Microsoft Hololens 2, Smartphone, 640×480640480640\times 480640 × 480 45,000 m2 LiDAR + SfM + <10absent10<10< 10 cm
Outdoor iPad, NavVis M6 or VLX backpack VIO [135]
360Loc [130] Indoor & Outdoor Velodyne lidar with 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT camera 9.39.39.39.3K / – 6144×3072614430726144\times 30726144 × 3072 105×7010570105\times 70105 × 70 m2 LiDAR + VIO <10absent10<10< 10 cm
NaVIP (Ours) Indoor Smartphone camera 212212212212K / 88888888K 3840×2160384021603840\times 21603840 × 2160 40×90409040\times 9040 × 90 m2 COLMAP [23] <10absent10<10< 10 cm

3.2 PoIs

Refer to caption
Figure 3: Floor plan dependent PoIs can be reported simultaneously as the camera pose is determined.

By aligning all images to the floor plan, annotating nearby points-of-interest (PoIs) becomes straightforward. As illustrated in Figure 3, we can mark pixel-level PoI labels for each public area. Predicted images at specific points can then provide feedback on these PoIs to users, depending on their camera positions and orientations. We release the detailed PoIs for each pixel of the HST floor plans to support the development of applications.

3.3 Captions for VIPs

We leverage recent MLLMs, e.g., OpenAI GPT-4 and Google Gemini, to generate tailored image descriptions that meet the specific needs of VIPs. Our prompt design considers both the capabilities of these models and the feedback from VIP volunteers. For more details on our explorations of prompting MLLMs and output examples of image descriptions, please refer to Appendix B. We offer three types of image descriptions for each image:

  • A concise description suitable for general use.

  • A detailed description specifically designed for individuals with vision impairments (including those with low vision or acquired blindness).

  • Another detailed description tailored for individuals who have been blind since birth.

4 Benchmarking

To evaluate the generalization ability across ever-changing indoor environments, we utilize images captured prior to April 15th as training dataset and images captured after May 1st as testing dataset throughout the benchmarking process.

4.1 Camera Relocalization

Preliminary. For clarity and consistency, we represent the camera pose in 3D space by a 2-element tuple [𝒙,𝒒]𝒙𝒒[\boldsymbol{x},\boldsymbol{q}][ bold_italic_x , bold_italic_q ] according to [30], where 𝒙3𝒙superscript3\boldsymbol{x}\in\mathbb{R}^{3}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT defines the position of camera center in 3D Cartesian coordinates and 𝒒4𝒒superscript4\boldsymbol{q}\in\mathbb{R}^{4}bold_italic_q ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT is the unit quaternion encoding its orientation. Other variations focus on orientation representations, with BranchPoseNet [136] employing an Euler angle representation and MapNet [137] using a logarithm of the unit quaternion.

Our purpose is to directly regress the camera pose [𝒙^,𝒒^]^𝒙^𝒒[\hat{\boldsymbol{x}},\hat{\boldsymbol{q}}][ over^ start_ARG bold_italic_x end_ARG , over^ start_ARG bold_italic_q end_ARG ] from a single monocular image I𝐼Iitalic_I using the trained function f𝑓fitalic_f. The standard objective loss function is defined as follows:

(I)=𝒙^𝒙2+β𝒒^𝒒𝒒22,𝐼subscriptnorm^𝒙𝒙2𝛽subscriptnorm^𝒒𝒒subscriptnorm𝒒22{\mathcal{L}}(I)={\|\hat{\boldsymbol{x}}-\boldsymbol{x}\|}_{2}+\beta\cdot{% \left\|\hat{\boldsymbol{q}}-\frac{\boldsymbol{q}}{{\|\boldsymbol{q}\|}_{2}}% \right\|}_{2},caligraphic_L ( italic_I ) = ∥ over^ start_ARG bold_italic_x end_ARG - bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_β ⋅ ∥ over^ start_ARG bold_italic_q end_ARG - divide start_ARG bold_italic_q end_ARG start_ARG ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (1)

where [𝒙,𝒒]=f(I)𝒙𝒒𝑓𝐼[\boldsymbol{x},\boldsymbol{q}]=f(I)[ bold_italic_x , bold_italic_q ] = italic_f ( italic_I ) represents the values predicted by our models. Notably, β𝛽\betaitalic_β is a hyperparameter introduced to balance the learning scales between position and orientation. Learnable PoseNet [91] captures homoscedastic uncertainty [138] between two tasks and omit β𝛽\betaitalic_β as

(I)=esx𝒙^𝒙2+esq𝒒^𝒒𝒒22+sx+sq,𝐼superscript𝑒subscript𝑠𝑥subscriptnorm^𝒙𝒙2superscript𝑒subscript𝑠𝑞subscriptnorm^𝒒𝒒subscriptnorm𝒒22subscript𝑠𝑥subscript𝑠𝑞{\mathcal{L}}(I)=e^{-s_{x}}\cdot{\|\hat{\boldsymbol{x}}-\boldsymbol{x}\|}_{2}+% e^{-s_{q}}\cdot{\left\|\hat{\boldsymbol{q}}-\frac{\boldsymbol{q}}{{\|% \boldsymbol{q}\|}_{2}}\right\|}_{2}+s_{x}+s_{q},caligraphic_L ( italic_I ) = italic_e start_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ ∥ over^ start_ARG bold_italic_x end_ARG - bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ ∥ over^ start_ARG bold_italic_q end_ARG - divide start_ARG bold_italic_q end_ARG start_ARG ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , (2)

where xxsubscript𝑥𝑥x_{x}italic_x start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are both learnable and only an approximate initial guess is required.

Table 2: Mean and median errors\downarrow of models across four floors (Basement, Lower Level, Level 1, and Level 2) in our dataset. Benchmarking is limited to models with publicly available code.
Model Backbone Basement Lower Level Level 1 Level 2
Mean Median Mean Median Mean Median Mean Median
PoseNet [30] ResNet-34 0.52m, 5.50 0.42m, 4.52 0.96m, 7.14 0.71m, 5.70 0.91m, 7.65 0.52m, 6.21 1.12m, 6.78 0.72m, 5.99
MobileNet-V3 0.60m, 5.78 0.52m, 5.01 0.95m, 7.10 0.69m, 5.59 0.90m, 7.94 0.50m, 6.34 1.11m, 6.91 0.73m, 6.04
Bayesian ResNet-34 0.57m, 6.88 0.51m, 5.70 1.02m, 8.34 0.75m, 6.04 0.98m, 9.04 0.61m, 7.38 1.09m, 7.34 0.71m, 6.51
PoseNet [98] MobileNet-V3 0.62m, 6.99 0.60m, 5.98 1.05m, 8.73 0.79m, 6.48 1.01m, 8.93 0.72m, 6.76 1.09m, 7.52 0.72m, 6.58
LSTM- ResNet-34 0.49m, 6.20 0.42m, 5.13 0.92m, 7.43 0.74m, 5.97 0.81m, 7.54 0.51m, 5.49 0.92m, 6.17 0.59m, 4.73
PoseNet [99] MobileNet-V3 0.52m, 6.49 0.41m, 6.24 0.94m, 7.83 0.73m, 5.99 0.84m, 7.62 0.52m, 5.73 0.95m, 6.21 0.60m, 4.41
Learnable ResNet-34 0.39m, 3.41 0.33m, 2.19 0.76m, 3.24 0.50m, 3.19 0.70m, 4.30 0.46m, 3.84 0.87m, 3.90 0.49m, 2.41
PoseNet [91] MobileNet-V3 0.42m, 3.90 0.34m, 2.81 0.78m, 3.20 0.53m, 3.14 0.75m, 4.76 0.47m, 3.90 0.91m, 4.74 0.46m, 3.82
Geometric ResNet-34 0.41m, 4.45 0.33m, 3.01 0.80m, 3.41 0.55m, 3.24 0.73m, 4.28 0.47m, 3.87 0.90m, 5.21 0.57m, 3.43
PoseNet [91] MobileNet-V3 0.46m, 3.89 0.39m, 2.80 0.83m, 3.40 0.56m, 3.19 0.77m, 4.91 0.49m, 4.19 0.94m, 5.32 0.58m, 3.49
Hourglass ResNet-34 0.45m, 5.95 0.34m, 4.91 0.88m, 5.47 0.62m, 4.79 0.81m, 6.23 0.49m, 4.02 0.85m, 6.87 0.49m, 3.96
PoseNet [100] MobileNet-V3 0.53m, 6.18 0.48m, 6.03 0.89m, 5.86 0.60m, 4.79 0.83m, 6.48 0.50m, 4.18 0.94m, 8.31 0.52m, 4.37
BranchNet- ResNet-34 0.43m, 5.99 0.32m, 4.72 0.79m, 4.00 0.48m, 3.25 0.94m, 9.21 0.65m, 6.27 1.04m, 7.26 0.78m, 6.13
Euler6 [136] MobileNet-V3 0.50m, 6.23 0.41m, 4.82 0.84m, 4.31 0.51m, 3.36 0.98m, 9.74 0.66m, 6.40 1.16m, 8.17 0.82m, 7.10
MapNet [103] ResNet-34 0.39m, 3.41 0.31m, 2.71 0.70m, 3.75 0.42m, 3.10 0.65m, 4.58 0.44m, 3.90 0.82m, 5.43 0.48m, 3.97
MobileNet-V3 0.40m, 3.37 0.29m, 2.65 0.74m, 3.76 0.43m, 3.22 0.66m, 4.79 0.41m, 4.21 0.86m, 4.27 0.50m, 3.21
MSPN [139] ResNet-34 0.39m, 3.40 0.30m, 2.68 0.68m, 3.70 0.42m, 3.13 0.69m, 5.17 0.47m, 3.96 0.91m, 3.71 0.48m, 2.53
MobileNet-V3 0.41m, 3.38 0.30m, 2.58 0.73m, 3.94 0.45m, 3.27 0.71m, 5.32 0.48m, 3.87 0.92m, 4.14 0.49m, 3.04
Direct- ResNet-34 0.35m, 3.94 0.27m, 3.10 0.69m, 3.77 0.44m, 2.69 0.63m, 4.74 0.41m, 3.90 0.84m, 3.94 0.46m, 2.97
PoseNet [102] MobileNet-V3 0.39m, 3.49 0.28m, 2.88 0.70m, 3.71 0.45m, 2.71 0.61m, 4.82 0.42m, 3.89 0.88m, 4.10 0.45m, 3.04
MS- ResNet-34 0.35m, 4.47 0.26m, 3.78 0.63m, 3.74 0.43m, 2.50 0.60m, 4.66 0.41m, 3.84 0.83m, 4.21 0.43m, 2.76
Transformer [92] MobileNet-V3 0.44m, 5.26 0.36m, 4.19 0.65m, 3.89 0.41m, 2.43 0.61m, 4.06 0.45m, 3.17 0.85m, 4.18 0.43m, 2.96
PAE [104] ResNet-34 0.40m, 3.99 0.27m, 2.87 0.71m, 3.79 0.45m, 2.57 0.71m, 5.23 0.49m, 4.15 0.96m, 4.73 0.51m, 3.25
MobileNet-V3 0.41m, 4.10 0.27m, 2.93 0.73m, 3.80 0.47m, 2.63 0.71m, 5.23 0.49m, 4.15 0.98m, 4.69 0.52m, 3.09
DFNet [105] ResNet-34 0.36m, 3.49 0.28m, 2.58 0.60m, 3.80 0.43m, 2.78 0.64m, 4.82 0.43m, 3.85 0.81m, 4.37 0.43m, 2.81
MobileNet-V3 0.40m, 3.56 0.34m, 3.16 0.61m, 3.72 0.42m, 2.55 0.67m, 5.01 0.47m, 3.91 0.80m, 4.26 0.45m, 2.90

Settings. To ensure a fair comparison in benchmarking camera relocalization, we employed identical CNN architectures—ResNet-34 [34] and MobileNet-V3 [140]—across 13 models ranging from the pioneering PoseNet [30] to its latest advancements [104, 105]. All the above models were trained for 200 epochs. Specifically, for the PoseNet-series models, we utilized the loss function defined in Eq.(1), setting β𝛽\betaitalic_β to e5superscript𝑒5e^{5}italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT across all four floors555Specifically tuning β𝛽\betaitalic_β for each floor may yield better results.. In the Bayesian PoseNet, uncertainty is incorporated only before the layers with randomly initialized weights, following [98]. LSTM-PoseNet models have the hidden size of all LSTM units set at 256 to achieve optimal results [99]. Learnable PoseNet starts with initial guesses of sxsubscript𝑠𝑥s_{x}italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT set to 00 and 55-5- 5, respectively, for all scenes [91], and then Geometric PoseNet continues training these models using geometric reprojection data to balance positional and rotational errors per image. We employed feature map concatenation in ResNet-34 and element-wise summation in MobileNet-V3 to combine features from the front layers and achieve the optimal results in Hourglass PoseNet [100]. For BranchNet-Euler6 [136], the networks are split before the final convolutional block to facilitate multi-task learning. MapNet [103] configurations avoid updating model weights with unlabeled data to maintain fairness in APR comparisons. Both MapNet and MSPN [139] adopt log-quaternion for rotation representation. Direct-PoseNet maintains a direct matching ratio of photometric difference to the loss function Eq.(1) at 3/7373/73 / 7 [102]. MS-Transformer [92] replaces only the convolutional backbones to extract activation maps while preserving the encoder-decoder architecture of Transformers. PAE embeds 𝒙𝒙\boldsymbol{x}bold_italic_x and 𝒒𝒒\boldsymbol{q}bold_italic_q using Fourier Features with expanded dimensions of 12 (L=6𝐿6L=6italic_L = 6 in [104, Eq.(5)]). Finally, DFNet results are confined to single-frame APR [105]. For more details of learning settings, please refer to Appendix C and the publicly available code.

Refer to caption
Refer to caption
Figure 4: CDF of errors on MS-Transformer.

Results. Table 2 shows the mean and median errors for both camera positions and orientations. Notably, even the pioneering work of PoseNet [30] from 2015 can achieve sub-meter accuracy, meeting the requirements for indoor navigation applications. We observe that the best performance in average is achieved by MS-Transformer [92], which benefits from its large model capacity enabled by the use of Transformers and a multi-scene mixing training strategy. Figure 4 illustrates the cumulative distribution function (CDF) of MS-Transformer’s predictions regarding mean errors in positions and orientations across four floors. For additional experimental results and visualization, please refer to Appendix C.

4.2 Image Captioning

We can train a captioning model using outputs from GPT-4 to meet various VIPs needs. Image captioning requires fluent descriptions to translate images into natural language [141, 142]. We mix image data from four floors and use concise descriptions as ground truth to distill [143] our own models. We validate the results using BLEU-4 [144], METEOR [145], CIDEr [146], and SPICE [147] metrics. Additionally, we report the model size in the number of parameters and training time in GPU hours to indicate the feasibility and practicality of this visual intelligence support. The results in Table 3 demonstrate accurate predictions with smaller-size distilled models.

Table 3: Experimental results on concise descriptions.
Model Backbone BLEU-4\uparrow METEOR\uparrow CIDEr\uparrow SPICE\uparrow #Params Training Time\downarrow
ClipCap [148] CLIP (ViT-B/32) + GPT-2 tuning 35.18 27.34 113.83 20.19 156 M 46h (A6000)
OFAbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT [149] ResNet101 + Transformer 36.31 30.02 126.77 26.85 180 M 53h (A6000)

5 Potential Applications

APR methods development. Popular datasets commonly used for benchmarking APR methods include 7-Scenes [88] and Cambridge Landmarks [30]. These datasets, however, are limited in scope and size, which can lead to the overfitting phenomenon already observed in previous research [30, 91, 99, 100]. This limitation complicates the performance assessment of new proposed models, particularly in an era dominated by deep learning and PFMs. For instance, Bayesian PoseNet [98] demonstrates inferior performance compared to the classic PoseNet using our dataset. This discrepancy arises because any dropout rate applied to PoseNet constrains its capacity rather than serving its intended purpose of regularization.

VLN test under real-world environments. We plan to release this comprehensive dataset that includes not only the original videos but also the floor plans annotated with PoIs. Each video can be segmented into various clips representing a unique navigation path. These clips are associated with an automated point-to-point (PoI-to-PoI) oracle that can be used for VLN model training. Moreover, since descriptions for each image will be available, exploring a VLN model based on PoI-to-PoI instruction that operates without the need for detailed step-by-step guidance in natural language appears promising, particularly in unknown real-world environments. Our dataset has the potential to significantly facilitate more robust and flexible navigation solutions, essential for navigating dynamic or unfamiliar spaces effectively.

Floor plan navigation development. Each image in our dataset is geo-registered to a corresponding floor plan, facilitating the development of learning-based methods that utilize only RGB images for localization. While LiDAR-based localization techniques have been explored extensively in recent research [150, 151, 152, 153], their practical application is often constrained by the hardware capabilities of commonly used mobile devices. In contrast, we anticipate that purely image-based floor plan navigation will become more viable with the availability of this large indoor navigation dataset. This could pave the way for more accessible and widely deployable navigation solutions that leverage visual data alone.

Deployment and test of “NaVIP” everywhere. We will provide comprehensive details regarding the setup and organization of our NaVIP solution within the Lehigh HST building. The data collection process requires only a mobile phone, and to simplify this process, we will also release the corresponding code, including data pre-processing and annotation. Given the limited time and resources invested in developing this effective pipeline, we have not yet tested our solutions in other buildings. We encourage any groups seeking an affordable and intelligent indoor navigation system to adopt and apply this pipeline in their settings.

6 Limitations and Discussion

Inherent biases of human data collection. In NaVIP, we collected video data via sighted individuals, which inevitably introduced biases from collectors themselves. To mitigate this and reduce human labor in developing such data-driven navigation systems, one possible solution is the use of a simple robot equipped with a phone camera that navigates randomly to gather video data. Nonetheless, both sighted individuals and robotic agents fail to capture the user needs and data distribution pertinent to VIPs. Inspired by VizWiz [121], involving VIPs as participants in the data collection process is beneficial as our primary objective is to assist VIPs in navigating unfamiliar environments. Efforts to bridge this limitation will not only enhance the practicality of the solutions but also ensure their genuine benefit to the intended users.

How to merge 3D reconstruction losslessly? In our dataset, we utilize COLMAP to reconstruct the 3D models of indoor environments along each video path. Given that indoor environments can be ever-changing, there currently lacks a robust algorithm to effectively merge these 3D point clouds from different time. As a workaround, we utilize human supervision to pinpoint several anchor images directly onto the floor plan. These annotated points facilitate the geo-registration process built in COLMAP to transform the coordinates of all 3D point clouds. While this alignment method is not lossless–resulting in an inevitable system error in camera poses of approximately 0.5 meters–we will release both the sparse and dense models generated by COLMAP, before and after geo-registration, to facilitate further research in this area. This contribution is expected to further enhance the accuracy of positioning system in indoor navigation.

How robust is the image-centric positioning solution? We acknowledge the limitations in our experimental analysis regarding the robustness of this image-centric positioning system. Although our dataset was collected directly from real-world scenarios using common mobile devices, the performance of this system under extreme conditions, e.g., electronic failures leading to dark environments or post-construction changes within buildings, remains unexplored. To simulate a distribution shift due to changes in the building environment, our dataset incorporates a temporal gap between the training and testing datasets. Despite this, we observe no decline in performance over time.

Is GPT-4 ready for assisting VIPs? Although we meticulously designed prompts tailored to the needs of VIPs (refer to Appendix B for our exploration), we encountered challenges in meeting their personalized requirements. Key challenges include: 1) optimizing GPT-4 as an image descriptor for VIPs to enhance accuracy and eliminate misinformation, and 2) developing prompts that guide GPT-4 to generate information that aligns with user expectations. To further explore these issues, we have released the outputs of GPT-4 on our dataset of 300K images.

7 Conclusion and Future Work

In this paper, we have redirected our research from conventional sensor-based navigation systems to purely vision-based solutions for indoor environments. To facilitate this shift, we created a large image-centric dataset, named NaVIP, within the largest building at Lehigh University, the HST, specifically for research purposes. Our comprehensive pilot experiments on benchmarking APR methods by leveraging real-time end-to-end inference in deep neural networks have validated its feasibility and accuracy. This solution not only streamlines the system architecture but also enhances its applicability and usability, thereby extending its utility in assisting VIPs. The integration of image captioning models distilled from GPT-4 further highlights the potential to independent exploration for VIPs. As we look to the future, our research will focus on developing a mobile application that incorporates our trained model. We plan to test within the Lehigh community, supported by approval from the Institutional Review Board (IRB) to ensure comprehensive human feedback integration into the study. Moreover, we intend to explore advanced unsupervised learning techniques and reinforcement learning from human feedback (RLHF) to further enhance the functionality and user experience of our mobile application. The next phase of our research will concentrate on incorporating real-world user insights to refine and optimize the navigational aid, aiming to create a more adaptive and effective tool for end-users.

Acknowledgments and Disclosure of Funding

We thank HST building coordinator Emily Diaz-Kempf and laboratory manager Chris Panko Graff for their support in video recording in the HST public area. We also extend our gratitude to Eashan Adhikarla for providing the computer used to run COLMAP in parallel, and Deven Bhadane for his time and effort in designing the logo for this work. Special thanks to our visually impaired friend, Joel Isaac, for volunteering and sharing their needs for an indoor navigation application. This work was partially supported by the U.S. National Science Foundation through awards #2409227, #2340870, and #2345057.

References

  • [1] Haosheng Huang and Georg Gartner. A survey of mobile indoor navigation systems. Springer, 2010.
  • [2] Jayakanth Kunhoth, AbdelGhani Karkar, Somaya Al-Maadeed, and Abdulla Al-Ali. Indoor positioning and wayfinding systems: a survey. Human-centric Computing and Information Sciences, 10(1):1–41, 2020.
  • [3] Zeev Volkovich, Elena V Ravve, and Renata Avros. Indoor navigation in facilities with repetitive structures. Sensors, 24(9):2876, 2024.
  • [4] Matteo Cypriani, Frédéric Lassabe, Philippe Canalda, and François Spies. Open wireless positioning system: A wi-fi-based indoor positioning system. In 2009 IEEE 70th Vehicular Technology Conference Fall, pages 1–5. IEEE, 2009.
  • [5] Richard Wandell, Md Shafaeat Hossain, and Ishtiaque Hussain. A cost-effective wi-fi-based indoor positioning system for mobile phones. Wireless Networks, 29(6):2845–2862, 2023.
  • [6] Yuan Zhuang, Jun Yang, You Li, Longning Qi, and Naser El-Sheimy. Smartphone-based indoor localization with bluetooth low energy beacons. Sensors, 16(5):596, 2016.
  • [7] Vicente Cantón Paterna, Anna Calveras Auge, Josep Paradells Aspas, and Maria Alejandra Perez Bullones. A bluetooth low energy indoor positioning system with channel diversity, weighted trilateration and kalman filtering. Sensors, 17(12):2927, 2017.
  • [8] Kamil Szyc, Maciej Nikodem, and Michał Zdunek. Bluetooth low energy indoor localization for large industrial areas and limited infrastructure. Ad Hoc Networks, 139:103024, 2023.
  • [9] Tan Kim Geok, Khaing Zar Aung, Moe Sandar Aung, Min Thu Soe, Azlan Abdaziz, Chia Pao Liew, Ferdous Hossain, Chih P Tso, and Wong Hin Yong. Review of indoor positioning: Radio wave technology. Applied Sciences, 11(1):279, 2020.
  • [10] Fuhu Che, Qasim Zeeshan Ahmed, Pavlos I Lazaridis, Pradorn Sureephong, and Temitope Alade. Indoor positioning system (ips) using ultra-wide bandwidth (uwb)—for industrial internet of things (iiot). Sensors, 23(12):5710, 2023.
  • [11] Jijun Geng, Linyuan Xia, Jingchao Xia, Qianxia Li, Hongyu Zhu, and Yuezhen Cai. Smartphone-based pedestrian dead reckoning for 3d indoor positioning. Sensors, 21(24):8180, 2021.
  • [12] Suqing Yan, Yalan Su, Xiaonan Luo, Anqing Sun, Yuanfa Ji, and Kamarul Hawari bin Ghazali. Deep learning-based geomagnetic navigation method integrated with dead reckoning. Remote Sensing, 15(17):4165, 2023.
  • [13] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021.
  • [14] Andréa Macario Barros, Maugan Michel, Yoann Moline, Gwenolé Corre, and Frédérick Carrel. A comprehensive survey of visual slam algorithms. Robotics, 11(1):24, 2022.
  • [15] Mark Fiala. Designing highly reliable fiducial markers. IEEE Transactions on Pattern analysis and machine intelligence, 32(7):1317–1324, 2009.
  • [16] Michail Kalaitzakis, Brennan Cain, Sabrina Carroll, Anand Ambrosi, Camden Whitehead, and Nikolaos Vitzilaios. Fiducial markers for pose estimation: Overview, applications and experimental comparison of the artag, apriltag, aruco and stag markers. Journal of Intelligent & Robotic Systems, 101:1–26, 2021.
  • [17] Gabriel De Blasio, Alexis Quesada-Arencibia, Carmelo R García, Jezabel Miriam Molina-Gil, and Cándido Caballero-Gil. Study on an indoor positioning system for harsh environments based on wi-fi and bluetooth low energy. Sensors, 17(6):1299, 2017.
  • [18] Jiang Dong, Marius Noreikis, Yu Xiao, and Antti Ylä-Jääski. Vinav: A vision-based indoor navigation system for smartphones. IEEE Transactions on Mobile Computing, 18(6):1461–1475, 2018.
  • [19] Daquan Feng, Junjie Peng, Yuan Zhuang, Chongtao Guo, Tingting Zhang, Yinghao Chu, Xiaoan Zhou, and Xiang-Gen Xia. An adaptive imu/uwb fusion method for nlos indoor positioning and navigation. IEEE Internet of Things Journal, 2023.
  • [20] Kanak Manjari, Madhushi Verma, and Gaurav Singal. A survey on assistive technology for visually impaired. Internet of Things, 11:100188, 2020.
  • [21] Fatma El-Zahraa El-Taher, Ayman Taha, Jane Courtney, and Susan Mckeever. A systematic review of urban navigation systems for visually impaired people. Sensors, 21(9):3103, 2021.
  • [22] Bineeth Kuriakose, Raju Shrestha, and Frode Eika Sandnes. Tools and technologies for blind and visually impaired navigation support: a review. IETE Technical Review, 39(1):3–18, 2022.
  • [23] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  • [24] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12716–12725, 2019.
  • [25] Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015.
  • [26] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Akihiko Torii. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7199–7209, 2018.
  • [27] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  • [28] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem. International journal of computer vision, 81:155–166, 2009.
  • [29] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence, 39(9):1744–1756, 2016.
  • [30] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
  • [31] Mohamed Adel Musallam, Vincent Gaudilliere, Miguel Ortiz Del Castillo, Kassem Al Ismaeil, and Djamila Aouada. Leveraging equivariant features for absolute pose regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6876–6886, 2022.
  • [32] Shuai Chen, Yash Bhalgat, Xinghui Li, Jiawang Bian, Kejie Li, Zirui Wang, and Victor Adrian Prisacariu. Refinement for absolute pose regression with neural feature synthesis. arXiv preprint arXiv:2303.10087, 2023.
  • [33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • [34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [35] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [36] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • [37] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
  • [38] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891, 2024.
  • [39] Darius Plikynas, Arūnas Žvironas, Andrius Budrionis, and Marius Gudauskis. Indoor navigation systems for visually impaired persons: Mapping the features of existing technologies to user needs. Sensors, 20(3):636, 2020.
  • [40] Gaurav Jain, Yuanyang Teng, Dong Heon Cho, Yunhao Xing, Maryam Aziz, and Brian A Smith. " i want to figure things out": Supporting exploration in navigation for people with visual impairments. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1):1–28, 2023.
  • [41] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  • [42] Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419, 2023.
  • [43] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [44] Naser El-Sheimy and You Li. Indoor navigation: State of the art and future trends. Satellite Navigation, 2(1):7, 2021.
  • [45] Dawar Khan, Zhanglin Cheng, Hideaki Uchiyama, Sikandar Ali, Muhammad Asshad, and Kiyoshi Kiyokawa. Recent advances in vision-based indoor navigation: A systematic literature review. Computers & Graphics, 104:24–45, 2022.
  • [46] Yongliang Shi, Weimin Zhang, Zhuo Yao, Mingzhu Li, Zhenshuo Liang, Zhongzhong Cao, Hua Zhang, and Qiang Huang. Design of a hybrid indoor location system based on multi-sensor fusion for robot navigation. Sensors, 18(10):3581, 2018.
  • [47] Vishnu Nair, Manjekar Budhai, Greg Olmschenk, William H Seiple, and Zhigang Zhu. Assist: Personalized indoor navigation via multimodal sensors and high-level semantic information. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
  • [48] Xuxin Lin, Jianwen Gan, Chaohao Jiang, Shuai Xue, and Yanyan Liang. Wi-fi-based indoor localization and navigation: A robot-aided hybrid deep learning approach. Sensors, 23(14):6320, 2023.
  • [49] Lamya Albraheem and Sarah Alawad. A hybrid indoor positioning system based on visible light communication and bluetooth rss trilateration. Sensors, 23(16):7199, 2023.
  • [50] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018.
  • [51] Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: A survey and taxonomy. Neural Computing and Applications, pages 1–26, 2023.
  • [52] Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Eric Wang. Vision-and-language navigation: A survey of tasks, methods, and future directions. arXiv preprint arXiv:2203.12667, 2022.
  • [53] Satoshi Tadokoro. Rescue robotics: DDT project on robots and systems for urban search and rescue. Springer Science & Business Media, 2009.
  • [54] Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, and Stefan Lee. Chasing ghosts: Instruction following as bayesian state tracking. Advances in neural information processing systems, 32, 2019.
  • [55] Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020.
  • [56] Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, and Zehuan Yuan. Multimodal transformer with variable-length memory for vision-and-language navigation. In European Conference on Computer Vision, pages 380–397. Springer, 2022.
  • [57] Xiangyang Li, Zihan Wang, Jiahao Yang, Yaowei Wang, and Shuqiang Jiang. Kerm: Knowledge enhanced reasoning for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2583–2592, 2023.
  • [58] Jingyang Huo, Qiang Sun, Boyan Jiang, Haitao Lin, and Yanwei Fu. Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23212–23221, 2023.
  • [59] Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023.
  • [60] Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H Li, Mingkui Tan, and Chuang Gan. Learning vision-and-language navigation from youtube videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8317–8326, 2023.
  • [61] Chia-Wen Kuo, Chih-Yao Ma, Judy Hoffman, and Zsolt Kira. Structure-encoding auxiliary tasks for improved visual representation in vision-and-language navigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1104–1113, 2023.
  • [62] Yanyuan Qiao, Zheng Yu, and Qi Wu. Vln-petl: Parameter-efficient transfer learning for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15443–15452, 2023.
  • [63] Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14738–14748, 2021.
  • [64] Ta-Chung Chi, Minmin Shen, Mihail Eric, Seokhwan Kim, and Dilek Hakkani-tur. Just ask: An interactive learning framework for vision and language navigation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(03):2459–2466, Apr. 2020.
  • [65] Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, pages 312–328. Springer, 2022.
  • [66] Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, and Jesse Thomason. Iterative vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14921–14930, 2023.
  • [67] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018.
  • [68] Sanyam Agarwal, Devi Parikh, Dhruv Batra, Peter Anderson, and Stefan Lee. Visual landmark selection for generating grounded and interpretable navigation instructions. In CVPR workshop on Deep Learning for Semantic Visual Navigation, volume 2, 2019.
  • [69] Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments: Back translation with environmental dropout. arXiv preprint arXiv:1904.04195, 2019.
  • [70] Tsu-Jui Fu, Xin Eric Wang, Matthew F Peterson, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. Counterfactual vision-and-language navigation via adversarial path sampler. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 71–86. Springer, 2020.
  • [71] Zi-Yi Dou and Nanyun Peng. Foam: A follower-aware speaker model for vision-and-language navigation. arXiv preprint arXiv:2206.04294, 2022.
  • [72] Hanqing Wang, Wei Liang, Jianbing Shen, Luc Van Gool, and Wenguan Wang. Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15471–15481, 2022.
  • [73] Jialu Li and Mohit Bansal. Improving vision-and-language navigation by generating future-view image semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10803–10812, 2023.
  • [74] Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023.
  • [75] Ming Zhao, Peter Anderson, Vihan Jain, Su Wang, Alexander Ku, Jason Baldridge, and Eugene Ie. On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504, 2021.
  • [76] Torsten Sattler, Tobias Weyand, Bastian Leibe, and Leif Kobbelt. Image retrieval for image-based localization revisited. In BMVC, volume 1, page 4, 2012.
  • [77] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 3–20. Springer, 2016.
  • [78] Dorian Galvez-Lopez and Juan D Tardos. Real-time loop detection with bags of binary words. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 51–58. IEEE, 2011.
  • [79] Hugo Germain, Guillaume Bourmaud, and Vincent Lepetit. Sparse-to-dense hypercolumn matching for long-term visual localization. In 2019 International Conference on 3D Vision (3DV), pages 513–523. IEEE, 2019.
  • [80] Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020.
  • [81] Shuzhe Wang, Zakaria Laskar, Iaroslav Melekhov, Xiaotian Li, Yi Zhao, Giorgos Tolias, and Juho Kannala. Hscnet++: Hierarchical scene coordinate classification and regression for visual localization with transformer. International Journal of Computer Vision, pages 1–21, 2024.
  • [82] Zakaria Laskar, Iaroslav Melekhov, Surya Kalia, and Juho Kannala. Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 929–938, 2017.
  • [83] Vassileios Balntas, Shuda Li, and Victor Prisacariu. Relocnet: Continuous metric learning relocalisation using neural nets. In Proceedings of the European conference on computer vision (ECCV), pages 751–767, 2018.
  • [84] Mingyu Ding, Zhe Wang, Jiankai Sun, Jianping Shi, and Ping Luo. Camnet: Coarse-to-fine retrieval for camera re-localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2871–2880, 2019.
  • [85] Mehmet Ozgur Turkoglu, Eric Brachmann, Konrad Schindler, Gabriel J Brostow, and Aron Monszpart. Visual camera re-localization using graph neural networks and relative pose supervision. In 2021 International Conference on 3D Vision (3DV), pages 145–155. IEEE, 2021.
  • [86] Kefan Chen, Noah Snavely, and Ameesh Makadia. Wide-baseline relative camera pose estimation with directional learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3258–3268, 2021.
  • [87] Samarth Sinha, Jason Y Zhang, Andrea Tagliasacchi, Igor Gilitschenski, and David B Lindell. Sparsepose: Sparse-view camera pose regression and refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21349–21359, 2023.
  • [88] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2937, 2013.
  • [89] Abner Guzman-Rivera, Pushmeet Kohli, Ben Glocker, Jamie Shotton, Toby Sharp, Andrew Fitzgibbon, and Shahram Izadi. Multi-output learning for camera relocalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1114–1121, 2014.
  • [90] Julien Valentin, Matthias Nießner, Jamie Shotton, Andrew Fitzgibbon, Shahram Izadi, and Philip HS Torr. Exploiting uncertainty in regression forests for accurate camera relocalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4400–4408, 2015.
  • [91] Alex Kendall and Roberto Cipolla. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5974–5983, 2017.
  • [92] Yoli Shavit, Ron Ferens, and Yosi Keller. Learning multi-scene absolute pose regression with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2733–2742, 2021.
  • [93] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
  • [94] Clemens Arth, Daniel Wagner, Manfred Klopschitz, Arnold Irschara, and Dieter Schmalstieg. Wide area localization on mobile phones. In 2009 8th ieee international symposium on mixed and augmented reality, pages 73–82. IEEE, 2009.
  • [95] Arnold Irschara, Christopher Zach, Jan-Michael Frahm, and Horst Bischof. From structure-from-motion point clouds to fast location recognition. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2599–2606. IEEE, 2009.
  • [96] Yunpeng Li, Noah Snavely, and Daniel P Huttenlocher. Location recognition using prioritized feature matching. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part II 11, pages 791–804. Springer, 2010.
  • [97] Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3247–3257, 2021.
  • [98] Alex Kendall and Roberto Cipolla. Modelling uncertainty in deep learning for camera relocalization. In 2016 IEEE international conference on Robotics and Automation (ICRA), pages 4762–4769. IEEE, 2016.
  • [99] Florian Walch, Caner Hazirbas, Laura Leal-Taixe, Torsten Sattler, Sebastian Hilsenbeck, and Daniel Cremers. Image-based localization using lstms for structured feature correlation. In Proceedings of the IEEE international conference on computer vision, pages 627–637, 2017.
  • [100] Iaroslav Melekhov, Juha Ylioinas, Juho Kannala, and Esa Rahtu. Image-based localization using hourglass networks. In Proceedings of the IEEE international conference on computer vision workshops, pages 879–886, 2017.
  • [101] Mingpeng Cai, Chunhua Shen, and Ian D. Reid. A hybrid probabilistic model for camera relocalization. In British Machine Vision Conference, 2018.
  • [102] Shuai Chen, Zirui Wang, and Victor Prisacariu. Direct-posenet: Absolute pose regression with photometric consistency. In 2021 International Conference on 3D Vision (3DV), pages 1175–1185. IEEE, 2021.
  • [103] Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2616–2625, 2018.
  • [104] Yoli Shavit and Yosi Keller. Camera pose auto-encoders for improving pose regression. In European Conference on Computer Vision, pages 140–157. Springer, 2022.
  • [105] Shuai Chen, Xinghui Li, Zirui Wang, and Victor A Prisacariu. Dfnet: Enhance absolute pose regression with direct feature matching. In European Conference on Computer Vision, pages 1–17. Springer, 2022.
  • [106] Torsten Sattler, Qunjie Zhou, Marc Pollefeys, and Laura Leal-Taixe. Understanding the limitations of cnn-based absolute camera pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3302–3312, 2019.
  • [107] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • [108] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210–7219, 2021.
  • [109] Jianlin Liu, Qiang Nie, Yong Liu, and Chengjie Wang. Nerf-loc: Visual localization with conditional neural radiance field. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9385–9392. IEEE, 2023.
  • [110] Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter networks with application to visual localization. In Conference on robot learning, pages 169–178. PMLR, 2018.
  • [111] Changan Chen, Rui Wang, Christoph Vogel, and Marc Pollefeys. F3loc: Fusion and filtering for floorplan localization. arXiv preprint arXiv:2403.03370, 2024.
  • [112] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212–3232, 2019.
  • [113] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning for generic object detection: A survey. International journal of computer vision, 128:261–318, 2020.
  • [114] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey. Proceedings of the IEEE, 111(3):257–276, 2023.
  • [115] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
  • [116] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3343–3352, 2019.
  • [117] Yinlin Hu, Pascal Fua, Wei Wang, and Mathieu Salzmann. Single-stage 6d object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2930–2939, 2020.
  • [118] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [119] Ricardo E Gonzalez Penuela, Jazmin Collins, Cynthia Bennett, and Shiri Azenkot. Investigating use cases of ai-powered scene description applications for blind and low vision people. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–21, 2024.
  • [120] Jasur Shukurov. Improve accessibility for low vision and blind people using machine learning and computer vision. arXiv preprint arXiv:2404.00043, 2024.
  • [121] Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 333–342, 2010.
  • [122] Michele A Burton, Erin Brady, Robin Brewer, Callie Neylan, Jeffrey P Bigham, and Amy Hurst. Crowdsourcing subjective fashion advice using vizwiz: challenges and opportunities. In Proceedings of the 14th international ACM SIGACCESS conference on Computers and accessibility, pages 135–142, 2012.
  • [123] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  • [124] Chongyan Chen, Samreen Anjum, and Danna Gurari. Grounding answers for visual questions asked by visually impaired people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19098–19107, 2022.
  • [125] Jarek Reynolds, Chandra Kanth Nagesh, and Danna Gurari. Salient object detection for images taken by people with vision impairments. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8522–8531, 2024.
  • [126] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR 2011, pages 1521–1528. IEEE, 2011.
  • [127] Zhuang Liu and Kaiming He. A decade’s battle on dataset bias: Are we there yet? arXiv preprint arXiv:2403.08632, 2024.
  • [128] Julien Valentin, Angela Dai, Matthias Nießner, Pushmeet Kohli, Philip Torr, Shahram Izadi, and Cem Keskin. Learning to navigate the energy landscape. In 2016 Fourth International Conference on 3D Vision (3DV), pages 323–332. IEEE, 2016.
  • [129] Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L Schönberger, Pablo Speciale, Lukas Gruber, Viktor Larsson, Ondrej Miksik, and Marc Pollefeys. Lamar: Benchmarking localization and mapping for augmented reality. Computer Vision–ECCV 2022, 13667:686–704, 2022.
  • [130] Huajian Huang, Changkun Liu, Yipeng Zhu, Hui Cheng, Tristan Braud, and Sai-Kit Yeung. 360loc: A dataset and benchmark for omnidirectional visual localization with cross-device queries. arXiv preprint arXiv:2311.17389, 2023.
  • [131] Yunpeng Li, Noah Snavely, Dan Huttenlocher, and Pascal Fua. Worldwide pose estimation using 3d point clouds. In European conference on computer vision, pages 15–29. Springer, 2012.
  • [132] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In 2011 10th IEEE international symposium on mixed and augmented reality, pages 127–136. Ieee, 2011.
  • [133] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Towards internet-scale multi-view stereo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 1434–1441. IEEE, 2010.
  • [134] Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (ToG), 32(6):1–11, 2013.
  • [135] Kenji Koide, Shuji Oishi, Masashi Yokozuka, and Atsuhiko Banno. General, single-shot, target-less, and automatic lidar-camera extrinsic calibration toolbox. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11301–11307. IEEE, 2023.
  • [136] Jian Wu, Liwei Ma, and Xiaolin Hu. Delving deeper into convolutional neural networks for camera relocalization. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5644–5651. IEEE, 2017.
  • [137] Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-aware learning of maps for camera localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [138] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
  • [139] Hunter Blanton, Connor Greenwell, Scott Workman, and Nathan Jacobs. Extending absolute pose regression to multiple scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 38–39, 2020.
  • [140] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
  • [141] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2016.
  • [142] MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019.
  • [143] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [144] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  • [145] Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
  • [146] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  • [147] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer, 2016.
  • [148] Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  • [149] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  • [150] Federico Boniardi, Tim Caselitz, Rainer Kümmerle, and Wolfram Burgard. Robust lidar-based localization in architectural floor plans. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3318–3324. IEEE, 2017.
  • [151] Federico Boniardi, Tim Caselitz, Rainer Kümmerle, and Wolfram Burgard. A pose graph-based localization system for long-term navigation in cad floor plans. Robotics and Autonomous Systems, 112:84–97, 2019.
  • [152] Zhikai Li, Marcelo H Ang, and Daniela Rus. Online localization with imprecise floor space maps using stochastic gradient descent. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8571–8578. IEEE, 2020.
  • [153] Oscar Mendez, Simon Hadfield, Nicolas Pugeault, and Richard Bowden. Sedar: reading floorplans like a human—using deep learning to enable human-inspired localisation. International Journal of Computer Vision, 128(5):1286–1310, 2020.
  • [154] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [155] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
  • [156] Johannes Lutz Schönberger, True Price, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. A vote-and-verify strategy for fast spatial verification in image retrieval. In Asian Conference on Computer Vision (ACCV), 2016.
  • [157] Alberto Pepe and Joan Lasenby. Cga-posenet: Camera pose regression via a 1d-up approach to conformal geometric algebra. arXiv preprint arXiv:2302.05211, 2023.
  • [158] Alberto Pepe, Joan Lasenby, and Sven Buchholz. Cgaposenet+ gcan: A geometric clifford algebra network for geometry-aware camera pose regression. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6593–6603, 2024.

Appendix A Dataset Details

A.1 Organization

To ensure consistency and facilitate ease of use for others, we organize our data using the COLMAP output structure [154, 155, 156] and adhere to the training standards derived from PoseNet [30]. Our data maintains the same output format as that of COLMAP666https://colmap.github.io/format.html for each video. Below is an example of the file structure for each video folder; we have a total of 400 COLMAP project folders structured in this manner:




























Data Structure \dirtree .0 ./. .1 train/. .2 Basement/. .3 20231220_141254_proj/. .4 HAND_20231220_141254/. .5 HAND_20231220_141254_frame_000.3s.jpg. .5 …. .5 HAND_20231220_141254_frame_166.7s.jpg. .4 dense/. .5 0/. .6 images/. .7 HAND_20231220_141254_frame_000.3s.jpg. .7 …. .7 HAND_20231220_141254_frame_166.7s.jpg. .6 sparse/. .7 cameras.bin. .7 images.bin. .7 points3D.bin. .6 stereo/. .7 patch-match.cfg. .7 fusion.cfg. .7 …. .4 sparse/. .5 0/. .6 cameras.bin. .6 images.bin. .6 points3D.bin. .5 geo/. .6 cameras.bin. .6 images.bin. .6 points3D.bin. .4 movie/. .5 frame000000.png. .5 …. .5 frame000001.png. .4 HAND_20231220_141254.MOV. .4 camera2world_6DoF.txt. .4 database.db. .4 geo_coord.txt. .4 log.log. .4 path.png. .4 path_stem.png. .3 …. .3 20240331_083440_proj/. .4 …. .3 image_train_all.txt. .3 geometric_data.pkl. .2 Lower_Level. .3 …. .2 …. .1 test/. .2 ….

For files or folders related to the standard outputs produced by COLMAP, please refer to the official tutorial777https://colmap.github.io/. We will describe additional files. All video files, which end with .MOV or .MP4, are named beginning with the capture ways (either “HAND” or “DJI”), followed by the timestamp of the recording. This naming convention is also applied to the folders containing extracted images and the images themselves, with the addition of a frame timestamp at the end. It is noticed that we preserved the padded image frames for some videos into another folder, named beginning with "HAND_pad" or "DJI_pad". The following folder structure is used to initiate any COLMAP processing:

Data Sturcture Before Running COLMAP \dirtree .1 20231220_141254_proj/. .2 HAND_20231220_141254/. .3 HAND_20231220_141254_frame_000.3s.jpg. .3 …. .3 HAND_20231220_141254_frame_166.7s.jpg.

A.1.1 Geo-Registration

After generating the standard outputs using COLMAP, we first stored the training report of COLMAP to the file log.log, and then captured “movies” of the 3D reconstructions to provide readers with a direct view of the reconstruction status. For each reconstruction, we saved 5 to 10 images from various perspectives. Please refer to Figure 5 as an example.

Refer to caption
(a) frame000000
Refer to caption
(b) frame000001
Refer to caption
(c) frame000002
Refer to caption
(d) frame000003
Refer to caption
(e) frame000004
Refer to caption
(f) frame000005
Refer to caption
(g) frame000006
Refer to caption
(h) frame000007
Figure 5: Movies grabbed from the 3D reconstruction of the project 20231220_141254_proj/.

Users can reload models via the COLMAP GUI to view the 3D reconstructions from any perspective.

Additionally, because the original world coordinates are not aligned for different paths, we need to complete the geo-registration process. This involves using 5 to 10 images, manually pinpointed to the floor plan, to accurately align the models. These ground truths are saved in the file geo_coord.txt as

Ground Truth Data for Image Localization Relative to the Floor Plan 1. HAND_20231220_141254_frame_000.3s.jpg 1052 2113 0 2. HAND_20231220_141254_frame_019.3s.jpg 464 2082 0 3. HAND_20231220_141254_frame_035.4s.jpg 449 1503 0 4. HAND_20231220_141254_frame_074.8s.jpg 253 188 0 5. HAND_20231220_141254_frame_089.9s.jpg 700 292 0 6. HAND_20231220_141254_frame_100.1s.jpg 467 434 0 7. HAND_20231220_141254_frame_113.1s.jpg 450 915 0

We name these image frames “anchor frames” or “anchor images” throughout this paper. The origins of their coordinates are aligned to the top-left pixel of the floor plan images, as shown in Figure 6. The second and third columns above indicate the number of pixels from the origin in terms of height and width, respectively.

Refer to caption
(a) Basement
Refer to caption
(b) Lower Level
Refer to caption
(c) Level 1
Refer to caption
(d) Level 2
Figure 6: Images of Floor Plan across four floors of HST.

For detailed instructions on how to geo-register the entire path using annotated images, please refer to the official tutorial 888https://colmap.github.io/faq.html#geo-registration.

We stored the geo-registered model in the folder 20231220_141254_proj/sparse/geo. Since the camera poses represent the coordinates of the world relative to the camera center and include redundant feature point information, we consolidated and streamlined this data from ../sparse/geo/images.bin into ../camera2world_6DoF.txt.

For each floor, we have released data in standard formats, image_train_all.txt and geometric_data.pkl, to serve as deep learning training inputs. The first four lines of image_train_all.txt are displayed as

Data Format of image_train_all.txt Lehigh Health Science and Technology (HST) Building (https://www2.lehigh.edu/news/new-health-science-and-technology-building-a-hub-for-interdisciplinary-research). IMG_PATH, IMG_ID, QW, QX, QY, QZ, TX, TY, TZ 20231220_141254_proj/HAND_20231220_141254/HAND_20231220_141254_frame_153.4s.jpg, 541, 0.46282440, -0.48897273, 0.52665017, -0.51897865, 602.36415529, 2137.65949852, -0.94367326 20231220_141254_proj/HAND_20231220_141254/HAND_20231220_141254_frame_153.1s.jpg, 540, 0.46784778, -0.48610208, 0.51996406, -0.52388988, 592.78433312, 2137.81172284, -1.24991543

This training format, inherited from PoseNet [30], aligns with the prevailing styles used in deep APR training [98, 91, 99, 100, 101, 102, 103, 92, 104, 105, 32].

The file geometric_data.pkl is stored as a python set in the format of

Data Format of geometric_data.pkl \dirtree .1 Basement/. .2 "train":[]. .3 0:{"image_path":,"w_t_c":,"c_q_w":,"c_R_w":,"w_P":,"c_p":,"K":}. .3 1:{"image_path":,"w_t_c":,"c_q_w":,"c_R_w":,"w_P":,"c_p":,"K":}. .3 …. .3 39842:{"image_path":,"w_t_c":,"c_q_w":,"c_R_w":,"w_P":,"c_p":,"K":}. .2 "test":[]. .3 0:{"image_path":,"w_t_c":,"c_q_w":,"c_R_w":,"w_P":,"c_p":,"K":}. .3 1:{"image_path":,"w_t_c":,"c_q_w":,"c_R_w":,"w_P":,"c_p":,"K":}. .3 …. .3 16180:{"image_path":,"w_t_c":,"c_q_w":,"c_R_w":,"w_P":,"c_p":,"K":}.

We encourage readers to review the code file at https://github.com/junfish/VIP_Navi/blob/master/dataset_utils/extract_geometric_data.py for a detailed understanding and generate geometric_data.pkl on your own. This data file can be utilized for the training of camera geometric data, e.g., Geometric PoseNet [91, 157, 158].

As illustrated in Figure 7(a-d) and 7(e-h), the files path.png and path_stem.png demonstrate the geo-registration on the floor plan and in the 3D world coordinates, respectively, to validate their accuracy. All the codes for the aforementioned steps are publicly available at https://github.com/junfish/VIP_Navi/tree/master/dataset_utils.

Refer to caption
(a) Basement
Refer to caption
(b) Lower Level
Refer to caption
(c) Level 1
Refer to caption
(d) Level 2
Refer to caption
(e) Basement
Refer to caption
(f) Lower Level
Refer to caption
(g) Level 1
Refer to caption
(h) Level 2
Figure 7: Examples of path visualizations on the floor plan and in 3D world coordinates across four floors.

A.1.2 GPT-4 Captions for VIPs

We have eight .csv files below to store the pseudo ground truth produced by GPT-4.

Captions for VIPs 1. train_Basement_output.csv 2. train_Lower_Level_output.csv 3. train_Level_1_output.csv 4. train_Level_2_output.csv 5. test_Basement_output.csv 6. test_Lower_Level_output.csv 7. test_Level_1_output.csv 8. test_Level_2_output.csv

Each file above contains three columns, labeled image_id, image_file, and caption.

A.2 Examples

In this section, we first present examples of images from our dataset along with the reference objects used for the accurate human annotations of anchor frames. Subsequently, we randomly select images with captions for VIPs.

A.2.1 Image Frames

Figure 8 presents examples of images extracted from various videos recorded to develop this image-centric indoor navigation solution. We observe that some indoor environments are occasionally textureless and dynamic.

Refer to caption
Figure 8: Randomly selected pictures extracted from different 4K video recordings. These videos are captured by various mobile devices using either human hands or a DJI stabilizer, including both portrait and landscape orientations.

A.2.2 Reference Objects

To obtain accurate camera poses for the anchor frames used in geo-registrations, we identify and use prominent objects in the environment that are integrated into the sparse model of COLMAP 3D reconstructions. These objects include exit signs, stairs, gardens, walls with posters, pillars, trash bins, door frames, lockers, water dispensers, etc. Figure 9 displays selected examples of the mapping of these feature points from the images to the 3D sparse point cloud.

Refer to caption
(a) Carpet Corner
Refer to caption
(b) Exit Sign
Refer to caption
(c) Flower
Refer to caption
(d) Garden Corner
Refer to caption
(e) Lock Handle
Refer to caption
(f) Poster
Refer to caption
(g) Banister
Refer to caption
(h) Trash Bin
Figure 9: Examples of reference objects used to accurately annotate the anchor frames along the path. Multiple reference objects can be selected collaboratively to cross-validate the ground truth positions on the floor plan.

A.2.3 Captions by GPT-4

We randomly show three examples from our dataset below.

captions by GPT-4 [Uncaptioned image] Concise Version: A modern hallway with stairs and elevators, featuring wooden ceiling slats, clean floors, and glass railings. Detailed Version: For Early-Blind You are in a spacious, modern hallway, about 6 feet wide. The floor is smooth with a slight speckled texture, providing a comfortable walking surface. Overhead, there are evenly spaced wooden slats along the ceiling, offering a unique spatial pattern. At 9 o’clock, about 6 feet to your left, is an elevator integrated into the hallway wall; it has a flat surface and poses no tripping risks. Straight ahead, the hallway lengthens around 20 feet. To the right, at 3 o’clock and roughly 10 feet from your current position, there is a broad staircase featuring glass railings. This staircase leads to the upper floor. The path in the hallway is clear of obstructions, making for a safe, unobstructed journey. Further down, past the stairs, the corridor proceeds and is lined with glass doors and other branching hallways. For Low-Vision & Late-Blind You are in a modern hallway, approximately 6 feet wide. The floor is polished and smooth, with a slight speckled texture that is easy to walk on. Overhead, there are uniform wooden slats on the ceiling running parallel to your path, providing an aesthetic touch. On your left at 9 o’clock, there’s an elevator with metallic doors set into a plain wall about 6 feet away. It’s flush with the hallway wall, without any tripping hazards. In front of you, the corridor stretches about 20 feet before widening slightly. To your right, at 3 o’clock and about 10 feet away, a wide, open staircase with transparent glass railings ascends to the next floor. There are no visible obstacles on the floor, ensuring an easy and safe walk. Farther ahead, beyond the stairs, the hallway continues, flanked by glass doors and additional hallways. Concise Version: Office with a partition, workstations, and two people, visible furniture, and lighting. Detailed Version: For Early-Blind You are in a busy office environment. Directly to your left at 9 o’clock, two people are engaged with their computer monitors in a glass cubicle. Ahead, you’ll encounter a substantial partition around 10 feet from you. This partition functions as a spatial divider unique to this setting. Walking past the partition, there are additional work areas equipped with desks, chairs, and computer systems. Overhead, the office has ceiling fixtures that likely create a well-lit environment. To your right, at 3 o’clock, there’s a wall made of frosted glass, part of the office’s contemporary design. Exercise caution as you proceed: there are chair legs by the pathway which may pose a slight tripping hazard. To safely reach the next workstation, steer slightly to your right to maintain a clear pathway free of obstacles. For Low-Vision & Late-Blind You are in a modern office space. To your immediate left at 9 o’clock, there are two people sitting at desks with computers in a glass-walled section. Moving forward, there is a large yellow partition about 10 feet away, which stands approximately 6 feet high and 4 feet wide, acting as a divider in the middle of the room. Directly ahead, just beyond the partition, are more workstations with office chairs, desks, and computer monitors. Fluorescent lights hanging from the white ceiling brighten the room. To your right, at 3 o’clock, you can see a frosted glass wall, which is part of a modern office design. Be cautious of the chair legs close to the pathway and navigate straight or slightly to the right to avoid obstacles and reach a workstation. Concise Version: Office space with a kitchen, glass-enclosed room, and various furniture arranged neatly. Detailed Version: For Early-Blind The scene is a tidy, modern office. In front of you, a glass partition is at 3 o’clock. Immediately to your left, around two feet, is a kitchen area with a countertop, cabinets, and appliances. You can navigate safely along a path roughly three feet wide leading into the kitchen. About ten feet beyond the kitchen, there’s a metallic refrigerator. To your right, approximately four feet away, a blue chair is near the glass partition. Inside the glass-enclosed room at 3 o’clock, there’s a white chair under a table. The floor is clear of obstacles, though watch for the sharp corners of the kitchen countertop to your left. For Low-Vision & Late-Blind You are in a modern office space. Directly ahead, a glass wall creates a partition at 3 o’clock. To your immediate left, around two feet away, the kitchen area has a counter, appliances, and cabinets. Beyond the kitchen, approximately ten feet away, is a stainless-steel refrigerator. There’s ample walking space, roughly three feet wide, leading to the kitchen. At the entrance to your right, around four feet away, a blue chair is securely placed near the glass wall. As you look further right at 3 o’clock, there is a white chair positioned under a table inside the glass-enclosed room. There are no visible hazards; the floor appears clear, but be mindful of the sharp corners of the countertop to your left.

A.3 Statistics

As illustrated in Figure 10, we have conducted basic statistical analyses of our video recordings to indicate potential biases inherent in data collected by humans. We note that our dataset lacks videos from the hours between 2:00 AM and 6:00 AM, as well as on Saturdays and during the Christmas holidays.

Refer to caption
(a) Distribution of video contributions throughout daytime hours.
Refer to caption
(b) Proportions of video postures and styles.
Refer to caption
(c) video contributions over weeks from December 2023 to June 2024.
Figure 10: Statistics of videos recorded for NaVIP. (a) Contribution Graph depicting the distribution of video recordings across different times of the day for each floor. The heatmap illustrates variations in the number of video contributions from Basement to Level 2, with darker greens indicating higher activity during specific hours. (b) Distribution of video capture styles across different floors, illustrated through pie charts. Each chart represents the proportion of video capture modes: Portrait by Hand, Landscape by Hand, Portrait by DJI Stabilizer, and Landscape by DJI Stabilizer. The charts highlight the variance in capture styles from Floor 2 to the Basement. (c) Contribution Graph showing the distribution of video recordings by days of the week over a period from December 5, 2023, to June 5, 2024. The heatmap illustrates the number of video contributions per day, highlighting activity patterns with darker shades representing higher frequencies of video recordings.

Appendix B VIP User Needs and GPT-4 Interaction

B.1 VIP User Needs and GPT-4 Text Prompts Design

We aim to experiment with the use of GPT-4 for processing images captured by visually impaired individuals using their smartphones and generating controlled, effective textual descriptions to facilitate barrier-free living.

Initially, we crafts a preliminary prompt for GPT, drawing inspiration from Jain et al.’s work [40]. Jain introduces the concept of "Exploration Assistance," an evolution of Navigation Assistance Systems (NASs) that empowers VIPs to explore unfamiliar environments. They study and analyze VIP user needs, such as what information VIPs require to explore unfamiliar environments and what factors influence these needs among individuals. Inspired by Jain et al.’s work[40], we categorize visually impaired users into early-blind, late-blind, and low-vision groups. We test GPT’s ability to produce descriptions at three detail levels: spatial information, independent exploration, and collaborative exploration.

We find that GPT’s performance limitations prevent it from generating distinct descriptions for different user needs, leading to redundant information in multiple versions of image descriptions and diminishing user experience. To address this, we reassessed user categories and requirements and designed an improved prompt. Experiments show that this improved prompt version enhances the accuracy and readability of GPT-generated descriptions. Additionally, we observe that GPT struggles with accurately describing direction information. To mitigate this, we introduce a visual prompt that corrects these errors, providing users with more precise navigation information.

B.1.1 Designs for GPT-4 Text Prompts: First Version

Please help me to annotate images with natural language descriptions for visually impaired people (VIPs). It is important to be as descriptive and clear as possible. For each image, please give me two versions (concise & detailed version) of descriptions for three VIP groups (VIPs who are low-vision, early-blind & late-blind), respectively.

My requirements of two versions are:

  • Concise version contains only one sentence (no more than 12121212 words). Use specific nouns and adjectives to give environment summarization, such as “a quiet lab where some people are reading books.”

  • Detailed version contains the descriptions of three different levels, as outlined below. Each level represents a progressively higher need within the visually impaired population. Please ensure that the information for each level is contained within its own separate paragraph.

    • Level 1. Spatial information needs (perceptual insight cravings). VIPs need two types of spatial information: shape information and layout information, to get a high-level overview of a space. Your descriptions should help VIPs gather shape and layout information in a manner that facilitates active engagement with the environment.

    • Level 2. Independent exploration needs (self-directed learning desires). VIPs face difficulties in making navigation decisions based on spatial information collected via their non-visual senses and your descriptions. Additionally, acquiring appropriate orientation and mobility (O&M) training and maintaining their O&M skills can be a struggle for them, which negatively impacting their confidence to explore independently. You should try your best to afford VIPs precise and reliable spatial information and should ensure that VIPs make accurate navigation decisions based on this information. You can serve as an O&M education assistance tool to give VIPs the confidence to explore unfamiliar environments independently.

    • Level 3. Collaborative exploration needs (social interaction needs). Social pressures pose a major challenge to VIPs receiving help and to non-VIPs providing help when exploring environments collaboratively. VIPs want verbal assistance in a comprehensible format when exploring collaboratively but find it challenging to communicate this preference to others. You should normalize VIPs’ exploration behaviors by introducing social norms for exploration assistance in the current situation. If possible, you could scaffold and facilitate collaborative exploration by telling VIPs how to distribute requests for collaboration and what kind of translations may be needed for VIPs.

Here are some tips for you to refine my requirements above:

Tips 1. Object Identification and Description: Clearly identify each object in the image. Use specific nouns and adjectives to describe the objects, such as “round red apple” or “high laboratory shelves.” 2. Spatial Relationships: Describe the spatial relationships between objects. Use terms like “to the left of,” “behind,” “in front of,” “next to,” and “above” to explain how objects are situated in relation to each other. 3. Approximate Measurements: Provide approximate measurements to help the listener understand distances. Use units like steps, arm’s length, or common objects as references, for example, “about three steps away” or “an arm’s length apart.” 4. Textures and Surfaces: Mention the texture or surface of objects if relevant. For instance, “the desk has a smooth, laminate surface” or “the corridor is lined with polished tiles.” 5. Colors and Contrasts: Even though the individual may not perceive colors, describing them can help build a richer picture or convey information to those who have partial vision or can perceive some colors. Contrasts can also be useful, like “the bright fluorescent lights against the neutral-colored ceiling.” 6. Layout Description: Offer a general layout of the scene. For example, “The room is rectangular, with a door on the shorter wall and a window on the longer wall.” 7. Directional Orientation: Use cardinal directions if the person is familiar with them, like “the conference room is to the east,” or relative directions like “the sun sets in front of you when you face the window.” 8. Consistency: Keep descriptions consistent throughout the annotations. If you start by describing objects from left to right, maintain that order. You can also start by describing the immediate area, followed by objects in the middle distance, and then far away items. 9. Interactive Elements: Point out any interactive elements or objects that can be used, such as “a button to press at the crosswalk” or “to your right, there’s a window that can be opened for fresh air.” 10. Avoid Subjectivity: Keep descriptions objective and avoid personal interpretations or non-evident assumptions about the scene. When uncertain about quantities or functionalities, employ phrases such as “Some chairs feature a green accent” or “these numbered, circular floor markers with arrows possibly serve for wayfinding or organizing seating.” 11. Clarity and Brevity: Be clear and concise. While detail is important, unnecessary information can clutter the mental image. For instance, instead of “a rectangular, mahogany coffee table with intricate carvings,” use “a long, wooden coffee table in front of the couch.” 12. Safety and Navigation Cues: Highlight potential obstacles or safety hazards (like low-hanging objects, uneven surfaces) and include navigational cues if possible. Use sentences like “Please be careful of the small step down as you enter the room. There’s a low coffee table in the center, about two feet high.” or “If you keep walking straight for about six steps, you’ll reach the kitchen door. You can feel a wooden chair right before the entrance.

By considering these tips, you can create descriptions in natural language that are informative and useful for building a mental map for VIPs.

Requirements of three groups of VIPs are:

  • Early-blind VIPs are a group of people who are blind by birth or developed vision impairments early in life. They have learned to trust their non-visual senses to collect spatial information over time, and have no concept or sense of colors. Please describe using texture or tactile feel, such as “the bench is crafted from unvarnished timber, giving it a textured, coarse feel,” rather than using color or contrast descriptions like “an aisle carpeted in dark gray.”

  • Late-blind VIPs are a group of people who are blind after they have a basic understanding of our world; they know things similar to normal people. They prefer receive various information in the descriptions with as much visual detail as possible.

  • Low vision VIPs are a group of people who have some degree of visual function and usually rely on combining their remaining vision and other senses to interact with their environment. Unlike those who are completely blind, individuals with low vision may use their visual capabilities to assist in the navigation and identification of objects, but they may have a loss of ability of Central vision, Peripheral vision, Depth perception, Contrast sensitivity, and Glare resistance.

Below are specialized tips for early-blind VIPs. Prioritize these over the initial 12 tips, as they align more closely with the unique experiences of early-blind VIPs. In cases of conflicting advice, please give these tips precedence:

Tips 1. Emphasize Non-Visual Senses: Since early-blind individuals may have limited or no visual memories, focus on descriptions that leverage other senses like touch, sound, smell, and taste based on the visuals in image. For example, “the carpet underfoot looks thick and soft, with a wavy texture; the walls seem to be smooth and cool to the touch.” 2. Spatial Layout and Orientation: Provide detailed information about the layout and orientation of spaces and objects, as this group relies heavily on spatial awareness for navigation and understanding their environment. Use language like “walk straight for about ten steps to reach the sofa.” 3. Texture and Shape Over Color: Describe the texture and shape of objects, as these will be more meaningful than color descriptions, such as “the shape of the object in front of you is similar to a large loaf of bread.” 4. Consistent Terminology: Use consistent terminology and reference points to avoid confusion, as early-blind individuals may have their unique ways of conceptualizing their environment. If you use steps as a measurement, as in “the table is four steps away from you,” continue using steps for other distances instead of suddenly switching to feet or meters. 5. Contextual Descriptions: Offer context for objects and environments that may not be immediately apparent, like the purpose of a certain object or the typical activities in a space. For example, “you are now in an office, a place for work and study; the object with a flat top and drawers you can feel is a desk, used for writing and placing computers.”

Below are specialized tips for low-vision VIPs. Prioritize these over the initial 12 tips, as they align more closely with their unique experiences. In cases of conflicting advice, please give these tips precedence:

Tips 1. People with low vision often rely on color contrast to discern details. Describe the colors and the contrast between elements in the scene. 2. Mention the scene’s lighting as it can significantly affect how someone with low vision perceives the image. 3. Provide context for the size of objects by comparing them to common items. For example, “the dog is about the size of a bicycle.”

B.1.2 Designs for GPT-4 Text Prompts: Improved Version

I need three paragraphs of scene descriptions.

  • The first description should be concise and within 15 words.

  • The second description should be detailed and 150 words and generated for people with vision impairments (low vision or late blindness). Follow these tips:

    Tips 1. Use clear and concise language: Choose words carefully to provide clear, concise descriptions, using descriptive adjectives and adverbs for relevant information. 2. Provide directional and distance information: Include information about the layout of the space, who and where any people are, and what they are doing. Give a reliable description of how to navigate the space, describing any safety hazards in detail. Include information on the direction and distance of points of interest. Use clock face references (e.g., “to your right at 3 o’clock.”) to give a sense of orientation. Use common reference objects to describe distances, sizes, and other measurements. 3. Describe surroundings in stages: Detail the environment in sections, starting with the immediate area, then describing nearby objects or obstacles, and finally providing information about the destination or the route ahead.
  • The third description should also be detailed and 150 words, but generated for people who have been blind since birth. Follow the same rules as the second description, with one additional consideration:

    Tips 1. Avoid describing color information: Do not include color details that are difficult to imagine for a user who has been blind since birth.

B.1.3 Examples of GPT-4 Responses

First, we show the image description results GPT generated through the first text prompt version. The first version of text prompts is shown in § B.1.1.

Inspired by the work of Jain et al.[40], we design the first version of text prompts to achieve two main goals: 1) Generate three different concise versions of image descriptions for the three user categories. 2) Generate nine detailed image descriptions for the three different levels of needs for each user category. Our customized guidelines for detailed descriptions include but are not limited to, detailed object shapes and environmental safety factors. This approach ensures that each description level is finely tuned to meet the varied and specific needs of users based on their visual impairment conditions. We hope this customization addresses the unique challenges each group faces in navigating their environments effectively.

Secondly, we show the image description results generated by GPT through the improved version of text prompts. The improved version of text prompts is shown in § B.1.2.

By manually screening and scoring the results of the three levels of descriptions, our team unanimously found that, in the concise description, the descriptions for the three groups of users are very similar, containing basic scene layout information. In the detailed description, the descriptions at Level 2 (Independent exploration needs) are more accurate and contain the fewest errors, thus aligning more closely with the users’ needs. Additionally, we observe that the descriptions for late-blind and low-vision users are remarkably similar. Considering the limitations of GPT and user needs, we make the following improvements in the improved version of the prompt: 1) Generate a common concise description for all VIP users; 2) Merge the detailed description needs of late-blind and low-vision users into one category; 3) Focus on the requirements of Level 2 and rewrite a simplified version of the GPT prompt to meet this requirement. The simpler and clearer the prompt language, the more accurate and readable the generated description sentences are.

First Version of Text Prompts & Response [Uncaptioned image] For Low-Vision Concise Version: A modern help desk in a bright, spacious library. Detailed Version: Level 1: The image shows a spacious library with a semi-circular help desk prominently in the foreground. The help desk is made of dark, smooth materials and is located centrally within the open area. To the right of the help desk, there is a set of elevators framed by dark tiles, and a staircase with a metal railing leading to an upper floor. The ceiling above the desk features hanging decorations in various colors, resembling paper cranes, adding a decorative touch to the space. Level 2: The help desk is positioned near the entrance, acting as a central hub for visitors. Behind the help desk are several computer workstations where people are seated, working. The flooring throughout the area consists of dark polished tiles, which could be slightly slippery. The elevators to the right are accessible by taking a few steps from the help desk, and the staircase next to them provides an alternative route to the upper floors. The area is well-lit with large windows that let in natural light, creating a bright environment. Level 3: For assistance, you can approach the staff at the help desk, who are seated behind the counter. They are available to provide directions or support with navigating the library. The elevators to the right are easy to locate and can be used to access different floors, with the staircase offering another option. The colorful paper cranes hanging above the help desk serve as a visual landmark, which can help in giving directions or coordinating with others in the library. The help desk is a good starting point for exploring the library, and the staff can guide you to specific sections or resources. For Early-Blind Concise Version: A help desk in a spacious library with hanging decorations. Detailed Version: Level 1: The image shows a help desk in a large, open library space. The help desk is semi-circular and made of smooth, hard materials. It is centrally located with workstations behind it and to the sides. Directly to the right of the help desk, there is a set of elevators, encased in a structure of smooth, hard tiles. To the far right, there is a staircase with a sturdy metal railing leading upwards. Level 2: The help desk is centrally placed for easy access. Behind the desk are several workstations, each with a computer and seating arrangements. The flooring throughout the area consists of polished tiles that feel smooth underfoot and may be slippery. The elevators to the right are accessible by walking a few steps from the desk. The staircase, located further right, has a metal handrail for support. There are tactile potted plants on the desk, offering a rougher texture compared to the smooth surfaces around. Level 3: For assistance, approach the help desk staff, who are positioned behind the counter. They can provide directions and support. When entering the library, you can ask for help with locating specific areas or using the workstations. The elevators are to the right and can be used to access different floors, with the staircase providing an alternative route. The hanging decorations above the desk, which have a smooth, paper-like texture, can serve as an auditory or tactile landmark when providing directions or coordinating with others in the library. For Late-Blind Concise Version: A help desk in a modern library with colorful paper cranes above. Detailed Version: Level 1: The image depicts a help desk situated in the foreground of a modern library. The help desk is semi-circular, made of dark materials, and positioned centrally in the scene. To the right of the help desk, there’s a set of elevators, framed in a dark tiled wall, with a staircase leading upwards on the far right. The ceiling above the help desk features a display of colorful paper cranes hanging from strings, adding a vibrant touch to the otherwise neutral-toned environment. Level 2: The help desk is positioned near the entrance, acting as a central point for visitors. Behind the help desk, there are several computer workstations occupied by people, and large windows allowing natural light to flood the space. The elevators to the right are accessible via a short walk from the help desk, and the staircase next to the elevators leads to the upper floors. The flooring is made of polished dark tiles, which may be slightly slippery. There are potted plants on the desk, contributing to a welcoming atmosphere. Level 3: If you need assistance, the help desk staff are seated behind the counter and are easily approachable. When entering, you can ask for directions to specific sections of the library or for help using the computers. The elevators to the right are useful for reaching different floors, and the staircase next to them is another option if you prefer walking. The colorful paper cranes hanging above can serve as a visual landmark when providing directions or coordinating with someone else in the library.
Improved Version of Text Prompts & Response [Uncaptioned image] Concise Version: Help desk area with elevators, stairs, and hanging artistic paper cranes. Detailed Version: For Early-Blind You are standing in a large, open lobby. Directly ahead, approximately 10 feet in front of you, is a curved help desk with a person available to assist. Moving to the right, at 3 o’clock, are sets of elevators within a tiled encasement. Next to the elevators, stairs begin about 20 feet away, going up to the next floor. Behind the help desk, at your 9 o’clock position, is a spacious area filled with tables, where several people are using computers. This section spans about 30 feet from where you are. The ceiling features artistic hanging objects above the help desk. The floor consists of solid tiles, ensuring an even surface without visible obstacles between you and the help desk. For Low-Vision & Late-Blind You are in a spacious lobby area. Immediately in front of you, around 10 feet away, is a curved help desk, manned by a person. To your right, at 3 o’clock, are elevators encased in a dark tiled structure. A set of stairs is also to the right of the elevators, leading upwards. The stairs begin around 20 feet away from you. Directly behind the help desk, to the left at 9 o’clock, is a large open area with multiple tables and people seated working on computers. This area extends about 30 feet from your current position. Above the help desk, look up around 45 degrees to see colorful paper cranes hanging from the ceiling. The floor is covered with dark tiles, providing a stable and even surface. There are no apparent obstacles between you and the help desk.

B.2 Visual Prompts for Correcting Directional Errors

We find that GPT’s ability to count and recognize direction is weak. For example, it sometimes cannot accurately describe the number of doors and tables. The left and right orientations cannot be accurately described based on the direction of the user’s line of sight (camera direction). Taking into account spatial orientation information is crucial for BLV users. In addition to text prompts, we introduce vision prompt information, which means embedding left and right orientation information in each picture and instructing GPT to use the orientation information embedded in the picture to describe the object’s relative position. Experiments show that the proposed orientation-related vision prompt method can effectively solve the problem of orientation errors in descriptions.

B.2.1 Examples of GPT-4 Responses

First, we present one example of image description generated by GPT using only the improved text prompt. We observed that GPT struggles to accurately describe the location of objects on the left and right sides. For instance, GPT’s description states that the lockers are on the right side of the image and the glass wall is on the left side, when the opposite is true.

Next, we show the description results generated by combining the visual and text prompts. This corrected the inaccurate direction descriptions.

Improved Version of Text Prompts Only & Response [Uncaptioned image] Concise Version: Modern office hallway with wooden lockers and glass walls. Detailed Version: For Early-Blind This scene is set in a modern office environment, characterized by a straight, extended hallway about 20 feet in length. Along the right side of the hallway, there is a row of lockers, each equipped with metallic handles positioned at waist height to facilitate easy opening. These lockers are uniformly spaced and extend the full length of the hallway. On the opposite side, a smooth, hard barrier runs parallel to the lockers, indicating a separation from another area. The floor beneath is flat and made from a hard material, contributing to the echo of footsteps, which can help in gauging the hallway’s length and your proximity to the walls. Overhead, consistent intervals of sound suggest the presence of lighting fixtures. The hallway is about five feet wide, offering clear passage. At the corridor’s end, there is a turn to the left, though this is beyond the immediate vicinity. The space is notably quiet, hinting at either a time of low activity or effective soundproofing. For Low-Vision & Late-Blind You are in a contemporary office hallway that extends straight ahead, about 20 feet long. To your right, a row of tall, wooden lockers lines the wall, featuring metallic handles at waist height for easy access. Directly opposite these lockers, to your left, is a long, smooth glass wall, beyond which lies another office space, although specifics are obscured by the wall’s frosted lower half. The ceiling overhead is equipped with evenly spaced strip lights, providing bright illumination that reflects off the polished concrete floor beneath your feet. The hallway is wide, approximately five feet across, allowing ample room for navigation without obstacles. At the end of this hallway, the path turns left, though this is just out of immediate reach. The atmosphere is quiet, with minimal noise, suggesting a less busy time of day or a well-insulated space.
Combined Version of Visual and Improved Text Prompts & Response [Uncaptioned image] Concise Version: Hallway with lockers on left and office partitions on the right. Detailed Version: For Early-Blind You are in a hallway. To your left at 9 o’clock there is a series of lockers mounted on a wooden base. These lockers are square and follow a grid pattern along the wall. On your right at 3 o’clock, there is an office partition wall with panes that give a semi-transparent appearance. The floor you are standing on is carpeted, and the area around is well illuminated. In front and slightly to the center at about 12:30, there is a structural pillar emerging from the floor and extending towards the ceiling. The hallway extends further ahead past the pillar. The ceiling has visible pipes and ductwork arranged neatly. Ensure to navigate around the pillar as you proceed forward. The route seems clear and straight ahead, eventually leading towards a brighter area that suggests an external light source from a window. For Low-Vision & Late-Blind You are in a hallway with square lockers to your left at 9 o’clock. These lockers are mounted on a wooden base and appear to be organized in a grid pattern. To your right at 3 o’clock, there is an office partition wall, with frosted glass panels near the top. The floor is carpeted, and the hallway appears to be well lit. Directly ahead, there is a supporting pillar in the center of the hallway. Beyond it, the hallway continues straight ahead. The ceiling above has exposed pipes and ducts arranged in an orderly manner. There are no immediate hazards, but be mindful of the pillar directly ahead. The space seems to lead into an area further on with visible light from a distant window.

Appendix C Experimental Settings and Additional Results

C.1 Learning settings

For the learning settings, we employed the Adam optimizer. The configurations of optimizer and other hyper parameters in our experiments are shown in Table 4.

Table 4: Learning Settings across all model trainings.
Parameter Value
Intial sx 0.0
Intial sq -5.0
Intial Learning Rate 0.0001
Optimizer Adam
Weight Decay 0.0005
Gamma 0.1
Scheduler StepLR
Step Size 50
Epoch 200
Batch Size 16
Pre-trained Weights ImageNet1K

C.2 Visualizations

Refer to caption
(a) Basement: 2024/04/08-11:34:37
Refer to caption
(b) Basement: 2024/04/08-11:40:19
Refer to caption
(c) Basement: 2024/04/16-23:59:26
Refer to caption
(d) Basement: 2024/05/14-08:51:38
Refer to caption
(e) Lower Level: 2024/04/08-11:44:36
Refer to caption
(f) Lower Level: 2024/04/15-11:49:29
Refer to caption
(g) Lower Level: 2024/04/17-00:12:46
Refer to caption
(h) Lower Level: 2024/05/14-07:49:16
Refer to caption
(i) Level 1: 2024/04/08-11:55:52
Refer to caption
(j) Level 1: 2024/04/17-00:17:14
Refer to caption
(k) Level 1: 2024/04/24-21:24:18
Refer to caption
(l) Level 1: 2024/05/14-07:58:49
Refer to caption
(a) Level 2: 2024/04/08-12:03:48
Refer to caption
(b) Level 2: 2024/04/18-17:29:21
Refer to caption
(c) Level 2: 2024/05/14-09:16:56
Refer to caption
(d) Level 2: 2024/05/29-18:55:34
Figure 12: We randomly selected four videos from each floor in the test dataset to visualize the predictions of camera poses. Predictions are marked with red dots, while ground truths for each floor are color-coded as follows: Basement in green, Lower Level in purple, Level 1 in blue, and Level 2 in orange.