\scaleobj0.038 NaVIP: An Image-Centric Indoor Navigation Solution for Visually Impaired People

Jun Yu¹ Yifan Zhang¹ Badrinadh Aila¹ Vinod Namboodiri^1,2 {juy220, yiz521, baa223, vin423}@lehigh.edu
¹ Department of Computer Science and Engineering, Lehigh University.
² Department of Community and Population Health, Lehigh University.
https://accesslab180.github.io/navip.github.io

Abstract

Indoor navigation is challenging due to the absence of satellite positioning. This challenge is manifold greater for Visually Impaired People (VIPs) who lack the ability to get information from wayfinding signage. Other sensor signals (e.g., Bluetooth and LiDAR) can be used to create turn-by-turn navigation solutions with position updates for users. Unfortunately, these solutions require tags to be installed all around the environment or the use of fairly expensive hardware. Moreover, these solutions require a high degree of manual involvement that raises costs, thus hampering scalability. We propose an image dataset and associated image-centric solution called NaVIP towards visual intelligence that is infrastructure-free and task-scalable, and can assist VIPs in understanding their surroundings. Specifically, we start by curating large-scale phone camera data in a four-floor research building, with $300$ K images, to lay the foundation for creating an image-centric indoor navigation and exploration solution for inclusiveness. Every image is labelled with precise 6DoF camera poses, details of indoor PoIs, and descriptive captions to assist VIPs. We benchmark on two main aspects: 1) positioning system and 2) exploration support, prioritizing training scalability and real-time inference, to validate the prospect of image-based solution towards indoor navigation. The dataset, code, and model checkpoints are made publicly available at https://github.com/junfish/VIP_Navi.

Refer to caption — Figure 1: Illustration of pipelines for purely image-based indoor navigation. We collect videos and extract image frames as data sources. Each image is annotated with: 1) 6-DoF camera poses aligned with the floor plan, 2) indoor points-of-interest (PoIs), and 3) visual descriptions that assist visually impaired people (VIPs) in understanding their surroundings. We highlight the task scalability of this solution, facilitated by its end-to-end training and inference using simple image forward pass.

1 Introduction

Indoor navigation [1, 2, 3] remains a complex challenge due to the unavailability of satellite signals, such as GPS, for indoor localization. The dynamic nature of indoor environment scenes further complicates the positioning. While Wi-Fi-based indoor positioning systems (WIPS) [4, 5] offer some promise by utilizing signal strength measurements from multiple access points (APs), their effectiveness can vary. To improve positioning accuracy and stability, deploying bluetooth low energy (BLE) beacons [6, 7, 8] within buildings has been explored. These beacons emit audio signals that, when received by mobile devices, can aid in determining the user locations. However, this solution necessitates extensive tag hardware deployment and ongoing maintenance and sometimes interferes with other signals, often making it impractical due to resistance from building managers. Similar technologies, such as radio frequency identification (RFID) [9] and ultra-wideband (UWB) [10], have been refined to mitigate the inconvenience and further enhance accuracy; however, these also come with the same disadvantages of requiring tag installations and maintenance. Other alternatives like magnetic and acoustic positioning offer infrastructure-free solutions but lack robustness against the spatial and temporal variations inherent in indoor settings. Dead reckoning [11, 12] and SLAM [13, 14] encounter challenges with error accumulation in positioning during navigation and also demonstrate limited robustness to environmental changes. Current marker-based methods offer cost-effective solutions but struggle with the accuracy of positioning when tracking visual landmarks (e.g., fiducial markers [15, 16]).

All the technologies above have their place in indoor navigation solutions, often complementing each other within hybrid systems [17, 18, 19] that aim to leverage the strengths of each approach while mitigating their individual disadvantages. While the above mentioned options may already serve sighted individuals with a strong sense of direction, the challenges of indoor navigation is much greater for visually impaired people (VIPs) [20, 21, 22] who cannot independently explore unfamiliar indoor environments due to lack of access to existing wayfinding signage and the inability to create a mental map of complex layouts. For VIPs, challenges of independent indoor navigation is more than just an inconvenience; it can mean the difference between venturing into unknown indoor spaces or just avoiding them altogether. In summary, current indoor navigation technologies struggle to provide effective solutions, a necessity for VIPs, for dynamic environments where (1) accurate and real-time positioning, (2) scalability across varying scene sizes, and (3) understanding of the surroundings, are all required simultaneously. To address these concerns, we are pioneering the application of visual intelligence towards indoor navigation. However, our initial and most significant challenge is the absence of a large real-world dataset derived directly from indoor navigation scenarios. These scenarios involve rapidly changing scenes as users move, and continuously evolving environments over days. Therefore, we curated a image-based indoor navigation dataset to facilitate the straightforward implementation of feasible applications. To further support this, we also ensured that the dataset collection process is scalable and adaptable to changes in building layouts and scene sizes.

The key task in such a purely image-based navigation system is positioning, simply put as determining your camera location from a single image. Contemporary methods for this camera localization task exhibit different trade-offs among hardware resources, prediction accuracy, inference time, and algorithm robustness. For example, state-of-the-art (SOTA) accuracy in camera localization is achieved by constructing a 3D model of the scene using either sparse feature points [23, 24] or dense reconstruction [25, 26]. The camera pose is then estimated through geometric calculations [27, 28] or 2D-3D matching [29]. However, these methods typically require significant memory overhead and have slow processing speeds, making them impractical for integration into a navigation system. On the other hand, absolute pose regression (APR) [30, 31, 32] can determine the camera pose with a single forward propagation in deep neural networks using query images. This line of research offers significant advantages in inference speed and can be easily deployed in thin client applications due to its minimal memory footprint. It achieves this while only incurring an acceptable level of accuracy loss in positioning, which is suitable for real-time navigation purposes. Additionally, the APR decision process is aligned with other well-established computer vision (CV) tasks in the era of deep learning, such as image recognition [33, 34], semantic segmentation [35, 36], and depth estimation [37, 38]. Considering task scalability is crucial for the development of an inclusive and accessible indoor navigation system. In this paper, we not only benchmark APR but also demonstrate how image captioning can assist VIPs in navigating indoor environments independently for their own needs [39, 40]. Without the burden of human-crafted annotations, this is achieved by leveraging recent advancements in pre-trained foundation models (PFMs) [41, 42], such as Segment Anything Model (SAM) [36] and GPT-4 [43].

The main contribution of this paper is the creation of an image dataset for an indoor building and the demonstration of our image-centric solution using this dataset to assist VIPs in navigating and exploring dynamically changing indoor environments. Using only a commodity-off-the-shelf (COTS) phone camera and minimizing human involvement in ground-truth annotations, we demonstrate our solution to be both practical and highly adaptable. Our benchmarks utilize deep representation learning for its robustness to varying scenarios, scalability across different scene sizes, and, most importantly, its simple forward pass inference in an end-to-end manner, responding to query images in just a few milliseconds. In §4.1, we benchmark APR methods (e.g., representative PoseNet [30]) on NaVIP, demonstrating their ability to pinpoint phone-captured images on a floor plan with sub-meter accuracy when applied to actual building layouts. §4.2 explores the potential of image captioning techniques to support the exploration needs of VIPs, illustrating the broader applicability of this image-centric approach to addressing the diverse needs of individuals with disabilities. With the release of NaVIP, we hope to spark further interest within the academic community to develop image-centric solutions for challenges such as indoor navigation and exploration, thus realizing some important human and societal benefits of AI.

2 Related Work

2.1 Indoor Navigation

Current full-fledged indoor navigation systems [1, 2, 44, 45] employ a set of technologies including WiFi fingerprints, BLE beacons, magnetic fields, and IMUs, either individually or in combination. These sensor technologies are specifically engineered to address the most challenging aspect of indoor navigation—accurate and robust positioning. [46] proposed a hybrid multi-sensor fusion system for indoor localization using WiFi and LiDAR. ViNav [18], named vision-based navigation system, relies on a combination of WiFi fingerprints, dead reckoning, and SfM 3D point clouds to localize user positions. ASSIST [47] is a personalized system using multimodal sensors and high-level semantic information, with efficacy tested on a blind and visually impaired (BVI) group. [48] proposed data-driven methods similar to ours but trained their localization model using annotated Wi-Fi observations instead of images. [49] explored the combination of radio and visible light communication-based positioning technologies.

Recent advancements in vision-and-language navigation (VLN) [50, 51, 52] has facilitated the translation of natural language instructions into practical actions for embodied agents, leveraging their visual perceptions. This development is instrumental in assisting users who can command such agents for indoor applications, such as search-and-rescue missions [53]. However, most VLN methods [54, 55, 56, 57, 58, 59, 60, 61, 62] are largely limited to controlled, simulated 3D environments [63], which significantly narrows the applicability in real-world settings. Interactive VLN [64, 65, 66], synchronizing human feedback to adapt to new environment, can explore unknown command feasibility, but struggles in scenarios where the oracle is disabled. While trajectory-instruction generation [67, 68, 69, 70, 52, 71, 72, 73, 74] that synthesizes new language instructions can alleviate data scarsity, their efficacy still lags behind precious oracle instructions [75].

2.2 Visual Localization

The problem of visual localization, aka camera pose estimation, is fundamental in many CV applications except navigation, such as augmented reality, SLAM system, and autonomous driving.

Indirect localization casts the camera pose as a query frame retrieval problem [76, 77, 24]. Traditional methods [78] are dependent on the quality of feature detection and matching, and often require manual tuning to effectively retrieve the most similar images stored in a database or to interpolate the camera pose from top retrieved images. In contrast, deep learning approaches [24, 79, 80, 81] utilize hierarchical pipelines that establish 2D-3D correspondences more efficiently and robustly. Despite achieving SOTA accuracy, these methods, which often incorporate PnP and RANSAC processes, tend to be slower by an order of magnitude ranging from 10 to 100 times compared to APR methods. Furthermore, although relative pose regression (RPR) [82, 83, 84, 85, 86, 87] also adopts camera pose regression, the inherent retrieval phase continues to hinder inference efficiency.

Direct localization can instantly relocalize the camera pose from the query image [88, 89, 90, 30, 91, 92], thus requiring a smaller memory footprint compared to indirect methods. Feature-based matching methods [93, 94, 95, 96, 97] always perform global localization by matching the feature points of query images with a 3D point cloud reconstruction. APR methods [30, 98, 91, 99, 100, 101, 102, 103, 92, 104, 105, 32] are capable of directly regressing the camera pose from a single image input in milliseconds, with minimal loss of accuracy. Although these methods have historically struggled to generalize to new camera poses [106], recent advancements in novel view synthesis (NVS) [107, 108, 102, 109, 32] could alleviate this burden via synthesizing new images from random viewpoints as data augmentation. Recent floorplan localization [110, 111] proposed to directly predict and localize with a single image in a floor plan.

2.3 CV and Beyond

CV has achieved the most predominant achievement among generic applications over the past few deep learning years, such as object detection [112, 113, 114], pose estimation [115, 116, 117], semantic segmentation [35, 36], depth estimation [37, 38], etc. Recent PFMs [41, 42] provide a powerful base that can achieve zero-shot performance. For example, SAM [36] enables zero-shot generalization in the detection and segmentation of unfamiliar subjects. Depth Anything [38] leverages unlabeled data to achieve SOTA performance in monocular depth estimation across previously unseen datasets. More recently, multimodal large language models (MLLMs), such as GPT-4(o) [43] and Gemini [118], have been instrumental in providing accurate image descriptions and facilitating various user-adaptive tasks including image-to-text and image-to-audio conversions. Although there have been significant advances in computer vision, research dedicated to enhancing assistive and accessibility technologies remains limited [119, 120]. For VIPs, the project VizWiz¹¹1https://vizwiz.org/ [121, 122, 123, 124, 125] pioneers the first Visual Question Answering (VQA) datasets that originate directly from VIPs and are tailored to benefit them. Recently, applications such as Be My Eyes and Seeing AI utilize a generative AI-powered virtual volunteer²²2https://openai.com/index/be-my-eyes/ to interpret and describe images for their BVI users. Our research, however, distinguishes itself by focusing on customizing distilled model and providing tailored navigation assistance, thereby enhancing the support offered to VIPs.

3 NaVIP

3.1 Data collection and annotation

This section presents the methods used for collecting and annotating our large-scale image dataset. To make it easily extend this process to other buildings, we prioritize reducing human labour. Within our workflow, human involvement is confined to recording videos, pinpointing 5–10 images from each video to the floor plan, and designing prompts for GPT-4. We also release the data preprocessing and annotating code to further alleviate the burdens associated with data collection and annotation.

3.1.1 Collection

To facilitate the adaptation of this data-driven, vision-based indoor navigation solution across various buildings in the future, data collection is streamlined to minimize human effort and allow for automation through a basic robot that does not require specialized design. We hereby consider the video recording by smartphone built-in cameras and subsequent extraction of image frames. In this study, we collected approximately 400 videos, each lasting 2–4 minutes, within the Health, Science, and Technology (HST) Building³³3HST is a hub designed to encourage collaboration among faculty and students across disciplines, which is the largest building Lehigh has ever built. For more information, please click this link.. By setting the frame extraction interval between 0.2 and 0.3 seconds, we obtained a big dataset comprising around $300$ K images. The simplicity of our data collection allows for its easy way of expansion, either through additional video recordings or by narrowing the time intervals for frame extraction.

To enhance the robustness of algorithms developed from this dataset, videos were captured using four distinct smartphone models, in both landscape and portrait orientations. These recordings were made in two different holding ways: by human hand and using a smartphone gimbal stabilizer. To comprehensively capture the variability of indoor environments, recordings spanned various times of the day–sunrise, morning, noon, afternoon, sunset, and evening–and were conducted from December 2023 through June 2024. Videos recorded after April are designated as testing set to ensure that models developed using this dataset exhibit generalizability. For additional statistics and instructions regarding the data, please refer to Appendix A. We acknowledge that despite our efforts to align this dataset closely with real-world scenarios, the dataset biases still exist and may be captured by models [126, 127]. For example, our dataset lacks the coverage of the fall season.

Algorithm 1 Camera Pose (6DoF) Annotation

0: Video set

\mathcal{V}=\{V_{1},\cdots,V_{N}\}

by mobile cameras.

0: Geo-registered ground-truth of camera poses (positions and orientations) on the floor plan for extracted video frames.

1: Initialize frame interval

\Delta f

. \eqparboxCOMMENT

\triangleright

\Delta f\approx 16

2: for

n=1,\cdots,N

3: Sample the image sequences

\mathcal{I}_{n}=\{I_{1},\cdots,I_{M_{n}}\}

from video

V_{n}

every

\Delta f

frames.

4: repeat

5: Run SfM algorithm on image set

\mathcal{I}

. \eqparboxCOMMENT

\triangleright

Use COLMAP.

6: if Point clouds align then

7: Obtain camera pose for image sequences

\mathcal{I}

8: else

9: Sample more images

\mathcal{I}^{\prime}

at breaking points. \eqparboxCOMMENT

\triangleright

\Delta f^{\prime}=1/3\Delta f

10: Data augmentations:

\mathcal{I}=\mathcal{I}\cup\mathcal{I}^{\prime}

11: end if

12: until Point clouds align

13: Geo-register the camera pose on the floor plan. \eqparboxCOMMENT

\triangleright

See Figure 2.

14: end for

3.1.2 Camera Pose

There are multiple ways to annotate a camera with its accurate shooting position. Most methods require professional equipment [88, 99, 26], e.g., the NavVis VLX 3D scanner backpack, or involve complex optimization pipelines that are either closed-source or difficult to replicate [128, 129, 130]. We conclude the summary of camera relocalization datasets, both indoors and outdoors, in Table 1. It is noteworthy that only the Cambridge [30] relied on ubiquitous device and open-source software to obtain the 6-DoF ground-truth of camera pose and achieved a 10 dm error level outdoors. Based on this, we emphasize the importance of low-cost and replicable methods for obtaining ground-truth data for training, especially when our users include low-income groups with health disparities. We employ the active open-source project COLMAP⁴⁴4COLMAP is available at https://github.com/colmap/colmap and is licensed under the BSD-3-Clause. to reconstruct the 3D scene and determine the camera pose for each image in our collections. The algorithm for 6DoF annotation process is presented in Algorithm 1.

As discussed in § 3.1.1, to capture the dynamic variations in scene evolution over time, we independently collected approximately 400 videos. These videos were recorded by various collectors using different gestures and devices at diverse times. Consequently, the world coordinates for each video are isolated within the COLMAP 3D reconstructions; they are not geo-registered and cannot be integrated into a unified world coordinate system that would align with a floor plan for navigation purposes. We introduced minimal human labor to accurately pinpoint 5–10 images for each video to the floor plan. This accuracy is achieved as the 3D points of the scene and each image are visualized using the COLMAP GUI. Additionally, we utilized embedded geo-registration function to transform the world coordinates. Figure 2 shows the process of geo-registring all camera poses from different paths into a unified world coordinate system, ensuring alignment with the floor plan.

Table 1: Summary of camera relocalization datasets. This table exclusively compares datasets that are publicly available.

Dataset	Environment	Device	# Train / Test	Resolution	Scope	Ground Truth Tool	Error Level
Dubrovnik 6K [131]	Outdoor	–	$6$ K / $0.8$ K	—	$1.5\times 1.5$ km²	SIFT Matching	$\sim 10$ m
7-Scenes [88]	Indoor	Kinect depth camera	$16$ K / $17$ K	$640\times 480$	$4\times 3$ m²	KinectFusion [132]	$<10$ cm
Cambridge [30]	Outdoor	Smartphone camera	$8.4$ K / $4.8$ K	$1920\times 1080$	$500\times 100$ m²	MVS [133]	$<10$ dm
12-Scenes [128]	Indoor	Structure.io depth sensor with iPad	$240$ K / $6.7$ K	$1296\times 968$	80 m³	VoxelHashing [134]	$\sim 10$ cm
TUM-LSI [99]	Indoor	NavVis M3 trolley	$875$ / $220$	$4592\times 3448$	$5575$ m²	SLAM	$<10$ cm
InLoc [26]	Indoor	Faro 3D laser scanner	$10$ K / $0.4$ K	$1600\times 1,200$	$186$ m²	LiDAR + Manual	$\sim 10$ dm
LaMAR [129]	Indoor &	Microsoft Hololens 2, Smartphone,	—	$640\times 480$	45,000 m²	LiDAR + SfM +	$<10$ cm
LaMAR [129]	Outdoor	iPad, NavVis M6 or VLX backpack	—	$640\times 480$	45,000 m²	VIO [135]	$<10$ cm
360Loc [130]	Indoor & Outdoor	Velodyne lidar with $360^{\circ}$ camera	$9.3$ K / –	$6144\times 3072$	$105\times 70$ m²	LiDAR + VIO	$<10$ cm
NaVIP (Ours)	Indoor	Smartphone camera	$212$ K / $88$ K	$3840\times 2160$	$40\times 90$ m²	COLMAP [23]	$<10$ cm

3.2 PoIs

By aligning all images to the floor plan, annotating nearby points-of-interest (PoIs) becomes straightforward. As illustrated in Figure 3, we can mark pixel-level PoI labels for each public area. Predicted images at specific points can then provide feedback on these PoIs to users, depending on their camera positions and orientations. We release the detailed PoIs for each pixel of the HST floor plans to support the development of applications.

3.3 Captions for VIPs

We leverage recent MLLMs, e.g., OpenAI GPT-4 and Google Gemini, to generate tailored image descriptions that meet the specific needs of VIPs. Our prompt design considers both the capabilities of these models and the feedback from VIP volunteers. For more details on our explorations of prompting MLLMs and output examples of image descriptions, please refer to Appendix B. We offer three types of image descriptions for each image:

•

A concise description suitable for general use.
•

A detailed description specifically designed for individuals with vision impairments (including those with low vision or acquired blindness).
•

Another detailed description tailored for individuals who have been blind since birth.

4 Benchmarking

To evaluate the generalization ability across ever-changing indoor environments, we utilize images captured prior to April 15th as training dataset and images captured after May 1st as testing dataset throughout the benchmarking process.

4.1 Camera Relocalization

Preliminary. For clarity and consistency, we represent the camera pose in 3D space by a 2-element tuple $[\boldsymbol{x},\boldsymbol{q}]$ according to [30], where $\boldsymbol{x}\in\mathbb{R}^{3}$ defines the position of camera center in 3D Cartesian coordinates and $\boldsymbol{q}\in\mathbb{R}^{4}$ is the unit quaternion encoding its orientation. Other variations focus on orientation representations, with BranchPoseNet [136] employing an Euler angle representation and MapNet [137] using a logarithm of the unit quaternion.

Our purpose is to directly regress the camera pose $[\hat{\boldsymbol{x}},\hat{\boldsymbol{q}}]$ from a single monocular image $I$ using the trained function $f$ . The standard objective loss function is defined as follows:

{\mathcal{L}}(I)={\|\hat{\boldsymbol{x}}-\boldsymbol{x}\|}_{2}+\beta\cdot{% \left\|\hat{\boldsymbol{q}}-\frac{\boldsymbol{q}}{{\|\boldsymbol{q}\|}_{2}}% \right\|}_{2},

(1)

where $[\boldsymbol{x},\boldsymbol{q}]=f(I)$ represents the values predicted by our models. Notably, $\beta$ is a hyperparameter introduced to balance the learning scales between position and orientation. Learnable PoseNet [91] captures homoscedastic uncertainty [138] between two tasks and omit $\beta$ as

{\mathcal{L}}(I)=e^{-s_{x}}\cdot{\|\hat{\boldsymbol{x}}-\boldsymbol{x}\|}_{2}+% e^{-s_{q}}\cdot{\left\|\hat{\boldsymbol{q}}-\frac{\boldsymbol{q}}{{\|% \boldsymbol{q}\|}_{2}}\right\|}_{2}+s_{x}+s_{q},

(2)

where $x_{x}$ and $s_{q}$ are both learnable and only an approximate initial guess is required.

Table 2: Mean and median errors

\downarrow

of models across four floors (Basement, Lower Level, Level 1, and Level 2) in our dataset. Benchmarking is limited to models with publicly available code.

Model	Backbone	Basement		Lower Level		Level 1		Level 2
Model	Backbone	Mean	Median	Mean	Median	Mean	Median	Mean	Median
PoseNet [30]	ResNet-34	0.52m, 5.50^∘	0.42m, 4.52^∘	0.96m, 7.14^∘	0.71m, 5.70^∘	0.91m, 7.65^∘	0.52m, 6.21^∘	1.12m, 6.78^∘	0.72m, 5.99^∘
PoseNet [30]	MobileNet-V3	0.60m, 5.78^∘	0.52m, 5.01^∘	0.95m, 7.10^∘	0.69m, 5.59^∘	0.90m, 7.94^∘	0.50m, 6.34^∘	1.11m, 6.91^∘	0.73m, 6.04^∘
Bayesian	ResNet-34	0.57m, 6.88^∘	0.51m, 5.70^∘	1.02m, 8.34^∘	0.75m, 6.04^∘	0.98m, 9.04^∘	0.61m, 7.38^∘	1.09m, 7.34^∘	0.71m, 6.51^∘
PoseNet [98]	MobileNet-V3	0.62m, 6.99^∘	0.60m, 5.98^∘	1.05m, 8.73^∘	0.79m, 6.48^∘	1.01m, 8.93^∘	0.72m, 6.76^∘	1.09m, 7.52^∘	0.72m, 6.58^∘
LSTM-	ResNet-34	0.49m, 6.20^∘	0.42m, 5.13^∘	0.92m, 7.43^∘	0.74m, 5.97^∘	0.81m, 7.54^∘	0.51m, 5.49^∘	0.92m, 6.17^∘	0.59m, 4.73^∘
PoseNet [99]	MobileNet-V3	0.52m, 6.49^∘	0.41m, 6.24^∘	0.94m, 7.83^∘	0.73m, 5.99^∘	0.84m, 7.62^∘	0.52m, 5.73^∘	0.95m, 6.21^∘	0.60m, 4.41^∘
Learnable	ResNet-34	0.39m, 3.41^∘	0.33m, 2.19^∘	0.76m, 3.24^∘	0.50m, 3.19^∘	0.70m, 4.30^∘	0.46m, 3.84^∘	0.87m, 3.90^∘	0.49m, 2.41^∘
PoseNet [91]	MobileNet-V3	0.42m, 3.90^∘	0.34m, 2.81^∘	0.78m, 3.20^∘	0.53m, 3.14^∘	0.75m, 4.76^∘	0.47m, 3.90^∘	0.91m, 4.74^∘	0.46m, 3.82^∘
Geometric	ResNet-34	0.41m, 4.45^∘	0.33m, 3.01^∘	0.80m, 3.41^∘	0.55m, 3.24^∘	0.73m, 4.28^∘	0.47m, 3.87^∘	0.90m, 5.21^∘	0.57m, 3.43^∘
PoseNet [91]	MobileNet-V3	0.46m, 3.89^∘	0.39m, 2.80^∘	0.83m, 3.40^∘	0.56m, 3.19^∘	0.77m, 4.91^∘	0.49m, 4.19^∘	0.94m, 5.32^∘	0.58m, 3.49^∘
Hourglass	ResNet-34	0.45m, 5.95^∘	0.34m, 4.91^∘	0.88m, 5.47^∘	0.62m, 4.79^∘	0.81m, 6.23^∘	0.49m, 4.02^∘	0.85m, 6.87^∘	0.49m, 3.96^∘
PoseNet [100]	MobileNet-V3	0.53m, 6.18^∘	0.48m, 6.03^∘	0.89m, 5.86^∘	0.60m, 4.79^∘	0.83m, 6.48^∘	0.50m, 4.18^∘	0.94m, 8.31^∘	0.52m, 4.37^∘
BranchNet-	ResNet-34	0.43m, 5.99^∘	0.32m, 4.72^∘	0.79m, 4.00^∘	0.48m, 3.25^∘	0.94m, 9.21^∘	0.65m, 6.27^∘	1.04m, 7.26^∘	0.78m, 6.13^∘
Euler6 [136]	MobileNet-V3	0.50m, 6.23^∘	0.41m, 4.82^∘	0.84m, 4.31^∘	0.51m, 3.36^∘	0.98m, 9.74^∘	0.66m, 6.40^∘	1.16m, 8.17^∘	0.82m, 7.10^∘
MapNet [103]	ResNet-34	0.39m, 3.41^∘	0.31m, 2.71^∘	0.70m, 3.75^∘	0.42m, 3.10^∘	0.65m, 4.58^∘	0.44m, 3.90^∘	0.82m, 5.43^∘	0.48m, 3.97^∘
MapNet [103]	MobileNet-V3	0.40m, 3.37^∘	0.29m, 2.65^∘	0.74m, 3.76^∘	0.43m, 3.22^∘	0.66m, 4.79^∘	0.41m, 4.21^∘	0.86m, 4.27^∘	0.50m, 3.21^∘
MSPN [139]	ResNet-34	0.39m, 3.40^∘	0.30m, 2.68^∘	0.68m, 3.70^∘	0.42m, 3.13^∘	0.69m, 5.17^∘	0.47m, 3.96^∘	0.91m, 3.71^∘	0.48m, 2.53^∘
MSPN [139]	MobileNet-V3	0.41m, 3.38^∘	0.30m, 2.58^∘	0.73m, 3.94	0.45m, 3.27^∘	0.71m, 5.32^∘	0.48m, 3.87^∘	0.92m, 4.14^∘	0.49m, 3.04^∘
Direct-	ResNet-34	0.35m, 3.94^∘	0.27m, 3.10^∘	0.69m, 3.77^∘	0.44m, 2.69^∘	0.63m, 4.74^∘	0.41m, 3.90^∘	0.84m, 3.94^∘	0.46m, 2.97^∘
PoseNet [102]	MobileNet-V3	0.39m, 3.49^∘	0.28m, 2.88^∘	0.70m, 3.71^∘	0.45m, 2.71^∘	0.61m, 4.82^∘	0.42m, 3.89^∘	0.88m, 4.10^∘	0.45m, 3.04^∘
MS-	ResNet-34	0.35m, 4.47^∘	0.26m, 3.78^∘	0.63m, 3.74^∘	0.43m, 2.50^∘	0.60m, 4.66^∘	0.41m, 3.84^∘	0.83m, 4.21^∘	0.43m, 2.76^∘
Transformer [92]	MobileNet-V3	0.44m, 5.26	0.36m, 4.19^∘	0.65m, 3.89^∘	0.41m, 2.43^∘	0.61m, 4.06^∘	0.45m, 3.17^∘	0.85m, 4.18^∘	0.43m, 2.96^∘
PAE [104]	ResNet-34	0.40m, 3.99^∘	0.27m, 2.87^∘	0.71m, 3.79^∘	0.45m, 2.57^∘	0.71m, 5.23^∘	0.49m, 4.15^∘	0.96m, 4.73^∘	0.51m, 3.25^∘
PAE [104]	MobileNet-V3	0.41m, 4.10^∘	0.27m, 2.93^∘	0.73m, 3.80^∘	0.47m, 2.63^∘	0.71m, 5.23^∘	0.49m, 4.15^∘	0.98m, 4.69^∘	0.52m, 3.09^∘
DFNet [105]	ResNet-34	0.36m, 3.49^∘	0.28m, 2.58^∘	0.60m, 3.80^∘	0.43m, 2.78^∘	0.64m, 4.82^∘	0.43m, 3.85^∘	0.81m, 4.37^∘	0.43m, 2.81^∘
DFNet [105]	MobileNet-V3	0.40m, 3.56^∘	0.34m, 3.16^∘	0.61m, 3.72^∘	0.42m, 2.55^∘	0.67m, 5.01^∘	0.47m, 3.91^∘	0.80m, 4.26^∘	0.45m, 2.90^∘

Settings. To ensure a fair comparison in benchmarking camera relocalization, we employed identical CNN architectures—ResNet-34 [34] and MobileNet-V3 [140]—across 13 models ranging from the pioneering PoseNet [30] to its latest advancements [104, 105]. All the above models were trained for 200 epochs. Specifically, for the PoseNet-series models, we utilized the loss function defined in Eq.(1), setting $\beta$ to $e^{5}$ across all four floors⁵⁵5Specifically tuning $\beta$ for each floor may yield better results.. In the Bayesian PoseNet, uncertainty is incorporated only before the layers with randomly initialized weights, following [98]. LSTM-PoseNet models have the hidden size of all LSTM units set at 256 to achieve optimal results [99]. Learnable PoseNet starts with initial guesses of $s_{x}$ and $s_{q}$ set to $0$ and $-5$ , respectively, for all scenes [91], and then Geometric PoseNet continues training these models using geometric reprojection data to balance positional and rotational errors per image. We employed feature map concatenation in ResNet-34 and element-wise summation in MobileNet-V3 to combine features from the front layers and achieve the optimal results in Hourglass PoseNet [100]. For BranchNet-Euler6 [136], the networks are split before the final convolutional block to facilitate multi-task learning. MapNet [103] configurations avoid updating model weights with unlabeled data to maintain fairness in APR comparisons. Both MapNet and MSPN [139] adopt log-quaternion for rotation representation. Direct-PoseNet maintains a direct matching ratio of photometric difference to the loss function Eq.(1) at $3/7$ [102]. MS-Transformer [92] replaces only the convolutional backbones to extract activation maps while preserving the encoder-decoder architecture of Transformers. PAE embeds $\boldsymbol{x}$ and $\boldsymbol{q}$ using Fourier Features with expanded dimensions of 12 ( $L=6$ in [104, Eq.(5)]). Finally, DFNet results are confined to single-frame APR [105]. For more details of learning settings, please refer to Appendix C and the publicly available code.

Results. Table 2 shows the mean and median errors for both camera positions and orientations. Notably, even the pioneering work of PoseNet [30] from 2015 can achieve sub-meter accuracy, meeting the requirements for indoor navigation applications. We observe that the best performance in average is achieved by MS-Transformer [92], which benefits from its large model capacity enabled by the use of Transformers and a multi-scene mixing training strategy. Figure 4 illustrates the cumulative distribution function (CDF) of MS-Transformer’s predictions regarding mean errors in positions and orientations across four floors. For additional experimental results and visualization, please refer to Appendix C.

4.2 Image Captioning

We can train a captioning model using outputs from GPT-4 to meet various VIPs needs. Image captioning requires fluent descriptions to translate images into natural language [141, 142]. We mix image data from four floors and use concise descriptions as ground truth to distill [143] our own models. We validate the results using BLEU-4 [144], METEOR [145], CIDEr [146], and SPICE [147] metrics. Additionally, we report the model size in the number of parameters and training time in GPU hours to indicate the feasibility and practicality of this visual intelligence support. The results in Table 3 demonstrate accurate predictions with smaller-size distilled models.

Table 3: Experimental results on concise descriptions.

Model	Backbone	BLEU-4 $\uparrow$	METEOR $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$	#Params	Training Time $\downarrow$
ClipCap [148]	CLIP (ViT-B/32) + GPT-2 tuning	35.18	27.34	113.83	20.19	156 M	46h (A6000)
OFA ${}_{\text{base}}$ [149]	ResNet101 + Transformer	36.31	30.02	126.77	26.85	180 M	53h (A6000)

5 Potential Applications

APR methods development. Popular datasets commonly used for benchmarking APR methods include 7-Scenes [88] and Cambridge Landmarks [30]. These datasets, however, are limited in scope and size, which can lead to the overfitting phenomenon already observed in previous research [30, 91, 99, 100]. This limitation complicates the performance assessment of new proposed models, particularly in an era dominated by deep learning and PFMs. For instance, Bayesian PoseNet [98] demonstrates inferior performance compared to the classic PoseNet using our dataset. This discrepancy arises because any dropout rate applied to PoseNet constrains its capacity rather than serving its intended purpose of regularization.

VLN test under real-world environments. We plan to release this comprehensive dataset that includes not only the original videos but also the floor plans annotated with PoIs. Each video can be segmented into various clips representing a unique navigation path. These clips are associated with an automated point-to-point (PoI-to-PoI) oracle that can be used for VLN model training. Moreover, since descriptions for each image will be available, exploring a VLN model based on PoI-to-PoI instruction that operates without the need for detailed step-by-step guidance in natural language appears promising, particularly in unknown real-world environments. Our dataset has the potential to significantly facilitate more robust and flexible navigation solutions, essential for navigating dynamic or unfamiliar spaces effectively.

Floor plan navigation development. Each image in our dataset is geo-registered to a corresponding floor plan, facilitating the development of learning-based methods that utilize only RGB images for localization. While LiDAR-based localization techniques have been explored extensively in recent research [150, 151, 152, 153], their practical application is often constrained by the hardware capabilities of commonly used mobile devices. In contrast, we anticipate that purely image-based floor plan navigation will become more viable with the availability of this large indoor navigation dataset. This could pave the way for more accessible and widely deployable navigation solutions that leverage visual data alone.

Deployment and test of “NaVIP” everywhere. We will provide comprehensive details regarding the setup and organization of our NaVIP solution within the Lehigh HST building. The data collection process requires only a mobile phone, and to simplify this process, we will also release the corresponding code, including data pre-processing and annotation. Given the limited time and resources invested in developing this effective pipeline, we have not yet tested our solutions in other buildings. We encourage any groups seeking an affordable and intelligent indoor navigation system to adopt and apply this pipeline in their settings.

6 Limitations and Discussion

Inherent biases of human data collection. In NaVIP, we collected video data via sighted individuals, which inevitably introduced biases from collectors themselves. To mitigate this and reduce human labor in developing such data-driven navigation systems, one possible solution is the use of a simple robot equipped with a phone camera that navigates randomly to gather video data. Nonetheless, both sighted individuals and robotic agents fail to capture the user needs and data distribution pertinent to VIPs. Inspired by VizWiz [121], involving VIPs as participants in the data collection process is beneficial as our primary objective is to assist VIPs in navigating unfamiliar environments. Efforts to bridge this limitation will not only enhance the practicality of the solutions but also ensure their genuine benefit to the intended users.

How to merge 3D reconstruction losslessly? In our dataset, we utilize COLMAP to reconstruct the 3D models of indoor environments along each video path. Given that indoor environments can be ever-changing, there currently lacks a robust algorithm to effectively merge these 3D point clouds from different time. As a workaround, we utilize human supervision to pinpoint several anchor images directly onto the floor plan. These annotated points facilitate the geo-registration process built in COLMAP to transform the coordinates of all 3D point clouds. While this alignment method is not lossless–resulting in an inevitable system error in camera poses of approximately 0.5 meters–we will release both the sparse and dense models generated by COLMAP, before and after geo-registration, to facilitate further research in this area. This contribution is expected to further enhance the accuracy of positioning system in indoor navigation.

How robust is the image-centric positioning solution? We acknowledge the limitations in our experimental analysis regarding the robustness of this image-centric positioning system. Although our dataset was collected directly from real-world scenarios using common mobile devices, the performance of this system under extreme conditions, e.g., electronic failures leading to dark environments or post-construction changes within buildings, remains unexplored. To simulate a distribution shift due to changes in the building environment, our dataset incorporates a temporal gap between the training and testing datasets. Despite this, we observe no decline in performance over time.

Is GPT-4 ready for assisting VIPs? Although we meticulously designed prompts tailored to the needs of VIPs (refer to Appendix B for our exploration), we encountered challenges in meeting their personalized requirements. Key challenges include: 1) optimizing GPT-4 as an image descriptor for VIPs to enhance accuracy and eliminate misinformation, and 2) developing prompts that guide GPT-4 to generate information that aligns with user expectations. To further explore these issues, we have released the outputs of GPT-4 on our dataset of 300K images.

7 Conclusion and Future Work

In this paper, we have redirected our research from conventional sensor-based navigation systems to purely vision-based solutions for indoor environments. To facilitate this shift, we created a large image-centric dataset, named NaVIP, within the largest building at Lehigh University, the HST, specifically for research purposes. Our comprehensive pilot experiments on benchmarking APR methods by leveraging real-time end-to-end inference in deep neural networks have validated its feasibility and accuracy. This solution not only streamlines the system architecture but also enhances its applicability and usability, thereby extending its utility in assisting VIPs. The integration of image captioning models distilled from GPT-4 further highlights the potential to independent exploration for VIPs. As we look to the future, our research will focus on developing a mobile application that incorporates our trained model. We plan to test within the Lehigh community, supported by approval from the Institutional Review Board (IRB) to ensure comprehensive human feedback integration into the study. Moreover, we intend to explore advanced unsupervised learning techniques and reinforcement learning from human feedback (RLHF) to further enhance the functionality and user experience of our mobile application. The next phase of our research will concentrate on incorporating real-world user insights to refine and optimize the navigational aid, aiming to create a more adaptive and effective tool for end-users.

Acknowledgments and Disclosure of Funding

We thank HST building coordinator Emily Diaz-Kempf and laboratory manager Chris Panko Graff for their support in video recording in the HST public area. We also extend our gratitude to Eashan Adhikarla for providing the computer used to run COLMAP in parallel, and Deven Bhadane for his time and effort in designing the logo for this work. Special thanks to our visually impaired friend, Joel Isaac, for volunteering and sharing their needs for an indoor navigation application. This work was partially supported by the U.S. National Science Foundation through awards #2409227, #2340870, and #2345057.

References

[1] Haosheng Huang and Georg Gartner. A survey of mobile indoor navigation systems. Springer, 2010.
[2] Jayakanth Kunhoth, AbdelGhani Karkar, Somaya Al-Maadeed, and Abdulla Al-Ali. Indoor positioning and wayfinding systems: a survey. Human-centric Computing and Information Sciences, 10(1):1–41, 2020.
[3] Zeev Volkovich, Elena V Ravve, and Renata Avros. Indoor navigation in facilities with repetitive structures. Sensors, 24(9):2876, 2024.
[4] Matteo Cypriani, Frédéric Lassabe, Philippe Canalda, and François Spies. Open wireless positioning system: A wi-fi-based indoor positioning system. In 2009 IEEE 70th Vehicular Technology Conference Fall, pages 1–5. IEEE, 2009.
[5] Richard Wandell, Md Shafaeat Hossain, and Ishtiaque Hussain. A cost-effective wi-fi-based indoor positioning system for mobile phones. Wireless Networks, 29(6):2845–2862, 2023.
[6] Yuan Zhuang, Jun Yang, You Li, Longning Qi, and Naser El-Sheimy. Smartphone-based indoor localization with bluetooth low energy beacons. Sensors, 16(5):596, 2016.
[7] Vicente Cantón Paterna, Anna Calveras Auge, Josep Paradells Aspas, and Maria Alejandra Perez Bullones. A bluetooth low energy indoor positioning system with channel diversity, weighted trilateration and kalman filtering. Sensors, 17(12):2927, 2017.
[8] Kamil Szyc, Maciej Nikodem, and Michał Zdunek. Bluetooth low energy indoor localization for large industrial areas and limited infrastructure. Ad Hoc Networks, 139:103024, 2023.
[9] Tan Kim Geok, Khaing Zar Aung, Moe Sandar Aung, Min Thu Soe, Azlan Abdaziz, Chia Pao Liew, Ferdous Hossain, Chih P Tso, and Wong Hin Yong. Review of indoor positioning: Radio wave technology. Applied Sciences, 11(1):279, 2020.
[10] Fuhu Che, Qasim Zeeshan Ahmed, Pavlos I Lazaridis, Pradorn Sureephong, and Temitope Alade. Indoor positioning system (ips) using ultra-wide bandwidth (uwb)—for industrial internet of things (iiot). Sensors, 23(12):5710, 2023.
[11] Jijun Geng, Linyuan Xia, Jingchao Xia, Qianxia Li, Hongyu Zhu, and Yuezhen Cai. Smartphone-based pedestrian dead reckoning for 3d indoor positioning. Sensors, 21(24):8180, 2021.
[12] Suqing Yan, Yalan Su, Xiaonan Luo, Anqing Sun, Yuanfa Ji, and Kamarul Hawari bin Ghazali. Deep learning-based geomagnetic navigation method integrated with dead reckoning. Remote Sensing, 15(17):4165, 2023.
[13] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021.
[14] Andréa Macario Barros, Maugan Michel, Yoann Moline, Gwenolé Corre, and Frédérick Carrel. A comprehensive survey of visual slam algorithms. Robotics, 11(1):24, 2022.
[15] Mark Fiala. Designing highly reliable fiducial markers. IEEE Transactions on Pattern analysis and machine intelligence, 32(7):1317–1324, 2009.
[16] Michail Kalaitzakis, Brennan Cain, Sabrina Carroll, Anand Ambrosi, Camden Whitehead, and Nikolaos Vitzilaios. Fiducial markers for pose estimation: Overview, applications and experimental comparison of the artag, apriltag, aruco and stag markers. Journal of Intelligent & Robotic Systems, 101:1–26, 2021.
[17] Gabriel De Blasio, Alexis Quesada-Arencibia, Carmelo R García, Jezabel Miriam Molina-Gil, and Cándido Caballero-Gil. Study on an indoor positioning system for harsh environments based on wi-fi and bluetooth low energy. Sensors, 17(6):1299, 2017.
[18] Jiang Dong, Marius Noreikis, Yu Xiao, and Antti Ylä-Jääski. Vinav: A vision-based indoor navigation system for smartphones. IEEE Transactions on Mobile Computing, 18(6):1461–1475, 2018.
[19] Daquan Feng, Junjie Peng, Yuan Zhuang, Chongtao Guo, Tingting Zhang, Yinghao Chu, Xiaoan Zhou, and Xiang-Gen Xia. An adaptive imu/uwb fusion method for nlos indoor positioning and navigation. IEEE Internet of Things Journal, 2023.
[20] Kanak Manjari, Madhushi Verma, and Gaurav Singal. A survey on assistive technology for visually impaired. Internet of Things, 11:100188, 2020.
[21] Fatma El-Zahraa El-Taher, Ayman Taha, Jane Courtney, and Susan Mckeever. A systematic review of urban navigation systems for visually impaired people. Sensors, 21(9):3103, 2021.
[22] Bineeth Kuriakose, Raju Shrestha, and Frode Eika Sandnes. Tools and technologies for blind and visually impaired navigation support: a review. IETE Technical Review, 39(1):3–18, 2022.
[23] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
[24] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12716–12725, 2019.
[25] Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015.
[26] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Akihiko Torii. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7199–7209, 2018.
[27] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
[28] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem. International journal of computer vision, 81:155–166, 2009.
[29] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence, 39(9):1744–1756, 2016.
[30] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
[31] Mohamed Adel Musallam, Vincent Gaudilliere, Miguel Ortiz Del Castillo, Kassem Al Ismaeil, and Djamila Aouada. Leveraging equivariant features for absolute pose regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6876–6886, 2022.
[32] Shuai Chen, Yash Bhalgat, Xinghui Li, Jiawang Bian, Kejie Li, Zirui Wang, and Victor Adrian Prisacariu. Refinement for absolute pose regression with neural feature synthesis. arXiv preprint arXiv:2303.10087, 2023.
[33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
[34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[35] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[36] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
[37] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
[38] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891, 2024.
[39] Darius Plikynas, Arūnas Žvironas, Andrius Budrionis, and Marius Gudauskis. Indoor navigation systems for visually impaired persons: Mapping the features of existing technologies to user needs. Sensors, 20(3):636, 2020.
[40] Gaurav Jain, Yuanyang Teng, Dong Heon Cho, Yunhao Xing, Maryam Aziz, and Brian A Smith. " i want to figure things out": Supporting exploration in navigation for people with visual impairments. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW1):1–28, 2023.
[41] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
[42] Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419, 2023.
[43] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[44] Naser El-Sheimy and You Li. Indoor navigation: State of the art and future trends. Satellite Navigation, 2(1):7, 2021.
[45] Dawar Khan, Zhanglin Cheng, Hideaki Uchiyama, Sikandar Ali, Muhammad Asshad, and Kiyoshi Kiyokawa. Recent advances in vision-based indoor navigation: A systematic literature review. Computers & Graphics, 104:24–45, 2022.
[46] Yongliang Shi, Weimin Zhang, Zhuo Yao, Mingzhu Li, Zhenshuo Liang, Zhongzhong Cao, Hua Zhang, and Qiang Huang. Design of a hybrid indoor location system based on multi-sensor fusion for robot navigation. Sensors, 18(10):3581, 2018.
[47] Vishnu Nair, Manjekar Budhai, Greg Olmschenk, William H Seiple, and Zhigang Zhu. Assist: Personalized indoor navigation via multimodal sensors and high-level semantic information. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
[48] Xuxin Lin, Jianwen Gan, Chaohao Jiang, Shuai Xue, and Yanyan Liang. Wi-fi-based indoor localization and navigation: A robot-aided hybrid deep learning approach. Sensors, 23(14):6320, 2023.
[49] Lamya Albraheem and Sarah Alawad. A hybrid indoor positioning system based on visible light communication and bluetooth rss trilateration. Sensors, 23(16):7199, 2023.
[50] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018.
[51] Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: A survey and taxonomy. Neural Computing and Applications, pages 1–26, 2023.
[52] Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Eric Wang. Vision-and-language navigation: A survey of tasks, methods, and future directions. arXiv preprint arXiv:2203.12667, 2022.
[53] Satoshi Tadokoro. Rescue robotics: DDT project on robots and systems for urban search and rescue. Springer Science & Business Media, 2009.
[54] Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, and Stefan Lee. Chasing ghosts: Instruction following as bayesian state tracking. Advances in neural information processing systems, 32, 2019.
[55] Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020.
[56] Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, and Zehuan Yuan. Multimodal transformer with variable-length memory for vision-and-language navigation. In European Conference on Computer Vision, pages 380–397. Springer, 2022.
[57] Xiangyang Li, Zihan Wang, Jiahao Yang, Yaowei Wang, and Shuqiang Jiang. Kerm: Knowledge enhanced reasoning for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2583–2592, 2023.
[58] Jingyang Huo, Qiang Sun, Boyan Jiang, Haitao Lin, and Yanwei Fu. Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23212–23221, 2023.
[59] Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023.
[60] Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H Li, Mingkui Tan, and Chuang Gan. Learning vision-and-language navigation from youtube videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8317–8326, 2023.
[61] Chia-Wen Kuo, Chih-Yao Ma, Judy Hoffman, and Zsolt Kira. Structure-encoding auxiliary tasks for improved visual representation in vision-and-language navigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1104–1113, 2023.
[62] Yanyuan Qiao, Zheng Yu, and Qi Wu. Vln-petl: Parameter-efficient transfer learning for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15443–15452, 2023.
[63] Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14738–14748, 2021.
[64] Ta-Chung Chi, Minmin Shen, Mihail Eric, Seokhwan Kim, and Dilek Hakkani-tur. Just ask: An interactive learning framework for vision and language navigation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(03):2459–2466, Apr. 2020.
[65] Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, pages 312–328. Springer, 2022.
[66] Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, and Jesse Thomason. Iterative vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14921–14930, 2023.
[67] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018.
[68] Sanyam Agarwal, Devi Parikh, Dhruv Batra, Peter Anderson, and Stefan Lee. Visual landmark selection for generating grounded and interpretable navigation instructions. In CVPR workshop on Deep Learning for Semantic Visual Navigation, volume 2, 2019.
[69] Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments: Back translation with environmental dropout. arXiv preprint arXiv:1904.04195, 2019.
[70] Tsu-Jui Fu, Xin Eric Wang, Matthew F Peterson, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. Counterfactual vision-and-language navigation via adversarial path sampler. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 71–86. Springer, 2020.
[71] Zi-Yi Dou and Nanyun Peng. Foam: A follower-aware speaker model for vision-and-language navigation. arXiv preprint arXiv:2206.04294, 2022.
[72] Hanqing Wang, Wei Liang, Jianbing Shen, Luc Van Gool, and Wenguan Wang. Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15471–15481, 2022.
[73] Jialu Li and Mohit Bansal. Improving vision-and-language navigation by generating future-view image semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10803–10812, 2023.
[74] Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023.
[75] Ming Zhao, Peter Anderson, Vihan Jain, Su Wang, Alexander Ku, Jason Baldridge, and Eugene Ie. On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504, 2021.
[76] Torsten Sattler, Tobias Weyand, Bastian Leibe, and Leif Kobbelt. Image retrieval for image-based localization revisited. In BMVC, volume 1, page 4, 2012.
[77] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 3–20. Springer, 2016.
[78] Dorian Galvez-Lopez and Juan D Tardos. Real-time loop detection with bags of binary words. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 51–58. IEEE, 2011.
[79] Hugo Germain, Guillaume Bourmaud, and Vincent Lepetit. Sparse-to-dense hypercolumn matching for long-term visual localization. In 2019 International Conference on 3D Vision (3DV), pages 513–523. IEEE, 2019.
[80] Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020.
[81] Shuzhe Wang, Zakaria Laskar, Iaroslav Melekhov, Xiaotian Li, Yi Zhao, Giorgos Tolias, and Juho Kannala. Hscnet++: Hierarchical scene coordinate classification and regression for visual localization with transformer. International Journal of Computer Vision, pages 1–21, 2024.
[82] Zakaria Laskar, Iaroslav Melekhov, Surya Kalia, and Juho Kannala. Camera relocalization by computing pairwise relative poses using convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 929–938, 2017.
[83] Vassileios Balntas, Shuda Li, and Victor Prisacariu. Relocnet: Continuous metric learning relocalisation using neural nets. In Proceedings of the European conference on computer vision (ECCV), pages 751–767, 2018.
[84] Mingyu Ding, Zhe Wang, Jiankai Sun, Jianping Shi, and Ping Luo. Camnet: Coarse-to-fine retrieval for camera re-localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2871–2880, 2019.
[85] Mehmet Ozgur Turkoglu, Eric Brachmann, Konrad Schindler, Gabriel J Brostow, and Aron Monszpart. Visual camera re-localization using graph neural networks and relative pose supervision. In 2021 International Conference on 3D Vision (3DV), pages 145–155. IEEE, 2021.
[86] Kefan Chen, Noah Snavely, and Ameesh Makadia. Wide-baseline relative camera pose estimation with directional learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3258–3268, 2021.
[87] Samarth Sinha, Jason Y Zhang, Andrea Tagliasacchi, Igor Gilitschenski, and David B Lindell. Sparsepose: Sparse-view camera pose regression and refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21349–21359, 2023.
[88] Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2937, 2013.
[89] Abner Guzman-Rivera, Pushmeet Kohli, Ben Glocker, Jamie Shotton, Toby Sharp, Andrew Fitzgibbon, and Shahram Izadi. Multi-output learning for camera relocalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1114–1121, 2014.
[90] Julien Valentin, Matthias Nießner, Jamie Shotton, Andrew Fitzgibbon, Shahram Izadi, and Philip HS Torr. Exploiting uncertainty in regression forests for accurate camera relocalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4400–4408, 2015.
[91] Alex Kendall and Roberto Cipolla. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5974–5983, 2017.
[92] Yoli Shavit, Ron Ferens, and Yosi Keller. Learning multi-scene absolute pose regression with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2733–2742, 2021.
[93] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
[94] Clemens Arth, Daniel Wagner, Manfred Klopschitz, Arnold Irschara, and Dieter Schmalstieg. Wide area localization on mobile phones. In 2009 8th ieee international symposium on mixed and augmented reality, pages 73–82. IEEE, 2009.
[95] Arnold Irschara, Christopher Zach, Jan-Michael Frahm, and Horst Bischof. From structure-from-motion point clouds to fast location recognition. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2599–2606. IEEE, 2009.
[96] Yunpeng Li, Noah Snavely, and Daniel P Huttenlocher. Location recognition using prioritized feature matching. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part II 11, pages 791–804. Springer, 2010.
[97] Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3247–3257, 2021.
[98] Alex Kendall and Roberto Cipolla. Modelling uncertainty in deep learning for camera relocalization. In 2016 IEEE international conference on Robotics and Automation (ICRA), pages 4762–4769. IEEE, 2016.
[99] Florian Walch, Caner Hazirbas, Laura Leal-Taixe, Torsten Sattler, Sebastian Hilsenbeck, and Daniel Cremers. Image-based localization using lstms for structured feature correlation. In Proceedings of the IEEE international conference on computer vision, pages 627–637, 2017.
[100] Iaroslav Melekhov, Juha Ylioinas, Juho Kannala, and Esa Rahtu. Image-based localization using hourglass networks. In Proceedings of the IEEE international conference on computer vision workshops, pages 879–886, 2017.
[101] Mingpeng Cai, Chunhua Shen, and Ian D. Reid. A hybrid probabilistic model for camera relocalization. In British Machine Vision Conference, 2018.
[102] Shuai Chen, Zirui Wang, and Victor Prisacariu. Direct-posenet: Absolute pose regression with photometric consistency. In 2021 International Conference on 3D Vision (3DV), pages 1175–1185. IEEE, 2021.
[103] Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2616–2625, 2018.
[104] Yoli Shavit and Yosi Keller. Camera pose auto-encoders for improving pose regression. In European Conference on Computer Vision, pages 140–157. Springer, 2022.
[105] Shuai Chen, Xinghui Li, Zirui Wang, and Victor A Prisacariu. Dfnet: Enhance absolute pose regression with direct feature matching. In European Conference on Computer Vision, pages 1–17. Springer, 2022.
[106] Torsten Sattler, Qunjie Zhou, Marc Pollefeys, and Laura Leal-Taixe. Understanding the limitations of cnn-based absolute camera pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3302–3312, 2019.
[107] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
[108] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210–7219, 2021.
[109] Jianlin Liu, Qiang Nie, Yong Liu, and Chengjie Wang. Nerf-loc: Visual localization with conditional neural radiance field. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9385–9392. IEEE, 2023.
[110] Peter Karkus, David Hsu, and Wee Sun Lee. Particle filter networks with application to visual localization. In Conference on robot learning, pages 169–178. PMLR, 2018.
[111] Changan Chen, Rui Wang, Christoph Vogel, and Marc Pollefeys. F³loc: Fusion and filtering for floorplan localization. arXiv preprint arXiv:2403.03370, 2024.
[112] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212–3232, 2019.
[113] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning for generic object detection: A survey. International journal of computer vision, 128:261–318, 2020.
[114] Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey. Proceedings of the IEEE, 111(3):257–276, 2023.
[115] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
[116] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3343–3352, 2019.
[117] Yinlin Hu, Pascal Fua, Wei Wang, and Mathieu Salzmann. Single-stage 6d object pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2930–2939, 2020.
[118] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[119] Ricardo E Gonzalez Penuela, Jazmin Collins, Cynthia Bennett, and Shiri Azenkot. Investigating use cases of ai-powered scene description applications for blind and low vision people. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–21, 2024.
[120] Jasur Shukurov. Improve accessibility for low vision and blind people using machine learning and computer vision. arXiv preprint arXiv:2404.00043, 2024.
[121] Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 333–342, 2010.
[122] Michele A Burton, Erin Brady, Robin Brewer, Callie Neylan, Jeffrey P Bigham, and Amy Hurst. Crowdsourcing subjective fashion advice using vizwiz: challenges and opportunities. In Proceedings of the 14th international ACM SIGACCESS conference on Computers and accessibility, pages 135–142, 2012.
[123] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
[124] Chongyan Chen, Samreen Anjum, and Danna Gurari. Grounding answers for visual questions asked by visually impaired people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19098–19107, 2022.
[125] Jarek Reynolds, Chandra Kanth Nagesh, and Danna Gurari. Salient object detection for images taken by people with vision impairments. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8522–8531, 2024.
[126] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR 2011, pages 1521–1528. IEEE, 2011.
[127] Zhuang Liu and Kaiming He. A decade’s battle on dataset bias: Are we there yet? arXiv preprint arXiv:2403.08632, 2024.
[128] Julien Valentin, Angela Dai, Matthias Nießner, Pushmeet Kohli, Philip Torr, Shahram Izadi, and Cem Keskin. Learning to navigate the energy landscape. In 2016 Fourth International Conference on 3D Vision (3DV), pages 323–332. IEEE, 2016.
[129] Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L Schönberger, Pablo Speciale, Lukas Gruber, Viktor Larsson, Ondrej Miksik, and Marc Pollefeys. Lamar: Benchmarking localization and mapping for augmented reality. Computer Vision–ECCV 2022, 13667:686–704, 2022.
[130] Huajian Huang, Changkun Liu, Yipeng Zhu, Hui Cheng, Tristan Braud, and Sai-Kit Yeung. 360loc: A dataset and benchmark for omnidirectional visual localization with cross-device queries. arXiv preprint arXiv:2311.17389, 2023.
[131] Yunpeng Li, Noah Snavely, Dan Huttenlocher, and Pascal Fua. Worldwide pose estimation using 3d point clouds. In European conference on computer vision, pages 15–29. Springer, 2012.
[132] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In 2011 10th IEEE international symposium on mixed and augmented reality, pages 127–136. Ieee, 2011.
[133] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Towards internet-scale multi-view stereo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 1434–1441. IEEE, 2010.
[134] Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (ToG), 32(6):1–11, 2013.
[135] Kenji Koide, Shuji Oishi, Masashi Yokozuka, and Atsuhiko Banno. General, single-shot, target-less, and automatic lidar-camera extrinsic calibration toolbox. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11301–11307. IEEE, 2023.
[136] Jian Wu, Liwei Ma, and Xiaolin Hu. Delving deeper into convolutional neural networks for camera relocalization. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5644–5651. IEEE, 2017.
[137] Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-aware learning of maps for camera localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[138] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
[139] Hunter Blanton, Connor Greenwell, Scott Workman, and Nathan Jacobs. Extending absolute pose regression to multiple scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 38–39, 2020.
[140] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
[141] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663, 2016.
[142] MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019.
[143] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[144] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
[145] Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
[146] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
[147] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer, 2016.
[148] Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
[149] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
[150] Federico Boniardi, Tim Caselitz, Rainer Kümmerle, and Wolfram Burgard. Robust lidar-based localization in architectural floor plans. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3318–3324. IEEE, 2017.
[151] Federico Boniardi, Tim Caselitz, Rainer Kümmerle, and Wolfram Burgard. A pose graph-based localization system for long-term navigation in cad floor plans. Robotics and Autonomous Systems, 112:84–97, 2019.
[152] Zhikai Li, Marcelo H Ang, and Daniela Rus. Online localization with imprecise floor space maps using stochastic gradient descent. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8571–8578. IEEE, 2020.
[153] Oscar Mendez, Simon Hadfield, Nicolas Pugeault, and Richard Bowden. Sedar: reading floorplans like a human—using deep learning to enable human-inspired localisation. International Journal of Computer Vision, 128(5):1286–1310, 2020.
[154] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[155] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
[156] Johannes Lutz Schönberger, True Price, Torsten Sattler, Jan-Michael Frahm, and Marc Pollefeys. A vote-and-verify strategy for fast spatial verification in image retrieval. In Asian Conference on Computer Vision (ACCV), 2016.
[157] Alberto Pepe and Joan Lasenby. Cga-posenet: Camera pose regression via a 1d-up approach to conformal geometric algebra. arXiv preprint arXiv:2302.05211, 2023.
[158] Alberto Pepe, Joan Lasenby, and Sven Buchholz. Cgaposenet+ gcan: A geometric clifford algebra network for geometry-aware camera pose regression. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6593–6603, 2024.

Appendix A Dataset Details

A.1 Organization

To ensure consistency and facilitate ease of use for others, we organize our data using the COLMAP output structure [154, 155, 156] and adhere to the training standards derived from PoseNet [30]. Our data maintains the same output format as that of COLMAP⁶⁶6https://colmap.github.io/format.html for each video. Below is an example of the file structure for each video folder; we have a total of 400 COLMAP project folders structured in this manner:

For files or folders related to the standard outputs produced by COLMAP, please refer to the official tutorial⁷⁷7https://colmap.github.io/. We will describe additional files. All video files, which end with .MOV or .MP4, are named beginning with the capture ways (either “HAND” or “DJI”), followed by the timestamp of the recording. This naming convention is also applied to the folders containing extracted images and the images themselves, with the addition of a frame timestamp at the end. It is noticed that we preserved the padded image frames for some videos into another folder, named beginning with "HAND_pad" or "DJI_pad". The following folder structure is used to initiate any COLMAP processing:

A.1.1 Geo-Registration

After generating the standard outputs using COLMAP, we first stored the training report of COLMAP to the file log.log, and then captured “movies” of the 3D reconstructions to provide readers with a direct view of the reconstruction status. For each reconstruction, we saved 5 to 10 images from various perspectives. Please refer to Figure 5 as an example.

Users can reload models via the COLMAP GUI to view the 3D reconstructions from any perspective.

Additionally, because the original world coordinates are not aligned for different paths, we need to complete the geo-registration process. This involves using 5 to 10 images, manually pinpointed to the floor plan, to accurately align the models. These ground truths are saved in the file geo_coord.txt as

We name these image frames “anchor frames” or “anchor images” throughout this paper. The origins of their coordinates are aligned to the top-left pixel of the floor plan images, as shown in Figure 6. The second and third columns above indicate the number of pixels from the origin in terms of height and width, respectively.

For detailed instructions on how to geo-register the entire path using annotated images, please refer to the official tutorial ⁸⁸8https://colmap.github.io/faq.html#geo-registration.

We stored the geo-registered model in the folder 20231220_141254_proj/sparse/geo. Since the camera poses represent the coordinates of the world relative to the camera center and include redundant feature point information, we consolidated and streamlined this data from ../sparse/geo/images.bin into ../camera2world_6DoF.txt.

For each floor, we have released data in standard formats, image_train_all.txt and geometric_data.pkl, to serve as deep learning training inputs. The first four lines of image_train_all.txt are displayed as

This training format, inherited from PoseNet [30], aligns with the prevailing styles used in deep APR training [98, 91, 99, 100, 101, 102, 103, 92, 104, 105, 32].

The file geometric_data.pkl is stored as a python set in the format of

We encourage readers to review the code file at https://github.com/junfish/VIP_Navi/blob/master/dataset_utils/extract_geometric_data.py for a detailed understanding and generate geometric_data.pkl on your own. This data file can be utilized for the training of camera geometric data, e.g., Geometric PoseNet [91, 157, 158].

As illustrated in Figure 7(a-d) and 7(e-h), the files path.png and path_stem.png demonstrate the geo-registration on the floor plan and in the 3D world coordinates, respectively, to validate their accuracy. All the codes for the aforementioned steps are publicly available at https://github.com/junfish/VIP_Navi/tree/master/dataset_utils.

A.1.2 GPT-4 Captions for VIPs

We have eight .csv files below to store the pseudo ground truth produced by GPT-4.

Each file above contains three columns, labeled image_id, image_file, and caption.

A.2 Examples

In this section, we first present examples of images from our dataset along with the reference objects used for the accurate human annotations of anchor frames. Subsequently, we randomly select images with captions for VIPs.

A.2.1 Image Frames

Figure 8 presents examples of images extracted from various videos recorded to develop this image-centric indoor navigation solution. We observe that some indoor environments are occasionally textureless and dynamic.

A.2.2 Reference Objects

To obtain accurate camera poses for the anchor frames used in geo-registrations, we identify and use prominent objects in the environment that are integrated into the sparse model of COLMAP 3D reconstructions. These objects include exit signs, stairs, gardens, walls with posters, pillars, trash bins, door frames, lockers, water dispensers, etc. Figure 9 displays selected examples of the mapping of these feature points from the images to the 3D sparse point cloud.

A.2.3 Captions by GPT-4

We randomly show three examples from our dataset below.

A.3 Statistics

As illustrated in Figure 10, we have conducted basic statistical analyses of our video recordings to indicate potential biases inherent in data collected by humans. We note that our dataset lacks videos from the hours between 2:00 AM and 6:00 AM, as well as on Saturdays and during the Christmas holidays.

Appendix B VIP User Needs and GPT-4 Interaction

B.1 VIP User Needs and GPT-4 Text Prompts Design

We aim to experiment with the use of GPT-4 for processing images captured by visually impaired individuals using their smartphones and generating controlled, effective textual descriptions to facilitate barrier-free living.

Initially, we crafts a preliminary prompt for GPT, drawing inspiration from Jain et al.’s work [40]. Jain introduces the concept of "Exploration Assistance," an evolution of Navigation Assistance Systems (NASs) that empowers VIPs to explore unfamiliar environments. They study and analyze VIP user needs, such as what information VIPs require to explore unfamiliar environments and what factors influence these needs among individuals. Inspired by Jain et al.’s work[40], we categorize visually impaired users into early-blind, late-blind, and low-vision groups. We test GPT’s ability to produce descriptions at three detail levels: spatial information, independent exploration, and collaborative exploration.

We find that GPT’s performance limitations prevent it from generating distinct descriptions for different user needs, leading to redundant information in multiple versions of image descriptions and diminishing user experience. To address this, we reassessed user categories and requirements and designed an improved prompt. Experiments show that this improved prompt version enhances the accuracy and readability of GPT-generated descriptions. Additionally, we observe that GPT struggles with accurately describing direction information. To mitigate this, we introduce a visual prompt that corrects these errors, providing users with more precise navigation information.

B.1.1 Designs for GPT-4 Text Prompts: First Version

Please help me to annotate images with natural language descriptions for visually impaired people (VIPs). It is important to be as descriptive and clear as possible. For each image, please give me two versions (concise & detailed version) of descriptions for three VIP groups (VIPs who are low-vision, early-blind & late-blind), respectively.

My requirements of two versions are:

•

Concise version contains only one sentence (no more than $12$ words). Use specific nouns and adjectives to give environment summarization, such as “a quiet lab where some people are reading books.”
•
Detailed version contains the descriptions of three different levels, as outlined below. Each level represents a progressively higher need within the visually impaired population. Please ensure that the information for each level is contained within its own separate paragraph.
- –
  
  Level 1. Spatial information needs (perceptual insight cravings). VIPs need two types of spatial information: shape information and layout information, to get a high-level overview of a space. Your descriptions should help VIPs gather shape and layout information in a manner that facilitates active engagement with the environment.
- –
  
  Level 2. Independent exploration needs (self-directed learning desires). VIPs face difficulties in making navigation decisions based on spatial information collected via their non-visual senses and your descriptions. Additionally, acquiring appropriate orientation and mobility (O&M) training and maintaining their O&M skills can be a struggle for them, which negatively impacting their confidence to explore independently. You should try your best to afford VIPs precise and reliable spatial information and should ensure that VIPs make accurate navigation decisions based on this information. You can serve as an O&M education assistance tool to give VIPs the confidence to explore unfamiliar environments independently.
- –
  
  Level 3. Collaborative exploration needs (social interaction needs). Social pressures pose a major challenge to VIPs receiving help and to non-VIPs providing help when exploring environments collaboratively. VIPs want verbal assistance in a comprehensible format when exploring collaboratively but find it challenging to communicate this preference to others. You should normalize VIPs’ exploration behaviors by introducing social norms for exploration assistance in the current situation. If possible, you could scaffold and facilitate collaborative exploration by telling VIPs how to distribute requests for collaboration and what kind of translations may be needed for VIPs.

Here are some tips for you to refine my requirements above:

By considering these tips, you can create descriptions in natural language that are informative and useful for building a mental map for VIPs.

Requirements of three groups of VIPs are:

•

Early-blind VIPs are a group of people who are blind by birth or developed vision impairments early in life. They have learned to trust their non-visual senses to collect spatial information over time, and have no concept or sense of colors. Please describe using texture or tactile feel, such as “the bench is crafted from unvarnished timber, giving it a textured, coarse feel,” rather than using color or contrast descriptions like “an aisle carpeted in dark gray.”
•

Late-blind VIPs are a group of people who are blind after they have a basic understanding of our world; they know things similar to normal people. They prefer receive various information in the descriptions with as much visual detail as possible.
•

Low vision VIPs are a group of people who have some degree of visual function and usually rely on combining their remaining vision and other senses to interact with their environment. Unlike those who are completely blind, individuals with low vision may use their visual capabilities to assist in the navigation and identification of objects, but they may have a loss of ability of Central vision, Peripheral vision, Depth perception, Contrast sensitivity, and Glare resistance.

Below are specialized tips for early-blind VIPs. Prioritize these over the initial 12 tips, as they align more closely with the unique experiences of early-blind VIPs. In cases of conflicting advice, please give these tips precedence:

Below are specialized tips for low-vision VIPs. Prioritize these over the initial 12 tips, as they align more closely with their unique experiences. In cases of conflicting advice, please give these tips precedence:

B.1.2 Designs for GPT-4 Text Prompts: Improved Version

I need three paragraphs of scene descriptions.

•

The first description should be concise and within 15 words.
•

The second description should be detailed and 150 words and generated for people with vision impairments (low vision or late blindness). Follow these tips:
•

The third description should also be detailed and 150 words, but generated for people who have been blind since birth. Follow the same rules as the second description, with one additional consideration:

B.1.3 Examples of GPT-4 Responses

First, we show the image description results GPT generated through the first text prompt version. The first version of text prompts is shown in § B.1.1.

Inspired by the work of Jain et al.[40], we design the first version of text prompts to achieve two main goals: 1) Generate three different concise versions of image descriptions for the three user categories. 2) Generate nine detailed image descriptions for the three different levels of needs for each user category. Our customized guidelines for detailed descriptions include but are not limited to, detailed object shapes and environmental safety factors. This approach ensures that each description level is finely tuned to meet the varied and specific needs of users based on their visual impairment conditions. We hope this customization addresses the unique challenges each group faces in navigating their environments effectively.

Secondly, we show the image description results generated by GPT through the improved version of text prompts. The improved version of text prompts is shown in § B.1.2.

By manually screening and scoring the results of the three levels of descriptions, our team unanimously found that, in the concise description, the descriptions for the three groups of users are very similar, containing basic scene layout information. In the detailed description, the descriptions at Level 2 (Independent exploration needs) are more accurate and contain the fewest errors, thus aligning more closely with the users’ needs. Additionally, we observe that the descriptions for late-blind and low-vision users are remarkably similar. Considering the limitations of GPT and user needs, we make the following improvements in the improved version of the prompt: 1) Generate a common concise description for all VIP users; 2) Merge the detailed description needs of late-blind and low-vision users into one category; 3) Focus on the requirements of Level 2 and rewrite a simplified version of the GPT prompt to meet this requirement. The simpler and clearer the prompt language, the more accurate and readable the generated description sentences are.

B.2 Visual Prompts for Correcting Directional Errors

We find that GPT’s ability to count and recognize direction is weak. For example, it sometimes cannot accurately describe the number of doors and tables. The left and right orientations cannot be accurately described based on the direction of the user’s line of sight (camera direction). Taking into account spatial orientation information is crucial for BLV users. In addition to text prompts, we introduce vision prompt information, which means embedding left and right orientation information in each picture and instructing GPT to use the orientation information embedded in the picture to describe the object’s relative position. Experiments show that the proposed orientation-related vision prompt method can effectively solve the problem of orientation errors in descriptions.

B.2.1 Examples of GPT-4 Responses

First, we present one example of image description generated by GPT using only the improved text prompt. We observed that GPT struggles to accurately describe the location of objects on the left and right sides. For instance, GPT’s description states that the lockers are on the right side of the image and the glass wall is on the left side, when the opposite is true.

Next, we show the description results generated by combining the visual and text prompts. This corrected the inaccurate direction descriptions.

Appendix C Experimental Settings and Additional Results

C.1 Learning settings

For the learning settings, we employed the Adam optimizer. The configurations of optimizer and other hyper parameters in our experiments are shown in Table 4.

Table 4: Learning Settings across all model trainings.

Parameter	Value
Intial sx	0.0
Intial sq	-5.0
Intial Learning Rate	0.0001
Optimizer	Adam
Weight Decay	0.0005
Gamma	0.1
Scheduler	StepLR
Step Size	50
Epoch	200
Batch Size	16
Pre-trained Weights	ImageNet1K