1. Introduction
Unmanned autonomous vehicles have been studied for decades and have been used increasingly for many real-life applications, such as search and rescue operations, military supplies delivery, transport of agricultural products or materials, delivery of materials to different sections of a warehouse, delivery of customer orders in restaurants, and delivery of clinical supplies within the hospital [
1]. To implement these applications, LiDAR sensors play a crucial role in improving the situational awareness and navigation capabilities of robots. For example, self-driving cars and delivery robots use LiDAR sensors to aid autonomous driving, path planning, and collision avoidance. The LiDAR sensor helps the vehicle understand the surrounding environment, detect lane boundaries, and identify other vehicles, pedestrians, or obstacles. However, research has shown that the main challenges faced by unmanned autonomous vehicles are to accurately perceive their environment and to learn and develop a policy for safe and autonomous navigation [
2]. Therefore, despite the huge amount of research in the field, existing systems have not yet achieved full autonomy in terms of collision avoidance, risk awareness, and recovery.
Before the use of LiDAR information became popular, visual simultaneous localization and mapping (vSLAM) was often used to perceive environment dynamics. In vSLAM, an operator measures and generates a map showing the locations, landmarks, and the guided path to the goal in the environment. However, such visual mapping systems have two major limitations: (1) they cannot reliably identify obstacles in low light conditions or when dealing with repetitive patterns, and (2) processing visual data can be computationally intensive [
3]. Unlike vSLAM, LiDAR uses eye-safe laser beams to capture the surrounding environment in 2D or 3D providing computing systems with an accurate representation of its environment that prompts its use by many automobile companies such as Volkswagen, Volvo, and Hyundai for autonomous driving, object detection, mapping, and localization [
4].
LiDAR sensors can be categorized as mechanical, semi-solid, or solid-state based on their scanning methods. Mechanical scanning LiDARs use a rotating mirror or prism to direct the laser beam over 360° using a motor. This type of LiDAR design is very expensive and large; hence, it is not suitable for large-scale use. In semi-solid-state LiDAR sensors, the mechanical rotating parts are made smaller and hidden within the shell of the LiDAR sensor, making the rotation invisible from its appearance. Solid-state LiDAR sensors do not have moving parts. They use electronic components to steer the laser beams, thereby reducing the cost of production and improving efficiency and reliability. They are also more durable and compact, making them suitable for automotive applications. The general characteristics of a LiDAR sensor are defined by the angular range
and
, resolution, sampling rate, and range, as shown in
Figure 1. The angular range determines the FOV covered by the LiDAR sensor, that is, the extent of the environment that the LiDAR can capture in terms of angles. The resolution parameter determines the angular spacing between individual beams within the specified FOV. A small resolution value allows more beams within the same FOV and results in increased coverage and potentially higher-resolution data. The sampling rate defines the number of beam pulses that the LiDAR sensor will emit, while the range determines how far the beam can travel. These parameters allow the design of different models of LiDAR sensors.
The 2D LiDAR sensors emit laser beams in a single plane (horizontal or vertical) to create a 2D representation of the environment. It measures the time it takes for the laser beam to return after hitting an object, allowing it to calculate distances and generate a 2D point cloud. In contrast to the 2D LiDAR sensor, the 3D LiDAR sensor emits a much larger number of laser beams in multiple directions (usually both horizontally and vertically) to create a volumetric representation of the environment. Sensors with a very high number of beams come with a higher price tag and are, therefore, not cost effective for smaller applications.
Due to the ability of LiDAR sensors to provide a real-time representation of the robot’s environment, they have increasingly been exploited to generate point clouds of the environment for training DRL models for autonomous self-driving vehicles [
5,
6,
7]. For cost effectiveness, researchers typically use 2D LiDAR sensors for small-height robots [
8,
9,
10]. In addition, different researchers randomly select different LiDAR sensors with different beam densities and FOV configurations. Since the DRL model makes decisions based on the information received from the LiDAR sensor, it is important to understand the beam density and FOV required to develop an appropriate learning model. This leads us to ask if a large beam size is required to navigate a static environment.
Hence, in this paper, we use a novel DRL-based approach to explore the performance of an autonomous ground vehicle driving through its environment to a goal point based on the information obtained from different LiDAR sensor configurations. Without prior knowledge of the environment, a point of interest is extracted from the robot’s present surroundings and evaluated, and then a routing point is selected to guide the DRL policy formation. Our approach calculates an initial FOV to generate a learning model referred to as model 2 in this work, after which two different FOVs are generated to achieve a narrower and wider FOV for comparison. The emitted beams are used to estimate the distance from the robot to the obstacle ahead to obtain a path from the initial pose of the robot to the goal point. The input to the neural network model is the linear and angular velocity of the robot, the orientation of the robot, the current distance between the robot and its goal, and the LiDAR information, while the output is the next velocity values of the robot (linear and angular) for the next time step, as shown in
Figure 2. Despite the common use of DRL to generate an obstacle avoidance control policy, the overall contribution of this work is as follows.
We design a DRL control policy based on goal-based exploration.
We explore the effect of the LiDAR beam and FOV on the performance of the DRL model by learning the appropriate FOV and beam density suitable for a static environment. This is essential when the application needs a low-resolution LiDAR sensor.
We demonstrate the performance of our model in a simulated environment to test the effect of different LiDAR sensor configurations on collision avoidance.
We demonstrate our control policy on a Husky A200 robot by Clear Robotics using environment dynamics different from those used for training.
The rest of the paper is organized as follows. In
Section 2, we describe the work related to the study.
Section 3 introduces the problem formulation, the elements that constitute the environment, and the methodology used in the paper.
Section 4 describes the training environment (gazebo) and the training process. Experimental results and performance are discussed in
Section 5.
Section 6 describes the limitations of the proposed method and future work. Finally, in
Section 7, we conclude the paper.
2. Related Work
Collision avoidance and path planning problems have been investigated and solved using many techniques, such as RRT, A, A*, RRT*, and decision trees [
11,
12,
13,
14]. These techniques are mostly suitable for applications where the environment state is known and not changing. In a bid to offer a better collision avoidance solution, researchers introduced map localization and position-based methods. In map localization and positioning, cameras are used to capture the state of the environment and detect obstacles and their sizes to determine the path to follow [
15]. This usually follows the process of mapping, localization, and planning. In this scenario, as with A, RRT, and related approaches, the environment needs to be known beforehand to design a policy for path planning; hence, it is not best suited for a dynamic environment [
16,
17,
18]. Furthermore, the environment used to develop the model can change over time, and maintaining and updating the models ensures that it can adapt to changes and continue to navigate safely, which is costly, time-consuming, and requires knowledge and experience.
In recent years, the use of reinforcement learning and DRL has increased significantly due to its excellent environmental awareness and decision control performance [
19]. The ATAri 2600 [
20] and AlphaGo [
21] developed by DeepMind are two of the early success stories of RL. In mobile robots, the use of RL/DRL to directly map the state of the environment to the control signal for a dynamic path planning solution remains a challenge. Kasun et al. [
22] investigated robot navigation in an uneven outdoor environment using a fully trained DRL network to obtain a cost map to perform the navigation task. The network has prior knowledge of the environment by accepting elevation maps of the environment, the robot poses, and the goal axis as input. Xue et al. [
23] and Ruan et al. [
24] investigated the use of a double-deep Q-network (DDQN). The size and position of the obstacle and the target position are taken as input to the network, and the robot’s velocity values are output. In [
25], a deep deterministic policy gradient (DDPG) algorithm was used to select a control policy for hybrid unmanned aerial underwater vehicles using the robot’s state, LiDAR measurements, and distance to the goal point. A goal-orientated approach to obstacle avoidance was implemented by [
26,
27]. Their work was based on processing a large amount of depth image information using DRL to reach its goal while avoiding obstacles in a continuous or unknown environment. In another work, Choi et al. [
28] proposed the integration of both path planning and reinforcement learning methods to predict the next movement of an obstacle using the calculated distance from the LiDAR information. Wang et al. [
29] implemented the curriculum learning of a DRL robot to navigate among movable obstacles. Rather than collecting human demonstrations as in [
30], they introduced the use of prior knowledge.
Most collision avoidance models developed using DRL obtain the state of the environment through LiDAR information. When training the learning network, many researchers have used different FOVs (90°, 180°, 270°, or 360°), and the number of LiDAR beams (10–60) has been used by many researchers, which directly impacts the computational complexity of the network [
31,
32]. Tai et al. [
33] developed a navigation learning model in a simulated environment using a 10-dimensional laser beam as one input to the model. Han et al. [
34] used the fusion of RGB images from a camera and 2D LiDAR sensor data as input to a DRL network of self-state attention to investigate the effect of using 2D LiDAR on a tall robot. In their work, the training environment is captured and processed before passing it to the training network. Xie et al. [
35] applied a proportional-integral-derivative (PID) controller to improve the training rate of a convolutional neural network that takes 512 stacked laser beams as input. In [
36], a reinforcement learning method is developed that automatically learns the best number of beams required based on the application. Their work was tested on object detection and shows that the appropriate beam configuration improves the performance of the LiDAR application. Zhang et al. [
37] developed a neural network for safe navigation based on different LiDAR sensor configurations (FOV, number of LiDAR sensors mounted and LiDAR orientation). Their work shows that the models with a LiDAR sensor with an FOV of 240° in all scenarios perform better than all other FOVs used. Another work by [
38] chooses an FOV with a minimum angle of 13.4° and a maximum angle of 11.5° to train a DRL model for robot navigation. Their approach was able to navigate safely with the limited FOV. Jinyoung et al. [
39] investigated the performance of a narrow FOV LiDAR in robot navigation. They developed a navigation model using long-short-term Memory (LSTM), a type of recurrent neural network with a local-map critic (LMC). However, these researchers did not provide details of the criteria used in the selection of these FOVs.
A LiDAR sensor emits light beams to its surroundings, which in turn bounce off surrounding objects back to the sensor. The beam that takes the shortest time to return to the sensor is used to calculate the shortest distance to an impending obstacle, which is further used to control the robot velocity values while training a neural network during navigation. Therefore, it is important to investigate the effect of the LiDAR sensor configuration required to train a DRL based on the required application. Does the DRL learning algorithm require 360°, 270°, 90°, or other view of the environment to be effective? To this end, we propose a method of estimating the required FOV based on the width of the sensor and the obstacle in view.
Table 1 summarizes the differences between our approach and the existing literature.
3. Problem Formulation
In this investigation, we are considering the transfer of a mobile robot from a starting point to a known target position while avoiding obstacles. To achieve a successful autonomous exploration, it is required that the mobile robot avoids colliding with obstacles along its path, while at the same time getting to its target in the shortest distance and travel time. To formulate the problem, the properties of the robot, the dynamics of the environment, and the reinforcement learning model are discussed in this section.
3.1. Simulation Environment
Mobile Robot Dynamics: For our experiment, we will use a simulation of Husky A200 UGV developed by Clearpath Robotics, Inc., Ontario, Canada. This is a non-holonomic, differential-drive mobile robot, as shown in
Figure 3, which allows the control of its linear (forward or backward) and angular (left or right rotation) velocities. The relationship between the instantaneous center of curvature (ICC) along the left velocity of the ground wheel
and the right velocity
, expressed numerically in radians per second (rad/s), is defined as [
40,
41,
42,
43]:
where
r is the radius of the driving wheel,
v is the linear velocity expressed in meters per second (m/s),
w is the angular velocity expressed in (rad/s), and
l is the distance between the wheelbase. The kinematic model
of the differential drive of the mobile robot is thus given as:
In the model presented, the location of the robot is defined by its Cartesian position and the orientation . Since the maximum speed of the robot is 1 m/s, the linear speed is set within the range of [0,1] and the angular rotation within [−1,1].
LiDAR Sensor Model: For our investigation, we consider map-less robot navigation of its surroundings. First, we consider the required angle of view suspended by an object as depicted in
Figure 4. Denoting the width of the sensor used as
h in millimeters, the minimum distance between the obstacle and the robot as
L in millimeters, the angle of view
is calculated as:
For our experiment, the distance between the LiDAR sensor mounted on the robot and the obstacle is set at a minimum of 100 mm apart while the width of the sensor in use is 103 mm. From the calculated angle of view, we obtain our proposed LiDAR FOV
. For verification, we obtain a narrower FOV
and a wider FOV
. These three FOVs are used to generate three navigation models. For each of the FOVs, there are different beam densities, that is, a different number of beams (
,
, and
) distributed uniformly throughout the FOV. The angular spacing
between the beams is given as:
The maximum range
each beam can travel is set to 10 m, while the resolution is set to 1. These parameters are used to define three different LiDAR sensor configurations. For each model configuration, an array of point cloud beams from the LiDAR sensor is used to perceive the robot’s surroundings, enabling it to avoid obstacles and reach its goal point, as shown in
Figure 5. If no obstacle is detected, the laser returns
as the free distance ahead or ranging values of
n picking the least value as the distance to the first obstacle it encounters.
Obstacle Environment: For this experiment, a
square wall was used to restrain the robot from going out of the experiment space. At the beginning of each episode, the target point is also changed, and the positions of all four obstacles of the same shape and size are randomly placed in the experiment space, as shown in
Figure 6. The purpose is to randomly initialize the training data set.
3.2. Action Space and State Representation
Developing a reinforcement learning-based unmanned robot is based on four components: (1) the robot model and its environment, (2) the state and action space, (3) the policy and reward function, and (4) the sensing, motion, and communication units [
44,
45]. In this paper, navigation in an unknown environment is based on using the current information received by the LiDAR at each time step to control the linear or angular velocity of the vehicle. Given the initial
and final coordinates
of the robot, the probability of transitioning from one state to another
depends on the distance between the robot and the target point
, the orientation of the robot to the goal point
, the previous linear velocity
, the angular velocity
, and an array
N of LiDAR beams
n relative to the distance between the obstacle and the robot at each time step
t:
The value of n depends on the LiDAR configuration used, that is, . The next action space in the time step t consists of the linear and angular velocity obtained from the policy distribution .
3.3. Deep-Reinforcement Learning
Reinforcement learning is a process where an agent (controller) is reinforced with rewards or penalties based on its interaction with its environment. It learns to take the appropriate action to maximize the reward in the environment. Formally, every reinforcement learning problem is formulated as a Markov decision process (MDP) [
46]. An MDP is represented as a five-tuple
, where
S is a set of states the agent can be in,
A is a finite set of actions the agent can take,
P is a state transition probability matrix,
,
R is a reward function,
following a discount factor
. This shows that the probability of transiting between state
and
depends on the action
. At each time step
t, the agent chooses an action based on a policy
.
In cases where the dynamics of the environment are unknown, Q-learning is the widely used method to solve MDPs. Q-learning is a model-free, off-policy RL algorithm that learns directly from its experience of interacting with the environment. The algorithm aims to learn an optimal action-value function (Q-function), which assigns the expected cumulative reward the agent can receive by taking a specific action in each given state to each action–state pair [
47]. The Q-table is represented as a 2D matrix, where the rows represent states, and the columns represent actions.
In DRL, the actor–critic approach is applied to approximate the Q function [
48]. The actor, which contains the policy-based algorithm modeled as a neural network with parameters
, selects the action of the agent, while the critic or the value-based algorithm
evaluates the actions and suggests the best action-value function
. The critic also helps to account for the discounted sum of future rewards. In this work, a twin delay deep deterministic policy gradient (TD3) actor–critic network is used to train the control policy. As shown in
Figure 7, the actor network is composed of the observation state
s as its input value. The first hidden layer is a fully connected layer (FC) with 800 units with rectified linear unit (ReLU) activation functions used to refine the representation of the state. The second hidden FC layer with 600 units also uses ReLU activation functions to further refine the representation of the state. The last FC layer has two units and represents the output action dimension. A tanh activation function is used to squash the output to be within the range [−1, 1].
For the critic network, two networks are used to evaluate
. As shown in
Figure 8, the two networks have a similar architecture. The first hidden layer has 800 units. It takes the state representation as input and applies the ReLU activation function to introduce non-linearity to the network. The output of this layer is passed to the first transformation layer (TL1). The TL1 has 800 units and uses ReLU activation to further introduce non-linearity to the network. The output from the actor network is passed as input to the second transformation layer (TL2), which transforms the action to match the dimensionality of the state representation without introducing non-linearity to the network. The combined layer (CL) concatenates the state and the transformed action, creating a vector of features. The CL output is passed to an FC layer with 800 units and applies ReLU activation. The output from the FC layer consists of a single unit representing the estimated Q-value.
3.4. Reward Function
Based on the action of the robot in an environment, we create our reward function to guide DRL training to encourage desirable actions while traveling between states. A positive or negative reward is given based on the action and state space. Considering the navigation space of mobile robots in this research, the reward is based on the robot’s ability to navigate to its destination while avoiding obstacles. To define the reward function, the following reward variables were considered:
Collision Penalty: A
m radius zone is placed around the obstacle called the restricted zone. It is considered that a collision has occurred if the robot enters the restricted zone. The collision penalty is defined as:
where
is the distance between the robot and the obstacle, and
is the distance threshold from the obstacle.
Goal Reward: Like the restricted zone, a
m radius surrounding the goal point is referred to as the success zone. If the robot enters the success zone, it is considered that the robot has reached its goal. Equation (
10) shows the calculated goal reward.
where
is the distance between the robot and the goal point and
is the goal threshold to the target point.
Distance Penalty: This is the penalty obtained based on the distance between the robot and the target point
relative to the initial distance to the target point
. If the robot is close to the goal point, it receives a small penalty, while if the distance is large, it receives a large penalty.
Heading Penalty: To ensure that the robot heads toward the goal point, a penalty is placed on the robot’s orientation. Given the robot’s orientation
, and the goal point orientation
, the heading penalty is calculated as:
where the goal point orientation is given by
From the calculated collision penalty, goal reward, distance penalty, and heading penalty, we have the total reward function defined as:
4. Training Environment and Process
The training on the navigation policy was performed on a computer system equipped with an NVIDIA GTX 1050 graphics card, 32 GB of RAM, and an Intel Core i7-6800 K CPU. The operating system used was Ubuntu 20.04 as it supports the Gazebo simulation of the Husky A200 robot. Other packages used were Tensorflow, Robotic Operating System (ROS), Python, and PyTorch.
Three training experiments were conducted to obtain three different learning policies. In each experiment, an actor–critic network (
Figure 7 and
Figure 8) was trained over 1000 episodes. Each episode of each experiment ends when the robot reaches its goal point, falls within the collision threshold, or reaches the timeout of 500 steps. Once the episode ends, the robot position
is set to
, while the goal point and obstacles are placed randomly within the environment, ensuring that the obstacles are placed 1 m away from the goal point, as shown in
Figure 6. The three experiments all use the same number of fixed-size obstacles, but varying numbers of LiDAR beams and FOV employed. The TD3 parameter update was set to two episodes while the delay rewards were updated over the last 10 steps. Also, after every 10 episodes, the network evaluates the current model in the training environment, saves the model, and records the result.
Table 2 shows the description of the parameters used to train the network. The choice of the parameters was determined through experimentation and tuning.