1. Introduction
Lake water body extraction is a fundamental and important field of remote sensing image analysis. There are approximately 304 million natural lakes on the surface of the Earth, which are composed by millions of small water bodies, covering about 4.6 million kilometers of water [
1]. With the crisis of global warming, lakes are very sensitive to global temperature changes and play an important role in the carbon cycle [
2]. On the other hand, lakes also have the function of developing irrigation [
3], providing a source of drinking water on which human life depends [
4], and transportation [
5]. Remote sensing images of lakes contain a great deal of information that can be used in other areas, such as disaster monitoring, the development of agriculture, livestock farming, and geographic planning. Therefore, it is important to study the automatic lake water body extraction from remote sensing images.
Since remote sensing image acquisition information is less affected by natural conditions, large area images can be acquired in a short time and at a low cost. Therefore, a large number of remote sensing images have been used for sea-land segmentation [
6,
7], water extraction [
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19], and other applications. In the recent years of research, numerous water extraction methods have been proposed, such as the threshold method [
8,
9,
16,
17,
18], machine learning method [
10,
11,
19], and deep learning method [
12,
13,
14].
Remote sensing images use different bands and contain different information. The threshold method is widely used in the field of water extraction, which involves with the number of bands used, mainly single band [
16] and multi-band [
8,
17]. The difference between the water bodies and non-water objects in the NIR band is the largest, and a single NIR band as the water index can be used to obtain satisfactory results for water bodies’ extraction [
16]. The use of multi-band water index thresholding to extract water can make use of spectral information from different bands as much as possible, but can also result in redundancy. Since mountain shadows and cloud shadows have similar spectral characteristics as water bodies, a single, fixed threshold may lead to over- or under-segmentation of the water bodies [
18]. Therefore, achieving dynamic adjustment of the threshold and producing optimal segmentation results are complex and time-consuming. Feyisa et al. [
8] introduced an automatic water extraction index to improve classification accuracy in the presence of various environmental noises, including shadows and areas within dark surfaces that cannot be correctly classified by other classification methods. This method provides a stable threshold while improving accuracy. Zhang et al. [
9] achieved automatic dynamic adjustment of thresholds, which reduced the dependence on data without degradation of the accuracy and could be applied in massive remote sensing images.
Methods based on spectral information provide a good result on low resolution multispectral images, but when applied to medium or high resolution images, they become less robust as more spatial detail becomes visible. With the development of machine learning algorithms, which use artificially designed features, traditional machine learning algorithms have shown strong robustness in water bodies’ extraction [
10,
11,
19]. The water extraction methods can be classified as pixel-based [
10] or object-based [
19]. Zhang et al. [
10] proposed a pixel area index to assist normalized differential water index detection of major water bodies, while using K-means clustering of water image elements near the boundary section to solve the complete boundary image element extraction problem, but there was still the problem of the optimal selection of the thresholds. Chen et al. [
19] used object-oriented classification techniques combined with the spectral, textural, and geometric features of remote sensing images to extract information from the water bodies. Both the pixel-oriented and object-oriented methods of water extraction are lacking in the study of water body types. There are many methods for water extraction, but the research to determine its type based on water extraction is lacking [
11]. Huang et al. [
11] presented a two-level machine learning framework for identifying water types at the target level using geometric and textural features under pixel-based water extraction conditions, filling a gap in the research. Although machine learning based algorithms have achieved good results in remote sensing image analysis, they are limited by the artificial design of features to compute and stitch together, which are fed into support vector machines [
20], random forests [
21], etc, and are less adaptive to different datasets.
In the last decade, deep learning techniques have made significant breakthroughs in the field of image processing, such as semantic segmentation [
22,
23,
24,
25], object detection [
26,
27], and image classification [
28,
29,
30]. Deep convolutional neural networks (DCNN) can automatically learn features at different levels from a large number of training images, which avoids the drawbacks of manually designed features. Krizhevsky et al. [
28] formally proposed AlexNet, which won the 2012 ImageNet classification task and had a 10% lower error rate than the second place one, establishing the dominance of deep learning in image recognition. However, there are the problems of high computational cost and the lack of depth in the network. Simonyan et al. [
29] employed small convolutional kernels that not only increased the nonlinear expression of the model, but also reduced the amount of computation. A shallow convolutional layer captures pixel boundary and location information, while a deep convolutional layer captures pixel semantic information for pixel classification. As the depth of the convolutional layer increases, rich semantic features are acquired, but this can also lead to gradient loss or gradient explosion problems. He et al. [
30] represented the layer as a learning residual function and solved the degradation problem caused by increasing the depth of the network, so performance can be improved by increasing the depth of the network.
Semantic segmentation is a typical pixel-level classification task, assigning a label based on the maximum probability that each pixel belongs to a region. Long et al. [
22] replaced the last fully connected layer of convolutional neural network with a convolutional layer that could accept an arbitrary size input, which was the first end-to-end learnable neural network for semantic segmentation. However, some of the details are lost due to pooling operations. Badrinarayanan et al. [
23] proposed a semantic segmentation model with an encoder and decoder structure that recovers the resolution of the feature map by max-pooling indices during up-sampling of the decoder, but it did not make good use of shallow, detailed location information. In the same year, Unet [
24] employed high level semantic information and low level detail information in the up-sampling process to obtain more accurate pixel-level classifications through connection operations. DeepLab V3+ [
25] is one of the best-performing semantic segmentation models for image segmentation, which expands the receptive field of feature points by dilated convolution and combines feature maps at different scales using the atrous spatial pyramid pooling module (ASPP). These end-to-end networks can be applied to the semantic segmentation of remote sensing images. However, due to the large amount of noise in remote sensing images, their performance will not be very satisfactory.
Multi-scale features, noise interference, and boundary blurring are the main factors affecting accuracy in lake water body extraction research. Miao et al. [
12] proposed the RRFDeconvnet model by combining the Deconvnet, residual unit, and skip connection strategies, while proposing a new loss function that applies area information to convolutional neural networks to solve the boundary blurring problem. However, it is not strong enough to deal with noise interference and the problem of multi-scale features due to a single dilated rate. Guo et al. [
14] proposed a multi-scale water extraction convolutional neural network, where the feature map encoded by the encoder is input into four parallel dilated convolutions with different dilated rates for learning, which solves the noise interference problem. However, due to the loss of important information caused by the large dilated rates, small lake water bodies cannot be extracted well. Furthermore, using only bottom level features leads to boundary blurring problems.
In this paper, the lake dataset is classified at the pixel level based on fully convolutional neural networks. The proposed algorithm consists of three main parts: an encoder, a multi-scale densely connected feature extractor, and a decoder. We try to improve the performance of the model by a modified ResNet-101 as an encoder to extract deep semantic features. In our dataset, lakes have multi-scale features due to the gradual improvement in resolution. They are very rich in textural and spectral features, with large intra-class variance and small inter-class variance in remote sensing images due to the presence of shadows and snow. We propose a multi-scale densely connected feature extractor that preserves small lakes’ information to solve the multi-scale problem while also expanding the receptive field to solve the problem of large intra-class variance and small inter-class variance. In the decoder phase, we use residual convolution to combine the features between the different layers to obtain accurate boundary segmentation results.
The main contributions of this paper are as follows:
- 1
In order to take full advantage of the features at different levels and prevent model degradation problems, this article proposes the novel multi-scale lake water extraction network, named MSLWENet.
- 2
Inspired by Xception [
31] and MobileNet [
32], in order to reduce the number of model parameters to prevent overfitting, depthwise separable convolution is used to reduce the volume of the model without reducing the overall accuracy.
- 3
In order to solve the problem of lakes with large intra-class variance, small inter-class variance, and multiple scales, we design a multi-scale densely connected feature extractor with multiple atrous rates that not only fully extract the information of small lakes, but also in the further expansion of the receptive field, to extract the integrity of the lake water bodies.
- 4
Compared with other end-to-end models, the algorithm for semantic segmentation proposed in this paper achieves optimal performance on all five evaluation metrics.
This paper is organized as follows:
Section 2 presents the structure of the lake water extraction model and the data pre-processing.
Section 3 describes the implementation details, dataset and experimental results.
Section 4 is a discussion of our method, and
Section 5 presents our conclusions.
4. Discussion
With the improvement in the resolution of remote sensing images, the methods of water extraction are changing, such as thresholding, machine learning, and deep learning. The Tibetan Plateau is the highest, most numerous and largest lake region on Earth and one of the two most densely distributed lake regions in China.
In this paper, our proposed method named MSLWENet achieves state-of-the-art performance on the lake dataset than DeepLab V3+, PSPNet, MWEN, and Unet. In
Section 3, the performance of the model is evaluated by five evaluation metrics and visualization results. In particular, our method achieves an overall accuracy of 98.53%, an improvement of 0.43% over DeepLab V3+. In small lake water extraction, OA is not much improved due to the large portion of background in the image, but TWR is improved by 1.47% compared to DeepLab V3+. This result shows that the proposed model is capable of extracting small lakes. In the large intra-class variance and small inter-class variance regions, PSPNet has better performance than DeepLab V3+. However, MSLWENet has improved performance over other CNN models, which means it can suppress noise better. The results are mainly from the multi-scale densely connected feature extraction module, which is a good solution to the problem of information loss caused by a large dilated rate and a too small dilated rate to correctly identify noise. We fully utilize spatial and channel dimensional features to better capture the multi-scale relationship between pixels, resulting in the better feature extraction capability and segmentation performance of our model. In
Section 3.3.5, MSLWENet improves the OA by 1.16% compared to VGG-Concat. The result shows that using VGG-16 as the backbone extraction network does not extract enough semantic features for segmentation and may be an argument for the poor performance of Unet. Furthermore, MSLWENet improves the OA by 0.36% compared to ResNet-Sum. The result shows that the operation with the channel connection has a greater improvement in performance than element-by-element addition, which may be related to its ability to retain more information during up-sampling.
The segmentation performance of these convolutional neural networks may be related to their own structure and dataset complexity. For relatively simple datasets, overly complex neural network models are prone to overfitting. FCN and Unet will be able to achieve better performance. On the other hand, the datasets in this paper have complex texture and spectral features, so more complex models are required for feature extraction, such as DeepLab V3+, PSPNet, and other networks. However, since our dataset is relatively small, we apply depthwise separable convolution in our model to drastically reduce the number of trainable parameters to effectively suppress overfitting.
5. Conclusions
In this paper, a new MSLWENet model for remote sensing image semantic segmentation based on convolutional neural networks is proposed. The adopted structure consists of an encoder, a multi-scale dense connect module, and a decoder. For feature extraction, a modified ResNet-101 is used as the feature extractor, where the residual structure is capable of extracting a large number of useful semantic features without causing degradation of the model. Dilated convolution is necessary due to lakes’ scale features, but excessive dilated rates can lead to the loss of useful information and incomplete segmentation of small lakes, so we use progressively larger dilated rates and connect channels at different layers of dilated convolution to preserve as much information as possible. In the training process of the model, we use data augmentation processing, which can avoid overfitting of the network and improve the generalization of the network. Compared to existing models, the method achieves the highest OA, recall, MIoU, and TWR and the lowest FWR, while the integrity of the segmented lake is significantly better than other methods.
Although our method achieves good segmentation results on this dataset, there are still many shortcomings, which will guide our future research directions. Due to the relatively small size of our dataset, it is easy to overfit the model. On the one hand, we need to enrich our dataset, and on the other hand, pre-training on ImageNet and fine-tuning on our dataset would be a better solution. Finally, due to the rich surface texture of lakes, segmentation is prone to noise, so some morphological treatments, such as conditional random fields and morphological filtering, will be applied to optimize the segmentation results.