Holistic and Historical Instance Comparison for Cervical Cell Detection

Hao Jiang1, Runsheng Liu1, Yanning Zhou2, Huangjing Lin3 and Hao Chen1, 4, 5, ✉ Corresponding authors: Hao Chen. Email: jhc@cse.ust.hk 1 Department of Computer Science and Engineering,
The Hong Kong University of Science and Technology, Hong Kong, China
2 Tencent AI Lab, Shenzhen, China
3 Imsight AI Research Lab, Shenzhen, China
4 Department of Chemical and Biological Engineering,
The Hong Kong University of Science and Technology, Hong Kong, China
5 HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, Futian, Shenzhen, China
Abstract

Cytology screening from Papanicolaou (Pap) smears is a common and effective tool for the preventive clinical management of cervical cancer, where abnormal cell detection from whole slide images serves as the foundation for reporting cervical cytology. However, cervical cell detection remains challenging due to 1) hazily-defined cell types (e.g., ASC-US) with subtle morphological discrepancies caused by the dynamic cancerization process, i.e., cell class ambiguity, and 2) imbalanced class distributions of clinical data may cause missed detection, especially for minor categories, i.e., cell class imbalance. To this end, we propose a holistic and historical instance comparison approach for cervical cell detection. Specifically, we first develop a holistic instance comparison scheme enforcing both RoI-level and class-level cell discrimination. This coarse-to-fine cell comparison encourages the model to learn foreground-distinguishable and class-wise representations. To emphatically improve the distinguishability of minor classes, we then introduce a historical instance comparison scheme with a confident sample selection-based memory bank, which involves comparing current embeddings with historical embeddings for better cell instance discrimination. Extensive experiments and analysis on two large-scale cytology datasets including 42,592 and 114,513 cervical cells demonstrate the effectiveness of our method. The code is available at  https://github.com/hjiangaz/HERO.

Index Terms:
Cytology Detection, Class Ambiguity, Class Imbalance, Instance Comparison, Contrastive Learning

I Introduction

Cervical cancer is one of the leading causes of cancer-related deaths, with approximately 604,127 confirmed cases and 341,831 deaths reported worldwide in 2020[18]. Cytology screening, involving the identification of abnormal cells through the examination of thousands of cells under a microscope, is the primary approach for precancerous screening from Pap smears or liquid-based cytology specimens. However, a typical cytology test usually requires experienced cytologists to spend 5-10 minutes on analyzing cytology characteristics under a microscope to identify abnormal cells [10]. Computational cytology has made significant progress in accelerating this screening process [7]. Cell detection is usually regarded as the prerequisite step for identifying suspicious cells throughout the entire process [5, 6, 10].

Refer to caption

Figure 1: Illustration of cervical cell detection. (a) Detected abnormal cells denoted by blue boxes; (b) Cell class imbalance: class imbalanced cell instance distribution; (c) Cell class ambiguity: cell class ambiguity with various appearances and morphologies.

Modern object detectors perform well in most natural scenes, but struggle to excel in this specific cervical cell detection task. Human Papillomavirus (HPV) invasion and infection processes are continuous and dynamic, with epithelial cells gradually progressing from low-grade lesions to cell carcinoma [14]. Therefore, the morphological feature discrepancies between categories in adjacent stages (e.g., HSIL and SCC) are subtle and indiscernible, resulting in class ambiguity (Fig. 1). Furthermore, the cervical cytology data is inherently under imbalanced class distributions due to screened positive candidates being mainly distributed in early stages of cancerization, which may lead to missing categories with fewer samples, i.e., SCC, AGC-N.

Previous works have emerged towards the cervical cell detection task. For example, Chai et al. [3] introduced a semi-supervised learning method to leverage unlabelled data for robust detection. Inspired by using surrounding cells as references in clinical practice, Liang et al. [9] utilized two attention modules to explore contextual information. Until recently, Liu et al. [12] observed the class imbalance issue in this task, while only focusing on class re-balancing. However, these studies only targeted specific issues, none of them tackled the intrinsic issue of cell indistinguishability caused by the blurred decision boundary between adjacent classes, which derives from the gradual progression of cancerization and is manifested in the morphological ambiguity.

To confirm the abnormal type in the case of cell class ambiguity, cytologists often retrieve TBS guidelines for deterministic reference and review previous samples for historical reference [14]. Inspired by this, we propose a holistic instance comparison strategy, consisting of both deterministic and historical comparison within the same framework. Deterministic comparison aligns instances belonging to the same class in the RoI feature space and encourages current batch instances to be referenced against the ground truth (GT) instances, thereby learning the class distinguishability. For historical comparison, we introduce a confident sample selection-based memory bank in class-level comparison, which ensures that confident instances from minor classes are unbiasedly sampled and learnt in each batch.

The main contributions of this paper are as follows:

  • To address the issues in cervical cell detection, we present a holistic and historical instance comparison strategy to exploit comprehensive inter-cell instance discrimination.

  • We introduce a RoI-level instance comparison module (RIC) and a class-level instance comparison module (CIC) with a confident sample selection-based memory bank to learn discriminative RoI and class representation.

  • Two large-scale cervical cell datasets containing 114,513 instance annotations, demonstrate our proposed method outperforms other state-of-the-art (SOTA) methods.

Refer to caption

Figure 2: Overview of the proposed Holistic and Historical Instance Comparison framework (a) for instance comparison (b). It consists of contrasting RoI features using RoI-level instance comparison module (RIC) with box augmentation (c), and contrasting class features by class-level comparison module (CIC) with a confident sample selection-based memory bank (d).

II Methodology

In this section, we first present the overview of proposed holistic and historical instance comparison approach (Sec. II-A), then introduce the details of holistic instance comparison (Sec. II-B) and historical instance comparison (Sec. II-C), and finally the overall training schemes (Sec. II-D).

II-A Framework Overview

The proposed instance comparison method is based on the two-stage object detection framework [16], as illustrated in Fig. 2. First, cytology images are fed into the backbone for feature extraction, followed by instance candidate generation through a Region Proposal Network (RPN). Then, these RoI candidates’ features are extracted by a projection head E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Then, we implement instance comparison on RoI feature maps to learn distinguishable instance features, through the RIC, which involves contrasting current RoI with GT embeddings. To explicitly address the class imbalance, we further introduce a confident sample selection-based memory bank in the CIC, which stores historical cell instances for each class with uniform sampling, thereby improving the generalizability in minority classes and avoid the domination of majority classes.

II-B Holistic Instance Comparison

RoI-level Instance Comparison (RIC). Deterministic instance comparison on a large number of RoI and GT bounding boxes can increase the subtle inter-class discrepancy between cells and encourage high-quality RoI candidates generation.

RIC involves comparing RoI and GT embeddings through contrastive learning. It starts with assigning GTs and sampling proposals to obtain bs×k𝑏𝑠𝑘bs\times kitalic_b italic_s × italic_k (bs𝑏𝑠bsitalic_b italic_s denotes batch size, k=256𝑘256k=256italic_k = 256) RoI candidates with corresponding predicted classes (0num.class)formulae-sequencesimilar-to0𝑛𝑢𝑚𝑐𝑙𝑎𝑠𝑠(0\sim num.class)( 0 ∼ italic_n italic_u italic_m . italic_c italic_l italic_a italic_s italic_s ), followed by filtering out the background class, obtaining K0subscript𝐾0K_{0}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT RoIs. Then, the RoI feature extractor E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is used to generate RoI features I𝐼Iitalic_I with the size of [K0×256×7×7]delimited-[]subscript𝐾025677[K_{0}\times 256\times 7\times 7][ italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 256 × 7 × 7 ]. Then, for given K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT GT boxes, we build a box augmentation method to enhance the diversity and amount of GT instances, enriching generalizable features in instance comparison. Given a box b=[x0,y0,w,h]𝑏subscript𝑥0subscript𝑦0𝑤b=\left[x_{0},y_{0},w,h\right]italic_b = [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_w , italic_h ], we randomly augment GT boxes as,

B=[x0±wk0,y0±hk0,x0+w±wk0,y0+h±hk0],𝐵plus-or-minussubscript𝑥0𝑤subscript𝑘0plus-or-minussubscript𝑦0subscript𝑘0plus-or-minussubscript𝑥0𝑤𝑤subscript𝑘0plus-or-minussubscript𝑦0subscript𝑘0B=\left[x_{0}\pm\frac{w}{k_{0}},y_{0}\pm\frac{h}{k_{0}},x_{0}+w\pm\frac{w}{k_{% 0}},y_{0}+h\pm\frac{h}{k_{0}}\right],italic_B = [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ± divide start_ARG italic_w end_ARG start_ARG italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ± divide start_ARG italic_h end_ARG start_ARG italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_w ± divide start_ARG italic_w end_ARG start_ARG italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_h ± divide start_ARG italic_h end_ARG start_ARG italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ] , (1)

obtaining K2subscript𝐾2{K_{2}}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT augmented boxes. By performing similar operations through the shared feature extractor E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we get augmented GT features J𝐽Jitalic_J with the size of [(K1+K2)×256×7×7]delimited-[]subscript𝐾1subscript𝐾225677[(K_{1}+K_{2})\times 256\times 7\times 7][ ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) × 256 × 7 × 7 ]. The next step involves supervised contrastive learning with positive (I+superscript𝐼I^{+}italic_I start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT) and negative (Isuperscript𝐼I^{-}italic_I start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT) sample selection, where positive pairs from the same class and negative pairs from different classes, and GTs serve as current query batches J𝐽Jitalic_J.

roi_com=jJ1|J|i+I+logexp(Sim(zj,zi+)τroi)iIexp(Sim(zjzi)τroi),subscript𝑟𝑜𝑖_𝑐𝑜𝑚subscript𝑗𝐽1𝐽subscriptsuperscript𝑖superscript𝐼𝑆𝑖𝑚subscript𝑧𝑗subscript𝑧superscript𝑖subscript𝜏𝑟𝑜𝑖subscript𝑖𝐼𝑆𝑖𝑚subscript𝑧𝑗subscript𝑧𝑖subscript𝜏𝑟𝑜𝑖\mathcal{L}_{roi\_com}=-\sum_{j\in J}\frac{1}{|J|}\sum_{i^{+}\in I^{+}}\log% \frac{\exp\left(\frac{Sim(z_{j},z_{i^{+}})}{\tau_{roi}}\right)}{\sum_{i\in I}% \exp\left(\frac{Sim(z_{j}\cdot z_{i})}{\tau_{roi}}\right)},caligraphic_L start_POSTSUBSCRIPT italic_r italic_o italic_i _ italic_c italic_o italic_m end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j ∈ italic_J end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_J | end_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∈ italic_I start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( divide start_ARG italic_S italic_i italic_m ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_r italic_o italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT roman_exp ( divide start_ARG italic_S italic_i italic_m ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_r italic_o italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG , (2)

where zi+subscript𝑧superscript𝑖z_{i^{+}}italic_z start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT represents the embedding of positive sample for the current query sample zjsubscript𝑧𝑗z_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Sim(zi,zj)=zizj/(zizj)𝑆𝑖𝑚subscript𝑧𝑖subscript𝑧𝑗subscript𝑧𝑖subscript𝑧𝑗normsubscript𝑧𝑖normsubscript𝑧𝑗Sim(z_{i},z_{j})=z_{i}\cdot z_{j}/(\left\|z_{i}\right\|\left\|z_{j}\right\|)italic_S italic_i italic_m ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / ( ∥ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ ) is the function for calculating the similarity between two samples. τroisubscript𝜏𝑟𝑜𝑖\tau_{roi}italic_τ start_POSTSUBSCRIPT italic_r italic_o italic_i end_POSTSUBSCRIPT is a tunable temperature hyper-parameter. For the denominator, I=I++I𝐼superscript𝐼superscript𝐼I=I^{+}+I^{-}italic_I = italic_I start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_I start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT denotes the total number of positive and negative samples.

Class-level Instance Comparison (CIC). To further improve class distinguishability, we design a class-level instance comparison module, which is located between the shared head and classification head to learn explicit instance class discrepancies. Specifically, we utilize historical class instance embeddings (detailed in Sec. II-C) and current batch class instance embeddings for instance comparison. First, RoI features are fed into the shared head (two fully connected layers) to obtain class features Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (with the size [K0×1,024]subscript𝐾01024[K_{0}\times 1,024][ italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 1 , 024 ]). Then, we conduct contrastive learning in a similar manner. Given the current class embedding siIsubscript𝑠𝑖superscript𝐼s_{i}\in I^{\prime}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as query samples with a batches size of |I|superscript𝐼|I^{\prime}|| italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |, historical class embedding smM={M+,M}subscript𝑠𝑚𝑀superscript𝑀superscript𝑀s_{m}\in M=\{M^{+},M^{-}\}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ italic_M = { italic_M start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT } including positive (sm+subscript𝑠superscript𝑚s_{m^{+}}italic_s start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT) and negative (smsubscript𝑠superscript𝑚s_{m^{-}}italic_s start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT) samples. Thus, the instance class comparison loss function is formulated as,

cls_com=iI1|I|m+M+logexp(Sim(si,sm+)τcls)mMexp(Sim(sism)τcls).subscript𝑐𝑙𝑠_𝑐𝑜𝑚subscript𝑖superscript𝐼1superscript𝐼subscriptsuperscript𝑚superscript𝑀𝑆𝑖𝑚subscript𝑠𝑖subscript𝑠superscript𝑚subscript𝜏𝑐𝑙𝑠subscript𝑚𝑀𝑆𝑖𝑚subscript𝑠𝑖subscript𝑠𝑚subscript𝜏𝑐𝑙𝑠\mathcal{L}_{cls\_com}=-\sum_{i\in I^{\prime}}\frac{1}{|I^{\prime}|}\sum_{m^{+% }\in M^{+}}\log\frac{\exp\left(\frac{Sim(s_{i},s_{m^{+}})}{\tau_{cls}}\right)}% {\sum_{m\in M}\exp\left(\frac{Sim(s_{i}\cdot s_{m})}{\tau_{cls}}\right)}.caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s _ italic_c italic_o italic_m end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∈ italic_M start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( divide start_ARG italic_S italic_i italic_m ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT roman_exp ( divide start_ARG italic_S italic_i italic_m ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT end_ARG ) end_ARG . (3)

Refer to caption

Figure 3: Qualitative results of our methods and other SOTA methods. (a) FRCNN, (b) Cascade R-CNN, (c) Grid R-CNN, (d) Sparse R-CNN, (e) RepPoints, (f) CornerNet, (g) YOLOv3, (h) RetinaNet, (i) FCOS, (j) DETR, (k) LOCE, our method with R50 (l) and R101 (m), and (n) GT.
TABLE I: Quantitative detection results of our method and other SOTA methods with overall performance (AP50, AP75, AP, AR) and per-class performance. Red and blue are first and second best results.
Methods AP50 \uparrow AP75 \uparrow AP \uparrow AR \uparrow ASC-US LSIL ASC-H HSIL SCC AGC AGC-N
FRCNN (R50) [16] 21.4 13.7 12.6 37.1 21.3 56.7 16.0 16.6 12.0 26.9 0.0
FRCNN (R101) [16] 25.5 18.0 16.6 41.0 32.2 60.8 21.0 18.7 12.8 33.1 0.0
Cascade R-CNN [1] 32.2 24.2 21.0 46.5 34.9 66.4 34.5 23.2 32.7 33.8 0.0
Grid R-CNN [13] 32.9 26.1 23.1 59.6 32.7 61.3 31.9 21.1 29.8 33.3 19.9
Sparse R-CNN [17] 20.6 12.3 12.1 41.0 25.3 53.0 21.3 14.6 2.1 26.2 1.4
RepPoints [21] 30.9 23.8 20.9 56.5 34.7 68.6 23.7 16.2 26.3 32.5 14.2
CornerNet [8] 12.5 10.2 8.5 42.9 6.5 39.6 9.4 11.7 0.0 18.3 1.8
YOLOv3 [15] 23.2 15.4 13.5 38.6 37.5 53.8 14.6 18.0 0.6 30.6 7.1
RetinaNet [11] 28.1 21.4 18.3 52.9 38.1 61.3 18.6 20.6 5.3 36.9 15.7
FCOS [19] 25.5 16.6 15.5 50.8 26.2 48.8 19.4 15.8 28.3 29.0 10.8
DETR [2] 29.1 14.8 15.4 43.6 40.0 61.6 26.3 16.2 22.2 29.4 8.2
LOCE [4] 28.2 20.4 18.0 59.1 27.6 55.2 22.5 19.4 24.5 32.1 16.1
Ours (R50) 35.2 28.3 23.6 46.9 47.4 73.3 35.4 27.2 21.8 34.1 6.9
Ours (R101) 36.8 28.4 23.8 47.5 52.6 78.3 34.1 25.2 30.0 34.8 3.0

II-C Historical Instance Comparison

Although comparing current batch instances in CIC often leads to learning class discrimination to address the class ambiguity issue, it mainly encourages class discrepancies among majority class instances. This is because minority classes (e.g., SCC, AGC-N) have very low frequencies, compared to the overwhelming majority classes (e.g., LSIL, ASC-US). Therefore, we propose a confident sample selection-based memory bank, which not only increases the frequency of minority class instances to avoid class biased learning, but also increases the number of class instances for each batch training, improving the model’s generalizability.

Specifically, we use a feature memory bank to store class instance features in the current batch, and reuse these features during the following training. The memory bank has a size of [C×Q]delimited-[]𝐶𝑄[C\times Q][ italic_C × italic_Q ], where C𝐶Citalic_C is the number of instance categories (C=7,15𝐶715C=7,15italic_C = 7 , 15 in experiments) and Q𝑄Qitalic_Q is the queue size per class. The memory bank Mclssubscript𝑀𝑐𝑙𝑠M_{cls}italic_M start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT is defined as follows,

Mcls=[Ins11,,Inscq,,InsCQ],subscript𝑀𝑐𝑙𝑠𝐼𝑛superscriptsubscript𝑠11𝐼𝑛superscriptsubscript𝑠𝑐𝑞𝐼𝑛superscriptsubscript𝑠𝐶𝑄M_{cls}=\left[Ins_{1}^{1},\ldots,Ins_{c}^{q},\ldots,Ins_{C}^{Q}\right],italic_M start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = [ italic_I italic_n italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_I italic_n italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , … , italic_I italic_n italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ] , (4)

where Inscq𝐼𝑛superscriptsubscript𝑠𝑐𝑞Ins_{c}^{q}italic_I italic_n italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT represents the qthsubscript𝑞𝑡q_{th}italic_q start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT class instance feature of category c𝑐citalic_c in memory bank Mclssubscript𝑀𝑐𝑙𝑠M_{cls}italic_M start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT.

The memory bank is dynamically updated using the queue-based scheme. Confident sample selection is designed to ensure the quality of queued instances in Mclssubscript𝑀𝑐𝑙𝑠M_{cls}italic_M start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, which enhances instance comparison, as lcτc.subscript𝑙𝑐subscript𝜏𝑐l_{c}\geq\tau_{c}.italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≥ italic_τ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT . It means comparing the predicted class score lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of the current sample with the corresponding class confidence score τcsubscript𝜏𝑐\tau_{c}italic_τ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, selecting confident samples, and adding them to the queue to update the memory bank Mclssubscript𝑀𝑐𝑙𝑠M_{cls}italic_M start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT. To enhance the stability of the memory bank and leverage the benefit of the confidence selection strategy, we perform the warm-up by training the baseline model, then implement our instance comparison after a few epochs.

II-D Overall Loss Function and Optimization

The framework can be trained with the overall objective \mathcal{L}caligraphic_L,

=λroi_comroi_com+λcls_comcls_com+base,subscript𝜆𝑟𝑜𝑖_𝑐𝑜𝑚subscript𝑟𝑜𝑖_𝑐𝑜𝑚subscript𝜆𝑐𝑙𝑠_𝑐𝑜𝑚subscript𝑐𝑙𝑠_𝑐𝑜𝑚subscript𝑏𝑎𝑠𝑒\mathcal{L}=\lambda_{roi\_com}\mathcal{L}_{roi\_com}+\lambda_{cls\_com}% \mathcal{L}_{cls\_com}+\mathcal{L}_{base},caligraphic_L = italic_λ start_POSTSUBSCRIPT italic_r italic_o italic_i _ italic_c italic_o italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_o italic_i _ italic_c italic_o italic_m end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s _ italic_c italic_o italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s _ italic_c italic_o italic_m end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT , (5)

where roi_comsubscript𝑟𝑜𝑖_𝑐𝑜𝑚\mathcal{L}_{roi\_com}caligraphic_L start_POSTSUBSCRIPT italic_r italic_o italic_i _ italic_c italic_o italic_m end_POSTSUBSCRIPT denotes RoI comparison loss, cls_comsubscript𝑐𝑙𝑠_𝑐𝑜𝑚\mathcal{L}_{cls\_com}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s _ italic_c italic_o italic_m end_POSTSUBSCRIPT denotes class comparison loss. basesubscript𝑏𝑎𝑠𝑒\mathcal{L}_{base}caligraphic_L start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT supervises the baseline training, including RPN loss RPNsubscript𝑅𝑃𝑁\mathcal{L}_{RPN}caligraphic_L start_POSTSUBSCRIPT italic_R italic_P italic_N end_POSTSUBSCRIPT, regression loss regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT, and classification loss clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT following previous settings [16]. λroi_comsubscript𝜆𝑟𝑜𝑖_𝑐𝑜𝑚\lambda_{roi\_com}italic_λ start_POSTSUBSCRIPT italic_r italic_o italic_i _ italic_c italic_o italic_m end_POSTSUBSCRIPT and λcls_comsubscript𝜆𝑐𝑙𝑠_𝑐𝑜𝑚\lambda_{cls\_com}italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s _ italic_c italic_o italic_m end_POSTSUBSCRIPT are trade-off controlling parameters, set as {1,0.1}10.1\{1,0.1\}{ 1 , 0.1 } in experiments.

III Experiments and Results

III-A Dataset and Experiments Settings

TABLE II: Effectiveness of proposed Instance Comparison strategy. Red and blue are first and second best results.
Base RIC Aug CIC AP50 \uparrow AP75 \uparrow AP \uparrow AR \uparrow ASC-US LSIL ASC-H HSIL SCC AGC AGC-N
21.4 13.7 12.6 37.1 21.3 56.7 16.0 16.6 12.0 26.9 0.0
32.7 25.2 21.5 44.5 42.1 74.5 32.5 24.6 19.3 33.1 3.0
34.5 27.4 23.2 47.2 46.3 74.7 35.2 26.4 21.4 34.7 3.0
35.2 28.3 23.6 46.9 47.4 73.3 35.4 27.2 21.8 34.1 6.9

Dataset. To validate the effectiveness of the proposed method, we build two cervical cytology datasets, called CC-L and CC-S, containing 39,006 and 17,301 images (1,200×1,200120012001,200\times 1,2001 , 200 × 1 , 200), respectively, cropped from cytology whole slides. Based on the Bethesda system [14], the CC-L dataset contains 114,513 box annotations. As shown in Tab. III, these instance annotations exhibit imbalanced distributions across 15 categories. To our knowledge, this is the largest cervical cytology dataset. In addition, we filtered out microbial categories to focus on specific intraepithelial lesions, leaving 42,592 instances and splitting the dataset into train, test, and validation for experiments.

TABLE III: The details of instance annotations and class distribution in CC-S and CC-L datasets.
Dataset Category Instance Category Instance Category Instance
CC-S ASCUS 6,631 LSIL 9,273 ASC-H 4,305
HSIL 17,938 SCC 1,394 AGC-N 276
AGC 3,111
CC-L ASCUS 22,056 LSIL 26,428 ASC-H 8,802
HSIL 28,110 SCC 1,756 AGC-N 256
AGC 6,533 ACTI 395 EMC 9,950
CC 2,510 Atrophy 453 FUNGI 1,023
Metapla 3,574 TV 2,274 AIS 402

Evaluation Metrics. For quantitative evaluation, we adopt commonly-used average precision (AP) and average recall (AR) across all classes over bounding box IoU thresholds ranging from 0.5 to 0.95, and the individual AP under the IoU threshold of 0.5 and 0.75, denoted by AP50 and AP75. We also report AP for each class under the IoU threshold of 0.5 to show the detection performance for each class.

Implementation Details. We utilize the Faster R-CNN (FRCNN) [16] as baseline model. We use ResNet-50-based and ResNet-101-based FPN in all experiments. For training, we utilize SGD with 0.9 momentum as the optimizer and set the initial learning rate to 0.005. To ensure convergence, we train the network for 24 epochs, reducing the learning rate by a factor of 0.1 after 8 and 14 epochs. All experiments were conducted using 24GB NVIDIA GeForce RTX 3090 GPUs.

III-B Comparison with SOTA methods

Qualitative Results. We compare detection results from the CC-S dataset with 11 state-of-the-art detectors, including R-CNN-based models (FRCNN [16], Cascade R-CNN [1], Grid R-CNN [13], Sparse R-CNN [17]), point-based models (RepPoints [21], CornerNet [8]), one-stage models (YOLOv3 [15], RetinaNet [11], FCOS [19]), transformer-based model (DETR [2]) and long-tailed learning model (LOCE [4]). Shown in Fig. 3, our method outperforms others in several aspects. First, several advanced detectors such as Sparse R-CNN, YOLOv3, and LOCE missed abnormal cells, LSIL, in Fig. 3(d)(g)(k), which may result in low sensitivity in clinical cervical screening. Our model, equipped with inter-class instance relationship modeling (RIC and CIC), avoids these cases of missed detection. Then, our models achieved higher IoU detection scores compared with methods like Cascade R-CNN (Fig. 3(b)) and RepPoints (Fig. 3(e)).

Quantitative Results. We provide the comprehensive quantitative result comparisons in Tab. I, which shows that our holistic and historical comparison method achieves the best performance, with significant improvements of 13.8% and 11.3% in AP50 compared to the baseline with both R50 and R101 [16]. Our method effectively addresses class ambiguity, showing in improved detection accuracy for ambiguous classes, notably over 26% growth in ASC-US. It also contributes to addressing class imbalance, with increased AP50 for minority classes, i.e., 9.8% for SCC and 6.9% for AGC-N. Our method outperforms two-stage and one-stage detectors, with a 2.9% gain compared to the strongest two-stage method, Grid R-CNN [13], and a 4.3% improvement compared to the best one-stage detector, RepPoints [21]. Transformer-based models (e.g., DETR) achieve promising detection results, while they are limited by issues such as tiny targets and noise. Finally, a significant performance drop in majority classes leads to an overall performance decrease in long-tailed learning, i.e., LOCE [4], highlighting the effectiveness of our method in balancing the performance of different classes.

III-C Ablation Studies and Analysis

TABLE IV: Ablation studies for parameters, temperature τ𝜏\tauitalic_τ and memory bank size Q. Red denotes the best results.
Parameter AP50 \uparrow AP75 \uparrow AP \uparrow AR \uparrow
τ𝜏\tauitalic_τ = 4 27.7 19.4 16.8 39.8
τ𝜏\tauitalic_τ = 6 35.2 28.3 23.6 46.9
τ𝜏\tauitalic_τ = 8 27.5 17.7 16.5 40.7
τ𝜏\tauitalic_τ = 10 25.6 17 15.2 40.5
Q = 16 27.6 18.0 16.3 41.0
Q = 80 35.2 28.3 23.6 46.9
Q = 160 28.0 21.6 18.1 42.9
TABLE V: Further exploration for class imbalanced learning with focal [11] and seesaw loss [20]. Red denotes the best results.
Base Loss Ours AP50 \uparrow AP75 \uparrow AP \uparrow AR \uparrow ASC-US LSIL ASC-H HSIL SCC AGC AGC-N
21.4 13.7 12.6 37.1 21.3 56.7 16.0 16.6 12.0 26.9 0.0
focal [11] 18.5 13.9 11.9 46.8 17.3 53.4 15.3 14.0 1.3 25.7 2.6
focal [11] 29.7 21.5 18.7 52.0 34.1 66.0 31.8 20.8 13.7 29.4 11.7
seesaw [20] 33.5 22.9 20.9 54.2 29.9 63.9 28.9 22.7 36.5 34.8 17.6
seesaw [20] 40.2 30.7 26.3 59.1 51.5 61.1 45.7 21.7 36.7 38.5 26.0
TABLE VI: Effectiveness of proposed method in the CC-L dataset. Red denotes the performance changes compared to FRCNN.
Method ASCUS LSIL ASCH HSIL SCC AGC AGCN ACTI EMC CC Atrophy FUNGI Metapla TV AIS
Baseline 28.2 23.6 25.5 3.3 1.5 23.8 0.0 73.7 39.5 44.3 18.1 10.5 4.6 22.2 0.0
Ours 31.1(+2.9) 39.5(+15.9) 31.0(+5.5) 6.8(+3.5) 13.5(+12.0) 22.0(-1.8) 1.3(+1.3) 86.3(+12.6) 47.6(+8.1) 39.3(-5.0) 33.6(+15.5) 19.6(+9.1) 18.9(+14.3) 24.4(+2.2) 14.6(+14.6)

Effectiveness of Instance Comparison. We perform ablation studies to investigate the effect of each proposed design. The comparison results can be seen in Tab. II, with baseline (denoted as Base) and three components: RoI-level Instance Comparison (RIC), Box Augmentation (Aug), and Class-level Instance Comparison (CIC). Specifically, by adding RIC, we observe obvious performance improvements with 11.3% and 8.9% in AP50 and mean AP. For each class, we can see improvements both in majority (17.8% for LSIL) and majority classes (7.3% for SCC, 6.2% for AGC, 3.0% for AGC-N). Equipped with box augmentation for increasing deterministic instance number and diversity, the model gains further improvements of 1.8% AP50 and 1.7% AP. Class-level comparison aims to improve comparison capability via increasing frequencies of minority class, which yields improvements of 0.7% AP50, notably 3.9% for the minor class, AGC-N.

Temperature Coefficient τ𝜏\tauitalic_τ. The temperature coefficient τ𝜏\tauitalic_τ is a crucial parameter for penalizing hard negative samples more effectively, thereby directing the model’s updates. To choose an appropriate value, we conduct a series of ablations with τ{4,6,8,10}𝜏46810\tau\in\{4,6,8,10\}italic_τ ∈ { 4 , 6 , 8 , 10 }. As shown in Tab. IV, we achieve the highest detection metrics with 35.2% AP50 and 23.6% AP, which is set for the following experiments.

Memory Bank Size Q. We employ memory banks with different sizes, and Tab. IV shows corresponding results for the task of cervical cell detection. The best performance is achieved when Q=80, with improvements observed for overall performance (exceeding 7.6% AP50). This is because a too-small value (Q=16) cannot significantly benefit instance comparison learning, while a too-large value (Q=180) may lead to an empty queue for minority classes and full for majority classes in early iterations.

Equipped with rebalancing loss. As a plug-and-play learning approach, we further explore our method by applying existing class re-balancing loss functions to the classification head. We replace the cross-entropy loss in vanilla FRCNN [16], using Focal loss [11] and Seesaw loss [20], shown in Tab. V. They achieve noticeable improvements in minority classes, while sacrificing performance in majority classes. When equipped with the proposed instance comparison approach, both Focal loss and Seesaw loss achieve consistent improvements, namely 11.2% AP50 for Focal loss and 6.7% AP50 for Seesaw loss.

Experiments on the large-scale cytology dataset. Moreover, we conduct experiments on CC-L, demonstrating the effectiveness of our approach on datasets with increasingly larger and more complex structures. As shown in Tab.VI, we achieve 28.6% AP50 and 16.6% AP results with significant performance improvements in almost all categories. Specifically, we can see the effectiveness of our method for addressing class ambiguity, as in typical ambiguity categories, ASC-US (2.9%) and ASC-H (5.5%). Moreover, performance compensation is significantly highlighted in minor categories, as from 0.0% to 14.6% in AIS, and from 1.5% to 13.5% in SCC.

IV Conclusion

In this work, we investigated the intrinsical issues of class ambiguity and imbalance in cervical cell detection. To jointly address these issues, we propose a novel instance comparison approach with holistic and historical cell comparison at both the RoI-level and class-level, together with a confident sample selection-based memory bank for compensating the contribution from minority class instances. Our experiments on two large-scale cervical cytology datasets demonstrate the effectiveness of our approach. The performance improvements of minority classes, e.g., AGC-N, remain to be further explored.

V Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62202403), Hong Kong Innovation and Technology Fund (Project No. PRP/034/22FX), Shenzhen Science and Technology Innovation Committee Fund (Project No. KCXFZ20230731094059008) and the Project of Hetao Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone (HZQB-KCZYB-2020083).

References

  • [1] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154–6162.
  • [2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision.   Springer, 2020, pp. 213–229.
  • [3] Z. Chai, L. Luo, H. Lin, H. Chen, A. Han, and P.-A. Heng, “Deep semi-supervised metric learning with dual alignment for cervical cancer cell detection,” in 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2022, pp. 1–5.
  • [4] C. Feng, Y. Zhong, and W. Huang, “Exploring classification equilibrium in long-tailed object detection,” in Proceedings of the IEEE/CVF International conference on computer vision, 2021, pp. 3417–3426.
  • [5] H. Jiang, S. Li, W. Liu, H. Zheng, J. Liu, and Y. Zhang, “Geometry-aware cell detection with deep learning,” Msystems, vol. 5, no. 1, pp. 10–1128, 2020.
  • [6] H. Jiang, R. Zhang, Y. Zhou, Y. Wang, and H. Chen, “Donet: Deep de-overlapping network for cytology instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 641–15 650.
  • [7] H. Jiang, Y. Zhou, Y. Lin, R. C. Chan, J. Liu, and H. Chen, “Deep learning for computational cytology: A survey,” Medical Image Analysis, p. 102691, 2022.
  • [8] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750.
  • [9] Y. Liang, S. Feng, Q. Liu, H. Kuang, J. Liu, L. Liao, Y. Du, and J. Wang, “Exploring contextual relationships for cervical abnormal cell detection,” arXiv preprint arXiv:2207.04693, 2022.
  • [10] H. Lin, H. Chen, X. Wang, Q. Wang, L. Wang, and P.-A. Heng, “Dual-path network with synergistic grouping loss and evidence driven risk stratification for whole slide cervical image analysis,” Medical Image Analysis, vol. 69, p. 101955, 2021.
  • [11] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [12] M. Liu, X. Li, X. Gao, J. Chen, L. Shen, and H. Wu, “Sample hardness based gradient loss for long-tailed cervical cell detection,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part II.   Springer, 2022, pp. 109–119.
  • [13] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid r-cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7363–7372.
  • [14] R. Nayar and D. C. Wilbur, The Bethesda system for reporting cervical cytology: definitions, criteria, and explanatory notes.   Springer, 2015.
  • [15] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  • [17] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang et al., “Sparse r-cnn: End-to-end object detection with learnable proposals,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 454–14 463.
  • [18] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: a cancer journal for clinicians, vol. 71, no. 3, pp. 209–249, 2021.
  • [19] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
  • [20] J. Wang, W. Zhang, Y. Zang, Y. Cao, J. Pang, T. Gong, K. Chen, Z. Liu, C. C. Loy, and D. Lin, “Seesaw loss for long-tailed instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9695–9704.
  • [21] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point set representation for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9657–9666.