Holistic and Historical Instance Comparison for Cervical Cell Detection

Hao Jiang¹, Runsheng Liu¹, Yanning Zhou², Huangjing Lin³ and Hao Chen^{1, 4, 5, ✉} ^✉ Corresponding authors: Hao Chen. Email: jhc@cse.ust.hk ¹ Department of Computer Science and Engineering,
The Hong Kong University of Science and Technology, Hong Kong, China
² Tencent AI Lab, Shenzhen, China
³ Imsight AI Research Lab, Shenzhen, China
⁴ Department of Chemical and Biological Engineering,
The Hong Kong University of Science and Technology, Hong Kong, China
⁵ HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, Futian, Shenzhen, China

Abstract

Cytology screening from Papanicolaou (Pap) smears is a common and effective tool for the preventive clinical management of cervical cancer, where abnormal cell detection from whole slide images serves as the foundation for reporting cervical cytology. However, cervical cell detection remains challenging due to 1) hazily-defined cell types (e.g., ASC-US) with subtle morphological discrepancies caused by the dynamic cancerization process, i.e., cell class ambiguity, and 2) imbalanced class distributions of clinical data may cause missed detection, especially for minor categories, i.e., cell class imbalance. To this end, we propose a holistic and historical instance comparison approach for cervical cell detection. Specifically, we first develop a holistic instance comparison scheme enforcing both RoI-level and class-level cell discrimination. This coarse-to-fine cell comparison encourages the model to learn foreground-distinguishable and class-wise representations. To emphatically improve the distinguishability of minor classes, we then introduce a historical instance comparison scheme with a confident sample selection-based memory bank, which involves comparing current embeddings with historical embeddings for better cell instance discrimination. Extensive experiments and analysis on two large-scale cytology datasets including 42,592 and 114,513 cervical cells demonstrate the effectiveness of our method. The code is available at https://github.com/hjiangaz/HERO.

Index Terms:

Cytology Detection, Class Ambiguity, Class Imbalance, Instance Comparison, Contrastive Learning

I Introduction

Cervical cancer is one of the leading causes of cancer-related deaths, with approximately 604,127 confirmed cases and 341,831 deaths reported worldwide in 2020[18]. Cytology screening, involving the identification of abnormal cells through the examination of thousands of cells under a microscope, is the primary approach for precancerous screening from Pap smears or liquid-based cytology specimens. However, a typical cytology test usually requires experienced cytologists to spend 5-10 minutes on analyzing cytology characteristics under a microscope to identify abnormal cells [10]. Computational cytology has made significant progress in accelerating this screening process [7]. Cell detection is usually regarded as the prerequisite step for identifying suspicious cells throughout the entire process [5, 6, 10].

Refer to caption — Figure 1: Illustration of cervical cell detection. (a) Detected abnormal cells denoted by blue boxes; (b) Cell class imbalance: class imbalanced cell instance distribution; (c) Cell class ambiguity: cell class ambiguity with various appearances and morphologies.

Modern object detectors perform well in most natural scenes, but struggle to excel in this specific cervical cell detection task. Human Papillomavirus (HPV) invasion and infection processes are continuous and dynamic, with epithelial cells gradually progressing from low-grade lesions to cell carcinoma [14]. Therefore, the morphological feature discrepancies between categories in adjacent stages (e.g., HSIL and SCC) are subtle and indiscernible, resulting in class ambiguity (Fig. 1). Furthermore, the cervical cytology data is inherently under imbalanced class distributions due to screened positive candidates being mainly distributed in early stages of cancerization, which may lead to missing categories with fewer samples, i.e., SCC, AGC-N.

Previous works have emerged towards the cervical cell detection task. For example, Chai et al. [3] introduced a semi-supervised learning method to leverage unlabelled data for robust detection. Inspired by using surrounding cells as references in clinical practice, Liang et al. [9] utilized two attention modules to explore contextual information. Until recently, Liu et al. [12] observed the class imbalance issue in this task, while only focusing on class re-balancing. However, these studies only targeted specific issues, none of them tackled the intrinsic issue of cell indistinguishability caused by the blurred decision boundary between adjacent classes, which derives from the gradual progression of cancerization and is manifested in the morphological ambiguity.

To confirm the abnormal type in the case of cell class ambiguity, cytologists often retrieve TBS guidelines for deterministic reference and review previous samples for historical reference [14]. Inspired by this, we propose a holistic instance comparison strategy, consisting of both deterministic and historical comparison within the same framework. Deterministic comparison aligns instances belonging to the same class in the RoI feature space and encourages current batch instances to be referenced against the ground truth (GT) instances, thereby learning the class distinguishability. For historical comparison, we introduce a confident sample selection-based memory bank in class-level comparison, which ensures that confident instances from minor classes are unbiasedly sampled and learnt in each batch.

The main contributions of this paper are as follows:

•

To address the issues in cervical cell detection, we present a holistic and historical instance comparison strategy to exploit comprehensive inter-cell instance discrimination.
•

We introduce a RoI-level instance comparison module (RIC) and a class-level instance comparison module (CIC) with a confident sample selection-based memory bank to learn discriminative RoI and class representation.
•

Two large-scale cervical cell datasets containing 114,513 instance annotations, demonstrate our proposed method outperforms other state-of-the-art (SOTA) methods.

II Methodology

In this section, we first present the overview of proposed holistic and historical instance comparison approach (Sec. II-A), then introduce the details of holistic instance comparison (Sec. II-B) and historical instance comparison (Sec. II-C), and finally the overall training schemes (Sec. II-D).

II-A Framework Overview

The proposed instance comparison method is based on the two-stage object detection framework [16], as illustrated in Fig. 2. First, cytology images are fed into the backbone for feature extraction, followed by instance candidate generation through a Region Proposal Network (RPN). Then, these RoI candidates’ features are extracted by a projection head $E_{1}$ . Then, we implement instance comparison on RoI feature maps to learn distinguishable instance features, through the RIC, which involves contrasting current RoI with GT embeddings. To explicitly address the class imbalance, we further introduce a confident sample selection-based memory bank in the CIC, which stores historical cell instances for each class with uniform sampling, thereby improving the generalizability in minority classes and avoid the domination of majority classes.

II-B Holistic Instance Comparison

RoI-level Instance Comparison (RIC). Deterministic instance comparison on a large number of RoI and GT bounding boxes can increase the subtle inter-class discrepancy between cells and encourage high-quality RoI candidates generation.

RIC involves comparing RoI and GT embeddings through contrastive learning. It starts with assigning GTs and sampling proposals to obtain $bs\times k$ ( $bs$ denotes batch size, $k=256$ ) RoI candidates with corresponding predicted classes $(0\sim num.class)$ , followed by filtering out the background class, obtaining $K_{0}$ RoIs. Then, the RoI feature extractor $E_{1}$ is used to generate RoI features $I$ with the size of $[K_{0}\times 256\times 7\times 7]$ . Then, for given $K_{1}$ GT boxes, we build a box augmentation method to enhance the diversity and amount of GT instances, enriching generalizable features in instance comparison. Given a box $b=\left[x_{0},y_{0},w,h\right]$ , we randomly augment GT boxes as,

B=\left[x_{0}\pm\frac{w}{k_{0}},y_{0}\pm\frac{h}{k_{0}},x_{0}+w\pm\frac{w}{k_{% 0}},y_{0}+h\pm\frac{h}{k_{0}}\right],

(1)

obtaining ${K_{2}}$ augmented boxes. By performing similar operations through the shared feature extractor $E_{1}$ , we get augmented GT features $J$ with the size of $[(K_{1}+K_{2})\times 256\times 7\times 7]$ . The next step involves supervised contrastive learning with positive ( $I^{+}$ ) and negative ( $I^{-}$ ) sample selection, where positive pairs from the same class and negative pairs from different classes, and GTs serve as current query batches $J$ .

\mathcal{L}_{roi\_com}=-\sum_{j\in J}\frac{1}{|J|}\sum_{i^{+}\in I^{+}}\log% \frac{\exp\left(\frac{Sim(z_{j},z_{i^{+}})}{\tau_{roi}}\right)}{\sum_{i\in I}% \exp\left(\frac{Sim(z_{j}\cdot z_{i})}{\tau_{roi}}\right)},

(2)

where $z_{i^{+}}$ represents the embedding of positive sample for the current query sample $z_{j}$ . $Sim(z_{i},z_{j})=z_{i}\cdot z_{j}/(\left\|z_{i}\right\|\left\|z_{j}\right\|)$ is the function for calculating the similarity between two samples. $\tau_{roi}$ is a tunable temperature hyper-parameter. For the denominator, $I=I^{+}+I^{-}$ denotes the total number of positive and negative samples.

Class-level Instance Comparison (CIC). To further improve class distinguishability, we design a class-level instance comparison module, which is located between the shared head and classification head to learn explicit instance class discrepancies. Specifically, we utilize historical class instance embeddings (detailed in Sec. II-C) and current batch class instance embeddings for instance comparison. First, RoI features are fed into the shared head (two fully connected layers) to obtain class features $I^{\prime}$ (with the size $[K_{0}\times 1,024]$ ). Then, we conduct contrastive learning in a similar manner. Given the current class embedding $s_{i}\in I^{\prime}$ as query samples with a batches size of $|I^{\prime}|$ , historical class embedding $s_{m}\in M=\{M^{+},M^{-}\}$ including positive ( $s_{m^{+}}$ ) and negative ( $s_{m^{-}}$ ) samples. Thus, the instance class comparison loss function is formulated as,

\mathcal{L}_{cls\_com}=-\sum_{i\in I^{\prime}}\frac{1}{|I^{\prime}|}\sum_{m^{+% }\in M^{+}}\log\frac{\exp\left(\frac{Sim(s_{i},s_{m^{+}})}{\tau_{cls}}\right)}% {\sum_{m\in M}\exp\left(\frac{Sim(s_{i}\cdot s_{m})}{\tau_{cls}}\right)}.

(3)

TABLE I: Quantitative detection results of our method and other SOTA methods with overall performance (AP50, AP75, AP, AR) and per-class performance. Red and blue are first and second best results.

Methods	AP50 $\uparrow$	AP75 $\uparrow$	AP $\uparrow$	AR $\uparrow$	ASC-US	LSIL	ASC-H	HSIL	SCC	AGC	AGC-N
FRCNN (R50) [16]	21.4	13.7	12.6	37.1	21.3	56.7	16.0	16.6	12.0	26.9	0.0
FRCNN (R101) [16]	25.5	18.0	16.6	41.0	32.2	60.8	21.0	18.7	12.8	33.1	0.0
Cascade R-CNN [1]	32.2	24.2	21.0	46.5	34.9	66.4	34.5	23.2	32.7	33.8	0.0
Grid R-CNN [13]	32.9	26.1	23.1	59.6	32.7	61.3	31.9	21.1	29.8	33.3	19.9
Sparse R-CNN [17]	20.6	12.3	12.1	41.0	25.3	53.0	21.3	14.6	2.1	26.2	1.4
RepPoints [21]	30.9	23.8	20.9	56.5	34.7	68.6	23.7	16.2	26.3	32.5	14.2
CornerNet [8]	12.5	10.2	8.5	42.9	6.5	39.6	9.4	11.7	0.0	18.3	1.8
YOLOv3 [15]	23.2	15.4	13.5	38.6	37.5	53.8	14.6	18.0	0.6	30.6	7.1
RetinaNet [11]	28.1	21.4	18.3	52.9	38.1	61.3	18.6	20.6	5.3	36.9	15.7
FCOS [19]	25.5	16.6	15.5	50.8	26.2	48.8	19.4	15.8	28.3	29.0	10.8
DETR [2]	29.1	14.8	15.4	43.6	40.0	61.6	26.3	16.2	22.2	29.4	8.2
LOCE [4]	28.2	20.4	18.0	59.1	27.6	55.2	22.5	19.4	24.5	32.1	16.1
Ours (R50)	35.2	28.3	23.6	46.9	47.4	73.3	35.4	27.2	21.8	34.1	6.9
Ours (R101)	36.8	28.4	23.8	47.5	52.6	78.3	34.1	25.2	30.0	34.8	3.0

II-C Historical Instance Comparison

Although comparing current batch instances in CIC often leads to learning class discrimination to address the class ambiguity issue, it mainly encourages class discrepancies among majority class instances. This is because minority classes (e.g., SCC, AGC-N) have very low frequencies, compared to the overwhelming majority classes (e.g., LSIL, ASC-US). Therefore, we propose a confident sample selection-based memory bank, which not only increases the frequency of minority class instances to avoid class biased learning, but also increases the number of class instances for each batch training, improving the model’s generalizability.

Specifically, we use a feature memory bank to store class instance features in the current batch, and reuse these features during the following training. The memory bank has a size of $[C\times Q]$ , where $C$ is the number of instance categories ( $C=7,15$ in experiments) and $Q$ is the queue size per class. The memory bank $M_{cls}$ is defined as follows,

M_{cls}=\left[Ins_{1}^{1},\ldots,Ins_{c}^{q},\ldots,Ins_{C}^{Q}\right],

(4)

where $Ins_{c}^{q}$ represents the $q_{th}$ class instance feature of category $c$ in memory bank $M_{cls}$ .

The memory bank is dynamically updated using the queue-based scheme. Confident sample selection is designed to ensure the quality of queued instances in $M_{cls}$ , which enhances instance comparison, as $l_{c}\geq\tau_{c}.$ It means comparing the predicted class score $l_{c}$ of the current sample with the corresponding class confidence score $\tau_{c}$ , selecting confident samples, and adding them to the queue to update the memory bank $M_{cls}$ . To enhance the stability of the memory bank and leverage the benefit of the confidence selection strategy, we perform the warm-up by training the baseline model, then implement our instance comparison after a few epochs.

II-D Overall Loss Function and Optimization

The framework can be trained with the overall objective $\mathcal{L}$ ,

\mathcal{L}=\lambda_{roi\_com}\mathcal{L}_{roi\_com}+\lambda_{cls\_com}% \mathcal{L}_{cls\_com}+\mathcal{L}_{base},

(5)

where $\mathcal{L}_{roi\_com}$ denotes RoI comparison loss, $\mathcal{L}_{cls\_com}$ denotes class comparison loss. $\mathcal{L}_{base}$ supervises the baseline training, including RPN loss $\mathcal{L}_{RPN}$ , regression loss $\mathcal{L}_{reg}$ , and classification loss $\mathcal{L}_{cls}$ following previous settings [16]. $\lambda_{roi\_com}$ and $\lambda_{cls\_com}$ are trade-off controlling parameters, set as $\{1,0.1\}$ in experiments.

III Experiments and Results

III-A Dataset and Experiments Settings

TABLE II: Effectiveness of proposed Instance Comparison strategy. Red and blue are first and second best results.

Base	RIC	Aug	CIC	AP50 $\uparrow$	AP75 $\uparrow$	AP $\uparrow$	AR $\uparrow$	ASC-US	LSIL	ASC-H	HSIL	SCC	AGC	AGC-N
✓				21.4	13.7	12.6	37.1	21.3	56.7	16.0	16.6	12.0	26.9	0.0
✓	✓			32.7	25.2	21.5	44.5	42.1	74.5	32.5	24.6	19.3	33.1	3.0
✓	✓	✓		34.5	27.4	23.2	47.2	46.3	74.7	35.2	26.4	21.4	34.7	3.0
✓	✓	✓	✓	35.2	28.3	23.6	46.9	47.4	73.3	35.4	27.2	21.8	34.1	6.9

Dataset. To validate the effectiveness of the proposed method, we build two cervical cytology datasets, called CC-L and CC-S, containing 39,006 and 17,301 images ( $1,200\times 1,200$ ), respectively, cropped from cytology whole slides. Based on the Bethesda system [14], the CC-L dataset contains 114,513 box annotations. As shown in Tab. III, these instance annotations exhibit imbalanced distributions across 15 categories. To our knowledge, this is the largest cervical cytology dataset. In addition, we filtered out microbial categories to focus on specific intraepithelial lesions, leaving 42,592 instances and splitting the dataset into train, test, and validation for experiments.

TABLE III: The details of instance annotations and class distribution in CC-S and CC-L datasets.

Dataset	Category	Instance	Category	Instance	Category	Instance
CC-S	ASCUS	6,631	LSIL	9,273	ASC-H	4,305
	HSIL	17,938	SCC	1,394	AGC-N	276
	AGC	3,111
CC-L	ASCUS	22,056	LSIL	26,428	ASC-H	8,802
	HSIL	28,110	SCC	1,756	AGC-N	256
	AGC	6,533	ACTI	395	EMC	9,950
	CC	2,510	Atrophy	453	FUNGI	1,023
	Metapla	3,574	TV	2,274	AIS	402

Evaluation Metrics. For quantitative evaluation, we adopt commonly-used average precision (AP) and average recall (AR) across all classes over bounding box IoU thresholds ranging from 0.5 to 0.95, and the individual AP under the IoU threshold of 0.5 and 0.75, denoted by AP50 and AP75. We also report AP for each class under the IoU threshold of 0.5 to show the detection performance for each class.

Implementation Details. We utilize the Faster R-CNN (FRCNN) [16] as baseline model. We use ResNet-50-based and ResNet-101-based FPN in all experiments. For training, we utilize SGD with 0.9 momentum as the optimizer and set the initial learning rate to 0.005. To ensure convergence, we train the network for 24 epochs, reducing the learning rate by a factor of 0.1 after 8 and 14 epochs. All experiments were conducted using 24GB NVIDIA GeForce RTX 3090 GPUs.

III-B Comparison with SOTA methods

Qualitative Results. We compare detection results from the CC-S dataset with 11 state-of-the-art detectors, including R-CNN-based models (FRCNN [16], Cascade R-CNN [1], Grid R-CNN [13], Sparse R-CNN [17]), point-based models (RepPoints [21], CornerNet [8]), one-stage models (YOLOv3 [15], RetinaNet [11], FCOS [19]), transformer-based model (DETR [2]) and long-tailed learning model (LOCE [4]). Shown in Fig. 3, our method outperforms others in several aspects. First, several advanced detectors such as Sparse R-CNN, YOLOv3, and LOCE missed abnormal cells, LSIL, in Fig. 3(d)(g)(k), which may result in low sensitivity in clinical cervical screening. Our model, equipped with inter-class instance relationship modeling (RIC and CIC), avoids these cases of missed detection. Then, our models achieved higher IoU detection scores compared with methods like Cascade R-CNN (Fig. 3(b)) and RepPoints (Fig. 3(e)).

Quantitative Results. We provide the comprehensive quantitative result comparisons in Tab. I, which shows that our holistic and historical comparison method achieves the best performance, with significant improvements of 13.8% and 11.3% in AP50 compared to the baseline with both R50 and R101 [16]. Our method effectively addresses class ambiguity, showing in improved detection accuracy for ambiguous classes, notably over 26% growth in ASC-US. It also contributes to addressing class imbalance, with increased AP50 for minority classes, i.e., 9.8% for SCC and 6.9% for AGC-N. Our method outperforms two-stage and one-stage detectors, with a 2.9% gain compared to the strongest two-stage method, Grid R-CNN [13], and a 4.3% improvement compared to the best one-stage detector, RepPoints [21]. Transformer-based models (e.g., DETR) achieve promising detection results, while they are limited by issues such as tiny targets and noise. Finally, a significant performance drop in majority classes leads to an overall performance decrease in long-tailed learning, i.e., LOCE [4], highlighting the effectiveness of our method in balancing the performance of different classes.

III-C Ablation Studies and Analysis

TABLE IV: Ablation studies for parameters, temperature

\tau

and memory bank size Q. Red denotes the best results.

Parameter	AP50 $\uparrow$	AP75 $\uparrow$	AP $\uparrow$	AR $\uparrow$
$\tau$ = 4	27.7	19.4	16.8	39.8
$\tau$ = 6	35.2	28.3	23.6	46.9
$\tau$ = 8	27.5	17.7	16.5	40.7
$\tau$ = 10	25.6	17	15.2	40.5
Q = 16	27.6	18.0	16.3	41.0
Q = 80	35.2	28.3	23.6	46.9
Q = 160	28.0	21.6	18.1	42.9

TABLE V: Further exploration for class imbalanced learning with focal [11] and seesaw loss [20]. Red denotes the best results.

Base	Loss	Ours	AP50 $\uparrow$	AP75 $\uparrow$	AP $\uparrow$	AR $\uparrow$	ASC-US	LSIL	ASC-H	HSIL	SCC	AGC	AGC-N
✓			21.4	13.7	12.6	37.1	21.3	56.7	16.0	16.6	12.0	26.9	0.0
✓	focal [11]		18.5	13.9	11.9	46.8	17.3	53.4	15.3	14.0	1.3	25.7	2.6
✓	focal [11]	✓	29.7	21.5	18.7	52.0	34.1	66.0	31.8	20.8	13.7	29.4	11.7
✓	seesaw [20]		33.5	22.9	20.9	54.2	29.9	63.9	28.9	22.7	36.5	34.8	17.6
✓	seesaw [20]	✓	40.2	30.7	26.3	59.1	51.5	61.1	45.7	21.7	36.7	38.5	26.0

TABLE VI: Effectiveness of proposed method in the CC-L dataset. Red denotes the performance changes compared to FRCNN.

Method	ASCUS	LSIL	ASCH	HSIL	SCC	AGC	AGCN	ACTI	EMC	CC	Atrophy	FUNGI	Metapla	TV	AIS
Baseline	28.2	23.6	25.5	3.3	1.5	23.8	0.0	73.7	39.5	44.3	18.1	10.5	4.6	22.2	0.0
Ours	31.1(+2.9)	39.5(+15.9)	31.0(+5.5)	6.8(+3.5)	13.5(+12.0)	22.0(-1.8)	1.3(+1.3)	86.3(+12.6)	47.6(+8.1)	39.3(-5.0)	33.6(+15.5)	19.6(+9.1)	18.9(+14.3)	24.4(+2.2)	14.6(+14.6)

Effectiveness of Instance Comparison. We perform ablation studies to investigate the effect of each proposed design. The comparison results can be seen in Tab. II, with baseline (denoted as Base) and three components: RoI-level Instance Comparison (RIC), Box Augmentation (Aug), and Class-level Instance Comparison (CIC). Specifically, by adding RIC, we observe obvious performance improvements with 11.3% and 8.9% in AP50 and mean AP. For each class, we can see improvements both in majority (17.8% for LSIL) and majority classes (7.3% for SCC, 6.2% for AGC, 3.0% for AGC-N). Equipped with box augmentation for increasing deterministic instance number and diversity, the model gains further improvements of 1.8% AP50 and 1.7% AP. Class-level comparison aims to improve comparison capability via increasing frequencies of minority class, which yields improvements of 0.7% AP50, notably 3.9% for the minor class, AGC-N.

Temperature Coefficient $\tau$ . The temperature coefficient $\tau$ is a crucial parameter for penalizing hard negative samples more effectively, thereby directing the model’s updates. To choose an appropriate value, we conduct a series of ablations with $\tau\in\{4,6,8,10\}$ . As shown in Tab. IV, we achieve the highest detection metrics with 35.2% AP50 and 23.6% AP, which is set for the following experiments.

Memory Bank Size Q. We employ memory banks with different sizes, and Tab. IV shows corresponding results for the task of cervical cell detection. The best performance is achieved when Q=80, with improvements observed for overall performance (exceeding 7.6% AP50). This is because a too-small value (Q=16) cannot significantly benefit instance comparison learning, while a too-large value (Q=180) may lead to an empty queue for minority classes and full for majority classes in early iterations.

Equipped with rebalancing loss. As a plug-and-play learning approach, we further explore our method by applying existing class re-balancing loss functions to the classification head. We replace the cross-entropy loss in vanilla FRCNN [16], using Focal loss [11] and Seesaw loss [20], shown in Tab. V. They achieve noticeable improvements in minority classes, while sacrificing performance in majority classes. When equipped with the proposed instance comparison approach, both Focal loss and Seesaw loss achieve consistent improvements, namely 11.2% AP50 for Focal loss and 6.7% AP50 for Seesaw loss.

Experiments on the large-scale cytology dataset. Moreover, we conduct experiments on CC-L, demonstrating the effectiveness of our approach on datasets with increasingly larger and more complex structures. As shown in Tab.VI, we achieve 28.6% AP50 and 16.6% AP results with significant performance improvements in almost all categories. Specifically, we can see the effectiveness of our method for addressing class ambiguity, as in typical ambiguity categories, ASC-US (2.9%) and ASC-H (5.5%). Moreover, performance compensation is significantly highlighted in minor categories, as from 0.0% to 14.6% in AIS, and from 1.5% to 13.5% in SCC.

IV Conclusion

In this work, we investigated the intrinsical issues of class ambiguity and imbalance in cervical cell detection. To jointly address these issues, we propose a novel instance comparison approach with holistic and historical cell comparison at both the RoI-level and class-level, together with a confident sample selection-based memory bank for compensating the contribution from minority class instances. Our experiments on two large-scale cervical cytology datasets demonstrate the effectiveness of our approach. The performance improvements of minority classes, e.g., AGC-N, remain to be further explored.

V Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62202403), Hong Kong Innovation and Technology Fund (Project No. PRP/034/22FX), Shenzhen Science and Technology Innovation Committee Fund (Project No. KCXFZ20230731094059008) and the Project of Hetao Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone (HZQB-KCZYB-2020083).

References

[1] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154–6162.
[2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
[3] Z. Chai, L. Luo, H. Lin, H. Chen, A. Han, and P.-A. Heng, “Deep semi-supervised metric learning with dual alignment for cervical cancer cell detection,” in 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). IEEE, 2022, pp. 1–5.
[4] C. Feng, Y. Zhong, and W. Huang, “Exploring classification equilibrium in long-tailed object detection,” in Proceedings of the IEEE/CVF International conference on computer vision, 2021, pp. 3417–3426.
[5] H. Jiang, S. Li, W. Liu, H. Zheng, J. Liu, and Y. Zhang, “Geometry-aware cell detection with deep learning,” Msystems, vol. 5, no. 1, pp. 10–1128, 2020.
[6] H. Jiang, R. Zhang, Y. Zhou, Y. Wang, and H. Chen, “Donet: Deep de-overlapping network for cytology instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 641–15 650.
[7] H. Jiang, Y. Zhou, Y. Lin, R. C. Chan, J. Liu, and H. Chen, “Deep learning for computational cytology: A survey,” Medical Image Analysis, p. 102691, 2022.
[8] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750.
[9] Y. Liang, S. Feng, Q. Liu, H. Kuang, J. Liu, L. Liao, Y. Du, and J. Wang, “Exploring contextual relationships for cervical abnormal cell detection,” arXiv preprint arXiv:2207.04693, 2022.
[10] H. Lin, H. Chen, X. Wang, Q. Wang, L. Wang, and P.-A. Heng, “Dual-path network with synergistic grouping loss and evidence driven risk stratification for whole slide cervical image analysis,” Medical Image Analysis, vol. 69, p. 101955, 2021.
[11] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[12] M. Liu, X. Li, X. Gao, J. Chen, L. Shen, and H. Wu, “Sample hardness based gradient loss for long-tailed cervical cell detection,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part II. Springer, 2022, pp. 109–119.
[13] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan, “Grid r-cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7363–7372.
[14] R. Nayar and D. C. Wilbur, The Bethesda system for reporting cervical cytology: definitions, criteria, and explanatory notes. Springer, 2015.
[15] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
[16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[17] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang et al., “Sparse r-cnn: End-to-end object detection with learnable proposals,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 454–14 463.
[18] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: a cancer journal for clinicians, vol. 71, no. 3, pp. 209–249, 2021.
[19] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
[20] J. Wang, W. Zhang, Y. Zang, Y. Cao, J. Pang, T. Gong, K. Chen, Z. Liu, C. C. Loy, and D. Lin, “Seesaw loss for long-tailed instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9695–9704.
[21] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point set representation for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9657–9666.