Self-supervised learning and multi-scale ensemble for EBV prediction based on whole slide image of gastric cancer

Haohuan Zhang; Ruixuan Wang; Chunxiao Li; Hongmei Liu

doi:10.1117/12.2662555

28 December 2022 Self-supervised learning and multi-scale ensemble for EBV prediction based on whole slide image of gastric cancer

Haohuan Zhang, Ruixuan Wang, Chunxiao Li, Hongmei Liu

Author Affiliations +

Proceedings Volume 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022); 125063M (2022) https://doi.org/10.1117/12.2662555
Event: International Conference on Computer Science and Communication Technology (ICCSCT 2022), 2022, Beijing, China

Abstract

Gastric cancer is the fifth most common cancer and the fourth highest cause of cancer death in the world. One molecular subtype of gastric cancer called Epstein-Barr virus (EBV) positive tumor often responds remarkably well to immune checkpoint inhibitors and has a favorable prognosis. Considering EBV testing is often time-consuming and costly, it is of great significance to develop an automatic classification method for EBV subtype prediction based on only the cost efficient pathological images. In this study, a self-supervised learning method was proposed to train a more generalizable feature extractor (often consisting of multiple convolutional layers) using only unlabeled pathological images, based on which the classifier head (often consisting of two- or three-layer fully connected layers) can then be more efficiently trained using a large number of labelled pathological image patches. In particular, a novel formation of positive pairs for self-supervised learning was proposed by considering that neighboring patches in each pathological image often share similar visual features and therefore should have similar feature representation from output of the feature extractor. In addition, by imitating the diagnosis process of pathologists who often observe the pathological image at multiple magnifications, a multi-scale ensemble model was proposed, with each individual classifier for prediction of image patches with a unique magnification scale. Experiments on two external pathological image datasets show that the proposed self-supervised learning can help gain a more effective EBV classifier and the multi-scale ensemble model can further improve the prediction stability.

1. INTRODUCTION

According to the report released by the World Health Organization (WHO) in 2021, the number of cancer deaths worldwide was increased by 37% from 2000 to 2019, reaching about 9.3 million in 2019¹. Among a variety of cancers, gastric cancer is the fifth most common cancer and the fourth highest cause of cancer death in the world². Epstein-Barr virus (EBV) positive tumor is one molecular subtype of gastric cancer³ which often responds well to immune checkpoint inhibitors⁴ and has a favorable prognosis⁵. However, EBV testing is often time-consuming and costly. Therefore, it would be desirable to help pathologists more accurately determine whether a patient belongs to the EBV group only based on cost-efficient analysis of pathological images. Most recent works^6-8 focused on the task of gastric cancer classification into positive and negative categories, with the exception of a recent work⁹ where a deep convolutional neural network (CNN) with the ResNet backbone¹⁰ was trained to predict the molecular subtypes of gastric cancer called microsatellite instability (MSI) and microsatellite stability (MSS).

In this study, we propose an innovative classification method for EBV prediction based on self-supervised learning and multi-scale ensemble prediction. First, considering that adjacent regions in each whole slide image (WSI) of the same tissue often share similar pathological features, a novel formation of positive pairs was proposed for the contrastive learning of CNN feature extractor. With self-supervised learning, more unlabeled pathological images can be used to help train a more generalizable feature extractor for downstream classification tasks. Second, inspired by the diagnosis process of pathologists how often inspect WSIs at multiple multiplications, we proposed a multi-scale ensemble model for the prediction of EBV status (EBV vs. non-EBV). Experiments on two external pathological image datasets show that the proposed self-supervised learning can help gain a more effective EBV classifier and the multi-scale ensemble model can further improve the prediction stability.

2. METHOD

The proposed multi-scale ensemble classifier consists of three individual classifiers. Each individual classifier is to classify image patches of a unique scale (with magnification 10×, 5×, or 2.5×). The feature extractor of each individual classifier is pre-trained by self-supervised learning without using labels of image patches and the classifier head is fine-tuned based on available labelled image patches.

2.1

Self-supervised learning of feature extractor

The motivation of self-supervised learning of each feature extractor is to speed up the whole process of classifier training. For a specific scale of image patches, a huge number of patches can be generated (e.g., with overlapped regular sampling from each large-size WSI slide), and thus training an CNN classifier (both the feature extractor and the classifier head) using the huge number of image patches would be very time-consuming. While a pre-trained and fixed feature extractor based on natural image dataset (e.g., ImageNet) can be used such that only the classifier head needs to be trained quickly for classification of pathological image patches, the ability of the pre-trained feature extractor based on natural images may not be powerful enough for extraction and representation of pathological image features. To make full use of the huge number of pathological image patches and meanwhile speed up the training process, we propose using a relatively smaller number of image patches to train the feature extractor in a self-supervised manner, and then using the huge number of patches to train only the classifier head (often a two- or three-layer MLP), with the pre-trained feature extractor fixed.

In this study, the contrastive learning strategy was applied to self-train each feature extractor, and a novel way of forming positive and negative pairs of image patches was proposed for contrastive learning. Specifically, positive pairs are generated not only based on two image augmentations of each image patch (Figure 1 left, from original patch x_i to its augmented versions ), but also novelly based on augmentations of two neighboring image patches (Figure 1 left, from two original neighboring patches x_i and x_i to their augmented versions ). Augmented neighboring image patches are considered as positive pairs because adjacent image patches are from the same tissue which is composed of cell groups with similar morphology and same function and therefore often have similar morphological features. On the other hand, for negative pair generation, the MoCo method¹¹ was adopted, by which a large number of negative pairs can be generated based on memory buffer storing feature representations of image patches from both current mini-batch and hundreds of previous mini-batches. In general, the corresponding two patches for each negative pair are often from two different WSI slides or from different locations of one WSI slide. Based on the generated positive and negative pairs with a specific patch scale, the corresponding MoCo model can be well trained. Consequently, the trained encoder from each well-trained MoCo was kept as the feature extractor for the corresponding patch classifier (Figure 1 right).

Figure 1.

Overview of the proposed method. Left: Construction of positive pairs () and (). Right: the training process of a EBV classifier, with the feature extractor trained by self-supervised learning (top) and then the classifier head (a MLP) trained with labelled image patches of a unique magnification (bottom).

2.2

Multi-scale ensemble classifier

Once a feature extractor was self-trained with image patches of a unique patch scale, the feature extractor can be used to extract a feature vector representation for each labelled image patch with the same scale, where the labelled image patches are from the annotated EBV regions from each WSI slide. Then, built on the feature extractor, a MLP as the classifier head can be trained for prediction of each patch into EBV or non-EBV category. Compared to training a CNN classifier (i.e., the CNN feature extractor and the final FC output layer), training a two- or three-layer MLP is much more efficient because of much smaller number of model parameters and smaller-size patch representation (i.e., feature vectors rather than image patches as input).

Once each MLP is trained based on the corresponding feature extractor for image patches of a unique scale, then the feature extractor and the MLP are combined as a CNN classifier. With three different image magnifications, three such CNN classifiers can be ensembled for prediction of any new image patch (Figure 2). It is worth noting that during inference, at each location (often representing a local region) in a new WSI, three image patches with different magnifications are cropped and fed into corresponding classifiers, resulting in three output probability predictions. The average of the predictions is used as the final prediction probability for the location in the WSI slide.

Figure 2.

Multi-scale ensemble classifier, consisting of three individual classifiers, each of which is associated with one magnification.

3. EXPERIMENT

3.1

Experimental setup

Three pathological image datasets of gastric cancer were used for evaluation of the proposed method, including one public dataset TCGA and two prividate datasets SYSUCC-Internal and SYSUCC-MultiCenter (Table 1 for more details). While SYSUCC-Internal was used to self-train the feature extractors, SYSUCC-MultiCenter and TCGA were used to train the classifier heads and evaluate the performance of the classifiers respectively. SYSUCC-MultiCenter and TCGA were respectively split into five folds at slide level and the five-fold cross-validation strategy was adopted for evaluation, each time with three folds for MLP training, one fold for validation, and the remaining fold for testing. The average area under the ROC curve (AUC) over five testing folds were reported on SYSUCC-MultiCenter and TCGA respectively. Note that only image patches from tissue regions in SYSUCC-Internal were regularly extracted for feature extractor training, and patches from annotated tumor regions in SYSUCC-MultiCenter and TCGA for classifier head training and classifier evaluations.

Table 1.

Statistics of datasets.

Dataset	WSIs	Role	Patch number (2.5×, 5×, 10×)	Patch size
SYSUCC-Internal	1006	Train feature extractor	895037, 910706, 1082001	256×256
SYSUCC-MultiCenter	98 EBV + 319 non-EBV	Train & eval. classifier head	400669, 400669, 400669	256×256
TCGA	23 EBV + 234 non-EBV	Train & eval. classifier head	428478, 428478, 428478	256×256

The three feature extractors have the same model backbone, i.e., the ResNet-50 convolutional layers. Since the MoCo v2¹² method was used train the feature extractor, a two-layer MLP projection head as in MoCo v2 was attached to the feature extractor to reduce the dimension of feature vector from 2048 to 128. Each feature extractor was trained by the MoCo method with suggested hyper-parameters. SGD optimizer with batch size 128 was used to train each feature extractor for 100 epochs. Each image patch was pre-processed by the Vahadane’s method¹³ for color normalization and randomly cropped to 224×224 pixels, and general data augmentations was performed, including random rotation, horizontal and vertical flip, brightness changes, contrast and saturation changes.

The three classifier heads also shared one backbone, i.e., a three-layer MLP with layer output dimensions being 2048, 2048 and 2 respectively. Each MLP was trained by the SGD optimizer (batch size 256, momentum 0.9) over 100 epochs with the initial learning rate of 0.001. In the first 20 iterations, the learning rate was adjusted by linear warmup, and then dynamically adjusted by cosine annealing in the remaining epochs. In terms of data processing, each patch from TCGA and SYSUCC-MultiCenter was center cropped into the size of 224×224 pixels and fed to the corresponding well-trained feature extractor to obtain the feature vector as the input of classifier head.

3.2

Performance evaluation

Both patch-level and WSI-level AUCs were reported with SYSUCC-MultiCenter and TCGA respectively. For slide-level AUCs, the classification probabilities of all patches in each WSI were averaged as the classification probability of the corresponding WSI slide. The original MoCo and MoCo v2 were used as two self-supervised learning baselines for comparison. In addition, simultaneously fine-tuning of each pre-trained feature extractor (initially based on ImageNet dataset) and training classifier head was also used as a baseline (‘Finetune’), where batch size was set 32 considering the large memory consumption. From Table 2, it can be observed that on TCGA, our method achieved the best patch-level and slide-level performance at 2.5× magnification. On SYSUCC-MultiCenter, our method achieved the best patch-level performance and comparable slide-level performance at 10× magnification. This clearly demonstrates that although the proposed self-supervised learning can help improve classification performance, patch scale affects model performance and its effect may vary over different datasets. Compared to those single-scale classifiers which were affected by patch scales, the proposed multi-scale ensemble classifier (Table 2, last row) was shown to be more stable, achieving either best patch-level and slide-level performance on SYSUCC-MultiCenter or comparable performance on TCGA.

Table 2.

Performance comparison between our method and baselines.

Method	Magnification	Patch-level average AUC	WSI-level average AU
TCGA	MultiCenter	TCGA	MultiCenter
MoCo	2.5×	0.8334	0.8686	0.8802	0.9412
MoCo v2	2.5×	0.8204	0.8750	0.8474	0.9454
Finetune (bs=32)	2.5×	0.8136	0.8852	0.8550	0.9468
Ours	2.5×	0.8440	0.8666	0.8888	0.9412
MoCo	5×	0.7948	0.8792	0.8516	0.9366
MoCo v2	5×	0.7952	0.8740	0.8392	0.9386
Finetune (bs=32)	5×	0.7998	0.8910	0.8502	0.9422
Ours	5×	0.8102	0.8868	0.8566	0.9428
MoCo	10×	0.7824	0.8804	0.8338	0.9420
MoCo v2	10×	0.8056	0.8888	0.8652	0.9446
Finetune (bs=32)	10×	0.7944	0.8874	0.8504	0.9478
Ours	10×	0.8012	0.9008	0.8372	0.9466
Eesemble (ours)	2.5×& 5× & 10×	0.8416	0.9068	0.8776	0.9522

4. CONCLUSION

In this study, we proposed a self-supervised learning method for classification pathological images into EBV or non-EBV, with the help of a novel positive pair formation based on neighboring patches in each WSI slide. By training a feature extractor with self-supervised learning, potentially more image patches can be used to efficiently train a classifier head only. In addition, the fusion of multi-scale classifiers further improved the stability of EBV prediction.

REFERENCES

[1]

World Health Organization, “World Health Statistics 2021: Monitoring Health for the SDGs, Sustainable Development Goals,” (2021). Google Scholar

[2]

Bray, F., Ferlay, J., Soerjomataram, I., et al., “Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: A Cancer Journal for Clinicians, 68 (6), 394 –424 (2018). Google Scholar

[3]

Cancer Genome Atlas Research Network, “Comprehensive molecular characterization of gastric adenocarcinoma,” Nature, 513 (7517), 202 –209 (2014). https://doi.org/10.1038/nature13480 Google Scholar

[4]

Kim, S. T., Cristescu, R., Bass, A. J., et al., “Comprehensive molecular characterization of clinical responses to PD-1 inhibition in metastatic gastric cancer,” Nature Medicine, 24 (9), 1449 –1458 (2018). https://doi.org/10.1038/s41591-018-0101-z Google Scholar

[5]

Qiu, M. Z., He, C. Y., Lu, S. X., et al., “Prospective observation: clinical utility of plasma Epstein-Barr virus DNA load in EBV-associated gastric carcinoma patients,” International Journal of Cancer, 146 (1), 272 –280 (2020). https://doi.org/10.1002/ijc.v146.1 Google Scholar

[6]

Oikawa, K., Saito, A., Kiyuna, T., et al., “Pathological diagnosis of gastric cancers with a novel computerized analysis system,” Journal of Pathology Informatics, 8 (1), 5 (2017). https://doi.org/10.4103/2153-3539.201114 Google Scholar

[7]

Song, Z., Zou, S., Zhou, W., et al., “Clinically applicable histopathological diagnosis system for gastric cancer detection using deep learning,” Nature Communications, 11 (1), 1 –9 (2020). https://doi.org/10.1038/s41467-020-18147-8 Google Scholar

[8]

Tsaku, N. Z., Kosaraju S. C., Aqila T., et al., “Texture-based deep learning for effective histopathological cancer image classification,” in IEEE International Conference on Bioinformatics and Biomedicine, 973 –977 (2019). Google Scholar

[9]

Kather, J. N., Pearson, A. T., Halama, N., et al., “Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer,” Nature Medicine, 25 (7), 1054 –1056 (2019). https://doi.org/10.1038/s41591-019-0462-y Google Scholar

[10]

He K, Zhang X, Ren S, et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770 –778 (2016). Google Scholar

[11]

He, K., Fan, H., Wu, Y., et al., “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729 –9738 (2020). Google Scholar

[12]

Chen, X., Fan, H., Girshick, R., et al., “Improved baselines with momentum contrastive learning,” arXiv:2003.04297, (2020). Google Scholar

[13]

Vahadane, A., Peng, T., Albarqouni, S., et al., “Structure-preserved color normalization for histological images,” in IEEE International Symposium on Biomedical Imaging, 1012 –1015 (2015). Google Scholar

Citation Download Citation

Haohuan Zhang, Ruixuan Wang, Chunxiao Li, and Hongmei Liu "Self-supervised learning and multi-scale ensemble for EBV prediction based on whole slide image of gastric cancer", Proc. SPIE 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022), 125063M (28 December 2022); https://doi.org/10.1117/12.2662555

Access the abstract

PROCEEDINGS
5 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Cancer

Head

Image processing

Image classification

Image fusion

Data modeling

Tissues

1.

INTRODUCTION

2.

METHOD

2.1

Self-supervised learning of feature extractor

Figure 1.

2.2

Multi-scale ensemble classifier

Figure 2.

3.

EXPERIMENT

3.1

Experimental setup

Table 1.

3.2

Performance evaluation

Table 2.

4.

CONCLUSION

REFERENCES

Show All Keywords

Keywords/Phrases

Search In:

Publication Years