Ensemble of convolutional neural networks for bioimage classification

Loris Nanni (Department of Information Engineering, University of Padua, Padova, Italy)

Stefano Ghidoni (Department of Information Engineering, University of Padua, Padova, Italy)

Sheryl Brahnam (Computer Information Systems, Missouri State University, Springfield, Missouri, USA)

Applied Computing and Informatics

ISSN: 2634-1964

Article publication date: 16 July 2020

Issue publication date: 4 January 2021

Downloads

2334

pdf (168 KB)

Abstract

This work presents a system based on an ensemble of Convolutional Neural Networks (CNNs) and descriptors for bioimage classification that has been validated on different datasets of color images. The proposed system represents a very simple yet effective way of boosting the performance of trained CNNs by composing multiple CNNs into an ensemble and combining scores by sum rule. Several types of ensembles are considered, with different CNN topologies along with different learning parameter sets. The proposed system not only exhibits strong discriminative power but also generalizes well over multiple datasets thanks to the combination of multiple descriptors based on different feature types, both learned and handcrafted. Separate classifiers are trained for each descriptor, and the entire set of classifiers is combined by sum rule. Results show that the proposed system obtains state-of-the-art performance across four different bioimage and medical datasets. The MATLAB code of the descriptors will be available at https://github.com/LorisNanni.

Citation

Nanni, L., Ghidoni, S. and Brahnam, S. (2021), "Ensemble of convolutional neural networks for bioimage classification", Applied Computing and Informatics, Vol. 17 No. 1, pp. 19-35. https://doi.org/10.1016/j.aci.2018.06.002

Publisher

:

Emerald Publishing Limited

License

Published in Applied Computing and Informatics. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) license. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this license may be seen at http://creativecommons.org/licences/by/4.0/legalcode

1. Introduction

Despite strong advances in automatic image analysis in recent years, in the field of medicine, expert clinicians remain the ones who typically make the final diagnostic determination of medical images. Automatic and semi-automatic analysis is gaining in importance, however, due to the massive growth in medical imaging technologies and thanks to some giant strides in the fields of image processing, pattern recognition, and image classification, all of which have made automatic analysis of medical images a viable alternative [1–3].

In general, bioimage processing often relies on approaches based on feature extraction from images that contain important information for a particular diagnostic task. Some of the best feature extraction methods for biological tissue analysis consider local textured patterns. A large variety of textural features have been employed in biomedical imaging classification systems, with some of these features combined together in ensembles under the assumption that different textural features extract different types of information from the same image [4,5]. Some typical methods for extracting textural features include Gabor filters and Haralick’s co-occurrence matrix [6]. Other feature extraction methods commonly used today are the Scale-Invariant Feature Transform (SIFT) and Local Binary Patterns (LBP) along with its many variants [7,8]. These feature extraction methods belong to what is often referred to as the class of handcrafted descriptors, so named because the algorithms are designed by researchers to detect specific characteristics considered important in the analysis of images.

Besides handcrafted features, some machine learning techniques have been developed that learn features automatically. This class of so-called learned features are also widely used in bioimage processing [9,10], but they tend to be limited in power because they rely heavily on the dataset used for training. This problem can be overcome by training on a very large dataset (or an ensemble of datasets) containing a broad set of images so that the system learns a wide variety of different patterns. In this way, the learned features become independent of any specific dataset and can be considered as general feature extractors. Like the handcrafted features mentioned above, these learned features can be used alone or in combination with other sets of features, both handcrafted and learned, to analyze new problems. Some examples along these lines include [9], where learned features are used for the detection of ovarian carcinomas, and [10], where learned features are combined with handcrafted features for histopathology image representation.

A powerful class of learned descriptors has recently been proposed that are based on the deep learning paradigm [11]. Deep learning has proven to be extremely effective in several image classification tasks, including medical image analysis [12]. Some examples include the detection/counting of mitotic events, the segmentation of nuclei, and many cancerous vs. noncancerous tissue evaluations [13].

A deep learning architecture that has been studied extensively is the Convolutional Neural Network (CNN) [14], which is a multi-layered image classification technique that incorporates spatial context and weight sharing between pixels. A CNN learns the optimal image features for a specific image classification problem by adopting an effective representation of the original image. Inspired by the process of visual perception in human beings, it requires little to no preprocessing. The basic components of a CNN are stacks of different types of specialized layers (convolutional, activation, pooling, fully-connected, softmax, etc.) that are interconnected and whose weights are trained using the backpropagation algorithm. The deepest layers of the network function as low-level feature extractors. The training phase of a CNN requires huge numbers of labelled data to avoid the problem of over-fitting; however, once trained, CNNs are capable of producing accurate and generalizable models that achieve state-of-the-art performance in general pattern recognition tasks. Some examples include LeNet [15], the first CNN proposed to classify handwritten digits; AlexNet [16], a deep network designed for image classification; ZFNet [17], a newer model that outperforms AlexNet; VGGNet [18], which increases depth using 3 × 3 convolution filters; GoogLeNet [19], which includes inception modules (which is a new organizational structure); and ResNet [20], a residual network that is much easier to optimize than VGGNets. The CNN architecture and the cited examples are discussed in more detail in Section 2.

When deep neural networks are trained on large datasets of images, the first convolutional filters learned by the network often resemble either Gabor filters or color blobs that are easily transferable to many other image tasks and datasets [21]. Pre-trained models can thus be used to extract learned features from novel sets of images, and these features can then be fed into other classifiers, similar to the way handcrafted features are used. Conversely, features computed in the last layer of a pretrained network are strongly dependent on the dataset used to train the deep learner and thus on the specific classification problem represented by a given dataset. Nonetheless, the outputs of these layers can be used for other tasks if CNN fine-tuning is exploited.

All three deep learning methods described above are used in medical and bioimage classification [22]. To summarize the possibilities mentioned so far: a) deep learners can be trained on images from scratch (as in [23]); b) pre-trained CNNs can function as additional feature extractors that can be combined with existing handcrafted image features (as in [24,25]); and c) the outputs of pre-trained CNNs can be fine-tuned by another simpler classifier, such as SVM, on novel target images (as in [26,27]). Yet another class of approaches combines different CNN architectures to exploit the strengths and offset the weaknesses of a given architecture [27].

In this work, we investigate methods for building ensembles of CNNs by leveraging pre-trained CNNs. We consider several different training patterns and experiments using different learning rates, batch sizes, and topologies. What is interesting is that this simple approach produces a very high performing system, one that strongly outperforms the single best CNN trained specifically on a given dataset. Of course, there are both pros and cons involved in combining different CNNs. Although ensembles of CNNs perform exceptionally well, training such models requires high computational power (in this work we used three TitanX GPUs). Moreover, the total size of the network set is quite large, requiring considerable computational power for input classification. Hence, this approach is suitable only for problems where computation time is not critical.

Aside from exploring different ensembles of CNNs, we also consider combining heterogeneous handcrafted descriptors for bioimage classification. The best system proposed in this work combines both learned and handcrafted descriptors. For each descriptor, a different classifier is trained, and the set of classifiers along with the classification results from the deep learners are combined by sum rule. The handcrafted descriptors tested in this paper are summarized in Section 3, and the power of this approach is validated on four different biomedical color datasets.

We wish to stress that the main goal of the proposed system is to produce a powerful general-purpose image classification system able to work out-of-the-box (i.e. requiring little to no parameter tuning) on any bioimage classification problem. We strive to produce a general-purpose system that performs competitively against less flexible systems that have been optimized for very specific image problems and datasets. Experimental results demonstrate that the proposed system obtains state-of-the-art performance in every tested bioimage problem. Yet the same set of descriptors is used in all the tested datasets, demonstrating the generalizability of the proposed approach.

2. Deep learned features

CNNs are a class of deep feed-forward neural networks. Like most neural networks, CNNs are composed of interconnected neurons that have inputs with learnable weights, biases, and activation functions.

CNN layers have neurons arranged in three dimensions: width, height and depth. This means that every layer in a CNN transforms a 3D input volume into a 3D output volume of neuron activations. CNNs are built with five classes of layers: convolutional (CONV), activation (ACT), pooling (POOL), followed by a last stage, including Fully-Connected (FC), and classification (CLASS).

The CONV layer is the core building block of a CNN and is also what makes CNNs so computationally expensive. These layers compute the outputs of neurons that are connected to local regions by applying a convolution operation to the input. The spatial extent of connectivity of these local regions is a hyperparameter called the receptive field, and a parameter sharing scheme is used in CONV Layers to control the number of parameters. This means that the parameters of CONV layers are shared sets of weights (also called kernels or filters) that have relatively small receptive fields.

POOL layers perform non-linear downsampling operations. Max pooling is the most common non-linear operation: it partitions the input into a set of non-overlapping rectangles and outputs the maximum for each group. In this way POOL reduces the spatial size of the representation while simultaneously reducing 1) the number of parameters, 2) the possibility of overfitting, and 3) the computational complexity of the network. It is common practice to insert a POOL layer between CONV layers.

ACT layers apply some activation function, such as the non-saturating ReLU (Rectified Linear Unit) function f(x)=max(0, x) or the saturating hyperbolic tangent f(x)=tanh(x), f(x)=|tanh(x)| or the sigmoid function f(x)=(1+e−x)−1.

FC layers have neurons that are fully connected to all the activations in the previous layer and are applied after CONV and POOL layers.

In this work, we test and combine the following CNN architectures:

• AlexNet [16]: this is the 2012 winner of the ImageNet ILSVRC challenge. AlexNet is a popular CNN that is composed of both stacked and connected layers. It includes five CONV layers followed by three FC layers, with some max-POOL layers inserted in the middle. A rectified linear unit nonlinearity is applied to each convolutional along with a fully connected layer to enable faster training.
• GoogleNet [19]: this is the 2014 winner of the ImageNet ILSVRC challenge. The main novelty of this CNN is the introduction of an inception module (INC), i.e. a subnetwork consisting of parallel convolutional filters whose outputs are concatenated. INC greatly reduces the number of parameters required (much lower than AlexNet). GoogleNet is composed of 22 layers that require training (27 layers in total, counting the POOL layers).
• VGGNet [18]: this is a CNN that placed second in ILSVRC 2014. The two best-performing VGG models (VGG-16 and VGG-19), with 16 and 19 wt layers, respectively, are available as pretrained models. Both models are very deep and include 16 CONV/FC layers. The CONV layers are extremely homogeneous and use very small (3 × 3) convolution filters. A POOL layer is inserted after two or three CONV layers (instead after each CONV layer as is the case with AlexNet).
• ResNet [20]: this is the winner of ILSVRC 2015. This network is approximately twenty times deeper than AlexNet and eight times deeper than VGGNet. The main novelty of this CNN is the introduction of residual (RES) layers, making it a “network-in-network” architecture. ResNet uses special skip connections and batch normalization, and the FC layers at the end of the network are substituted by global average pooling. Instead of learning unreferenced functions, ResNet explicitly reformulates layers as learning residual functions with reference to the layer inputs. As a result, ResNet is much deeper than VGGNet, although the model size is smaller and thus easier to optimize than VGGNet.
• Inception [19]: InceptionV3 is a variant of GoogleNet based on the factorization of 7 × 7 convolutions into two or three consecutive layers of 3 × 3 convolutions.
• IncResv2 [28]: Inception-ResNet-v2 is an Inception style networks that utilize residual connections instead of filter concatenation.

As noted in the introduction, the learning effectiveness of a CNN depends on the availability of large training data. Data augmentation is one effective way to expand training data when necessary and to reduce overfitting during CNN training by artificially expanding the training set using perturbations of individual images [16]. Data augmentation applies transformations and deformations to the labeled data, thus producing new samples as additional training data. A key attribute of the data augmentation process is that the labels remain unchanged after applying the transformations. In this work we perform random data augmentation with horizontal and vertical flipping, rotation in a range of 10°, translation of a maximum of five pixels, and scaling in a range of [1,2].

Fine-tuning a CNN is a procedure that essentially restarts the retraining process of a pretrained network so that it learns a different classification problem. We adopt the Two-Round Tuning for fine-tuning a CNN. With Two-Round Tuning, the first round of tuning is performed by training a CNN using a leave-one-out dataset strategy, e.g. by including in the training set all the images from the dataset summarized in Table 2 except for the target dataset. The final number of classes becomes the sum of all the classes from each classification problem. The second round of tuning is the same as in One Round Tuning and involves only the training set of the target problem.

In keeping with the rationale of the Data Augmentation step, we use the following datasets in the first round of tuning:

• PAP: the PAP SMEAR dataset [29], which contains 917 images acquired during Pap tests to identify cervical cancer diagnosis (available at http://labs.fme.aegean.gr/decision/downloads);
• LG: the “Liver gender” [30] dataset, which includes 265 images of liver tissue sections from 6-month male and female mice on a caloric restriction diet (the classes are the 2 genders);
• LA: the “Liver aging” [30] dataset, which includes 529 images of liver tissue sections from female mice of 4 ages on an ad-libitum diet;
• BR: the BREAST CANCER dataset [31], which contains 1394 images divided into the control, malignant cancer, and benign cancer classes;
• HI: the HISTOPATHOLOGY dataset [32], which contains 2828 images of connective, epithelial, muscular, and nervous tissue classes.
• RPE: a dataset composed of 195 human stem cell-derived retinal pigmented epithelium images that were divided into 16 subwindows with each subwindow divided into four classes by two trained operators (available at https://figshare.com/articles/BioMediTech_RPE_dataset/2070109).

We fine-tune the weights of the pretrained CNNs by fixing the deep CONV layers of the network and by fine-tuning only the higher-level FC layers since these layers are specific to the details of the classes contained in the target dataset. The last FC layer is designed to be the same size as the number of classes in the new dataset. All the FC layers are initialized with random values and trained from scratch using the Stochastic Gradient Descent (SGD) algorithm with data from the target training set.

3. Handcrafted features

In Table 1 we summarize the handcrafted descriptors used in our tests, along with the parameter sets used to extract each descriptor. Each descriptor is trained on an SVM, and only the training data is used to fix its parameters. Since we are working with RGB color datasets, each texture descriptor is applied separately to each RGB channel, with the final score given by the sum rule of the three classifiers trained with the three set of features.

As it can be observed in Table 1, many of the handcrafted texture descriptors are based on Local Binary Patterns (LBP), a descriptor that has achieved great success due to its computational efficiency and discriminative power. The traditional LBP [44] is expressed as

(1)LBPP,R=∑P=0P−1s(x)2P

where x=qp−qc is the difference between the intensity levels of a central pixel (q_c) and a set of neighbouring pixels (q_p). A neighbourhood is defined by a circular region of radius R and P neighbouring points. The function s(x) in Eq. (1) is defined as:

(2)s(x)={1,x≥00,otherwise

LBP descriptors are the histograms of these binary numbers.

3.1 The Local Ternary Pattern (LTP)

LTP [33] is a ternary variant of LBP and is designed to reduce the noise in the feature vector when uniform regions are analyzed. LTP proposes a three-value coding scheme that includes a threshold around zero for the evaluation of the local gray-scale difference by adding to Eq. (2) the threshold τ:

(3)s(x)={1,x≥τ0,|x|≤τ−1,x≤−τ

3.2 Multithreshold Local Phase Quantization (MLPQ)

MLPQ [34] extends the multi-threshold approach described for LBP to the LPQ feature [45,46] that is based on the phase of the Short-Term Fourier Transform (STFT) evaluated on a rectangular neighborhood of size R. The MLPQ features used in our experiments are computed using parameter belonging to the following ranges: τ ∈ {0.2, 0.4, 0.6, 0.8, 1}, R ∈ {1, 3, 5}, a ∈ {0.8, 1, 1.2, 1.4, 1.6} and ρ ∈ {0.75, 0.95, 1.15, 1.35, 1.55, 1.75, 1.95}. Such sets were proposed in [47].

3.3 Completed LBP (CLBP)

CLBP, proposed in [35], encodes a texture by means of two components: the difference sign and the different magnitude that is computed between a reference pixel and all the pixels belonging to a given neighborhood. CLBP represents a local region by its centre pixel (CLBP-C) and a local difference sign-magnitude transform (LDSMT). This is what produces the difference signs and the difference magnitudes.

Two operators, CLBP-Sign (CLBP_S) and CLBP-Magnitude (CLBP_M), are defined for the difference signs and the difference magnitudes. Since all three descriptors (CLBP_C, CLBP_S and CLBP_M) are in binary format, they can be combined to form the final CLBP histogram.

Given a central pixel gc and its P evenly spaced circular neighbors gc, gp, p,1, … ,P-1, the difference between gc and gp can be calculated as dp=gp−gc and be decomposed into two components defining the LDSMT transform: dp=Sp∗mp and {Sp=sign(dp)mp=|dp|,

(4)Sp={1,dp≥0−1,dp<0

where Sp is the sign of dp, and mp is the magnitude of dp. Thus, the LDSMT transforms the vector [d0, …, dP-1] into a sign vector [S0, …, SP-1] and a magnitude vector [m0, …, mP-1].

The CLBP_S operator is the traditional LBP operator defined in Eq. (1). The CLBP_M is defined as:

LBPMP,R=∑P=0P−1t(mp, c)2P

(5)t(x, c)={1,x≥c0,x<c

where c is the mean value of m.

The center pixels represent the image gray level and thus contains discriminant information. These values are converted into a binary code by global thresholding, which makes them consistent with CLBP_S and CLBP_M as CLBP_CP,R=t(gc, c1), where t is the threshold defined in Eq. (5), and c1 is the average gray level of the white image.

Combining CLBP_S, CLBP_M, and CLBP_C features into joint or hybrid distributions results in significant improvement for rotation invariant texture classification. The CLBP_S, CLBP_M, and CLBP_C histograms are concatenated to obtain the CLBP descriptor.

3.4 Multiscale Rotation Invariant Co-occurrence of Adjacent LBP (RIC)

RIC [36] considers the co-occurrence in the context of LBP features, or the spatial relations among pixels. This feature adds rotational invariance for angles that are multiples of 45°. RIC depends on two parameters, namely, LBP radius and the displacement among the LBPs. The values used in our experiments are: (1, 2), (2, 4) and (4, 8).

3.5 Full BSIF (FBSIF)

FBSIF [37] is an extension of the Binarized Statistical Image Feature (BSIF) [48], that assigns each pixel of the input image a n-bit label obtained by means of a set of n linear filters. Each filter operates on a neighborhood of l×l pixels around the element they should give the label. This n-bit label can be formalized as:

(6)s=WX

where X is a vector of length l² × 1 obtained from the neighborhood, while W is a n × l² matrix including the filters vector notations. FBSIF operates by evaluating BSIF using several values of the filter size (SIZE_BSIF) and a binarization threshold (FULL_BSIF). Values considered in this work are: SIZE_BSIF ∈ {3, 5, 7, 9, 11}, FULL_BSIF ∈ {−9, −6, −3, 0, 3, 6, 9}. Each combination of size and threshold is fed to a separate SVM: the SVMs are then combined by sum rule.

3.6 Adaptive Hybrid Pattern (AHP)

AHP [38] descriptors were created to overcome two main drawbacks of the LBP feature: 1) its noisy behavior in quasi-uniform regions and 2) its reactivity, that is, the strong variations in the descriptor that are possibly induced by small variation in the input image, which is caused by the use of quantization thresholds.

AHP overcomes both problems by using a Hybrid Texture Model (HTD) composed of local primitive features and global spatial structure and then by applying an adaptive quantization algorithm (AQA) to improve the noise robustness of the angular space quantization. In this way, the vector quantization thresholds are adaptive to the content of the local patch. AQA extracts the discriminative texture information provided by primitive microfeatures.THTD is defined as:

(7)THTD≈Tglobal+Tlocal

where THTD represents the texture, Tglobal the global texture information, and Tlocal the local texture information. Tglobal is the joint distribution of the global difference between gray values of the circular symmetric neighborhoods and the mean value from the whole texture image. Tlocal is the joint distribution of the local differences between the gray value of the center pixel and the gray values of the circularly symmetric neighborhoods.

The length of the feature histogram of the whole image is reduced by splitting the global pattern and the local pattern into multiple binary patterns using the threshold calculations in [49] and [50].

3.7 Gaussian of Local Descriptors (GOLD)

GOLD [39] is based on a four-step algorithm: i) evaluation of SIFT features; ii) spatial pyramid decomposition; iii) parametric probability density estimation; iv) the covariance matrix is projected onto the tangent Euclidean space in order to vectorize the feature. In other words, GOLD descriptors are obtained by extracting some descriptors from an image to obtain D={D1,…,Dℕ}, where D_i ∈ ℜn, by collecting and weighting them in a spatial pyramid, and then by describing each subregion by the estimated parameters of a multivariate Gaussian distribution. To vectorize the descriptors, the covariance matrix is projected onto a Euclidean space and concatenated to the mean vector to obtain the final descriptor of size (n² + 3n)/2. Finally, the feature vector is fed into an SVM with a histogram kernel.

3.8 Histogram of Oriented Gradients (HOG)

HOG [40] groups pixels into small windows and measures intensity gradients in each of them. It is possible to view HOG as a simplified version of SIFT. HOG calculates intensity gradients, pixel by pixel; and the selection of a corresponding histogram bin for each pixel is based on the gradient direction. A histogram is then evaluated for each window, leading to the final descriptor. Windows of size 5 × 6 are used in our experiments.

3.9 Color descriptor (COL)

COL, proposed in [51], is a simple and compact descriptor, acquired combining statistical measures extracted from each color channel in the RGB space. The final descriptor is obtained as the concatenation of several measures: the mean, the standard deviation, the 3rd and 5th moments of each color channel and the marginal Histograms (8 bins per channel) [51].

3.10 Morphological descriptor (MOR)

MOR, proposed in [41], is a set of measures extracted from a segmented version of the image, including the aspect ratio, number of objects, area, perimeter, eccentricity, and other measures.

3.11 CodebookLess Model (CLM)

CLM [42] is based on an image modeling method that can represent an image by means of a single Gaussian. This is obtained by first evaluating SIFT features on a regular grid placed on the image. Thus, CLM is a dense sampling features model, and fitting them using a Gaussian model. The main difference between CLM and the other widely used dense sampling method, such as the BoF approach [52], is the absence of a codebook.

According to the experiments reported in [24], we select for CLM the ensemble named CLoVo_3 in [24] based on e-SFT, PCA for dimensionality reduction and one-vs-one SVM for the training phase.

3.12 LETRIST descriptor (LET)

LET, proposed in [43], is simple but effective representation that encodes the joint information within an image across feature and scale spaces. We use the default values available in the MATLAB toolbox.

4. Materials

Several medical datasets were used to test our system and demonstrate the generalizability of our approach. Each dataset contains different types of medical images. For the sake of easy comparisons, the datasets used in our experiments were selected because they are publicly available:

• LY: the LYMPHOMA dataset [53], which includes 375 images of malignant lymphoma subdivided in three classes: CLL (chronic lymphocytic leukemia), FL (follicular lymphoma), and MCL (mantle cell lymphoma).
• BGR: the BREAST GRADING CARCINOMA [54], which is a medium size dataset containing 300 images (Grade 1: 107, Grade 2: 102, and Grade 3: 91 images) of resolution 1280 × 960 corresponding to 21 different patients with invasive ductal carcinoma of the breast.
• LAR: the LARYNGEAL dataset [55], which contains a well-balanced set of 1320 patches extracted from the endoscopic videos of 33 patients affected by laryngeal squamous cell carcinoma (SCC). The patches are relative to four laryngeal tissue classes. LAR contains color images. In our experiments with this dataset each descriptor is separately extracted from each color channel.
• CO: the COLORECTAL dataset [56], which is a collection of textures obtained by manual annotation and tessellation of histological images of human colorectal cancer.

Table 2 summarizes some important characteristics of each dataset including the number of classes (#C), the number of samples (#S) (i.e. the number of images), the image size, and the URL for downloading the dataset. The testing protocol used in our experiments is the fivefold cross-validation method except in those case where the database is specifics its own protocol.

5. Experimental results

The experimental evaluations reported in this section are intended first to compare the performance of handcrafted descriptors to deep learned descriptors on several cancer data analysis classification tasks and second to evaluate the performance of several ensembles based on the fusion of classifiers. Our main objective is to design a method that is both robust and effective on different classification problems. To assess the generalizability and robustness of our system, our best performing method is finally compared with several state-of-the-art results published by different researches on the same datasets. Note: Before each fusion, the scores of the classifiers of each descriptor are normalized to mean 0 and standard deviation 1. Experiments reported below were statistically validated using the Wilcoxon signed rank test.

In the first experiment, reported in Table 3, we evaluate the performance (using accuracy as the performance indicator) of the baseline handcrafted descriptors described in Section 3. Moreover, the performance obtained by the following ensembles of handcrafted methods are compared:

• FH: the fusion by sum rule of the following handcrafted methods LTP, CLBP, RIC, LET, MOR, AHP, COL, MLPQ and FullBSIF. We have not use GOLD and CLM in FH since they are computational expensive. Note: the scores of each method are normalized to mean zero and standard deviation 1 so that the importance of MLPQ and FullBSIF (that are methods based on ensemble combined by sum rule) is equal to the other approaches;
• FH + CLM: sum rule among the methods belonging to FH and CLM, i.e. the sum rule among the nine methods of FH and CLM;
• FH + CLM+GOLD: sum rule among the methods belonging to FH, GOLD and CLM;
• PREV: the ensemble of handcrafted features proposed in [24];
• PREV1: ensemble of handcrafted features proposed in [57].

Clearly, the fusion approaches FH and FH + CLM+GOLD works better (Wilcoxon signed rank test – p-value of 0.05) than the stand-alone methods and the previous handcrafted ensembles PREV and PREV1.

In the second experiment, see Tables 4 and 5, we test the feasibility of building an ensemble of convolutional neural networks¹ as follows:

• Training CNNs using different learning rates (LR), i.e. 0.001 & 0.0001;
• Training CNNs using different batch sizes (BS), i.e. 10, 30, 50 and 70;
• Training CNNs using different topologies.

In Table 4 we report experiments using standard tuning. In Table 5 we report the performance of the Two-Round tuning detailed in Section 2.

The following methods are also reported in Tables 4 and 5:

• SB: the single best CNN configuration in that dataset. This method is clearly overfitted since we report the best result on the testing set after running different parameter configurations and choosing the best one. It is important to report SB as baseline performance for the proposed ensemble.
• AB: best average CNN configuration in all the datasets.
• Fus: fusion among all the different CNNs trained varying the parameter configuration. If the CNN does not converge (i.e. it produces random results on the training data, which usually happens with AlexNet and VggNet with LR = 0.001), the CNN is excluded from the ensemble. Note that it is not always feasible to train a CNN with a large batch size. In other words, if for a given BS we obtain a “GPU out of memory” error message, we discard that CNN configuration.
• FCN-st: fusion among the methods Fus of the all CNN topologies trained using standard tuning (column All) or all the topologies trained using standard tuning except AlexNet (column NoAlex). The scores of each given CNN topology are normalized considering how many CNNs of that topology are effectively used in the fusion Fus (i.e. by excluding all CNNs that produce random results on the training set or an out of memory error message).
• FCN-Two: fusion among the methods Fus of the all CNN topologies trained using Two-round tuning (column All) or all the topologies trained using Two-round tuning except AlexNet (column NoAlex). The scores of each given CNN topology are normalized considering how many CNNs of that topology are effectively used in the fusion Fus (i.e. by excluding all CNNs that produce random results on the training set or an out of memory error message).

Notice that Two-Round Tuning is applied on a reduced number of topologies due to computational issues.

The following conclusions can be drawn from the results reported in Tables 4 and 5:

• For each topology, Fus outperforms AB (Wilcoxon signed rank test – p-value of 0.05);
• FCN outperforms each Fus (Wilcoxon signed rank test - p-value of 0.05);
• FCN-two obtains performance similar to FCN-st.

In Table 6 the ensemble of CNNs is combined with other methods. The ensembles evaluated in Table 6 are the following:

• FCN+: sum rule among the methods that belong to FCN-st and FCN-Two;
• Here1: sum rule between (FCN + NoAlex) and FH; before fusion the scores of (FCN + NoAlex) and FH are normalized to mean 0 and standard deviation 1;
• Here2: sum rule between (FCN + NoAlex) and (FH + CLM+GOLD); before fusion the scores of (FCN + NoAlex) and FH are normalized to mean 0 and standard deviation 1;

In Table 7 we compare our ensemble Here1 with the literature, for a fair comparison we have reported methods based on the same testing protocol used to assess the performance of our approaches.

The following conclusions can be drawn from the results reported in Tables 6 and 7:

• Here1 and Here2 outperform FCN+; since Here1 is simpler then Here2 our suggestion is to use Here1;
• Here1 obtains state-of-the-art-performance; e.g. in [55] a median F-measure of 92 is obtained in the LAR dataset, while our ensemble obtains an F-measure of 95.2.

Finally, in Table 8, we report the performance obtained by some ensemble proposed in this paper using the Kappa statistic [60] to measure the agreement between true and predicted class labels.

The conclusions that can be drawn by the results reported in Table 8 are similar to those that can be drawn by the performance reported in Table 6.

To better motivate the reason of the good performance of the ensemble of CNNs we calculate the Yule’s Q-statistic [61] among the methods that build the ensemble. The Q-statistic is used to provide information about the correlation among the output of different classifiers. The average Q-statistic among the different CNNs that build FCN-st is 0.7098, hence the different CNNs brings different information and their combination permits to boost the performance of the stand-alone CNN.

6. Conclusion

In this work an ensemble of CNNs is proposed for cancer related color datasets. The ensemble is built in a very simple way by training and comparing the performance of CNNs using different learning rates, batch sizes, and topologies. The set of CNNs is simply combined with the sum rule. The most important finding of this work is that this simple ensemble outperforms the best stand-alone CNN. When the ensemble of CNNs is combined with other features based on handcrafted features, the final ensemble obtains state-of-the-art performance on all the four tested datasets. For each handcrafted features a different support vector machine is trained, than the set of SVMs is combined by sum rule; also, the fusion between deep learning ensemble and handcrafted features ensemble is performed by sum rule. Notice that, before the fusion, the set of scores of each ensemble is normalized to mean 0 and standard deviation 1.

In the future, we plan to develop and test different approaches for representing images using CNNs. Features extracted from these CNNs will then be used to train SVM classifiers. To reproduce our experiments, MATLAB source code will be available at https://github.com/LorisNanni.

Table 1

Summary Handcrafted Descriptors.

Name	Parameters	Source	Section
LTP	Multiscale Uniform LTP with two (R,P) configurations: (1, 8) and (2, 16), threshold = 3.	[33]	3.1
MLPQ	Ensemble of LPQ descriptors obtained by varying the filter sizes, the scalar frequency, and the correlation coefficient between adjacent pixel values.	[34]	3.2
CLBP	Completed LBP with two (R,P) configurations: (1,8) and (2,16).	[35]	3.3
RIC	Multiscale Rotation Invariant Co-occurrence of Adjacent LBP with R ∈ {1, 2, 4}.	[36]	3.4
FBSIF	Extension of the BIF by varying the parameters of filter size (SIZE_BSIF, size ∈ {3, 5, 7, 9, 11}) and the threshold for binarizing (FULL_BSIF, th ∈ {−9, −6, −3, 0, 3, 6, 9}).	[37]	3.5
AHP	Adaptive Hybrid Pattern with quantization level = 5 and 2; the (R,P) configurations are (1, 8) and (2, 16).	[38]	3.6
GOLD	Ensemble of Gaussians of LOcal Descriptors extracted using the spatial pyramid decomposition.	[39]	3.7
HOG	Histogram of Oriented Gradients with 30 cells (5 by 6).	[40]	3.8
MOR	A set of MORphological features.	[41]	3.9
CLM	CodebookLess Model. We use the ensemble named CLoVo_3 in [15] based on e-SFT, PCA for dimensionality reduction, and one-vs-all SVM for the training phase.	[42]	3.10
LET	Same parameters used in the source code of [43]	[43]	3.11

Table 2

Descriptive Summary of the Datasets.

Dataset	#C	#S	Size	URL for Download
BGR	3	300	1280 × 960	https://zenodo.org/record/834910#.Wp1bQ-jOWUl
LY	3	375	1388 × 1040	ome.grc.nia.nih.gov/iicbu2008
LAR	3	1320	100 × 100	https://zenodo.org/record/1003200#.WdeQcnBx0nQ
CO	8	5000	150 × 150	zenodo.org/record/53169#.WaXjW8hJaUm