Collaborative regression-based anatomical landmark detection

Yaozong Gao; Dinggang Shen

doi:10.1088/0031-9155/60/24/9377

I. Introduction

Anatomical landmark detection aims to automatically localise specific points of interest in the human anatomy. These points are named landmarks, which often lie on the organ/structure boundary. These landmarks are important in registration, segmentation and quantitative analysis, e.g. for landmark-guided deformable registration (Han et al 2014), model initialisation in deformable segmentation (Lay et al 2013, Gao et al 2014) and dental deformity quantisation (Ren et al 2014). Despite its importance, anatomical landmark detection still remains a challenging problem due to many reasons: (1) poor image contrast, (2) image artifacts and (3) large appearance variations of the landmark.

Figure 1 gives an example of one prostate landmark which lies on the boundary between the prostate and the rectum. Its local appearance could dramatically change due to the uncertainty of bowel gas in the rectum. Besides, CT scans may be acquired after the injection of a contrast agent, which changes the surrounding appearance of the landmark and makes automatic landmark detection even more challenging.

Figure 2 gives an example of one tooth landmark in cone beam computed tomography (CBCT) images. As shown in the transversal view of figure 2(a), metal dental braces can cause severe streaking artifacts, which makes the landmark difficult to recognise. Besides, the challenges of teeth landmark detection can also come from various deformities in the patients. Figure 2(c) shows a patient with anterior open-bite. This deformity leads to dramatic changes in appearance of the same landmark across different patients, which increases the difficulty of landmark detection.

Due to the aforementioned challenges, it is difficult to empirically create all the rules to address the landmark detection problem. In the literature, researchers often rely on machine learning-based approaches to tackle this problem. The mainstream landmark detection methods can be categorised into two types: classification-based and regression-based landmark detection.

In the classification-based methods, strong classifiers are usually learned to distinguish the correct position of the anatomical landmark from the wrong ones. For example, Zhan et al (2011) used cascade Adaboost classifiers to classify each image voxel for detecting anatomical landmarks on MR knee images. Zheng et al (2008) proposed marginal spacing learning, which used probabilistic boosting trees (Tu 2005) as classifiers to detect the positions of heart chambers for deformable model fitting. Gao et al (2014) proposed an online updating scheme named 'incremental learning with selective memory' to update the population-learned cascade classifiers with online collected patient-specific data for improving the accuracy of landmark detection in daily treatment CT images.

In contrast to the classification-based approaches, which often require voxel-wise classification to determine the correct landmark position, the regression-based approaches predict the landmark position from each image voxel. At the training stage, a regression model is often learned to predict the 3D displacement from any image voxel to the target landmark. At the application/testing stage, the learned regression model can be used to predict the 3D displacement for every voxel in the image. Then, based on the estimated 3D displacement, each image voxel casts one vote to a potential landmark location. Finally, all the votes from different image voxels are aggregated to localise the target landmark, such as at the voxel with the maximum vote. For example, Criminisi et al (2013) proposed using a regression forest with context-rich visual features for detecting bounding boxes of organs in CT images. Instead of determining the bounding box by checking the local image features within the box, they showed that the context appearance information was also important in the bounding box detection. Recently, researchers (Ebner et al 2014, Gao and Shen 2014) have shown that the bounding box detection method (Criminisi et al 2013) can also be easily extended to anatomical landmark detection. Besides the aforementioned methods, there are methods that combine both classification and regression for landmark detection. Lay et al (2013) first used regression forest to detect candidate positions for each landmark, and then applied probabilistic boosting trees as classifiers to accurately identify the landmark location within all candidates.

Compared to classification-based methods, regression-based methods integrate context appearance information to localise landmarks, which makes them less sensitive to the anatomical structures with similar local appearances to the target landmark, but with completely different anatomical positions in the image. Recently, Cootes et al (2012) also showed that the random forest regression method is significantly faster and more accurate than the equivalent classification-based methods in driving the deformable segmentation of different datasets. Despite the success of recent regression-based landmark detection methods, they still suffer several limitations:

(1)
The inclusion of faraway image voxels in the voting procedure. In the conventional regression-based method (Criminisi et al 2013), all image voxels are involved in voting the landmark location. As many voxels are not near the target landmark, they are not informative to local anatomical variations of the landmark. Thus, the inclusion of these voxels in the voting procedure would limit the detection accuracy.
(2)
Neglect of landmark dependency at the detection step. Many anatomical landmarks are spatially dependent. Their independent detection may cause inconsistent detection results. In the literature, most works (Zhan et al 2011, Zhang et al 2012, Donner et al 2013) exploit the landmark spatial dependency at the post-processing step, which is separate from the detection step. For example, Zhan et al (2011) exploited a linear spatial relationship between landmarks for correcting wrongly localised landmarks on MR knee images. Donner et al (2013) adopted a Markov random field to find the optimal landmark configuration, given a set of landmark candidates. Because the spatial dependency is exploited after the detection step, it only helps filter out wrongly detected landmarks, not improve the accuracy of individual landmark detections.

In this paper, we propose a collaborative regression-based framework for solving the above limitations. Specifically, our framework consists of two components:

(1)
Multi-resolution collaboration. We propose a multi-resolution strategy named 'multi-resolution regression voting' to detect a landmark hierarchically. In the coarsest resolution, all image voxels are allowed to vote the landmark position for rough localisation. Once the rough position is known, the landmark position can be refined by voting from nearby voxels. The training of our multi-resolution framework also takes into account the idea that nearby voxels are more useful for localising the landmark than faraway voxels. In particular, we propose a spherical sampling strategy which associates the sampling probability of a voxel with its distance to the target landmark. In this way, the spherical sampling strategy tends to draw more training samples towards the target landmark, thus improving the prediction accuracy for voxels near the target landmark.
(2)
Inter-landmark collaboration. We exploit the detection reliability of each landmark and then propose a confidence-based landmark detection strategy, which uses 'easy-to-detect' (reliable) landmarks to guide the detection of 'difficult-to-detect' (challenging) landmarks. In particular, we introduce context distance features which measure the displacements of an image voxel to reliable landmarks. Context distance features can be used to guide the detection of challenging landmarks because the displacements of an image voxel to reliable and challenging landmarks are often highly correlated. If this correlation is exploited, reliable landmarks can be used to improve the detection accuracy of challenging landmarks.

In the experiments, we extensively evaluate our method on 127 images, including 73 CT prostate images each with 6 landmarks, 14 CBCT dental images each with 15 landmarks, and 40 CT head & neck images each with 5 landmarks. The experimental results show that with the proposed strategies our method outperforms the conventional regression-based method, and a classification-based method in landmark detection. Moreover, our method is able to localise a landmark in 1 second with accuracy comparable to the inter-observer variability.

The preliminary version of this work was published in Han et al (2014), where we used landmark detection for initialising deformable registration. The method described in this work extends our previous work in the following three aspects.

We propose a spherical sampling strategy in the multi-resolution framework. As validated in three datasets, the spherical sampling strategy improves the accuracy of landmark detection, compared to the conventional uniform sampling strategy.
We propose a collaborative landmark detection strategy, by using easy-to-detect landmarks to guide and improve the detection accuracy of difficult-to-detect landmarks. This strategy is important for detecting those challenging landmarks with a large variation of landmark appearance.
Compared to our previous work Han et al (2014), which was applied only to MRI brain images, we have now extensively evaluated our method on three different datasets. The results show that our method works not only for landmarks with clear appearances, but also for landmarks with indistinct appearances, such as prostate landmarks in CT images.

The rest of paper is organised as follows. Section II presents the conventional regression-based landmark detection to familiarise readers with the overall flowchart. Section III elaborates on the proposed multi-resolution strategy. Section IV provides the details of confidence-based landmark detection. The experimental results of the different strategies on three applications are given in section V. Finally, sections VI and VII present the conclusion and discussion of the paper, respectively.

II. Regression-based landmark detection

In this section, we will first introduce the basics of a regression forest, which is often used as a regression model in conventional regression-based landmark detection. Then we will describe the conventional regression-based landmark detection method in detail.

II.A. Regression random forest

The regression random forest is one type of random forest specialised for non-linear regression tasks. It consists of multiple independently-trained binary decision trees. Each binary decision tree is composed of two types of node, namely a leaf node and a split node. Each leaf node records the statistics that summarises the target values of all training samples falling into it. In our implementation,the mean $\mathbf{\bar{d}}\in {{\mathbb{R}}^{M}}$ and variance $\mathbf{v}\in {{\mathbb{R}}^{M}}$ are recorded in each leaf node, where M is the dimension of the target vector we want to predict/regress, such as M = 3 in our case of detecting the location of landmark in the 3D images. Each split node is a split function which often uses a decision stump with one feature f and a threshold t, i.e. $\text{Split}\left(\Omega |\,f,t\right)=H\left({{\Omega }_{f}}<t\right)$ , where $\Omega$ represents an input sample, ${{\Omega }_{f}}$ is the value of feature f at sample $\Omega$ , and H is the Heaviside step function. If $\text{Split}\left(\Omega |\,f,t\right)=0$ , the sample $\Omega$ is split to the left child of this split node. Otherwise, it is split to the right child node.

Each binary decision tree in the regression random forest is independently trained with bootstrapping on both samples and features. Given a random subset of training samples and features, a binary decision tree is trained recursively, starting from the first split node (root). A good split function should separate training samples into two subsets with consistent target vectors. This could be achieved by maximising the variance reduction. Thus, the optimal parameters $\left\{\,{{f}^{*}},{{t}^{*}}\right\}$ of a split function can be found by maximising the following objective function:

$\begin{eqnarray}&&\left\{\,{{f}^{*}},{{t}^{*}}\right\}=\underset{f,t}{\mathop{\max}}\,\underset{i=1}{\overset{M}{\mathop \sum}}\,\left(\mathbf{v}_{i}^{\text{split}}-\underset{j\in \left\{L,R\right\}}{\mathop \sum}\,\frac{{{N}^{j}}}{N}\mathbf{v}_{i}^{j}\right)\end{eqnarray} \tag{ 1 }$

where $\mathbf{v}_{i}^{\text{split}}$ is the variance of the ith target of all the training samples arriving at the split node. ${{N}^{j}},j\in \left\{L,R\right\}$ , is the number of training samples split into the left/right child, given a pair of {f, t}. $\mathbf{v}_{i}^{j}$ is the variance of the ith target of the training samples split into the left/right child node, i.e. j = L or j = R. To maximise equation (1), an exhaustive search over a random subset of features and thresholds is often conducted in the random forest optimisation (Criminisi et al 2011). Specifically, a set of thresholds is randomly sampled for each feature in the bootstrapped feature set. Every combination of feature and threshold is evaluated in equation (1) to find out the optimal pair that achieves the maximum objective value. Once the split function is determined, it is used to split the training samples into two subsets: the left subset with training samples satisfying $\text{Split}\left(\Omega |\,f,t\right)=0$ , and the right subset with training samples satisfying $\text{Split}\left(\Omega |\,f,t\right)=1$ . For each subset, a split function can be similarly trained to further separate the training samples into subsets with more consistent target vectors. The splitting function is thus recursively trained until one of stopping criteria is met: (1) the number of training samples is too few to split; (2) the maximum tree depth is reached. In these cases, the current node becomes a leaf node and the statistics (i.e. the mean $\mathbf{\bar{d}}$ and variance $\mathbf{v}$ ) of the training samples falling into this node is stored for future prediction.

At the testing stage, a testing sample is pushed into each binary decision tree, starting at the root node. Based on the split nodes learned at the training stage, the testing sample is guided towards the leaf nodes. When it arrives at a leaf node, the mean $\mathbf{\bar{d}}$ stored in the leaf node is retrieved to serve as the prediction result of this tree. Finally, the results from all different trees are fused to obtain the prediction result of the entire forest. Conventionally, averaging is often used to fuse the prediction results from different trees due to its simplicity and efficiency.

$\begin{eqnarray}&&\hat{{{\mathbf{d}}_{i}}}=\frac{\underset{k=1}{\overset{K}{\mathop \sum}}\,\mathbf{\bar{d}}_{i}^{(k)}}{K}\end{eqnarray} \tag{ 2 }$

where K is the number of trees in the forest, and $\hat{{{\mathbf{d}}_{i}}}$ is the i-the predicted target for this testing sample. ${{\overline{{{\mathbf{d}}_{i}}}}^{(k)}}$ is the mean of the ith target stored in the leaf node reached in the kth tree. Since the variance of each leaf indicates the prediction uncertainty (i.e. large variance indicates high uncertainty, while small variance indicates low uncertainty), it is better to also exploit this piece of information when fusing results from different trees. Therefore, in this paper we use the variance-weighted averaging to fuse the prediction results from different trees:

$\begin{eqnarray}&&\hat{{{\mathbf{d}}_{i}}}=\frac{\underset{k=1}{\overset{K}{\mathop \sum}}\,\mathbf{w}_{i}^{(k)}\mathbf{\bar{d}}_{i}^{(k)}}{\underset{k=1}{\overset{K}{\mathop \sum}}\,\mathbf{w}_{i}^{(k)}},\;\;\;\mathbf{w}_{i}^{(k)}=\frac{1}{\mathbf{v}_{i}^{(k)}+\epsilon}\end{eqnarray} \tag{ 3 }$

where $\mathbf{v}_{i}^{(k)}$ is the variance of the ith target stored in the leaf node reached in the kth tree. $\mathbf{w}_{i}^{(k)}$ is the weight to measure the prediction confidence of the ith target by the kth tree, which is defined as the inverse of $\mathbf{v}_{i}^{(k)}$ . The smaller the variance, the larger the confidence. 𝜖 is a very small number ( $1.0\times {{10}^{-6}}$ ) to deal with the case when the variance of a leaf node is zero.

II.B. Regression-based anatomical landmark detection

Regression-based landmark detection utilises context appearances to localise the target landmark. This characteristic differentiates it from the classification-based landmark detection, which localises a landmark via voxel-wise classification according to the local appearance of each voxel. As a machine-learning-based approach, regression-based landmark detection has two stages, the training stage and the testing stage. At the training stage, the goal is to learn a regression model (i.e. regression forest) that predicts the 3D displacement from any image voxel to the target landmark according to the local image appearance of the voxel. At the testing stage, the learned regression model is used to predict the 3D displacement for each image voxel in the new testing image. Based on the estimated 3D displacement to the target landmark, each image voxel casts one vote to a potential landmark position. Finally, by collecting votes from all the image voxels, the position that receives the maximum votes is taken as the detected landmark position. In the following paragraphs, the details of the respective training and testing stages are provided in the context of single-landmark detection for the sake of conciseness. However, they can also be used in a multi-landmark setting by assuming the independence among landmarks.

Training stage: The input of the training stage is a number of training images, each with its interested landmark annotated. To train a regression forest, the training samples need to be extracted from these training images. In this paper, each training sample is a voxel from one training image, thus also referred to as the training voxel in the rest of the paper. The training voxel is represented by a feature vector and is associated with a target vector, which is the 3D displacement from this voxel to the target landmark in the same image.

The training stage consists of three successive steps: (1) sampling the training voxels, (2) extracting the feature and target vectors and (3) training the regression forest. Since step (3) is straightforward, we detail only steps (1) and (2) in the following paragraphs.

1.
Sampling training voxels: theoretically all image voxels in all training images can be used as training voxels to train a regression forest. However, as each training voxel is often represented by a long feature vector, it is practically impossible to use all image voxels for training due to the limitations of memory and training time. Therefore, sampling is often used to draw a limited number of representative training voxels from each training image for training. In conventional regression-based landmark detection, uniform sampling is commonly adopted, where each voxel in the training images has the same probability of being sampled. For each training image, a fixed number τ of training voxels is uniformly and randomly sampled. After sampling, we have $\tau \times Z$ training voxels, where Z is the number of training images.
2.
Extracting features and target vectors: as the interested landmark is manually annotated on each training image, we can easily compute the target vector $\mathbf{d}$ of each sampled training voxel, i.e. $\mathbf{d}={{\mathbf{x}}^{\text{LM}}}-\mathbf{x}$ , where $\mathbf{x}$ and ${{\mathbf{x}}^{\text{LM}}}$ are the positions of a training voxel and the landmark, respectively. The features of each training voxel are often calculated as 3D Haar-like features, which measure the average intensity of an arbitrary position, and also the average intensity difference of two arbitrary positions within the local patch of this voxel (see figure 3). Mathematically, the 3D Haar-like features used in our paper are formulated as:
$\begin{eqnarray}&&f\left({{I}_{\mathbf{x}}}|{{\mathbf{c}}_{1}},{{s}_{1}},{{\mathbf{c}}_{2}},{{s}_{2}},\delta \right)=\frac{1}{{{\left(2{{s}_{1}}+1\right)}^{3}}}\underset{\parallel \mathbf{y}-{{\mathbf{c}}_{1}}\parallel \leqslant {{s}_{1}}}{\mathop \sum}\,{{I}_{\mathbf{x}}}\left(\mathbf{y}\right)-\frac{\delta}{{{\left(2{{s}_{2}}+1\right)}^{3}}}\underset{\parallel \mathbf{y}-{{\mathbf{c}}_{2}}\parallel \leqslant {{s}_{2}}}{\mathop \sum}\,{{I}_{\mathbf{x}}}\left(\mathbf{y}\right)\end{eqnarray} \tag{ 4 }$
where ${{I}_{\mathbf{x}}}$ denotes a local patch centered at voxel $\mathbf{x}$ . $f\left({{I}_{\mathbf{x}}}|{{\mathbf{c}}_{1}},{{s}_{1}},{{\mathbf{c}}_{2}},{{s}_{2}},\delta \right)$ denotes one Haar-like feature with parameters $\left\{{{\mathbf{c}}_{1}},{{s}_{1}},{{\mathbf{c}}_{2}},{{s}_{2}},\delta \right\}$ , where ${{\mathbf{c}}_{1}}\in {{\mathbb{R}}^{3}}$ and s₁ are the centre and size of the first positive block, respectively, and ${{\mathbf{c}}_{2}}\in {{\mathbb{R}}^{3}}$ and s₂ are the centre and size of the second negative block, respectively. Note that ${{\mathbf{c}}_{1}}$ and ${{\mathbf{c}}_{2}}$ refer to the centre of the blocks relative to the patch rather than the overall image. $\delta \in \left\{0,1\right\}$ switches between two types of Haar-like features (figure 3), with $\delta =0$ indicating one-block Haar-like features (figure 3(a)) and $\delta =1$ indicating two-block Haar-like features (figure 3(b)).By changing the parameters $\left\{{{\mathbf{c}}_{1}},{{s}_{1}},{{\mathbf{c}}_{2}},{{s}_{2}},\delta \right\}$ in equation (4), we can compute various Haar-like features that capture the average intensities and intensity differences at different locations in the patch. Following the idea of feature bootstrapping in the random forest, only a subset of Haar-like features is sampled to represent each training voxel by randomising the four parameters $\left\{{{\mathbf{c}}_{1}},{{s}_{1}},{{\mathbf{c}}_{2}},{{s}_{2}},\delta \right\}$ .

**Figure 3.** Illustration of 3D Haar-like features. The red and blue boxes denote positive and negative blocks. The green boxes denote local patches. One-block Haar-like features (a) compute the average intensity of an arbitrary position within the local patch, and two-block Haar-like features (b) compute the average intensity difference of two arbitrary positions within the local patch.
Download figure:
Standard image High-resolution image

Once the feature vector (i.e. Haar-like features) and target vector (i.e. 3D displacement) of each training voxel are computed as described above, all the training voxels/samples are used to train the regression forest in a tree-by-tree manner. As mentioned above, each binary decision tree is trained independently. Each tree uses different random subsets of training voxels and Haar-like features in order to increase the diversity among the trained trees, thus potentially being able to improve the performance of the ensemble model.

Testing stage: The input of the testing stage is a new image, for which the method will localise the position of the target landmark. The testing stage consists of two sucessive steps: (1) 3D displacement prediction, and (2) landmark voting and localisation.

1.
3D displacement prediction: At the first step, the 3D displacement of each voxel in the new image (also referred as testing voxel) is predicted using the regression random forest learned in the training stage.
2.
Landmark voting and localisation: After the 3D displacement of each testing voxel is predicted, it is used to vote for the potential landmark position. Specifically, for each testing voxel $\mathbf{x}\in {{\mathbb{R}}^{3}}$ with the predicted 3D displacement $\mathbf{\hat{d}}$ , one vote is cast onto the voxel at $\text{ROUND}(\mathbf{x}+\mathbf{\hat{d}})$ , where function $\text{ROUND}(.)$ rounds each dimension of the input vector to the nearest integer. After collecting the votes from all the image voxels, we obtain a landmark voting map, where the value of each voxel in the voting map denotes the number of votes it receives from all the locations in the image. The landmark position is the voxel that receives the maximum vote.

III. Multi-resolution collaboration: multi-resolution regression voting

As briefly mentioned in the introduction, the limitation of conventional regression-based landmark detection is the inclusion of faraway voxels at both the training and testing stages. Because the local appearances of faraway voxels are insensitive to deformations happening around the landmark, faraway voxels are not informative about the precise landmark position, although they are useful for rough localisation.

Figure 4 provides two scenarios for illustration. In the CT prostate case (figure 4(a)), the relative position of the prostate landmark to the pelvic bone could change due to the inflation of the bladder or the rectum. Hence, the voxels of the pelvic bone in different images may have distinct displacements to the same prostate landmark even though their local image appearances are quite similar. The same situation applies to the CBCT dental landmark detection (figure 4(b)). Due to the deformities of patients and also the individual shape differences of the mandible, the 3D displacement from the mandible bottom to the upper frontal tooth landmark could change significantly across patients, even though the image appearances of the mandible-bottom voxels look similar across patients. These facts cause the ambiguity of 3D displacements associated with faraway voxels, thus bringing problems to both the training and testing of regression-based landmark detection.

The above examples illustrate that faraway voxels are not informative for precise landmark detection. However, at the testing stage, without pre-knowing the landmark position, it is impossible to distinguish nearby voxels from faraway voxels. Actually, this dilemma can be well addressed by the multi-resolution strategy. In this paper, we propose a multi-resolution strategy named 'multi-resolution regression voting' to address this issue.

Specifically, at the testing stage, a landmark is detected in a hierarchical way. In the coarsest resolution, the landmark position is roughly localised by landmark voting from the entire image domain. Once the rough landmark position is detected, voxels within distance ρ mm from it (also referred to as the ρ-neighbourhood) are identified as nearby voxels and used to refine the landmark position in a finer resolution. With the increase in resolution, ρ is gradually decreased to exclude faraway and also less informative voxels in the landmark voting step. Algorithm 1 gives the algorithm for our multi-resolution landmark detection.

Algorithm 1. Multi-resolution regression voting algorithm.

Input: ${{I}^{\text{test}}}$ ${{I}^{\text{test}}}$ —a testing image with an unknown landmark position

${{\mathcal{R}}_{i}},i=\left\{\text{Coarsest},\cdots ,\text{Finest}\right\}$ ${{\mathcal{R}}_{i}},i=\left\{\text{Coarsest},\cdots ,\text{Finest}\right\}$ — ${{\mathcal{R}}_{i}}$ ${{\mathcal{R}}_{i}}$ is the regression forest trained in the i-th resolution

${{\rho}_{0}}$ ${{\rho}_{0}}$ —the voting neighborhood size for the 2nd coarsest resolution

Output: $\mathbf{p}$ $\mathbf{p}$ —detected landmark position

Notations: $\mathcal{N}\left(\mathbf{x},\rho \right)$ $\mathcal{N}\left(\mathbf{x},\rho \right)$ —ρ-neighborhood of voxel $\mathbf{x}$ $\mathbf{x}$ ; $\mathcal{N}\left({{I}^{\text{test}}}\right)$ $\mathcal{N}\left({{I}^{\text{test}}}\right)$ —entire image domain of ${{I}^{\text{test}}}$ ${{I}^{\text{test}}}$

Initialization: $\rho ={{\rho}_{0}}$ $\rho ={{\rho}_{0}}$

for $i=\text{Coarsest}$ $i=\text{Coarsest}$ To $\text{Finest}$ $\text{Finest}$ do

Re-sample image ${{I}^{\text{test}}}$ ${{I}^{\text{test}}}$ to resolution i

/* Set the voting area $\Phi$ $\Phi$ */

$\Phi=\mathcal{N}\left({{I}^{\text{test}}}\right)$ $\Phi=\mathcal{N}\left({{I}^{\text{test}}}\right)$

if $i\ne \text{Coarsest}$ $i\ne \text{Coarsest}$ then

$\Phi=\mathcal{N}\left(\mathbf{p},\rho \right)$ $\Phi=\mathcal{N}\left(\mathbf{p},\rho \right)$

$\rho =\rho /2$ $\rho =\rho /2$ /* Reduce the voting area by 2³in the next finer resolution */

end if

/* 3D displacement prediction */

for every voxel $\mathbf{x}$ $\mathbf{x}$ in region $\Phi$ $\Phi$ do

Predict the 3D displacement $\mathbf{\hat{d}}\left(\mathbf{x}\right)$ $\mathbf{\hat{d}}\left(\mathbf{x}\right)$ by regression forest ${{\mathcal{R}}_{i}}$ ${{\mathcal{R}}_{i}}$

end for

/* Landmark voting */

Initialize voting map V to be zero and of the same size with ${{I}^{\text{test}}}$ ${{I}^{\text{test}}}$

for every voxel $\mathbf{x}$ $\mathbf{x}$ in region $\Phi$ $\Phi$ do

$V\left(\text{ROUND}\left(\mathbf{x}+\mathbf{\hat{d}}\left(\mathbf{x}\right)\right)\right)~+=1$ $V\left(\text{ROUND}\left(\mathbf{x}+\mathbf{\hat{d}}\left(\mathbf{x}\right)\right)\right)~+=1$

end for

/* Landmark localization */

$\mathbf{p}=\text{ma}{{\text{x}}_{\mathbf{x}}}V\left(\mathbf{x}\right)$ $\mathbf{p}=\text{ma}{{\text{x}}_{\mathbf{x}}}V\left(\mathbf{x}\right)$

end for

Return $\mathbf{p}$ $\mathbf{p}$

The training of our multi-resolution strategy follows the same idea as hierarchical landmark detection as described above. Specifically, a regression forest is independently trained at each resolution. The regression forest in the coarsest resolution is trained with training voxels sampled from the entire image domain, while the regression forest in the finer resolution is trained with training voxels sampled only from the ρ-neighbourhood of the annotated landmark position in each training image. To take into account that nearby voxels are more informative than faraway voxels, a spherical sampling strategy is further proposed, which draws training voxels based on the distance of a voxel to the landmark. In this spherical sampling strategy, given a ρ-neighbourhood of an annotated landmark ${{\mathbf{x}}^{\text{LM}}}$ and the number of training voxels ${{N}_{\text{sample}}}$ to draw, the algorithm aims to distribute all the training voxels evenly on each concentric sphere, which makes the concentric spheres with different radiuses have roughly the same number of training voxels (see the illustration in figure 5). Mathematically, the sampling probability of each voxel can be computed as:

$\begin{eqnarray}{{P}_{\text{sample}}}\left(\mathbf{x}\right)=\left\{\begin{array}{*{35}{l}} \frac{1}{\rho}\times \frac{1}{4\pi \parallel \mathbf{x}-{{\mathbf{x}}^{\text{LM}}}{{\parallel}^{2}}}, & \text{if} ~\parallel \mathbf{x}-{{\mathbf{x}}^{\text{LM}}}{{\parallel}_{2}}\leqslant \rho \\ 0, & \text{if} ~\parallel \mathbf{x}-{{\mathbf{x}}^{\text{LM}}}{{\parallel}_{2}}>\rho \end{array}\right. \end{eqnarray} \tag{ 5 }$

**Figure 5.** (a) Illustration of the spherical sampling strategy. The yellow cross denotes a target landmark ${{\mathbf{x}}^{\text{LM}}}$ . The red circles denote concentric spheres. (b) An example of the distribution of training voxels with $\rho =60\text{mm}$
Download figure:
Standard image High-resolution image

It can be clearly seen that the sampling probability is inversely proportional to the square distance of voxel $\mathbf{x}$ to the target landmark ${{\mathbf{x}}^{\text{LM}}}$ . Therefore, more training voxels would be drawn near the landmark than far away from the landmark, thus potentially improving the displacement prediction accuracy for nearby voxels. Algorithm 2 gives the detailed implementation of our spherical sampling strategy.

Algorithm 2. Spherical sampling strategy

Input: ${{\mathbf{x}}^{\text{LM}}}$ ${{\mathbf{x}}^{\text{LM}}}$ - an annotated landmark position

ρ - the neighborhood size for sampling

${{N}_{\text{sample}}}$ ${{N}_{\text{sample}}}$ - the number of training voxels requested

Output: sampled training voxel set $\mathbb{S}$ $\mathbb{S}$

Initialization: $\mathbb{S}=\varnothing$ $\mathbb{S}=\varnothing$

for i = 1 to ${{N}_{\text{sample}}}$ ${{N}_{\text{sample}}}$ do

/* Randomly choose a concentric sphere based on the uniform distribution */

$r=\text{Random}(0,\rho )$ $r=\text{Random}(0,\rho )$

/* Randomly sample a point on the unit sphere based on the uniform distribution */

$\alpha =\text{Random}(0,2\pi )$ $\alpha =\text{Random}(0,2\pi )$

$z=\text{Random}(-1,1)$ $z=\text{Random}(-1,1)$ ; $x=\text{sqrt}\left(1-{{z}^{2}}\right)\text{cos}\alpha$ $x=\text{sqrt}\left(1-{{z}^{2}}\right)\text{cos}\alpha$ ; $y=\text{sqrt}\left(1-{{z}^{2}}\right)\text{sin}\alpha$ $y=\text{sqrt}\left(1-{{z}^{2}}\right)\text{sin}\alpha$

/* Shift and scale it onto the selected concentric sphere */

${{\mathbf{x}}_{i}}={{\mathbf{x}}^{\text{LM}}}+r{{\left[x~y~z\right]}^{T}}$ ${{\mathbf{x}}_{i}}={{\mathbf{x}}^{\text{LM}}}+r{{\left[x~y~z\right]}^{T}}$

/* Push it into the sampled training voxel set $\mathbb{S}$ $\mathbb{S}$ */

$\mathbb{S}=\mathbb{S}\mathop{\cup}^{}\left\{{{\mathbf{x}}_{i}}\right\}$ $\mathbb{S}=\mathbb{S}\mathop{\cup}^{}\left\{{{\mathbf{x}}_{i}}\right\}$

end for

IV. Inter-landmark collaboration: confidence-based landmark detection

As will be shown in the experimental section, much more accurate landmark detection can be achieved with the proposed multi-resolution strategy than conventional regression-based landmark detection. However, for certain challenging landmarks, where appearance variations are large, it is still difficult to accurately detect them independently from other landmarks. To improve their detection accuracy, it is necessary to exploit the spatial dependency between these challenging landmarks and other reliable landmarks.

Joint landmark detection (Criminisi et al 2013) is a simple way to consider an inter-landmark spatial relationship at the landmark detection step. It jointly predicts the 3D displacements of a voxel to multiple landmarks using a common regression forest, instead of using separate regression forests as in individual landmark detection. Sharing a common regression forest increases the prediction efficiency. However, it also brings a limitation. As the detection of different landmarks may prefer different features and splitting functions in the random forest, the landmark detection accuracy could be compromised by sharing a common forest. Besides, all landmarks are equally treated in joint detection without considering the detection confidence of each landmark. The detection accuracy of reliable landmarks may decrease due to the negative influence from challenging landmarks.

To effectively exploit the spatial dependency among landmarks, we propose a confidence-based landmark detection strategy which uses reliable landmarks (with high detection confidence) to guide the detection of challenging landmarks (with low detection confidence). There are generally two ways to determine reliable and challenging landmarks. In applications where the spatial dependency is explicitly known, such as one landmark being annotated according to other landmarks, the dependents are challenging landmarks, and those which they depend on are reliable landmarks. In other applications where no such dependency is provided, we first compute the variance of the Euclidean distances between any pair of landmarks across subjects. Landmark pairs with small variances are considered spatially highly correlated. Next, we use cross validation to determine the detection accuracy of each landmark. If two landmarks are spatially correlated and their validated detection accuracy is statistically different (p < 0.05), we use the landmark with the higher detection accuracy as the reliable landmark to guide the detection of the one with the lower detection accuracy. It should be noted that the above cross validation is performed on the training data without using the testing data.

Suppose that $\text{L}{{\text{M}}_{a}}$ is a challenging landmark and $\left\{\text{L}{{\text{M}}_{1}},\cdots \text{L}{{\text{M}}_{b}},\cdots \text{L}{{\text{M}}_{B}}\right\}$ is a set of reliable landmarks, the following paragraphs introduce how the reliable landmarks can be used to guide the detection of challenging landmarks in confidence-based landmark detection.

Training stage: The regression forest training for B reliable landmarks is the same as that described in section III. To train a regression forest for challenging landmarks $\text{L}{{\text{M}}_{a}}$ , the learned regression forests for the B reliable landmarks are first applied to detect their positions $\left\{\mathbf{p}_{j}^{\text{L}{{\text{M}}_{1}}},\cdots \mathbf{p}_{j}^{\text{L}{{\text{M}}_{b}}},\cdots \mathbf{p}_{j}^{\text{L}{{\text{M}}_{B}}}\right\}$ on each (jth) training image $I_{j}^{\text{train}}$ . Then, the 3D displacements between each training voxel $\mathbf{x}$ and detected reliable landmarks of the same training image are measured, i.e. $\left\{\mathbf{p}_{j}^{\text{L}{{\text{M}}_{1}}}-\mathbf{x},\cdots \mathbf{p}_{j}^{\text{L}{{\text{M}}_{b}}}-\mathbf{x},\cdots \mathbf{p}_{j}^{\text{L}{{\text{M}}_{B}}}-\mathbf{x}\right\}$ . These displacements are named 'context distance features', which are used as additional geometric features for each training voxel and further combined with 3D Haar-like features to train the regression forests $\mathcal{R}_{i}^{\text{L}{{\text{M}}_{a}}},i=\left\{\text{coarsest},\cdots ,\text{finest}\right\}$ .
Testing stage: The testing stage follows a similar procedure to the training stage. First, the positions of B reliable landmarks $\left\{\mathbf{p}_{\text{test}}^{\text{L}{{\text{M}}_{1}}},\cdots \mathbf{p}_{\text{test}}^{\text{L}{{\text{M}}_{b}}},\cdots \mathbf{p}_{\text{test}}^{\text{L}{{\text{M}}_{B}}}\right\}$ are detected in the testing image using the multi-resolution strategy as described in algorithm 1. Then, to predict the 3D displacement of each testing voxel $\mathbf{x}$ to landmark $\text{L}{{\text{M}}_{a}}$ , the context distance features $\left\{\mathbf{p}_{\text{test}}^{\text{L}{{\text{M}}_{1}}}-\mathbf{x},\cdots \mathbf{p}_{\text{test}}^{\text{L}{{\text{M}}_{b}}}-\mathbf{x},\cdots \mathbf{p}_{\text{test}}^{\text{L}{{\text{M}}_{B}}}-\mathbf{x}\right\}$ are calculated and combined with 3D Haar-like features as input to the trained regression forest $\mathcal{R}_{i}^{\text{L}{{\text{M}}_{a}}}$ . Once the displacements of all the testing voxels are estimated, the landmark voting and localisation steps are the same, as described in section II.

It can be seen from the above descriptions that the only difference between confidence-based landmark detection and regular regression-based landmark detection is the introduction of 'context distance features', which bridges reliable and challenging landmarks. As the selected reliable landmarks are spatially highly correlated with the challenging landmarks, for any voxel its displacements to reliable landmarks must also be highly correlated with those to challenging landmarks. Therefore, a voxel's displacements to reliable landmarks (context distance features) are very informative to regress its displacements to challenging landmarks. With the help of these 3D displacements, the 3D displacement prediction accuracy for challenging landmarks could be improved, eventually leading to better landmark detection accuracy.

V. Experimental results

In this section, we extensively evaluate our collaborative landmark detection framework for detecting landmarks in three datasets: (1) CT prostate images, (2) CBCT dental images and 3) CT head & neck images. The organisation of this section is as follows: the parameter setting of our method is first presented in section V.A. In all three datasets, the same parameter setting is used if not explicitly mentioned. Next, section V.B reports both the training and testing time of our method. Finally, sections V.C–V.E present the experimental results of our method in three datasets, respectively.

V.A. Parameter setting

Multi-resolution Setting: Our multi-resolution landmark detection consists of 3 resolutions. The detailed parameters of each resolution are shown in table 1. The original spacings of the CT and CBCT images in our dataset are about $1\times 1\times 3\text{m}{{\text{m}}^{3}}$ and $0.4\times 0.4\times 0.8\text{m}{{\text{m}}^{3}}$ , respectively. To ease the image processing, the CT and CBCT images are linearly resampled to the isotropic volumes with spacings $1\times 1\times 1\,\text{m}{{\text{m}}^{3}}$ and $0.4\times 0.4\times 0.4\,\text{m}{{\text{m}}^{3}}$ , respectively. These spacings denote the spacings used in the finest resolution.

Table 1. Parameter setting for each resolution.

	R3 (Coarsest)	R2 (Medium)	R1 (Finest)
Spacing (mm)	$4\times$	$2\times$	$1\times$
Patch Size (voxel)	15	30	30 (dental, HN), 50 (prostate)

Note: $4\times$ and $2\times$ means that the spacing is four times and two times larger than that of the finest resolution, respectively. HN denotes head & neck.

Regression forest setting: The training parameters for a regression random forest are provided in table 2. Similar to that in Lindner et al (2013), the maximum tree depth is large, which makes the trained trees as deep as possible. To prevent overfitting, a 'minimum leaf sample number' is set to stop splitting if the number of samples falling into the node is less or equal than the specified value (i.e. 8). Empirically, we found the detection accuracy increases with the increase in (1) tree number K, (2) the number of bootstrapped thresholds and (3) the number of bootstrapped features. However, the increase in 'tree number K' will linearly increase the runtime for landmark detection. Similarly, the increase in the 'number of bootstrapped thresholds' and the 'number of bootstrapped features' will linearly increase the time and memory cost at the training stage. As a compromise, we adopt the parameters shown in table 2, which give good results on all three datasets. Thus, we believe it should also work for other applications.

Table 2. Training parameter setting for a regression random forest.

Tree number K	10	Maximum tree depth $\mathcal{D}$	100
Number of bootstrapped thresholds	100	Number of bootstrapped features	2000
Minimum leaf sample number	8

Other parameters: The number of training voxels τ sampled from each training image is 10 000. The local voting neighbourhood size ${{\rho}_{0}}$ is 30 voxels. The block sizes $\left\{{{s}_{1}},{{s}_{2}}\right\}$ are limited to {3, 5}. For each one-block Haar-like feature, we randomly sample a value from {3, 5} for s₁. For each two-block Haar-like feature, we random sample one value with a replacement from {3, 5} for s₁ and s₂, respectively. Both one-block and two-block features are used in the training.

V.B. Training and testing timing

Our experiments are conducted on a laptop with Intel i7-2720QM CPU (2.2 GHz) and 16 GB memory. All the algorithms are implemented with C++. OpenMP is used to parallelise the code by multi-threading. The typical runtime to detect a landmark on a $512\times 512\times 61$ image volume is about 1 s. The training time is 27 min for one tree with 54 training images and the parameter setting described in section V.A. This training time is linearly proportional to the number of training images, the number of bootstrapped thresholds and the number of bootstrapped features.

V.C. CT Prostate dataset

Data description: Our CT prostate dataset consists of 73 CT images from 73 different prostate cancer patients acquired from the North Carolina Cancer Hospital. A radiation oncologist has manually delineated the prostate in each CT image. Based on the delineation, six prostate landmarks are defined as shown in figure 6, where BS and AP are defined as the prostate centres in the most inferior and superior slices of the prostate volume, respectively. RT, LF, AT and PT are defined on the same central slice of the prostate volumn. They correspond to the rightmost, the leftmost, the most anterior and the most posterior points of the prostate on the central slice, respectively.

**Figure 6.** Illustration of six prostate landmarks (transversal view along with 3D rendering).
Download figure:
Standard image High-resolution image

Applications: These landmarks can be used to align the mean prostate shape onto the testing image for fast prostate localisation (Gao et al 2014). The mean prostate shape is represented as a 3D mesh. To construct it, the marching cube algorithm (Lorensen and Cline 1987) is first used to extract a 3D mesh from the manual prostate segmentation of each training image. Then, the coherent point drift algorithm (Myronenko and Song 2010) is used to build the vertex-to-vertex correspondence for all prostate meshes. Finally, all the correspondent meshes are affinely registered into a common space, where the mean prostate shape is obtained by vertex-wisely averaging the aligned meshes. At the testing stage, once the prostate landmarks are detected in the new image, an affine transformation is estimated between the detected landmarks and their correspondent vertices on the mean prostate mesh. Then, the prostate in the new image can be quickly localised by applying the estimated transformation onto the mean prostate mesh. For details, readers might be interested in Gao et al (2014).

Evaluations: Four-fold cross validation is used to evaluate each component of our method. Specifically, the entire dataset is evenly divided into four folds. To test the detection accuracy of one fold, the other three folds are used as training data to learn regression forests and to construct the mean prostate shape. Two metrics are used to evaluate the performance:

Landmark detection error: The Euclidean distance between the ground truth landmark position and the automatically detected landmark position.
Prostate overlap ratio: The dice similarity coefficient (DSC) between the manually annotated prostate and the automatically localised prostate using six detected landmarks:
$\begin{eqnarray}&&\text{DSC}=\frac{2|\text{Vo}{{\text{l}}_{\text{gt}}}\mathop{\cap}^{}\text{Vo}{{\text{l}}_{\text{auto}}}|}{|\text{Vo}{{\text{l}}_{\text{gt}}}|\,+\,|\text{Vo}{{\text{l}}_{\text{auto}}}|}\end{eqnarray} \tag{ 6 }$
where $\text{Vo}{{\text{l}}_{\text{gt}}}$ is the voxel set of the manually annotated prostate, $\text{Vo}{{\text{l}}_{\text{auto}}}$ is the voxel set of the automatically localised prostate using the six landmarks and $|.|$ denotes the cardinality of a set.
Single-resolution versus multi-resolution: Table 3 quantitatively compares the average landmark detection error between single-resolution and multi-resolution landmark detection. Both methods use uniform sampling and the same parameters to train the regression random forest. We can clearly see that single-resolution landmark detection always leads to poor detection performance (i.e. mean error ∼9 mm). In contrast, by using three resolutions, our multi-resolution landmark detection significantly improves the detection accuracy by reducing the mean landmark detection errors by half. In terms of the prostate overlap ratio compared with the best performance of single-resolution methods, which obtains the mean DSC $67.0\pm 11.6\%$ on 73 cases, our multi-resolution method significantly improves the mean DSC to $81.0\pm 4.49\%$ , which is comparable to the inter-operator variability of the manual prostate delineation $81.0\pm 6.00\%$ reported in Foskey et al (2005).
Uniform sampling versus spherical sampling: To justify the use of the spherical sampling strategy, we quantitatively compare uniform and spherical sampling in both single-resolution and multi-resolution. Table 4 presents the comparison results. We can see that the spherical sampling strategy significantly (p < 0.05) improves the detection accuracy in both single-resolution and multi-resolution. In terms of the prostate overlap ratio, the mean DSC obtained by multi-resolution landmark detection with spherical sampling is $81.0\pm 4.49\%$ , which is also statistically ( $p=4.9\times {{10}^{-4}}$ ) better than the mean DSC $80.3\pm 5.05\%$ obtained by multi-resolution landmark detection with uniform sampling.
Joint landmark detection versus confidence-based landmark detection: With the multi-resolution and spherical sampling strategies, we obtain detection errors $4.4\pm 3.0$ mm for landmark PT and $3.8\pm 2.1$ mm for landmark AT. The inferior detection accuracy of landmark PT is owing to the fact that its local appearance is much more complex than that of landmark AT (figure 7). As landmarks AT and PT are spatially highly correlated, we use landmark AT as a reliable landmark to guide the detection of landmark PT.Table 5 shows the detection accuracy of the six landmarks by the confidence-based landmark detection ('Confidence'), and compares it with the detection accuracy of joint and individual landmark detection. All three methods use the same multi-resolution strategy proposed in this paper. We can see that joint landmark detection performs worse than individual landmark detection, which justifies our previous statement that sharing a common regression model among different landmarks would compromise the landmark detection accuracy.On the other hand, by comparing confidence-based landmark detection with individual landmark detection, we observe a significant improvement (p-value = 0.01) in the detection accuracy of landmark PT, which improves from $4.4\pm 3.0$ mm to $4.0\pm 2.7$ mm (with a $9\%$ reduction in mean detection error) due to the guidance from landmark AT. Besides, it is surprising to see that the detection accuracy of most other landmarks also shows a slight improvement by using the context guidance from landmark AT. This may be explained by the weak spatial correlations associated with these prostate landmarks and landmark AT. Additionally, we also notice that the context guidance from landmark AT improves its own detection accuracy as well. This is because the sagittal plane of landmark AT can be localised very accurately and reliably using our multi-resolution strategy (i.e. with mean and max errors $0.8\pm 0.6$ mm and 2.6 mm, respectively). With the guidance of such a reliably localised sagittal plane, the 3D displacement along the lateral dimension could be more accurately predicted, compared with solely relying on the local image appearance. Consequently, the votes are more clustered towards the correct sagittal plane (figure 8(d)), compared to the case without self-guidance (figure 8(c)). This difference finally leads to the improved detection accuracy of landmark AT.In terms of the prostate overlap ratio (DSC), joint landmark detection obtains $77.6\pm 7.14\%$ , which is worse than $81.0\pm 4.49\%$ achieved by individual landmark detection. With confidence-based landmark detection, the DSC for prostate localisation is slightly improved to $81.1\pm 4.32\%$ .
Comparison with a multi-resolution classification-based method: Finally, we compare our method with a multi-resolution classification-based method (Gao et al 2014) in table 6. Both methods use the same number of resolutions and the same parameter setting for each resolution. Besides, the same type of Haar features are also used in both methods to encourage a fair comparison. We can see from table 6 that our method is significantly better than (Gao et al 2014) in CT prostate landmark detection. For CT prostate landmarks whose local appearances are indistinct there is a strong likelihood of encountering local patches with similar appearances to the target landmark. In such a situation, the classification-based methods may suffer. In contrast, with the help of context image patches, the regression-based methods are more robust, which explains why the regression-based method achieves a higher detection accuracy in this task. In terms of the prostate overlap ratio, our proposed method is also significantly higher than the classification-based method (Gao et al 2014), which obtains DSC $73.3\pm 11.6\%$ in this dataset.

Table 3. Quantitative comparison between single-resolution and multi-resolution landmark detection on the CT prostate dataset.

Method	Single-resolution			Multi-resolution
Method	Finest (R1)	Medium (R2)	Coarsest (R3)	Multi-resolution
Error (mm)	$9.3\pm 5.1$	$8.7\pm 4.6$	$8.6\pm 4.6$	$\mathbf{4}.\mathbf{4}\pm \mathbf{2}.\mathbf{5}$
p-value	$1.3\times {{10}^{-71}}$	$1.2\times {{10}^{-69}}$	$2.6\times {{10}^{-68}}$	N/A

Note: The p-values are computed with a paired t-test between the single-resolution methods and our multi-resolution method. The bold number indicates the best performance.

Table 4. Quantitative comparison between uniform sampling and spherical sampling in a CT prostate dataset.

Method	Single-resolution						Multi-resolution
Method	Finest (R1)		Medium (R2)		Coarsest (R3)		Multi-resolution
Sampling	Uniform	Spherical	Uniform	Spherical	Uniform	Spherical	Uniform	Spherical
Error (mm)	$9.3\pm 5.1$	$7.8\pm 4.6$	$8.7\pm 4.6$	$7.8\pm 4.4$	$8.6\pm 4.6$	$8.3\pm 4.3$	$4.4\pm 2.5$	$\mathbf{4}.\mathbf{2}\pm \mathbf{2}.\mathbf{5}$
p-value	$4.0\times {{10}^{-24}}$	N/A	$3.6\times {{10}^{-17}}$	N/A	$3.1\times {{10}^{-7}}$	N/A	0.01	N/A

Note: The p-values are computed with a paired t-test between every two methods. The bold number indicates the best performance.

Table 5. Quantitative comparison between joint landmark detection (Joint), individual landmark detection (Individual) and confidence-based landmark detection (Confidence).

Error (mm)	RT	LF	PT	AT	BS	AP	Average	p-value
Joint	$4.7\pm 2.7$	$4.4\pm 2.7$	$5.2\pm 3.4$	$3.9\pm 2.1$	$4.9\pm 2.7$	$6.3\pm 4.5$	$4.9\pm 3.2$	$1.8\pm {{10}^{-13}}$
Individual	$4.1\pm 2.1$	$3.9\pm 2.4$	$4.4\pm 3.0$	$3.8\pm 2.1$	$4.7\pm 2.5$	$\mathbf{4}.\mathbf{5}\pm \mathbf{2}.\mathbf{9}$	$4.2\pm 2.5$	0.01
Confidence	$\mathbf{4}.\mathbf{0}\pm \mathbf{2}.\mathbf{2}$	$\mathbf{3}.\mathbf{8}\pm \mathbf{2}.\mathbf{0}$	$\mathbf{4}.\mathbf{0}\pm \mathbf{2}.\mathbf{7}$	$\mathbf{3}.\mathbf{7}\pm \mathbf{1}.\mathbf{9}$	$\mathbf{4}.\mathbf{7}\pm \mathbf{2}.\mathbf{4}$	$4.6\pm 2.6$	$\mathbf{4}.\mathbf{1}\pm \mathbf{2}.\mathbf{4}$	N/A

Note: The p-values are computed between confidence-based landmark detection and other methods.

**Figure 7.** Appearance variations of prostate landmarks AT and PT across patients (transversal view).
Download figure:
Standard image High-resolution image

**Figure 8.** (a) Transversal CT prostate slice. (b) Zoomed-in view of the red rectangle in (a), where the red point indicates the position of landmark AT. (c) and (d) are the voting maps of landmark AT in the fine resolution (R1) with and without self-guidance, respectively.
Download figure:
Standard image High-resolution image

Table 6. Quantitative comparison between the multi-resolution classification-based landmark detection method (Gao et al 2014) and our method.

Error (mm)	RT	LF	PT	AT	BS	AP	Average	p-value
Classification	$6.5\pm 3.7$	$6.5\pm 4.8$	$7.4\pm 5.3$	$4.9\pm 2.7$	$6.2\pm 3.5$	$8.8\pm 6.7$	$6.7\pm 4.8$	$1.0\pm {{10}^{-29}}$
Proposed	$\mathbf{4}.\mathbf{0}\pm \mathbf{2}.\mathbf{2}$	$\mathbf{3}.\mathbf{8}\pm \mathbf{2}.\mathbf{0}$	$\mathbf{4}.\mathbf{0}\pm \mathbf{2}.\mathbf{7}$	$\mathbf{3}.\mathbf{7}\pm \mathbf{1}.\mathbf{9}$	$\mathbf{4}.\mathbf{7}\pm \mathbf{2}.\mathbf{4}$	$\mathbf{4}.\mathbf{6}\pm \mathbf{2}.\mathbf{6}$	$\mathbf{4}.\mathbf{1}\pm \mathbf{2}.\mathbf{4}$	N/A

Note: The bold numbers indicate the best performance.

V.D. CBCT dental dataset

Data descriptions: Our CBCT dataset consists of 14 patients, each with one CBCT scan. These patients suffer from either one or two of the following deformities: (1) maxillary hypoplasia, (2) mandibular hyperplasia, (3) mandibular hypoplasia, (4) bimaxillary protrusion and (5) condylar hyperplasia. In each CBCT image, 15 landmarks are manually annotated by a physician based on the CBCT segmentation (i.e. segmentation of the maxilla and mandible), as shown in figure 9.

**Figure 9.** Illustration of 15 dental landmarks on a 3D rendering skull, where the white and yellow parts of the skull indicate the maxilla and mandible, respectively.
Download figure:
Standard image High-resolution image

Motivations: These dental landmarks are important in deformity diagnosis and treatment planning. For example, they provide important symmetry measurements that could be used in the analysis of maxillofacial deformities (Maeda et al 2006). They can also be used to estimate the patient-specific normal craniomaxillofacial shape for guiding surgery planning (Ren et al 2014). Besides, by superimposing dental landmarks of the same patient acquired from different points in time, physicians can monitor temporal changes associated with orthodontic treatment and growth. Despite the clinical importance of dental landmarks, it is very time-consuming and labour-intensive to manually annotate these landmarks. Specifically, the physician needs to first manually segment bony structures from the CBCT and separate the maxilla from the mandible. This procedure often takes 5 h. The purpose of segmentation is to separate different anatomical structures (e.g. the maxilla and mandible) and to remove metal artifacts. After that, 3D models are generated from the segmented CBCT image. Then, it takes another 30 min for landmark annotation on the 3D models. Therefore, it is clinically desirable to develop an automatic method that can efficiently and accurately localise dental landmarks directly from a CBCT image without relying on segmentation, which is often time-consuming to attain.

Evaluations: Two-fold cross validation is used to evaluate our method on this dataset. Specifically, the entire dataset is divided into two folds, with 7 CBCT scans in each fold. To test the detection accuracy of one fold, the CBCT images in the other fold are used to learn the regression forest for each landmark. To enrich the training dataset, we also add 30 CT images, considering the similar appearances of dental landmarks in the CT and CBCT images (figure 10).

Evaluation of the proposed strategy: Similar to that conducted in the previous dataset, tables 7–9 show the quantitative comparisons (1) between single-resolution and multi-resolution landmark detections, (2) between uniform and spherical sampling, and (3) between joint and individual landmark detections. These results indicate the effectiveness of our proposed strategies in improving landmark detection accuracy. It should be noted that confidence-based landmark detection is not used in this dataset because (1) the detection accuracies of all the dental landmarks are already high with our multi-resolution strategy; (2) for landmarks with spatial dependency (i.e. two upper teeth landmarks UR1 and UL1, and two lower teeth landmarks LR1 and LL1), their detection accuracy is almost the same, which makes it unlikely to have further improvement by using one landmark to guide the other.
Comparison with the multi-resolution classification-based method: Similarly, table 9 quantitatively compares our method with the multi-resolution classification based method (Gao et al 2014). We can see that our method significantly outperforms the conventional multi-resolution classification-based method in almost all the landmarks. By carefully analysing the results, we notice that the improvement of our method over (Gao et al 2014) is bigger in teeth landmarks than non-teeth landmarks. This is due to the metal artifacts mentioned in the introduction. For patients with dental braces, their CBCT images suffer from severe streaking artifacts (figure 2), which make the appearances of the upper and lower teeth similar and hard to distinguish. As a result, the classification-based method may detect the lower teeth landmark on the upper teeth (figure 11(a)) because it checks only the local appearance. In contrast, with the help of context appearances, our regression-based method can easily overcome this limitation and produce a good detection result (figure 11(b)).

Table 7. Quantitative comparison between single-resolution and multi-resolution landmark detection on the CBCT dental dataset.

Method	Single-resolution			Multi-resolution
Method	Finest (R1)	Medium (R2)	Coarsest (R3)	Multi-resolution
Error (mm)	$12\pm 8.6$	$10\pm 7.5$	$9.3\pm 7.0$	$\mathbf{2}.\mathbf{8}\pm \mathbf{4}.\mathbf{2}$
p-value	$8.8\times {{10}^{-48}}$	$4.1\times {{10}^{-46}}$	$7.2\times {{10}^{-52}}$	N/A

Note: The p-values are computed with a paired t-test between the single-resolution methods and our multi-resolution method. The bold number indicates the best performance.

Table 8. Quantitative comparison between uniform sampling and spherical sampling on the CBCT dental dataset.

Method	Single-resolution						Multi-resolution
Method	Finest (R1)		Medium (R2)		Coarsest (R3)		Multi-resolution
Sampling	Uniform	Spherical	Uniform	Spherical	Uniform	Spherical	Uniform	Spherical
Error (mm)	$12\pm 8.6$	$3.9\pm 4.1$	$10\pm 7.5$	$4.4\pm 3.9$	$9.3\pm 7.0$	$5.2\pm 3.8$	$2.8\pm 4.2$	$\mathbf{1}.\mathbf{5}\pm \mathbf{0}.\mathbf{9}$
p-value	$7.3\times {{10}^{-35}}$	N/A	$3.7\times {{10}^{-25}}$	N/A	$2.3\times {{10}^{-19}}$	N/A	$2.0\times {{10}^{-6}}$	N/A

Note: The p-values are computed with a paired t-test between every two methods. The bold number indicates the best performance.

Table 9. Quantitative comparisons between joint and individual landmark detection, and between the multi-resolution classification-based method (Gao et al 2014) and the proposed method on the CBCT dental dataset.

Error (mm)	Go-R	Go-L	Me	N	Or-R	Or-L	Pg	UR1
Joint	$3.8\pm 2.0$	$2.5\pm 2.0$	$2.2\pm 1.1$	$2.2\pm 1.5$	$3.7\pm 3.0$	$3.0\pm 1.2$	$1.9\pm 1.3$	$5.6\pm 4.3$
Classification (Gao et al 2014)	$2.5\pm 2.4$	$2.5\pm 1.4$	$1.6\pm 1.3$	$1.4\pm 1.8$	$1.6\pm 0.9$	$\mathbf{1}.\mathbf{3}\pm \mathbf{1}.\mathbf{2}$	$1.5\pm 0.7$	$1.4\pm 1.4$
Proposed (Individual)	$\mathbf{1}.\mathbf{6}\pm \mathbf{1}.\mathbf{0}$	$\mathbf{1}.\mathbf{7}\pm \mathbf{1}.\mathbf{0}$	$\mathbf{1}.\mathbf{1}\pm \mathbf{0}.\mathbf{4}$	$\mathbf{1}.\mathbf{2}\pm \mathbf{0}.\mathbf{7}$	$\mathbf{1}.\mathbf{3}\pm \mathbf{0}.\mathbf{9}$	$1.5\pm 0.9$	$\mathbf{1}.\mathbf{2}\pm \mathbf{0}.\mathbf{7}$	$\mathbf{0}.\mathbf{9}\pm \mathbf{0}.\mathbf{7}$

Error (mm)	UL1	LR1	LL1	URL	LRL	ULL	LLL	Average
Joint	$5.5\pm 4.5$	$6.0\pm 4.4$	$6.8\pm 3.3$	$6.8\pm 3.6$	$3.8\pm 3.6$	$4.5\pm 2.9$	$4.9\pm 4.2$	$4.2\pm 3.5$
Classification (Gao et al 2014)	$1.7\pm 2.1$	$2.6\pm 1.9$	$2.2\pm 1.6$	$3.6\pm 4.3$	$1.8\pm 1.4$	$2.7\pm 3.7$	$4.7\pm 4.4$	$2.2\pm 2.5$
Proposed (Individual)	$\mathbf{0}.\mathbf{9}\pm \mathbf{0}.\mathbf{4}$	$\mathbf{1}.\mathbf{9}\pm \mathbf{1}.\mathbf{1}$	$\mathbf{1}.\mathbf{9}\pm \mathbf{1}.\mathbf{5}$	$\mathbf{2}.\mathbf{0}\pm \mathbf{0}.\mathbf{7}$	$\mathbf{1}.\mathbf{5}\pm \mathbf{0}.\mathbf{9}$	$\mathbf{1}.\mathbf{8}\pm \mathbf{0}.\mathbf{9}$	$\mathbf{1}.\mathbf{8}\pm \mathbf{0}.\mathbf{8}$	$\mathbf{1}.\mathbf{5}\pm \mathbf{0}.\mathbf{9}$

Note: The bold numbers indicate the best performance.

**Figure 10.** Qualitative comparison between the landmark appearances in the CBCT and CT images.
Download figure:
Standard image High-resolution image

**Figure 11.** Visual comparison between the classification-based method (Gao *et al* 2014) and our regression-based method in detecting landmark LLL on a CBCT scan. (a) Landmark position detected by the classification-based method. (b) Landmark position detected by our method. (c) Ground-truth landmark position.
Download figure:
Standard image High-resolution image

V.E. CT head & neck dataset

Data descriptions: Our CT head & neck dataset is acquired from PDDCA (www.imagenglab.com//pddca_18.html). PDDCA version 1.1 comprises 40 patients' CT images from the Radiation Therapy Oncology Group (RTOG) 0522 Study (a multi-institutional clinical trial led by Dr Kian Ang). Each CT image has five bony landmarks manually annotated: chin (chine), right condyloid process (mand_r), left condyloid process (mand_l), odontoid process (odont_proc) and occopital bone (occ_bone). Figure 12 shows the positions of these landmarks on one subject.

**Figure 12.** Illustration of the positions of five bony landmarks in the CT head and neck dataset.
Download figure:
Standard image High-resolution image

These bony landmarks are used to align CT images of different patients for correcting the orientation and translation incurred by different patient setups. The accuracy of alignment could largely influence the later processing steps, e.g. multi-atlas-based tissue segmentation. Therefore, it is important to accurately detect these landmarks.

This dataset is interesting because it provides explicit spatial dependency between landmarks, which could be used to evaluate our confidence-based landmark detection strategy. Specifically, the landmark 'occ_bone' is manually annotated on the same sagittal slice of the landmark 'chin'.

Evaluation: Four-fold cross validation is used to evaluate our method on this dataset. To test the detection accuracy of one fold, the CT images in the other folds are used to learn the regression forest for each landmark.

Evaluation of the proposed strategies: Similar to that done in the previous datasets, tables 10–11 provide quantitative comparisons (1) between single-resolution and multi-resolution landmark detections, and (2) between uniform and spherical sampling, respectively. The results indicate the effectiveness of the proposed multi-resolution and spherical sampling in improving landmark detection accuracy.
Joint landmark detection versus confidence-based landmark detection: Since the landmark 'occ_bone' is annotated according to the landmark 'chin', we exploit this dependency in our confidence-based landmark detection. Specifically, the landmark 'chin' is used as a reliable landmark to help detect the landmark 'occ_bone'. Table 12 quantitatively compares the joint landmark detection (Joint) and the individual landmark detection (Individual) with the confidence-based landmark detection (Confidence). Similar to the previous datasets, the individual landmark detection outperforms the joint landmark detection. However, compared to 'Confidence', its detection accuracy is still limited. By incorporating context distance features, 'Confidence' achieves the best detection accuracy for 'occ_bone', by reducing the landmark detection error by more than half compared to 'Individual'.
Comparison with the multi-resolution classification-based method (Gao et al 2014): Table 12 quantitatively compares our method with the multi-resolution classification based method (Gao et al 2014) on this dataset. We can clearly observe the better detection accuracy obtained by our method. In particular, the detection error of the landmark 'occ_bone' is reduced by almost two thirds with our method compared to Gao et al (2014), which indicates the effectiveness of our collaborative landmark detection framework over the conventional classification-based method.

Table 10. Quantitative comparison between single-resolution and multi-resolution landmark detections on the CT head & neck dataset.

Method	Single-resolution			Multi-resolution
Method	Finest (R1)	Medium (R2)	Coarsest (R3)	Multi-resolution
Error (mm)	$12\pm 6.9$	$9.2\pm 5.7$	$8.9\pm 5.6$	$\mathbf{2}.\mathbf{6}\pm \mathbf{2}.\mathbf{1}$
p-value	$3.0\times {{10}^{-42}}$	$7.2\times {{10}^{-36}}$	$1.5\times {{10}^{-35}}$	N/A

Note: The p-values are computed with a paired t-test between single-resolution methods and our multi-resolution method. The bold number indicates the best performance.

Table 11. Quantitative comparison between uniform sampling and spherical sampling on the CT head & neck dataset.

Method	Single-resolution						Multi-resolution
Method	Finest (R1)		Medium (R2)		Coarsest (R3)		Multi-resolution
Sampling	Uniform	Spherical	Uniform	Spherical	Uniform	Spherical	Uniform	Spherical
Error (mm)	$12\pm 6.9$	$7.2\pm 6.0$	$9.2\pm 5.7$	$7.8\pm 5.3$	$8.9\pm 5.6$	$7.9\pm 5.3$	$2.6\pm 2.1$	$\mathbf{2}.\mathbf{5}\pm \mathbf{2}.\mathbf{0}$
p-value	$1.1\times {{10}^{-19}}$	N/A	$1.8\times {{10}^{-19}}$	N/A	$1.2\times {{10}^{-13}}$	N/A	$8.2\times {{10}^{-4}}$	N/A

Note: The p-values are computed between every two methods. The bold number indicates the best performance.

Table 12. Quantitative comparison between the multi-resolution classification-based method (Gao et al 2014), joint landmark detection (Joint), individual landmark detection (Individual), and confidence-based landmark detection (Confidence) in the CT head & neck dataset.

Error (mm)	chin	mand_r	mand_l	odont_proc	occ_bone	Average	p-value
Classification (Gao et al 2014)	$2.1\pm 1.0$	$3.6\pm 2.4$	$3.7\pm 2.3$	$2.3\pm 1.3$	$6.5\pm 4.2$	$3.6\pm 2.9$	$1.2\times {{10}^{-14}}$
Joint	$2.2\pm 1.0$	$2.8\pm 1.9$	$2.9\pm 1.7$	$2.1\pm 1.3$	$6.5\pm 4.0$	$3.3\pm 2.7$	$3.6\times {{10}^{-7}}$
Individual	$1.6\pm 0.7$	$2.2\pm 1.2$	$2.4\pm 1.1$	$1.7\pm 1.1$	$5.1\pm 3.6$	$2.6\pm 2.2$	$1.9\times {{10}^{-5}}$
Confidence	$\mathbf{1}.\mathbf{6}\pm \mathbf{0}.\mathbf{7}$	$\mathbf{2}.\mathbf{2}\pm \mathbf{1}.\mathbf{2}$	$\mathbf{2}.\mathbf{4}\pm \mathbf{1}.\mathbf{1}$	$\mathbf{1}.\mathbf{7}\pm \mathbf{1}.\mathbf{1}$	$\mathbf{2}.\mathbf{3}\pm \mathbf{1}.\mathbf{4}$	$\mathbf{2}.\mathbf{0}\pm \mathbf{1}.\mathbf{2}$	N/A

Note: The p-values are computed between 'Confidence' and the other methods. The bold numbers indicate the best performance.

VI. Conclusion

In this paper, we propose a collaborative landmark detection framework to improve the detection accuracy of the conventional regression-based method. Specifically, two strategies are respectively proposed. The first multi-resolution strategy detects a landmark location from the coarsest resolution to the finest resolution. It improves detection accuracy by gradually filtering out faraway voxels during the landmark voting step. The second confidence-based landmark detection strategy utilises reliable landmarks to guide the detection of challenging landmarks. It improves detection accuracy by exploiting the inter-landmark spatial relationship. Validated on 127 CT/CBCT scans from three applications, our method obtains accurate detection results at a speed of 1 s per landmark. Besides, it also shows a better performance than the conventional classification-based and regression-based approaches.

VII. Discussion

Ground-truth annotations: In the prostate application, the landmark positions were annotated by a radiation oncologist and then reviewed by another radiation oncologist, in order to minimise the potential bias. In CBCT dental application, both the maxilla and mandible are first segmented and separated by a physician from the CBCT image. Then, the segmentation is utilised to construct a 3D surface model. Finally, the landmarks are manually annotated on the constructed 3D model. Compared to the manual annotation on the CBCT, our manual annotation on the constructed 3D surface model is much more reliable, suffers less inter-patient variation and also potentially reduces the bias in manual annotation. As for the head-neck dataset, we acquired it from the public site. Thus, we have limited information regarding how the manual annotation was performed. But our visual inspection shows that all the landmarks are annotated on the distinctive anatomical structures. Thus, we believe that the quality of manual annotation in this dataset is sufficiently good to serve as the ground-truth for evaluation.

Assessment of landmark detection accuracy: To assess the landmark detection accuracy of our method, we can compare it with the intra-operator or inter-operator variation of manual landmark annotation. Specifically, the inter-operator variation of CT prostate landmark annotation is about 5 mm, as shown in Gao et al (2014). In comparison, our method yields a detection error of $4.2\pm 2.5$ mm, which is clinically acceptable. In the CBCT-based dental application, less than 2 mm detection error is clinically acceptable. Based on the Fourie et al (2011) and Kragskov et al (1997), the intra-operator and inter-operator variations of dental landmark detection from 3D CT and CBCT are mostly from 1.5 mm to 2 mm. In comparison, our method yields a detection error of $1.5\pm 0.9$ mm, which is thus acceptable. In the head-neck application, we did not find any reference standard. However, considering the slice thickness of 3 mm and that our method obtained a detection error of $2.0\pm 1.2$ mm, we believe the accuracy of our method is sufficient for many applications, such as global alignment.

Appearance features: In our method, Haar-like features are used as the only appearance features which have shown to be effective in CT/CBCT images. However, if we want to extend our method to landmark detection on MR images, which have more complex textures than CT images, it may be necessary to add other sophisticated features. Recently, deep learning has attracted much attention to machine learning and computer vision. Its main idea is to automatically learn useful appearance features from data instead of handcrafting features as was often done in previous research. We are planning to borrow deep learning techniques, such as the convolution neural network, to learn high-level discriminant features to further boost the detection accuracy of our method, and also to extend it to detect landmarks on other modalities such as MRI.

Large-scale landmark detection: We are also targeting the large-scale landmark detection problem, where hundreds of landmarks need to be detected in a single image. In such a case, efficiency may be a concern if using the current framework, as the detection time of our method is linear to the number of landmarks. To address this issue, we are considering splitting landmarks into spatially coherent groups, and using joint landmark detection for detecting landmarks within the same group. Similarly, confidence-based landmark detection can be also applied by first detecting landmarks in reliable groups, and then using them to guide the detection of landmarks in challenging groups.

Transfer learning: Another interesting direction which may be worth exploring is transfer learning for landmark detection, as we slightly touched on in the CBCT dental dataset. Specifically, due to the limited number of CBCT images, we added 30 CT dental images and mixed them with CBCT images to enrich the training dataset. The experimental results showed that the average detection accuracy is significantly (p < 0.05) improved from $2.0\pm 2.1$ mm to $1.5\pm 0.9$ mm, which justifies the benefit of using additional CT images for training. The same situation may happen in many cases. More validations are still required to answer the question as to whether high-quality images are indeed helpful in improving the accuracy of landmark detection in low-quality images.

Acknowledgment

This work was supported by NIH grant CA140413.

Collaborative regression-based anatomical landmark detection

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

Abstract

I. Introduction

II. Regression-based landmark detection

II.A. Regression random forest

II.B. Regression-based anatomical landmark detection

III. Multi-resolution collaboration: multi-resolution regression voting

IV. Inter-landmark collaboration: confidence-based landmark detection

V. Experimental results

V.A. Parameter setting

V.B. Training and testing timing

V.C. CT Prostate dataset

V.D. CBCT dental dataset

V.E. CT head & neck dataset

VI. Conclusion

VII. Discussion

Acknowledgment

Collaborative regression-based anatomical landmark detection

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Dates

Abstract

I. Introduction

II. Regression-based landmark detection

II.A. Regression random forest

II.B. Regression-based anatomical landmark detection

III. Multi-resolution collaboration: multi-resolution regression voting

IV. Inter-landmark collaboration: confidence-based landmark detection

V. Experimental results

V.A. Parameter setting

V.B. Training and testing timing

V.C. CT Prostate dataset

V.D. CBCT dental dataset

V.E. CT head & neck dataset

VI. Conclusion

VII. Discussion

Acknowledgment