Methods for diversity and overlap analysis in T-cell receptor populations

Rempala, Grzegorz A.; Seweryn, Michal

doi:10.1007/s00285-012-0589-7

Methods for diversity and overlap analysis in T-cell receptor populations

Published: 25 September 2012

Volume 67, pages 1339–1368, (2013)
Cite this article

Journal of Mathematical Biology Aims and scope Submit manuscript

Grzegorz A. Rempala¹ &
Michal Seweryn^1,2

1987 Accesses
81 Citations
1 Altmetric
Explore all metrics

Abstract

The paper presents some novel approaches to the empirical analysis of diversity and similarity (overlap) in biological or ecological systems. The analysis is motivated by the molecular studies of highly diverse mammalian T-cell receptor (TCR) populations, and is related to the classical statistical problem of analyzing two-way contingency tables with missing cells and low cell counts. The new measures of diversity and overlap are proposed, based on the information-theoretic as well as geometric considerations, with the capacity to naturally up-weight or down-weight the rare and abundant population species. The consistent estimates are derived by applying the Good–Turing sample-coverage correction. In particular, novel consistent estimates of the Shannon entropy function and the Morisita–Horn index are provided. Data from TCR populations in mice are used to illustrate the empirical performance of the proposed methods vis a vis the existing alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Flexible Distance-Based TCR Analysis in Python with tcrdist3

The Repertoire Dissimilarity Index as a method to compare lymphocyte receptor repertoires

Article Open access 07 March 2017

tcR: an R package for T cell receptor repertoire advanced data analysis

Article Open access 28 May 2015

References

Agresti A (2002) Categorical data analysis, 2nd edn. In: Wiley series in probability and statistics. Wiley, New York
Antos A, Kontoyiannis I (2001) Convergence properties of functional estimates for discrete distributions. Random Struct Algorithms 19(3–4):163–193
Article MathSciNet MATH Google Scholar
Arstila T, Casrouge A, Baron V, Even J, Kanellopoulos J, Kourilsky P (1999) A direct estimate of the human $\alpha \,\beta $ T-cell receptor diversity. Science 286(5441):958
Article Google Scholar
Baum P, McCune J (2006) Direct measurement of T-cell receptor repertoire diversity with amplicot. Nat methods 3(11):895–901
Article Google Scholar
Butz EA, Bevan MJ (1998) Massive expansion of antigen-specific cd8+ T-cells during an acute virus infection. Immunity 8(2):167–75
Article Google Scholar
Chao A, Shen T (2003) Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ Ecol Stat 10(4):429–443
Article MathSciNet Google Scholar
Chao A, Chazdon RL, Colwell RK, Shen TJ (2005) A new statistical approach for assessing similarity of species composition with incidence and abundance data. Ecol Lett 8:148–159
Article Google Scholar
Chen W, Jin W, Hardegen N, Lei KJ, Li L, Marinos N, McGrady G, Wahl SM (2003) Conversion of peripheral cd4+cd25$-$ naive T-cells to cd4+cd25+ regulatory T cells by tgf-beta induction of transcription factor foxp3. J Exp Med 198(12):1875–86. doi: 10.1084/jem.20030152
Article Google Scholar
Davis MM, Bjorkman PJ (1988) T-cell antigen receptor genes and T-cell recognition. Nature 334(6181): 395–402. doi:10.1038/334395a0
Google Scholar
Esteban MD, Morales D (1995) A summary on entropy statistics. Kybernetika 31(4):337–346
MathSciNet MATH Google Scholar
Esty W (1983) A normal limit law for a nonparametric estimator of the coverage of a random sample. Ann Stat 11:905–912
Article MathSciNet MATH Google Scholar
Esty W (1986) The efficiency of Good’s nonparametric coverage estimator. Ann Stat 3:1257–1260
Article MathSciNet Google Scholar
Good I (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3–4):237–264
Google Scholar
Gras S, Kjer-Nielsen L, Burrows S, McCluskey J, Rossjohn J (2008) T-cell receptor bias and immunity. Curr Opin Immunol 20(1):119–125
Article Google Scholar
Hsieh CS, Zheng Y, Liang Y, Fontenot JD, Rudensky AY (2006) An intersection between the self-reactive regulatory and nonregulatory T-cell receptor repertoires. Nat Immunol 7(7):401–410. doi:10.1038/ni1318
Article Google Scholar
Hsieh CS, Lee HM, Lio CWJ (2012) Selection of regulatory T-cells in the thymus. Nat Rev Immunol 12(3):157–167. doi:10.1038/nri3155
Google Scholar
Janeway C (2005) Immunobiology: the immune system in health and disease, 6th edn. Garland Science, New York
Google Scholar
Jost L (2006) Entropy and diversity. Oikos 113(2):363–375
Article MathSciNet Google Scholar
Keylock C (2005) Simpson diversity and the shannon-wiener index as special cases of a generalized entropy. Oikos 109(1):203–207
Article Google Scholar
Komatsu N, Mariotti-Ferrandiz ME, Wang Y, Malissen B, Waldmann H, Hori S (2009) Heterogeneity of natural foxp3+ T-cells: a committed regulatory T-cell lineage and an uncommitted minor population retaining plasticity. Proc Natl Acad Sci USA 106(6):1903–1908. doi:10.1073/pnas.0811556106
Article Google Scholar
Magurran AE (2005) Biological diversity. Curr Biol 15(4):R116–R118. doi:10.1016/j.cub.2005.02.006
Article Google Scholar
Mao C, Lindsay B (2002) A poisson model for the coverage problem with a genomic application. Biometrika 89(3):669–682
Article MathSciNet MATH Google Scholar
Memon SA, Sportès C, Flomerfelt FA, Gress RE, Hakim FT (2012) Quantitative analysis of T-cell receptor diversity in clinical samples of human peripheral blood. J Immunol Methods 375(1–2):84–92. doi:10.1016/j.jim.2011.09.012
Article Google Scholar
Mohebtash M, Tsang KY, Madan RA, Huen NY, Poole DJ, Jochems C, Jones J, Ferrara T, Heery CR, Arlen PM, Steinberg SM, Pazdur M, Rauckhorst M, Jones EC, Dahut WL, Schlom J, Gulley JL (2011) A pilot study of muc-1/cea/tricom poxviral-based vaccine in patients with metastatic breast and ovarian cancer. Clin Cancer Res 17(22):7164–7173. doi:10.1158/1078-0432.CCR-11-0649
Article Google Scholar
Nayak T (1986) An analysis of diversity using Rao’s quadratic entropy. Sankhyā. Indian J Stat Ser B 48:315–330
MathSciNet MATH Google Scholar
Nielsen R, Paul J, Albrechtsen A, Song Y (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6):443–451
Article Google Scholar
Orlitsky A, Santhanam N, Zhang J (2003) Always Good-Turing: asymptotically optimal probability estimation. Science 302(5644):427–431
Article MathSciNet MATH Google Scholar
Orlitsky A, Santhanam N, Zhang J (2004) Universal compression of memoryless sources over unknown alphabets. IEEE Trans Inf Theory 50(7):1469–1481
Article MathSciNet Google Scholar
Pacholczyk R, Ignatowicz H, Kraj P, Ignatowicz L (2006) Origin and T-cell receptor diversity of foxp3+cd4+cd25+ T-cells. Immunity 25(2):249–259. doi:10.1016/j.immuni.2006.05.016
Article Google Scholar
Pacholczyk R, Kern J, Singh N, Iwashima M, Kraj P, Ignatowicz L (2007) Nonself-antigens are the cognate specificities of foxp3+ regulatory T-cells. Immunity 27(3):493–504. doi:10.1016/j.immuni.2007.07.019
Article Google Scholar
Rempala GA, Seweryn M, Ignatowicz L (2011) Model for comparative analysis of antigen receptor repertoires. J Theor Biol 269(1):1–15. doi:10.1016/j.jtbi.2010.10.001
Article MathSciNet Google Scholar
Rényi P (1961) On measures of information and entropy. In: Proceedings of the 4th Berkeley symposium on mathematics, statistics and probability, pp 547–561
Ricotta C (2005) Through the jungle of biological diversity. Acta Biotheoret 53(1):29–38
Article Google Scholar
Salameire D, Le Bris Y, Fabre B, Fauconnier J, Solly F, Pernollet M, Bonnefoix T, Leroux D, Plumas J, Jacob MC (2009) Efficient characterization of the tcr repertoire in lymph nodes by flow cytometry. Cytometry A 75(9):743–753. doi:10.1002/cyto.a.20767
Article Google Scholar
Spellerberg I, Fedor P (2003) A tribute to claude shannon (1916–2001) and a plea for more rigorous use of species richness, species diversity and the Shannon-Wiener index. Global Ecol Biogeogr 12(3):177–179
Article Google Scholar
Staveley-O’Carroll K, Sotomayor E, Montgomery J, Borrello I, Hwang L, Fein S, Pardoll D, Levitsky H (1998) Induction of antigen-specific T-cell anergy: an early event in the course of tumor progression. Proc Natl Acad Sci USA 95(3):1178–1183
Article Google Scholar
Tóthmérész B (1995) Comparison of different methods for diversity ordering. J Veget Sci 6(2):283–290
Article Google Scholar
Valiant P (2008) Testing symmetric properties of distributions. PhD thesis, MIT
Van Den Berg HA, Molina-París C, Sewell AK (2011) Specific T-cell activation in an unspecific t-cell repertoire. Sci Prog 94(Pt 3):245–64
Article Google Scholar
Vu VQ, Yu B, Kass RE (2007) Coverage-adjusted entropy estimation. Stat Med 26(21):4060. doi:10.1002/sim.2942
Article MathSciNet Google Scholar
Zhang CH, Zhang Z (2009) Asymptotic normality of a nonparametric estimator of sample coverage. Ann Stat 37:2582–2595
Article MATH Google Scholar

Download references

Acknowledgments

The authors would like to thank Prof. Leszek Ignatowicz for allowing the use of his experimental data on TCR populations and for helpful discussions and comments on the early drafts of the paper. We are also grateful to the reviewers for their valuable suggestions and for pointing out some additional references.

Author information

Authors and Affiliations

Department of Biostatistics and Cancer Research Center, Georgia Health Sciences University, Augusta, GA, 30912, USA
Grzegorz A. Rempala & Michal Seweryn
Department of Mathematics and Computer Science, University of Lodz, Lodz, Poland
Michal Seweryn

Authors

Grzegorz A. Rempala
View author publications
You can also search for this author in PubMed Google Scholar
Michal Seweryn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Grzegorz A. Rempala.

Additional information

This research was partially supported by US NIH grant R01CA-152158 (GAR, MS) and US NSF grant DMS-1106485 (GAR).

Appendix: Proofs

In this section we prove Theorems 1, 2 and 4. Recall that for the purpose of consistency analysis we consider populations with possibly an infinite number of species (i.e. the number of receptors $m\le \infty $) and we let the sample size $n$ increase to infinity. We write $X_n= O(a_n)$ (resp. $X_n= o(a_n)$) to denote the fact that the random sequence $X_n $ and a deterministic sequence $a_n$ satisfy with probability one $\sup _n {X_n}/{a_n}<\infty $ (resp. ${X_n}/{a_n}\rightarrow 0$).

1.1 Auxiliary results

Denote $\mathcal{S}_\alpha ({\varvec{p}}) :=\sum p_i^\alpha $ and $\mathcal{S}_\alpha ^{(n)}({\varvec{p}}) :=\sum \frac{p_{i}^{\alpha }}{1-\left(1- p_{i}\right)^{n}}$ for $\alpha >0.$ In order to prove the main results, we need the following

Lemma 1

Let $\alpha \in (0,\infty )$ and ${\varvec{p}}$ be a vector of probabilities (possibly of infinite length) for which $\mathcal{S}_\alpha ({\varvec{p}})<\infty $.

(i)
If $\alpha >1$ and $\sum p_i \log ^r 1/p_i < \infty $ for some $r>0$, then $\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}}) \stackrel{a.s.}{\rightarrow } \mathcal{S}_\alpha ({\varvec{p}})$.
(ii)
If $\alpha <1$ then $\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}}) \stackrel{a.s.}{\rightarrow } \mathcal{S}_\alpha ({\varvec{p}})$.
(iii)
If $\alpha =1$ and $\sum p_i \log 1/p_i < \infty $, then $\mathcal{S}_1^{(n)}(\tilde{{\varvec{p}}}) \stackrel{a.s.}{\rightarrow } 1$.

Additionally, in the above we may replace $\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}})$ by $\mathcal{S}_{\hat{C}\alpha }^{(n)}(\tilde{{\varvec{p}}})$. That is, under any of the hypothesis in (i)-(iii), we also have

$$\begin{aligned} \mathcal{S}_{{\hat{C}}\alpha }^{(n)}(\tilde{{\varvec{p}}}) \stackrel{a.s.}{\rightarrow } \mathcal{S}_\alpha ({\varvec{p}}). \end{aligned}$$

(5.1)

Proof

First, we consider the consistency of $\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}})$. By the results of ((Antos and Kontoyiannis 2001, Section 2)), the plug-in estimator of the power sum $\mathcal{S}_\alpha (\hat{p}_k)$ is strongly consistent for each $\alpha \in (0,\infty )$, that is,

$$\begin{aligned} \left|\mathcal{S}_\alpha ({\varvec{p}})-\mathcal{S}_{\alpha }(\hat{{\varvec{p}}})\right|\rightarrow 0\qquad a.s. \end{aligned}$$

(5.2)

Moreover, the assumption that $\sum _k p_i \log ^r 1/p_i < \infty $ for some $r>0$ is sufficient (following Vu et al. 2007) for

$$\begin{aligned} 1-{\hat{C}}=O\left(\log ^{-r}\!n\right)\rightarrow 0\qquad a.s. \end{aligned}$$

(5.3)

In view of (5.2) it suffices to show that under (i)–(iii) we have

$$\begin{aligned} \left|\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}})-\mathcal{S}_{\alpha }(\hat{{\varvec{p}}})\right|\rightarrow 0\qquad a.s. \end{aligned}$$

(5.4)

To this end, consider first $\alpha >1$ and note that the following holds with probability one

$$\begin{aligned} \left|\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}})-\mathcal{S}_{\alpha }(\hat{{\varvec{p}}})\right|&=\left|\sum \frac{\tilde{p} _{i}^{\alpha }}{1-(1-\tilde{p}_i)^{n}}-\sum \hat{p} _{i}^{\alpha }\right|= \left|\sum \frac{{\hat{C}}^{\alpha }-1+\left( 1-\tilde{p}_i\right) ^{n}}{1-(1-\tilde{p}_i)^{n}}\hat{p} _{i}^\alpha \right|\nonumber \\&\le \left|\sum \frac{{\hat{C}}^\alpha -1}{1-(1-\tilde{p}_i)^{n}}\hat{p} _{i}^{\alpha }\right|+ \left|\sum \frac{\left( 1-\tilde{p}_i\right) ^{n}}{1-(1-\tilde{p}_i)^{n}}\hat{p} _{i}^{\alpha }\right|=:(I) +(II). \nonumber \\ \end{aligned}$$

(5.5)

We now establish that both majorizing terms (I) and (II) vanish asymptotically a.s. To this end note that since $\tilde{p}_i\ge \frac{{\hat{C}}}{n}$ $a.s.$, then

$$\begin{aligned} (I)\le \left|\frac{{\hat{C}}^{\alpha }\!-\!1}{1\!-\!(1\!-\! \frac{{\hat{C}}}{n})^{n}}\sum \hat{p} _i^{\alpha }\right| =O \left(\left(1-\frac{1}{\log ^r n}\right)^\alpha -1\right)=O(\log ^{-r}\!n)\rightarrow 0\qquad a.s. \end{aligned}$$

due to the consistency of the plug-in power sum estimator of order $\alpha $ and the sample coverage estimator. Apropos (II), set $ \pi _{n}:=\frac{\log n}{n}$ and consider

$$\begin{aligned} (II) \le \left|\sum _{\tilde{p}_i>\pi _{n}}\frac{ \left( 1\!-\!\tilde{p}_i\right) ^{n}}{1\!-\!(1\!-\!\tilde{p}_i)^{n}}\hat{p}_i^{\alpha }\right|\!+\!\left|\sum _{\tilde{p}_i\le \pi _{n}}\frac{ \left( 1\!-\!\tilde{p}_i\right) ^{n}}{1\!-\!(1\!-\!\tilde{p}_i)^{n}}\hat{p} _i^{\alpha }\right|=:(IIa)+(IIb)\qquad a.s. \end{aligned}$$

The function $f(x):=\frac{\left( 1-x\right) ^{n}}{1-(1-x)^{n}}$ is decreasing in $x$ for $x\in (0,1)$ and thus, for $n$ sufficiently large, the first term $(IIa)$ is majorized by

$$\begin{aligned}(IIa)&= \left| \sum _{\tilde{p}_i>\pi _{n}}\frac{\left( 1-\tilde{p}_i\right) ^{n}}{1-(1-\tilde{p}_i)^{n}}\hat{p} _i^{\alpha }\right| \le \left|\frac{\left( 1-\pi _{n}\right) ^{n}}{1-(1-\pi _{n})^{n}}\sum _{\tilde{p}_i>\pi _{n}}\hat{p} _i^{\alpha }\right| \le \frac{\left( 1-\pi _{n}\right) ^{n}}{1-(1-\pi _{n})^{n}} \\&= O\left(n^{-1}\right)\rightarrow 0\qquad a.s. \end{aligned}$$

For the second term, once again due to $\tilde{p}_i\ge \frac{{\hat{C}}}{n}$ $a.s.$, we have

$$\begin{aligned}(IIb)\!=\!\left| \sum _{\tilde{p}_i\le \pi _{n}}\frac{\left( 1-\tilde{p}_i\right) ^{n}}{1-(1-\tilde{p}_i)^{n}}\hat{p} _i^{\alpha }\right| \!\le \! \frac{{\hat{C}}^{-\alpha }\left( 1-\frac{{\hat{C}}}{n }\right) ^{n}}{1-(1-\frac{{\hat{C}}}{n})^{n}} \sum _{\tilde{p} _i\le \pi _{n}}\tilde{p}_i^{\alpha } \!=\! O(n^{-\beta })\rightarrow 0\qquad a.s. \end{aligned}$$

for $0<\beta <\alpha -1$. This establishes $(II)\rightarrow 0$ $a.s.$ and hence also (5.4) for $\alpha >1$.

Consider now the case when $0<\alpha \le 1$. Note that, since $\sum p_i^\alpha <\infty $ implies that $\sum p_i \log ^{1-\alpha }1/p_i<\infty $, the relation (5.3) holds true with $r=1-\alpha $ for $\alpha <1$ and is forced by our assumption with $r=1$ when $\alpha =1$. Moreover, (5.5) still holds and the majorizing terms $(I)$ and $(IIa)$ may be handled identically as above. For the remaining term $(IIb)$, note that for $0<\alpha \le 1$ and $\tilde{\pi }_n=\pi _n/{\hat{C}}$

$$\begin{aligned} (IIb)\le \frac{\left( 1-\frac{{\hat{C}}}{n}\right) ^{n}}{1-(1-\frac{{\hat{C}}}{n})^{n}} \sum _{\hat{p} _i\le \tilde{\pi } _n}\hat{p}_i^\alpha =O\left(\sum _{\hat{p} _i\le \tilde{\pi } _n}\hat{p}_i^\alpha \right)\qquad a.s. \end{aligned}$$

(5.6)

Note also that

$$\begin{aligned} \left|\sum _{\hat{p}_i\le \tilde{\pi } _n}\hat{p}_i^\alpha \right|\le \left| \sum _{\hat{p}_i\le \tilde{\pi } _n}\hat{p}_i^\alpha - \sum _{\hat{p}_i\le \tilde{\pi } _n}{p}_i^\alpha \right| + \left|\sum _{\hat{p}_i\le \tilde{\pi } _n}p_i^\alpha - \sum _{{p}_i\le \tilde{\pi } _n}{p}_i^\alpha \right| + \left|\sum _{p_i\le \tilde{\pi } _n}{p}_i^\alpha \right|\qquad a.s. \end{aligned}$$

Asymptotically, the first term above vanishes a.s. in view of the result of Antos and Kontoyiannis (2001) and the third one vanishes a.s. due to the summability assumption and the fact that $\tilde{\pi }_n\rightarrow 0$. On the other hand, the middle term is bounded a.s. by the asymptotically vanishing terms

$$\begin{aligned} \sum _{i: p_i\le \tilde{\pi }_n<\hat{p}_i} p_i^\alpha +\sum _{i: \hat{p}_i\le \tilde{\pi }_n<{p}_i}{p}_i^\alpha \rightarrow 0\qquad a.s. \end{aligned}$$

in view of the result of Antos and Kontoyiannis (2001). Hence from (5.6) it follows that $(IIb)\rightarrow 0 a.s.$ and the parts (i)–(iii) of Lemma (1) are established.

Finally, we also establish (5.1). Note that without loss of generality we may assume that $P({\hat{C}}={\hat{C}}_n<1~ \text{ infinitely} \text{ often})=1$.

Assume first that $\alpha >1$ and $\sum p_i \log ^r 1/p_i < \infty $ for some $r>0$, and choose $\beta $ such that $1<\beta < \alpha $ and $\alpha -\beta -1<0.$ Due to the almost sure convergence of $\hat{C}$ to $1$ we may without loss of generality assume that for each $n \in \mathbb N \hat{C}\alpha - \beta >0 \, a.s.$ We have

$$\begin{aligned} \sum \frac{\tilde{p_i}^{\hat{C}\alpha }}{1-\left(1-\tilde{p_i}\right)^n} -\frac{\tilde{p_i}^\alpha }{1-\left(1-\tilde{p_i}\right)^n}&= \sum \frac{\tilde{p_i}^\beta }{1-\left(1-\tilde{p_i}\right)^n} \left(\tilde{p_i}^{\hat{C}\alpha -\beta }-\tilde{p_i}^{\alpha -\beta }\right)\\&\le \max _{x \in (0,1)} \left(x^{\hat{C}\alpha -\beta }-x^{\alpha -\beta }\right)\sum \frac{\tilde{p_i}^\beta }{1-\left(1-\tilde{p_i}\right)^n}. \end{aligned}$$

The maximum is attained at the point

$$\begin{aligned} \tilde{x} := \left(\frac{\hat{C}\alpha -\beta }{\alpha -\beta }\right)^ {\frac{1}{\alpha -\hat{C}\alpha }} \rightarrow \left(\frac{1}{e}\right)^{\frac{1}{\alpha -\beta }}\qquad {a.s.} \end{aligned}$$

thus

$$\begin{aligned} \mathcal{S}^{(n)}_{\hat{C}\alpha } (\tilde{{\varvec{p}}}) - \mathcal{S}^{(n)}_\alpha (\tilde{{\varvec{p}}}) \le \left(\tilde{x}^{\hat{C}\alpha -\beta }-\tilde{x}^ {\alpha -\beta }\right)\sum \frac{\tilde{p_i}^\beta }{1-\left(1-\tilde{p_i}\right)^n} \rightarrow 0\qquad {a.s.} \end{aligned}$$

since, under the assumption that for some $r>0$ $\sum p_i \log ^r 1/p_i < \infty ,$ we know that $\mathcal{S}^{(n)}_{\beta } (\tilde{{\varvec{p}}}) \rightarrow \mathcal{S}_{\beta } ({\varvec{p}})$ ${a.s.}$, by the first part of the lemma.

For $\alpha < 1$ and under the assumption that $\sum p_i^\alpha < \infty $, it follows from the inequality $\log x \le n x^{1/n}$ valid for $ x>0, n\ge 1,$ that $\sum p_i \log ^r \left(1/p_i\right) < \infty $, for each $r>0$. For any $r>1$ we have therefore

$$\begin{aligned}&\sum \frac{\tilde{p_i}^{\hat{C}\alpha }}{1-\left(1-\tilde{p_i}\right)^n} -\frac{\tilde{p_i}^\alpha }{1-\left(1-\tilde{p_i}\right)^n} \\&\quad = \sum \frac{\tilde{p_i}^\alpha }{1-\left(1-\tilde{p_i}\right)^n} \left(\tilde{p_i}^{\alpha (\hat{C}-1)}-1\right) \\&\quad \le \left(\left(\left(\frac{n}{\hat{C}}\right)^{1-\hat{C}}\right)^\alpha -1 \right)\sum \frac{\tilde{p_i}^\alpha }{1-\left(1-\tilde{p_i}\right)^n} \rightarrow 0\qquad {a.s.} \end{aligned}$$

since $n^{(1/\log ^r n)} \rightarrow 1, \, n\rightarrow \infty $ for any $r>1.$

Now, for $\alpha =1$ under the assumption that the entropy of ${\varvec{p}}$ is finite, we have similarly as above that

$$\begin{aligned} \sum \frac{\tilde{p_i}^{\hat{C}}-\tilde{p_i}}{1-\left(1-\tilde{p_i}\right)^n}&= \sum \frac{\tilde{p_i} \log (1/\tilde{p_i})}{1-\left(1-\tilde{p_i}\right)^n} \left(\frac{\tilde{p_i}^{\hat{C}-1}}{\log (1/\tilde{p_i})}-\frac{1}{\log (1/\tilde{p_i})}\right) \\&\le \left(\frac{\left(\frac{n}{\hat{C}}\right)^{1-\hat{C}}}{\log \frac{n}{\hat{C}}}-\frac{1}{\log \frac{n}{\hat{C}}}\right) \sum \frac{\tilde{p_i} \log (1/\tilde{p_i})}{1-\left(1-\tilde{p_i}\right)^n}\rightarrow 0 \qquad {a.s.} \end{aligned}$$

since $n^{1/\log n} \rightarrow e$, and $\left(\frac{1}{\hat{C}}\right)^{1-\hat{C}} \rightarrow 1$ a.s. $n\rightarrow \infty .$ Hence, under the assumptions of the lemma, we have (for any $\alpha >0) \mathcal{S}^{(n)}_{\hat{C}\alpha } (\tilde{{\varvec{p}}}) - \mathcal{S}^{(n)}_\alpha (\tilde{{\varvec{p}}})\rightarrow 0$, a.s. and (5.1) follows.

With the above lemma in hand, we are now ready for the proof of the Theorem 2, which becomes relatively straightforward.

1.2 Proof of Theorem 2

Note that it suffices to show that the estimators of the power sums of the type $\sum \frac{\tilde{p}_{i1}^{\alpha {\hat{C}}_1}}{1-\left(1- \tilde{p}_{i1}\right)^{n}}$ and $\sum \frac{\tilde{p}_{i2}^{\beta {\hat{C}}_{2}}}{1-\left(1- \tilde{p}_{i2}\right)^n}$ are strongly consistent. The result in each case follows by Lemma 1.$\square $

The next step is to prove Theorem 1.

1.3 Proof of Theorem 1

Note that for $\alpha \ne 1$ the assertions follow from Lemma 1 by continuity of the bivariate function $g(x,y):=(x-1)^{-1}\log y$. For the remaining case $\alpha =1$, the first assertion $H_1(\tilde{{\varvec{p}}})^{(n)}\rightarrow H_1({\varvec{p}})$ a.s. follows by an argument similar to that used in the proof of the lemma and hence we forgo the details. To argue the second assertion, note that we may assume without loss of generality that $P({\hat{C}}< 1\ \text{ infinitely} \text{ often})=1$ and that in view of the result in Antos and Kontoyiannis (2001) which asserts that $H_1(\hat{{\varvec{p}}})\rightarrow H_1({\varvec{p}})$ a.s., it suffices to show

$$\begin{aligned} \Delta _n:=\left| H_{{\hat{C}}}^{(n)}(\tilde{{\varvec{p}}})-\frac{\log \mathcal{S}_1^{(n)}(\tilde{{\varvec{p}}})}{1-{\hat{C}}}-H_1(\hat{{\varvec{p}}})\right|\rightarrow 0 \qquad a.s. \end{aligned}$$

(5.7)

To this end, note that by Cauchy’s mean value theorem and (iii) of Lemma 1

$$\begin{aligned} H_{{\hat{C}}}^{(n)}(\tilde{{\varvec{p}}})-\frac{\log \mathcal{S}_1^{(n)}(\tilde{{\varvec{p}}})}{1-{\hat{C}}}&= \frac{\log \mathcal{S}_{\hat{C}}^{(n)}(\tilde{{\varvec{p}}})-\log \mathcal{S}_{1}^{(n)}(\tilde{{\varvec{p}}})}{1-{\hat{C}}}\\&= \left(\sum \frac{\tilde{p}_i^{\varphi _n}}{1-(1-\tilde{p}_i)^n}\right)^{-1} \sum \frac{\tilde{p}_i^{\varphi _n}\log 1/\tilde{p}_i}{1-(1-\tilde{p}_i)^n}\qquad a.s. \end{aligned}$$

for some ${\varphi _n}$ such that ${\hat{C}}\le {\varphi _n}\le 1$. Note that $1-{\varphi _n}=O(\log ^{-r}n)$ due to (5.3) and consequently, from the proof of Lemma 1, it follows that its assertions also holds with ${\varphi _n}$ in place of ${\hat{C}}$. In particular, in view of (5.1) with $\alpha =1$,

$$\begin{aligned} \beta _n:=\left(\sum \frac{\tilde{p}_i^{\varphi _n}}{1-(1-\tilde{p}_i)^n} \right)^{-1}\rightarrow 1\qquad a.s. \end{aligned}$$

(5.8)

Re-write $\Delta _n$ as follows

$$\begin{aligned} \Delta _n&=\sum \left(\beta _n\frac{\tilde{p}_i^{\varphi _n}\log 1/\tilde{p}_i}{1-(1-\tilde{p}_i)^n} - \hat{p}_i\log 1/\hat{p}_i\right) \\&= \sum \left(\beta _n\frac{\tilde{p}_i^{\varphi _n}\log 1/\tilde{p}_i}{1-(1-\tilde{p}_i)^n} - \hat{p}_i\log 1/\tilde{p}_i\right)+\log 1/{\hat{C}}\nonumber \\&=\sum \frac{\hat{p}_i \log 1/\tilde{p}_i}{1-(1-\tilde{p}_i)^n}\left( \beta _n\hat{p}_i^{{\varphi _n}-1}{\hat{C}}^{\varphi _n}- 1\right)+ \sum \frac{\hat{p}_i \log 1/\tilde{p}_i}{1-(1-\tilde{p}_i)^n}(1-\tilde{p}_i)^n+\log 1/{\hat{C}}\nonumber \\&\le \left( \beta _n{\hat{C}}^{\varphi _n}n^{1-{\varphi _n}} - 1\right) \sum \frac{\hat{p}_i \log 1/\tilde{p}_i}{1-(1-\tilde{p}_i)^n}+\sum \frac{(1-\tilde{p}_i)^n}{1-(1-\tilde{p}_i)^n}\,\hat{p}_i \log 1/\tilde{p}_i+\log 1/{\hat{C}}\nonumber \end{aligned}$$

(5.9)

$$\begin{aligned}&=:&(I)+(II)+(III) \end{aligned}$$

(5.10)

where in the last inequality we applied the bound $\hat{p}_i\ge 1/n$. It is obvious that $(III):=\log (1/{\hat{C}})\rightarrow 0$ $a.s.$ For the term $(I)$, consider the following.

$$\begin{aligned} (I)\!\le \! \!\left( \beta _n{\hat{C}}^{\varphi _n}n^{1-{\varphi _n}} - 1\right) \sum \frac{\hat{p}_i \log 1/\tilde{p}_i}{1-(1-\tilde{p}_i)^n}\!\le \! \left( \beta _n{\hat{C}}^{\varphi _n}n^{1-{\varphi _n}} - 1\right)\! O\left( 1\right)\rightarrow 0\qquad a.s \end{aligned}$$

since $\beta _n{\hat{C}}^{\varphi _n}n^{(1-{\varphi _n})}\rightarrow 1 a.s.$, in view of (5.8) and $1\ge {\varphi _n}\ge {\hat{C}}\rightarrow 1 a.s.$, as well as $n^{1-{\varphi _n}}=\exp \left[ O(\log ^{1-r} n) \right]\rightarrow 1$ $a.s.$ The remaining expression $(II)$ needs to be handled similarly to the analogous term considered in the proof of Lemma 1. First note that

$$\begin{aligned} (II)&= \sum \frac{(1-\tilde{p}_i)^n}{1-(1-\tilde{p}_i)^n}\,\hat{p}_i \log 1/\hat{p}_i + \sum \frac{(1-\tilde{p}_i)^n}{1-(1-\tilde{p}_i)^n}\,\hat{p}_i\log 1/{\hat{C}}:\\&= (II)^\prime +o(1)\qquad a.s. \end{aligned}$$

and therefore it suffices to consider $(II)^\prime $ instead. To this end, set $ \pi _{n}:={\log n}/{n}$ and note that

$$\begin{aligned}&(II)^\prime \le \left|\sum _{\tilde{p}_i>\pi _{n}}\frac{ \left( 1-\tilde{p}_i\right) ^{n}}{1-(1-\tilde{p}_i)^{n}}\hat{p}_i\log 1/\hat{p}_i\right|+\left|\sum _{\tilde{p}_i\le \pi _{n}}\frac{ \left( 1-\tilde{p}_i\right) ^{n}}{1-(1-\tilde{p}_i)^{n}}\hat{p}_i \log 1/\hat{p}_i\right|\\&=:(IIa)+(IIb)\qquad a.s. \end{aligned}$$

The first term $(IIa)$ is majorized by

$$\begin{aligned}(IIa)&=\left| \sum _{\tilde{p}_i>\pi _{n}}\frac{\left( 1-\tilde{p}_i\right) ^{n}}{1-(1-\tilde{p}_i)^{n}}\hat{p}_i \log 1/\hat{p}_i\right| \le \left|\frac{\left( 1-\pi _{n}\right) ^{n}}{1-(1-\pi _{n})^{n}}\sum _{\tilde{p}_i>\pi _{n}}\hat{p}_i \log 1/\hat{p}_i\right| \\&\le \frac{\left( 1-\pi _{n}\right) ^{n}}{1-(1-\pi _{n})^{n}}\, O(1) = O\left(n^{-1}\right)\rightarrow 0\qquad a.s. \end{aligned}$$

For the second term $(IIb)$, set $\tilde{\pi }_n=\pi _n/{\hat{C}}$

$$\begin{aligned} (IIb)\!\le \! \frac{\left( 1-\frac{{\hat{C}}}{n}\right) ^{n}}{1\!-\!(1-\frac{{\hat{C}}}{n})^{n}} \sum _{\hat{p} _i\le \tilde{\pi } _n}\hat{p}_i \log 1/\hat{p}_i\!=\!O\!\left(\sum _{\hat{p} _i\le \tilde{\pi } _n}\hat{p}_i \log 1/\hat{p}_i \right)\qquad a.s.\qquad \end{aligned}$$

(5.11)

Note also that

$$\begin{aligned}&\left|\sum _{\hat{p}_i\le \tilde{\pi } _n}\hat{p}_i \log 1/\hat{p}_i \right| \le \left| \sum _{\hat{p}_i\le \tilde{\pi } _n}\hat{p}_i \log 1/\hat{p}_i - \sum _{\hat{p}_i\le \tilde{\pi } _n}{p}_i \log 1/{p}_i\right| \\&+ \left|\sum _{\hat{p}_i\le \tilde{\pi } _n}{p}_i \log 1/{p}_i - \sum _{{p}_i\le \tilde{\pi } _n}{p}_i \log 1/{p}_i\right| \nonumber + \left|\sum _{p_i\le \tilde{\pi } _n}{p}_i \log 1/{p}_i\right|\qquad a.s. \end{aligned}$$

Asymptotically, the first term above vanishes a.s. in view of the result of Antos and Kontoyiannis (2001) and the third one vanishes a.s. due to the finite entropy assumption and the fact that $\tilde{\pi }_n\rightarrow 0$.

On the other hand, the middle term is bounded a.s. by the asymptotically vanishing terms

$$\begin{aligned} \sum _{i: p_i\le \tilde{\pi }_n<\hat{p}_i} {p}_i \log 1/{p}_i +\sum _{i: \hat{p}_i\le \tilde{\pi }_n<{p}_i}{p}_i \log 1/{p}_i \rightarrow 0\qquad a.s. \end{aligned}$$

in view of the result of Antos and Kontoyiannis (2001). Hence from (5.11) it follows that $(IIb)\rightarrow 0$ $a.s.$ and therefore $\Delta _n\le (I)+(II)+(II)\rightarrow 0$ $a.s.$ in (5.9) and the required result (5.7) is established.$\square $

1.4 Proof of Theorem 4

We only consider the more difficult case of $\alpha =1$. The case of any other $\alpha \ne 1$ may be handled by the arguments similar to those used in the proof of Lemma 1. Without loss of generality assume that $P({\hat{C}}<1 \text{ infinitely} \text{ often})=1$, since otherwise the result follows by the consistency of the ’plug-in’ estimate of the $I$-index (Theorem 3). Note that it suffices to prove that

$$\begin{aligned} F_{\hat{C}}(\hat{{\varvec{P}}},\hat{{\varvec{Q}}})-F_1(\hat{{\varvec{P}}},\hat{{\varvec{Q}}}) \rightarrow 0 \quad a.s. \end{aligned}$$

(5.12)

and

$$\begin{aligned} H_1(\hat{{\varvec{P}}_{\circ }})-H_{2-{\hat{C}}}(\hat{{\varvec{P}}_{\circ }}) \rightarrow 0 \qquad a.s. \end{aligned}$$

(5.13)

where $\hat{{\varvec{Q}}}:=\hat{{\varvec{P}}}_\circ \bigotimes \hat{{\varvec{P}}}^\circ :=[\hat{p}_{i \circ }\,\hat{p}_{\circ j} ]$. For the proof of the above assertions, we again use Cauchy’s mean value theorem. To argue (5.12), let us note that there exists a ${\varphi _n}$ with ${\hat{C}}\le {\varphi _n}\le 1$ such that almost surly,

$$\begin{aligned} F_{\hat{C}}(\hat{{\varvec{P}}},\hat{{\varvec{Q}}})=\frac{\log (\sum _{ij}\tau ^{{\hat{C}}-1}_{ij}\hat{p}_{ij})}{1-{\hat{C}}}=\frac{1}{\sum _{ij}\tau ^{{\varphi _n}-1}_{ij}\hat{p}_{ij}}\sum _{ij}\tau ^{{\varphi _n}-1}_{ij}\hat{p}_{ij}\log \tau _{ij}, \end{aligned}$$

where $\tau _{ij}=\frac{ \hat{p}_{ij}}{\hat{p}_{i\circ }\hat{p}_{\circ j}}.$ By the assumption that $\sum _{ij}p_{ij}\log ^r 1/p_{ij} < \infty $ for some $r>1$, we have as before that $1-{\varphi _n}=O\left(\frac{1}{\log ^r n}\right) a.s.$ Since $1/n\le \tau _{ij}\le n$, therefore

$$\begin{aligned} 1- \sum _{ij}\tau ^{{\varphi _n}-1}_{ij}\hat{p}_{ij} \le 1- \frac{1}{n^{1-{\varphi _n}}} \sum _{ij} \hat{p}_{ij} \le 1- \frac{1}{n^{1/ \log ^r n}} \sum _{ij} \hat{p}_{ij} \rightarrow 0 \quad n\rightarrow \infty \quad a.s. \end{aligned}$$

Similarly, we obtain

$$\begin{aligned}&\left|\sum _{ij}\tau ^{{\varphi _n}-1}_{ij}\hat{p}_{ij}\log \tau _{ij} - \sum _{ij}\hat{p}_{ij}\log \tau _{ij} \right|\\&\quad = \left|\sum _{ij}\hat{p}_{ij}\log \tau _{ij}(\tau ^{{\varphi _n}-1}_{ij}-1)\right|\le d_n (H_1({\varvec{P}})+ H_1({\varvec{P}}^\circ )+H_1({\varvec{P}}_\circ )), \end{aligned}$$

where $d_n:=\max \{1-n^{{\varphi _n}-1},n^{1-{\varphi _n}}-1\}.$ Since the entropy $H_1({\varvec{P}})$ is finite and $d_n \stackrel{a.s.}{\rightarrow } 0 \; n\rightarrow \infty $ then the assertion (5.12) follows. To argue (5.13) let us note again that there exists a ${\varphi _n}$ (possibly different from the one considered above) with ${\hat{C}}\le {\varphi _n}\le 1$ such that

$$\begin{aligned} H_{2-{\hat{C}}}(\hat{{\varvec{P}}_\circ })=\frac{\sum _j \hat{p_{\circ j}}^{2-{\varphi _n}} \log 1/\hat{p_{\circ j}}}{\sum _j \hat{p_{\circ j}}^{2-{\varphi _n}}}\quad a.s. \end{aligned}$$

By the elementary algebra

$$\begin{aligned} 1-\sum _j \hat{p_{\circ j}}^{1-{\varphi _n}}\hat{p}_{\circ j} \le 1- \frac{1}{n^{1-{\varphi _n}}} \rightarrow 0 \quad n\rightarrow \infty \qquad a.s. \end{aligned}$$

and

$$\begin{aligned}&H_1({\varvec{P}}_\circ )-\sum _j \hat{p}_{\circ j}^{2-{\varphi _n}} \log 1/\hat{p}_{\circ j}\\&\quad = \sum _j \hat{p}_{\circ j} \log 1/\hat{p}_{\circ j} (1-\hat{p}_{\circ j}^{1-{\varphi _n}}) \le (1-\frac{1}{n^{1-{\varphi _n}}})H_1({\varvec{P}}_\circ ) \rightarrow 0 \quad n\rightarrow \infty \quad a.s. \end{aligned}$$

which completes the proof.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rempala, G.A., Seweryn, M. Methods for diversity and overlap analysis in T-cell receptor populations. J. Math. Biol. 67, 1339–1368 (2013). https://doi.org/10.1007/s00285-012-0589-7

Download citation

Received: 06 March 2012
Revised: 29 August 2012
Published: 25 September 2012
Issue Date: December 2013
DOI: https://doi.org/10.1007/s00285-012-0589-7

Keywords

Mathematics Subject Classification (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Methods for diversity and overlap analysis in T-cell receptor populations

Abstract

Access this article

Similar content being viewed by others

Flexible Distance-Based TCR Analysis in Python with tcrdist3

The Repertoire Dissimilarity Index as a method to compare lymphocyte receptor repertoires

tcR: an R package for T cell receptor repertoire advanced data analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proofs

1.1 Auxiliary results

Lemma 1

Proof

1.2 Proof of Theorem 2

1.3 Proof of Theorem 1

1.4 Proof of Theorem 4

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

Methods for diversity and overlap analysis in T-cell receptor populations

Abstract

Access this article

Similar content being viewed by others

Flexible Distance-Based TCR Analysis in Python with tcrdist3

The Repertoire Dissimilarity Index as a method to compare lymphocyte receptor repertoires

tcR: an R package for T cell receptor repertoire advanced data analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proofs

Appendix: Proofs

1.1 Auxiliary results

Lemma 1

Proof

1.2 Proof of Theorem 2

1.3 Proof of Theorem 1

1.4 Proof of Theorem 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation