Abstract
The paper presents some novel approaches to the empirical analysis of diversity and similarity (overlap) in biological or ecological systems. The analysis is motivated by the molecular studies of highly diverse mammalian T-cell receptor (TCR) populations, and is related to the classical statistical problem of analyzing two-way contingency tables with missing cells and low cell counts. The new measures of diversity and overlap are proposed, based on the information-theoretic as well as geometric considerations, with the capacity to naturally up-weight or down-weight the rare and abundant population species. The consistent estimates are derived by applying the Good–Turing sample-coverage correction. In particular, novel consistent estimates of the Shannon entropy function and the Morisita–Horn index are provided. Data from TCR populations in mice are used to illustrate the empirical performance of the proposed methods vis a vis the existing alternatives.
Similar content being viewed by others
References
Agresti A (2002) Categorical data analysis, 2nd edn. In: Wiley series in probability and statistics. Wiley, New York
Antos A, Kontoyiannis I (2001) Convergence properties of functional estimates for discrete distributions. Random Struct Algorithms 19(3–4):163–193
Arstila T, Casrouge A, Baron V, Even J, Kanellopoulos J, Kourilsky P (1999) A direct estimate of the human \(\alpha \,\beta \) T-cell receptor diversity. Science 286(5441):958
Baum P, McCune J (2006) Direct measurement of T-cell receptor repertoire diversity with amplicot. Nat methods 3(11):895–901
Butz EA, Bevan MJ (1998) Massive expansion of antigen-specific cd8+ T-cells during an acute virus infection. Immunity 8(2):167–75
Chao A, Shen T (2003) Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ Ecol Stat 10(4):429–443
Chao A, Chazdon RL, Colwell RK, Shen TJ (2005) A new statistical approach for assessing similarity of species composition with incidence and abundance data. Ecol Lett 8:148–159
Chen W, Jin W, Hardegen N, Lei KJ, Li L, Marinos N, McGrady G, Wahl SM (2003) Conversion of peripheral cd4+cd25\(-\) naive T-cells to cd4+cd25+ regulatory T cells by tgf-beta induction of transcription factor foxp3. J Exp Med 198(12):1875–86. doi: 10.1084/jem.20030152
Davis MM, Bjorkman PJ (1988) T-cell antigen receptor genes and T-cell recognition. Nature 334(6181): 395–402. doi:10.1038/334395a0
Esteban MD, Morales D (1995) A summary on entropy statistics. Kybernetika 31(4):337–346
Esty W (1983) A normal limit law for a nonparametric estimator of the coverage of a random sample. Ann Stat 11:905–912
Esty W (1986) The efficiency of Good’s nonparametric coverage estimator. Ann Stat 3:1257–1260
Good I (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3–4):237–264
Gras S, Kjer-Nielsen L, Burrows S, McCluskey J, Rossjohn J (2008) T-cell receptor bias and immunity. Curr Opin Immunol 20(1):119–125
Hsieh CS, Zheng Y, Liang Y, Fontenot JD, Rudensky AY (2006) An intersection between the self-reactive regulatory and nonregulatory T-cell receptor repertoires. Nat Immunol 7(7):401–410. doi:10.1038/ni1318
Hsieh CS, Lee HM, Lio CWJ (2012) Selection of regulatory T-cells in the thymus. Nat Rev Immunol 12(3):157–167. doi:10.1038/nri3155
Janeway C (2005) Immunobiology: the immune system in health and disease, 6th edn. Garland Science, New York
Jost L (2006) Entropy and diversity. Oikos 113(2):363–375
Keylock C (2005) Simpson diversity and the shannon-wiener index as special cases of a generalized entropy. Oikos 109(1):203–207
Komatsu N, Mariotti-Ferrandiz ME, Wang Y, Malissen B, Waldmann H, Hori S (2009) Heterogeneity of natural foxp3+ T-cells: a committed regulatory T-cell lineage and an uncommitted minor population retaining plasticity. Proc Natl Acad Sci USA 106(6):1903–1908. doi:10.1073/pnas.0811556106
Magurran AE (2005) Biological diversity. Curr Biol 15(4):R116–R118. doi:10.1016/j.cub.2005.02.006
Mao C, Lindsay B (2002) A poisson model for the coverage problem with a genomic application. Biometrika 89(3):669–682
Memon SA, Sportès C, Flomerfelt FA, Gress RE, Hakim FT (2012) Quantitative analysis of T-cell receptor diversity in clinical samples of human peripheral blood. J Immunol Methods 375(1–2):84–92. doi:10.1016/j.jim.2011.09.012
Mohebtash M, Tsang KY, Madan RA, Huen NY, Poole DJ, Jochems C, Jones J, Ferrara T, Heery CR, Arlen PM, Steinberg SM, Pazdur M, Rauckhorst M, Jones EC, Dahut WL, Schlom J, Gulley JL (2011) A pilot study of muc-1/cea/tricom poxviral-based vaccine in patients with metastatic breast and ovarian cancer. Clin Cancer Res 17(22):7164–7173. doi:10.1158/1078-0432.CCR-11-0649
Nayak T (1986) An analysis of diversity using Rao’s quadratic entropy. Sankhyā. Indian J Stat Ser B 48:315–330
Nielsen R, Paul J, Albrechtsen A, Song Y (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6):443–451
Orlitsky A, Santhanam N, Zhang J (2003) Always Good-Turing: asymptotically optimal probability estimation. Science 302(5644):427–431
Orlitsky A, Santhanam N, Zhang J (2004) Universal compression of memoryless sources over unknown alphabets. IEEE Trans Inf Theory 50(7):1469–1481
Pacholczyk R, Ignatowicz H, Kraj P, Ignatowicz L (2006) Origin and T-cell receptor diversity of foxp3+cd4+cd25+ T-cells. Immunity 25(2):249–259. doi:10.1016/j.immuni.2006.05.016
Pacholczyk R, Kern J, Singh N, Iwashima M, Kraj P, Ignatowicz L (2007) Nonself-antigens are the cognate specificities of foxp3+ regulatory T-cells. Immunity 27(3):493–504. doi:10.1016/j.immuni.2007.07.019
Rempala GA, Seweryn M, Ignatowicz L (2011) Model for comparative analysis of antigen receptor repertoires. J Theor Biol 269(1):1–15. doi:10.1016/j.jtbi.2010.10.001
Rényi P (1961) On measures of information and entropy. In: Proceedings of the 4th Berkeley symposium on mathematics, statistics and probability, pp 547–561
Ricotta C (2005) Through the jungle of biological diversity. Acta Biotheoret 53(1):29–38
Salameire D, Le Bris Y, Fabre B, Fauconnier J, Solly F, Pernollet M, Bonnefoix T, Leroux D, Plumas J, Jacob MC (2009) Efficient characterization of the tcr repertoire in lymph nodes by flow cytometry. Cytometry A 75(9):743–753. doi:10.1002/cyto.a.20767
Spellerberg I, Fedor P (2003) A tribute to claude shannon (1916–2001) and a plea for more rigorous use of species richness, species diversity and the Shannon-Wiener index. Global Ecol Biogeogr 12(3):177–179
Staveley-O’Carroll K, Sotomayor E, Montgomery J, Borrello I, Hwang L, Fein S, Pardoll D, Levitsky H (1998) Induction of antigen-specific T-cell anergy: an early event in the course of tumor progression. Proc Natl Acad Sci USA 95(3):1178–1183
Tóthmérész B (1995) Comparison of different methods for diversity ordering. J Veget Sci 6(2):283–290
Valiant P (2008) Testing symmetric properties of distributions. PhD thesis, MIT
Van Den Berg HA, Molina-París C, Sewell AK (2011) Specific T-cell activation in an unspecific t-cell repertoire. Sci Prog 94(Pt 3):245–64
Vu VQ, Yu B, Kass RE (2007) Coverage-adjusted entropy estimation. Stat Med 26(21):4060. doi:10.1002/sim.2942
Zhang CH, Zhang Z (2009) Asymptotic normality of a nonparametric estimator of sample coverage. Ann Stat 37:2582–2595
Acknowledgments
The authors would like to thank Prof. Leszek Ignatowicz for allowing the use of his experimental data on TCR populations and for helpful discussions and comments on the early drafts of the paper. We are also grateful to the reviewers for their valuable suggestions and for pointing out some additional references.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was partially supported by US NIH grant R01CA-152158 (GAR, MS) and US NSF grant DMS-1106485 (GAR).
Appendix: Proofs
Appendix: Proofs
In this section we prove Theorems 1, 2 and 4. Recall that for the purpose of consistency analysis we consider populations with possibly an infinite number of species (i.e. the number of receptors \(m\le \infty \)) and we let the sample size \(n\) increase to infinity. We write \(X_n= O(a_n)\) (resp. \(X_n= o(a_n)\)) to denote the fact that the random sequence \(X_n \) and a deterministic sequence \(a_n\) satisfy with probability one \(\sup _n {X_n}/{a_n}<\infty \) (resp. \({X_n}/{a_n}\rightarrow 0\)).
1.1 Auxiliary results
Denote \(\mathcal{S}_\alpha ({\varvec{p}}) :=\sum p_i^\alpha \) and \(\mathcal{S}_\alpha ^{(n)}({\varvec{p}}) :=\sum \frac{p_{i}^{\alpha }}{1-\left(1- p_{i}\right)^{n}}\) for \(\alpha >0.\) In order to prove the main results, we need the following
Lemma 1
Let \(\alpha \in (0,\infty )\) and \({\varvec{p}}\) be a vector of probabilities (possibly of infinite length) for which \(\mathcal{S}_\alpha ({\varvec{p}})<\infty \).
-
(i)
If \(\alpha >1\) and \(\sum p_i \log ^r 1/p_i < \infty \) for some \(r>0\), then \(\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}}) \stackrel{a.s.}{\rightarrow } \mathcal{S}_\alpha ({\varvec{p}})\).
-
(ii)
If \(\alpha <1\) then \(\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}}) \stackrel{a.s.}{\rightarrow } \mathcal{S}_\alpha ({\varvec{p}})\).
-
(iii)
If \(\alpha =1\) and \(\sum p_i \log 1/p_i < \infty \), then \(\mathcal{S}_1^{(n)}(\tilde{{\varvec{p}}}) \stackrel{a.s.}{\rightarrow } 1\).
Additionally, in the above we may replace \(\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}})\) by \(\mathcal{S}_{\hat{C}\alpha }^{(n)}(\tilde{{\varvec{p}}})\). That is, under any of the hypothesis in (i)-(iii), we also have
Proof
First, we consider the consistency of \(\mathcal{S}_\alpha ^{(n)}(\tilde{{\varvec{p}}})\). By the results of ((Antos and Kontoyiannis 2001, Section 2)), the plug-in estimator of the power sum \(\mathcal{S}_\alpha (\hat{p}_k)\) is strongly consistent for each \(\alpha \in (0,\infty )\), that is,
Moreover, the assumption that \(\sum _k p_i \log ^r 1/p_i < \infty \) for some \(r>0\) is sufficient (following Vu et al. 2007) for
In view of (5.2) it suffices to show that under (i)–(iii) we have
To this end, consider first \(\alpha >1\) and note that the following holds with probability one
We now establish that both majorizing terms (I) and (II) vanish asymptotically a.s. To this end note that since \(\tilde{p}_i\ge \frac{{\hat{C}}}{n}\) \(a.s.\), then
due to the consistency of the plug-in power sum estimator of order \(\alpha \) and the sample coverage estimator. Apropos (II), set \( \pi _{n}:=\frac{\log n}{n}\) and consider
The function \(f(x):=\frac{\left( 1-x\right) ^{n}}{1-(1-x)^{n}}\) is decreasing in \(x\) for \(x\in (0,1)\) and thus, for \(n\) sufficiently large, the first term \((IIa)\) is majorized by
For the second term, once again due to \(\tilde{p}_i\ge \frac{{\hat{C}}}{n}\) \(a.s.\), we have
for \(0<\beta <\alpha -1\). This establishes \((II)\rightarrow 0\) \(a.s.\) and hence also (5.4) for \(\alpha >1\).
Consider now the case when \(0<\alpha \le 1\). Note that, since \(\sum p_i^\alpha <\infty \) implies that \(\sum p_i \log ^{1-\alpha }1/p_i<\infty \), the relation (5.3) holds true with \(r=1-\alpha \) for \(\alpha <1\) and is forced by our assumption with \(r=1\) when \(\alpha =1\). Moreover, (5.5) still holds and the majorizing terms \((I)\) and \((IIa)\) may be handled identically as above. For the remaining term \((IIb)\), note that for \(0<\alpha \le 1\) and \(\tilde{\pi }_n=\pi _n/{\hat{C}}\)
Note also that
Asymptotically, the first term above vanishes a.s. in view of the result of Antos and Kontoyiannis (2001) and the third one vanishes a.s. due to the summability assumption and the fact that \(\tilde{\pi }_n\rightarrow 0\). On the other hand, the middle term is bounded a.s. by the asymptotically vanishing terms
in view of the result of Antos and Kontoyiannis (2001). Hence from (5.6) it follows that \((IIb)\rightarrow 0 a.s.\) and the parts (i)–(iii) of Lemma (1) are established.
Finally, we also establish (5.1). Note that without loss of generality we may assume that \(P({\hat{C}}={\hat{C}}_n<1~ \text{ infinitely} \text{ often})=1\).
Assume first that \(\alpha >1\) and \(\sum p_i \log ^r 1/p_i < \infty \) for some \(r>0\), and choose \(\beta \) such that \(1<\beta < \alpha \) and \(\alpha -\beta -1<0.\) Due to the almost sure convergence of \(\hat{C}\) to \(1\) we may without loss of generality assume that for each \(n \in \mathbb N \hat{C}\alpha - \beta >0 \, a.s.\) We have
The maximum is attained at the point
thus
since, under the assumption that for some \(r>0\) \(\sum p_i \log ^r 1/p_i < \infty ,\) we know that \(\mathcal{S}^{(n)}_{\beta } (\tilde{{\varvec{p}}}) \rightarrow \mathcal{S}_{\beta } ({\varvec{p}})\) \({a.s.}\), by the first part of the lemma.
For \(\alpha < 1\) and under the assumption that \(\sum p_i^\alpha < \infty \), it follows from the inequality \(\log x \le n x^{1/n}\) valid for \( x>0, n\ge 1,\) that \(\sum p_i \log ^r \left(1/p_i\right) < \infty \), for each \(r>0\). For any \(r>1\) we have therefore
since \(n^{(1/\log ^r n)} \rightarrow 1, \, n\rightarrow \infty \) for any \(r>1.\)
Now, for \(\alpha =1\) under the assumption that the entropy of \({\varvec{p}}\) is finite, we have similarly as above that
since \(n^{1/\log n} \rightarrow e\), and \(\left(\frac{1}{\hat{C}}\right)^{1-\hat{C}} \rightarrow 1\) a.s. \(n\rightarrow \infty .\) Hence, under the assumptions of the lemma, we have (for any \(\alpha >0) \mathcal{S}^{(n)}_{\hat{C}\alpha } (\tilde{{\varvec{p}}}) - \mathcal{S}^{(n)}_\alpha (\tilde{{\varvec{p}}})\rightarrow 0\), a.s. and (5.1) follows.
With the above lemma in hand, we are now ready for the proof of the Theorem 2, which becomes relatively straightforward.
1.2 Proof of Theorem 2
Note that it suffices to show that the estimators of the power sums of the type \(\sum \frac{\tilde{p}_{i1}^{\alpha {\hat{C}}_1}}{1-\left(1- \tilde{p}_{i1}\right)^{n}}\) and \(\sum \frac{\tilde{p}_{i2}^{\beta {\hat{C}}_{2}}}{1-\left(1- \tilde{p}_{i2}\right)^n}\) are strongly consistent. The result in each case follows by Lemma 1.\(\square \)
The next step is to prove Theorem 1.
1.3 Proof of Theorem 1
Note that for \(\alpha \ne 1\) the assertions follow from Lemma 1 by continuity of the bivariate function \(g(x,y):=(x-1)^{-1}\log y\). For the remaining case \(\alpha =1\), the first assertion \(H_1(\tilde{{\varvec{p}}})^{(n)}\rightarrow H_1({\varvec{p}})\) a.s. follows by an argument similar to that used in the proof of the lemma and hence we forgo the details. To argue the second assertion, note that we may assume without loss of generality that \(P({\hat{C}}< 1\ \text{ infinitely} \text{ often})=1\) and that in view of the result in Antos and Kontoyiannis (2001) which asserts that \(H_1(\hat{{\varvec{p}}})\rightarrow H_1({\varvec{p}})\) a.s., it suffices to show
To this end, note that by Cauchy’s mean value theorem and (iii) of Lemma 1
for some \({\varphi _n}\) such that \({\hat{C}}\le {\varphi _n}\le 1\). Note that \(1-{\varphi _n}=O(\log ^{-r}n)\) due to (5.3) and consequently, from the proof of Lemma 1, it follows that its assertions also holds with \({\varphi _n}\) in place of \({\hat{C}}\). In particular, in view of (5.1) with \(\alpha =1\),
Re-write \(\Delta _n\) as follows
where in the last inequality we applied the bound \(\hat{p}_i\ge 1/n\). It is obvious that \((III):=\log (1/{\hat{C}})\rightarrow 0\) \(a.s.\) For the term \((I)\), consider the following.
since \(\beta _n{\hat{C}}^{\varphi _n}n^{(1-{\varphi _n})}\rightarrow 1 a.s.\), in view of (5.8) and \(1\ge {\varphi _n}\ge {\hat{C}}\rightarrow 1 a.s.\), as well as \(n^{1-{\varphi _n}}=\exp \left[ O(\log ^{1-r} n) \right]\rightarrow 1\) \(a.s.\) The remaining expression \((II)\) needs to be handled similarly to the analogous term considered in the proof of Lemma 1. First note that
and therefore it suffices to consider \((II)^\prime \) instead. To this end, set \( \pi _{n}:={\log n}/{n}\) and note that
The first term \((IIa)\) is majorized by
For the second term \((IIb)\), set \(\tilde{\pi }_n=\pi _n/{\hat{C}}\)
Note also that
Asymptotically, the first term above vanishes a.s. in view of the result of Antos and Kontoyiannis (2001) and the third one vanishes a.s. due to the finite entropy assumption and the fact that \(\tilde{\pi }_n\rightarrow 0\).
On the other hand, the middle term is bounded a.s. by the asymptotically vanishing terms
in view of the result of Antos and Kontoyiannis (2001). Hence from (5.11) it follows that \((IIb)\rightarrow 0\) \(a.s.\) and therefore \(\Delta _n\le (I)+(II)+(II)\rightarrow 0\) \(a.s.\) in (5.9) and the required result (5.7) is established.\(\square \)
1.4 Proof of Theorem 4
We only consider the more difficult case of \(\alpha =1\). The case of any other \(\alpha \ne 1\) may be handled by the arguments similar to those used in the proof of Lemma 1. Without loss of generality assume that \(P({\hat{C}}<1 \text{ infinitely} \text{ often})=1\), since otherwise the result follows by the consistency of the ’plug-in’ estimate of the \(I\)-index (Theorem 3). Note that it suffices to prove that
and
where \(\hat{{\varvec{Q}}}:=\hat{{\varvec{P}}}_\circ \bigotimes \hat{{\varvec{P}}}^\circ :=[\hat{p}_{i \circ }\,\hat{p}_{\circ j} ]\). For the proof of the above assertions, we again use Cauchy’s mean value theorem. To argue (5.12), let us note that there exists a \({\varphi _n}\) with \({\hat{C}}\le {\varphi _n}\le 1\) such that almost surly,
where \(\tau _{ij}=\frac{ \hat{p}_{ij}}{\hat{p}_{i\circ }\hat{p}_{\circ j}}.\) By the assumption that \(\sum _{ij}p_{ij}\log ^r 1/p_{ij} < \infty \) for some \(r>1\), we have as before that \(1-{\varphi _n}=O\left(\frac{1}{\log ^r n}\right) a.s.\) Since \(1/n\le \tau _{ij}\le n\), therefore
Similarly, we obtain
where \(d_n:=\max \{1-n^{{\varphi _n}-1},n^{1-{\varphi _n}}-1\}.\) Since the entropy \(H_1({\varvec{P}})\) is finite and \(d_n \stackrel{a.s.}{\rightarrow } 0 \; n\rightarrow \infty \) then the assertion (5.12) follows. To argue (5.13) let us note again that there exists a \({\varphi _n}\) (possibly different from the one considered above) with \({\hat{C}}\le {\varphi _n}\le 1\) such that
By the elementary algebra
and
which completes the proof.
Rights and permissions
About this article
Cite this article
Rempala, G.A., Seweryn, M. Methods for diversity and overlap analysis in T-cell receptor populations. J. Math. Biol. 67, 1339–1368 (2013). https://doi.org/10.1007/s00285-012-0589-7
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00285-012-0589-7
Keywords
- Contingency tables
- Antigen receptors
- Richness and diversity estimation
- Renyi’s entropy
- Renyi’s divergence