diff options
Diffstat (limited to 'docs/src/sci/normalize.md')
-rw-r--r-- | docs/src/sci/normalize.md | 229 |
1 files changed, 202 insertions, 27 deletions
diff --git a/docs/src/sci/normalize.md b/docs/src/sci/normalize.md index ac3fcb1..e6056d6 100644 --- a/docs/src/sci/normalize.md +++ b/docs/src/sci/normalize.md @@ -31,6 +31,17 @@ Our goal during the normalization procedure is two-fold: 1. Estimate the low-rank mean ``\mu_{\alpha,i}``. 2. Estimate the sampling variance ``\langle\delta_{\alpha,i}^2\rangle`` of the experimental noise. Brackets denote an average over (theoretical) realizations of sequencing. This will eventually require an explicit model. +```@raw html +<p align="center"> +<figure> + <img src="/assets/heteroskedastic.png" width="79%" /> + <figurecaption> + Count matrix is heteroskedastic: variation of scales across genes (rows) and cells (columns) + </figurecaption> +</figure> +</p> +``` + Given both quantities, normalization over gene *and* cell-specific biases can be achieved by imposing the variances of all marginal distributions to be one. Specifically, we rescale all counts by cell-specific ``c_\alpha`` and gene-specific ``g_i`` factors ``\tilde{n}_{\alpha,i} \equiv c_\alpha n_{\alpha,i} g_i`` such that ```math @@ -38,7 +49,19 @@ Specifically, we rescale all counts by cell-specific ``c_\alpha`` and gene-speci ``` This system of equations can be solved by the Sinkhorn-Knopp algorithm, provided we have a model for ``\langle\delta_{\alpha,i}^2\rangle`` parameterized by measurables. Within this formalism, this choice of model fully determines the normalization scheme. -Owing to dropout and other sources of overdispersion, we model scRNAseq counts as sampled from an empirically estimated [Heteroskedastic Negative Binomial](@ref) distribution. +Owing to dropout and other sources of overdispersion, as shown empircally in the figure below, we model scRNAseq counts as sampled from an empirically estimated [Heteroskedastic Negative Binomial](@ref) distribution. + +```@raw html +<p align="center"> +<figure> + <img src="/assets/overdispersed_mean_vs_variance.png" width="49%" /> + <img src="/assets/overdispersed_zeros.png" width="49%" /> + <figurecaption> + scRNAseq data for Drosophila is overdispersed. + </figurecaption> +</figure> +</p> +``` ### Case studies @@ -54,6 +77,16 @@ In this limit, Eq.(1) reduces to the well-studied spiked population covariance m If ``\mu_{\alpha,i} = 0``, the spectral decomposition of the count matrix ``n_{\alpha,i}`` would be given by the [Marchenko-Pastur distribution](https://en.wikipedia.org/wiki/Marchenko–Pastur_distribution) asymptotically. As such, singular values would be bounded by ``\bar{\lambda} \equiv 1+\sigma\sqrt{N_g/N_c}``. + +```@raw html +<p align="center"> +<img src="/assets/marchenko-pastur.png" width="49%" class="center"/> +</p> +<p align="center"> +Marchenko pastur distribution (orange) vs random matrix eigenvalues (blue) +</p> +``` + Now consider the case of a rank 1 mean, i.e. 1 cell type in the data. ```math n_{\alpha,i} = \gamma x_\alpha \bar{x}_i + \delta_{\alpha,i} @@ -64,6 +97,14 @@ It has been shown[^1][^2] that this model exhibits the following asymptotic phas This procedure can generalized to higher rank spike-ins; sub-leading principal components can be found by simply subtracting the previously inferred component from the count matrix ``n_{\alpha,i}``. As such, we can only expect to meaningful measure the principal components of $\mu_{\alpha,i}$ that fall above the sampling noise floor, given by the Marchenko-Pastur distribution. Consequently, this forces us to define the statistically significant **rank** of the count matrix. +This is exhibited empirically below. + +```@raw html +<p align="center"> +<img src="/assets/gaussian_svd.png" width="49%" class="center"/> +<img src="/assets/gaussian_overlap.png" width="49%" class="center"/> +</p> +``` [^1]: [The singular values and vectors of low rank perturbations of large rectangular random matrices](https://arxiv.org/abs/1103.2221) [^2]: [The largest eigenvalue of rank one deformation of large Wigner matrices](https://arxiv.org/abs/math/0605624) @@ -84,7 +125,12 @@ which provides an explicit system of equations to estimate the scaling factors ` Once obtained, ``\tilde{\mu}_{\alpha,i}`` can be estimated via singular value decomposition of ``c_\alpha n_{\alpha,i} g_i``; all components with singular value greater than ``\bar{\lambda} = \sqrt{N_c}+\sqrt{N_g}`` can be confidently attributed to the "true mean" ``\mu_{\alpha,i}`` while all other components fall amongst the noise. An example is shown below: -**TODO**: ADD FIGURE +```@raw html +<p align="center"> +<img src="/assets/poisson_svd.png" width="49%" class="center"/> +<img src="/assets/poisson_overlap.png" width="49%" class="center"/> +</p> +``` [^3]: [Asymptotics of Sample Eigenstructure for a Large Dimensional Spiked Covariance Model](http://www3.stat.sinica.edu.tw/statistica/oldpdf/A17n418.pdf) [^4]: [Biwhitening Reveals the Rank of a Count Matrix](https://arxiv.org/abs/2103.13840) @@ -93,54 +139,183 @@ An example is shown below: #### Heteroskedastic Negative Binomial A negative binomial distribution is often used to model overdispersed count data, i.e. a process in which the variance grows superlinearly with the mean. -Canonically the distribution arises from the distribution of the number of successes (with probabiliy ``p``) obtained after ``\phi`` failures of a Bernoulli process. +Canonically the distribution arises from the distribution of the number of successes (with probabiliy ``p``) obtained after ``\Theta_3`` failures of a Bernoulli process. However, the generative stochastic process can equivalently be modelled as an underlying Poisson process in which the emission rate is itself a stochastic variable drawn from a Gamma distribution. -This allows us to analytically continue ``\phi`` from an integer to the reals. -Provided we have estimated the mean ``\mu`` and the overdisperson factor ``\phi``, the unbiased estimator for the variance is given by +This allows us to analytically continue ``\Theta_{3}`` from an integer to the reals. +Provided we have estimated the mean ``\mu_{i\alpha}`` and the overdisperson factor ``\Theta_{3;i}``, the unbiased estimator for the variance is given by ```math - \langle \delta^2 \rangle= = \mu\frac{1 + \mu\phi}{1 + \phi} + \tag{3} \langle \delta_{i\alpha}^2 \rangle = \mu_{i\alpha}\left(\frac{1 + \mu_{i\alpha}\Theta_{3;i}}{1 + \Theta_{3;i}}\right) ``` -Direct substitution into Eq. (2) would provide the necessary cell-specific $c_\alpha$ and gene-specific $g_i$ scaling factors. +Direct substitution into Eq. (2) would provide the necessary cell-specific ``c_\alpha`` and gene-specific ``g_i`` scaling factors. -## Empirical Sampling Distribution Estimation +## Empirical sampling distribution -As such, in order to normalize the estimated sampling variance, we must formulate a method to fit a negative binomial to the measured count data _per gene_. +In order to normalize the estimated sampling variance, we must formulate a method to fit a negative binomial to the measured count data _per gene_. We parameterize the distribution as follows: ```math - p\left(n|\mu,\phi\right) = \frac{\Gamma\left(n+\phi\right)}{\Gamma\left(n+1\right)\Gamma\left(\phi\right)}\left(\frac{\mu}{\mu+\phi}\right)^n\left(\frac{\phi}{\mu+\phi}\right)^\phi + p\left(n|\mu,\Theta_{3}\right) = \frac{\Gamma\left(n+\Theta_{3}\right)}{\Gamma\left(n+1\right)\Gamma\left(\Theta_{3}\right)}\left(\frac{\mu}{\mu+\Theta_{3}}\right)^n\left(\frac{\Theta_{3}}{\mu+\Theta_{3}}\right)^{\Theta_{3}} ``` -### Generalized Linear Model +All results, unless explicitly stated otherwise, are obtained empirically from scRNAseq data obtained during early _Drosophila melongaster_ embryogenesis. -A central complication in modeling counts of a given gene _across_ cells with the above function is that there are confounding variables that require explicit consideration. -For the present discussion, we explictly model the hetereogeneous sequencing depth across cells: naively we expect that if we sequenced a given cell with ``2x`` depth, each gene should scale accordingly. +### Generalized linear model + +A central complication in modeling the counts of a given gene _across_ cells with the above generative model is that there are confounding variables that require consideration. +For the present discussion, we explictly model the hetereogeneous sequencing depth across cells: naively we expect that if we sequenced a given cell with ``2\times`` depth, each gene should scale accordingly. As such, we formulate a generalized linear model ```math - \log \mu_{i\alpha} = A_i + B_i \log n_{\alpha} + \tag{4} \log \mu_{i\alpha} = \Theta_{1;i} + \Theta_{2;i} \log \chi_{\alpha} ``` -where ``n_\alpha`` denotes the total sequencing depth of cell ``\alpha``. +where ``\chi_\alpha`` denotes the total sequencing depth of cell ``\alpha`` We note this formulation can be easily extended to account for other confounds such as batch effect or cell type. The likelihood function for cell ``\alpha``, gene ``i`` is ```math - p\left(n_{i\alpha}|A_i, B_i, n_\alpha, \phi_i\right) = \frac{\Gamma\left(n_{i\alpha}+\phi_i\right)}{\Gamma\left(n_{i\alpha}+1\right)\Gamma\left(\phi_i\right)}\left(\frac{\mu_{i\alpha}}{\mu_{i\alpha}+\phi_i}\right)^{n_{i\alpha}}\left(\frac{\phi_i}{\mu_{i\alpha}+\phi_i}\right)^{\phi_{i}} + p\left(n_{i\alpha}|\Theta_{1;i},\Theta_{2;i}, \Theta_{3;i},\chi_\alpha\right) = \frac{\Gamma\left(n_{i\alpha}+\Theta_{3;i}\right)}{\Gamma\left(n_{i\alpha}+1\right)\Gamma\left(\Theta_{3;i}\right)}\left(\frac{\mu_{i\alpha}}{\mu_{i\alpha}+\Theta_{3;i}}\right)^{n_{i\alpha}}\left(\frac{\Theta_{3;i}}{\mu_{i\alpha}+\Theta_{3;i}}\right)^{\Theta_{3;i}} ``` ### Maximum Likelihood Estimation -``\{A_i, B_i, \phi_i\}`` represent ``3N_g`` parameters we must infer from the data. -A priori this seems underdetermined given our ``N_c \times N_g`` sized matrix and thus we attempt to estimate all parameters within a maximum likelihood framework. +``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` represent ``3N_g`` parameters we must infer from the data. +We note that _a priori_ this problem is overdetermined; our count matrix is sized ``N_g \times N_c`` and thus we attempt to estimate all parameters within a maximum likelihood framework. This is equivalent to minimizing (for each gene independently) ```math - \mathcal{L} = \displaystyle\sum\limits_{\alpha} - \left(n_{i\alpha} + \phi_i\right)\log\left(e^{A_{i}}n_\alpha^{B_i} + \phi_i\right) - - \phi_i\log\left(\phi_i\right) - - n_{i\alpha}\left(A_i + B_i\log n_\alpha\right) - - \log\left(\frac{\Gamma\left(n_{i\alpha}+\phi_i\right)}{\Gamma\left(n_{i\alpha}+1\right)\Gamma\left(\phi_i\right)}\right) +\begin{aligned} + \tag{5} \mathcal{L}_i = \displaystyle\sum\limits_{\alpha} + \left(n_{i\alpha} + \Theta_{3;i}\right)\log\left(e^{\Theta_{1;i}}\chi_\alpha^{\Theta_{2;i}} + \Theta_{3;i}\right) + - \Theta_{3;i}\log\left(\Theta_{3;i}\right) \\ + - n_{i\alpha}\left(\Theta_{1;i} + \Theta_{2;i}\log \chi_\alpha\right) + - \log\left(\frac{\Gamma\left(n_{i\alpha}+\Theta_{3;i}\right)}{\Gamma\left(n_{i\alpha}+1\right)\Gamma\left(\Theta_{3;i}\right)}\right) +\end{aligned} +``` +Empirically it was determined that to provide a robust estimate of all three parameters that our confounding variables ``\chi_{\alpha}`` are given by +```math + \chi_\alpha \equiv \exp\left(\langle \log\left(n_{i\alpha}+1\right) \rangle\right) - 1 ``` +where angle brackets denote the empirical average over genes for cell ``\alpha``. +Once the parameters ``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` are estimated, Eqns. (2-4) can be utilized to estimate the normalized mean count matrix ``\tilde{\mu}_{i\alpha}``. +This will also immediately teach us the **statistically significant rank** of the counting matrix. -### Overfitting +#### Synthetic data verification +As a first check on the methodology, we generated toy counting data sampled from negative binomial distribution with low rank mean. +After estimation of parameters by minimizing Eq. (5), we solved Eq. (2) utilizing our unbiased estimator given by Eq. (3). +The result is shown below. -### Maximum A Posterior +```@raw html +<p align="center"> +<img src="/assets/negbinom.png" width="49%" class="center"/> +<img src="/assets/negbinom_mean.png" width="49%" class="center"/> +</p> +``` +As shown, we underestimate the true rank as a few components have singular values below the Marchenko-Pastur noise floor. +This, along with our noisy estimation of the overdispersion factor ``\Theta_3`` contributes to our noisier mean estimation when compared to the Poisson case above. -#### Empirical Estimation of Priors +#### Filter uncertain genes +The uncertainty of our parameter estimates can be computed by the second derivative of the likelihood around our determined minima +```math + \delta \Theta_{a}^2 = \left[\partial_{\Theta_{b}}\partial_{\Theta_{c}} \mathcal{L}\right]^{-1}_{aa} +``` +Unsurprisingly, the uncertainty of our estimates for parameters ``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` is strongly dependent upon the underlying gene expression; our estimates for lowly expressed genes are highly uncertain. +This can be seen in the below figure, which shows the scatter plot of the parameter estimate ``\Theta_{a}`` versus its uncertainty ``\delta \Theta_a``, colored by the average gene expression. +```@raw html +<p align="center"> +<figure> + <img src="/assets/nb_1_uncertainty_vs_expression.png" width="32%" /> + <img src="/assets/nb_2_uncertainty_vs_expression.png" width="32%" /> + <img src="/assets/nb_3_uncertainty_vs_expression.png" width="32%" /> + <figurecaption> + Each point is a gene (row). Color of point determined by mean expression of gene. + </figurecaption> +</figure> +</p> +``` +We note that the estimated ``\Theta_1`` monotonically grows with increasing gene expression, as was expected by construction. +Conversely, the average estimate for ``\Theta_2`` shows no obvious trend against expression levels, however the uncertainty ``\delta\Theta_2`` monotonically increases with decreasing gene expression. +The estimated ``\Theta_3`` also appears to be independent of average expression counts with a family of curves with increasing uncertainty defined by decreasing gene levels. -## Results +We capture the _total_ uncertainty in our estimates by analyzing the trace of uncertainty ``\delta \Theta^2 = \delta \Theta_1^2 + \delta \Theta_2^2 + \delta \Theta_3^2`` +A scatter plot of the uncertainty versus average expression level is shown below. +We see that expression level is an imperfect predictor of MLE uncertainty. +The dashed cyan line denotes our chosen uncertainty cutoff used to remove genes from our count matrix. +The right figure displays the cumulative density function for the filtered genes; as can be seen these are lowly expressed genes that are, in practice, 2 state variables. + +```@raw html +<p align="center"> +<figure> + <img src="/assets/nb_total_uncertainty_vs_expression.png" width="49%" /> + <img src="/assets/nb_badfits.png" width="49%" /> + <figurecaption> + Filter genes with bad fits. Genes with high uncertainty are determined to be lowly expressed. + </figurecaption> +</figure> +</p> +``` + +Interestingly, as shown below, the mean estimated value for ``\Theta_2`` is _slightly higher_ than 1. +This phenomenon persists across all gene expression levels. + +```@raw html +<p align="center"> +<figure> + <img src="/assets/nb_param2.png" width="49%" /> + <img src="/assets/nb_param3.png" width="49%" /> + <figurecaption> + Parameter distributions + </figurecaption> +</figure> +</p> +``` + +#### Verify estimates via bootstrap + +In order to test that the method does not overfit the data, especially lowly expressed genes, we validated our estimates via bootstrap. +Specifically, we re-ran our maximum likelihood estimation of ``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` over subsamples of our given set of cells 100 times for each gene. +The mean and standard deviation of the resultant empirical distribution was compared directly to our estimates and uncertainty calculations performed on the full dataset. +As shown below, we find great quantitative agreement between the empirical estimates and our original calculations, suggesting strongly that we are _not_ overfitting the data. + +```@raw html +<p align="center"> +<figure> + <img src="/assets/bootstrap_1.png" width="60%" /> + <img src="/assets/bootstrap_2.png" width="60%" /> + <img src="/assets/bootstrap_3.png" width="60%" /> +</figure> +</p> +``` + +### Drosophila results + +Just as we performed using the toy data, once parameters ``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` are estimated, we can utilize Eqs. (2-5) to normalize the variance matrix and subsequently estimate the mean ``\mu_{i\alpha}``. +As shown below, we find there are ``\sim 30`` **statistically significant** linear dimensions in the scRNAseq data obtained during early _Drosophila melongaster_ embryogenesis. +Interestingly, while not fully delocalized as seen by the participation ratio of the "noise" components, we see that roughly ``\sim 1000`` genes contribute significantly to each component suggesting these are coarse "pathways" discovered. +```@raw html +<p align="center"> +<img src="/assets/rank_estimate.png" width="49%" class="center"/> +<img src="/assets/participation_ratio.png" width="49%" class="center"/> +</p> +``` +To ensure element-wise positivity, the estimated ``\tilde{\mu}_{i\alpha}`` is obtained by performing non-negative matrix factorization on the rescaled ``\tilde{n}_{i\alpha}`` with rank ``35``. +The factorization was initialized using the nndsvda algorithm [^6] and minimized the least squares objective via a multiplicative update [^7]. + +[^6]: [SVD based initialization: A head start for nonnegative matrix factorization](https://www.sciencedirect.com/science/article/pii/S0031320307004359) +[^7]: [Algorithms for Non-negative Matrix Factorization](https://www.cs.cmu.edu/~11755/lectures/Lee_Seung_NMF.pdf) + +## Pearson residuals + +Once obtained, ``\tilde{\mu}_{i\alpha}`` provides us an estimate for the rescaled mean of the expression of gene ``i`` for cell ``\alpha``. +It is important to note, this _will_ still have gene-specific and cell-specific scales by construction. +However, our goal was originally to convert our raw count matrix into more natural units where variation across sequencing depth and gene expression are normalized out. +As such, our normalization pipeline requires one last step: convert each rescaled mean into a z-score that measures expression differentially to the observed distribution across cells. +Recall that a negative binomial stochastic model is equivalent to a Poisson-Gamma mixture, i.e. a poisson process whose mean is drawn from a Gamma distribution. +To this end, we estimate the significance of each ``\tilde{\mu}_{i\alpha}`` by fitting a Gamma distribution across cells for each gene, in exactly the same way as we performed for the raw count data with a negative binomial distribution. +Specifically, we take the mean to be given by Eq. (4) and minimize the analog of Eq. (5); we omit details here in the interest of brevity. +This was determined to be an excellent stochastic model for the computed ``\tilde{mu}_{i\alpha}``, as shown by the linear quantile-quantile plot below. + +```@raw html +<p align="center"> +<img src="/assets/gamma_qq.png" width="49%" class="center"/> +</p> +``` + +Taken together, our normalized gene count ``z_{i\alpha}`` is given by (the mean and variance is given by the estimated Gamma distribution) +```math + \tag{6} z_{i\alpha} \equiv \frac{\tilde{\mu}_{i\alpha} - \mu_{i\alpha}}{\sigma_{i\alpha}} +``` |