feat: normalization schema detailed in documentation

author: Nicholas <nbnoll@eml.cc> 2022-02-24 15:34:14 -0800
committer: Nicholas <nbnoll@eml.cc> 2022-02-24 15:34:14 -0800
commit: dcaa6396cf980f84356414ab71ed10590fd6f24a (patch)
tree: f65ec25b093e92bf6852d989be69407dfe631b7a
parent: c3ee3f6dbdb146e3744e8305df4580acfee16eae (diff)
2 files changed, 267 insertions, 35 deletions
diff --git a/docs/src/sci/inference.md b/docs/src/sci/inference.md
index 0f8e200..84f1b75 100644
--- a/docs/src/sci/inference.md
+++ b/docs/src/sci/inference.md
@@ -22,6 +22,21 @@ Below we provide an (incomplete) overview of past attempts at this problem, as w
 There are published techniques that attempt to solve this problem, however our attempts to utilize them were unsuccessful.
 
 ## Our approach
+While the below discussion is general, all results shown within the context of _Drosophila melongaster_ and thus will be utilizing the Berkeley Drosophila Transcriptional Network Project database [^1].
+Specifically, the BDTNP collaboration produced a "virtual embryo" pointcloud discretizing the spatial expression pattern of ``84`` genes.
+The virtual embryo was constructed by aligning florescent pointcloud data obtained by FISH of similarly staged embryos.
+```@raw html
+<p align="center">
+<figure>
+  <img src="/assets/bdtnp.jpg" width="99%" />
+  <figurecaption>
+  Cartoon of the procedure used to generate the virtual embryo by the BDTNP.
+  </figurecaption>
+</figure>
+</p>
+```
+
+[^1]: [Registering Drosophila Embryos at Cellular Resolution to Build a Quantitative 3D Atlas of Gene Expression Patterns and Morphology](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1540644)
 
 ### Objective function
 Let ``\alpha`` and ``a`` denote indices over _scRNAseq_ cells and embryonic spatial positions respectively.
@@ -31,15 +46,15 @@ We pose this problem using the tools of _Statistical Physics_ and thus formulate
     \tag{1} F(\rho_{a\alpha}) \equiv \displaystyle\sum\limits_{\alpha,a} \epsilon\left(\vec{g}_\alpha,\vec{G}_a\right)\rho_{a\alpha} + T\rho_{a\alpha}\log\left(\rho_{a\alpha}\right)
 ```
 where ``\vec{g}_\alpha`` and ``\vec{G}_a`` denote the transcriptomic state of cell ``\alpha`` and position ``a`` respectively.
-We note that, as formulated, this problem is equivalent to regularized optimal transport[^1].
+We note that, as formulated, this problem is equivalent to regularized optimal transport[^2].
 ``\epsilon`` denotes the energy required to map cell ``\alpha`` onto position ``a``.
 For now, we leave the functional form generic but denote that it explicitly depends upon the gene expression value of both the sequenced cell ``\alpha`` and the in-situ position ``a``.
 Additionally, ``T`` is a parameter analogous to a thermodynamic "temperature"; it controls the precision that we demand of the inferred position for each sequenced cell.
-As ``T \rightarrow 0``, each cell is constrained to injectively map onto a spatial position; the problem reduces to the assignment problem[^2].
+As ``T \rightarrow 0``, each cell is constrained to injectively map onto a spatial position; the problem reduces to the assignment problem[^3].
 Conversely, as ``T \rightarrow \infty``, the entropic term dominates; the solution converges onto the uniform distribution.
 
-[^1]: [Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances](https://arxiv.org/abs/1306.0895)
-[^2]: [Computational optimal transport: With applications to data science](https://www.nowpublishers.com/article/DownloadSummary/MAL-073)
+[^2]: [Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances](https://arxiv.org/abs/1306.0895)
+[^3]: [Computational optimal transport: With applications to data science](https://www.nowpublishers.com/article/DownloadSummary/MAL-073)
 
 However, if extremized as written, Eq. (1) would not result in a well-formed probability distribution.
 Specifically, we must additionally constrain the row and column sum of ``\rho_{a\alpha}`` to have correctly interpretable marginals.
@@ -72,8 +87,8 @@ However, the complication is that our _scRNAseq_ data and the in-situ expression
 Both datasets are in manifestly different unit systems, the _scRNAseq_ data are expressed in _UMI_ counts while the underlying database is ultimately florescent intensity collated from a myriad of FISH experiments.
 
 We postulate that the "true" transformation that maps _scRNAseq_ counts ``g_{\alpha i}`` to florescent intensity ``G_{ai}`` should minimize distortions between both observed distortions under the action of said map.
-This ultimately suggests we identify the putative transformation via minimizing the Wasserstein metric via optimal transport [^3].
-It has been shown [^4] that strictly convex cost functions ``\varepsilon`` between ``1``D distributions admit a unique optimal transport solution.
+This ultimately suggests we identify the putative transformation via minimizing the Wasserstein metric via optimal transport [^4].
+It has been shown [^5] that strictly convex cost functions ``\varepsilon`` between ``1``D distributions admit a unique optimal transport solution.
 Specifically, if we denote the cumulative density of ``g_{\alpha i}`` and ``G_{ai}`` by ``\phi_{i}`` and ``\Phi_{i}`` respectively, the minimizing transformation is given by ``\Phi_{i}^{-1} \circ \phi_{i}``.
 We assume a Gaussian sampling probability with mean given by the BDTNP database such that our one-body energy takes the form
 ```math
@@ -90,11 +105,53 @@ Minimizing of Eqn. (1) is singular as ``T \to 0`` and reduces to the assignment
 Conversely, as ``T \to \infty`` the entropic contribution to Eqn. (1) dominates and thus admits a uniform sampling distribution with no patterning.
 As such, we fix the temperature by hyperparameter optimization, i.e. to be the value that maximally correlates to the original BDTNP database.
 
-[^3]: [Optimal transport: old and new](https://www.cedricvillani.org/sites/dev/files/old_images/2012/08/preprint-1.pdf)
-[^4]: [Mass Transportation Problems: Volume I: Theory](https://books.google.com/books/about/Mass_Transportation_Problems.html?id=t1LsSrWKjKQC)
+[^4]: [Optimal transport: old and new](https://www.cedricvillani.org/sites/dev/files/old_images/2012/08/preprint-1.pdf)
+[^5]: [Mass Transportation Problems: Volume I: Theory](https://books.google.com/books/about/Mass_Transportation_Problems.html?id=t1LsSrWKjKQC)
 
 ## Results
 
+Once the sampling probability ``\rho_{a\alpha}`` is known, we can immediately compute the mean expression profile for each gene
+```math
+    \bar{g}_{ia} \equiv \displaystyle\sum\limits_{\alpha} \rho_{a\alpha} g_{i\alpha}
+```
+We compare the computed mean expression profile to the "known" pattern as given by the BDTNP database.
+As shown in the figure below, we capture ``\sim 70\%`` of the variance of the database, on average across all ``84`` genes, at the optimal temperature.
+The red vertical bars display the standard deviation across all genes.
+The mapping at the optimal temperature corresponds to a sampling entropy ``\rho_{i\alpha}`` of ``\sim 6`` bits, corresponding to each scRNAseq cell mapping to ``\sim 60`` positions of the embryo.
+```@raw html
+<p align="center">
+<figure>
+  <img src="/assets/bdtnp_fit.png" width="49%" />
+  <img src="/assets/bdtnp_entropy.png" width="49%" />
+  <figurecaption>
+  (Left)Residuals of scRNAseq expression patterns matches to BDTNP database.
+  (Right) Entropy of sampling distribution, averaged over cells.
+  </figurecaption>
+</figure>
+</p>
+```
+We note that our positional precision is significantly less than what has been found in previous studies utilizing florescence data [^6] that concluded gene expression could identify position with subcellular accuracy.
+The origin of the discovered "imprecision" is likely multifaceted owing, in part, to technical noise inherent to scRNAseq technology, systematic errors due to mistiming of the embryo stage between the database and scRNAseq data, as well as true biological variation associated to the full 2D positional information.
+Deconvolving the contributions of each stochastic process to our measurement error is an interesting future direction for research.
+
+[^6]: [Positional information, in bits](https://www.pnas.org/content/110/41/16301)
+
+In addition to matching the small subset of genes contained within the BDTNP database, we also produce _predictions_ for gene expression for the remaining ``\sim 10^4`` genes.
+The full database of predictions is available upon request; an exploratory web front-end is currently underway.
+Below we display a representative sample.
+```@raw html
+<p align="center">
+<figure>
+  <img src="/assets/gene/eve.png" width="32%" />
+  <img src="/assets/gene/ftz.png" width="32%" />
+  <img src="/assets/gene/twi.png" width="32%" />
+  <figurecaption>
+  Genes shown (left-to-right): eve, ftz, twi
+  </figurecaption>
+</figure>
+</p>
+```
+
 ## Discussion
 
 Unfortunately, such databases only exist for a select few model organisms and thus limit the applicability of this approach.
diff --git a/docs/src/sci/normalize.md b/docs/src/sci/normalize.md
index ac3fcb1..e6056d6 100644
--- a/docs/src/sci/normalize.md
+++ b/docs/src/sci/normalize.md
@@ -31,6 +31,17 @@ Our goal during the normalization procedure is two-fold:
   1. Estimate the low-rank mean ``\mu_{\alpha,i}``.
   2. Estimate the sampling variance ``\langle\delta_{\alpha,i}^2\rangle`` of the experimental noise. Brackets denote an average over (theoretical) realizations of sequencing. This will eventually require an explicit model.
 
+```@raw html
+<p align="center">
+<figure>
+  <img src="/assets/heteroskedastic.png" width="79%" />
+  <figurecaption>
+  Count matrix is heteroskedastic: variation of scales across genes (rows) and cells (columns)
+  </figurecaption>
+</figure>
+</p>
+```
+
 Given both quantities, normalization over gene *and* cell-specific biases can be achieved by imposing the variances of all marginal distributions to be one.
 Specifically, we rescale all counts by cell-specific ``c_\alpha`` and gene-specific ``g_i`` factors ``\tilde{n}_{\alpha,i} \equiv c_\alpha n_{\alpha,i} g_i`` such that
 ```math
@@ -38,7 +49,19 @@ Specifically, we rescale all counts by cell-specific ``c_\alpha`` and gene-speci
 ```
 This system of equations can be solved by the Sinkhorn-Knopp algorithm, provided we have a model for ``\langle\delta_{\alpha,i}^2\rangle`` parameterized by measurables.
 Within this formalism, this choice of model fully determines the normalization scheme.
-Owing to dropout and other sources of overdispersion, we model scRNAseq counts as sampled from an empirically estimated [Heteroskedastic Negative Binomial](@ref) distribution.
+Owing to dropout and other sources of overdispersion, as shown empircally in the figure below, we model scRNAseq counts as sampled from an empirically estimated [Heteroskedastic Negative Binomial](@ref) distribution.
+
+```@raw html
+<p align="center">
+<figure>
+  <img src="/assets/overdispersed_mean_vs_variance.png" width="49%" />
+  <img src="/assets/overdispersed_zeros.png" width="49%" />
+  <figurecaption>
+  scRNAseq data for Drosophila is overdispersed.
+  </figurecaption>
+</figure>
+</p>
+```
 
 ### Case studies
 
@@ -54,6 +77,16 @@ In this limit, Eq.(1) reduces to the well-studied spiked population covariance m
 
 If ``\mu_{\alpha,i} = 0``, the spectral decomposition of the count matrix ``n_{\alpha,i}`` would be given by the [Marchenko-Pastur distribution](https://en.wikipedia.org/wiki/Marchenko–Pastur_distribution) asymptotically.
 As such, singular values would be bounded by ``\bar{\lambda} \equiv 1+\sigma\sqrt{N_g/N_c}``.
+
+```@raw html
+<p align="center">
+<img src="/assets/marchenko-pastur.png" width="49%" class="center"/>
+</p>
+<p align="center">
+Marchenko pastur distribution (orange) vs random matrix eigenvalues (blue)
+</p>
+```
+
 Now consider the case of a rank 1 mean, i.e. 1 cell type in the data.
 ```math
     n_{\alpha,i} = \gamma x_\alpha \bar{x}_i + \delta_{\alpha,i}
@@ -64,6 +97,14 @@ It has been shown[^1][^2] that this model exhibits the following asymptotic phas
 This procedure can generalized to higher rank spike-ins; sub-leading principal components can be found by simply subtracting the previously inferred component from the count matrix ``n_{\alpha,i}``.
 As such, we can only expect to meaningful measure the principal components of $\mu_{\alpha,i}$ that fall above the sampling noise floor, given by the Marchenko-Pastur distribution.
 Consequently, this forces us to define the statistically significant **rank** of the count matrix.
+This is exhibited empirically below.
+
+```@raw html
+<p align="center">
+<img src="/assets/gaussian_svd.png" width="49%" class="center"/>
+<img src="/assets/gaussian_overlap.png" width="49%" class="center"/>
+</p>
+```
 
 [^1]: [The singular values and vectors of low rank perturbations of large rectangular random matrices](https://arxiv.org/abs/1103.2221)
 [^2]: [The largest eigenvalue of rank one deformation of large Wigner matrices](https://arxiv.org/abs/math/0605624)
@@ -84,7 +125,12 @@ which provides an explicit system of equations to estimate the scaling factors `
 Once obtained, ``\tilde{\mu}_{\alpha,i}`` can be estimated via singular value decomposition of ``c_\alpha n_{\alpha,i} g_i``; all components with singular value greater than ``\bar{\lambda} = \sqrt{N_c}+\sqrt{N_g}`` can be confidently attributed to the "true mean" ``\mu_{\alpha,i}`` while all other components fall amongst the noise.
 An example is shown below:
 
-**TODO**: ADD FIGURE
+```@raw html
+<p align="center">
+<img src="/assets/poisson_svd.png" width="49%" class="center"/>
+<img src="/assets/poisson_overlap.png" width="49%" class="center"/>
+</p>
+```
 
 [^3]: [Asymptotics of Sample Eigenstructure for a Large Dimensional Spiked Covariance Model](http://www3.stat.sinica.edu.tw/statistica/oldpdf/A17n418.pdf)
 [^4]: [Biwhitening Reveals the Rank of a Count Matrix](https://arxiv.org/abs/2103.13840)
@@ -93,54 +139,183 @@ An example is shown below:
 #### Heteroskedastic Negative Binomial
 
 A negative binomial distribution is often used to model overdispersed count data, i.e. a process in which the variance grows superlinearly with the mean.
-Canonically the distribution arises from the distribution of the number of successes (with probabiliy ``p``) obtained after ``\phi`` failures of a Bernoulli process.
+Canonically the distribution arises from the distribution of the number of successes (with probabiliy ``p``) obtained after ``\Theta_3`` failures of a Bernoulli process.
 However, the generative stochastic process can equivalently be modelled as an underlying Poisson process in which the emission rate is itself a stochastic variable drawn from a Gamma distribution.
-This allows us to analytically continue ``\phi`` from an integer to the reals.
-Provided we have estimated the mean ``\mu`` and the overdisperson factor ``\phi``, the unbiased estimator for the variance is given by
+This allows us to analytically continue ``\Theta_{3}`` from an integer to the reals.
+Provided we have estimated the mean ``\mu_{i\alpha}`` and the overdisperson factor ``\Theta_{3;i}``, the unbiased estimator for the variance is given by
 ```math
-    \langle \delta^2 \rangle= = \mu\frac{1 + \mu\phi}{1 + \phi}
+    \tag{3} \langle \delta_{i\alpha}^2 \rangle = \mu_{i\alpha}\left(\frac{1 + \mu_{i\alpha}\Theta_{3;i}}{1 + \Theta_{3;i}}\right)
 ```
-Direct substitution into Eq. (2) would provide the necessary cell-specific $c_\alpha$ and gene-specific $g_i$ scaling factors.
+Direct substitution into Eq. (2) would provide the necessary cell-specific ``c_\alpha`` and gene-specific ``g_i`` scaling factors.
 
-## Empirical Sampling Distribution Estimation
+## Empirical sampling distribution
 
-As such, in order to normalize the estimated sampling variance, we must formulate a method to fit a negative binomial to the measured count data _per gene_.
+In order to normalize the estimated sampling variance, we must formulate a method to fit a negative binomial to the measured count data _per gene_.
 We parameterize the distribution as follows:
 ```math
-    p\left(n|\mu,\phi\right) = \frac{\Gamma\left(n+\phi\right)}{\Gamma\left(n+1\right)\Gamma\left(\phi\right)}\left(\frac{\mu}{\mu+\phi}\right)^n\left(\frac{\phi}{\mu+\phi}\right)^\phi
+    p\left(n|\mu,\Theta_{3}\right) = \frac{\Gamma\left(n+\Theta_{3}\right)}{\Gamma\left(n+1\right)\Gamma\left(\Theta_{3}\right)}\left(\frac{\mu}{\mu+\Theta_{3}}\right)^n\left(\frac{\Theta_{3}}{\mu+\Theta_{3}}\right)^{\Theta_{3}}
 ```
 
-### Generalized Linear Model
+All results, unless explicitly stated otherwise, are obtained empirically from scRNAseq data obtained during early _Drosophila melongaster_ embryogenesis.
 
-A central complication in modeling counts of a given gene _across_ cells with the above function is that there are confounding variables that require explicit consideration.
-For the present discussion, we explictly model the hetereogeneous sequencing depth across cells: naively we expect that if we sequenced a given cell with ``2x`` depth, each gene should scale accordingly.
+### Generalized linear model
+
+A central complication in modeling the counts of a given gene _across_ cells with the above generative model is that there are confounding variables that require consideration.
+For the present discussion, we explictly model the hetereogeneous sequencing depth across cells: naively we expect that if we sequenced a given cell with ``2\times`` depth, each gene should scale accordingly.
 As such, we formulate a generalized linear model
 ```math
-    \log \mu_{i\alpha} = A_i + B_i \log n_{\alpha}
+    \tag{4} \log \mu_{i\alpha} = \Theta_{1;i} + \Theta_{2;i} \log \chi_{\alpha}
 ```
-where ``n_\alpha`` denotes the total sequencing depth of cell ``\alpha``.
+where ``\chi_\alpha`` denotes the total sequencing depth of cell ``\alpha``
 We note this formulation can be easily extended to account for other confounds such as batch effect or cell type.
 The likelihood function for cell ``\alpha``, gene ``i`` is
 ```math
-    p\left(n_{i\alpha}|A_i, B_i, n_\alpha, \phi_i\right) = \frac{\Gamma\left(n_{i\alpha}+\phi_i\right)}{\Gamma\left(n_{i\alpha}+1\right)\Gamma\left(\phi_i\right)}\left(\frac{\mu_{i\alpha}}{\mu_{i\alpha}+\phi_i}\right)^{n_{i\alpha}}\left(\frac{\phi_i}{\mu_{i\alpha}+\phi_i}\right)^{\phi_{i}}
+    p\left(n_{i\alpha}|\Theta_{1;i},\Theta_{2;i}, \Theta_{3;i},\chi_\alpha\right) = \frac{\Gamma\left(n_{i\alpha}+\Theta_{3;i}\right)}{\Gamma\left(n_{i\alpha}+1\right)\Gamma\left(\Theta_{3;i}\right)}\left(\frac{\mu_{i\alpha}}{\mu_{i\alpha}+\Theta_{3;i}}\right)^{n_{i\alpha}}\left(\frac{\Theta_{3;i}}{\mu_{i\alpha}+\Theta_{3;i}}\right)^{\Theta_{3;i}}
 ```
 
 ### Maximum Likelihood Estimation
-``\{A_i, B_i, \phi_i\}`` represent ``3N_g`` parameters we must infer from the data.
-A priori this seems underdetermined given our ``N_c \times N_g`` sized matrix and thus we attempt to estimate all parameters within a maximum likelihood framework.
+``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` represent ``3N_g`` parameters we must infer from the data.
+We note that _a priori_ this problem is overdetermined; our count matrix is sized ``N_g \times N_c``  and thus we attempt to estimate all parameters within a maximum likelihood framework.
 This is equivalent to minimizing (for each gene independently)
 ```math
-    \mathcal{L} = \displaystyle\sum\limits_{\alpha}
-        \left(n_{i\alpha} + \phi_i\right)\log\left(e^{A_{i}}n_\alpha^{B_i} + \phi_i\right) 
-      - \phi_i\log\left(\phi_i\right)
-      - n_{i\alpha}\left(A_i + B_i\log n_\alpha\right)
-      - \log\left(\frac{\Gamma\left(n_{i\alpha}+\phi_i\right)}{\Gamma\left(n_{i\alpha}+1\right)\Gamma\left(\phi_i\right)}\right)
+\begin{aligned}
+    \tag{5} \mathcal{L}_i = \displaystyle\sum\limits_{\alpha}
+        \left(n_{i\alpha} + \Theta_{3;i}\right)\log\left(e^{\Theta_{1;i}}\chi_\alpha^{\Theta_{2;i}} + \Theta_{3;i}\right) 
+      - \Theta_{3;i}\log\left(\Theta_{3;i}\right) \\
+      - n_{i\alpha}\left(\Theta_{1;i} + \Theta_{2;i}\log \chi_\alpha\right)
+      - \log\left(\frac{\Gamma\left(n_{i\alpha}+\Theta_{3;i}\right)}{\Gamma\left(n_{i\alpha}+1\right)\Gamma\left(\Theta_{3;i}\right)}\right)
+\end{aligned}
+```
+Empirically it was determined that to provide a robust estimate of all three parameters that our confounding variables ``\chi_{\alpha}`` are given by
+```math
+    \chi_\alpha \equiv \exp\left(\langle \log\left(n_{i\alpha}+1\right) \rangle\right) - 1
 ```
+where angle brackets denote the empirical average over genes for cell ``\alpha``.
+Once the parameters ``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` are estimated, Eqns. (2-4) can be utilized to estimate the normalized mean count matrix ``\tilde{\mu}_{i\alpha}``.
+This will also immediately teach us the **statistically significant rank** of the counting matrix.
 
-### Overfitting
+#### Synthetic data verification
+As a first check on the methodology, we generated toy counting data sampled from negative binomial distribution with low rank mean.
+After estimation of parameters by minimizing Eq. (5), we solved Eq. (2) utilizing our unbiased estimator given by Eq. (3).
+The result is shown below.
 
-### Maximum A Posterior
+```@raw html
+<p align="center">
+<img src="/assets/negbinom.png" width="49%" class="center"/>
+<img src="/assets/negbinom_mean.png" width="49%" class="center"/>
+</p>
+```
+As shown, we underestimate the true rank as a few components have singular values below the Marchenko-Pastur noise floor.
+This, along with our noisy estimation of the overdispersion factor ``\Theta_3`` contributes to our noisier mean estimation when compared to the Poisson case above.
 
-#### Empirical Estimation of Priors
+#### Filter uncertain genes
+The uncertainty of our parameter estimates can be computed by the second derivative of the likelihood around our determined minima
+```math
+    \delta \Theta_{a}^2 = \left[\partial_{\Theta_{b}}\partial_{\Theta_{c}} \mathcal{L}\right]^{-1}_{aa}
+```
+Unsurprisingly, the uncertainty of our estimates for parameters ``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` is strongly dependent upon the underlying gene expression; our estimates for lowly expressed genes are highly uncertain.
+This can be seen in the below figure, which shows the scatter plot of the parameter estimate ``\Theta_{a}`` versus its uncertainty ``\delta \Theta_a``, colored by the average gene expression.
+```@raw html
+<p align="center">
+<figure>
+  <img src="/assets/nb_1_uncertainty_vs_expression.png" width="32%" />
+  <img src="/assets/nb_2_uncertainty_vs_expression.png" width="32%" />
+  <img src="/assets/nb_3_uncertainty_vs_expression.png" width="32%" />
+  <figurecaption>
+  Each point is a gene (row). Color of point determined by mean expression of gene.
+  </figurecaption>
+</figure>
+</p>
+```
+We note that the estimated ``\Theta_1`` monotonically grows with increasing gene expression, as was expected by construction.
+Conversely, the average estimate for ``\Theta_2`` shows no obvious trend against expression levels, however the uncertainty ``\delta\Theta_2`` monotonically increases with decreasing gene expression.
+The estimated ``\Theta_3`` also appears to be independent of average expression counts with a family of curves with increasing uncertainty defined by decreasing gene levels.
 
-## Results
+We capture the _total_ uncertainty in our estimates by analyzing the trace of uncertainty ``\delta \Theta^2 = \delta \Theta_1^2 + \delta \Theta_2^2 + \delta \Theta_3^2``
+A scatter plot of the uncertainty versus average expression level is shown below.
+We see that expression level is an imperfect predictor of MLE uncertainty.
+The dashed cyan line denotes our chosen uncertainty cutoff used to remove genes from our count matrix.
+The right figure displays the cumulative density function for the filtered genes; as can be seen these are lowly expressed genes that are, in practice, 2 state variables.
+
+```@raw html
+<p align="center">
+<figure>
+  <img src="/assets/nb_total_uncertainty_vs_expression.png" width="49%" />
+  <img src="/assets/nb_badfits.png" width="49%" />
+  <figurecaption>
+  Filter genes with bad fits. Genes with high uncertainty are determined to be lowly expressed.
+  </figurecaption>
+</figure>
+</p>
+```
+
+Interestingly, as shown below, the mean estimated value for ``\Theta_2`` is _slightly higher_ than 1.
+This phenomenon persists across all gene expression levels.
+
+```@raw html
+<p align="center">
+<figure>
+  <img src="/assets/nb_param2.png" width="49%" />
+  <img src="/assets/nb_param3.png" width="49%" />
+  <figurecaption>
+  Parameter distributions
+  </figurecaption>
+</figure>
+</p>
+```
+
+#### Verify estimates via bootstrap
+
+In order to test that the method does not overfit the data, especially lowly expressed genes, we validated our estimates via bootstrap.
+Specifically, we re-ran our maximum likelihood estimation of ``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` over subsamples of our given set of cells 100 times for each gene.
+The mean and standard deviation of the resultant empirical distribution was compared directly to our estimates and uncertainty calculations performed on the full dataset.
+As shown below, we find great quantitative agreement between the empirical estimates and our original calculations, suggesting strongly that we are _not_ overfitting the data.
+
+```@raw html
+<p align="center">
+<figure>
+  <img src="/assets/bootstrap_1.png" width="60%" />
+  <img src="/assets/bootstrap_2.png" width="60%" />
+  <img src="/assets/bootstrap_3.png" width="60%" />
+</figure>
+</p>
+```
+
+### Drosophila results
+
+Just as we performed using the toy data, once parameters ``\{\Theta_{1;i}, \Theta_{2;i}, \Theta_{3;i} \}`` are estimated, we can utilize Eqs. (2-5) to normalize the variance matrix and subsequently estimate the mean ``\mu_{i\alpha}``.
+As shown below, we find there are ``\sim 30`` **statistically significant** linear dimensions in the scRNAseq data obtained during early _Drosophila melongaster_ embryogenesis.
+Interestingly, while not fully delocalized as seen by the participation ratio of the "noise" components, we see that roughly ``\sim 1000`` genes contribute significantly to each component suggesting these are coarse "pathways" discovered.
+```@raw html
+<p align="center">
+<img src="/assets/rank_estimate.png" width="49%" class="center"/>
+<img src="/assets/participation_ratio.png" width="49%" class="center"/>
+</p>
+```
+To ensure element-wise positivity, the estimated ``\tilde{\mu}_{i\alpha}`` is obtained by performing non-negative matrix factorization on the rescaled ``\tilde{n}_{i\alpha}`` with rank ``35``.
+The factorization was initialized using the nndsvda algorithm [^6] and minimized the least squares objective via a multiplicative update [^7].
+
+[^6]: [SVD based initialization: A head start for nonnegative matrix factorization](https://www.sciencedirect.com/science/article/pii/S0031320307004359)
+[^7]: [Algorithms for Non-negative Matrix Factorization](https://www.cs.cmu.edu/~11755/lectures/Lee_Seung_NMF.pdf)
+
+## Pearson residuals
+
+Once obtained, ``\tilde{\mu}_{i\alpha}`` provides us an estimate for the rescaled mean of the expression of gene ``i`` for cell ``\alpha``.
+It is important to note, this _will_ still have gene-specific and cell-specific scales by construction.
+However, our goal was originally to convert our raw count matrix into more natural units where variation across sequencing depth and gene expression are normalized out.
+As such, our normalization pipeline requires one last step: convert each rescaled mean into a z-score that measures expression differentially to the observed distribution across cells.
+Recall that a negative binomial stochastic model is equivalent to a Poisson-Gamma mixture, i.e. a poisson process whose mean is drawn from a Gamma distribution.
+To this end, we estimate the significance of each ``\tilde{\mu}_{i\alpha}`` by fitting a Gamma distribution across cells for each gene, in exactly the same way as we performed for the raw count data with a negative binomial distribution.
+Specifically, we take the mean to be given by Eq. (4) and minimize the analog of Eq. (5); we omit details here in the interest of brevity.
+This was determined to be an excellent stochastic model for the computed ``\tilde{mu}_{i\alpha}``, as shown by the linear quantile-quantile plot below.
+
+```@raw html
+<p align="center">
+<img src="/assets/gamma_qq.png" width="49%" class="center"/>
+</p>
+```
+
+Taken together, our normalized gene count ``z_{i\alpha}`` is given by (the mean and variance is given by the estimated Gamma distribution)
+```math
+    \tag{6} z_{i\alpha} \equiv \frac{\tilde{\mu}_{i\alpha} - \mu_{i\alpha}}{\sigma_{i\alpha}}
+```
author	Nicholas <nbnoll@eml.cc>	2022-02-24 15:34:14 -0800
committer	Nicholas <nbnoll@eml.cc>	2022-02-24 15:34:14 -0800
commit	dcaa6396cf980f84356414ab71ed10590fd6f24a (patch)
tree	f65ec25b093e92bf6852d989be69407dfe631b7a
parent	c3ee3f6dbdb146e3744e8305df4580acfee16eae (diff)