aboutsummaryrefslogtreecommitdiff
path: root/docs/src/sci/inference.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/src/sci/inference.md')
-rw-r--r--docs/src/sci/inference.md73
1 files changed, 65 insertions, 8 deletions
diff --git a/docs/src/sci/inference.md b/docs/src/sci/inference.md
index 0f8e200..84f1b75 100644
--- a/docs/src/sci/inference.md
+++ b/docs/src/sci/inference.md
@@ -22,6 +22,21 @@ Below we provide an (incomplete) overview of past attempts at this problem, as w
There are published techniques that attempt to solve this problem, however our attempts to utilize them were unsuccessful.
## Our approach
+While the below discussion is general, all results shown within the context of _Drosophila melongaster_ and thus will be utilizing the Berkeley Drosophila Transcriptional Network Project database [^1].
+Specifically, the BDTNP collaboration produced a "virtual embryo" pointcloud discretizing the spatial expression pattern of ``84`` genes.
+The virtual embryo was constructed by aligning florescent pointcloud data obtained by FISH of similarly staged embryos.
+```@raw html
+<p align="center">
+<figure>
+ <img src="/assets/bdtnp.jpg" width="99%" />
+ <figurecaption>
+ Cartoon of the procedure used to generate the virtual embryo by the BDTNP.
+ </figurecaption>
+</figure>
+</p>
+```
+
+[^1]: [Registering Drosophila Embryos at Cellular Resolution to Build a Quantitative 3D Atlas of Gene Expression Patterns and Morphology](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1540644)
### Objective function
Let ``\alpha`` and ``a`` denote indices over _scRNAseq_ cells and embryonic spatial positions respectively.
@@ -31,15 +46,15 @@ We pose this problem using the tools of _Statistical Physics_ and thus formulate
\tag{1} F(\rho_{a\alpha}) \equiv \displaystyle\sum\limits_{\alpha,a} \epsilon\left(\vec{g}_\alpha,\vec{G}_a\right)\rho_{a\alpha} + T\rho_{a\alpha}\log\left(\rho_{a\alpha}\right)
```
where ``\vec{g}_\alpha`` and ``\vec{G}_a`` denote the transcriptomic state of cell ``\alpha`` and position ``a`` respectively.
-We note that, as formulated, this problem is equivalent to regularized optimal transport[^1].
+We note that, as formulated, this problem is equivalent to regularized optimal transport[^2].
``\epsilon`` denotes the energy required to map cell ``\alpha`` onto position ``a``.
For now, we leave the functional form generic but denote that it explicitly depends upon the gene expression value of both the sequenced cell ``\alpha`` and the in-situ position ``a``.
Additionally, ``T`` is a parameter analogous to a thermodynamic "temperature"; it controls the precision that we demand of the inferred position for each sequenced cell.
-As ``T \rightarrow 0``, each cell is constrained to injectively map onto a spatial position; the problem reduces to the assignment problem[^2].
+As ``T \rightarrow 0``, each cell is constrained to injectively map onto a spatial position; the problem reduces to the assignment problem[^3].
Conversely, as ``T \rightarrow \infty``, the entropic term dominates; the solution converges onto the uniform distribution.
-[^1]: [Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances](https://arxiv.org/abs/1306.0895)
-[^2]: [Computational optimal transport: With applications to data science](https://www.nowpublishers.com/article/DownloadSummary/MAL-073)
+[^2]: [Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances](https://arxiv.org/abs/1306.0895)
+[^3]: [Computational optimal transport: With applications to data science](https://www.nowpublishers.com/article/DownloadSummary/MAL-073)
However, if extremized as written, Eq. (1) would not result in a well-formed probability distribution.
Specifically, we must additionally constrain the row and column sum of ``\rho_{a\alpha}`` to have correctly interpretable marginals.
@@ -72,8 +87,8 @@ However, the complication is that our _scRNAseq_ data and the in-situ expression
Both datasets are in manifestly different unit systems, the _scRNAseq_ data are expressed in _UMI_ counts while the underlying database is ultimately florescent intensity collated from a myriad of FISH experiments.
We postulate that the "true" transformation that maps _scRNAseq_ counts ``g_{\alpha i}`` to florescent intensity ``G_{ai}`` should minimize distortions between both observed distortions under the action of said map.
-This ultimately suggests we identify the putative transformation via minimizing the Wasserstein metric via optimal transport [^3].
-It has been shown [^4] that strictly convex cost functions ``\varepsilon`` between ``1``D distributions admit a unique optimal transport solution.
+This ultimately suggests we identify the putative transformation via minimizing the Wasserstein metric via optimal transport [^4].
+It has been shown [^5] that strictly convex cost functions ``\varepsilon`` between ``1``D distributions admit a unique optimal transport solution.
Specifically, if we denote the cumulative density of ``g_{\alpha i}`` and ``G_{ai}`` by ``\phi_{i}`` and ``\Phi_{i}`` respectively, the minimizing transformation is given by ``\Phi_{i}^{-1} \circ \phi_{i}``.
We assume a Gaussian sampling probability with mean given by the BDTNP database such that our one-body energy takes the form
```math
@@ -90,11 +105,53 @@ Minimizing of Eqn. (1) is singular as ``T \to 0`` and reduces to the assignment
Conversely, as ``T \to \infty`` the entropic contribution to Eqn. (1) dominates and thus admits a uniform sampling distribution with no patterning.
As such, we fix the temperature by hyperparameter optimization, i.e. to be the value that maximally correlates to the original BDTNP database.
-[^3]: [Optimal transport: old and new](https://www.cedricvillani.org/sites/dev/files/old_images/2012/08/preprint-1.pdf)
-[^4]: [Mass Transportation Problems: Volume I: Theory](https://books.google.com/books/about/Mass_Transportation_Problems.html?id=t1LsSrWKjKQC)
+[^4]: [Optimal transport: old and new](https://www.cedricvillani.org/sites/dev/files/old_images/2012/08/preprint-1.pdf)
+[^5]: [Mass Transportation Problems: Volume I: Theory](https://books.google.com/books/about/Mass_Transportation_Problems.html?id=t1LsSrWKjKQC)
## Results
+Once the sampling probability ``\rho_{a\alpha}`` is known, we can immediately compute the mean expression profile for each gene
+```math
+ \bar{g}_{ia} \equiv \displaystyle\sum\limits_{\alpha} \rho_{a\alpha} g_{i\alpha}
+```
+We compare the computed mean expression profile to the "known" pattern as given by the BDTNP database.
+As shown in the figure below, we capture ``\sim 70\%`` of the variance of the database, on average across all ``84`` genes, at the optimal temperature.
+The red vertical bars display the standard deviation across all genes.
+The mapping at the optimal temperature corresponds to a sampling entropy ``\rho_{i\alpha}`` of ``\sim 6`` bits, corresponding to each scRNAseq cell mapping to ``\sim 60`` positions of the embryo.
+```@raw html
+<p align="center">
+<figure>
+ <img src="/assets/bdtnp_fit.png" width="49%" />
+ <img src="/assets/bdtnp_entropy.png" width="49%" />
+ <figurecaption>
+ (Left)Residuals of scRNAseq expression patterns matches to BDTNP database.
+ (Right) Entropy of sampling distribution, averaged over cells.
+ </figurecaption>
+</figure>
+</p>
+```
+We note that our positional precision is significantly less than what has been found in previous studies utilizing florescence data [^6] that concluded gene expression could identify position with subcellular accuracy.
+The origin of the discovered "imprecision" is likely multifaceted owing, in part, to technical noise inherent to scRNAseq technology, systematic errors due to mistiming of the embryo stage between the database and scRNAseq data, as well as true biological variation associated to the full 2D positional information.
+Deconvolving the contributions of each stochastic process to our measurement error is an interesting future direction for research.
+
+[^6]: [Positional information, in bits](https://www.pnas.org/content/110/41/16301)
+
+In addition to matching the small subset of genes contained within the BDTNP database, we also produce _predictions_ for gene expression for the remaining ``\sim 10^4`` genes.
+The full database of predictions is available upon request; an exploratory web front-end is currently underway.
+Below we display a representative sample.
+```@raw html
+<p align="center">
+<figure>
+ <img src="/assets/gene/eve.png" width="32%" />
+ <img src="/assets/gene/ftz.png" width="32%" />
+ <img src="/assets/gene/twi.png" width="32%" />
+ <figurecaption>
+ Genes shown (left-to-right): eve, ftz, twi
+ </figurecaption>
+</figure>
+</p>
+```
+
## Discussion
Unfortunately, such databases only exist for a select few model organisms and thus limit the applicability of this approach.