feat: empirical manifold learning

author: Nicholas <nbnoll@eml.cc> 2022-03-01 11:02:31 -0800
committer: Nicholas <nbnoll@eml.cc> 2022-03-01 11:02:31 -0800
commit: 470872e8a2ad0a73d3c9981093afae9977ed2fea (patch)
tree: 39d5f576407361abecc9d91b9871bc70c3eb3b67
parent: ff84390bb700077bf7b9203ff732b196e2a80722 (diff)
2 files changed, 116 insertions, 11 deletions
diff --git a/docs/src/sci/autoencode.md b/docs/src/sci/autoencode.md
index 6a1e2ae..9039864 100644
--- a/docs/src/sci/autoencode.md
+++ b/docs/src/sci/autoencode.md
@@ -1,12 +1,115 @@
 # Manifold learning
 
+## Introduction
+
+In section [scRNAseq spatial inference](@ref), we demonstrated that scRNAseq expression data can be directly mapped to space using a reference database  of gene expression patterns, provided by the Berkeley Drosophila Transcriptional Network Project.
+These results motivate a more ambitious question: is positional information directly encoded within gene expression and, as such, can we utilize _just_ scRNAseq counts to infer cellular spatial position _de-novo_?
+
+We anchor our inference approach on the _continuum ansatz_ of cellular states in a developing embryo.
+Specifically, we postulate that gene expression is a _smoothly_ varying, continuous function of space, and potentially other continuous degrees of freedom such as time.
+Said another way, neighboring cells are assumed to be infinitesimally "close" in expression.
+This formulation is a departure from the conventional, _discrete_, view of cellular fates taken, for example, in the description of the striped gene patterning observed during the Anterior-Posterior segmentation of the _Drosophila melongaster_ embryo.
+From our viewpoint, a "discontinuity" of gene expression in a handful of components can also be explained by the curvature of a _smooth_ surface embedded in very high dimensions.
+
+As formulated, spatial inference from scRNAseq data is equivalent to non-linear dimensional reduction: we want to find a small number of degrees of freedom that parameterize the variation across ``\sim 10^4`` genes.
+From [Drosophila results](@ref), we know that at least minimally, we can describe gene expression within early Drosophila embryogenesis by ``31`` relevant components.
+In this section, we analyze the data within this subspace to show that the data is ultimately generated by an even smaller set of degrees of freedom.
+Furthermore, we outline a protocol to "learn" this parameterization and show that it can recapitulate space.
+
 ## Empirical analysis
 
-* Point cloud scaling
-* Physically close cells are close in expression
-* Isomap analysis picks up spatial gradients
+### Hausdorff dimension estimation
+
+A critical parameter to determine _empirically_ is the underlying dimensionality of gene expression of early _Drosophila embryogenesis, as given by the scRNAseq data.
+Informally, we expect that if the data is sampled from an underlying ``d`` dimensional manifold, than the number of points ``n`` enclosed by a sphere of radius ``R`` should grow as
+```math
+    n(R) \sim R^d
+```
+However, we note that performing this comparison utilizing the Euclidean metric between normalized cellular expression would be manifestly incorrect; we postulated that gene expression is a manifold which implies a Euclidean metric _locally_ between neighbors.
+There is no _a priori_ reason to expect Euclidean distances to be a good measure for far cells.
+Similar considerations hold for other postulated global metrics.
+Instead, we estimate geodesic distances empirically, in analog to the Isomap algorithm [^1].
+
+[^1]: [A Global Geometric Framework for Nonlinear Dimensionality Reduction](https://www.science.org/doi/10.1126/science.290.5500.2319)
+
+#### Neighborhood graph construction
+The basic construction of the neighborhood graph is demonstrated in the below cartoon.
+```@raw html
+<p align="center">
+<img src="/assets/drosophila/radius_scaling_neighborhood.svg" width="32%" class="center"/>
+<img src="/assets/drosophila/radius_scaling_shortest_path.svg" width="32%" class="center"/>
+<img src="/assets/drosophila/radius_scaling_ball.svg" width="32%" class="center"/>
+</p>
+```
+Our algorithm to construct the neighborhood graph proceeds in three main steps.
+First, we visit the local neighborhood of each point (cell) within our dataset.
+In principle, the neighborhood can be defined by either a fixed number of neighbors ``k`` or a given radius ``R``.
+In practice, for _Drosophila melongaster_ we chose to define neighborhood by ``k=8``.
+The neighborhood is assumed to be a good approximation to the manifold tangent space and thus pairwise distances between the point and its neighbors are computed within the euclidean metric.
+Each pairwise neighborhood relationship is captured by an edge, weighted by the pairwise distance.
+The collection of all cells, and their neighborhood edges, constitute the _neighborhood graph_.
+
+Given a neighborhood graph as defined above, computing the "geodesic" distance between cells reduces to finding the shortest path between all pairs of points.
+This can be solved using either the Floyd-Warshall algorithm in ``\mathcal{O}(N^3)`` time [^2] or iteratively using Dijikstra's algorithm in ``\mathcal{O}(N(N\logN + E))`` time [^3].
+We chose the later due to the sparsity of the neighborhood graph.
+The estimation of geodesic distances, if performed as described, asymptotically approaches the true value as the density of points increases [^4].
+
+[^2]: [Algorithm 97: shortest path](https://dl.acm.org/doi/pdf/10.1145/367766.368168)
+[^3]: [A note on two problems in connexion with graphs](https://link.springer.com/article/10.1007/BF01386390)
+[^4]: [Graph approximations to geodesics on embedded manifolds](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.359.7343&rep=rep1&type=pdf)
 
-## Initial considerations
+Given our estimate for all pairwise geodesic distances, we can now empirically estimate the _Hausdorff_ dimension.
+Simply stated, the estimate is done by analyzing the scaling relationship between number of points enclosed within a _geodesic_ ball of radius ``R`` and the radius ``R`` itself.
+Below we show the results plotted for each cell within our dataset, along with a trend line ``n(R) \sim R^3`` shown in red.
+```@raw html
+<p align="center">
+<img src="/assets/drosophila/pointcloud_scaling.png" width="68%" class="center"/>
+</p>
+```
+The manifold clearly looks three-dimensional.
+
+### Physical proximity implies expression proximity
+
+In [Neighborhood graph construction](@ref), we demonstrated that the normalized scRNAseq appears to be three-dimensional.
+This supports _half_ of our continuum ansatz: gene expression of early Drosophila embryogenesis is consistent with being distributed along a low-dimensional manifold.
+However, _a priori_ is is unclear that such dimensions meaningfully correspond to space.
+Additionally, at the developmental stage sampled, the _Drosophila_ embryo is a two-dimensional epithelial monolayer and as such it is unclear what a potential third dimension would parameterize.
+To this end, we utilize the positional labels calculated in [scRNAseq spatial inference](@ref) to assay if physically close cells are close along the putative manifold.
+```@raw html
+<p align="center">
+<img src="/assets/drosophila/pointcloud_locality.png" width="68%" class="center"/>
+</p>
+```
+Specifically, we demonstrate that conditioning on varying _average_ spatial geodesic distances results in quantitative effects in the distribution of pairwise expression geodesic distances.
+Physically close cells are not only close in expression, but the median of the distribution increases quantitatively (albeit non-linearly) with increasing conditioned shells of distance.
+Taken together, the scRNAseq data is empirically _consistent_ with our continuum expression ansatz.
+
+### Isomap dimensional reduction
+We provide additional evidence in support of our underlying hypothesis by utilizing the Isomap algorithm [^1].
+As such, we simply perform a Principal Coordinates Analysis on the geodesic distances estimated in [Neighborhood graph construction](@ref).
+This is tantamount to finding a set of low-dimensional coordinates ``\xi_\alpha`` that minimize the strain between their Euclidean embedding and the given distance matrix ``D_{\alpha\beta}``.
+```math
+    E(\{\xi_\alpha\}) = \left(\frac{\left(\sum_{\alpha,beta} D_{\alpha\beta} - |\xi_\alpha - \xi_\beta|^2\right)^2}{\sum_{\alpha,\beta} D_{\alpha\beta}^2 }\right)^{1/2}
+```
+```@raw html
+<p align="center">
+<img src="/assets/drosophila/isomap_AP.png" width="42%" class="center"/>
+<img src="/assets/drosophila/isomap_DV.png" width="42%" class="center"/>
+</p>
+```
+Above, we show the resultant embedding, colored by the estimated AP (anterior-posterior) and DV (dorsal-vental) position, obtained by averaging over the distribution obtained in [scRNAseq spatial inference](@ref).
+It is clear that two of the major axes of the embedding quantitatively segregate both spatial axes of the embryo.
+We emphasize that the estimated spatial positions _were not_ used to generate the shown embedding, but rather _only_ the underlying neighborhood distances of cells within our scRNAseq dataset.
+This is taken as strong evidence in support of our underlying hypothesis.
+
+By varying the dimensionality of the Isomap embedding, we see that three dimensions is the "knee" of dimensioning returns, consistent with the 3-dimensional scaling observed above.
+```@raw html
+<p align="center">
+<img src="/assets/drosophila/isomap_dimension.png" width="68%" class="center"/>
+</p>
+```
+
+## Nonlinear manifold learning
 
 Want:
 * Dimensional reduction
@@ -19,7 +122,7 @@ Natural choice for an autoencoder.
 Utilize known positional labels as a validation step.
 Not used for training purposes.
 
-## Network architecture
+### Network architecture
 
 How to pick depth? Width?
 Worry about overfitting: enter dropout and batch normalization.
@@ -58,7 +161,7 @@ Utilizing this term is problematic for several reasons:
 2. Generically, ``d`` dimensional manifolds can not be isometrically embedded into ``\mathbb{R}^d``, e.g. the sphere into the plane.
 3. It trusts the computed distances quantitatively. We simply want close cells to be close in the resultant latent space.
 
-#### Differentiable ranking
+#### Monometric formulation
 Consider a vector ``\psi_\alpha`` of scores of length ``n`` we wish to rank.
 Furthermore, define ``\sigma \in \Sigma_n`` to be an arbitrary permutation of ``n`` such scores.
 We define the **argsort** to be the permutation that sorts ``\psi`` in descending order
@@ -76,9 +179,9 @@ We wish to devise an objective function that contains functions of the rank of s
 However, ``R(\bm{\psi})`` is a non-differentiable function; it maps a vector in ``\mathbb{R}^n`` to a permutation of ``n`` items.
 Hence, we can not directly utilize the rank in a loss function as there is no way to backpropagate gradient information to the network parameters.
 In order to rectify this limitation, we first reformulate the ranking problem as a linear programming problem that permits efficient regularization.
-Note, the presentation here follows closely the original paper [^1]
+Note, the presentation here follows closely the original paper [^5]
 
-[^1]: [Fast Differentiable Sorting and Ranking](https://arxiv.org/abs/2002.08871)
+[^5]: [Fast Differentiable Sorting and Ranking](https://arxiv.org/abs/2002.08871)
 
 ##### Linear program formulation
 
@@ -126,7 +229,7 @@ Note that the limit ``\epsilon \rightarrow 0`` reproduces the linear programming
 Conversely, in the limit ``\epsilon \rightarrow \infty``, the solution will go to a constant vector that has the smallest modulus on the _permutahedron_.
 
 ##### Solution
-It has been demonstrated before that the above problem reduces to simple isotonic regression[^1][^2].
+It has been demonstrated before that the above problem reduces to simple isotonic regression[^5][^6].
 Specifically,
 ```math
 R\left(\bm{\psi}\right) = -\frac{\bm{\psi}}{\epsilon} -
@@ -149,10 +252,12 @@ Furthermore, the solution admits a simple, calculatable Jacobian
 where ``\bm{B}_i`` denotes the matrix corresponding to the `i^{th}` block obtained during isotonic regression.
 It is a constant matrix whose number of rows and columns equals the size of the block, and whose values all sum to 1.
 
-[^2]: [SparseMAP: Differentiable Sparse Structured Inference](https://arxiv.org/abs/1802.04223)
+[^6]: [SparseMAP: Differentiable Sparse Structured Inference](https://arxiv.org/abs/1802.04223)
 
 #### Loss function
 
 ### Uniform sampling of latent space
 
+### Swiss roll validation
+
 ## Results
diff --git a/docs/src/sci/inference.md b/docs/src/sci/inference.md
index ce166d0..79f2621 100644
--- a/docs/src/sci/inference.md
+++ b/docs/src/sci/inference.md
@@ -1,4 +1,4 @@
-# scRNAseq Spatial Inference
+# scRNAseq spatial inference
 
 ## Introduction
author	Nicholas <nbnoll@eml.cc>	2022-03-01 11:02:31 -0800
committer	Nicholas <nbnoll@eml.cc>	2022-03-01 11:02:31 -0800
commit	470872e8a2ad0a73d3c9981093afae9977ed2fea (patch)
tree	39d5f576407361abecc9d91b9871bc70c3eb3b67
parent	ff84390bb700077bf7b9203ff732b196e2a80722 (diff)