diff options
author | Nicholas <nbnoll@eml.cc> | 2022-03-01 11:02:31 -0800 |
---|---|---|
committer | Nicholas <nbnoll@eml.cc> | 2022-03-01 11:02:31 -0800 |
commit | 470872e8a2ad0a73d3c9981093afae9977ed2fea (patch) | |
tree | 39d5f576407361abecc9d91b9871bc70c3eb3b67 | |
parent | ff84390bb700077bf7b9203ff732b196e2a80722 (diff) |
feat: empirical manifold learning
-rw-r--r-- | docs/src/sci/autoencode.md | 125 | ||||
-rw-r--r-- | docs/src/sci/inference.md | 2 |
2 files changed, 116 insertions, 11 deletions
diff --git a/docs/src/sci/autoencode.md b/docs/src/sci/autoencode.md index 6a1e2ae..9039864 100644 --- a/docs/src/sci/autoencode.md +++ b/docs/src/sci/autoencode.md @@ -1,12 +1,115 @@ # Manifold learning +## Introduction + +In section [scRNAseq spatial inference](@ref), we demonstrated that scRNAseq expression data can be directly mapped to space using a reference database of gene expression patterns, provided by the Berkeley Drosophila Transcriptional Network Project. +These results motivate a more ambitious question: is positional information directly encoded within gene expression and, as such, can we utilize _just_ scRNAseq counts to infer cellular spatial position _de-novo_? + +We anchor our inference approach on the _continuum ansatz_ of cellular states in a developing embryo. +Specifically, we postulate that gene expression is a _smoothly_ varying, continuous function of space, and potentially other continuous degrees of freedom such as time. +Said another way, neighboring cells are assumed to be infinitesimally "close" in expression. +This formulation is a departure from the conventional, _discrete_, view of cellular fates taken, for example, in the description of the striped gene patterning observed during the Anterior-Posterior segmentation of the _Drosophila melongaster_ embryo. +From our viewpoint, a "discontinuity" of gene expression in a handful of components can also be explained by the curvature of a _smooth_ surface embedded in very high dimensions. + +As formulated, spatial inference from scRNAseq data is equivalent to non-linear dimensional reduction: we want to find a small number of degrees of freedom that parameterize the variation across ``\sim 10^4`` genes. +From [Drosophila results](@ref), we know that at least minimally, we can describe gene expression within early Drosophila embryogenesis by ``31`` relevant components. +In this section, we analyze the data within this subspace to show that the data is ultimately generated by an even smaller set of degrees of freedom. +Furthermore, we outline a protocol to "learn" this parameterization and show that it can recapitulate space. + ## Empirical analysis -* Point cloud scaling -* Physically close cells are close in expression -* Isomap analysis picks up spatial gradients +### Hausdorff dimension estimation + +A critical parameter to determine _empirically_ is the underlying dimensionality of gene expression of early _Drosophila embryogenesis, as given by the scRNAseq data. +Informally, we expect that if the data is sampled from an underlying ``d`` dimensional manifold, than the number of points ``n`` enclosed by a sphere of radius ``R`` should grow as +```math + n(R) \sim R^d +``` +However, we note that performing this comparison utilizing the Euclidean metric between normalized cellular expression would be manifestly incorrect; we postulated that gene expression is a manifold which implies a Euclidean metric _locally_ between neighbors. +There is no _a priori_ reason to expect Euclidean distances to be a good measure for far cells. +Similar considerations hold for other postulated global metrics. +Instead, we estimate geodesic distances empirically, in analog to the Isomap algorithm [^1]. + +[^1]: [A Global Geometric Framework for Nonlinear Dimensionality Reduction](https://www.science.org/doi/10.1126/science.290.5500.2319) + +#### Neighborhood graph construction +The basic construction of the neighborhood graph is demonstrated in the below cartoon. +```@raw html +<p align="center"> +<img src="/assets/drosophila/radius_scaling_neighborhood.svg" width="32%" class="center"/> +<img src="/assets/drosophila/radius_scaling_shortest_path.svg" width="32%" class="center"/> +<img src="/assets/drosophila/radius_scaling_ball.svg" width="32%" class="center"/> +</p> +``` +Our algorithm to construct the neighborhood graph proceeds in three main steps. +First, we visit the local neighborhood of each point (cell) within our dataset. +In principle, the neighborhood can be defined by either a fixed number of neighbors ``k`` or a given radius ``R``. +In practice, for _Drosophila melongaster_ we chose to define neighborhood by ``k=8``. +The neighborhood is assumed to be a good approximation to the manifold tangent space and thus pairwise distances between the point and its neighbors are computed within the euclidean metric. +Each pairwise neighborhood relationship is captured by an edge, weighted by the pairwise distance. +The collection of all cells, and their neighborhood edges, constitute the _neighborhood graph_. + +Given a neighborhood graph as defined above, computing the "geodesic" distance between cells reduces to finding the shortest path between all pairs of points. +This can be solved using either the Floyd-Warshall algorithm in ``\mathcal{O}(N^3)`` time [^2] or iteratively using Dijikstra's algorithm in ``\mathcal{O}(N(N\logN + E))`` time [^3]. +We chose the later due to the sparsity of the neighborhood graph. +The estimation of geodesic distances, if performed as described, asymptotically approaches the true value as the density of points increases [^4]. + +[^2]: [Algorithm 97: shortest path](https://dl.acm.org/doi/pdf/10.1145/367766.368168) +[^3]: [A note on two problems in connexion with graphs](https://link.springer.com/article/10.1007/BF01386390) +[^4]: [Graph approximations to geodesics on embedded manifolds](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.359.7343&rep=rep1&type=pdf) -## Initial considerations +Given our estimate for all pairwise geodesic distances, we can now empirically estimate the _Hausdorff_ dimension. +Simply stated, the estimate is done by analyzing the scaling relationship between number of points enclosed within a _geodesic_ ball of radius ``R`` and the radius ``R`` itself. +Below we show the results plotted for each cell within our dataset, along with a trend line ``n(R) \sim R^3`` shown in red. +```@raw html +<p align="center"> +<img src="/assets/drosophila/pointcloud_scaling.png" width="68%" class="center"/> +</p> +``` +The manifold clearly looks three-dimensional. + +### Physical proximity implies expression proximity + +In [Neighborhood graph construction](@ref), we demonstrated that the normalized scRNAseq appears to be three-dimensional. +This supports _half_ of our continuum ansatz: gene expression of early Drosophila embryogenesis is consistent with being distributed along a low-dimensional manifold. +However, _a priori_ is is unclear that such dimensions meaningfully correspond to space. +Additionally, at the developmental stage sampled, the _Drosophila_ embryo is a two-dimensional epithelial monolayer and as such it is unclear what a potential third dimension would parameterize. +To this end, we utilize the positional labels calculated in [scRNAseq spatial inference](@ref) to assay if physically close cells are close along the putative manifold. +```@raw html +<p align="center"> +<img src="/assets/drosophila/pointcloud_locality.png" width="68%" class="center"/> +</p> +``` +Specifically, we demonstrate that conditioning on varying _average_ spatial geodesic distances results in quantitative effects in the distribution of pairwise expression geodesic distances. +Physically close cells are not only close in expression, but the median of the distribution increases quantitatively (albeit non-linearly) with increasing conditioned shells of distance. +Taken together, the scRNAseq data is empirically _consistent_ with our continuum expression ansatz. + +### Isomap dimensional reduction +We provide additional evidence in support of our underlying hypothesis by utilizing the Isomap algorithm [^1]. +As such, we simply perform a Principal Coordinates Analysis on the geodesic distances estimated in [Neighborhood graph construction](@ref). +This is tantamount to finding a set of low-dimensional coordinates ``\xi_\alpha`` that minimize the strain between their Euclidean embedding and the given distance matrix ``D_{\alpha\beta}``. +```math + E(\{\xi_\alpha\}) = \left(\frac{\left(\sum_{\alpha,beta} D_{\alpha\beta} - |\xi_\alpha - \xi_\beta|^2\right)^2}{\sum_{\alpha,\beta} D_{\alpha\beta}^2 }\right)^{1/2} +``` +```@raw html +<p align="center"> +<img src="/assets/drosophila/isomap_AP.png" width="42%" class="center"/> +<img src="/assets/drosophila/isomap_DV.png" width="42%" class="center"/> +</p> +``` +Above, we show the resultant embedding, colored by the estimated AP (anterior-posterior) and DV (dorsal-vental) position, obtained by averaging over the distribution obtained in [scRNAseq spatial inference](@ref). +It is clear that two of the major axes of the embedding quantitatively segregate both spatial axes of the embryo. +We emphasize that the estimated spatial positions _were not_ used to generate the shown embedding, but rather _only_ the underlying neighborhood distances of cells within our scRNAseq dataset. +This is taken as strong evidence in support of our underlying hypothesis. + +By varying the dimensionality of the Isomap embedding, we see that three dimensions is the "knee" of dimensioning returns, consistent with the 3-dimensional scaling observed above. +```@raw html +<p align="center"> +<img src="/assets/drosophila/isomap_dimension.png" width="68%" class="center"/> +</p> +``` + +## Nonlinear manifold learning Want: * Dimensional reduction @@ -19,7 +122,7 @@ Natural choice for an autoencoder. Utilize known positional labels as a validation step. Not used for training purposes. -## Network architecture +### Network architecture How to pick depth? Width? Worry about overfitting: enter dropout and batch normalization. @@ -58,7 +161,7 @@ Utilizing this term is problematic for several reasons: 2. Generically, ``d`` dimensional manifolds can not be isometrically embedded into ``\mathbb{R}^d``, e.g. the sphere into the plane. 3. It trusts the computed distances quantitatively. We simply want close cells to be close in the resultant latent space. -#### Differentiable ranking +#### Monometric formulation Consider a vector ``\psi_\alpha`` of scores of length ``n`` we wish to rank. Furthermore, define ``\sigma \in \Sigma_n`` to be an arbitrary permutation of ``n`` such scores. We define the **argsort** to be the permutation that sorts ``\psi`` in descending order @@ -76,9 +179,9 @@ We wish to devise an objective function that contains functions of the rank of s However, ``R(\bm{\psi})`` is a non-differentiable function; it maps a vector in ``\mathbb{R}^n`` to a permutation of ``n`` items. Hence, we can not directly utilize the rank in a loss function as there is no way to backpropagate gradient information to the network parameters. In order to rectify this limitation, we first reformulate the ranking problem as a linear programming problem that permits efficient regularization. -Note, the presentation here follows closely the original paper [^1] +Note, the presentation here follows closely the original paper [^5] -[^1]: [Fast Differentiable Sorting and Ranking](https://arxiv.org/abs/2002.08871) +[^5]: [Fast Differentiable Sorting and Ranking](https://arxiv.org/abs/2002.08871) ##### Linear program formulation @@ -126,7 +229,7 @@ Note that the limit ``\epsilon \rightarrow 0`` reproduces the linear programming Conversely, in the limit ``\epsilon \rightarrow \infty``, the solution will go to a constant vector that has the smallest modulus on the _permutahedron_. ##### Solution -It has been demonstrated before that the above problem reduces to simple isotonic regression[^1][^2]. +It has been demonstrated before that the above problem reduces to simple isotonic regression[^5][^6]. Specifically, ```math R\left(\bm{\psi}\right) = -\frac{\bm{\psi}}{\epsilon} - @@ -149,10 +252,12 @@ Furthermore, the solution admits a simple, calculatable Jacobian where ``\bm{B}_i`` denotes the matrix corresponding to the `i^{th}` block obtained during isotonic regression. It is a constant matrix whose number of rows and columns equals the size of the block, and whose values all sum to 1. -[^2]: [SparseMAP: Differentiable Sparse Structured Inference](https://arxiv.org/abs/1802.04223) +[^6]: [SparseMAP: Differentiable Sparse Structured Inference](https://arxiv.org/abs/1802.04223) #### Loss function ### Uniform sampling of latent space +### Swiss roll validation + ## Results diff --git a/docs/src/sci/inference.md b/docs/src/sci/inference.md index ce166d0..79f2621 100644 --- a/docs/src/sci/inference.md +++ b/docs/src/sci/inference.md @@ -1,4 +1,4 @@ -# scRNAseq Spatial Inference +# scRNAseq spatial inference ## Introduction |