Directed 3d UMAP

A Novel approach to 3d UMAP visualization for single cell genomic analysis.
2d and 3d UMAP visualizations can be hard to navigate. We chose to take a new approach: flip it!

The challenge

How do we show relationships between high-dimensional data and biological functions?

Context

Within single cell genomics, clustering techniques for dimensionality reduction have become a near ubiquitous tool for visualizing genetic data. Techniques such as Principal Component Analysis (PCA) and t-Stochastic Neighbor Embedding (t-SNE) provide frameworks for interpreting high-dimensional data into a low-dimensional space, most commonly a 2d graph. More recently, scientists have adopted a new dimensionality reduction algorithm called Uniform Manifold Approximation and Projection (UMAP), which was introduced by McInnes et al in 2018.

In an effort to rethink the traditional UMAP embedding, we have created a new framework for visualizing high dimensional data points in 3d space. Crucially, our approach maintains the contextual information provided by UMAP clustering while adding a new dimension in which to plot additional data points. This new visualization method seeks to add another layer of data to the UMAP layout rather than duplicate its existing components.

For this new visualization approach, we partnered with postdoctoral researcher Joshua Pan to create an interface for exploring the outputs of his new computational tool (Webster) for modeling genetic pleiotropy, per the paper’s abstract:

“In practice, a single gene perturbation may induce multiple cascading functional outcomes, a genetic principle known as pleiotropy. Here, we model pleiotropy in fitness screen collections by representing each gene perturbation as the sum of multiple perturbations of biological functions, each harboring independent fitness effects inferred empirically from the data. Our approach (‘Webster’) recovered pleiotropic functions for DNA damage proteins from genotoxic fitness screens, untangled distinct signaling pathways upstream of shared effector proteins from cancer cell fitness screens, and predicted the stoichiometry of a new protein complex subunit from fitness data alone.”

Concept of word vectors is useful to understand: one word has meaning in multiple different contexts.

The problem

Manhattan plots have long been used for single-cell data, but the context is lost

UMAPs and TSNEs provide a good way to give a summary analysis of high-dimensional data, but the UX of navigating them is awkward

showing the “real” distance between the local or global relationships of data, summarizing to 2 dimensions, points can be close on the umap, but very far apart in relation, opposite of that, points can appear far away from eachother, but are very close .

3D visualizations are especially challenging,

navigation through 3d space,
network hairball issues for large datasets
Hard to see trends.

TSNE’s capture local relationshiops but does not capture global relationships.

Many of these types of visualizations are mapped in 3D space which increases the “Hairball effect” and hard to navigate.

Two different clustering algorithms. UMAP is better, but same challenges remain...how to navigate 3d space, clustering overlap, etc.

Clustering algorithms encode relatedness with distance and the embedding can change depending on how the data is being clustered. The mapping of relatedness does not alway portray the data accurately. In the right panel Near and Equidistant data points are mapped similarly to Far and Equidistant points.

Embedding can dictate the shape of the clustering. Panels and A and B represent the same data but the embedded shows drastic different results. While panels B, C, D have different datasets completely but the mapping is very similar.

The concept

The concept here is mindful use of 3D space. Using the traditional 2D Umap representation as our initial encoding mapped along the X and Y axis, can we accurately show relatedness of the data using the Z axis.

Here’s my initial prototype using a subset of the data from the paper. I used the final UMAP embedding of the data and mapped those points along the X and Y axis. I then took relatedness data from a few biological functions to map along the Z axis. In the prototype the user is able to select which biological function they’d like to explore more and accurately see which gene are essential to the function and are related within the cluster.

The solution

Iterating from the prototype, we wanted to limit the user from freely exploring the 3D space as the experience of freely navigating the space causes users to be quickly disoriented. In our approach the Z axis is will be defined by the selection of the biological function and we’re using the distance in the Z axis to show relatedness.

Users will select the function and are able to rotate the plot 90 degrees to be able to see the top scored gene and the related genes, The visualization is able to show neutral genes and also negatively correlated genes as well.

Blue = genes within a biological function
Blue opacity = gene score
Grey = neutral genes
Red = negatively correlated genes
Black = centroid of other biological functions
Z axis = relatedness

Next questions to explore

Do the specific patterns have inherent meaning?

Is the top feature (datapoint) the centroid of the network, or the bottleneck of two clusters?

Is it the keystone?!

The results

Scientific poster presented at the 2021 Broad Institute annual retreat.

We received the cover image for Cell System

In this issue of Cell Systems, Pan et al. (p. 286) use cell viability changes following gene perturbation to automatically learn cellular functions or pathways from data. Here, each gene is represented as a data point in 3D space. On the XY plane, genes are arranged near each other if they induce similar cell viability changes upon perturbation (UMAP projection). On the Z axis, genes are plotted based on their association to a specific biological function learned from the data. The strength of the association can be seen by the height, size, and color of that gene's data point. Using this strategy, the authors build interactive visualizations of inferred gene functions (https://depmap.org/webster/#/). Artwork by Andrew Tang.

Read the scientific paper at Cell System
Continue reading on Medium
Explore the visualization