ESM Atlas · TMAP
ESM Atlas proteins, mapped with TMAP
I have been testing the new ESMFold2 and ESMC embeddings with the latest version of TMAP (a preprint is coming soon). It runs on protein embeddings out of the box, so this was mostly a matter of pointing it at the data and waiting.
Below are 50,000 metagenomic proteins from the ESM Atlas. Each one is embedded with ESMC-600M and folded with ESMFold2, then laid out as a single map. You can hover a point for its dominant function, click it to see the predicted structure, and color, search, filter or lasso-select from there.
I made two versions of the same set. One comes from the raw ESMC embeddings, the other from the sparse autoencoder features, which are 16,384 interpretable directions pooled per protein. They lay the proteins out quite differently, and that contrast is the part worth poking at. Switch between them below.
What I like about TMAP is that each point is almost always connected to its true nearest neighbor in the full embedding space. The connections actually mean something, more than they do in UMAP, and there is no parameter tuning to get there. About ten lines of code, running on a laptop, all local.
The code
The whole run is one script. The core that produces a map is short:
from tmap import TMAP # 1,152-d ESMC-600M embeddings -> Map A (16,384-d SAE features -> Map B) X = esmc_embeddings # shape (N, 1152) viz = (TMAP(metric="cosine", n_neighbors=20, layout_iterations=1000, seed=42) .fit(X) # LSH-forest k-NN, then a minimum spanning tree .to_tmapviz()) viz.add_3d_structures(cif_urls, fmt="cif") # ESMFold2 folds, click to view viz.add_color_layout("pLDDT", plddt, color="viridis") viz.write_static("esm_atlas_out/") # one self-contained page
Code: github.com/afloresep/tmap2 · script: examples/esm_atlas_tmap.py