Publications

Detailed Information

Nearest neighbor search on embeddings rapidly identifies distant protein relations

Cited 4 time in Web of Science Cited 5 time in Scopus
Authors

Schuetze, Konstantin; Heinzinger, Michael; Steinegger, Martin; Rost, Burkhard

Issue Date
2022-11
Publisher
Frontiers Media S.A.
Citation
Frontiers in Bioinformatics, Vol.2, p. 1033775
Abstract
Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as "homology detection") use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.
ISSN
2673-7647
URI
https://hdl.handle.net/10371/202519
DOI
https://doi.org/10.3389/fbinf.2022.1033775
Files in This Item:
There are no files associated with this item.
Appears in Collections:

Related Researcher

  • College of Natural Sciences
  • School of Biological Sciences
Research Area Development of algorithms to search, cluster and assemble sequence data, Metagenomic analysis, Pathogen detection in sequencing data

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share