Publications

Detailed Information

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

DC Field Value Language
dc.contributor.authorElnaggar, Ahmed-
dc.contributor.authorHeinzinger, Michael-
dc.contributor.authorDallago, Christian-
dc.contributor.authorRehawi, Ghalia-
dc.contributor.authorYu, Wang-
dc.contributor.authorJones, Llion-
dc.contributor.authorGibbs, Tom-
dc.contributor.authorFeher, Tamas-
dc.contributor.authorAngerer, Christoph-
dc.contributor.authorSteinegger, Martin-
dc.contributor.authorBhowmik, Debsindhu-
dc.contributor.authorRost, Burkhard-
dc.date.accessioned2024-05-16T01:26:45Z-
dc.date.available2024-05-16T01:26:45Z-
dc.date.created2021-08-09-
dc.date.created2021-08-09-
dc.date.issued2022-10-
dc.identifier.citationIEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.44 No.10, pp.7112-7127-
dc.identifier.issn0162-8828-
dc.identifier.urihttps://hdl.handle.net/10371/202521-
dc.description.abstractComputational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.-
dc.language영어-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.titleProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning-
dc.typeArticle-
dc.identifier.doi10.1109/TPAMI.2021.3095381-
dc.citation.journaltitleIEEE Transactions on Pattern Analysis and Machine Intelligence-
dc.identifier.wosid000853875300088-
dc.identifier.scopusid2-s2.0-85138449494-
dc.citation.endpage7127-
dc.citation.number10-
dc.citation.startpage7112-
dc.citation.volume44-
dc.description.isOpenAccessY-
dc.contributor.affiliatedAuthorSteinegger, Martin-
dc.type.docTypeArticle-
dc.description.journalClass1-
dc.subject.keywordPlusPROTEIN SECONDARY STRUCTURE-
dc.subject.keywordPlusNEURAL-NETWORKS-
dc.subject.keywordPlusPREDICTION-
dc.subject.keywordPlusLOCALIZATION-
dc.subject.keywordAuthorProteins-
dc.subject.keywordAuthorTraining-
dc.subject.keywordAuthorAmino acids-
dc.subject.keywordAuthorTask analysis-
dc.subject.keywordAuthorDatabases-
dc.subject.keywordAuthorComputational modeling-
dc.subject.keywordAuthorThree-dimensional displays-
dc.subject.keywordAuthorComputational biology-
dc.subject.keywordAuthorhigh performance computing-
dc.subject.keywordAuthormachine learning-
dc.subject.keywordAuthorlanguage modeling-
dc.subject.keywordAuthordeep learning-
Appears in Collections:
Files in This Item:
There are no files associated with this item.

Related Researcher

  • College of Natural Sciences
  • School of Biological Sciences
Research Area Development of algorithms to search, cluster and assemble sequence data, Metagenomic analysis, Pathogen detection in sequencing data

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share