Publications

Detailed Information

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

DC Field Value Language
dc.contributor.authorUcak, Umit V.-
dc.contributor.authorAshyrmamatov, Islambek-
dc.contributor.authorLee, Juyong-
dc.date.accessioned2023-06-29T08:10:20Z-
dc.date.available2023-06-29T17:10:37Z-
dc.date.issued2023-05-29-
dc.identifier.citationJournal of Cheminformatics, Vol.15:55ko_KR
dc.identifier.issn1758-2946-
dc.identifier.urihttps://hdl.handle.net/10371/194617-
dc.description.abstractTokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.ko_KR
dc.language.isoenko_KR
dc.publisherBMCko_KR
dc.subjectAtom-in-SMILES-
dc.subjectTokenization-
dc.subjectRepetition-
dc.subjectChemical language processing-
dc.titleImproving the quality of chemical language model outcomes with atom-in-SMILES tokenizationko_KR
dc.typeArticleko_KR
dc.identifier.doi10.1186/s13321-023-00725-9ko_KR
dc.citation.journaltitleJournal of Cheminformaticsko_KR
dc.language.rfc3066en-
dc.rights.holderThe Author(s)-
dc.date.updated2023-06-04T03:10:34Z-
dc.citation.number55ko_KR
dc.citation.volume15ko_KR
Appears in Collections:
Files in This Item:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share