Publications
Detailed Information
Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Ucak, Umit V. | - |
dc.contributor.author | Ashyrmamatov, Islambek | - |
dc.contributor.author | Lee, Juyong | - |
dc.date.accessioned | 2023-06-29T08:10:20Z | - |
dc.date.available | 2023-06-29T17:10:37Z | - |
dc.date.issued | 2023-05-29 | - |
dc.identifier.citation | Journal of Cheminformatics, Vol.15:55 | ko_KR |
dc.identifier.issn | 1758-2946 | - |
dc.identifier.uri | https://hdl.handle.net/10371/194617 | - |
dc.description.abstract | Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models. | ko_KR |
dc.language.iso | en | ko_KR |
dc.publisher | BMC | ko_KR |
dc.subject | Atom-in-SMILES | - |
dc.subject | Tokenization | - |
dc.subject | Repetition | - |
dc.subject | Chemical language processing | - |
dc.title | Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization | ko_KR |
dc.type | Article | ko_KR |
dc.identifier.doi | 10.1186/s13321-023-00725-9 | ko_KR |
dc.citation.journaltitle | Journal of Cheminformatics | ko_KR |
dc.language.rfc3066 | en | - |
dc.rights.holder | The Author(s) | - |
dc.date.updated | 2023-06-04T03:10:34Z | - |
dc.citation.number | 55 | ko_KR |
dc.citation.volume | 15 | ko_KR |
- Appears in Collections:
- Files in This Item:
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.