Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Ucak, Umit V.; Ashyrmamatov, Islambek; Lee, Juyong

doi:10.1186/s13321-023-00725-9

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

DC Field	Value	Language
dc.contributor.author	Ucak, Umit V.	-
dc.contributor.author	Ashyrmamatov, Islambek	-
dc.contributor.author	Lee, Juyong	-
dc.date.accessioned	2023-06-29T08:10:20Z	-
dc.date.available	2023-06-29T17:10:37Z	-
dc.date.issued	2023-05-29	-
dc.identifier.citation	Journal of Cheminformatics, Vol.15:55	ko_KR
dc.identifier.issn	1758-2946	-
dc.identifier.uri	https://hdl.handle.net/10371/194617	-
dc.description.abstract	Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.	ko_KR
dc.language.iso	en	ko_KR
dc.publisher	BMC	ko_KR
dc.subject	Atom-in-SMILES	-
dc.subject	Tokenization	-
dc.subject	Repetition	-
dc.subject	Chemical language processing	-
dc.title	Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization	ko_KR
dc.type	Article	ko_KR
dc.identifier.doi	10.1186/s13321-023-00725-9	ko_KR
dc.citation.journaltitle	Journal of Cheminformatics	ko_KR
dc.language.rfc3066	en	-
dc.rights.holder	The Author(s)	-
dc.date.updated	2023-06-04T03:10:34Z	-
dc.citation.number	55	ko_KR
dc.citation.volume	15	ko_KR

Appears in Collections:

College of Pharmacy (약학대학)
- Dept. of Pharmacy (약학과)
  - Journal Papers (저널논문_약학과)

Files in This Item:

13321_2023_Article_725.pdf 3.06 MB

Altmetrics

Item View & Download Count

Show Simple Item Record

Find it @ SNU

트윗하기

SNS Share