Model matches the accuracy of leading 1.2 billion-parameter DNA language models while using only 172 million parameters


SAN DIEGO & SEOUL, South Korea--(BUSINESS WIRE)--Inocras, a bioinformatics-led company harnessing the power of whole genome data and proprietary analytics to deliver curated insights that advance precision health, today announced that “DNAChunker: Learnable Tokenization for DNA Language Models,” a joint research paper with the Korea Advanced Institute of Science and Technology (KAIST), has been accepted for paper publication at the International Conference on Machine Learning (ICML) 2026.
The paper introduces DNAChunker, a learnable adaptive tokenization approach for DNA language models that dynamically segments genomic sequences into biologically meaningful, variable-length units. Unlike conventional DNA language models that process genomic sequences using fixed-size or externally defined segments, DNAChunker learns how to group genetic code based on biological context, enabling more accurate and efficient representation of complex genomic patterns.
DNAChunker achieves state-of-the-art performance while matching the accuracy of leading 1.2 billion-parameter DNA language models with only 172 million parameters, making it more than seven times smaller. By reducing model size while preserving performance, DNAChunker may help make advanced genomic AI models more practical for large-scale research, translational discovery and future clinical applications.
“DNA language models depend heavily on how genomic sequences are represented before they are interpreted by AI,” said Wonchul Lee, CIO at Inocras and co-lead of the paper. “By replacing rigid tokenization with a learnable approach, DNAChunker provides a more precise and efficient foundation for downstream genomic modeling.”
“Our ICML acceptance marks a major milestone for Inocras’ Cancer Foundation Model, developed in collaboration with KAIST and trained on thousands of whole genomes from diverse cancer types,” said Jehee Suh, CEO of Inocras. “DNAChunker provides the biologically informed genome representation layer underlying that broader vision, helping foundation models move beyond pattern recognition toward clinically meaningful cancer interpretation. Together with KAIST, we are advancing the core technologies needed to make whole-genome AI more accurate, efficient, and scalable.”
KAIST led foundational algorithm design, model implementation and validation, while Inocras contributed large-scale computational resources, key technical ideas and validation efforts to align the model with practical and clinical applications.
“DNAChunker shows that sequence representation is a central challenge in building effective DNA language models,” said Prof Sungsoo Ahn and Insu Han from KAIST and corresponding authors of the paper. “Our collaboration with Inocras helped connect advanced AI methodology with the scale and practical requirements of whole-genome analysis.”
About Inocras
Inocras is a bioinformatics-led company redefining precision health through whole genome data and proprietary analytics. Our oncology and rare disease platforms integrate comprehensive whole genome data with advanced automation to deliver curated and actionable insights at scale that accelerate discovery and diagnostics to improve patient care, bringing a real-world impact. Inocras operates a CLIA/CAP-certified laboratory and partners with leading hospitals, pharmaceutical companies, and research institutions worldwide. For more information, please visit inocras.com and follow the Inocras LinkedIn page.
Contacts
Media Contact
Vikki Herrera
Oak Street Communications for Inocras
vikki@oakstreetcommunications.com





