Swedish researchers have successfully used artificial intelligence to create synthetic DNA for controlling specific gene expression – unlocking the keys to designer mRNA.
According to the study, published 30 August 2022 in Nature Communications, the AI was told how much of a specific gene was required, before ‘printing’ the appropriate DNA sequence.
The team from Chalmers University of Technology said that the breakthrough could contribute to the development and production of mRNA vaccines, protein-based drugs for severe diseases and alternative food proteins, in a fraction of the time – and at significantly lower costs – than is possible today.
Lead researcher and Associate Professor of Systems Biology, Aleksej Zelezniak, explained that researchers have been trying to control cellular protein production for decades, seeking to unlock a process that would enable the design of efficient gene therapies and microbial cell factories, with the eventual hope of reaching the Holy Grail of medical research – a cure for cancer.
“First it was about being able to fully ‘read’ the DNA molecule’s instructions,” Professor Zelezniak said.
“Now we have succeeded in designing our own DNA that contains the exact instructions to control the quantity of a specific protein.”
Gene expression is a fundamental process underlying the cellular functionality of all living organisms, with expression describing how the genetic code in DNA is transcribed to the molecule messenger RNA (mRNA), which tells a cell’s factory which protein to produce, and in what quantities.
The team used state-of-the-art deep learning models to learn and map the functional DNA regulatory sequence space to gene expression levels directly from natural genomic data in Saccharomyces cerevisiae (whose cells resemble mammals’), enabling the deliberate design of specific proteins.
“This was made possible by incorporating multiple recent advances that enabled us to develop our supervised deep generative modelling approach… including a highly accurate predictive models of gene expression levels that could explain over 82% of expression variation from regulatory sequence alone,” the authors explained.
“We can design unique regulatory sequence variants that are nevertheless functional and contain natural-like properties and cis-regulatory grammar, even surpassing the expression level of natural, highly expressed genes.
“Moreover, since our DNA-generator has learned the generalized functional regulatory sequence space, it can generate a practically infinite supply of unique sequence samples for any gene.
“It traverses only the most relevant sequence subspace instead of randomly sampling candidates from all 41000 possible sequence variants, which would otherwise be needed to explore the 1000bp of regulatory DNA.”
The most common method for designing synthetic regulatory DNA currently involves stacking multiple known functional sequences and then applying random mutagenesis to a specific region, with researchers relying on in-silico screening approaches to predict the genetic expression.
The study noted that such random-mutagenesis-based approaches were commonly based on a brute-force strategy, starting with an existing natural sequence then working randomly, one set of mutations at a time, without considering the functional sequence context.
In recent years more intricate solutions, including the use of genetic algorithms, were implemented, however, these algorithms still employ random mutagenesis in each round of sequence evolution, and many of the sequences produced result in junk DNA.
As such, the search for functional sequence variants frequently requires experimental screening with multiple rounds of trial and error, or experimentally testing enormous sequence batches.
Professor Zelezniak explained that because of this, protein-based drugs for complex diseases or alternative sustainable food proteins can take many years to create and can be extremely expensive to develop.
“Some are so expensive that it is impossible to obtain a return on investment, making them economically nonviable,” he said,
“[Yet] with our technology, it is possible to develop and manufacture proteins much more efficiently so that they can be marketed.”
The resource intense nature of the mutagenesis-based approaches has also been the major factor limiting the current exploration of DNA to short segments of single regulatory regions and specific reporter genes.
“There are over 1060 different ways to construct a mere 100 bp promoter sequence, covering more DNA variation than exists in all living species on our planet,” the authors highlighted.
“Experimentally, exploring even a tiny fraction of such an enormous sequence space is challenging and often infeasible due to the vast species diversity and complexity of eukaryotic gene regulation.”
Lead author Dr Jan Zrimic, a research associate at the National Institute of Biology in Slovenia (and former post-doctoral student under Professor Zelezniak), said that machine learning approaches have expanded our knowledge of the principles underlying gene expression, helping us to accurately predict gene expression across multiple model organisms.
“DNA is an incredibly long and complex molecule, and as such, experimentally it was extremely challenging to make changes by iteratively reading and changing it, then reading and changing it again,” Dr Zrimic explained.
“That way, it took years of research to find something that worked – instead, it is much more effective to let an AI learn the principles of navigating DNA. What otherwise took years is now shortened to weeks or days.”
The study noted that the “striking capacity” of random DNA to evolve into functioning regulatory sequences, by introducing only a small number of base pair mutations, suggested that the “richness and plasticity of cis-regulatory grammar results in a vast functional regulatory sequence space, far larger than the one currently observed in natural systems.”
The next step is to use human cells.