Measures to map the function of what was once considered ‘junk DNA’ have revealed the obscure yet surprisingly important outcomes generated by the body’s forgotten proteins.
Even though the map of the human genome was first drafted back in 2001, more than two decades of further research has found more than 200 million newly sequenced base pairs, 115 million of which are predicted to produce proteins, which have been sitting on the scientific shelf collecting dust.
British researchers from Oxford and Cambridge Universities have created a publicly available and customisable database of these understudied proteins, dubbed the ‘unknome,’ in the hope that they can promote more rapid exploration of these proteins.
Lead author Dr Joao Rocha from MRC Laboratory of Molecular Biology at Cambridge, explained that the unknome database assigned every protein a ‘knownness’ score, which reflected the amount of information contained in the scientific literature.
“It has become clear that scientific research tends to focus on well-studied proteins, leading to a concern that poorly understood genes are unjustifiably neglected,” Dr Rocha said.
“Since the release of the first draft of the human genome sequence in 2001, the application of transcriptomics and proteomics has confirmed that most of these new proteins are expressed, and the function of many of them has been identified.
“However, despite over 20 years of extensive effort, there are also many others that still have no known function… Yet to quote James Clerk Maxwell, ‘Thoroughly conscious ignorance is the prelude to every real advance in science.’”
The study, published 8 August in PLOS Biology, found that human genes and proteins were more likely to have been investigated in the last 12 years if they were in clusters that were already well known at the start of this period, with similar trends seen in gene and protein annotation databases.
“This is despite clear evidence from studies of gene expression and genetic variation that many of the poorly characterised proteins are linked to disease, including those that are eminently druggable,” Dr Rocha said.
“This has led to concern that important fundamental or clinical insights, as well as the potential for therapeutic interventions, are being missed, and hence, the launch of several initiatives to address the problem, including programs to generate proteome-wide sets of reagents such as antibodies or mouse knock-out lines.”
The unknome database assigns each protein from a particular organism a ‘knownness’ score based on a user-controlled application of the widely used Genome Ontology (GO) annotations. For example, there were 750 human clusters whose ‘knownness’ was zero 12 years ago that have since increased to above two.
The knownness scores for clusters containing human proteins have increased across the whole range of genes, and the proportion with a score of 2 or less has declined from 43% to 23% over the last 10 years.
“The database allows selection of an “unknome” for humans, or a chosen model organism, that can be tuned to reflect the degree of conservation in other species, for example, allowing a focus on those proteins of unknown function that have orthologs in humans or are widely conserved in evolution,” Dr Rocha explained.
To assess the value of the unknome as a foundation for experimental work, the team selected a set of 260 Drosophila proteins of unknown function that are conserved in humans and used RNA interference (RNAi) to test their contribution to a wide range of biological processes – investigating specific gene function in the fruit fly by eliminating its expression with RNAi and assessing the biological consequences.
“This revealed proteins important for diverse biological roles, including cilia function and Notch pathway signalling, and overall, our approach demonstrated that significant and unexplored biology was encoded in the neglected parts of proteomes,” Dr Rocha explained.
To determine if the genes were required for viability, a GAL4 driver was used to direct RNAi throughout development and for 162 of the 358 genes, their offspring showed compromised viability with most failing to develop beyond pupal eclosion, suggesting that these genes were essential for cell development or function.
“Two genes gave a partial but significant reduction in female brood size, and during our work, a mouse ortholog, MARF1, was identified in a genetic screen as being required for maintaining female fertility, apparently by controlling mRNA homeostasis in oocytes,” Dr Rocha said.
“Seven genes showed near complete male sterility, with another five genes giving a statistically significant reduction in brood size.”
Under conditions of amino acid deprivation, knockdown of 8 of the unknome test set significantly prolonged survival and even though seven of these genes remained of unknown function, five had orthologs in other species whose localisation or interactions suggested they have roles in the endosomal system.
“Similarly, eleven genes gave a statistically meaningful increase in starvation resistance and while most of these genes’ purpose remains unknown, three have since been reported to have functions related to oxidative stress signalling,” Dr Rocha said.
“An important primary conclusion of our work is that these uncharacterised genes have not deserved their neglect, a conclusion strengthened by a variety of other studies published during the protracted course of our studies, again revealing important functions for unknown genes.
“Accurately evaluating ignorance about gene function provides a valuable resource for guiding biological studies and may even be important for determining strategies to efficiently fund science.”