Half 2 of the collection describing my computational biology position at Stanford. No genetics background is required, however some elements get technical as they’re vital for showcasing the issue.
We hear how rapidly AI is advancing and the way it’s altering the world. How are we utilizing it to unlock organic discoveries, significantly within the subject of genetics?
DNA is the blueprint of life, containing the directions for a way we develop and performance. It’s answerable for our hair colour, but in addition answerable for debilitating illnesses, like Huntington’s. It contains over 3 billion base pairs (letters), however we solely perceive a fraction of it. With the advances in DNA-sequencing applied sciences, we’ve obtained many DNA sequences from totally different people and organisms. This offers us an immense quantity of information, with hopes to make use of it to decipher the DNA language.
There’s been quite a lot of effort to know and categorize the DNA that codes for genes. These sections of DNA get probably the most consideration since they are often instantly traced to a operate inside a cell. Nevertheless, gene-coding areas within the DNA solely make up 1%. After we do genome huge affiliation research (GWAS), equivalent to for coronary heart illness, most hits lie within the non-coding (non-gene) areas of DNA. This tells us that the non-coding areas are simply as vital for understanding the total image, which is why many genetics labs, equivalent to mine, deal with it.
An excellent portion of the non-coding area of DNA focuses on regulating the genes. For instance, two folks can each have the identical gene of their DNA, however one individual will get most cancers as a result of that gene will get over expressed of their physique. We name the areas of DNA that enhance gene expression, enhancers. Enhancers usually work by coming into bodily contact with a DNA area of the gene area, catalyzing the method to transform the gene right into a protein. In my lab, we have been making an attempt to determine these enhancer areas and determine which genes they have been regulating.
The place the Machine Studying Comes In
To determine enhancer areas and map these enhancers to genes, we carried out scientific experiments. The concept was to make use of a expertise, CRISPRi, to “silence” suspected enhancer areas of DNA, after which see if that silencing had any impact on the expression of a gene. These experiments have been carried out in different analysis labs, together with our personal, and allowed us to construct a mixed dataset that appeared like the next:
Sadly, the CRISPRi expertise doesn’t scale effectively. We may solely check a certain quantity of enhancers and will solely apply it in sure cell sorts. The ensuing mixed dataset consisted of solely 10k whole enhancer-gene pairs, with solely 400 being positives (Regulates=TRUE). There are millions of cell sorts and hundreds of thousands of potential enhancer-gene pairs. So a path ahead was to study from the info we do have, and construct fashions that may predict which enhancers regulate which genes.
Enter:
Enhancer Area
Gene Area
Distance between enhancer and gene
Accessibility of enhancer area (measured by a chromatin accessibility assay)
…
Output:
Does the enhancer regulate the gene?
When figuring out if an enhancer regulates a gene, there are quite a lot of elements concerned. For instance, the nearer in proximity an enhancer is to the gene, the upper probability it regulates that gene. We generated a bunch of options and handed it via our predictive fashions. The end result: our easy logistic regression mannequin surpassed all different revealed predictive fashions when benchmarked in opposition to the mixed CRISPRi dataset. Because of the small dataset dimension (10k knowledge factors), extra advanced fashions, equivalent to neural networks, wouldn’t work too effectively as they might often find yourself overfitting.
In my position, I helped to generate options, practice the mannequin, and apply the mannequin to 1000’s of cell sorts. The entire work we’ve carried out within the lab has been revealed on bioRxiv and is within the strategy of assessment in a scientific journal.
The Future is LLMs?
Though our mannequin achieved state-of-the-art efficiency, we all know there’s quite a lot of room for enchancment. Among the knowledge we used to generate options for the mannequin have been solely accessible in sure cell sorts (e.g h3k27ac ChIP-seq in K562). So for almost all of cell sorts, we had to make use of a a lot less complicated mannequin that had worse efficiency.
However hope is just not misplaced. These lacking options are generated from the outcomes of organic experiments. If we develop fashions that may predict the outcomes of these organic experiments, we will increase the efficiency for all of the cell sorts.
The outcomes of sure experiments might be represented in a protection observe, the place every genomic area (x-axis) has a sign based mostly on what you’re measuring (y-axis). It’s much like how we’d observe the pace of a race automobile, the place the y-axis is pace and x-axis could be distance from the beginning line.
Just like translating one human language to a different, we need to translate a DNA sequence to its corresponding protection observe.
However it’s not as simple as taking 1 part of the DNA and translating it to the protection. In human language translation, context issues. What was mentioned originally of the sentence will affect how we translate a sure phrase. The identical applies when doing DNA to protection observe translation. The protection at a area might be closely influenced by the DNA sequence that’s over 1 million base pairs away! Many nonlocal elements come into play when translating: how does the DNA fold, what’s the state of the cell, how accessible is that this area of DNA? Lots of these larger stage ideas might be inferred by taking in context.
That’s the place the developments from pure language processing (NLP) are available. New NLP mannequin architectures, equivalent to Transformers, present the power of fashions to recall context from actually lengthy distances. For instance, in ChatGPT, once I ask it my a centesimal query, it nonetheless remembers the primary query I requested it. Context home windows for these LLMs (giant language fashions) are rising bigger with extra analysis, however aren’t but able to dealing with the total scale context of a complete human genome (3 billion letters). Nevertheless, we will nonetheless leverage these fashions to offer us insights and assist us with DNA translation.
Just like how ChatGPT and different LLMs have gotten actually good at understanding human language, we’d like an LLM to get actually good at understanding the DNA language. And as soon as we now have such a mannequin, we will fantastic tune it to carry out very well on sure organic duties, equivalent to predicting enhancer-gene pairs or DNA-drug compatibility (how will sufferers with totally different DNA react to a drug?). Genomic LLMs nonetheless have a protracted option to go when in comparison with present LLMs in human language, however they’re on the rise, as evidenced by the rising variety of publications.
My Journey With LLMs and Deep Studying Fashions
Previous to my job, I had no concept how language fashions labored. However on the finish of it, I had constructed my very own deep studying mannequin that takes DNA sequence as enter and generates one of many protection tracks talked about above. I’ve additionally used and evaluated many various deep studying genomic fashions.
Like many others, ChatGPT’s capabilities blew my thoughts. I had a primary ML background, taking an ML course again in undergrad and finishing an internet ML specialization course. However none of these lined pure language processing, so I might get actually confused once I tried to leap proper in and perceive the structure behind language fashions.
Amazingly, Stanford posts a lot of their CS programs free of charge on Youtube. So I watched their course on pure language processing (CS 224N: NLP), which lined embeddings and former architectures equivalent to LSTMs. I even designed and constructed my very own studying projects to achieve sensible expertise constructing and coaching these deep studying fashions. Later, I’d additionally find out about graph neural networks (CS224W) and convolutional neural networks (CS331N: Computer Vision) as they have been additionally related in sure genomic fashions.
Across the identical time I used to be studying via these programs, sequence-based deep studying fashions (fashions that took DNA sequence as enter) have been rising in recognition. The analysis subject hoped to use the analysis breakthroughs in NLP to genetics. The north star was to create an LLM that understands genetic code, the identical manner ChatGPT understands human language.
Sadly, many analysis labs, together with mine, didn’t have the sources to construct LLMs. So we constructed fashions on a smaller scale, with the hope that the fashions can nonetheless decide up on sure patterns of the DNA sequence.
Let’s return to the enhancer-gene pair prediction downside. For our predictive mannequin, the principle enter was a file, which might be from 1 of two organic experiments (DNase-seq or ATAC-seq). Each experiments theoretically measure comparable issues (chromatin accessibility), however achieve this in numerous methods. Sadly, our mannequin didn’t carry out as effectively on ATAC-seq information because it did on DNase-seq ones. So one of many avenues we explored was to transform ATAC-seq sign to DNase-seq sign, leveraging DNA sequence to take action.
As a part of my try, I designed my own deep learning model constructed utilizing each convolutional layers and transformer encoder layers. It takes each DNA sequence and ATAC-seq sign in 750 base pair context home windows. Leveraging analysis in NLP, I adopted a transformer encoder structure much like BERT and included relative positional encodings into my consideration layers. For coaching, I adopted finest practices, equivalent to warming up studying charges, grid trying to find hyper parameters, and normalizing when relevant. I examined totally different architectures, mannequin sizes, loss features, and place encoding codecs earlier than coaching the mannequin for twenty-four hours on a GPU.
Utilizing a separate analysis standards, the expected DNase-seq sign carried out higher than the unique ATAC-seq sign, however solely marginally. There are a few doable explanations: the mannequin isn’t giant sufficient, the context window is simply too small, and/or the underlying organic cells have been totally different between experiments. In the end, one other avenue I pursued confirmed extra promise, so I finished engaged on this venture.