Your privacy, your choice

We use essential cookies to make sure the site can function. We also use optional cookies for advertising, personalisation of content, usage analysis, and social media.

By accepting optional cookies, you consent to the processing of your personal data - including transfers to third parties. Some third parties are outside of the European Economic Area, with varying standards of data protection.

See our privacy policy for more information on the use of your personal data.

for further information and to change your choices.

You are viewing the site in preview mode

Skip to main content

A tool for CRISPR-Cas9 sgRNA evaluation based on computational models of gene expression

Abstract

Background

CRISPR is widely used to silence genes by inducing mutations expected to nullify their expression. While numerous computational tools have been developed to design single-guide RNAs (sgRNAs) with high cutting efficiency and minimal off-target effects, only a few tools focus specifically on predicting gene knockouts following CRISPR. These tools consider factors like conservation, amino acid composition, and frameshift likelihood. However, they neglect the impact of CRISPR on gene expression, which can dramatically affect the success of CRISPR-induced gene silencing attempts. Furthermore, information regarding gene expression can be useful even when the objective is not to silence a gene. Therefore, a tool that considers gene expression when predicting CRISPR outcomes is lacking.

Results

We developed EXPosition, the first computational tool that combines models predicting gene knockouts after CRISPR with models that forecast gene expression, offering more accurate predictions of gene knockout outcomes. EXPosition leverages deep-learning models to predict key steps in gene expression: transcription, splicing, and translation initiation. We showed our tool performs better at predicting gene knockout than existing tools across 6 datasets, 4 cell types and ~207k sgRNAs. We also validated our gene expression models using the ClinVar dataset by showing enrichment of pathogenic mutations in high-scoring mutations according to our models.

Conclusions

We believe EXPosition will enhance both the efficiency and accuracy of genome editing projects, by directly predicting CRISPR’s effect on various aspects of gene expression. EXPosition is available at http://www.cs.tau.ac.il/~tamirtul/EXPosition. The source code is available at https://github.com/shaicoh3n/EXPosition.

Background

Over the past decade, significant progress has been achieved in the field of genome editing, largely attributed to the utilization of CRISPR (clustered regularly interspaced short palindromic repeats) and its Cas (CRISPR-associated) proteins (reviewed in [1]). The Cas9 protein creates double-stranded breaks (DSBs) that are subsequently repaired by the cell’s repair mechanisms, usually through non-homologous end joining (NHEJ), leading to the potential introduction of indels. One of the areas where these advancements have occurred pertains to gene silencing, with the primary objective being the selective inhibition of a specific gene without affecting others. Various methods for gene silencing using CRISPR exist, including expression inhibition through CRISPRi (CRISPR interference), the introduction of point mutations, and the insertion of premature stop codons. Another common approach involves the use of Cas9 proteins to induce mutations in start codons, thereby inhibiting translation initiation [2,3,4].

In most computational models predicting CRISPR activity, researchers have shown interest in the DSB’s location and likelihood, as well as the identity of the resulting mutation [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. This paradigm assumes that if a mutation is induced, then the gene’s expression will significantly decrease; however, this is not always the case. For example, a gene with a mutated start codon can still be translated due to an alternative in-frame start codon, so the resulting protein remains functional [23] (Fig. 1A). Furthermore, transcription and splicing are influenced by various properties of the DNA sequence [24, 25] and not every change to the original sequence will result in a measurable change in the gene’s expression. Therefore, mutated off-target genes, i.e., genes mutated by CRISPR even though the single-guide RNA (sgRNA) was not meant to target them, do not necessarily have their expression affected. On the other hand, even if each exon’s DNA sequence remains unmutated, the gene’s expression could be affected by intronic or intergenic mutations (Fig. 1B). Thus, mutations in all positions require careful evaluation to determine whether they exert a discernible effect on expression.

Fig. 1
figure 1

Overview of the tool’s rationale and structure. A A given gene’s canonical start codon (blue) is deleted by a CRISPR mutation to silence the gene. However, since an alternative ATG in the same reading frame exists (green), translation initiation still occurs, effectively keeping the gene unsilenced. B Part of intron 1 (red) is deleted by a CRISPR mutation; although each exon’s pre-mRNA sequence remains intact, the deletion causes a mis-splicing event, resulting in exon 2 not being spliced; thus, the mRNA includes only exons 1 and 3. C EXPosition pipeline. First, the tool either predicts the most likely mutations induced in a CRISPR target site or receives specific mutations from the user; it then searches for potentially affected genes and assesses the mutation’s effect on their splicing, transcription, and translation initiation. In addition, the VBC and GuidePro scores are calculated on the target site. These scores are fed along with the predicted cutting efficiency and the gene expression estimates into an SVM classifier that predicts whether the sgRNA will cause gene knockout. The user can also insert specific genes/transcripts for analysis, thus disregarding other genes/transcripts which are affected by the mutations. Green boxes represent EXPosition's main components

Recently, a few tools designed to predict gene knockout following CRISPR were created [26, 27]. These tools focus on sgRNAs that target the coding region of a gene and use amino acid composition, conservation and frameshift likelihood. However, a tool designed to address sgRNAs’ effect on gene expression, that can evaluate sgRNAs targeting non-coding regions of genes, is needed.

Here we present a tool that addresses these issues by explicitly considering CRISPR’s effect on the target site’s phenotype rather than the genotypic change. Our tool predicts the impact of CRISPR’s action on three aspects of gene expression: transcription, splicing, and translation initiation. It then incorporates these estimations with predicted cutting efficiency and predictions from tools designed to predict gene knockout to better assess gene knockout. Since not every mutation will significantly affect gene expression, researchers using our tool can save time and money when deciding which sgRNA is most likely to achieve the desired change in gene expression (e.g., silencing) without affecting the expression of off-target genes.

Implementation

General pipeline of the tool

Our tool, called EXPosition, accepts a CRISPR target site location and predicts whether targeting that site will silence the target gene. It accomplishes this by combining post-CRISPR gene expression estimations with gene knockout predictions from other models that rely on other features. Importantly, EXPosition differs from previous tools by also providing information about the likely phenotypical effect of CRISPR at that site, i.e., the effect of the induced mutation on gene expression. The tool’s modules are summarized in Fig. 1C. In short, EXPosition accepts an sgRNA and a gene of interest; predicts the likelihood of the guide’s cut and NHEJ-induced mutations using CRISPRedict [28] and Lindel [29], respectively; predicts the mutations’ effect on transcription, splicing, and translation initiation using Xpresso [24], Oncosplice [30], and TITER [31], respectively; and combines them with the scores from VBC [26], GuidePro [27], and CRISPRedict [28] to provide a prediction of gene knockout using that sgRNA. The source code for our tool is available at https://github.com/shaicoh3n/EXPosition [32].

How to run EXPosition is hereby described in more detail. Firstly, the user chooses which sub-models to run (Fig. 1C:a): transcription, splicing, and translation initiation. Then, the user inputs a target site location (Fig. 1C:b). Using CRISPRedict [28] and Lindel [29], the tool predicts the most likely mutations to be induced by CRISPR, along with their probabilities (Fig. 1C:I). Alternatively, the user can specify the mutations and their probabilities (Fig. 1C:c). In both cases, these mutations are then analyzed in the chosen sub-models. In each sub-model (Fig. 1C:III-V), we evaluate the effect of each mutation on one aspect of expression for each human gene, thereby finding the genes affected by the mutation. We then multiply each mutation’s phenotypic score by the probability of the mutation, and sum over all mutations to arrive at an expected value of effect for that aspect of expression for a certain gene (Fig. 1C:f–h). If multiple genes have been affected by the mutations, the gene that received the highest score (i.e., most detrimental effect on expression) will be the output of the sub-model. Thus, the final score for each sub-model would be:

$$\mathrm{Sub-model}\;\mathrm{score}\;=\max_{j\in\{1,..,N_{\mathrm{affected}\;\mathrm{genes}}\}}\sum_i^{N_{\mathrm{mutations}}}p_i\ast s_{ij}$$
(1)

where \(j\) ranges from 1 to \({N}_{\text{affected genes}}\), which is the number of genes affected by the predicted mutations and is usually 1–2 (Fig. 1C:e); \({N}_{\text{mutations}}\) is the number of top mutations (i.e., most probable mutations) considered (by default, \({N}_{\text{mutations}}=4\)); \({p}_{i}\) is the probability of mutation \(i\) occurring; and \({s}_{ij}\) represents the expected effect of mutation \(i\) on gene \(j\) according to the sub-model, i.e., its effect on transcription, splicing, or translation initiation. \({s}_{ij}\) is normalized to range between 0 and 1 in the following way:

$$s_{ij}=\min\left(1,\frac{{s'}_{ij}}{M_{\mathrm{sub}-\mathrm{model}}}\right)$$
(2)

where \({s{\prime}}_{ij}\) represents the highest raw sub-model score for mutation \(i\) over all gene \(j\)‘s transcripts, and \({M}_{\text{sub}-\text{model}}\) is the maximal score predicted by the sub-model on all ClinVar mutations (see the section “Mutations with high gene expression scores are overrepresented in ClinVar-designated pathogenic mutations”).

Finally, the scores from EXPosition are combined with the predicted cutting efficiency (Fig. 1C:i), VBC score (Fig. 1C:j), and GuidePro score (Fig. 1C:k) into a Support Vector Machine (SVM) classifier (Fig. 1C:VIII). to provide a binary classification of whether the sgRNA would cause a gene knockout (Fig. 1C:I).

Importantly, instead of evaluating every human coding gene/transcript affected by the mutations, the user can input a gene or transcript of interest (Fig. 1C:d); if they do so, the mutation’s effects in the selected sub-models will be checked only against that gene or transcript. If a transcript of interest was set, \({s{\prime}}_{ij}\) is the score of that transcript (instead of the maximal score over all the associated gene’s transcripts).

In the transcription sub-model (Fig. 1C:III), the score (\({s}_{ij}\)) represents the predicted relative change in mRNA levels caused by a mutation; this change is predicted using Xpresso [24]. In the splicing sub-model (Fig. 1C:IV), the score signifies the predicted loss of functionality of the gene’s proteins (based on evolutionary conservation) caused by a mutation, while considering any mis-splicing events and changes in the position of their start codon; this is accomplished using Oncosplice [30]. In the translation initiation sub-model (Fig. 1C:V), the score denotes the predicted relative change in the start codon’s efficiency of translation initiation, while considering any mis-splicing events; this is achieved using TITER [31]. The following sections detail the different parts of EXPosition.

Predicting the genotypic outcome of CRISPR’s DSB

As a first step, based on the user’s cut site location, we extract the target site along with its flanking sequences to predict the DSB’s probability and resulting indels using CRISPRedict [28] and Lindel [29] respectively (Fig. 1C:I). CRISPRedict is a linear regression model that predicts the probability of cutting by sgRNAs; it takes as input a 30-nt-long sequence surrounding the cut site: 4 nt upstream to the cut site, 20 nt of the site, and the following 6 nt downstream to site. Lindel is a logistic regression model that predicts the likelihood of NHEJ mutations induced by CRISPR; its input is a 60-nt sequence centered around the cut site, and its output consists of the predicted probabilities of 557 possible mutations: deletions around the cut site of up to 30 nt, every possible insertion of 1–2 nt, and a single collective mutation for any insertions of ≥ 3 nt.

We analyze only the \({N}_{\text{mutations}}\) (by default \({N}_{\text{mutations}}=4\)) most likely mutations predicted by Lindel in our tool, as checking all possibilities is not feasible timewise. We normalize the probabilities of these mutations so that their sum equals 1. We also exclude insertions longer than 2 nt, as Lindel does not provide explicit mutations for such cases, which have been demonstrated to be exceedingly rare [22, 29].

Thus, the probability for each mutation is calculated as follows:

$$p_i=\frac{{p_{\mathrm{Lindel}}}_i}{\sum_1^{N_{\mathrm{mutations}}}\;p_{{\mathrm{Lindel}}_{\mathrm k}}}\ast{\;p_{\mathrm{CRISPRedict}}}_i$$
(3)

where \({{p}_{\text{Lindel}}}_{i}\) and \({p}_{{\text{CRISPRedict}}_{i}}\) represent the probabilities from Lindel and CRISPRedict of mutation \(i\), respectively; and \({N}_{\text{mutations}}\) is the number of most probable mutations taken from Lindel.

The predicted mutations and their probabilities serve as input for the three sub-models described in the following sections. Alternatively, users can manually input specific mutations of interest along with their probabilities, which can include both indels and substitutions.

Transcription sub-model

To predict the effect of a mutation on transcription (Fig. 1C:III), we employ Xpresso [24], a fast and accurate deep learning model (Additional file 1: Fig. S1) that predicts mRNA steady-state abundance based on the nucleotide context around a transcription start site (TSS). While more accurate models exist such as Enformer [33], which can consider mutations up to 100k base pairs away, they are too slow and heavy computationally to be incorporated into EXPosition. Enformer can take ~ 5 min to evaluate a mutation as compared to less than a second needed by Xpresso. More details about Xpresso can be found in supplementary Sect. 1. Its input consists of a 10.5-kb context around the TSS, while the output is the log10 mRNA expression level of the respective gene (Fig. 2A).

Fig. 2
figure 2

Overview of the gene expression sub-models. A The transcription model takes as its inputs the sequences surrounding the transcription start site (TSS), which include both the mutated and unmutated sequences. Specifically, it considers 7000 base pairs upstream and 3499 base pairs downstream of the TSS. The mutated and unmutated sequences are each fed into Xpresso, which then predicts the mRNA levels for the mutated and unmutated genes. The model’s output is the relative change in mRNA levels compared to the unmutated gene. B The splicing model receives the gene’s predicted mutations and uses them in SpliceAI to identify potential aberrant splicing events. Accounting for any mis-splicing events detected, variant mRNAs are generated. These variant mRNAs are subsequently assessed by our translation initiation model to identify a suitable start codon. The variant mRNA sequences are translated into amino acids and employed to calculate the gene’s conservation score. C The translation initiation model accepts both the mutated and unmutated transcripts as inputs. In an iterative procedure, the mutated transcript is searched for in-frame codons within a defined window. These codons’ initiation capabilities are predicted by TITER. If a suitable codon is identified, defined as one with better efficiency than the canonical start codon or ranking within 5% of its efficiency when compared to all human transcripts, it is selected and returned. If no suitable start codon is found in the initial window, the iterative process continues with an expanded window size. This process continues until a suitable start codon is discovered or until a maximum window size is reached, at which point the best available codon is returned. Simultaneously, the unmutated mRNA’s start codon undergoes analysis by TITER, and its efficiency ranking is used to calculate the relative change in initiation efficiency ranking between the canonical start codon and the best new start codon identified in the mutated sequence

For a given mutation \(i\) and gene \(j\), we examine whether the mutation could potentially impact the gene’s transcription levels by considering its transcripts’ TSSs and checking if the mutation falls within 7 kb upstream or 3.5 kb downstream of them (i.e., in the region that Xpresso considers when evaluating a TSS). For each potentially impacted transcript, we calculate the Xpresso score of the 10.5 kb sequence around its TSS before and after the mutation (denoted \({r}_{\text{WT}}\) and \({r}_{\text{mutated}}\), respectively). Each transcript’s final transcription score reflects the relative change in mRNA transcription levels following the mutation:

$$s_{\mathrm{Trancription}}=\frac{\left|r_{\mathrm{mutated}}-r_{\mathrm{WT}}\right|}{r_{\mathrm{WT}}}$$
(4)

\({s{\prime}}_{ij}\) in Eq. 2 is the maximal \({s}_{\text{Trancription}}\) caused by mutation \(i\) over gene \(j\)’s transcripts. If a specific transcript of interest was provided, the output score pertains solely to that transcript.

Finding mRNA isoforms following a mutation

Both the splicing model (Fig. 1C:IV) and the translation initiation model (Fig. 1C:V) analyze isoforms of transcripts following mutations and potential aberrant splicing events. We obtained splice site annotations by Ensembl [34]. To predict mis-splicing events, we utilize SpliceAI [25], a deep learning tool (Additional file 1: Fig. S2) that predicts the change in a position’s probability to function as a splicing donor/acceptor site following a mutation. More information can be found in supplementary Sect. 2. We use these annotated splice sites and the splicing changes predicted by SpliceAI to generate all possible mRNA isoforms following the mutation by concatenating donor and acceptor splice sites (Additional file 1: Fig. S3). Further details are available in supplementary Sect. 3.

Splicing sub-model

The splicing sub-model (Fig. 1C:IV) assesses the impact of a mutation on a gene’s viability by examining the isoforms generated for each of the gene’s transcripts. To gauge the effect of a mutation on a protein’s functionality, we use Oncosplice [30]. This model receives a mutation and a gene as input and predicts how much the mutation disrupts the gene’s protein function. This disruption is scored using evolutionary conservation information (Fig. 2B).

For each isoform, a sliding window is employed to identify the most conserved area that is affected by the mutation; the window’s length is set to the average domain length of all human proteins. The score of each transcript is the average score of its isoforms (Additional file 1: Fig. S4); and finally, the gene’s score is the maximal transcript score, i.e., the transcript whose function was most significantly disrupted. If a specific transcript of interest was provided, the output score pertains solely to that transcript. For further information, please refer to supplementary Sect. 4.

Translation initiation sub-model

The translation initiation sub-model (Fig. 1C:V, Fig. 2C) assesses the ability of mutant variants to initiate translation by searching for suitable start codons within the isoforms identified by the splicing model (see the section “Finding mRNA isoforms following a mutation”). The translation initiation score of the suitable codons is determined using TITER [31], a deep learning tool (Additional file 1: Fig. S5) that integrates a deep learning algorithm with known codon compositions of translation initiation sites (TISs) to predict TIS functionality. For additional information, please consult supplementary Sect. 5.

For each isoform, we locate the start of the coding sequence through local alignment with the original transcript’s coding sequence’s start. An iterative process is then initiated where TITER examines all in-frame NUG and ACG codons (where N can be any nucleotide) within a window surrounding the coding sequence’s start, with the window size increasing in each iteration. The process concludes when either the best new codon is discovered (with a TITER score sufficiently close to that of the canonical start codon) or when the maximum window size is reached. Further information can be found in supplementary Sect. 6. We then calculate the WT start codon’s TITER score rank, compared to the TITER scores of all human canonical start codons; this rank is denoted as \({i}_{\text{WT}}\). We repeat this calculation for the best new start codon found for the isoform, whose rank is denoted \({i}_{\text{mutated}}\). The isoform's score is defined as the relative change in the isoform start codon's TITER score rank:

$$s_{\mathrm{Initiation}}=\frac{\left|i_{\mathrm{mutated}}-i_{\mathrm{WT}}\right|}{i_{WT}}$$
(5)

Like the splicing sub-model, we calculate the transcript’s initiation score as the average of the isoform scores and the gene’s initiation score (i.e., \(s'_{ij}\) from Eq. 2) as the highest score among all its transcripts (Additional file 1: Fig. S4). Likewise, if a specific transcript of interest is provided, we provide the score exclusively for that transcript.

Vienna Bioscore CRISPR (VBC)

VBC [26] (Fig. 1C:VI) is a tool that predicts gene knockout given an sgRNA and the position of the target site. It outputs its score using linear regression with the following features: (A) indel formation predictions from inDelphi [12]; (B) a “Bioscore” calculated using protein features like Pfam domains, DNA and amino acid conservation, amino acid identity, and gene structure; and (C) sgRNA activity prediction obtained using predictors like Azimuth and similar models. Together, these components form a comprehensive score that captures key processes in CRISPR–Cas9 mutagenesis and can be used to estimate gene knockout effectiveness following CRISPR. For more information about this tool and its performance, please review the original paper and the results in this paper.

GuidePro

GuidePro [27] (Fig. 1C:VII) is another tool that predicts gene knockout of sgRNAs targeting protein-coding exons. Knockout efficiency is governed by three key factors: (A) sgRNA activity score attained with DeepHF [35], Azimuth [36] and SSC [37]; (B) frameshift probability acquired with inDelphi [12], Lindel [29], and FORECasT [14]; and (C) amino acid sensitivity score which is evaluated using conservation, Pfam domain annotations, post-translational modifications (PTMs), and secondary structures. Each of these scores is created with an SVM, and these scores are then fed into another SVM to estimate gene knockout. For more information about this tool and its performance, please review the original paper and the results in this paper.

SVM sgRNA gene knockout classifier

EXPosition utilizes an SVM classifier (Fig. 1C:VIII) with an RBF kernel using EXPosition’s gene expression estimations, along with the predicted cutting likelihood and the scores from VBC and GuidePro as features. The model was trained using all sgRNAs from all the datasets in this paper. In cases when the scores from VBC and/or GuidePro are not available for an sgRNA, EXPosition uses one of similarly trained SVM classifiers that don’t require GuidePro and/or VBC.

Mutations with high gene expression scores are overrepresented in ClinVar-designated pathogenic mutations

We wanted to further validate each of our gene expression sub-models, even though their main components were already validated elsewhere, and demonstrate our ability to predict functional effect of mutations. Thus, we used our tool’s gene expression component to analyze mutations from the ClinVar dataset [38], which contains mutations and their phenotypes accumulated from laboratories and researchers globally. We analyzed ~ 325k mutations tagged as benign (192k) or pathogenic (133k). Although using these data is not ideal, and not every mutation would lead to a functional effect, we expect a monotonic association between the score of the model and the number of pathogenic mutations.

For each of EXPosition’s gene expression sub-models, we examined the 3254 mutations in each percentile range (i.e., 99–100%, 98–99%, etc.) and calculated the fraction of pathogenic mutations in that set (Fig. 3; the score thresholds for each percentile are detailed in supplementary Sect. 7, Additional file 1: Table S1). We denote this fraction \({S}_{\text{ClinVar}}\). A higher fraction indicates better recognition of pathogenic mutations by the sub-model. All sub-models provided meaningful rankings on the ClinVar dataset from a certain threshold. The thresholds of the transcription/splicing/translation initiation sub-models correspond to the 92/67/97 percentiles. \({S}_{\text{ClinVar}}\) values for the higher percentiles exhibited significant enrichment of pathogenic mutations in almost all top 5 percentiles (\({p<10}^{-25}, {p<10}^{-324}\), \({p<10}^{-308}\) for the transcription, splicing, and translation initiation sub-models respectively using the hypergeometric test).

Fig. 3
figure 3

ClinVar analysis. \({\text{S}}_{\text{ClinVar}}\) values of the transcription, splicing, and translation initiation sub-models (yellow, blue, orange lines, and dots, respectively) for different percentiles. Asterisks denote significant p-values (p < 0.05) and empty circles denote non-significant p-values (p ≥ 0.05). The highest p-value (i.e., closest to 0.05) calculated for each sub-model is listed in the legend. All p-values were corrected using FDR. Note that for each sub-model, the dots begin at the first percentile which corresponds to a non-zero score, meaning that 52/80/95% of mutations received a score of 0 in the transcription/splicing/translation initiation sub-models respectively

Following this analysis, each sub-model’s ClinVar threshold was set as the lowest percentile in which we observed a significant enrichment of pathogenic mutations (e.g., the 97th percentile for the translation initiation sub-model). The thresholds are used for informing the user that a mutation’s effect on a certain aspect of expression is potentially disease-causing. We then calculated the maximal score predicted by each sub-model for the whole set of ClinVar mutations and normalized the thresholds by these maximal values, facilitating easier interpretability. To assess the impact of an sgRNA on a specific expression aspect, we compute an average score across the corresponding sub-model for all predicted mutations, weighted by their probabilities. If the mean sub-model score surpasses its designated threshold, EXPosition informs the user that this aspect is affected. Alternatively, if a mutation was inputted manually, its score is checked against the sub-model’s threshold and the user is notified accordingly. The maximal scores from the ClinVar analysis for each sub-model used to normalize the scores (Eq. 2), as well as the raw and normalized final thresholds, can be found in supplementary Sect. 7, Additional file 1: Table S2.

Technical details regarding data for validation of EXPosition

To validate EXPosition, we used four functional screening datasets published by Doench et al. [39, 40]; Doench et al. [36, 41], Shalem et al. [42, 43]; and Xu et al. [37, 44]. The Doench (2016) and Shalem studies performed negative selection screening using sgRNAs that targeted multiple sites with the aim of identifying essential genes in A375 and HT29 cell lines. The data from Xu contains sgRNAs marked as efficient/non-efficient at performing gene knockout on KBM7 and HL60, while the data from Doench (2014) only consists of sgRNAs that demonstrated gene knockouts on A375 cells.

sgRNAs that did not have the scores for VBC or GuidePro, were unanalyzable due to bugs in EXPosition, or were not found in the genome were omitted from the analysis. The Xu dataset contained ~ 2k sgRNAs (for both cell lines), of which we used 1.8k were used. The Doench (2014) consisted of ~ 1.3k sgRNAs, of which 1.2k were used. The Doench (2016) contained ~ 113k & ~ 77k sites for the A375 and HT29 cell-lines respectively, of which ~ 89k & ~ 61k sites were used respectively. The Shalem dataset contained ~ 65k sites, of which were used ~ 53k sites.

In the Doench (2016) dataset and the Shalem dataset used, there were measurements from both lentiCRISPRv2 and lentiGuide, with multiple repeats. Thus, these results were averaged as described in supplementary Sect. 8.

Results

EXPosition is a tool that predicts gene knockout by considering gene expression estimations

We have developed EXPosition (EXPosing CRISPR's Impact on EXPression and Position), a computational tool designed to assess CRISPR sites and determine the likelihood of successfully silencing a target gene. It does this by integrating post-CRISPR gene expression estimations with gene knockout evaluations from models that consider various other factors such as frameshift likelihood, amino acid composition, and conservation. EXPosition’s gene expression sub-models estimate the sgRNA’s impact on three key stages of gene expression: transcription, splicing, and translation initiation. The tool begins by predicting the most probable mutations and their associated probabilities resulting from CRISPR utilization. Subsequently, EXPosition assesses the impact of each mutation on these three aspects of gene expression and outputs a score for each gene affected by the mutations. Finally, EXPosition utilizes previous tools to predict scores pertaining to gene knockout, and combines their scores along with the predicted cutting efficiency and gene expression estimates in an SVM classifier to predict potential gene knockout by the input sgRNA.

EXPosition (depicted in Fig. 1C) employs deep-learning algorithms for its gene expression component, all of which have undergone independent validation to execute their tasks. The transcription sub-model (Fig. 1C:III) predicts the relative change in mRNA levels following the mutation. As for the splicing and translation initiation sub-models (Fig. 1C:IV-V), we consider all possible isoforms generated by the mutation through alternative splicing. These isoforms can lead to a distinct protein from the original one due to the following factors:

  1. 1.

    Aberrant splicing events: These can arise by either creating new donor/acceptor sites or deleting existing ones, resulting in an altered splicing pattern of the mRNA.

  2. 2.

    Alterations in initiation site usage: Mutations in the start codon or its context can modify the initiation capability. Factors such as the nucleotide context and folding energy of the start codon play pivotal roles in determining initiation efficiency (as reviewed in [45]). Any change in these aspects could impact the initiation capability of the original start codon. Additionally, other potential start codons, such as ATGs in the same reading frame, could serve as alternative initiation sites, preserving the transcript's initiation capability [23] and potentially retaining the protein’s function. Consequently, an assessment of the initiation capability of all potential start codons, including the original ATG if unaltered, is necessary to predict whether translation initiation can occur.

  3. 3.

    Elimination of the gene's stop codon: This type of mutation can lead to the addition of potentially unnecessary amino acids to the translated protein.

Therefore, our tool’s gene expression component comprehensively evaluates the potential effects of mutations on these three aspects of gene expression, utilizing deep-learning algorithms that have undergone rigorous validation.

The mRNA’s isoforms are constructed by predicting the mutation’s effect on alternative splicing and concatenating relevant exons, i.e., exons with viable splicing donor/acceptor pairs. These isoforms are then passed to the splicing and translation initiation sub-models, which assess the viability of these isoforms based on amino acid conservation and the translation initiation capability of the isoforms, respectively. Details regarding each gene expression sub-model appear in the “Material and methods” section.

Our tool also incorporates the sgRNA’s VBC (Fig. 1C:VI) and GuidePro (Fig. 1C:VII) scores and uses them along with the predicted cutting efficiency (from CRISPRedict—Fig. 1C:II) and gene expression estimates in an SVM classifier (Fig. 1C:VIII) to determine if the input sgRNA will silence the gene. Details regarding VBC, GuidePro, CRISPRedict, and the final SVM classifier also appear in the “Material and methods” section.

The tool is written in Python 3.9 and can be accessed in http://www.cs.tau.ac.il/~tamirtul/EXPosition. The tool’s GUI is shown in Fig. 4.

Fig. 4
figure 4

The GUI of EXPosition. The user needs to check the boxes of the sub-models to be run (Transcription/Splicing/Initiation, upper left corner) and provide the target site information (chromosome, strand, position, and 20nt sequence). If the user predicts the resulting mutations via our tool, “Predict Muts” should be selected; otherwise, the user can insert the mutations manually and select “Manual Muts”. The initiation sub-model parameters can be adjusted manually by the user. The results are saved in a csv file and shown in the “Results” textbox

EXPosition’s gene expression component provides additional information that improves prediction of gene knockout with CRISPR

To validate our tool’s gene expression component, i.e., the gene expression estimates post-CRISPR, we searched for datasets containing measurements of gene expression following CRISPR editing. Since no such explicit data was found, we decided to use four functional screening datasets published by Doench et al. [39], Doench et al. [36], Xu et al. [37], and Shalem et al. [42]. Our reasoning was that we could validate our tool’s gene expression component by comparing its predictions with quantifications of gene knockout (more information about the data can be found in the “Methods” section).

In each of the studies, the cells were infected using lentiviruses, causing them to express sgRNAs and Cas9, and measured the fold change in sgRNA levels following CRISPR’s action (Fig. 5A). The underlying assumption was that sgRNAs targeting essential genes would negatively impact the cell's fitness, leading to a reduction in the production of these sgRNAs. We believe that this assumption generally holds true to some extent for any gene [46, 47]. Furthermore, we believe that this relationship is, to some extent, attributable to alterations in gene expression. Consequently, we anticipated observing a stronger correlation between the impact on fitness and the predicted effect on gene expression than between the impact on fitness and the efficiency of DNA cutting alone.

Fig. 5
figure 5

Illustration of all the experiments used to generate the functional screening libraries and analysis. A The cells in the top row are infected with a lentivirus and produce Cas9 proteins and sgRNAs; the left/right cell produces sgRNA1/2, which affects site 1/2 in a given gene, respectively. After 3 weeks (bottom left), sgRNA1 induces mutations in the gene, which affect its function and cause the cell’s fitness to decrease; this results in lower amounts of sgRNA1 being produced. Meanwhile, sgRNA2 induces mutations in the gene as well, but its function—and the cell’s fitness—is minimally affected; thus, sgRNA2 levels stay nearly the same. Each gene in each dataset contains multiple different target sites. B We conducted 100 iterations of Monte–Carlo cross-validation, training two SVM classifiers/XGBoost regressors with different features, each time taking a random 80% of the data and testing the models on the remaining 20%. See full details in the main text

We performed regression analysis on each dataset where we compared existing models’ performance to their performance when adding EXPosition’s gene expression results as features. In addition, we conducted classification analysis for each dataset by dividing the sgRNAs into silencing/non-silencing by choosing sgRNAs which had depletion rates of at least half as silencing sgRNAs, and the rest as non-silencing (as done by the authors of VBC) (Fig. 5B). The results pertaining to each dataset are obtained using regressors/classifiers that were trained and tested (on withheld data) using data solely from that dataset. Additional details regarding the training of these models can be found in supplementary Sect. 9.

EXPosition’s gene expression component improved prediction of gene knockout on the test sets when added to VBC or GuidePro across different functional screening libraries and cell types in all cases, which include 6 functional screening experiments, 4 cell types, and encompass 207k sites (Table 1). This demonstrates that additional information is embedded in EXPosition’s gene expression features, thus validating the gene expression estimations (which were already validated, both independently and in our analysis of the ClinVar dataset).

Table 1 EXPosition’s gene expression features improve Spearman correlations with sgRNA depletion rates. “VBC vs. VBC + EXP. gene expression” and “GuidePro vs. GuidePro + EXP. gene expression” are the comparisons between the regressor trained with VBC\GuidePro and the regressor trained with VBC\GuidePro and EXPosition’s gene expression models. Each cell contains the median spearman correlations obtained from 100 cross-validations of training an XGBoost regressor on randomly chosen 80% of the data and testing on the remaining 20%. Cells with an asterisk were cases where the added gene expression features from EXPosition improved performance with statistical significance (p<0.05, Wilcoxon rank-sum test)

To verify that this improvement was not a result of random chance, we repeated this analysis when shuffling the training data labels; with this change, no improvement was observed for any of the comparisons, indicating the improvement is indeed not due to random chance. Similar results were obtained using classification (Additional file 1: Tables S3-S5, see supplementary Sect. 10).

We note that the main goal of EXPosition’s gene expression component is to predict CRISPR’s effect on various aspects of gene expression, which might be interesting even if not aiming to silence a gene (e.g., to evaluate CRISPR’s effect on an off-target). On the other hand, VBC and GuidePro are designed to predict gene knockout following CRISPR focusing on the amino acid content of the gene and the protein functionally; thus, EXPosition’s gene expression predictions by themselves do not exactly compete with them in terms of performance (Additional file 1: Tables S9-S10, see supplementary Sect. 10).

It is important to mention that VBC and GuidePro cannot analyze a general sgRNA, but rather only specific guides that were already analyzed and that target coding sequences; whereas EXPosition’s gene expression component deals with any guide sequence, which can affect any region of the gene, including UTRs and introns. In addition, the datasets we analyze are biased towards VBC and GuidePro (compared to our gene expression models), since similarly to these tools, the datasets are designed to target the gene’s coding sequence and change its amino acid sequence, rather than directly affecting aspects of gene expression such as transcription, splicing, and translation initiation, which is what EXPosition’s gene expression component predicts. Despite all this, we show that EXPosition’s gene expression component provides valuable information that is not contained in the other tools.

EXPosition outperforms previous tools in predicting gene knockout following CRISPR usage

We wanted to compare EXPosition to existing tools which predict gene knockout following CRISPR usage such as VBC [27] and GuidePro [28]. Therefore, we performed similar regression and classification analyses as in the previous section, only this time, we used all of EXPosition’s outputs (gene expression estimates, predicted cutting likelihood, VBC score, GuidePro score) as features to train and test the regressors and classifiers and compared their performance against regressors/classifiers trained solely with VBC or GuidePro. Additional details regarding the training of these models can be found in supplementary Sect. 9.

The results can be seen in Table 2. EXPosition outperforms VBC in 5 of 6 cases and outperforms GuidePro in all 6 cases. Results from the classification analysis yielded similar results (Additional file 1: Tables S6-S8, see supplementary Sect. 10).

Table 2 EXPosition score produces better Spearman correlations with sgRNA depletion rates than previous tools’ scores. “VBC vs. EXP” and “GuidePro vs. EXP” are the comparisons between the regressor trained with VBC/GuidePro and the regressor trained with predictions from VBC, GuidePro, CRISPRedict and EXPosition’s gene expression models. Each cell contains the median Spearman correlations obtained from 100 cross-validations of training a XGBoost regressor on randomly chosen 80% of the data and testing on the remaining 20%. Cells with an asterisk were cases where the EXPosition outperformed the compared tool with statistical significance (p<0.05, Wilcoxon rank-sum test)

Examples of silencing sites found with EXPosition

Finally, we provide a few examples where EXPosition improved on previous models to predict sgRNAs that will cause gene knockout (Table 3, Fig. 6). We used classifiers trained and tested on the same data from Doench et al. [36], as defined in the section “EXPosition outperforms previous tools in predicting gene knockout following CRISPR usage” and Fig. 5B. In each example, using classifiers whose only features are the cutting efficiencies and the scores from VBC or GuidePro would misclassify the sgRNA, while a classifier incorporating EXPosition’s gene expression predictions classifies it correctly.

Table 3 Examples of sites EXPosition improved on existing models. Each column contains the score of its gene expression respective model. In each of the cases EXPosition correctly classified the sites as silencing, while the classifiers trained only using CRISPRedict, VBC and GuidePro or just CRISPRedict misclassified them as non-silencing
Fig. 6
figure 6

Predicted changes in gene expression following CRISPR action. Illustrations of the changes in gene expression, as predicted by EXPosition’s gene expression component, for examples 2–4 in Table 1. A One of the predicted deletions in one of the transcripts of the gene KCTD6 was predicted to cause a 20% relative change in mRNA levels compared to the WT. B A mis-splicing event is predicted to occur following a predicted deletion in the ENST00000294954.12 transcript of the LHCGR gene. A discovered donor site causes an intron to be included in the mRNA. C A predicted deletion causes a frameshift in one of the transcripts of the ZNF501 gene, resulting in a loss of translation initiation capability. The best alternative start codon found had a 61% relative change initiation ranking, relative to the ranking of the canonical start codon

Discussion

In recent years, CRISPR has been used to edit genes and specifically to silence them. The prevalent paradigm is to use computational tools which estimate the likelihood of a mutation following the use of CRISPR and choose a site for gene silencing by taking the site with the highest mutation likelihood. However, when using CRISPR, we are usually interested in affecting the expression of the target gene without affecting any other gene’s expression. Thus, the current common approaches for designing sgRNAs do not optimize the right objective.

Recently, a few tools designed to predict gene knockout post-CRISPR usage, using features other than predicted cutting efficiency, were developed. These tools are helpful but they have limitations: they do not consider alterations in gene expression when forming their predictions; they cannot grade every given sgRNA, but rather are limited to a subset of already pre-processed sgRNAs; and they are focused on sgRNAs targeting coding regions.

Therefore, we created EXPosition, a tool that circumvents these limitations by combining gene knockout predictions (VBC and GuidePro) and predicted cutting efficiency (CRISPRedict) with gene expression estimates post-CRISPR to predict gene knockout. Our tool can evaluate any sgRNA (not just pre-processed ones), including sgRNAs not in coding regions, and it considers the effect of CRISPR usage on gene expression.

EXPosition’s gene expression component predicts the most likely mutations following CRISPR use and their effect on transcription, splicing, and translation initiation. In addition, EXPosition can analyze manually inserted mutations, regardless of their origin, and assess their effects on gene expression. This versatility allows users to assess the effects of mutations that may not have been generated via CRISPR or other specific methods, expanding the tool’s applicability to a wider range of scenarios. Since our tool’s gene expression component is composed of various algorithms which predict various aspects of gene expression that were validated and compared to measurements of gene expression, we expect predictions used in our tool to be relevant and correspond with actual expression measurements.

We validated our tool’s gene knockout predictions using experimental data from 6 functional screening experiments, on 4 cell types, encompassing 207k sites. EXPosition predictions produced better Spearman correlations with sgRNA depletion rates than 6/5 out of 6 cases for GuidePro/VBC, respectively. In addition, EXPosition’s gene expression component was validated via showing that when using only the combination of VBC or GuidePro, there is a significant decrease in the performances. Similar results were obtained using classification analysis.

We also gave additional information about our gene expression outputs by providing the fraction of pathogenic mutations from the ClinVar dataset out of all pathogenic and benign mutations that received certain values of EXPosition’s gene expression score. Analysis of the ClinVar dataset also validated our gene expression models by showing enrichment of pathogenic mutations in subsets of mutations that have high-scoring gene expression estimates. We hope that the user-friendly GUI will encourage people to use our tool in their scientific endeavors. We believe that since EXPosition is modular, it will be possible to update each part of the tool with newer and better models, including models that are specific for different cell types and/or Cas proteins. We believe that once robust models of gene expression steps, likely mutations post-CRISPR cleavage and likelihood for cleavage post-CRISPR are available for non-human organisms, we will be able to extend EXPosition’s gene expression component to apply for these organisms as well.

The study reported here clearly demonstrates two important gaps in the field of CRISPR research: (1) we should carefully design better objective functions to correctly evaluate sgRNAs and (2) we should conduct more experiments that include the target gene in their endogenous genomic context while measuring the effect on gene expression in addition to cutting efficiency. Studies including this type of data will facilitate better understanding of a given sgRNA’s phenotypical effect on its target site, rather than only its genotypical effect; they can also be used to further improve our tool.

Conclusions

EXPosition is a user-friendly tool for the classification of sgRNAs into silencing/non-silencing that considers the effects of predicted gene expression along known factors such as conservation, amino acid composition, and frameshift likelihood. Validated on several datasets of different human cell types, it offers the scientific community a better tool to assess the functionality of sgRNAs than before and for the first time reveals the likely gene expression outcomes following CRISPR usage, which complement prediction of cleavage likelihood by predicting the actual objective of CRISPR usage: changing gene expression. With research on CRISPR ever growing, we hope that datasets with gene expression measurements post-CRISPR cleavage will be published to improve our understanding of the interplay between CRISPR and gene expression.

Data availability

For the version of EXPosition available at the time of this publication, please refer to the EXPosition citation [32] or use the following link: https://doi.org/https://doi.org/10.5281/zenodo.14228618.

The latest developments to EXPosition can be found here: https://github.com/shaicoh3n/EXPosition.

Human genome annotations were downloaded from Ensembl [34].

The data we analyzed by Doench et al. [36] can be found in Table 11 at [41].

The Shalem dataset can be found in Table 10 at [43].

The data we analyzed by Doench et al. [39] can be found in Supplementary Table 10 at [40].

The data analyzed by Xu et al. can be found in Supplementary Table_1 at [44].

The ClinVar dataset can be found in the ClinVar FTP server (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz) [38].

References

  1. Pickar-Oliver A, Gersbach CA. The next generation of CRISPR–Cas technologies and applications. Nat Rev Mol Cell Biol. 2019;20(8):490–507. https://doi.org/10.1038/s41580-019-0131-5.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Uehara H, Zhang X, Pereira F, Narendran S, Choi S, Bhuvanagiri S, et al. Start codon disruption with CRISPR/Cas9 prevents murine Fuchs’ endothelial corneal dystrophy. Zoghbi HY, Cepko CL, Ksander B, editors. Elife. 2021;10:e55637.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Si X, Zhang H, Wang Y, Chen K, Gao C. Manipulating gene translation in plants by CRISPR–Cas9-mediated genome editing of upstream open reading frames. Nat Protoc. 2020;15(2):338–63.

    Article  PubMed  Google Scholar 

  4. Whitworth KM, Benne JA, Spate LD, Murphy SL, Samuel MS, Murphy CN, et al. Zygote injection of CRISPR/Cas9 RNA successfully modifies the target gene without delaying blastocyst development or altering the sex ratio in pigs. Transgenic Res. 2017;26(1):97–107.

    Article  PubMed  Google Scholar 

  5. Chari R, Yeo NC, Chavez A, Church GM. sgRNA Scorer 2.0: a species-independent model to predict CRISPR/Cas9 activity. ACS Synth Biol. 2017;6(5):902–4.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Listgarten J, Weinstein M, Kleinstiver BP, Sousa AA, Joung JK, Crawford J, et al. Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs. Nat Biomed Eng. 2018;2(1):38–47.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Lei Y, Lu L, Liu HY, Li S, Xing F, Chen LL. CRISPR-P: a web tool for synthetic single-guide RNA design of CRISPR-system in plants. Mol Plant. 2014;7(9):1494–6.

    Article  PubMed  Google Scholar 

  8. Liu H, Wei Z, Dominguez A, Li Y, Wang X, Qi LS. CRISPR-ERA: a comprehensive design tool for CRISPR-mediated gene editing, repression and activation. Bioinformatics. 2015;31(22):3676–8.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Concordet JP, Haeussler M. CRISPOR: intuitive guide selection for CRISPR/Cas9 genome editing experiments and screens. Nucleic Acids Res. 2018;46(W1):W242–5.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Peng D, Tarleton R. EuPaGDT: a web tool tailored to design CRISPR guide RNAs for eukaryotic pathogens. Microb Genom. 2015;1(4):e000033.

    PubMed  PubMed Central  Google Scholar 

  11. Montague TG, Cruz JM, Gagnon JA, Church GM, Valen E. CHOPCHOP: a CRISPR/Cas9 and TALEN web tool for genome editing. Nucleic Acids Res. 2014;42(W1):W401–7.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Shen MW, Arbab M, Hsu JY, Worstell D, Culbertson SJ, Krabbe O, et al. Predictable and precise template-free CRISPR editing of pathogenic variants. Nature. 2018;563(7733):646–51.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Stemmer M, Thumberger T, del Sol KM, Wittbrodt J, Mateo JL. CCTop: an intuitive, flexible and reliable CRISPR/Cas9 target prediction tool. PLoS ONE. 2015;10(4):e0124633.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Allen F, Crepaldi L, Alsinet C, Strong AJ, Kleshchevnikov V, De Angeli P, et al. Predicting the mutations generated by repair of Cas9-induced double-strand breaks. Nat Biotechnol. 2019;37(1):64–72.

    Article  Google Scholar 

  15. Labuhn M, Adams FF, Ng M, Knoess S, Schambach A, Charpentier EM, et al. Refined sgRNA efficacy prediction improves large-and small-scale CRISPR–Cas9 applications. Nucleic Acids Res. 2018;46(3):1375–85.

    Article  PubMed  Google Scholar 

  16. Moreno-Mateos MA, Vejnar CE, Beaudoin JD, Fernandez JP, Mis EK, Khokha MK, et al. CRISPRscan: designing highly efficient sgRNAs for CRISPR-Cas9 targeting in vivo. Nat Methods. 2015;12(10):982–8.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Pulido-Quetglas C, Aparicio-Prat E, Arnan C, Polidori T, Hermoso T, Palumbo E, et al. Scalable design of paired CRISPR guide RNAs for genomic deletion. PLoS Comput Biol. 2017;13(3): e1005341.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Heigwer F, Kerr G, Boutros M. E-CRISP: fast CRISPR target site identification. Nat Methods. 2014;11(2):122–3.

    Article  PubMed  Google Scholar 

  19. Li VR, Zhang Z, Troyanskaya OG. CROTON: an automated and variant-aware deep learning framework for predicting CRISPR, Cas9 editing outcomes. Bioinformatics. 2021;37(Supplement_1):i342–8. https://doi.org/10.1093/bioinformatics/btab268.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Molla KA, Yang Y. Predicting CRISPR/Cas9-induced mutations for precise genome editing. Trends Biotechnol. 2020;38(2):136–41. Available from https://www.sciencedirect.com/science/article/pii/S0167779919302069.

    Article  PubMed  Google Scholar 

  21. Chuai G, Ma H, Yan J, Chen M, Hong N, Xue D, et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol. 2018;19(1):80. https://doi.org/10.1186/s13059-018-1459-4.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Leenay RT, Aghazadeh A, Hiatt J, Tse D, Roth TL, Apathy R, et al. Large dataset enables prediction of repair after CRISPR–Cas9 editing in primary T cells. Nat Biotechnol. 2019;37(9):1034–7. https://doi.org/10.1038/s41587-019-0203-2.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Ben-Yehezkel T, Zur H, Marx T, Shapiro E, Tuller T. Mapping the translation initiation landscape of an S. cerevisiae gene using fluorescent proteins. Genomics. 2013;102(4):419–29.

    Article  PubMed  Google Scholar 

  24. Agarwal V, Shendure J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 2020;31(7):107663.

    Article  PubMed  Google Scholar 

  25. Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, et al. Predicting splicing from primary sequence with deep learning. Cell. 2019;176(3):535–548.e24.

    Article  PubMed  Google Scholar 

  26. Michlits G, Jude J, Hinterndorfer M, de Almeida M, Vainorius G, Hubmann M, et al. Multilayered VBC score predicts sgRNAs that efficiently generate loss-of-function alleles. Nat Methods. 2020;17(7):708–16.

    Article  PubMed  Google Scholar 

  27. He W, Wang H, Wei Y, Jiang Z, Tang Y, Chen Y, et al. GuidePro: a multi-source ensemble predictor for prioritizing sgRNAs in CRISPR/Cas9 protein knockouts. Bioinformatics. 2021;37(1):134–6.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Konstantakos V, Nentidis A, Krithara A, Paliouras G. CRISPRedict: a CRISPR-Cas9 web tool for interpretable efficiency predictions. Nucleic Acids Res. 2022;50(W1):W191–8. https://doi.org/10.1093/nar/gkac466.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Chen W, McKenna A, Schreiber J, Haeussler M, Yin Y, Agarwal V, et al. Massively parallel profiling and predictive modeling of the outcomes of CRISPR/Cas9-mediated double-strand break repair. Nucleic Acids Res. 2019;47(15):7989–8003. https://doi.org/10.1093/nar/gkz487.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Lynn N, Tuller T. Detecting and understanding meaningful cancerous mutations based on computational models of mRNA splicing. NPJ Syst Biol Appl. 2024;10(1):25.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Zhang S, Hu H, Jiang T, Zhang L, Zeng J. TITER: predicting translation initiation sites by deep learning. Bioinformatics. 2017;33(14):i234–42. https://doi.org/10.1093/bioinformatics/btx247.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Cohen* S, Bergman* S, Lynn N, Tuller T. EXPosition. Zenodo. 2024. https://doi.org/10.5281/zenodo.14228618. Cited 2024 Nov 25.

  33. Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18(10):1196–203. https://doi.org/10.1038/s41592-021-01252-x.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, et al. Ensembl 2022. Nucleic Acids Res. 2022;50(D1):D988–95. https://doi.org/10.1093/nar/gkab1049.

    Article  PubMed  Google Scholar 

  35. Wang D, Zhang C, Wang B, Li B, Wang Q, Liu D, et al. Optimized CRISPR guide RNA design for two high-fidelity Cas9 variants by deep learning. Nat Commun. 2019;10(1):4284.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol. 2016;34(2):184–91. https://doi.org/10.1038/nbt.3437.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Xu H, Xiao T, Chen CH, Li W, Meyer CA, Wu Q, et al. Sequence determinants of improved CRISPR sgRNA design. Genome Res. 2015;25(8):1147–57.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):D1062–7. https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.

  39. Doench JG, Hartenian E, Graham DB, Tothova Z, Hegde M, Smith I, et al. Rational design of highly active sgRNAs for CRISPR-Cas9–mediated gene inactivation. Nat Biotechnol. 2014;32(12):1262–7. https://doi.org/10.1038/nbt.3026.

    Article  PubMed  PubMed Central  Google Scholar 

  40. Doench JG, Hartenian E, Graham DB, Tothova Z, Hegde M, Smith I, et al. Rational design of highly active sgRNAs for CRISPR-Cas9–mediated gene inactivation. Supplementary Table 10. Nat Biotechnol. 2014. https://staticcontent.springer.com/esm/art%3A10.1038%2Fnbt.3026/MediaObjects/41587_2014_BFnbt3026_MOESM10_ESM.xlsx.

  41. Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Table S11. Nat Biotechnol. 2016;34(2):184–91. https://staticcontent.springer.com/esm/art%3A10.1038%2Fnbt.3437/MediaObjects/41587_2016_BFnbt3437_MOESM8_ESM.zip.

  42. Shalem O, Sanjana NE, Hartenian E, Shi X, Scott DA, Mikkelsen TS, et al. Genome-scale CRISPR-Cas9 knockout screening in human cells. Science (1979). 2014;343(6166):84–7. https://doi.org/10.1126/science.1247005.

    Article  Google Scholar 

  43. Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Table S10. Nat Biotechnol. 2016;34(2):184–91. https://staticcontent.springer.com/esm/art%3A10.1038%2Fnbt.3437/MediaObjects/41587_2016_BFnbt3437_MOESM8_ESM.zip.

  44. Xu H, Xiao T, Chen CH, Li W, Meyer CA, Wu Q, et al. Sequence determinants of improved CRISPR sgRNA design. Supplementary Table_1. Genome Res. 2015:1147–57. https://genome.cshlp.org/content/suppl/2015/06/12/gr.191452.115.DC1/Supplemental_Table_1.xlsx.

  45. Tuller T, Zur H. Multiple roles of the coding sequence 5′ end in gene expression regulation. Nucleic Acids Res. 2015;43(1):13–28. https://doi.org/10.1093/nar/gku1313.

    Article  PubMed  Google Scholar 

  46. Lang GI, Murray AW, Botstein D. The cost of gene expression underlies a fitness trade-off in yeast. Proc Natl Acad Sci. 2009;106(14):5755–60. https://doi.org/10.1073/pnas.0901620106.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Keren L, Hausser J, Lotan-Pompan M, Vainberg Slutskin I, Alisar H, Kaminski S, et al. Massively parallel interrogation of the effects of gene expression levels on fitness. Cell. 2016;166(5):1282–1294.e18. Available from: https://www.sciencedirect.com/science/article/pii/S009286741630931X.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Nadav Kra-Oz for contributing to the GUI. This study was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University.

Funding

SC, SB, and NL are supported by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel Aviv University. The study was also supported by the CRISPR-IL consortium grant from the Israeli Innovation Authority.

Author information

Authors and Affiliations

Authors

Contributions

SC, SB, and TT conceived the project. All authors analyzed the data. SC and SB wrote the software. TT supervised the project. All authors interpreted the results. All authors wrote and revised the manuscript. All authors read and approved of the final manuscript.

Corresponding author

Correspondence to Tamir Tuller.

Ethics declarations

Ethics approval and consent to participate

This study only utilizes data that has been previously published [38, 40, 41, 43, 44].

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cohen, S., Bergman, S., Lynn, N. et al. A tool for CRISPR-Cas9 sgRNA evaluation based on computational models of gene expression. Genome Med 16, 152 (2024). https://doi.org/10.1186/s13073-024-01420-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13073-024-01420-6