Deep learning in drug discovery: an integrative review and future challenges

  • Open access
  • Published: 17 November 2022
  • Volume 56 , pages 5975–6037, ( 2023 )

Cite this article

You have full access to this open access article

drug discovery research papers

  • Heba Askr 1 ,
  • Enas Elgeldawi 2 ,
  • Heba Aboul Ella 4 ,
  • Yaseen A. M. M. Elshaier 5 ,
  • Mamdouh M. Gomaa 2 &
  • Aboul Ella Hassanien 3  

30k Accesses

48 Citations

12 Altmetric

Explore all metrics

Recently, using artificial intelligence (AI) in drug discovery has received much attention since it significantly shortens the time and cost of developing new drugs. Deep learning (DL)-based approaches are increasingly being used in all stages of drug development as DL technology advances, and drug-related data grows. Therefore, this paper presents a systematic Literature review (SLR) that integrates the recent DL technologies and applications in drug discovery Including, drug–target interactions (DTIs), drug–drug similarity interactions (DDIs), drug sensitivity and responsiveness, and drug-side effect predictions. We present a review of more than 300 articles between 2000 and 2022. The benchmark data sets, the databases, and the evaluation measures are also presented. In addition, this paper provides an overview of how explainable AI (XAI) supports drug discovery problems. The drug dosing optimization and success stories are discussed as well. Finally, digital twining (DT) and open issues are suggested as future research challenges for drug discovery problems. Challenges to be addressed, future research directions are identified, and an extensive bibliography is also included.

Similar content being viewed by others

drug discovery research papers

Applications of artificial intelligence to drug design and discovery in the big data era: a comprehensive review

drug discovery research papers

Revolutionizing Drug Discovery: Unleashing AI’s Potential in Pharmaceutical Innovation

drug discovery research papers

A review of machine learning-based methods for predicting drug–target interactions

Avoid common mistakes on your manuscript.

1 Introduction

The examination of how various drugs interact with the body and how a medication needs to act on the body to have a therapeutic impact is known as drug discovery. Drug discovery strategy constitutes from different approaches as physiology-based and target based. This strategy is based on information about the ligand and the target. In this regard, our attention was directed in certain topics especially drug (ligand)–target interactions, drug sensitivity and response, drug–drug interaction, and drug–drug similarity. For certain diseases such as cancer or pandemic situations as COVID-19, more than one drug combination is required to alleviate the prognosis and pathogenesis interactions. Despite all the recent advances in pharmaceuticals, medication development is still a labor-intensive and costly process. As a result, several computational algorithms are proposed to speed up the drug discovery process (Betsabeh and Mansoor 2021 ).

As DL models progress and the drug data size is getting bigger, a slew of new DL-based approaches is cropping up at every stage of the drug development process (Kim et al. 2021 ). In addition, we’ve seen large pharmaceutical corporations migrate toward AI in the wake of the development of DL approaches, eschewing outmoded, ineffective procedures to increase patient profit while also increasing their own (Nag et al. 2022 ). Despite the DL impressive performance, it remains a critical and challenging task, and there is a chance for researchers to develop several algorithms that improve drug discovery performance. Therefore, this paper presents a SLR that integrates the recent DL technologies and applications in drug discovery. This review study is the first one that incorporates the recent DL models and applications for the different categories of drug discovery problems such as DTIs, DDIs similarity, drug sensitivity and response, and drug-side effects predictions, as well as presenting new challenging topics such as XAI and DT and how they help the advancement of the drug discovery problems. In addition, the paper supports the researchers with the most frequently used datasets in the field.

The paper is developed based on six building blocks as shown in Fig.  1 . More than 300 articles are presented in this paper, and they are divided across these building blocks. The papers are selected using the following criteria:

The papers which published from 2000 to 2022.

The papers which published in IEEE, ACM, Elsevier, and Springer have more priority.

figure 1

The main building blocks of the paper

The following analytical questions are discussed and completely being answered in the paper:

AQ1: What DL algorithms have been used to predict the different categories of drug discovery problems?

AQ2: Which deep learning methods are mostly used in drug dosing optimization?

AQ3: Are there any success stories about drug discovery and DL?

AQ4: What about the newest technologies such as XAI and DT in drug discovery?

AQ5: What are the future and open works related to drug discovery and DL?

The remainder of this review paper is organized as: Sect.  2 presents a review of related studies; Sect.  3 covers the various DL techniques as an overview. Section  4 presents the organization of DL applications in drug discovery problems through explaining each drug discovery problem category and gives a literature review of the DL techniques used. Section  5 discusses the numerous benchmark data sets and databases that have been employed in the drug development process. Section  6 presents the evaluation metrics used for each drug discovery problem category. The drug dose optimization, successful stories, and XAI are introduced in Sect.  7 , Sect.  8 , and Sect.  9 . DT and open problems are suggested as future research challenges in Sects.  10 and 11 . Section  12 presents a discussion of the analytical questions. Finally, Sect.  13 concludes the paper.

2 Review of related studies

Although the drug discovery is a large field and has different research categories, there is a few review studies about this field and each related study has focused only on a one research category such as reviewing the DL applications for the DTIs. This section aims to review these related studies and a summary is presented in Table 1 .

Kim et al. ( 2021 ) presented a survey of DL models in the prediction of drug–target interaction (DTI) and new medication development. They start by providing a thorough summary of many depictions of drugs and proteins, DL applications, and widely used exemplary data sets to test and train models. One good point for this study, they identify a few obstacles to the bright future of de novo drug creation and DL-based DTI prediction. However, the major drawback of this study was that it did not consider the latest technology in DL application for the DTIs such as XAI and DTs.

Rifaioglu et al. ( 2019 ) presented the recent ML applications in Virtual Screening (VS) with the techniques, instruments, databases, and materials utilized to create the model. They outline what VS is and how crucial it is to the process of finding new drugs. Good points for this study, they highlighted the DL technologies that are accessible as open access programming libraries and provided instances of VS investigations that resulted in the discovery of novel bioactive chemicals and medications, tool kits and frameworks, and can be employed for the foreseeable future's computational drug discovery (including DTI prediction). However, they did not consider the drug dose optimization in their literature review.

Sachdev and Gupta ( 2019 ) presented the various feature based chemogenomic methods for DTIs prediction. They offer a thorough review of the different methodologies, datasets, tools, and measurements. They give a current overview of the various feature-based methodologies. Additionally, it describes relevant datasets, methods for determining medication or target properties, and evaluation measures. Although the study considered the initial integrated review which concentrate only on DTI feature-based techniques, they did not consider the latest technology in DL application for the DTIs such as XAI and DTs.

3 Deep learning (DL) techniques

Detecting spam, recommending videos, classifying images, and retrieving multimedia ideas are just a few of the techniques used are just a few of the applications where machine learning (ML) has lately gained favor in research. Deep learning (DL) is one of the most extensively utilized ML methods in these applications. The ongoing appearance of new DL studies is due to the unpredictability of data acquisition and the incredible progress made in hardware technologies. DL is based on conventional neural networks but outperforms them significantly. Furthermore, DL uses transformations and graph technology to build multi-layer learning models (Kim et al. 2021 ). With their groundbreaking invention, Machine Learning and Deep Learning have revolutionized the world's perspective. Deep learning approaches have revolutionized the way we tackle problems. Deep learning models come in various shapes and sizes, capable of effectively resolving problems that are too complex for standard approaches to tackle. We'll review the various deep learning models in this section (Sarker 2021 ).

3.1 Classic neural networks

As shown in Fig.  2 , Multi-layer perceptron are frequently employed to recognize Fully Connected Neural Networks. It involves converting the algorithm into simple two-digit data inputs (Mukhamediev et al. 2021 ). This paradigm allows for both linear and nonlinear functions to be included. The linear function is a single line with a constant multiplier that multiplies its inputs. Sigmoid Curve, Hyperbolic Tangent, and Rectified Linear Unit are three representations for nonlinear functions. This model is best for categorization and regression issues with real-valued data and a flexible model of any kind.

figure 2

Multilayer Perceptron or ANN

3.2 Convolutional neural networks (CNN)

As shown in Fig.  3 , The classic convolutional neural network (CNN) model is an advanced and high-potential variant ANN Which developed to manage escalating complexity levels, as well as data pretreatment and compilation. It is based on how an animal's visual cortex's neurons are arranged (Amashita et al. 2018 ). One of the most flexible algorithms for the processing of data with and without images is CNNs. CNN can be processed through 4 phases:

For analyzing basic visual data, such as picture pixels, it includes one input layer that is often the case a 2D array of neurons.

Some CNNs analyze images on their inputs using a single-dimensional output layer of neurons coupled to distributed convolutional layers.

Layer number 3, called as the sampling layer, is included in CNNs o restrict the number of neurons which It took part in the relevant network levels.

The sampling and output layers are joined by one or more connected layers in CNNs.

figure 3

Convolutional Neural Networks (CNN)

This network concept can potentially aid in extracting relevant visual data in pieces or smaller units. In the CNN, the neurons are responsible for the group of neurons from the preceding layer.

After the input data has been included into the convolutional model, the CNN is constructed in four steps:

Convolution: The method produces feature maps based on supplied data., which are then subjected to a purpose.

Max-Pooling: It aids CNN in detecting an image based on supplied changes.

Flattening: The data is flattened in this stage so that a CNN can analyze it.

Full Connection: It's sometimes referred to as a "hidden layer" which creates the loss function for a model.

Image recognition, image analysis, image segmentation, video analysis, and natural language processing (NLP) (Chauhan et al. 2018 ; Tajbakhsh et al. May 2016 ; Mohamed et al. 2020 ; Zhang et al. 2018 ) are among the tasks that CNNs are capable of.

3.3 Recurrent neural networks (RNNs)

RNNs were first created to help in sequence prediction. These networks rely solely on data streams with different lengths as inputs. For the most recent forecast, the knowledge of its previous state is used as an input value by the RNN. As a result, it can help a network's short-term memory achievers (Tehseen et al. 2019 ). As shown in Fig.  4 , The Long Short-Term Memory (LSTM) method, for example, is renowned for its adaptability.

figure 4

LSTM Network

LSTMs, which are advantageous in predicting data in time sequences using memory, and LSTMs, which are useful in predicting data in time sequences using memory, are two forms of RNN designs that aid in the study of problems. The three gates are Input, Output, and Forget. Gated RNNs are particularly helpful for temporal sequence prediction using memory-based data. Both types of algorithms can be used to address a range of issues, including image classification (Chandra and Sharma 2017 ), sentiment analysis (Failed 2018 ), video classification (Abramovich et al. 2018 ), language translation (Hermanto et al. 2015 ), and more.

3.4 Generative adversarial networks: GAN

As shown in Fig.  5 , It combines a Generator and a Discriminator DL neural network approach. The Discriminator helps to discriminate between real and fake data while the Generator Network creates bogus data (Alankrita et al. 2021 ).

figure 5

GAN: Generative Adversarial Networks

Both networks compete with one another as The Discriminator still distinguishes between actual and fake data, and the Generator keeps making fake data look like real data. The Generator network will generate simulated data for the authentic photos if a picture library is necessary. Then, a deconvolution neural network would be created. Then, an Image Detector network would be utilized to discriminate between fictitious and real images. This competition would eventually help the network's performance. It can be employed in creating images and texts, enhancing the image and discovering new drugs.

3.5 Self-organizing maps (SOM)

As shown in Fig.  6 , Self-Organizing Maps operate by leveraging unsupervised data to decrease a model's number of random variables (Kohonen 1990 ). Given that every synapse is linked to both its input and output nodes, the output dimension in this DL approach is set as a two-dimensional model. The competition between each data point and its model representation in the Self-Organizing Maps, the weight of the closest nodes or Best Matching Units is adjusted (BMUs). The value of the weights varies based on how close a BMU is. The value represents the node's position in the network because weights are a node attribute in and of themselves. It's great for evaluating dataset frameworks that don't have a Y-axis value or project explorations that don't have a Y-axis value.

figure 6

Self-Organizing Maps (SOM)

3.6 Boltzmann machines

As shown in Fig.  7 , the nodes are connected in a circular pattern because there is no set orientation in this network model. This deep learning technique is utilized to generate model parameters because of its uniqueness. The Boltzmann Machines model is stochastic, unlike all preceding deterministic network models. It can monitor systems, create a binary recommendation platform, and analyze specific datasets (Hinton 2011 ).

figure 7

Boltzmann Machines

The architecture of the Boltzmann Machine is a two-layer neural network. The visible or input layer is the first, while the hidden layer is the second. They are made up of several neuron-like nodes that carry out computations. These nodes are interconnected at different levels but are not linked across nodes in the same layer. As a result, there is no connectivity between layers, which is one of the Boltzmann machine's disadvantages. When data is supplied into these nodes, it is transformed into a graph, and they process it and learn all the parameters, motifs, and relations between them before deciding whether to transmit it. As a result, an Unsupervised DL model is often known as a Boltzmann Machine.

3.7 Autoencoders

As shown in Fig.  8 , This algorithm, one of the most popular deep learning algorithms, automatically based on its inputs, applies an activation function, and decodes the result at the end. Because of the backlog, there are fewer types of data produced, and the built-in data structures are used to their fullest extent (Zhai et al. 2018 ).

figure 8

Autoencoders

There are various types of autoencoders:

Sparse: The generalization technique is used when the hidden layers outnumber the input layer to decrease the overfitting. It constrains the loss function and restricts the autoencoder from utilizing all its nodes simultaneously.

Denoising: In this case, randomly, the inputs are adjusted and made to equal 0.

Contractive: When the hidden layer outnumbers the input layer, to avoid overfitting and data duplication, a penalty factor is introduced to the loss function.

Stacked: When another hidden layer is added to an autoencoder, it results in two stages of encoding and Initial stages of decoding.

Feature identification, establishing a strong recommendation model, and adding features to enormous datasets are some of the difficulties it can solve.

4 Organization of DL applications in drug discovery problems

The evolution of safe and effective treatments for human is the primary goal of drug discovery (Kim et al. 2021 ). Drug discovery is the problem of finding the suitable drugs to treat a disease (i.e., a target protein) which relies on several interactions. This paper divides the drug discovery problems into four main categories, as presented in Fig.  9 . They are drug–target interactions, drug–drug similarity, drug combinations side effects, and drug sensitivity and response predictions. The following subsections provide a literature review of DL with these problems and some of the investigated literature articles related to each category are summarized in Table 2 .

figure 9

Drug discovery problem categories

4.1 Drug–target interactions prediction using DL

Drug repurposing attempts to uncover new uses for drugs that are already on the market and have been approved. It has attracted much attention since it takes less time, costs less money, and has a greater success rate than traditional de novo drug development (Thafar et al. 2022 ). The discovery of drug–target interactions is the initial step in creating new medications, as well as one of the most crucial aspects of drug screening and drug-guided synthesis (Wang et al. 2020a ). Exploring the link between possible medications and targets can aid researchers in better understanding the pathophysiology of targets at the drug level, which can help with the disease's early detection, treatment prognosis, and drug design. This is well known as drug–target interactions (DTIs) (Lian et al. 2021 ). Achieving success to the drug repositioning mechanism largely reliant on DTI's forecast because it reduces the number of potential medication candidates for specific targets. The approaches based on molecular docking and the approaches based on drugs are the two basic tactics used in traditional computational methods. When target proteins' 3D structures aren't available, the effectiveness of molecular docking is limited. When there are only a few known binding molecules for a target, drug-based techniques typically produce subpar prediction results. DL technologies overcome the restrictions of the high-dimensional structure of drug and target protein by using unstructured-based approaches which do not need 3D structural data or docking for DTI prediction. Therefore, this section provides a recent comprehensive review of DL-based DTIs prediction models (Chen et al. 2012 ).

As shown in Fig.  10 , there are known interactions (solid lines) and unknown interactions (dashed lines) between diseases (proteins) and drugs. DTIs forecast unknown interactions or what diseases (or target proteins) a new drug might treat. According to their input features, we divided the latest DL models used to predict DTIs into three categories: drug-based models, structure (graph)-based models, and drug-protein(disease)-based models.

figure 10

DL models used for predicting the DTIs are grouped into three categories: a drug-based models, b structure (graph)-based models, and c drug-protein(disease)-based models

4.1.1 Drug-based models

Figure  10 A shows drug-based models that assume a potential drug will be like known drugs for the target proteins. It calculates the DTI using the target's medication information. Similarity search strategies are used in these models, which postulate that structurally similar substances have similar biological functions (Thafar et al. 2019 ; Matsuzaka and Uesawa 2019 ). These methods have been used for decades to select compounds in vast compound libraries employing massive computer jobs or solve problems using human calculations. Deep neural network models gradually narrow the gap between in silico prediction and empirical study, and DL technology can shorten these time-consuming procedures and manual operations.

Researchers may now use deep neural networks to analyze medicines and predict drug-related features, including as bioactivities and physicochemical qualities, thanks to using benchmark packages like MoleculeNet (Wu et al. 2018 ) and DeepChem (). As a result, basic neural networks like MLP and CNN have been used in numerous drug-based DL approaches (Zeng et al. 2020 ; Yang et al. 2019 ; Liu et al. 2017 ). The representation power of molecular descriptors was often the focus of ADMET investigations rather than the model itself (Zhai et al. 2018 ; Liu et al. 2017 ; Kim et al. 2016 ; Tang et al. 2014 ). Hirohara et al. trained a CNN model with the SMILES string and then used learned attributes to discover motifs using significant structures for locations that bind proteins or unidentified functional groupings (Hirohara et al. 2018 ). Atom pairs and pharmacophoric donor–acceptor pairings have been employed by Wenzel et al. ( 2019 ) as adjectives in multi-task deep neural networks to predict microsomal metabolic liability. Gao et al. ( 2019 ) compared 6 different kinds of 2D fingerprints in the prediction of affinity between proteins and drugs using ML methods such as RF, single-task DNN, and multi-task DNN models. Matsuzaka and Uesawa ( 2019 ) used 2D pictures of 3D chemical compounds to train a CNN model to predict constitutive androstane receptor agonists. They optimized the greatest performance in snapshots of a 3D ball-and-stick model taken at various angles or coordinates. Therefore, the method outperformed seven common 3D chemical structure forecasts.

Since the GCN's development, drug related GCN models have created depictions of graphs which concerned with molecules that incorporate details on the chemical structures by adding up the adjacent atoms' properties (Gilmer et al. 2017 ).

GCNs have been employed as 3D descriptors instead of SMILES strings in a lot of research, and it's been discovered that these learned descriptors outperform standard descriptors in prediction tests and are easier to understand (Shin et al. 2019 ; Ozturk et al. 2018 ; Yu et al. 2019 ). Chemi-net employed GCN models to represent molecules and compared the performance of single-task and multi-task DNNs on their own QSAR datasets (Liu et al. 2019a ). Yang et al. ( 2019 ) introduced the directed message passing neural network, which uses a directed message-passing paradigm, as a more advanced model (D-MPNN). They tested their approaches on 19 publicly available and 16 privately held datasets and discovered that in most situations, they were correct. The D-MPNN models outperformed the previous models. In two datasets, they underperformed and were not as resilient as typical 3D descriptors when the sample was small or unbalanced. The D-MPNN model was then employed by another research group to correctly forecast a kind of antibiotic named HALICIN, which demonstrated bactericide effects in models for mice (Stokes et al. 2020 ). This was the first incident that resulted in the finding of an antibiotic by using DL methods to explore a large-scale chemical space that current experimental methodologies cannot afford. The application of attention-based graph neural networks is another interesting contemporary method (Sun et al. 2020a ). Edge weights and node features can be learned together since a molecule's graph representations can be altered by edge properties. As a result, Shang et al. suggested a multi-relational GCN with edge attention (Shang et al. 2018 ). For each edge, they created a reference guide on attention spans. Because it is used throughout the molecule, the approach can handle a wide range of input sizes.

In the Tox21 and HIV benchmark datasets, they found that this model performed better than the random forest model. As a result, the model may effectively learn pre-aligned features from the molecular graph's inherent qualities. Withnall et al. ( 2020 ) extended the MPNN model with AMPNN (attention MPNN), which is an attention technique that the message forwarding step employs weighted summation. Moreover, they termed the D-MPNN model the edge memory neural network because it was extended by the same attention mechanism as the AMPNN (EMNN). Although it is computationally more intensive than other models, this model fared better than others on the uniformly absent information from the maximal unbiased validation (MUV) reference.

4.1.2 Structure (graph)-based models

Unlike the drug- and structure-based models in Fig.  10 b, protein targets and medication information should be included. Typical molecular docking simulation methods aim to predict the geometrically possible binding of known tertiary structure drugs and proteins. Atom sequences and amino acid residues can be used to express both the medicine as well as the target. Descriptors based on sequences were selected because DL approaches may be implemented right away with non-significant pre-processing of the entering data.

The Davis kinase binding affinity dataset (Davis et al. 2011 ) and the KIBA dataset (Sun et al. 2020a ) were used in that study. DeepDTA, suggested by Ozturk et al. ( 2018 ), outperformed moderate ML approaches such as KronRLS (Nascimento et al. 2016 ) and SimBoosts (Tong et al. 2017 ) by applying solely information about the sequence of a CNN model based on the SMILES string and amino acid sequences. Wen et al. used ECFPs and protein sequence composition descriptors as examples of common and basic features and trained them using semi-supervised learning via a deep belief network (Wen et al. 2017 ). Another study, DeepConv-DTI, built a deep CNN model using only an RDKit Morgan fingerprint and protein sequences (Lee et al. 2019 ). They also used the pooled convolution findings to capture local residue patterns of target protein sequences, resulting in high values for critical protein areas like actual binding sites.

The scoring feature, which ranks the protein-drug interaction with 3D structures and makes the training data parametric to forecast values for binding affinities of targeted proteins, is used to predict binding affinity values or binding pocket sites of the target proteins as a key metric for the structure-based regression model. The protein–drug complexes' 3D structural characteristics were included in the CNNs by AtomNet (Wallach et al. 2015 ). They placed 3D grids with set sizes (i.e., voxels) in comparison to protein–drug combinations, with every cell in the grid representing structural properties at that position. Several researchers have examined the situation since then, deep CNN models that use voxels to predict binding pocket location or binding affinity (Wang et al. 2020b ; Ashburner et al. 2000 ; Zhao et al. 2019 ). In comparison to common docking approaches such as AutoDock Vina (Trott and Olson 2010 ) or Smina (Koes et al. 2013 ), these models have shown enhanced performance. This is since CNN models are relatively impervious even with large input sizes. It can be taught and is resilient to input data noise.

Many DTI investigations using GCNs based on structure-based approaches have been reported (Feng et al. 2018 ; Liu et al. 2016 ). Feng et al. ( 2018 ) used both ECFPs and GCNs as pharmacological characteristics. In the Davis et al. ( 2011 ), Metz et al. ( 2011 ), and KIBA Tang et al. ( 2014 ) benchmark datasets, their methods outperformed prior models such as KronRLS (Nascimento et al. 2016 ) and SimBoost (Tong et al. 2017 ). However, they did agree that their GCN model couldn't beat their ECFP model due to time and resource constraints in implementing the GCN. In a different DTI investigation study, Torng et al. employed a graph without supervision to become familiar with constant size depictions of protein binding sites (Torng and Altman 2019 ). The pre-trained GCN model was then trained using the newly created protein pocket GCN, the drug GCN model, on the other hand, used attributes to be trained and which were generated automatically. They concluded that without relying on target–drug complexes, their model effectively captured protein–drug binding interactions.

Because the models that implement the attention mechanism have key qualities that enable the model to be interpreted, attention-based DTI prediction approaches have evolved (Hirohara et al. 2018 ; Liu et al. 2016 ; Perozzi et al. 2014 ).

For protein sequences, Gao et al. ( 2017 ) employed compressed vectors with the LSTM RNNs and the GCN for drug structures. They concentrated on demonstrating their method's capacity to deliver biological insights into DTI predictions. To do so, Mechanisms for two-way attention were employed. to calculate the binding of drug–target pairs (DTPs), allowing for flexible interpretation of superior data from target proteins, such as GO keywords. Shin et al. ( 2019 ) introduced the Molecule transformer DTI (MT-DTI) approach for drug representations, which uses the self-attention mechanism. The MT-DTI model was tweaked to perfection and assessed using two Davis models Using pre-trained parameters from the 97 million chemicals PubChem (Davis et al. 2011 ) and (KIBA) (Tang et al. 2014 ) benchmark datasets, which are both publicly available. However, the attention mechanism was not used to depict the protein targets because it would take too long to calculate the target sequence in an acceptable amount of time. Pre-training is impossible due to a lack of target information.

On the other hand, attention DTA presented by Zhao et al. incorporates a CNN attention mechanism model to establish the weighted connections between drug and protein sequences (Zhao et al. 2019 ). They showed that these attention-based drug and protein representations have good MLP model affinity prediction task performance. DeepDTIs used external, experimental DTPs to infer the probability of interaction for any given DTP. Four of the top ten predicted DTIs have previously been identified, and one was discovered to have a poor glucocorticoid receptor binding affinity (Huang et al. 2018 ). DeepCPI was used to predict drug–target interactions. Small-molecule interactions with the glucagon-like peptide one receptor, the glucagon receptor, and the vasoactive intestinal peptide receptor have been tested in experiments (Wan et al. 2019 ).

4.1.3 Drug–protein(disease)-based models

According to poly pharmacology, most medicines have multiple effects on both primary and secondary targets. The biological networks involved, as well as the drug's dose, influence these effects. As a result, the drug–protein(disease)-based models shown in Fig.  10 c are particularly beneficial when evaluating protein promiscuity or drug selectivity (Cortes-Ciriano et al. 2015 ). Furthermore, Neural networks that can do multiple tasks are ideal for simultaneously learning the properties of many sorts of data (Camacho et al. 2018 ). Several DL model applications, such as drug-induced gene-expression patterns and DTI-related heterogeneous networks, leverage relational information for distinct views. A network-based strategy employs heterogeneous networks includes a variety of nodes and edges kinds (Luo et al. 2017 ; David et al. 2019 ). The nodes in these networks have a local similarity, which is a significant aspect of these models. One can anticipate DTIs using their connections and topological features when a network of similarity with medications as its nodes and drug–drug similarity values as a measure of the edges' weights is investigated. Machine to support vectors (Bleakley and Yamanishi 2009 ; Keum and Nam 2017 ), Machine learning techniques that use heterogeneous networks as prediction frameworks include the regularized least square model (RLS) (Liu et al. 2016 ; Xia et al. 2010 ; Hao et al. 2016 ) and random walk with the restart model Nascimento (Lian et al. 2021 ; Nascimento et al. 2016 ). DTI prediction research using networks have employed DL to enhance the methods used to forecast associations today for evaluating the comparable topological structures of drug and target networks that are bipartite and tripartite linked networks, owing to the increased interest in the usage of DL technologies (drug, target, and disease networks) (Hassan-Harrirou et al. 2020 ; Lamb et al. 2006 ; Korkmaz 2020 ; Townshend et al. 2012 ; Vazquez et al. 2020 ). Zong et al. ( 2017 ) used the DeepWalk approach to collect local latent data, compute topology-based similarity in tripartite networks, and demonstrate the technology's promise as a medication repurposing solution.

Relationship-based features collected by training the AE were used in some network-based DTI prediction studies. Zhao et al. ( 2020 ) developed a DTI-CNN prediction model that combined depth information that is low-dimensional but rich with a heterogeneous network that has been taught using the stacked AE technique. To construct the topological similarity matrix of drug and target, Wang et al. used a deep AE and mutually beneficial pointwise information in their analysis (Wang et al. 2020b ). Peng et al. ( 2020 ) employed a denoising Autoencoder to pick network-based attributes and decrease the representation dimensions in another investigation.

By helping the self-encoder learn to denoise, the anti-aliasing effect (Autoencoder) enhances high-dimensional images with noise, input data that is noisy and incomplete, allowing the encoder to learn more reliably. These approaches, however, have a drawback in that it is challenging to foresee recent medications or targets, a problem. The problem of recommendation systems' "cold start" is known as the "cold start" problem (Bedi et al. 2015 ). The size and form of the network have a big impact on these models, so if the network isn't big enough, they will not be able to collect all the medications or targets that aren't in the network (Lamb et al. 2006 ).

Various investigations have also utilized Gene expression patterns as chemogenomic traits to predict DTIs. This research presumes that medications with similar expression patterns have similar effects on the same targets (Hizukuri et al. 2015 ; Sawada et al. 2018 ).

The revised version of CMAP, the LINCS-L1000 database, has been integrated into the DL DTI models in recent works (Subramanian et al. 2017 ; Thafar et al. 2020 ; Karpov et al. 2020 ; Arus-Pous et al. 2020 ). Based on the LINCS pharmacological perturbation and knockout gene data, using a deep neural network, Xie et al. developed a binary classification model (Xie et al. 2018 ).

On the other hand, Lee and Kim employed as a source of expression signature genes medication and target features. They used node2vec to train the rich data by examining three elements of protein function, including pathway-level memberships and PPI (Lee and Kim 2019 ). Saho and Zhang employed a GCN model to extract drug and target attributes from LINCS data and a CNN model to forecast DTPs by extracting latent features in DTIGCCN (Shao et al. 2020 ). The Gaussian kernel function was identified to aid in the production of high-quality graphs, and as a result, this hybrid model scored better on classification tests.

DeepDTnet employs a heterogeneous drug–gene-disease network to uncover known drug targets containing fifteen types of chemicals and genomic, phenotypic, and cellular network properties. DeepDTnet predicted and experimentally confirmed topotecan, a new direct inhibitor of the orphan receptor linked to the human retinoic acid receptor (Zeng et al. 2020 ).

4.2 Drug sensitivity and response prediction using DL

Drug response is the clinical outcome treated by the drug of interest ( https://www.sciencedirect.com/topics/drug-response ). This is due to the normally low ratio of samples to measurements each sample, which makes traditional feedforward neural networks unsuitable. The main idea of drug response prediction is shown in Fig.  11 . The DL method takes the heterogenous network of drug and protein interactions as inputs and predicts the response scores. Although the widespread use of the deep neural network (DNN) approaches in various domains and sectors, including related topics like computational chemistry (Gómez-Bombarelli et al. 2018 ), DNNs have only lately made their way into drug response prediction. Overparameterization, overfitting, and poor generalization are common outcomes of recent simulation datasets. However, more public data has become available recently, and freshly built DNN models have shown promise. As a result, this section summarizes current DL computational problems and drug response prediction breakthroughs.

figure 11

Drug binding with proteins and drug sensitivity (response) scores prediction

Since the 1990s, neural networks have been used to predict drug response (El-Deredy et al. 1997 ) revealed that data from tumor nuclear magnetic resonance (NMR) spectra might be used to train a neural network and can be utilized to predict drug response in gliomas and offer information on the metabolic pathways involved in drug response.

In 2018, The DRscan model was created by Chang et al. ( 2018 ), and it uses a CNN architecture that was trained on 1000 drug reaction studies per molecule. Compared to other traditional ML algorithms like RF and SVM, their model performed much better. CDRscan's ability to incorporate genomic data and molecular fingerprints is one of the reasons it outperformed these baseline models. Furthermore, its convolutional design has been demonstrated to be useful in various machine learning areas. A neural network called an autoencoder attempts to recreate the original data from the compressed form after compressing its input. As proven by Way and Greene ( 2018 ), this is very useful for feature extraction, which condensed a gene expression profile with 5000 dimensions with a maximum of 100 dimensions, some of which revealed to significant characteristics such as the patient's sexual orientation or melanoma status. Using variational autoencoders, Dincer et al. ( 2018 ) created DeepProfile, a technique for learning a depiction of gene expression in AML patients in eight dimensions that is then fitted to a Lasso linear model for treatment response prediction with superior results to that of no extracting features.

Ding et al. ( 2018 ) proposed a deep autoencoder model for representation learning of cancer cells from input data consisting of gene expression, CNV, and somatic mutations.

In 2019, MOLI (Multi-omics Late Integration) (Sharifi-Noghabi et al. 2019 ) was a deep learning model that incorporates multi-omics data and somatic mutations to characterize a cell line. Three separate subnetworks of MOLI learn representations for each type of omics data. A final network identifies a cell's response as responder or non-responder based on concatenated attributes. Those methods share two characteristics: integrating multiple input data (multi-omics) and binary classification of the drug response. Although combining several forms of omics data can improve the learning of cell line status, it may limit the method's applicability for testing on different cell lines or patients because the model requires extra data beyond gene expression.

Furthermore, a certain threshold of the IC50 values should be set before binary classification of the drug response, which may vary depending on the experimental condition, such as drug or tumor types. Twin CNN for drugs in SMILES format (TCNNS) (Liu et al. 2019b ) takes a one-hot encoded representation of drugs and feature vectors of cell lines as the inputs for two encoding subnetworks of a One-Dimensional (1D) CNN. One-hot encodings of drugs in TCNNS are Simplified Molecular Input Line Entry System (SMILES) strings which describe a drug compound's chemical composition. Binary feature vectors of cell lines represent 735 mutation states or CNVs of a cell. KekuleScope (Cortés-Ciriano and Bender 2019 ) adopts transfer learning, using a pre-trained CNN on ImageNet data. The pre-trained CNN is trained with images of drug compounds represented as Kekulé structures to predict the drug response.

Yuan et al. ( 2019 ) offer GNNDR, a GNN-based technique with a high learning capacity and allows drug response prediction by combining protein–protein interactions (PPI) information with genomic characteristics. The value of including protein information has been empirically proven. The proposed method offers a viable avenue for the discovery of anti-cancer medicines. Semi-supervised variational autoencoders for the prediction of monotherapy response were examined by the Rampášek et al. ( 2019 ). In contrast to many conventional ML methodologies, together developed a model for predicting medication reaction that took advantage of expression of genes before and after therapy in cell lines and demonstrated enhanced evaluation on a variety of FDA-approved pharmaceuticals. Chiu et al. ( 2019 ) trained a deep drug response predictor after pre-training autoencoders using mutation data and expression features from the TCGA dataset. The use of pretraining distinguishes their strategy from others. Compared to using only the labeled data, the pretraining process permits un-labelled data from outside sources, like TCGA, as opposed to just gene expression profiles obtained from drug reaction tests, resulting in a significant increase in the number of samples available and improved performance.

Chiu et al. ( 2019 ) and Li et al. ( 2019 ) used a combination of auto-encoders and predicted drug reactions in cell lines with deep neural networks and malignancies that had been gnomically characterized. To anticipate cell lines reactions to drug combinations, in https://string-db.org/cgi/download.pl?sessionId=uKr0odAK9hPs used deep neural encoders to link genetic characteristics with drug profiles.

In 2020, Wei et al. ( 2020 ) anticipate drug risk levels (ADRs) based on adverse drug reactions. They use SMOTE and machine learning techniques in their studies. The proposed framework was used to investigate the mechanism of ADRs to estimate degrees of drug risk and to assist with and direct decision-making during the changeover from prescription to over-the-counter medications. They demonstrated that the best combination, PRR-SMOTE-RF, was built using the above architecture and that the macro-ROC curve had a strong classification prediction effect. They suggested that this framework could be used by several drug regulatory organizations, including the FDA and CFDA, to provide a simple but dependable method for ADR signal detection and drug classification, as well as an auxiliary judgement basis for experts deciding on the status change of Rx drugs to OTC drugs. They propose that more ML or DL categorization algorithms be tested in the future and that computational complexity be factored into the comparison process. Kuenzi et al. ( 2020 ) built DrugCell, an interpretable DL algorithm of personal cancer cells based on the reactions of 1235 tumor cell lines to 684 drugs. Genotypes of cancer cause conditions in cellular systems combined with medication composition to forecast therapeutic outcome while also learning the molecular mechanisms underlying the response. Predictions made by DrugCell in cell lines are precise and help to categorize clinical outcomes. The study of DrugCell processes results in the development of medication combinations with synergistic effects, which we test using combinatorial CRISPR, in vitro drug–drug screening, and xenografts generated from patients. DrugCell is a step-by-step guide to building interpretable predictive medicine models.

Artificial Neural Networks (ANNs) that operate on graphs as inputs are known as Graph Neural Networks (GNNs). Deep GNNs were recently employed for learning representations of low-dimensional biomolecular networks (Hamilton 2020 ; Wu et al. 2020 ). Ahmed et al. ( 2020 ) used two separate GNN methods to develop a GNN using GE and a network of genes that are expressed together. This is a network that depicts the relationship between gene pairs' expression.

The CNN is one of the neural network models adopted for drug response prediction. The CNN has been actively used for image, video, text, and sound data due to its strong ability to preserve the local structure of data and learn hierarchies of features. In 2021, several methods had been developed for drug response prediction, each of which utilizes different input data for prediction (Baptista et al. 2021 ).

Nguyen et al. ( 2021 ) proposed a method to predict drug response called GraphDRP, which integrates two subnetworks for drug and cell line features, like CNN in Liu et al. ( 2019b ) and Qiu et al. ( 2021 ). Gene expression data from cancer cell lines and medication response data, the author finds predictor genes for medications of interest and provides a reliable and accurate drug response prediction model. Using the Pearson correlation coefficient, they employed the ElasticNet regression model to predict drug response and fine-tune gene selection after pre-selecting genes. They ran a regression on each drug twice, once using the IC50 and once with the area under the curve (AUC), to obtain a more trustworthy collection of predictor genes (or activity area). The Pearson correlation coefficient for each of the 12 medicines they examined was greater than 0.6. With 17-AAG, IC50 has the highest Pearson correlation coefficient of 0.811.

In contrast, AUC has the highest Pearson correlation coefficient of 0.81. Even though the model developed in this study has excellent predictive performance for GDSC, it still has certain flaws. First, the cancer cell line's properties may differ significantly from those of in vivo malignancies, and it must be determined whether this will be advantageous in a clinical trial. Second, they primarily use gene expression data to predict drug response. While drug response is influenced by structural changes such as gene mutations, it is also influenced by gene expression levels. To improve the prediction capacity of the model, more research is needed to use such data and integrate it into the model.

In 2022, Ren et al. ( 2022 ) suggested a graph regularized matrix factorization based on deep learning (DeepGRMF), which uses a variety of information, including information on drug chemical composition, their effects on cell biology signaling mechanisms, and the conditions of cancer cells, to integrate neural networks, graph models, and matrix-factorization approaches to forecast cell response to medications. DeepGRMF trains drug embeddings so that drugs in the embedding space with similar structures and action mechanisms, (MOAs) are intimately linked. DeepGRMF learns the same representation embeddings for cells, allowing cells with similar biological states and pharmacological reactions to be linked. The Cancer Cell Line Encyclopedia (CCLE) and On the Genomics of Drug Sensitivity in Cancer (GDSC) datasets, DeepGRMF outperforms competing models in prediction performance. In the Cancer Genome Atlas (TCGA) dataset, the suggested model might anticipate the effectiveness of a treatment plan on lung cancer patients' outcomes. The limited expressiveness of our VAE-based chemical structure representation may explain why new cell line prediction outperforms innovative drug sensitivity prediction in terms of accuracy. A family of neural graph networks has recently been shown to depict better chemical structures that can be investigated in the future. Pouryahya et al. ( 2022 ) proposed a new network-based clustering approach for predicting medication response based on OMT theory. Gene-expression profiles and cheminformatic drug characteristics were used to cluster cell lines and medicines, and data networks were used to represent the data. Then, RF model was used regarding each pair of cell-line drug clusters. by comparison, prediction-clustered based models regarding the homogenous data are anticipated to enhance drug sensitivity and precise forecasting and biological interpretability.

4.3 Drug–drug interactions (DDIs) side effect prediction using DL

Drugs are chemical compounds consumed by people and interact with protein targets to create a change. The drugs may alter the human body positively or negatively. Drug side effects are the undesirable alterations medications cause in the human body. These adverse effects might range from moderate headaches to life-threatening reactions like cardiac arrest, malignancy, and death. They differ depending on the person's age, gender, stage of sickness, and other factors (Kuijper et al. 2019 ). In the laboratory, to determine whether the medications have any unfavorable side effects, several tests are conducted on them. However, these examinations are both pricey and additionally lengthy. Recently, many computational algorithms for detecting medication adverse effects have been created. Computational methodologies are replacing laboratory experiments.

On the other hand, these methods do not provide adequate data to predict drug–drug interactions (DDIs). The phenomenon of DDIs is discussed in Fig.  12 . The desired effects of a drug resulting from its interaction with the intended target and the unfavorable repercussions emerging from drug interactions with off targets make up a drug's entire reaction on the human body (undesirable effects). Even though A medication has a strong affinity for binding to one target, it binds to several proteins as well with varied affinities, which might cause adverse consequences (Liu et al. 2021 ). Predicting DDIs can assist in reducing the likelihood of adverse reactions and optimizing the medication development and post-market monitoring processes (Arshed et al. 2022 ). Side effects of DDIs are often regarded as the leading cause of drug failure in pharmacological development. When drugs have major side effects, the market is quickly removed from them. As a result, predicting side effects is a fundamental requirement in the drug discovery process to keep drug development costs and timelines in check and launch a beneficial drug in terms of patient health recovery.

figure 12

Drug binding with proteins and DDI side effects

Furthermore, the average drug research and development cost is $2.6 billion (Liu et al. 2019 ). As a result, determining the possibility of negative consequences is important for lowering the expense and risk of medication development. The researchers use various computer tools to speed up the process. In pharmacology and clinical application, DDI prediction is a difficult topic, and correctly detecting possible DDIs in clinical studies is crucial for patients and the public. Researchers have recently produced a series of successes utilizing deep learning as an AI technique to predict DDIs by using drug structural properties and graph theory (Han et al. 2022 ). AI successfully detected potential drug interactions, allowing doctors to make informed decisions before prescribing prescription combinations to patients with complex or numerous conditions (Fokoue et al. 2016 ).

Therefore, this section comprehensively reviews the researchers' most popular DL algorithms to predict DDIs.

In 2016, Tiresias is a framework proposed by Achille Fokoue et al. ( 2017 ) for discovering DDIs. The Tiresias framework uses a large amount of drug-related data as input to generate DDI predictions. The detection of the DDI approach begins using input data that has been semantically integrated, resulting in a knowledge network that represents drug properties and interactions using additional components like enzymes, chemical structures, and routes. Numerous similarity metrics between all pharmacological categories were determined using a knowledge graph in a scalable and distributed setting. To forecast the DDIs, a large-scale logistic regression prediction model employs calculated similarity metrics. According to the findings, the Tiresias framework was proven to help identify new interactions between currently available medications and freshly designed and existing drugs. The suggested Tiresias model's necessity for big, scaled medication information was negative, resulting in the developed model's high cost.

In 2017, Reza et al. ( 2017 ) developed a computational technique for predicting DDIs based on functional similarities among all medicines. Several major biological aspects were used to create the suggested model: carriers, enzymes, transporters, and targets (CETT). The suggested approach was implemented on 2189 approved medications, for which the associated CETTs were obtained, and binary vectors to find the DDIs were created. Two million three hundred ninety-four thousand seven hundred sixty-seven potential drug–drug interactions were assessed, with over 250,000 unidentified possible DDIs discovered. Inner product-based similarity measures (IPSMs) offered good values predicted for detecting DDIs among the several similarity measures used. The lack of pharmacological data was a key flaw in this strategy, which resulted in the erroneous detection of all potential pairs of DDIs.

In 2018, Ryu et al. ( 2018 ) proposed a model that predicts more DDI kinds using the drug's chemical structures as inputs and applied multi-task learning to DDI type prediction in the same vein Decagon (Zitnik et al. 2018 ) models polypharmacy side effects using a relational GNN. To comprehend the representations of intricate nonlinear pharmacological interactions, Chu et al. ( 2018 ) utilized an auto-encoder for factoring. To predict DDIs, Liu et al. ( 2019c ) presented the DDI-MDAE based on shared latent representation, a multimodal deep auto-encoder. Recently, interest in employing graph neural networks (GNNs) to forecast DDI has increased. Distinct aggregation algorithms lead to different versions of GNNs to efficiently assemble the vectors of its neighbors’ feature vectors (Asada et al. 2018 ) uses a convolutional graph network (GCN) to encode the molecular structures to extract DDIs from text. Furthermore, Ma et al. ( 2018 ) has incorporated attentive Multiview graph auto-encoders into a coherent model.

Chen ( 2018 ) devised a model for predicting Adverse Drug Reactions (ADR). SVM, LR, RF, and GBT were all used in the predictive model. The DEMO dataset, which contains properties such as the patient's age, weight, and sex, and the DRUG dataset, which includes features such as the drug's name, role, and dosage, were employed in this model. Males make up 46% of the sample, while females make up 54%. The developed model had a fair forecasting accuracy for a representative sample set. Furthermore, the outputs revealed that the suggested model is only accurate for a significant number of datasets.

To anticipate the possible DDI, Kastrin et al. ( 2018 ) employed statistical learning approaches. The DDI was depicted as a complex network, with nodes representing medications and links representing their potential interactions. On networks of DDIs, the procedure for predicting links was represented as a binary classification job. A big DDI database was picked randomly to forecast. Several supervised and unsupervised ML approaches, such as SVM, classification tree, boosting, and RF, are applied for edge prediction in various DDIs. Compared to unsupervised techniques, the supervised link prediction strategy generated encouraging results. To detect the link between the pharmaceuticals, The proposed method necessitates Unified Medical Language System (UMLS) filtering, which provided a dilemma for the scientists. Furthermore, the suggested system only considers fixed network snapshots, which is problematic for DDI's system because It's a fluid system.

In 2019, Lee et al. ( 2019 ) proposed a deep learning system for accurately forecasting the results of DDIs. To learn more about the pharmacological effects of a variety of DDIs, an assortment of auto-encoders and a deep feed-forward neural network was employed in the suggested method that were honed utilizing a mix of well-known techniques. The results revealed that using SSP alone improves GSP and TSP prediction accuracy, and the autoencoder is more powerful than PCA at reducing profile features. In addition, the model outperformed existing approaches and included numerous novel DDIs relevant to the current study Yue et al. ( 2020 ) combines numerous graphs embedding methods for the DDI job, while models DDI as link prediction with the help of a knowledge graph (Karim et al. 2019 ). There's also a system for co-attention (Andreea and Huang 2019 ), which presented a deep learning model based solely on side-effect data and molecular drug structure. CASTER in Huang et al. ( 2020 ) also based on drug chemical structures, develops a framework for dictionary learning to anticipate DDIs (Chu et al. 2019 ) and proposes using semi-supervised learning to extract meaningful information for DDI prediction in both labeled and unlabeled drug data. Shtar et al. ( 2019 ) used a mix of computational techniques to predict medication interactions, including artificial neural networks and graph node factor propagation methods such as adjacency matrix factorization (AMF) and adjacency matrix factorization with propagation (AMFP). The Drug-bank database was used to train the model, containing 1142 medications and 45,297 drug drugs. With 1442 drugs and 248,146 drug–drug interactions, the trained model was tested from the drug bank's most recent version. AMF and AMFP were also used to develop an ensemble-based classifier, and the outcomes were assessed using the receiver operating characteristic (ROC) curve. The findings revealed that the suggested a classifier that uses an ensemble delivers important drug development data and noisy data for drug prescription. In addition, drug embedding, which was developed during the training of models utilizing interaction networks, has been made available. To anticipate adverse drug events caused by DDIs, Hou et al. ( 2019 ) suggested a deep neural network architecture model. The suggested model is based on a database of 5000 medication codes obtained from Drug Bank. Using the computed features, it discovers 80 different types of DDIs. Tensor Flow-GPU was also used to create the model, which takes 4432 drug characteristics as input.

Medicines for inflammatory bowel disease (IBD) can predict how they will react; the trained model has an accuracy of 88 percent. The findings also revealed that the model performs best when many datasets are used. Detecting negative effects of drugs with a DNN Model was proposed by Wang et al. ( 2019 ). The model predicts ADRs by using synthetic, biological, and biomedical knowledge of drugs. Drug data from SIDER databases was also incorporated into the model. The proposed system's performance was improved by distributing. Using a word-embedding approach, determine the association between medications using the target drug representations in a vector space. The suggested system's fundamental flaw was that it only worked well with ordinary SIDER databases.

In 2020, numerous AI-based methods were developed for DDI event prediction, including evaluating chemical structural similarity using neural graph networks (Huang et al. 2020 ). Attempts to forecast DDI utilizing different data sources have also been made, such as leveraging similarity features to create pharmacological features for the DDI job predicting occurrences (Deng et al. 2020 ).

With the help of word embeddings, part-of-speech tags, and distance embeddings. Bai et al. ( 2020 ) suggested a deep learning technique that executes the DDI extraction task and supports the drug development cycle and drug repurposing. According to experimental data, the technique can better avoid instance misclassifications with minimal pre-processing. Moreover, the model employs an attention technique to emphasize the significance of each hidden state in the Bi-LSTM layers.

A tool for extracting features regarding a graph convolutional network (GCN) and a predictor based on a DNN. Feng et al. ( 2020 ) suggested DPDDI, an effective and robust approach for predicting potential DDIs by utilizing data from the DDI network lacking a thought of drug characteristics (i.e., drug chemical and biological properties). The proposed DPDDI is a useful tool for forecasting DDIs. It should benefit from other DDI-related circumstances, such as recognizing unanticipated side effects and guiding drug combinations. The disadvantage of this paradigm is that it ignores drug characteristics.

Zaikis and Vlahavas ( 2020 ), by developing a bi-level network with a more advanced level reflecting the network of biological entities' interactions, suggested a multi-level GNN framework for predicting biological entity links. Lower levels, however, reflect individual biological entities such as drugs and proteins, although the proposed model's accuracy needs to be enhanced.

In 2021, To overcome the DDI prediction, Lin et al. ( 2021 ) suggested an end-to-end system called Knowledge Graph Neural Network (KGNN). KGNN expands the use of spatial GNN algorithms to the knowledge graph by selectively various aggregators of neighborhood data, allowing it to learn the knowledge graph's topological structural information, semantic relations, and the neighborhood of drugs and drug-related entities. Medical risks are reduced when numerous medications are used correctly, and drug synergy advantages are maximized. For multi-typed DDI pharmacological effect prediction, Yue et al. ( 2021 ) used knowledge graph summarization. Lyu et al. ( 2021 ) also introduced a Multimodal Deep Neural Network (MDNN) framework for DDI event prediction. On the drug knowledge graph, a graph neural network was used, MDNN effectively utilizes topological information and semantic relations. MDNN additionally uses joint representation structure information, and heterogeneous traits are studied, which successfully investigates the multimodal data's complementarity across modes. Karim et al. ( 2019 ) built a knowledge graph that used CNN and LSTM models to extract local and global pharmacological properties across the network. DANN-DDI is a deep attention neural network framework proposed by Liu et al. ( 2021 ). To anticipate unknown DDIs, it carefully incorporates different pharmacological properties (Chun and Yi-Ping Phoebe 2021 ) and developed a deep hybrid learning (DL) model to provide a descriptive forecasting of pharmacological adverse reactions. It was one of the initial hybrid DL models through conception models that could be interpreted. The model includes a graph CNN through conception models to improve the learning efficiency of chemical drug properties and bidirectional long short-term memory (BiLSTM) recurrent neural networks to link drug structure to adverse effects. After concatenating the outputs of the two networks (GCNN and BiLSTM), a fully connected network is utilized to forecast pharmacological adverse reactions. Regardless of the classification threshold, the model obtains an AUC of 0.846. It has a 0.925 precision score. Even though a tiny drug data set was used for adverse drug response (ADR) prediction, the Bilingual Evaluation Understudy (BLEU) concluded results were 0.973, 0.938, 0.927, and 0.318, indicating considerable achievements. Furthermore, the model can correctly form words to explain pharmacological adverse reactions and link them to the drug's name and molecular structure. The projected drug structure and ADR relationship will guide safety pharmacology research at the preclinical stage and make ADR detection easier early in the drug development process. It can also aid in the detection of unknown ADRs in existing medications. DDI extraction using a deep neural network model from medical literature was proposed by Mohsen and Hossein (). This model employs an innovative approach of attracting attention to improve the separation of essential words from other terms based on word similarity and location concerning candidate medications. Before recognizing the type of DDIs, this method calculates the results of a bi-directional long short-term memory (Bi-LSTM) model's attention weights in the deep network architecture. On the standard DDI Extraction 2013 dataset, the proposed approach was tested. According to the findings of the experiments, they were able to get an F1-Score of 78.30, which is comparable to the greatest outcomes for stated existing approaches.

In 2022, Pietro et al. ( 2022 ) introduced DruGNN, a GNN-based technique for predicting DDI side effects. Each DDI corresponds to a class in the prediction, a multi-class, multi-label node classification issue. To forecast the side effects of novel pharmaceuticals, they use a combination inductive-transudative learning system that takes advantage of drug and gene traits (induction path) and knowledge of known drug side effects (transduction path). The entire procedure is adaptable because the base for machine learning can still be used if the graph dataset is enlarged to include more node properties and associations. Zhang et al. ( 2022 ) proposed CNN-DDI, a new semi-supervised algorithm for predicting DDIs that uses a CNN architecture. They first extracted interaction features from pharmacological categories, targets, pathways, and enzymes as feature vectors. They then suggested a novel convolution neural network as a predictor of DDIs-related events based on feature representation. Five convolutional layers, two full-connected layers, and a CNN-based SoftMax layer make up the predictor. The results reveal that CNN-DDI superior to other cutting-edge techniques, but it takes longer to complete (Jing et al. 2022 ) presented DTSyn. This unique dual-transformer-based approach can select probable cancer medication combinations. It uses a multi-head attention technique to extract chemical substructure-gene, chemical-chemical, and chemical-cell-line connections. DTSyn is the initial model that incorporates two transformer blocks to extract linkages between interactions between genes, drugs, and cell lines, allowing a better understanding of drug action processes. Despite DTSyn's excellent performance, it was discovered that balanced accuracy on independent data sets is still limited. Collecting more training data is expected to solve the problem. Another issue is that the fine-granularity transformer was only trained on 978 signature genes, which could result in some chemical-target interactions being lost.

Furthermore, DTSyn used expression data as the only cell line attributes. To fully represent the cell line, additional omics data may be added going forward, including methylation and genetic data. He et al. ( 2022 ) proposed MFFGNN, a new end-to-end learning framework for DDI forecasting that can effectively combine information from molecular drug diagrams, SMILES sequences, and DDI graphs. The MFFGNN model used the molecular graph feature extraction module to extract global and local features from molecular graphs.

They run thorough tests on a variety of real-world datasets. The MFFGNN model routinely beats further cutting-edge models, according to the findings. Furthermore, the module for multi-type feature fusion configures the gating mechanism to limit the amount of neighborhood data provided to the node.

4.4 Drug–drug similarity prediction using DL

Drug similarity studies presume that medications with comparable pharmacological qualities have similar activation mechanisms, and side effects are used to treat problems like each other (Brown 2017 ; Zeng et al. 2019 ).

The drug-pharmacological similarity is critical for various purposes, including identifying drug targets, predicting side effects, predicting drug–drug interactions, and repositioning drugs. Features of the chemical structure (Lu et al. 2017 ; O’Boyle 2016 ), protein targets (Vilar 2016 ; Wang et al. 2014 ), side-effect profiles (Campillos et al. 2008 ; Tatonetti et al. 2012 ), and gene expression profiles (Iorio et al. 2010 ) provide a multi-perspective viewpoint for forecasting medications that are similar and can correct for data gaps in different data sources and offer fresh perspectives on drug repositioning and other uses. The main idea of drug–drug similarity is presented in Fig.  13 . The vector represents the drug features, and the links reflect the similarity between the two drugs.

figure 13

Drug–drug similarity main idea

4.4.1 Drug similarity measures

The similarity estimations are calculated based on chemical structure, target protein sequence-based, target protein functional, and drug-induced pathway similarities.

4.4.1.1 The similarity in chemical structure

DrugBank ( 2019 ) provides tiny molecule medicine chemical structures in SDF molecular format. Invalid SDFs can be recognized and eliminated, such as those with a NA value or fewer than three columns in atom or bond blocks. For valid compounds, atom pair descriptors can be computed, pairwise comparison of compounds, δ c ( di , dj ), was evaluated using atom pairs using the Tanimoto coefficient, which is defined as the number of atom pairs in each fraction shared by two different compounds divided by their union (Eq.  1 ).

where AP i and AP j are atom pairs from pharmaceuticals d i and dj, respectively, the numerator is the total number of atom pairs in both compounds, while the denominator is the number of common atom pairs in both compounds.

4.4.1.2 Target protein sequence-based similarity

DrugBank provides all small molecule drugs have target sequences in FASTA format. The basic Needleman-Wunsch et al. ( 1970 ) dynamic programming approach for global alignment can be used to compare pairwise protein sequences. The proportion of pairwise sequence identity (Raghava 2006 ) can be represented as the corresponding sequence similarity. Equation  2 was used to calculate drug–drug similarity based on target sequence similarities:

where δ t ( di , dj ) denotes target-based similarity between medicines di and dj. Drugs di target a group of proteins known as Ti. Tj is a set of proteins that pharmaceuticals dj target and S(x,y) is a similarity metric based on symmetric sequences between two targeted proteins, x \(\in \) Ti and y \(\in \) Tj. Overall, Eq.  2 calculates the average of the best matches, wherein each first medicine's target is only connected to the second medicine's most comparable phrase, and vice versa.

4.4.1.3 Target protein functional similarity

Protein targets that are overrepresented by comparable biological functions and have similar sequences imply shared pharmacological mechanisms and downstream effects (Passi et al. 2018 ). As a result, each protein has a set of Gene Ontology (GO) concepts from all three categories associated with it, such as cellular components (CC), molecular functions (MF), and biological processes (BP). We filtered out GO keywords that were either very specialized (with 15 linked genes) or very general (with 100 genes). DrugBank ( 2019 ) provided the Human Protein–Protein Interaction (PPI) network. Wang et al. ( 2007 ) proposed leveraging the topology of the GO graph structure to determine the semantic similarity of their linked GO terms, which was used to determine how functionally comparable two drugs are, such as δ f (d i , d j ). Using a best-match average technique, any two GO keywords are compared for pairwise semantic similarity connected with di and d j were aggregated into a single semantic similarity measure and presented into a final similarity matrix.

4.4.1.4 Drug-induced pathway similarity

A medication pair that triggers similar pathways or overlaps shows that the drugs' mechanisms of action are similar, which is useful information for drug similarities and repositioning research (Zeng et al. 2015 ). Kanehisa and Goto ( 2000 ) was used to find the pathways activated by each small molecule medication. Using dice similarity, the similarity in pairs of any two options was calculated based on their constituent genes' closeness. After that, a pathway-based similarity score was calculated for each medication pair d i and d j , i.e., δ p ( d i , d j ), was calculated using Eq.  3 :

where P i and P j are a group of drug-induced pathways d i and d j , respectively; x and y are two paths represented by a group of genes that make up their constituents, and \(DSC\left( {x,y} \right) = {{{2}\left| {x \cap y} \right|} \mathord{\left/ {\vphantom {{{2}\left| {x \cap y} \right|} {\left( {\left| x \right| + \left| y \right|} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {\left| x \right| + \left| y \right|} \right)}}\) is the probability of a pair of dice matching, this determines how much the two trajectories overlap. When no gene is shared by any two pathways produced by the comparing drug pair, the similarity is set to 0.0. Overall, Eq.  3 implies that if two medications stimulate one or more identical pathways, the maximum pathway-based similarity will be achieved (s).

4.4.2 DL for drug similarity prediction

Wang et al. ( 2019 ) introduced a gated recurrent units (GRUs) model that employs similarity to predict drug–disease interactions. In this approach, CDK turned the SMILES into 2D chemical fingerprints, and the Jaccard score of the 2D chemical fingerprints was used to compare the two medicines. This section comprehensively reviews the researchers' most popular DL algorithms to predict drug similarity.

Hirohara et al. ( 2018 ) employed a CNN to learn molecular representation. The network is given the molecule's SMILES notation as input to feed into the convolutional layers in this scenario. The TOX 21 dataset was used.

To conduct similarity analysis, Cheng et al. ( 2019 ) used the Anatomical Therapeutic Chemical (ATC) based on the drug ATC classification systems and code-based commonalities of drug pairs. The authors created interaction networks, performed drug pair similarity analyses, and developed a network-based methodology for identifying clinically effective treatment combinations for a specific condition.

Xin et al. ( 2016 ) presented a Ranking-based k-Nearest Neighbour (Re-KNN) technique for medication repositioning. The method's key feature combines the Ranking SVM (Support Vector Machine) algorithm and the traditional KNN algorithm. Chemical structural similarity, target-based similarity, side-effect similarity, and topological similarity are the types of similarity computation methodologies they used. The Tanimoto score was then used to determine the similarity between the two profiles.

Seo et al. ( 2020 ) proposed an approach that combined drug–drug interactions from DrugBank, network-based drug–drug interactions, polymorphisms in a single nucleotide, and anatomical hierarchy of side effects, as well as indications, targets, and chemical structures.

Zeng et al. ( 2019 ) developed an assessment of clinical drug–drug similarity derived from data from the clinic and used EHRs to analyse and establish drug–diagnosis connections. Using the Bonferroni adjusted hypergeometric P value, they created connections between drugs and diagnoses in an EMR dataset. The distances between medications were assessed using the Jaccard similarity coefficient to form drug clusters, and a k-means algorithm was devised.

Dai et al. ( 2020 ) reviewed, summarized representative methods, and discussed applications of patient similarity. The authors talked about the values and applications of patient similarity networks. Also, they discussed the ways to measure similarity or distance between each pair of patients and classified it into unsupervised, supervised, and semi-supervised.

Yan et al. ( 2019 ) created BiRWDDA, a new computational methodology for medication repositioning that combines bi-random walk and various similarity measures to uncover potential correlations between diseases and pharmaceuticals. First drug and disease–disease similarities are assessed to identify optimal drug and disease similarities. The information entropy is evaluated between the similarity of medicine and disease to determine the right similarities. Four drug–drug similarity metrics and three disease–disease similarity measurements were calculated depending on some drug- and disease-related characteristics to create a heterogeneous network. The drug's protein sequence information, the extracted drug interaction from DrugBank then utilized the Jaccard score to determine this similarity, the chemical structure, derived canonical SMILES from DrugBank, and the side effect, respectively the four drug–drug similarities.

Yi et al. ( 2021 ) constructed the model of a deep gated recurrent unit to foresee drug–disease interactions that likely employ a wide range of similarity metrics and a kernel with a Gaussian interaction profile. Based on their chemical fingerprints, the similarity measure is utilized to detect a distinguishing trait in medications. Meanwhile, based on established disease–disease relationships, the Gaussian interactions profile kernel is used to derive efficient disease features. After that, a model with a deep gated recurrent cycle is created to anticipate drug-disease interactions that could occur. The outputs of the experiments showed that the suggested algorithm could be used to anticipate novel drug indications or disease treatments and speed up drug repositioning and associated drug research and discovery.

To forecast DDIs, Yan et al. ( 2022 ) suggested a semi-supervised learning technique (DDI-IS-SL). DDI-IS-SL uses the cosine similarity method to calculate drug feature similarity by combining chemical, biological, and phenotypic data. Drug chemical structures, drug–target interactions, drug enzymes, drug transporters, drug routes, drug indications, drug side effects, harmful effects of drug discontinuation, and DDIs that have been identified are all included in the integrated drug information.

Heba et al. ( 2021 ) used DrugBank to develop a machine learning framework based on similarities called "SMDIP" (Similarity-based ML for Drug Interaction Prediction), where they calculated drug–drug similarity utilizing a Russell–Rao metric for the biological and structural data that is currently accessible on DrugBank to represent the limited feature area. The DDI classification is carried out using logistic regression, emphasizing finding the main predictors of similarity. The DDI key features are subjected to six machine learning models (NB: naive Bayes; LR: logistic regression; KNN: k-nearest neighbours; ANN: neural network; RFC: random forest classifier; SVM: support vector machine).

For large-scale DDI prediction, Vilar et al. ( 2014 ) provided a procedure combining five similar drug fingerprints (Two-dimensional structural fingerprints, fingerprinting of interaction profiles, fingerprints of the target profile, Fingerprints of ADE profiles, and pharmacophoric techniques in three dimensions).

Song et al. ( 2022 ) used similarity theory and a convolutional neural network to create global structural similarity characteristics. They employed a transformer to extract and produce local chemical sub-structure semantic characteristics for drugs and proteins. To create drug and protein global structural similarity characteristics, The Tanimoto coefficient, Levenshtein distance, and CNN are all utilized in this study.

5 Benchmark datasets and databases

Drug development or discovery has been based on a range of direct and indirect data sources and has regularly demonstrated strong predictive capability in finding confirmed repositioning candidates and other applications for computer-aided drug design. This section reviews the most important and available benchmark datasets and databases used in the drug discovery problem and which the researchers may need according to each problem category. Thirty-five datasets are summarized in Table 3.

6 Evaluation metrics

Performance measures are required for evaluating machine learning models (Benedek et al. 2021 ). The measures serve as a tool for comparing different techniques. They aid in comparing many approaches to identify the best one for execution. This section describes the many metrics defined for the four categories of drug discovery difficulties below.

Table 4 shows the metrics employed in drug discovery problems—understanding the metrics aids in assessing the effectiveness of various prediction systems. True positives (TP) are drug side effects that have been recognized appropriately, False positives (FP) are adverse pharmacological effects that aren't present but were detected by the model, and True negatives (TN) are pharmacological side effects that do not exist but that the model failed to detect. False negatives (FN) are adverse pharmacological effects the model did not predict.

7 Drug dosing optimization

Drugs are vital to human health and choosing the proper treatment and dose for the right patient is a constant problem for clinicians. Even when taken as studied and prescribed, drugs have adverse impact profiles with varying response rates. As a result, all medications must be well-managed, especially those utilized in treating critical ailments or with a tight exposure window between efficacy and toxicity. Clinicians follow typical guidelines for the first dosage, which is not always optimal or secure for every patient, especially if the medicine no longer is evaluated in various dosages for various patient types. Precision dosage can revolutionize by increasing perks in health care while reducing drug therapy risks. While precise dosing will probably influence some pharmaceuticals significantly, perhaps not essential or practical to apply to all drugs or therapeutic classes. As a result, recognizing the characteristics that make medications suitable for precision dosage targets will aid in directing resources to where they'll have the most impact. Precision-dosing meds with a high priority and therapeutic classes could be crucial in achieving increased health care performance, safety, and cost-effectiveness (Tyson et al. 2020 ).

Due to standard, fixed dosing procedures or gaps in knowledge, imprecise drug dosing in specific subpopulations increases the risk of potentiating adverse effects due to supratherapeutic or subtherapeutic concentrations (Watanabe et al. 2018 ). Currently, the Food and Medicine Administration (FDA) simply requires a drug to be statistically better than a non-inferior to placebo of the existing treatment standard. This does not guarantee that the medicine will benefit most patients in clinical trials, especially if malignancies treatment can be tough, like diffuse intrinsic pontine glioma (DIPG) and unresectable meningioma, where rates of therapy response can be exceedingly low (Fleischhack et al. 2019 ).

There are essential aspects for dose optimization ( https://friendsofcancerresearch.org/wpcontent/uploads/Optimizing_Dosing_in_Oncology_Drug_Development.pdf ) that vary based on the product, the target population, and the available data to find the most effective dose, which varies based on the product, the target population, and the available data:

Therapeutic properties: Drug features such as small molecule vs. large molecule and agonist vs. antagonist impact how drugs interact with the body regarding safety and efficacy. The therapeutic characteristics impact the first doses used in dose-finding studies and the procedures used to determine which doses should be used in registrational trials.

Patient populations: Patient demographics vary depending on tumour kind, stage of disease, and comorbidities. Understanding how diverse factors influence the drug's efficacy may justify modifying the dose correspondingly, especially in the context of enlarged clinical trial populations.

Supplemental versus original approval: Differences in disease features and patient demographics between tumour types and treatment settings, such as monotherapy versus combination therapy, must be considered when assessing whether additional dose exploration is required for a supplemental application. In cases when more dose exploration is required, the research design can include previous exposure-response knowledge from the initial approval.

8 Drug discovery and XAI

The topic of XAI addresses one of the most serious flaws in ML and DL algorithms: model interpretability and explain ability. Understanding how and why a prediction is formed becomes increasingly crucial as algorithms grow more sophisticated and can forecast with greater accuracy. It would be impossible to trust the forecasts of real-world AI applications without interpretability and explain ability. Human-comprehensible explanations will increase system safety while encouraging trust and sustained acceptance of machine learning technologies (). XAI has been studied to circumvent the limitations of AI technologies due to their black-box nature. In contrast to making decisions and model justifications which may be provided by AI approaches like DL and XAI (Zhang et al. 2022 ). Attention has been attracted to XAI approaches (Lipton 2018 ; Murdoch et al. 2019 ) to compensate for the lack of interpretability of some ML models as well as to aid human decision-making and reasoning (Goebel et al. 2018 ). The purpose of presenting relevant explanations alongside mathematical models is to help students understand them better by (1) Making the decision-making process more transparent (Doshi-Velez and Kim 2017 ), (2) correct predictions should not be made for the wrong motives (Lapuschkin et al. 2019 ), (3) avoid biases and discrimination that are unjust or unethical (Miller 2019 ), and (4) close the gap between ML and other scientific disciplines. Effective XAI can also help scientists in navigating the scientific process (Goebel et al. 2018 ), enabling people to fine-tune their understanding and opinions on the process under inquiry (Chander et al. 2018 ). We hope to provide an overview of recent XAI drug discovery research in this section.

XAI has a place in drug development. While the precise definition of XAI is still up for controversy (Guidotti et al. 2018 ), the following characteristics of XAI are unquestionably beneficial in applications of drug design (Lipton 2018 ):

Transparency is accomplished by understanding how the system came to a specific result.

The explanation of why the model's response is suitable serves as justification. It is instructive to provide new information to human decision-makers.

Determining the reliability of a prediction to estimate uncertainty.

The molecular explanation of pharmacological activity is already possible with XAI (Xu et al. 2017 ; Ciallella and Zhu 2019 ), as well as drug safety and organic synthesis planning (Dey et al. 2018 ). If It's working overtime, XAI will be important in processing and interpreting increasingly complex chemical data, as well as creating new pharmaceutical ideas, all while preventing human bias (Boobier et al. 2017 ). Application-specific XAI techniques are being developed to quickly reply to unique scientific issues relating to the Pathophysiology and biology of the human may be boosted by pressing drug discovery difficulties such as the coronavirus pandemic.

AI tools can increase their prediction performance by increasing model complexity. As a result, these models become opaque, with no clear grasp of how they operate. Because of this ambiguity, AI models are not generally utilized in important industries such as medical care. As a result, XAI focuses on understanding what goes into AI model prediction to meet the demand for transparency in AI tools. AI model interpretability approaches can be categorized depending on the algorithms used, a scale for interpreting, and the kind of information (Adadi and Mohammed 2018 ). Regarding the objectives of interpretability, approaches grouped as white-box model development, black-box model explanation, model fairness enhancement, and predictive sensitivity testing (Guidotti et al. 2018 ).

According to the gradient-based attribution technique (Simonyan et al. 2014 ), the network's input features are to blame for the forecast. Because this strategy is commonly employed when producing a DNN system's predictions, it may be a suitable solution for various black-box DNN models in DDI prediction (Quan, et al. 2016 ; Sun et al. 2018 ). In addition, DeepLIFT is a frequent strategy for implementing on top of DNN models that have been demonstrated to be superior to techniques based on gradients (Shrikumar et al. 2017 ). As opposed to that, the Guided Backpropagation model may be used to construct network architectures (Springenberg 2015 ). A convolutional layer with improved stride can be used instead of max pooling in CNN to deal with loss of precision. This method could be employed in CNN-based DDI prediction, as shown in Zeng et al. ( 2015 ).

Furthermore, in the Tao et al. ( 2016 ) was implemented neural networks that parse natural language. Using rationales, this method aimed to achieve the small pieces of input text. This method's design comprises two parts: a generator and an encoder that seek for text subsets that are closely connected to the predicted outcome. Because NLP-based models are used to extract DDIs (Quan et al. 2016 ), the above methods should be examined for usage in improving the model's clarity.

Aside from that, XAI has created methods for developing white-box models, including linear, decision tree, rule-based, and advanced but transparent models. However, these approaches are receiving less attention due to their weak ability to predict, particularly in the NLP-based sector, such as in the DDIs the job of extracting. Several ideas to address AI fairness have also been offered. Nonetheless, while extracting DDIs, only a small number of these scholarly studies looked at non-tabular data impartiality, such as text-based data. Many DDIs experiments used the word embedding method (Quan et al. 2016 ; Zhang 2020 ; Bolukbasi 2016 ). As a result, attempts to ensure fairness in DDI research should be considered more. To ensure the reliability of AI models, numerous methods also make an effort to examine the sensitivity of the models. Regarding their Adversarial Example-based Sensitivity Analysis, Zügner et al. ( 2018 ) used this model to explore graph-structured data. The technique looks at making changes to links between nodes or node properties to target node categorization models. Because graph-based methods are frequently utilized in DDIs research (Lin et al. 2021 ; Sun et al. 2020b ), methods like those used in the previous study suggest that they might be used in a DDIs prediction model. In RNN, word embedding perturbations (Miyato et al. 1605 ) are also worth addressing. Significantly, the input reduction strategy utilized by Feng et al. ( 2018 ) to expose hypersensitivity in NLP models could be applied to DDI extraction studies. The DDIs study of Schwarz et al. ( 2021 ) attempted to provide model interpretability using Attention ratings derived at all levels of modeling in their DDIs study. The significance of similarity matrices to the vectors for medication depiction is determined using these scores, and drug properties that contribute to improved encoding are identified using these scores. This method makes use of data that travels through all tiers of the network.

Graph neural networks (GNNs) and their explain ability are rapidly evolving in the field of graph data. GNNExplainer in Ying et al. ( 2019 ) uses mask optimization to learn soft masks for edge and node attributes to elaborate on the forecasts. Soft masks have been initiated at random and regarded as trainable variables. After that, the masks are then combined in comparison to the first graph using multiplications on a per-element basis by GNNExplainer. After that, by enhancing the exchange of information between the forecasts from the first graph and the recently acquired graph, the masks are maximized. Even when various regularization terms, such as element-by-element entropy, motivate optimal disguises for stealth, the resulting Masks remain supple.

In addition, because the masks are tuned for each input graph separately, it’s possible that the explanations aren't comprehensive enough. To elaborate on the forecasts, PGExplainer (Luo et al. 2020 ) discovers approximated discrete edge masks. To forecast edge masks, it develops a mask predictor that is parameterized. It starts by concatenating node embeddings to get the embeddings for each edge in an input graph. The predictor then forecasts the chances of each edge being selected using the edge embeddings, that regarded as an evaluation of significance. The reparameterization approach is then used to sample the approximated discrete masks. Finally, the mutual information between the previous and new forecasts is optimized to train the mask predictor. GraphMask (Schlichtkrull et al. 2010 ) describes the relevance of edges in each GNN layer after the fact. It uses a classifier, like the PGExplainer, to forecast if an edge may be eliminated and does not impact the original predictions. A binary concrete distribution (Louizos et al. 1712 ) and a reparameterization method are used to roughly represent separate masks. The classifier is additionally trained by removing a term for a difference, which evaluates the difference between network predictions over the entire dataset. ZORRO (Thorben et al. 2021 ) employs discrete masks to pinpoint key input nodes and characteristics. A greedy method is used to choose nodes or node attributes from an input network. ZORRO chooses one node characteristic with the greatest fidelity score for each stage. The objective function, fidelity score, measures the degree of the recent forecasts resemble the model's original predictions by replacing the rest of the nodes/features with random noise values and repairing chosen nodes/features. The non-differentiable limitation of discrete masks is overcome because no training process is used.

Furthermore, ZORRO avoids the problem of "introduced evidence" by wearing protective masks. The greedy mask selection process, on the other hand, may result in optimal local explanations. Furthermore, because masks are generated for each graph separately, the explanations may lack a global understanding. Causal Screening (Xiang et al. 2021 ) investigates the attribution of causality to various edges in the input graph. It locates the explanatory subgraph's edge mask. The essential concept behind causal attribution is to look at how predictions change when an edge is added to the present explanatory subgraph, called the influence of causality. It examines the causal consequences of many edges at each step and selects one to include in the paragraph. It selects edges using the individual causal effect (ICE), which assesses the difference in information between parties after additional edges are introduced to the subgraph.

Causal Screening, like ZORRO, is a rapacious algorithm that generates undetectable masks without any prior training. As a result, it does not suffer due to the issue of the evidence presented. However, it is possible to lack worldwide comprehension and be caught in optimum local explanations. SubgraphX (Yuan et al. 2102 ) investigates deep graph model subgraph-level explanations. It uses the Monte Carlo Tree Search (MCTS) method (Silver et al. 2017 ) to effectively investigate various subgraphs by trimming nodes and choose the most significant subgraph from the search tree's leaves as the explanation for the prediction.

Furthermore, the Shapley values can be used to update the mask generation algorithm's objective function. Its produced subgraphs are more understandable by humans and suited for graph data than previous perturbation-based approaches. However, the computational cost is higher because the MCTS algorithm explores distinct subgraphs.

9 Success stories about using DL in drug discovery

Big pharmaceutical companies have migrated toward AI as DL methodologies have advanced, abandoning conventional approaches to maximize patient and company profit. AstraZeneca is a multinational, science-driven, worldwide pharmaceutical company that has successfully used artificial intelligence in each stage of drug development, from virtual screening to clinical trials. They could comprehend current diseases better, identify new targets, plan clinical trials with higher quality, and speed up the entire process by incorporating AI into medical science. AstraZeneca's success is a shining illustration of how combining AI with medical science can yield incredible results. Their collaborations with other AI-based companies demonstrate their continual attempts to increase AI utilization. One such cooperation is with Ali Health, an Alibaba subsidiary that wants to provide AI-assisted screening and diagnosis systems in China (Nag et al. 2022 ).

SARS-CoV-2 virus outbreak placed many businesses under duress to develop the best medicine in the shortest amount of time feasible. These businesses have turned to employ AI in conjunction based on the data available to attain their goals. Below are some examples of firms that have been successful in identifying viable strategies to combat the COVID-19 virus because of their efforts.

Deargen, a South Korean startup, developed the MT-DTI (Molecule Transformer Drug Target Interaction Model), a DL-based drug-protein interaction prediction model. In this approach, the strength of an interaction between a drug and its target protein is predicted using simplified chemical sequences rather than 2D or 3D molecular structures. A critical protein on the COVID-19-causing virus SARS-CoV-2 is highly likely to bind to and inhibit the FDA-approved antiviral drug atazanavir, a therapy for HIV. It also discovered three more antivirals, as well as Remdesivir, a not-yet-approved medicine that is currently being studied in patients. Deagen's ability to uncover antivirals utilizing DL approaches is a significant step forward in pharmaceutical research, making it less time-consuming and more efficient. If such treatments are thoroughly evaluated, there is a good chance that we will be able to stop the epidemic in its tracks (Beck et al. 2020 ; Scudellari 2020 ).

Another example is Benevolent AI, a biotechnology company in London leverages medical information, AI, and machine learning to speed up health-related research. They've identified six medicines so far, one of which, Ruxolitinib, is claimed to be in clinical trials for COVID19 (Gatti et al. 2021 ). To find prospective medications that might impede the procedure for viral replication of SARS-CoV-2, The business has been utilizing a massive reservoir of information pertaining to medicine, together Utilizing data obtained from the scientific literature by their AI system and ML. They received FDA permission to use their planned Baricitinib medication in conjunction with Remdesivir, which resulted in a higher recovery rate for hospitalized COVID19 patients (Richardson et al. 2020 ).

Skin cancer is a form of cancer that is very frequent around the globe. As the rate at which skin cancer continues to rise, it is becoming increasingly crucial to diagnose it initially developed, research demonstrate that early identification and therapy improve the survival rate of skin cancer patients. With the advancement of medical research and AI, several skin cancer smartphone applications have been introduced to the market, allowing people with worrisome lesions to use a specialized technique to determine whether they should seek medical care. According to studies, over 235 dermatology smartphone apps were developed between 2014 and 2017 (Flaten et al. 2020 ). Previously, they worked by sending a snapshot of the lesion over the internet to a health care provider. Still, thanks to smartphones' internal AI algorithms, these applications can detect and classify images of lesions as high or low risk and Immediately assess the patient's risk and offer advice. SkinVison (Carvalho et al. 2019 ) is an example of a successful application.

10 Future challenges

10.1 digital twinning in drug discovery.

The development and implementation of Industry 4.0 emerging technologies allow for creation of digital twins (DTs), that promotes the modification of the industrial sector into a more agile and intelligent one. A DT is a digital depiction of a real entity that interacts in dynamic, two-way links with the original. Today, DTs are being used in a variety of industries. Even though the pharmaceutical sector has grown to accept digitization to embrace Industry 4.0, there is yet to be a comprehensive implementation of DT in pharmaceutical manufacture. As a result, it is vital to assess the pharmaceutical industry's success in applying DT solutions (Chen et al. 1088 ).

New digital technologies are essential in today's competitive marketplaces to promote innovation, increase efficiency, and increase profitability (Legner et al. 2017 ). AI (Venkatasubramanian 2019 ), Internet of Things (IoT) devices (Venkatasubramanian 2019 ; Oztemel and Gursev 2018 ), and DTs have all piqued the interest of governments, agencies, academic institutions, and corporations (Bao et al. 2018 ). Industry 4.0 is a concept offered by a professional community to increase the level of automation to boost productivity and efficiency in the workplace.

This section provides a quick look at the evolution of DT and its application in pharmaceutical and biopharmaceutical production. We begin with an overview of the technology's principles and a brief history, then present various examples of DTs in pharmacology and drug discovery. After then, there will be a discussion of the significant technical and other issues that arise in these kinds of applications.

10.1.1 History and main concepts of digital twin

The idea of making a "twin" of a process or a product returned to NASA's Apollo project in the late 1960s (Rosen et al. 2015 ; Mayani et al. 2018 ; Schleich et al. 2017 ), when it assembled two identical space spacecraft. In this scenario, the "twin" was employed to imitate the counterpart's action in real-time.

The DT, according to Guo et al. ( 2018 ), is a type of digital data structure that is generated as a separate entity and linked to the actual system. Michael Grieves presented the original meaning of a DT in 2002 at the University of Michigan as part of an industry presentation on product lifecycle management (PLM) (Grieves 2014 ; Grieves and Vickers 2017 ; Stark et al. 2019 ). However, the first actual use of this notion, which gave origin to the current moniker, occurred in 2010, when NASA (the United States National Aeronautics and Space Administration) attempted to create virtual spaceship simulators for testing (Glaessgen and Stargel 2012 ).

A digital reproduction or representation of a physical thing, process, or service is what a DT is in theory. It's a computer simulation with unique features that dynamically connect the physical and digital worlds. The purpose of DTs is to model, evaluate, and improve a physical object in virtual space til it matches predicted performance, at which time it can be created or enhanced (if already built) in the real world (Kamel et al. 2021 ; Marr 2017 ).

Since then, DT technology has acquired popularity in both business and academia. Main components of DTs presently exist, as shown in Fig.  14 . Still, the theoretical model comprises three parts: the real entity in the actual world, the digital entity in the virtual space, and the interconnection between them (Glaessgen and Stargel 2012 ).

figure 14

Main components of DT

In an ideal world, the digital component would have all the system's information that could be acquired from its physical counterpart (Kritzinger et al. 2018 ). When integrated with AI, IoT, and other recent intelligent systems, a DT can forecast how an object or process will perform.

10.1.2 Digital twin in pharmaceutical manufacturing

Developing a drug is lengthy and costly, requiring efforts in biology, chemistry, and manufacturing, and it has a low success rate. An estimated 50,000 hits (trial versions of compounds that are subsequently tweaked to develop a medication in the future) are evaluated to develop a successful drug. Only one in every 12 therapeutic compounds, clinical trials have been performed on humans, makes it to market successfully. Toxicity (A medication's capacity to offer a patient with respite and slow the progression of a disease) and lack of effectiveness contribute to more than 60% of all drug failures (Subramanian 2020 ).

Making the appropriate decisions about which targets, hits, leads, and compounds to pursue is important to a drug's successful market introduction. However, the decision is based on in vitro (Experimental system in a test tube or petri dish.) and in vivo (experiments in animals.) systems, both of which have a shaky correlation with clinical outcomes (Mak et al. 2014 ). Answers to the following inquiries would be provided by a perfect decision support system for drug discovery:

What is the magnitude of any target's influence on the desired clinical result?

Is the potential compound changing the target enough to change clinical outcomes?

Is the chemical sufficiently selective and free of side effects or harmful consequences?

Is the ineffectiveness attributable to the drug's failure to reach its target?

Has the trial chosen the appropriate dose and dosing regimen?

Are there any surrogate or biomarkers such as cholesterol that serves as a proxy for the illness's root cause that can forecast a drug's success or failure?

Have the correct patients been chosen for the study?

Is it possible to identify hyper- and hypo-responders before the study begins?

Therapeutic failures are prevalent and difficult to address, given the complex process of developing drugs based on the points above. This issue must be addressed by combining data and observations from many stages of the drug development process and developing a system that can forecast an experiment's outcome or a chemical modification's influence on a therapeutic molecule. This highlights the significance of DT in the field of drug discovery.

In the United States, funding organizations such as DARPA, NSF, and DOE have aggressively supported bioprocess modeling at the genomic and cellular levels, resulting in high-profile programs such as BioSPICE (Kumar and Feidler 2003 ). These groups have shown that smaller models built to answer specific issues can greatly influence drug development efficiency. This would make it possible to apply the prediction methodology to various stages of the drug discovery and research process, including confirmation of the target, enhancing leads, and choosing candidates, Recognition of biomarkers, fabrication of assays and screens, and the improvement of clinical trials.

The pharmaceutical business is embracing the overall digitization trend in tandem with the US FDA's ambition to establish an agile, adaptable pharmaceutical manufacturing sector that delivers high-quality pharmaceuticals without considerable regulatory scrutiny (O’Connor et al. 2016 ). Industries are beginning to implement Industry 4.0 and DT principles and use them for development and research (Barenji et al. 2019 ; Steinwandter et al. 2019 ; Lopes et al. 2019 ; Kumar et al. 2020 ; Reinhardt et al. 2020 ). Pharma 4.0 (Ierapetritou et al. 2016 ) is a digitalization initiative that integrates Industry 4.0 with International Council for Harmonisation (ICH) criteria to model a combined operational model and production control plan.

As shown in Fig.  15 , live monitoring of the system `by the Process Analytical Technology (PAT), data collection from the machinery, the supplementary and finished goods, and a worldwide modelling and software for data analysis are some of the key requirements for achieving smart manufacturing with DT (Barenji et al. 2019 ). Quality-by-Design (QbD) and Continuous Manufacturing (CM) (Boukouvala et al. 2012 ), flowsheet modeling (Kamble et al. 2013 ), and PAT implementations (James et al. 2006 ) have all been used by the pharmaceutical industry to achieve this. Although some of the instruments have been thoroughly examined, DTs' entire integration and development is still a work in progress.

figure 15

Main categories of smart manufacturing with DT

The pharmaceutical industry has used PAT in different programs across the steps involved in producing drugs (Nagy et al. 2013 ). Even though this has resulted in a rise in the use of PAT instruments, their implementations are limited to research and development rather than manufacturing on a large scale (Papadakis et al. 2018 ). They have been successful in decreasing production costs and enhancing product quality monitoring in the small number of examples where they have been used in manufacturing (Simon et al. 2019 ). The development of various PAT approaches, as well as their convincing implementation is a vital component of a scheme for surveillance and control (Boukouvala et al. 2012 ) and has given a foundation for obtaining essential data from the physical component.

Papadakis et al. ( 2018 ) recently provided a framework for identifying efficient reaction paths for pharmaceutical manufacture (Rantanen and Khinast 2015 ), which comprises modeling reaction route workflows discovery, analysis of reactions and separations, process simulation, assessment, optimization, and the use (Sajjia et al. 2017 ).

To develop models, data-driven modeling methods require the gathering and using of many substantial experiments, and the resulting models are solely reliant on the datasets provided. Artificial neural networks (ANN) (Pandey et al. 2006 ; Cao et al. 2018 ), multivariate statistical analysis, and in Monte Carlo Badr and Sugiyama ( 2020 ) are all commonly used in pharmaceutical manufacturing. These methods are less computationally costly, but the prediction outside the dataset space is frequently unsatisfactory due to the trained absence of underlying physics understanding in models. Using IoT devices in pharmaceutical manufacturing lines results in massive data collection volumes. The virtual component must receive this collection of process data and CQAs quickly and effectively. Additionally, for accurate prediction, several pharmaceutical process models need material properties. As a result, to provide virtual component access to all datasets, a central database site is necessary (Lin-Gibson and Srinivasan 2019 ).

10.1.3 Digital twin in biopharmaceutical manufacturing

The synthesis of big molecule-based entities in various combinations that has applications in the treatment of inflammatory, microbial, and cancer issues, is the focus of biopharmaceutical manufacturing (Glaessgen and Stargel 2012 ; Narayanan et al. 2020 ). The demand for biologic-based medications has risen in recent years, necessitating greater production efficiency and efficacy (Kamel et al. 2021 ). As a result, many businesses are switching from batch to continuous production and implementing intelligent manufacturing systems (Lin-Gibson and Srinivasan 2019 ). DT can aid in decision-making, risk analysis, product creation, and process prediction., which incorporates the physical plant, data collecting, data analysis, and system control (Tao et al. 2018 ).

biological products' components and structures are intimately connected to treatment effectiveness (Read et al. 2010 ) and are very sensitive to cell-line. Operating conditions thorough actual plant's virtual description in a simulation environment is required to apply DT in biopharmaceutical manufacturing (Tao et al. 2018 ). This means that each unit activity inside an integrated model's simulation should accurately reflect the crucial process dynamics. Previous reviews Narayanan et al. ( 2020 ) Tang et al. ( 2020 ) Farzan et al. ( 2017 ) Baumann and Hubbuch ( 2017 ) Smiatek et al. ( 2020 ) and Olughu et al. ( 2019 ) focused on process modelling methodologies for both upstream and downstream operations.

Data from a biopharmaceutical monitoring system is typically diverse regarding data kinds and time scales. A considerable amount of data is collected during biopharmaceutical manufacture thanks to the deployment of real-time PAT sensors. As a result, data pre-processing is required to deal with missing data, visualize data, and reduce dimensions (Gangadharan et al. 2019 ). In batch biopharmaceutical production, Casola et al. ( 2019 ) presented data mining-based techniques for stemming, classifying, filtering, and clustering historical real-time data. Lee et al. ( 2012 ) combined different spectroscopic techniques and used data fusion to forecast the composition of raw materials.

10.2 AI-driven digital twins in today's pharmaceutical drug discovery

In the pharmaceutical industry, challenges are emerging from clinical studies that make drug development incomplete, sluggish, uncertain, and maybe dangerous. For example, It is not a true reflection of reality where clinical trials can take into account that in the real world, just a small portion of a big and diverse population is depicted among the many billions of humans on the planet where it is not possible to get a view of how each person based on how they will respond to a medicine. Clinical trials' rigorous requirements for physical and mental health in some cases also result in failure because of a lack of qualified participants. Pharmaceutical firms battle to provide the precise number and kind of participants needed to comply with the stringent requirements of clinical trial designs. Also, in most trials, the actual drug is replaced by a placebo as this helps contrast how sick individuals behave when they are not administered the experimental medication; This implies that at least some trial participants do not receive it. Here, These issues can be solved by using digital twins, which can imitate a range of patient features, giving a fair representation of how a medicine affects a larger population. AI-enabled digital twinning may reduce the trial's setup by revealing how susceptible a patient is to various inclusion and exclusion criteria as a result, patients can be rapidly identified, and digital twins can predict a patient's reaction, and placebos won't be required. Therefore, the new treatment can be assured for every patient in the trial, and digital twins can reduce the dangerous impact of drugs in the early stages by decreasing the number of patients who need to be tested in the real world. Figure  16 illustrates a framework by running all possible combinations. All treatment protocols are tested on a digital twin of the patient to discover an appropriate treatment protocol for this patient. Doing this quickly and accurately can lead to providing the best quality treatment for the patient without experimenting with the patient, which saves effort, cost, and accuracy in determining an appropriate treatment protocol for patients.

figure 16

AI-driven digital twins in today's pharmaceutical drug discovery

11 Open problems

This section discusses important issues to consider regarding progression from preclinical to clinical and implementation in practice that necessitate new ML solutions to assist transparent, usable, and data-driven decision-making procedures to accelerate drug discovery and decrease the number of failures in clinical development phases.

Complex disorders, such as viral infections and advanced malignancies frequently necessitate drug combinations (Julkunen et al. 2020 ; White et al. 2021 ). For example, kinase inhibitor combos or single compounds that block several kinases may improve therapeutic efficacy and duration while combating treatment resistance in cancer (Attwood et al. 2021 ). While several ML models have been created to predict response pairs of drug–dose combinations, higher-order combination effects can be predicted in a systematic way involving more than two medicines or targets is still a problem. In cancer cell lines, tensor learning methods have permitted reliable prediction of paired drug combination dose-response matrices (Smiatek et al. 2020 ). This computationally efficient learning approach could use extensive pharmacogenomic data, determine which drug combinations are most successful for additional in vitro or in vivo testing in many kinds of preclinical models, such as higher-order combinations among novel therapeutic compounds and doses.

While possible toxicity and effectiveness that is targeted are important criteria for clinical development success, most existing ML models for predicting response to the therapy accentuate effectiveness as the primary result. As a result, careful examination, and harmful effects prediction of instances in simulated and preclinical settings is required to strike a balance between the effectiveness of the toxicity and therapy that is acceptable to accelerate the next stages of drug development (Narayanan et al. 2020 ). Applying single-cell data and ML algorithms to develop combinations of anticancer drugs has shown the potential to boost the likelihood of clinical success (Tao et al. 2018 ). Transfer of knowledge and deconvolution techniques for in silico cell set (Avila et al. 2020 ) may offer effective ways to reduce the requirement to generate a lot of single-cell data to predict combination therapy responders and impacts of toxicity, as well as the recommended dosage that optimizes both efficacy and safety.

In addition, patient data and clinical profiles must be used to validate the in-silico therapy response forecasts. This real data for ML predictions is crucial for progress in medicine and establishing the practical value and providing clinical guidance in making decisions. A no-go decision was made early, for example, if the substance has harmful consequences. Many of the present issues encountered when using machine learning for drug discovery, particularly in clinical development, are since current AI algorithms do not meet the requirements for clinical research. As a result, ML model validation requires systematic and comprehensive high-quality clinical data sets. The discovery methods must be thoroughly evaluated for accuracy and reproducibility using community-agreed performance measures in various settings, not just a small collection of exemplary data sets. sharing and exploiting private patient information is possible with systems that isolate the code from the data or use the model to data method (Guinney and Saez-Rodriguez 2018 ), which It makes it possible for federated learning to utilise patient-level data for model construction and thorough assessment.

Even if there are many applications for drug discovery, The majority of ML and particularly DL models remain "black boxes”, and interpretation by a human specialist is sometimes tricky (Jiménez-Luna et al. 2020 ). Implementing mathematical models as online decision support tools must be understandable to users to obtain confidence. Comprehensible, accessible, and explainable models should clearly state the optimization goals, such as synergy, efficacy, and/or toxicity.

DTI prediction is a notable example of fields of drug discovery research. It has been ongoing more than 10 years and aims to enhance the effectiveness of computational models using various technologies. The most recent computational approaches for predicting DTIs are DL technologies. These use unstructured-based approaches that don't need 3D structural data or docking to get over the drug and target protein's high-dimensional structure restrictions. Despite the DL's outstanding performance, regression inside the DTI prediction remains a critical and difficult issue, and researchers could develop several strategies to improve prediction accuracy. Furthermore, data scarcity and the lack of a standardized benchmark database are still considered current research gaps.

While DL approaches show promise in detecting drug responses, especially when dealing with large amounts of data, drug response prediction research is in its first stages, and more efficient and relevant models are needed.

While DL techniques have shown to be effective in detecting DDIs, especially when dealing with large amounts of data, more promising algorithms that focus on complex molecular reactions need to be developed.

Only a few studies in the drug discovery field have investigated their models' explain ability, leaving much room for improvement. The explanations generated by XAI for human decision-making must be not insignificant, not artificial, and helpful to the scientific community. Until now, ensuring that XAI techniques achieve their goals and produce trustworthy responses would necessitate a combined effort amongst DL specialists, chemo informaticians and chemists, biologists, data scientists, and other subject matter experts. As a result, we believe that more developed methodologies to explain black-box models for drug discovery fields like DDIs, drug–target interactions, drug sensitivity, and drug side effects must be considered in the future to ensure model fairness or strict sensitivity evaluations of models. Further exploration of the capabilities and constraints of the existing chemical language for defining these models will be critical. The development of novel interpretable molecular representations for DL and the deployment of self-explanatory algorithms alongside sufficiently accurate predictions will be a critical area of research in the coming years. Because there are currently no methods that combine all the stated advantageous XAI characteristics (transparency, justification, informativeness, and uncertainty estimation), consensus techniques that draw on the advantages of many XAI approaches and boost model dependability will play a major role in the short and midterm. Currently, there is no open-community platform for exchanging and refining XAI software and model interpretations in drug discovery. As a result, we believe that future study into XAI in drug development has much potential.

12 Discussion

This section presents a brief about how the proposed analytical questions in Sect.  2 are being answered through the paper.

Several DL algorithms have been used to predict the different categories of drug discovery problems as deeply illustrated in Sect. 4 with respect to the main categories of drug discovery problems in Fig.  8 . In addition, a summary of a sample of these algorithms, their methods, advantages and weaknesses are presented in Table 2 .

Recognizing the characteristics that make medications suitable for precision dosage targets will aid in directing resources to where they'll have the most impact. Employing DL in drug dosing optimization is a big challenge which increases the health care performance, safety, and cost-effectiveness as presented in Sect.  7 .

With the advancement of DL methods, we've seen big pharmaceutical businesses migrate toward AI, such as ‘AstraZeneca’ which is a global multinational pharmaceutical business that has successfully used AI in every stage of drug development. Several success stories have been presented in Sect.  9 .

AQ4: What about using the newest technologies such as XAI and DT in drug discovery?

The topic of XAI addresses one of the most serious flaws in ML and DL algorithms: model interpretability and explain ability. It would be impossible to trust the forecasts of real-world AI applications without interpretability and explain ability. Section  8 presents the literature that address this issue. A digital twin (DT) is a virtual representation of a living thing that is connected to the real thing in dynamic, reciprocal ways. Today, DTs are being used in a variety of industries. Even though the pharmaceutical sector has grown to accept digitization to embrace Industry 4.0, there is yet to be a comprehensive implementation of DT in pharmaceutical manufacture. Success stories regarding employing DT into drug discovery is presented in Sect. 10.

AQ5: What are the future and open works related to the drug discovery and DL?.

Through the paper, we present how DL succeed in all aspects of drug discovery problems, However, it is still a very important challenge for future research. Section 11 covers these challenges.

Figure  17 presents the percentage of the different DL applications for each building block of our study. It is well observed that the most percentage segment is dedicated for the drug discovery and DL because it is the main core of our research.

figure 17

Percentages of DL applications for each category

13 Conclusion

Despite all the breakthroughs in pharmacology, developing new drugs still requires a lot of time and costs. As DL technology advances and the amount of drug-related data grows, a slew of new DL-based approaches is cropping up at every stage of the drug development process. In addition, we’ve seen large pharmaceutical corporations migrate toward AI in the wake of the development of DL approaches.

Although the drug discovery is a large field and has different research categories, there is a few review studies about this field and each related study has focused only on a one research category such as reviewing the DL applications for the DTIs. So, the main goal of our research is to present a systematic Literature review (SLR) which integrates the recent DL technologies and applications for the different categories of drug discovery problems Including, Drug–target interactions (DTIs), drug–drug similarity interactions (DDIs), drug sensitivity and responsiveness, and drug-side effect predictions. That is associated with the benchmark data sets and databases. Related topics such as XAI and DT and how they support the drug discovery problems are also discussed. In addition, the drug dosing optimization and success stories are presented as well. Finally, we suggest open problems as future research challenges.

Although the DL has proved its strength in drug discovery problems, it is still a promising open research area for the interested researchers. In this paper, they can find all they want to know about using DL in various drug discovery problems. In addition, they can find success stories and open areas for future research.

Given the recent success of DL approaches and their use by pharmaceuticals in identifying new medications, it seems clear that current DL techniques being highly regarded in the next generation of enormous data investigation and evaluation for drug discovery and development.

Abramovich I, Ben-Yehuda T, Cohen R (2018) Low-complexity video classification using recurrent neural networks. IEEE Int Conf Sci Electr Eng Israel (ICSEE) 2018:1–4. https://doi.org/10.1109/ICSEE.2018.8646076

Article   Google Scholar  

Adadi A, Mohammed B (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6:2169–3536

Google Scholar  

Ahmed KT, Park S, Jiang Q et al (2020) Network-based drug sensitivity prediction. BMC Med Genomics 13:193

Alankrita A, Mamta M, Gopi B (2021) Generative adversarial network: an overview of theory and applications. Int J Inf Manag Data Insights 1(1):100004

Amashita R, Nishio M, Do RKG et al (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9:611–629. https://doi.org/10.1007/s13244-018-0639-9

Andreea D, Yu-Hsiang H, Petar V, Pietro L, Jian T (2019) Drug–drug adverse effect prediction with graph co-attention. https://arxiv.org/abs/1905.00534

Arshed MA, Mumtaz S, Riaz O, Sharif W, Abdullah S (2022) A deep learning framework for multi drug side effects prediction with drug chemical substructure. Int J Innovat Sci Technol 4(1):19–31

Arus-Pous J, Patronov A, Bjerrum EJ, Tyrchan C, Reymond JL, Chen H, Engkvist O (2020) SMILES-based deep generative scaffold decorator for de-novo drug design. J Cheminform 12:1–18

Asada M, Miwa M, Sasaki Y (2018) Enhancing drug–drug interaction extraction from texts by molecular structure information. In: proceedings of the 56th annual meeting of the association for computational linguistics. 2, pp 680–685, https://doi.org/10.18653/v1/P18-2108

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29

Attwood MM, Fabbro D, Sokolov AV et al (2021) Trends in kinase drug discovery: targets, indications and inhibitor design. Nat Rev Drug Discov 20(11):839–861

Avila C, Alquicira-Hernandez J, Powell JE et al (2020) Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat Commun 11(1):5650

Azad AKM, Dinarvand M, Nematollahi A, Swift J, Lutze-Mann L, Vafaee F (2021) A comprehensive integrated drug similarity resource for in-silico drug repositioning and beyond. Brief Bioinform 22(3):bbaa126. https://doi.org/10.1093/bib/bbaa126

Badr S, Sugiyama H (2020) A PSE perspective for the efficient production of monoclonal antibodies: integration of process, cell, and product design aspects. Curr Opin Chem Eng 27:121–128

Bao J, Guo D, Li J, Zhang J (2018) The modelling and operations for the digital twin in the context of manufacturing. Enterp Inf Syst 13:534–556

Baptista D, Ferreira PG, Rocha M (2021) Deep learning for drug response prediction in cancer. Briefings Bioinform 22:360–379

Barenji RV, Akdag Y, Yet B, Oner L (2019) Cyber-physical-based PAT (CPbPAT) framework for Pharma 4.0. Int J Pharm 567:118445

Baumann P, Hubbuch J (2017) Downstream process development strategies for effective bioprocesses: Trends, progress, and combinatorial approaches. Eng Life Sci 17:1142–1158

Beck BR, Shin B, Choi Y, Park S, Kang K (2020) Predicting commercially available antiviral drugs that may act on the novel coronavirus (SARS-CoV-2) through a drug–target interaction deep learning model. Comput Struct Biotechnol J 18:784–790

Bedi P, Sharma C, Vashisth P, Goel D, Dhanda M (2015) Handling cold start problem in Recommender Systems by using Interaction Based Social Proximity factor. In: Proceeding of the 2015 international conference on advances in computing, communications and informatics, Kerala, India, 10–13 August 2015; pp 1987–1993

Benedek R, Stephen B, Andriy N, Michael U, Sebastian N, Eliseo P (2021) A unified view of relational deep learning for drug pair scoring. coRR V. https://arxiv.org/abs/2111.02916 .

Betsabeh T, Mansoor ZJ (2021) Using drug–drug and protein-protein similarities as feature vector for drug–target binding prediction. Chemom Intell Lab Syst 217:104405. https://doi.org/10.1016/j.chemolab.2021.104405

Bleakley K, Yamanishi Y (2009) Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics 25:2397–2403

Bolukbasi T (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 2016; 29. In Identifying gender and sexuality of data subjects. https://cis.pubpub.org/pub/debiasing-word-embeddings-2016 .

Bongini P, Pancino N, Dimitri GM, Bianchini M, Scarselli F, Lio P (2022) Modular multi-source prediction of drug side-effects with DruGNN. http://arxiv.org/abs/2202.08147 .

Boobier S, Osbourn A, Mitchell JB (2017) Can human experts predict solubility better than computers? J Cheminform 9:63

Boukouvala F, Niotis V, Ramachandran R, Muzzio FJ, Ierapetritou MG (2012) An integrated approach for dynamic flowsheet modeling and sensitivity analysis of a continuous tablet manufacturing process. Comput Chem Eng 42:30–47

Brown AS, Patel CJ (2017) MeSHDD: literature-based drug-drug similarity for drug repositioning. J Am Med Inf Assoc 24(3):614–618

Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ (2018) Next-generation machine learning for biological networks. Cell 173:1581–1592

Campillos M et al (2008) Drug target identification using side-effect similarity. Science 321(5886):263–666. https://doi.org/10.1126/science.1158140

Cao H, Mushnoori S, Higgins B, Kollipara C, Fermier A, Hausner D, Jha S, Singh R, Ierapetritou M, Ramachandran R (2018) A systematic framework for data management and integration in a continuous pharmaceutical manufacturing processing line. Processes 6:53

Casola G, Siegmund C, Mattern M, Sugiyama H (2019) Data mining algorithm for pre-processing biopharmaceutical drug product manufacturing records. Comput Chem Eng 124:253–269

Chabner BA (2016) NCI-60 cell line screening: a radical departure in its time. J Natl Cancer Inst. https://doi.org/10.1093/jnci/djv388

Chander A, Srinivasan R, Chelian S, Wang J, Uchino K (2018) Working with beliefs: AI transparency in the enterprise. In: Joint proceedings of the ACM IUI 2018 workshops co-located with the 23rd acm conference on intelligent user interfaces 2068 (eds Said, A. and Komatsu, T.) (CEUR-WS.org, 2018)

Chandra B, Sharma RK (2017) On improving recurrent neural network for image classification. Int Jt Conf Neural Netw (IJCNN) 2017:1904–1907. https://doi.org/10.1109/IJCNN.2017.7966083

Chang Y, Park H, Yang HJ, Lee S, Lee KY, Kim TS, Jung J, Shin JM (2018) Cancer drug response profile scan (CDRscan): a deep learning model that predicts drug effectiveness from cancer genomic signature. Sci Rep 8:1–11

Chauhan R, Ghanshala KK, Joshi RC (2018) Convolutional neural network (CNN) for image detection and recognition. First Int Conf Secure Cyber Comput Commun (ICSCCC) 2018:278–282. https://doi.org/10.1109/ICSCCC.2018.8703316

Chen AW (2018) Predicting adverse drug reaction outcomes with machine learning. Int J Commun Med Public Health 5(3):901–904

Chen JY, Mamidipalli S, Huan T (2009) Happi: an online database of comprehensive human annotated and predicted protein interactions. BMC Genomics 10(1):S16

Chen X, Liu M-X, Yan G-Y (2012) Drug–target interaction prediction by random walk on the heterogeneous network. Mol BioSyst 8:1970–1978. https://doi.org/10.1039/C2MB00002D

Chen Y, Yang O, Sampat C, Bhalode P, Ramachandran R, Ierapetritou M (2020) Digital twins in pharmaceutical and biopharmaceutical manufacturing: a literature review. Processes 8(9):1088. https://doi.org/10.3390/pr8091088

Cheng F, Kovács IA, Barabási AL (2019) Network-based prediction of drug combinations. Nat Commun 10(1):1–11

Chiu Y-C, Chen H-IH, Zhang T, Zhang S, Gorthi A, Wang L-J, Huang Y, Chen Y (2019) Predicting drug response of tumors from integrated genomic profiles by deep neural networks. BMC Med Genomics 12:119

Chu X, Lin Y, Gao J, Wang J, Wang Y, Wang L (2018) Multi-label robust factorization autoencoder and its applicationin predicting drug–drug interactions. arXiv:1811.00208 .

Chu X, Lin Y, Wang Y, Wang L, Wang J, Mlrda JG (2019) A multitask semi-supervised learning framework for drug–drug interaction prediction. In: proceedings of the international joint conference on artificial intelligence, pp 4518– 4524

Ciallella HL, Zhu H (2019) Advancing computational toxicology in the big data era by artificial intelligence: data-driven and mechanism-driven modeling for chemical toxicity. Chem Res Toxicol 32:536–547

Cortes-Ciriano I, Ain QU, Subramanian V, Lenselink EB, Méndez-Lucio O, IJzerman AP, Wohlfahrt G, Prusis P, Malliavin TE, van Westen GJP et al (2015) Polypharmacology modelling using proteochemometrics (PCM): recent methodological developments, applications to target families, and future prospects. Medchemcomm 6:24–50

Cortés-Ciriano I, Bender A (2019) KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images. J Cheminform 11:1–16

Dai L, Zhu H, Liu D (2020) Patient similarity: methods and applications. http://arxiv.org/abs/2012.01976

David L, Arús-Pous J, Karlsson J, Engkvist O, Bjerrum EJ, Kogej T, Kriegl JM, Beck B, Chen H (2019) Applications of deep-learning in exploiting large-scale and heterogeneous compound data in industrial pharmaceutical research. Front Pharmacol 10:1303

Davis MI, Hunt JP, Herrgard S, Ciceri P, Wodicka LM, Pallares G, Hocker M, Treiber DK, Zarrinkar PP (2011) Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 29:1046–1051

De Carvalho TM, Noels E, Wakkee M, Udrea A, Nijsten T (2019) Development of smartphone apps for skin cancer risk assessment: progress and promise. JMIR Dermatol 2(1):e13376

De Kuijper GM, Risselada A, van Dijken R (2019) Monitoring drug side-effects. Handbook of intellectual disabilities. Springer, Cham, pp 275–301

“deepchem/deepchem: Democratizing Deep-Learning for Drug Discovery”; Quantum Chemistry, Materials Science and Biology; Available online: https://github.com/deepchem/deepchem (accessed on 15 April 2022).

Dey S, Luo H, Fokoue A, Hu J, Zhang P (2018) Predicting adverse drug reactions through interpretable deep learning framework. BMC Bioinform 19:476

Dincer AB, Celik S, Hiranuma N, Lee S-I (2018) DeepProfile: deep learning of cancer molecular profiles for precision medicine. bioRxiv. https://doi.org/10.1101/278739

Ding MQ, Chen L, Cooper GF, Young JD, Lu X (2018) Precision oncology beyond targeted therapy: combining omics data with machine learning matches the majority of cancer cells to effective therapeutics. Mol Cancer Res 16:269–278

Doshi-Velez F, Kim B (2017) Towards a rigorous science of interpretable machine learning. https://arxiv.org/abs/1702.08608

DrugBank (2019) DrugBank Release Version 5.1.3, chemical structures. https://www.drugbank.com

Dua D, Graff C (2017) UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php

El-Deredy W et al (1997) Pretreatment prediction of the chemotherapeutic response of human glioma cell cultures using nuclear magnetic resonance spectroscopy and artificial neural networks. Cancer Res 57:4196–4199

Farzan P, Mistry B, Ierapetritou MG (2017) Review of the important challenges and opportunities related to modeling of mammalian cell bioreactors. AIChE J 63:398–408

Fatehifar M, Karshenas H (2021) Drug–drug interaction extraction using a position and similarity fusion-based attention mechanism. J Biomed Inf 115:103707. https://doi.org/10.1016/j.jbi.2021.103707

Feng S, et al (2018) Pathologies of neural models make interpretations difficult. http://arxiv.org/abs/1804.07781

Feng Q, Dueva E, Cherkasov A, Ester M (2018) PADME: a deep learning-based framework for drug–target interaction prediction. arXiv 2018; arXiv:1807.09741

Feng YH, Zhang SW, Shi JY (2020) DPDDI: a deep predictor for drug–drug interactions. BMC Bioinform 21:419. https://doi.org/10.1186/s12859-020-03724-x

Ferdousi R, Safdari R, Omidi Y (2017) Computational prediction of drug–drug interactions based on drugs functional similarities. J Biomed Inform. https://doi.org/10.1016/j.jbi.2017.04.021

Finn RD et al (2013) Pfam: the protein families database. Nucleic Acids Res 42(D1):D222–D230

Flaten HK, St Claire C, Schlager E, Dunnick CA, Dellavalle RP (2020) Growth of mobile applications in dermatology. Dermatol Online J 24(2):13–16

Fleischhack G, Massimino M, Warmuth-Metz M, Khuhlaeva E, Janssen G, Graf N et al (2019) Nimotuzumab and radiotherapy for treatment of newly diagnosed diffuse intrinsic pontine glioma (DIPG): a phase III clinical study. J Neurooncol 143:107–113. https://doi.org/10.1007/s11060-019-03140-z

Fokoue A, Sadoghi M, Hassanzadeh O, Zhang P (2016) Predicting drug–drug interactions through large-scale similarity-based link prediction. In: European semantic web conference 2016 May 29; pp 774–789

Fushman D, Shooshan SE, Rodriguez L, Aronson AR, Lang F, Rogers W, Tonning J (2018) A dataset of 200 structured product labels annotated for adverse drug reactions. Sci Data 5:180001

Gangadharan N, Turner R, Field R, Oliver SG, Slater N, Dikicioglu D (2019) Metaheuristic approaches in biopharmaceutical process development data analysis. Bioprocess Biosyst Eng 42:1399–1408

Gao Z et al (2008) PDTD: a web-accessible protein database for drug target identification. BMC Bioinf 9(1):104

Gao KY, Fokoue A, Luo H, Iyengar A, Dey S, Zhang P (2017) Interpretable drug target prediction using deep neural representation. In: Proceedings of the international joint conference on artificial intelligence, Melbourne, Australia, 19–25 August 2017

Gao K, Duy Nguyen D, Sresht V, Mathiowetz AM, Tu M, Wei G-W (2019) Are 2D fingerprints still valuable for drug discovery? Phys Chem Chem Phys 22:8373–8390

Gatti M, Turrini E, Raschi E, Sestili P, Fimognari C (2021) Janus kinase inhibitors and coronavirus disease (COVID)-19: rationale, clinical evidence and safety issues. Pharmaceuticals 14(8):738

Gaulton A et al (2011) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(D1):D1100–D1107

Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. 34th Int Conf Mach Learn ICML 3:2053–2070

Glaessgen EH, Stargel DS (2012) The digital twin paradigm for future NASA and US Air Force vehicles. In: Proceedings of the 53rd AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics and materials conference, Honolulu, HI, USA. https://ntrs.nasa.gov/citations/20120008178

Goebel R et al (2018) Explainable AI: the new 42? In: Holzinger A, Kieseberg P, Tjoa A, Weippl E (eds) Machine learning and knowledge extraction. CD-MAKE Lecture Notes in Computer Science. Springer, New York

Gómez-Bombarelli R et al (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4:268–276

Grieves M (2014) Digital twin: manufacturing excellence through virtual factory replication. Glob J Eng Sci Res. https://doi.org/10.5281/zenodo.1493930

Grieves M, Vickers J (2017) Digital twin: mitigating unpredictable undesirable emergent behavior in complex systems. Springer, Cham, pp 85–113

Guidotti R et al (2018) A survey of methods for explaining black box models. ACM Comput Surv 51:93

Guinney J, Saez-Rodriguez J (2018) Alternative models for sharing confidential biomedical data. Nat Biotechnol 36(5):391–392

Gunther S et al (2007) SuperTarget and Matador: resources for exploring drug–target relationships. Nucleic Acids Res 36:D919–D922

Hamilton WL (2020) Graph representation learning. Synth Lect Artif Intell Mach Learn 14:1–159

MATH   Google Scholar  

Han X, Xie R, Li X, Li J (2022) SmileGNN: drug–drug interaction prediction based on the smiles and graph neural network. Life (basel). 12(2):319. https://doi.org/10.3390/life12020319

Hao M, Wang Y, Bryant SH (2016) Improved prediction of drug–target interactions using regularized least squares integrating with kernel fusion technique. Anal Chim Acta 909:41

Hassan-Harrirou H, Zhang C, Lemmin T (2020) RosENet: improving binding affinity prediction by leveraging molecular mechanics energies with an ensemble of 3D convolutional neural networks. J Chem Inf Model 60:2791–2802

He C, Liu Y, Li H, Zhang H, Mao Y, Qin X, Liu L, Zhang X (2022) Multi-type feature fusion based on graph neural network for drug-drug interaction prediction. BMC Bioinf 23(1):1–8

Hecker N et al (2011) SuperTarget goes quantitative: update on drug–target interactions. Nucleic Acids Res 40(D1):D1113–D1117

Hermanto A, Adji TB, Setiawan NA (2015) Recurrent neural network language model for English-Indonesian machine translation: experimental study. Int Conf Sci Inf Technol (ICSITech) 2015:132–136. https://doi.org/10.1109/ICSITech.2015.7407791

Hinton G (2011) Boltzmann machines. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning. Springer, Boston

Hirohara M, Saito Y, Koda Y, Sato K, Sakakibara Y (2018) Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinform 19:83–94

Hizukuri Y, Sawada R, Yamanishi Y (2015) Predicting target proteins for drug candidate compounds based on drug-induced gene expression data in a chemical structure-independent manner. BMC Med Genomics 8:82

Hou X, You J, Hu P (2019) Predicting drug–drug interactions using deep neural network. In: proceedings of the 11 th international conference on machine learning and computing, pp 168–172

http://zinc.docking.org

https://bioinf-applied.charite.de/supernatural_new/index.php .

https://friendsofcancerresearch.org/wpcontent/uploads/Optimizing_Dosing_in_Oncology_Drug_Development.pdf .

https://ncats.nih.gov/tox21

https://pharmacodb.pmgenomics.ca/datasets/4

https://sites.broadinstitute.org/ccle/

https://string-db.org/cgi/download.pl?sessionId=uKr0odAK9hPs

https://www.cancer.gov/about-nci/organization/ccct/ctrp

https://www.ebi.ac.uk/chebi/

https://www.sciencedirect.com/topics/drug-response

Hu J, Gao J, Fang X, Liu Z, Wang F, Huang W, Wu H, Zhao G (2022) DTSyn: a dual-transformer-based neural network to predict synergistic drug combinations. bioRxiv. https://doi.org/10.1101/2022.03.29.486200

Huang C-T et al (2018) A large-scale gene expression intensity-based similarity metric for drug repositioning. iScience 7:40–52

Huang K, Xiao C, Hoang TN, Glass LM, Sun J (2020) Caster: predicting drug interactions with chemical substructure representation. In: AAAI 2020 34th AAAI Conference on Artificial Intelligence, American Association for Artificial Intelligence (AAAI) Press, pp 702–709

Ibrahim H, El Kerdawy AM, Abdo A, Eldin AS (2021) Similarity-based machine learning framework for predicting safety signals of adverse drug–drug interactions. Inf Med Unlocked 26:100699

Ierapetritou M, Muzzio F, Reklaitis G (2016) Perspectives on the continuous manufacturing of powder-based pharmaceutical processes. AIChE J 62:1846–1862

Iorio F et al (2010) Discovery of drug mode of action and drug repositioning from transcriptional responses. PNAS 107(33):14621–14626. https://doi.org/10.1073/pnas.1000138107

Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, Schubert M, Aben N, Gonçalves E, Barthorpe S, Lightfoot H et al (2016) A landscape of pharmacogenomic interactions in cancer. Cell 166:740–754

James M, Stanfield CF, Bir G (2006) A review of process analytical technology (PAT) in the US pharmaceutical industry. Curr Pharm Anal 2:405–414

Ji ZL, Han LY, Yap CW, Sun LZ, Chen X, Chen YZ (2003) Drug adverse reaction target database (DART). Drug Saf 26(10):685–690

Jiménez-Luna J, Grisoni F, Schneider G (2020) Drug discovery with explainable artificial intelligence. Nat Mach Intell 2(10):573–584

Julkunen H, Cichonska A, Gautam P et al (2020) Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects. Nat Commun 11(1):6136

Kamath U, Liu J (2021) Explainable artificial intelligence: an introduction to interpretable machine learning. Springer, Cham

Kamble R, Sharma S, Varghese V, Mahadik K (2013) Process analytical technology (PAT) in pharmaceutical development and its application. Int J Pharm Sci Rev Res 23:212–223

Kamel Boulos MN, Zhang P (2021) Digital twins: from personalised medicine to precision public health. J Person Med 11(8):745

Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30

Karim MR, Cochez M, Jares JB, Uddin M, Beyan O, Decker S (2019) Drug–drug interaction prediction based on knowledge graph embeddings and convolutional-LSTM network. In: Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pp 113–123

Karim MR, Cochez M, Jares JB, Uddin M, Beyan O, Decker S (2019) Drug–drug interaction prediction based on knowledge graph embeddings and convolutional-LSTM network. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 2019, pp 113–123

Karpov P, Godin G, Tetko IV (2020) Transformer-CNN: Swiss knife for QSAR modeling and interpretation. J Cheminform 12:17

Kastrin A, Ferk P, Leskošek B (2018) Predicting potential drug–drug interactions on topological and semantic similarity features using statistical learning. PLoS ONE 13(5):e0196865

Keum J, Nam H (2017) SELF-BLM: prediction of drug–target interactions via self-training SVM. PLoS ONE 12:e0171839

Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA et al (2016) PubChem substance and compound databases. Nucleic Acids Res 44:D1202–D1213

Kim J, Park S, Min D, Kim W (2021) comprehensive survey of recent drug discovery using deep learning. Int J Mol Sci 22:9983. https://doi.org/10.3390/ijms22189983

Koes DR, Baumgartner MP, Camacho CJ (2013) Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J Chem Inf Model 53:1893–1904

Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

Korkmaz S (2020) Deep learning-based imbalanced data classification for drug discovery. J Chem Inf Model 60:4180–4190

Kritzinger W, Karner M, Traar G, Henjes J, Sihn W (2018) Digital Twin in manufacturing: a categorical literature review and classification. IFAC-PapersOnLine 51:1016–1022

Kuenzi BM et al (2020) Predicting drug response and synergy using a deep learning model of human cancer cells. J Elsevier Cancer Cell 38(5):1535–6108. https://doi.org/10.1016/j.ccell.2020.09.014

Kuhn M et al (2010) A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol 6(1):343

Kuhn M et al (2013) STITCH 4: integration of protein–chemical interactions with user data. Nucleic Acids Res 42(D1):D401–D407

Kumar SP, Feidler JC (2003) BioSPICE: a computational infrastructure for integrative biology. OMICS J Integr Biol 7(3):225. https://doi.org/10.1089/153623103322452350

Kumar S, Talasila D, Gowrav M, Gangadharappa H (2020) Adaptations of pharma 4.0 from industry 4.0. Drug Invent Today 14:405–415

Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN et al (2006) The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313:1929–1935

Lapuschkin S et al (2019) Unmasking clever Hans predictors and assessing what machines really learn. Nat Commun 10:1096

Lee CY, Chen YP (2021) Descriptive prediction of drug side-effects using a hybrid deep learning model. Int J Intell Syst 36(6):2491–2510

MathSciNet   Google Scholar  

Lee H, Kim W (2019) Comparison of target features for predicting drug–target interactions by deep neural network based on large-scale drug-induced transcriptome data. Pharmaceutics 11:377

Lee HW, Christie A, Xu J, Yoon S (2012) Data fusion-based assessment of raw materials in mammalian cell culture. Biotechnol Bioeng 109:2819–2828

Lee G, Park C, Ahn J (2019) Novel deep learning model for more accurate prediction of drug–drug interaction effects. BMC Bioinform 20(1):415

Lee I, Keum J, Nam H (2019) DeepConv-DTI: prediction of drug–target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol 15:1–21

Legner C, Eymann T, Hess T, Matt C, Böhmann T, Drews P, Mädche A, Urbach N, Ahlemann F (2017) Digitalization: opportunity and challenge for the business and information systems engineering community. Bus Inf Syst Eng 59:301–308

Lei T, Barzilay R, Jaakkola T (2016) Rationalizing neural predictions. In: 2016 conference on empirical methods in natural language processing, 2016; Austin, Texas: Association for computational linguistics, pp 107—117. https://aclanthology.org/D16-1011

Li M, Wang Y, Zheng R, Shi X, Wu F, Wang J, et al. (2019) Deepdsc: a deep learning method to predict drug sensitivity of cancer cell lines. IEEE/ACM transactions on computational biology and bioinformatics

Lian M, Du W, Wang X, Yao Q (2021) Drug–target interaction prediction based on multi-similarity fusion and sparse dual-graph regularized matrix factorization. IEEE Access 9:99718–99730. https://doi.org/10.1109/ACCESS.2021.3096830

Lin X, Quan Z, Wang Z-J, Ma T, Zeng X (2021) KGNN: knowledge graph neural network for drug–drug interaction prediction. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, Jaban; IJCAI'20

Lin-Gibson S, Srinivasan V (2019) Recent industrial roadmaps to enable smart manufacturing of biopharmaceuticals. IEEE Trans Autom Sci Eng 2019:1–8

Lipton ZC (2018) The mythos of model interpretability. Queue 16:31–57

Liu Y, Wu M, Miao C, Zhao P, Li X-L (2016) Neighborhood regularized logistic matrix factorization for drug–target interaction prediction. PLoS Comput Biol 12:e1004760

Liu B, Ramsundar B, Kawthekar P, Shi J, Gomes J, Luu Nguyen Q, Ho S, Sloane J, Wender P, Pande V (2017) Retrosynthetic reaction prediction using neural sequence-to-sequence models. R ACS Cent Sci 3:1103–1113

Liu N, Chen CB, Kumara S (2019) Semi-supervised learning algorithm for identifying high-priority drug–drug interactions. IEEE J Biomedic Health Inform. https://doi.org/10.1109/JBHI.2019.2932740

Liu K, Sun X, Jia L, Ma J, Xing H, Wu J, Gao H, Sun Y, Boulnois F, Fan J (2019a) Chemi-net: a molecular graph convolutional network for accurate drug property prediction. Int J Mol Sci 20:3389

Liu P, Li H, Li S, Leung KS (2019b) Improving prediction of phenotypic drug response on cancer cell lines using deep convolutional network. BMC Bioinform 20:408

Liu S, Huang Z, Qiu Y, Chen Y-PP, Zhang W (2019c) Structural network embedding using multi-modal deep auto-encoders for predicting drug–drug interactions. IEEE Int Conf Bioinform Biomed 2019:445–450. https://doi.org/10.1109/BIBM47256.2019.8983337

Liu S, Zhang Y, Cui Y, Qiu Y, Deng Y, Zhang W, Zhang Z (2021) Enhancing drug–drug interaction prediction using deep attention neural networks. BioRxiv. https://doi.org/10.1101/2021.03.16.435553

Lopes MR, Costigliola A, Pinto R, Vieira S, Sousa JMC (2019) Pharmaceutical quality control laboratory digital twin—a novel governance model for resource planning and scheduling. Int J Prod Res 58:1–15

Louizos C, Welling M, Kingma DP (2017) Learning sparse neural networks through l 0 regularization. http://arxiv.org/abs/1712.01312 .

Lu Y, Guo Y, Korhonen AJB (2017) Link prediction in drug–target interactions network using similarity indices. BMC Bioinf 18(1):39. https://doi.org/10.1186/s12859-017-1460-z

Luo Y, Zhao X, Zhou J, Yang J, Zhang Y, Kuang W, Peng J, Chen L, Zeng J (2017) A network integration approach for drug–target interaction prediction and computational drug repositioning from heterogeneous information. Nat Commun 8:573

Luo D, Cheng W, Xu D, Yu W, Zong B, Chen H, Zhang X (2020) Parameterized explainer for graph neural network. Adv Neural Inf Process Syst 33:19620–19631

Lyu T, Gao J, Tian L, Li Z, Zhang P, Zhang J (2021) MDNN: a multimodal deep neural network for predicting drug–drug interaction events. In: Proceedings of the thirtieth international joint conference on artificial intelligence (IJCAI-21), pp 3536–3542. https://doi.org/10.24963/ijcai.2021/487

Ma T, Xiao C, Zhou J, Wang F (2018) Drug similarity integration through attentive Multiview graph auto-encoders. In: IJCAI 2018, proceedings of the 27th international joint conference on artificial intelligence, pp 3477–3483

Mahajan D, Kumar D (2018) Sentiment analysis using RNN and Google translator. In: 2018 8th international conference on cloud computing, data science & engineering (Confluence), pp 798–802. https://doi.org/10.1109/CONFLUENCE.2018.8442924

Mak IWY, Evaniew N, Ghert M (2014) Lost in translation: animal models and clinical trials in cancer treatment. Am J Transl Res 6:114–118

Marr B (2017) What is digital twin technology and why is it so important? Forbes. https://www.forbes.com/sites/bernardmarr/2017/03/06/what-is-digital-twin-technology-and-why-is-it-so-important

Matsuzaka Y, Uesawa Y (2019) Prediction model with high-performance constitutive androstane receptor (CAR) using DeepSnap-deep learning approach from the tox21 10K compound library. Int J Mol Sci 20:4855

Maul J-T, Djamei V, Kolios AG, Meier B, Czernielewskiand J, Jungo P (2016) Efficacy and survival of systemic psoriasis treatments: an analysis of the SWISS registry SDNTT. Dermatology 232(6):640–647

Mayani MG, Svendsen M, Oedegaard SI (2018) Drilling digital twin success stories the last 10 years. In: Proceedings of the SPE Norway one day seminar, Bergen, Norway. https://doi.org/10.2118/191336-MS

Metz JT, Johnson EF, Soni NB, Merta PJ, Kifle L, Hajduk PJ (2011) Navigating the kinome. Nat Chem Biol 7:200–202

Miller T (2019) Explanation in artificial intelligence: insights from the social sciences. Artif Intell 267:1–38

MathSciNet   MATH   Google Scholar  

Miyato T, Dai AM, Goodfellow I (2016) Adversarial training methods for semisupervised text classification. http://arxiv.org/abs/1605.07725

Mohamed C, Nsiri B, Abdelmajid S, Abdelghani EM, Brahim B (2020) Deep convolutional networks for image segmentation: application to optic disc detection. Int Conf Electr Inf Technol (ICEIT) 2020:1–3. https://doi.org/10.1109/ICEIT48248.2020.9113204

Mukhamediev RI, Symagulov A, Kuchin Y, Yakunin K, Yelis M (2021) From classical machine learning to deep neural networks: a simplified scientometric review. Appl Sci 11:5541. https://doi.org/10.3390/app11125541

Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B (2019) Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci USA 116:22071–22080

Nag S, Baidya ATK, Mandal A et al (2022) Deep learning tools for advancing drug discovery and development. 3 Biotech 12:110. https://doi.org/10.1007/s13205-022-03165-8

Nagy ZK, Fevotte G, Kramer H, Simon LL (2013) Recent advances in the monitoring, modelling, and control of crystallization systems. Chem Eng Res Des 91:1903–1922

Narayanan H, Luna MF, von Stosch M, Cruz Bournazou MN, Polotti G, Morbidelli M, Butte A, Sokolov M (2020) Bioprocessing in the digital age: the role of process models. Biotechnol J 15:e1900172

Nascimento ACA, Prudêncio RBC, Costa IG (2016) A multiple kernel learning algorithm for drug–target interaction prediction. BMC Bioinforma 17:46

Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

Nguyen T, Nguyen TT, Nguyen T, Le DH (2021) Graph convolutional networks for drug response prediction. IEEE/ACM Trans Comput Biol Bioinform 19:146–154

O’Connor TF, Yu LX, Lee SL (2016) Emerging technology: a key enabler for modernizing pharmaceutical manufacturing and advancing product quality. Int J Pharm 509:492–498

Oboyle NM, Sayle RA (2016) Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminform 8(1):1–14. https://doi.org/10.1186/s13321-016-0148-0

Olughu W, Deepika G, Hewitt C, Rielly C (2019) Insight into the large-scale upstream fermentation environment using scaled-down models. J Chem Technol Biotechnol 94:647–657

Oughtred R, Rust J, Chang C, Breitkreutz BJ, Stark C, Willems A, Boucher L, Leung G, Kolas N, Zhang F, Dolma S (2021) The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci 30(1):187–200

Oztemel E, Gursev S (2018) Literature review of Industry 4.0 and related technologies. J Intell Manuf 31:127–182

Ozturk H, Ozturk A, Ozkirimli E (2018) DeepDTA: Deep drug–target binding affinity prediction. Bioinformatics 34:i821–i829

Pandey P, Katakdaunde M, Turton R (2006) Modeling weight variability in a pan coating process using Monte Carlo simulations. AAPS Pharm Sci Tech 7:E2–E11

Papadakis E, Woodley JM, Gani R (2018) Perspective on PSE in pharmaceutical process development and innovation. In Process. Systems engineering for pharmaceutical manufacturing. Elsevier, Amsterdam pp 597–656

Passi A et al (2018) RepTB: a gene ontology-based drug repurposing approach for tuberculosis. J Cheminform 10(1):24. https://doi.org/10.1186/s13321-018-0276-9

Peng J, Li J, Shang X (2020) A learning-based method for drug–target interaction prediction based on feature representation learning and deep neural network. BMC Bioinform 21:1–13

Perozzi B, Al-Rfou R, Skiena S (2014) DeepWalk: online learning of social representations. In: Proceeding of the ACM SIGKDD international conference on knowledge discovery and data mining, New York, NY, USA, 24–27 August 2014, pp 701–710

Poluzzi E, Raschi E, Piccinni C, De Ponti F (2012) data mining techniques in pharmacovigilance: analysis of the publicly accessible FDA adverse event reporting system (AERS). In: Data mining applications in engineering and medicine. London, United Kingdom: IntechOpen. https://doi.org/10.5772/50095

Pouryahya M, Oh JH, Mathews JC, Belkhatir Z, Moosmüller C, Deasy JO, Tannenbaum AR (2022) Pan-cancer prediction of cell-line drug sensitivity using network-based methods. Int J Mol Sci 23:1074. https://doi.org/10.3390/ijms23031074

Qiu K, Lee J, Kim H, Yoon S, Kang K (2021) Machine learning based anti-cancer drug response prediction and search for predictor genes using cancer cell line gene expression. Genomics Inform. https://doi.org/10.5808/gi.20076

Quan C et al (2016) Multichannel convolutional neural network for biological relation extraction. BioMed Res Int. https://doi.org/10.1155/2016/1850404

Raghava GP, Barton GJ (2006) Quantification of the variation in percentage identity for protein sequence alignments. BMC Bioinf 7(1):415. https://doi.org/10.1186/1471-2105-7-415

Rampášek L et al (2019) Improving drug response prediction via modeling of drug perturbation effects. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz158

Rantanen J, Khinast J (2015) The future of pharmaceutical manufacturing sciences. J Pharm Sci 104:3612–3638

Read EK, Park JT, Shah RB, Riley BS, Brorson KA, Rathore AS (2010) Process analytical technology (PAT) for biopharmaceutical products: Part I. Concepts and applications. Biotechnol Bioeng 105:276–284

Reinhardt IC, Oliveira DJC, Ring DDT (2020) Current perspectives on the development of industry 4.0 in the pharmaceutical sector. J Ind Inf Integr 18:100131

Ren S, Tao Y, Yu K et al (2022) De novo prediction of Cell-Drug sensitivities using deep learning-based graph regularized matrix factorization. Pacif Symp Biocomput. https://doi.org/10.7490/f1000research.1118807.1

Reza F, Reza S, Yadollah O (2017) Computational prediction of drug–drug interactions based on drugs functional similarities. J Biomed Inform 70:54–64

Richardson P, Grifn I, Tucker C, Smith D, Oechsle O, Phelan A, Rawling M, Savory E, Stebbing J (2020) Baricitinib as potential treatment for 2019-nCoV acute respiratory disease. Lancet (london, England) 395(10223):e30

Rifaioglu AS, Atas H, Martin MJ, Cetin-Atalay R, Atalay V, Dogan T (2019) Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief Bioinform 20:1878–1912

Rosen R, von Wichert G, Lo G, Bettenhausen KD (2015) About the importance of autonomy and digital twins for the future of manufacturing. IFAC-PapersOnLine 48:567–572

Ryu JY, Kim HU, Lee SY (2018) Deep learning improves prediction of drug–drug and drug–food interactions. PNAS 115(18):E4304–E4311

Sachdev K, Gupta MK (2019) A comprehensive review of feature-based methods for drug–target interaction prediction. J Biomed Inform 93:103159

Sajjia M, Shirazian S, Kelly CB, Albadarin AB, Walker G (2017) ANN analysis of a roller compaction process; in the pharmaceutical industry. Chem Eng Technol 40:487–492

Sarker IH (2021) Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput Sci 2:420. https://doi.org/10.1007/s42979-021-00815-1

Sawada R, Iwata M, Tabei Y, Yamato H, Yamanishi Y (2018) Predicting inhibitory and activatory drug targets by chemically and genetically perturbed transcriptome signatures. Sci Rep 8:156

Schleich B, Anwer N, Mathieu L, Wartzack S (2017) Shaping the digital twin for design and production engineering. CIRP Ann 66:141–144

Schlichtkrull MS, De Cao N, Titov I (2020) Interpreting graph neural networks for NLP with differentiable edge masking. http://arxiv.org/abs/2010.00577

Schwarz K (2021) AttentionDDI: Siamese attention-based deep learning method for drug–drug interaction predictions. BMC Bioinf 22(1):412

Scudellari M (2020) Five companies using AI to fight coronavirus. https://spectrum.ieee.org/the-human-os/artificial-intelligence/medical-ai/companies-ai-coronavirus

Seo S, Lee T, Kim MH, Yoon Y (2020) Prediction of side effects using comprehensive similarity measures. BioMed Res Int. https://doi.org/10.1155/2020/1357630

Shang C, Liu Q, Chen KS, Sun J, Lu J, Yi J, Bi J (2018) Edge attention-based multi-relational graph convolutional networks. arXiv 2018; arXiv:1802.04944 .

Shao K, Zhang Z, He S, Bo X (2020) DTIGCCN: prediction of drug–target interactions based on GCN and CNN. In: Proceedings of the 2020 IEEE 2 nd international conference on tools with artificial intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020, pp 337–342

Sharifi-Noghabi H, Zolotareva O, Collins CC, Ester M (2019) MOLI: multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics 35:i501–i509

Shin B, Park S, Kang K, Ho JC (2019) Self-attention based molecule representation for predicting drug–target interaction. Proc Mach Learn Res 106:1–18

Shoemaker RH (2006) The NCI60 human tumour cell line anticancer drug screen. Nat Rev Cancer 6:813–823

Shrikumar A, Greenside P, Kundaje A (2017) Learning important features through propagating activation differences. In: Proceedings of the 34th international conference on machine learning 2017; 70, JMLR.org: Sydney, NSW, Australia. pp 3145–3153

Shtar G, Rokach L, Shapira B (2019) Detecting drug–drug interactions using artificial neural networks and classic graph similarity measures. PLoS ONE 14(8):e0219796

Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359

Simon LL, Kiss AA, Cornevin J, Gani R (2019) Process engineering advances in pharmaceutical and chemical industries: Digital process design, advanced rectification, and continuous filtration. Curr Opin Chem Eng 25:114–121

Simonyan K, Vedaldi A, Zisserman A (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. In: 2nd international conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Workshop Track Proceedings; http://arxiv.org/abs/1312.6034

Smiatek J, Jung A, Bluhmki E (2020) Towards a digital bioprocess. Replica: computational approaches in biopharmaceutical development and manufacturing. Trends Biotechnol 38(10):1141–1153. https://doi.org/10.1016/j.tibtech.2020.05.008

Song T, Zhang X, Ding M, Rodriguez-Paton A, Wang S, Wang G (2022) DeepFusion: a deep learning based multi-scale feature fusion method for predicting drug–target interactions. Methods 204:269–277

Springenberg JT (2015) Striving for simplicity: the all-convolutional Net. CoRR, http://arxiv.org/abs/1412.6806

Stark R, Fresemann C, Lindow K (2019) Development and operation of digital twins for technical systems and services. CIRP Ann 68:129–132

Steinwandter V, Borchert D, Herwig C (2019) Data science tools and applications on the way to Pharma 4.0. Drug Discov Today 24:1795–1805

Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, MacNair CR, French S, Carfrae LA, Bloom-Ackerman Z et al (2020) A deep learning approach to antibiotic discovery. Cell 180:688-702.e13

Subramanian K (2020) Digital twin for drug discovery and development—the virtual liver. J Indian Inst Sci 100:653–662. https://doi.org/10.1007/s41745-020-00185-2

Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, Gould J, Davis JF, Tubelli AA, Asiedu JK et al (2017) A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171:1437-1452.e17

Sun X, Ma L, Du X, Feng J, Dong K (2018) Deep convolution neural networks for drug–drug interaction extraction. In: 2018 IEEE international conference on bioinformatics and biomedicine (BIBM), pp 1662–1668. https://doi.org/10.1109/BIBM.2018.8621405

Sun M, Zhao S, Gilvary C, Elemento O, Zhou J, Wang F (2020a) Graph convolutional networks for computational drug development and discovery. Brief Bioinform 21:919–935

Sun M, Wang F, Elemento O, Zhou J (2020b) Structure-based drug–drug interaction detection via expressive graph convolutional networks and deep sets. Proc AAAI Conf Artif Intell 34(10):13927–13928. https://doi.org/10.1609/aaai.v34i10.7236

System HSL (2006) Psychoactive Drug Screening Program. https://www.hsls.pitt.edu/obrc/index.php?page=URL1133202727

Tajbakhsh N et al (2016) Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Trans Med Imaging 35(5):1299–1312. https://doi.org/10.1109/TMI.2016.2535302

Tang J, Szwajda A, Shakyawar S, Xu T, Hintsanen P, Wennerberg K, Aittokallio T (2014) Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J Chem Inf Model 54:735–743

Tang P, Xu J, Louey A, Tan Z, Yongky A, Liang S, Li ZJ, Weng Y, Liu S (2020) Kinetic modeling of Chinese hamster ovary cell culture: factors and principles. Crit Rev Biotechnol 40:265–281

Tao F, Cheng J, Qi Q, Zhang M, Zhang H, Sui F (2018) Digital twin-driven product design, manufacturing and service with big data. Int J Adv Manuf Technol 94:3563–3576

Tatonetti NP et al (2012) Data-driven prediction of drug effects and interactions. Sci Transl Med 4(125):12531. https://doi.org/10.1126/scitranslmed.3003377

Tatonetti NP, Patrick PY, Daneshjou R, Altman RB (2012) Data driven prediction of drug effects and interactions. Sci Transl Med 4(125):125ra31-125ra31

Tehseen Z, Usman Z (2019) Long short-term memory recurrent neural network architectures for Urdu acoustic modelling. Int J Speech Technol 22(1):21–30. https://doi.org/10.1007/s10772-018-09573-7

Thafar M, Raies AB, Albaradei S, Essack M, Bajic VB (2019) Comparison study of computational prediction tools for drug–target binding affinities. Front Chem 7:782. https://doi.org/10.3389/fchem.2019.00782

Thafar MA, Olayan RS, Olayan RS, Ashoor H, Ashoor H, Albaradei S, Albaradei S, Bajic VB, Gao X et al (2020) DTiGEMS: drug–target interaction prediction using graph embedding, graph mining, and similarity-based techniques. J Cheminform 12:1–17

Thafar MA, Alshahrani M, Albaradei S et al (2022) Affinity2Vec: drug–target binding affinity prediction through representation learning, graph mining, and machine learning. Sci Rep 12:4751. https://doi.org/10.1038/s41598-022-08787-9

Thorben F, Megha Kh, Avishek A (2021) Hard masking for explaining graph neural networks. In Submitted to international conference on learning representations https://openreview.net/forum?id=uDN8pRAdsoC

Tian X, Xin M, Luo J, Jiang Z (2016) Using the ranking-based KNN approach for drug repositioning based on multiple information. Springer, Cham, pp 317–327

Tong H, Heidemeyer M, Ban F, Cherkasov A, Ester M (2017) SimBoost: A read-across approach for predicting drug–target binding affinities using gradient boosting machines. J Cheminform 9:1–14

Torng W, Altman RB (2019) Graph convolutional neural networks for predicting drug–target interactions. J Chem Inf Model 59:4131–4149

Townshend RJL, Powers A, Eismann S, Derry A (2021) ATOM3D: tasks on molecules in three dimensions. arXiv 2021: arXiv:2012.04035

Trißl S, Rother K, Müller H et al (2005) Columba: an integrated database of proteins, structures, and annotations. BMC Bioinformatics 6:81. https://doi.org/10.1186/1471-2105-6-81

Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading. J Comput Chem 31:455

Tyson RJ, Park CC, Powell JR, Patterson JH, Weiner D, Watkins PB, Gonzalez D (2020) Precision dosing priority criteria: drug, disease, and patient population variables. J Front Pharmacol. https://doi.org/10.3389/fphar.2020.00420

U. Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212

Vazquez J, Lopez M, Gibert E, Herrero E, Luque FJ (2020) Merging ligand-based and structure-based methods in drug discovery: an overview of combined virtual screening approaches. Molecules 25:4723

Venkatasubramanian V (2019) The promise of artificial intelligence in chemical engineering: is it here, finally? AIChE J 65:466–478

Vermeer NS, Straus SM, Mantel-Teeuwisse AK, Domergue F, Egberts TC, Leufkens HG, De Bruin ML (2013) Traceability of biopharmaceuticals in spontaneous reporting systems: a cross sectional study in the FDA adverse event reporting system (FAERS) and surveillance databases. Drug Saf 36(8):617–625

Vilar S, Hripcsak GJ (2016) Leveraging 3D chemical similarity, target and phenotypic data in the identification of drug-protein and drug-adverse effect associations. J Cheminform 8(1):35. https://doi.org/10.1186/s13321-016-0147-1

Vilar S, Uriarte E, Santana L, Lorberbaum T, Hripcsak G, Friedman C, Tatonetti NP (2014) Similarity-based modeling in large-scale prediction of drug–drug interactions. Nat Protoc 9(9):2147–2163. https://doi.org/10.1038/nprot.2014.151

Wallach I, Dzamba M, Heifets A (2015) AtomNet: a deep convolutional neural network for bioactivity prediction in structurebased drug discovery. arXiv 2015: arXiv:1510.02855 .

Wan F et al (2019) DeepCPI: a deep learning-based framework for large-scale in silico drug screening. Genom Proteomics Bioinform 17:478–495

Wang JZ et al (2007) A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10):1274–1281. https://doi.org/10.1093/bioinformatics/btm087

Wang W et al (2014) Drug repositioning by integrating target information through a heterogeneous network model. Bioinformatics 30(20):2923–2930. https://doi.org/10.1093/bioinformatics/btu403

Wang CS, Lin PJ, Cheng CL, Tai SH, Kao Yang YH, Chiang JH (2019) Detecting potential adverse drug reactions using a deep neural network model. J Med Internet Res 21(2):e11016

Wang T, Yi HC, You ZH, Li LP, Wang YB, Hu L, Wong L (2019) A gated recurrent unit model for drug repositioning by combining comprehensive similarity measures and Gaussian interaction profile kernel. In: International conference on intelligent computing. Springer, Cham. pp 344–353

Wang YB, You ZH, Yang S et al (2020a) A deep learning-based method for drug–target interaction prediction based on long short-term memory neural network. BMC Med Inform Decis Mak 20:49. https://doi.org/10.1186/s12911-020-1052-0

Wang H, Wang J, Dong C, Lian Y, Liu D, Yan Z (2020b) A novel approach for drug–target interactions prediction based on multimodal deep autoencoder. Front Pharmacol 10:1–19

Watanabe JH, McInnis T, Hirsch JD (2018) Cost of prescription drug-related morbidity and mortality. Ann Pharmacother 52:829–837. https://doi.org/10.1177/1060028018765159

Way GP, Greene CS (2018) Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac Symp Biocomput 23:80–91

Wei J, Lu Z, Qiu K, Li P, Sun H (2020) Predicting drug risk level from adverse drug reactions using SMOTE and machine learning approaches. IEEE Access 8:185761–185775. https://doi.org/10.1109/ACCESS.2020.3029446

Weinstein JN (2004) Integromic analysis of the NCI-60 cancer cell lines. Breast Dis 19:11–22

Wen M, Zhang Z, Niu S, Sha H, Yang R, Yun Y, Lu H (2017) Deep-learning-based drug–target interaction prediction. J Proteome Res 16:1401–1409

Wenzel J, Matter H, Schmidt F (2019) Predictive multitask deep neural network models for adme-tox properties: learning from large data sets. J Chem Inf Model 59:1253–1268

White J, Schiffer JT, Bender R et al (2021) Drug combinations as a first line of defense against coronaviruses and other emerging viruses. Mbio 12(6):e0334721

Withnall M, Lindelöf E, Engkvist O, Chen H (2020) Building attention and edge message passing neural networks for bioactivity and physical-chemical property prediction. J Cheminform 12:1

Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9:513–530

Wu Z, Pan S, Chen F et al (2020) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32:4–24

Xia Z, Wu LY, Zhou X, Wong ST (2010) Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces. BMC Syst Biol 4:S6

Xiang W, Yingxin W, An Z, Xiangnan H, Tat-seng C (2021) Causal screening to interpret graph neural networks. In Submitted to international conference on learning representations. https://www.openreview.net/forum?id=nzKv5vxZfge

Xie L, He S, Song X, Bo X, Zhang Z (2018) Deep learning-based transcriptome data classification for drug–target interaction prediction. BMC Genomics 19:13–16

Xie Y, Peng J, Zhou Y, et al (2019) Integrating protein-protein interaction information into drug response prediction by graph neural encoding. 16 December 2019, Available at Research Square https://doi.org/10.21203/rs.2.18936/v1 .

Xu Y, Pei J, Lai L (2017) Deep learning-based regression and multiclass models for acute oral toxicity prediction with automatic chemical feature extraction. J Chem Inf Model 57:2672–2685

Yan CK, Wang WX, Zhang G et al (2019) BiRWDDA: a novel drug repositioning method based on multisimilarity fusion. J Comput Biol 26(11):1230–1242

Yan C, Duan G, Zhang Y, Wu F-X, Pan Y, Wang J (2022) Predicting drug–drug interactions based on integrated similarity and semi-supervised learning. IEEE/ACM Trans Comput Biol Bioinf 19(1):168–179. https://doi.org/10.1109/TCBB.2020.2988018

Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M et al (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59:3370–3388

Yi HC, You ZH, Wang L et al (2021) In silico drug repositioning using deep learning and comprehensive similarity measures. BMC Bioinf 22:293. https://doi.org/10.1186/s12859-020-03882-y

Yifan D, Xinran X, Yang Q, Jingbo X, Wen Z, Shichao L (2020) A multimodal deep learning framework for predicting drug–drug interaction events. Bioinformatics 36:4316–4322

Ying Z, Bourgeois D, You J, Zitnik M, Leskovec J (2019) Gnnexplainer: generating explanations for graph neural networks. Adv Neural Inf Process Syst 32:9244–9255

Yu Y, Si X, Hu C, Zhang J (2019) A review of recurrent neural networks: Lstm cells and network architectures. Neural Comput 31:1235–1270

Yu Y, Huang K, Zhang C, Glass LM, Sun J, Xiao C (2021) SumGNN: multi-typed drug interaction prediction via efficient knowledge graph summarization. Bioinformatics 37(18):2988–2995

Yuan H, Yu H, Wang J, Li K, Ji S (2021) On explain-ability of graph neural networks via subgraph explorations. http://arxiv.org/abs/2102.05152

Yue X, Wang Z, Huang J, Parthasarathy S, Moosavinasab S, Huang Y, Lin SM, Zhang W, Zhang P, Sun H (2020) Graph embedding on biomedical networks: methods, applications, and evaluations. Bioinformatics 36(4):1241–1251. https://doi.org/10.1093/bioinformatics/btz718

Yunsheng B, Ken G, Yizhou S, Wei W (2020) Bi-level graph neural networks for drug–drug interaction prediction. J Comput Eng arXiv:2006.14002

Zaikis D, Vlahavas I (2020) Drug–drug interaction classification using attention based neural networks. In: 11th Hellenic conference on artificial intelligence, pp 34–40. https://doi.org/10.1145/3411408.3411461

Zeng H, Qiu C, Cui QJD (2015) Drug-path: a database for drug-induced pathways. J Biol Databases Curation. https://doi.org/10.1093/database/bav061

Zeng T, Rongjian L, Ravi M, Jieping Y, Shuiwang J (2015) Deep convolutional neural networks for annotating gene expression patterns in the mouse brain. BMC Bioinformatics 16(1):147

Zeng X et al (2019) Measure clinical drug–drug similarity using electronic medical records. Int J Med Inf 124:97–103. https://doi.org/10.1016/j.ijmedinf.2019.02.003

Zeng X, Zhu S, Lu W, Liu Z, Huang J, Zhou Y, Fang J, Huang Y, Guo H, Li L et al (2020) Target identification among known drugs by deep learning from heterogeneous networks. Chem Sci 11:1775–1797

Zhai J, Zhang S, Chen J, He Q (2018) Autoencoder and its various variants. In: 2018 IEEE international conference on systems, man, and cybernetics (SMC), pp 415–419. https://doi.org/10.1109/SMC.2018.00080

Zhang Y (2020) Predicting drug–drug interactions using multi-modal deep autoencoders based network embedding and positive-unlabeled learning. Methods 179:37–46

Zhang M-L, Zhou Z-H (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048

Zhang H, Liu D, Xiong Z (2018) Convolutional neural network-based video super-resolution for action recognition. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pp 746–750. https://doi.org/10.1109/FG.2018.00117

Zhang Y, Weng Y, Lund J (2022) Applications of explainable artificial intelligence in diagnosis and surgery. Diagnostics 12:237. https://doi.org/10.3390/diagnostics12020237

Zhang C, Lu Y, Zang T (2022) CNN-DDI: a learning-based method for predicting drug–drug interactions using convolution neural networks. BMC Bioinf 23:88. https://doi.org/10.1186/s12859-022-04612-2

Zhao Y, Zheng K, Guan B, Guo M, Song L, Gao J, Qu H, Wang Y, Shi D, Zhang Y (2020) DLDTI: a learning-based framework for drug–target interaction identification using neural networks and network representation. J Transl Med 18:434

Zhao Q, Xiao F, Yang M, Li Y, Wang J (2019) AttentionDTA: prediction of drug–target binding affinity using attention model. In: Proceedings of the 2019 IEEE international conference on bioinformatics and biomedicine, San Diego, CA, USA, 18–21 November 2019, pp 64–69

Zhou Y, Zhang Y, Lian X, Li F, Wang C, Zhu F, Qiu Y, Chen Y (2022) Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents. Nucleic Acids Res 50:1398–1407

Zitnik M, Agrawal M, Leskovec J (2018) Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34(13):i457–i466

Zitnik SM, Sosic R, Leskovec J (2018) Biosnap datasets: Stanford biomedical network dataset collection. http://snap.stanford.edu/biodata

Zong N, Kim H, Ngo V, Harismendy O (2017) Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations. Bioinformatics 33:2337–2344

Zügner D, Akbarnejad A, Günnemann S (2018) Adversarial attacks on neural networks for graph data. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and Data Mining. 2018, Association for Computing Machinery: London, United Kingdom. pp 2847–2856

Download references

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).

Author information

Authors and affiliations.

Faculty of Computers and Artificial Intelligence, University of Sadat City, Sadat City, Egypt

Computer Science Department, Faculty of Science, Minia University, Minia, Egypt

Enas Elgeldawi & Mamdouh M. Gomaa

Faculty of Computers and Artificial Intelligence, Cairo University, Cairo, Egypt

Aboul Ella Hassanien

Faculty of Pharmacy and Drug Technology, Chinese University in Egypt (CUE), Cairo, Egypt

Heba Aboul Ella

Faculty of Pharmacy, University of Sadat City, Sadat City, Menoufia, Egypt

Yaseen A. M. M. Elshaier

You can also search for this author in PubMed   Google Scholar

Contributions

Ask wrote the main text, HA wrote the digital twining part, EE wrote the deep learning part, YAMME wrote the data sets part, MMG wrote the similarly part, AEH, suggest the idea of the review and supervision

Corresponding author

Correspondence to Aboul Ella Hassanien .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Askr, H., Elgeldawi, E., Aboul Ella, H. et al. Deep learning in drug discovery: an integrative review and future challenges. Artif Intell Rev 56 , 5975–6037 (2023). https://doi.org/10.1007/s10462-022-10306-1

Download citation

Accepted : 24 October 2022

Published : 17 November 2022

Issue Date : July 2023

DOI : https://doi.org/10.1007/s10462-022-10306-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Drug discovery
  • Artificial intelligence
  • Deep learning
  • Drug–target interactions
  • Drug–drug similarity
  • Drug side-effects
  • Drug sensitivity and response
  • Drug dosing optimization
  • Explainable artificial intelligence
  • Digital twining
  • Find a journal
  • Publish with us
  • Track your research

An overview of drug discovery and development

Affiliation.

  • 1 Department of biomedical Science, Nazarbayev University School of Medicine, Nur-Sultan 010000, Kazakhstan.
  • PMID: 32270704
  • DOI: 10.4155/fmc-2019-0307

A new medicine will take an average of 10-15 years and more than US$2 billion before it can reach the pharmacy shelf. Traditionally, drug discovery relied on natural products as the main source of new drug entities, but was later shifted toward high-throughput synthesis and combinatorial chemistry-based development. New technologies such as ultra-high-throughput drug screening and artificial intelligence are being heavily employed to reduce the cost and the time of early drug discovery, but they remain relatively unchanged. However, are there other potentially faster and cheaper means of drug discovery? Is drug repurposing a viable alternative? In this review, we discuss the different means of drug discovery including their advantages and disadvantages.

Keywords: drug repurposing; high throughput; natural sources; small molecule.

Publication types

  • Artificial Intelligence
  • Drug Development*
  • Drug Evaluation, Preclinical

Drug Discovery and Drug Identification using AI

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Rational drug design with AlphaFold 3

drug discovery research papers

Understanding the biomolecular world within us, and how complex networks of molecules interact within our cells, is a crucial starting point for understanding and treating disease with rational drug design. 

To move this understanding forwards, together with Google DeepMind we have developed AlphaFold 3, our breakthrough artificial intelligence (AI) model that provides an accurate atomic-level view of the structure of biomolecular systems. This builds upon Google DeepMind’s foundational work of predicting the structure of proteins with AlphaFold 2, to now include how multiple proteins, DNA, RNA, and small molecule ligands come together and interact, as well as predicting the structural impact of post-translational modifications and ions on these molecular systems. This model is a powerful unified framework for structure prediction, encompassing unprecedented breadth and accuracy.

This breakthrough opens up exciting possibilities for drug discovery, allowing us to rationally develop therapeutics against targets that were previously difficult or deemed intractable to modulate.

AlphaFold 3 and the research that led to this breakthrough

AlphaFold 3 is an AI model that allows a scientist to input a description of a biomolecular complex that they are interested in, and predicts the 3D structure of that biomolecular complex. 

The input biomolecular system can consist of a collection of multiple proteins, nucleic acids (DNA and RNA), as well as small molecule ligands and ions. These inputs are processed by AlphaFold 3, which is a generative model composed of a neural network architecture that builds upon a custom Transformer with triangular attention, and uses a diffusion process to generate individual 3D coordinates of every atom in the input-specified system.

Building this technology required a huge amount of innovation and research, developed with our colleagues at Google DeepMind. The model is trained on the world’s molecular structural data contained within the Protein Data Bank, and is able to process over 99% of all known biomolecular complexes contained there. The capabilities and accuracy of this model have been extensively evaluated across a wide range of test cases including many completely novel systems and molecular interfaces. From these evaluations, we’ve seen state-of-the-art accuracy across nearly all structural areas, including doubling the accuracy for some important interfaces, whilst generalising even to novel interfaces that are not seen during training, such as the examples depicted in this blog post.

We’re really excited to share the hard work of our teams, with more details of the model and results detailed in our Nature paper .

We’re also looking forward to seeing how scientists will use the AFServer released today to generate molecular complexes for non-commercial academic research to accelerate our understanding of biology.

Three examples that show how AF3 allows us to fold many proteins with their respective ligands, and to rationalise their mechanism of action. Ground truth structures are shown in white.

drug discovery research papers

The promise of AlphaFold 3 for science and drug design

For Isomorphic Labs, AlphaFold 3 equips our drug designers with the ability to quickly and accurately predict the structure of complexes that have never been characterised before, giving us a fundamental tool that allows us to take novel approaches to drug design.

We can now create and test hypotheses at the atomic level, and produce highly accurate structure predictions within seconds, standing in stark contrast to the months, or even years, required to experimentally determine answers to similar questions.

Already, we’re using AlphaFold 3 day-to-day. Our scientists have seen:

- That designing small molecules against AlphaFold 3’s structural predictions helps create designs that bind effectively to a target protein.

- The improved structural accuracy of protein-protein interfaces with AlphaFold 3 opens up the possibility of designing for new treatment modalities such as antibodies or other therapeutic proteins. 

- A richer understanding of a novel target can be achieved by looking at the structure of targets in their full biological context, in complex with other protein binding partners, DNA, RNA, and ligand cofactors. We believe that this broader understanding of the context within which drug targets operate will translate into more effective drugs in the clinic.

To demonstrate the potential for rational structure-based drug design with AlphaFold 3, we examined TIM-3 , an immune checkpoint protein identified as a potential target for cancer immunotherapy following a 2021 publication . 

The study focused on the discovery and design of small molecules capable of binding TIM-3 with high affinity. The research group experimentally solved three ligand-bound crystal structures to rationalise the structure-activity relationship observed in their work. To our knowledge, no small molecule-bound crystal structures of TIM-3 existed in the Protein Data Bank prior to this paper, and these structures were not in the training set of AlphaFold 3. 

Crucially, the authors found those ligands bound to a previously uncharacterised pocket.

We evaluated this system with AlphaFold 3, by creating three predictions corresponding to the three published crystal structures. We used the raw sequence of the protein, and the SMILES representation of each ligand, without giving AlphaFold 3 any additional information about the pose, structure, or pocket. Excitingly, the predicted structures were in agreement with the published experimental structures. The pocket discovered by the study was also found by AlphaFold 3. Furthermore, the predicted binding modes were almost identical to the ground truth structures, and the ligand-free prediction we predicted for reference showed a very different pocket conformation, which was flat and open. This difference between the predicted protein structure with and without the ligand demonstrates the ability of AlphaFold 3 to change the structure of the protein based on the presence of other molecules in a context dependent manner.

Through this example, AlphaFold 3 demonstrated it could accurately characterise the progression of a drug molecule design structurally.

Looking ahead 

We’ve already deployed our frontier version of AlphaFold 3 in our own internal pipeline of projects, and in our partnerships with pharmaceutical companies.

But AlphaFold 3 is one of the many AI-powered breakthroughs we’re working on which are needed to transform drug discovery - structural understanding is just a part of the picture. We’re combining AlphaFold 3 with our other proprietary AI models in our platform that help us understand more about the properties, function, and dynamics of molecular systems. And as we learn more about the molecular machines within us, with more structural and biological context, we can use this understanding to identify novel targets for drug design, as well as to approach existing targets with novel therapeutic mechanisms. 

drug discovery research papers

We’ll continue to be heads down in research, tackling the next frontier of fundamental modelling questions in chemistry and biology from first principles with AI. Bringing these together will help change the way we design the next generation of therapeutics, and unlock new biology.

While this is an important moment for AI-powered biological research, the potential for AI to accelerate outcomes for digital biology are limitless. Further development of our AI research models will deepen our understanding of human biology and the building blocks of life to reach our ultimate goal - harnessing the power and pace of AI to reimagine the entire drug discovery process.

AlphaFold 3 predicts the structure and interactions of all life's molecules

Nature paper

drug discovery research papers

Latest from Iso

MIT Technology Review

  • Newsletters

Google DeepMind’s new AlphaFold can model a much larger slice of biological life

AlphaFold 3 can predict how DNA, RNA, and other molecules interact, further cementing its leading role in drug discovery and research. Who will benefit?

  • James O'Donnell archive page

Google DeepMind has released an improved version of its biology prediction tool, AlphaFold, that can predict the structures not only of proteins but of nearly all the elements of biological life.

It’s a development that could help accelerate drug discovery and other scientific research. The tool is currently being used to experiment with identifying everything from resilient crops to new vaccines. 

While the previous model, released in 2020, amazed the research community with its ability to predict proteins structures, researchers have been clamoring for the tool to handle more than just proteins. 

Now, DeepMind says, AlphaFold 3 can predict the structures of DNA, RNA, and molecules like ligands, which are essential to drug discovery. DeepMind says the tool provides a more nuanced and dynamic portrait of molecule interactions than anything previously available. 

“Biology is a dynamic system,” DeepMind CEO Demis Hassabis told reporters on a call. “Properties of biology emerge through the interactions between different molecules in the cell, and you can think about AlphaFold 3 as our first big sort of step toward [modeling] that.”

AlphaFold 2 helped us better map the human heart , model antimicrobial resistance , and identify the eggs of extinct birds , but we don’t yet know what advances AlphaFold 3 will bring. 

Mohammed AlQuraishi, an assistant professor of systems biology at Columbia University who is unaffiliated with DeepMind, thinks the new version of the model will be even better for drug discovery. “The AlphaFold 2 system only knew about amino acids, so it was of very limited utility for biopharma,” he says. “But now, the system can in principle predict where a drug binds a protein.”

Isomorphic Labs, a drug discovery spinoff of DeepMind, is already using the model for exactly that purpose, collaborating with pharmaceutical companies to try to develop new treatments for diseases, according to DeepMind. 

AlQuraishi says the release marks a big leap forward. But there are caveats.

“It makes the system much more general, and in particular for drug discovery purposes (in early-stage research), it’s far more useful now than AlphaFold 2,” he says. But as with most models, the impact of AlphaFold will depend on how accurate its predictions are. For some uses, AlphaFold 3 has double the success rate of similar leading models like RoseTTAFold. But for others, like protein-RNA interactions, AlQuraishi says it’s still very inaccurate. 

DeepMind says that depending on the interaction being modeled, accuracy can range from 40% to over 80%, and the model will let researchers know how confident it is in its prediction. With less accurate predictions, researchers have to use AlphaFold merely as a starting point before pursuing other methods. Regardless of these ranges in accuracy, if researchers are trying to take the first steps toward answering a question like which enzymes have the potential to break down the plastic in water bottles, it’s vastly more efficient to use a tool like AlphaFold than experimental techniques such as x-ray crystallography. 

A revamped model  

AlphaFold 3’s larger library of molecules and higher level of complexity required improvements to the underlying model architecture. So DeepMind turned to diffusion techniques, which AI researchers have been steadily improving in recent years and now power image and video generators like OpenAI’s DALL-E 2 and Sora. It works by training a model to start with a noisy image and then reduce that noise bit by bit until an accurate prediction emerges. That method allows AlphaFold 3 to handle a much larger set of inputs.

That marked “a big evolution from the previous model,” says John Jumper, director at Google DeepMind. “It really simplified the whole process of getting all these different atoms to work together.”

It also presented new risks. As the AlphaFold 3 paper details, the use of diffusion techniques made it possible for the model to hallucinate, or generate structures that look plausible but in reality could not exist. Researchers reduced that risk by adding more training data to the areas most prone to hallucination, though that doesn’t eliminate the problem completely. 

Restricted access

Part of AlphaFold 3’s impact will depend on how DeepMind divvies up access to the model. For AlphaFold 2, the company released the open-source code , allowing researchers to look under the hood to gain a better understanding of how it worked. It was also available for all purposes, including commercial use by drugmakers. For AlphaFold 3, Hassabis said, there are no current plans to release the full code. The company is instead releasing a public interface for the model called the AlphaFold Server , which imposes limitations on which molecules can be experimented with and can only be used for noncommercial purposes. DeepMind says the interface will lower the technical barrier and broaden the use of the tool to biologists who are less knowledgeable about this technology.

Artificial intelligence

Sam altman says helpful agents are poised to become ai’s killer function.

Open AI’s CEO says we won’t need new hardware or lots more training data to get there.

What’s next for generative video

OpenAI's Sora has raised the bar for AI moviemaking. Here are four things to bear in mind as we wrap our heads around what's coming.

  • Will Douglas Heaven archive page

Is robotics about to have its own ChatGPT moment?

Researchers are using generative AI and other techniques to teach robots new skills—including tasks they could perform in homes.

  • Melissa Heikkilä archive page

An AI startup made a hyperrealistic deepfake of me that’s so good it’s scary

Synthesia's new technology is impressive but raises big questions about a world where we increasingly can’t tell what’s real.

Stay connected

Get the latest updates from mit technology review.

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at [email protected] with a list of newsletters you’d like to receive.

AlphaFold 3 predicts the structure and interactions of all of life’s molecules

May 08, 2024

[[read-time]] min read

Introducing AlphaFold 3, a new AI model developed by Google DeepMind and Isomorphic Labs. By accurately predicting the structure of proteins, DNA, RNA, ligands and more, and how they interact, we hope it will transform our understanding of the biological world and drug discovery.

Colorful protein structure against an abstract gradient background.

Inside every plant, animal and human cell are billions of molecular machines. They’re made up of proteins, DNA and other molecules, but no single piece works on its own. Only by seeing how they interact together, across millions of types of combinations, can we start to truly understand life’s processes.

In a paper published in Nature , we introduce AlphaFold 3, a revolutionary model that can predict the structure and interactions of all life’s molecules with unprecedented accuracy. For the interactions of proteins with other molecule types we see at least a 50% improvement compared with existing prediction methods, and for some important categories of interaction we have doubled prediction accuracy.

We hope AlphaFold 3 will help transform our understanding of the biological world and drug discovery. Scientists can access the majority of its capabilities, for free, through our newly launched AlphaFold Server , an easy-to-use research tool. To build on AlphaFold 3’s potential for drug design, Isomorphic Labs is already collaborating with pharmaceutical companies to apply it to real-world drug design challenges and, ultimately, develop new life-changing treatments for patients.

Our new model builds on the foundations of AlphaFold 2, which in 2020 made a fundamental breakthrough in protein structure prediction . So far, millions of researchers globally have used AlphaFold 2 to make discoveries in areas including malaria vaccines, cancer treatments and enzyme design. AlphaFold has been cited more than 20,000 times and its scientific impact recognized through many prizes, most recently the Breakthrough Prize in Life Sciences . AlphaFold 3 takes us beyond proteins to a broad spectrum of biomolecules. This leap could unlock more transformative science, from developing biorenewable materials and more resilient crops, to accelerating drug design and genomics research.

7PNM - Spike protein of a common cold virus (Coronavirus OC43): AlphaFold 3’s structural prediction for a spike protein (blue) of a cold virus as it interacts with antibodies (turquoise) and simple sugars (yellow), accurately matches the true structure (gray). The animation shows the protein interacting with an antibody, then a sugar. Advancing our knowledge of such immune-system processes helps better understand coronaviruses, including COVID-19, raising possibilities for improved treatments.

How AlphaFold 3 reveals life’s molecules

Given an input list of molecules, AlphaFold 3 generates their joint 3D structure, revealing how they all fit together. It models large biomolecules such as proteins, DNA and RNA, as well as small molecules, also known as ligands — a category encompassing many drugs. Furthermore, AlphaFold 3 can model chemical modifications to these molecules which control the healthy functioning of cells, that when disrupted can lead to disease.

AlphaFold 3’s capabilities come from its next-generation architecture and training that now covers all of life’s molecules. At the core of the model is an improved version of our Evoformer module — a deep learning architecture that underpinned AlphaFold 2’s incredible performance. After processing the inputs, AlphaFold 3 assembles its predictions using a diffusion network, akin to those found in AI image generators. The diffusion process starts with a cloud of atoms, and over many steps converges on its final, most accurate molecular structure.

AlphaFold 3’s predictions of molecular interactions surpass the accuracy of all existing systems. As a single model that computes entire molecular complexes in a holistic way, it’s uniquely able to unify scientific insights.

7R6R - DNA binding protein: AlphaFold 3’s prediction for a molecular complex featuring a protein (blue) bound to a double helix of DNA (pink) is a near-perfect match to the true molecular structure discovered through painstaking experiments (gray).

Leading drug discovery at Isomorphic Labs

AlphaFold 3 creates capabilities for drug design with predictions for molecules commonly used in drugs, such as ligands and antibodies, that bind to proteins to change how they interact in human health and disease.

AlphaFold 3 achieves unprecedented accuracy in predicting drug-like interactions, including the binding of proteins with ligands and antibodies with their target proteins. AlphaFold 3 is 50% more accurate than the best traditional methods on the PoseBusters benchmark without needing the input of any structural information, making AlphaFold 3 the first AI system to surpass physics-based tools for biomolecular structure prediction. The ability to predict antibody-protein binding is critical to understanding aspects of the human immune response and the design of new antibodies — a growing class of therapeutics.

Using AlphaFold 3 in combination with a complementary suite of in-house AI models, Isomorphic Labs is working on drug design for internal projects as well as with pharmaceutical partners. Isomorphic Labs is using AlphaFold 3 to accelerate and improve the success of drug design — by helping understand how to approach new disease targets, and developing novel ways to pursue existing ones that were previously out of reach.

AlphaFold Server: A free and easy-to-use research tool

8AW3 - RNA modifying protein: AlphaFold 3’s prediction for a molecular complex featuring a protein (blue), a strand of RNA (purple), and two ions (yellow) closely matches the true structure (gray). This complex is involved with the creation of other proteins — a cellular process fundamental to life and health.

Google DeepMind’s newly launched AlphaFold Server is the most accurate tool in the world for predicting how proteins interact with other molecules throughout the cell. It is a free platform that scientists around the world can use for non-commercial research. With just a few clicks, biologists can harness the power of AlphaFold 3 to model structures composed of proteins, DNA, RNA and a selection of ligands, ions and chemical modifications.

AlphaFold Server helps scientists make novel hypotheses to test in the lab, speeding up workflows and enabling further innovation. Our platform gives researchers an accessible way to generate predictions, regardless of their access to computational resources or their expertise in machine learning.

Experimental protein-structure prediction can take about the length of a PhD and cost hundreds of thousands of dollars. Our previous model, AlphaFold 2, has been used to predict hundreds of millions of structures, which would have taken hundreds of millions of researcher-years at the current rate of experimental structural biology.

Demo video showing the capabilities of the server.

Sharing the power of AlphaFold 3 responsibly

With each AlphaFold release, we’ve sought to understand the broad impact of the technology , working together with the research and safety community. We take a science-led approach and have conducted extensive assessments to mitigate potential risks and share the widespread benefits to biology and humanity.

Building on the external consultations we carried out for AlphaFold 2, we’ve now engaged with more than 50 domain experts, in addition to specialist third parties, across biosecurity, research and industry, to understand the capabilities of successive AlphaFold models and any potential risks. We also participated in community-wide forums and discussions ahead of AlphaFold 3’s launch.

AlphaFold Server reflects our ongoing commitment to share the benefits of AlphaFold, including our free database of 200 million protein structures. We’ll also be expanding our free AlphaFold education online course with EMBL-EBI and partnerships with organizations in the Global South to equip scientists with the tools they need to accelerate adoption and research, including on underfunded areas such as neglected diseases and food security. We’ll continue to work with the scientific community and policy makers to develop and deploy AI technologies responsibly.

Opening up the future of AI-powered cell biology

7BBV - Enzyme: AlphaFold 3’s prediction for a molecular complex featuring an enzyme protein (blue), an ion (yellow sphere) and simple sugars (yellow), along with the true structure (gray). This enzyme is found in a soil-borne fungus (Verticillium dahliae) that damages a wide range of plants. Insights into how this enzyme interacts with plant cells could help researchers develop healthier, more resilient crops.

AlphaFold 3 brings the biological world into high definition. It allows scientists to see cellular systems in all their complexity, across structures, interactions and modifications. This new window on the molecules of life reveals how they’re all connected and helps understand how those connections affect biological functions — such as the actions of drugs, the production of hormones and the health-preserving process of DNA repair.

The impacts of AlphaFold 3 and our free AlphaFold Server will be realized through how they empower scientists to accelerate discovery across open questions in biology and new lines of research. We’re just beginning to tap into AlphaFold 3’s potential and can’t wait to see what the future holds.

Related stories

SP_Hero_Update (1)

Google I/O 2024: An I/O for a new generation

24017_IO_BlogHeader_Day1_01

Experience Google AI in even more ways on Android

RAI_LearnLM_LearnLM infuses research-backed learning_v27_HeroImage_2 (1)

How generative AI expands curiosity and understanding with LearnLM

Gemini_Blog_Header_3

Gemini breaks new ground with a faster model, longer context, AI agents and more

RAI_Hero

Building on our commitment to delivering responsible AI

Hero_D

Gemini 1.5 Pro updates, 1.5 Flash debut and 2 new Gemma models

Let’s stay in touch. Get the latest news from Google in your inbox.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Br J Pharmacol
  • v.162(6); 2011 Mar

Logo of brjpharm

Principles of early drug discovery

1 MedImmune Inc, Granta Park, Cambridge, UK

2 GlaxoSmithKline, Gunnels Wood Road, Stevenage, Hertfordshire, UK

SB Kalindjian

3 King's College, Guy's Campus, London, UK

KL Philpott

Associated data.

Developing a new drug from original idea to the launch of a finished product is a complex process which can take 12–15 years and cost in excess of $1 billion. The idea for a target can come from a variety of sources including academic and clinical research and from the commercial sector. It may take many years to build up a body of supporting evidence before selecting a target for a costly drug discovery programme. Once a target has been chosen, the pharmaceutical industry and more recently some academic centres have streamlined a number of early processes to identify molecules which possess suitable characteristics to make acceptable drugs. This review will look at key preclinical stages of the drug discovery process, from initial target identification and validation, through assay development, high throughput screening, hit identification, lead optimization and finally the selection of a candidate molecule for clinical development.

Introduction

A drug discovery programme initiates because there is a disease or clinical condition without suitable medical products available and it is this unmet clinical need which is the underlying driving motivation for the project. The initial research, often occurring in academia, generates data to develop a hypothesis that the inhibition or activation of a protein or pathway will result in a therapeutic effect in a disease state. The outcome of this activity is the selection of a target which may require further validation prior to progression into the lead discovery phase in order to justify a drug discovery effort ( Figure 1 ). During lead discovery, an intensive search ensues to find a drug-like small molecule or biological therapeutic, typically termed a development candidate, that will progress into preclinical, and if successful, into clinical development ( Figure 2 ) and ultimately be a marketed medicine.

An external file that holds a picture, illustration, etc.
Object name is bph0162-1239-f1.jpg

Drug discovery process from target ID and validation through to filing of a compound and the approximate timescale for these processes. FDA, Food and Drug Administration; IND, Investigational New Drug; NDA, New Drug Application.

An external file that holds a picture, illustration, etc.
Object name is bph0162-1239-f2.jpg

Overview of drug discovery screening assays.

Target identification

Drugs fail in the clinic for two main reasons; the first is that they do not work and the second is that they are not safe. As such, one of the most important steps in developing a new drug is target identification and validation. A target is a broad term which can be applied to a range of biological entities which may include for example proteins, genes and RNA. A good target needs to be efficacious, safe, meet clinical and commercial needs and, above all, be ‘druggable’. A ‘druggable’ target is accessible to the putative drug molecule, be that a small molecule or larger biologicals and upon binding, elicit a biological response which may be measured both in vitro and in vivo . It is now known that certain target classes are more amenable to small molecule drug discovery, for example, G-protein-coupled receptors (GPCRs), whereas antibodies are good at blocking protein/protein interactions. Good target identification and validation enables increased confidence in the relationship between target and disease and allows us to explore whether target modulation will lead to mechanism-based side effects.

Data mining of available biomedical data has led to a significant increase in target identification. In this context, data mining refers to the use of a bioinformatics approach to not only help in identifying but also selecting and prioritizing potential disease targets ( Yang et al ., 2009 ). The data which are available come from a variety of sources but include publications and patent information, gene expression data, proteomics data, transgenic phenotyping and compound profiling data. Identification approaches also include examining mRNA/protein levels to determine whether they are expressed in disease and if they are correlated with disease exacerbation or progression. Another powerful approach is to look for genetic associations, for example, is there a link between a genetic polymorphism and the risk of disease or disease progression or is the polymorphism functional. For example, familial Alzheimer's Disease (AD) patients commonly have mutations in the amyloid precursor protein or presenilin genes which lead to the production and deposition in the brain of increased amounts of the Abeta peptide, characteristic of AD ( Bertram and Tanzi, 2008 ). There are also examples of phenotypes in humans where mutations can nullify or overactivate the receptor, for example, the voltage-gated sodium channel NaV1.7, both mutations incur a pain phenotype, insensitivity or oversensitivity respectively ( Yang et al ., 2004 ; Cox et al ., 2006 ).

An alternative approach is to use phenotypic screening to identify disease relevant targets. In an elegant experiment, Kurosawa et al . (2008) used a phage-display antibody library to isolate human monoclonal antibodies (mAbs) that bind to the surface of tumour cells. Clones were individually screened by immunostaining and those that preferentially and strongly stained the malignant cells were chosen. The antigens recognized by those clones were isolated by immunoprecipitation and identified by mass spectroscopy. Of 2114 mAbs with unique sequences they identified 21 distinct antigens highly expressed on several carcinomas, some of which may be useful targets for the corresponding carcinoma therapy and several mAbs which may become therapeutic agents.

Target validation

Once identified, the target then needs to be fully prosecuted. Validation techniques range from in vitro tools through the use of whole animal models, to modulation of a desired target in disease patients. While each approach is valid in its own right, confidence in the observed outcome is significantly increased by a multi-validation approach ( Figure 3 ).

An external file that holds a picture, illustration, etc.
Object name is bph0162-1239-f3.jpg

Target ID and validation is a multifunctional process. IHC, immunohistochemistry.

Antisense technology is a potentially powerful technique which utilizes RNA-like chemically modified oligonucleotides which are designed to be complimentary to a region of a target mRNA molecule ( Henning and Beste, 2002 ). Binding of the antisense oligonucleotide to the target mRNA prevents binding of the translational machinery thereby blocking synthesis of the encoded protein. A prime example of the power of antisense technology was demonstrated by researchers at Abbott Laboratories who developed antisense probes to the rat P2X3 receptor ( Honore et al ., 2002 ). When given by intrathecal minipump, to avoid toxicities associated with bolus injection, the phosphorothioate antisense P2X3 oligonucleonucleotides had marked anti-hyperalgesic activity in the Complete Freund's Adjuvant model, demonstrating an unambiguous role for this receptor in chronic inflammatory states. Interestingly, after administration of the antisense oligonucleonucleotides was discontinued, receptor function and algesic responses returned. Therefore, in contrast to the gene knockout approach, antisense oligonucleotide effects are reversible and a continued presence of the antisense is required for target protein inhibition ( Peet, 2003 ). However, the chemistry associated with creating oligonucleotides has resulted in molecules with limited bioavailability and pronounced toxicity, making their in vivo use problematic. This has been compounded by non-specific actions, problems with controls for these tools and a lack of diversity and variety in selecting appropriate nucleotide probes ( Henning and Beste, 2002 ).

In contrast, transgenic animals are an attractive validation tool as they involve whole animals and allow observation of phenotypic endpoints to elucidate the functional consequence of gene manipulation. In the early days of gene targeting animals were generated that lacked a given gene's function from inception and throughout their lives. This work yielded great insights into the in vivo functions of a wide range of genes. One such example is through use of the P2X7 knockout mouse to confirm a role for this ion channel in the development and maintenance of neuropathic and inflammatory pain ( Chessell et al ., 2005 ). In mice lacking P2X7 receptors, inflammatory and neuropathic hypersensitivity is completely absent to both mechanical and thermal stimuli, while normal nociceptive processing is preserved. These transgenic animals were also used to confirm the mechanism of action for this ablation in vivo as the transgenic mice were unable to release the mature pro-inflammatory cytokine IL-1beta from cells although there was no deficit in IL-1beta mRNA expression. An alternative to gene knockouts are gene knock-ins, where a non-enzymatically functioning protein replaces the endogenous protein. These animals can have a different phenotype to a knockout, for example when the protein has structural as well as enzymatic functions ( Abell et al ., 2005 ) and these mice should ostensibly mimic more closely what happens during treatment with drugs, that is, the protein is there but functionally inhibited.

More recently, the desire to be able to make tissue-restricted and/or inducible knockouts has grown. Although these approaches are technically challenging, the most obvious reason for this is the need to overcome embryonic lethality of the homozygous null animals. Other reasons include avoidance of compensatory mechanisms due to chronic absence of a gene-encoded function and avoidance of developmental phenotypes. However, the use of transgenic animals is expensive and time-consuming. So in order to circumvent some of these issues, the use of small interfering RNA (siRNA) has become increasingly popular for target validation. Double-stranded RNA (dsRNA) specific to the gene to be silenced is introduced into a cell or organism, where it is recognized as exogenous genetic material and activates the RNAi pathway. The ribonuclease protein Dicer is activated which binds and cleaves dsRNAs to produce double-stranded fragments of 21–25 base pairs with a few unpaired overhang bases on each end. These short double-stranded fragments are called siRNAs. These siRNAs are then separated into single strands and integrated into an active RNA-induced silencing complex (RISC). After integration into the RISC, siRNAs base-pair to their target mRNA and induce cleavage of the mRNA, thereby preventing it from being used as a translation template (reviewed in Castanotto and Rossi, 2009 ). However, RNAi technology still has the major problem of delivery to the target cell, but many viral and non-viral delivery systems are currently under investigation (for review see Whitehead et al ., 2009 ).

Monoclonal antibodies are an excellent target validation tool as they interact with a larger region of the target molecule surface, allowing for better discrimination between even closely related targets and often providing higher affinity. In contrast, small molecules are disadvantaged by the need to interact with the often more conserved active site of a target, while antibodies can be selected to bind to unique epitopes. This exquisite specificity is the basis for their lack of non-mechanistic (or ‘off-target’) toxicity – a major advantage over small-molecule drugs.

However, antibodies cannot cross cell membranes restricting the target class mainly to cell surface and secreted proteins. One impressive example of the efficacy of a mAb in vivo is that of the function neutralizing anti-TrkA antibody MNAC13, which has been shown to reduce both neuropathic pain and inflammatory hypersensitivity ( Ugolini et al ., 2007 ), thereby implicating NGF in the initiation and maintenance of chronic pain. Finally, the classic target validation tool is the small bioactive molecule that interacts with and functionally modulates effector proteins.

More recently, chemical genomics, a systemic application of tool molecules to target identification and validation has emerged. Chemical genomics can be defined as the study of genomic responses to chemical compounds. The goal is the rapid identification of novel drugs and drug targets embracing multiple early phase drug discovery technologies ranging from target identification and validation, over compound design and chemical synthesis to biological testing. Chemical genomics brings together diversity-oriented chemical libraries and high-information-content cellular assays, along with the informatics and mining tools necessary for storing and analysing the data generated (reviewed in Zanders et al ., 2002 ). The ultimate goal of this approach is to provide chemical tools against every protein encoded by the genome. The aim is to use these tools to evaluate cellular function prior to full investment in the target and commitment to a screening campaign

The hit discovery process

Following the process of target validation, it is during the hit identification and lead discovery phase of the drug discovery process that compound screening assays are developed. A ‘hit’ molecule can vary in meaning to different researchers but in this in review we define a hit as being a compound which has the desired activity in a compound screen and whose activity is confirmed upon retesting. A variety of screening paradigms exist to identify hit molecules (see Table 1 ). High throughput screening (HTS) involves the screening of the entire compound library directly against the drug target or in a more complex assay system, such as a cell-based assay, whose activity is dependent upon the target but which would then also require secondary assays to confirm the site of action of compounds ( Fox et al ., 2006 ). This screening paradigm involves the use of complex laboratory automation but assumes no prior knowledge of the nature of the chemotype likely to have activity at the target protein. Focused or knowledge-based screening involves selecting from the chemical library smaller subsets of molecules that are likely to have activity at the target protein based on knowledge of the target protein and literature or patent precedents for the chemical classes likely to have activity at the drug target ( Boppana et al ., 2009 ). This type of knowledge has given rise, more recently, to early discovery paradigms using pharmacophores and molecular modelling to conduct virtual screens of compound databases ( McInnes, 2007 ). Fragment screening involves the generation of very small molecular weight compound libraries which are screened at high concentrations and is typically accompanied by the generation of protein structures to enable compound progression ( Law et al ., 2009 ). Finally, a more specialized focused screening approach can also be taken, physiological screening. This is a tissue-based approach and looks for a response more aligned with the final desired in vivo effect as opposed to targeting one specific molecular component.

Screening strategies

NMR, nuclear magnetic resonance.

High throughput and other compound screens are developed and run to identify molecules that interact with the drug target, chemistry programmes are run to improve the potency, selectivity and physiochemical properties of the molecule, and data continue to be developed to support the hypothesis that intervention at the drug target will have efficacy in the disease state. It is this series of activities that are the subject of intense activity within the pharmaceutical industry and increasingly within academia to identify candidate molecules for clinical development. Pharmaceutical companies have built large organizations with the objective of identifying targets, assembling compound collections and the associated infrastructure to screen those compounds to identify initially hit molecules from HTS or other screening paradigms and to optimize those screening ‘hits’ into clinical candidates. In recent years the academic sector has become increasingly interested in the activities traditionally performed within the lead discovery phase in the pharmaceutical industry. Academic scientists are now formatting assays for drug discovery which are passed onto academic drug discovery centres for compound screening. These centres, as exemplified by the NIH Roadmap initiative in the USA ( Frearson and Collie, 2009 ), have established compound libraries, screening infrastructure and the appropriate expertise traditionally found within the industrial sector to screen target proteins to identify so-called chemical probes for use in target validation and disease biology studies and increasingly to identify chemical start points for drug discovery programmes. The success of these efforts has been facilitated by the transfer of skills between the industrial and academic sectors.

A typical programme critical path within the lead discovery phase consists of a number of activities and begins with the development of biological assays to be used for the identification of molecules with activity at the drug target. Once developed, such assays are used to screen compound libraries to identify molecules of interest. The output of a compound screen is typically termed a hit molecule, which has been demonstrated to have specific activity at the target protein. Screening hits form the basis of a lead optimization chemistry programme to increase potency of the chemical series at the primary drug target protein. During the lead discovery, phase molecules are also screened in cell-based assays predictive of the disease state and in animal models of disease to characterize both the efficacy of the compound and its likely safety profile ( Figure 2 ). The following paragraphs describe in more detail the requirements and application of compound screening assays within hit and lead discovery.

Assay development

In the recombinant era the majority of assays in use within the industry rely upon the creation of stable mammalian cell lines over-expressing the target of interest or upon the over-expression and purification of recombinant protein to establish so-called biochemical assays although in recent years there has been an increase in the number of reports describing the use of primary cell systems for compound screening ( Dunne et al ., 2009 ). Generally, cell-based assays have been applied to target classes such as membrane receptors, ion channels and nuclear receptors and typically generate a functional read-out as a consequence of compound activity ( Michelini et al ., 2010 ). In contrast, biochemical assays, which have been applied to both receptor and enzyme targets, often simply measure the affinity of the test compound for the target protein. The relative merits of biochemical and cell-based assays have been debated extensively and have been reviewed elsewhere ( Moore and Rees, 2001 ). Both assay paradigms have been used successfully to identify hit and candidate molecules.

A plethora of assay formats have been enabled to support compound screening. The choice of assay format is dependent upon the biology of the drug target protein, the equipment infrastructure in the host laboratory, the experience of the scientists in that laboratory, whether an inhibitor or activator molecule is sought and the scale of the compound screen. For example compound screening assays at GPCRs have been configured to measure the binding affinity of a radio- or fluorescently labelled ligand to the receptor, to measure guanine nucleotide exchange at the level of the G-protein, to measure compound-mediated changes in one of a number of second messenger metabolites including calcium, cAMP or inositiol phosphates or to measure the activation of downstream reporter genes. Whatever the assay format that is selected, it is a requirement that the following factors are considered:

  • Pharmacological relevance of the assay. If available, studies should be performed using known ligands with activity at the target under study, to determine if the assay pharmacology is predictive of the disease state and to show that the assay is capable of identifying compounds with the desired potency and mechanism of action.
  • Reproducibility of the assay. Within a compound screening environment it is a requirement that the assay is reproducible across assay plates, across screen days and, within a programme that may run for several years, across the duration of the entire drug discovery programme.
  • Assay costs. Compound screening assays are typically performed in microtitre plates. Within academia or for relatively small numbers of compounds assays are typically formatted in 96-well or 384-well microtitre plates whereas in industry or in HTS applications assays are formatted in 384-well or 1536-well microtire plates in assay volumes as small as a few microlitires. In each case assay reagents and assay volumes are selected to minimize the costs of the assay.
  • Assay quality. Assay quality is typically determined according to the Z' factor ( Zhang et al ., 1999 ). This is a statistical parameter that in addition to considering the signal window in the assay also considers the variance around both the high and low signals in the assay. The Z factor has become the industry standard means of measuring assay quality on a plate bases. The Z factor has a range of 0 to 1; an assay with a Z factor of greater than 0.4 is considered appropriately robust for compound screening although many groups prefer to work with assays with a Z factor of greater than 0.6. In addition to the Z factor assay quality is also monitored through the inclusion of pharmacological controls within each assay. Assays are deemed acceptable if the pharmacology of the standard compound(s) falls within predefined limits. Assay quality is affected by many factors. Generally, high-quality assays are created through implementing simple assay protocols with few steps, minimizing wash steps or plate to plate reagent transfers within the assay, through the use of stable reagents and biologicals, and through ensuring that all the instrumentation used to perform the assay is performing optimally. This is typically achieved through developing quality control practices for all items of laboratory automation (see http://www.ncgc.nih.gov/guidance/section2.html#replicate-experiment-study-summary-acceptance ).
  • Effects of compounds in the assay. Chemical libraries are typically stored in organic solvents such as ethanol or dimethyl sulphoxide (DMSO). Thus, assays need to be configured that are not sensitive to the concentrations of solvents used in the assay. Typically, cell-based assays are intolerant to solvent concentrations of greater than 1% DMSO whereas biochemical assays can be performed in solvent concentrations of up to 10% DMSO. Studies are also performed to establish the false negative and false positive hit rates in the assay. If these are unacceptably high then the assay will need to be reconfigured. Finally some consideration should be made to the screening concentration. Compound screening assays for hit discovery are typically run at 1–10 µM compound concentration. At these concentrations compounds with activities of up to 40 µM can be identified. The test concentration can be varied to identify compounds with higher or lower activity.

One example of an HTS technology implemented for the identification of hit molecules with activity at GPCRs is the aequorin assay ( Stables et al ., 2000 ). Aequorin is a calcium-sensitive bioluminescent protein cloned from the jellyfish Aequorea victorea. Stable mammalian cell lines have been created transfected to express the GPCR drug target and the aequorin biosensoer. For receptors capable of coupling to heterotrimeric G-proteins of the Gαq/11 family, ligand activation results in an increase in intracellular calcium concentration. When aequorin is expressed in the same cells, this increase in intracellular calcium concentration is detected as a consequence of calcium binding to the aequorin photoprotein, which in the presence of the cofactor coelenterazine, results in the generation of a flash of light that can be detected within a microtitre plate-based luminometer such as the Lumilux™ platform (PerkinElmer, Waltham, MA, USA). The aequorin assay has a very simple protocol and has been developed for HTS in 1536-well plate format in assay volumes of 6 µL and for compound profiling activities in 384-well plate format.

When developing any HTS assay, which can involve the screening of several million molecules over several weeks, it is best practice to screen training sets of compounds to verify that the assay is performing acceptably. Figure 4 shows the screening of a 12 000 compound training set against the histamine H1 receptor expressed in Chinese hamster ovary cells in a 1536-well format HTS assay. The training set is typically run on two or three occasions to identify the hit rate in the assay, the reproducibility of the assay and the false positive and false negative hit rates in the assay. Typically, statistical packages have been developed to identify these parameters. When screened to detect agonist ligands the hit rates in the aequorin assay are typically less than 0.5% of compounds screened with a statistical assay cut-off of 5% or less of the agonist signal seen with a standard agonist ligand. In this assay format false positive and false negative hit rates are very low. For antagonist screening the hit rate in the aequorin assay is typically of 2–3% of compounds screened with an activity cut-off of greater than 25% inhibition. This is a common phenomenon of all screening assays. Hit rates in antagonist or inhibitor format tend to be higher than hit rates in agonist assays as antagonist assays, which are defined by detection of a decrease in assay signal, will also detect compounds that interfere in signal generation. Following completion of robustness testing an assay moves into HTS. During HTS, up to 200 assay plates are screened each day, often using complex laboratory automation. During the screen, assay performance is measured according to the Z' on the assay plate and the variance in the pharmacology of a standard compound, with assay plates being failed and rescreened if these quality control measures fall outside predefined limits ( Figure 5 ).

An external file that holds a picture, illustration, etc.
Object name is bph0162-1239-f4.jpg

Aequorin high throughput screening: validation testing GPCR antagonist assay (1536-well). Assay validation of a GPCR drug screening assay for the identification of agonist and antagonist ligands. Cells expressing the histamine H1 receptor and the calcium-sensitive photoprotein aequorin were dispensed into 1536-well microtitre plates. A total of 12 000 compounds were screened in duplicate to detect agonist ligands (left panel) and antagonist ligands (right panel). In the agonist assay (left panel), no drug response is represented in red, the response to a maximal concentration of the ligand histamine in blue and compound data in yellow. As is typically seen in agonist assays, the hit rate is very low due to the absence of false positives. In the antagonist assay (right panel), the response to histamine in the absence of test compound is represented in red (basal response), the response to a maximal concentration of a histamine antagonist in blue (100% inhibition) and compound data in yellow. As is typically seen in a cell-based inhibitor assay, there is significant spread of the compound data due to a combination of assay interference and compound activity. True actives correlate in the range 40% to 100% inhibition. Both assays have excellent Z'. GPCR, G-protein-coupled receptor.

An external file that holds a picture, illustration, etc.
Object name is bph0162-1239-f5.jpg

Quality control (QC) in high throughput screening. To ensure the control of screening data in compound screening campaigns each assay plate typically contains a number of pharmacological control compounds. (A) Each 384-well plate contains 16 wells containing a low control and a further 16 wells containing an EC100 concentration of a pharmacological standard which are used to calculate the Z' factor (reference Zhang et al ., 1999 ). Plates that generate a Z' factor below 0.4 are rescreened. (B) Each plate also contains 16 wells of an EC 50 concentration of a pharmacological standard to monitor the variance in the assay (diamonds). (C) A heat map is generated for all plates that pass the pharmacological standard QC to monitor the distribution of activity across the assay plate. One would expect to see a random distribution of activity across the screening plate. A plate such as the one presented would be failed and rescreened due to the active wells clustering in the centre of the plate.

Defining a hit series

Compound libraries have been assembled to contain small molecular weight molecules that obey chemical parameters such as the Lipinski Rule of Five ( Lipinski et al ., 2001 ), and more often have molecular weights of less than 400 and clogP (a measure of lipophilicity which affects absorption into the body) of less than 4. Molecules with these features have been termed ‘drug-like’, in recognition of the fact that the majority of clinically marketed drugs have a molecular weight of less than 350 and a cLogP of less than 3. It is critically important to initiate a drug discovery programme with a small simple molecule as lead optimization, to improve potency and selectivity, typically involves an increase in molecular weight which in turn can lead to safety and tolerability issues.

Once a number of hits have been obtained from virtual screening or HTS, the first role for the drug discovery team is to try to define which compounds are the best to work on. This triaging process is essential as, from a large library, a team will likely be left with many possible hits which they will need to reduce, confirm and cluster into series. There are several steps to achieving this. First, although this is less of a problem as the quality of libraries have improved, compounds that are known by the library curators to be to be frequent hitters in HTS campaigns need to be removed from further consideration. Second, a number of computational chemistry algorithms have been developed to group hits based on structural similarity to ensure that a broad spectrum of chemical classes are represented on the list of compounds taken forward. Analysis of the compound hit list using these algorithms allows the selection of hits for progression based on chemical cluster, potency and factors such as ligand efficiency which gives an idea of how well a compound binds for its size (log potency divided by number of ‘heavy atoms’ i.e. non-hydrogen atoms, in a molecule).

The next phase in the initial refinement process is to generate dose–response curves in the primary assay for each hit, preferably with a fresh sample of the compound. Showing normal competitive behaviour in hits is important. Compounds which give an all or nothing response are not acting in a reversible manner and indeed may not be binding to the target protein at all, with the activity at high concentrations arising from an interaction between the sample and another component of the assay system. Reversible compounds are favoured because their effects can be more easily ‘washed-out’ following drug withdrawal, an important consideration when using in patients. Obtaining a dose–response curve allows the generation of a half maximal inhibitory concentration which is used to compare of the potencies of candidate compounds. Sourcing and using fresh samples of compounds for this exercise is highly desirable. Nearly all HTS libraries are stored as frozen DMSO solutions with the result that, after some time, the compound can become degraded or modified. Virtually anyone who has worked with libraries of this type has got anecdotes about how potent activity has disappeared when the compound was resynthesized and used in re-testing, although occasionally identification of potent impurities has allowed progress to be made.

With reliable dose–response curves generated in the primary assay for the target, the stage is set to examine the surviving hits in a secondary assay, if one is available, for the target of choice. This need not be an assay in a high throughput format but will involve looking at the affect of the compounds in a functional response, for example in a second messenger assay or in a tissue-or cell-based bioassay. Activity in this setting will give reassurance that compounds are able to modulate more intact systems rather than simply interacting with the isolated and often engineered protein used in the primary assay. Throughout the confirmation process, medicinal chemists would be looking to cluster compounds into groups which could form the basis of lead series. As part of this process, consideration will be given to the properties of each cluster such as whether there is an identifiable structure–activity relationship (SAR) evolving over a number of compounds, that is, identification of a group of compounds which have some section or chemical motif in common and the addition of different chemical groups to this core structure results in different potencies. Issues of chemical synthesis would also be examined. Thus, ease of preparation, potential amenability to parallel synthesis and the ability to generate diversity from late-stage intermediates would be assessed.

With defined clusters in place an exercise can now take place on several groups of compounds in parallel. This phase will include the rapid generation of rudimentary SAR data and defining the essential elements in the structure associated with activity. At the same time, representative examples of each of these mini-series will be subjected to various in vitro assays designed to provide important information with regard to absorption, distribution, metabolism and excretion (ADME) properties as well as physicochemical and pharmacokinetic (PK) measurements (see Table 2 ). Selectivity profiling, especially against the types of targets, if any, for which the compounds were originally made, is also useful to carry out at this time. For example you may want to inhibit kinase X but avoid kinase Y to reduce unwanted in vivo side effects. This exercise will reveal the strengths and flaws of each series and allow a decision to be taken about the most promising series of compounds to be progressed. The numbers of series taken forward at this stage will depend on the resource available but ideally several should be taken into the hit-to-lead stage to allow for attrition in the coming phase.

Key in vitro assays in early drug discovery

IC 50 , half maximal inhibitory concentration.

Whatever the screening paradigm, the output of the hit discovery phase of a lead identification programme is a so-called ‘hit’ molecule, typically with a potency of 100 nM–5 µM at the drug target. A chemistry programme is initiated to improve the potency of this molecule.

Hit-to-lead phase

The aim of this stage of the work is to refine each hit series to try to produce more potent and selective compounds which possess PK properties adequate to examine their efficacy in any in vivo models that are available.

Typically, the work now consists of intensive SAR investigations around each core compound structure, with measurements being made to establish the magnitude of activity and selectivity of each compound. This needs to be carried out systematically and, where structural information about the target is known, structure-based drug design techniques using molecular modelling and methodologies such as X-ray crystallography and NMR can be applied to develop the SAR faster and in a more focused way. This type of activity will also often give rise to the discovery of new binding sites on the target proteins.

A screening cascade at this time would generally consist of a relatively high throughput assay establishing the activity of each molecule on the molecular target, together with assays in the same format for sites where selectivity might be known, or expected to be, an issue ( Figure 6 ). A compound meeting basic criteria at this stage would be escalated into a further bank of assays. These should include higher order functional investigations against the molecular target and also whether the compounds were active in primary assays in different species. The HTS assay is generally carried out on protein encoded by human DNA sequences but as animal models are used to validate the activity of compounds in in vivo disease models, in pharmacodynamic (PD)/PK modelling and in preclinical toxicity studies, it is important to have data on activity in vitro on orthologues. This is also particularly important as it will assist in minimizing dosing levels in toxicology studies which are chosen on the basis multiples of the pharmacologically effective doses.

An external file that holds a picture, illustration, etc.
Object name is bph0162-1239-f6.jpg

Hypothetical screening cascade. Examples of assays along the screening cascade from high throughput screening (HTS) to candidate selection are shown. DMPK, drug metabolism pharmacokinetics.

Attention in this phase has to also turn to more detailed profiling of physicochemical and in vitro ADME properties and this series of studies is carried out in parallel, with key compounds being selected for assessment. The sort of assays to be considered, with targets that have been found to be appropriate are shown in Table 2 .

Solubility and permeability assessments are crucial in ruling in or out the potential of a compound to be a drug, that is, drug substance often needs access to a patient's circulation and therefore may be injected or more generally has to be adsorbed in the digestive system. Deficiency in one or other parameter in a molecule can sometimes be put right. For example formulation strategies can be used to design a tablet such that it dissolves in a particular region of the gut at a pH in which the compound is more soluble. A compound that lacks both these properties is very unlikely to become a drug no matter how potent it is in the primary screening assay. Microsomal stability is a useful measure of the ability of in vivo metabolizing enzymes to modify and then remove a compound. Hepatocytes are sometimes used in this sort of study instead and these will give more extensive results but are not used routinely as they need to be prepared freshly on a regular basis. CYP450 inhibition is examined as, among other things, it is an important predictor of whether a new compound might have an influence on the metabolism of an existing drug with which it may be co-administered.

If one or more of these properties is less than ideal, then it might be necessary to screen many more compounds specifically for those properties. Each programme will end up subtly different in this regard. For example in one recent project to identify novel GPCR antagonists, a number of sub-micromolar hit compounds were identified. The main issues associated with these molecules was that they showed some speciation with poorer receptor affinities in rodent receptors, a general lack of selectivity with >50% inhibition at 10 µM at 30 out of 63 GPCRs and transporters tested in a cross-screening panel as well as broad CYP450 inhibitory activity. It was felt that a number of these deficiencies were associated with the nature of the base common to all the initial structures. Modification of the basic residue resulted in a number of compounds which were as potent as the initial hits at the principal receptor but which were more selective in their actions. In common with many programmes, as potency at the principal target improved selectivity issues in this series were left behind.

Key compounds which are beginning to meet the target potency and selectivity, as well as most of the physicochemical and ADME targets, should be assessed for PK in rats. Here one would normally be aiming for a half-life of >60 min when the compound is administered intravenously and a fraction in excess of 20% absorbed following oral dosing although sometimes, different targets require very different PK profiles. In large pharma with inhouse drug metabolism pharmacokinetics (DMPK) departments numerous compounds might be profiled while in academic environments there may be funds for only a predefined number of these expensive investigations As the receptor antagonist programme, described above, advanced through the hit-to-lead phase, a number of compounds were prepared which had potency in the nanomolar range and a benign selectivity profile except for some potency at the hERG channel, a potassium voltage-gated ion channel important for cardiac function and inhibition at which can cause cardiac liability. Ideally for hERG we were aiming for an activity over 30 uM or at least a 1000-fold selectivity for the target. A number of these compounds were examined in PK studies and were found to have a reasonable half-life following intravenous dosing but poor plasma levels were noted when the compound was given orally to rats. It was felt that some of these compounds, representing the end of the hit-to-lead phase of the project were, although not likely themselves to be progressed, capable of answering questions in disease models. Thus, compounds were administered intra-peritoneally and results from the experiments gave substantial credence to the developing programme.

Lead optimization phase

The object of this final drug discovery phase is to maintain favourable properties in lead compounds while improving on deficiencies in the lead structure. Continuing with example above, the aim of the programme was now to modify the structure to minimize hERG liability and to improve the absorption of the compound. Thus, more regular checks of hERG affinity and CACO2 permeation were undertaken and compounds were soon available which maintained their potency and selectivity at the principal target but which had a much reduced hERG affinity and a better apparent permeation than initial lead compounds. When examined for PK properties in rat one of these compounds, with 8 nM affinity at the receptor of interest, had an oral bioavailability of over 40% in rats and about 80% in dogs.

Compounds at this stage may be deemed to have met the initial goals of the lead optimization phase and are ready for final characterization before being declared as preclinical candidates. Discovery work does not cease at this stage. The team has to continue to explore synthetically in order to produce potential back up molecules, in case the compound undergoing further preclinical or clinical characterization fails and, more strategically, to look for follow-up series.

The stage at which the various elements that constitute further characterization are carried out will vary from company to company and parts of this process may be incorporated into the lead optimization phase. However, in general molecules need to be examined in models of genotoxicity such as the Ames test and in in vivo models of general behaviour such as the Irwin's test. High-dose pharmacology, PK/PD studies, dose linearity and repeat dosing PK looking for drug-induced metabolism and metabolic profiling all need to be carried out by the end of this stage. Consideration also needs to be given to chemical stability issues and salt selection for the putative drug substance.

All the information gathered about the molecule at this stage will allow for the preparation of a target candidate profile which with together with toxicological and chemical manufacture and control considerations will form the basis of a regulatory submission to allow human administration to begin.

The process of hit generation to preclinical candidate selection often takes a long time and cannot in any way be considered a routine activity. There are rarely any short cuts and significant, intellectual input is required from scientists from a variety of disciplines and backgrounds. The quality of the hit-to-lead starting point and the expertise of the available team are the key determinants of a successful outcome of this phase of work. Typically, within industry for each project 200 000 to >10 6 compounds might be screened initially and during the following hit-to-lead and lead optimization programmes 100's of compounds are screened to hone down to one or two candidate molecules, usually from different chemical series. In academia screens are more likely to be of a focused nature due to the high cost of an extensive HTS or compounds are derived from a structure-based approach. Only 10% of small molecule projects within industry might make the transition to candidate, failing at multiple stages. These can include the (i) inability to configure a reliable assay; (ii) no developable hits obtained from the HTS; (iii) compounds do not behave as desired in secondary or native tissue assays; (iv) compounds are toxic in vitro or in vivo ; (v) compounds have undesirable side effects which cannot be easily screened out or separated from the mode of action of the target; (vi) inability to obtain a good PK or PD profile in line with the dosing regeme required in man, for example, if require a once a day tablet then need the compound to have a half-life in vivo suitable to achieve this; and (vii) inability to cross the blood brain barrier for compounds whose target lies within the central nervous system. The attrition rate for protein therapeutics, once the target has been identified, is much lower due to less off target selectivity and prior experience of PK of some proteins, for example, antibodies.

Although relatively less costly than many processes carried out later on in the drug development and clinical phases, preclinical activity is sufficiently high risk and remote from financial return to often make funding it a problem. Ensuring transparency of the cost of each stage/assay within large pharma may help reduce some of their costs and there are some movements towards this as companies instigate a ‘biotech’ mentality and accountability for costs.

Once a candidate is selected, the attrition rate of compounds entering the clinical phase is also high, again only one in 10 candidates reaching the market but at this stage the financial consequences of failure are much higher. There has been considerable debate in industry as to how to improve the success rate, by ‘failing fast and cheap’. Once a candidate reaches the clinical stage, it can become increasingly difficult to kill the project, as at this stage the project has become public knowledge and thus termination can influence confidence in the company and shareholder value. Carrying out more studies prior to clinical development such as improved toxicology screens (using failed drugs to inform these assays), establishing predictive translational models based on a thorough disease understanding and identifying biomarkers may help in this endeavour. It is particularly in these later two areas where academic-industry partnerships could really add value preclinically and eventually help bring more effective drugs to patients.

Acknowledgments

Karen Philpott is supported by the Medical Research Council and Guys and St Thomas' Charity.

S. Barret Kalindjian is supported by a Seeding Drug Discovery Wellcome Trust grant.

Abbreviations

Conflict of interest.

Jane Hughes is employed by MedImmune, Steve Rees is employed by GSK and Karen Philpott was previously employed by GSK.

Supplementary material

Supporting Information: Teaching Materials; Figs 1–6 as PowerPoint slide.

  • Abell AN, Rivera-Perez JA, Cuevas BD, Uhlik MT, Sather S, Johnson NL, et al. Ablation of MEKK4 kinase activity causes neurulation and skeletal patterning defects in the mouse embryo. Mol Cell Biol. 2005; 25 :8948–8959. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Bertram L, Tanzi RE. Thirty years of Alzheimer's disease genetics: the implications of systematic meta-analyses. Nat Rev Neurosci. 2008; 9 :768–778. [ PubMed ] [ Google Scholar ]
  • Boppana K, Dubey PK, Jagarlapudi SARP, Vadivelan S, Rambabu G. Knowledge based identification of MAO-B selective inhibitors using pharmacophore and structure based virtual screening models. Eur J Med Chem. 2009; 44 :3584–3590. [ PubMed ] [ Google Scholar ]
  • Castanotto D, Rossi JJ. The promises and pitfalls of RNA-interference-based therapeutics. Nature. 2009; 457 :426–433. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chessell IP, Hatcher JP, Bountra C, Michel AD, Hughes JP, Green P, et al. Disruption of the P2X7 purinoceptor gene abolishes chronic inflammatory and neuropathic pain. Pain. 2005; 114 :386–396. [ PubMed ] [ Google Scholar ]
  • Cox JJ, Reimann F, Nicholas AK. An SCN9A channelopathy causes congenital inability to experience pain. Nature. 2006; 444 :894–898. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Dunne A, Jowett M, Rees S. Use of primary cells in high throughput screens. Meth Mol Biol. 2009; 565 :239–257. [ PubMed ] [ Google Scholar ]
  • Fox S, Farr-Jones S, Sopchak L, Boggs A, Nicely AW, Khoury R, et al. High-throughput screening; Update on practices and success. J Biol Screen. 2006; 11 :864–869. [ PubMed ] [ Google Scholar ]
  • Frearson JA, Collie IT. HTS and hit finding in academia – from chemical genomics to drug discovery. Drug Discov Today. 2009; 14 :1150–1158. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Henning SW, Beste G. Loss-of-function strategies in drug target validation. Curr Drug Discov. 2002; May :17–21. [ Google Scholar ]
  • Honore P, Kage K, Mikusa J, Watt AT, Johnston JF, Wyatt JR, et al. Analgesic profile of intrathecal P2X3 antisense oligonucleotide treatment in chronic inflammatory and neuropathic pain states. Pain. 2002; 99 :11–19. [ PubMed ] [ Google Scholar ]
  • Kurosawa G, Akahori Y, Morita M, Sumitomo M, Sato N, Muramatsu C, et al. Comprehensive screening for antigens overexpressed on carcinomas via isolation of human mAbs that may be therapeutic. Proc Natl Acad Sci U S A. 2008; 105 :7287–7292. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Law R, Barker O, Barker JJ, Hesterkamp T, Godemann R, Andersen O, et al. The multiple roles of computational chemistry in fragment-based drug design. J Comput Aided Mol Des. 2009; 23 :459–473. [ PubMed ] [ Google Scholar ]
  • Lipinski CA, Lombardo F, Dominy BW, Feeney PJ. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev. 2001; 46 :3–26. [ PubMed ] [ Google Scholar ]
  • McInnes C. Virtual screening strategies in drug discovery. Curr Opin Chem Biol. 2007; 11 :494–502. [ PubMed ] [ Google Scholar ]
  • Michelini E, Cevenini L, Mezzanotte L, Coppa A, Roda A. Cell Based Assays: fuelling drug discovery. Anal Biochem. 2010; 397 :1–10. [ PubMed ] [ Google Scholar ]
  • Moore K, Rees S. Cell-based versus isolated target screening: how lucky do you feel? J Biomol Scr. 2001; 6 :69–74. [ PubMed ] [ Google Scholar ]
  • Peet NP. What constitutes target validation? Targets. 2003; 2 :125–127. [ Google Scholar ]
  • Stables J, Mattheakis LC, Chang TR, Rees S. Recombinant aequorin as a reporter of changes in intracellular calcium concentration in mammalian n cells. Meth Enzymol. 2000; 327 :456–471. [ PubMed ] [ Google Scholar ]
  • Ugolini G, Marinelli S, Covaceuszach S, Cattaneo A, Pavone F. The function neutralizing anti-TrkA antibody MNAC13 reduces inflammatory and neuropathic pain. Proc Natl Acad Sci U S A. 2007; 104 :2985–2990. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Whitehead KA, Langer R, Anderson DG. Knocking down barriers: advances in siRNA delivery. Nature Rev Drug Discov. 2009; 8 :129–138. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yang Y, Wang Y, Li S. Mutations in SCN9A, encoding a sodium channel alpha subunit, in patients with primary erythermalgia. J Med Genet. 2004; 41 :171–174. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yang Y, Adelstein SJ, Kassis AI. Target discovery from data mining approaches. Drug Discov Today. 2009; 14 :147–154. [ PubMed ] [ Google Scholar ]
  • Zanders ED, Bailey DS, Dean PM. Probes for chemical genomics by design. Drug Discov Today. 2002; 7 :711–718. [ PubMed ] [ Google Scholar ]
  • Zhang JH, Chung DY, Oldenberg KR. A simple statistical parameter for use in evaluation and validation of high throughput screening assays. J Biomol Scr. 1999; 4 :67–73. [ PubMed ] [ Google Scholar ]

Google DeepMind and Isomorphic Labs reveal AI able to predict large swathes of molecular biology

Good DeepMind CEO Demis Hassabis and John Jumper, the scientist who heads the company's protein structure team, accepting a Breakthrough Pirze for their work on AlphaFold 2.

Alphabet’s Google DeepMind and its sister company Isomorphic Labs have created a new AI model that they say can help predict both the structure and interaction of most molecules involved in biological processes, including proteins, DNA, RNA, and some of chemicals used to create new medicines. The new model is a potentially giant leap for biological research. The companies are allowing researchers working on non-commerical projects to query the model for free through an internet-based interface.

Isomorphic Labs, which was spun out of Google DeepMind, has also begun using the system internally to speed its efforts to discover new drugs. The company currently has partnerships with Eli Lilly and Novartis aimed at developing multiple drugs, although the specifics of which diseases the companies are targeting has not been revealed. Proteins are the building blocks of life and their interactions with one another and with other molecules are the mechanism through which life’s processes happen. Being able to predict those interactions more accurately will help researchers advance science. by helping them understand the mechanism behind diseases, and, potentially, how to better treat and cure them. Called AlphaFold 3, the new AI software represents a major update and expansion of capabilities beyond Google DeepMind’s previous AlphaFold 2 system . Researchers from the companies published a paper on AlphaFold 3 today in the prestigious scientific journal Nature . Demis Hassabis, who serves as CEO of both Google DeepMind and Isomorphic, described the new model’s interaction predictions as “incredibly important for drug discovery.” John Jumper, the senior researcher who heads the protein structure team at Google DeepMind, described AlphaFold 3 as “an evolution of AlphaFold 2, but a really big one that opens up new avenues.” He also said he was excited to see what researchers would do with the new model, noting that AlphaFold 2 had already opened up new areas of biological research that he could never have imagined. AlphaFold 2 has been cited more than 20,000 times in other published scientific papers and has been used to work on drugs for malaria, cancer, and many other diseases.

AlphaFold 2 and 3

Debuted in late 2020, AlphaFold 2 solved a grand scientific challenge because it was able to accurately predict the structure of most proteins simply from their DNA sequence. The company later published the system’s predicted structures for all 200 million proteins with known DNA sequences and made them freely available to scientists in a massive database. Prior to this, only about 100,000 proteins had known structural information. Knowing the shape and structure of a protein is often a key part of understanding how it will function. But proteins do not work in isolation. And AlphaFold 2 was not designed to predict how proteins would interact with one another—although scientists soon found ways to modify AlphaFold 2 to make some of these predictions. Nor could AlphaFold 2 predict protein interactions with other kinds of molecules, such as DNA, RNA, ligands, and ions, that are found inside living things. It also could not predict the interaction of these other molecules with one another. AlphaFold 3 can. The system is not always accurate, but represents a major leap forward in performance. According to tests conducted by Google DeepMind and Isomorphic, AlphaFold 3 can accurately predict 76% of protein interactions with small molecules, compared to 52% for the previous best predictive software. It can predict 65% of DNA interactions compared to the next leading system, which only achieves 28%. And in protein to protein interactions, it can predict 62% accurately, more than doubling what AlphaFold 2 could do. Like AlphaFold 2, AlphaFold 3 also includes a confidence score alongside its predictions that give scientists some indication of whether they should trust the system’s output. This reduces the chance that the AI model will experience the sort of “hallucinations”—plausible but inaccurate outputs—that have plagued recent generative AI models. Jumper said that so far researchers have found these confidence scores to be highly correlated with whether the structural and interaction predictions are accurate. In other words, the system is not likely to be confidently wrong. There are a few classes of proteins where AlphaFold 3 is still not accurate. These include proteins that scientists consider “intrinsically disordered,” meaning they only assume a particular structure in the presence of another protein or molecule, perhaps changing their shape radically depending on circumstance, according to Max Jaderberg, the chief AI scientist at Isomorphic Labs.

Bioweapons worries

While many, including former Google DeepMind cofounder Mustafa Suleyman , who is now heading up a new consumer AI division at Microsoft , and Dario Amodei, the confounder and CEO of Google DeepMind rival Anthropic, have warned that rapid advances in AI may lead to the proliferation of bioweapons by radically lowering the knowledge barrier to creating deadly pathogens, Jumper said Google DeepMind and Isomorphic had consulted more than 50 experts in biosecurity, bioethics, and AI safety and concluded that the marginal risk AlphaFold 3 might present in terms of bioweapons creation was far outweighed by the system’s potential benefits to science, including advancing human understanding of disease and finding possible treatments.

The two companies are also only allowing access to the model through an internet service that allows outside researchers to prompt the system and receive a prediction, but does not give them access to the model itself or its underlying computer code. Unlike some efforts to create large language models (LLMs) for biology that can be prompted in natural language to produce a formula for a compound with particular properties, AlphaFold 3 still requires someone to have a fairly good understanding of biology to use it effectively. In addition, any suggested molecular structure it predicts would still need to be produced or isolated in a lab, a process that also requires relatively specialized knowledge. AlphaFold 3 uses a significantly different AI design than its predecessor AlphaFold 2. While both AI models are based around transformers, a kind of artificial neural network architecture pioneered by Google researchers in 2017, Jumper said the team working on the new system replaced entire “blocks” of the large transformer that powered AlphaFold 2.

AlphaFold 2 relied heavily on evolutionary information about the proteins for which it was trying to predict structures, while AlphaFold 3 leans on this evolutionary signal far less, using it only at the first step of its structure prediction. Instead, the new system devotes the majority of its components to working through the physical shape of the molecules it is making predictions about.

AlphaFold 3 also uses a diffusion model, similar to ones used for popular text-to-image generation models such as OpenAI’s DALL-E 3 or Midjourney, to learn how to puzzle out the precise atomic structures of molecules. Overall, despite covering far more substances than AlphaFold 2, AlphaFold 3 is a simpler design, with fewer separate components, than its predecessor.  

Latest in Tech

Shou Zi Chew

TikTok is footing the bill for a new lawsuit filed by creators challenging a ban

Sam Salehpour

Boeing could face criminal charges after violating a deal negotiated in secret with regulators

Ilya Sutskever, chief scientist of OpenAI, is leaving.

Key player in on-again, off-again ouster of OpenAI CEO Sam Altman is leaving the company

Explosive charges are detonated to bring down sections of the collapsed Francis Scott Key Bridge.

Crewmember mistakenly caused the ship’s engine to stall hours before a blackout led to the Baltimore Bridge collapse

The California Department of Motor Vehicles (DMV) revoked Cruise's self-driving car permit, citing 'unreasonable risk to public safety' in San Francisco, California, on October 24, 2023.

GM-owned Cruise reached a more than $8M settlement with pedestrian who was dragged by robo taxi 

30% of bachelor’s degrees cost more than what they pay out in a lifetime, a new analysis finds.

The juice isn’t worth the squeeze for many college majors, new report reveals: Lifetime earnings simply can’t keep up with the cost of degrees

Most popular.

drug discovery research papers

The collapsed Baltimore bridge will be demolished soon, and the crew of the ship that’s trapped underneath will be onboard when the explosives go off

drug discovery research papers

The housing crisis in the U.S. is flipped upside down in Japan, where each home that’s occupied could be next to an empty one by 2033

drug discovery research papers

Consumers were deprived of rare bourbons, including Pappy Van Winkle’s 23-year-old whiskey, by alcohol overseers

drug discovery research papers

TV chef Gordon Ramsay spends an extra $7.6 million on staff as U.K. restaurant empire losses triple

drug discovery research papers

Meet the boomers who’d rather spend $100k to renovate their homes than risk the frozen housing market: ‘It would be too hard to purchase anything else’

drug discovery research papers

Hedge fund billionaire Ken Griffin says college protests are the result of a ‘cultural revolution’ and Harvard should ’embrace our Western values’

  • Computer Vision
  • Federated Learning
  • Reinforcement Learning
  • Natural Language Processing
  • New Releases
  • AI Dev Tools
  • Advisory Board Members
  • 🐝 Partnership and Promotion

Logo

Existing research in molecular representation learning has leveraged models like Denoising Diffusion Probabilistic Models (DDPMs) for generating accurate molecular structures by transforming random noise into structured data. Models such as GeoDiff and Torsional Diffusion have emphasized the importance of 3D molecular conformation, enhancing the prediction of molecular properties. Furthermore, methods integrating substructural details, like GeoMol, have improved by considering the connectivity and arrangement of atoms within molecules, advancing the field through more nuanced and precise modeling techniques.

International Digital Economy Academy (IDEA) researchers have introduced SubGDiff, a novel diffusion model aimed at enhancing molecular representation by strategically incorporating subgraph details into the diffusion process. This integration allows for a more nuanced understanding and representation of molecular structures, setting SubGDiff apart from traditional models. The key innovation of SubGDiff lies in its ability to leverage subgraph prediction within its methodology, thus allowing the model to maintain essential structural relationships and features critical for accurate molecular property prediction.

drug discovery research papers

SubGDiff’s methodology centers around three principal techniques: subgraph prediction, expectation state diffusion, and k-step same-subgraph diffusion. For validation and training, the model utilizes the PCQM4Mv2 dataset, part of the larger PubChemQC project known for its extensive collection of molecular structures. SubGDiff’s approach integrates these techniques to improve the learning process by enhancing the model’s responsiveness to the intrinsic substructural features of molecules. This is achieved by employing a continuous diffusion process adjusted to focus on relevant subgraphs, thus preserving critical molecular information throughout the learning phase. This structured methodology enables SubGDiff to achieve superior performance in molecular property prediction tasks.

drug discovery research papers

SubGDiff has shown impressive results in molecular property prediction, significantly outperforming standard models. In benchmark testing, SubGDiff reduced mean absolute error by up to 20% compared to traditional diffusion models like GeoDiff. Furthermore, it demonstrated a 15% increase in accuracy on the PCQM4Mv2 dataset for predicting quantum mechanical properties. These outcomes underscore SubGDiff’s effective use of molecular substructures, resulting in more accurate predictions and enhanced performance across various molecular representation tasks.

To conclude, SubGDiff significantly advances molecular representation learning by integrating subgraph information into the diffusion process. This novel approach allows for a more detailed and accurate depiction of molecular structures, leading to enhanced performance in property prediction tasks. The model’s ability to incorporate essential substructural details sets a new standard for predictive accuracy. It highlights its potential to significantly improve outcomes in drug discovery and material science, where precise molecular understanding is crucial.

Check out the  Paper .  All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on  Twitter . Join our  Telegram Channel ,   Discord Channel , and  LinkedIn Gr oup .

If you like our work, you will love our  newsletter..

Don’t Forget to join our  42k+ ML SubReddit

drug discovery research papers

Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

  • Nikhil https://www.marktechpost.com/author/nikhil0980/ This AI Paper Presents SliCK: A Knowledge Categorization Framework for Mitigating Hallucinations in Language Models Through Structured Training
  • Nikhil https://www.marktechpost.com/author/nikhil0980/ This AI Paper by the University of Michigan Introduces MIDGARD: Advancing AI Reasoning with Minimum Description Length
  • Nikhil https://www.marktechpost.com/author/nikhil0980/ This AI Paper by Microsoft and Tsinghua University Introduces YOCO: A Decoder-Decoder Architectures for Language Models
  • Nikhil https://www.marktechpost.com/author/nikhil0980/ This AI Paper by Alibaba Group Introduces AlphaMath: Automating Mathematical Reasoning with Monte Carlo Tree Search

RELATED ARTICLES MORE FROM AUTHOR

Neural networks and nucleotides: ai in genomic manufacturing, web-instruct’s instruction tuning for mammoth2 and mammoth2-plus models: the power of web-mined data in enhancing large language models, large language model (llm) training data is running out. how close are we to the limit, openai launches chatgpt desktop app: enhancing productivity for mac users, top books on deep learning and neural networks, radonc-gpt: leveraging meta llama for a pioneering radiation oncology model, web-instruct’s instruction tuning for mammoth2 and mammoth2-plus models: the power of web-mined data in..., large language model (llm) training data is running out. how close are we to..., this ai paper presents slick: a knowledge categorization framework for mitigating hallucinations in language..., generative ai in marketing and sales: a comprehensive review, microsoft researchers propose dig: transforming molecular modeling with deep learning for equilibrium distribution prediction, advances and challenges in drone detection and classification techniques.

  • AI Magazine
  • Privacy & TC
  • Cookie Policy

🐝 🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...

Thank You 🙌

Privacy Overview

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 28 January 2021

Natural products in drug discovery: advances and opportunities

  • Atanas G. Atanasov   ORCID: orcid.org/0000-0003-2545-0967 1 , 2 , 3 , 4 ,
  • Sergey B. Zotchev 2 ,
  • Verena M. Dirsch   ORCID: orcid.org/0000-0002-9261-5293 2 ,
  • the International Natural Product Sciences Taskforce &
  • Claudiu T. Supuran   ORCID: orcid.org/0000-0003-4262-0323 5  

Nature Reviews Drug Discovery volume  20 ,  pages 200–216 ( 2021 ) Cite this article

314k Accesses

2033 Citations

709 Altmetric

Metrics details

  • Chemical biology
  • Transferases

Natural products and their structural analogues have historically made a major contribution to pharmacotherapy, especially for cancer and infectious diseases. Nevertheless, natural products also present challenges for drug discovery, such as technical barriers to screening, isolation, characterization and optimization, which contributed to a decline in their pursuit by the pharmaceutical industry from the 1990s onwards. In recent years, several technological and scientific developments — including improved analytical tools, genome mining and engineering strategies, and microbial culturing advances — are addressing such challenges and opening up new opportunities. Consequently, interest in natural products as drug leads is being revitalized, particularly for tackling antimicrobial resistance. Here, we summarize recent technological developments that are enabling natural product-based drug discovery, highlight selected applications and discuss key opportunities.

Similar content being viewed by others

drug discovery research papers

The hidden enzymology of bacterial natural product biosynthesis

drug discovery research papers

Strategies to access biosynthetic novelty in bacterial genomes for drug discovery

drug discovery research papers

PeruNPDB: the Peruvian Natural Products Database for in silico drug screening

Introduction.

Historically, natural products (NPs) have played a key role in drug discovery, especially for cancer and infectious diseases 1 , 2 , but also in other therapeutic areas, including cardiovascular diseases (for example, statins) and multiple sclerosis (for example, fingolimod) 3 , 4 , 5 .

NPs offer special features in comparison with conventional synthetic molecules, which confer both advantages and challenges for the drug discovery process. NPs are characterized by enormous scaffold diversity and structural complexity. They typically have a higher molecular mass, a larger number of sp 3 carbon atoms and oxygen atoms but fewer nitrogen and halogen atoms, higher numbers of H-bond acceptors and donors, lower calculated octanol–water partition coefficients (cLogP values, indicating higher hydrophilicity) and greater molecular rigidity compared with synthetic compound libraries 1 , 6 , 7 , 8 , 9 . These differences can be advantageous; for example, the higher rigidity of NPs can be valuable in drug discovery tackling protein–protein interactions 10 . Indeed, NPs are a major source of oral drugs ‘beyond Lipinski’s rule of five ’ 11 . The increasing significance of drugs not conforming to this rule is illustrated by the increase in molecular mass of approved oral drugs over the past 20 years 12 . NPs are structurally ‘optimized’ by evolution to serve particular biological functions 1 , including the regulation of endogenous defence mechanisms and the interaction (often competition) with other organisms, which explains their high relevance for infectious diseases and cancer. Furthermore, their use in traditional medicine may provide insights regarding efficacy and safety. Overall, the NP pool is enriched with ‘bioactive’ compounds covering a wider area of chemical space compared with typical synthetic small-molecule libraries 13 .

Despite these advantages and multiple successful drug discovery examples, several drawbacks of NPs have led pharmaceutical companies to reduce NP-based drug discovery programmes. NP screens typically involve a library of extracts from natural sources (Fig.  1 ), which may not be compatible with traditional target-based assays 14 . Identifying the bioactive compounds of interest can be challenging, and dereplication tools have to be applied to avoid rediscovery of known compounds. Accessing sufficient biological material to isolate and characterize a bioactive NP may also be challenging 15 . Furthermore, gaining intellectual property (IP) rights for (unmodified) NPs exhibiting relevant bioactivities can be a hurdle, since naturally occurring compounds in their original form may not always be patented (legal frameworks vary between countries and are evolving) 16 , although simple derivatives can be patent-protected (Box  1 ). An additional layer of complexity relates to the regulations defining the need for benefit sharing with countries of origin of the biological material, framed in the United Nations 1992 Convention on Biological Diversity and the Nagoya Protocol, which entered into force in 2014 (ref. 17 ), as well as recent developments concerning benefit sharing linked to use of marine genetic resources 18 .

figure 1

Steps in the process are shown in purple boxes, with associated key limitations shown in red boxes and advances that are helping to address these limitations in modern natural product (NP)-based drug discovery shown in green boxes. The process begins with extraction of NPs from organisms such as bacteria. The choice of extraction method determines which compound classes will be present in the extract (for example, the use of more polar solvents will result in a higher abundance of polar compounds in the crude extract). To maximize the diversity of the extracted NPs, the biological material can be subjected to extraction with several solvents of different polarity. Following the identification of a crude extract with promising pharmacological activity, the next step is its (often multiple) consecutive bioactivity-guided fractionation until the pure bioactive compounds are isolated. A key limitation for the potential of this approach to identify novel NPs is that many potential source organisms cannot be cultured or stop producing relevant NPs when taken out of their natural habitat. These limitations are being addressed through development of new methods for culturing, for in situ analysis, for NP synthesis induction and for heterologous expression of biosynthetic genes. At the crude extract step, challenges include the presence in the extracts of NPs that are already known, NPs that do not have drug-like properties or insufficient amounts of NPs for characterization. These challenges can be addressed through the development of methods for dereplication, extraction and pre-fractionation of extracts. Finally, at the last stage, when bioactive compounds are identified by phenotypic assays, significant time and effort are typically needed to identify the affected molecular targets. This challenge can be addressed by the development of methods for accelerated elucidation of molecular modes of action, such as the nematic protein organization technique (NPOT), drug affinity responsive target stability (DARTS), stable isotope labelling with amino acids in cell culture and pulse proteolysis (SILAC-PP), the cellular thermal shift assay (CETSA) and an extension known as thermal proteome profiling (TPP), stability of proteins from rates of oxidation (SPROX), the similarity ensemble approach (SEA) and bioinformatics-based analysis of connectivity (connectivity map, CMAP) 23 , 189 , 190 , 191 , 192 .

Although the complexity of NP structures can be advantageous, the generation of structural analogues to explore structure–activity relationships and to optimize NP leads can be challenging, particularly if synthetic routes are difficult. Also, NP-based drug leads are often identified by phenotypic assays , and deconvolution of their molecular mechanisms of action can be time-consuming 19 . Fortunately, there have been substantial advances 20 both in the development of screening assays (for example, harnessing the potential of induced pluripotent stem cells and gene editing technologies) and in strategies to identify the modes of action of active compounds (reviewed previously 21 , 22 , 23 ).

Here, we discuss recent technological and scientific advances that may help to overcome challenges in NP-based drug discovery, with an emphasis on three areas: analytical techniques, genome mining and engineering, and cultivation systems. In the concluding section, we highlight promising future directions for NP drug discovery.

Box 1 Natural products that activate the KEAP1/NRF2 pathway

An example of a pathway affected by diverse natural products (NPs) is the KEAP1/NRF2 pathway. This pathway regulates the expression of networks of genes encoding proteins with versatile cytoprotective functions and has essential roles in the maintenance of redox and protein homeostasis, mitochondrial biogenesis and the resolution of inflammation 196 , 197 , 198 , 199 .

Activation of this pathway can protect against damage by most types of oxidants and pro-inflammatory agents, and it restores redox and protein homeostasis 200 . The pathway has therefore attracted attention for the development of drugs for the prevention and treatment of complex diseases, including neurological conditions such as relapsing–remitting multiple sclerosis 201 and autism spectrum disorder 202 .

Dimethyl fumarate (DMF), the methyl ester of the NP fumarate (a tricarboxylic acid (TCA) cycle intermediate that is found in both animals and plants), is one of the earliest discovered inducers of the KEAP1/NRF2 pathway 203 , 204 . The origins of the development of DMF as a drug date back to the use in traditional medicine of the plant Fumaria officinalis . Initially, fumaric acid derivatives were used for the treatment of psoriasis as it was thought that psoriasis is caused by a metabolic deficiency in the TCA cycle that could be compensated for by repletion of fumarate 205 . Despite this erroneous assumption, DMF is effective in treating psoriasis, both topically and orally, and is the active principle of Fumaderm, which has been used clinically for several decades in the treatment of plaque psoriasis in Germany. More recently, a DMF formulation developed by Biogen has been tested in other immunological disorders, with successful phase III trials in multiple sclerosis 206 , 207 leading to its approval by the FDA and EMA in 2013.

The isothiocyanate sulforaphane, isolated from broccoli ( Brassica oleracea ) 208 , is among the most potent naturally occurring inducers of the KEAP1/NRF2 pathway 209 and has protective effects in animal models of Parkinson 210 , Huntington 211 and Alzheimer 212 diseases, traumatic brain injury 213 , spinal cord contusion injury 214 , stroke 215 , depression 216 and multiple sclerosis 217 . Sulforaphane-rich broccoli extract preparations are being developed as preventive interventions in areas of the world with unavoidable exposure to environmental pollutants, such as China; the initial results of a randomized clinical trial showed rapid and sustained, statistically significant increases in the levels of excretion of the glutathione-derived conjugates of benzene and acrolein 218 , and a follow-up trial (NCT02656420) also demonstrated dose–response-dependent benzene detoxification 219 . In a placebo-controlled, double-blind, randomized clinical trial in young individuals (age 13–27 years) with autism spectrum disorder, sulforaphane reversed many of the clinical abnormalities 202 ; these encouraging findings led to a recently completed clinical trial in children (age 3–12 years) (NCT02561481; results of the trial are not yet publicly available). An α-cyclodextrin complex of sulforaphane known as SFX-01 (developed by Evgen Pharma) is being clinically studied for its potential to reverse resistance to endocrine therapies in patients with ER + HER2 - metastatic breast cancer (phase II trial completed 220 ) and in patients with subarachnoid haemorrhage (phase II trial NCT02614742 recently completed; results not yet publicly available). Currently, a clinical trial of SFX-01 in patients hospitalized with COVID-19 is in its final stages of preparation.

Finally, the pentacyclic triterpenoids bardoxolone methyl (also known as RTA 402) and omaveloxolone (RTA 408), which are semi-synthetic derivatives of the NP oleanolic acid, are the most potent (active at nanomolar concentrations) activators of the KEAP1/NRF2 pathway known to date 221 . These compounds have shown protective effects in numerous animal models of chronic disease 222 , and are currently in clinical trials for a wide range of indications, such as chronic kidney disease in type 2 diabetes, pulmonary arterial hypertension, melanoma, radiation dermatitis, ocular inflammation and Friedreich’s ataxia 200 . Most recently, bardoxolone methyl has entered a clinical trial in patients hospitalized with confirmed COVID-19 (NCT04494646).

drug discovery research papers

Application of analytical techniques

Classical NP-based drug research starts with biological screening of ‘crude’ extracts to identify a bioactive ‘hit’ extract, which is further fractionated to isolate the active NPs. Bioactivity-guided isolation is a laborious process with a number of limitations, but various strategies and technologies can be used to address some of them (Fig.  2 ). For example, to create libraries that are compatible with high-throughput screening, crude extracts can be pre-fractionated into sub-fractions that are more suitable for automated liquid handling systems. In addition, fractionation methods can be adjusted so that sub-fractions preferentially contain compounds with drug-like properties (typically moderate hydrophilicity). Such approaches can increase the number of hits compared with using crude extracts, as well as enabling more efficient follow-up of promising hits 24 .

figure 2

a | An illustrative example of the application of liquid chromatography–high-resolution mass spectrometry (LC–HRMS) metabolomics in the screening of natural product (NP) extracts is the work of Kurita et al. 58 , in which 234 bacterial extracts were subjected to image-based phenotypic bioactivity screening and LC–HRMS metabolomics. Clustering of the resulting data allowed prioritization of promising extracts for further analysis, resulting in the discovery of the new NPs, quinocinnolinomycins A–D. b | Another illustrative example of LC–HRMS screening of NP extracts is the work of Clevenger et al. 85 , who obtained novel NP extracts through heterologous expression of fungal artificial chromosomes (FACs) containing uncharacterized biosynthetic gene clusters (BGCs) from diverse fungal species in Aspergillus nidulans . Analysis of the LC–HRMS metabolomics data with a FAC-Score algorithm directed the simultaneous discovery of 15 new NPs and the characterization of their BGCs.

Metabolomics was developed as an approach to simultaneously analyse multiple metabolites in biological samples. Enabled by technological developments in chromatography and spectrometry, metabolomics was historically applied first in other research fields, such as biomedical and agricultural sciences 2 . Advances in the analytical instrumentation used in NP research 25 , 26 , coupled with computational approaches that can generate plausible NP analogue structures and their respective simulated spectra 27 , have also enabled application of ‘omics’ approaches such as metabolomics in NP-based drug discovery. Metabolomics can provide accurate information on the metabolite composition in NP extracts, thus helping to prioritize NPs for isolation, to accelerate dereplication 28 , 29 and to annotate unknown analogues and new NP scaffolds. Moreover, metabolomics can detect differences between metabolite compositions in various physiological states of producing organisms and enable the generation of hypotheses to explain them, and can also provide extensive metabolite profiles to underpin phenotypic characterization at the molecular level 30 . Both options are very useful in understanding the molecular mechanisms of action of NPs.

For metabolite profiling, NP extracts are analysed by NMR spectroscopy or high-resolution mass spectrometry (HRMS), or respective combined methods involving upstream liquid chromatography (LC) 31 , 32 , such as LC–HRMS, which can separate numerous isomers present in NP extracts 33 . Moreover, such combined methods might integrate HRMS and NMR, allowing the simultaneous use of the advantages of both techiques 34 , 35 . NMR analysis of NP extracts is simple and reproducible, and provides direct quantitative information and detailed structural information, although it has relatively low sensitivity, meaning that it generally enables profiling only of major constituents 33 . The applications of NMR in NP research are versatile 36 and the technique is used both directly for metabolomics of unfractionated NP extracts and for structural characterization of compounds and fractions obtained with appropriate separation methods, most often LC. HRMS is the gold standard for qualitative and quantitative metabolite profiling 33 and is most commonly applied in combination with LC. HRMS can also be used in the direct infusion mode (called DIMS) 37 , whereby samples are directly profiled by MS without a chromatography step, or in MS imaging (MSI) 38 , which enables determination of the spatial distribution of NPs within living organisms. HRMS enables routine acquisition of accurate molecular mass information, which together with appropriate heuristic filtering can provide unambiguous assignment of molecular formulae for hundreds to thousands of metabolites within a single extract over a dynamic range that may exceed five orders of magnitude 31 , 39 . However, challenges remain in data mining and in the unambiguous identification of the metabolites using various workflows relying on open web-based tools 40 .

Dereplication of secondary metabolites in bioactive extracts includes the determination of molecular mass and formula and cross-searching in the literature or structural NP databases with taxonomic information, which greatly assists the identification process. Such metadata, which are difficult to query in the literature, are often compiled in proprietary databases, such as the Dictionary of Natural Products , which encompasses all NP structures reported with links to their biological sources (see Related links). However, a comprehensive experimental tandem mass spectrometry (MS/MS) database of all NPs reported to date does not exist, and a search for experimental spectra across various platforms is hindered by the lack of standardized collision energy conditions for fragmentation in LC–MS/MS 25 .

In this respect, the Global Natural Products Social (GNPS) molecular networking platform developed in the Dorrestein laboratory is an important addition to the toolbox 41 . Molecular networking organizes thousands of sets of MS/MS data recorded from a given set of extracts and visualizes the relationship of the analytes as clusters of structurally related molecules. This improves the efficiency of dereplication by enabling annotation of isomers and analogues of a given metabolite in a cluster 42 . The recorded experimental spectra can be searched against putative structures and their corresponding predicted MS/MS spectra generated by tools such as competitive fragmentation modelling (CFM-ID) 43 . Based on such approaches, vast databases of theoretical NP spectra have been created and applied in dereplication 44 . The GNPS molecular networking approach has limitations, however, such as better applicability to some classes of NPs than others and the uncertainty of structural assignment among possible predicted candidates. Efforts to address such issues are ongoing 45 , 46 , 47 , including overlaying molecular networks of large NP extract libraries with taxonomic information to improve the confidence of annotation 48 . Overall, molecular networking mainly allows better prioritization of the isolation of unknown compounds by strengthening the dereplication process and elucidating relationships between NP analogues, and rigorous structure elucidation for NPs of interest should not be neglected.

Another useful platform for metabolite identification is METLIN 49 , which includes a high-resolution MS/MS database with a fragment similarity search function that is useful for identification of unknown compounds. Other databases and in silico tools such as Compound Structure Identification (CSI): FingerID and Input Output Kernel Regression (IOKR) can be used to search available fragment ion spectra, as well as to generate predicted spectra of fragment ions not present in current databases 50 . A novel computational platform for predicting the structural identity of metabolites derived from any identified compound has also been recently reported 51 , which should increase the searchable chemical space of NPs.

To accelerate the identification of bioactive NPs in extracts, metabolomics data can be matched to the biological activities of these extracts 52 . Various chemometric methods such as multivariate data analysis can correlate the measured activity with signals in the NMR and MS spectra, enabling the active compounds to be traced in complex mixtures with no need for further bioassays 53 , 54 , 55 . Furthermore, several analytical modules involving different bioassays and detection technologies can be linked to allow simultaneous bioactivity evaluation and identification of compounds present in small amounts (analytical scale) in complex compound mixtures 34 , 35 .

Metabolomics data can be integrated with data obtained by other omics techniques such as transcriptomics and proteomics and/or with imaging-based screens. For example, Acharya et al. used this approach to characterize NP-mediated interactions between a Micromonospora species and a Rhodococcus species 56 . In another interesting example, Kurita et al. developed a compound activity mapping platform for the prediction of identities and mechanisms of action of constituents from complex NP extract libraries by integrating cytological profiling 57 with untargeted metabolomics data from a library of extracts 58 , and identified quinocinnolinomycins as a new family of NPs causing endoplasmic reticulum stress 58 (Fig.  2a ).

Analytical advances that enable the profiling of responses to bioactive molecules at the single-cell level can also accelerate NP-based drug discovery. Irish, Bachmann, Earl and colleagues developed a high-throughput platform for metabolomic profiling of bioactivity by integrating phospho-specific flow cytometry, single-cell chemical biology and cellular barcoding with metabolomic arrays (characterized chromatographic microtitre arrays originating from biological extracts) 59 . Using this platform, the authors studied the single-cell responses of bone marrow biopsy samples from patients with acute myeloid leukaemia following exposure to microbial metabolomic arrays obtained from extracts of biosynthetically prolific bacteria, which enabled the identification of new bioactive polyketides 59 .

Finally, advances in analytical technologies continue to support the rigorous structure determination of NPs of interest. The progressive development of higher-field NMR instruments and probe technology 60 , 61 has enabled NP structure determination from very small quantities (below 10 µg) 62 , 63 , which is important, as the available quantities of NPs are often limited. In addition, microcrystal electron diffraction (MicroED) has recently emerged as a cryo-electron microscopy-based technique for unambiguous structure determination of small molecules 64 and is already finding important applications in NP research 65 . The increased resolution and sensitivity of analytical equipment can also help address problems associated with ‘residual complexity’ of isolated NPs; that is when biologically potent but unidentified impurities in an isolated NP sample (which could include structurally related metabolites or conformers) lead to an incorrect assignment of structure and/or activity 66 , 67 . To avoid futile downstream development efforts, Pauli and colleagues recommended that lead NPs should undergo advanced purity analysis at an early stage using quantitative NMR and LC–MS 67 .

Genome mining and engineering

Advances in knowledge on biosynthetic pathways for NPs and in developing tools for analysing and manipulating genomes are further key drivers for modern NP-based drug discovery. Two key characteristics enable the identification of biosynthetic genes in the genomes of the producing organisms. First, these genes are clustered in the genomes of bacteria and filamentous fungi. Second, many NPs are based on polyketide or peptide cores, and their biosynthetic pathways involve enzymes — polyketide synthases (PKSs) and nonribosomal peptide synthetases (NRPSs), respectively — that are encoded by large genes with highly conserved modules 68 .

‘Genome mining’ is based on searches for genes that are likely to govern biosynthesis of scaffold structures, and can be used to identify NP biosynthetic gene clusters 69 , 70 , 71 . Prioritization of gene clusters for further work is facilitated by advances in biosynthetic knowledge and predictive bioinformatics tools, which can provide hints about whether the metabolic products of the clusters have chemical scaffolds that are new or known, thereby supporting dereplication 72 , 73 . Such predictive tools for gene cluster analysis can be applied in combination with spectroscopic techniques to accelerate the identification of NPs 65 and determine the stereochemistry of metabolic products 66 . Furthermore, to extend genome mining from a single genome to entire genera, microbiomes or strain collections, computational tools have been developed, such as BiG-SCAPE, which enables sequence similarity analysis of biosynthetic gene clusters, and CORASON, which uses a phylogenomic approach to elucidate evolutionary relationships between gene clusters 74 .

Phylogenetic studies of known groups of talented secondary metabolite producers can also empower discovery of novel NPs. Recently, a study comparing secondary metabolite profiles and phylogenetic data in myxobacteria demonstrated a correlation between the taxonomic distance and the production of distinct secondary metabolite families 75 . In filamentous fungi, it was likewise shown that secondary metabolite profiles are closely correlated with their phylogeny 76 . These organisms are rich in secondary metabolites, as demonstrated by LC–MS studies of their extracts under laboratory conditions 77 . Concurrent genomic and phylogenomic analyses implied that even the genomes of well-studied organism groups harbour many gene clusters for secondary metabolite biosynthesis with as yet unknown functions 78 . The phylogeny of biosynthetic gene clusters, together with analysis of the absence of known resistance determinants, was recently used to prioritize members of the glycopeptide antibiotic family that could have novel activities. This led to the identification of the known antibiotic complestatin and the newly discovered corbomycin as compounds that act through a previously uncharacterized mechanism involving inhibition of peptidoglycan remodelling 79 .

Many microorganisms cannot be cultured, or tools for their genetic manipulation are not sufficiently developed, which makes it more challenging to access their NP-producing potential. However, biosynthetic gene clusters for NPs can be cloned and heterologously expressed in organisms that are well-characterized and easier to culture and to genetically manipulate (such as Streptomyces coelicolor , Escherichia coli and Saccharomyces cerevisiae ) 80 . The aim is to achieve higher production titres in the heterologous hosts than in wild-type strains, improving the availability of lead compounds 80 , 81 , 82 . Vectors that can carry large DNA inserts are needed for the cloning of complete NP biosynthetic gene clusters. Cosmids (which can have inserts of 30–40 kb), fosmids (which can harbour 40–50 kb) and bacterial artificial chromosomes (BACs; which can have inserts of 100 kb to >300 kb) have been developed 83 . For fungal gene clusters, self-replicating fungal artificial chromosomes (FACs) have been developed, which can have inserts of >100 kb (ref. 84 ). FACs in combination with metabolomic scoring were used to develop a scalable platform, FAC-MS, allowing the characterization of fungal biosynthetic gene clusters and their respective NPs at unprecedented scale 85 . The application of FAC-MS for the screening of 56 biosynthetic gene clusters from different fungal species yielded the discovery of 15 new metabolites, including a new macrolactone, valactamide A 85 (Fig.  2b ).

Even in culturable microorganisms, many biosynthetic gene clusters may not be expressed under conventional culture conditions, and these silent clusters could represent a large untapped source of NPs with drug-like properties 86 . Several approaches can be pursued to identify such NPs. One approach is sequencing, bioinformatic analysis and heterologous expression of silent biosynthetic gene clusters, which has already led to the discovery of several new NP scaffolds from cultivable strains 87 . Direct cloning and heterologous expression was also used to discover the new antibiotic taromycin A, which was identified upon the transfer of a silent 67 kb NRPS biosynthetic gene cluster from Saccharomonospora sp. CNQ-490 into S. coelicolor 88 . To transfer a biosynthetic gene cluster of such size, a platform based on transformation-associated recombination (TAR) cloning was developed. This platform enables direct cloning and manipulation of large biosynthetic gene clusters in S. cerevisiae , maintenance and manipulation of the vector in E. coli , and heterologous expression of the cloned gene clusters in Actinobacteria (such as S. coelicolor ) following chromosomal integration 88 , and is an alternative to BACs for heterologous expression of large biosynthetic gene clusters.

Heterologous expression has limitations, such as the need to clone and manipulate very large genome regions occupied by biosynthetic gene clusters and the difficulty of identifying a suitable host that provides all conditions necessary for the production of the corresponding NPs. These limitations can be circumvented by activating biosynthetic gene clusters directly in the native microorganism through targeted genetic manipulations, generally involving the insertion of activating regulatory elements or deletion of inhibitory elements such as repressors or their binding sites. For example, a derepression strategy of deleting gbnR , a gene for a transcriptional repressor in Streptomyces venezuelae ATCC 10712 was used by Sidda et al. in the discovery of gaburedins, a family of γ-aminobutyrate-derived ureas 89 . An example of the activator-based strategy is the constitutive expression of the samR0484 gene in Streptomyces ambofaciens ATCC 23877, which led to the discovery of stambomycins A–D, 51-membered cytotoxic glycosylated macrolides 72 . Alternatively, silent biosynthetic gene clusters can be activated using repressor decoys 90 , which have the same DNA nucleotide sequence as the binding sites for the repressors that prevent the expression of the clusters. When these decoys are introduced into the bacteria, they sequester the respective repressors, and the ‘endogenous’ binding sites in the genome remain unoccupied, leading to derepression of the previously silent biosynthetic genes and production of the corresponding NPs. This approach has been applied to activate eight silent biosynthetic gene clusters in multiple streptomycetes and led to the characterization of a novel NP, oxazolepoxidomycin A 90 . The repressor decoy strategy is simpler, easier and faster to perform than the deletion of genes encoding regulatory factors. However, it has the same limitation as other approaches that rely on the introduction of recombinant DNA molecules into cells: it is necessary to develop protocols for efficient introduction of DNA into the targeted host strain, and the decoy must be maintained on a high-copy plasmid to ensure efficient repressor sequestration.

Another approach focused on exchange of regulatory elements is based on the CRISPR–Cas9 technology. The promise of this technique is exemplified in a recent work by Zhang et al., which demonstrated that CRISPR–Cas9-mediated targeted promoter introduction can efficiently activate diverse biosynthetic gene clusters in multiple Streptomyces species, leading to the production of unique metabolites, including a novel polyketide in Streptomyces viridochromogenes 91 . The CRISPR–Cas9 technology was also used to knock out genes encoding two well-known and frequently rediscovered antibiotics in several actinomycete strains, which led to the production of different rare and previously unknown variants of antibiotics that were otherwise obscured, including amicetin, thiolactomycin, phenanthroviridin and 5-chloro-3-formylindole 92 .

Approaches that rely on sequencing, bioinformatics and heterologous expression can also enable the identification of novel NPs from bacterial strains that have not yet been cultivated (Fig.  3a ). For example, Hover et al. searched the metagenomes of 2,000 soil samples for biosynthetic gene clusters for lipopeptides with calcium-binding motifs. This led to the discovery of malacidins, members of the calcium-dependent antibiotic family, via heterologous expression of a 72 kb biosynthetic gene cluster from a desert soil sample in a Streptomyces albus host strain 93 (Fig.  3b ). However, in comparison with some of the other above-discussed strategies 72 , 89 , 90 , this metagenome-based discovery approach is more suited to finding new members of known NP classes rather than discovery of entirely new classes. In another study, Chu et al. developed a human microbiome-based approach that identified nonribosomal linear heptapeptides called humimycins as novel antibiotics active against methicillin-resistant Staphylococcus aureus (MRSA) 94 (Fig.  3c ). The structure of the NPs was predicted via bioinformatics analysis of gene clusters found in human commensal bacteria, followed by their chemical synthesis. A major strength of this innovative approach is that it is entirely independent of microbial cultivation and heterologous gene expression. Nevertheless, there are limitations related to the accuracy of computational chemical structure predictions and the feasibility of total chemical synthesis if structures are complex.

figure 3

a | Genome mining-based approaches to explore the biosynthetic capacity of microorganisms rely on DNA extraction, sequencing and bioinformatics analysis. The vast majority of microbes from different environments and microbiota communities have not been cultured, and their capacity to produce natural products (NPs) was largely inaccessible until recently. In the case of unculturable microorganisms, the bioinformatics analysis step can be followed by either targeted heterologous expression of biosynthetic gene clusters (BGCs) prioritized as being likely to yield relevant new NPs or direct chemical synthesis of ‘synthetic–bioinformatic’ NP-like compounds. b , c | These two approaches are exemplified by the recent discoveries of malacidins (panel b ) and humimycins (panel c ), respectively 93 , 94 . A major strength of the ‘synthetic–bioinformatic’ approach is that it is entirely independent of microbial culture and gene expression. Its limitations are the accuracy of computational chemical structure predictions and the feasibility of total chemical synthesis. NRPS, nonribosomal peptide synthetase.

The genomes of plants or animals can also be mined for novel NPs. For example, mining of 116 plant genomes enabled by identification of a precursor gene for the biosynthesis of lyciumins, a class of branched cyclic ribosomal peptides with hypotensive action produced by Lycium barbarum (popularly known as goji), identified diverse novel lyciumin chemotypes in seven other plants, including crops such as soybean, beet, quinoa and eggplant 95 . Genome mining in the animal kingdom is exemplified by the work of Dutertre et al., which used an integrated transcriptomics and proteomics approach to discover thousands of novel venom peptides from Conus marmoreus snails 96 . Proteomics analysis revealed that the vast majority of the conopeptide diversity was derived from a set of ~100 genes through variable peptide processing 96 .

Some bioactive compounds initially isolated from marine organisms might be products of symbionts, and genome mining can facilitate the characterization of such NPs. For example, it has been shown that bioactive compounds from the sponge Theonella swinhoei are produced by bacterial symbionts 97 , and characterization of the symbiont ‘ Candidatus Entotheonella serta’ using single-cell genomics led to the discovery of gene clusters for misakinolide and theonellamide biosynthesis 98 . Another example of a marine NP produced by a bacterial symbiont is ET-743 (trabectedin), originally isolated from the tunicate Ecteinascidia turbinate . A meta-omics approach developed by Rath et al. revealed that the producer of this clinically used anticancer agent is the bacterial symbiont ‘ Candidatus Endoecteinascidia frumentensis’ 99 .

Similarly, plant microbiomes also represent a large reservoir for the identification of novel bioactive NPs (such as the antitumour agents maytansine, paclitaxel and camptothecin, which were initially isolated from plants and later shown to be produced by microbial endophytes) 100 that can be tapped by genome mining approaches. An illustrative example is a recent work by Helfrich et al. that identified hundreds of novel biosynthetic gene clusters by genome mining of 224 bacterial strains isolated from Arabidopsis thaliana leaves 101 . A combination of bioactivity screening and imaging mass spectrometry was used to select a single species for further genomic analysis and led to the isolation of a NP with an unprecedented structure, the trans -acyltransferase PKS-derived antibiotic macrobrevin 101 .

Targeted genetic engineering of NP biosynthetic gene clusters can be of high value if the producing organism is difficult to cultivate or the yield of a NP is too low to allow comprehensive NP characterization. Rational genetic engineering and heterologous expression contributed to increase the production of vioprolides, a depsipeptide class of anticancer and antifungal NPs in the myxobacterium Cystobacter violaceus Cb vi35, by several orders of magnitude. In addition, non-natural vioprolide analogues were generated by this approach 102 . Similarly, promoter engineering and heterologous expression of biosynthetic gene clusters was reported to result in a 7-fold increase in the production of the cytotoxic NP disorazol 103 , and a 328-fold increase in the production of spinosad, an insecticidal macrolide produced by the bacterium Saccharopolyspora spinosa 104 .

Besides increasing NP yields, targeted gene manipulation can also be used to alter biosynthetic pathways in a predictable manner to produce new NP analogues with improved pharmacological properties, such as higher specific activity, lower toxicity and better pharmacokinetics. Such biosynthetic engineering approaches depend on a solid understanding of the biosynthetic pathway leading to a specific NP, access to the genes specifying this pathway and the ability to manipulate them in either the original or a heterologous host. Recent advances in biosynthetic engineering have enabled faster and more efficient production of NP analogues, including the development of methods for accelerated engineering and recombination of modules of PKS gene clusters 105 , NRPSs 106 , 107 and NRPS–PKS assembly lines 108 , as well as elucidation of mechanisms for polyketide chain release that are contributing to NP structural diversification 109 , 110 . Examples of biosynthetic engineering applied to several important NPs include the generation of analogues of the immunosuppressant rapamycin 111 , the antitumour agents mithramycin 112 and bleomycin 113 , and the antifungal agent nystatin 114 .

It should be noted that biosynthetic engineering has limitations regarding the parts of the NP molecule that can be targeted for modifications, and the chemical groups that can be introduced or removed. Considering the complexity of many NPs, however, total synthesis may be prohibitively costly, and a combined approach of biosynthetic engineering and chemical modification can provide a viable alternative for identifying improved drug candidates. For example, biosynthetic engineering may create a ‘handle’ for addition of a beneficial chemical group by synthetic chemistry, as demonstrated for the biosynthetically engineered analogues of nystatin mentioned above; further synthetic chemistry modifications resulted in compounds with improved in vivo pharmacotherapeutic characteristics compared with amphotericin B 115 , 116 .

Advances in microbial culturing systems

The complex regulation of NP biosynthesis in response to the environment means that the conditions under which producing organisms are cultivated can have a major impact on the chance of identifying novel NPs 87 . Several strategies have been developed to improve the likelihood of identifying novel NPs compared with monoculture under standard laboratory conditions and to make ‘uncultured’ microorganisms grow in a simulated natural environment 117 (Fig.  4 ).

figure 4

New strategies for isolating previously uncultured microorganisms can enable access to new natural products (NPs) produced by them. a | To recapitulate the effect of complex signals coming from the native environment, microorganisms can be cultivated directly in the environment from which they were isolated. This concept is used with the iChip platform, in which diluted environmental samples are seeded in multiple small chambers separated from the native environment with a semipermeable membrane. The potential of this approach is illustrated by the recent discovery of teixobactin, a new antibiotic with activity against Gram-positive bacteria 134 , 135 . b | Another important recent development involves obtaining information from environmental samples using omics techniques such as metagenomics to identify and partially characterize microorganisms present in a specific environment before culturing. An approach relying on such preliminary information was recently used to engineer the capture of antibodies based on genetic information, which resulted in the successful cultivation of previously uncultured bacteria from the human mouth 145 . This reverse genomics workflow was validated by the isolation and cultivation of three species of Saccharibacteria (TM7) along with their interacting Actinobacteria hosts, as well as SR1 bacteria that are members of a candidate phylum with no previously cultured representatives.

One well-established approach to promote the identification of novel NPs is the modulation of culture conditions such as temperature, pH and nutrient sources. This strategy may lead to activation of silent gene clusters, thereby promoting production of different NPs. The term ‘One Strain Many Compounds’ (OSMAC) was coined for this approach about 20 years ago 118 , but the concept has a longer history 119 , with its use being routine in industrial microbiology since the 1960s 120 .

While OSMAC is still widely used for the identification of new bioactive compounds 121 , 122 , this approach has limited capacity to mimic the complexities of natural habitats. It is difficult to predict the combination of cues (which might also involve metabolites secreted by other members of the microbial community) to which the microorganism has evolved to respond by switching metabolic programmes. To account for such kinds of interactions, co-culturing using ‘helper’ strains can be applied 123 . This can enable the production and identification of new NPs, as illustrated by recent studies in which particular fungi were co-cultured with Streptomcyes species 124 , 125 .

Study of the molecular mechanisms underlying the ability of helper strains to increase the cultivability of previously uncultured microbes can lead to the identification of specific growth factors, allowing expansion of the number of species that can be successfully cultured. This strategy was used by D’Onofrio et al. for the identification of new acyl-desferrioxamine siderophores (iron-chelating compounds) as growth factors produced by helper strains promoting the growth of previously uncultured isolates from marine sediment biofilm 117 , 126 . The siderophore-assisted growth is based on the property of these compounds to provide iron for microbes unable to autonomously produce siderophores themselves, and the application of this approach led to the isolation of previously uncultivated microorganisms 126 . The development of strategies to cultivate microbial symbionts that produce NPs only upon interaction with their hosts can promote access to new NPs. Microbial symbionts interacting with insects or other organisms are a highly promising reservoir for the discovery of novel bioactive NPs produced in a unique ecological context 127 , 128 , 129 , 130 . To stimulate NP production, culturing strategies can be developed that better mimic the native environment of microbial symbionts of insects, including the use of media containing either lyophilized dead insects 131 or l -proline, a major constituent of insect haemolymph 132 .

Strategies to mimic the natural environment even more closely by harnessing in situ incubation in the environment from which the microorganism is sampled have been developed, dating back to more than 20 years ago with the biotech companies OneCell and Diversa. They developed platforms that allowed the growth of some previously uncultivated microbes from various environments based on diluting out and suspension in a single drop of medium 120 , 133 . More recently, such strategies have been highlighted by the development and application of a platform dubbed the iChip, in which diluted soil samples are seeded in multiple small chambers separated from the environment with a semipermeable membrane 134 . After seeding, the iChip is placed back into the soil from which the sample was taken for an in situ incubation period, allowing the cultured microorganisms to be exposed to influences from their native environment. The power of this culturing approach was demonstrated by the discovery of a new antibiotic, teixobactin, produced by a previously uncultured soil bacterium 135 , 136 (Fig.  4a ). This platform may be of great significance for NP drug discovery, given that it has been estimated that only 1% of soil organisms have so far been successfully cultured using traditional culturing techniques 137 .

The omics strategies discussed in previous sections can complement efforts to explore NPs produced upon microbial interactions. The application of such a strategy is illustrated in the work of Derewacz et al., who analysed the metabolome of a genome-sequenced Nocardiopsis bacterium upon co-culture with bacteria of the genera Escherichia , Bacillus , Tsukamurella and Rhodococcus 138 . Around 14% of the metabolomic features found in co-cultures were undetectable in monocultures, with many of those being unique to specific co-culture genera, and the previously unreported polyketides ciromicin A and B, which possess an unusual pyrrolidinol substructure and displayed moderate and selective cytotoxicity, were identified 138 . Other examples include a ‘culturomics’ approach that combines multiple culture conditions with MS profiling and 16S rRNA-based taxonomy to identify prokaryotic species from the human gut 139 , and an ultrahigh-throughput screening platform based on microfluidic droplet single-cell encapsulation and cultivation followed by next-generation sequencing and LC–MS, which allows investigation of pairwise interactions between target microorganisms 140 . The latter approach enabled identification of a slow-growing oral microbiota species that inhibits the growth of S. aureus 140 .

Historically early-adopted microbial culturing approaches led to a bias reflected in the predominant discovery of NPs from microorganisms that are easy to cultivate (such as streptomycetes and some common filamentous fungi). As a result, a vast number of NPs from such ‘easy to culture’ microbes have already been characterized, and conventional screening efforts tend to yield disappointing returns associated with frequent rediscovery of known NPs and their closely related congeners. Therefore, culturing strategies aimed at previously unexplored (or under-investigated) microbial groups, with the potential to produce NPs with entirely new scaffolds and bioactivities (such as Burkholderia , Clostridium and Xenorhabdus ) are of high interest 141 , 142 . Closthioamide, the first secondary metabolite from a strictly anaerobic bacterium, was discovered from Clostridium cellulolyticum by this approach 143 . Targeted isolation of such species is important, and a genome-guided approach to achieve this goal has recently been demonstrated for Burkholderia strains in environmental samples 144 . Another highly innovative approach to the isolation and cultivation of previously uncultured bacteria was recently reported by Cross et al. 145 , who used genomic information to engineer antibodies predicted to target selected microorganisms and to specifically capture these microorganisms from complex communities and to isolate them in pure cultures. This approach was validated by isolation and cultivation of previously uncultured bacteria from the human oral cavity 145 (Fig.  4b ), and it could be applicable to a wide range of target organisms if suitable cultivation conditions can be identified for the isolated cells.

Despite these advances in culturing strategies, artificial conditions still do not fully represent the complex environment of natural habitats. To circumvent this problem, microbial and NP diversity can also be accessed via extraction of organisms and/or their NPs in situ. To directly gain compounds produced in the natural marine environment (which may be missed otherwise), resin capture technology can be used to capture compounds on inert sorbent supports ready to be desorbed, analysed and tested for biological activity 146 . Sustainable approaches for in situ extraction with green solvents, such as glycerol or natural deep eutectic and ionic solvents (NADES), could be used directly during field work 147 , 148 . To improve dereplication, analytical equipment miniaturization is also facilitating in situ analysis; examples include the introduction of devices for physicochemical data analysis, such as micro-MS and portable near infrared spectroscopy 149 , 150 .

Outlook for NPs in drug discovery

The technological advances discussed above have the potential to reinvigorate NP-based drug discovery in both established and emerging areas. NPs have long been the key source of new drugs against infectious diseases, especially antibiotics (reviewed elsewhere 151 , 152 ). Selected NPs with antimicrobial properties discovered by leveraging advances discussed in the sections above, including strategies to exploit the human microbiome for novel NPs 94 , 153 are highlighted in Figs  3 , 4 . Along with the search for new NPs with antimicrobial activities, researchers are continuing to develop and optimize already known NP classes, making use of advances in biosynthetic engineering 154 , total synthesis 155 or semi-synthetic strategies 156 , 157 . In addition, antivirulence strategies could represent an alternative approach to fighting infections 158 , for which NPs targeting bacterial quorum sensing could be of interest 159 .

NPs also have a successful history as cancer therapeutics, which has been well covered in other reviews 160 , 161 , 162 , 163 . An important new opportunity in this field is the capacity of some NPs to trigger a selective yet potent host immune reaction against cancer cells, particularly given the intense interest at present in strategies that could improve response rates to immune checkpoint inhibitors by turning ‘cold’ tumours ‘hot’ 164 . For example, NPs such as cardiac glycosides 165 can increase the immunogenicity of stressed and dying cancer cells by triggering immunogenic cell death, characterized by the release of damage-associated molecular patterns (DAMPs), which could open new avenues for drug discovery or repurposing 166 , 167 , 168 .

Botanical therapies containing complex mixtures of NPs have long attracted interest owing to the potential for synergistic therapeutic effects of components within the mixture 169 , 170 . However, the variability of the NP composition in the starting plant material owing to factors such as environmental variations in the location at which the plants were collected is a major challenge for the development of botanical drugs 1 . With the advances in technology for their characterization, such as metabolomics discussed above, as well as development of regulatory guidance for complex mixtures of NPs ( see Related links ), it is becoming more feasible to develop such mixtures as therapeutics, rather than to identify and purify a single active ingredient 171 .

Since gut microbiota are considered to play a major role in health and disease 172 , 173 , 174 , and NPs are known to affect the gut microbiome composition 175 , 176 , 177 , 178 , this area is an emerging opportunity for NP-based drug discovery. However, drug discovery efforts in this area are still in their infancy, with many open questions remaining 179 . A future direction may be the characterization of single microbiota-derived species for particular therapeutic applications, and the advances in culturing strategies, genome mining and analytics discussed above will be of great importance in this respect.

Many advances discussed above are supported by computational tools including databases (such as genomic, chemical or spectral analysis data; see ref. 180 for a recent review on NP databases) and tools that enable the analysis of genetic information, the prediction of chemical structures and pharmacological activities 181 , the integration of data sets with diverse information (such as tools for multi-omics analysis) 182 and machine learning applications 183 .

Although this Review focuses on technologies that enable the discovery of novel NPs, it is important to acknowledge that unmodified NPs may possess suboptimal efficacy or absorption, distribution, metabolism, excretion and toxicity (ADMET) properties. So, for development of NP hits into leads and ultimately into successful drugs, chemical modification may be required. In addition, bringing a compound into clinical development requires a sustainable and economically viable supply of sufficient quantities of the compound. Total chemical synthesis, semi-synthesis using a NP as a starting point for analogue generation and biosynthetic engineering modifying biosynthetic pathways of the producing organism will be of great importance in this context (Fig.  5 ). Recent advances in chemical synthesis and biosynthetic engineering technologies are strongly empowering NP-based drug discovery and development by enabling property optimization of complex NP scaffolds that were previously regarded as inaccessible. This allows the enrichment of screening libraries with NPs, NP hybrids, NP analogues and NP-inspired molecules, as well as superior structure functionalization approaches (including late-stage functionalization) for optimization of NP leads 94 , 105 , 106 , 107 , 108 , 184 , 185 , 186 , 187 , 188 .

figure 5

Unmodified natural products (NPs) often possess suboptimal properties, and superior analogues need to be obtained in order to yield valuable new drugs. a | NP analogues can be accessed through the development of total chemical synthesis followed by chemical derivatization, through semisynthesis using a NP as a starting point for the introduction of chemical modifications, and through biosynthetic engineering using manipulations of biosynthetic pathways of the producing organism to generate NP analogues. b , c | Tetracyclines are an example of NP-derived antibiotics that have already yielded several generations of successfully marketed semisynthetic and synthetic derivatives. The first generation of tetracyclines (such as chlortetracycline and tetracycline) were unmodified NPs, while the two subsequent generations of analogues with optimized properties were semisynthetic (second-generation, doxycycline, minocycline; third-generation, tigecycline) and the most recently developed fourth-generation analogues (eravacycline) are entirely synthetic, accessed via total synthesis 193 , 194 . More recent examples of property optimization of other classes of NPs through total chemical synthesis followed by chemical derivatization or through semisynthesis are illustrated by studies focused on analogues of chrysomycin A (panel b ) 195 and arylomycins (panel c ) 157 , respectively. d | The biosynthetic engineering approach has also shown potential; for example, in the generation of analogues of rapamycin 111 , bleomycin 113 (panel d ) and nystatin 114 . 6′-deoxy-BLM A2, 6′-deoxy-bleomycin A2; BLM A2, bleomycin A2.

Finally, although NP-based drug discovery offers a unique niche for diverse forms of academia–industry collaboration, a key challenge is that scientific and technological expertise is often scattered over many academic institutions and companies. Focused efforts are needed to support translational NP research in academia, which has become more difficult in recent years given the decline in the number of large companies actively engaged in NP research. A conventional solution to improve academia–industry interaction is to focus the relevant expertise under one umbrella and in close spatial proximity. For example, the Phytovalley Tirol, centred in Innsbruck, Austria, brings together several research institutions and companies (among others, the Austrian Drug Screening Institute (ADSI), the Michael Popp Research Institute for New Phyto-Entities, Bionorica Research and Biocrates Life Sciences AG) with the aim of accelerating NP-based drug discovery. Another solution could be virtual consortia, such as the International Natural Product Sciences Taskforce ( INPST ) that we have recently established (see Related links), which provides a platform for integration of expertise, technology and materials from the participating academic and industrial entities.

In conclusion, NPs remain a promising pool for the discovery of scaffolds with high structural diversity and various bioactivities that can be directly developed or used as starting points for optimization into novel drugs. While drug development overall continues to be challenged by high attrition rates, there are additional hurdles for NPs due to issues such as accessibility, sustainable supply and IP constraints. However, we believe that the scientific and technological advances discussed in this Review provide a strong basis for NP-based drug discovery to continue making major contributions to human health and longevity.

Atanasov, A. G. et al. Discovery and resupply of pharmacologically active plant-derived natural products: a review. Biotechnol. Adv. 33 , 1582–1614 (2015).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Harvey, A. L., Edrada-Ebel, R. & Quinn, R. J. The re-emergence of natural products for drug discovery in the genomics era. Nat. Rev. Drug Discov. 14 , 111–129 (2015).

Article   CAS   PubMed   Google Scholar  

Newman, D. J. & Cragg, G. M. Natural products as sources of new drugs from 1981 to 2014. J. Nat. Prod. 79 , 629–661 (2016).

Waltenberger, B., Mocan, A., Šmejkal, K., Heiss, E. H. E. H. & Atanasov, A. A. G. A. G. Natural products to counteract the epidemic of cardiovascular and metabolic disorders. Molecules 21 , 807 (2016).

Article   PubMed Central   CAS   Google Scholar  

Tintore, M., Vidal-Jordana, A. & Sastre-Garriga, J. Treatment of multiple sclerosis — success from bench to bedside. Nat. Rev. Neurol. 15 , 53–58 (2019).

Feher, M. & Schmidt, J. M. Property distributions: differences between drugs, natural products, and molecules from combinatorial chemistry. J. Chem. Inf. Comput. Sci. 43 , 218–227 (2003).

Barnes, E. C., Kumar, R. & Davis, R. A. The use of isolated natural products as scaffolds for the generation of chemically diverse screening libraries for drug discovery. Nat. Prod. Rep. 33 , 372–381 (2016).

Li, J. W.-H. & Vederas, J. C. Drug discovery and natural products: end of an era or an endless frontier? Science 325 , 161–165 (2009).

Article   PubMed   CAS   Google Scholar  

Clardy, J. & Walsh, C. Lessons from natural molecules. Nature 432 , 829–837 (2004).

Lawson, A. D. G., MacCoss, M. & Heer, J. P. Importance of rigidity in designing small molecule drugs to tackle protein–protein interactions (PPIs) through stabilization of desired conformers. J. Med. Chem. 61 , 4283–4289 (2018).

Doak, B. C., Over, B., Giordanetto, F. & Kihlberg, J. Oral druggable space beyond the rule of 5: insights from drugs and clinical candidates. Chem. Biol. 21 , 1115–1142 (2014).

Shultz, M. D. Two decades under the influence of the rule of five and the changing properties of approved oral drugs. J. Med. Chem. 62 , 1701–1714 (2019).

Lachance, H., Wetzel, S., Kumar, K. & Waldmann, H. Charting, navigating, and populating natural product chemical space for drug discovery. J. Med. Chem. 55 , 5989–6001 (2012).

Henrich, C. J. & Beutler, J. A. Matching the power of high throughput screening to the chemical diversity of natural products. Nat. Prod. Rep. 30 , 1284 (2013).

Cragg, G. M., Schepartz, S. A., Suffness, M. & Grever, M. R. The taxol supply crisis. New NCI policies for handling the large-scale production of novel natural product anticancer and anti-HIV agents. J. Nat. Prod. 56 , 1657–1668 (1993).

Harrison, C. Patenting natural products just got harder. Nat. Biotechnol. 32 , 403–404 (2014).

Burton, G. & Evans-Illidge, E. A. Emerging R and D law: the Nagoya Protocol and its implications for researchers. ACS Chem. Biol. 9 , 588–591 (2014).

Heffernan, O. Why a landmark treaty to stop ocean biopiracy could stymie research. Nature 580 , 20–22 (2020).

Article   PubMed   Google Scholar  

Corson, T. W. & Crews, C. M. Molecular understanding and modern application of traditional medicines: triumphs and trials. Cell 130 , 769–774 (2007).

Moffat, J. G., Vincent, F., Lee, J. A., Eder, J. & Prunotto, M. Opportunities and challenges in phenotypic drug discovery: an industry perspective. Nat. Rev. Drug Discov. 16 , 531–543 (2017).

Shi, Y., Inoue, H., Wu, J. C. & Yamanaka, S. Induced pluripotent stem cell technology: a decade of progress. Nat. Rev. Drug Discov. 16 , 115–130 (2017).

Fellmann, C., Gowen, B. G., Lin, P.-C., Doudna, J. A. & Corn, J. E. Cornerstones of CRISPR–Cas in drug discovery and therapy. Nat. Rev. Drug Discov. 16 , 89–100 (2017).

Schirle, M. & Jenkins, J. L. Identifying compound efficacy targets in phenotypic drug discovery. Drug Discov. Today 21 , 82–89 (2016).

Wagenaar, M. M. Pre-fractionated microbial samples-the second generation natural products library at Wyeth. Molecules 13 , 1406–1426 (2008).

Wolfender, J.-L., Nuzillard, J.-M., van der Hooft, J. J. J., Renault, J.-H. & Bertrand, S. Accelerating metabolite identification in natural product research: toward an ideal combination of liquid chromatography–high-resolution tandem mass spectrometry and nmr profiling, in silico databases, and chemometrics. Anal. Chem. 91 , 704–742 (2019).

Stuart, K. A., Welsh, K., Walker, M. C. & Edrada-Ebel, R. A. Metabolomic tools used in marine natural product drug discovery. Expert Opin. Drug Discov. 15 , 499–522 (2020).

Allard, P.-M., Genta-Jouve, G. & Wolfender, J.-L. Deep metabolome annotation in natural products research: towards a virtuous cycle in metabolite identification. Curr. Opin. Chem. Biol. 36 , 40–49 (2017).

Allard, P.-M. et al. Pharmacognosy in the digital era: shifting to contextualized metabolomics. Curr. Opin. Biotechnol. 54 , 57–64 (2018).

Hubert, J., Nuzillard, J.-M. & Renault, J.-H. Dereplication strategies in natural product research: How many tools and methodologies behind the same concept? Phytochem. Rev. 16 , 55–95 (2017).

Article   CAS   Google Scholar  

Liu, X. & Locasale, J. W. Metabolomics: a primer. Trends Biochem. Sci. 42 , 274–284 (2017).

Eugster, P. J. et al. Ultra high pressure liquid chromatography for crude plant extract profiling. J. AOAC Int. 94 , 51–70 (2011).

Stavrianidi, A. A classification of liquid chromatography mass spectrometry techniques for evaluation of chemical composition and quality control of traditional medicines. J. Chromatogr. A 1609 , 460501 (2020).

Wolfender, J.-L., Marti, G., Thomas, A. & Bertrand, S. Current approaches and challenges for the metabolite profiling of complex natural extracts. J. Chromatogr. A 1382 , 136–164 (2015).

Tahtah, Y. et al. High-resolution PTP1B inhibition profiling combined with high-performance liquid chromatography–high-resolution mass spectrometry–solid-phase extraction–nuclear magnetic resonance spectroscopy: proof-of-concept and antidiabetic constituents in crude extract of Eremophila lucida . Fitoterapia 110 , 52–58 (2016).

Chu, C. et al. Antidiabetic constituents of Dendrobium officinale as determined by high-resolution profiling of radical scavenging and α-glucosidase and α-amylase inhibition combined with HPLC-PDA-HRMS-SPE-NMR analysis. Phytochem. Lett. 31 , 47–52 (2019).

Garcia-Perez, I. et al. Identifying unknown metabolites using NMR-based metabolic profiling techniques. Nat. Protoc. 15 , 2538–2567 (2020).

Giavalisco, P. et al. High-resolution direct infusion-based mass spectrometry in combination with whole 13 C metabolome isotope labeling allows unambiguous assignment of chemical sum formulas. Anal. Chem. 80 , 9417–9425 (2008).

Covington, B. C., McLean, J. A. & Bachmann, B. O. Comparative mass spectrometry-based metabolomics strategies for the investigation of microbial secondary metabolites. Nat. Prod. Rep. 34 , 6–24 (2017).

Fontana, A., Iturrino, L., Corens, D. & Crego, A. L. Automated open-access liquid chromatography high resolution mass spectrometry to support drug discovery projects. J. Pharm. Biomed. Anal. 178 , 112908 (2020).

Kind, T. et al. Identification of small molecules using accurate mass MS/MS search. Mass. Spectrom. Rev. 37 , 513–532 (2018).

Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34 , 828–837 (2016).

Yang, J. Y. et al. Molecular networking as a dereplication strategy. J. Nat. Prod. 76 , 1686–1699 (2013).

Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11 , 98–110 (2015).

Allard, P.-M. et al. Integration of molecular networking and in-silico MS/MS fragmentation for natural products dereplication. Anal. Chem. 88 , 3317–3323 (2016).

da Silva, R. R. et al. Propagating annotations of molecular networks using in silico fragmentation. PLoS Comput. Biol. 14 , e1006089 (2018).

Article   PubMed   PubMed Central   CAS   Google Scholar  

Randazzo, G. M. et al. Prediction of retention time in reversed-phase liquid chromatography as a tool for steroid identification. Anal. Chim. Acta 916 , 8–16 (2016).

Zhou, Z., Xiong, X. & Zhu, Z.-J. MetCCS predictor: a web server for predicting collision cross-section values of metabolites in ion mobility-mass spectrometry based metabolomics. Bioinformatics 33 , 2235–2237 (2017).

Rutz, A. et al. Taxonomically informed scoring enhances confidence in natural products annotation. Front. Plant. Sci. 10 , 1329 (2019).

Article   PubMed   PubMed Central   Google Scholar  

Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90 , 3156–3164 (2018).

Aksenov, A. A., da Silva, R., Knight, R., Lopes, N. P. & Dorrestein, P. C. Global chemical analysis of biology by mass spectrometry. Nat. Rev. Chem. 1 , 0054 (2017).

Fox Ramos, A. E. et al. CANPA: computer-assisted natural products anticipation. Anal. Chem. 91 , 11247–11252 (2019).

Wolfender, J.-L., Litaudon, M., Touboul, D. & Queiroz, E. F. Innovative omics-based approaches for prioritisation and targeted isolation of natural products – new strategies for drug discovery. Nat. Prod. Rep. 36 , 855–868 (2019).

Graziani, V. et al. Metabolomic approach for a rapid identification of natural products with cytotoxic activity against human colorectal cancer cells. Sci. Rep. 8 , 5309 (2018).

Grienke, U. et al. 1 H NMR-MS-based heterocovariance as a drug discovery tool for fishing bioactive compounds out of a complex mixture of structural analogues. Sci. Rep. 9 , 11113 (2019).

Aligiannis, N. et al. Heterocovariance based metabolomics as a powerful tool accelerating bioactive natural product identification. ChemistrySelect 1 , 2531–2535 (2016).

Acharya, D. et al. Omics technologies to understand activation of a biosynthetic gene cluster in Micromonospora sp. WMMB235: deciphering keyicin biosynthesis. ACS Chem. Biol. 14 , 1260–1270 (2019).

Schulze, C. J. et al. ‘Function-first’ lead discovery: mode of action profiling of natural product libraries using image-based screening. Chem. Biol. 20 , 285–295 (2013).

Kurita, K. L., Glassey, E. & Linington, R. G. Integration of high-content screening and untargeted metabolomics for comprehensive functional annotation of natural product libraries. Proc. Natl Acad. Sci. USA 112 , 11999–12004 (2015).

Earl, D. C. et al. Discovery of human cell selective effector molecules using single cell multiplexed activity metabolomics. Nat. Commun. 9 , 39 (2018).

Wishart, D. S. NMR metabolomics: a look ahead. J. Magn. Reson. 306 , 155–161 (2019).

Berlinck, R. G. S. et al. Approaches for the isolation and identification of hydrophilic, light-sensitive, volatile and minor natural products. Nat. Prod. Rep. 36 , 981–1004 (2019).

Hilton, B. D. & Martin, G. E. Investigation of the experimental limits of small-sample heteronuclear 2D NMR. J. Nat. Prod. 73 , 1465–1469 (2010).

Sultan, S. et al. Evolving trends in the dereplication of natural product extracts. 3: Further lasiodiplodins from Lasiodiplodia theobromae , an endophyte from Mapania kurzii . Tetrahedron Lett. 55 , 453–455 (2014).

Jones, C. G. et al. The CryoEM method MicroED as a powerful tool for small molecule structure determination. ACS Cent. Sci. 4 , 1587–1592 (2018).

Ting, C. P. et al. Use of a scaffold peptide in the biosynthesis of amino acid-derived natural products. Science 365 , 280–284 (2019).

Ganesh, T. et al. Evaluation of the tubulin-bound paclitaxel conformation: synthesis, biology, and SAR studies of C-4 to C-3′ bridged paclitaxel analogues. J. Med. Chem. 50 , 713–725 (2007).

Choules, M. P. et al. Residual complexity does impact organic chemistry and drug discovery: the case of rufomyazine and rufomycin. J. Org. Chem. 83 , 6664–6672 (2018).

Ziemert, N., Alanjary, M. & Weber, T. The evolution of genome mining in microbes - a review. Nat. Prod. Rep. 33 , 988–1005 (2016).

Viehrig, K. et al. Structure and biosynthesis of crocagins: polycyclic posttranslationally modified ribosomal peptides from Chondromyces crocatus . Angew. Chem. Int. Ed. Engl. 56 , 7407–7410 (2017).

Surup, F. et al. Crocadepsins-depsipeptides from the myxobacterium Chondromyces crocatus found by a genome mining approach. ACS Chem. Biol. 13 , 267–272 (2018).

Kayrouz, C. M., Zhang, Y., Pham, T. M. & Ju, K. S. Genome mining reveals the phosphonoalamide natural products and a new route in phosphonic acid biosynthesis. ACS Chem. Biol. 15 , 1921–1929 (2020).

Laureti, L. et al. Identification of a bioactive 51-membered macrolide complex by activation of a silent polyketide synthase in Streptomyces ambofaciens . Proc. Natl Acad. Sci. USA 108 , 6258–6263 (2011).

Weber, T. & Kim, H. U. The secondary metabolite bioinformatics portal: Computational tools to facilitate synthetic biology of secondary metabolite production. Synth. Syst. Biotechnol. 1 , 69–79 (2016).

Navarro-Muñoz, J. C. et al. A computational framework to explore large-scale biosynthetic diversity. Nat. Chem. Biol. 16 , 60–68 (2020).

Hoffmann, T. et al. Correlating chemical diversity with taxonomic distance for discovery of natural products in myxobacteria. Nat. Commun. 9 , 803 (2018).

Helaly, S. E., Thongbai, B. & Stadler, M. Diversity of biologically active secondary metabolites from endophytic and saprotrophic fungi of the ascomycete order Xylariales. Nat. Prod. Rep. 35 , 992–1014 (2018).

Dalinova, A. et al. Isolation and bioactivity of secondary metabolites from solid culture of the fungus, Alternaria sonchi . Biomolecules 10 , 81 (2020).

Article   CAS   PubMed Central   Google Scholar  

Zerikly, M. & Challis, G. L. Strategies for the discovery of new natural products by genome mining. ChemBioChem 10 , 625–633 (2009).

Culp, E. J. et al. Evolution-guided discovery of antibiotics that inhibit peptidoglycan remodelling. Nature 578 , 582–587 (2020).

Zhang, H., Boghigian, B. A., Armando, J. & Pfeifer, B. A. Methods and options for the heterologous production of complex natural products. Nat. Prod. Rep. 28 , 125–151 (2011).

Anyaogu, D. C. & Mortensen, U. H. Heterologous production of fungal secondary metabolites in aspergilli. Front. Microbiol. 6 , 77 (2015).

Sucipto, H., Pogorevc, D., Luxenburger, E., Wenzel, S. C. & Müller, R. Heterologous production of myxobacterial α-pyrone antibiotics in Myxococcus xanthus . Metab. Eng. 44 , 160–170 (2017).

Nora, L. C. et al. The art of vector engineering: towards the construction of next-generation genetic tools. Microb. Biotechnol. 12 , 125–147 (2019).

Bok, J. W. et al. Fungal artificial chromosomes for mining of the fungal secondary metabolome. BMC Genomics 16 , 343 (2015).

Clevenger, K. D. et al. A scalable platform to identify fungal secondary metabolites and their gene clusters. Nat. Chem. Biol. 13 , 895–901 (2017).

Mao, D., Okada, B. K., Wu, Y., Xu, F. & Seyedsayamdost, M. R. Recent advances in activating silent biosynthetic gene clusters in bacteria. Curr. Opin. Microbiol. 45 , 156–163 (2018).

Rutledge, P. J. & Challis, G. L. Discovery of microbial natural products by activation of silent biosynthetic gene clusters. Nat. Rev. Microbiol. 13 , 509–523 (2015).

Yamanaka, K. et al. Direct cloning and refactoring of a silent lipopeptide biosynthetic gene cluster yields the antibiotic taromycin A. Proc. Natl Acad. Sci. 111 , 1957–1962 (2014).

Sidda, J. D. et al. Discovery of a family of γ-aminobutyrate ureas via rational derepression of a silent bacterial gene cluster. Chem. Sci. 5 , 86–89 (2014).

Wang, B., Guo, F., Dong, S.-H. & Zhao, H. Activation of silent biosynthetic gene clusters using transcription factor decoys. Nat. Chem. Biol. 15 , 111–114 (2019).

Zhang, M. M. et al. CRISPR–Cas9 strategy for activation of silent Streptomyces biosynthetic gene clusters. Nat. Chem. Biol. 13 , 607–609 (2017).

Culp, E. J. et al. Hidden antibiotics in actinomycetes can be identified by inactivation of gene clusters for common antibiotics. Nat. Biotechnol. 37 , 1149–1154 (2019).

Hover, B. M. et al. Culture-independent discovery of the malacidins as calcium-dependent antibiotics with activity against multidrug-resistant Gram-positive pathogens. Nat. Microbiol. 3 , 415–422 (2018).

Chu, J. et al. Discovery of MRSA active antibiotics using primary sequence from the human microbiome. Nat. Chem. Biol. 12 , 1004–1006 (2016).

Kersten, R. D. & Weng, J.-K. Gene-guided discovery and engineering of branched cyclic peptides in plants. Proc. Natl Acad. Sci. USA 115 , E10961–E10969 (2018).

Dutertre, S. et al. Deep venomics reveals the mechanism for expanded peptide diversity in cone snail venom. Mol. Cell. Proteom. 12 , 312–329 (2013).

Wilson, M. C. et al. An environmental bacterial taxon with a large and distinct metabolic repertoire. Nature 506 , 58–62 (2014).

Mori, T. et al. Single-bacterial genomics validates rich and varied specialized metabolism of uncultivated Entotheonella sponge symbionts. Proc. Natl Acad. Sci. USA 115 , 1718–1723 (2018).

Rath, C. M. et al. Meta-omic characterization of the marine invertebrate microbial consortium that produces the chemotherapeutic natural product ET-743. ACS Chem. Biol. 6 , 1244–1256 (2011).

Newman, D. J. Are microbial endophytes the ‘actual’ producers of bioactive antitumor agents? Trends Cancer 4 , 662–670 (2018).

Helfrich, E. J. N. et al. Bipartite interactions, antibiotic production and biosynthetic potential of the Arabidopsis leaf microbiome. Nat. Microbiol. 3 , 909–919 (2018).

Yan, F. et al. Biosynthesis and heterologous production of vioprolides: rational biosynthetic engineering and unprecedented 4-methylazetidinecarboxylic acid formation. Angew. Chem. Int. Ed. 57 , 8754–8759 (2018).

Tu, Q. et al. Genetic engineering and heterologous expression of the disorazol biosynthetic gene cluster via Red/ET recombineering. Sci. Rep. 6 , 21066 (2016).

Song, C. et al. Enhanced heterologous spinosad production from a 79-kb synthetic multioperon assembly. ACS Synth. Biol. 8 , 137–147 (2019).

Wlodek, A. et al. Diversity oriented biosynthesis via accelerated evolution of modular gene clusters. Nat. Commun. 8 , 1206 (2017).

Bozhüyük, K. A. J. et al. De novo design and engineering of non-ribosomal peptide synthetases. Nat. Chem. 10 , 275–281 (2018).

Bozhüyük, K. A. J. et al. Modification and de novo design of non-ribosomal peptide synthetases using specific assembly points within condensation domains. Nat. Chem. 11 , 653–661 (2019).

Awakawa, T. et al. Reprogramming of the antimycin NRPS-PKS assembly lines inspired by gene evolution. Nat. Commun. 9 , 3534 (2018).

Masschelein, J. et al. A dual transacylation mechanism for polyketide synthase chain release in enacyloxin antibiotic biosynthesis. Nat. Chem. 11 , 906–912 (2019).

Kosol, S. et al. Structural basis for chain release from the enacyloxin polyketide synthase. Nat. Chem. 11 , 913–923 (2019).

Gregory, M. A. et al. Structure guided design of improved anti-proliferative rapalogs through biosynthetic medicinal chemistry. Chem. Sci. 4 , 1046–1052 (2013).

Méndez, C., González-Sabín, J., Morís, F. & Salas, J. A. Expanding the chemical diversity of the antitumoral compound mithramycin by combinatorial biosynthesis and biocatalysis: the quest for mithralogs with improved therapeutic window. Planta Med. 81 , 1326–1338 (2015).

Hindra et al. Genome mining of Streptomyces mobaraensis DSM40847 as a bleomycin producer providing a biotechnology platform to engineer designer bleomycin analogues. Org. Lett. 19 , 1386–1389 (2017).

Brautaset, T. et al. Improved antifungal polyene macrolides via engineering of the nystatin biosynthetic genes in Streptomyces noursei . Chem. Biol. 15 , 1198–1206 (2008).

Preobrazhenskaya, M. N. et al. Synthesis and study of the antifungal activity of new mono- and disubstituted derivatives of a genetically engineered polyene antibiotic 28,29-didehydronystatin A1 (S44HP). J. Antibiot. 63 , 55–64 (2010).

Tevyashova, A. N. et al. Structure-antifungal activity relationships of polyene antibiotics of the amphotericin B group. Antimicrob. Agents Chemother. 57 , 3815–3822 (2013).

Lewis, K., Epstein, S., D’Onofrio, A. & Ling, L. L. Uncultured microorganisms as a source of secondary metabolites. J. Antibiot. 63 , 468–476 (2010).

Schiewe, H.-J. & Zeeck, A. Cineromycins, γ-butyrolactones and ansamycins by analysis of the secondary metabolite pattern created by a single strain of Strepomyces . J. Antibiot. 52 , 635–642 (1999).

Zähner, H. Some aspects of antibiotics research. Angew. Chem. Int. Ed. Engl. 16 , 687–694 (1977).

Newman, D. Screening and identification of novel biologically active natural compounds. F1000Research 6 , 783 (2017).

Hussain, A. et al. Novel bioactive molecules from Lentzea violacea strain AS 08 using one strain-many compounds (OSMAC) approach. Bioorg. Med. Chem. Lett. 27 , 2579–2582 (2017).

Hemphill, C. F. P. et al. OSMAC approach leads to new fusarielin metabolites from Fusarium tricinctum . J. Antibiot. 70 , 726–732 (2017).

Vartoukian, S. R., Palmer, R. M. & Wade, W. G. Strategies for culture of ‘unculturable’ bacteria. FEMS Microbiol. Lett. 309 , 1–7 (2010).

CAS   PubMed   Google Scholar  

Moussa, M. et al. Co-culture of the fungus Fusarium tricinctum with Streptomyces lividans induces production of cryptic naphthoquinone dimers. RSC Adv. 9 , 1491–1500 (2019).

Abdel-Razek, A. S., Hamed, A., Frese, M., Sewald, N. & Shaaban, M. Penicisteroid C: new polyoxygenated steroid produced by co-culturing of Streptomyces piomogenus with Aspergillus niger . Steroids 138 , 21–25 (2018).

D’Onofrio, A. et al. Siderophores from neighboring organisms promote the growth of uncultured bacteria. Chem. Biol. 17 , 254–264 (2010).

Van Arnam, E. B., Currie, C. R. & Clardy, J. Defense contracts: molecular protection in insect-microbe symbioses. Chem. Soc. Rev. 47 , 1638–1651 (2018).

Molloy, E. M. & Hertweck, C. Antimicrobial discovery inspired by ecological interactions. Curr. Opin. Microbiol. 39 , 121–127 (2017).

Tobias, N. J., Shi, Y. M. & Bode, H. B. Refining the natural product repertoire in entomopathogenic bacteria. Trends Microbiology 26 , 833–840 (2018).

Imai, Y. et al. A new antibiotic selectively kills Gram-negative pathogens. Nature 576 , 459–464 (2019).

Bode, E. et al. Biosynthesis and function of simple amides in Xenorhabdus doucetiae . Environ. Microbiol. 19 , 4564–4575 (2017).

Crawford, J. M., Kontnik, R. & Clardy, J. Regulating alternative lifestyles in entomopathogenic bacteria. Curr. Biol. 20 , 69–74 (2010).

Zengler, K. et al. Cultivating the uncultured. Proc. Natl Acad. Sci. USA 99 , 15681–15686 (2002).

Nichols, D. et al. Use of ichip for high-throughput in situ cultivation of ‘uncultivable’ microbial species. Appl. Environ. Microbiol. 76 , 2445–2450 (2010).

Ling, L. L. et al. A new antibiotic kills pathogens without detectable resistance. Nature 517 , 455–459 (2015).

Homma, T. et al. Dual targeting of cell wall precursors by teixobactin leads to cell lysis. Antimicrob. Agents Chemother. 60 , 6510–6517 (2016).

Pham, V. H. T. & Kim, J. Cultivation of unculturable soil bacteria. Trends Biotechnol. 30 , 475–484 (2012).

Derewacz, D. K., Covington, B. C., McLean, J. A. & Bachmann, B. O. Mapping microbial response metabolomes for induced natural product discovery. ACS Chem. Biol. 10 , 1998–2006 (2015).

Lagier, J. C. et al. Culture of previously uncultured members of the human gut microbiota by culturomics. Nat. Microbiol . 1 , 16203 (2016).

Terekhov, S. S. et al. Microfluidic droplet platform for ultrahigh-throughput single-cell screening of biodiversity. Proc. Natl Acad. Sci. USA 114 , 2550–2555 (2017).

Challinor, V. L. & Bode, H. B. Bioactive natural products from novel microbial sources. Ann. NY Acad. Sci. 1354 , 82–97 (2015).

Pidot, S. J., Coyne, S., Kloss, F. & Hertweck, C. Antibiotics from neglected bacterial sources. Int. J. Med. Microbiol. 304 , 14–22 (2014).

Lincke, T., Behnken, S., Ishida, K., Roth, M. & Hertweck, C. Closthioamide: an unprecedented polythioamide antibiotic from the strictly anaerobic bacterium Clostridium cellulolyticum . Angew. Chem. Int. Ed. 49 , 2011–2013 (2010).

Haeckl, F. P. J. et al. A selective genome-guided method for environmental Burkholderia isolation. J. Ind. Microbiol. Biotechnol. 46 , 345–362 (2019).

Cross, K. L. et al. Targeted isolation and cultivation of uncultivated bacteria by reverse genomics. Nat. Biotechnol. 37 , 1314–1321 (2019).

Vlachou, P. et al. Innovative approach to sustainable marine invertebrate chemistry and a scale-up technology for open marine ecosystems. Mar. Drugs 16 , 152 (2018).

Zainal-Abidin, M. H., Hayyan, M., Hayyan, A. & Jayakumar, N. S. New horizons in the extraction of bioactive compounds using deep eutectic solvents: a review. Anal. Chim. Acta 979 , 1–23 (2017).

Dai, Y., van Spronsen, J., Witkamp, G.-J., Verpoorte, R. & Choi, Y. H. Ionic liquids and deep eutectic solvents in natural products research: mixtures of solids as extraction solvents. J. Nat. Prod. 76 , 2162–2173 (2013).

Nemes, P. & Vertes, A. Ambient mass spectrometry for in vivo local analysis and in situ molecular tissue imaging. Trends Analyt. Chem. 34 , 22–34 (2012).

Pasquini, C. Near infrared spectroscopy: a mature analytical technique with new perspectives–a review. Anal. Chim. Acta 1026 , 8–36 (2018).

Hutchings, M., Truman, A. & Wilkinson, B. Antibiotics: past, present and future. Curr. Opin. Microbiol. 51 , 72–80 (2019).

Rossiter, S. E., Fletcher, M. H. & Wuest, W. M. Natural products as platforms to overcome antibiotic resistance. Chem. Rev. 117 , 12415–12474 (2017).

Zipperer, A. et al. Human commensals producing a novel antibiotic impair pathogen colonization. Nature 535 , 511–516 (2016).

Lešnik, U. et al. Construction of a new class of tetracycline lead structures with potent antibacterial activity through biosynthetic engineering. Angew. Chem. Int. Ed. Engl. 54 , 3937–3940 (2015).

Kling, A. et al. Antibiotics. Targeting DnaN for tuberculosis therapy using novel griselimycins. Science 348 , 1106–1112 (2015).

Shaeer, K. M., Zmarlicka, M. T., Chahine, E. B., Piccicacco, N. & Cho, J. C. Plazomicin: a next-generation aminoglycoside. Pharmacotherapy 39 , 77–93 (2019).

Smith, P. A. et al. Optimized arylomycins are a new class of Gram-negative antibiotics. Nature 561 , 189–194 (2018).

Dickey, S. W., Cheung, G. Y. C. & Otto, M. Different drugs for bad bugs: antivirulence strategies in the age of antibiotic resistance. Nat. Rev. Drug Discov. 16 , 457–471 (2017).

Park, S. R. et al. Discovery of cahuitamycins as biofilm inhibitors derived from a convergent biosynthetic pathway. Nat. Commun. 7 , 10710 (2016).

Mann, J. Natural products in cancer chemotherapy: past, present and future. Nat. Rev. Cancer 2 , 143–148 (2002).

Beck, A., Goetsch, L., Dumontet, C. & Corvaïa, N. Strategies and challenges for the next generation of antibody–drug conjugates. Nat. Rev. Drug Discov. 16 , 315–337 (2017).

Pereira, R. B. et al. Marine-derived anticancer agents: clinical benefits, innovative mechanisms, and new targets. Mar. Drugs 17 (2019).

Newman, D. J. & Cragg, G. M. Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J. Nat. Prod. 83 , 770–803 (2020).

Galon, J. & Bruni, D. Approaches to treat immune hot, altered and cold tumours with combination immunotherapies. Nat. Rev. Drug Discov. 18 , 197–218 (2019).

Menger, L. et al. Cardiac glycosides exert anticancer effects by inducing immunogenic cell death. Sci. Transl. Med. 4 , 143ra99 (2012).

Galluzzi, L., Buqué, A., Kepp, O., Zitvogel, L. & Kroemer, G. Immunogenic cell death in cancer and infectious disease. Nat. Rev. Immunol. 17 , 97–111 (2017).

Diederich, M. Natural compound inducers of immunogenic cell death. Arch. Pharm. Res. 42 , 629–645 (2019).

Radogna, F., Dicato, M. & Diederich, M. Natural modulators of the hallmarks of immunogenic cell death. Biochem. Pharmacol. 162 , 55–70 (2019).

Schmidt, B. M., Ribnicky, D. M., Lipsky, P. E. & Raskin, I. Revisiting the ancient concept of botanical therapeutics. Nat. Chem. Biol. 3 , 360–366 (2007).

Schmidt, B. et al. A natural history of botanical therapeutics. Metabolism 57 , S3–S9 (2008).

Kellogg, J. J. et al. Comparison of metabolomics approaches for evaluating the variability of complex botanical preparations: green tea ( Camellia sinensis ) as a case study. J. Nat. Prod. 80 , 1457–1466 (2017).

Marchesi, J. R. et al. The gut microbiota and host health: a new clinical frontier. Gut 65 , 330–339 (2016).

Abdollahi-Roodsaz, S., Abramson, S. B. & Scher, J. U. The metabolic role of the gut microbiota in health and rheumatic disease: mechanisms and interventions. Nat. Rev. Rheumatol. 12 , 446–455 (2016).

Lynch, S. V. & Pedersen, O. The human intestinal microbiome in health and disease. N. Engl. J. Med. 375 , 2369–2379 (2016).

Scherlach, K. & Hertweck, C. Mediators of mutualistic microbe-microbe interactions. Nat. Prod. Rep. 35 , 303–308 (2018).

Modi, S. R., Collins, J. J. & Relman, D. A. Antibiotics and the gut microbiota. J. Clin. Invest. 124 , 4212–4218 (2014).

Peterson, C. T. et al. Effects of turmeric and curcumin dietary supplementation on human gut microbiota: a double-blind, randomized, placebo-controlled pilot study. J Evid. Based Integr. Med. 23 , 2515690X18790725 (2018).

Eid, H. M. et al. Significance of microbiota in obesity and metabolic diseases and the modulatory potential by medicinal plant and food ingredients. Front. Pharmacol . 8 , (2017).

Valencia, P. M., Richard, M., Brock, J. & Boglioli, E. The human microbiome: opportunity or hype? Nat. Rev. Drug Discov. 16 , 823–824 (2017).

Sorokina, M. & Steinbeck, C. Review on natural products databases: Where to find data in 2020. J. Cheminform . 12 , 20 (2020).

Schneider, G. et al. Deorphaning the macromolecular targets of the natural anticancer compound doliculide. Angew. Chem. Int. Ed. 55 , 12408–12411 (2016).

Palazzotto, E. & Weber, T. Omics and multi-omics approaches to study the biosynthesis of secondary metabolites in microorganisms. Curr. Opin. Microbiol. 45 , 109–116 (2018).

Dias, T., Gaudêncio, S. P. & Pereira, F. A computer-driven approach to discover natural product leads for methicillin-resistant staphylococcus aureus infection therapy. Mar. Drugs 17 , 16 (2019).

Boström, J., Brown, D. G., Young, R. J. & Keserü, G. M. Expanding the medicinal chemistry synthetic toolbox. Nat. Rev. Drug Discov. 17 , 709–727 (2018).

Zhao, X. et al. A novel drug discovery strategy inspired by traditional medicine philosophies. Science 347 , S38–S40 (2015).

Google Scholar  

Liao, S. et al. Tanshinol borneol ester, a novel synthetic small molecule angiogenesis stimulator inspired by botanical formulations for angina pectoris. Br. J. Pharmacol. 176 , 3143–3160 (2019).

CAS   PubMed   PubMed Central   Google Scholar  

Bai, Y. et al. Polygala tenuifolia - Acori tatarinowii herbal pair as an inspiration for substituted cinnamic α-asaronol esters: design, synthesis, anticonvulsant activity, and inhibition of lactate dehydrogenase study. Eur. J. Med. Chem. 183 , 111650 (2019).

Seiple, I. B. et al. A platform for the discovery of new macrolide antibiotics. Nature 533 , 338–345 (2016).

Wang, L. et al. Novel interactomics approach identifies ABCA1 as direct target of evodiamine, which increases macrophage cholesterol efflux. Sci. Rep. 8 , 11061 (2018).

Chang, J., Kim, Y. & Kwon, H. J. Advances in identification and validation of protein targets of natural products without chemical modification. Nat. Prod. Rep. 33 , 719–730 (2016).

Adhikari, J. & Fitzgerald, M. C. SILAC-pulse proteolysis: a mass spectrometry-based method for discovery and cross-validation in proteome-wide studies of ligand binding. J. Am. Soc. Mass. Spectrom. 25 , 2073–2083 (2014).

Gregori-Puigjane, E. et al. Identifying mechanism-of-action targets for drugs and probes. Proc. Natl Acad. Sci. USA 109 , 11178–11183 (2012).

Yñigez-Gutierrez, A. E. & Bachmann, B. O. Fixing the unfixable: the art of optimizing natural products for human medicine. J. Med. Chem. 62 , 8412–8428 (2019).

Markley, J. L. & Wencewicz, T. A. Tetracycline-inactivating enzymes. Front. Microbiol. 9 , 1058 (2018).

Wu, F. et al. Chrysomycin A derivatives for the treatment of multi-drug-resistant tuberculosis. ACS Cent. Sci. 6 , 928–938 (2020).

Dayalan Naidu, S., Kostov, R. V. & Dinkova-Kostova, A. T. Transcription factors Hsf1 and Nrf2 engage in crosstalk for cytoprotection. Trends Pharmacol. Sci. 36 , 6–14 (2015).

Hayes, J. D. & Dinkova-Kostova, A. T. The Nrf2 regulatory network provides an interface between redox and intermediary metabolism. Trends Biochem. Sci. 39 , 199–218 (2014).

Mills, E. L. et al. Itaconate is an anti-inflammatory metabolite that activates Nrf2 via alkylation of KEAP1. Nature 556 , 113–117 (2018).

Murphy, K. E. & Park, J. J. Can co-activation of Nrf2 and neurotrophic signaling pathway slow Alzheimer’s disease? Int. J. Mol. Sci. 18 , 1168 (2017).

Cuadrado, A. et al. Therapeutic targeting of the NRF2 and KEAP1 partnership in chronic diseases. Nat. Rev. Drug Discov. 18 , 295–317 (2019).

Linker, R. A. et al. Fumaric acid esters exert neuroprotective effects in neuroinflammation via activation of the Nrf2 antioxidant pathway. Brain 134 , 678–692 (2011).

Singh, K. et al. Sulforaphane treatment of autism spectrum disorder (ASD). Proc. Natl Acad. Sci. USA 111 , 15550–15555 (2014).

Spencer, S. R., Wilczak, C. A. & Talalay, P. Induction of glutathione transferases and NAD(P)H:quinone reductase by fumaric acid derivatives in rodent cells and tissues. Cancer Res. 50 , 7871–7875 (1990).

Soušek, J. et al. Alkaloids and organic acids content of eight Fumaria species. Phytochem. Anal. 10 , 6–11 (1999).

Article   Google Scholar  

Linker, R. A. & Haghikia, A. Dimethyl fumarate in multiple sclerosis: latest developments, evidence and place in therapy. Ther. Adv. Chronic Dis. 7 , 198–207 (2016).

Fox, R. J. et al. Efficacy and tolerability of delayed-release dimethyl fumarate in Black, Hispanic, and Asian patients with relapsing-remitting multiple sclerosis: post hoc integrated analysis of DEFINE and CONFIRM. Neurol. Ther. 6 , 175–187 (2017).

Fernández, Ó. et al. Efficacy and safety of delayed-release dimethyl fumarate for relapsing-remitting multiple sclerosis in prior interferon users: an integrated analysis of DEFINE and CONFIRM. Clin. Ther. 39 , 1671–1679 (2017).

Zhang, Y., Talalay, P., Cho, C. G. & Posner, G. H. A major inducer of anticarcinogenic protective enzymes from broccoli: isolation and elucidation of structure. Proc. Natl Acad. Sci. USA 89 , 2399–2403 (1992).

Dinkova-Kostova, A. T. et al. Direct evidence that sulfhydryl groups of Keap1 are the sensors regulating induction of phase 2 enzymes that protect against carcinogens and oxidants. Proc. Natl Acad. Sci. USA 99 , 11908–11913 (2002).

Morroni, F. et al. Neuroprotective effect of sulforaphane in 6-hydroxydopamine-lesioned mouse model of Parkinson’s disease. Neurotoxicology 36 , 63–71 (2013).

Liu, Y. et al. Sulforaphane enhances proteasomal and autophagic activities in mice and is a potential therapeutic reagent for Huntington’s disease. J. Neurochem. 129 , 539–547 (2014).

Kim, H. V. et al. Amelioration of Alzheimer’s disease by neuroprotective effect of sulforaphane in animal model. Amyloid 20 , 7–12 (2013).

Zhao, J., Moore, A. N., Clifton, G. L. & Dash, P. K. Sulforaphane enhances aquaporin-4 expression and decreases cerebral edema following traumatic brain injury. J. Neurosci. Res. 82 , 499–506 (2005).

Benedict, A. L. et al. Neuroprotective effects of sulforaphane after contusive spinal cord injury. J. Neurotrauma 29 , 2576–2586 (2012).

Alfieri, A. et al. Sulforaphane preconditioning of the Nrf2/HO-1 defense pathway protects the cerebral vasculature against blood-brain barrier disruption and neurological deficits in stroke. Free Radic. Biol. Med. 65 , 1012–1022 (2013).

Wu, S. et al. Sulforaphane produces antidepressant- and anxiolytic-like effects in adult mice. Behav. Brain Res. 301 , 55–62 (2016).

Li, B. et al. Sulforaphane ameliorates the development of experimental autoimmune encephalomyelitis by antagonizing oxidative stress and Th17-related inflammation in mice. Exp. Neurol. 250 , 239–249 (2013).

Egner, P. A. et al. Rapid and sustainable detoxication of airborne pollutants by broccoli sprout beverage: results of a randomized clinical trial in China. Cancer Prev. Res. 7 , 813–823 (2014).

Chen, J. G. et al. Dose-dependent detoxication of the airborne pollutant benzene in a randomized trial of broccoli sprout beverage in Qidong, China. Am. J. Clin. Nutr. 110 , 675–684 (2019).

Howell, S. J. et al. Final results of the STEM trial: SFX-01 in the treatment and evaluation of ER+ Her2– metastatic breast cancer (mBC). Ann. Oncol. 30 , v122 (2019).

Dinkova-Kostova, A. T. et al. Extremely potent triterpenoid inducers of the phase 2 response: correlations of protection against oxidant and inflammatory stress. Proc. Natl Acad. Sci. USA 102 , 4584–4589 (2005).

Liby, K. T. & Sporn, M. B. Synthetic oleanane triterpenoids: multifunctional drugs with a broad range of applications for prevention and treatment of chronic disease. Pharmacol. Rev. 64 , 972–1003 (2012).

Download references

Acknowledgements

This paper is affectionately dedicated in memory of Dr Mariola Macías (1984–2020) M.D., Ph.D. in Immunology, Emergency Physician at Hospital Punta Europa, Algeciras (Cadiz), Spain and active member of a research team working against SARS-CoV-2. An excellent professional and a better person. Her humanity, kindness, special and unmistakable smile, generosity, dedication and professionalism will never be forgotten. The authors are grateful to P. Kirkpatrick for his editorial contribution, which resulted in a greatly improved manuscript. A.G.A. acknowledges support from the Austrian Science Fund (FWF) project P25971-B23 (‘Improved cholesterol efflux by natural products’). R.B. acknowledges support by a grant from the Austrian Science Fund (FWF) P27505. V.B. acknowledges support by a grant from the Austrian Science Fund (FWF) P27682-B30. N.B. is recipient of an Australian Research Council DECRA Fellowship. A.C. and E.I. thank the Ministerio de Ciencia, Innovación y Universidades, Spain (Project AGL2017-89417-R) for support. M. Diederich is supported by the National Research Foundation (NRF) (grant number 019R1A2C1009231), by a grant from the MEST of Korea for Tumour Microenvironment Global Core Research Center (GCRC) (grant number NRF-2011-0030001), by the Creative-Pioneering Researchers Program through Seoul National University (Funding number: 370C-20160062), by the Brain Korea 21 (BK21) PLUS programme, by the ‘Recherche Cancer et Sang’ foundation, by the ‘Recherches Scientifiques Luxembourg’ association, by the ‘Een Häerz fir kriibskrank Kanner’ association, by the Action LIONS ‘Vaincre le Cancer’ association and by Télévie Luxembourg. The research work of A.T.D.-K. is funded by Cancer Research UK (C20953/A18644), the Biotechnology and Biological Sciences Research Council (BB/L01923X/1), Reata Pharmaceuticals, and Tenovus Scotland (T17/T14). B.L.F. acknowledges BMBF (TUNGER 036/FUCOFOOD) and AIF (AGEsense) for supporting his research. M.I.G. acknowledges financial support from the European Union’s Horizon 2020 research and innovation programme, project PlantaSYST (SGA No 739582 under FPA No. 664620) and the BG05M2OP001-1.003-001-C01 project, financed by the European Regional Development Fund through the ‘Science and Education for Smart Growth’ Operational Programme. K.M.G. is supported by the UK Medical Research Council (MC_UU_12011/4), the National Institute for Health Research (NIHR Senior Investigator (NF-SI-0515-10042) and the NIHR Southampton Biomedical Research Centre), the European Union (Erasmus+ Capacity-Building ENeA SEA Project and Seventh Framework Programme (FP7/2007-2013), projects EarlyNutrition and ODIN (grant agreements 289346 and 613977), the US National Institute On Ageing of the National Institutes of Health (award no. U24AG047867) and the UK ESRC and BBSRC (award no. ES/M00919X/1). Research in the laboratory of C.W.G. is supported by the Austrian Science Fund (FWF) through project P32109 and a NATVANTAGE grant 2019 by the Wilhelm Doerenkamp-Stiftung. A.K. acknowledges support by national funds through FCT-Foundation for Science and Technology of Portugal within the scope of UIDB/04423/2020 and UIDP/04423/2020. A.L. acknowledges HKBU SDF16-0603-P02 for supporting this research. F.A.M. acknowledges the support by Ministerio de Economia y Competitividad, Spain (project AGL2017-88083-R). A.M. acknowledges the support by a grant of the Romanian Ministry of Research and Innovation, CNCS – UEFISCDI, project number PN-III-P1-1.1-PD-2016-1900 – ‘PhytoSal’, within PNCDI III. G.P. acknowledges the support by NIH G12-MD007591, Kleberg Foundation and NIH R01-AG066749. M.R. acknowledges support by the Swiss National Science Foundation (Schweizerischer Nationalfonds, SNF), and by the Horizon 2020 programme of the European Union. J.M.R. acknowledges the support from the Austrian Science Fund (FWF: P24587), the Natvantage grant 2018 and the University of Vienna, Austria. G.L.R. acknowledges the group of Cellular and Molecular Nutrition (BJ-Lab) at the Institute of Food Sciences, National Research Council, Avellino, Italy. A.S.S. acknowledges the support by UIDB/00211/2020 with funding from FCT/MCTES through national funds. D.S. acknowledges the support by FWF S10711. D.S. is an Ingeborg Hochmair Professor at the University of Innsbruck. K.S.W. is supported by the National Centre for Research and Development (4/POLTUR-1/2016) and the National Science Centre (2017/27/B/NZ4/00917) and Medical University of Lublin, Poland. E.S.S. thanks Universidad Central de Chile, through Dirección de Investigación y Postgrado, for supporting this research. H. Stuppner acknowledges support by the Austrian Research Promotion Agency (FFG), the Austrian Science Fund (FWF) and the Horizon 2020 programme of the European Union (RISE, 691158). A.S. was granted by Instituto de Salud Carlos III, CIBEROBN (CB12/03/30038) and EU-COST Action (CA16112). M.W. acknowledges the support by DFG, BMBF, EU, CSC, DAAD, AvH and Land Baden Württemberg. J.L.W. is grateful to the Swiss National Science Foundation (SNF) for supporting its natural product metabolomics projects (grants nos. 310030E-164289, 31003A_163424 and 316030_164095). S.B.Z. acknowledges the support by University of Vienna, Vienna, Austria. M.H. acknowledges an EPSRC CASE Award (with Pukka Herbs Ltd, UK as industrial partner). I.B.-N. acknowledges the support of Competitivity Operational Program, 2014–2020, entitled ‘Clinical and economical impact of personalized targeted anti-microRNA therapies in reconverting lung cancer chemoresistance’ — CANTEMIR, No. 35/01.09.2016, MySMIS 103375; project PNCDI III 2015-2020 entitled ‘Increasing the performance of scientific research and technology transfer in translational medicine through the formation of a new generation of young researchers’ — ECHITAS, no. 29PFE/18.10.2018. This work was also funded by the Italian Ministry for University and Research (MIUR), grant PRIN: rot. 2017XYBP2R (to C.T.S).

Author information

Authors and affiliations.

Institute of Genetics and Animal Biotechnology of the Polish Academy of Sciences, Jastrzebiec, Poland

Atanas G. Atanasov

Department of Pharmacognosy, University of Vienna, Vienna, Austria

Atanas G. Atanasov, Sergey B. Zotchev, Verena M. Dirsch & Judith M. Rollinger

Institute of Neurobiology, Bulgarian Academy of Sciences, Sofia, Bulgaria

Ludwig Boltzmann Institute for Digital Health and Patient Safety, Medical University of Vienna, Vienna, Austria

Università degli Studi di Firenze, NEUROFARBA Dept, Sezione di Scienze Farmaceutiche, Florence, Italy

Claudiu T. Supuran

Department of Pharmacognosy, Faculty of Pharmacy, Gazi University, Ankara, Turkey

  • Ilkay Erdogan Orhan

Polish Mother’s Memorial Hospital Research Institute (PMMHRI), Łodz, Poland

Maciej Banach

Department of Chemical, Biological, Pharmaceutical and Environmental Sciences, Università degli Studi di Messina, Messina, Italy

Davide Barreca

Molecular Systems Biology (MOSYS), Department of Evolutionary and Functional Ecology, University of Vienna, Vienna, Austria

Wolfram Weckwerth

Vienna Metabolomics Center (VIME), University of Vienna, Vienna, Austria

Institute of Pharmaceutical Sciences, Department of Pharmacognosy, University of Graz, Graz, Austria

Rudolf Bauer & Franz Bucar

BioTechMed-Graz, Graz, Austria

Rudolf Bauer

Department of Biomolecular Sciences, The Weizmann Institute of Science, Rehovot, Israel

Edward A. Bayer

Sami Labs Limited, 19/1, 19/2, First Main, Second Phase, Peenya Industrial Area, Bangalore, Karnataka, India

Muhammed Majeed

Sabinsa Corporation, East Windsor, NJ, USA

Sabinsa Corporation, Payson, UT, USA

Lake Erie College of Osteopathic Medicine, Bradenton, FL, USA

Anupam Bishayee

Institute of Pharmaceutical Sciences, Department of Pharmaceutical Chemistry, University of Graz, Graz, Austria

Valery Bochkov

Institute of Analytical Chemistry and Radiochemistry, Leopold-Franzens University of Innsbruck and Austrian Drug Screening Institute — ADSI, CCB — Center of Chemistry and Biomedicine, Innsbruck, Austria

Günther K. Bonn

Centre for Healthy Brain Ageing (CHeBA), School of Psychiatry, University of New South Wales, Sydney, New South Wales, Australia

Nady Braidy

Laboratory of Foodomics, Bioactivity and Food Analysis Department, Institute of Food Science Research CIAL (UAM-CSIC), Madrid, Spain

Alejandro Cifuentes & Elena Ibanez

Clinical Psychology Service, Health Department, Fondazione IRCCS ‘Casa Sollievo della Sofferenza’, San Giovanni Rotondo, Italy

Grazia D’Onofrio

Evotec (UK) Ltd, Oxford, UK

Michael Bodkin

Department of Pharmacy, College of Pharmacy, Seoul National University, Seoul, South Korea

Marc Diederich

Jacqui Wood Cancer Centre, Division of Cellular Medicine, School of Medicine, University of Dundee, Dundee, UK

Albena T. Dinkova-Kostova

Department of Pharmacology and Molecular Sciences and Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA

Department of Pharmaceutical Biology, Institute of Pharmaceutical and Biomedical Sciences, Johannes Gutenberg University, Mainz, Germany

Thomas Efferth

Cancer Biomarkers Working Group, Oujda, Morocco

Khalid El Bairi

International Natural Product Sciences Taskforce (INPST), Jastrzebiec, Poland

Nicolas Arkells

Department of Pharmacology, University of Cambridge, Cambridge, UK

Tai-Ping Fan

College of Life Sciences, Northwest University, Xi’an, China

Neuroimmunology and Neurochemistry Research Group, Department of Psychiatry and Psychotherapy, Medical Center – University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany

Bernd L. Fiebich

Institute of Pharmacology and the Gaston H. Glock Research Laboratories for Exploratory Drug Development, Center of Physiology and Pharmacology, Medical University of Vienna, Vienna, Austria

Michael Freissmuth & Christian W. Gruber

Laboratory of Metabolomics, The Stephan Angeloff Institute of Microbiology, Bulgarian Academy of Sciences, Plovdiv, Bulgaria

Milen I. Georgiev

Center of Plant Systems Biology and Biotechnology, Plovdiv, Bulgaria

Research Department of Pharmaceutical and Biological Chemistry, UCL School of Pharmacy, London, UK

Simon Gibbons

MRC Lifecourse Epidemiology Unit and NIHR Southampton Biomedical Research Centre, University of Southampton and University Hospital Southampton NHS Foundation Trust, Southampton, UK

Keith M. Godfrey

UCB Pharma Ltd, Slough, UK

Institute for Cell Biology, Biocenter, Medical University of Innsbruck, Innsbruck, Austria

Lukas A. Huber

Austrian Drug Screening Institute-ADSI, Innsbruck, Austria

ICBAS-Instituto de Ciências Biomédicas Abel Salazar & CIIMAR, Universidade do Porto, Porto, Portugal

Anake Kijjoa

Department of Pharmacognosy and Molecular Basis of Phytotherapy, Medical University of Warsaw, Warsaw, Poland

Anna K. Kiss

School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China

Allelopathy Group, Department of Organic Chemistry, Institute of Biomolecules (INBIO), Campus de Excelencia Internacional (ceiA3), School of Science, University of Cadiz, Cadiz, Spain

Francisco A. Macias

Kaiviti Consulting, LLC, Dallas, TX, USA

Mark J. S. Miller

Department of Pharmaceutical Botany, ‘Iuliu Haţieganu’ University of Medicine and Pharmacy, Cluj-Napoca, Romania

Andrei Mocan

Department of Microbial Natural Products, Helmholtz-Institute for Pharmaceutical Research Saarland, Helmholtz Centre for Infection Research and Department of Pharmacy, Saarland University, Saarbrücken, Germany

Rolf Müller

German Centre for Infection Research (DZIF), Partner Site Hannover, Braunschweig, Germany

Rolf Müller & Marc Stadler

Department of Biomedical and Biotechnological Sciences, University of Catania, Catania, Italy

Ferdinando Nicoletti

Department of Biology, The University of Texas at San Antonio, San Antonio, TX, USA

George Perry

Department of Drug Science, University of Catania, Catania, Italy

Valeria Pittalà

Dipartimento di Farmacia, University of Salerno, Fisciano, Italy

Luca Rastrelli

Energy Metabolism Laboratory, Institute of Translational Medicine, Swiss Federal Institute of Technology (ETH) Zurich, Schwerzenbach, Switzerland

Michael Ristow

Institute of Food Sciences, National Research Council, Avellino, Italy

Gian Luigi Russo

National Institute for Agricultural and Veterinary Research (INIAV), Vila do Conde, Portugal

Ana Sanches Silva

Center for Study in Animal Science (CECA), ICETA, University of Porto, Porto, Portugal

Department of Pharmaceutical and Medicinal Chemistry, Institute of Pharmacy, Paracelsus Medical University Salzburg, Salzburg, Austria

Daniela Schuster

Institute of Pharmacy/Pharmaceutical Chemistry and Center for Molecular Biosciences Innsbruck (CMBI), University of Innsbruck, Innsbruck, Austria

The NatPro Centre, School of Pharmacy and Pharmaceutical Sciences, Trinity College Dublin, Dublin, Ireland

Helen Sheridan

Independent Laboratory of Natural Products Chemistry, Medical University of Lublin, Lublin, Poland

Krystyna Skalicka-Woźniak

Department of Pharmacognosy and Natural Products Chemistry, Faculty of Pharmacy, National and Kapodistrian University of Athens, Panepistimioupolis Zografou, Athens, Greece

Leandros Skaltsounis

Laboratory of Pharmaceutical Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, Santiago de Compostela, Spain

Eduardo Sobarzo-Sánchez

Instituto de Investigación y Postgrado en Salud, Facultad de Ciencias de la Salud, Universidad Central de Chile, Santiago, Chile

Janssen Pharmaceuticals Research & Development, San Diego, CA, USA

David S. Bredt

Institute of Pharmacy/Pharmacognosy, Center for Molecular Biosciences Innsbruck (CMBI), University of Innsbruck, Innsbruck, Austria

Hermann Stuppner

Research Group on Community Nutrition and Oxidative Stress, and Health Research Institute of the Balearic Islands (IdISBa), Department of Fundamental Biology and Health Sciences, University of Balearic Islands, Palma de Mallorca, Spain

Antoni Sureda

CIBEROBN (Physiopathology of Obesity and Nutrition), Instituto de Salud Carlos III, Madrid, Spain

Institute of Molecular Biology ‘Roumen Tsanev’, Department of Biochemical Pharmacology and Drug Design, Bulgarian Academy of Sciences, Sofia, Bulgaria

Nikolay T. Tzvetkov

Pharmaceutical Institute, University of Bonn, Bonn, Germany

Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Italian National Council of Research, Bari, Italy

Rosa Anna Vacca

Inflammation Research Center, San Diego, CA, USA

Bharat B. Aggarwal

Department of Clinical Sciences, Università Politecnica delle Marche, Ancona, Italy

Maurizio Battino & Francesca Giampieri

International Research Center for Food Nutrition and Safety, Jiangsu University, Zhenjiang, China

Maurizio Battino, Jianbo Xiao & Maria Daglia

Department of Biochemistry, Faculty of Sciences, King Abdulaziz University, Jeddah, Saudi Arabia

Francesca Giampieri

College of Food Science and Technology, Northwest University, Xi’an, Shaanxi, China

Institute of Pharmacy and Molecular Biotechnology, Heidelberg University, Heidelberg, Germany

Michael Wink

School of Pharmaceutical Sciences, University of Geneva, CMU, Geneva, Switzerland

Jean-Luc Wolfender

Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, CMU, Geneva, Switzerland

Nutrition and Bromatology Group, Department of Analytical Chemistry and Food Science, Faculty of Food Science and Technology, University of Vigo — Ourense Campus, Ourense, Spain

Jianbo Xiao

Oral and Maxillofacial Radiology, Applied Oral Sciences and Community Dental Care, Faculty of Dentistry, The University of Hong Kong, Hong Kong, China

Andy Wai Kan Yeung

Team Bio-PeroxIL, ‘Biochemistry of the Peroxisome, Inflammation and LipidMetabolism’ (EA7270)/University Bourgogne Franche-Comté/Inserm, Dijon, France

Gérard Lizard

Bionorica SE, Neumarkt/Oberpfalz, Germany

Michael A. Popp

Research Group ‘Pharmacognosy and Phytotherapy’, UCL School of Pharmacy, London, UK

Michael Heinrich

‘Graduate Institute of Integrated Medicine, College of Chinese Medicine’, and ‘Chinese Medicine Research Center’, China Medical University, Taichung, Taiwan

Research Center for Functional Genomics, Biomedicine and Translational Medicine, Institute of Doctoral Studies, ‘Iuliu Hatieganu’ University of Medicine and Pharmacy, Cluj-Napoca, Romania

Ioana Berindan-Neagoe

Department of Experimental Pathology, ‘Prof. Dr. Ion Chiricuta’, The Oncology Institute, Cluj-Napoca, Romania

Helmholtz-Center for Infection Research, Department of Microbial Drugs, Braunschweig, Germany

Marc Stadler

Department of Pharmacy, University of Naples Federico II, Naples, Italy

Maria Daglia

Natural Products Laboratory, Institute of Biology, Leiden University, Leiden, Netherlands

Robert Verpoorte

You can also search for this author in PubMed   Google Scholar

the International Natural Product Sciences Taskforce

  • , Maciej Banach
  • , Judith M. Rollinger
  • , Davide Barreca
  • , Wolfram Weckwerth
  • , Rudolf Bauer
  • , Edward A. Bayer
  • , Muhammed Majeed
  • , Anupam Bishayee
  • , Valery Bochkov
  • , Günther K. Bonn
  • , Nady Braidy
  • , Franz Bucar
  • , Alejandro Cifuentes
  • , Grazia D’Onofrio
  • , Michael Bodkin
  • , Marc Diederich
  • , Albena T. Dinkova-Kostova
  • , Thomas Efferth
  • , Khalid El Bairi
  • , Nicolas Arkells
  • , Tai-Ping Fan
  • , Bernd L. Fiebich
  • , Michael Freissmuth
  • , Milen I. Georgiev
  • , Simon Gibbons
  • , Keith M. Godfrey
  • , Christian W. Gruber
  • , Lukas A. Huber
  • , Elena Ibanez
  • , Anake Kijjoa
  • , Anna K. Kiss
  • , Aiping Lu
  • , Francisco A. Macias
  • , Mark J. S. Miller
  • , Andrei Mocan
  • , Rolf Müller
  • , Ferdinando Nicoletti
  • , George Perry
  • , Valeria Pittalà
  • , Luca Rastrelli
  • , Michael Ristow
  • , Gian Luigi Russo
  • , Ana Sanches Silva
  • , Daniela Schuster
  • , Helen Sheridan
  • , Krystyna Skalicka-Woźniak
  • , Leandros Skaltsounis
  • , Eduardo Sobarzo-Sánchez
  • , David S. Bredt
  • , Hermann Stuppner
  • , Antoni Sureda
  • , Nikolay T. Tzvetkov
  • , Rosa Anna Vacca
  • , Bharat B. Aggarwal
  • , Maurizio Battino
  • , Francesca Giampieri
  • , Michael Wink
  • , Jean-Luc Wolfender
  • , Jianbo Xiao
  • , Andy Wai Kan Yeung
  • , Gérard Lizard
  • , Michael A. Popp
  • , Michael Heinrich
  • , Ioana Berindan-Neagoe
  • , Marc Stadler
  • , Maria Daglia
  •  & Robert Verpoorte

Corresponding authors

Correspondence to Atanas G. Atanasov or Claudiu T. Supuran .

Ethics declarations

Competing interests.

A.G.A. is executive administrator of the International Natural Product Sciences Taskforce (INPST) and Digital Health and Patient Safety Platform (DHPSP). M. Banach has served on the speakers’ bureau of Abbott/Mylan, Abbott Vascular, Actavis, Akcea, Amgen, Biofarm, KRKA, MSD, Novo-Nordisk, Novartis, Sanofi-Aventis, Servier and Valeant, has served as a consultant to Abbott Vascular, Akcea, Amgen, Daichii Sankyo, Esperion, Freia Pharmaceuticals, Lilly, MSD, Novartis, Polfarmex, Resverlogix, Sanofi-Aventis, and has received grants from Amgen, Mylan, Sanofi and Valeant. R.B. collaborates with Bayer Consumer Health and Dr Willmar Schwabe GmbH & Co. KG, and is scientific advisory committee member of PuraPharm International (HK) Limited and ISURA. G.K.B. is a board member of Bionorica SE. M. Daglia has received consultancy honoraria from Pfizer Italia and Mylan for training courses for chemists, and is a member of the INPST board of directors. A.T.D.-K. is a member of the Scientific and Medical Advisory Board of Evgen Pharma plc. I.E.O. is Dean of Faculty of Pharmacy, Gazi University, Ankara, Turkey, member of the Traditional Chinese Medicine Experts Group in European Pharmacopeia, and principal member of Turkish Academy of Sciences (TUBA). B.L.F. is a member of the INPST Board of Directors and has received research funding from Dr Willmar Schwabe GmbH & Co. KG. K.M.G. has received reimbursement for speaking at conferences sponsored by companies selling nutritional products and is part of an academic consortium that has received research funding from Abbott Nutrition, Nestec and Danone. C.W.G. is chairman of the scientific advisory board of Cyxone AB, SE. M.H.’s research group has received charitable donations from Dr Willmar Schwabe GmbH & Co. KG and recently completed a research project sponsored by Pukka Herbs, UK. A.L. is a member of the board of directors of Kaisa Health. M.J.S.M. is president of Kaiviti Consulting and consults for Gnosis by LeSaffre. F.N. is cofounder and shareholder of OncoNox and Aura Biopharm. G.P. is on the board of Neurotez and Neurotrope. M.R. serves as an adviser for the Nestlé Institute of Health Sciences. G.L.R. is a member of the board of directors of INPST. N.T.T. is Founder and CEO of NTZ Lab Ltd and advisory board member of INPST. M.W. collaborates with Finzelberg GmbH and Schwabe GmbH. J.L.W. collaborates with Nestlé and Firmenich. M.A.P. is CEO and owner of Bionorica SE. J.H. is an employee of and holds shares in UCB Pharma Ltd. M.M. is Founder and Chairman of Sami–Sabinsa Group of Companies. D.S.B. is an employee of Janssen R&D. M. Bodkin is an employee of Evotec (UK) Ltd.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Dictionary of Natural Products: http://dnp.chemnetbase.com/faces/chemical/ChemicalSearch.xhtml

FDA botanical drug development guidance for industry: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/botanical-drug-development-guidance-industry

INPST: https://inpst.net/

Tetravalent carbon atoms forming single covalent bonds with other atoms within the molecular structure. A higher fraction of sp 3 carbons within molecules is a descriptor that indicates more complex 3D structures.

This guideline for the likelihood of a compound having oral bioavailability is based on several characteristics containing the number 5. It predicts that a molecule is likely to have poor absorption or permeation if it has more than one of the following characteristics: there are >5 H-bond donors and >10 H-bond acceptors; the molecular weight is >500; or the partition coefficient LogP is >5. Notably, natural products were identified as common exceptions at the time of publication in 1997.

Pharmacological screening of natural product extracts yields hits potentially containing multiple natural products that need to be considered for further study to identify the bioactive compounds. Dereplication is the process of recognizing and excluding from further study such hit mixtures that contain already known bioactive compounds.

Assays that rely on the ability of tested compounds to exert desired phenotypic changes in cells, isolated tissues, organs or animals. They offer a complementary strategy to target-based assays for identifying new potential drugs.

The use of genomic data to reveal evolutionary relationships. In the context of natural product drug discovery, the use of phylogenomics is based on the assumption that organisms that have closer evolutionary relationships are more likely to produce similar natural products.

The distance of compared taxa on a constructed phylogenetic tree (also known as an evolutionary tree). Closer distance of compared taxa indicates a closer evolutionary relationship.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Atanasov, A.G., Zotchev, S.B., Dirsch, V.M. et al. Natural products in drug discovery: advances and opportunities. Nat Rev Drug Discov 20 , 200–216 (2021). https://doi.org/10.1038/s41573-020-00114-z

Download citation

Accepted : 12 November 2020

Published : 28 January 2021

Issue Date : March 2021

DOI : https://doi.org/10.1038/s41573-020-00114-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Golden bile powder prevents drunkenness and alcohol-induced liver injury in mice via the gut microbiota and metabolic modulation.

  • Yarong Wang
  • Zhenzhuang Zou
  • Guozhen Cui

Chinese Medicine (2024)

Deciphering antifungal and antibiofilm mechanisms of isobavachalcone against Cryptococcus neoformans through RNA-seq and functional analyses

  • Weidong Qian

Microbial Cell Factories (2024)

Anti-bacterial and anti-inflammatory properties of Vernonia arborea accelerate the healing of infected wounds in adult Zebrafish

  • Lalitha Vaidyanathan
  • T. Sivaswamy Lokeswari

BMC Complementary Medicine and Therapies (2024)

Paclitaxel combined with Compound K inducing pyroptosis of non-small cell lung cancer cells by regulating Treg/Th17 balance

  • Hongzheng Wang
  • Shuai Zhang

Recent advances in the potential effects of natural products from traditional Chinese medicine against respiratory diseases targeting ferroptosis

Quick links.

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

drug discovery research papers

IMAGES

  1. Journal of Drug Design and Discovery Research Template

    drug discovery research papers

  2. 😂 Drug research paper. 40 Drugs and Drug Abuse Research Paper Topics

    drug discovery research papers

  3. (PDF) Collaboration for rare disease drug discovery research

    drug discovery research papers

  4. BindingDB IC50 Benchmark (Drug Discovery)

    drug discovery research papers

  5. Drug Research Paper

    drug discovery research papers

  6. (PDF) Drug Discovery and Development: An Insight into Pharmacovigilance

    drug discovery research papers

VIDEO

  1. Cell culture laboratory environment, consumables & equipment overview

  2. Find out more about PharmaSea

  3. Drug Discovery Biology Boston 2023

  4. Introduction to Drug Discovery and Development by Iliya

  5. October 2015 Webinar

  6. CHAPTER 2: DRUG DISCOVERY AND DEVELOPMENT

COMMENTS

  1. Nature Reviews Drug Discovery

    Nature Reviews Drug Discovery is a journal for people interested in drug discovery and development. It features reviews, news, analysis and research highlights.

  2. Drug discovery

    Major AlphaFold upgrade offers boost for drug discovery. Latest version of the AI models how proteins interact with other molecules — but DeepMind restricts access to the tool. Ewen Callaway ...

  3. Deep learning in drug discovery: an integrative review and future

    Recently, using artificial intelligence (AI) in drug discovery has received much attention since it significantly shortens the time and cost of developing new drugs. Deep learning (DL)-based approaches are increasingly being used in all stages of drug development as DL technology advances, and drug-related data grows. Therefore, this paper presents a systematic Literature review (SLR) that ...

  4. Current Research in Pharmacology and Drug Discovery

    Current Research in Pharmacology and Drug Discovery (CRPHAR) is a new primary research, gold open access journal from Elsevier. CRPHAR publishes original papers, reviews, graphical reviews, short communications and follow-up manuscripts resulting from research in pharmacology and drug discovery that cover aspects of drug action at the cellular, molecular, and biochemical level.

  5. Drug Discovery Today

    Drug Discovery Today delivers informed and highly current reviews for the discovery community. The magazine addresses not only the rapid scientific developments in drug discovery associated technologies but also the management, commercial and regulatory issues that increasingly play a part in how R&D is planned, structured and executed. Features include comment by international experts, news ...

  6. (PDF) Recent Advances in Drug Discovery: Innovative ...

    Drug discovery is a dynamic field constantly evolving with the aim of identifying novel therapeutic agents to combat various diseases. ... Section A-Research paper. Eur. Chem. Bull. 2023,12 ...

  7. Drug Design and Discovery: Principles and Applications

    Drug discovery is the process through which potential new therapeutic entities are identified, using a combination of computational, experimental, translational, and clinical models (see, e.g., [1,2]).Despite advances in biotechnology and understanding of biological systems, drug discovery is still a lengthy, costly, difficult, and inefficient process with a high attrition rate of new ...

  8. CADD, AI and ML in drug discovery: A comprehensive review

    Drug discovery research is expensive and time-consuming, and it frequently took 10-15 years for a drug to be commercially available. ... Zhang et al. published a paper on drug repurposing using deep learning (2020). Chemical sequences [simplified molecular-input line-entry system (SMILES) strings] and amino acid (AA) sequences were used as ...

  9. Machine Learning in Drug Discovery: A Review

    This review provides the feasible literature on drug discovery through ML tools and techniques that are enforced in every phase of drug development to accelerate the research process and deduce the risk and expenditure in clinical trials. ... In literature, several papers provided information relates to predictive models and biomarkers, and ...

  10. An overview of drug discovery and development

    Abstract. A new medicine will take an average of 10-15 years and more than US$2 billion before it can reach the pharmacy shelf. Traditionally, drug discovery relied on natural products as the main source of new drug entities, but was later shifted toward high-throughput synthesis and combinatorial chemistry-based development.

  11. Artificial intelligence in drug discovery and development

    The use of artificial intelligence (AI) has been increasing in various sectors of society, particularly the pharmaceutical industry. In this review, we highlight the use of AI in diverse sectors of the pharmaceutical industry, including drug discovery and development, drug repurposing, improving pharmaceutical productivity, and clinical trials, among others; such use reduces the human workload ...

  12. Full article: Artificial intelligence in drug discovery: recent

    1. Introduction. Machine learning algorithms have been widely applied for computer-assisted drug discovery [Citation 1-3].Deep learning approaches, that is, artificial neural networks with several hidden processing layers [Citation 4, Citation 5], have recently gathered renewed attention owing to their ability to perform automatic feature extractions from the input data, and their potential ...

  13. Applications of machine learning in drug discovery and development

    Drug discovery and development pipelines are long, complex and depend on numerous factors. ... This research paper describes the methodology being used by the winners of almost all categories of ...

  14. Drug Design and Discovery: Principles and Applications

    Drug discovery is the process through which potential new therapeutic entities are identified, using a combination of computational, experimental, translational, and clinical models (see, e.g., [1,2]).Despite advances in biotechnology and understanding of biological systems, drug discovery is still a lengthy, costly, difficult, and inefficient process with a high attrition rate of new ...

  15. The Stages of Drug Discovery and Development Process

    Abstract and Figures. Drug discovery is a process which aims at identifying a compound therapeutically useful in curing and treating disease. This process involves the identification of candidates ...

  16. Drug Discovery and Drug Identification using AI

    The paper deals with identifying and creating new drugs using AI technique. We are implementing a process using Intel Open VINO toolkit for identification of drugs. With this detection technique we can identify the reactants which are added as drugs and automates the entire flow. We are using Intel OpenVINOtoolkit with custom object detection technique to train the model using the faster ...

  17. Drug discovery and development: Role of basic biological research

    This article provides a brief overview of the processes of drug discovery and development. Our aim is to help scientists whose research may be relevant to drug discovery and/or development to frame their research report in a way that appropriately places their findings within the drug discovery and development process and thereby support effective translation of preclinical research to humans.

  18. Synthetic Data from Diffusion Models Improve Drug Discovery Prediction

    Artificial intelligence (AI) is increasingly used in every stage of drug development. Continuing breakthroughs in AI-based methods for drug discovery require the creation, improvement, and refinement of drug discovery data. We posit a new data challenge that slows the advancement of drug discovery AI: datasets are often collected independently from each other, often with little overlap ...

  19. Moving targets in drug discovery

    In terms of numbers of drug-efficacy target annotations, a picture similar to that of target trends can be observed with also a peak in 2011 originating from the paper by Davis et al. (81 out of ...

  20. Rational drug design with AlphaFold 3

    Further development of our AI research models will deepen our understanding of human biology and the building blocks of life to reach our ultimate goal - harnessing the power and pace of AI to reimagine the entire drug discovery process. ‍ ‍ Read more: AlphaFold 3 predicts the structure and interactions of all life's molecules. Nature paper

  21. Artificial intelligence and machine learning in drug discovery and

    Research conducted by Raschka et al. [14-15] demonstrates how machine learning can be integrated into G-protein coupled receptor (GPCR) ligand recognition, which is a key part of the drug discovery process. The aim of their research was to determine whether machine learning could replace the old technology used in Sea Lamprey Receptor 1 (SLOR1 ...

  22. Google DeepMind's new AlphaFold can model a much larger slice of

    It's a development that could help accelerate drug discovery and other scientific research. The tool is currently being used to experiment with identifying everything from resilient crops to new ...

  23. Editorial: Tumour microenvironment in cancer research and drug discovery

    DOI: 10.3389/fphar.2024.1403176 Corpus ID: 269572766; Editorial: Tumour microenvironment in cancer research and drug discovery @article{Said2024EditorialTM, title={Editorial: Tumour microenvironment in cancer research and drug discovery}, author={Nur Akmarina B. M. Said and Syed Mahmood and Kenneth K. W.

  24. Google DeepMind and Isomorphic Labs introduce AlphaFold 3 AI model

    Google DeepMind's newly launched AlphaFold Server is the most accurate tool in the world for predicting how proteins interact with other molecules throughout the cell. It is a free platform that scientists around the world can use for non-commercial research. With just a few clicks, biologists can harness the power of AlphaFold 3 to model structures composed of proteins, DNA, RNA and a ...

  25. Drug discovery

    Synthetic drug kills fungi but spares kidney cells. Amphotericin B is a clinically vital antifungal drug, but it has high renal toxicity. The compound kills cells by forming sponge-like aggregates ...

  26. Principles of early drug discovery

    Abstract. Developing a new drug from original idea to the launch of a finished product is a complex process which can take 12-15 years and cost in excess of $1 billion. The idea for a target can come from a variety of sources including academic and clinical research and from the commercial sector. It may take many years to build up a body of ...

  27. Google DeepMind and Isomorphic Labs unveil AlphaFold 3, an AI that

    AlphaFold 2 has been cited more than 20,000 times in other published scientific papers and has been used to work on drugs for malaria, cancer, and many other diseases.

  28. This AI Research Introduces SubGDiff: Utilizing Diffusion Model to

    Molecular representation learning is an essential field focusing on understanding and predicting molecular properties through advanced computational models. It plays a significant role in drug discovery and material science, providing insights by analyzing molecular structures. The fundamental challenge in molecular representation learning involves efficiently capturing the intricate 3D ...

  29. Natural products in drug discovery: advances and opportunities

    Historically, natural products (NPs) have played a key role in drug discovery, especially for cancer and infectious diseases 1, 2, but also in other therapeutic areas, including cardiovascular ...