Graph Neural Network Framework for Web-Based Prediction of Protein-Ligand Docking Scores across multiple organs

Anagha S Setlur; Vidya Niranjan; Arjun Balaji; Chandrashekar K

Dec 19, 2023

Graph Neural Network Framework for Web-Based Prediction of Protein-Ligand Docking Scores across multiple organs

DOI

dx.doi.org/10.17504/protocols.io.j8nlkoy9xv5r/v1

Graph Neural Network Framework for Web-Based Prediction of Protein-Ligand Docking Scores across multiple organs

¹Department of Biotechnology, RV College of Engineering, Bangalore- 560059, affiliated to Visvesvaraya Technological University (VTU), Belagavi- 590018;
²Department of Electronics & Telecommunication, RV College of Engineering, Bangalore- 560059, affiliated to Visvesvaraya Technological University (VTU), Belagavi- 590018

Vidya Niranjan

R V College of Engineering

DOI: dx.doi.org/10.17504/protocols.io.j8nlkoy9xv5r/v1

Protocol Citation: Anagha S Setlur, Vidya Niranjan, Arjun Balaji, Chandrashekar K 2023. Graph Neural Network Framework for Web-Based Prediction of Protein-Ligand Docking Scores across multiple organs. protocols.io https://dx.doi.org/10.17504/protocols.io.j8nlkoy9xv5r/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: December 15, 2023

Last Modified: December 19, 2023

Protocol Integer ID: 92373

Keywords: QSAR, machine learning and deep learning, graph convolution networks, graph neural networks, data pre-processing, human organs, web-based predictions

Abstract

Estimating the docking score between proteins and drugs is very important in the application of structure-based drug design. This project explores the application of Graph Neural networks (GNN) in the field of molecular property prediction using SMILES representation, the trained models are then deployed on a web-based platform for broader accessibility and use. The primary dataset utilized in this study includes molecular data represented by MolPort IDs and associated docking scores, which are critical in assessing molecular interactions. A significant aspect of this project is data preprocessing, where each molecule, initially represented as a SMILES string, is converted into a graph format. Effective molecular representation learning is pivotal to facilitate molecular property prediction. Models are then evaluated based on various performance metrics and deployed on the web-based platform.

Keywords: QSAR, machine and deep learning, graph convolution networks, graph neural networks, data pre-processing, human organs, web-based predictions

Guidelines

QSAR modeling should be performed for each protein under each organ first. The ligand IDs and SMILES structures are the preferred columns to be present in the analytical dataset.

Safety warnings

NA

Ethics statement

None.

Before start

Check system compatibility to run pre-processing of models and GCN/hybrid GCN models.

IMPORTING LIBRARIES

Import all necessary libraries

Ensure the installation and importation of all the necessary libraries needed for both the data preprocessing and the model training and evaluation. Provided below is a screenshot of the required libraries to be imported.

Importing required libraries

DATASET CREATION

In the present scenario, Quantitative Structure Activity Relationship (QSAR) data generated from Schrodinger Maestro was used for dataset creation. QSAR models were first generated for specific proteins and by taking a set of ligands from MolPort.

Taking an example for Brain, O14672. Here, Y(Obs) is the docking score. This dataset has the MolPort IDs and the docking scores obtained from QSAR modeling data.

Creation of analytical dataset
 
Using the second dataset containing MolPort IDs and the SMILES string. An analytical dataset was created.

The following is performed to prepare an analytical dataset:

The processed dataset looks as follows:

DATA PRE-PROCESSING

SMILES to graph conversion

Data preprocessing is a pivotal step in this model. Each molecule represented by a SMILES string is converted into a graph, with atoms as nodes and chemical bonds as edges. This graph representation is essential for the GNN to accurately interpret molecular structures.

Feature Representation

●   Atom Features: Each atom is represented by a one-hot encoded feature vector, indicating the atom type. The model considers four types of atoms (C, O, N, B), leading to a 4-dimensional feature vector for each atom.
●  Bond Features: Bonds are characterized by their type (single, double, triple, aromatic) and their inclusion in a ring structure. Each bond is represented by a 5-dimensional feature vector.

 
AB
  Feature
    Dimensions
  
  One-hot encoding of atom types (C, O, N, B)
    4
  
  Edge features for bond types (single, double, triple, aromatic)
    4
  
  Edge features for bond presence in a ring structure
    1
  
  Atom features for atom presence in a ring structure
    1
  
  Bond indices for atom connectivity
    2 per bond
  

Using RDKit library for feature representation

So, to represent all these features, we utilize the functionalities of the RDKit library. The function converts a SMILES string into a molecular graph, encoding atom types using one-hot encoding and representing bonds with their types and ring membership.

Using RDKit for feature representation

MODEL TRAINING AND EVALUATION

Model defining and training

Define the models and train with early stopping along with appropriate parameters. 

MODEL 1- GRAPH CONVOLUTION NETWORK (GCN) 

The first model we explore is a Graph Convolution Network (GCN) with 2 convolution layers.

MODEL 2- HYBRID GCN 

The second model we explore is a hybrid GCN model:

5-fold cross validation

Utilizing 5-Fold cross-validation for training enhancing its robustness and reliability. This method ensured a comprehensive evaluation by systematically partitioning the data into distinct subsets for both training and validation. 

The model's performance was further evaluated using metrics like Root Mean Squared Error (RMSE) and Mean Average Error(MAE), providing insights into its predictive accuracy and overall performance.

PICKING THE BEST MODEL AND UPLOADING IN REPOSITORY

The best possible model was picked and the weights were saved. Then, these weights were uploaded onto the Streamlit repository.

These same steps were repeated across different proteins, datasets and models to integrate all models from each human organ into a single platform.

CONCLUSION

This protocol briefs the steps required to integrate all predicted QSAR data from each organ into a single, all-in-one platform for all human organs and proteins associated with them, to enable users to provide a SMILES structure and estimate the predicted docking score after mapping with the integrated models. Data pre-processing is the primary step in this protocol, followed by creation of analytical dataset for conversion into graphs. Advanced machine and deep learning technique called the graph convolution network (GCN) is shown as model 1, where high dimensional data is converted to low dimensional data and the graphs are correlated to the target variables (in this case, docking scores). The hybrid model, shown as model 2, also adds an additional concept of attention mechanism, that employs positional encoding along with traditional GCN. The web-application allows users to choose which model to utilise for their prediction. This protocol allows for the direct binding affinity predictions of small molecules to important proteins in the human organs, thereby, providing an overall safety information on the small molecules.

ACKNOWLEDGEMENTS

The authors thank Mr. Akshay Uttarkar for providing inputs throughout.

Protocol references

Kaplan Z, Ehrlich S, Leswing K (2021) Benchmark study of DeepAutoQSAR, ChemProp, and DeepPurpose on the ADMET subset of the Therapeutic Data Commons. https://newsite.schrodinger.com/life-science/learn/white-papers/benchmark-study-deepautoqsar-chemprop-and-deeppurpose-admet-subset-therapeutic-data/

Gion K, Gattani S, Kaplan Z (2022) DeepAutoQSAR hardware benchmark. https://newsite.schrodinger.com/materials-science/learn/white-papers/deepautoqsar-hardware-benchmark/

Schrödinger Release 2023-4: DeepAutoQSAR, Schrödinger, LLC, New York, NY, 2023.

https://www.molport.com/shop/index

Wu F, Souza A, Zhang T, Fifty C, Yu T, Weinberger K. Simplifying graph convolutional networks. International conference on machine learning 2019 May 24 (pp. 6861-6871). PMLR.

Javeed A. A hybrid attention mechanism for multi-target entity relation extraction using graph neural networks. Machine Learning with Applications. 2023 Mar 15;11:100444.

	A	B
	Feature	Dimensions
	One-hot encoding of atom types (C, O, N, B)	4
	Edge features for bond types (single, double, triple, aromatic)	4
	Edge features for bond presence in a ring structure	1
	Atom features for atom presence in a ring structure	1
	Bond indices for atom connectivity	2 per bond

Public workspaceGraph Neural Network Framework for Web-Based Prediction of Protein-Ligand Docking Scores across multiple organs

Graph Neural Network Framework for Web-Based Prediction of Protein-Ligand Docking Scores across multiple organs