import biotite.sequence.io.fasta as fasta
import biotite.application.blast as blast
from biotite.sequence import ProteinSequence
import urllib
import io
import pickle
="P07900"
uniprotcode=9
start=236
end="https://rest.uniprot.org/uniprotkb/"+uniprotcode+".fasta"
url
with urllib.request.urlopen(url) as file_handle:
= io.TextIOWrapper(file_handle)
file_object = fasta.FastaFile.read(file_object)
fasta_file for header, seq_str in fasta_file.items():
= ProteinSequence(seq_str)
seq = str(seq[start:end])
selected_sequence
= blast.BlastWebApp("blastp", selected_sequence, database="pdb",obey_rules=True)
app 5000)
app.set_max_results(1e-30)
app.set_max_expect_value(
app.start()=1000)
app.join(timeout= app.get_alignments()
alignments =app.get_xml_response()
xml
# open a file, where you ant to store the data
file = open('data/blasthits.pkl', 'wb')
# dump information to that file
file)
pickle.dump(xml,
# close the file
file.close()
Introduction
The previous post of this series established a first protocol on how to identify & cluster the ATP binding site of the N-terminal domain of human HSP90 alpha. It can be generalized to other proteins as well but to keep it simple I’ll continue this series on a single protein for now and generalize at a later stage.
In this second post we’ll gradually extend the dataset with ATP binding sites of proteins that are very similar to the human HSP90 alpha.
What is similar?
One can adopt several ways to assess the similarity between proteins. Here I’ll use a commonly understood and accepted sequence local similarity on the N-terminal domain. I’ll further assess the binding site residue conservation in detail. So at this stage of the dataset creation I’ll focus on things I’d expect to be found as very very similar to the human HSP90 alpha N-terminal domain when doing a pocket comparison. The Rognan lab discussed this topic in more than one paper. Feel free to check out Kellenberger, Schalon, and Rognan (2008) and Eguida and Rognan (2022) for more details.
Things to keep in mind
Again I’ll track all things that I left aside or important approximations I made. They’ll be tracked as side-note and depending on the outcome of the whole exercise I’ll come back to them at a later stage.
Things to keep in mind
- listing goes here
Methods & Results
Identify similar sequences
In order to identify similar sequences one usually would do just a simple blast. Unfortunately, things are a bit trickier. First of all, we are currently looking only at the N-terminal domain of HSP90 (around 230 amino acids). The full human HSP90 alpha protein has 732 amino acids. So, shall we search for globally similar sequences? or smaller regions?
The N-terminal part is considered to be a domain, so structured on itself. There are numerous structure resolution papers that show that, Sreeramulu et al. (2009) is just one of them, as an example here, where only the N-terminal domain of HSP90 is amplified, produced & resolved.
We could even go further and use only the range of amino acids that are involved in the ATP binding site. In the case of a small domain like this one here that might not be required, but for large domains it might be of interest to focus on the binding site only. In this example this would mean start at the glutamate 47 and go up to the valine 186 (+ maybe a few flanking residues)
For the sake of simplicity I’ll take the whole N-terminal domain (amino acids 9-236), corresponding to the first region of HS90A_HUMAN in the UniProt database.
The blast mess
Now let’s try to set up a first blast search. We want to find PDB structures containing a resolved portion of the domain of interest (or similar). There are unfortunately thousands of services allowing to do that, and as usual they don’t give the same results. One important factor is to search against an up to date version of the RCSB PDB. The only service I found that clearly states what release of the PDB is being used is NCBI blastp page.
I don’t want to reinvent the wheel or install blast locally etc, so here I’ll use the biotite (Kunzmann and Hamacher (2018)) integration of the NCBI blast service.
Here we get all blast hits with an e-value above 1e-30 (arbitrary choice) and store the xml response in a pickled file, not to spam the NCBI service too much during writing this post 😉. It’s also very slow to run, so I’ll just load the pickled file in the next step.
Parse the blast results
We can now load the pickled results & parse them to make them a bit more workable
#! eval: false
import pickle
import xml.etree.ElementTree as ET
file = open('data/blasthits.pkl', 'rb')
# dump information to that file
= pickle.load(file)
xml file.close()
# print(xml)
= ET.fromstring(xml)
root = []
identifiers = []
evalues for alignment in root.iter('Hit'):
= alignment.find('Hit_id').text
identifier # Extract the coverage
= alignment.find('Hit_hsps/Hsp')
hsp = int(hsp.find('Hsp_query-to').text) / int(hsp.find('Hsp_query-from').text)
coverage # Extract the percentage of identity
= float(hsp.find('Hsp_identity').text) / float(hsp.find('Hsp_align-len').text)
identity
# Extract the e-value
= float(hsp.find('Hsp_evalue').text)
evalue
identifiers.append(identifier)
evalues.append(evalue)print(identifiers)
['pdb|7LSZ|A', 'pdb|4YKQ|A', 'pdb|1UY6|A', 'pdb|2YE2|A', 'pdb|3B24|A', 'pdb|3QDD|A', 'pdb|3INW|A', 'pdb|3R4M|A', 'pdb|6TN4|AAA', 'pdb|4AWO|A', 'pdb|4EGH|A', 'pdb|2BSM|A', 'pdb|4NH7|A', 'pdb|4EEH|A', 'pdb|2XAB|A', 'pdb|2CCS|A', 'pdb|6U99|A', 'pdb|4B7P|A', 'pdb|2FWY|A', 'pdb|6LR9|A', 'pdb|1BYQ|A', 'pdb|2YE7|A', 'pdb|2XHX|A', 'pdb|6GQS|A', 'pdb|3K97|A', 'pdb|6U98|A', 'pdb|5J6L|A', 'pdb|1UYI|A', 'pdb|6OLX|A', 'pdb|8B7I|A', 'pdb|6GP4|A', 'pdb|2XJJ|A', 'pdb|1YC1|A', 'pdb|2YEE|A', 'pdb|5J82|A', 'pdb|5J6M|A', 'pdb|3BM9|A', 'pdb|3D0B|A', 'pdb|5J80|A', 'pdb|2YI0|A', 'pdb|7KRJ|A', 'pdb|5NYI|A', 'pdb|3K98|A', 'pdb|3EKO|A', 'pdb|3HHU|A', 'pdb|1OSF|A', 'pdb|5H22|A', 'pdb|2XDU|A', 'pdb|2JJC|A', 'pdb|2K5B|A', 'pdb|3TUH|A', 'pdb|6EI5|A', 'pdb|5J2X|A', 'pdb|5GGZ|A', 'pdb|4LWE|A', 'pdb|5J20|A', 'pdb|2QFO|A', 'pdb|5LR1|A', 'pdb|5LQ9|A', 'pdb|2QF6|A', 'pdb|2YJW|A', 'pdb|6N8W|A', 'pdb|3NMQ|A', 'pdb|1UYM|A', 'pdb|6N8Y|A', 'pdb|7ULJ|A', 'pdb|5UC4|A', 'pdb|7ZUB|A', 'pdb|8GAE|A', 'pdb|5FWK|A', 'pdb|7Z37|AP1', 'pdb|7ZR0|A', 'pdb|7Y04|A', 'pdb|8EOA|A', 'pdb|2JKI|A', 'pdb|4GQT|A', 'pdb|2XCM|A', 'pdb|4X9L|A', 'pdb|7K9U|A', 'pdb|7K9R|A', 'pdb|3K60|A', 'pdb|3H80|A', 'pdb|4XC0|A', 'pdb|6CJI|A', 'pdb|3O6O|A', 'pdb|3OMU|A', 'pdb|2BRE|A', 'pdb|1A4H|A', 'pdb|1AH6|A', 'pdb|2WEP|A', 'pdb|1ZW9|A', 'pdb|2YGE|A', 'pdb|2YGF|A', 'pdb|2CGF|A', 'pdb|1AMW|A', 'pdb|1AM1|A', 'pdb|2XX2|A', 'pdb|3C0E|A', 'pdb|1BGQ|A', 'pdb|2YGA|A', 'pdb|2CG9|A', 'pdb|6XLB|A', 'pdb|2AKP|A', 'pdb|3PEH|A', 'pdb|1TBW|A', 'pdb|4NH9|A', 'pdb|5IN9|A', 'pdb|1YT2|A', 'pdb|7ULL|A', 'pdb|1QY8|A', 'pdb|2ESA|A', 'pdb|2O1W|A', 'pdb|5ULS|A', 'pdb|2O1U|A', 'pdb|2IOR|A', 'pdb|1Y4S|A', 'pdb|2IOP|A', 'pdb|3IED|A', 'pdb|7C04|A', 'pdb|5F3K|A', 'pdb|5HPH|A', 'pdb|7ULK|A', 'pdb|4Z1F|A', 'pdb|6D14|A', 'pdb|4IVG|A', 'pdb|4J0B|A', 'pdb|5TVU|A', 'pdb|5TTH|B', 'pdb|4IPE|A', 'pdb|5TTH|A', 'pdb|7KCK|A', 'pdb|6XG6|A', 'pdb|7KLU|B', 'pdb|7U4C|A']
We end up here with 139 hits. As a reminder, when gathering all structures of the same protein we identified over 300 structures. I’m expecting them to be part of the hitlist found here. I tried to find any indication on the PDB subset used by the NCBI, but couldn’t identify whether they use the full RCSB PDB or a subset only. In conclusion: this isn’t useable as such. So I optimistically tried the blast service on uniprot and got even less hits. Digging into the detail I noticed that the ebi is just running an ncbi blast behind the scenes as well.
Well that’s a bummer. I have two options here, either I expand the hitlist, by identifying all other structures encompassing the same protein as the hits I got here. But the hits aren’t unique as well per biomolecule, not sure what this corresponds to in the end. Or I’d need to basically build my own blast database locally & do everything by hand. Let me roughly outline the steps here:
For each PDB structure & chain:
- extract the expressed sequence
- add it to a blast db file with an identifier corresponding to the pdb code + chain code
- run the blast locally
Fall back to the RCSB PDB search
The RCSB PDB seems to offer a protein sequence search as API now as well & this is the one I decided to use in the end:
import json
import requests
="""{
query "query": {
"type": "terminal",
"service": "sequence",
"parameters": {
"evalue_cutoff": 1e-30,
"identity_cutoff": 0.1,
"sequence_type": "protein",
"value": "QPMEEEEVETFAFQAEIAQLMSLIINTFYSNKEIFLRELISNSSDALDKIRYESLTDPSKLDSGKELHINLIPNKQDRTLTIVDTGIGMTKADLINNLGTIAKSGTKAFMEALQAGADISMIGQFGVGFYSAYLVAEKVTVITKHNDDEQYAWESSAGGSFTVRTDTGEPMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDKEVSDDEAE"
}
},
"request_options": {
"results_verbosity": "verbose",
"scoring_strategy": "sequence",
"return_all_hits": true
},
"return_type": "polymer_entity"
}"""
="https://search.rcsb.org/rcsbsearch/v2/query?json="+query
url=requests.get(url)
response=json.loads(response.text)["result_set"]
hitsprint(hits[0])
Here we get all structures (510 hits by the time I’m writing this) & even a bit more information which portion of the input sequence hit where on the structure. Be gentle with the calls to the RCSB, else they will blacklist your query.
Now let’s start meddling with the sequence mess and integrate already all the clustering bits we did in the previous post.
Clustering the similar binding sites
In the previous clustering I did on identical structures I didn’t need to bother about residue numberings as they were identical in the end for all human HSP90 alpha structures. Here, however, I’ll retrieve sequences from other organisms etc and it’s guaranteed that the residue numberings won’t coincide most of the time with the ones I used before.
As a result the code I provide here includes a proper sequence mapping on the results from the RCSB query directly.
Code
import gemmi
from gemmi import cif
import urllib
import numpy as np
from scipy.spatial.distance import cdist
import scipy.spatial.distance as ssd
import scipy
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Markdown
from tabulate import tabulate
import blosum
= blosum.BLOSUM(62) subst_matrix
Code
#get all polymer identifiers from previous results
=[hit["identifier"] for hit in hits]
polymer_identifiers
#now let's build another graphql query against the rcsb to get the author defined chain names of the structures that correspond to the polymer identified.
="""
query{
polymer_entities(entity_ids: """+str(polymer_identifiers).replace("'","\"")+""")
{
rcsb_id
rcsb_polymer_entity_container_identifiers {
auth_asym_ids
}
rcsb_polymer_entity_align{
reference_database_name
aligned_regions{
entity_beg_seq_id
length
ref_beg_seq_id
}
}
}
}
"""
=f"https://data.rcsb.org/graphql?query={query}"
url=requests.get(url)
response=response.json()["data"]["polymer_entities"]
chains_tmp
#here we know what polymer id corresponds to what chain name
={chain["rcsb_id"]:chain["rcsb_polymer_entity_container_identifiers"]["auth_asym_ids"] for chain in chains_tmp}
chain_polymer_mapping
#From our blast hits, let's keep only the sequence matching bits to simplify the object a bit:
={hit["identifier"]:hit["services"][0]["nodes"][0]["match_context"][0] for hit in hits}
clean_hits
#Let's get our ATP binding site residue list mapping from our initial sequence:
=np.array([47,48,51,52,54,55,58,93,95,96,97,98,106,107,112,131,132,133,134,135,136,137,138,139,152,183,184,186])-9
residue_list
def transformResidueNumbers(residue_numbers, rcsbpolymerhit):
= pd.DataFrame(columns = ["aa_query", "aa_subject", "query_residue_number","subject_residue_number"])
df
=0
currentindex=list(rcsbpolymerhit["subject_aligned_seq"])
subject_aligned_seq_list=rcsbpolymerhit["query_beg"]
query_current_residue_number=rcsbpolymerhit["subject_beg"]
subject_current_residue_numberfor idx,queryaa in enumerate(list(rcsbpolymerhit["query_aligned_seq"])):
if(queryaa!='-'):
=query_current_residue_number
query_residue_number+=1
query_current_residue_numberelse:
=-1
query_residue_numberif(subject_aligned_seq_list[idx]!='-'):
=subject_current_residue_number
subject_residue_number+=1
subject_current_residue_numberelse:
=-1
subject_residue_number=pd.DataFrame({"aa_query":queryaa,"aa_subject":subject_aligned_seq_list[idx],"query_residue_number":query_residue_number,"subject_residue_number":subject_residue_number}, index=[0])
new_row=pd.concat([df.loc[:],new_row]).reset_index(drop=True)
df
=[df.loc[df['query_residue_number'] == residue_number, 'aa_subject'].values[0] for residue_number in residue_numbers]
subject_residue_listreturn((df.loc[df['query_residue_number'].isin(residue_numbers),"subject_residue_number"].tolist(),subject_residue_list))
def getContactMatrix(pdbCode, chainCode, residueSelection, debug=False):
= urllib.request.urlopen("https://files.rcsb.org/view/"+pdbCode+".cif").read()
content
=cif.read_string(content)[0]
block=gemmi.make_structure_from_block(block)
structure=[]
positionsfor model in structure:
if model.name=="1":
for chain in model:
if chain.name == chainCode:
for residue in chain:
if residue.label_seq in residueSelection:
if debug:
print(residue.seqid)
print(residue)
for atom in residue:
if atom.name=="CA":
if debug: print("ok")
positions.append(atom.pos.tolist())break #we need that for multiple occurences
if(len(positions)!=len(residueSelection)):
print("Not all positions found for "+pdbCode+" discarding structure")
return None
=np.array(positions)
positions_npreturn cdist(positions_np, positions_np, 'euclidean')
# create the contact matrices (CA based for now)
=[]
contactMatrices=[]
resultResidueNames=[]
selectedResidues
# for pdbid in ['1UYF_1']:
for pdbid in list(chain_polymer_mapping.keys()):
print(pdbid)
=chain_polymer_mapping[pdbid][0] #take only the first chain
chain=pdbid.split("_")[0]
pdbcode=transformResidueNumbers(residue_list,clean_hits[pdbid])
tmp_result=tmp_result[0]
mapped_residues
contactMatrices.append(getContactMatrix(pdbcode,chain,mapped_residues))1])
resultResidueNames.append(tmp_result[+":"+str(res) for res in mapped_residues])
selectedResidues.append([chain
= [ic for ic, matrix in enumerate(contactMatrices) if matrix is None]
none_indices
=[structure.split("_")[0] for structure in list(chain_polymer_mapping.keys())]
labels=[chain_polymer_mapping[structure][0] for structure in list(chain_polymer_mapping.keys())]
chainCodes= [matrix for i, matrix in enumerate(contactMatrices) if i not in none_indices]
filteredContactMatrices = [bl for i, bl in enumerate(resultResidueNames) if i not in none_indices]
filteredresultResidueNames = [bl for i, bl in enumerate(selectedResidues) if i not in none_indices]
filteredSelectedResidues
= [label for idx, label in enumerate(labels) if idx not in none_indices]
filteredLabels = [code for idx, code in enumerate(chainCodes) if idx not in none_indices]
filteredChainCodes=pd.DataFrame(({"pdbid":filteredLabels,"chain":filteredChainCodes,"residues":filteredSelectedResidues}))
finalList"data/finalList.tsv",sep="\t",index=False)
finalList.to_csv(#final clustering of contact matrices
file = open('data/contactMatrices.pkl', 'wb')
# dump information to that file
"contactMatrices":filteredContactMatrices,"fileredResidueNames":filteredresultResidueNames,"filteredLabels":filteredLabels,"filteredChainCodes":filteredChainCodes,"finalList":finalList}, file)
pickle.dump({
# close the file
file.close()
import numpy as np
def clusterMatrices(matrixList,selectedResidueList,blosumWeight=1.0):
=len(matrixList)
nMatrices=np.zeros((nMatrices,nMatrices))
resultfor i in range(nMatrices):
for j in range(nMatrices):
if i==j:
=0.0
result[i][j]elif i<j:
=selectedResidueList[i]
r1=selectedResidueList[j]
r2=(np.array([subst_matrix[r1[idx]][r2[idx]] for idx in range(len(r1))])<0.0).sum()
blosum_score=np.mean(np.abs(matrixList[i]-matrixList[j]))+blosumWeight*blosum_score
result[i][j]=result[i][j]
result[j][i]
= ssd.squareform(result)
distArray =scipy.cluster.hierarchy.linkage(result, method='single', metric='euclidean')
clustersreturn(clusters)
file = open('data/contactMatrices.pkl', 'rb')
= pickle.load(file)
contactMatrices file.close()
=clusterMatrices(contactMatrices["contactMatrices"],contactMatrices["fileredResidueNames"]) clusters
Code
=(9, 100))
plt.figure(figsize
=contactMatrices["filteredLabels"],orientation='right',leaf_font_size=9,color_threshold=2.0)
scipy.cluster.hierarchy.dendrogram(clusters,labels plt.show()
Well that was a bit painful to set up. However in the figure above now we should have a good hierarchical tree for a pure backbone geometry based comparison. The pocket comparison method we’d might want to evaluate later might make use of particular amino acids or side chain positions. Thus, I already integrated a bit of code to account for a penalty in the distance between two binding sites if unfavourable permutations were found (blosum62 score<0). Feel free to adapt to whatever you need. In the current code, an unfavourable substitution of an amino acid is equivalent to a 1A distance between the two alpha carbons. The hierarchical tree was colored at a cut distance of 2A. We can however observe 3 smaller clusters in the large pink cluster. This very likely corresponds to the 3 clusters we identified on structures of identical sequence.
In the end we get a full set of structures, HSP90 alpha human ones, and similars. Let’s check how much of our dataset established in the previous post. In theory we should be able to gather 100% of it through the approach outlined here as well.
Code
import pickle
file = open('data/identicallist.pkl', 'rb')
= pickle.load(file)
identical_list file.close()
= [contactMatrices["filteredLabels"][idx]+":"+code for idx, code in enumerate(contactMatrices["filteredChainCodes"])]
similar_list
= np.intersect1d(identical_list, similar_list)
intersection
print(f"{len(intersection)} common structures vs {len(similar_list)} in the similar list and {len(identical_list)} in the identical list")
295 common structures vs 477 in the similar list and 295 in the identical list
That’s a nice and expected outcome here. All structures identified from our previous post are found through the approach used here as well. But instead of only 295 structures, we now have a total of 477 structures with different levels of similarity between all structures.
Results
I won’t put all the raw results in a table format in HTML here, but you can get the final list of structures, chain codes & residue selections as download from here. The tsv file is structured like this:
pdbid | chain | residues | |
---|---|---|---|
0 | 1UY8 | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
1 | 1UY9 | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
2 | 1UYD | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
3 | 1UYE | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
4 | 1UYF | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
5 | 1UYH | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
6 | 1UYK | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
7 | 1UYL | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
8 | 2VCI | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
9 | 2WI2 | A | [‘A:47’, ‘A:48’, ‘A:51’, ‘A:52’, ‘A:54’, ‘A:55’, ‘A:58’, ‘A:93’, ‘A:95’, ‘A:96’, ‘A:97’, ‘A:98’, ‘A:106’, ‘A:107’, ‘A:112’, ‘A:131’, ‘A:132’, ‘A:133’, ‘A:134’, ‘A:135’, ‘A:136’, ‘A:137’, ‘A:138’, ‘A:139’, ‘A:152’, ‘A:183’, ‘A:184’, ‘A:186’] |
How can this be used as benchmark dataset?
Let’s come back to our initial problem. How can this dataset now be used to evaluate the performance of a pocket comparison method? The dataset contains information about:
- what binding site is similar to another
- to what extent the binding sites are similar (more or less similar)
So we already have a set of structures that we can in theory compare to one another and see if the binding site comparison similarity metric correlates with the similarity we considered here. We can also use the dataset to compare to structures outside of the set, a decoy set. This is in theory easy to set up but has important implications on how to evaluate a performance of a method.
First let’s clean up a bit and define a function that allows us to get the most similar binding sites from our set versus a chosen query structure:
Code
import itertools
def getDistanceMatrix(matrixList,selectedResidueList,blosumWeight=1.0):
=len(matrixList)
nMatrices=np.zeros((nMatrices,nMatrices))
resultfor i in range(nMatrices):
for j in range(nMatrices):
if i==j:
=0.0
result[i][j]elif i<j:
=selectedResidueList[i]
r1=selectedResidueList[j]
r2=(np.array([subst_matrix[r1[idx]][r2[idx]] for idx in range(len(r1))])<0.0).sum()
blosum_score=np.mean(np.abs(matrixList[i]-matrixList[j]))+blosumWeight*blosum_score
result[i][j]=result[i][j]
result[j][i]
return(result)
=getDistanceMatrix(contactMatrices["contactMatrices"],contactMatrices["fileredResidueNames"])
distanceMatrix
=np.array(similar_list)
np_similar_list=[]
sortedMatrixfor rowid in range(0,distanceMatrix.shape[0]):
=np.argsort(distanceMatrix[rowid,:])
o
np_similar_list[o]list(np.char.add(np.char.add(np_similar_list[o],": "),distanceMatrix[rowid,o].astype('<U7')))])
sortedMatrix.append([np_similar_list[rowid],
=pd.DataFrame(sortedMatrix,columns=["pdbid","similar structures"])
sortedMatrixDF"data/similarityList.tsv",sep="\t",index=False)
sortedMatrixDF.to_csv(
=[[element[0],element[1][:10]] for element in sortedMatrix[:10]]
truncatedMatrix
Markdown(tabulate(10],
truncatedMatrix[:=["pdbid","similar structures"]
headers ))
pdbid | similar structures |
---|---|
1UY8:A | [‘1UY8:A: 0.0’, ‘1UY9:A: 0.04950’, ‘1UY7:A: 0.07109’, ‘2FWY:A: 0.07655’, ‘4CWN:A: 0.07769’, ‘7D22:A: 0.07787’, ‘3QDD:A: 0.08176’, ‘1UYG:A: 0.08746’, ‘1UYC:A: 0.09087’, ‘5LR1:A: 0.09155’] |
1UY9:A | [‘1UY9:A: 0.0’, ‘1UY8:A: 0.04950’, ‘2FWY:A: 0.06527’, ‘4CWN:A: 0.06538’, ‘7D22:A: 0.06731’, ‘1UY7:A: 0.06833’, ‘5LR1:A: 0.07130’, ‘3QDD:A: 0.07253’, ‘2H55:A: 0.08012’, ‘7S9H:A: 0.08232’] |
1UYD:A | [‘1UYD:A: 0.0’, ‘1UYE:A: 0.05363’, ‘1UY6:A: 0.07854’, ‘1UYF:A: 0.08113’, ‘3HZ5:A: 0.08704’, ‘2FWY:A: 0.09215’, ‘1UYM:A: 0.09343’, ‘6N8Y:A: 0.09523’, ‘1UY9:A: 0.09570’, ‘6OLX:A: 0.09623’] |
1UYE:A | [‘1UYE:A: 0.0’, ‘1UYD:A: 0.05363’, ‘1UYF:A: 0.07702’, ‘3HZ5:A: 0.07724’, ‘1UY6:A: 0.08819’, ‘7S9G:A: 0.09107’, ‘6EL5:A: 0.09392’, ‘1UYM:A: 0.09860’, ‘7S9I:A: 0.10313’, ‘6OLX:A: 0.10381’] |
1UYF:A | [‘1UYF:A: 0.0’, ‘1UYE:A: 0.07702’, ‘1UYD:A: 0.08113’, ‘1UY6:A: 0.08245’, ‘3HZ5:A: 0.08700’, ‘6EL5:A: 0.09504’, ‘1UY9:A: 0.09708’, ‘6N8Y:A: 0.09839’, ‘6OLX:A: 0.09915’, ‘7LSZ:A: 0.10767’] |
1UYH:A | [‘1UYH:A: 0.0’, ‘1UYC:A: 0.08294’, ‘1UYG:A: 0.08927’, ‘1UY8:A: 0.10472’, ‘2H55:A: 0.10763’, ‘4CWN:A: 0.10797’, ‘1UY9:A: 0.11027’, ‘5H22:A: 0.11587’, ‘5T21:A: 0.11815’, ‘5FNC:A: 0.11937’] |
1UYK:A | [‘1UYK:A: 0.0’, ‘7S9H:A: 0.09523’, ‘1UY9:A: 0.09661’, ‘1UY7:A: 0.10556’, ‘1UYD:A: 0.10785’, ‘1UY8:A: 0.11345’, ‘7S9G:A: 0.11358’, ‘1UYF:A: 0.11432’, ‘1UYE:A: 0.11480’, ‘2FWY:A: 0.11513’] |
1UYL:A | [‘1UYL:A: 0.0’, ‘7S90:A: 0.08262’, ‘2YE3:A: 0.08945’, ‘4EEH:A: 0.09143’, ‘5J64:A: 0.09365’, ‘2YE9:A: 0.09593’, ‘2YE2:A: 0.09723’, ‘3BM9:A: 0.09779’, ‘6HHR:A: 0.09889’, ‘2WI3:A: 0.10079’] |
2VCI:A | [‘2VCI:A: 0.0’, ‘3OWB:A: 0.07360’, ‘4YKQ:A: 0.07581’, ‘2UWD:A: 0.07666’, ‘2XAB:A: 0.07739’, ‘2CCS:A: 0.08540’, ‘4LWI:A: 0.08556’, ‘2BT0:A: 0.08640’, ‘5XRD:A: 0.08885’, ‘2WI5:A: 0.09369’] |
2WI2:A | [‘2WI2:A: 0.0’, ‘3B24:A: 0.08150’, ‘4YKY:A: 0.08930’, ‘3FT5:A: 0.08979’, ‘6GPW:A: 0.09092’, ‘2YEH:A: 0.09562’, ‘7S8Y:A: 0.09671’, ‘4YKX:A: 0.09682’, ‘2WI1:A: 0.10036’, ‘2YED:A: 0.10248’] |
Table Table 2 shows only the first 10 most similar structures and underlying binding sites to the binding site corresponding to the pdbid on the leftmost column.
You can retrieve the full list through the downloadable tsv file here: similarityList.tsv
This file can be used for various purposes. You can decide to reduce the dataset to reduce structural reduncany. You can also use it to train a method, parameters, scores on how similarity is defined etc.
Decoys?
In all benchmark datasets that I have seen so far in the litterature there’s a distinction between binding sites one should find (actives) and binding sites one shouldn’t find (decoys). I fully understand that this is rather standard practice for similarity metric performance measurements & evaluation studies. However, in the context of binding site comparison on a large variety of structures (like against the PDB) things get a little bit trickier.
Assessing ligand similarity in similar binding sites
One interesting aspect about HSP90 is that there are a lot of structures in the RCSB. A lot of these structures have soaked or cocrystallized ligands from similar or congeneric chemical series. One interesting aspect to analyze is how similar two ligands are to each other when they are known to bind to very similar binding site conformations.
In order to try integrate that we first need to gather the ligands contained in the binding site encompassed by the residues selected, which aren’t ions or surfactants or water. I didn’t find a clear & easy way to get ligands contained within a given set of residues using RCSB resources, so I’ll calculate the center of mass of the binding site residues here and check against the center of mass of all ligands in the structure. Very cumbersome but should do the trick here.
Code
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import rdFMCS
from rdkit import DataStructs
from rdkit.Chem import rdMolDescriptors
import seaborn as sns
import io
def getLigandHetCodeInBindingSite(pdbCode, chainCode, residueSelection, debug=False):
= urllib.request.urlopen("https://files.rcsb.org/view/"+pdbCode+".cif").read()
content=cif.read_string(content)[0]
block=gemmi.make_structure_from_block(block)
structure=[]
positions={}
hetDictfor model in structure:
if model.name=="1":
for chain in model:
if chain.name == chainCode:
for residue in chain:
if residue.label_seq in residueSelection:
if debug:
print(residue.seqid)
print(residue)
for atom in residue:
if atom.name=="CA":
if debug: print("ok")
positions.append(atom.pos.tolist())break #we need that for multiple occurences
elif(str(residue.entity_type)=="EntityType.NonPolymer" and (str(residue.name) !="HOH" or str(residue.name)!="WAT")):
=residue.name+":"+chain.name+":"+str(residue.seqid.num)
dname=[]
hetDict[dname]for atom in residue:
hetDict[dname].append(atom.pos.tolist())
if(len(positions)!=len(residueSelection)):
print("Not all positions found for "+pdbCode+" discarding structure")
return None
=np.mean(np.array(positions))
binding_site_com=[]
resultsif(len(hetDict.keys())>0):
for ligand in hetDict.keys():
=np.mean(np.array(hetDict[ligand]))
ligand_com=np.linalg.norm(binding_site_com-ligand_com)
ligand_distance
results.append([ligand,ligand_distance])return results
=[]
hetatmsfor index,row in contactMatrices["finalList"].iterrows():
"pdbid"],row["chain"],[int(el.split(":")[1]) for el in row["residues"]]))
hetatms.append(getLigandHetCodeInBindingSite(row[
=[]
retainedhetatmcodes=3.0
thresholdfor index,row in contactMatrices["finalList"].iterrows():
=[hetatm[0].split(":")[0] for hetatm in hetatms[index]if hetatm[1]<threshold]
tmpif(len(tmp)>0):
+=[tmp]
retainedhetatmcodeselse:
+=[['']]
retainedhetatmcodes
"finalList"]["ligands"]=retainedhetatmcodes
contactMatrices[= list(itertools.chain(*retainedhetatmcodes))
flatretainedhetatmcodes
="""
query{
chem_comps(comp_ids: """+str(flatretainedhetatmcodes).replace("'","\"")+""") {
rcsb_id
chem_comp {
formula_weight
}
rcsb_chem_comp_descriptor{
SMILES
}
}
}
"""
={}
moleculedict=f"https://data.rcsb.org/graphql?query={query}"
url=requests.get(url)
response=response.json()["data"]["chem_comps"]
dataRCSB
for compound in dataRCSB:
if compound["chem_comp"]["formula_weight"]>180.0:
"rcsb_id"]]={"molecule":Chem.MolFromSmiles(compound["rcsb_chem_comp_descriptor"]["SMILES"])} moleculedict[compound[
import base64
=[]
plots=[]
refmolecules=[]
hetcodes=[]
pdbcodesfor index,row in contactMatrices["finalList"].iterrows():
if index<10:
print(index,retainedhetatmcodes[index])
=sortedMatrixDF.iloc[[index]]["similar structures"].values[0]
t=[line.split(" ")[0].rstrip(":") for line in t]
sortedpdbcodes=[float(line.split(" ")[1].rstrip(":")) for line in t]
sortedpocketsimilarityscores=[]
rowmolecules=[]
fpsfor pdbcode in sortedpdbcodes:
=sortedMatrixDF.loc[sortedMatrixDF["pdbid"]==pdbcode].index[0]
tableindex=contactMatrices["finalList"].iloc[[tableindex]]["ligands"].values[0]
ligandsfor ligand in ligands:
if ligand in moleculedict:
"molecule"])
rowmolecules.append(moleculedict[ligand]["molecule"],radius=2))
fps.append(rdMolDescriptors.GetMorganFingerprint(moleculedict[ligand][=fps[0]
reffp= DataStructs.BulkTanimotoSimilarity(reffp, fps)
fpscores =pd.DataFrame(zip(sortedpocketsimilarityscores,fpscores),columns=["pocket dissimilarity","fp similarity"])
plotdata= sns.pairplot(plotdata)
g = io.BytesIO()
buf format='png')
g.fig.savefig(buf, 0)
buf.seek(= base64.b64encode(buf.read())
image
plots.append(image)0])
refmolecules.append(rowmolecules["pdbid"])
pdbcodes.append(row[
hetcodes.append(retainedhetatmcodes[index])
# Draw.MolsToGridImage(rowmolecules[:20],molsPerRow=5,subImgSize=(300,200),legends=["Tanimoto: "+str(round(score,2)) for score in scores[:20]])
# mcs = rdFMCS.FindMCS(rowmolecules,threshold=0.8,completeRingsOnly=True,ringMatchesRingOnly=True)
# patt = Chem.MolFromSmarts(mcs.smartsString)
# refMol = rowmolecules[0]
# AllChem.Compute2DCoords(refMol)
# refMatch = refMol.GetSubstructMatch(patt)
# for probeMol in rowmolecules[1:]:
# AllChem.GenerateDepictionMatching2DStructure(query , refMatch)
Code
=pd.DataFrame(zip(pdbcodes,refmolecules,plots,hetcodes),columns=["pdb","molecule","similarity plot","ligand name"])
tableplotdata
with open("data/similarityplots.pkl","wb") as f:
pickle.dump(tableplotdata,f)
# Markdown(tabulate(
# tableplotdata[:10],
# headers=["pdbid","molecule","similarity"]
# ))
Code
from rdkit import Chem
from rdkit.Chem import PandasTools
import urllib
with open("data/similarityplots.pkl","rb") as f:
=pickle.load(f)
tableplotdata"smiles"]=[Chem.MolToSmiles(mol) for mol in tableplotdata["molecule"]]
tableplotdata['smiles','molecule')
PandasTools.AddMoleculeColumnToFrame(tableplotdata,for index, row in tableplotdata.iterrows():
= 'data:image/png;base64,' + urllib.parse.quote(row["similarity plot"])
uri = '<img width="300px" src = "%s"/>' % uri
html "similarity plot"]=html
row[
=False) tableplotdata.to_html(escape
pdb | molecule | similarity plot | ligand name | smiles | |
---|---|---|---|---|---|
0 | 2BYI |
|
[2DD] | NS(=O)(=O)c1ccc(CNC(=O)c2cn[nH]c2-c2cc(Cl)c(O)cc2O)cc1 | |
1 | 2H55 |
|
[DZ8] | CC(C)NCCCn1c(Cc2cc3c(cc2I)OCO3)nc2c(N)nc(F)nc21 | |
2 | 2XJX |
|
[XJX] | CC(C)c1cc(C(=O)N2Cc3ccc(CN4CCN(C)CC4)cc3C2)c(O)cc1O | |
3 | 2YEB |
|
[2K4] | Oc1ccc2c(-c3cc(Br)c(O)cc3O)noc2c1 | |
4 | 3FT5 |
|
[MO8] | Cc1nc(N)nc2c1CSCC2 | |
5 | 3QDD |
|
[94M] | COc1c(C)cnc(Cn2cnc3c(Cl)nc(N)nc32)c1C | |
6 | 4AWQ |
|
[592] | COc1cccc(C(=O)NC2CC3CCC(C2)N3c2ccc(C(=O)NCc3ccccc3)cn2)c1C | |
7 | 4EGI |
|
[B2J] | CSc1nc(N)nc(-c2ccccc2Cl)n1 | |
8 | 4O0B |
|
[2QA] | Cc1cn(-c2cc3c(c(C4CCCC4)c2)C(=O)NCC3)c2c1C(=O)CC(C)(C)C2 | |
9 | 6B9A |
|
[PA7] | CCCNC(=O)C1OC(n2cnc3c(N)ncnc32)C(O)C1O |
The table above summarizes for 10 out of over 400 structures the similarity of the binding sites versus the Tanimoto similarity of the ligands located in the binding sites. I used a Morgan fingerprint (radius) 2 and according to a blogpost by Greg Landrum, the random threshold is situated at around 0.2 - 0.3.
Feel free to adapt the code to display the plots for all structures.
When looking through the first few plots we can see that even though sometimes there might be some correlation between the two similarity metrics, for the large bulk there doesn’t seem to be any correlation. Both similarity metrics can be discussed, refined, rendered more fuzzy or precise, one can also replace the ligand similarity by an MCS approach. Again, you can adapt the code quite easily to get there, but still the current results give a lot of examples where rather different ligands can bind to very similar binding sites. Also in several cases ligands above the significance threshold can be found in very similar pockets (compound series), but the opposite is also true.
Let’s discuss in more detail a typical compound series found in the dataset. Structure 2h55 resulted in the following distribution:
In the orange box on Figure 2 several ligands with some similarity to DZ8 are identified in very similar binding sites (pocket dissimilarity close to 0). More dissimilar binding sites do not seem to bind this class of ligands. This points into the direction that this series of compounds binds clearly to a particular conformation (or induces it, or both). It is also interesting to note that the majority of other compounds in the set are significantly dissimilar to the query ligand.
On figure Figure 3 we can see a very different distribution. Here we have an ATP analog as query ligand & the corresponding binding site. It is very interesting to observe the skewness to more dissimilar binding sites compared to 6b9a. Despite the dissimilarity on binding sites we can see a large cluster of ligands with around 0.5 Tanimoto similarity to PA7. This cluster encompasses ATP & all the closer analogs. This example is a very good one to illustrate the complexity behind the theory of similar ligands bind to similar binding sites. ATP analogs clearly accomodate for a larger variety of binding site conformations & mutations than the compound series around DZ8 does. We could then even imagine that DZ8 is conformation selective & has a few hallmarks of a more specific inhibitor. That’s a bit far fetched for a comparison with a plain ATP analog here, but it appears to go into that direction nevertheless.