# Visualization and exploration of data

### Visualization and exploration of data

There exists a lot of different tools for visualization and exploration of data in Python.

#### Content:

For the purposes of this tutorial, we will compute the root-mean-square deviation of atomic positions (RMSD) of a molecule along a conformation path compared to the goal molecule conformation. We assume that each of the conformations is being stored as a separate pdb-file and these files are loaded into SAMSON according to the conformation path (the first molecule is the starting conformation and the last molecule is the goal). The result will be stored in a pandas data frame.

You can find the code in SAMSON Python Scripting samples on github.

The RMSD is computed using the next script.

```from math import sqrt import pandas as pd   def compute_rmsd(mol1_indexer, mol2_indexer): ''' Computes RMSD between two conformations of a molecule NB: this is not an optimized implementation, it is only for the purpose of this tutorial ''' if mol1_indexer.size != mol2_indexer.size: # check if indexers are of the same size print('molecules are not the same') return -1   rmsd = 0.0; for i in range(mol1_indexer.size): # go through all atoms in the indexer if mol1_indexer[i].elementType != mol2_indexer[i].elementType: print('atoms are not the same') return -1 pos1 = mol1_indexer[i].getPosition() # get position of an atom from an indexer pos2 = mol2_indexer[i].getPosition() distance = pos1 - pos2 # compute distance between atoms rmsd += distance.norm2().angstrom # accumulate squared norm for distances [in angstrom]   rmsd = sqrt(rmsd / mol1_indexer.size) # take a square root of the sum of squared distances divided by the number of atoms return rmsd   def compute_rmsd_2mol(mol1, mol2): ''' Computes several RMSD between two conformations: 1) between all atoms; 2) between backbone atoms; 3) between alpha-Carbon atoms ''' mol1_all_indexer = mol1.getNodes('n.t a') # get all atoms mol2_all_indexer = mol2.getNodes('n.t a') mol1_backbone_indexer = mol1.getNodes('n.t a in n.t bb') # get all atoms in the backbone mol2_backbone_indexer = mol2.getNodes('n.t a in n.t bb') mol1_CA_indexer = mol1.getNodes('"CA" in n.t bb') # get all alpha Carbons in the backbone mol2_CA_indexer = mol2.getNodes('"CA" in n.t bb')   rmsd_all = compute_rmsd(mol1_all_indexer, mol2_all_indexer) rmsd_backbone = compute_rmsd(mol1_backbone_indexer, mol2_backbone_indexer) rmsd_CA = compute_rmsd(mol1_CA_indexer, mol2_CA_indexer)   if (rmsd_all != -1): print("RMSD (all atoms): %e Angstrom" % rmsd_all) if (rmsd_backbone != -1): print("RMSD (backbone) : %e Angstrom" % rmsd_backbone) if (rmsd_CA != -1): print("RMSD (C_alpha) : %e Angstrom" % rmsd_CA)   return [rmsd_all, rmsd_backbone, rmsd_CA], ['all', 'backbone', 'Calpha'] # return list of different RMSD     indexer = SAMSON.getNodes('n.t m') # get all molecules if indexer.size < 2: print("less than two molecules are present in the data graph") else: # we assume here that we need to compute RMSD between the first two molecules goal = indexer[-1] # get the goal molecule [assume the last molecule to be the goal one] indexer.removeNode(goal) # remove the target node from the indexer df = pd.DataFrame() # create pandas data frame for t in range(indexer.size): mol = indexer[t] rmsd, rmsd_info = compute_rmsd_2mol(goal, mol) # compute RMSD between two conformations # fill in the data frame for j in range(len(rmsd)): df = df.append( pd.DataFrame({'name': [mol.name], 'info': rmsd_info[j], 'RMSD': rmsd[j], 'timestep': t }) )```

The data frame df will be used for visualization using different packages.

If you want to see how the data in the data frame looks like, you can use:

`df.head() # prints first several lines`

#### matplotlib

By default matplotlib creates figures in a separate window. If you would like to have your plots inlined in the Qt Console, use IPython magic to state it before importing matplotlib:

```# % - IPython magic. Works only inside a separate cell. %matplotlib inline```

If you want to use IPython magic in your script, use the next command

```from IPython import get_ipython   ipython = get_ipython() ipython.magic('matplotlib inline')```

After that, you may import matplotlib

`import matplotlib.pyplot as plt`

To plot data using matplotlib we will group our data by the type of the computed RMSD and then plot each of the grouped data on a single figure:

```df_all = df.groupby('info').get_group('all') # group by info of RMSD and get RMSD computed based on all atoms df_bb = df.groupby('info').get_group('backbone') # group by info of RMSD and get RMSD computed based on backbone atoms df_CA = df.groupby('info').get_group('Calpha') # group by info of RMSD and get RMSD computed based on Calpha atoms line_RMSD_all, = plt.plot(df_all["timestep"], df_all["RMSD"], label="all") line_RMSD_bb, = plt.plot(df_bb ["timestep"], df_bb ["RMSD"], label="backbone") line_RMSD_CA, = plt.plot(df_CA ["timestep"], df_CA ["RMSD"], label="Calpha") plt.ylabel('RMSD, Angstrom') plt.xlabel('timestep') plt.legend(handles=[line_RMSD_all, line_RMSD_bb, line_RMSD_CA]) plt.grid() # the next line is not necessary if you are using the inline plotting plt.show()```

As a result, we get the next picture: #### R ggplot

To be able to plot using R ggplot, it is necessary to install R, rpy2 and r-ggplot2 packages. For example, using conda:

```conda install -c r rpy2 r-ggplot2
```

Let us use pandas package for transferring data to R:

```from IPython import get_ipython # import function get_ipython   ipython = get_ipython() # get the global instance of the Jupyter Qt console instance ipython.magic('load_ext rpy2.ipython') # load an IPython extention named rpy2.ipython: it allows one to use R-language inside IPython shell ipython.magic('R require(ggplot2)') # R language: require ggplot2 package```

Finally, in a separate cell we can plot using R ggplot:

```# -i df - input data to be transferred to R %%R -i df ggplot(data = df) + geom_line(aes(x = timestep, y = RMSD, color = info))``` #### seaborn

“Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.” (seaborn)

Since seaborn is based on the matplotlib package, to have an inlined figure you may use the same IPython magic as for matplotlib:

```# % - IPython magic. Works only inside a separate cell. %matplotlib inline```

If you want to use IPython magic in your script, use the next command

```from IPython import get_ipython   ipython = get_ipython() ipython.magic('matplotlib inline')```

After that, you may import matplotlib

`import matplotlib.pyplot as plt`

Let us plot now the time evolution

```import seaborn as sns   sns.pointplot(x=df["timestep"], y=df["RMSD"], hue=df["info"]) # the next line is not necessary if you are using inline plotting plt.show()``` #### bokeh

If you want to explore data interactively in a browser with a help of javascript you can use such tools as bokeh.
“Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets.” (bokeh)

```from bokeh.plotting import figure, output_file, show p = figure(title="RMSD") p.line(x=df_all["timestep"], y=df_all["RMSD"], color='red', legend='all') p.line(x=df_bb ["timestep"], y=df_bb ["RMSD"], color='blue', legend='backbone') p.line(x=df_CA ["timestep"], y=df_CA ["RMSD"], color='green', legend='Calpha') show(p)``` 