Skip to content
Snippets Groups Projects
Commit 1ecd36d8 authored by Guillaume Poirier-Morency's avatar Guillaume Poirier-Morency
Browse files

Fix Km plot

parent 8c46be0d
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# BIM6065C: Microtargetome analysis
We will use the basic of Pandas you have learned in previous sessions to analyze microtargetome data.
%% Cell type:code id: tags:
``` python
import pandas as pd
from io import StringIO
import requests
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
```
%% Cell type:markdown id: tags:
Seaborn has a really nice matplotlib theme to make figure readable.
%% Cell type:code id: tags:
``` python
import seaborn as sb
sb.set('talk', 'ticks')
```
%% Cell type:code id: tags:
``` python
def read_microtargetome(f):
return pd.read_csv(f, sep='\t', index_col=['gene_accession', 'target_accession', 'position', 'mirna_accession'])
```
%% Cell type:markdown id: tags:
# Querying data from miRBooking-scan
[miRBooking-scan](https://major.iric.ca/~poirigui/mirbooking-scan/) is a Web platform that provides pre-computed microtargetomes.
We currently have predictions for 29 cell lines that were retrieved from the ENCODE project.
Every endpoint allow querying programatically with the `Accept: text/tab-separated-values` to retrieve a TSV format.
```
https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/<cell_line_accession>
```
%% Cell type:code id: tags:
``` python
response = requests.get('https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/ENCSR809EFN', headers={'Accept': 'text/tab-separated-values'})
microtargetome_df = read_microtargetome(StringIO(response.text))
```
%% Cell type:markdown id: tags:
# Hop on!
Microtargetome have 4 level of hierarchy:
- gene: genomic element that encodes various transcripts
- target: RNA transcript
- position: offset on a transcript which matches the end of the microRNA seed
- microRNA: small RNA of about 22 nucleotides
For each possible quadruplet, our model predicts an equilibrium concentration `quantity` according to its equilibrium constant $K_m$. In particular, our solution satisfies:
$K_m = \frac{[E_m][S_{t,p}]}{[E_mS_{t,p}]}$
Where $[E_m]$ is the free concentration of microRNA $m$, $[S_{t,p}]$ is the free concentration of target site $(t, p)$ and $[E_mS_{t,p}]$ is the duplex formed at that particular location.
%% Cell type:code id: tags:
``` python
microtargetome_df.head()
```
%% Cell type:markdown id: tags:
# Multi-index
%% Cell type:code id: tags:
``` python
microtargetome_df.loc['ENSG00000137876.9'].head()
```
%% Cell type:markdown id: tags:
To query an arbitrary level, you can use `xs`.
%% Cell type:code id: tags:
``` python
microtargetome_df.xs('MIMAT0000098', level='mirna_accession').head()
```
%% Cell type:markdown id: tags:
Multi-index can also be used for columns, which can be very handy for grouping samples.
%% Cell type:code id: tags:
``` python
microtargetome_df.T
```
%% Cell type:code id: tags:
``` python
sponged_concentration = microtargetome_df.groupby(level='mirna_accession').quantity.sum()
total_concentration = microtargetome_df.groupby(level='mirna_accession').mirna_quantity.first()
(sponged_concentration / total_concentration).head()
```
%% Cell type:code id: tags:
``` python
microtargetome_df.groupby(level='target_accession').agg({
'gene_name': 'first',
'target_name': 'first',
'target_quantity': 'first',
'quantity': 'sum'}).head()
```
%% Cell type:markdown id: tags:
# With this in mind, let's verify if our equilibrium really hold!
%% Cell type:code id: tags:
``` python
S0 = microtargetome_df.target_quantity
E0 = microtargetome_df.mirna_quantity
```
%% Cell type:code id: tags:
``` python
S0.head()
```
%% Cell type:markdown id: tags:
The available microRNA concentration is given by conservation:
%% Cell type:code id: tags:
``` python
E = E0 - microtargetome_df.groupby(level='mirna_accession').quantity.sum()
```
%% Cell type:markdown id: tags:
Same for substrate, but position-wise:
%% Cell type:code id: tags:
``` python
S = S0 - microtargetome_df.quantity
```
%% Cell type:markdown id: tags:
Now, the complexes:
%% Cell type:code id: tags:
``` python
ES = microtargetome_df.quantity
```
%% Cell type:code id: tags:
``` python
predicted_Km = ((E * S) / ES).rename('predicted_score')
predicted_Km.head()
```
%% Cell type:code id: tags:
``` python
pd.concat([predicted_Km, microtargetome_df.score], axis=1).plot.scatter('predicted_score', 'score')
plt.xscale('log')
plt.yscale('log')
plt.xlabel(r'$\frac{[E][S]}{[ES]}$')
plt.ylabel('$K_m$')
pd.concat([predicted_Km, microtargetome_df.score], axis=1)
```
%% Cell type:code id: tags:
``` python
fig, ax = plt.subplots()
ax.set_xscale('log')
ax.set_yscale('log')
pd.concat([predicted_Km, microtargetome_df.score], axis=1).plot.scatter('predicted_score', 'score', ax=ax)
ax.set_xlabel(r'$\frac{[E][S]}{[ES]}$')
ax.set_ylabel('$K_m$')
```
%% Cell type:markdown id: tags:
However, this is not exactly equal because the available substrate concentration is actually a bit more complicated to calculate since we have to account for overlapping sites.
%% Cell type:markdown id: tags:
# Jointure, merge and concatenation
These three concepts are similar, but behave differently.
- jointure are fast and work on indexes
- merge are slow and work on columns
- concat is similar to a jointure, but require matching indexes and works with many dataframes and series
But first, let's automate the process of fetching data from miRBooking-scan so that we can study a couple of cell lines.
%% Cell type:code id: tags:
``` python
def fetch_from_mirbooking_scan(accession):
response = requests.get(f'https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/{accession}', headers={'Accept': 'text/tab-separated-values'})
return read_microtargetome(StringIO(response.text))
```
%% Cell type:markdown id: tags:
We start first with comparing undifferentiated embryonic stem cells agaist skin keratinocytes.
%% Cell type:code id: tags:
``` python
embryonic_stem_cell = fetch_from_mirbooking_scan('ENCSR820QMS')
keratinocyte = fetch_from_mirbooking_scan('ENCSR193SZM')
```
%% Cell type:code id: tags:
``` python
pd.concat([embryonic_stem_cell, keratinocyte], keys=['ENCSR820QMS', 'ENCSR193SZM'], names=['sample'])
```
%% Cell type:markdown id: tags:
We can use a different axis for concatenation.
%% Cell type:code id: tags:
``` python
pd.concat([embryonic_stem_cell, keratinocyte], keys=['ENCSR820QMS', 'ENCSR193SZM'], names=['sample'], axis=1)
```
%% Cell type:markdown id: tags:
We obtain a similar result with a jointure, but it's not really appropriate here because we're joining the same kind of data.
%% Cell type:code id: tags:
``` python
embryonic_stem_cell.join(keratinocyte, lsuffix='stem_cell', rsuffix='_keratinocyte').head()
```
%% Cell type:code id: tags:
``` python
df = pd.concat([embryonic_stem_cell, keratinocyte], keys=['ENCSR820QMS', 'ENCSR193SZM'], names=['sample'], axis=1)
df.head()
```
%% Cell type:code id: tags:
``` python
def log2fc(a, b):
return np.log2(a.replace(0, np.nan) / b.replace(0, np.nan)).dropna()
```
%% Cell type:code id: tags:
``` python
log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity).head()
```
%% Cell type:code id: tags:
``` python
log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity).hist(bins=20)
plt.xlabel('$\log_2$ fold-change')
plt.ylabel('Frequency')
```
%% Cell type:markdown id: tags:
It seems that most interactions are up-regulated. Let's verify if that holds:
%% Cell type:code id: tags:
``` python
log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity).hist(bins=50, cumulative=True, density=True)
plt.axvline(0, c='r')
```
%% Cell type:markdown id: tags:
About 60% of our interactions seems to be down-regulated.
%% Cell type:code id: tags:
``` python
from scipy.stats import ttest_1samp
ttest_1samp(log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity), popmean=0)
```
%% Cell type:markdown id: tags:
Let's calculate the bound fraction of each target. We make the simple assumption that binding sites are independent so that:
$\Pr[k > 0] = 1 - \Pr[k = 0] = 1 - \prod_{i=1}^n (1 - p_i)$
For each position $p_i$ in our target.
%% Cell type:code id: tags:
``` python
def bound_fraction(df):
return 1 - (1 - df.quantity / df.target_quantity).prod()
```
%% Cell type:code id: tags:
``` python
stem_cell_bf = df.ENCSR820QMS \
.groupby(level=['gene_accession', 'target_accession', 'position']).agg({'quantity': 'sum', 'target_quantity': 'first'}) \
.groupby(level=['gene_accession', 'target_accession']).apply(bound_fraction)
```
%% Cell type:code id: tags:
``` python
keratinocyte_bf = df.ENCSR193SZM \
.groupby(level=['gene_accession', 'target_accession', 'position']).agg({'quantity': 'sum', 'target_quantity': 'first'}) \
.groupby(level=['gene_accession', 'target_accession']).apply(bound_fraction)
```
%% Cell type:code id: tags:
``` python
log2fc(keratinocyte_bf, stem_cell_bf).hist(bins=20)
```
%% Cell type:code id: tags:
``` python
log2fc(keratinocyte_bf, stem_cell_bf).hist(bins=20, cumulative=True, density=True)
plt.axvline(0, c='r')
```
%% Cell type:code id: tags:
``` python
ttest_1samp(log2fc(keratinocyte_bf, stem_cell_bf), 0)
```
%% Cell type:markdown id: tags:
Now, it seems a bit clearer that the majority of targets had an increase in relative activity, which is consistent for a differentiated tissue.
Let's augment our analysis with gene and transcript-level information.
%% Cell type:code id: tags:
``` python
ontology = pd.concat([df.ENCSR193SZM.groupby(level=[0,1]).agg({'gene_name': 'first', 'target_name': 'first'}),
df.ENCSR820QMS.groupby(level=[0,1]).agg({'gene_name': 'first', 'target_name': 'first'})])
ontology = ontology.groupby(ontology.index).first()
```
%% Cell type:code id: tags:
``` python
k_s_log2fc = pd.concat([log2fc(keratinocyte_bf, stem_cell_bf).rename('log2fc'), ontology], axis=1)
k_s_log2fc.sort_values('log2fc').head()
```
%% Cell type:code id: tags:
``` python
sorted_k_s_index = k_s_log2fc.log2fc.abs().sort_values(ascending=False).index
k_s_log2fc.loc[sorted_k_s_index].head()
```
%% Cell type:code id: tags:
``` python
sorted_k_s_log2fc = k_s_log2fc.loc[sorted_k_s_index]
```
%% Cell type:code id: tags:
``` python
sorted_k_s_log2fc.head(100)
```
%% Cell type:code id: tags:
``` python
sorted_k_s_log2fc.head()
```
%% Cell type:code id: tags:
``` python
print('\n'.join(sorted_k_s_log2fc.gene_name.drop_duplicates()))
```
%% Cell type:markdown id: tags:
Let's paste this in [g:Profiler](https://biit.cs.ut.ee/gprofiler/gost), a meta-enrichment analysis tool.
%% Cell type:markdown id: tags:
# Cytoscape visualization
Microtargetomes are very, very large and we might prefer to summarize interactions prior to performing any visualization.
%% Cell type:code id: tags:
``` python
embryonic_stem_cell.describe()
```
%% Cell type:code id: tags:
``` python
gene_quantity = keratinocyte \
.groupby(level=['gene_accession', 'target_accession']).target_quantity.first() \
.groupby(level='gene_accession').sum().rename('gene_quantity')
gene_quantity.head()
```
%% Cell type:markdown id: tags:
We will also be interested in the gene fraction bound by at least one microRNA as a relative measure of how strong regulation is.
%% Cell type:code id: tags:
``` python
gene_mirna_bound_fraction = keratinocyte.groupby(level=['gene_accession', 'mirna_accession']).apply(bound_fraction).rename('bound_fraction')
```
%% Cell type:code id: tags:
``` python
gene_mirna_interactions = keratinocyte.groupby(level=['gene_accession', 'mirna_accession']).agg({
'gene_name': 'first',
'mirna_name': 'first',
'mirna_quantity': 'first',
'quantity': 'sum'})
gene_mirna_interactions = gene_mirna_interactions.join(gene_quantity)
gene_mirna_interactions = gene_mirna_interactions.join(gene_mirna_bound_fraction)
```
%% Cell type:markdown id: tags:
Let's narrow down our search to the genes involved in the melanosome.
%% Cell type:code id: tags:
``` python
melanosome_genes = '''
SYPL1
ATP6V0A1
CTNS
RAB27B
CAPG
HSPA5
DTNBP1
ATP1B3
RAB27A
TFRC
RAB7A
TYR
DCT
HSP90AA1
ERP29
GANAB
HSP90AB1
HPS4
SYNGR1
DNAJC5
AHCY
GPR143
OCA2
RAB2A
ANKRD27
SLC1A5
TYRP1
RAB5C
YWHAE
TMEM33
HSPA8
RAB5B
RAB35
CCT4
SLC1A4
RAB29
SLC2A1
PRDX1
CTSD
MREG
RAB32
GNA13
MLANA
ANXA11
RAB9A
RAB38
RAB17
CANX
CALU
RAN
SERPINF1
MYH11
CD63
GPNMB
RAC1
FLOT1
MYO7A
SYTL2
GGH
SDCBP
GCHFR
RAB1A
SGSM2
CLTC
SYTL1
PDIA6
RAB5A
ATP6V1B2
STOM
ITGB1
PDIA4
MMP14
NCSTN
ATP1A1
RPN1
SLC45A2
CTSB
YWHAZ
TPP1
HSP90B1
PPIB
STX3
YWHAB
PDIA3
SLC3A2
FASN
MYRIP
PDCD6IP
TMED10
BSG
CNP
TH
ANXA2
P4HB
PMEL
LAMP1
NAP1L1
TRPV2
SLC24A5
ANXA6
SND1
MYO5A
ATP6V1G2
ITGB3
SEC22B
'''.split()
specific_gene_mirna_interactions = gene_mirna_interactions[gene_mirna_interactions.gene_name.isin(melanosome_genes)]
```
%% Cell type:markdown id: tags:
Our dataframe is ready! We have:
- edges with scores (quantity or bound fraction)
- nodes with unique identifiers (accessions), labels (names) and metadata (quantity)
Now, let's export this in a TSV so that we can import it in Cytoscape.
%% Cell type:code id: tags:
``` python
specific_gene_mirna_interactions.to_csv('keratinocyte-melanosome-gene-mirna-interactions.tsv', sep='\t')
```
%% Cell type:code id: tags:
``` python
# Venn diagrams
```
%% Cell type:code id: tags:
``` python
from matplotlib_venn import venn2, venn3
```
%% Cell type:code id: tags:
``` python
venn2([set([1,2,3]), set([3,4,5])])
```
%% Cell type:code id: tags:
``` python
venn3([set([1,2,3]), set([2,3,4]), set([4,5,1])], ['label1', 'label2', 'label3'])
```
%% Cell type:code id: tags:
``` python
plt.title('Unique genes')
venn2([set(embryonic_stem_cell.gene_name), set(keratinocyte.gene_name)], ['Embryonic stem cell', 'Keratonicyte'])
```
%% Cell type:code id: tags:
``` python
plt.title('Unique microRNAs')
venn2([set(embryonic_stem_cell.mirna_name), set(keratinocyte.mirna_name)], ['Embryonic stem cell', 'Keratonicyte'])
```
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment