Fix Km plot

1ecd36d8 · Guillaume Poirier-Morency · 8c46be0d · 1ecd36d8
Commit 1ecd36d8 authored 5 years ago by Guillaume Poirier-Morency
--- a/Microtargetome analysis.ipynb
+++ b/Microtargetome analysis.ipynb
@@ -268,11 +268,21 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "pd.concat([predicted_Km, microtargetome_df.score], axis=1).plot.scatter('predicted_score', 'score')\n",
-    "plt.xscale('log')\n",
-    "plt.yscale('log')\n",
-    "plt.xlabel(r'$\\frac{[E][S]}{[ES]}$')\n",
-    "plt.ylabel('$K_m$')"
+    "pd.concat([predicted_Km, microtargetome_df.score], axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots()\n",
+    "ax.set_xscale('log')\n",
+    "ax.set_yscale('log')\n",
+    "pd.concat([predicted_Km, microtargetome_df.score], axis=1).plot.scatter('predicted_score', 'score', ax=ax)\n",
+    "ax.set_xlabel(r'$\\frac{[E][S]}{[ES]}$')\n",
+    "ax.set_ylabel('$K_m$')"
   ]
  },
  {
@@ -866,9 +876,9 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3.6.2",
+   "display_name": "Python 3",
   "language": "python",
-   "name": "python3.6.2"
+   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
@@ -880,7 +890,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.2"
+   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,

 %% Cell type:markdown id: tags:

 # BIM6065C: Microtargetome analysis

 We will use the basic of Pandas you have learned in previous sessions to analyze microtargetome data.

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 from io import StringIO
 import requests
 import matplotlib.pyplot as plt
 import numpy as np
 %matplotlib inline
 ```

 %% Cell type:markdown id: tags:

 Seaborn has a really nice matplotlib theme to make figure readable.

 %% Cell type:code id: tags:

 ``` python
 import seaborn as sb
 sb.set('talk', 'ticks')
 ```

 %% Cell type:code id: tags:

 ``` python
 def read_microtargetome(f):
    return pd.read_csv(f, sep='\t', index_col=['gene_accession', 'target_accession', 'position', 'mirna_accession'])
 ```

 %% Cell type:markdown id: tags:

 # Querying data from miRBooking-scan

 [miRBooking-scan](https://major.iric.ca/~poirigui/mirbooking-scan/) is a Web platform that provides pre-computed microtargetomes.

 We currently have predictions for 29 cell lines that were retrieved from the ENCODE project.

 Every endpoint allow querying programatically with the `Accept: text/tab-separated-values` to retrieve a TSV format.

 ```
 https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/<cell_line_accession>
 ```

 %% Cell type:code id: tags:

 ``` python
 response = requests.get('https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/ENCSR809EFN', headers={'Accept': 'text/tab-separated-values'})
 microtargetome_df = read_microtargetome(StringIO(response.text))
 ```

 %% Cell type:markdown id: tags:

 # Hop on!

 Microtargetome have 4 level of hierarchy:

 - gene: genomic element that encodes various transcripts
 - target: RNA transcript
 - position: offset on a transcript which matches the end of the microRNA seed
 - microRNA: small RNA of about 22 nucleotides

 For each possible quadruplet, our model predicts an equilibrium concentration `quantity` according to its equilibrium constant $K_m$. In particular, our solution satisfies:

 $K_m = \frac{[E_m][S_{t,p}]}{[E_mS_{t,p}]}$

 Where $[E_m]$ is the free concentration of microRNA $m$, $[S_{t,p}]$ is the free concentration of target site $(t, p)$ and $[E_mS_{t,p}]$ is the duplex formed at that particular location.

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.head()
 ```

 %% Cell type:markdown id: tags:

 # Multi-index

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.loc['ENSG00000137876.9'].head()
 ```

 %% Cell type:markdown id: tags:

 To query an arbitrary level, you can use `xs`.

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.xs('MIMAT0000098', level='mirna_accession').head()
 ```

 %% Cell type:markdown id: tags:

 Multi-index can also be used for columns, which can be very handy for grouping samples.

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.T
 ```

 %% Cell type:code id: tags:

 ``` python
 sponged_concentration = microtargetome_df.groupby(level='mirna_accession').quantity.sum()
 total_concentration = microtargetome_df.groupby(level='mirna_accession').mirna_quantity.first()
 (sponged_concentration / total_concentration).head()
 ```

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.groupby(level='target_accession').agg({
    'gene_name': 'first',
    'target_name': 'first',
    'target_quantity': 'first',
    'quantity': 'sum'}).head()
 ```

 %% Cell type:markdown id: tags:

 # With this in mind, let's verify if our equilibrium really hold!

 %% Cell type:code id: tags:

 ``` python
 S0 = microtargetome_df.target_quantity
 E0 = microtargetome_df.mirna_quantity
 ```

 %% Cell type:code id: tags:

 ``` python
 S0.head()
 ```

 %% Cell type:markdown id: tags:

 The available microRNA concentration is given by conservation:

 %% Cell type:code id: tags:

 ``` python
 E = E0 - microtargetome_df.groupby(level='mirna_accession').quantity.sum()
 ```

 %% Cell type:markdown id: tags:

 Same for substrate, but position-wise:

 %% Cell type:code id: tags:

 ``` python
 S = S0 - microtargetome_df.quantity
 ```

 %% Cell type:markdown id: tags:

 Now, the complexes:

 %% Cell type:code id: tags:

 ``` python
 ES = microtargetome_df.quantity
 ```

 %% Cell type:code id: tags:

 ``` python
 predicted_Km = ((E * S) / ES).rename('predicted_score')
 predicted_Km.head()
 ```

 %% Cell type:code id: tags:

 ``` python
-pd.concat([predicted_Km, microtargetome_df.score], axis=1).plot.scatter('predicted_score', 'score')
-plt.xscale('log')
-plt.yscale('log')
-plt.xlabel(r'$\frac{[E][S]}{[ES]}$')
-plt.ylabel('$K_m$')
+pd.concat([predicted_Km, microtargetome_df.score], axis=1)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+fig, ax = plt.subplots()
+ax.set_xscale('log')
+ax.set_yscale('log')
+pd.concat([predicted_Km, microtargetome_df.score], axis=1).plot.scatter('predicted_score', 'score', ax=ax)
+ax.set_xlabel(r'$\frac{[E][S]}{[ES]}$')
+ax.set_ylabel('$K_m$')
 ```

 %% Cell type:markdown id: tags:

 However, this is not exactly equal because the available substrate concentration is actually a bit more complicated to calculate since we have to account for overlapping sites.

 %% Cell type:markdown id: tags:

 # Jointure, merge and concatenation

 These three concepts are similar, but behave differently.

 - jointure are fast and work on indexes
 - merge are slow and work on columns
 - concat is similar to a jointure, but require matching indexes and works with many dataframes and series

 But first, let's automate the process of fetching data from miRBooking-scan so that we can study a couple of cell lines.

 %% Cell type:code id: tags:

 ``` python
 def fetch_from_mirbooking_scan(accession):
    response = requests.get(f'https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/{accession}', headers={'Accept': 'text/tab-separated-values'})
    return read_microtargetome(StringIO(response.text))
 ```

 %% Cell type:markdown id: tags:

 We start first with comparing undifferentiated embryonic stem cells agaist skin keratinocytes.

 %% Cell type:code id: tags:

 ``` python
 embryonic_stem_cell = fetch_from_mirbooking_scan('ENCSR820QMS')
 keratinocyte = fetch_from_mirbooking_scan('ENCSR193SZM')
 ```

 %% Cell type:code id: tags:

 ``` python
 pd.concat([embryonic_stem_cell, keratinocyte], keys=['ENCSR820QMS', 'ENCSR193SZM'], names=['sample'])
 ```

 %% Cell type:markdown id: tags:

 We can use a different axis for concatenation.

 %% Cell type:code id: tags:

 ``` python
 pd.concat([embryonic_stem_cell, keratinocyte], keys=['ENCSR820QMS', 'ENCSR193SZM'], names=['sample'], axis=1)
 ```

 %% Cell type:markdown id: tags:

 We obtain a similar result with a jointure, but it's not really appropriate here because we're joining the same kind of data.

 %% Cell type:code id: tags:

 ``` python
 embryonic_stem_cell.join(keratinocyte, lsuffix='stem_cell', rsuffix='_keratinocyte').head()
 ```

 %% Cell type:code id: tags:

 ``` python
 df = pd.concat([embryonic_stem_cell, keratinocyte], keys=['ENCSR820QMS', 'ENCSR193SZM'], names=['sample'], axis=1)
 df.head()
 ```

 %% Cell type:code id: tags:

 ``` python
 def log2fc(a, b):
    return np.log2(a.replace(0, np.nan) / b.replace(0, np.nan)).dropna()
 ```

 %% Cell type:code id: tags:

 ``` python
 log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity).head()
 ```

 %% Cell type:code id: tags:

 ``` python
 log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity).hist(bins=20)
 plt.xlabel('$\log_2$ fold-change')
 plt.ylabel('Frequency')
 ```

 %% Cell type:markdown id: tags:

 It seems that most interactions are up-regulated. Let's verify if that holds:

 %% Cell type:code id: tags:

 ``` python
 log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity).hist(bins=50, cumulative=True, density=True)
 plt.axvline(0, c='r')
 ```

 %% Cell type:markdown id: tags:

 About 60% of our interactions seems to be down-regulated.

 %% Cell type:code id: tags:

 ``` python
 from scipy.stats import ttest_1samp
 ttest_1samp(log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity), popmean=0)
 ```

 %% Cell type:markdown id: tags:

 Let's calculate the bound fraction of each target. We make the simple assumption that binding sites are independent so that:

 $\Pr[k > 0] = 1 - \Pr[k = 0] = 1 - \prod_{i=1}^n (1 - p_i)$

 For each position $p_i$ in our target.

 %% Cell type:code id: tags:

 ``` python
 def bound_fraction(df):
    return 1 - (1 - df.quantity / df.target_quantity).prod()
 ```

 %% Cell type:code id: tags:

 ``` python
 stem_cell_bf = df.ENCSR820QMS \
    .groupby(level=['gene_accession', 'target_accession', 'position']).agg({'quantity': 'sum', 'target_quantity': 'first'}) \
    .groupby(level=['gene_accession', 'target_accession']).apply(bound_fraction)
 ```

 %% Cell type:code id: tags:

 ``` python
 keratinocyte_bf = df.ENCSR193SZM \
    .groupby(level=['gene_accession', 'target_accession', 'position']).agg({'quantity': 'sum', 'target_quantity': 'first'}) \
    .groupby(level=['gene_accession', 'target_accession']).apply(bound_fraction)
 ```

 %% Cell type:code id: tags:

 ``` python
 log2fc(keratinocyte_bf, stem_cell_bf).hist(bins=20)
 ```

 %% Cell type:code id: tags:

 ``` python
 log2fc(keratinocyte_bf, stem_cell_bf).hist(bins=20, cumulative=True, density=True)
 plt.axvline(0, c='r')
 ```

 %% Cell type:code id: tags:

 ``` python
 ttest_1samp(log2fc(keratinocyte_bf, stem_cell_bf), 0)
 ```

 %% Cell type:markdown id: tags:

 Now, it seems a bit clearer that the majority of targets had an increase in relative activity, which is consistent for a differentiated tissue.

 Let's augment our analysis with gene and transcript-level information.

 %% Cell type:code id: tags:

 ``` python
 ontology = pd.concat([df.ENCSR193SZM.groupby(level=[0,1]).agg({'gene_name': 'first', 'target_name': 'first'}),
                      df.ENCSR820QMS.groupby(level=[0,1]).agg({'gene_name': 'first', 'target_name': 'first'})])
 ontology = ontology.groupby(ontology.index).first()
 ```

 %% Cell type:code id: tags:

 ``` python
 k_s_log2fc = pd.concat([log2fc(keratinocyte_bf, stem_cell_bf).rename('log2fc'), ontology], axis=1)
 k_s_log2fc.sort_values('log2fc').head()
 ```

 %% Cell type:code id: tags:

 ``` python
 sorted_k_s_index = k_s_log2fc.log2fc.abs().sort_values(ascending=False).index
 k_s_log2fc.loc[sorted_k_s_index].head()
 ```

 %% Cell type:code id: tags:

 ``` python
 sorted_k_s_log2fc = k_s_log2fc.loc[sorted_k_s_index]
 ```

 %% Cell type:code id: tags:

 ``` python
 sorted_k_s_log2fc.head(100)
 ```

 %% Cell type:code id: tags:

 ``` python
 sorted_k_s_log2fc.head()
 ```

 %% Cell type:code id: tags:

 ``` python
 print('\n'.join(sorted_k_s_log2fc.gene_name.drop_duplicates()))
 ```

 %% Cell type:markdown id: tags:

 Let's paste this in [g:Profiler](https://biit.cs.ut.ee/gprofiler/gost), a meta-enrichment analysis tool.

 %% Cell type:markdown id: tags:

 # Cytoscape visualization

 Microtargetomes are very, very large and we might prefer to summarize interactions prior to performing any visualization.

 %% Cell type:code id: tags:

 ``` python
 embryonic_stem_cell.describe()
 ```

 %% Cell type:code id: tags:

 ``` python
 gene_quantity = keratinocyte \
    .groupby(level=['gene_accession', 'target_accession']).target_quantity.first() \
    .groupby(level='gene_accession').sum().rename('gene_quantity')
 gene_quantity.head()
 ```

 %% Cell type:markdown id: tags:

 We will also be interested in the gene fraction bound by at least one microRNA as a relative measure of how strong regulation is.

 %% Cell type:code id: tags:

 ``` python
 gene_mirna_bound_fraction = keratinocyte.groupby(level=['gene_accession', 'mirna_accession']).apply(bound_fraction).rename('bound_fraction')
 ```

 %% Cell type:code id: tags:

 ``` python
 gene_mirna_interactions = keratinocyte.groupby(level=['gene_accession', 'mirna_accession']).agg({
    'gene_name': 'first',
    'mirna_name': 'first',
    'mirna_quantity': 'first',
    'quantity': 'sum'})
 gene_mirna_interactions = gene_mirna_interactions.join(gene_quantity)
 gene_mirna_interactions = gene_mirna_interactions.join(gene_mirna_bound_fraction)
 ```

 %% Cell type:markdown id: tags:

 Let's narrow down our search to the genes involved in the melanosome.

 %% Cell type:code id: tags:

 ``` python
 melanosome_genes = '''
 SYPL1
 ATP6V0A1
 CTNS
 RAB27B
 CAPG
 HSPA5
 DTNBP1
 ATP1B3
 RAB27A
 TFRC
 RAB7A
 TYR
 DCT
 HSP90AA1
 ERP29
 GANAB
 HSP90AB1
 HPS4
 SYNGR1
 DNAJC5
 AHCY
 GPR143
 OCA2
 RAB2A
 ANKRD27
 SLC1A5
 TYRP1
 RAB5C
 YWHAE
 TMEM33
 HSPA8
 RAB5B
 RAB35
 CCT4
 SLC1A4
 RAB29
 SLC2A1
 PRDX1
 CTSD
 MREG
 RAB32
 GNA13
 MLANA
 ANXA11
 RAB9A
 RAB38
 RAB17
 CANX
 CALU
 RAN
 SERPINF1
 MYH11
 CD63
 GPNMB
 RAC1
 FLOT1
 MYO7A
 SYTL2
 GGH
 SDCBP
 GCHFR
 RAB1A
 SGSM2
 CLTC
 SYTL1
 PDIA6
 RAB5A
 ATP6V1B2
 STOM
 ITGB1
 PDIA4
 MMP14
 NCSTN
 ATP1A1
 RPN1
 SLC45A2
 CTSB
 YWHAZ
 TPP1
 HSP90B1
 PPIB
 STX3
 YWHAB
 PDIA3
 SLC3A2
 FASN
 MYRIP
 PDCD6IP
 TMED10
 BSG
 CNP
 TH
 ANXA2
 P4HB
 PMEL
 LAMP1
 NAP1L1
 TRPV2
 SLC24A5
 ANXA6
 SND1
 MYO5A
 ATP6V1G2
 ITGB3
 SEC22B
 '''.split()

 specific_gene_mirna_interactions = gene_mirna_interactions[gene_mirna_interactions.gene_name.isin(melanosome_genes)]
 ```

 %% Cell type:markdown id: tags:

 Our dataframe is ready! We have:

 - edges with scores (quantity or bound fraction)
 - nodes with unique identifiers (accessions), labels (names) and metadata (quantity)

 Now, let's export this in a TSV so that we can import it in Cytoscape.

 %% Cell type:code id: tags:

 ``` python
 specific_gene_mirna_interactions.to_csv('keratinocyte-melanosome-gene-mirna-interactions.tsv', sep='\t')
 ```

 %% Cell type:code id: tags:

 ``` python
 # Venn diagrams
 ```

 %% Cell type:code id: tags:

 ``` python
 from matplotlib_venn import venn2, venn3
 ```

 %% Cell type:code id: tags:

 ``` python
 venn2([set([1,2,3]), set([3,4,5])])
 ```

 %% Cell type:code id: tags:

 ``` python
 venn3([set([1,2,3]), set([2,3,4]), set([4,5,1])], ['label1', 'label2', 'label3'])
 ```

 %% Cell type:code id: tags:

 ``` python
 plt.title('Unique genes')
 venn2([set(embryonic_stem_cell.gene_name), set(keratinocyte.gene_name)], ['Embryonic stem cell', 'Keratonicyte'])
 ```

 %% Cell type:code id: tags:

 ``` python
 plt.title('Unique microRNAs')
 venn2([set(embryonic_stem_cell.mirna_name), set(keratinocyte.mirna_name)], ['Embryonic stem cell', 'Keratonicyte'])
 ```

 %% Cell type:code id: tags:

 ``` python
 ```