Add the a partial solution for the assignment

a1cac2cf · Guillaume Poirier-Morency · 1d626b2e · a1cac2cf · a1cac2cf
Commit a1cac2cf authored 5 years ago by Guillaume Poirier-Morency
--- a/Microtargetome analysis.ipynb
+++ b/Microtargetome analysis.ipynb
@@ -77,6 +77,15 @@
    "microtargetome_df = read_microtargetome(StringIO(response.text))"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "(microtargetome_df.groupby('gene_accession').first().sort_values('score').gene_name.to_csv('gene-for-enrichment.tsv', index=False))"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -106,6 +115,15 @@
    "microtargetome_df.head()"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "microtargetome_df.loc[:,:]"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -633,6 +651,13 @@
    "stem_cell_bf.sort_values(ascending=False).head()"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -1087,7 +1112,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.4"
  },
  "toc": {
   "base_numbering": 1,

 %% Cell type:markdown id: tags:

 # BIM6065C: Microtargetome analysis

 We will use the basic of Pandas you have learned in previous sessions to analyze microtargetome data.

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 from io import StringIO
 import requests
 import matplotlib.pyplot as plt
 import numpy as np
 %matplotlib inline
 ```

 %% Cell type:markdown id: tags:

 Seaborn has a really nice matplotlib theme to make figure readable.

 %% Cell type:code id: tags:

 ``` python
 import seaborn as sb
 sb.set('talk', 'ticks')
 ```

 %% Cell type:code id: tags:

 ``` python
 def read_microtargetome(f):
    return pd.read_csv(f, sep='\t', index_col=['gene_accession', 'target_accession', 'position', 'mirna_accession'])
 ```

 %% Cell type:markdown id: tags:

 # Querying data from miRBooking-scan

 [miRBooking-scan](https://major.iric.ca/~poirigui/mirbooking-scan/) is a Web platform that provides pre-computed microtargetomes.

 We currently have predictions for 29 cell lines that were retrieved from the ENCODE project.

 Every endpoint allow querying programatically with the `Accept: text/tab-separated-values` to retrieve a TSV format.

 ```
 https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/<cell_line_accession>
 ```

 %% Cell type:code id: tags:

 ``` python
 response = requests.get('https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/ENCSR809EFN', headers={'Accept': 'text/tab-separated-values'})
 microtargetome_df = read_microtargetome(StringIO(response.text))
 ```

+%% Cell type:code id: tags:
+
+``` python
+(microtargetome_df.groupby('gene_accession').first().sort_values('score').gene_name.to_csv('gene-for-enrichment.tsv', index=False))
+```
+
 %% Cell type:markdown id: tags:

 # Hop on!

 Microtargetome have 4 level of hierarchy:

 - gene: genomic element that encodes various transcripts
 - target: RNA transcript
 - position: offset on a transcript which matches the end of the microRNA seed
 - microRNA: small RNA of about 22 nucleotides

 For each possible quadruplet, our model predicts an equilibrium concentration `quantity` according to its equilibrium constant $K_m$. In particular, our solution satisfies:

 $K_m = \frac{[E_m][S_{t,p}]}{[E_mS_{t,p}]}$

 Where $[E_m]$ is the free concentration of microRNA $m$, $[S_{t,p}]$ is the free concentration of target site $(t, p)$ and $[E_mS_{t,p}]$ is the duplex formed at that particular location.

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.head()
 ```

+%% Cell type:code id: tags:
+
+``` python
+microtargetome_df.loc[:,:]
+```
+
 %% Cell type:markdown id: tags:

 # NumPy

 You can use [NumPy](https://numpy.org/) routines directly on your dataframes and series.

 %% Cell type:code id: tags:

 ``` python
 np.square(microtargetome_df.quantity).head()
 ```

 %% Cell type:code id: tags:

 ``` python
 np.sqrt(microtargetome_df.quantity).head()
 ```

 %% Cell type:markdown id: tags:

 # Multi-index

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.loc['ENSG00000125445.10'].head()
 ```

 %% Cell type:markdown id: tags:

 To query an arbitrary level, you can use `xs`.

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.xs('MIMAT0000098', level='mirna_accession').head()
 ```

 %% Cell type:markdown id: tags:

 Multi-index can also be used for columns, which can be very handy for grouping samples.

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.head()
 ```

 %% Cell type:markdown id: tags:

 # Grouping and aggregating

 One of the most useful Pandas operation is `groupby` as it allow you to analyse parts of your data separately.

 Aggregators work on individual groups resulting from a `groupby` operation. The most frequent are:

 - `first`
 - `mean`
 - `median`
 - `sum`

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.groupby(level='target_accession').quantity.mean().sort_values(ascending=False).head()
 ```

 %% Cell type:code id: tags:

 ``` python
 sponged_concentration = microtargetome_df.groupby(level='mirna_accession').quantity.sum()
 total_concentration = microtargetome_df.groupby(level='mirna_accession').mirna_quantity.first()
 (sponged_concentration / total_concentration).head()
 ```

 %% Cell type:code id: tags:

 ``` python
 microtargetome_df.groupby(level='target_accession').agg({
    'gene_name': 'first',
    'target_name': 'first',
    'target_quantity': 'first',
    'quantity': 'sum'}).head()
 ```

 %% Cell type:markdown id: tags:

 # With this in mind, let's verify if our equilibrium really hold!

 %% Cell type:code id: tags:

 ``` python
 S0 = microtargetome_df.target_quantity
 E0 = microtargetome_df.mirna_quantity
 ```

 %% Cell type:code id: tags:

 ``` python
 S0.head()
 ```

 %% Cell type:markdown id: tags:

 The available microRNA concentration is given by conservation:

 %% Cell type:code id: tags:

 ``` python
 E = E0 - microtargetome_df.groupby(level='mirna_accession').quantity.sum()
 ```

 %% Cell type:code id: tags:

 ``` python
 E.head()
 ```

 %% Cell type:markdown id: tags:

 Same for substrate, but position-wise:

 %% Cell type:code id: tags:

 ``` python
 S = S0 - microtargetome_df.quantity
 ```

 %% Cell type:code id: tags:

 ``` python
 S.head()
 ```

 %% Cell type:markdown id: tags:

 Now, the complexes:

 %% Cell type:code id: tags:

 ``` python
 ES = microtargetome_df.quantity
 ```

 %% Cell type:code id: tags:

 ``` python
 predicted_Km = ((E * S) / ES).rename('predicted_score')
 predicted_Km.head()
 ```

 %% Cell type:code id: tags:

 ``` python
 pd.concat([predicted_Km, microtargetome_df.score], axis=1).head()
 ```

 %% Cell type:code id: tags:

 ``` python
 fig, ax = plt.subplots()
 ax.set_xscale('log')
 ax.set_yscale('log')
 pd.concat([predicted_Km, microtargetome_df.score], axis=1).plot.scatter('predicted_score', 'score', ax=ax)
 ax.set_xlabel(r'$\frac{[E][S]}{[ES]}$')
 ax.set_ylabel('$K_m$')
 ```

 %% Cell type:markdown id: tags:

 However, this is not exactly equal because the available substrate concentration is actually a bit more complicated to calculate since we have to account for overlapping sites.

 %% Cell type:markdown id: tags:

 # Jointure, merge and concatenation

 These three concepts are similar, but behave differently.

 - jointure are fast and work on indexes
 - merge are slow and work on columns
 - concat is similar to a jointure, but require matching indexes and works with many dataframes and series

 But first, let's automate the process of fetching data from miRBooking-scan so that we can study a couple of cell lines.

 %% Cell type:code id: tags:

 ``` python
 pd.concat([microtargetome_df.loc['ENSG00000165672.6', 'ENST00000298510.3', 1065 ],
          microtargetome_df.loc['ENSG00000084623.11','ENST00000373586.1', 13]], keys=['ENSG00000165672.6', 'ENSG00000084623.11'], axis=1)
 ```

 %% Cell type:markdown id: tags:

 # Compare embryonic stem cells with keratinocytes

 %% Cell type:code id: tags:

 ``` python
 def fetch_from_mirbooking_scan(accession):
    response = requests.get(f'https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/{accession}', headers={'Accept': 'text/tab-separated-values'})
    return read_microtargetome(StringIO(response.text))
 ```

 %% Cell type:markdown id: tags:

 We start first with comparing undifferentiated embryonic stem cells agaist skin keratinocytes.

 %% Cell type:code id: tags:

 ``` python
 embryonic_stem_cell = fetch_from_mirbooking_scan('ENCSR820QMS')
 keratinocyte = fetch_from_mirbooking_scan('ENCSR193SZM')
 ```

 %% Cell type:code id: tags:

 ``` python
 pd.concat([embryonic_stem_cell.describe(), keratinocyte.describe()], keys=['ENCSR820QMS', 'ENCSR193SZM'], axis=1)
 ```

 %% Cell type:code id: tags:

 ``` python
 keratinocyte['sample_name'] = 'Keratinocyte'
 embryonic_stem_cell['sample_name'] = 'Embryonic stem cell'
 ```

 %% Cell type:code id: tags:

 ``` python
 pd.concat([embryonic_stem_cell, keratinocyte], keys=['ENCSR820QMS', 'ENCSR193SZM'], names=['sample']).head()
 ```

 %% Cell type:markdown id: tags:

 We can use a different axis for concatenation.

 %% Cell type:code id: tags:

 ``` python
 compared_cells = pd.concat([embryonic_stem_cell, keratinocyte], keys=['ENCSR820QMS', 'ENCSR193SZM'], names=['sample', 'col'], axis=1, join='inner')
 ```

 %% Cell type:code id: tags:

 ``` python
 compared_cells.head()
 ```

 %% Cell type:code id: tags:

 ``` python
 compared_cells.xs('quantity', level=1, axis='columns').head()
 ```

 %% Cell type:markdown id: tags:

 We obtain a similar result with a jointure, but it's not really appropriate here because we're joining the same kind of data.

 %% Cell type:code id: tags:

 ``` python
 embryonic_stem_cell.join(keratinocyte, lsuffix='stem_cell', rsuffix='_keratinocyte').head()
 ```

 %% Cell type:code id: tags:

 ``` python
 df = pd.concat([embryonic_stem_cell, keratinocyte], keys=['ENCSR820QMS', 'ENCSR193SZM'], names=['sample'], axis=1, join='inner')
 df.head()
 ```

 %% Cell type:code id: tags:

 ``` python
 def log2fc(a, b):
    return np.log2(a.replace(0, np.nan) / b.replace(0, np.nan)).dropna()
 ```

 %% Cell type:code id: tags:

 ``` python
 log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity).head()
 ```

 %% Cell type:code id: tags:

 ``` python
 log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity).hist(bins=20)
 plt.xlabel('$\log_2$ fold-change')
 plt.ylabel('Frequency')
 ```

 %% Cell type:markdown id: tags:

 It seems that most interactions are up-regulated. Let's verify if that holds:

 %% Cell type:code id: tags:

 ``` python
 log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity).hist(bins=50, cumulative=True, density=True)
 plt.axvline(0, c='r')
 ```

 %% Cell type:markdown id: tags:

 About 60% of our interactions seems to be down-regulated.

 %% Cell type:code id: tags:

 ``` python
 from scipy.stats import ttest_1samp
 ttest_1samp(log2fc(df.ENCSR193SZM.quantity, df.ENCSR820QMS.quantity), popmean=0)
 ```

 %% Cell type:markdown id: tags:

 Let's calculate the bound fraction of each target. We make the simple assumption that binding sites are independent so that:

 $\Pr[k > 0] = 1 - \Pr[k = 0] = 1 - \prod_{i=1}^n (1 - p_i)$

 For each position $p_i$ in our target.

 %% Cell type:code id: tags:

 ``` python
 target = df.ENCSR820QMS.loc['ENSG00000004059.10', 'ENST00000000233.9']
 ```

 %% Cell type:code id: tags:

 ``` python
 1 - (1  - (target.quantity / target.target_quantity)).prod()
 ```

 %% Cell type:code id: tags:

 ``` python
 def bound_fraction(df):
    return 1 - (1 - df.quantity / df.target_quantity).prod()
 ```

 %% Cell type:code id: tags:

 ``` python
 stem_cell_bf = df.ENCSR820QMS \
    .groupby(level=['gene_accession', 'target_accession', 'position']).agg({'quantity': 'sum', 'target_quantity': 'first'}) \
    .groupby(level=['gene_accession', 'target_accession']).apply(bound_fraction)
 ```

 %% Cell type:code id: tags:

 ``` python
 stem_cell_bf.sort_values(ascending=False).head()
 ```

 %% Cell type:code id: tags:

 ``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
 keratinocyte_bf = df.ENCSR193SZM \
    .groupby(level=['gene_accession', 'target_accession', 'position']).agg({'quantity': 'sum', 'target_quantity': 'first'}) \
    .groupby(level=['gene_accession', 'target_accession']).apply(bound_fraction)
 ```

 %% Cell type:code id: tags:

 ``` python
 log2fc(keratinocyte_bf, stem_cell_bf).hist(bins=20)
 ```

 %% Cell type:code id: tags:

 ``` python
 ttest_1samp(log2fc(keratinocyte_bf, stem_cell_bf), 0)
 ```

 %% Cell type:markdown id: tags:

 Now, it seems a bit clearer that the majority of targets had an increase in relative activity, which is consistent for a differentiated tissue.

 Let's augment our analysis with gene and transcript-level information.

 %% Cell type:code id: tags:

 ``` python
 ontology = pd.concat([df.ENCSR193SZM.groupby(level=[0,1]).agg({'gene_name': 'first', 'target_name': 'first'}),
                      df.ENCSR820QMS.groupby(level=[0,1]).agg({'gene_name': 'first', 'target_name': 'first'})])
 ontology = ontology.groupby(level=['gene_accession', 'target_accession']).first()
 ```

 %% Cell type:code id: tags:

 ``` python
 ontology.head()
 ```

 %% Cell type:code id: tags:

 ``` python
 log2fc(keratinocyte_bf, stem_cell_bf).rename('log2fc').to_frame().join(ontology).head()
 ```

 %% Cell type:code id: tags:

 ``` python
 k_s_log2fc = log2fc(keratinocyte_bf, stem_cell_bf).rename('log2fc').to_frame().join(ontology)
 k_s_log2fc.sort_values('log2fc').head()
 ```

 %% Cell type:code id: tags:

 ``` python
 sorted_k_s_index = k_s_log2fc.log2fc.abs().sort_values(ascending=False).index
 k_s_log2fc.loc[sorted_k_s_index].head()
 ```

 %% Cell type:code id: tags:

 ``` python
 sorted_k_s_log2fc = k_s_log2fc.loc[sorted_k_s_index]
 ```

 %% Cell type:code id: tags:

 ``` python
 sorted_k_s_log2fc.head(10)
 ```

 %% Cell type:code id: tags:

 ``` python
 print('\n'.join(sorted_k_s_log2fc.gene_name.drop_duplicates()))
 ```

 %% Cell type:markdown id: tags:

 Let's paste this in [g:Profiler](https://biit.cs.ut.ee/gprofiler/gost), a meta-enrichment analysis tool.

 %% Cell type:markdown id: tags:

 # Cytoscape visualization

 Microtargetomes are very, very large and we might prefer to summarize interactions prior to performing any visualization.

 %% Cell type:code id: tags:

 ``` python
 embryonic_stem_cell.describe()
 ```

 %% Cell type:code id: tags:

 ``` python
 def expected_interactions(df):
    return (df.quantity / df.target_quantity).sum()
 ```

 %% Cell type:code id: tags:

 ``` python
 keratinocyte.groupby(level=['gene_accession', 'mirna_accession']).apply(expected_interactions).head()
 ```

 %% Cell type:code id: tags:

 ``` python
 gene_quantity = keratinocyte \
    .groupby(level=['gene_accession', 'target_accession']).target_quantity.first() \
    .groupby(level='gene_accession').sum().rename('gene_quantity')
 gene_quantity.head()
 ```

 %% Cell type:markdown id: tags:

 We will also be interested in the gene fraction bound by at least one microRNA as a relative measure of how strong regulation is.

 %% Cell type:code id: tags:

 ``` python
 gene_mirna_bound_fraction = keratinocyte.groupby(level=['gene_accession', 'mirna_accession']).apply(bound_fraction).rename('bound_fraction')
 ```

 %% Cell type:code id: tags:

 ``` python
 gene_mirna_interactions = keratinocyte.groupby(level=['gene_accession', 'mirna_accession']).agg({
    'gene_name': 'first',
    'mirna_name': 'first',
    'mirna_quantity': 'first',
    'quantity': 'sum'})
 gene_mirna_interactions = gene_mirna_interactions.join(gene_quantity)
 gene_mirna_interactions = gene_mirna_interactions.join(gene_mirna_bound_fraction)
 ```

 %% Cell type:markdown id: tags:

 Let's narrow down our search to the genes involved in the melanosome.

 %% Cell type:code id: tags:

 ``` python
 melanosome_genes = '''
 SYPL1
 ATP6V0A1
 CTNS
 RAB27B
 CAPG
 HSPA5
 DTNBP1
 ATP1B3
 RAB27A
 TFRC
 RAB7A
 TYR
 DCT
 HSP90AA1
 ERP29
 GANAB
 HSP90AB1
 HPS4
 SYNGR1
 DNAJC5
 AHCY
 GPR143
 OCA2
 RAB2A
 ANKRD27
 SLC1A5
 TYRP1
 RAB5C
 YWHAE
 TMEM33
 HSPA8
 RAB5B
 RAB35
 CCT4
 SLC1A4
 RAB29
 SLC2A1
 PRDX1
 CTSD
 MREG
 RAB32
 GNA13
 MLANA
 ANXA11
 RAB9A
 RAB38
 RAB17
 CANX
 CALU
 RAN
 SERPINF1
 MYH11
 CD63
 GPNMB
 RAC1
 FLOT1
 MYO7A
 SYTL2
 GGH
 SDCBP
 GCHFR
 RAB1A
 SGSM2
 CLTC
 SYTL1
 PDIA6
 RAB5A
 ATP6V1B2
 STOM
 ITGB1
 PDIA4
 MMP14
 NCSTN
 ATP1A1
 RPN1
 SLC45A2
 CTSB
 YWHAZ
 TPP1
 HSP90B1
 PPIB
 STX3
 YWHAB
 PDIA3
 SLC3A2
 FASN
 MYRIP
 PDCD6IP
 TMED10
 BSG
 CNP
 TH
 ANXA2
 P4HB
 PMEL
 LAMP1
 NAP1L1
 TRPV2
 SLC24A5
 ANXA6
 SND1
 MYO5A
 ATP6V1G2
 ITGB3
 SEC22B
 '''.split()

 specific_gene_mirna_interactions = gene_mirna_interactions[gene_mirna_interactions.gene_name.isin(melanosome_genes)]
 ```

 %% Cell type:code id: tags:

 ``` python
 gene_mirna_interactions[gene_mirna_interactions.gene_name.isin(melanosome_genes)].head()
 ```

 %% Cell type:markdown id: tags:

 Our dataframe is ready! We have:

 - edges with scores (quantity or bound fraction)
 - nodes with unique identifiers (accessions), labels (names) and metadata (quantity)

 Now, let's export this in a TSV so that we can import it in Cytoscape.

 %% Cell type:code id: tags:

 ``` python
 specific_gene_mirna_interactions.head()
 ```

 %% Cell type:code id: tags:

 ``` python
 specific_gene_mirna_interactions.to_csv('keratinocyte-melanosome-gene-mirna-interactions.tsv', sep='\t')
 ```

 %% Cell type:code id: tags:

 ``` python
 from IPython.display import Image
 Image('keratinocyte-melanosome-gene-mirna-interactions.tsv.png')
 ```

 %% Cell type:markdown id: tags:

 # Venn diagrams

 %% Cell type:code id: tags:

 ``` python
 !pip3 install --user matplotlib_venn
 ```

 %% Cell type:code id: tags:

 ``` python
 from matplotlib_venn import venn2, venn3
 ```

 %% Cell type:code id: tags:

 ``` python
 venn2([set([1,2,3]), set([3,4,5])])
 ```

 %% Cell type:code id: tags:

 ``` python
 venn3([set([1,2,3]), set([2,3,4]), set([4,5,1])], ['label1', 'label2', 'label3'])
 ```

 %% Cell type:code id: tags:

 ``` python
 plt.title('Unique genes')
 venn2([set(embryonic_stem_cell.gene_name), set(keratinocyte.gene_name)], ['Embryonic stem cell', 'Keratonicyte'])
 ```

 %% Cell type:code id: tags:

 ``` python
 plt.title('Unique microRNAs')
 venn2([set(embryonic_stem_cell.mirna_name), set(keratinocyte.mirna_name)], ['Embryonic stem cell', 'Keratonicyte'])
 ```

--- a/ift6065c-microtargetome-analysis-solution.ipynb
+++ b/ift6065c-microtargetome-analysis-solution.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is a partial solution to the assignment that cover most missed points and gotchas."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import requests\n",
+    "from io import StringIO\n",
+    "import numpy as np\n",
+    "from matplotlib_venn import venn2\n",
+    "import matplotlib.pyplot as plt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def read_microtargetome(f):\n",
+    "    return pd.read_csv(f, sep='\\t', index_col=['gene_accession', 'target_accession', 'position', 'mirna_accession'])\n",
+    "\n",
+    "def fetch_from_mirbooking_scan(accession):\n",
+    "    response = requests.get(f'https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/{accession}', headers={'Accept': 'text/tab-separated-values'})\n",
+    "    return read_microtargetome(StringIO(response.text))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "A = fetch_from_mirbooking_scan('ENCSR172GTQ')\n",
+    "B = fetch_from_mirbooking_scan('ENCSR066FYC')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Efficiency"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Efficiency measures the degree of specificity of an interaction.\n",
+    "\n",
+    "Highly specific interactions can be completely inefficient if $K_m$ is high and high affinity interactions can end-up being completely inefficient if they face strong competitors or binding many substrates.\n",
+    "\n",
+    "If the enzyme is exclusive to its substrate, the efficiency will be very high. If the enzyme is shared among many substrates, its free concentration will be lower and the formed complex $[ES]$ will be lower as well. Conversly, if many enzymes are competing for a given substrate, the substrate free concentration will be lower and the formed complexes will be lower as well.\n",
+    "\n",
+    "From a network perspective, it summarizes the local density surrounding an edge."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# It is best to introduce a function for this purpose\n",
+    "\n",
+    "def efficiency(df):\n",
+    "    E0 = df.mirna_quantity\n",
+    "    S0 = df.target_quantity\n",
+    "    Km = df.score\n",
+    "    ES = df.quantity\n",
+    "    Z = E0 + S0 + Km\n",
+    "    ES_max = (Z - np.sqrt(Z**2 - 4 * E0 * S0)) / 2\n",
+    "    return ES / ES_max"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "efficiency(A).sort_values(ascending=False).head(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "efficiency(B).sort_values(ascending=False).head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Common interactions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.title('Comparison of microRNA-gene interactions presents in tyroid gland\\nfor samples from 37 and 54 year patients')\n",
+    "venn2([set(A.index), set(B.index)], ['Thyroid gland (37 year)', 'Thyroid gland (54 year)'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Sponged fraction\n",
+    "\n",
+    "It's important here to not combine sponged fraction from different microRNA because they are not compatible. The solution is to use a central tendency measure such as a mean or a median."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "(A.quantity / A.mirna_quantity).groupby(['gene_accession', 'mirna_accession']).sum().groupby(['gene_accession']).mean().sort_values(ascending=False).head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "(B.quantity / B.mirna_quantity).groupby(['gene_accession', 'mirna_accession']).sum().groupby(['gene_accession']).mean().sort_values(ascending=False).head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The gene TG (ENSG00000042832.11) codes for the tyroglobulin protein and sponges a substantial fraction of the microRNA it interacts with."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Fold-changes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def expected_occupants(df):\n",
+    "    return (df.quantity / df.target_quantity).sum()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def bound_fraction(df):\n",
+    "    return 1 - (1 - df.quantity / df.target_quantity).prod()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_log2fc = log2fc(A.groupby(['gene_accession']).apply(expected_occupants), B.groupby(['gene_accession']).apply(expected_occupants))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_log2fc.abs().sort_values(ascending=False).head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can also use the other metrics to construct our fold-changes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_efficiency_log2fc = log2fc(efficiency(A).groupby('gene_accession').median(), efficiency(B).groupby('gene_accession').median())\n",
+    "gene_efficiency_log2fc.abs().sort_values(ascending=False).head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_bf_log2fc = log2fc(A.groupby('gene_accession').apply(bound_fraction), B.groupby('gene_accession').apply(bound_fraction))\n",
+    "gene_bf_log2fc.abs().sort_values(ascending=False).head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this case, the bound fraction fold-changes are very similar to those of the expected number of occupants."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.scatter(gene_log2fc, gene_bf_log2fc)\n",
+    "plt.xlabel('Number of occupants $\\log_2$ fold-changes')\n",
+    "plt.ylabel('Bound fraction $\\log_2$ fold-changes')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To make a clear point for question 2 about efficiency, we can clearly see that changes in interaction efficiency are not reflected with changes of substrate binding."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.scatter(gene_log2fc, gene_efficiency_log2fc)\n",
+    "plt.xlabel('Number of occupants $\\log_2$ fold-changes')\n",
+    "plt.ylabel('Efficiency $\\log_2$ fold-changes')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Detailed fold-changes\n",
+    "\n",
+    "If we dig deeper, we can see that some microARN substantially increase."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_mirna_log2fc = log2fc(A.loc['ENSG00000170345.9'].groupby(['mirna_accession']).apply(expected_occupants), \n",
+    "                           B.loc['ENSG00000170345.9'].groupby(['mirna_accession']).apply(expected_occupants))\n",
+    "gene_mirna_log2fc.abs().sort_values(ascending=False).head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "However, this is just a partial picture since there's gains and losses that are not in the intersection."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pd.concat([A.loc['ENSG00000170345.9'].groupby(['mirna_accession']).apply(expected_occupants), \n",
+    "           B.loc['ENSG00000170345.9'].groupby(['mirna_accession']).apply(expected_occupants)], axis=1, sort=True).head()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
+%% Cell type:markdown id: tags:
+
+This is a partial solution to the assignment that cover most missed points and gotchas.
+
+%% Cell type:code id: tags:
+
+``` python
+import pandas as pd
+import requests
+from io import StringIO
+import numpy as np
+from matplotlib_venn import venn2
+import matplotlib.pyplot as plt
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def read_microtargetome(f):
+    return pd.read_csv(f, sep='\t', index_col=['gene_accession', 'target_accession', 'position', 'mirna_accession'])
+
+def fetch_from_mirbooking_scan(accession):
+    response = requests.get(f'https://major.iric.ca/~poirigui/mirbooking-scan/cell-lines/{accession}', headers={'Accept': 'text/tab-separated-values'})
+    return read_microtargetome(StringIO(response.text))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+A = fetch_from_mirbooking_scan('ENCSR172GTQ')
+B = fetch_from_mirbooking_scan('ENCSR066FYC')
+```
+
+%% Cell type:markdown id: tags:
+
+# Efficiency
+
+%% Cell type:markdown id: tags:
+
+Efficiency measures the degree of specificity of an interaction.
+
+Highly specific interactions can be completely inefficient if $K_m$ is high and high affinity interactions can end-up being completely inefficient if they face strong competitors or binding many substrates.
+
+If the enzyme is exclusive to its substrate, the efficiency will be very high. If the enzyme is shared among many substrates, its free concentration will be lower and the formed complex $[ES]$ will be lower as well. Conversly, if many enzymes are competing for a given substrate, the substrate free concentration will be lower and the formed complexes will be lower as well.
+
+From a network perspective, it summarizes the local density surrounding an edge.
+
+%% Cell type:code id: tags:
+
+``` python
+# It is best to introduce a function for this purpose
+
+def efficiency(df):
+    E0 = df.mirna_quantity
+    S0 = df.target_quantity
+    Km = df.score
+    ES = df.quantity
+    Z = E0 + S0 + Km
+    ES_max = (Z - np.sqrt(Z**2 - 4 * E0 * S0)) / 2
+    return ES / ES_max
+```
+
+%% Cell type:code id: tags:
+
+``` python
+efficiency(A).sort_values(ascending=False).head(10)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+efficiency(B).sort_values(ascending=False).head(10)
+```
+
+%% Cell type:markdown id: tags:
+
+# Common interactions
+
+%% Cell type:code id: tags:
+
+``` python
+plt.title('Comparison of microRNA-gene interactions presents in tyroid gland\nfor samples from 37 and 54 year patients')
+venn2([set(A.index), set(B.index)], ['Thyroid gland (37 year)', 'Thyroid gland (54 year)'])
+```
+
+%% Cell type:markdown id: tags:
+
+# Sponged fraction
+
+It's important here to not combine sponged fraction from different microRNA because they are not compatible. The solution is to use a central tendency measure such as a mean or a median.
+
+%% Cell type:code id: tags:
+
+``` python
+(A.quantity / A.mirna_quantity).groupby(['gene_accession', 'mirna_accession']).sum().groupby(['gene_accession']).mean().sort_values(ascending=False).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+(B.quantity / B.mirna_quantity).groupby(['gene_accession', 'mirna_accession']).sum().groupby(['gene_accession']).mean().sort_values(ascending=False).head()
+```
+
+%% Cell type:markdown id: tags:
+
+The gene TG (ENSG00000042832.11) codes for the tyroglobulin protein and sponges a substantial fraction of the microRNA it interacts with.
+
+%% Cell type:markdown id: tags:
+
+# Fold-changes
+
+%% Cell type:code id: tags:
+
+``` python
+def expected_occupants(df):
+    return (df.quantity / df.target_quantity).sum()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def bound_fraction(df):
+    return 1 - (1 - df.quantity / df.target_quantity).prod()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+gene_log2fc = log2fc(A.groupby(['gene_accession']).apply(expected_occupants), B.groupby(['gene_accession']).apply(expected_occupants))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+gene_log2fc.abs().sort_values(ascending=False).head()
+```
+
+%% Cell type:markdown id: tags:
+
+We can also use the other metrics to construct our fold-changes.
+
+%% Cell type:code id: tags:
+
+``` python
+gene_efficiency_log2fc = log2fc(efficiency(A).groupby('gene_accession').median(), efficiency(B).groupby('gene_accession').median())
+gene_efficiency_log2fc.abs().sort_values(ascending=False).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+gene_bf_log2fc = log2fc(A.groupby('gene_accession').apply(bound_fraction), B.groupby('gene_accession').apply(bound_fraction))
+gene_bf_log2fc.abs().sort_values(ascending=False).head()
+```
+
+%% Cell type:markdown id: tags:
+
+In this case, the bound fraction fold-changes are very similar to those of the expected number of occupants.
+
+%% Cell type:code id: tags:
+
+``` python
+plt.scatter(gene_log2fc, gene_bf_log2fc)
+plt.xlabel('Number of occupants $\log_2$ fold-changes')
+plt.ylabel('Bound fraction $\log_2$ fold-changes')
+```
+
+%% Cell type:markdown id: tags:
+
+To make a clear point for question 2 about efficiency, we can clearly see that changes in interaction efficiency are not reflected with changes of substrate binding.
+
+%% Cell type:code id: tags:
+
+``` python
+plt.scatter(gene_log2fc, gene_efficiency_log2fc)
+plt.xlabel('Number of occupants $\log_2$ fold-changes')
+plt.ylabel('Efficiency $\log_2$ fold-changes')
+```
+
+%% Cell type:markdown id: tags:
+
+## Detailed fold-changes
+
+If we dig deeper, we can see that some microARN substantially increase.
+
+%% Cell type:code id: tags:
+
+``` python
+gene_mirna_log2fc = log2fc(A.loc['ENSG00000170345.9'].groupby(['mirna_accession']).apply(expected_occupants),
+                           B.loc['ENSG00000170345.9'].groupby(['mirna_accession']).apply(expected_occupants))
+gene_mirna_log2fc.abs().sort_values(ascending=False).head()
+```
+
+%% Cell type:markdown id: tags:
+
+However, this is just a partial picture since there's gains and losses that are not in the intersection.
+
+%% Cell type:code id: tags:
+
+``` python
+pd.concat([A.loc['ENSG00000170345.9'].groupby(['mirna_accession']).apply(expected_occupants),
+           B.loc['ENSG00000170345.9'].groupby(['mirna_accession']).apply(expected_occupants)], axis=1, sort=True).head()
+```