Want to make creations as awesome as this one?

Transcript

Gene/Tissue specifity scoring and clustering

Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman

Background

Aim & Objectives

Identify the pairs of the most specific genes per a list of given human tissue

Methods

Data Engineering
  • Scrapped and annotated with disease/healthy labels and tissue types the gene read counts from MetaSRA project
  • Transformed RNA read counts into Transcripts Per Million (TPM). For every 1,000,000 RNA molecules in the RNA-seq sample, x came from this gene
  • Normalized TPM samples per gene

Data Analysis

  • Benchmarked existing scores
  • Built a new score based on Area Under the Curve (AUC) statistics to perfom the gene/tissue specificity scoring
  • Used parallelization and data batching to optimize operations and computer power
  • Implemented divisive hierarchical clustering for sanity check of results

  • Client: an American biotech company looking to outsource some of its R&D
  • Aim: research the idea of targetted therapies to specific tissues
  • Challenge: human genomic data is very specific, sparse, and complicated to handle

In this project, we aimed to adress the above problem. I assumed the position of data scientist and worked in collaboration with my supervisor to scrap and annotate the data.


Data

RNA sequencing data scrapped

80GB

53

20,000+

Different genes

2 labels

Human tissues

Biological samples:

  • Tissue
  • Primary cells
  • Cell line

Healthy/Disease

by HERMINE TRANIE

  • Create a relevant dataset of human RNA sequencing data
  • Quantify the specificity of genes in human tissues

Biological structure

targetted therapies in PPI (protein-protein interaction)

Gene/Tissue specifity scoring and clustering

Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman

Conclusion & Future Work

  • Found the best pairs of highly specific genes per tissue with a margin of error
  • Delivered a user-friendly interface to display visualizations of results

What could I do better today?
  • Use Hadoop for further optimization in scrapping and cleaning
  • Use Interpretable Clustering (Bertsimas et. al 2021)

Impact

  • Client-end:
    • Delivered a robust, clean dataset exploitable for further analysis
    • Delivered a user-friendly interface flexible to produce new analysis and reports
    • 100+ hours of lab testing saved by having produced an accurate pair of genes to test directly and tailor tissue specific therapies
  • In-house:
    • Added an analytics tool to our own product
    • Created substancial evidence of robust work; the client hired the company again for further work

Workflow

by HERMINE TRANIE

Selecting pair of top genes

Identifying the top 2 genes per tissue

Gene/Tissue specifity scoring and clustering

Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman

Background

Aim & Objectives

Identify the pairs of the most specific genes per a list of given human tissue

  • Create a relevant dataset of human biological data
  • Quantify the specificity of genes in human tissues

Conclusion & Future Work

  • Found the best pairs of highly specific genes per tissue with a margin of error
  • Delivered a user-friendly interface to display visualizations of results

What could I do better today?
  • Use Hadoop for further optimization in scrapping and cleaning
  • Use Interpretable Clustering (Bertsimas et. al 2021)

Selecting pair of top genes

Impact

  • Client-end:
    • Delivered a robust, clean dataset exploitable for further analysis
    • Delivered a user-friendly interface flexible to produce new analysis and reports
    • 100+ hours of lab testing saved by having produced an accurate pair of genes to test directly
  • In-house:
    • Added an analytics tool to our own product
    • Created substancial evidence of robust work; the client hired the company again for further work

Methods

Data Engineering
  • Scrapped and annotated gene read counts from MetaSRA project, using healthy and disease patients
  • Transformed RNA read counts into Transcripts Per Million (TPM)
  • Normalized TPM samples per gene

Data Analysis

  • Benchmarked existing scores
  • Built a new score based on Area Under the Curve (AUC) statistics to perfom the gene/tissue specificity scoring
  • Used parallelization and data batching to optimize operations and computer power
  • Implemented divisive hierarchical clustering for sanity check of results

  • Client: an American biotech company looking to outsource some of its R&D
  • Aim: research the idea of targetted therapies to specific tissues
  • Challenge: human genomic data is very specific, sparse, and complicated to handle

In this project, we aimed to adress the above problem. I assumed the position of data scientist and worked in collaboration with my supervisor to scrap and annotate the data.


Data

RNA sequencing data scrapped

80GB

53

20k

Different genes

2 labels

Human tissues

  • Tissue
  • Primary cells
  • Cell line

Healthy/Disease

Identifying the top 2 genes per tissue

Workflow

by HERMINE TRANIE

Gene/Tissue specifity scoring and clustering

Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman

Background

Aim & Objectives

Identify the pairs of the most specific genes per a list of given human tissue

Methods

Data Engineering
  • Scrapped and annotated with disease/healthy labels and tissue types the gene read counts from MetaSRA project
  • Transformed RNA read counts into Transcripts Per Million (TPM). For every 1,000,000 RNA molecules in the RNA-seq sample, x came from this gene
  • Normalized TPM samples per gene

Data Analysis

  • Benchmarked existing scores
  • Built a new score based on Area Under the Curve (AUC) statistics to perfom the gene/tissue specificity scoring
  • Used parallelization and data batching to optimize operations and computer power
  • Implemented divisive hierarchical clustering for sanity check of results

  • Client: an American biotech company looking to outsource some of its R&D
  • Aim: research the idea of targetted therapies to specific tissues
  • Challenge: human genomic data is very specific, sparse, and complicated to handle

In this project, we aimed to adress the above problem. I assumed the position of data scientist and worked in collaboration with my supervisor to scrap and annotate the data.


Data

RNA sequencing data scrapped

80GB

53

20,000+

Different genes

2 labels

Human tissues

Biological samples:

  • Tissue
  • Primary cells
  • Cell line

Healthy/Disease

by HERMINE TRANIE

  • Create a relevant dataset of human RNA sequencing data
  • Quantify the specificity of genes in human tissues

Gene/Tissue specifity scoring and clustering

Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman

Conclusion & Future Work

  • Found the best pairs of highly specific genes per tissue with a margin of error
  • Delivered a user-friendly interface to display visualizations of results
What could I do better today?
  • Use Hadoop for further optimization in scrapping and cleaning
  • Use Interpretable Clustering (Bertsimas et. al 2021)

Impact

  • Client-end:
    • Delivered a robust, clean dataset exploitable for further analysis
    • Delivered a user-friendly interface flexible to produce new analysis and reports
    • 100+ hours of lab testing saved by having produced an accurate pair of genes to test directly and tailor tissue specific therapies
  • In-house:
    • Added an analytics tool to our own product
    • Created substancial evidence of robust work; the client hired the company again for further work

Workflow

by HERMINE TRANIE

Hierarchical clustering