MIT slides
Hermine T
Created on January 29, 2022
More creations to inspire you
LET’S GO TO LONDON!
Personalized
SLYCE DECK
Personalized
ENERGY KEY ACHIEVEMENTS
Personalized
CULTURAL HERITAGE AND ART KEY ACHIEVEMENTS
Personalized
ABOUT THE EEA GRANTS AND NORWAY
Personalized
DOWNFALLL OF ARAB RULE IN AL-ANDALUS
Personalized
HUMAN AND SOCIAL DEVELOPMENT KEY
Personalized
Transcript
Gene/Tissue specifity scoring and clustering
Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman
Background
Aim & Objectives
Identify the pairs of the most specific genes per a list of given human tissue
Methods
Data Engineering
Data Analysis
- Scrapped and annotated with disease/healthy labels and tissue types the gene read counts from MetaSRA project
- Transformed RNA read counts into Transcripts Per Million (TPM). For every 1,000,000 RNA molecules in the RNA-seq sample, x came from this gene
- Normalized TPM samples per gene
Data Analysis
- Benchmarked existing scores
- Built a new score based on Area Under the Curve (AUC) statistics to perfom the gene/tissue specificity scoring
- Used parallelization and data batching to optimize operations and computer power
- Implemented divisive hierarchical clustering for sanity check of results
- Client: an American biotech company looking to outsource some of its R&D
- Aim: research the idea of targetted therapies to specific tissues
- Challenge: human genomic data is very specific, sparse, and complicated to handle
In this project, we aimed to adress the above problem. I assumed the position of data scientist and worked in collaboration with my supervisor to scrap and annotate the data.
Data
RNA sequencing data scrapped
80GB
53
20,000+
Different genes
2 labels
Human tissues
Biological samples:
- Tissue
- Primary cells
- Cell line
Healthy/Disease
by HERMINE TRANIE
- Create a relevant dataset of human RNA sequencing data
- Quantify the specificity of genes in human tissues
Biological structure
targetted therapies in PPI (protein-protein interaction)
Gene/Tissue specifity scoring and clustering
Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman
Conclusion & Future Work
- Found the best pairs of highly specific genes per tissue with a margin of error
- Delivered a user-friendly interface to display visualizations of results
What could I do better today?
- Use Hadoop for further optimization in scrapping and cleaning
- Use Interpretable Clustering (Bertsimas et. al 2021)
Impact
- Client-end:
- Delivered a robust, clean dataset exploitable for further analysis
- Delivered a user-friendly interface flexible to produce new analysis and reports
- 100+ hours of lab testing saved by having produced an accurate pair of genes to test directly and tailor tissue specific therapies
- In-house:
- Added an analytics tool to our own product
- Created substancial evidence of robust work; the client hired the company again for further work
Workflow
by HERMINE TRANIE
Selecting pair of top genes
Identifying the top 2 genes per tissue
Gene/Tissue specifity scoring and clustering
Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman
Background
Aim & Objectives
Identify the pairs of the most specific genes per a list of given human tissue
- Create a relevant dataset of human biological data
- Quantify the specificity of genes in human tissues
Conclusion & Future Work
- Found the best pairs of highly specific genes per tissue with a margin of error
- Delivered a user-friendly interface to display visualizations of results
- Use Hadoop for further optimization in scrapping and cleaning
- Use Interpretable Clustering (Bertsimas et. al 2021)
Selecting pair of top genes
Impact
- Client-end:
- Delivered a robust, clean dataset exploitable for further analysis
- Delivered a user-friendly interface flexible to produce new analysis and reports
- 100+ hours of lab testing saved by having produced an accurate pair of genes to test directly
- In-house:
- Added an analytics tool to our own product
- Created substancial evidence of robust work; the client hired the company again for further work
Methods
Data Engineering
Data Analysis
- Scrapped and annotated gene read counts from MetaSRA project, using healthy and disease patients
- Transformed RNA read counts into Transcripts Per Million (TPM)
- Normalized TPM samples per gene
Data Analysis
- Benchmarked existing scores
- Built a new score based on Area Under the Curve (AUC) statistics to perfom the gene/tissue specificity scoring
- Used parallelization and data batching to optimize operations and computer power
- Implemented divisive hierarchical clustering for sanity check of results
- Client: an American biotech company looking to outsource some of its R&D
- Aim: research the idea of targetted therapies to specific tissues
- Challenge: human genomic data is very specific, sparse, and complicated to handle
In this project, we aimed to adress the above problem. I assumed the position of data scientist and worked in collaboration with my supervisor to scrap and annotate the data.
Data
RNA sequencing data scrapped
80GB
53
20k
Different genes
2 labels
Human tissues
- Tissue
- Primary cells
- Cell line
Healthy/Disease
Identifying the top 2 genes per tissue
Workflow
by HERMINE TRANIE
Gene/Tissue specifity scoring and clustering
Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman
Background
Aim & Objectives
Identify the pairs of the most specific genes per a list of given human tissue
Methods
Data Engineering
Data Analysis
- Scrapped and annotated with disease/healthy labels and tissue types the gene read counts from MetaSRA project
- Transformed RNA read counts into Transcripts Per Million (TPM). For every 1,000,000 RNA molecules in the RNA-seq sample, x came from this gene
- Normalized TPM samples per gene
Data Analysis
- Benchmarked existing scores
- Built a new score based on Area Under the Curve (AUC) statistics to perfom the gene/tissue specificity scoring
- Used parallelization and data batching to optimize operations and computer power
- Implemented divisive hierarchical clustering for sanity check of results
- Client: an American biotech company looking to outsource some of its R&D
- Aim: research the idea of targetted therapies to specific tissues
- Challenge: human genomic data is very specific, sparse, and complicated to handle
In this project, we aimed to adress the above problem. I assumed the position of data scientist and worked in collaboration with my supervisor to scrap and annotate the data.
Data
RNA sequencing data scrapped
80GB
53
20,000+
Different genes
2 labels
Human tissues
Biological samples:
- Tissue
- Primary cells
- Cell line
Healthy/Disease
by HERMINE TRANIE
- Create a relevant dataset of human RNA sequencing data
- Quantify the specificity of genes in human tissues
Gene/Tissue specifity scoring and clustering
Summer 2021 research project at WhiteLab Genomics
Under the supervision of Dr. Kevin Cheeseman
Conclusion & Future Work
- Found the best pairs of highly specific genes per tissue with a margin of error
- Delivered a user-friendly interface to display visualizations of results
What could I do better today?
- Use Hadoop for further optimization in scrapping and cleaning
- Use Interpretable Clustering (Bertsimas et. al 2021)
Impact
- Client-end:
- Delivered a robust, clean dataset exploitable for further analysis
- Delivered a user-friendly interface flexible to produce new analysis and reports
- 100+ hours of lab testing saved by having produced an accurate pair of genes to test directly and tailor tissue specific therapies
- In-house:
- Added an analytics tool to our own product
- Created substancial evidence of robust work; the client hired the company again for further work
Workflow
by HERMINE TRANIE
Hierarchical clustering