Differences
This shows you the differences between two versions of the page.
research:ariane [2021/04/23 11:35] – created ariane | research:ariane [2021/04/23 11:37] (current) – ariane | ||
---|---|---|---|
Line 7: | Line 7: | ||
This project is a collaboration with Stefan Thor, and my main lab, with Mikael Boden. Stefan and his lab developed a comprehensive transcriptome dataset that spans across the anterior-posterior (A-P) developmental axis in mice. A key feature of organisms with a central nervous system, in particular those with a distinct brain and spinal cord, is the distinctive overgrowth of the brain (the anterior region). Human’s, mice, and flies all belong to this category (bilateria), | This project is a collaboration with Stefan Thor, and my main lab, with Mikael Boden. Stefan and his lab developed a comprehensive transcriptome dataset that spans across the anterior-posterior (A-P) developmental axis in mice. A key feature of organisms with a central nervous system, in particular those with a distinct brain and spinal cord, is the distinctive overgrowth of the brain (the anterior region). Human’s, mice, and flies all belong to this category (bilateria), | ||
+ | ==== Recently developed software ==== | ||
+ | === sci-vae === | ||
+ | Sci-VAE is an implementation of a variational autoencoder (VAE) in Keras that I developed to use VAEs to integrate biological data. The implementation allows for customisations to the VAE to be passed in via CLI (and a JSON file) or in python and R scripts (see examples). | ||
+ | |||
+ | The VAE implementation expects a data matrix with features as columns (no headers) and rows as training data (no row IDs). The first thing the VAE will do is transform your data between 0 and 1 so you don’t need to do this prior to running the VAE. | ||
+ | |||
+ | I show several examples, using MNIST, IRIS dataset and then also a publicly available histone modification and RNAseq dataset from encode (for those bioinformaticians out there). There are also some useful visualisations that I was having to repeat often when inspecting the latent space so check out the Vis functions if you’re interested (these are also in the examples). | ||
+ | |||
+ | Lastly, there is a optimisation library that allows you to optimise the VAE architecture based on building a separable latent space based on classification. If you choose to use this you’ll also need to pass in labels into the VAE. Check out some tests for how to run this - it uses an evolutionary algorithm. Code will be released soon (with the papers above). | ||
+ | |||
+ | === sci-diffMethGenes === | ||
+ | Sci-dmg aims to assign a change in DNA methylation (as calculated by an external tool) to genes in a consistent and unbiased manner. The user provides a DMR file, a file with the percentage of DNA Methylation, | ||
+ | |||
+ | Any tool can be used to produce the DMC’s and DMR’s, two such tools are MethylKit and MethylSig many others exist. | ||
+ | |||
+ | === sci-epi2gene === | ||
+ | Sci-epi2gene maps events annotated to a genome location to nearby genes - i.e. peaks from histone modification data ChIP-seq experiemnts stored as bed data, or DNA methylation data in csv format (e.g. output from DMRseq or methylKit). | ||
+ | |||
+ | The user provides a SORTED gene annotation file with start, end, and direction for each gene (we recommend using sci-biomart. | ||
+ | |||
+ | The user then selects how to annotate, i.e. whether it is in the promoter region, or overlaps the gene body. Finally, the parameters for overlap on each side are chosen. This algorithm only runs through the file once, so should have computational complexity of O(N). | ||
+ | |||
+ | It is available under the GNU General Public License (Version 3). | ||
+ | |||
+ | This package is a wrapper that allows various epigenetic data types to be annotated to genes. I also wanted to have different upper flanking and lower flanking distances that took into account the directionality of the strand and also an easy output csv file that can be filtered and used in downstream analyses. This is why I keep all features that fall within the annotation region of a gene (example below): | ||
+ | |||
+ | The overlapping methods are as follows: | ||
+ | Overlaps: this means does ANY part of the peak/ | ||
+ | |||
+ | Promoter: does ANY part of the peak/ | ||
+ | |||
+ | Lastly, there are sometimes differences between annotations (i.e. the TSS on your annotation in IGV may differ to the annotation you input to sciepi2gene), | ||
+ | |||
+ | Please post questions and issues related to sci-epi2gene on the Issues section of the GitHub repository. | ||
+ | |||
+ | === sci-downloadAnnotateTCGA === | ||
+ | Sci-Download-Annotate-TCGA is a wrapper around the functions provided by TCGA and the GDC data portal. Long story short, I was needing to merge many of the data (RNAseq and DNA methylation) together from TCGA and I wanted to keep track of the demographics of the patients to ensure I had a balanced dataset. I also wanted to easily find genes in groups of patients with mutations. I found no easy ways to do these things, so I made this wrapper to be able to: | ||
+ | |||
+ | Create a dataframe of many RNAseq datasets from TCGA (and automatically download these) | ||
+ | |||
+ | Merge RNAseq and DNA methylation datasets so for each gene I could see a cross mode profile | ||
+ | |||
+ | Annotate each experiment with demographic information | ||
+ | |||
+ | Anotate each gene with mutation information and search for genes with specific mutations through the API. | ||
+ | |||
+ | This package provides the above in python notebooks, R markdown, and a CLI. | ||
+ | |||
+ | It is available under the GNU General Public License (Version 3). | ||
+ | |||
+ | Please post questions and issues related to sci-dat on the Issues section of the GitHub repository. | ||
+ | |||
+ | === sci-motf === | ||
+ | sci-moTF is a simple package to help with finding motifs that are enriched in different clusters, that are also expressed in your dataset and make it easier to draw inferences on which TFs may be driving the observed changes. | ||
+ | |||
+ | The input to sci-motf is: 1) the output of FIMO , fimo.tsv, 2) a CSV file with gene identifier (e.g. name), cluster, log2FC, and p-value. | ||
+ | |||
+ | === sci-biomart === | ||
+ | Sci-biomart is a simple wrapper around the API from BioMart, but I found existing packages were not quite sufficent for what I was wanting to do. The handy thing about this is that most queries can be performed in a single line, and you can also use it for running in a pipeline (since it supports CLI). | ||
+ | |||
+ | Here you can simply get the list of all genes and perform other biomart functions such as mapping between human and mouse. | ||
+ | |||
+ | It is available under the GNU General Public License (Version 3). | ||
+ | |||
+ | Please post questions and issues related to sci-loc2gene on the Issues section of the GitHub repository. | ||
+ | |||
+ | === sci-RNAprocessing === | ||
+ | Scirnap (sci-RNAprocessing) is a wrapper for some commonly used programs for processing RNAseq data. I created this wrapper to make pipelines more reproducible while keeping things completely modular and allowing for any other program to be added. The main thing I like is that there are consistent log files output and the direct path to a program can be passed (I’ve found this useful on shared servers.) It has made it super easy for me to reproduce pipelines while not adding overhead. Code will be released soon mid 2021. | ||
+ | |||
+ | === sci-viso === | ||
+ | Sci-viso is a visualisation package that I use for all my scientific visualisations. It uses charts from matplotlib and seaborn, but then adds styles for papers (for example, size 6 bold arial font). Colour palletes are inbuilt as is statistics on boxplots. | ||
+ | |||
+ | === sci-util === | ||
+ | Sci-util has Utility functions for my sci* packages. This package contains utility functions such as error catching and handling, and logging functions. | ||
+ | |||
+ | ==== Previous projects ==== | ||
+ | |||
+ | === Graphical Representation of Ancestral Sequence Prediction === | ||
+ | GRASP enables users to perform ancestral sequence prediction and visualisation via a web-interface. My role consisted largely of developing the web, and backend architecture to support the web tool and the implementation of the optimal path finding algorithm through the POAG. | ||
+ | |||
+ | “We developed Graphical Representation of Ancestral Sequence Predictions (GRASP) to infer and explore ancestral variants of protein families with more than 10,000 members. GRASP uses partial order graphs to represent homology in very large datasets, which are intractable with current inference tools and may, for example, be used to engineer proteins by identifying ancient variants of enzymes. We demonstrate that (1) across three distinct enzyme families, GRASP predicts ancestor sequences, all of which demonstrate enzymatic activity, (2) within-family insertions and deletions can be used as building blocks to support the engineering of biologically active ancestors via a new source of ancestral variation, and (3) generous inclusion of sequence data encompassing great diversity leads to less variance in ancestor sequence.” from the documentation | ||
+ | |||
+ | Authors: Gabriel Foley, Ariane Mora, Connie M Ross, Scott Bottoms, Leander Sutzl, Marnie L Lamprecht, Julian Zaugg, Alexandra Essebier, Brad Balderson, Rhys Newell, Raine ES Thomson, Bostjan Kobe, Ross T Barnard, Luke Guddat, Gerhard Schenk, Joerg Carsten, Yosephine Gumulya, Burkhard Rost, Dietmar Haltrich, Volker Sieber, Elizabeth MJ Gillam, Mikael Boden | ||
+ | |||
+ | === OmixView === | ||
+ | Abstract: Omicxview is an interactive visualisation portal that enables researchers to display large metabolic datasets on well-defined Escher pathways. It addresses the gap between very simple static views, such as the common approach of colouring KEGG pathways, and the comprehensive networks such as Reactome, which can be so complex that the signal of interest is dwarfed by background information. Omicxview overlays experimental data onto metabolic pathways, providing users with intuitive ways to explore large multi-omic datasets. Authors: Ariane Mora, Rowland Mosbergen, Steve Englart, Othmar Korn, Mikael Boden and Christine A Wells. | ||
+ | |||
+ | - Oral Presentation at E-Research Australasia, | ||
+ | - Oral Presentation at Joining the Dots Symposium (Aug 2017) |