GEOracle: Mining perturbation experiments using free text metadata in Gene Expression Omnibus





1. Background

NCBI's Gene Expression Omnibus (GEO) contains >79,000 gene expression data sets (GSE). Those from perturbation experiments (e.g., gene knock-out, signalling or physical stimulation) are especially valuable because they allow identification of genes that are causally downstream of a perturbation agent. This has important applications in determining signalling pathway targets and gene regulatory networks.

There are likely tens of thousands of perturbation studies in GEO, containing millions of experimentally determined perturbation data. Nonetheless, there is no simple way to determine whether a GSE contains perturbation data, and it is not trivial to automatically match treatment samples with their respective control samples.

2. Method

A key insight is that a wealth of useful information regarding experimental design and sample description is stored as free text in GEO metadata. Although this free text is often readily interpretable by humans, there is no simple means to extract this information from GEO in an automated fashion. We reason that we can use text mining and machine learning techniques to classify GSE that contain perturbation data, and to identify and match the treatment and control samples in a perturbation data set. We posit that such an approach will allow us to extract a large amount of gene regulatory information that are already present in GEO.

Using our R Shiny tool called GEOracle, we can quickly annotate many perturbation experiments from GEO in a semi-automated fashion with full user control. GEOracle then performs differential expression analysis to identify gene targets of the perturbation agent.

3. Results

GEOracle is freely available at To demonstrate GEOracle's application in biomedical research, we present two case studies that involve the discovery of conserved signalling pathway target genes and reconstruction of an organ specific gene regulatory network.

4. Conclusions

This work shows that free text metadata in GEO can be computationally mined to extract a large amount of perturbation data. This wealth of perturbation data can be used for discovering signaling pathway target genes and causal gene regulatory networks. While we believe it is important to push for better use of standard annotations in GEO metadata, GEOracle provides a powerful and practical tool to reuse the large amount of data that already exist in GEO.

5. Future ideas/collaborators needed to further research?

We expect it is possible to further extract additional information about each gene expression experiment, such as cell types and experimental context from the metadata by applying more advanced natural language processing techniques. Such extracted information can then be mapped to standard ontologies (e.g., cell ontology).

Our long-term goal is make the classification accuracy high enough so that all GEO data sets can be processed fully automatically. This will allow us to build a rich resource of genetic/molecular perturbation.

We always value collaboration with experts in natural language processing, machine learning, biological ontology, genetics and biomedical science.


aliyah brown
over 2 years ago

This article has suggested to me many new ideas. I will embark on doing it. Hope you can continue to contribute your talents in this area. Thank you
geometry dash

Egbert man
over 2 years ago

If you can do what you want from it, then you can also try it now and it will be the best thing to do for you. It's what they all are telling me right now to do.

matt hummel
about 2 years ago

Superb website and I wanted to post a note to let you understand, "Great job"! I am glad I found this site. Your blog site has introduced me the majority of the strategies that I like. Thank you for sharing this.
- Lenny Face

anony mous
about 2 years ago

I am amazingly amped up for this graduated class week's end. I don't grasp what they will get this week's end anyway absolutely bring something exceptional with bestessays review and stacked with the redirection. I am looking forward and better trust it I by and large sit tight for such kind of the event.

ca mike
about 2 years ago

Your writing impresses me deeply, and I hope you will have more good articles in the future to give readers a new perspective on life.
candy crush soda

funny jokes
about 2 years ago

I enjoyed over read your blog post. Your blog have nice information I got good ideas from this amazing blog. I am always searching like this type blog post. I hope I will see again. light novel

Khatri SEO
about 2 years ago

To an unprecedented degree excellent and captivating post. I was separating for this kind of information and savored the experience of taking a gander at this one. 토토


Dr Joshua Ho is the Head of Bioinformatics and Systems Medicine Laboratory at the Victor Chang Cardiac Research Institute. He is also an NHMRC Career Development Fellow and a Heart Foundation Futur...

Round: Open Peer Voting
Category: Main Prize






Recent Voters