Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3535508.3545531acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

ArcheGEO: towards improving relevance of gene expression omnibus search results

Published: 07 August 2022 Publication History

Abstract

Transciptomic data stored in the Gene Expression Omnibus (GEO) serves thousands of queries per day, but a lack of standardized machine-readable metadata causes many searches to return irrelevant hits, which impede convenient access to useful data in the GEO repository. Here, we describe ArcheGEO, a novel end-to-end framework that improves results from the GEO Browser by automatically determining the relevance of these results. Unlike existing tools, ArcheGEO reports on the irrelevant results and provides reasoning for their exclusion. Such reasoning can be leveraged to improve annotations of metadata.

References

[1]
ArrayExpress. https://www.ebi.ac.uk/arrayexpress/.
[2]
Cellosaurus. https://web.expasy.org/cellosaurus/.
[3]
Classification of Diseases. https://www.who.int/standards/classifications/classification-of-diseases.
[4]
Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/.
[5]
Genomic Expression Archive. https://www.ddbj.nig.ac.jp/gea/index-e.html.
[6]
Medical Subject Headings. https://www.ncbi.nlm.nih.gov/mesh/.
[7]
NCI Metathesaurus. https://ncim.nci.nih.gov/ncimbrowser/.
[8]
NCI Thesaurus. https://ncithesaurus.nci.nih.gov/ncitbrowser/.
[9]
Online Mendelian Inheritance in Man. https://www.omim.org/.
[10]
SNOMED CT. https://www.nlm.nih.gov/healthit/snomedct/index.html.
[11]
UMLS Metathesaurus. https://uts.nlm.nih.gov/uts/umls/home.
[12]
L. Amos, et al. UMLS users and uses: a current overview. Journal of the American Medical Informatics Association, 27(10): 1606--1611, 2020.
[13]
A.R. Aronson. Effective mapping of biomedical text to the UMLS metathesaurus: the metamap program. Proc AMIA Symp, 17--21, 2001.
[14]
T. Barrett, et al. NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Research, 35(suppl_1): D760-D765, 2007.
[15]
H. Bono. All of gene expression (AOE): an integrated index for public gene expression databases. PloS one, 15(1): e0227076, 2020.
[16]
M. Brockington, et al. Localization and functional analysis of the LARGE family of glycosyltransferases: significance for muscular dystrophy. Human Molecular Genetics, 14(5): 657--665, 2005.
[17]
T. Byrt. How good is that agreement? Epidemiology, 7(5): 561, 1996.
[18]
E.J.M. Campbell, J.G. Scadding, R.S. Roberts. The concept of disease. Br Med J, 2(6193): 757--762, 1979.
[19]
G. Chen, et al. Restructured GEO: restructuring gene expression omnibus metadata for genome dynamics analysis. Database, 2019.
[20]
X. Chen, et al. DataMed - an open source discovery index for finding biomedical datasets. Journal of the Americal Medical Informatics Association, 25(3): 300--308, 2018.
[21]
Y. Chen, et al. Gene expression inference with deep learning. Bioinform., 32(12): 1832--1839, 2016.
[22]
H. Cho, H. Lee. Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinformatics, 20(735), 2019.
[23]
H.-E Chua, L. Tucker-Kellogg, S. S. Bhowmick. ArcheGEO: Towards improving relevance of gene expression omnibus search results. Technical Report, https://personal.ntu.edu.sg/assourav/TechReports/ArcheGEO-TR.pdf, 2021.
[24]
S. Davis, P.S. Meltzer. GEOquery: a bridge between the gene expression omnibus (GEO) and bioconductor. Bioinformatics, 23(14): 1846--1847, 2007.
[25]
D. Demner-Fushman, W.J. Rogers, A.R. Aronson. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc, 24(4): 841--844, 2017.
[26]
B. Ding, et al. Optimizing index for taxonomy keyword search. In SIGMOD, 2012.
[27]
D. Djordjevic, et al. Discovery of perturbation gene targets via free text metadata mining in gene expression omnibus. Computational Biology and Chemistry, 80: 152--158, 2019.
[28]
J. Dumas, M.A. Gargano, G.M. Dancik. shinyGEO: a web-based application for analyzing gene expression omnibus datasets. Bioinformatics, 32(23): 3679--3681, 2016.
[29]
G. Gay, et al. On the use of relevance feedback in IR-based concept location. In IEEE ICSM, 2009.
[30]
C.B. Giles, et al. ALE: automated label extraction from GEO metadata. BMC Bioinformatics, 15(14): 7--16, 2017.
[31]
E.S. Gushchanskaia, et al. Interplay between small RNA pathways shapes chromatin landscapes in C. elegans. Nucleic Acids Research, 47(11): 5603--5613, 2019.
[32]
D. Hadley, et al. Precision annotation of digital samples in NCBI's gene expression omnibus. Scientific Data, 4(1): 1--11, 2017.
[33]
A.N. Hasan, et al. An in silico analytical study of lung cancer and smokers datasets from gene expression omnibus (GEO) for prediction of differentially expressed genes. Bioinformation, 11(5): 229, 2015.
[34]
R.Q. He, et al. Clinical significance of miR-210 and its prospective signaling pathways in non-small cell lung cancer: evidence from gene expression omnibus and the cancer genome atlas data mining with 2763 samples and validation via real-time quantitative PCR. Cellular Physiology and Biochemistry, 46(3): 925--952, 2018.
[35]
L.J. Jensen, J. Saric, P. Bork. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics, 7(2): 119--129, 2006.
[36]
N. Karam, et al. Matching biodiversity and ecology ontologies: challenges and evaluation results. The Knowledge Engineering Review, 35(E9): 1--19, 2020.
[37]
K. Koeppen, B.A. Stanton, T.H. Hampton. ScanGEO: parallel mining of high-throughput gene expression data. Bioinformatics, 33(21): 3500--3501, 2017.
[38]
Y.S. Lee, et al. Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies. Bioinformatics, 29(23): 3036--3044, 2013.
[39]
J. Lee, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4): 1234--1240, 2020.
[40]
A. Leuski. Evaluating document clustering for interactive information retrieval. In CIKM, 2001.
[41]
Y. Li, et al. SCIA: a novel gene set analysis applicable to data with different characteristics. Frontiers in Genetics, 10: 598, 2019.
[42]
Z. Li, J. Li, P. Yu. GEOMetaCuration: a web-based application for accurate manual curation of gene expression omnibus. Database, 2018, 2018.
[43]
J. Lin. Is searching full text more effective than searching abstracts? BMC Bioinformatics, 10(1): 1--15, 2009.
[44]
S. Mathur, D. Dinakarpandian. Finding disease similarity based on implicit semantic similarity. Journal of Biomedical Informatics, 45(2): 363--371, 2012.
[45]
R. Mihalcea, C. Corley, C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, 2006.
[46]
C.P. Morrey, et al. Resolution of redundant semantic type assignments for organic chemicals in the UMLS. Artificial Intelligence in Medicine, 52(3): 141--151, 2011.
[47]
F. Mougin, N. Grabar. Auditing the multiply-related concepts within the UMLS. Journal of the American Medical Informatics Association, 21(e2): e185-e193, 2014.
[48]
C.J. Mungall, et al. Uberon, an integrative multi-species anatomy ontology. Genome Biology, 13(1): 1--20, 2012.
[49]
U. Naseem, et al. Biomedical named-entity recognition by hierarchically fusing biobert representations and deep contextual-level word-embedding. In IJCNN, 2020.
[50]
M. Neumann, et al. ScispaCy: fast and robust models for biomedical natural language processing. In BioNLP, 2019.
[51]
V. Nguyen, H.Y. Yip, O. Bodenreider. Biomedical vocabulary alignment at scale in the umls metathesaurus. In Proceedings of the Web Conference, 2021.
[52]
A.W. Nienhuis, D.G. Nathan. Pathophysiology and clinical manifestations of the β-thalassemias. Cold Spring Harbor Perspectives in Medicine, 2(12): a011726, 2016.
[53]
D. Oliveira, C. Pesquita. Improving the interoperability of biomedical ontologies with compound alignments. J. Biomed. Semant., 9(1), 2018.
[54]
L. Pang, et al. Deeprank: A new deep architecture for relevance ranking in information retrieval. In CIKM, 2017.
[55]
E.G. Puffenberger, et al. Mapping of sudden infant death with dysgenesis of the testes syndrome (SIDDT) by a SNP genome scan and identification of TSPYL loss of function. Proceedings of the National Academy of Sciences, 101(32): 11689--11694, 2004.
[56]
P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI, 1995.
[57]
M.A. Rodríguez, M.J. Egenhofer. Determining semantic similarity among entity classes from different ontologies. IEEE Transactions on Knowledge and Data Engineering, 15(2): 442--456, 2003.
[58]
Y. Rui, et al. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 8(5): 644--655, 1998.
[59]
D. Sánchez, et al. Ontology-based semantic similarity: a new feature-based approach. Expert Systems with Applications, 39(9): 7718--7728, 2012.
[60]
N. Seco, T. Veale, J. Hayes. An intrinsic information content metric for semantic similarity in WordNet. In ECAI, 2004.
[61]
H. Toda, R. Kataoka. A search result clustering method using informatively named entities. In WIDM, 2005.
[62]
A. Trotman. An artificial intelligence approach to information retrieval. In SIGIR, 2004.
[63]
D. Tsoucas, et al. Accurate estimation of cell-type composition from gene expression data. Nature Communications, 10(1): 1--9, 2019.
[64]
A. Tversky. Features of similarity. Psychological Review, 84: 327--352, 1977.
[65]
E.M. Voorhees. The philosophy of information retrieval evaluation. In Workshop of the cross-language evaluation forum for european languages, 2001.
[66]
H. Wang, et al. High expression levels of pyrimidine metabolic rate-limiting enzymes are adverse prognostic factors in lung adenocarcinoma: a study based on The Cancer Genome Atlas and Gene Expression Omnibus datastes. Purinergic Signalling, 16(3): 347--366, 2020.
[67]
L.L. Wang, et al. Ontology alignment in the biomedical domain using entity definitions and context. In BioNLP, 2018.
[68]
Z. Wang, A. Lachmann, A. Ma'ayan. Mining data and metadata from the gene expression omnibus. Biophysical Reviews, 11(1):103--110, 2019.
[69]
Z. Wang, et al. Extraction and analysis of signatures from the gene expression omnibus by the crowd. Nature Communications, 7(1): 1--11, 2016.
[70]
Z. Wu, M. Palmer. Verbs semantics and lexical selection. In ACL, 1994.
[71]
D. Yin, et al. Ranking relevance in yahoo search. In SIGKDD, 2016.
[72]
T. Zhang, et al. KIAA0101 is a novel transcriptional target of FoxM1 and is involved in the regulation of hepatocellular carcinoma microvascular invasion by regulating epithelial-mesenchymal transition. Journal of Cancer, 10(15): 3501, 2019.
[73]
Y. Zhu, et al. GEOmetadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics, 24(23): 2798--2800, 2008.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
August 2022
549 pages
ISBN:9781450393867
DOI:10.1145/3535508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2022

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

BCB '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 73
    Total Downloads
  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media