Abstract
Many repositories of open data for genomics, collected by world-wide consortia, are important enablers of biological research; moreover, all experimental datasets leading to publications in genomics must be deposited to public repositories and made available to the research community. These datasets are typically used by biologists for validating or enriching their experiments; their content is documented by metadata. However, emphasis on data sharing is not matched by accuracy in data documentation; metadata are not standardized across the sources and often unstructured and incomplete.
In this paper, we propose a conceptual model of genomic metadata, whose purpose is to query the underlying data sources for locating relevant experimental datasets. First, we analyze the most typical metadata attributes of genomic sources and define their semantic properties. Then, we use a top-down method for building a global-as-view integrated schema, by abstracting the most important conceptual properties of genomic sources. Finally, we describe the validation of the conceptual model by mapping it to three well-known data sources: TCGA, ENCODE, and Gene Expression Omnibus.
Similar content being viewed by others
Notes
- 1.
- 2.
Data-Driven Genomic Computing, http://www.bioinformatics.deib.polimi.it/geco/, ERC Advanced Grant, 2016–2021.
- 3.
At https://www.encodeproject.org/profiles/graph.svg see the conceptual model of ENCODE, an ER schema with tens of entities and hundreds of relationships, which is neither readable nor supported by metadata for most concepts.
- 4.
- 5.
- 6.
- 7.
Textual analysis to extract semantic information from the GEO repository is reported in [12]; we plan to reuse their library.
- 8.
The metadata is provided in the NCI Genomic Data Commons portal, https://docs.gdc.cancer.gov/Data_Dictionary/viewer/.
- 9.
GEO information can be retrieved through the R package GEOmetadb [37].
References
Adams, D., et al.: BLUEPRINT to decode the epigenetic signature written in blood. Nat. Biotechnol. 30(3), 224–226 (2012)
Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of epigenome. Nucleic Acids Res. 44(W1), W581–W586 (2016)
Barrett, T., et al.: BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 40(D1), 57–63 (2012)
Barrett, T., et al.: NCBI GEO: archive for functional genomics data sets – update. Nucleic Acids Res. 41(Database issue), D991–D995 (2013)
Bornberg-Bauer, E., Paton, N.W.: Conceptual data modelling for bioinformatics. Brief. Bioinform. 3(2), 166–180 (2002)
Buneman, P., et al.: A data transformation system for biological data sources. In: International Conference on Very Large Data Bases, pp. 158–169 (1995)
Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017)
Davidson, S.B., et al.: Biokleisli: a digital library for biomedical researchers. Int. J. Digit. Libr. 1(1), 36–53 (1997)
Davidson, S.B., et al.: K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst. J. 40(2), 512–531 (2001)
El-Ghalayini, H., et al.: Deriving conceptual data models from domain ontologies for bioinformatics. In: 2006 2nd Information and Communication Technologies, ICTTA 2006, vol. 2, pp. 3562–3567 (2006)
Fernández, J.D., et al.: Ontology-based search of genomic metadata. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 233–247 (2016)
Galeota, E., Pelizzola, M.: Ontology-based annotations and semantic relations in large-scale (epi)genomics data. Brief. Bioinform. 18(3), 403–412 (2017)
Haider, S., et al.: BioMart Central Portal - unified access to biological data. Nucleic Acids Res. 37(Web Server issue), 23–27 (2009)
Hernandez, T., Kambhampati, S.: Integration of biological sources: current systems and challenges ahead. SIGMOD Rec. 33(3), 51–60 (2004)
Idrees, M., et al.: A review: conceptual data models for biological domain. JAPS, J. Anim. Plant Sci. 25(2), 337–345 (2015)
Ji, F., Elmasri, R., et al.: Incorporating concepts for bioinformatics data modeling into EER models. In: ACS/IEEE International Conference on Computer Systems and Applications, pp. 189–192. IEEE Computer Society, Washington, DC, USA (2005)
Kaitoua, A., Pinoli, P., Bertoni, M., Ceri, S.: Framework for supporting genomic operations. IEEE Trans. Comput. 66(3), 443–457 (2017)
Keet, M.C.: Biological data and conceptual modelling method. J. Concept. Model. 29(1), 1–14 (2003)
Kundaje, A., et al.: Integrative analysis of 111 reference human epigenomes. Nature 518(7539), 317–330 (2015)
Lenzerini, M.: Data integration: a theoretical perspective. In: Symposium on Principles of Database Systems, PODS, pp. 233–246. ACM, New York, NY, USA (2002)
Louie, B., et al.: Data integration and genomic medicine. J. Biomed. Inform. 40(1), 5–16 (2007)
Masseroli, M., Canakoglu, A., Ceri, S.: Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(2), 209–219 (2016)
Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)
Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)
Rechenmann, F.: Data modeling: the key to biological data integration. EMBnet. J. 18(B), 59–60 (2012)
Anonymous paper. Accelerating bioinformatics research with new software for big data to knowledge (BD2K), Paradigm4, April 2015. www.paradigm4.com
Consortium 1000Genomes: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)
Consortium ENCODE: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)
Reyes Román, J.F., Pastor, Ó., Casamayor, J.C., Valverde, F.: Applying conceptual modeling to better understand the human genome. In: Comyn-Wattiau, I., Tanaka, K., Song, I.-Y., Yamamoto, S., Saeki, M. (eds.) ER 2016. LNCS, vol. 9974, pp. 404–412. Springer, Cham (2016). doi:10.1007/978-3-319-46397-1_31
Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD 2017, Chicago, Illinois, USA, 14–19 May 2017, pp. 187–202. ACM, New York (2017)
Sarntivijai, S., et al.: CLO: the cell line ontology. J. Biomed. Semant. 5(1), 37 (2014)
Schomburg, I., et al.: BRENDA in 2013: new options and contents in BRENDA. Nucleic Acids Res. 41(Database issue), D764–D772 (2013)
Schriml, L.M., et al.: Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res. 40(Database issue), 940–946 (2012)
Smedley, D., et al.: The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43(W1), 589–598 (2015)
Wang, L., et al.: BioStar models of clinical and genomic data for biomedical data warehouse design. Int. J. Bioinform. Res. Appl. 1(1), 63–80 (2005)
Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)
Zhu, Y., et al.: Geometadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics 24(23), 2798–2800 (2008)
Acknowledgement
This research is funded by the ERC Advanced Grant project GeCo (Data-Driven Genomic Computing), 2016–2021.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Bernasconi, A., Ceri, S., Campi, A., Masseroli, M. (2017). Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data. In: Mayr, H., Guizzardi, G., Ma, H., Pastor, O. (eds) Conceptual Modeling. ER 2017. Lecture Notes in Computer Science(), vol 10650. Springer, Cham. https://doi.org/10.1007/978-3-319-69904-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-69904-2_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69903-5
Online ISBN: 978-3-319-69904-2
eBook Packages: Computer ScienceComputer Science (R0)