Abstract
Data driven science requires manipulating large datasets coming from various data sources through complex workflows based on a variety of models and languages. With the increasing number of big data sources and models developed by different groups, it is hard to relate models and data and use them in unanticipated ways for specific data analysis. Current solutions are typically ad-hoc, specialized for particular data, models and workflow systems. In this paper, we focus on data driven life science and propose an open service-based architecture, Life Science Workflow Services (LifeSWS), which provides data analysis workflow services for life sciences. We illustrate our motivations and rationale for the architecture with real use cases from life science.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Afgan, E., et al.: The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 50(W1), 345–351 (2022)
Artzet, S., et al.: Phenomenal: an automatic open source library for 3D shoot architecture reconstruction and analysis for image-based plant phenotyping. BioRxiv p. 805739 (2019)
Bondiombouy, C., Valduriez, P.: Query processing in multistore systems: an overview. Int. J. Cloud Comput. 5(4), 309–346 (2016)
Boursiac, Y., et al.: Phenotyping and modeling of root hydraulic architecture reveal critical determinants of axial water transport. Plant Physiol. 190(2), 1289–1306 (2022)
Brito, A., et al.: Lying in wait: the resurgence of dengue virus after the zika epidemic in Brazil. Nat. Commun. 12, 2619 (2021)
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)
Chen, A., et al.: Developments in MLflow: a system to accelerate the machine learning lifecycle. In: Workshop on Data Management for End-To-End Machine Learning (DEEM@SIGMOD), pp. 5:1–5:4 (2020)
Crusoe, M.R., et al.: Methods included: standardizing computational reuse and portability with the common workflow language. Commun. ACM 65(6), 54–63 (2022)
Daviet, B., Fernandez, R., Cabrera-Bosquet, L., Pradal, C., Fournier, C.: Phenotrack3d: an automatic high-throughput phenotyping pipeline to track maize organs over time. Plant Methods 18(1), 130 (2022)
Fernandez, R., Crabos, A., Maillard, M., Nacry, P., Pradal, C.: High-throughput and automatic structural and developmental root phenotyping on arabidopsis seedlings. Plant Methods 18(1), 1–19 (2022)
Goff, S., et al.: The iplant collaborative: cyberinfrastructure for plant biology. Front. Plant Sci. 2 (2011)
Guedes, T., et al.: Capturing and analyzing provenance from spark-based scientific workflows with samba-rap. Future Gener. Comput. Syst. 112, 658–669 (2020)
Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P.: Cache-aware scheduling of scientific workflows in a multisite cloud. Futur. Gener. Comput. Syst. 122, 172–186 (2021)
Hey, T., Tansley, S., Tolle, K., Gray, J.: The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, October 2009
Hogan, A., et al.: Knowledge graphs. ACM Comput. Surv. 54(4) (2021). https://doi.org/10.1145/3447772
Joly, A., et al.: Interactive plant identification based on social image data. Ecol. Inform. 23, 22–34 (2014). Special Issue on Multimedia in Ecology and Environment
Kolev, B., Bondiombouy, C., Valduriez, P., Jiménez-Peris, R., Pau, R., Pereira, J.: The CloudMdSQL multistore system. In: ACM SIGMOD International Conference on Management of Data, pp. 2113–2116 (2016)
Lourenço, R., Freire, J., Simon, E., Weber, G., Shasha, D.E.: Bugdoc: iterative debugging and explanation of pipeline. VLDB J. 32(1), 75–101 (2023)
Ludäscher, B., et al.: Scientific workflow management and the Kepler system. Concurr. Comput. Pract. Exp. 18(10), 1039–1065 (2006)
Lustosa, H.L.S., da Silva, A.C., da Silva, D.N.R., Valduriez, P., Porto, F.A.M.: SAVIME: an array DBMS for simulation analysis and ML models predictions. J. Inf. Data Manag. 11(3), 247–264 (2021)
Muller, B., Martre, P.: Plant and crop simulation models: powerful tools to link physiology, genetics, and phenomics. J. Exp. Bot. 70(9), 2339–2344 (2019)
Neveu, P., et al.: Dealing with multi-source and multi-scale information in plant phenomics: the ontology-driven phenotyping hybrid information system. New Phytol. 221(1), 588–601 (2019)
Özsu, M.T.: Data science: a systematic treatment. Commun. ACM 66(7), 106–116 (2023)
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 4th edn. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-26253-2
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 8024–8035 (2019)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pereira, R.S., et al.: Djensemble: a cost-based selection and allocation of a disjoint ensemble of spatio-temporal models. In: International Conference on Scientific and Statistical Database Management (SSDBM), pp. 226–231 (2021)
Pradal, C., et al.: InfraPhenoGrid: a scientific workflow infrastructure for Plant Phenomics on the Grid. Futur. Gener. Comput. Syst. 67, 341–353 (2017)
Pradal, C., Cohen-Boulakia, S., Valduriez, P., Shasha, D.: VersionClimber: version upgrades without tears. IEEE Comput. Sci. Eng. 21(5), 87–93 (2019)
Pradal, C., Fournier, C., Valduriez, P., Boulakia, S.C.: OpenAlea: scientific workflows combining data analysis and simulation. In: International Conference on Scientific and Statistical Database Management (SSDBM), pp. 11:1–11:6 (2015)
Schlegel, M., Sattler, K.: Management of machine learning lifecycle artifacts: a survey. ACM SIGMOD Rec. 51(4), 18–35 (2022)
Silva, V., de Oliveira, D., Valduriez, P., Mattoso, M.: DfAnalyzer: runtime dataflow analysis of scientific applications using provenance. Proc. VLDB Endow. (PVLDB) 11(12), 2082–2085 (2018)
Souza, R., et al.: Workflow provenance in the lifecycle of scientific machine learning. Concur. Comput. Pract. Exp. 34(14) (2022)
Tardieu, F., Cabrera-Bosquet, L., Pridmore, T., Bennett, M.: Plant phenomics, from sensors to knowledge. Curr. Biol. 27(15), R770–R783 (2017)
Valduriez, P., Porto, F.: Data and machine learning model management with Gypscie. In: CARLA workshop on HPC and data sciences meet scientific computing, pp. 1–2 (2022)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) (2010)
Zhang, C., Ma, Y.: Ensemble Machine Learning, Methods and Applications. Springer, New York (2012). https://doi.org/10.1007/978-1-4419-9326-7
Zorrilla, R., Ogasawara, E.S., Valduriez, P., Porto, F.: A data-driven model selection approach to spatio-temporal prediction. In: Brazilian Symposium on Databases (SBBD), pp. 1–12 (2022)
Acknowledgement
This work is within the context of the HPDaSc associated team between Inria and Brazil. Some of us are supported by CNPq research productivity fellowships. C. Pradal has support from the MaCS4Plants CIRAD network, initiated from the AGAP Institute and AMAP joint research units, and EU’s Horizon 2020 research and innovation program (IPM Decisions project No. 817617, BreedingValue project No. 101000747).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer-Verlag GmbH, DE, part of Springer Nature
About this chapter
Cite this chapter
Akbarinia, R. et al. (2023). Life Science Workflow Services (LifeSWS): Motivations and Architecture. In: Hameurlain, A., Tjoa, A.M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems LV. Lecture Notes in Computer Science(), vol 14280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-68100-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-662-68100-8_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-68099-5
Online ISBN: 978-3-662-68100-8
eBook Packages: Computer ScienceComputer Science (R0)