INCREASING THE REPRODUCIBILITY OF SCIENTIFIC RESEARCH WORKS: A CASE STUDY USING THE ENVIRONMENT CODE-FIRST FRAMEWORK

Authors

  • Daniel Adorno Gomes Universidade de Trás-os-Montes e Alto D'Ouro
  • Pedro Mestre
  • Carlos Serôdio

DOI:

https://doi.org/10.26668/businessreview/2024.v9i5.4662

Keywords:

Infrastructure-as-Code, Reproducibility, Virtualization, Containerization, Open Science

Abstract

Objective: The purpose of this paper is to present a case study on how a recently proposed reproducibility framework named Environment Code-First (ECF) based on the Infrastructure-as-Code approach can improve the implementation and reproduction of computing environments by reducing complexity and manual intervention.

Methodology: The study compares the manual way of implementing a pipeline and the automated method proposed by the ECF framework, showing real metrics regarding time consumption, efforts, manual intervention, and platform agnosticism. It details the steps needed to implement the computational environment of a bioinformatics pipeline named MetaWorks from the perspective of the scientist who owns the research work. Also, we present the steps taken to recreate the environment from the point of view of one who wants to reproduce the published results of a research work.

Findings and Conclusion: The results demonstrate considerable benefits in adopting the ECF framework, particularly in maintaining the same applicational behavior across different machines. Such empirical evidence underscores the significance of reducing manual intervention, as it ensures the consistent recreation of the environment as many times as needed, especially by non-original researchers.

Originality/Value: Verifying published findings in bioinformatics through independent validation is challenging, mainly when accounting for differences in software and hardware to recreate computational environments. Reproducing a computational environment that closely mimics the original proves intricate and demands a significant investment of time. This study contributes to educate and assist researchers in enhancing the reproducibility of their work by creating self-contained computational environments that are highly reproducible, isolated, portable, and platform-agnostic.

Downloads

Download data is not yet available.

References

Barba, L. A., & Thiruvathukal, G. K. (2017). Reproducible Research for Computing in Science Engineering. Computing in Science Engineering, 19(6), 85–87.

Cacho, J. R. F., & Taghva, K. (2018). Reproducible research in document analysis and recognition. In Information Technology-New Generations (pp. 389–395). Springer.

Cacho, J. R. F., & Taghva, K. (2020). The State of Reproducible Research. In Computer Science, 17th International Conference on Information Technology – New Generations (ITNG 2020), Advances in Intelligent Systems and Computing (Vol. 1134, pp. 519-524). Springer.

Conda. (2024). Conda’s official website. Retrieved from https://docs.conda.io/en/latest

Coveney, P. V., Groen, D., & Hoekstra, A. G. (2021). Reliability and reproducibility in computational science: implementing validation, verification and uncertainty quantification in silico. Philosophical Transactions of the Royal Society A, 379, 1-5. https://doi.org/10.1098/rsta.2020.0409

de Bayser, M., Azevedo, L. G., & Cerqueira, R. (2015). ResearchOps: The case for DevOps in scientific applications. In 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM) (pp. 1398–1404). IEEE.

Docker. (2024a). Troubleshoot topics for Docker Desktop. Retrieved from https://docs.docker.com/desktop/troubleshoot/topics/

Docker. (2024b). Workarounds for common problems. Retrieved from https://docs.docker.com/desktop/troubleshoot/workarounds/

Docker. (2024c). Known issues. Retrieved from https://docs.docker.com/desktop/troubleshoot/known-issues/

Edgar, R. C. (2016). UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing. bioRxiv. https://doi.org/10.1101/081257

Gomes, D. A., Mestre, P., & Serôdio, C. (2019). Infrastructure-as-Code for Scientific Computing Environments. In CENTRIC 2019: The Twelfth International Conference on Advances in Human-oriented and Personalized Mechanisms, Technologies, and Services (pp. 7-10).

Gomes, D. A., Mestre, P., & Serôdio, C. (2022). Environment Code-First Framework: Provisioning Scientific Computational Environments Using the Infrastructure-as-Code Approach. International Journal on Advances in Software, 15(1 & 2), 1-13.

Gomes, D. A. (2024). Metaworks based on ECF Framework [Github repository]. Retrieved from https://github.com/adornogomes/MetaWorks_Based_On_ECF_Framework

Grüning, B. A., Lampa, S., Vaudel, M., & Blankenberg, D. (2019). Software engineering for scientific big data analysis. Gigascience, 8(5). https://doi.org/10.1093/gigascience/giz054

Humble, J., & Farley, D. (2010). Continuous delivery: reliable software releases through build, test, and deployment automation. Pearson Education.

Liu, D. M., & Salganik, M. J. (2019). Successes and Struggles with Computational Reproducibility: Lessons from the Fragile Families Challenge. Socius, 5, 1-21. https://doi.org/10.1177/2378023119849803

Marwick, B. (2017). Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation. Journal of Archaeological Method and Theory, 24, 424–450. https://doi.org/10.1007/s10816-015-9272-9

Martin, M. (2011). CutAdapt removes adapter sequences from high-throughput sequencing reads. EMBnet journal, 17, 10.

MetaWorks. (2024a). MetaWorks’ official page on Github. Retrieved from https://github.com/terrimporter/MetaWorks

MetaWorks. (2024b). MetaWorks’ official implementation tutorial. Retrieved from https://terrimporter.github.io/MetaWorksSite/tutorial

Morris, K. (2020). Infrastructure as Code: Dynamic Systems for the Cloud Age (2nd ed.). O’Reilly Media, Inc.

Porter, T. M., & Hajibabaei, M. (2022). MetaWorks: A flexible, scalable bioinformatic pipeline for high-throughput multi-marker biodiversity assessments. PLoS ONE, 17(9), 1-11. https://doi.org/10.1371/journal.pone.0274260

Reinecke, R., Trautmann, T., Wagener, T., & Schüler, K. (2022). The critical need to foster computational reproducibility. Environmental Research Letters, 17. https://doi.org/10.1088/1748-9326/ac5cf8

Segal, J., & Morris, C. (2012). Developing Software for a Scientific Community: Some Challenges and Solutions. In J. Leng & W. Sharrock (Eds.), Handbook of Research on Computational Science and Engineering: Theory and Practice (pp. 177-196). IGI Global. https://doi.org/10.4018/978-1-61350-116-0.ch008

Snakemake. (2024). Snakemake’s official website. Retrieved from https://snakemake.github.io

St John, J. (2016). SeqPrep’s official page on Github. Retrieved from https://github.com/jstjohn/SeqPrep/releases

Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Applied and Environmental Microbiology, 73, 5261–5267. https://doi.org/10.1128/AEM.00062-07

Wiebels, K., & Moreau, D. (2021). Leveraging containers for reproducible psychological research. Advances in Methods and Practices in Psychological Science, 4(2), Article 25152459211017853. https://doi.org/10.1177/25152459211017853

Wiggins, A. (2017). The Twelve-Factor App Official Website. Retrieved from http://12factor.net

Downloads

Published

2024-05-14

How to Cite

Adorno Gomes, D., Mestre, P., & Serôdio, C. (2024). INCREASING THE REPRODUCIBILITY OF SCIENTIFIC RESEARCH WORKS: A CASE STUDY USING THE ENVIRONMENT CODE-FIRST FRAMEWORK. International Journal of Professional Business Review, 9(5), e04662. https://doi.org/10.26668/businessreview/2024.v9i5.4662