Towards Reproducible Data Analysis Using Cloud and Container Technologies

50 %
50 %
Information about  Towards Reproducible Data Analysis Using Cloud and Container Technologies

Published on May 29, 2019

Author: insideHPC

Source: slideshare.net

1. Supported by Towards Reproducible Data Analysis Using Container Technologies Sergio Maffioletti EnhanceR project director UZH/S3IT

2. https://www.enhancer.ch Disclaimer What I’m presenting here is the result of a personal experience plus the outcomes of different discussions within the EnhanceR project. i.e.: if you like the talk, congratulate with me… if you don’t, blame EnhanceR

3. https://www.enhancer.ch What are we going to talk about ? • Context • What is the user story we have in mind ? • Let’s build the infrastructure support • Let’s not stop here: building containers for/with end-users • One more step: what do we put inside container ? • Main challenges and open questions

4. https://www.enhancer.ch Who is EnhanceR again ?

5. https://www.enhancer.ch What problems are we facing ? Reproducible data analysis “Reproducibility is just collaboration with people you don’t know, including yourself next week” — Philip Stark, UC Berkeley Statistic

6. https://www.enhancer.ch Context Repeatability (Same team, same experimental setup): The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation. Replicability (Different team, same experimental setup): The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author's own artifacts. Reproducibility (Different team, different experimental setup): The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently.

7. https://www.enhancer.ch Let’s simplify... Peng, R. D. (2011). Reproducible research in computational science. Science (New York, Ny), 334(6060), 1226.

8. https://www.enhancer.ch What is the user story we have in mind ? on average a researcher: • develops on personal server • changes code and data as research progress • finally gets publishable results • sometimes running on a large-scale research IT infrastructure • prepares slides / images / tables / manuscript • publishes manuscript • at the end of a review process

9. https://www.enhancer.ch What is the user story we have in mind ? Researcher’s side recommendations for Open Science: ● Share data, software, workflows and other digital artifacts. ● Persistent links should appear in the published article for data, code, and digital artifacts. ● Citation should be standard practice, to enable credit for shared digital scholarly objects. ● Document digital scholarly artifacts, to facilitate reuse. ● Use Open Licensing when publishing digital scholarly objects.

10. https://www.enhancer.ch What does this means for a service provider ? ● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies + best-practises + support * I know - I’m intentionally skipping the business aspect of this...

11. https://www.enhancer.ch What does this means for a service provider ? ● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies + best-practises + support ● Why? ○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources. * I know - I’m intentionally skipping the business aspect of this...

12. https://www.enhancer.ch What does this means for a service provider ? ● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies + best-practises + support ● Why? ○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources. ● at the end ? ○ we become a valuable asset for a research group ○ we actually help them * I know - I’m intentionally skipping the business aspect of this...

13. https://www.enhancer.ch Let’s build the infrastructure what container technology orchestration integration with resource management storage for data and container’s images deployment and management monitoring

14. https://www.enhancer.ch Let’s build the infrastructure validation and verification automated policies scanning signing https://www.docker.com https://www.enhancer.ch/pipeline

15. https://www.enhancer.ch Let’s not stop here what to consider ● Automated build / integration with CD/CI ● design strategies ● naming schema ● Path binding ● documentation, metadata and runner script building containers for/with end-users competences ● version control - CD/CI ● container build process opportunities ● development best practises ● embed policies ● standardise assumptions

16. https://www.enhancer.ch Container design strategies https://www.enhancer.ch/pipeline

17. https://www.enhancer.ch what do we put inside container now ? https://nbis-reproducible-research.readthedocs.io/en/course_1811/tutorial_intro/ what to consider: ● Track software dependencies: ● in-container executions: competences: ● track requirements in sw development ● sw deployment - CD/CI opportunities: ● end-user best practices ● better handling of sw dependencies

18. https://www.enhancer.ch Open questions Infrastructure / Pull ● what containers shall I allow on my infrastructure ? ● how do I make sure cited container is exactly what I’m getting ? ● how do I verify and validate containers when we deploy them on our infrastructure ● how do I know what the container is doing ? ● how do I know whether the container has the latest security patch ? Run ● how do I make sure a deployed container runs ‘as documented’ on my data ? ● “how do I find a container that I need for running RNAseq ?” Build ● what assumptions can I make when building a container and what I should try to avoid ? ○ data mapping in and out / user privileges / ● where do I publish my container and how do I get a DOI for the publication ? ● how do I publish my container so that people can find it for their purposes ? (metadata) ● how do I describe/document my container’s behaviour

19. https://www.enhancer.ch Main challenges ● Social ○ adoption by end-users ○ how to address: “is it worth the investment ?” ● Technical ○ scale-out / orchestration ○ integration of specialised resources (e.g. GPU) ○ multi-tenancy - privileges ○ documented assumptions within the containers ○ maintenance ■ bugfix and security ○ portability vs performance

20. https://www.enhancer.ch Acknowledgments ● Guidelines for pipeline interoperability using containers ○ https://www.enhancer.ch/pipeline ● Survey for Research IT Infrastructure providers ○ https://forms.gle/JBW78qDPWabd4GDR8

Add a comment