Skip to content

Containerized Pipelines

Jeffrey K Gillan edited this page Apr 15, 2024 · 51 revisions

What's a Pipeline?

A 'Pipeline' is defined as an automated sequence of data processing steps where the output of the first step becomes the input of the next step, etc. Throughout this course we have presented many python scripts to automate geospatial data processing. A script with several sequential steps could be considered a 'pipeline'.


For the purposes of this lesson we are interested in connecting sequential processing steps between multiple types of software. For example, we may want to use a combination of command line tools, Python, Javascript, and R to go from the first to the last step of our pipeline. Fortunately, there are several options to link multiple disparate software together into a sequential pipeline. This gives us the power to automate and combine our favorite software even if they are from different ecosystems.



Open & Reproducible Code

Sharing your scientific analysis code with your colleagues is an act of collaboration that will help push your field forward. There are, however, technical challenges that may prevent your colleagues from effectively running the code on their own computer. These include:

  • hardware: CPUs, GPUs, RAM
  • Operating System: Linux, MacOS, Windows
  • Software version: R, Python, etc
  • Library versions and dependencies

How do we make it easier to share analysis code and avoid the challenges of computer and environment setups?


What are containers?

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. Container images are a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. Each of these elements are specifically versioned and do not change. The user does not need to install the software in the traditional sense.

A useful analogy is to think of software containers as shipping containers. It allows us move cargo (software) around the world in standard way. The shipping container can be offloading and executed anywhere, as long the destination has a shipping port (i.e., Docker).



Containers are similar to virtual machines (VMs), but are smaller and easier to share. A big distinction between Containers and VMs is what is within each environment: VMs require the OS to be present within the image, whilst containers rely on the host OS and the container engine (e.g., Docker Engine).



Containers for Reproducible Science

Software containers, such as those managed by Docker or Singularity, are incredibly useful for reproducible science for several reasons:

Environment Consistency:

Containers encapsulate the software environment, ensuring that the same versions of software, libraries, and dependencies are used every time, reducing the "it works on my machine" problem.

Ease of Sharing:

Containers can be easily shared with other researchers, allowing them to replicate the exact software environment used in a study.

Platform Independence:

Containers can run on different operating systems and cloud platforms, allowing for consistency across different hardware and infrastructure.

Version Control:

Containers can be versioned, making it easy to keep track of changes in the software environment over time.

Scalability:

Containers can be easily scaled and deployed on cloud infrastructure, allowing for reproducible science at scale.

Isolation:

Containers isolate the software environment from the host system, reducing the risk of conflicts with other software and ensuring a clean and controlled environment.


Docker

The most common container software is Docker, which is a platform for developers and sysadmins to develop, deploy, and run applications with containers. Apptainer (formerly, Singularity), is another popular container engine, which allows you to deploy containers on HPC clusters.

DockerHub is the world's largest respository of container images. Think of it as the 'Github' of container images. It facilitates collaboration amongst developers and allows you to share your container images with the world. Dockerhub allows users to maintain different versions of container images.



Containerized Pipelines

Depending on the goals of your geospatial analysis, you may need to use different software at different stages of the pipeline. For example, you may need to use Python for the first step, R for the second step, and a command line tool for the third step. Using software such as Docker-Compose, we can link all of these step together into a single automation.

It took 8 minutes to process drone imagery using Codespaces (16 core VM with 64GB RAM)

Clone this wiki locally