Creating components
A common file format
One of the core principals of OpenPipelines is to use MuData as a common data format troughout the whole pipeline. See the concepts page for more information on openpipelines uses MuData to store single-cell data.
Component location
As discussed in the project structure, components in the repository are stored within src
. Additionally, components are grouped into namespaces, according to a common functionality. An example of such a namespace is the dimensionality reduction namespace (dimred
), of which the components pca
and umap
are members. This means that within src
, the namespace folders can be found that stores the components that belong to these namespaces.
In order to create a new component in OpenPipelines, you will need to create a new folder that will contain the different elements of the component:
mkdir src/my_namespace/my_component
Take a look at the components that are already in src/
! There might be a component that already does something similar to what you need.
The elements of a component
A component consists of one or more scripts that provide the functionality of the component together with metadata of the component in a configuration file. The Viash config contains metadata of your dataset, which script is used to run it, and the required dependencies. An in-depth guide on how to create components is available on the viash website, but a few specifics and guidelines will be discussed here.
The config
functionality:
name: "my_component"
namespace: "my_namespace"
description: "My new custom component"
authors:
- __merge__: ../../authors/my_name.yaml
roles: [ author ]
arguments:
- name: "--output"
type: file
example: "output_file.h5mu"
description: "Location were the output file should be written to."
direction: "output"
resources:
- type: python_script
path: script.py
platforms:
- type: docker
image: python:3.11
setup:
- type: python
packages: mudata~=0.2.3
- type: nextflow
directives:
label: [highcpu, midmem]
Basic information
Each component should have the name, a namespace, a description and author information defined in the config. Because a single author can contribute to multiple components, the author information is often duplicated across components, which was causing issues with the author information being out of date and not easy to maintain. Therefore, it was decided to move author information to ./src/authors
. Each author has a yaml
file containing the author information, and the viash __merge__
property is used to merge this information into the viash configs.
Basic information checklist:
- Give the component a name
- Add the component to an appropriate namespace
- Add a description
- Add author information
Arguments and argument groups
If you component requires arguments, they should be defined in arguments
or argument_groups
. Try tro group individual arguments into argument_groups
when the number of arguments become too larg (10 or more as a rule of thumb).
Argument checklist:
- Add a description and name
- Each argument should have the appropriate type.
- Input and output files should be of type
file
instead ofstring
and use the appropriatedirection:
- If possible: add an example
- If the argument can accept multiple values, add
multiple: true
- If the possible input for an argument is limited to certain set of values, use
choices:
(Test)resources
Resources define files that are required for a component to perform its function. These can be scripts, but also additional files like settings for tools you might require. Defining resources is both a necessity because viash needs to know what code to execute, but defining resources also has the added benefit that these resources are automatically made available, regardless of the build environment. For example: resources are automatically mounted within a running docker container.
There is a difference between defining resources
and test_resources
. While resources are required for a component to function, test_resources
only need to be included when testing the component (with for example viash test
) in addition to the regular resources. Having a look at the example above, resources are defined using the resources:
property. It takes a list of multiple files or folders.
In openpipelines, it was decided to not use a service like git lfs
to include large resources into the repository. Instead, if large resources are required, there are two possibilities: * Large resources required for testing are to uploaded into an s3
bucket that is synced automatically before running tests (both locally and on github). Please ping a maintainer when you open a PR and ask them to upload the files for you. * Other large resources that are not needed for testing can be considered as input. This means that an argument of type: file
needs to be created. The downside of this method is that viash is not able to natively support remote files f
Resources checklist: - Script resources are located next to the config and added to the config with the correct type (python_script
, r_script
, …) - Small resources (<50MB) that are not scripts can also be checked in into the repo, next to the
The script file
TODO
Adding dependencies
TODO
Building components from their source
When running or testing individual components, it is not necessary to execute an extra command to run the build step, viash test
and viash run
will build the component on the fly. However, before integrating components into a pipeline, you will need to build the components. More specifically, openpipelines uses Nextflow to combine components into pipelines, so we need to have at least the components build for nextflow
platform as target. The easiest method to build the components is to use:
viash ns build --parallel --setup cachedbuild
After using viash ns build
, the target folder will be populated with three subfolders, corresponding to the build platforms that viash supports: native
, docker
and nextflow
.
Building an individual component can still be useful, for example when debugging a component for which the build fails or if you want to create a standalone executable for a component to execute it without the need to use viash
. To build an individual component, viash build
can be used. Note that the default build directory of this viash base command is output
, which is not the location where build components will be imported from when integrating them in pipelines. Using the --output
argument, you can set it to any directory you want, for example:
viash build src/filter/do_filter/config.vsh.yaml -o target/native/filter/do_filter/ -p native
Containerization
One of the key benefits of using Viash is that containers can be created that gather dependencies per component, which avoids building one container that has to encorporate all dependencies for a pipeline together. The containers for a single component can be reduced in size, defining the minimal requirements to run the component. That being said, building containers from scratch can be labour intensive and error prone, with base containers from reputable publishers often benefiting from improved reliability and security. Hence, a balance has to be made between reducing the container’s size and adding many dependencies to a small base container.
The preferred containerization setup in OpenPipelines uses the following guidelines:
- Choose a base container from a reputable source and use its latest version
- Do not use base containers that have not been updated in a while
- Use package managers to install dependencies as much as possible
- Avoid building depdencies from source.
Examples of base containers that are currently being used are:
- python:3.11 for python environments
- ubuntu:focal for general linux environments and bash scripts
- eddelbuettel/r2u:22.04 for R
- nvcr.io/nvidia/pytorch:22.09-py3 for using GPU accelerated calculations using pytorch in python