Creating components

A guide on how to create new components

A common file format

One of the core principals of OpenPipelines is to use MuData as a common data format troughout the whole pipeline. See the concepts page for more information on openpipelines uses MuData to store single-cell data.

Component location

As discussed in the project structure, components in the repository are stored within src. Additionally, components are grouped into namespaces, according to a common functionality. An example of such a namespace is the dimensionality reduction namespace (dimred), of which the components pca and umap are members. This means that within src, the namespace folders can be found that stores the components that belong to these namespaces.

In order to create a new component in OpenPipelines, you will need to create a new folder that will contain the different elements of the component:

mkdir src/my_namespace/my_component

Tip

Take a look at the components that are already in src/! There might be a component that already does something similar to what you need.

The elements of a component

A component consists of one or more scripts that provide the functionality of the component together with metadata of the component in a configuration file. The Viash config contains metadata of your dataset, which script is used to run it, and the required dependencies. An in-depth guide on how to create components is available on the viash website, but a few specifics and guidelines will be discussed here.

The config

functionality:
  name: "my_component"
  namespace: "my_namespace"
  description: "My new custom component"
  authors:
    - __merge__: ../../authors/my_name.yaml
      roles: [ author ]
  arguments:
    - name: "--output"
      type: file
      example: "output_file.h5mu"
      description: "Location were the output file should be written to."
      direction: "output"
  resources:
    - type: python_script
      path: script.py
platforms:
  - type: docker
    image: python:3.11
    setup:
      - type: python
        packages: mudata~=0.2.3
  - type: nextflow
    directives:
      label: [highcpu, midmem]

Basic information

Each component should have the name, a namespace, a description and author information defined in the config. Because a single author can contribute to multiple components, the author information is often duplicated across components, which was causing issues with the author information being out of date and not easy to maintain. Therefore, it was decided to move author information to ./src/authors. Each author has a yaml file containing the author information, and the viash __merge__ property is used to merge this information into the viash configs.

Basic information checklist:

Give the component a name
Add the component to an appropriate namespace
Add a description
Add author information

Arguments and argument groups

If you component requires arguments, they should be defined in arguments or argument_groups. Try tro group individual arguments into argument_groups when the number of arguments become too larg (10 or more as a rule of thumb).

Argument checklist:

Add a description and name
Each argument should have the appropriate type.
Input and output files should be of type file instead of string and use the appropriate direction:
If possible: add an example
If the argument can accept multiple values, add multiple: true
If the possible input for an argument is limited to certain set of values, use choices:

(Test)resources

Resources define files that are required for a component to perform its function. These can be scripts, but also additional files like settings for tools you might require. Defining resources is both a necessity because viash needs to know what code to execute, but defining resources also has the added benefit that these resources are automatically made available, regardless of the build environment. For example: resources are automatically mounted within a running docker container.

There is a difference between defining resources and test_resources. While resources are required for a component to function, test_resources only need to be included when testing the component (with for example viash test) in addition to the regular resources. Having a look at the example above, resources are defined using the resources: property. It takes a list of multiple files or folders.

In openpipelines, it was decided to not use a service like git lfs to include large resources into the repository. Instead, if large resources are required, there are two possibilities: * Large resources required for testing are to uploaded into an s3 bucket that is synced automatically before running tests (both locally and on github). Please ping a maintainer when you open a PR and ask them to upload the files for you. * Other large resources that are not needed for testing can be considered as input. This means that an argument of type: file needs to be created. The downside of this method is that viash is not able to natively support remote files f

Resources checklist: - Script resources are located next to the config and added to the config with the correct type (python_script, r_script, …) - Small resources (<50MB) that are not scripts can also be checked in into the repo, next to the

The script file

TODO

Author information

TODO

Adding dependencies

TODO

Building components from their source

When running or testing individual components, it is not necessary to execute an extra command to run the build step, viash test and viash run will build the component on the fly. However, before integrating components into a pipeline, you will need to build the components. More specifically, openpipelines uses Nextflow to combine components into pipelines, so we need to have at least the components build for nextflow platform as target. The easiest method to build the components is to use:

viash ns build --parallel --setup cachedbuild

After using viash ns build, the target folder will be populated with three subfolders, corresponding to the build platforms that viash supports: native, docker and nextflow.

Building an individual component can still be useful, for example when debugging a component for which the build fails or if you want to create a standalone executable for a component to execute it without the need to use viash. To build an individual component, viash build can be used. Note that the default build directory of this viash base command is output, which is not the location where build components will be imported from when integrating them in pipelines. Using the --output argument, you can set it to any directory you want, for example:

viash build src/filter/do_filter/config.vsh.yaml -o target/native/filter/do_filter/ -p native

Containerization

One of the key benefits of using Viash is that containers can be created that gather dependencies per component, which avoids building one container that has to encorporate all dependencies for a pipeline together. The containers for a single component can be reduced in size, defining the minimal requirements to run the component. That being said, building containers from scratch can be labour intensive and error prone, with base containers from reputable publishers often benefiting from improved reliability and security. Hence, a balance has to be made between reducing the container’s size and adding many dependencies to a small base container.

The preferred containerization setup in OpenPipelines uses the following guidelines:

Choose a base container from a reputable source and use its latest version
Do not use base containers that have not been updated in a while
Use package managers to install dependencies as much as possible
Avoid building depdencies from source.

Examples of base containers that are currently being used are:

python:3.11 for python environments
ubuntu:focal for general linux environments and bash scripts
eddelbuettel/r2u:22.04 for R
nvcr.io/nvidia/pytorch:22.09-py3 for using GPU accelerated calculations using pytorch in python