Architecture
OpenPipeline is a pipeline for the processing of multimodal single-cell data that scales to a great many of samples. Covering the architecture requires us to explain many angles, including: what the expected inputs and outputs are for each workflow are, how do the workflows relate to each other, and what the state of the data is at each step of the pipeline. Here is an overview of the general steps involved in processing sequencing data into a single integrated object. We will discuss each of the steps further below.
- Ingestion: Convert raw sequencing data or count tables into MuData data for further processing.
- Splitting modalities: Creating several MuData objects, one per modality, out of a multimodal input sample.
- Unimodal Single Sample Processing: tools applied to each modality of samples individually. Mostly involves the selection of true from false cells.
- Unimodal Multi Sample Processing: steps that require information from all samples together. Processing is still performed per-modality.
- Merging: Creating one MuData object from several unimodal MuData input files.
- Initializing Integration: Performs dimensionality reduction and cell type clustering on non-integrated samples. These are popular steps that would otherwise be executed manually or they provide input for downstream integration methods.
- Integration: The alignment of cell types across samples. Can be performed per modality or based on multiple modalities.
- Downstream Processing: Extra analyses performed on the integrated dataset and conversion to other file formats.
Available workflows
The structure of the sections that have been laid out below follow a logical grouping of the processing according to the state of the data. However, even though this grouping makes sense from a data perspective, it does not mean that a workflow exist for each section. For example, the processing a single sample section describes how processing of single sample is performed as part of the full pipeline, but there is no singlesample
workflow that a user can execute. The inverse is also possible: while there exists an multisample pipeline, it’s functionality is not limited to what has been described in the section Multisample Processing. This section lists all the available workflows and will try to describe the link with the relevant sections below.
Ingestion workflows
All of the following workflows from the ingestion namespace have been discussed in more detail in the ingestion section:
Multiomics workflows
There exists no singlesample
workflow. However, the prot_singlesample
and rna_singlesample
pipelines do exist and they map identically to the functionality described in the single-sample antibody capture processing and single-sample gene expression processing sections respectively. If you would like to process your samples as described in the unimodal single sample processing section, you can execute both workflows in tandem for the two modalities.
Contrary to the workflows for single sample processing, there exists a multiomics/multisample workflow. However this workflow is not just the multiomics/prot_multisample and multiomics/rna_multisample workflows that have been combined. Instead, it combines the multiomics/prot_multisample, multiomics/rna_multisample and multiomics/integration/initialize_integration workflows. The purpose of this pipeline is to provide an extra ‘entrypoint’ into the full pipeline that skips the singlesample processing, allowing reprocessing samples that have already been processed before. A popular usecase is to manually select one or more celltypes which need to be processed again or the integration of observations from multiple experiments into a single dataset. Keep in mind that concatenation is not included in the multisample pipeline, so when multiple input files are specified they are processed in parallel. If you would like to integrate multiple experiments, you need to first concatenate them in a seperate step:
The “full” pipeline
The name of this pipeline is a bit of a misnomer, because it does not include all the steps from ingestion to integration. As will be discussed in the ingestion section, which ingestion strategy you need is dependant on your technology provider and the chosen platform. For integration, there exist many methods and combination of methods, and you may wish to choose which integration methods are applicable for your usecase. As a consequence, these two stages in the analysis of single-cell need to be executed seperatly and not as part of a single unified pipeline. All other steps outlined below on the other hand are included into the “full” pipeline, which can therefore be summarized in the following figure:
Integration workflows
For each of the integration methods (and their optional combination with other tools), a seperate pipeline is defined. More information for each of the pipelines is available in the integration methods section.
Important dataflow components
While most components included in openpipelines are involved in data analysis, the sole purpose of other components is to facilitate data flow throughout the pipelines. In a workflow, output for a component is written to disk after it is done performing its task, and is read back in by the next component. However, the relation between the component and the next component is not always a clear one to one relationship. For example: some tools are capable of analyzing a single sample, while others require the input of all samples together. Additionally, not only are the input requirement of tools limiting, performance also needs to be taken into account. Tasks which are performed on each sample separately can be executed in parallel, while if the same task is performed on a single file that contains the data for all samples. In order to facilitate one-to-many or many-to-one relations between components and to allow parallel execution of tasks, component that are specialized in dataflow were implemented.
Splitting modalities
We refer to splitting modalities when multimodal MuData file is split into several unimodal MuData files. The number of output files is equal to the number of modalities present in the input file. Splitting the modalities works on MuData files containing data for multiple samples or for single-sample files.
Merging of modalities
Merging refers to combining multiple files with data for one modality into a single output file that contains all input modalities. It is the inverse operation of splitting the modalities.
Concatenation of samples
Joining of observations for different samples, stored in their respective MuData file, into a single MuData file for all samples together is called sample concatenation. In practice, this operation is performed for each modality separately. An extra column (with default name sample_id
) is added to the annotation of the observations (.obs
) to indicate where each observation originated from.
Special care must be taken when considering annotations for observations and features while concatenating the samples. Indeed, the data from different samples can contain conflicting information. Openpipeline’s concat
component provides an argument other_axis_mode
that allows a user to specify what happens when conflicting information is found. The move
option for this argument is the default behavior. In this mode, each annotation column (from .obs
and .var
) is compared across samples. When no conflicts are found or the column is unique for a sample, the column is added output object. When a conflict does occur, all of the columns are gathered from the samples and stored into a dataframe. This dataframe is then stored into .obsm
for annotations for the observations and .varm
for feature annotations. This way, a user can have a look at the conflicts and decide what to do with them.
Ingestion
Ingestion is the conversion of raw sequencing data or count tables into MuData objects that can be used for further processing.
Demultiplexing refers to a two-step process:
- The conversion of the binary base call (BCL) files, output by the sequencing machines, into the text-based FASTQ format.
- The sorting of reads into different FASTQ files for different libraries pooled together into a single sequencing run.
In order to perform demultiplexing, several tools have been made available in the demux workflow, where the --demultiplexer
can be used to choose your demultiplexer of choice. Currently, three options have been made available:
- bcl2fastq(2): a legacy tool from Illumina that has been replaced by BCL Convert
- BCL Convert: general demultiplexing software by Illumina.
- Cellranger’s mkfastq: a wrapper around BCL Convert that provides extra convenience features for the processing of 10X single-cell data.
The alignment of reads from the FASTQ files to an appropriate genome reference is called mapping. The result of the mapping process are tables that count the number of times a read has been mapped to a certain feature and metadata information for the cells (observations) and features. There are different format that can be used to store this information together. Since OpenPipeline uses MuData as a common file format throughout its pipelines, a conversion to MuData is included in the mapping pipelines. The choice between workflows for mapping is dependant on your single-cell library provider and technology:
- For DB Genomics libraries, the BD Rhapsody pipeline can be used.
- For 10X based libraries, either cellranger count or cellranger multi is provided. For more information about the differences between the two and when to use which mapping software, please consult the 10X genomics website.
Creating a transcriptomics reference
Mapping reads from the FASTQ files to features requires a reference that needs to be provided to the mapping component. Depending on the usecase, you might even need to provide references specific for the modalities that you are trying to analyze. For gene expression data, the reference is a reference genome, together with its appropriate gene annotation. A genome reference is often indexed in order to improve the mapping speed. Additionally, some mapping frameworks provided by the single-cell technology providers require extra preprocessing of the reference before they can be used with their worklow. OpenPipelines provides a make_reference that allows you to create references in many formats which can be used to map your reads to.
Processing a single sample
Some processing can (or must) be performed without concatenating samples together. Even when having the choice, adding a tool to the single-sample processing is preferred because multiple samples can be processed in parallel, improving the processing speed. In general, the processing is modality specific, meaning that a multi-modal sample is first split into its unimodal counterparts. As described in the multi-sample processing, the resulting files are not merged back together after the single-sample processing is done. Instead, the output files for all samples are gathered per modality and concatenated to create a multi-sample unimodal object.
Single-sample Gene Expression Processing
Single-sample gene expression processing involves two steps: removing cells based on count statistics and flagging observations originating from doublets.
The removal of cells based on basic count statistics is split up into two parts: first, cells are flagged for removal by filter_with_counts. It flags observations based on several thresholds:
- The number of genes that have a least a single count. Both a maximum and minimum number of genes for a cell to be removed can be specified.
- The percentage of read counts that originated from a mitochodrial genes. Cells can be filtered based on both a maximum or minimum fraction of mitochondrial genes.
- The minimum or maximum total number of counts captured per cell. Cells with 0 total counts are always removed.
Flagging cells for removal involved adding a boolean column to the .obs
dataframe. After the cells have been flagged for removal, the cells are actually filtered using do_filter, which reads the values in .obs
and removed the cells labeled True
. This applies the general phylosophy of “separation of concerns”: one component is responsible for labeling the cells, another for removing them. This keeps the codebase for a single component small and its functionality testable.
The next and final step in the single-sample gene expression processing is doublet detection using filter_with_scrublet. Like filter_with_counts
, it will not remove cells but add a column to .obs
(which have the name filter_with_scrublet
by default). The single-sample GEX workflow will not remove not be removed during the processing (hence no do_filter
). However, you can choose to remove them yourself before doing your analyses by applying a filter with the column in .obs
yourself.
Single-sample Antibody Capture Processing
The process of filtering antibody capture data is similar to the filtering in the single-sample gene-expression processing, but without doublet detection. In some particular cases you can use your ADT data to perform doublet detection using for example cell-type maskers. More information can be found in the single-cell best practices book.
Multisample Processing
After the processing of individual samples has been concluded, samples can be concatenated for further processing. Like with the single-sample processing the multisample processing is not performed on multimodal objects, but each modality separately in order to tailor for the specific modality in question. This means that the result from the singlesample processing is merged together per-modality to create unimodal multisample objects. After processing each modality, all of the modalities can finally be merged and a single object is created that is ready for the integration.
Multisample Gene Expression Processing
Processing multisample gene expression involved the following steps:
- Normalization: Normalization aims to adjust the raw counts in the dataset for variable sampling effects by scaling the observable variance to a specified range. There are different ways to transform the data, but the normalization method is to make sure each observation (cell) has a total count equal to the median of total counts over all genes for observations (cells) before normalization.
- Log transformation: Calculates \(X = ln(X + 1)\), which converts multiplicative relative changes to additive differences. This allows for interpreting the gene expression in terms of relative, rather than absolute, abundances of genes.
- Highly variable gene detection: Detects genes that have a large change in expression between samples. By default, OpenPipeline uses the method from Seurat (Satija et al.). As with other “filtering” components, the
filter_with_hvg
component does not remove features, but rather annotates genes of interest by adding a boolean column to.var
. - QC metric calculations
Multisample Antibody Capture Processing
Processing the ADT modality for multiple samples
Integration
Dimensionality Reduction
scRNA-seq is a high-throughput sequencing technology that produces datasets with high dimensions in the number of cells and genes. It is true that the data should provide more information, but it also contains more noise and redudant information, making it harder to distill the usefull information. The number of genes and cells can already reduced by gene filtering, but further reduction is a necessity for downstream analysis. Dimensionality reduction projects high-dimensional data into a lower dimensional space (like taking a photo (2D) of some 3D structure). The lower dimensional representation still captures the underlying information of the data, while having fewer dimensions.
Several dimensionality reduction methods have been developed and applied to single-cell data analysis. Two of which are being applied in OpenPipeline:
- Principal Component Analysis (PCA): PCA reduces the dimension of a dataset by creating a new set of variables (principal components, PCs) from a linear combination of the original features in such a way that they are as uncorrelated as possible. The PCs can be ranked in the order by which they explain the largest variability in the original dataset. By keeping the top n PCs, the PCs with the lowest variance are discarded to effectively reduce the dimensionality of the data without losing information.
- Uniform manifold approximation and projection (UMAP): a non-linear dimensionality technique. It constructs a high dimensional graph representation of the dataset and optimizes the low-dimensional graph representation to be structurally as similar as possible to the original graph. In a review by Xiang et al., 2021 it showed the highest stability and separates best the original cell populations.
- t-SNE is another popular non-linear, graph based dimensionality technique which is very similar to UMAP, but it has not yet been implemented in OpenPipeline.
Initializing integration
As will be descibed in more details later on, many integration methods exist and therefore there is no single integration which is executed by default. However, there are common tasks which are run before integration either because they provide required input for many downstream integration methods or because they popular steps that would otherwise be done manually. These operations are executed by default when using the “full pipeline” as part of the initialize_integration subworkflow.
PCA is used to reduce the dimensionality of the dataset as described previously. Find Neighbors and Leiden Clustering are useful for the identification of cell types or states in the data. Here we apply a popular method to accomplish this is to first calculate a neighborhood graph on a low dimensinonal representation of the data and then cluster the data based on similarity between data points. Finally, UMAP allows us to visualise the clusters by reducing the dimensionality of the data while still providing an accurate representation of the underlying cell population structure.
Integration Methods
Integration is the alignment of cell types across samples. There exist three different types of integration methods, based on the degree of integration across modalities:
- Unimodal integration across batches. For example: scVI, scanorama, harmony
- Multimodal integration across batches and modalities. Can be used to integrate joint-profiling data where multiple modalities are measured. For example: totalVI
- Mosaic integration: data integration across batches and modalities where not all cells are profiled in all modalities and it may be the case that no cells contain profiles in all integrated modalities. Mosaic integration methods have not been made available in OpenPipeline yet. An example of a tool that performs mosaic integration is StabMap.
In either of the three cases, concatenated samples are required, and merged modalities preferred. A plethora of integration methods exist, which in turn interact with other functionality (like clustering and dimensionality reduction methods) to generate a large number of possible usecases which one pipeline cannot cover in an easy manner. Therefore, there is no single integration step that is part of a global pipeline which is executed by default. Instead, a user can choose from the integration workflows provided, and ‘stack’ integration methods by adding the outputs to different output slots of the MuData object. The following sections will descibe the integration workflows that are available in OpenPipeline.
Unimodal integration
For unimodal integration, scVI, scanorama and harmony have been added to the scvi_leiden, scanorama_leiden, and harmony_leiden workflows respectively. After executing the integration methods themselves, Find Neighbors and Leiden Clustering are run the results of the integration as wel as UMAP in order to be able to visualise the results. The functioning of these components has already been described here.
Multimodal Integration
A single multimodal integration method is currently avaiable in OpenPipeline: totalVI. It allows using information from both the gene-expression data and the antibody-capture data together to integrate the cell types. As with the other integration workflows, after running totalVI, Find Neighbors, Leiden Clustering and UMAP are run on the result. However in this case the three components are executed on both of the integrated modalities.