Scvi

Performs scvi integration as done in the human lung cell atlas https://github.com/LungCellAtlas/HLCA

Info

ID: scvi
Namespace: integrate

Links

Source

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -main-script target/nextflow/integrate/scvi/main.nf \
  --help

Run command

Example of params.yaml

# Inputs
input: # please fill in - example: "path/to/file"
modality: "rna"
# input_layer: "foo"
obs_batch: "sample_id"
# var_gene_names: "foo"
# var_input: "foo"
# obs_labels: "foo"
# obs_size_factor: "foo"
# obs_categorical_covariate: ["foo"]
# obs_continuous_covariate: ["foo"]

# Outputs
# output: "$id.$key.output"
# output_model: "$id.$key.output_model"
# output_compression: "gzip"
obsm_output: "X_scvi_integrated"

# SCVI options
n_hidden_nodes: 128
n_dimensions_latent_space: 30
n_hidden_layers: 2
dropout_rate: 0.1
dispersion: "gene"
gene_likelihood: "nb"

# Variational auto-encoder model options
use_layer_normalization: "both"
use_batch_normalization: "none"
encode_covariates: true
deeply_inject_covariates: false
use_observed_lib_size: false

# Early stopping arguments
# early_stopping: true
early_stopping_monitor: "elbo_validation"
early_stopping_patience: 45
early_stopping_min_delta: 0.0

# Learning parameters
# max_epochs: 123
reduce_lr_on_plateau: true
lr_factor: 0.6
lr_patience: 30.0

# Data validition
n_obs_min_count: 0
n_var_min_count: 0

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"

# Arguments

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -profile docker \
  -main-script target/nextflow/integrate/scvi/main.nf \
  -params-file params.yaml

Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Inputs

Name	Description	Attributes
`--input`	Input h5mu file	`file`, required
`--modality`		`string`, default: `"rna"`
`--input_layer`	Input layer to use. If None, X is used	`string`
`--obs_batch`	Column name discriminating between your batches.	`string`, default: `"sample_id"`
`--var_gene_names`	.var column containing gene names. By default, use the index.	`string`
`--var_input`	.var column containing highly variable genes. By default, do not subset genes.	`string`
`--obs_labels`	Key in adata.obs for label information. Categories will automatically be converted into integer categories and saved to adata.obs[’_scvi_labels’]. If None, assigns the same label to all the data.	`string`
`--obs_size_factor`	Key in adata.obs for size factor information. Instead of using library size as a size factor, the provided size factor column will be used as offset in the mean of the likelihood. Assumed to be on linear scale.	`string`
`--obs_categorical_covariate`	Keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). Thus, these should not be used for biologically-relevant factors that you do not want to correct for.	List of `string`, multiple_sep: `";"`
`--obs_continuous_covariate`	Keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). Thus, these should not be used for biologically-relevant factors that you do not want to correct for.	List of `string`, multiple_sep: `";"`

Outputs

Name	Description	Attributes
`--output`	Output h5mu file.	`file`, required
`--output_model`	Folder where the state of the trained model will be saved to.	`file`
`--output_compression`	The compression format to be used on the output h5mu object.	`string`, example: `"gzip"`
`--obsm_output`	In which .obsm slot to store the resulting integrated embedding.	`string`, default: `"X_scvi_integrated"`

SCVI options

Name	Description	Attributes
`--n_hidden_nodes`	Number of nodes per hidden layer.	`integer`, default: `128`
`--n_dimensions_latent_space`	Dimensionality of the latent space.	`integer`, default: `30`
`--n_hidden_layers`	Number of hidden layers used for encoder and decoder neural-networks.	`integer`, default: `2`
`--dropout_rate`	Dropout rate for the neural networks.	`double`, default: `0.1`
`--dispersion`	Set the behavior for the dispersion for negative binomial distributions: - gene: dispersion parameter of negative binomial is constant per gene across cells - gene-batch: dispersion can differ between different batches - gene-label: dispersion can differ between different labels - gene-cell: dispersion can differ for every gene in every cell	`string`, default: `"gene"`
`--gene_likelihood`	Model used to generate the expression data from a count-based likelihood distribution. - nb: Negative binomial distribution - zinb: Zero-inflated negative binomial distribution - poisson: Poisson distribution	`string`, default: `"nb"`

Variational auto-encoder model options

Name	Description	Attributes
`--use_layer_normalization`	Neural networks for which to enable layer normalization.	`string`, default: `"both"`
`--use_batch_normalization`	Neural networks for which to enable batch normalization.	`string`, default: `"none"`
`--encode_covariates`	Whether to concatenate covariates to expression in encoder	`boolean_false`
`--deeply_inject_covariates`	Whether to concatenate covariates into output of hidden layers in encoder/decoder. This option only applies when n_layers > 1. The covariates are concatenated to the input of subsequent hidden layers.	`boolean_true`
`--use_observed_lib_size`	Use observed library size for RNA as scaling factor in mean of conditional distribution.	`boolean_true`

Early stopping arguments

Name	Description	Attributes
`--early_stopping`	Whether to perform early stopping with respect to the validation set.	`boolean`
`--early_stopping_monitor`	Metric logged during validation set epoch.	`string`, default: `"elbo_validation"`
`--early_stopping_patience`	Number of validation epochs with no improvement after which training will be stopped.	`integer`, default: `45`
`--early_stopping_min_delta`	Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.	`double`, default: `0`

Learning parameters

Name	Description	Attributes
`--max_epochs`	Number of passes through the dataset, defaults to (20000 / number of cells) * 400 or 400; whichever is smallest.	`integer`
`--reduce_lr_on_plateau`	Whether to monitor validation loss and reduce learning rate when validation set `lr_scheduler_metric` plateaus.	`boolean`, default: `TRUE`
`--lr_factor`	Factor to reduce learning rate.	`double`, default: `0.6`
`--lr_patience`	Number of epochs with no improvement after which learning rate will be reduced.	`double`, default: `30`

Data validition

Name	Description	Attributes
`--n_obs_min_count`	Minimum number of cells threshold ensuring that every obs_batch category has sufficient observations (cells) for model training.	`integer`, default: `0`
`--n_var_min_count`	Minimum number of genes threshold ensuring that every var_input filter has sufficient observations (genes) for model training.	`integer`, default: `0`

Authors

Malte D. Luecken (author)
Dries Schaumont (maintainer)
Matthias Beyens (contributor)