Scvi

Performs scvi integration as done in the human lung cell atlas https://github.com/LungCellAtlas/HLCA

Info

ID: scvi
Namespace: integrate

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 1.0.1 -latest \
  -main-script target/nextflow/integrate/scvi/main.nf \
  --help

Run command

Example of params.yaml
# Inputs
input: # please fill in - example: "path/to/file"
modality: "rna"
# input_layer: "foo"
obs_batch: "sample_id"
# var_input: "foo"
# obs_labels: "foo"
# obs_size_factor: "foo"
# obs_categorical_covariate: ["foo"]
# obs_continuous_covariate: ["foo"]

# Outputs
# output: "$id.$key.output.output"
# output_model: "$id.$key.output_model.output_model"
# output_compression: "gzip"
obsm_output: "X_scvi_integrated"

# SCVI options
n_hidden_nodes: 128
n_dimensions_latent_space: 30
n_hidden_layers: 2
dropout_rate: 0.1
dispersion: "gene"
gene_likelihood: "nb"

# Variational auto-encoder model options
use_layer_normalization: "both"
use_batch_normalization: "none"
encode_covariates: true
deeply_inject_covariates: false
use_observed_lib_size: false

# Early stopping arguments
# early_stopping: true
early_stopping_monitor: "elbo_validation"
early_stopping_patience: 45
early_stopping_min_delta: 0.0

# Learning parameters
# max_epochs: 123
reduce_lr_on_plateau: true
lr_factor: 0.6
lr_patience: 30

# Data validition
n_obs_min_count: 0
n_var_min_count: 0

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"
nextflow run openpipelines-bio/openpipeline \
  -r 1.0.1 -latest \
  -profile docker \
  -main-script target/nextflow/integrate/scvi/main.nf \
  -params-file params.yaml
Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Inputs

Name Description Attributes
--input Input h5mu file file, required
--modality string, default: "rna"
--input_layer Input layer to use. If None, X is used string
--obs_batch Column name discriminating between your batches. string, default: "sample_id"
--var_input .var column containing highly variable genes. By default, do not subset genes. string
--obs_labels Key in adata.obs for label information. Categories will automatically be converted into integer categories and saved to adata.obs[’_scvi_labels’]. If None, assigns the same label to all the data. string
--obs_size_factor Key in adata.obs for size factor information. Instead of using library size as a size factor, the provided size factor column will be used as offset in the mean of the likelihood. Assumed to be on linear scale. string
--obs_categorical_covariate Keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). Thus, these should not be used for biologically-relevant factors that you do not want to correct for. List of string, multiple_sep: ";"
--obs_continuous_covariate Keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). Thus, these should not be used for biologically-relevant factors that you do not want to correct for. List of string, multiple_sep: ";"

Outputs

Name Description Attributes
--output Output h5mu file. file, required
--output_model Folder where the state of the trained model will be saved to. file
--output_compression The compression format to be used on the output h5mu object. string, example: "gzip"
--obsm_output In which .obsm slot to store the resulting integrated embedding. string, default: "X_scvi_integrated"

SCVI options

Name Description Attributes
--n_hidden_nodes Number of nodes per hidden layer. integer, default: 128
--n_dimensions_latent_space Dimensionality of the latent space. integer, default: 30
--n_hidden_layers Number of hidden layers used for encoder and decoder neural-networks. integer, default: 2
--dropout_rate Dropout rate for the neural networks. double, default: 0.1
--dispersion Set the behavior for the dispersion for negative binomial distributions: - gene: dispersion parameter of negative binomial is constant per gene across cells - gene-batch: dispersion can differ between different batches - gene-label: dispersion can differ between different labels - gene-cell: dispersion can differ for every gene in every cell string, default: "gene"
--gene_likelihood Model used to generate the expression data from a count-based likelihood distribution. - nb: Negative binomial distribution - zinb: Zero-inflated negative binomial distribution - poisson: Poisson distribution string, default: "nb"

Variational auto-encoder model options

Name Description Attributes
--use_layer_normalization Neural networks for which to enable layer normalization. string, default: "both"
--use_batch_normalization Neural networks for which to enable batch normalization. string, default: "none"
--encode_covariates Whether to concatenate covariates to expression in encoder boolean_false
--deeply_inject_covariates Whether to concatenate covariates into output of hidden layers in encoder/decoder. This option only applies when n_layers > 1. The covariates are concatenated to the input of subsequent hidden layers. boolean_true
--use_observed_lib_size Use observed library size for RNA as scaling factor in mean of conditional distribution. boolean_true

Early stopping arguments

Name Description Attributes
--early_stopping Whether to perform early stopping with respect to the validation set. boolean
--early_stopping_monitor Metric logged during validation set epoch. string, default: "elbo_validation"
--early_stopping_patience Number of validation epochs with no improvement after which training will be stopped. integer, default: 45
--early_stopping_min_delta Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement. double, default: 0

Learning parameters

Name Description Attributes
--max_epochs Number of passes through the dataset, defaults to (20000 / number of cells) * 400 or 400; whichever is smallest. integer
--reduce_lr_on_plateau Whether to monitor validation loss and reduce learning rate when validation set lr_scheduler_metric plateaus. boolean, default: TRUE
--lr_factor Factor to reduce learning rate. double, default: 0.6
--lr_patience Number of epochs with no improvement after which learning rate will be reduced. double, default: 30

Data validition

Name Description Attributes
--n_obs_min_count Minimum number of cells threshold ensuring that every obs_batch category has sufficient observations (cells) for model training. integer, default: 0
--n_var_min_count Minimum number of genes threshold ensuring that every var_input filter has sufficient observations (genes) for model training. integer, default: 0

Authors

  • Malte D. Luecken (author)

  • Dries Schaumont (maintainer)

  • Matthias Beyens (contributor)