Highly variable features scanpy

Annotate highly variable features [Satija15] [Zheng17] [Stuart19].

Info

ID: highly_variable_features_scanpy
Namespace: feature_annotation

Links

Expects logarithmized data, except when flavor=‘seurat_v3’ in which count data is expected.

Depending on flavor, this reproduces the R-implementations of Seurat [Satija15], Cell Ranger [Zheng17], and Seurat v3 [Stuart19].

For the dispersion-based methods ([Satija15] and [Zheng17]), the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for features falling into a given bin for mean expression of features. This means that for each bin of mean expression, highly variable features are selected.

For [Stuart19], a normalized variance for each feature is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each feature after the transformation. Features are ranked by the normalized variance.

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -main-script target/nextflow/feature_annotation/highly_variable_features_scanpy/main.nf \
  --help

Run command

Example of params.yaml

# Arguments
input: # please fill in - example: "input.h5mu"
modality: "rna"
# layer: "foo"
# var_input: "foo"
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
var_name_filter: "filter_with_hvg"
varm_name: "hvg"
flavor: "seurat"
# n_top_features: 123
min_mean: 0.0125
max_mean: 3.0
min_disp: 0.5
# max_disp: 123.0
span: 0.3
n_bins: 20
# obs_batch_key: "foo"

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -profile docker \
  -main-script target/nextflow/feature_annotation/highly_variable_features_scanpy/main.nf \
  -params-file params.yaml

Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument group

Arguments

Name	Description	Attributes
`--input`	Input h5mu file	`file`, required, example: `"input.h5mu"`
`--modality`		`string`, default: `"rna"`
`--layer`	use adata.layers[layer] for expression values instead of adata.X.	`string`
`--var_input`	If specified, use boolean array in adata.var[var_input] to calculate hvg on subset of vars.	`string`
`--output`	Output h5mu file.	`file`, example: `"output.h5mu"`
`--output_compression`	The compression format to be used on the output h5mu object.	`string`, example: `"gzip"`
`--var_name_filter`	In which .var slot to store a boolean array corresponding to which observations should be filtered out.	`string`, default: `"filter_with_hvg"`
`--varm_name`	In which .varm slot to store additional metadata.	`string`, default: `"hvg"`
`--flavor`	Choose the flavor for identifying highly variable features. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_features.	`string`, default: `"seurat"`
`--n_top_features`	Number of highly-variable features to keep. Mandatory if flavor=‘seurat_v3’.	`integer`
`--min_mean`	If n_top_features is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’.	`double`, default: `0.0125`
`--max_mean`	If n_top_features is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’.	`double`, default: `3`
`--min_disp`	If n_top_features is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’.	`double`, default: `0.5`
`--max_disp`	If n_top_features is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’. Default is +inf.	`double`
`--span`	The fraction of the data (cells) used when estimating the variance in the loess model fit if flavor=‘seurat_v3’.	`double`, default: `0.3`
`--n_bins`	Number of bins for binning the mean feature expression. Normalization is done with respect to each bin. If just a single feature falls into a bin, the normalized dispersion is artificially set to 1.	`integer`, default: `20`
`--obs_batch_key`	If specified, highly-variable features are selected within each batch separately and merged. This simple process avoids the selection of batch-specific features and acts as a lightweight batch correction method. For all flavors, features are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = ‘seurat_v3’, ties are broken by the median (across batches) rank based on within-batch normalized variance.	`string`

Authors

Dries De Maeyer (contributor)
Robrecht Cannoodt (maintainer, contributor)