Filter with hvg

Annotate highly variable genes [Satija15] [Zheng17] [Stuart19].

Info

ID: filter_with_hvg
Namespace: filter

Expects logarithmized data, except when flavor=‘seurat_v3’ in which count data is expected.

Depending on flavor, this reproduces the R-implementations of Seurat [Satija15], Cell Ranger [Zheng17], and Seurat v3 [Stuart19].

For the dispersion-based methods ([Satija15] and [Zheng17]), the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.

For [Stuart19], a normalized variance for each gene is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each gene after the transformation. Genes are ranked by the normalized variance.

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 1.0.2 -latest \
  -main-script target/nextflow/filter/filter_with_hvg/main.nf \
  --help

Run command

Example of params.yaml
# Arguments
input: # please fill in - example: "input.h5mu"
modality: "rna"
# layer: "foo"
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
var_name_filter: "filter_with_hvg"
varm_name: "hvg"
do_subset: false
flavor: "seurat"
# n_top_genes: 123
min_mean: 0.0125
max_mean: 3
min_disp: 0.5
# max_disp: 123.0
span: 0.3
n_bins: 20
# obs_batch_key: "foo"

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"
nextflow run openpipelines-bio/openpipeline \
  -r 1.0.2 -latest \
  -profile docker \
  -main-script target/nextflow/filter/filter_with_hvg/main.nf \
  -params-file params.yaml
Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument group

Arguments

Name Description Attributes
--input Input h5mu file file, required, example: "input.h5mu"
--modality string, default: "rna"
--layer use adata.layers[layer] for expression values instead of adata.X. string
--output Output h5mu file. file, example: "output.h5mu"
--output_compression The compression format to be used on the output h5mu object. string, example: "gzip"
--var_name_filter In which .var slot to store a boolean array corresponding to which observations should be filtered out. string, default: "filter_with_hvg"
--varm_name In which .varm slot to store additional metadata. string, default: "hvg"
--do_subset Whether to subset before storing the output. boolean_true
--flavor Choose the flavor for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes. string, default: "seurat"
--n_top_genes Number of highly-variable genes to keep. Mandatory if flavor=‘seurat_v3’. integer
--min_mean If n_top_genes is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’. double, default: 0.0125
--max_mean If n_top_genes is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’. double, default: 3
--min_disp If n_top_genes is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’. double, default: 0.5
--max_disp If n_top_genes is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’. Default is +inf. double
--span The fraction of the data (cells) used when estimating the variance in the loess model fit if flavor=‘seurat_v3’. double, default: 0.3
--n_bins Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1. integer, default: 20
--obs_batch_key If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = ‘seurat_v3’, ties are broken by the median (across batches) rank based on within-batch normalized variance. string

Authors

  • Dries De Maeyer (contributor)

  • Robrecht Cannoodt (maintainer, contributor)