Filter with hvg

Annotate highly variable genes [Satija15] [Zheng17] [Stuart19].

Info

ID: filter_with_hvg
Namespace: filter

Links

Expects logarithmized data, except when flavor=‘seurat_v3’ in which count data is expected.

Depending on flavor, this reproduces the R-implementations of Seurat [Satija15], Cell Ranger [Zheng17], and Seurat v3 [Stuart19].

For the dispersion-based methods ([Satija15] and [Zheng17]), the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.

For [Stuart19], a normalized variance for each gene is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each gene after the transformation. Genes are ranked by the normalized variance.

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 1.0.2 -latest \
  -main-script target/nextflow/filter/filter_with_hvg/main.nf \
  --help

Run command

Example of params.yaml

# Arguments
input: # please fill in - example: "input.h5mu"
modality: "rna"
# layer: "foo"
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
var_name_filter: "filter_with_hvg"
varm_name: "hvg"
do_subset: false
flavor: "seurat"
# n_top_genes: 123
min_mean: 0.0125
max_mean: 3
min_disp: 0.5
# max_disp: 123.0
span: 0.3
n_bins: 20
# obs_batch_key: "foo"

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"

nextflow run openpipelines-bio/openpipeline \
  -r 1.0.2 -latest \
  -profile docker \
  -main-script target/nextflow/filter/filter_with_hvg/main.nf \
  -params-file params.yaml

Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument group

Arguments

Name	Description	Attributes
`--input`	Input h5mu file	`file`, required, example: `"input.h5mu"`
`--modality`		`string`, default: `"rna"`
`--layer`	use adata.layers[layer] for expression values instead of adata.X.	`string`
`--output`	Output h5mu file.	`file`, example: `"output.h5mu"`
`--output_compression`	The compression format to be used on the output h5mu object.	`string`, example: `"gzip"`
`--var_name_filter`	In which .var slot to store a boolean array corresponding to which observations should be filtered out.	`string`, default: `"filter_with_hvg"`
`--varm_name`	In which .varm slot to store additional metadata.	`string`, default: `"hvg"`
`--do_subset`	Whether to subset before storing the output.	`boolean_true`
`--flavor`	Choose the flavor for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.	`string`, default: `"seurat"`
`--n_top_genes`	Number of highly-variable genes to keep. Mandatory if flavor=‘seurat_v3’.	`integer`
`--min_mean`	If n_top_genes is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’.	`double`, default: `0.0125`
`--max_mean`	If n_top_genes is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’.	`double`, default: `3`
`--min_disp`	If n_top_genes is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’.	`double`, default: `0.5`
`--max_disp`	If n_top_genes is defined, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor=‘seurat_v3’. Default is +inf.	`double`
`--span`	The fraction of the data (cells) used when estimating the variance in the loess model fit if flavor=‘seurat_v3’.	`double`, default: `0.3`
`--n_bins`	Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1.	`integer`, default: `20`
`--obs_batch_key`	If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = ‘seurat_v3’, ties are broken by the median (across batches) rank based on within-batch normalized variance.	`string`

Authors

Dries De Maeyer (contributor)
Robrecht Cannoodt (maintainer, contributor)