Align query reference

Alignment of a query and reference dataset by: * Alignment of layers * Harmonization of .obs field names for batch and cell type labels * Harmonization of .var field name for gene names * Sanitation of gene names * Cross-checking of genes * Assignment of an id to the query and reference datasets

Info

ID: align_query_reference
Namespace: feature_annotation

Links

Source

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -main-script target/nextflow/feature_annotation/align_query_reference/main.nf \
  --help

Run command

Example of params.yaml

# Inputs
input: # please fill in - example: "input.h5mu"
modality: "rna"
# input_layer: "foo"
# input_layer_lognormalized: "foo"
# input_var_gene_names: "foo"
input_obs_batch: # please fill in - example: "sample_id"
# input_obs_label: "cell_type"
input_id: "query"

# Reference
# reference: "reference.h5mu"
# reference_layer: "foo"
# reference_layer_lognormalized: "foo"
# reference_var_gene_names: "foo"
reference_obs_batch: # please fill in - example: "sample_id"
# reference_obs_label: "cell_type"
reference_id: "reference"

# Outputs
# output_query: "$id.$key.output_query.h5mu"
# output_reference: "$id.$key.output_reference.h5mu"
# output_compression: "gzip"
output_layer: "_counts"
output_layer_lognormalized: "_log_normalized"
output_var_gene_names: "_gene_names"
output_obs_batch: "_sample_id"
output_obs_label: "_cell_type"
output_obs_id: "_dataset"
output_var_index: "_ori_var_index"
output_var_common_genes: "_common_vars"

# Arguments
input_reference_gene_overlap: 100
align_layers_raw_counts: true
align_layers_lognormalized_counts: false
unkown_celltype_label: "Unknown"
overwrite_existing_key: false
preserve_var_index: false

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -profile docker \
  -main-script target/nextflow/feature_annotation/align_query_reference/main.nf \
  -params-file params.yaml

Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Inputs

Input dataset (query) arguments

Name	Description	Attributes
`--input`	The input (query) data to be labeled. Should be a .h5mu file.	`file`, required, example: `"input.h5mu"`
`--modality`	Which modality to process. Note that the query and reference modalities should be the same.	`string`, default: `"rna"`
`--input_layer`	The layer in the input (query) data containing raw counts if .X is not to be used.	`string`
`--input_layer_lognormalized`	The layer in the input (query) data containing log normalized counts if .X is not to be used.	`string`
`--input_var_gene_names`	The name of the .var column in the input (query) data containing gene names; when no gene_name_layer is provided, the var index will be used.	`string`
`--input_obs_batch`	The name of the .obs column in the input (query) data containing batch information.	`string`, required, example: `"sample_id"`
`--input_obs_label`	The name of the .obs column in the input (query) data containing cell type labels. If not provided, the –unkown_celltype_label will be assigned to all observations.	`string`, example: `"cell_type"`
`--input_id`	Meta id value to be assigned to the –output_obs_id .obs field of the aligned input (query) data.	`string`, default: `"query"`

Reference

Arguments related to the reference dataset.

Name	Description	Attributes
`--reference`	The reference data to train the CellTypist classifiers on. Only required if a pre-trained –model is not provided.	`file`, example: `"reference.h5mu"`
`--reference_layer`	The layer in the reference data containing raw counts if .X is not to be used. Data are expected to be processed in the same way as the –input query dataset.	`string`
`--reference_layer_lognormalized`	The layer in the reference data containing log normalized counts if .X is not to be used. Data are expected to be processed in the same way as the –input query dataset.	`string`
`--reference_var_gene_names`	The name of the .var column in the reference data containing gene names; when no gene_name_layer is provided, the var index will be used.	`string`
`--reference_obs_batch`	The name of the .obs column in the reference data containing batch information.	`string`, required, example: `"sample_id"`
`--reference_obs_label`	The name of the .obs column in the reference data containing cell type labels. If not provided, the –unkown_celltype_label will be assigned to all observations.	`string`, example: `"cell_type"`
`--reference_id`	Meta id value to be assigned to the –output_obs_id .obs field of the aligned reference data.	`string`, default: `"reference"`

Outputs

Output arguments.

Name	Description	Attributes
`--output_query`	Aligned query data.	`file`, example: `"output_query.h5mu"`
`--output_reference`	Aligned reference data.	`file`, example: `"output_reference.h5mu"`
`--output_compression`		`string`, example: `"gzip"`
`--output_layer`	Name of the aligned layer containing raw counts in the output query and reference datasets.	`string`, default: `"_counts"`
`--output_layer_lognormalized`	Name of the aligned layer containing log normalized counts in the output query and reference datasets.	`string`, default: `"_log_normalized"`
`--output_var_gene_names`	Name of the .var column in the output query and reference datasets containing the gene names.	`string`, default: `"_gene_names"`
`--output_obs_batch`	Name of the .obs column in the output query and reference datasets containing the batch information.	`string`, default: `"_sample_id"`
`--output_obs_label`	Name of the .obs column in the output query and reference datasets containing the cell type labels.	`string`, default: `"_cell_type"`
`--output_obs_id`	Name of the .obs column in the output query and reference datasets containing the dataset id.	`string`, default: `"_dataset"`
`--output_var_index`	Name of the .var column to which the .var index of the –input and –reference datasets is stored. Only relevant if “–preserve_var_index” is False.	`string`, default: `"_ori_var_index"`
`--output_var_common_genes`	Name of the .var column in the output query and reference datasets containing the boolean array indicating the common variables.	`string`, default: `"_common_vars"`

Arguments

Arguments related to the alignment of the input and reference datasets.

Name	Description	Attributes
`--input_reference_gene_overlap`	The minimum number of genes present in both the reference and query datasets.	`integer`, default: `100`
`--align_layers_raw_counts`	Whether to align the query and reference layers containing raw counts.	`boolean`, default: `TRUE`
`--align_layers_lognormalized_counts`	Whether to align the query and reference layers containing log normalized counts.	`boolean_true`
`--unkown_celltype_label`	The label to assign to cells with an unknown cell type.	`string`, default: `"Unknown"`
`--overwrite_existing_key`	If set to true and the layer, obs or var key already exists in the query/reference file, the key will be overwritten.	`boolean_true`
`--preserve_var_index`	If set to true, the .var index of the –input and –reference datasets will be preserved. If set to false (default behavior), the original .var index will be stored in the –output_var_index .var column and the .var index will be replaced with the sanitized & aligned gene names.	`boolean_true`

Authors

Dorien Roosen (maintainer)