Celltypist
Automated cell type annotation tool for scRNA-seq datasets on the basis of logistic regression classifiers optimised by the stochastic gradient descent algorithm.
Info
ID: celltypist
Namespace: annotate
Links
Example commands
You can run the pipeline using nextflow run
.
View help
You can use --help
as a parameter to get an overview of the possible parameters.
nextflow run openpipelines-bio/openpipeline \
-r 2.1.0 -latest \
-main-script target/nextflow/annotate/celltypist/main.nf \
--help
Run command
Example of params.yaml
# Inputs
input: # please fill in - example: "input.h5mu"
modality: "rna"
# input_layer: "foo"
# input_var_gene_names: "foo"
input_reference_gene_overlap: 100
# Reference
# reference: "reference.h5mu"
# reference_layer: "foo"
reference_obs_target: "cell_ontology_class"
# reference_var_gene_names: "foo"
# reference_var_input: "foo"
# Model arguments
# model: "pretrained_model.pkl"
feature_selection: false
majority_voting: false
C: 1.0
max_iter: 1000
use_SGD: false
min_prop: 0.0
# Outputs
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
output_obs_predictions: "celltypist_pred"
output_obs_probability: "celltypist_probability"
# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"
# Arguments
nextflow run openpipelines-bio/openpipeline \
-r 2.1.0 -latest \
-profile docker \
-main-script target/nextflow/annotate/celltypist/main.nf \
-params-file params.yaml
Note
Replace -profile docker
with -profile podman
or -profile singularity
depending on the desired backend.
Argument groups
Inputs
Input dataset (query) arguments
Name | Description | Attributes |
---|---|---|
--input |
The input (query) data to be labeled. Should be a .h5mu file. | file , required, example: "input.h5mu" |
--modality |
Which modality to process. | string , default: "rna" |
--input_layer |
The layer in the input data containing log normalized counts to be used for cell type annotation if .X is not to be used. | string |
--input_var_gene_names |
The name of the adata var column in the input data containing gene names; when no gene_name_layer is provided, the var index will be used. | string |
--input_reference_gene_overlap |
The minimum number of genes present in both the reference and query datasets. | integer , default: 100 |
Reference
Arguments related to the reference dataset.
Name | Description | Attributes |
---|---|---|
--reference |
The reference data to train the CellTypist classifiers on. Only required if a pre-trained –model is not provided. | file , example: "reference.h5mu" |
--reference_layer |
The layer in the reference data to be used for cell type annotation if .X is not to be used. Data are expected to be processed in the same way as the –input query dataset. | string |
--reference_obs_target |
The name of the adata obs column in the reference data containing cell type annotations. | string , default: "cell_ontology_class" |
--reference_var_gene_names |
The name of the adata var column in the reference data containing gene names; when no gene_name_layer is provided, the var index will be used. | string |
--reference_var_input |
.var column containing highly variable genes. By default, do not subset genes. | string |
Model arguments
Model arguments.
Name | Description | Attributes |
---|---|---|
--model |
Pretrained model in pkl format. If not provided, the model will be trained on the reference data and –reference should be provided. | file , example: "pretrained_model.pkl" |
--feature_selection |
Whether to perform feature selection. | boolean , default: FALSE |
--majority_voting |
Whether to refine the predicted labels by running the majority voting classifier after over-clustering. | boolean , default: FALSE |
--C |
Inverse of regularization strength in logistic regression. | double , default: 1 |
--max_iter |
Maximum number of iterations before reaching the minimum of the cost function. | integer , default: 1000 |
--use_SGD |
Whether to use the stochastic gradient descent algorithm. | boolean_true |
--min_prop |
“For the dominant cell type within a subcluster, the minimum proportion of cells required to support naming of the subcluster by this cell type. Ignored if majority_voting is set to False. Subcluster that fails to pass this proportion threshold will be assigned ‘Heterogeneous’.” | double , default: 0 |
Outputs
Output arguments.
Name | Description | Attributes |
---|---|---|
--output |
Output h5mu file. | file , example: "output.h5mu" |
--output_compression |
string , example: "gzip" |
|
--output_obs_predictions |
In which .obs slots to store the predicted information. |
string , default: "celltypist_pred" |
--output_obs_probability |
In which .obs slots to store the probability of the predictions. |
string , default: "celltypist_probability" |