Random forest annotation

Automated cell type annotation tool for scRNA-seq datasets on the basis of random forest.

Info

ID: random_forest_annotation
Namespace: annotate

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.0 -latest \
  -main-script target/nextflow/annotate/random_forest_annotation/main.nf \
  --help

Run command

Example of params.yaml
# Inputs
input: # please fill in - example: "input.h5mu"
modality: "rna"
# input_layer: "foo"
# input_var_gene_names: "foo"
input_reference_gene_overlap: 100

# Reference
# reference: "reference.h5mu"
# reference_layer: "foo"
reference_obs_target: # please fill in - example: "foo"
# reference_var_gene_names: "foo"
# reference_var_input: "foo"

# Outputs
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
output_obs_predictions: "random_forest_pred"
output_obs_probability: "random_forest_probability"

# Model arguments
# model: "pretrained_model.pkl"
n_estimators: 100
# max_depth: 123
criterion: "gini"
class_weight: "balanced_subsample"
max_features: "200"

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"

# Arguments
nextflow run openpipelines-bio/openpipeline \
  -r 2.1.0 -latest \
  -profile docker \
  -main-script target/nextflow/annotate/random_forest_annotation/main.nf \
  -params-file params.yaml
Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Inputs

Input dataset (query) arguments

Name Description Attributes
--input The input (query) data to be labeled. Should be a .h5mu file. file, required, example: "input.h5mu"
--modality Which modality to process. string, default: "rna"
--input_layer The layer in the input data to be used for cell type annotation if .X is not to be used. string
--input_var_gene_names The name of the adata var column in the input data containing gene names; when no gene_name_layer is provided, the var index will be used. string
--input_reference_gene_overlap The minimum number of genes present in both the reference and query datasets. integer, default: 100

Reference

Arguments related to the reference dataset.

Name Description Attributes
--reference The reference data to train the CellTypist classifiers on. Only required if a pre-trained –model is not provided. file, example: "reference.h5mu"
--reference_layer The layer in the reference data to be used for cell type annotation if .X is not to be used. Data are expected to be processed in the same way as the –input query dataset. string
--reference_obs_target Key in obs field of reference modality with cell-type information. string, required
--reference_var_gene_names The name of the adata var column in the reference data containing gene names; when no gene_name_layer is provided, the var index will be used. string
--reference_var_input .var column containing highly variable genes. By default, do not subset genes. string

Outputs

Output arguments.

Name Description Attributes
--output Output h5mu file. file, example: "output.h5mu"
--output_compression string, example: "gzip"
--output_obs_predictions In which .obs slots to store the predicted information. string, default: "random_forest_pred"
--output_obs_probability In which .obs slots to store the probability of the predictions. string, default: "random_forest_probability"

Model arguments

Model arguments.

Name Description Attributes
--model Pretrained model in pkl format. If not provided, the model will be trained on the reference data and –reference should be provided. file, example: "pretrained_model.pkl"
--n_estimators Number of trees in the random forest. integer, default: 100
--max_depth Maximum depth of the trees in the random forest. If not provided, the nodes are expanded until all leaves only contain a single sample. integer
--criterion The function to measure the quality of a split. string, default: "gini"
--class_weight Weights associated with classes. The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data. The balanced_subsample mode is the same as balanced except that weights are computed based on the bootstrap sample for every tree grown. The uniform mode gives all classes a weight of one. string, default: "balanced_subsample"
--max_features The number of features to consider when looking for the best split. The value can either be a positive integer or one of sqrt, log2 or all. If integer: consider max_features features at each split. If sqrt: max_features is the squareroot of all input features. If log2: max_features is the log2 of all input features. If all: max features equals all input features. string, default: "200"

Authors

  • Jakub Majercik (author)