The workflow takes a pre-processed h5mu file as query input, and performs - subsetting for HVG - cross-checking of genes with the model vocabulary - binning of gene counts - padding and tokenizing of genes - transformer-based cell type prediction Note that cell-type prediction using scGPT is only possible using a fine-tuned scGPT model
Example commands
You can run the pipeline using nextflow run.
View help
You can use --help as a parameter to get an overview of the possible parameters.
nextflow run openpipelines-bio/openpipeline \-r 2.1.0 -latest\-main-script target/nextflow/workflows/annotation/scgpt_annotation/main.nf \--help
Run command
Example of params.yaml
# Query inputid: # please fill in - example: "foo"input: # please fill in - example: "input.h5mu"modality:"rna"# input_layer: "foo"# input_var_gene_names: "foo"input_obs_batch_label: # please fill in - example: "foo"# Model inputmodel: # please fill in - example: "best_model.pt"model_config: # please fill in - example: "args.json"model_vocab: # please fill in - example: "vocab.json"finetuned_checkpoints_key:"model_state_dict"label_mapper_key:"id_to_class"# Outputs# output: "$id.$key.output.h5mu"# output_compression: "gzip"output_obs_predictions:"scgpt_pred"output_obs_probability:"scgpt_probability"# Padding argumentspad_token:"<pad>"pad_value:-2# HVG subset argumentsn_hvg:1200hvg_flavor:"cell_ranger"# Tokenization arguments# max_seq_len: 123# Embedding argumentsdsbn:truebatch_size:64# Binning argumentsn_input_bins:51# seed: 123# Nextflow input-output argumentspublish_dir: # please fill in - example: "output/"# param_list: "my_params.yaml"# Arguments
Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.
Argument groups
Query input
Name
Description
Attributes
--id
ID of the sample.
string, required, example: "foo"
--input
Path to the input file.
file, required, example: "input.h5mu"
--modality
string, default: "rna"
--input_layer
The layer of the input dataset to process if .X is not to be used. Should contain log normalized counts.
string
--input_var_gene_names
The .var field in the input (query) containing gene names; if not provided, the var index will be used.
string
--input_obs_batch_label
The .obs field in the input (query) dataset containing the batch labels.
string, required
Model input
Name
Description
Attributes
--model
The scGPT model file. Must be a fine-tuned model that contains keys for checkpoints (–finetuned_checkpoints_key) and cell type label mapper(–label_mapper_key).
file, required, example: "best_model.pt"
--model_config
The scGPT model configuration file.
file, required, example: "args.json"
--model_vocab
The scGPT model vocabulary file.
file, required, example: "vocab.json"
--finetuned_checkpoints_key
Key in the model file containing the pre-trained checkpoints.
string, default: "model_state_dict"
--label_mapper_key
Key in the model file containing the cell type class to label mapper dictionary.
string, default: "id_to_class"
Outputs
Name
Description
Attributes
--output
Output file path
file, required, example: "output.h5mu"
--output_compression
The compression algorithm to use for the output h5mu file.
string, example: "gzip"
--output_obs_predictions
The name of the adata.obs column to write predicted cell type labels to.
string, default: "scgpt_pred"
--output_obs_probability
The name of the adata.obs column to write predicted cell type labels to.
string, default: "scgpt_probability"
Padding arguments
Name
Description
Attributes
--pad_token
Token used for padding.
string, default: "<pad>"
--pad_value
The value of the padding token.
integer, default: -2
HVG subset arguments
Name
Description
Attributes
--n_hvg
Number of highly variable genes to subset for.
integer, default: 1200
--hvg_flavor
Method to be used for identifying highly variable genes. Note that the default for this workflow (cell_ranger) is not the default method for scanpy hvg detection (seurat).
string, default: "cell_ranger"
Tokenization arguments
Name
Description
Attributes
--max_seq_len
The maximum sequence length of the tokenized data.
integer
Embedding arguments
Name
Description
Attributes
--dsbn
Apply domain-specific batch normalization
boolean, default: TRUE
--batch_size
The batch size to be used for embedding inference.
integer, default: 64
Binning arguments
Name
Description
Attributes
--n_input_bins
The number of bins to discretize the data into; When no value is provided, data won’t be binned.
integer, default: 51
--seed
Seed for random number generation used for binning. If not set, no seed is used.