scGPT Annotation

Cell type annotation workflow using scGPT.

Info

ID: scgpt_annotation
Namespace: workflows/annotation

Links

The workflow takes a pre-processed h5mu file as query input, and performs - subsetting for HVG - cross-checking of genes with the model vocabulary - binning of gene counts - padding and tokenizing of genes - transformer-based cell type prediction Note that cell-type prediction using scGPT is only possible using a fine-tuned scGPT model

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -main-script target/nextflow/workflows/annotation/scgpt_annotation/main.nf \
  --help

Run command

Example of params.yaml

# Query input
id: # please fill in - example: "foo"
input: # please fill in - example: "input.h5mu"
modality: "rna"
# input_layer: "foo"
# input_var_gene_names: "foo"
input_obs_batch_label: # please fill in - example: "foo"

# Model input
model: # please fill in - example: "best_model.pt"
model_config: # please fill in - example: "args.json"
model_vocab: # please fill in - example: "vocab.json"
finetuned_checkpoints_key: "model_state_dict"
label_mapper_key: "id_to_class"

# Outputs
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
output_obs_predictions: "scgpt_pred"
output_obs_probability: "scgpt_probability"

# Padding arguments
pad_token: "<pad>"
pad_value: -2

# HVG subset arguments
n_hvg: 1200
hvg_flavor: "cell_ranger"

# Tokenization arguments
# max_seq_len: 123

# Embedding arguments
dsbn: true
batch_size: 64

# Binning arguments
n_input_bins: 51
# seed: 123

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"

# Arguments

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -profile docker \
  -main-script target/nextflow/workflows/annotation/scgpt_annotation/main.nf \
  -params-file params.yaml

Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Query input

Name	Description	Attributes
`--id`	ID of the sample.	`string`, required, example: `"foo"`
`--input`	Path to the input file.	`file`, required, example: `"input.h5mu"`
`--modality`		`string`, default: `"rna"`
`--input_layer`	The layer of the input dataset to process if .X is not to be used. Should contain log normalized counts.	`string`
`--input_var_gene_names`	The .var field in the input (query) containing gene names; if not provided, the var index will be used.	`string`
`--input_obs_batch_label`	The .obs field in the input (query) dataset containing the batch labels.	`string`, required

Model input

Name	Description	Attributes
`--model`	The scGPT model file. Must be a fine-tuned model that contains keys for checkpoints (–finetuned_checkpoints_key) and cell type label mapper(–label_mapper_key).	`file`, required, example: `"best_model.pt"`
`--model_config`	The scGPT model configuration file.	`file`, required, example: `"args.json"`
`--model_vocab`	The scGPT model vocabulary file.	`file`, required, example: `"vocab.json"`
`--finetuned_checkpoints_key`	Key in the model file containing the pre-trained checkpoints.	`string`, default: `"model_state_dict"`
`--label_mapper_key`	Key in the model file containing the cell type class to label mapper dictionary.	`string`, default: `"id_to_class"`

Outputs

Name	Description	Attributes
`--output`	Output file path	`file`, required, example: `"output.h5mu"`
`--output_compression`	The compression algorithm to use for the output h5mu file.	`string`, example: `"gzip"`
`--output_obs_predictions`	The name of the adata.obs column to write predicted cell type labels to.	`string`, default: `"scgpt_pred"`
`--output_obs_probability`	The name of the adata.obs column to write predicted cell type labels to.	`string`, default: `"scgpt_probability"`

Padding arguments

Name	Description	Attributes
`--pad_token`	Token used for padding.	`string`, default: `"<pad>"`
`--pad_value`	The value of the padding token.	`integer`, default: `-2`

HVG subset arguments

Name	Description	Attributes
`--n_hvg`	Number of highly variable genes to subset for.	`integer`, default: `1200`
`--hvg_flavor`	Method to be used for identifying highly variable genes. Note that the default for this workflow (`cell_ranger`) is not the default method for scanpy hvg detection (`seurat`).	`string`, default: `"cell_ranger"`

Tokenization arguments

Name	Description	Attributes
`--max_seq_len`	The maximum sequence length of the tokenized data.	`integer`

Embedding arguments

Name	Description	Attributes
`--dsbn`	Apply domain-specific batch normalization	`boolean`, default: `TRUE`
`--batch_size`	The batch size to be used for embedding inference.	`integer`, default: `64`

Binning arguments

Name	Description	Attributes
`--n_input_bins`	The number of bins to discretize the data into; When no value is provided, data won’t be binned.	`integer`, default: `51`
`--seed`	Seed for random number generation used for binning. If not set, no seed is used.	`integer`

Authors

Dorien Roosen (author, maintainer)
Elizabeth Mlynarski (contributor)
Weiwei Schultz (contributor)

Visualisation

flowchart TB
    v0(Channel.fromList)
    v2(filter)
    v10(filter)
    v18(highly_variable_features_scanpy)
    v25(cross)
    v35(cross)
    v41(filter)
    v49(cross_check_genes)
    v56(cross)
    v66(cross)
    v72(filter)
    v80(binning)
    v87(cross)
    v97(cross)
    v103(filter)
    v111(pad_tokenize)
    v118(cross)
    v128(cross)
    v134(filter)
    v164(concat)
    v142(scgpt_celltype_annotation)
    v149(cross)
    v159(cross)
    v170(cross)
    v177(cross)
    v189(cross)
    v196(cross)
    v200(Output)
    v0-->v2
    v2-->v10
    v10-->v18
    v18-->v25
    v10-->v25
    v10-->v35
    v41-->v49
    v49-->v56
    v41-->v56
    v41-->v66
    v72-->v80
    v80-->v87
    v72-->v87
    v72-->v97
    v103-->v111
    v111-->v118
    v103-->v118
    v103-->v128
    v134-->v142
    v142-->v149
    v134-->v149
    v134-->v159
    v159-->v164
    v164-->v170
    v2-->v170
    v170-->v177
    v2-->v177
    v2-->v189
    v189-->v196
    v2-->v196
    v196-->v200
    v35-->v41
    v18-->v35
    v66-->v72
    v49-->v66
    v97-->v103
    v80-->v97
    v128-->v134
    v111-->v128
    v142-->v159
    v164-->v189
    style v0 fill:#e3dcea,stroke:#7a4baa;
    style v2 fill:#e3dcea,stroke:#7a4baa;
    style v10 fill:#e3dcea,stroke:#7a4baa;
    style v18 fill:#e3dcea,stroke:#7a4baa;
    style v25 fill:#e3dcea,stroke:#7a4baa;
    style v35 fill:#e3dcea,stroke:#7a4baa;
    style v41 fill:#e3dcea,stroke:#7a4baa;
    style v49 fill:#e3dcea,stroke:#7a4baa;
    style v56 fill:#e3dcea,stroke:#7a4baa;
    style v66 fill:#e3dcea,stroke:#7a4baa;
    style v72 fill:#e3dcea,stroke:#7a4baa;
    style v80 fill:#e3dcea,stroke:#7a4baa;
    style v87 fill:#e3dcea,stroke:#7a4baa;
    style v97 fill:#e3dcea,stroke:#7a4baa;
    style v103 fill:#e3dcea,stroke:#7a4baa;
    style v111 fill:#e3dcea,stroke:#7a4baa;
    style v118 fill:#e3dcea,stroke:#7a4baa;
    style v128 fill:#e3dcea,stroke:#7a4baa;
    style v134 fill:#e3dcea,stroke:#7a4baa;
    style v164 fill:#e3dcea,stroke:#7a4baa;
    style v142 fill:#e3dcea,stroke:#7a4baa;
    style v149 fill:#e3dcea,stroke:#7a4baa;
    style v159 fill:#e3dcea,stroke:#7a4baa;
    style v170 fill:#e3dcea,stroke:#7a4baa;
    style v177 fill:#e3dcea,stroke:#7a4baa;
    style v189 fill:#e3dcea,stroke:#7a4baa;
    style v196 fill:#e3dcea,stroke:#7a4baa;
    style v200 fill:#e3dcea,stroke:#7a4baa;