Embedding

Generation of cell embeddings for the integration of single cell transcriptomic count data using scGPT

Info

ID: embedding
Namespace: scgpt

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.0 -latest \
  -main-script target/nextflow/scgpt/embedding/main.nf \
  --help

Run command

Example of params.yaml
# Inputs
input: # please fill in - example: "input.h5mu"
modality: "rna"
model: # please fill in - example: "best_model.pt"
model_vocab: # please fill in - example: "vocab.json"
model_config: # please fill in - example: "args.json"
obsm_gene_tokens: "gene_id_tokens"
obsm_tokenized_values: "values_tokenized"
obsm_padding_mask: "padding_mask"
# var_gene_names: "foo"
# obs_batch_label: "foo"
# finetuned_checkpoints_key: "model_state_dict"

# Outputs
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
obsm_embeddings: "X_scGPT"

# Arguments
pad_token: "<pad>"
pad_value: -2
dsbn: true
batch_size: 64

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"
nextflow run openpipelines-bio/openpipeline \
  -r 2.1.0 -latest \
  -profile docker \
  -main-script target/nextflow/scgpt/embedding/main.nf \
  -params-file params.yaml
Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Inputs

Name Description Attributes
--input The input h5mu file containing tokenized gene and count data. file, required, example: "input.h5mu"
--modality string, default: "rna"
--model Path to scGPT model file. file, required, example: "best_model.pt"
--model_vocab Path to scGPT model vocabulary file. file, required, example: "vocab.json"
--model_config Path to scGPT model config file. file, required, example: "args.json"
--obsm_gene_tokens The key of the .obsm array containing the gene token ids string, default: "gene_id_tokens", example: "values.pt"
--obsm_tokenized_values The key of the .obsm array containing the count values of the tokenized genes string, default: "values_tokenized"
--obsm_padding_mask The key of the .obsm array containing the padding mask. string, default: "padding_mask"
--var_gene_names The name of the .var column containing gene names. When no gene_name_layer is provided, the .var index will be used. string
--obs_batch_label The name of the adata.obs column containing the batch labels. Must be provided when ‘dsbn’ is set to True. string
--finetuned_checkpoints_key Key in the model file containing the pretrained checkpoints. Only relevant for fine-tuned models. string, example: "model_state_dict"

Outputs

Name Description Attributes
--output Path to output anndata file containing pre-processed data as well as scGPT embeddings. file, required, example: "output.h5mu"
--output_compression The compression algorithm to use for the output h5mu file. string, example: "gzip"
--obsm_embeddings The name of the adata.obsm array to which scGPT embeddings will be written. string, default: "X_scGPT"

Arguments

Name Description Attributes
--pad_token The token to be used for padding. string, default: "<pad>"
--pad_value The value of the padding token. integer, default: -2
--dsbn Whether to apply domain-specific batch normalization for generating embeddings. When set to True, ‘obs_batch_labels’ must be set as well. boolean, default: TRUE
--batch_size The batch size to be used for inference integer, default: 64

Authors

  • Dorien Roosen (maintainer, author)

  • Elizabeth Mlynarski (author)

  • Weiwei Schultz (contributor)