Embedding
Generation of cell embeddings for the integration of single cell transcriptomic count data using scGPT
Info
ID: embedding
Namespace: scgpt
Links
Example commands
You can run the pipeline using nextflow run
.
View help
You can use --help
as a parameter to get an overview of the possible parameters.
nextflow run openpipelines-bio/openpipeline \
-r 2.1.0 -latest \
-main-script target/nextflow/scgpt/embedding/main.nf \
--help
Run command
Example of params.yaml
# Inputs
input: # please fill in - example: "input.h5mu"
modality: "rna"
model: # please fill in - example: "best_model.pt"
model_vocab: # please fill in - example: "vocab.json"
model_config: # please fill in - example: "args.json"
obsm_gene_tokens: "gene_id_tokens"
obsm_tokenized_values: "values_tokenized"
obsm_padding_mask: "padding_mask"
# var_gene_names: "foo"
# obs_batch_label: "foo"
# finetuned_checkpoints_key: "model_state_dict"
# Outputs
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
obsm_embeddings: "X_scGPT"
# Arguments
pad_token: "<pad>"
pad_value: -2
dsbn: true
batch_size: 64
# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"
nextflow run openpipelines-bio/openpipeline \
-r 2.1.0 -latest \
-profile docker \
-main-script target/nextflow/scgpt/embedding/main.nf \
-params-file params.yaml
Note
Replace -profile docker
with -profile podman
or -profile singularity
depending on the desired backend.
Argument groups
Inputs
Name | Description | Attributes |
---|---|---|
--input |
The input h5mu file containing tokenized gene and count data. | file , required, example: "input.h5mu" |
--modality |
string , default: "rna" |
|
--model |
Path to scGPT model file. | file , required, example: "best_model.pt" |
--model_vocab |
Path to scGPT model vocabulary file. | file , required, example: "vocab.json" |
--model_config |
Path to scGPT model config file. | file , required, example: "args.json" |
--obsm_gene_tokens |
The key of the .obsm array containing the gene token ids | string , default: "gene_id_tokens" , example: "values.pt" |
--obsm_tokenized_values |
The key of the .obsm array containing the count values of the tokenized genes | string , default: "values_tokenized" |
--obsm_padding_mask |
The key of the .obsm array containing the padding mask. | string , default: "padding_mask" |
--var_gene_names |
The name of the .var column containing gene names. When no gene_name_layer is provided, the .var index will be used. | string |
--obs_batch_label |
The name of the adata.obs column containing the batch labels. Must be provided when ‘dsbn’ is set to True. | string |
--finetuned_checkpoints_key |
Key in the model file containing the pretrained checkpoints. Only relevant for fine-tuned models. | string , example: "model_state_dict" |
Outputs
Name | Description | Attributes |
---|---|---|
--output |
Path to output anndata file containing pre-processed data as well as scGPT embeddings. | file , required, example: "output.h5mu" |
--output_compression |
The compression algorithm to use for the output h5mu file. | string , example: "gzip" |
--obsm_embeddings |
The name of the adata.obsm array to which scGPT embeddings will be written. | string , default: "X_scGPT" |
Arguments
Name | Description | Attributes |
---|---|---|
--pad_token |
The token to be used for padding. | string , default: "<pad>" |
--pad_value |
The value of the padding token. | integer , default: -2 |
--dsbn |
Whether to apply domain-specific batch normalization for generating embeddings. When set to True, ‘obs_batch_labels’ must be set as well. | boolean , default: TRUE |
--batch_size |
The batch size to be used for inference | integer , default: 64 |