Pad tokenize

Tokenize and pad a batch of data for scGPT integration zero-shot inference or fine-tuning

Info

ID: pad_tokenize
Namespace: scgpt

Links

Source

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -main-script target/nextflow/scgpt/pad_tokenize/main.nf \
  --help

Run command

Example of params.yaml

# Inputs
input: # please fill in - example: "input.h5mu"
modality: "rna"
model_vocab: # please fill in - example: "vocab.json"
# var_gene_names: "foo"
var_input: "id_in_vocab"
input_obsm_binned_counts: "binned_counts"

# Outputs
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
obsm_gene_tokens: "gene_id_tokens"
obsm_tokenized_values: "values_tokenized"
obsm_padding_mask: "padding_mask"

# Arguments
pad_token: "<pad>"
pad_value: -2
# max_seq_len: 123

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -profile docker \
  -main-script target/nextflow/scgpt/pad_tokenize/main.nf \
  -params-file params.yaml

Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Inputs

Name	Description	Attributes
`--input`	The input h5mu file of pre-processed data.	`file`, required, example: `"input.h5mu"`
`--modality`		`string`, default: `"rna"`
`--model_vocab`	Path to model vocabulary file.	`file`, required, example: `"vocab.json"`
`--var_gene_names`	The name of the .var column containing gene names. When no gene_name_layer is provided, the .var index will be used.	`string`
`--var_input`	The name of the adata.var column containing boolean mask for vocabulary-cross checked and/or highly variable genes.	`string`, default: `"id_in_vocab"`
`--input_obsm_binned_counts`	The name of the .obsm field containing the binned counts to be padded and tokenized.	`string`, default: `"binned_counts"`

Outputs

Name	Description	Attributes
`--output`	The output h5mu file containing obsm arrays for gene tokens, tokenized data and padding mask.	`file`, required, example: `"output.h5mu"`
`--output_compression`	The compression type for the output file.	`string`, example: `"gzip"`
`--obsm_gene_tokens`	The key of the .obsm array containing the gene token ids	`string`, default: `"gene_id_tokens"`, example: `"values.pt"`
`--obsm_tokenized_values`	The key of the .obsm array containing the count values of the tokenized genes	`string`, default: `"values_tokenized"`
`--obsm_padding_mask`	The key of the .obsm array containing the padding mask.	`string`, default: `"padding_mask"`

Arguments

Name	Description	Attributes
`--pad_token`	Token used for padding.	`string`, default: `"<pad>"`
`--pad_value`	The value of the padding token.	`integer`, default: `-2`
`--max_seq_len`	The maximum sequence length of the tokenized data. Defaults to the number of features if not provided.	`integer`

Authors

Dorien Roosen (maintainer, author)
Elizabeth Mlynarski (author)
Weiwei Schultz (contributor)