Pad tokenize
Tokenize and pad a batch of data for scGPT integration zero-shot inference or fine-tuning
Info
ID: pad_tokenize
Namespace: scgpt
Links
Example commands
You can run the pipeline using nextflow run.
View help
You can use --help as a parameter to get an overview of the possible parameters.
nextflow run openpipelines-bio/openpipeline \
-r 2.1.1 -latest \
-main-script target/nextflow/scgpt/pad_tokenize/main.nf \
--helpRun command
Example of params.yaml
# Inputs
input: # please fill in - example: "input.h5mu"
modality: "rna"
model_vocab: # please fill in - example: "vocab.json"
# var_gene_names: "foo"
var_input: "id_in_vocab"
input_obsm_binned_counts: "binned_counts"
# Outputs
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
obsm_gene_tokens: "gene_id_tokens"
obsm_tokenized_values: "values_tokenized"
obsm_padding_mask: "padding_mask"
# Arguments
pad_token: "<pad>"
pad_value: -2
# max_seq_len: 123
# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"nextflow run openpipelines-bio/openpipeline \
-r 2.1.1 -latest \
-profile docker \
-main-script target/nextflow/scgpt/pad_tokenize/main.nf \
-params-file params.yaml
Note
Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.
Argument groups
Inputs
| Name | Description | Attributes |
|---|---|---|
--input |
The input h5mu file of pre-processed data. | file, required, example: "input.h5mu" |
--modality |
string, default: "rna" |
|
--model_vocab |
Path to model vocabulary file. | file, required, example: "vocab.json" |
--var_gene_names |
The name of the .var column containing gene names. When no gene_name_layer is provided, the .var index will be used. | string |
--var_input |
The name of the adata.var column containing boolean mask for vocabulary-cross checked and/or highly variable genes. | string, default: "id_in_vocab" |
--input_obsm_binned_counts |
The name of the .obsm field containing the binned counts to be padded and tokenized. | string, default: "binned_counts" |
Outputs
| Name | Description | Attributes |
|---|---|---|
--output |
The output h5mu file containing obsm arrays for gene tokens, tokenized data and padding mask. | file, required, example: "output.h5mu" |
--output_compression |
The compression type for the output file. | string, example: "gzip" |
--obsm_gene_tokens |
The key of the .obsm array containing the gene token ids | string, default: "gene_id_tokens", example: "values.pt" |
--obsm_tokenized_values |
The key of the .obsm array containing the count values of the tokenized genes | string, default: "values_tokenized" |
--obsm_padding_mask |
The key of the .obsm array containing the padding mask. | string, default: "padding_mask" |
Arguments
| Name | Description | Attributes |
|---|---|---|
--pad_token |
Token used for padding. | string, default: "<pad>" |
--pad_value |
The value of the padding token. | integer, default: -2 |
--max_seq_len |
The maximum sequence length of the tokenized data. Defaults to the number of features if not provided. | integer |