Split h5mu train test

Split mudata object into training and testing (and validation) datasets based on observations into separate mudata objects.

Info

ID: split_h5mu_train_test
Namespace: dataflow

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.0 -latest \
  -main-script target/nextflow/dataflow/split_h5mu_train_test/main.nf \
  --help

Run command

Example of params.yaml
# Inputs
input: # please fill in - example: "input.h5mu"
modality: "rna"

# Outputs
# output_train: "$id.$key.output_train.h5mu"
# output_test: "$id.$key.output_test.h5mu"
# output_val: "$id.$key.output_val.h5mu"
# compression: "gzip"

# Split arguments
test_size: 0.2
# val_size: 123.0
shuffle: false
# random_state: 123

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"

# Arguments
nextflow run openpipelines-bio/openpipeline \
  -r 2.1.0 -latest \
  -profile docker \
  -main-script target/nextflow/dataflow/split_h5mu_train_test/main.nf \
  -params-file params.yaml
Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Inputs

Input dataset in mudata format.

Name Description Attributes
--input The input (query) data to be labeled. Should be a .h5mu file. file, required, example: "input.h5mu"
--modality Which modality to process. string, default: "rna"

Outputs

Output arguments.

Name Description Attributes
--output_train The output training data in mudata format. file, required, example: "output_train.h5mu"
--output_test The output testing data in mudata format. file, required, example: "output_test.h5mu"
--output_val The output validation data in mudata format. file, example: "output_val.h5mu"
--compression string, example: "gzip"

Split arguments

Model arguments.

Name Description Attributes
--test_size The proportion of the dataset to include in the test split. double, default: 0.2
--val_size The proportion of the dataset to include in the validation split. double
--shuffle Whether or not to shuffle the data before splitting. boolean_true
--random_state The seed used by the random number generator. integer

Authors

  • Jakub Majercik (author)