flowr: streamlining computing workflows

try this

install.packages(devtools)
devtools::install_github("sahilseth/flowr")
## OR
install.packages("flowr") 
library(flowr) ## load the library
setup() ## copy flowr bash script
run('sleep', execute=TRUE,  platform='moab')
## OR from terminal
# flowr run sleep execute=TRUE platform=moab

pic of flood

a deluge of data

`flowr`

streamlining computing workflows

What we need?

A system, where one could simply specify how to run a set of jobs
Specify the sequence to follow
Optionally specify resources [CPU/Memory required etc]

What we dont need?

effiency is good, but not at the cost of us spending too much time, on re-inventing the wheel
(dont) learn a new language design flows
(dont) rewrite existing pipelines from scratch

a flowr recepie needs:

two ingredients
- table with shell commands to run [flow_mat]
- another table specifying the structure [flow_def]
knowledge of 5 terms
- scatter, sequential (serial)
- gather, serial, burst

five terms one needs to know about

Submission type

Given a bunch of shell commands for a step, how to submit the jobs ?

serial: or sequential, would submit one after the other
scatter: submits all the them at the same time, executing all in parallel

five terms one needs to know about

Dependency type:

In what fashion should the downstream step wait for previous step(s)?

gather: wait for N jobs in previous step to complete
serial: when i^th job in previous step completes, start the i^th job in current step
burst: when the previous step completes (and had a single job), start N jobs of the current step

An example chart showing a typical pipeline

plot of chunk flow_overview

lets get started

#install.packages(devtools)
#devtools::install_github("sahilseth/flowr")
## OR
#install.packages("flowr")
library(flowr)
setup()

Consider adding ~/bin to your PATH variable in .bashrc.
export PATH=$PATH:$HOME/bin
You may now use all R functions using 'flowr' from shell.

load some example data

exdata = file.path(system.file(package = "flowr"), "pipelines")
flowmat = as.flowmat(file.path(exdata, "abcd.tsv"))
flowdef = as.flowdef(file.path(exdata, "abcd.def"))

stitch it

fobj <- to_flow(x = flowmat, def = flowdef)

input x is data.frame
##--- Getting default values for missing parameters...
Using `samplename` as the grouping column
Using `jobname` as the jobname column
Using `cmd` as the cmd column
Using flow_base_path default: ~/flowr
##--- Checking flow definition and flow matrix for consistency...
##--- Detecting platform...
Platform supplied, this will override defaults from flow_definition...
##--- flowr submission...
Working on... sample1
Test Successful!
You may check this folder for consistency. Also you may re-run submit with execute=TRUE
 ~/flowr/example1-sample1-20150706-21-50-08-AuNGnTHi

plot it

plot_flow(fobj)

plot of chunk plotit

test it

submit_flow(fobj)

submit it

submit_flow(fobj, execute = TRUE)

Flow has been submitted. Track it from terminal using:
flowr::status(x="~/flowr/type1-20150520-15-18-46-sySOzZnE")
OR
flowr status x=~/flowr/type1-20150520-15-18-46-sySOzZnE

$ flowr status x=~/flowr/sample1-20150619-07-43-28-OTpuKaMz
Flowr: streamlining workflows
Showing status of: ~/flowr/sample1-20150619-07-43-28-OTpuKaMz
|          | total| started| completed| exit_status|
|:---------|-----:|-------:|---------:|-----------:|
|001.sleep |     3|       3|         1|           0|
|002.tmp   |     3|       1|         1|           0|
|003.merge |     1|       0|         0|           0|

all in one go

wrap creation of flow_mat and flow_def
stitch them -->>> into a flow object !
submit (to the cluster)!

Here is an example:

flowr run sleep execute=TRUE platform=moab

simple yet powerful status()

Uses wildcard search
sample1 matches two different flows; shows summary status for both

$ flowr status x=sample1
Showing status of: ./sample1-20150619-07-34-17-lykJ4pdf
|          | total| started| completed| exit_status|
|:---------|-----:|-------:|---------:|-----------:|
|001.sleep |     3|       3|         3|           0|
|002.tmp   |     3|       3|         3|           0|
|003.merge |     1|       1|         1|           0|
Showing status of: ./sample1-20150619-07-43-28-OTpuKaMz
|          | total| started| completed| exit_status|
|:---------|-----:|-------:|---------:|-----------:|
|001.sleep |     3|       3|         3|           0|
|002.tmp   |     3|       3|         3|           0|
|003.merge |     1|       0|         0|           0|

simple yet powerful status()

end your x argument at parent folder
status shows a summary of all the flows in the folder

status() is designed to work similar to how ls works in the terminal

flowr run sleep execute=TRUE flow_base_path="~/flowr/sleep"
flowr status x=~/flowr/sleep ## parent folder with 3 flows inside
Showing status of: /rsrch2/iacs/iacs_dep/sseth/flowr/sleep
|          | total| started| completed| exit_status|
|:---------|-----:|-------:|---------:|-----------:|
|001.sleep |     9|       9|         6|           0|
|002.tmp   |     9|       6|         6|           0|
|003.merge |     3|       1|         1|           0|
|004.size  |     3|       1|         1|           0|
flowr status x=~/flowr/sleep/sample1* ## get status of all them

stopping flows

kill_flow: fetch jobid of each job
use qdel/bkill etc.. to kill all the compute jobs

flowr kill_flow wd=~/flowr/sample1-20150619-07-53-58-ySuYo5t0

rerun partially completed flows

this is quite experimental at this stage
provide full path to a partially complete flow
mention which step to start from

flowr rerun_flow x=~/flowr/sample1-20150619-11-41-50-eXa0insg start_from=tmp
Extracting commands from previous run.
Hope the reason for previous failure was fixed...
Subsetting... get stuff to run starting tmp
Using flow_base_path default: ~/flowr

clean organization and structure

├── 001.sleep
│   ├── 001.sleep
│   ├── sleep_cmd_1.sh
│   ├── sleep_cmd_2.sh
│   └── sleep_cmd_3.sh
├── 002.tmp
│   ├── 002.tmp
│   ├── tmp_cmd_1.sh
│   ├── tmp_cmd_2.sh
│   └── tmp_cmd_3.sh
├── 003.merge
│   ├── 003.merge
│   └── merge_cmd_1.sh
├── 004.size
│   ├── 004.size
│   └── size_cmd_1.sh

clean organization and structure

├── example1-flow_design.pdf
├── flow_details.rda
├── flow_details.txt
├── flow_status.txt
├── tmp
│   ├── merge1
│   ├── tmp1_1
│   ├── tmp1_2
│   └── tmp1_3
└── trigger
    ├── trigger_001.sleep_1.txt
    ├── trigger_001.sleep_2.txt
    ├── trigger_001.sleep_3.txt
    ├── trigger_002.tmp_1.txt
    ├── trigger_002.tmp_2.txt
    ├── trigger_002.tmp_3.txt
    ├── trigger_003.merge_1.txt
    └── trigger_004.size_1.txt

mixing local and hpcc jobs

first jobs uses data transfer on login node: local
when complete, flowr goes ahead and submits the rest: moab
works well when the input data sits on a remote server

platforms supported

Error in get_dt(tb, "plat_supp"): could not find function "inviible"

flowr shell script: universal use

syntax: flowr function parameters

can be used for all R functions
-h or missing argument loads R help file

flowr rnorm n=100
Loading required package: shape
Flowr: streamlining workflows
2.277249 0.3188005 -0.9658285 0.4719445 
....
## load help file for knitr
funr knitr::knit
## OR use
funr knitr::knit -h

more examples

links for more info:

complete documentation: docs.flowr.space
github.com/sahilseth/flowr:
questions:
email: sseth@mdanderson.org

Aknowledgements

Jianhua Zhang
Samir Amin
Kadir Akdemir
Ethan Mao
Henry Song
An excellent resource for writing your own R packages: r-pkgs.had.co.nz