Skip to contents

Introduction

the PIP project makes use of many directory paths and ad-hoc information that cannot be gathered endogenously from the PIP project. Since pipload is the package in charge of accessing and loading all the data in PIP, it is also in charge of loading into memery the ad-hoc variables that are used across PIP.

Functions

the two main functions are pip_create_globals() and add_gls_to_env(). The former is the main function that creates all the variables and returns them in a list. The latter is a convenient function to standardize the use of the global variables across all PIP processes.

Create globals

Basics

pip_create_globals() makes use of three arguments that work together: root_dir, out_dir, and vintage. root_dir refers to the directory path where the PIP data resides. Since the directory path of the PIP data cannot be made publicly available, it is stored in the environment variable PIP_ROOT_DIR, which hosted privately by the PIP Technical Team.

For the sake of sake of demostrations, let’s use a temporal directory as it it where the root directory. The default behavior of pip_create_globals() is to create the the whole directory structure in case it does not exist:

fake_dir1 <- paste0(tempdir(), "/1")
Sys.setenv(PIP_ROOT_DIR = fake_dir1)
gls <- pip_create_globals()
fs::dir_tree(fake_dir1, type = "directory")
#> /tmp/RtmpzynBs8/1
#> ├── DLW-RAW
#> ├── PIP-Data_QA
#> └── pip_ingestion_pipeline
#>     ├── pc_data
#>     │   ├── cache
#>     │   │   └── clean_survey_data
#>     │   └── output
#>     │       └── 20220314
#>     │           ├── _aux
#>     │           ├── estimations
#>     │           └── survey_data
#>     └── tb_data
#>         ├── arrow
#>         ├── cache
#>         │   └── clean_survey_data
#>         └── output
#>             └── 20220314
#>                 └── estimations

In addition to create the global vartiables, pip_create_globals() creates two main folders in case they don’t exist, PIP-Data_QA and pip_ingestion_pipeline. This is useful if the user does not want to work in the main PIP data folder structure. Let’s focus for now on the pc_data sub-directory inside pip_ingestion_pipeline. This folder has two other folders, cache and output. Within the latter you see a vintage folder 20220314 with the form "%Y%m%d".

Since by default both input and output data are stored within the structure of root_dir directory, the value of out_dir is the same as root_dir. Yet, the user can specify a different directory path to store the output from the PIP pipeline.

fake_dir2 <- paste0(tempdir(), "/2")
fake_dir3 <- paste0(tempdir(), "/3")
Sys.setenv(PIP_ROOT_DIR = fake_dir2)
gls <- pip_create_globals(out_dir = fake_dir3)
fs::dir_tree(fake_dir2, type = "directory")
#> /tmp/RtmpzynBs8/2
#> ├── DLW-RAW
#> ├── PIP-Data_QA
#> └── pip_ingestion_pipeline
#>     ├── pc_data
#>     │   └── cache
#>     │       └── clean_survey_data
#>     └── tb_data
#>         ├── arrow
#>         └── cache
#>             └── clean_survey_data
fs::dir_tree(fake_dir3, type = "directory")
#> /tmp/RtmpzynBs8/3
#> └── pip_ingestion_pipeline
#>     ├── pc_data
#>     │   └── output
#>     │       └── 20220314
#>     │           ├── _aux
#>     │           ├── estimations
#>     │           └── survey_data
#>     └── tb_data
#>         └── output
#>             └── 20220314
#>                 └── estimations

In the case above, folder fake_dir2 represents root_dir and fake_dir3 is the output directory, out_dir. You can see that folder pip_ingestion_pipeline is available in both fake_dir2 and fake_dir3, but output was created only on fake_dir3 and not in fake_dir2. It does not mean that in reality fake_dir2 does not contain an output folder, but that given that we are working with temporal directories, the output folder does not exist in either of them.

Vintages

This is where the option vintage comes into play. This argument refers to the name of the sub-directories inside the output folder. It can take two special values, “latest” or “new”, or any other character. If it is “latest” (default), the most recent version available in the vintage directory of the form “%Y%m%d” will be used. If no folder exists with this form, a new folder with the date of the execution will be created. If it is “new”, a new folder with a name of the form “%Y%m%d” will be created. All the names will be coerced to lower cases.

So, let’s pretend that, inside the official directory, fake_dir1, the most recent vintage of the output sub-directories is 20220314 .Yet, you don’t want to mess up with it, so you want to create a new vintage, “temp_out”, and you also want to do it in a directory different to the official one, fake_dir2. In this, can do something like this.

Sys.setenv(PIP_ROOT_DIR = fake_dir2)
gls <- pip_create_globals(root_dir = fake_dir1, 
                          out_dir  = fake_dir2, 
                          vintage  = "temp_out")
#>  Alternative root directory for root_dir is set to </tmp/RtmpzynBs8/1>
fs::dir_tree(fake_dir1, type = "directory")
#> /tmp/RtmpzynBs8/1
#> ├── DLW-RAW
#> ├── PIP-Data_QA
#> └── pip_ingestion_pipeline
#>     ├── pc_data
#>     │   ├── cache
#>     │   │   └── clean_survey_data
#>     │   └── output
#>     │       └── 20220314
#>     │           ├── _aux
#>     │           ├── estimations
#>     │           └── survey_data
#>     └── tb_data
#>         ├── arrow
#>         ├── cache
#>         │   └── clean_survey_data
#>         └── output
#>             └── 20220314
#>                 └── estimations
fs::dir_tree(fake_dir2, type = "directory")
#> /tmp/RtmpzynBs8/2
#> ├── DLW-RAW
#> ├── PIP-Data_QA
#> └── pip_ingestion_pipeline
#>     ├── pc_data
#>     │   ├── cache
#>     │   │   └── clean_survey_data
#>     │   └── output
#>     │       └── temp_out
#>     │           ├── _aux
#>     │           ├── estimations
#>     │           └── survey_data
#>     └── tb_data
#>         ├── arrow
#>         ├── cache
#>         │   └── clean_survey_data
#>         └── output
#>             └── temp_out
#>                 └── estimations

As you can see, directory temp_out was created inside output of the fake_dir2 directory.