11  Load microdata and Auxiliary data

Make sure you have all the packages installed and loaded into memory. Given that they are hosted in Github, the code below makes sure that any package in the PIP workflow can be installed correctly.

# ## First specify the packages of interest
# packages = c("pipaux", "pipload")
# 
# ## Now load or install&load all
# package.check <- lapply(
#   packages,
#   FUN = function(x) {
#     if (!require(x, character.only = TRUE)) {
#       pck_name <- paste0("PIP-Technical-Team/", x)
#       devtools::install_github(pck_name)
#       library(x, character.only = TRUE)
#     }
#   }
# )

11.1 Auxiilary data

Even though pipaux has more than functions, most of its features can be executed by using only the pipaux::load_aux and pipaux::update_aux functions.

11.1.1 udpate data

the main function of the pipaux package is udpate_aux. The first argument of this function is measure and it refers to the measure data to be loaded. The measures available are ****.

# pipaux::update_aux(measure = "cpi")

11.1.2 Load data

Loading auxiliary data is the job of the package pipload through the function pipload::pip_load_aux(), though pipaux also provides pipaux::load_aux() for the same purpose. Notice that, though both function do exactly the same, the loading function from pipload has the prefix pip_ to distinguish it from the one in pipaux. However, we are going to limit the work of pipaux to update auxiliary data and the work of pipload to load data. Thus, all the examples below use pipload for loading either microdata or auxiliary data.

# df <- pipload::pip_load_aux(measure = "cpi")
# head(df)

11.2 Microdata

Loading PIP microdata is the most practical action in the pipload package. However, it is important to understand the logic of microdata.

PIP microdata has several characteristics,

  • There could be more than once survey for each Country/Year. This happens when there are more than one welfare variable available such as income and consumption.
  • Some countries, like Mexico, have the two different welfare types in the same survey for the same country/year. This add a layer of complexity when the objective is to known which is default one.
  • There are multiple version of the same harmonized survey. These version are organized in a two-type vintage control. It is possible to have a new version of the data because the Raw data–the one provided by the official NSO–has been updated, or because there has been un update in the harmonization process.
  • Each survey could be use for more than one analytic tool in PIP (e.g., Poverty Calculator, Table Maker, or SOL). Thus, the data to be loaded depends on the tool in which it is going to be used.

Thus, in order to make the process of finding and loading data efficiently, pipload is a three-step process.

11.2.1 Inventory file

The inventory file resides in y:/PIP-Data/_inventory/inventory.fst. This file is a data frame with all the microdata available in the PIP structure. It has two main variables, orig and filename. The former refers to the full directory path of the database, whereas the latter is only the file name. the other variables in this data frame are derived from these two.

The inventory file is used to speed up the file searching process in pipload. In previous packages, each time the user wanted to find a particular data base, it was necessary to look into the folder structure and extract the name of all the file that meet a particular criteria. This is time-consuming and inefficient. The advantage of this method though, is that, by construction, it finds all the the data available. By contrast, the inventory file method is much faster than the “searching” method, as it only requires to load a light file with all the data available, filter the data, and return the required information. The drawback, however, is that it needs to be kept up to date as data changes constantly.

To update the inventory file, you need to use the function pip_update_inventory. If you don’t provide any argument, it will update the whole inventory, which may take around 10 to 15 min–the function will warn you about it. By provide the country/ies you want to update, the process is way faster.

# # update one country
# pip_update_inventory("MEX")
# 
# # Load inventory file
# df <- pip_load_inventory()
# head(df[, "filename"])

11.2.2 Finding data

Every dataset in the PIP microdata repository is identified by seven variables! Country code, survey year, survey acronym, master version, alternative version, tool, and source. So giving the user the responsibility to know all the different combinations of each file is a heavy burden. Thus, the data finder, pip_find_data(), will provide the names of all the files available that meet the criteria in the arguments provided by the user. For instance, if the use wants to know the all the file available for Paraguay, we could type,

# pip_find_data(country = "PRY")[["filename"]]

Yet, if the user need to be more precise in its request, she can add information to the different arguments of the function. For example, this is data available in 2012,

# pip_find_data(country = "PRY", 
#               year = 2012)[["filename"]]

11.2.3 Loading data

Function pip_load_data takes care of loading the data. The very first instruction within pip_load_data is to find the data avialable in the repository by using pip_load_inventory(). The difference however is two-fold. First, pip_load_data will load the default and/or most recent version of the country/year combination available. Second, it gives the user the possibility to load different datasets in either list or dataframe form. For instance, if the user wants to load the Paraguay data in 2014 and 2015 used in the Poverty Calculator tool, she may type,


# df <- pip_load_data(country = "PRY",
#                     year    = c(2014, 2015), 
#                     tool    = "PC")
# 
# janitor::tabyl(df, survey_id)