4  Joining Data

Since the PIP project is comprised of several databases at very different domain levels, this chapter provides all the guidelines to join data frames correctly. The first thing to understand is that the reporting measures in PIP (e.g. poverty and inequality) are uniquely identified by four variables: Country, year, domain, and welfare type.

The challenge of joining different data frames in PIP is that these four variables that uniquely identify the reporting measures are not available on any of the PIP data files—with the exception of the cache files that we discuss below. This challenge is easily addressed by having a clear understanding of the Price FrameWork (pfw) data frame. This file does not only contain valuable metadata, but it could also be considered as the anchor among all PIP data.

4.1 The Price FrameWork (pfw) data

As always, this file can be loaded by typing,

library(data.table)
#pfw <- pipload::pip_load_aux("pfw")
#joyn::is_id(pfw, by = c("country_code", "surveyid_year", "survey_acronym"))

First of all, notice that pfw is uniquely identified by country code, survey year, and survey acronym. The reason for this is that pfw aims at providing a link between every single household survey and all the other auxiliary data. Since welfare data is stored following the naming convention of the International Household Survey Network (IHSN), data is stored according to country, survey year, acronym of the survey, and vintage control of master and alternative versions. The vintage control of the master and alternative version of the data is not relevant for joining data because PIP uses, by default, the most recent version.

Keep in mind that PIP estimates are reported at the country, year, domain, and welfare type level, but the last two of these are not found either in the survey ID nor as unique identifiers of the pfw. To solve this problem, the pfw data makes use of the variables welfare_type, aaa_domain, and aaa_domain_var.

As the name suggests, welfare_type indicates the main welfare aggregate type (i.e, income or consumption ) of the variable welfare in the GMD datasets that correspond to the survey ID formed by concatenating variables country_code, surveyid_year, and survey_acronym. For example, the welfare_type of the welfare variable in the datasets of COL_2018_GEIH_V01_M_V03_A_GMD is income.

# pfw[ country_code    == "COL"
#     & surveyid_year  == 2018
#     & survey_acronym == "GEIH", # Not necessary since it is the only one
#     unique(welfare_type)]

The prefix aaa in variables aaa_domain and aaa_domain_var refers to the identification code of any of the auxiliary data. Thus, you will find a gdp_domain, cpi_domain, ppp_domain and several others. All aaa_domain variables contain the lower level of geographical disaggregation of the corresponding aaa auxiliary data. There are only three possible levels of disaagregation,

As of now, no survey or auxiliary data is broken down at level 3 (i.e., subnational), but it is important to know that the PIP internal code takes that possibility into account for future cases.

Depending on the country, the domain level of each auxiliary data might be different. In Indonesia, for instance, the CPI domain is national, whereas the PPP domain is “urban/rural.”

# pfw[ country_code == "IDN" & surveyid_year == 2018, 
#     .(cpi = unique(cpi_domain), 
#       ppp = unique(ppp_domain))]

Finally, and this is really important, variables aaa_domain_var contains the name of variable in the GMD dataset that uniquely identify the household survey in the corresponding aaa auxiliary data. In other words, aaa_domain_var contains the name of the variable in GMD that must be used as key to join GMD to aaa. You may ask, does the name of the variable in the aaa auxiliary data have the same variable name in the GMD data specified in aaa_domain_var? No, it does not. Since the domain level to identify observations in the aaa auxiliary data is unique, there is only one variable in auxiliary data used to merge any welfare data, aaa_data_level. Since all this process is a little cumbersome, the {pipdp} Stata package, during the process of cleaning GMD databases to PIP databases, creates as many aaa_data_level variables as needed in order to make the join of welfare data and auxiliary data simpler. You can see the lines of code that create these variables in this section of the file “pipdp_md_clean.ado.”1

4.1.1 Joining data example

Let’s see the case of Indonesia above. The pfw says that the CPI domain is “national” and the PPP domain is “urban/rural.” That means that the welfare data join to each of these auxiliary data with two different variables,

# domains <- 
#   pfw[ country_code == "IDN" & surveyid_year == 2018, 
#      .(cpi = unique(cpi_domain_var), 
#        ppp = unique(ppp_domain_var))][]

This says that the name of the variable in the welfare data to join PPP data is called uban, but there is not seem to be a variable name in GMD to join the CPI data. When the name of the variable is missing, it indicates that the welfare data is not split by any variable to merge CPI data. That is, it is at the national level.

# ccode  <- "CHN"
# cpi <- pipload::pip_load_aux("cpi")
# ppp <- pipload::pip_load_aux("ppp")
# 
# CHN <-  pipload::pip_load_data(country = ccode, 
#                               year    = 2015)
# 
# dt <- joyn::merge(CHN, cpi,
#                   by = c("country_code", "survey_year",
#                          "survey_acronym", "cpi_data_level"),
#                   match_type = "m:1", 
#                   keep = "left")

4.2 Special data cases

NOTE: Andres. Add this section


  1. You can find more information about the conversion from GMD to PIP databases in Section @ref(welfare-data)↩︎