8  Auxiliary data

As it was said in Section @ref(data-acquisition), auxiliary data is the data used to temporally deflate and line up welfare data, with the objective of getting poverty estimates comparable over time, across countries, and, more importantly, being able to estimate regional and global estimates. Yet, auxiliary data also refers to metadata with functional and qualitative information. Functional information is such that is used in internal calculations such as time comparability or surveys availability. Qualitative information is just useful information that does not affect, neither depend on, quantitative data. It is primary collected and made available for the end user.

As explain in Chapter @ref(folder-structure), all auxiliary data is stored in "y:/PIP-Data/_aux/".

The naming convention of subfolders inside the _aux directory is useful because auxiliary data is commonly referred to in all technical processes by its convention rather than by it actual name. For instance Gross Domestic Product or Purchasing Power Parity are better known as gdp and ppp, respectively. Yet, other measures such as national population or consumption also make use of conventions.

In this chapter you will learn everything related to each of the files that store auxiliary data. Notice that the chapter is structured by files rather than by measures or types of auxiliary data because you may find more than one measure in one file.

The R package that manages auxiliary data is {pipaux}.

As explained in Chapter @ref(folder-structure), within the folder of each auxiliary file, you will find, at a minimum, a _vintage folder, one xxx.fst, one xxx.dta file, and one xxx_datasignature.txt , where xxx stands for the name of the file.

8.1 Population

8.1.1 Original data

Everything related to population data should be placed in the folder y:\PIP-Data\_aux\pop\. hereafter (./).

The population data come from one of two different sources. WDI or an internal file provided by a member of the DECDG team. Ideally, population data should be downloaded from WDI, but sometimes the most recent data available has not been uploaded yet, so it needs to be collected internally in DECDG. As of now (August 24, 2023), the DECDG focal point to provide the population data is Emi Suzuki. You just need to send her an email, and she will provide the data to you.

If the data is provided by DECDG, it should be stored in the folder ./raw_data. The original excel file must be placed without modification in the folder ./raw_data/original. Then, the file is copied again one level up into the folder ./raw_data with the name population_country_yyyy-mm-dd.xlsx where yyyy-mm-dd refers to the official release date of the population data. Notice that for countries PSE, KWT and SXM, some years of population data are missing in the DECDG main file and hence in WDI. Here we complement the main file with an additional file shared by Emi to assure complete coverage. This file contains historical data and will not need to be updated every year. This additional file has the name convention population_missing_yyyy-mm-dd.xlsx and should follow the same process as the population_country file. Once all the files and their corresponding place, you can update the ./pop.fst file by typing pipaux::pip_pop_update(src = "decdg").

If the data comes directly from WDI, you just need to update the file ./pop.fst by typing pipaux::pip_pop_update(src = "wdi"). It is worth mentioning that the population codes used in WDI are “SP.POP.TOTL”, “SP.RUR.TOTL”, and “SP.URB.TOTL”, which are total population, rural population, and urban population, respectively. If it is the case that PIP begins using subnational population, a new set of WDI codes should be added to the R script in pipaux::pip_pop_update().

8.1.2 Data structure

Population data is loaded by typing either pipload::pip_load_aux("pop") or pipaux::pip_pop("load"). We highly recommend the former, as {pipload} is the intended R package for loading any PIP data.

# pop <- pipload::pip_load_aux("pop")
# head(pop)

8.2 National Accounts

National accounts account for the economic development of a country at an aggregate or macroeconomic level. These measure are thus useful to interpolate or extrapolate microeconomic measures mean welfare aggregate or poverty headcount when household surveys are not available. National accounts work as a proxy of the economic development that would have been present if household surveys were available.

There are two main types of national accounts, Household Final Consumption Expenditure (HFCE) and Gross Domestic Product (GDP)—both in real per capita terms. Please refer to Section 5.3 of [@worldbankPovertyInequalityPlatform2021] to understand the usage of national accounts data.

8.2.1 GDP

As explained in Section 5.3 of [@worldbankPovertyInequalityPlatform2021], there are three sources of GDP data, and one more for a few particular cases. The integration of all the sources of GDP data is performed by pipaux::pip_gdp_update(), you’ll need to manually download and store the data from WEO and the data for the special cases. The national accounts series from WDI are GDP per capita  [series code: NY.GDP.PCAP.KD]. These series are in constant 2010 US$.

The most recent version of the WEO data most be downloaded from the World Economic Outlook Databases of the IMF.org website and saved as an .xls file in <maindir>/_aux/weo/. The filename should be in the following structure WEO_<YYYY-DD-MM>.xls. Due to potential file corruption the file must be opened and re-saved before it can be updated with pip_gdp_weo(), which is an internal function fo pipaux::pip_gdp_update(). Hopefully in the future IMF will stop using an `.xls` file that’s not really xls.

8.2.2 Consumption (PCE)

Private Consumption Expenditure (pce) is gathered from WDI, with the exception of a few special cases. As in the case of GDP, the special cases are treated in the same way with PCE. You only need to execute the function pipaux::pip_pce_update() to update the PCE data. HFCE per capita [series code: NE.CON.PRVT.PC.KD] [@prydzNationalAccountsData2019]. These series are in constant 2010 US$.

8.2.3 National Accounts, Special Cases

Special national accounts are used for lining up poverty estimates in the following cases1:

  1. National accounts data are unavailable in the latest version of WDI.

    In such cases, national accounts data are obtained, in order of preference, from the latest version of WEO, or the latest version of MPD. For example, the entire series of GDP per capita for Taiwan, China and Somalia are missing in WDI, so WEO series are used instead.

  2. National accounts data are incomplete in the latest version of WDI.

    These are the cases where national accounts data are not available in WDI for some historical or recent years. In such cases, national accounts data in WDI are chained on backward or forward using growth rates from WEO or MPD, in that order. For example, GDP per capita for South Sudan (2016-2019) are based on the growth rate of GDP per capita from WEO. GDP per capita data for Liberia up to 1999 are based on the growth rate in GDP per capita from MPD.

  3. The available national accounts data from official sources (e.g. WDI, WEO, MPD) are considered to have quality issues.

    This is the case for Syria. Supplementary national accounts data are obtained from other sources, including research papers or national statistical offices. GDP per capita series for Syria (for 2010 through 2019) are from research (for 2011-2015) and @devadasGrowthWarSyria2019 (for 2016-2019)—and are chained on backward with growth rates in GDP per capita from WEO. See *y:/PIP-Data/_aux/sna/* for more details on how this is implemented.

  4. National accounts data need to be adjusted for the purposes of global poverty monitoring.

    This is the case for India. Growth rates in national accounts data for rural and urban India after 2014, precisely HFCE (or formerly PCE) per capita from WDI, are adjusted with a pass-through rate of 67%, as described in Section 5 of @castanedaaguilarSeptember2020PovcalNet2020. See *y:/PIP-Data/_aux/sna/NAS special_2021-01-14.csv* for more details on how this is implemented.

8.3 CPI

8.3.1 Raw data

General documents on the CPI source and the CPI frameworks are posted here. Yet, for more details, please refer to [@laknerConsumerPriceIndices2018b; @azevedoPricesUsedGlobal2018a].

There are three sources of CPI: IFS, WEO, and country team.

In general, the CPI data will be taken from the IMF International Financial Statistics (IFS). For the incoming update, about 2-3 months to the upload we initial the request to DECDG CPI team for the three series: annually, quarterly, monthly from IFS CPI database. The purpose of requesting DECDG CPI team is to ensure the same vintage will be updated in WDI later in the next update cycle. We would need the three series as for some countries we only have annually, and for other countries we could have up to monthly. This is also a check for us in checking the consistency of annual and monthly series. The monthly series will be used to construct annual and quarterly series, as that there are some inconsistences between the annual and monthly series in the IFS. For the exceptional countries, Poverty GP replaces data series based on previous consultations when there is no update or better information.

Some countries also use National series which are not available from IFS – in this case we check with country poverty TTLs to provide the updated information, especially when we have a new survey for that country.

In some cases where the CPI value are missing for some sources but not other sources. This is especially true for the very old year where the data is available for one source, or very recent year where WEO has a projection on the CPI while it is not available in IFS. In those cases, we will follow the logic and method described in the “CPI source document” to ensure we have all CPI values for all data points.

Global D4G team will prepare and send the CPI raw series as well as the weighted numbers for the current data points in the system. The data will be query by datalibweb. The following files will be added to the system each round:

File ane Description
Final_CPI_PPP_to_be_used.dta final weighted CPI for poverty calculation
Yearly_CPI_Final.dta annual CPI – combined from different sources using chained
methods Yearly_CPI.dta annual CPI constructed from the monthly CPI
Yearly_CPI_Annual.dta annual CPI from the annual series
Quarterly_CPI.dta quarterly CPI
Monthly_CPI.dta monthly CPI series
WEO_Yearly_CPI.dta annual CPI from WEO
Special_CPI_series.dta Special case of CPI (national source, imputation)

8.3.2 Vintage control

Vintage control of the CPI data comes in a similar fashion as welfare data, CPI_vXX_M_vXX_A, where vXX_M refers to the version of the master or raw data, and vXX_A refers to the alternative version.

Every year, around November-December, PIP CPI data is updated with the most recent version of the IMF CPI data, which comes with information for the most recent year available and with changes/fixes/additions of previous years for each country. When this happens, the master version of the CPI ID is increased in one unit before the data is saved. As of today, the current ID is . If data is modified during the rolling of the year, then the alternative version of the CPI ID is increased in one unit.

8.3.3 Data structure

When you load CPI data using #pipload::pip_load_aux("cpi"), the data you get has already been cleaned for being use in the PIP workflow, and it is slightly different from the original CPI data stored in datalibweb servers. That is, the way CPI data is used and referred to datalibweb is different from the way it is used in PIP even though they both achive the same purpose.

The most important variable in CPI data is, no surprisingly, cpi. This variable however, is not available in the original CPI data from dlw. The original name of this variable comes in the form cpiYYYY, where YYYY refers to the base year of the CPI, which in turn depends on the collection year of the PPP. Today, this variable is thus “.” The name of this variable is stored in the pipaux.cpivar object in the zzz.R file of the {pipaux} package. This will supdate the option getOption("pipaux.cpivar"), guaranteing that pipaux::pip_cpi_update() uses the right variable when updating the CPI data.

Another important variable in CPI dataframe is change_cpiYYYY, where YYYY stands for the base year of the CPI. Since it version control of the CPI data does not depend on the individual changes in the CPI series of each country but on the release of new data by the IMF or by additional modifications by the Poverty GP, variable change_cpiYYYY tracks changes in the CPI at the country/year/survey with respect to the previous version. This is very useful when you need to identify changes in output measures like poverty rates that depend on deflation. One possible source of difference is the CPI and this variable will help you identify whether the number of interest has change because the CPI has changed.

8.4 PPP

8.4.1 Raw data

The PPP data is downloaded from ICP website for most of the countries (Ask ICP team for the link, outlier, countries with changes in currency). Often there is a GPWG working group to assess the PPP and its impacts on poverty. In this case, the team would determine the countries for which there is a need to impute the PPP value, either from the ICP model or the team model.

After the validation and adjustment process, the PPP values are stored in the data file for all PPP rounds with vintage controls for each round.

The name of the variables in the wide-format file will follow the structure ppp_YYYY_vX_vY. Where, YYYY refers to the ICP round. vX refers to the version of the release, and vY refers to the adaptation of the release. So, v1 will be the original data, whereas v2 would be the first adaptation or estimates of the release.

  • YYYY: refers to the ICP round.

  • vX: refers to the version of the release.

  • vY: refers to the adaptation of the release. So, v1 will be the original data, whereas v2 would be the first adaptation or estimates of the release.

8.4.2 Data structure

PPP data is available by typing, pipload::pip_load_aux("ppp"). As expected, the data you get has already been cleaned for being use in the PIP workflow, and it is slightly different from the original PPP data stored in datalibweb servers. The most important difference between the PIP data frame and the datalibweb data frame is its rectangular structure. PIP data is in long format, whereas datalibweb data in wide format.

The reason for having PPP data in long format in PIP is that some countries, very few, use a different PPP year than the rest of the countries. Instead of using a different variable for the calculations of those specific countries, we use the same variable for all the countries but filter the corresponding observations for each country using metadata from the Price Framework database.

The PPP data is at the country/ppp year/data_level/release version/adapation version level. Yet, several filters most always be applied before this data can be used. Ultimately, the data frame should be at the country/data_level level to used properly. As a general rule, the filter must be done by selecting the most recent release_version and the most recent adaptation_version in each year. Then you can just filter by the PPP year you want to work with. In order to make this process even easier we have created variables ppp_default and ppp_default_by_year, which dummy variables to filter data. If you keep all observations that ppp_default == 1 you will get the current PPP used for all PIP calculations. If you use ppp_default_by_year == 1, you get the default version used in each PPP year. This is useful in case you want to make comparisons between PPP releases. This two variables are created in function pipaux::pip_ppp_clean() , in particular in these lines.

8.5 Price FrameWork (PFW)

blah

8.5.1 Original data

asds

8.6 Abbreviations

HFCE – final consumption expenditure

MDP – Maddison Project Database

PCE – private consumption expenditure

WDI – World Development Indicators

WEO – World Economic Outlook


  1. The examples of special cases mentioned in this document are based on the March 2021 PovcalNet update.↩︎