7 Welfare data

Welfare data refers to any data file that contains the welfare vector that underlies the poverty and inequality measures in each country/year/welfare type. PIP makes use of four different data kinds to store the welfare vectors: microdata, bin data, group data, and synthetic data. Each of these is collected, formatted, and stored in a particular way.

7.1 From Raw to GMD

7.1.1 Microdata

The Global Monitoring Database (GMD) is the World Bank’s repository of multitopic income and expenditure household surveys used to monitor global poverty and shared prosperity. The household survey data are typically collected by national statistical offices in each country, and then compiled, processed, and harmonized. The process of harmonization is done at the regional level by the regional statistical development teams. This is to ensure that the harmonization is done with best country knowledge from the country TTLs as well as to ensure consistency of country data overtime. Selected variables have been harmonized to the extent possible such that levels and trends in poverty and other key sociodemographic attributes can be reasonably compared across and within countries over time.

The regional harmonization process also follows the global harmonization guidelines for most of the variables in GMD. PovcalNet also contributed historical data from before 1990, and recent survey data from Luxemburg Income Studies (LIS).

The global update process is coordinated by the Data for Goals (D4G) team and supported by the six regional statistics teams in the Poverty and Equity Global Practice.

The original data files used for the regional harmonization and GMD harmonization should be cataloged and stored in the regional shared drive and the central Microdata Library catalog (http://microdatalib/). We assume that the depositing of the original microdata in the Microdata Library is a collective responsibility shared by all World Bank staff working in all the World Bank’s regions, irrespective of global practice, since the dialogue and data acquisition through the NSO and line ministries often take place in a decentralized manner.

7.1.2 Group Data

Some countries have very restrictive and confidential data sharing conditions. In such cases, country economist, in collaboration with the regional team and PIP team, would send a request to the country NSO for the group or “collapsed” data for the purpose of monitoring international poverty (SDG 1.1.1). Historically, the group data can be exported from national statistics reports with either data format of type 2 or type 5. In recent years, country economists would request the group data from NSO directly. For example, we request the group data - ventile (20) for China by urban and rural, type 2. For other countries, like high income countries, such as UEA (Country code ARE), NSO directly reach out to us in order to update SDG 1.1.1, in this case the group data is shared to us and we validate the data.

7.1.3 Bin Data

Some of the countries that are not available in DataLibWeb can be found in the repository of the LIS Cross-National Data Center (hereafter, LIS). Currently, PovcalNet uses LIS data for 8 high-income economies: AUS, CAN, DEU, ISR, JPN, KOR, TWN & USA, plus the Pre-EUSILC years (generally before 2002) of European Economies.

LIS datasets cannot be downloaded in full; however, they provide a remote-execution system, LISSY, that allows us interact with their microdata without having access to the individual records. We have developed a set of Stata do-files to interact with LISSY and aggregate the welfare distribution of our countries of interest to 400 bins. Then, these data is organized locally and shared with the Poverty GP to be included in DataLibWeb as a collection independent from GPWG.

The LIS_data repository

In order to work with the LIS data you need to clone the repository PovcalNet-Team/LIS_data. You will find in there three folders, 00.LIS_output, 01.programs, and 02.data.

Interacting with LISSY

Opening an account in LISSY:

To interact with the LIS data you need to first register here, by first completing the LIS microdata User Registration Form, and then submitting it through the same website using your institutional e-mail account. Within a couple of days, you will receive an e-mail from LIS containing your username and password.

You do not get to choose your own username or password. LISdatacenter creates both for you and those won’t change in time. Make sure to save that e-mail and record that information for your future log-ins. Also, know that LISSY passwords expire each year on December 31st. While your password won’t change, it must be renewedhere after January 1st.

Interacting with LISSY:

To get acquainted with LISSY’s interface, coding structure, database naming and variables available, and learn how to compute estimates within LISSY we highly recommend taking some time to review the tutorials and self-teaching materials. However, in order to update the 400 bins twice a year, Stata codes have been previously written, so you simply need to follow the 5 steps in the next section.

Getting the 400 bins from LISSY

1. log in Go to LIS main page, scroll down and click on the lock icon

2. Provide info Feed the three drop-down menus on top of the platform with the following information:

Project: LIS
Package: Stata
Subject: (Choose a name Ex: “Bins #1 - Dec 2020”)

The LISSY platform cannot run the code for ALL surveys available at once. If you attempt to do so, your project will stop and you will receive an e-mail containing the text:

#####################################################
  Your job has been killed and will be not executed  
#####################################################

To avoid this, we need to run the code in groups of 5 to 6 countries, depending on the amount of years in each of them. Currently, LIS has data for 52 countries (26 of them are in the EUSILC project, and 26 do not) which usually take approximately 10 rounds of this process.

3. Add do-file Copy and paste the entire content of 01.LIS_400bins.do file into the the main large command window and update the locals in lines 23-24 with the LIS 2-letter acronyms of the group of 5-6 countries in each round.

Remember to update the subject with each round (Ex: “Bins #2 - Dec 2020”) so you keep track of the number of output files. Be careful not to leave out [or repeat] any country in the process.

4 Submit. Click on the green arrow icon to submit your project. You will get an e-mail within some minutes with your output. If the system kills your project, your group of countries was probably too large. Remove one country and try again.

5. Retrieve results Copy the entire text in the output e-mail you receive for each round, open your notepad and paste. Save each round in the \00.LIS_output folder. Save each text file with the name LISSY_Dec202@_#.txt, [where @ is the year and # the round]. Consistency with this naming format is important for the next step (02.LIS_organize_output.do file)

From Text file to datalibweb structure

We now need to convert the the text files generated by the LISSY system to actual data suitable for datalibweb. This structure is suggested by the International Household Survey Network (IHSN). Once the data is saved in folder 00.LIS_output you need to execute the file 02.LIS_organize_output.do. This file created to be executed in just one go. However, it could be ran in sections taking advantage of the different frames along the code.

Before you execute this code, you need to ensure a few things,

1. Get rcall working in your computer

The processing of the text files is not done anymore on Stata but in R. To avoid changing systems, we need to execute R code directly from Stata. In order to do this, you need to make sure to have install R in your computer and also the Stata command rcall. The do-file 02.LIS_organize_output.do will check if you have it installed and will install it for you in case it is not. However, you can run the lines below to make sure everything is working fine. Also, you can take a look at the help file of rcall to get familiar with it.

2. Personal Drive

Make sure to add your UPI to the appropriate sections it appears by typing disp lower("`c(username)'") , following the example below,

3. Directives of the code

This do-file works like an ado-file in the sense that the output depends on the value of some local macros,

If local update_surveynames is set to 1, the code will load the sheet LIS_survname from the the file 02.data/_aux/LIS datasets.xlsx and updated the file 02.data/_aux/LIS_survname.dta. If replace is set to 1, the code will replace any output with the same name. Otherwise, it will create a new vintage version if the two files are different. If they are not different, the code will do nothing. local p_drive_output_dir is deprecated, so you must leave it as 0.

4. Pattern of the text files

When the text files with the information from LIS are stored in 00.LIS_output, they should be stored in a systematic way so that they could be loaded and processed at the same time. This can be done by specifying in a matching regular expression in local pattern. For instance, all the files downloaded in December, 2020 could by loaded and processed using the directive, local pattern = "LISSY_Dec2020.*txt".

5. Output

When the do-file is concluded, it saves the file 02.data/create_dta_status.dta with the status of all the surveys processed.

Compare new LIS data to Datalibweb inventory

To identify what data is new and what data has changed with respect to the one available in datalibweb, you need to execute do-file 03.LIS_compare_dlw.do. Again, this do-file is intended to be executed in one run, but you can do it in parts taking advantage of the different frames. At the end of the execution the file 02.data/comparison_results.dta is created. This file contains three important variables wf, wt, and gn, which correspond to the ration of welfare means, weight means, and Gini coefficient between the data in datalibweb and the data in the folder, p:/01.PovcalNet/03.QA/06.LIS/03.Vintage_control.

You should only send to the Poverty GP those surveys for which at least one of these three variables is different to 1.

The Excel file `LIS datasets.xlsx`

With each LIS data update performed, we must first identify from LIS the new surveys (countries and/or years) they had recently added. LIS send users e-mails informing about new datasets added, and also releases newsletters with this information.

Inside the 02.data folder of your LIS_data GIT repository you will find a _aux sub folder, and the LIS datasets.xlsx file placed in there. We must manually update the tab LIS_survname tab adding new rows to the sheet. All necessary information to fill up this metadata (household size, currency, etc.) can be found in METIS.

ACRONYMS: The column survey_acronym is created by us. If you come across a new survey for which an acronym has not been previously established, the rules applied in the past by the team were the following:

Acronyms are created based on the ENGLISH name of the survey. (Ex: German Transfer Survey (Germany) is “GTS”, followed by the suffix -LIS; thus GTS-LIS.
For the surveys that were Microcensus, we created the acronym “MC”, and for Denmark’s Law Model, “LM”.
All acronyms are created in capital letters.

Finally, while the survey names in METIS are in English, some of the acronyms in parenthesis are still in the original language. In those cases we translated them to English. For instance, the survey name “Household Budget Survey (BdF) (France)” from METIS was changed to “Household Budget Survey (HBS) (France)” in the column surveyname of the excel.

Prepare data for the Poverty GP

Finally, the do-file 04.Append_new_LIS_bases.do prepares the data to be shared with the Poverty GP. Note that this do-file ONLY appends the 400 bins data of surveys that are new and those where welfare changed, which are identified in the previous step /comparison_results.dta as those gn != 1.

Before running the code, make sure to change the output file name to the date of your update (last one saved was “LIS_bins_Dec_21_2020.dta”). The output is saved in P:\01.PovcalNet\03.QA\06.LIS\04.Share_with_GP.

Finally, quickly prepare a short .dta file importing the metadata already created in the LIS_survname tab from the Excel, keeping ONLY the surveys of the append output you just run and send both files to Minh Cong Nguyen <mnguyen3@worldbank.org> from the Poverty GP.

7.1.4 Synthetic Data

Synthetic (or imputed) data is used in particular instances where household surveys do not collect income or expenditure information; or collect partial expenditures (i.e. for a small list of items and rotate across different households). In cases of that nature, country teams incur in efforts to impute the income or consumption welfare vectors. This exercise is based on the imputation of welfare from one survey to another. The imputation modeling is between the welfare, and household & individual characteristics from a previous household survey, where data on both is available. This exercise consists of simulating welfare information for each household in the survey several times based on different simulated coefficients from the imputation models. The resulting data is stacked together from each simulated set. Final data should follow the same format as the regular GPWG module with the exception of the variable “sim” (which indicates the number of simulations in the stacked database).

FGT indicators are estimated in the same way as any other data; however, for sensitive distributional indicators, calculations are run separately for each simulation, then averaged across results to get the final indicator. This data will get a value of 1 in the “imputed” variable of the price framework database.

7.2 From GMD to PIP ({pipdp} package)

The next step in the preperation of survey data for PIP is to leverage the {pipdp} package. This package retrives survey data from DLW and standardizes it into the format used by the PIP ingestion pipelines (Poverty Calculator Pipeline and Table Maker Pipeline).

7.2.1 Requirements

{pipdp} depends on the following Stata packages.

You will also need to have the World Bank {datalibweb} module and the PovcalNet internal {pcn} command installed.

7.2.2 Installation

7.2.3 Remote server

It is recommended to both use {pipdp} on the PovcalNet remote server (WBGMSDDG001). The phyiscal location of this server is much closer to the data storage so queries to Datalibweb, as well as read/write operations to the PIP shared network drive, will be much faster.

The only issue with using the server is that Datalibweb will require login credentials for each new Stata session. To avoid this it is highly recommended that users apply for direct access to Datalibweb. This will grant access to the Datalibweb file storage and enable the possiblity of using the files option in all datalibweb queries.

In order to use the files option you will also need to make some minor adjustments to your Datalibweb setup files. In particular the global root variable in C:/ado/personal/datalibweb/GMD.do needs to be specified correctly. This can be achived by copying datalibweb.ado and GMD.do in the _aux directory to their respective locations.

Since WBGMSDDG001 is a shared server and the DLW settings thus can be reset, you will need to make sure that these settings are correct every time you use pipdp. The modified setup files will only change the behavior of Datalibweb for users that have direct access to the file storage. It will not affect other users. You could either check and copy the files manually or run the following from the command line.

For details on how to connect to WBGMSDDG001 see the Remote server connection section in the PovcalNet Internal Guidelines and Protocols.

7.2.4 Usage

pipdp has two main commands, pipdp group to copy and prepare grouped data files from the PocvalNet shared network drive and pipdp micro to download and prepare micro datasets from Datalibweb. For examples on how to use these commands see the Usage section in the {pipdp}README.

7.2.5 Preparing survey data for Poverty Calculator Pipeline

Follow these steps when preparing survey data for the Poverty Calculator Pipeline

Make sure the Price Framework file is up to date in the directory you are using.

Make sure the Datalibweb repository file is up to date in directory you are using. Note that if you are running this code on the PovcalNet remote server you will be required to log in to DLW, regardless if you have direct access or not. This is because the underlying datalibweb, type(GMD) repo(create dlwrepo) command always sends queries over web. If you want you to avoid the login requirement you can conduct this specific step from your local machine.

Make sure DLW setup files are up-to-date (needed for files option).

Update surveys

Update the PIP inventory

7.3 Survey ID nomenclature

All household surveys in the PIP repository are stored following the naming convention of the International Household Survey Network (IHSN) for archiving and managing data. This structure can be generalized as follows:

CCC_YYYY_SSSS_ vNN_M_vNN_A_TYPE_MODULE

where,

CCC refers to 3 letter ISO country code
YYYY refers to survey year when data collection started
SSSS refers to survey acronym (e.g., LSMS, CWIQ, HBS, etc.)
vNN is the version vintage of the raw/master data if followed by an M or of the alternative/adapted data if followed by an A. The first version will always be v01, and when a newer version is available it should be named sequentially (e.g. v02, v03, etc.). Note that when a new version is available, the previous ones must still be kept.
TYPE refers to the collection name. In the case of PIP data, the type is precisely, PIP, but be aware that there are several other types in the datalibweb collection. For instance, the Global Monitoring Database uses GMD; the South Asia region uses SARMD; or the LAC region uses SEDLAC.
MODULE refers to the module of the collection. This part of the survey ID is only available at the file level, not at the folder (i.e, survey) level. Since the folder structure is created at the survey level, there is no place in the survey ID to include the module of the survey. However, within a single survey, you may find different modules, which are specified in the name of the file. In the case of PIP, the module of the survey is divided in two: the PIP tool and the GMD module. The PIP tool could be PC, TB, or SOL (forthcoming), which stand for one of the PIP systems, Poverty Calculator, Table Baker (or Maker), and Statistics OnLine. the GMD module, refers to the original module in the GMD collection, such as module ALL, or GPWG, HIST, or BIN.

For example, the most recent version of the harmonized Pakistani Household Survey of 2015, she would refer to the survey ID PAK_2015_PSLM_v01_M_v02_A_PIP. In this case, PSLM refers to the acronym of the Pakistan Social and Living Standards Measurement Survey. v01_M means that the version of the raw data has not changed since it was released and v02_A means that the most recent version of the alternative version is 02.2

7.4 Survey CACHE ID nomenclature

NOTE:Andres, add explanation here

Testing tachyons colors

This is a test Royal blue color in line