3 Folder Structure
The PIP root directory (Sys.getenv("PIP_ROOT_DIR")
) contains the following main folders.
Folder | Explanation |
---|---|
PIP-Data | N/A |
PIP-Data_QA | QA directory for survey and AUX data |
PIP-Data_ExtSOL | SOL |
PIP-PIP_Data_Testing | Testing directory for pipaux and pipdp |
PIP-PIP_Data_Vintage | N/A |
pip_ingestion_pipeline | Pipeline directory |
Details on each folder and their subfolders are given below.
3.1 PIP-Data
This folder is no longer in use.
This was the original PIP_Data folder. It was used for testing and development of {pipdp}
and {pipaux
}, but has since been replaced by PIP-Data_QA
and PIP_Data_Testing
.
3.2 PIP-Data_QA
This is the main output directory for the {pipdp}
and {pipaux}
packages. It contains all the survey and auxiliary data needed to run the Poverty Calculator and Table Maker pipelines.
Note that the contents of this folder is for QA and production purposes. Please use the PIP_Data_Testing
directory, or your own personal testing directory, if you intend to test code changes in {pipdp}
or {pipaux}
.
3.2.1 _aux
The _aux
folder contains the auxiliary data used by the Poverty Calculator Pipeline and the {pipdp}
and {pipapi}
packages.
Please note the following:
- The file
sna/NAS special_2021-01-14.xlsx
is currently hardcoded in{pipaux}
. If the contents of the file changes this package might need to be updated. - Some other National Accounts special cases (e.g. BLZ, VEN) are manually hardcoded in pipaux. Beware of this when updating GDP/PCE.
- The file
weo/weo_<YYYY-MM-DD>.xls
needs to be manually downloaded from IMF, opened and then re-saved asweo_<YYYY-MM-DD>.xls
. - The grouped data means currently come from the PovcalNet Masterfile. This should be changed when PovcalNet goes out of production.
An explanation of each subfolder is given below.
Folder | Measure | Usage | Source |
---|---|---|---|
countries | PIP country list | pipapi | pipaux |
country_list | WDI country list | pipaux, PC pipeline | pipaux |
cp | Country profiles | pipai | pipaux |
cpi | CPI | PC pipeline | pipaux |
dlw | DLW repository | pipdp | pipdp |
gdm | Grouped data means | PC pipeline | pipaux |
gdp | Gross Domestic Product | PC pipeline | pipaux |
maddison | Maddison Project Data | pipaux | pipaux |
indicators | Indicators master | pipapi, PC pipeline | Manual, pipaux |
pce | Private consumption | PC pipeline | pipaux |
pfw | Price Framework | pipdp, PC pipeline | pipaux |
pl | Poverty Lines | pipapi | pipaux |
pop | Population | PC pipeline | pipaux |
ppp | Purchasing Power Parity | PC pipeline | pipaux |
regions | Regions | pipapi | pipaux |
sna | Special National Account cases | pipaux | Manual |
weo | World Economic Outlook (GDP) | pipaux | IMF, pipaux |
The data in the folders countries
, regions
, pl
, cp
and indicators
are loaded into the Poverty Calculator pipeline, but that they are not used for any calculations or modified in any way, when being parsed through the pipeline. They are thus only listed with {pipapi}
as their usage.
In contrast the measures CPI, GDP, PCE, POP and PPP are both used in the pre-calculations and transformed before being saved as pipeline outputs. So even though these measures are also available in the PIP PC API, the files at this stage only have the Poverty Calculator pipeline as their use case.
3.2.2 _inventory
The _inventory
folder contains the PIP inventory file created by pipload::pip_update_inventory()
. It is important to update this file if the survey data has been updated.
3.2.3 Country data
The folders with country codes as names, e.g AGO
, contain survey data for each country. This is created by {pipdp}
. The folder structure within each country folder follows roughly the same convention as used by DLW.
There will be one data file for each survey, based on DLW’s GMD
module and the grouped data in the PovcalNet-drive. These are labelled with PC in their filenames. Additionally, if there is survey data available from DLW’s ALL
module there will also be a dataset for the Table Maker, labelled with TB.
3.3 PIP-Data_ExtSOL
This folder is for external SOL application. This is the folder where the data for the externa SOL will be synced between the network drive and the storage in Azure. The synchronization will be done daily from the ITS side. There are two subfolders GMD-DLW and HFPS-COVID19.
- GMD-DLW: it is the folder for GMD data for both raw and harmonized GMD data.
- HFPS-COVID19: it is the folder for High Frequent Phone Survey data for both raw and harmonized data.
The structured folders for these two catalogs are standard likes the ones used in DLW system.
There are dofiles on the synchronization between GMD in DLW and GMD in SOL. Those files are “Data in SOL.xlsx” and “Convert GMD to GMD SOL.do”. Only data with clear and explicit data license will be uploaded in this folder.
For now, only “public data” is in this application. Public means any users can download and redistribute the data, and there is no login in the country NSO website or any condition/terms when downloading the data. Otherwise, there is an explicit agreement with NSO on the usage of the data for SOL – email: Microdata for an external Statistics Online (SOL) platform.
At the moment, we are working with Legal on the Custom license for the data license template where SOL will be mentioned explicitly. Once the Custom license is developed (after the launch of SOL) we can reach out to all countries to ask for permission to use and “limited redistribute” in the data in our secured platform.
Please do not touch or change any content in this folder without permission
3.4 PIP_Data_Testing
This folder is used as a testing directory for development of the {pipdp}
and {pipaux}
packages.
3.5 PIP_Data_Vintage
This folder is currently not in use.
3.6 pip_ingestion_pipeline
This is the main output directory for the both the Poverty Calculator and Table Maker pipelines.
3.6.1 pc_data
Folder | Explanation |
---|---|
_targets | Targets storage |
cache | Cleaned survey data |
output | PC pipline output directory |
validation | Validation directory |
3.6.1.1 _targets
This folder contains the objects that are cached by {targets}
when the Poverty Calculator Pipeline is run. It is located on the shared network drive so that the latest cached objects are available to all team members. Please note however that a shared _targets
folder does entail some risks. It is very important that only one person runs the pipeline at a time. Concurrent runs against the same _targets
store will wreak havoc. And needless to say; if you are developing new features or customising the pipeline use a different _targets
store. (We should probably have a custom DEV or testing folder for this, similar to PIP_Data_Testing.)
3.6.1.2 cache
The survey data output of {pipdp}
cannot be directly used as input to the Poverty Calculator Pipeline. This is because the data needs to be cleaned before running the necessary pre-calculations for the PIP database. The cleaning currently includes removal of rows with missing welfare and weight values, standardising of grouped data, and restructuring of datasets with both income and consumption based welfare values in the same survey. Please refer to {wbpip}
and {pipdm}
for the latest cleaning procedures.
In order to avoid doing this cleaning every time the pipeline runs an intermediate version of the survey data is stored in the cache/clean_survey_data/
folder. Please make sure this folder is up-to-date before running the _targets.R
pipeline.
Note: This is not exactly true. We need to solve the issue with the double caching of survey data. But the essence of what we want to accomplish is to avoid cleaning and checking the survey data in every pipeline run..
3.6.1.3 output
This folder contains the outputs of the Poverty Calculator Pipeline. This is seperated into three subfolders; aux
containing the auxiliary data, estimations
containing the survey, interpolated and distributional pre-calculated estimates, and survey_data
containing the cleaned survey data.
3.6.1.4 validation
TBD Andres: Could you write this section?.
3.6.2 tb_data
TBD Andres: Could you write this section?.