Extract Different Rows
extract_diff_rows.Rmd
This article looks a series of workflow for the function
extract_diff_rows()
.
Introduction
At its core, the function does the following:
- Sets up the two data frames (
dfx
anddfy
) for comparison:- Pairs rows based on the keys provided (given by the user using the
by
,by.x
,by.y
arguments, or by row number). - Pairs columns based on the column names.
- Pairs rows based on the keys provided (given by the user using the
- Runs the comparison between shared rows and shared columns only.
- Returns the rows/observations that are missing from either
dfx
ordfy
, with all variables from both datasets.
⚠️ No keys provided
Right now, myrror does not align rows based on row
content. This means that if the user does not provide keys, the function
will compare the two datasets based on row number only. This can lead to
incorrect results (extract_diff_row()
will not return the
correct rows) if:
the datasets have different row orders.
a detaset have duplicate rows.
a dataset has new/missing rows.
Therefore, we suggest the user to always provide keys when running the function.
Example 1: missing rows in dfy.
In extract_diff_rows()
, the default options for
output
is simple
. This returns a data.table
with the rows that are missing or new. In this example, we will compare
the survey_data
dataset with a modified version,
survey_data_4
:
output <- extract_diff_rows(survey_data, survey_data_4, by = c('country', 'year'), output = "simple")
print(output)
#> df country year variable1.x variable2.x variable3.x variable4.x
#> <char> <char> <int> <num> <num> <num> <num>
#> 1: dfx A 2010 -0.56047565 0.4978505 0.8951257 0.77996512
#> 2: dfx A 2011 -0.23017749 -1.9666172 0.8781335 -0.08336907
#> 3: dfx A 2012 1.55870831 0.7013559 0.8215811 0.25331851
#> 4: dfx A 2013 0.07050839 -0.4727914 0.6886403 -0.02854676
#> variable1.y variable2.y variable3.y variable4.y rn
#> <num> <num> <num> <num> <char>
#> 1: NA NA NA NA 1
#> 2: NA NA NA NA 2
#> 3: NA NA NA NA 3
#> 4: NA NA NA NA 4
The column df
tells the user whether the row is part of
dfx
or dfy
. In this case, the rows are part of
the dfx
dataset, but are missing from dfy
(survey_data
). So the column df
shows the
value dfx
.
If the user runs the function with output = "full"
, the
console will print a user-friendly summary of the comparison, similar to
the output of myrror()
. However, given that the output of
extract_diff_rows()
is not part of the myrror
object, the user will need to run extract_diff_rows()
separately to see the extracted rows.
# First run the function in full mode:
extract_diff_rows(survey_data, survey_data_4, by = c('country', 'year'), output = "full")
#>
#> ── Myrror Report ───────────────────────────────────────────────────────────────
#>
#> ── General Information: ──
#>
#> dfx: dfx with 16 rows and 6 columns.
#> dfy: dfy with 12 rows and 6 columns.
#> keys: country and year.
#>
#> ── Note: comparison is done for shared columns and rows. ──
#>
#> ✔ Total shared columns (no keys): 4
#> ! Non-shared columns in dfx: 0 ()
#> ! Non-shared columns in dfy: 0 ()
#>
#> ✔ Total shared rows: 12
#> ! Non-shared rows in dfx: 4.
#> ! Non-shared rows in dfy: 0.
#>
#> ℹ Note: run `extract_diff_rows()` to extract the missing/new rows.
Following the suggestion in the note, the user can extract the
results. The advantage of extracting the results is that the user can
further explore or manipulate the data using additional packages or
functions. For example, here we extract the results and display them
using DT::datatable()
:
# Then extract the results and analyse them:
extract_diff_rows() |>
datatable(
filter = 'top',
rownames = FALSE,
style = 'auto',
options = list(
searching = FALSE,
paging = TRUE,
scrollX = TRUE,
autoWidth = TRUE,
pageLength = 5
)
)
#> Last myrror object used for comparison.
Example 2: missing rows in dfx.
Similarly, if dfx
is missing some rows, they will be
extracted like so:
output <- extract_diff_rows(survey_data_4, survey_data, by = c('country', 'year'), interactive = FALSE, output = "simple")
output |>
datatable(
filter = 'top',
rownames = FALSE,
style = 'auto',
options = list(
searching = FALSE,
paging = TRUE,
scrollX = TRUE,
autoWidth = TRUE,
pageLength = 5
)
)