Tools for GitHub Interaction
interact_with_Github.Rmd
Introduction
The pipfun package provides a set of tools for interacting with GitHub content, especially designed for working with PIP repositories. These functions act as lightweight wrappers around the gh package, offering a simpler and more focused interface for specific GitHub tasks related to PIP.
Managing Credentials
The pipfun package utilizes the gitcreds package to manage, store, and authenticate GitHub credentials. gitcreds provides a straightforward way to securely store your GitHub Personal Access Token (PAT) and ensure seamless access when using pipfun.
To set up your credentials, you can follow the instructions provided
on the official
{gitcreds}
documentation page, or refer to Jenny
Bryan’s excellent guide, Happy
Git With R. Both resources provide detailed explanations on how to
properly configure your Git credentials.
Please note that pipfun requires valid GitHub credentials to interact with repositories. If your credentials are not set up correctly with gitcreds, you will be unable to use pipfun for any GitHub-related functionality.
Getting Data from GitHub
The functions in pipfun are organized to build upon one another, allowing for progressively more complex interactions with GitHub. Below, we’ll start by introducing the foundational, lower-level functions that provide core metadata and file information. We will then move towards the higher-level, more abstracted functions that make complex tasks easier.
Retrieving File Metadata from GitHub
Every file stored in a GitHub repository contains a rich set of metadata, which can be very useful for tracking changes, validating updates, and managing file versions. For example, each file has a unique hash code that serves as a signature for its version. Whenever a file is updated, its hash changes, making it possible to determine whether a file has been modified since the last time it was accessed. This hash is also used to confirm overwriting files in a GitHub repository.
To retrieve this metadata, you can use the
get_file_info_from_gh()
function as follows:
ghf_info <- get_file_info_from_gh(
owner = getOption("pipfun.ghowner"),
repo = "pip_info",
file_path = "releases.csv",
branch = "releases"
)
#> Git credentials are missing or invalid in non-interactive mode.
names(ghf_info)
#> [1] "name" "path" "sha" "size" "url"
#> [6] "html_url" "git_url" "download_url" "type" "content"
#> [11] "encoding" "_links" "owner" "repo" "branch"
In this example, we use the function to get information about the
file "releases.csv"
in the "releases"
branch
of the "pip_info"
repository. The owner
argument specifies the GitHub repository owner, which by default will be
the value returned by getOption("pipfun.ghowner")
. This
default value simplifies repeated calls, especially when consistently
working within a specific organization’s repositories.
The function returns metadata about the file, such as its current hash and other useful information that can be leveraged for version control or ensuring file consistency.
Retrieving branch metadata from Github
In a similar way, users can use the
get_branch_info_from_gh()
function to retrieve detailed
metadata about a specific branch within a GitHub repository. This
function provides essential information, such as the branch’s name, the
latest commit’s hash, and details about branch protection rules (if
any). Like other {pipfun} functions, it uses the GitHub API for these
queries, leveraging your credentials to authenticate requests. The
output can be used for tasks such as verifying the latest changes,
ensuring compliance with branch protection rules, or tracking updates
across branches in the repository.
gh_branch_info <- get_branch_info_from_gh(
owner = getOption("pipfun.ghowner"),
repo = "aux_ppp",
branch = "DEV"
)
#> Git credentials are missing or invalid in non-interactive mode.
names(gh_branch_info)
#> [1] "name" "commit" "_links" "protected"
#> [5] "protection" "protection_url" "owner" "repo"
#> [9] "branch"
Downloading a File from GitHub and Reading from Local Drive
Downloading and reading files from GitHub can be challenging due to
the diversity of file formats, many of which cannot be read directly.
This is where the get_file_from_gh()
function in
pipfun comes into play.
The get_file_from_gh()
function simplifies this process
by identifying the format of the file, downloading it as a temporary
file on your local drive (when necessary), and then loading it directly
into memory. This function supports the most common file formats used in
the PIP project, including:
- CSV (
.csv
) - Excel files (
.xls
,.xlsx
) - Stata (
.dta
) - Quick Serialization (
.qs
) - Fast serialization (
.fst
) - YAML (
.yaml
) - R Data (
.rds
) - JSON (
.json
)
This flexibility allows users to work seamlessly with various data types commonly encountered in PIP, without worrying about the specific steps needed to handle each format.
dt <- get_file_from_gh(owner = getOption("pipfun.ghowner"),
repo = "pip_info",
file_path = "releases.csv",
branch = "releases")
#> Git credentials are missing or invalid in non-interactive mode.
head(dt)
#> release identity
#> <int> <char>
#> 1: 20240326 PROD
#> 2: 20241101 PROD
#> 3: 20241105 PROD
#> 4: 20250203 TEST
Saving Data to GitHub
To save data to GitHub using pipfun, you need to have appropriate permissions configured in your R environment. Ensure that the token you use with gitcreds has the necessary permissions to write files to GitHub repositories.
The process of saving data with the save_to_gh()
function differs from the traditional Git workflow of staging,
committing, and pushing changes. Instead, save_to_gh()
interacts directly with the GitHub API, allowing you to upload data
without manually handling Git commands. This approach provides a
convenient way to programmatically save or update files in a
repository.
The save_to_gh()
function provides a streamlined way to
save or update files in a GitHub repository. If the file already exists,
it will be updated with new content. If it does not exist, a new file
will be created. By default, save_to_gh()
saves the data in
CSV format, but you can choose any of the supported formats mentioned
earlier.
Below are examples demonstrating how to use it:
Example 1: create a new file
This will create a new file named data_example.csv in the DEV branch of the aux_test repository.
#library(pipfun)
# Example data frame
# df <- data.frame(a = 1:5, b = letters[1:5])
#
# # Save the data to a repository
# save_to_gh(
# df = df,
# owner = getOption("pipfun.ghowner"),
# repo = "aux_test",
# filename = "data_example",
# ext = "csv"
# )
Example 2: updating an existing file
# Updated data frame
df <- data.frame(a = 6:10, b = letters[6:10])
# Update the file in the repository
save_to_gh(
df = df,
repo = "aux_test",
filename = "data_example",
ext = "csv"
)
Example 3: Saving the Same Data to the Same File (No Change)
# Save the same data to the same file
result_no_change <- save_to_gh(
df = df,
repo = "aux_test",
filename = "data_example",
ext = "csv"
)
# Check if the data was changed
print(result_no_change$data_change) # Should be FALSE as the content is identical
Additionally, recall that if you have metadata for an existing file,
you can pass it directly to the function through the
metadata
argument.
Understanding its output
The save_to_gh()
function returns a list invisibly,
containing information about the upload or update operation. The key
elements of this output list are:
-
content
: Metadata about the uploaded or updated file, including its SHA hash, path in the repository, and other details. -
commit
: Information about the commit associated with the upload or update operation, such as the commit SHA and message. -
init
: Metadata of the file before the operation. If the file did not exist, this will beNULL
. -
owner
: The GitHub username or organization that owns the repository. -
repo
: The name of the GitHub repository where the file was uploaded or updated. -
branch
: The branch of the repository where the file was uploaded or updated. -
data_change
: A logical value indicating whether the file’s content was updated (TRUE
) or remained unchanged (FALSE
).
Note: If the file did not exist before the operation,data_change
will beTRUE
because creating a new file is considered a change to the repository’s data state.