tidypolis
Project Overview
tidypolis is an R package that streamlines the cleaning, preparation, and management of global polio surveillance data from the WHO Polio Information System (POLIS) API. It provides tools for downloading data, conducting quality checks, standardizing fields, archiving outputs, and caching data both locally and in Azure storage. Key features include:
- Automated downloading and caching of global polio surveillance data from the POLIS API.
- Comprehensive data cleaning functions to standardize variable names, formats, and values.
- Built-in data quality checks to identify missing or inconsistent records.
- Archival tools to support reliable long-term storage and reproducibility.
- Seamless compatibility with Azure storage for cloud-based workflows, including Posit Workbench.
- Cleaning and standardization of WHO geodatabases following CDC Surveillance, Innovation, and Research methodologies.
The package enables consistent, transparent, and reproducible workflows for analyzing global polio data. The original lead developer was Nishant Kishore, PhD, with Nicholas Heaghney, MS, serving as primary maintainer until May 2025. I have since taken over ongoing maintenance and development.
My key contributions include:
- Implementing cloud-compatible authentication and caching systems, enabling the cleaning pipeline to run fully in Posit Workbench and reducing dependence on local compute resources.
- Improving documentation and vignettes to enhance usability and adoption.
- Refactoring and modularizing code to improve readability, performance, and maintainability.
- Resolving breaking changes caused by POLIS API updates.
- Strengthening the reliability of case geographic assignment workflows.
Upon becoming the primary maintainer, I focused on two goals: enabling reliable execution in cloud environments and improving maintainability. Initially, the cleaning pipeline could not run on Posit Workbench because it relied on local Azure authentication workflows. I redesigned the authentication system to use a service principal, with credentials stored securely within Posit Workbench. This eliminated the need for user-based authentication and allowed the entire pipeline to run non-interactively in the cloud.
To improve maintainability and support external collaboration, I led the modularization of the previously monolithic cleaning script. This involved breaking the workflow into smaller, well-documented functions that could be reused by partners outside CDC. This effort was conducted in collaboration with Nicholas Heaghney, MS (CDC), Nishant Kishore, PhD (CDC), and Mohammed Yusuf, PhD (WHO). The effort has increased transparency, reduced compute requirements, and enabled external analysts to run the same cleaning pipeline used internally at CDC.
Technologies Used
- Programming: R
- Data Manipulation: dplyr, tidyr, purrr, stringr
- Data Access: httr, jsonlite, AzureStor, AzureAuth, Microsoft365R
- Documentation: roxygen2, devtools, usethis
