Data Ingestion and Engineering

All data engineering and ingestion scripts are available via open source under the Apache 2.0 license. Please visit our GitHub page to see more details.

Each repository is named after the source data. For example, the mimi-cdc repository contains data download and ingestion scripts regarding the datasets from the Centers for Disease Control and Prevention.

A repository usually consists of two groups of Python/PySpark scripts:

download*: A script that starts with the download prefix is used to download source files from the relevant websites or APIs. ingest*: A script that starts with the ingest prefix is used to ingest the source files to corresponding delta tables in the workspace.

If you find any bugs or ideas for improvement, please email them to