Data Ingestion and Engineering

Open-Source Scripts

We believe in transparency and collaboration. That's why many of our data engineering and ingestion scripts are open-source, available under the Apache 2.0 license. You can explore and contribute to these scripts on our GitHub page.

Repository Structure

Our GitHub repositories are organized for easy navigation:

  • Repository Naming: Each repository is named after its data source.
    • Example: The mimi-cdc repository contains scripts for datasets from the Centers for Disease Control and Prevention.
  • Script Types: Within each repository, you'll find two main types of Python/PySpark scripts:
    1. download*: These scripts fetch source files from various websites or APIs.
    2. ingest*: These scripts take the downloaded files and load them into corresponding delta tables in our workspace.

Column Naming Convention

To ensure consistency and readability:

  • We use Snake case for all column names.
    • Example: "Provider Name" becomes "provider_name".
  • This makes it easier to work with the data across different programming languages and platforms.

Data Alignment

Data publishers occasionally change their file formats or column names. To handle this:

  • We've aligned columns by carefully editing the original names.
  • Always refer to the column descriptions in the Data Catalog on Databricks for the most up-to-date and accurate information about each column.

Additional Variables

To enhance traceability and version control, we've added three special variables to all tables:

  • mimi_src_file_name: Shows the source file's name.
  • mimi_src_file_date: Indicates the date mark of the source file (creation date or measurement cut-off date).
  • mimi_dlt_load_date: Shows when the data was loaded into the lake house as a delta table.

For more detailed information about these variables and how they can be used, check the Catalog section in the workspace.

Accessing Data (Basic)

You can access the data using SQL or PySpark. Here's a basic SQL example:

-- The first level maps to {schema}
-- and the next level maps to {table} fields.
SELECT * FROM mimi_ws_1.{schema}.{table} LIMIT 10;

Accessing the Latest Data

Often, you'll only need the most recent data. We've made this easy with a utility function:

First, add this macro to your SQL script (for Python, add the import script as below):

-- Macro; you can pass the table path as an input
-- This Macro returns the latest mimi_src_file_date
CREATE OR REPLACE FUNCTION get_max_date(tablepath2 STRING) RETURNS DATE
RETURN SELECT MAX(max_date) FROM mimi_ws_1.default.tablestats WHERE tablepath = tablepath2;

Then, you can use it to fetch the latest data:

-- The first level maps to {schema}
-- and the next level maps to {table} fields.
SELECT * FROM mimi_ws_1.{schema}.{table} 
WHERE mimi_src_file_date = get_max_date('mimi_ws_1.{schema}.{table}');

This ensures you're always working with the most up-to-date information.

Finding Examples and Getting Help

We want to make your data exploration as smooth as possible. Here are some resources:

  • Databricks Search: Use the search function on Databricks to find relevant notebooks and examples.
  • Mimibot on Slack: Our intelligent bot can answer many of your questions instantly.
  • Public Example Scripts: We've made many example scripts publicly accessible to help you get started quickly.

Feedback and Contributions

Your input is valuable! We're always looking to improve:

  • Found a bug?
  • Have an idea for improvement?
  • Want to contribute to our open-source scripts?

Please email us at info@mimilabs.ai. We're excited to hear from you and work together to make our data tools even better. Happy data exploring!