This technical report explores the use of ChatGPT API to automatically document DBT projects. ChatGPT is a natural language processing API that can generate human-like responses to prompts. We will be using the Obsidian ChatGPT MD plugin for this project, which provides a workflow to develop prompts for our documentation.

Documentation is a crucial task for any data project. However, documentation tends to be time-consuming and tedious, taking away from the time and focus that could be spent on other aspects of the project. In this technical report, we explore the use of ChatGPT API and the ChatGPT MD plugin in Obsidian to provide a no-code solution to automatically document DBT projects. By automating the documentation process, we can save time and resources while still maintaining high-quality documentation for our DBT projects

ChatGPT MD Plugin

The ChatGPT MD plugin allows us to generate documentation by writing prompts directly along with a littlle custom formatting to indicate message types. The plugin then sends the prompts to the ChatGPT API, which generates a response that is inserted into the Markdown document. This allows us to quickly and easily generate documentation without having to spend time writing out each section manually. I mapped the Chat command to ctrl + ⏎ to speed things up.

One fun side effect of this format over using the ChatGPT website is that you can modify the assistant replies in place which can help to fix small mistakes in real time while larger changes can be achieved via further conversation.

Example Usage

Here is an example of how we can use ChatGPT to generate documentation for the tables in the MIMIM-IV ICU database. I chose this as it’s the only open source schema I knew where to find off the top of my head.

A brief explanation of how to interpet the below; the first section surrouded by --- allows you to specify custom args to the OpenAI API. Next, the role::system line lets the plugin know that the following block is a system message and the <hr class="__chatgpt_plugin"> line lets the plugin know the end of the block. This works similarly for role::user and role::assistant.

---
max_tokens: 1000
model: gpt-3.5-turbo
---
role::system

you are a dbt docs generator, you have the following objectives:
- when you are told the tables reply "acknowledge" and nothing else
- you must only output a single table at a time
- your only responses will be in yaml codeblock
- you expect the user to give you table descriptions and you will reply using the templates to follow
- dont try to make up any information
- If extra information is required from the user use `<CLARIFY>` inline to signify more should be added
- If asked to document a database you will use the database template
- if asked to document a table you will use the tables template

## Database template
```yaml
version: 2

sources:
  - name: <string> # required
    description: <markdown_string>
    database: <database_name>
    schema: <schema_name>
    tags: [<string>]
```


## Tables template

```
version: 2

sources:
  - name: <string> # required
    tables:
      - name: <string> #required
        description: <markdown_string>

        columns:
          - name: <column_name> # required
            description: <markdown_string>
            tests:
              - <test>
              - ... # declare additional tests
            tags: [<string>]
          - name: ... # declare properties of additional columns
```

Out of the box, dbt ships with four generic tests already defined: unique, not_null, accepted_values and relationships. Here's a full example using those tests on an orders model:
```
version: 2

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'returned']
      - name: customer_id
        tests:
          - relationships:
              to: ref('customers')
              field: id
```

In plain English, these tests translate to:

    unique: the order_id column in the orders model should be unique
    not_null: the order_id column in the orders model should not contain null values
    accepted_values: the status column in the orders should be one of 'placed', 'shipped', 'completed', or 'returned'
    relationships: each customer_id in the orders model exists as an id in the customers table (also known as referential integrity)

<hr class="__chatgpt_plugin">

Next supply the table definitions, I’ve used DDL’s in this case and in all my experimentation but I imagine you should be able to use any format you’d like, even free text perhaps?

role::user
```
DROP TABLE IF EXISTS mimiciv_icu.caregiver;
CREATE TABLE mimiciv_icu.caregiver
(
  caregiver_id INTEGER NOT NULL
);
DROP TABLE IF EXISTS mimiciv_icu.chartevents;
CREATE TABLE mimiciv_icu.chartevents
(
  subject_id INTEGER NOT NULL,
  hadm_id INTEGER NOT NULL,
  stay_id INTEGER NOT NULL,
  caregiver_id INTEGER,
  charttime TIMESTAMP NOT NULL,
  storetime TIMESTAMP,
  itemid INTEGER NOT NULL,
  value VARCHAR(200),
  valuenum FLOAT,
  valueuom VARCHAR(20),
  warning SMALLINT
);
DROP TABLE IF EXISTS mimiciv_icu.datetimeevents;
CREATE TABLE mimiciv_icu.datetimeevents
(
  subject_id INTEGER NOT NULL,
  hadm_id INTEGER NOT NULL,
  stay_id INTEGER NOT NULL,
  caregiver_id INTEGER,
  charttime TIMESTAMP NOT NULL,
  storetime TIMESTAMP,
  itemid INTEGER NOT NULL,
  value TIMESTAMP NOT NULL,
  valueuom VARCHAR(20),
  warning SMALLINT
);
DROP TABLE IF EXISTS mimiciv_icu.d_items;
CREATE TABLE mimiciv_icu.d_items
(
  itemid INTEGER NOT NULL,
  label VARCHAR(100) NOT NULL,
  abbreviation VARCHAR(50) NOT NULL,
  linksto VARCHAR(30) NOT NULL,
  category VARCHAR(50) NOT NULL,
  unitname VARCHAR(50),
  param_type VARCHAR(20) NOT NULL,
  lownormalvalue FLOAT,
  highnormalvalue FLOAT
);
DROP TABLE IF EXISTS mimiciv_icu.icustays;
CREATE TABLE mimiciv_icu.icustays
(
  subject_id INTEGER NOT NULL,
  hadm_id INTEGER NOT NULL,
  stay_id INTEGER NOT NULL,
  first_careunit VARCHAR(255),
  last_careunit VARCHAR(255),
  intime TIMESTAMP,
  outtime TIMESTAMP,
  los FLOAT
);
DROP TABLE IF EXISTS mimiciv_icu.ingredientevents;
CREATE TABLE mimiciv_icu.ingredientevents(
  subject_id INTEGER NOT NULL,
  hadm_id INTEGER NOT NULL,
  stay_id INTEGER,
  caregiver_id INTEGER,
  starttime TIMESTAMP NOT NULL,
  endtime TIMESTAMP NOT NULL,
  storetime TIMESTAMP,
  itemid INTEGER NOT NULL,
  amount FLOAT,
  amountuom VARCHAR(20),
  rate FLOAT,
  rateuom VARCHAR(20),
  orderid INTEGER NOT NULL,
  linkorderid INTEGER,
  statusdescription VARCHAR(20),
  originalamount FLOAT,
  originalrate FLOAT
);
DROP TABLE IF EXISTS mimiciv_icu.inputevents;
CREATE TABLE mimiciv_icu.inputevents
(
  subject_id INTEGER NOT NULL,
  hadm_id INTEGER NOT NULL,
  stay_id INTEGER,
  caregiver_id INTEGER,
  starttime TIMESTAMP NOT NULL,
  endtime TIMESTAMP NOT NULL,
  storetime TIMESTAMP,
  itemid INTEGER NOT NULL,
  amount FLOAT,
  amountuom VARCHAR(20),
  rate FLOAT,
  rateuom VARCHAR(20),
  orderid INTEGER NOT NULL,
  linkorderid INTEGER,
  ordercategoryname VARCHAR(50),
  secondaryordercategoryname VARCHAR(50),
  ordercomponenttypedescription VARCHAR(100),
  ordercategorydescription VARCHAR(30),
  patientweight FLOAT,
  totalamount FLOAT,
  totalamountuom VARCHAR(50),
  isopenbag SMALLINT,
  continueinnextdept SMALLINT,
  statusdescription VARCHAR(20),
  originalamount FLOAT,
  originalrate FLOAT
);
DROP TABLE IF EXISTS mimiciv_icu.outputevents;
CREATE TABLE mimiciv_icu.outputevents
(
  subject_id INTEGER NOT NULL,
  hadm_id INTEGER NOT NULL,
  stay_id INTEGER NOT NULL,
  caregiver_id INTEGER,
  charttime TIMESTAMP NOT NULL,
  storetime TIMESTAMP NOT NULL,
  itemid INTEGER NOT NULL,
  value FLOAT NOT NULL,
  valueuom VARCHAR(20)
);
DROP TABLE IF EXISTS mimiciv_icu.procedureevents;
CREATE TABLE mimiciv_icu.procedureevents
(
  subject_id INTEGER NOT NULL,
  hadm_id INTEGER NOT NULL,
  stay_id INTEGER NOT NULL,
  caregiver_id INTEGER,
  starttime TIMESTAMP NOT NULL,
  endtime TIMESTAMP NOT NULL,
  storetime TIMESTAMP NOT NULL,
  itemid INTEGER NOT NULL,
  value FLOAT,
  valueuom VARCHAR(20),
  location VARCHAR(100),
  locationcategory VARCHAR(50),
  orderid INTEGER,
  linkorderid INTEGER,
  ordercategoryname VARCHAR(50),
  ordercategorydescription VARCHAR(30),
  patientweight FLOAT,
  isopenbag SMALLINT,
  continueinnextdept SMALLINT,
  statusdescription VARCHAR(20),
  originalamount FLOAT,
  originalrate FLOAT
);
```
<hr class="__chatgpt_plugin">
role::assistant

acknowledge
<hr class="__chatgpt_plugin">

Finally request the tables and database one at a time, from there it’s up to the user to stitch this all together at the end…or maybe someone could write a library!

role::user

document the mimiciv_icu database


<hr class="__chatgpt_plugin">

role::assistant

```
version: 2

sources:
  - name: mimiciv_icu
    description: This is the ICU database of MIMIC-IV, containing data from adult ICU stays.
    database: mimiciv_icu
    schema: public
    tags: [ICU, MIMIC-IV]
```


<hr class="__chatgpt_plugin">

role::user

document the outputevents table

<hr class="__chatgpt_plugin">

role::assistant

```
version: 2

sources:
  - name: mimiciv_icu
    tables:
      - name: outputevents
        description: This table contains information about output events (e.g. urine output) for patients in the ICU.
        columns:
          - name: subject_id
            description: The unique identifier for the patient.
            tests:
              - not_null
          - name: hadm_id
            description: The unique identifier for the patient's hospital admission.
            tests:
              - not_null
          - name: stay_id
            description: The unique identifier for the patient's ICU stay.
            tests:
              - not_null
          - name: caregiver_id
/docker-entrypoint.sh: Configuration complete; ready for start up the caregiver involved in the output event.
          - name: charttime
            description: The time at which the output event was charted.
            tests:
              - not_null
          - name: storetime
            description: The time at which the output event was stored.
          - name: itemid
            description: The identifier for the type of output event.
            tests:
              - not_null
          - name: value
            description: The value of the output event.
            tests:
              - not_null
          - name: valueuom
            description: The unit of measurement for the value of the output event.
```

Cool that we even get some tests as well!

Closing thoughts

For MIMIC there is plenty of information on the internet that the model could have memorised in order to make accurate descriptions. (To see what I mean try sending the system message and then the phrase document icu_stays). It would be interesting to see if more obscure schemas can be documented as accurately, I have had some anecdotal success with several applications.

One thing I have not noticed in any of my work on these prompts is any indication that the model would like some clarification. Even after explicity setting a clarify tag this output has not ocurred once for me.

Using ChatGPT API and the ChatGPT MD plugin in Obsidian provides a no-code solution to automatically document DBT projects. This allows us to save time and focus on other aspects of the project while still maintaining high-quality documentation. With all that saidm this is only meant as a first pass and results should be carefully reviewed by humans.