- Published on
Populate Postgres database with fake data
Directly from the database itself
- Authors
- Name
- Nico Prananta
- Follow me on Bluesky
During development and testing, it's common for us to set up the environment where our app runs, along with the necessary data. For instance, if we have a web store app and want to create a feature enabling users to receive notifications when an item they try to add to a cart is sold out, we need to efficiently develop this by using fake data that includes items already sold out in the database. This practice not only aids current development but also prepares us for future bug fixes or improvements related to this feature.
In the past, I used scripts with libraries such as Faker to add fake data to the database. But now, I have found another way to insert data into Postgres using SQL.
Setting the Stage for a Chat App
Consider we're building a chat app capable of handling millions of messages daily. To get started, we set up two tables: one for the chats and another for the individual messages. Here's a quick look at the schema:
create table chats (
id bigserial,
created_at timestamptz not null default now(),
primary key (id)
);
create table chat_messages (
id bigserial,
created_at timestamptz not null,
chat_id bigint not null,
chat_created_at timestamptz not null,
message text not null,
primary key (id),
foreign key (chat_id) references chats (id)
);
This setup focuses on simplicity, sidestepping details like sender or receiver identification for clarity's sake.
Injecting Life with Fake Data
To make our tables more lively, let's start by filling the 'chats' table with a series of timestamps.
INSERT INTO chats (created_at)
SELECT generate_series(
'2023-01-01'::timestamptz,
'2023-01-30 23:00:00'::timestamptz,
interval '1 hour'
);
generate_series(start, stop, step)
is a function that generates a set of values, from start to stop, at the interval specified by step. Here, it's used to create a series of timestamps. Each generated timestamp is inserted into the created_at
column of the chats table. Since id
is a bigserial
, it will automatically increment for each row inserted.
Next, we fill the chat_messages
table with content, ensuring each chat is bustling with activity:
INSERT INTO chat_messages (
created_at,
chat_id,
chat_created_at,
message
)
SELECT
mca,
chats.id,
chats.created_at,
(SELECT ($$[0:2]={'hello','goodbye','I would like a sandwich please'}$$::text[])[trunc(random() * 3)::int])
FROM chats
CROSS JOIN LATERAL (
SELECT generate_series(
chats.created_at,
chats.created_at + interval '1 day',
interval '1 minute'
) AS mca
) b;
This SQL statement inserts data into the chat_messages
table using a combination of a SELECT
statement, a CROSS JOIN LATERAL
, and a subquery to generate the message content. Here's a friendly and simplified explanation of inserting data into the chat_messages table:
Columns to Insert: Specifies
created_at
,chat_id
,chat_created_at
, andmessage
as the columns to receive the inserted data in thechat_messages
table.Selecting Data: The
SELECT
statement is used to generate the data to be inserted into these columns.Generating Message Creation Times: For each row in the
chats
table, thegenerate_series
function is used in conjunction withCROSS JOIN LATERAL
to create a series of timestamps (aliased asmca
for "message creation at"). This series starts at the chat'screated_at
time and ends one day later, with one timestamp generated for every minute within that day.Setting
chat_id
andchat_created_at
: For each generated timestamp, thechats.id
andchats.created_at
are selected to associate each message with its corresponding chat and to denote when the chat was originally created.Generating Random Messages: The
message
column's values are determined by a subquery that selects a random element from a predefined array of texts:'hello'
,'goodbye'
, and'I would like a sandwich please'
. This is accomplished by creating an array of these texts, indexing it with a random number truncated and cast to an integer (trunc(random() * 3)::int
), which effectively selects an index between 0 and 2 (inclusive). The selected text from the array is then used as the message content.The
CROSS JOIN LATERAL
allows for each row in thechats
table to be joined with the result of the lateral subquery (b
). A lateral join enables the subquery to reference columns from the preceding tables in theFROM
clause (chats
in this case). This technique is particularly useful for applying operations that need to use values from each row of thechats
table as input, such as generating a series of timestamps based on each chat'screated_at
time.
Additionally, instead of using random texts, we can explore the possibility of creating more meaningful data using postgresql_faker.
Are you working in a team environment and your pull request process slows your team down? Then you have to grab a copy of my book, Pull Request Best Practices!