Manipulating Tables with Delta Lake 🐟¶
In this blog post, we’re going to explore how to effectively manage and manipulate tables using Delta Lake. Whether you're new to Delta Lake or need a refresher, this hands-on guide will take you through the essential operations needed to work with Delta tables.
From creating tables to updating and deleting records, we’ve got you covered! So, let’s dive in and get started! 🚀
Learning Objectives 🧩¶
By the end of this lab, you should be able to execute standard operations to create and manipulate Delta Lake tables, including:
- Creating tables
- Inserting data
- Selecting records
- Updating values
- Deleting rows
- Merging data
- Dropping tables
Setup ⚙️¶
Before we jump into the fun part, let’s clear out any previous runs of this notebook and set up the necessary environment. Run the script below to reset and prepare everything.
%run ../Includes/Classroom-Setup-2.2L
Create Table ➕¶
We'll kick things off by creating a Delta Lake table that will track our favorite beans collection. The table will include a few basic fields to describe each bean.
Field Name | Field type |
---|---|
name | STRING |
color | STRING |
grams | FLOAT |
delicious | BOOLEAN |
Let’s go ahead and create the beans table with the following schema:
create table beans
(name string, color string, grams float, delicious boolean)
Note
We'll use Python to run checks occasionally throughout the lab. The following cell will return as error with a message on what needs to change if you have not followed instructions. No output from cell execution means that you have completed this step.
assert spark.table("beans"), "Table named `beans` does not exist"
assert spark.table("beans").columns == ["name", "color", "grams", "delicious"], "Please name the columns in the order provided above"
assert spark.table("beans").dtypes == [("name", "string"), ("color", "string"), ("grams", "float"), ("delicious", "boolean")], "Please make sure the column types are identical to those provided above"
Insert Data 📇¶
Next, let’s populate the table with some data. The following SQL command will insert three records into our table.
INSERT INTO beans VALUES
("black", "black", 500, true),
("lentils", "brown", 1000, true),
("jelly", "rainbow", 42.5, false)
To make sure that the data was inserted correctly, let’s query the table to review the contents:
select * from beans
Now, let’s add a few more records in one transaction:
insert into beans values
('pinto', 'brown', 1.5, true),
('green', 'green', 178.3, true),
('beanbag chair', 'white', 40000, false)
Verify the data is in the correct state using the cell below:
assert spark.conf.get("spark.databricks.delta.lastCommitVersionInSession") == "2", "Only 3 commits should have been made to the table"
assert spark.table("beans").count() == 6, "The table should have 6 records"
assert set(row["name"] for row in spark.table("beans").select("name").collect()) == {'beanbag chair', 'black', 'green', 'jelly', 'lentils', 'pinto'}, "Make sure you have not modified the data provided"
Update Records 📢¶
Now, let's update some of our data. A friend pointed out that jelly beans are, in fact, delicious. Let’s update the delicious column for jelly beans to reflect this new information.
UPDATE beans
SET delicious = true
WHERE name = "jelly"
You also realize that the weight for the pinto beans was entered incorrectly. Let’s update the weight to the correct value of 1500 grams
.
update beans
set grams = 1500
where name = 'pinto'
Ensure everything is updated correctly by running the cell below:
assert spark.table("beans").filter("name='pinto'").count() == 1, "There should only be 1 entry for pinto beans"
row = spark.table("beans").filter("name='pinto'").first()
assert row["color"] == "brown", "The pinto bean should be labeled as the color brown"
assert row["grams"] == 1500, "Make sure you correctly specified the `grams` as 1500"
assert row["delicious"] == True, "The pinto bean is a delicious bean"
Delete Records ❌¶
Let’s say you’ve decided that only delicious beans are worth tracking. Use the query below to remove any non-delicious beans from the table.
delete from beans
where delicious = false
Run the following cell to confirm this operation was successful.
assert spark.table("beans").filter("delicious=true").count() == 5, "There should be 5 delicious beans in your table"
assert spark.table("beans").filter("delicious=false").count() == 0, "There should be 0 delicious beans in your table"
assert spark.table("beans").filter("name='beanbag chair'").count() == 0, "Make sure your logic deletes non-delicious beans"
Merge Records ⛙¶
Your friend brought some new beans! We’ll register these new beans as a temporary view and merge them with our existing table.
CREATE OR REPLACE TEMP VIEW new_beans(name, color, grams, delicious) AS VALUES
('black', 'black', 60.5, true),
('lentils', 'green', 500, true),
('kidney', 'red', 387.2, true),
('castor', 'brown', 25, false);
SELECT * FROM new_beans
In the cell below, use the above view to write a merge statement to update and insert new records to your beans
table as one transaction.
Make sure your logic: - Match beans by name and color - Updates existing beans by adding the new weight to the existing weight - Inserts new beans only if they are delicious
merge into beans a
using new_beans b
on a.name= b.name and a.color = b.color
when matched then
update set grams = a.grams + b.grams
when not matched and b.delicious = true then
insert *
Check your work by running the following:
version = spark.sql("DESCRIBE HISTORY beans").selectExpr("max(version)").first()[0]
last_tx = spark.sql("DESCRIBE HISTORY beans").filter(f"version={version}")
assert last_tx.select("operation").first()[0] == "MERGE", "Transaction should be completed as a merge"
metrics = last_tx.select("operationMetrics").first()[0]
assert metrics["numOutputRows"] == "3", "Make sure you only insert delicious beans"
assert metrics["numTargetRowsUpdated"] == "1", "Make sure you match on name and color"
assert metrics["numTargetRowsInserted"] == "2", "Make sure you insert newly collected beans"
assert metrics["numTargetRowsDeleted"] == "0", "No rows should be deleted by this operation"
Dropping Tables 📍¶
Finally, when you're done with a managed Delta Lake table, you can drop it, which permanently deletes the table and its underlying data. Let’s write a query to drop the beans table.
drop table beans
Run the following cell to confirm the table is gone:
assert spark.sql("SHOW TABLES LIKE 'beans'").collect() == [], "Confirm that you have dropped the `beans` table from your current database"
Final Thoughts 🤔¶
Working with Delta Lake tables provides immense flexibility and control when managing data, and mastering these basic operations can significantly boost your productivity.
From creating tables to merging data, these skills form the foundation of efficient data manipulation. Keep practicing, and soon, managing Delta Lake tables will feel like second nature!