Compare two versions of a Delta table

Use time travel to compare two versions of a Delta table.

Written by mathan.pillai

Last published at: May 10th, 2022

Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table.

One common use case is to compare two versions of a Delta table in order to identify what changed.

For more details on time travel, please review the Delta Lake time travel documentation (AWS | Azure | GCP).

Identify all differences

You can use a SQL SELECT query to identify all differences between two versions of a Delta table.

You need to know the name of the table and the version numbers of the snapshots you want to compare.

%sql

select * from <table-name>@v<version-number>
except all
select * from
<table-name>@v<version-number>

For example, if you had a table named “schedule” and you wanted to compare version 2 with the original version, your query would look like this:

%sql

select * from schedule@v2
except all
select * from
schedule@v0

Identify files added to a specific version

You can use a Scala query to retrieve a list of files that were added to a specific version of the Delta table.

%scala

display(spark.read.json("dbfs:/<path-to-delta-table>/_delta_log/00000000000000000002.json").where("add is not null").select("add.path"))

In this example, we are getting a list of all files that were added to version 2 of the Delta table.

00000000000000000002.json contains the list of all files in version 2.

After reading in the full list, we are excluding files that already existed, so the displayed list only includes files added to version 2.


Was this article helpful?