Delta Lake by Bennie Haelen
Author:Bennie Haelen
Language: eng
Format: epub
Publisher: O'Reilly Media
Published: 2023-10-23T00:00:00+00:00
Time Travel Under the Hood
Version history can be kept on a Delta table because the transaction log keeps track of which files should or should not be read when performing operations on a table. When the DESCRIBE HISTORY command is executed, it will also return the operationMetrics, which tells you the number of files added and removed during an operation. When performing an UPDATE, DELETE, or MERGE on a table, that data is not physically removed from the underlying storage. Rather, these operations update the transaction log to indicate which files should or should not be read. Similarly, when you restore a table to a previous version, it does not physically add or remove data; it only updates the metadata in the transaction log to tell it which files to read.
In Chapter 2 you learned about JSON files within the _delta_log directory and checkpoint files. Checkpoint files save the state of the entire table at a point in time, and are automatically generated to maintain read performance by combining JSON commits into Parquet files. The checkpoint file and subsequent commits can then be read to get the current state, and previous states in the case of time travel, of the table, avoiding the need to list and reprocess all of the commits.
The transaction log commits checkpoint files, and the fact that data files are only logically removed as opposed to being physically removed is the foundation for how Delta Lake easily enables time travel on your Delta table. Figure 6-3 shows the transaction log entries for each of the operations on the taxidb.tripData table throughout the different transactions and versions.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Access | Data Mining |
Data Modeling & Design | Data Processing |
Data Warehousing | MySQL |
Oracle | Other Databases |
Relational Databases | SQL |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8310)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6835)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6812)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6695)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6484)
Driving Data Quality with Data Contracts by Andrew Jones(6437)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6183)
Learning SQL by Alan Beaulieu(6007)
Weapons of Math Destruction by Cathy O'Neil(5800)
Big Data Analysis with Python by Ivan Marin(5407)
Data Engineering with dbt by Roberto Zagni(4416)
Solidity Programming Essentials by Ritesh Modi(4066)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3924)
Pandas Cookbook by Theodore Petrou(3629)
Blockchain Basics by Daniel Drescher(3308)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2914)
Feature Store for Machine Learning by Jayanth Kumar M J(2822)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2804)
Mastering Python for Finance by Unknown(2748)
