Skip to content

Releases: delta-io/delta

Delta Lake 2.1.0

18 Aug 00:26
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 2.1.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features in this release are as follows

  • Support for Apache Spark 3.3.
  • Support for [TIMESTAMP | VERSION] AS OF in SQL. With Spark 3.3, Delta now supports time travel in SQL to query older data easily. With this update, time travel is now available both in SQL and through the DataFrame API.
  • Support for Trigger.AvailableNow when streaming from a Delta table. Spark 3.3 introduces Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches. This is now supported when using Delta tables as a streaming source.
  • Support for SHOW COLUMNS to return the list of columns in a table.
  • Support for DESCRIBE DETAIL in the Scala and Python DeltaTable API. Retrieve detailed information about a Delta table using the DeltaTable API and in SQL.
  • Support for returning operation metrics from SQL Delete, Merge, and Update commands. Previously these SQL commands returned an empty DataFrame, now they return a DataFrame with useful metrics about the operation performed.
  • Optimize performance improvements
    • Added a config to use repartition(1) instead of coalesce(1) in Optimize for better performance when compacting many small files.
    • Improve Optimize performance by using a queue-based approach to parallelize the compaction jobs.
  • Other notable changes
    • Support for using variables in the VACUUM and OPTIMIZE SQL commands.
    • Improvements for CONVERT TO DELTA with catalog tables.
      • Autofill the partition schema from the catalog when it’s not provided.
      • Use partition information from the catalog to find the data files to commit instead of doing a full directory scan. Instead of committing all data files in the table directory, only data files under the directories of active partitions will be committed.
    • Support for Change Data Feed (CDF) batch reads on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have not been used. See the documentation for more details.
    • Improve Update performance by enabling schema pruning in the first pass.
    • Fix for DeltaTableBuilder to preserve table property case of non-delta properties when setting properties.
    • Fix for duplicate CDF row output for delete-when-matched merges with multiple matches.
    • Fix for consistent timestamps in a MERGE command.
    • Fix for incorrect operation metrics for DataFrame writes with a replaceWhere option.
    • Fix for a bug in Merge that sometimes caused empty files to be committed to the table.
    • Change in log4j properties file format. Apache Spark upgraded the log4j version from 1.x to 2.x which has a different format for the log4j file. Refer to the Spark upgrade notes.

Benchmark framework update

Improvements to the benchmark framework (initial version added in version 1.2.0) including support for benchmarking arbitrary functions and not just SQL queries. We’ve also added Terraform scripts to automatically generate the infrastructure to run benchmarks on AWS and GCP.

Credits

Adam Binford, Allison Portis, Andreas Chatzistergiou, Andrew Vine, Andy Lam, Carlos Peña, Chang Yong Lik, Christos Stavrakakis, David Lewis, Denis Krivenko, Denny Lee, EJ Song, Edmondo Porcu, Felipe Pessoto, Fred Liu, Fu Chen, Grzegorz Kołakowski, Hedi Bejaoui, Hussein Nagree, Ionut Boicu, Ivan Sadikov, Jackie Zhang, Jiawei Bao, Jintao Shen, Jintian Liang, Jonas Irgens Kylling, Juliusz Sompolski, Junlin Zeng, KaiFei Yi, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Lin Zhou, Lukas Rupprecht, Max Gekk, Min Yang, Ming DAI, Nick, Ole Sasse, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Terry Kim, Thomas Newton, Tom van Bussel, Tyson Condie, Venki Korukanti, Vini Jaiswal, Will Jones, Xi Liang, Yijia Cui, Yousry Mohamed, Zach Schuermann, sherlockbeard, yikf

Delta Lake 2.0.0

20 Jul 19:39
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 2.0.0 on Apache Spark 3.2.

The key features in this release are as follows.

  • Support Change Data Feed on Delta tables. Change Data Feed represents the row level changes between different versions of the table. When enabled, additional information is recorded regarding row level changes for every write operation on the table. See the documentation for more details.

  • Support Z-Order clustering of data to reduce the amount of data read. Z-Ordering is a technique to colocate related information in the same set of files. This data clustering allows column stats (released in Delta 1.2) to be more effective in skipping data based on filters in a query. See the documentation for more details.

  • Support for idempotent writes to Delta tables to enable fault-tolerant retry of Delta table writing jobs without writing the data multiple times to the table. See the documentation for more details.

  • Support for dropping columns in a Delta table as a metadata change operation. This command drops the column from metadata and not the column data in underlying files. See documentation for more details.

  • Support for dynamic partition overwrite. Overwrite only the partitions with data written into them at runtime. See documentation for details.

  • Experimental support for multi-part checkpoints to split the Delta Lake checkpoint into multiple parts to speed up writing the checkpoints and reading. See documentation for more details.

  • Python and Scala API support for OPTIMIZE file compaction and Z-order by.

  • Other notable changes

    • Improve the generated column data skipping by adding the support for skipping by nested column generated column
    • Improve the table schema validation by blocking the unsupported data types in Delta Lake.
    • Support creating a Delta Lake table with an empty schema.
    • Change the behavior of DROP CONSTRAINT to throw an error when the constraint does not exist. Before this version the command used to return silently.
    • Fix the symlink manifest generation when partition values contain space in them.
    • Fix an issue where incorrect commit stats are collected.
    • Support for SimpleAWSCredentialsProvider or TemporaryAWSCredentialsProvider in S3 multi-cluster write supported LogStore.
    • Fix an issue in generated columns that would not allow null columns in the insert DataFrame to be written even if the column was nullable.

Benchmark Framework Update

Independent of this release, we have improved the framework for writing large scala performance benchmarks (initial version added in version 1.2.0), we have added support for running benchmarks on Google Compute Platform using Google Dataproc (in addition to the existing support for EMR on AWS)

Credits

Adam Binford, Alkis Evlogimenos, Allison Portis, Ankur Dave, Bingkun Pan, Burak Yilmaz, Chang Yong Lik, Chen Qingzhi, Denny Lee, Eric Chang, Felipe Pessoto, Fred Liu, Fu Chen, Gaurav Rupnar, Grzegorz Kołakowski, Hussein Nagree, Jacek Laskowski, Jackie Zhang, Jiaan Geng, Jintao Shen, Jintian Liang, John O'Dwyer, Junyong Lee, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Liwen Sun, Lukas Rupprecht, Max Gekk, Michael Mengarelli, Min Yang, Naga Raju Bhanoori, Nick Grigoriev, Nick Karpov, Ole Sasse, Patrick Grandjean, Peng Zhong, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Ruslan Dautkhanov, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Shoumik Palkar, Tathagata Das, Terry Kim, Tyson Condie, Venki Korukanti, Vini Jaiswal, Wenchen Fan, Xinyi, Yijia Cui, Yousry Mohamed

Delta Lake 1.2.1

27 Apr 21:47
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 1.2.1 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

Key features in this release

  • Fix an issue with loading error messages in --packages mode. Previous release had a bug that resulted in user getting NullPointerException instead of proper error message when using Delta Lake with --packages mode either in pyspark or spark-shell (Fix, Test)
  • Fix incorrect exception type thrown in some Python APIs. A bug caused pyspark to throw incorrect type of exceptions instead of expected AnalysisException. This issue is fixed. See issue #1086 for more details.
  • Fix for S3 multi-cluster mode configuration. A bug in the S3 multi-cluster mode caused --conf to not work for certain configuration parameters. This issue is fixed by having these configuration parameters begin with spark. See the updated documentation.
  • Make the GCS LogStore configuration simpler by automatically deriving the LogStore implementation class config spark.delta.logStore.gs.impl from the scheme in the table path. See the updated documentation.
  • Make SetAccumulator thread safe. SetAccumulator used by Merge was not thread safe and might cause executor heartbeat failures in rare cases. This was fixed by using a synchronized set.

Credits

Allison Portis, Chang Yong Lik, Kam Cheung Ting, Rahul Mahadev, Scott Sandre, Venki Korukanti

Delta Lake 1.2.0

13 Apr 21:05
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 1.2.0 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

Key features in this release

  • Support multi-cluster write in Delta Lake tables stored in S3. Users now have the option of specifying a new and experimental LogStore implementation that supports concurrent reads and writes to a single Delta Lake table in S3 from multiple Spark drivers. See the documentation for more details.

  • Support for compacting small files (optimize) into larger files in a Delta Lake table. Reduced number of data files improves read latency due to reduced metadata size and per-file overheads such as file-open overhead and file-close overhead. See the documentation for more details.

  • Support for data skipping using column statistics. Column statistics are collected for each file as part of the Delta Lake table writes. These statistics can be used during the reading of a Delta Lake table to skip reading files not matching the filters in the query. See the documentation for more details.

  • Support for restoring a Delta table to an earlier version. Restoring to an earlier version number or a version of a specific timestamp is supported using the SQL command, Scala APIs or Python APIs. See the documentation for more details.

  • Support for column renaming in a Delta Lake table without the need to rewrite the underlying Parquet data files. See the documentation for more details.

  • Support for arbitrary characters in column names in Delta tables. Before, the supported list of characters was limited by the support of the same in Parquet data format. Column names containing special characters such space, tab, ,, {, ( etc. are supported now. See the documentation for more details.

  • Support for automatic data skipping using generated columns. For any partition column that is a generated column, partition filters will be automatically generated from any data filters on its generating column(s), when possible.

  • Support for Google Cloud Storage is now generally available. See the documentation on how to read and write Delta Lake tables in Google Cloud Storage.

  • Other notable changes

    • Create a new module delta-storage. This extracts out the LogStore interface and implementations in a separate module which is published as its own jar. This enables new implementations of LogStore without depending upon the complete Delta jars. See the migration guide here for more details.
    • Improve the error messages and exceptions to be better organized and queryable.
    • Support for gettimestamp expression in generated columns.
    • Snapshot/Checkpoint management improvements
      • Make loading snapshots resilient to corrupt checkpoints in Delta. When reading a checkpoint fails, we try to search for an alternative checkpoint and use it to construct a snapshot.
      • Fix to snapshot writing to not fail the write when a checkpoint fails due to non-fatal errors.
      • Optimization to reduce the number of list calls to storage
    • Improved output metrics for DELETE table command.
    • Improved output metrics for UPDATE table command.
    • Optimize merge operation in a Delta table with a large number of columns.
    • Fix a NullPointerException when trying to reference a DeltaLog created with a SparkContext that has stopped.
    • Fix an issue in handling null partition column values in the change data capture feature.
    • Fix an issue in adding a new column to the Delta table when the preceding column is of type Array.
    • Fix an issue where we are not closing the file list iterator when reading large log files in the Delta Streaming source.
    • Throw proper exceptions when searching for a Delta table in the catalog.
    • Fix a schema evolution issue when the column type is an array of structs.
    • Better handling of FileNotFoundException when reading Delta log files to distinguish between the corrupt log files and no files found.

Benchmark Framework

Independent of this release, we have also built a framework for writing large scale performance benchmarks on Delta tables using a real cluster. Currently, the framework provides a TPC-DS inspired benchmark to measure the ingestion time (e.g. time taken to create TPC-DS tables) and query times. But we encourage the community to contribute more benchmarks to measure performance of different real-world workloads on Delta tables.

Credits

Adam Binford, Alex Liu, Allison Portis, Anton Okolnychyi, Bart Samwel, Carmen Kwan, Chang Yong Lik, Christian Williams, Christos Stavrakakis, David Lewis, Denny Lee, Fabio Badalì, Fred Liu, Gengliang Wang, Hoang Pham, Hussein Nagree, Hyukjin Kwon, Jackie Zhang, Jan Paw, John ODwyer, Junlin Zeng, Jackie Zhang, Junyong Lee, Kam Cheung Ting, Kapil Sreedharan, Lars Kroll, Liwen Sun, Maksym Dovhal, Mariusz Krynski, Meng Tong, Peng Zhong, Prakhar Jain, Pranav, Ryan Johnson, Sabir Akhadov, Scott Sandre, Shixiong Zhu, Sri Tikkireddy, Tathagata Das, Tyson Condie, Vegard Stikbakke, Venkata Sai Akhil Gudesa, Venki Korukanti, Vini Jaiswal, Wenchen Fan, Will Jones, Xinyi Yu, Yann Byron, Yaohua Zhao, Yijia Cui

Delta Lake 1.0.1

10 Feb 21:39
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 1.0.1 on Apache Spark™ 3.1, which back-ports bug fixes from Delta Lake 1.1.0 to Delta Lake 1.0.0.

The details of the fixed bugs are as follows:

  • Fix for rare data corruption issue on GCS - Experimental GCS support released in Delta Lake 1.0 has a rare bug that can lead to Delta tables being unreadable due to partially written transaction log files. This issue has now been fixed (1, 2).

  • Fix for the incorrect return object in Python DeltaTable.convertToDelta() - This existing API now returns the correct Python object of type delta.tables.DeltaTable instead of an incorrectly-typed, and therefore unusable object.

  • Fix for incorrect handling of special characters (e.g. spaces) in paths by MERGE/UPDATE/DELETE operations

  • Fix for Hadoop configurations not being used to write checkpoints

  • Improvements to DeltaTableBuilder API introduced in Delta 1.0.0

    • Fix for bug that prevented passing of multiple partition columns in Python DeltaTableBuilder.partitionBy.
    • Throw error when column data type is not specified.

Credits
Jarred Parrett, Shixiong Zhu, Tathagata Das, Tom Lynch, Yijia Cui, Yaohua Zhao, gurunath

Delta Lake 1.1.0

03 Dec 20:53
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 1.1.0 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. The key features in this release are as follows.

  • Performance improvements in MERGE operation - On partitioned tables, MERGE operations will automatically repartition the output data before writing to files. This ensures better performance out-of-the-box for both the MERGE operation as well as subsequent read operations.

  • Support for passing Hadoop configurations via DataFrameReader/Writer options - You can now set Hadoop FileSystem configurations (e.g., access credentials) via DataFrameReader/Writer options. Earlier the only way to pass such configurations was to set Spark session configuration which would set them to the same value for all reads and writes. Now you can set them to different values for each read and write. See the documentation for more details.

  • Support for arbitrary expressions inreplaceWhere DataFrameWriter option - Instead of expressions only on partition columns, you can now use arbitrary expressions in the replaceWhere DataFrameWriter option. That is you can replace arbitrary data in a table directly with DataFrame writes. See the documentation for more details.

  • Improvements to nested field resolution and schema evolution in MERGE operation on array of structs - When applying the MERGE operation on a target table having a column typed as an array of nested structs, the nested columns between the source and target data are now resolved by name and not by position in the struct. This ensures structs in arrays have a consistent behavior with structs outside arrays. When automatic schema evolution is enabled for MERGE, nested columns in structs in arrays will follow the same evolution rules (e.g., column added if no column by the same name exists in the table) as columns in structs outside arrays. See the documentation for more details.

  • Support for Generated Columns in MERGE operation - You can now apply MERGE operations on tables having Generated Columns.

  • Fix for rare data corruption issue on GCS - Experimental GCS support released in Delta Lake 1.0 has a rare bug that can lead to Delta tables being unreadable due to partially written transaction log files. This issue has now been fixed (1, 2).

  • Fix for the incorrect return object in Python DeltaTable.convertToDelta() - This existing API now returns the correct Python object of type delta.tables.DeltaTable instead of an incorrectly-typed, and therefore unusable object.

  • Python type annotations - We have added Python type annotations which improve auto-completion performance in editors which support type hints. Optionally, you can enable static checking through mypy or built-in tools (for example Pycharm tools).

  • Other notable changes

    • Removed support to read tables with certain special characters in partition column name. See migration guide for details.
    • Support for “delta.`path`” in DeltaTable.forName() for consistency with other APIs
    • Improvements to DeltaTableBuilder API introduced in Delta 1.0.0
      • Fix for bug that prevented passing of multiple partition columns in Python DeltaTableBuilder.partitionBy.
      • Throw error when column data type is not specified.
    • Improved support for MERGE/UPDATE/DELETE on temp views.
    • Support for setting userMetadata in the commit information when creating or replacing tables.
    • Fix for an incorrect analysis exception in MERGE with multiple INSERT and UPDATE clauses and automatic schema evolution enabled.
    • Fix for incorrect handling of special characters (e.g. spaces) in paths by MERGE/UPDATE/DELETE operations.
    • Fix for Vacuum parallel mode from being affected by the Adaptive Query Execution enabled by default in Apache Spark 3.2.
    • Fix for earliest valid time travel version.
    • Fix for Hadoop configurations not being used to write checkpoints.
    • Multiple fixes (1, 2, 3) to Delta Constraints.

    Credits
    Abhishek Somani, Adam Binford, Alex Jing, Alexandre Lopes, Allison Portis, Bogdan Raducanu, Bart Samwel, Burak Yavuz, David Lewis, Eunjin Song, Feng Zhu, Flavio Cruz, Florian Valeye, Fred Liu, Guy Khazma, Jacek Laskowski, Jackie Zhang, Jarred Parrett, JassAbidi, Jose Torres, Junlin Zeng, Junyong Lee, KamCheung Ting, Karen Feng, Lars Kroll, Li Zhang, Linhong Liu, Liwen Sun, Maciej, Max Gekk, Meng Tong, Prakhar Jain, Pranav Anand, Rahul Mahadev, Ryan Johnson, Sabir Akhadov, Scott Sandre, Shixiong Zhu, Shuting Zhang, Tathagata Das, Terry Kim, Tom Lynch, Vijayan Prabhakaran, Vítor Mussa, Wenchen Fan, Yaohua Zhao, Yijia Cui, YuXuan Tay, Yuchen Huo, Yuhong Chen, Yuming Wang, Yuyuan Tang, Zach Schuermann, ericfchang, gurunath

Delta Lake 1.0.0

24 May 23:22
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 1.0.0 on Apache Spark 3.1. The key features in this release are as follows.

  • Unlimited MATCHED and NOT MATCHED clauses for merge operations in SQL - With the upgrade to Apache Spark 3.1, MERGE SQL command now supports any number of WHEN MATCHED and WHEN NOT MATCHED clauses (Scala, Java and Python APIs already support unlimited clauses since 0.8.0 on Spark 3.0). See the documentation on MERGE for more details.

  • New programmatic APIs to create tables - Delta Lake now allows you to directly create new Delta tables programmatically (Scala, Java, and Python) without using DataFrame APIs. We have introduced new DeltaTableBuilder and DeltaColumnBuilder APIs to specify all the table details that you can specify through SQL CREATE TABLE. See the documentation for details and examples.

  • Experimental support for Generated Columns - Delta Lake now supports Generated Columns which are a special type of columns whose values are automatically generated based on a user-specified function over other columns in the Delta table. You can use most built-in SQL functions in Apache Spark to generate the values of these generated columns. For example, you can automatically generate a date column (for partitioning the table by date) from the timestamp column; any writes into the table need only specify the data for the timestamp column. You can create Delta tables with Generated Columns using the new programmatic APIs to create tables. See the documentation for details.

  • Simplified storage configuration - Delta Lake can now automatically load the correct LogStore needed for common storage systems hosting the Delta table being read or written to. Users no longer need to explicitly configure the LogStore implementation if they are running Delta Lake on AWS S3, Azure blob stores, and HDFS. This also allows the same application to simultaneously read and write to Delta tables on different cloud storage systems. The scheme of the Delta table path is used to dynamically load the necessary LogStore implementation. Using storage systems other than the ones listed above still needs explicit configuration. See the documentation on storage configuration for details.

  • Experimental support for additional cloud storage systems - Delta Lake now has experimental support for Google Cloud Storage, Oracle Cloud Storage, IBM Cloud Object Storage. You will have to add an additional maven artifact delta-contribs to access the LogStores corresponding to them, and explicitly configure the LogStore names corresponding to the relevant path schemes. See the documentation on storage configuration for details. In addition, we have also defined a more stable LogStore API for building custom implementations.

  • Public APIs for catching exceptions due to conflicts - The exceptions thrown on conflict between concurrent operations have now been converted to public APIs. This allows you to catch those exceptions and retry your write operations. See the API documentation for details.

  • PyPI release - Delta Lake can now be installed from PyPI with pip install delta-spark. However, along with pip installation, you also have to configure the SparkSession. See the documentation for details.

  • Other notable changes

    • New Maven artifact delta-contribs which contain contributions from the community that are still experimental and need more testing before being packaged in the main artifact delta-core.
    • Execution time metrics for UPDATE, DELETE, and MERGE operations are available in table history.
    • Fixed multiple bugs in schema evolution of nested columns in MERGE operation.
    • Fixed bug in handling dots in column names.

Delta Sharing
In relation to this release, we have also introduced a new Delta Sharing project which is an open protocol for secure real-time exchange of large datasets, which enables organizations to share data in real-time regardless of which computing platforms they use. It is a simple REST protocol that securely shares access to part of a cloud dataset and leverages modern cloud storage systems, such as S3, ADLS, or GCS, to reliably transfer data. See the project repository and the release notes for details.

Credits
Alex Ott, Ali Afroozeh, Antonio, Bruno Palos, Burak Yavuz, Christopher Grant, Denny Lee, Gengliang Wang, Guy Khazma, Howard Xiao, Jacek Laskowski, Joe Widen, Jose Torres, Lars Kroll, Linhong Liu, Meng Tong, Prakhar Jain, Pranav Anand, R. Tyler Croy, Rahul Mahadev, Ranu Vikram, Sabir Akhadov, Shixiong Zhu, Stefan Zeiger, Tathagata Das, Tom van Bussel, Vijayan Prabhakaran, Vivek Bhaskar, Wenchen Fan, Yijia Cui, Yingyi Bu, Yuchen Huo, Brenner Heintz, fvaleye, Herman van Hovell, Liwen Sun, Mahmoud Mahdi, Sabir Akhadov, Yaohua Zhao

Delta Lake 0.8.0

05 Feb 02:23
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 0.8.0, which introduces the following key features.

  • Unlimited MATCHED and NOT MATCHED clauses for merge operations in Scala, Java, and Python - merge operations now support any number of whenMatched and whenNotMatched clauses. In addition, merge queries that unconditionally delete matched rows no longer throw errors on multiple matches. See the documentation for details.

  • MERGE operation now supports schema evolution of nested columns - Schema evolution of nested columns now has the same semantics as that of top-level columns. For example, new nested columns can be automatically added to a StructType column. See Automatic schema evolution in Merge for details.

  • MERGE INTO and UPDATE operations now resolve nested struct columns by name - Update operations UPDATE and MERGE INTO commands now resolve nested struct columns by name. That is, when comparing or assigning columns of type StructType, the order of the nested columns does not matter (exactly in the same way as the order of top-level columns). To revert to resolving by position, set the Spark configuration ”spark.databricks.delta.resolveMergeUpdateStructsByName.enabled” to ”false”.

  • Check constraints on Delta tables - Delta now supports CHECK constraints. When supplied, Delta automatically verifies that data added to a table satisfies the specified constraint expression. To add CHECK constraints, use the ALTER TABLE ADD CONSTRAINTS command. See the documentation for details.

  • Start streaming a table from a specific version (#474) - When using Delta as a streaming source, you can use the options startingTimestamp or startingVersionto start processing the table from a given version and onwards. You can also set startingVersion to latest to skip existing data in the table and stream from the new incoming data. See the documentation for details.

  • Ability to perform parallel deletes with VACUUM (#395) - When using VACUUM, you can set the session configuration "spark.databricks.delta.vacuum.parallelDelete.enabled" to “true” in order to use Spark to perform the deletion of files in parallel (based on the number of shuffle partitions). See the documentation for details.

  • Use Scala implicits to simplify read and write APIs - You can import io.delta.implicits._ to use the delta method with Spark read and write APIs such as spark.read.delta(“/my/table/path”). See the documentation for details.

Credits
Adam Binford, Alan Jin, Alex liu, Ali Afroozeh, Andrew Fogarty, Burak Yavuz, David Lewis, Gengliang Wang, HyukjinKwon, Jacek Laskowski, Jose Torres, Kian Ghodoussi, Linhong Liu, Liwen Sun, Mahmoud Mahdi, Maryann Xue, Michael Armbrust, Mike Dias, Pranav Anand, Rahul Mahadev, Scott Sandre, Shixiong Zhu, Stephanie Bodoff, Tathagata Das, Wenchen Fan, Wesley Hoffman, Xiao Li, Yijia Cui, Yuanjian Li, Zach Schuermann, contrun, ekoifman, Yi Wu

Delta Lake 0.7.0

18 Jun 21:22
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 0.7.0 on Apache Spark 3.0. This is the first release on Spark 3.x and adds support for metastore-defined tables and SQL DDLs. The key features in this release are as follows.

  • Support for defining tables in the Hive metastore (#85) - You can now define Delta tables in the Hive metastore and use the table name in all SQL operations. Specifically, we have added support for: 

    This integration uses Catalog APIs introduced in Spark 3.0. You must enable the Delta Catalog by setting additional configurations when starting your SparkSession. See the documentation for details.

  • Support for SQL Delete, Update and Merge - With Spark 3.0, you can now use SQL DML operations DELETE, UPDATE and MERGE. See the documentation for details.

  • Support for automatic and incremental Presto/Athena manifest generation (#453) - You can now use ALTER TABLE SET TBLPROPERTIES to enable automatic regeneration of the Presto/Athena manifest files on every operation on a Delta table. This regeneration is incremental, that is, manifest files are updated for only the partitions that have been updated by the operation. See the documentation for details.

  • Support for controlling the retention of the table history - You can now use ALTER TABLE SET TBLPROPERTIES to configure how long the table history and delete files are maintained in Delta tables. See the documentation for details.

  • Support for adding user-defined metadata in Delta table commits - You can now add user-defined metadata as strings in commits made to a Delta table by any operation. For DataFrame.writeand DataFrame.writeStream operations, you can set the option userMetadata. For other operations, you can set the SparkSession configuration spark.databricks.delta.commitInfo.userMetadata. See the documentation for details.

  • Support Azure Data Lake Storage Gen2 (#288) - Spark 3.0 has support for Hadoop 3.2 libraries which enables support for Azure Data Lake Storage Gen2. See the documentation for details on how to configure Delta Lake with the correct versions of Spark and Hadoop libraries for Azure storage systems.

  • Improved support for streaming one-time triggers - With Spark 3.0, we now ensure that one-time trigger (also known as Trigger.Once) processes all outstanding data in a Delta table in a single micro-batch even if rate limits are set with the DataStreamReader option maxFilesPerTrigger.

Due to the significant internal changes, workloads on previous versions of Delta using the DeltaTable programmatic APIs may require additional changes to migrate to 0.7.0. See the Migration Guide for details.

Credits
Alan Jin, Alex Ott, Burak Yavuz, Jose Torres, Pranav Anand, QP Hou, Rahul Mahadev, Rob Kelly, Shixiong Zhu, Subhash Burramsetty, Tathagata Das, Wesley Hoffman, Yin Huai, Youngbin Kim, Zach Schuermann, Eric Chang, Herman van Hovell, Mahmoud Mahdi

Delta Lake 0.6.1

26 May 23:03
Compare
Choose a tag to compare

We are excited to announce the release of Delta Lake 0.6.1, which fixes a few critical bugs in merge operation and operation metrics. If you are using version 0.6.0, it is strongly recommended that you upgrade to version 0.6.1. The details of the fixed bugs are as follows:

  • Invalid MERGE INTO AnalysisExceptions (#419) - A couple of bugs related to merge operation were causing analysis errors in 0.6.0 on previously supported merge queries.

    • Fixing one of these bugs required reverting a minor change to the DeltaTable 0.6.0 API. In 0.6.1 (similar to 0.5.0), if the table’s schema has changed since the creation of the DeltaTable instance DeltaTable.toDF() does not return a DataFrame with the latest schema. In such scenarios, you must recreate the DeltaTable instance for it to recognize the latest schema.
  • Incorrect operations metrics in history - 0.6.0 reported an incorrect number of rows processed during Update and Delete. This is fixed in 0.6.1.

Credits
Alan Jin, Jose Torres, Rahul Mahadev, Tathagata Das