Databricks, the inventor and commercial distributor of the Apache Spark processing platform, has announced the launch of an open source project called Delta Sharing at the Data + AI Summit.
The supplier describes Delta Sharing as the “first open protocol for securely sharing data across organisations in real time, completely independent of the platform on which the data resides”.
It is included within the Delta Lake project, which combines data lake technology with data warehousing attributes, and which the company open sourced in 2019 at its conference, then called Spark + AI Summit, from its own Delta product.
As a term, “data lakehouse” has some currency beyond Databricks, attracting the imprimatur of the O’Reilly media group, albeit in association with the supplier.
It is said to be supported by data providers Nasdaq, ICE, S&P, Precisely, Factset, Foursquare and SafeGraph, and by storage and software providers Amazon Web Services (AWS), Microsoft, Google Cloud and Tableau.
Matei Zaharia, chief technologist and co-founder of Databricks, said: “The top challenge for data providers today is making their data easily and broadly consumable. Managing dozens of different data delivery solutions to reach all user platforms is untenable. An open, interoperable standard for real-time data sharing will dramatically improve the experience for data providers and data users.
“Delta Sharing will standardise how data is securely exchanged between enterprises regardless of which storage or computing platform they use, and we are thrilled to make this innovation open source.”
In an interview ahead of the summit, Joel Minnick, vice-president of marketing at Databricks, said: “The lakehouse is emerging as the new architecture for how customers think about their data, in that it brings their data and AI [artificial intelligence] initiatives onto the same platform.”
It is, he said, gaining recognition as an IT industry term, and featured at AWS’s re:Invent conference, with a focus on Amazon Redshift.
Minnick cited a recent blog by Bill Inmon, often described as the father of data warehousing, as an important validation for the data lakehouse concept. The blog describes the lakehouse as the natural evolution of data architecture. Inmon is speaking at the Data + AI Summit.
“In pursuit of machine learning and AI initiatives, getting value from unstructured data, alongside structured data, is something that data warehouses cannot do. And nor can data lakes. The lakehouse [concept] recognises that the vast majority of your data today is landing in your data lake, and data lakes lack reliability, performance capability and governance,” said Minnick.
“Data lakes are great places to put data, but they are not engineered to have lots of concurrent users running analytic workloads,” he added. “Data warehouses do have great performance, reliability and governance, but they are not built for unstructured data types, and are generally proprietary. It’s easier to move a data lake up, and bring governance to it, than to bring a data warehouse down to deal with less structured types of data.”
Minnick said that the value of the Delta Sharing product laid in organisations wanting to “ask bigger questions” by pooling in data from outside. “Retailers, for example, want to share data with other retailers and their suppliers, and that is not easy to do,” he said. “Even within companies, different divisions have their own data platforms. And it’s not just [data in] traditional tables that companies want to share, but unstructured data.”
Minnick said Delta Sharing offered a solution to this data sharing problem. “We’ve had great support in this from the data providers, like Nasdaq, Standard and Poor’s, and AWS, and from the data tools side, like Microsoft, Tableau, Looker and Qlik, in getting one common format to align behind to share data with their customers,” he added.
The protocol is said to establish a common standard for sharing all data types that can be used in SQL, visual analytics tools, and programming languages such as Python and R. Delta Sharing also allows organisations to share existing large-scale datasets in the Apache Parquet and Delta Lake formats in real time without copying them, and can be implemented within existing software that supports Parquet.