Getting Started with Delta Lake

Behind the Scenes: Genesis of Delta Lake

Burak Yavuz. Software Engineer Databricks
Burak Yavuz is a Software Engineer at Databricks. He has been contributing to Spark since Spark 1.1, and is the maintainer of Spark Packages. Burak received his BS in Mechanical Engineering at Bogazici University, Istanbul, and his MS in Management Science & Engineering at Stanford.

Denny Lee. Developer Advocate at Databricks
Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers.

Series Details

This session is part of the Getting Started with Delta Lake series with Denny Lee and the Delta Lake team.

Session Abstract

Developer Advocate Denny Lee will interview Burak Yavuz, Software Engineer at Databricks, to learn about the Delta Lake team’s decision making process and why they designed, architected, and implemented the architecture that it is today. Understand technical challenges that the team faced, how those challenges were solved, and learn about the plans for the future.

What you need:
Sign up for Community Edition here and access the workshop presentation materials and sample notebooks.

Video Transcript

I will get that sent out to you ASAP. So without further ado, I’ll hand it off to Denny, and Burak’s gonna come over and switch chairs with me.

– Hi everybody. Thanks very much, Karen. My name is Denny Lee. I’m a developer advocate here at Databricks, based out of Seattle Washington. So that’s why I’m actually sitting with a weird Espresso background while Burak moves chairs and pops on over, he’s based out of San Francisco, California. So without ado, you currently are watching our interview with Burak for the genesis of Delta Lake. But before we go into it, the heart of this type of online meetup is that we get to interview the people behind the technology. So Burak, want just we start with you and you tell us a little bit about yourself. – Yeah, hi Danny. So hello everyone. My name is Burak. I’m a software engineer here at Databricks. I’m also a Spark committer. Basically, I work in a team called the stream team at Databricks. Our goal is to make the lives of data engineers much simpler. And our team motto is, “We Make Your Streams Come True.”

– That’s right. The old purple T-shirts say that, make your streams come true. By any chance, did you make up that phrase or somebody else did in that case? – I mean, there, I don’t know. We were joking about it. Like we had like versions of may your streams come true. We make your streams come true. It kind of like we’re joking around, like we called ourselves to the stream team or like, once we started working on Delta, we like switched it off to like dream team and whatever it was kinda like just joking around. – Right, so the standard rigor, moral of geeks trying to think of a marketing phrases- – Exactly. (laughs) – Let’s go backwards a little bit though. Can you tell us how you even progressed? Like what made you decide to even get into computer science or engineering? What’s your background? I’m just curious here. – Yeah, so I’m originally a mechanical engineer, but I’ve been programming myself since like high school, early high school. I was like really interested in robots and kind of like bionic arms and things like that. So, I wanted to get like a full view of like engineering, where mechanical engineering you build things that move, work, control systems. And on the other hand, like you have to program them. And I really enjoyed that whole area like, making something work, making something move and everything. So studied mechanical engineering, but then came to the U.S. Did something called management science and engineering, which is kinda like industrial engineering. So worked on large scale optimization problems. And that’s how I kind of got into the world of big data. – Cool. So you finished your degree in mechanical engineering then decided to gears go into the management aspect, right? So that was big data. Where did you go to school? You noticed that a lot of the folks are Berkeley based, but I don’t believe you’re a Berkeley in this case. – Yeah, so I came to Stanford University for grad school. That’s where I did management science and engineering. And I was planning on doing like a full PhD program, like for six years kind of like learn everything about optimization and things like that. But then that kind of introduced me to the world of big data and machine learning because every machine learning algorithm in the end uses some optimization algorithm and some optimization routine to actually get a result. – Got it. – So that’s where I kind of got introduced to Databricks and Spark as well. – Okay, so before you got introduced to Spark, because as you noted you’re committer of Spark, I’m just curious, what were the type of libraries and machine learning that you were doing? Was it like old school Java Mallette? Was it like pre Python, or were you using Python pandas? Just curious in terms of your progression into the machine learning cycles before then you actually switch over to the data engineering cycles. – Yeah, I mean, honestly, like we were doing very academic research in that sense. Like, we had a lot of code in math lab. We had a lot of code in C. We had like some code in Java. We were doing pretty, I mean, well known optimization routines like stochastic gradient descent. And it turned out I was using all these tools that people had built on top of like C or Fortran even, where you have the very optimized, like matrix matrix, multiplication, routines, and whatnot. So that’s what I was used to. I wasn’t used like things like PyTorch or those kinds of like more modern machine learning libraries didn’t exist. – Gotcha. So we’re going back to almost old school Fortran 77 types of stuff. – [Burak] Yeah. – Got it. Okay, cool. No, fair enough, that’s even what I did in the past, so I completely get you. Okay, so then you finished off your degree at Stanford, but then how did you progress to the Databricks and for that matter, how did you progress into Spark in the first place? – Yeah. So that’s like a very long story for happy hour, to be honest. (laughs) – Okay, fair enough. – Basically, it turned out that I was continuing the research of the maintainer of ML Lip every month at Databricks. And I just like signed up for a Spark workshop, which was being held at Stanford at the time. And actually went to a separate interview for a different company that day, and missed the entire Spark workshop, but received an email later saying that, oh, do you wanna intern at Databricks? I’m like, yeah, sure. And I noticed the email of Sean Gri, and then I just like sent my resume over. – Cool, all right. (laughing) So not the most direct path, but that’s pretty cool. So then now you’re a part of Databricks. I believe you had started off saying that you had more focused on the Spark side of things, or did you more focus on the machine learning side of things when you first joined? – I mean, when I first joined, it was kind of like I’m building the tool. So I started off as an intern. One of the things back then, like Spark 1.0 was just released. And then we were trying to build all these tools to figure out regressions in Spark. Like we were adding all kinds of code to Spark. We just wanted to make sure that we didn’t regress in performance. So kind of the first things I worked on was like Spark-Perf, which was kind of like this library that allowed us to run benchmarks on Spark. Run some like machine learning algorithms that existed during the day on like our data sets and ensure that over time, we didn’t regress in performance. And then moved more towards like linear algebra, kind of worked on Spark within ML Lib, and then data frames came out, then worked on like the statistics routines within data frames. – Got it. Okay, cool. So basically, you started doing the mathematics in essence, behind a lot of this in the first place, but then since when data frames came out, then is that when you progressed into more of the stream team early on? – Yeah, exactly. So that was around the time like we started adding all these like statistical routines into data frames, and then we created the data team at Databricks, and we were tasked with building a new logging data pipeline. And at that time we were like, data frames are cool. Maybe we should come up with like streaming data frames as well. And so we were wondering like how that would be, because we had Spark streaming that had this like DStream API. We had Spark Core, which was RDD API. And then you had Spark SQL, which had data frame APIs. Like, how do we connect all these two or three? Like, are we gonna have a DStream of data frames? Are we gonna have like some other concepts? So that was kind of like early on, we were thinking about, like it would be super cool to have streaming data frames, but we didn’t know what it was gonna look like. – Perfect. Let’s hold for that. So basically what you’re telling us here is that the progression into the streaming data frames, which we’ll talk about shortly, but it was actually started off of the fact that on the Databricks’ side, without obviously diving into too many details, but can you provide some context of why this was such a big problem or the type of problems that you’re trying to solve? – Yeah, I mean, so it wasn’t a humongous problem. I mean early on, we had this very simple batch data pipeline that took one day’s worth of data, processed it, put it into nice table. We need to do something else. And then that’s when Databricks decided that, hey, a new college grad should build our new data pipeline. And so we kind of like looked into what the industry standards were at the time, and it was this thing called Lambda Architecture. And so we decided to kind of move towards building such a data pipeline at Databricks as well. – Got it. So as you started progressing toward the Lambda Arch, this is where that delineation between the concept of streaming data frames versus bach data frames first started for you and your team? – Yeah, precisely. And like, I mean the idea was like we were streaming data, which was like a completely different API, which was in Spark streaming with DStreams, but then we had our batch processing which was in data frames, and I’m managing like two completely different code paths started becoming like a hassle. And we wanted to like, we started thinking about like, how can we unify these APIs, and come up with this logic where you wouldn’t have to change too much code? You just give us the transformations that you wanna do. And doing transformations on data frames was like super declarative, super easy. And it was kind of like sequel, like APIs that people were used to. So we were like, oh, can we like start doing this in a streaming fashion as well? – Got it. So then from that progression, because you had all this data and because you want to do a Lambda Architecture where basically you’re doing batch queries, whether it was machine learning or BI or whatever else that you were doing gets that data, but you also needed to look at the data live via streaming, right? And so because you’re looking at that data live, you wanna be able to use that same declarative nature and batch and apply that to streaming. So that was basically part of the reason why the Spark community itself was saying, hey, we see the experience and we wanted to go ahead and build basically streaming on top of the Spark SQL syntax using Spark SQL syntax, excuse me, then versus DStreams. I’m presuming that’s the progression. How did that communication go with the community? How how many others were sort of bugging you about the same problem? I’m just curious. – Yeah, I mean, like in the stream team, like we had many ideas, but basically people like Matei Zaharia and Michael Armbrust, they like butted their heads together and they’re like, how can we get this to work nicely? And then they were like, yeah, we could just like unify this within this API. And then like we started like talking about it within the developer community as well. And then people were like, yeah, this sounds like a good idea. Like let’s just make it as reusable as possible. If something’s, you can point out the same logic in batch and streaming, it’ll make it easier to test your code as well. And so there were like a lot of things where we’re like, yeah, if we can just like have this one unified API then, and all you need to do is like, change two lines of your code, then it’s gonna make it so much simpler for people to move from like a batch world to a streaming world. – Got it. So right now up to this point, we’ve got ourselves now, at this point basically you have your streaming data frames and you have your batch data frames, and you’ve now successfully built up Lambda Architecture. Right?

By building the streaming data frames. Sorry, by building the streaming data frames and by building the batch data frames, and now you have your Lambda Architecture, what were some of the issues that you and your team were running into in that case? Because it seemed like you had a solution, you had the Lambda Architecture, that was the popular concept toujour, per se.

What were some of the issues that you ran into then? – Oh, there were like so many. I can’t begin to count.

Basically, I mean, we would get alerts like every other day on our like pipelines, like something’s failing. And I mean, it was just like an issue with like working on large scale distributed systems. We had many cases where data would arrive late. We would forget to process that data.

We didn’t know how far back to look when new data came in. So we just like said, okay we expect data to come in within three days, so let’s just like reprocess our entire data set over like three days. And as we were doing all this like streaming work, and also our batch pipeline, the streaming, the latest data that we wanted to query, it was always very slow. And the reason for that was, this concept of small files, having a lot of small files. And the idea there is that, I mean, we were initially up in AWS working with Amazon S3, and these kinds of storage systems, data lakes as we call them, they’re just like key valued blob storage systems. They’re great for storing insane amounts of data, but they’re not great at like telling you what data’s there or which version of the data’s there. I mean, it’s just very hard to give you like very consistent semantics. So we would hit like so many issues around Amazon’s S3 eventual consistency where we would write out a file. But before writing out the file, you would have to check that the files there just so that you don’t overwrite it, or write garbage data, and that check would prime a negative cash. It would write the file. You try to read it back, and then you have these issues of, oh, the file doesn’t exist. And you’re like, well, I just wrote it there, how does it not exist? But those were kind of like the kind of problems that we had to deal with. Listing all those files were super expensive because S3 wasn’t just built for like listing things. It’s very hard for like those kinds of systems to give you a list of what’s there. And then just reading all those small files, opening so many HTTP connections, that would be super expensive. We would be killed on throughout just trying to access every single file, open all those connections. So we had all kinds of like performance issues as well as correctness issues and consistency issues. – Got it. So the heart of the matter for at least when you were doing the Lambda Architecture for log analytics, was that the underlying file system in this case, the cloud storage system itself was not reliable. Right. And so I’m just curious.

Based on, obviously, this was your experience, but then did you see the same thing happen with lots of Databricks’ customers? – Yeah, I mean so it was like, I mean, everyone was building the same thing at the time. And we’re like, with these like architectures, I mean, it’s not that the storage system is unreliable, it’s that everyone had to build their own database semantics on top of this storage system. And, people were just used to working with things like my sequel or like data warehouses and these kind of storage systems that were very easy to deal with. You didn’t have to think about a lot of the problems that you might face, but then suddenly when you came into this data Lake architecture, which all our customers also were in, and then you had to deal with all kinds of like, how do I deal about files? Like, can I delete files? How do I optimize my IO patterns with these files? Which file format do I save them in? Which file sizes do I wanna have? Like suddenly all these new problems arise that people had never had to think of as an end user. – Right. So that transition from on-premise to the cloud, that transition from single a box to distributed systems, the fact that you actually had to deal with a distributed file system, this basically introduced a whole set of issues that not only you were suffering from, but in terms of doing the analysis of the data, but many of the Databricks’ customers themselves were actually going ahead and suffered from? – Yeah, precisely. – Right, and so then saying this then, sounding a little dorky on my part, this is the, in essence, the Genesis of Delta, right? The problems that you ran into, basically, it was the reason why, what ultimately became Delta lake. – Precisely. I mean, we were like trying to solve all these problems in different ways. We would get all kinds of like support tickets saying, oh my queries are slow, or like my list thing is super slow. Can I make this faster? We were getting all kinds of support tickets around, oh like two people try to change the same table at the same time, but then I have this like totally inconsistent garbage state of my table. People would be like, oh, I have duplicate records here. Why do I have duplicate records? Well, you had partial failures. Those were kinds of issues. Then like with Spark, for example, a lot of things were built with like Hadoop distributed file system in mind, where the idea there was an on-prem like HDFS, you could just like write to a temporary location, rename, and their renames are super fast. It’s like a constant time operation. Whereas with cloud storage systems, it could either be a very quick rename or it could be a server site copy of the entire file. And these kinds of like performance issues, as people were like moving from on-prem to the cloud, they just like had so much trouble dealing with all of these kinds of like inconsistencies and all these kinds of like performance issues. So that really led to the prob. We came up with like intermediate solutions. – Yeah, actually, with that, I’d love to actually dive a little bit into that in terms of like some of the intermediate solutions you had to put in place before you actually had the solution. So just to give a little a heads up to the audience here, right? We are gonna be talking about the Delta Lake transaction log, which ultimately solveD some of these problems. But before we talk about that, I’d love to just, as you hinted there Burak, understand a little bit more, what were some of those intermediate solutions for example, like the duplication or the consistency for the file systems or things of that, whichever is your favorite pain in the butts? I guess I’ll put it that way. – Yeah, I mean, so for example, with structured streaming, once we released so streaming data frames, structured streaming, we came up with this file sync implementation, where you could take your data from anywhere at Kafka, Kinesis, Azure Event Hubs, whatever, or files. And then you would store it in some other file storage system. And what this file sync did was that it was actually kind of like the initial implementation of Delta’s transaction log. It would write out all the files with unique names. So these unique names ensured that you wouldn’t ever hit an eventual consistency problem. You would never hit like a task, a failed task, writing out a file and a second task that retries, that’s a retry writing out the same file. You would just like get new sets of files every time. And once all the files were complete, it would just take the set of files that it wrote and then store it in a manifest file, that said, okay this batch, this micro batch I stored, I wrote all these files. And then when Spark would actually query this table, it would directly go to this manifest. It wouldn’t have to list any of the directories. It wouldn’t have to listen to anything. This manifest was kind of like the source of truth about which files Spark had to read to actually have a full view of the table. So that was kind of like one of our initial solutions to dealing with avoiding listing and kind of like having an atomic operation where, if the write fails, then Spark’s not gonna read those files. Spark’s still gonna look at the transaction log to see, or in manifest file to see what’s the source of truth. So that was kind of like a very early implementation of what was getting us there. – Got it. So basically, prior to the transaction log, in essence, you have a manifest file, basically just a file, which lists all the names. And because if you have a lot of files, listing the files from S2 became relatively slow. So it was actually faster to read that one manifest file, which itself contained the list of let’s just say 25 files. Even though there may have been 50 files due to failures on write or for whatever reason, it would only grab the 25 files that you needed. – Exactly. And it wasn’t just the single file, it was actually like a ordered operation. So it was kind of like the first batch wrote these files, the second batch wrote these files, third batch wrote these files. And suddenly, once we saw this directory, we would have to read like one, two, three, we would list that directory and then read all the files within one, two, three, and then answer a query based on the file list generated from those. – Gotcha. So like you know, this was the precursor to the transaction log. So the idea is then basically you had a file. I’m just curious, was there any discussion on why, for sake of argument that transaction or so that manifest file would in fact be a file versus for sake of argument, some other like SQL or no SQL store or something like that? Were there discussions for that, like from a standpoint of a queuing or in-memory system instead? – Yeah. No, that’s a great question. I mean, so there were like, I mean, with Spark every time, like the biggest question is like scalability, how can we build something scalable? And also like one other thing was, we don’t wanna depend on external systems, like avoid dependence on external systems because it just adds more problems onto the users. And so we were like, oh, they’re trying to write to a storage system, why not have this like our source of truth along with all the data files within the storage system? So it was kind of like, we didn’t wanna have them set up a connection to some other database. We didn’t want to have them set up a connection to some other key value storage. Like we already have permissions, they’ve set everything up so that the right people can write to that directory or read from that directory. Why not just like store all the information that we need in there? So that was kind of our idea. And like anything in memory, we just wanted to avoid as much as possible because it wasn’t gonna scale. – Got it. So basically the in-memory part basically, because it couldn’t persist, there wasn’t something that you could basically say, this is my source of truth by that very definition. – You don’t get fault tolerance in that case either. I mean, like the thing is, it’s like the moment your Spark driver’s gone, then you just lose whatever you have about that table. – Got it. Cool. So, I mean, that’s all basically, the manifest file basically did solve a bunch of things like especially the file writes issue, but I’m just curious then, you started off with the Lambda Architecture of really talking about streaming, right? So, was it just a manifest itself that would solve the streaming issues, the resolution streaming? Or what else did you have to do in order to be able to resolve these things? – Yeah, I mean, so the manifest file kind of resolved the issues around distributed failures, partial failures, and like a file listing that’s streaming, but it didn’t get rid of the problems of like having a lot of small files. Because the manifest is the source of truth, it tells us which files we have to read, but then it only worked with streaming writes. So how do you actually read from this table and compact your files at the end of the day? Do you have one table that’s just like streaming and then you compact to a separate table? Or do you ignore, like some customers would just ignore that transaction log and then just like overwrite everything and like blow everything out at the end of the day, even though like we told them, hey, please don’t do it. You’re not getting any guarantees this way, but so some people still were okay with that solution.

But so, still a lot of issues existed because there was no like unification with batch yet.

Especially with this new streaming file sync. So in the end, we noticed all these problems, people were starting to use structured streaming a lot more, and then we’re like, well maybe we should start thinking about this again and come up with this, like a V2 concept of a streaming because people don’t wanna write (audio breaking up).

– Oh, apologies. There seems to be some connection issue. I’m not sure if it’s on my end or on your end, but can you hear me okay, actually? (audio breaking up) Okay, well, this is I guess the love of doing live sessions where sometimes we have technical difficulties. I don’t know if it’s on my end or your end, but nevertheless. Okay, let’s progress because we actually only have a few minutes left on the interview portion of ours. Because we did wanna try to time these interviews to be time for about the average length of time it takes for somebody to commute in San Francisco, which is about 32 minutes. So nevertheless, all right. So you’ve went ahead and told us a little bit about how that transaction log worked, the streaming sync. So then, what were some of the other issues that you ran into as well, right? Especially with your customers in terms of like because you progress with the file sync, you progressed with what ultimately turned into a transactional. What were the other things? Like, for example I’m presuming one of the problems was as time changed, oh, sorry, as time progressed, excuse me, business needs change. So the scheme has of the data had also as well changed.

(audio breaking up) – Oh, we caught up. Can you hear us? – I can hear you now, no problem. – Yeah, it’s like your face was like for a second. (laughing) So yeah, to repeat your question, it was like, what kind of like business need came up along with time that the file sync manifest did not support? Another like big thing that came up was GDPR.

All these issues around data protection and data privacy, and their requirement for data subject requests where people could ask specifically for what is my data, or like can you update my data or just delete my data entirely? And people had to build like very complex systems and data pipelines or data architectures to kind of solve those issues. But we were like, normally you should be able to write an update statement in SQL and be able to update your table. That’s what people are generally used to. Or you should be able to write a delete statement on your table and delete all the records for a user. And like, that’s what our users are used to from like on-premise data warehouses or databases. So we saw this like new, more complex workloads emerging from like all these new requirements around the data world as well. And our transaction log, which only supported streaming writes was never gonna support those use cases. So we had to come up with this like new protocol that actually was able to understand what changes are being made to the table. – Got it. So now you actually have expanded the transaction log and understands what changes made to table. What made you decide then that it would have to be a SQL syntax? I mean, because it makes us think about the database world all over again.

– I mean, it wasn’t necessarily just SQL syntax. It was SQL, people were just used to SQL. And it was the easiest thing that people knew across like different roles. And, it was kind of like the SQL syntax allowed them to transition a lot easier into this world without having to know any Spark APS or data frame APIs. Maybe data scientists knew about data frames from like pandas or AR, but not necessarily a data analyst who was using BI tools or writing dashboards, but we needed to empower all these people to actually build such things as well when required. – Perfect. – [Burak] Yeah. – Cool. Well then, let me end the interview portion of this. We’ll basically talk about then what spawn did you guys to actual also think of time travel? – Oh, time travel. – [Danny] Yeah. – Well, yeah, so it’s kind of like when we came up with Delta. I mean, it was kind of like with all the intermediate solutions. Delta was kind of like with its transaction log and its protocol, it solved a lot of support tickets that we would get. Honestly, around like it would ease a lot of issues.

Another issue that came up was with Delta, we tried to enforce kind of like best practices on users. And what would happen was with Hive, there’s this concept of dynamic partition overwrites. What that does is that you have a data set, you try to write it out in overwrite mode. So it’s gonna overwrite some amount of data, and it overwrites only the partitions that it writes new data to. So the idea was, it was kind of like a lazy way of saying, I have this entire new data set, overwrite whatever I need to overwrite with this. So the initial users of Delta who were used to that kind of a mode started overwriting their entire tables, which meant deleting all their data and actually just override inserting a very small subset of data. And when they asked like, oh, why did this happen? They would create a lot of support tickets, oh, Delta lost all my data. Or like, so here’s the history log, and here’s the operation that you wrote. And, we have this operation called replace where, what you can use to actually guarantee that you’re overriding the right data. So you know you’re not deleting spurious data accidentally. But here you go, here’s time-travel, that’ll allow you to actually get whatever data was in a previous version. And you can merge all this data back into your current version. So it kind of like the first week we released time-travel, actually like six or seven customers had saved all the data that they accidentally deleted. – Oh, wow. – [Burak] Yeah. – So that’s pretty cool. So basically, even though time travel is a pretty cool fun feature, it was born from the idea that people were not using overwrite incorrectly in essence. – Yeah. It was kind of like, I mean, it came from the idea that people make mistakes. – Got it. – And, if there’s a way that we can prevent or like, I mean, roll back from those mistakes a lot easily, much easily, then we should provide that feature to users. And it was actually like just from how the transaction log and like the concepts of like multi version, concurrency control works. It was really easy for us to actually go back to the state of a table at any given time. So why not just like empower users to actually, if they wanna query the differences, do that. If you accidentally deleted data, add that back.

Next Up: Diving Into Delta Lake

Dive through the internals of Delta Lake, a popular open source technology enabling ACID transactions, time travel, schema enforcement and more on top of your data lakes.