Data + AI Research

Databricks' founders and staff include leading researchers in distributed systems, artificial intelligence and data analytics who pioneered widely used techniques and software. Read some of the recent papers our staff contributed to, in collaboration with leading universities such as UC Berkeley and Stanford.

SQL and Lakehouse Platforms

View PDF
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
Authors: Michael Armbrust, Ali Ghodsi, Reynold Xin, Matei Zaharia
View PDF
Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores
Authors: Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, Michał ́Switakowski, Michał Szafra ́nski, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, Matei Zaharia
View PDF
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark
Authors: Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz , Shixiong Zhu , Reynold Xin, Ali Ghodsi, Ion Stoica, Matei Zaharia
View PDF
Filter Before You Parse: Faster Analytics on Raw Data with Sparser
Authors: Shoumik Palkar, Firas Abuzaid, Peter Bailis, Matei Zaharia
View PDF
Spark SQL: Relational Data Processing in Spark
Authors: Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia
View PDF
Shark: SQL and Rich Analytics at Scale
Authors: Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica

Apache Spark

View PDF
Apache Spark: A Unified Engine For Big Data Processing
Authors: Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica
View PDF
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Authors: Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica
View PDF
GraphFrames: An Integrated API for Mixing Graph and Relational Queries
Authors: Ankur Dave, Alekh Jindal, Li Erran Li, Reynold Xin, Joseph Gonzalez, Matei Zaharia
View PDF
GraphX: Graph Processing in a Distributed Dataflow Framework
Authors: Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica
View PDF
SparkR: Scaling R Programs with Spark
Authors: Shivaram Venkataraman , Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, Matei Zaharia
View PDF
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Authors: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica

Machine Learning and Artificial Intelligence

View PDF
Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle
Authors: Andrew Chen, Andy Chow, Aaron Davidson, Arjun DCunha, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Clemens Mewald, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Avesh Singh, Fen Xie, Matei Zaharia, Richard Zang, Juntai Zheng, Corey Zumar, Databricks, Inc.
View PDF
Accelerating the Machine Learning Lifecycle with MLflow
Authors: Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, Corey Zumar, Databricks Inc.
View PDF
Ray: A Distributed Framework for Emerging AI Applications
Authors: Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, UC Berkeley
View PDF
Parametrized Heirarchical Procedures for Neural Programming
Authors: Roy Fox, Richard Shin, Sanjay Krishnan, Ken Goldberg, Dawn Song, Ion Stoica
View PDF
Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale
Authors: Firas Abuzaid, Joseph Bradley, Feynman Liang, Andrew Feng, Lee Yang, Matei Zaharia, Ameet Talwalkar
View PDF
DAWNBench: An End-to-End Deep Learning Benchmark and Competition
Authors: Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, Matei Zaharia
View PDF
Clipper: A Low-Latency Online Prediction Serving System
Authors: Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, Ion Stoica
View PDF
Matrix Computations and Optimization in Apache Spark
Authors: Reza Bosagh Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu, Shivaram Venkataraman, Evan Sparks, Aaron Staple, Matei Zaharia
View PDF
MLlib: Machine Learning in Apache Spark
Authors: Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar
View PDF
RLlib: Abstractions for Distributed Reinforcement Learning
Authors: Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica

Applications

View PDF
NoScope: Optimizing Neural Network Queries over Video at Scale
Authors: Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia
View PDF
Kira: Processing Astronomy Imagery Using Big Data Technology
Authors: Zhao Zhang, Kyle Barbary, Frank Austin Nothaft, Evan R. Sparks, Oliver Zahn, Michael J. Franklin, David A. Patterson, Saul Perlmutter
View PDF
C3: Internet-Scale Control Plane for Video Quality Optimization
Authors: Aditya Ganjam, Junchen Jiang, Xi Liu, Vyas Sekar, Faisal Siddiqui, Ion Stoica, Jibin Zhan, Hui Zhang
View PDF
CellIQ : Real-Time Cellular Network Analytics at Scale
Authors: Anand Padmanabha Iyer, Li Erran Li, Ion Stoica
View PDF
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Authors: Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson
View PDF
A Cloud-Compatible Bioinformatics Pipeline for Ultrarapid Pathogen Identification from Next-Generation Sequencing of Clinical Samples
Authors: Samia N. Naccache, Scot Federman, Narayanan Veeeraraghavan, Matei Zaharia, Deanna Lee, Erik Samayoa, Jerome Bouquet, Alexander L. Greninger, Ka-Cheung Luk, Barryett Enge, Debra A. Wadford, Sharon L. Messenger, Gillian L. Genrich, Kristen Pellegrino, Gilda Grard, Eric Leroy, Bradley S. Schneider, Joseph N. Fair, Miguel A. Martı´nez, Pavel Isa, John A. Crump, Joseph L. DeRisi, Taylor Sittler, John Hackett, Jr., Steve Miller, Charles Y. Chiu
View PDF
ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing
Authors: Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D. Joseph, David A. Patterson

Large Scale Distributed Computing

View PDF
ASAP: Fast, Approximate Graph Pattern Mining at Scale
Authors: Anand Padmanabha Iyer, Zaoxing Liu, Xin Jin, Shivaram Venkataraman, Vladimir Braverman, Ion Stoica
View PDF
Drizzle: Fast and Adaptable Stream Processing at Scale
Authors: Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, Ion Stoica
View PDF
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
Authors: Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica
View PDF
Occupy the Cloud: Distributed Computing for the 99%
Authors: Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, Benjamin Recht
View PDF
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Authors: Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica
View PDF
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
Authors: Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica
View PDF
Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
Authors: Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, Ion Stoica
View PDF
Above the Clouds: A View of Cloud Computing
Authors: Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, Matei Zaharia
View PDF
Improving MapReduce Performance in Heterogeneous Environments
Authors: Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica
View PDF
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
Authors: D. Karger, H. Balakrishnan, I. Stoica, M.F. Kaashoek, R. Morris