Brooke Wenig

Machine Learning Practice Lead, Databricks

Brooke Wenig is a Machine Learning Practice Lead at Databricks. She leads a team of data scientists who develop large-scale machine learning pipelines for customers, as well as teach courses on distributed machine learning best practices. She is a co-author of Learning Spark, 2nd Edition, co-instructor of the Distributed Computing with Spark SQL Coursera course, and co-host of the Data Brew podcast. She received an MS in Computer Science from UCLA with a focus on distributed machine learning. She speaks Mandarin Chinese fluently and enjoys cycling.

Watch this speaker at Data + AI Summit 2021

Past sessions

Summit 2021 Keynotes: Data Science and Machine Learning

May 27, 2021 08:30 AM PT

The pursuit of AI is one of the biggest priorities in data today. The Thursday morning keynote will be led by Databricks Cofounder and CEO Ali Ghodsi and cover advances in data science, machine learning, MLOps and more in both open source and the Databricks Lakehouse Platform.

We’ll also be joined by data leaders from McDonalds and Microsoft, as well as the legendary Bill Nye, a scientist, engineer, comedian and author.

Summit 2021 YOLO with Data-Driven Software

May 27, 2021 11:00 AM PT

Software engineering evolved around certain best practices such as versioning code, dependency management, feature branches, etc. However, the same best practices have not translated to data science. Data scientists who update a stage of their ML pipeline need to understand the cascading effects of their change so that their downstream dependencies do not end up with stale data, or unnecessarily rerunning the entire pipeline end-to-end. When data scientists collaborate, they should be able to use the intermediate results from their colleagues instead of computing everything from scratch.

This presentation shows how to treat data like code through the concept of Data-Driven Software (DDS). This concept, implemented as a lightweight and easy-to-use python package, solves all the issues mentioned above for single user and collaborative data pipelines, and it fully integrates with a lakehouse architecture such as Databricks. In effect, it allows data engineers and data scientists to go YOLO: you only load your data once, and you never recalculate existing pieces.

Through live demonstrations leveraging DDS, you will see how data science teams can:

Integrate data and complex code bases with nearly no overhead in code or performance
Build fully reproducible data pipelines
Collaborate on data products with the same level of ease and speed as using familiar software tools such as git

In this session watch:

Brooke Wenig, Machine Learning Practice Lead, Databricks

Tim Hunter, Data Scientist, ABN AMRO

[daisna21-sessions-od]

Wednesday Morning Keynote

November 17, 2020 04:00 PM PT

Welcome from Ali Ghodsi, Databricks

Project Zen: Making Spark Pythonic

Reynold Xin
Co-founder & Chief Architect, Databricks

In this keynote from Reynold Xin, the top contributor to Apache Spark and PMC member, we will review the state of the project and highlight major community developments in the 10th anniversary release and beyond. Reynold will review how the recent Spark 3.0 release focused on making it easier to use, faster, and more ANSI standard compliant. With Python representing nearly 70% of notebook commands, he’ll focus on the development of Project Zen - the community effort to make Spark more Pythonic. This includes improvements in development tooling, API design, error handling and more, to make data scientists and engineers more productive with data.

Demo:Pythonic Spark with Real Koalas

Caryl Yuhas
Sr. Manager, Field Engineering, Databricks

The Rise of the Lakehouse

Ali Ghodsi
Co-founder & CEO
Original Creator of Apache Spark, Databricks

Data warehouses have a long history in decision support and business intelligence applications. But, data warehouses were not well suited to dealing with the unstructured, semi-structured, and streaming data common in modern enterprises. This led to organizations building data lakes of raw data about a decade ago. But, they also lacked important capabilities. The need for a better solution has given rise to lakehouse architecture, which implements similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes.

This keynote by Databricks CEO, Ali Ghodsi, explains how the open source Delta Lake project allows the industry to realize the full potential of lakehouse architecture. Additionally, Ali will discuss the newly announced SQL Analytics service that allows users to run traditional analytics on their data lake, instead of moving data out to data warehouses, without sacrificing performance, security, or quality. This service completes the vision of lakehouse architecture to allow the data lake to be a single source of truth of all data workloads.

Discussion with Tableau Software

Francois Ajenstat
Chief Product Officer, Tableau Software

Demo: SQL Analytics and the Lakehouse Architecture

Brooke Wenig,
Machine Learning Practice Lead, Databricks

How SQL Analytics Makes Lakehouse Fast

Reynold Xin
Co-founder & Chief Architect, Databricks

In this keynote, Reynold Xin, Co-founder and Chief Architect at Databricks, will explore how SQL Analytics brings a new level of performance to data lakes for analytics workloads. Traditionally, data lakes have struggled with analytics, because they struggle to deliver the fast query performance wiht low latency at high user concurrency. Reynold will provide a techical deep dive of how Databricks has addresssed these challenges. First, Delta Engine, Databricks' polymorphic vectorized execution engine, delivers extremely fast single query throughput. Second, the new auto-scaling SQL-optimized clusters in SQL Analytics make it easy to match compute capacity to user load. And third, optimizations in the new SQL Analytics Endpoints reduce the time required to get query results by up to 6x. Altogether, SQL Analytics is able to provide users with data warehousing performance at data lake economics for their analytics workloads.

Discussion with Peter Boncz

Professor, CWI & Vrije Universiteit Amsterdam

Discussion with Unilever

Phinean Woodward
Head of Architecture, Information and Analytics, Unilever

In this talk, we’ll discuss how the Lakehouse architecture has become a critical part of Unilever’s information management infrastructure to limit traditional enterprise data silos, and enable agile access to data both up and downstream that’s needed for faster decision making. As a result, IT is helping Unilever to deliver higher quality predictions in many areas of the business, thereby building trust in AI throughout the company.

Why Data Should Drive the Next Pandemic Response

Malcolm Gladwell
Best-selling author, journalist, and podcast host

Imagine what a data-driven response to the Covid-19 pandemic would have looked like — if we could set aside politics and ego. Award-winning author and journalist Malcolm Gladwell discusses the lessons we can learn from the current crisis, and how data and data teams will be critical in solving the world’s toughest problems – including future pandemic outbreaks. He also reveals the essential role that data teams play in his own work every day.

Close

Ali Ghodsi

Summit 2020 Spark + AI Summit 2020: Wednesday Morning Keynotes

June 23, 2020 05:00 PM PT

Ali Ghodsi - Intro to Lakehouse, Delta Lake (Databricks) - 46:40
Matei Zaharia - Spark 3.0, Koalas 1.0 (Databricks) - 17:03
Brooke Wenig - DEMO: Koalas 1.0, Spark 3.0 (Databricks) - 35:46
Reynold Xin - Introducing Delta Engine (Databricks) - 1:01:50
Arik Fraimovich - Redash Overview & DEMO (Databricks) - 1:27:25
Vish Subramanian - Brewing Data at Scale (Starbucks) - 1:39:50

Realizing the Vision of the Data Lakehouse
Ali Ghodsi

Data warehouses have a long history in decision support and business intelligence applications. But, data warehouses were not well suited to dealing with the unstructured, semi-structured, and streaming data common in modern enterprises. This led to organizations building data lakes of raw data about a decade ago. But, they also lacked important capabilities. The need for a better solution has given rise to the data lakehouse, which implements similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes.

This keynote by Databricks CEO, Ali Ghodsi, explains why the open source Delta Lake project takes the industry closer to realizing the full potential of the data lakehouse, including new capabilities within the Databricks Unified Data Analytics platform to significantly accelerate performance. In addition, Ali will announce new open source capabilities to collaboratively run SQL queries against your data lake, build live dashboards, and alert on important changes to make it easier for all data teams to analyze and understand their data.

Introducing Apache Spark 3.0:
A retrospective of the Last 10 Years, and a Look Forward to the Next 10 Years to Come.
Matei Zaharia and Brooke Wenig

In this keynote from Matei Zaharia, the original creator of Apache Spark, we will highlight major community developments with the release of Apache Spark 3.0 to make Spark easier to use, faster, and compatible with more data sources and runtime environments. Apache Spark 3.0 continues the project’s original goal to make data processing more accessible through major improvements to the SQL and Python APIs and automatic tuning and optimization features to minimize manual configuration. This year is also the 10-year anniversary of Spark’s initial open source release, and we’ll reflect on how the project and its user base has grown, as well as how the ecosystem around Spark (e.g. Koalas, Delta Lake and visualization tools) is evolving to make large-scale data processing simpler and more powerful.

Delta Engine: High Performance Query Engine for Delta Lake
Reynold Xin

How Starbucks is Achieving its 'Enterprise Data Mission' to Enable Data and ML at Scale and Provide World-Class Customer Experiences
Vish Subramanian

Starbucks makes sure that everything we do is through the lens of humanity – from our commitment to the highest quality coffee in the world, to the way we engage with our customers and communities to do business responsibly. A key aspect to ensuring those world-class customer experiences is data. This talk highlights the Enterprise Data Analytics mission at Starbucks that helps making decisions powered by data at tremendous scale. This includes everything ranging from processing data at petabyte scale with governed processes, deploying platforms at the speed-of-business and enabling ML across the enterprise. This session will detail how Starbucks has built world-class Enterprise data platforms to drive world-class customer experiences.

Summit Europe 2019 New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas

October 15, 2019 05:00 PM PT

In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.

Summit Europe 2019 Koalas: Pandas on Apache Spark EU

October 15, 2019 05:00 PM PT

In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.

We will demonstrate Koalas' new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.

What you will learn:

How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas

Prerequisites:

A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs

Summit Europe 2019 Koalas: Pandas on Apache Spark (continued)

October 15, 2019 05:00 PM PT

We will demonstrate Koalas' new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.

What you will learn:

How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas

Prerequisites:

A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Read koalas docs

Summit 2019 Official Announcement of Koalas Open Source Project

April 23, 2019 05:00 PM PT

Keynote from Spark + AI Summit 2019: Reynold Xin, Databricks, Brooke Wenig, Databricks

Summit 2019 The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and NLP to Measure Customer Service Quality

April 23, 2019 05:00 PM PT

How do we get better than good enough? Leveraging NLP techniques, we can determine the general sentiment of a sentence, phrase, or a paragraph of text. We can mine the world of social data to get a sense of what is being said. But, how do you get control of the factors that create happiness? How do you become proactive in making end-users happy? Chatbots, human chats, and conversations are the means we are using to express our ideas to each other. NLP is great for helping us process and understand this data but can fall short.

In our session, we will explore how to expand NLP/sentiment analysis to investigate the intense interactions that can occur between humans and humans or humans and robots. We will show how to pinpoint the things that work to improve quality and how to use those data points to measure the effectiveness of chatbots. Learn how we have applied popular NLP frameworks such as NLTK, Stanford CoreNLP and John Snow Labs NLP to financial customer service data. Explore techniques to analyze conversations for actionable insights. Leave with an understanding of how to influence your customers' happiness.

Summit Europe 2018 A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch

October 3, 2018 05:00 PM PT

We all know what they say – the bigger the data, the better. But when the data gets really big, how do you mine it and what deep learning framework to use? This talk will survey, with a developer’s perspective, three of the most popular deep learning frameworks—TensorFlow, Keras, and PyTorch—as well as when to use their distributed implementations.

We’ll compare code samples from each framework and discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data) as well as help you answer questions such as:

As a developer how do I pick the right deep learning framework?
Do I want to develop my own model or should I employ an existing one?
How do I strike a trade-off between productivity and control through low-level APIs?
What language should I choose?

In this session, we will explore how to build a deep learning application with Tensorflow, Keras, or PyTorch in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you.

Session hashtag: #SAISDL3

Summit 2018 A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learning Pipelines

June 4, 2018 05:00 PM PT

We all know what they say - the bigger the data, the better. But when the data gets really big, how do you use it? This talk will cover three of the most popular deep learning frameworks: TensorFlow, Keras, and Deep Learning Pipelines, and when, where, and how to use them.

We'll also discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data), as well as help you answer questions such as:

- As a developer how do I pick the right deep learning framework for me?

- Do I want to develop my own model or should I employ an existing one

- How do I strike a trade-off between productivity and control through low-level APIs?

In this session, we will show you how easy it is to build an image classifier with Tensorflow, Keras, and Deep Learning Pipelines in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you, and perhaps with a better sense for how to fool an image classifier!

Session hashtag: #DL4SAIS