Agenda subject to change.
Download PDF
This course covers the new features and changes introduced to Apache Spark and the surrounding ecosystem during the past 12 months. It focuses on Spark 2.4 and3.0, updates to performance, monitoring, usability, stability, extensibility, PySpark, SparkR, Delta Lakes, Pandas, and MLFlow. Students will also learn about backwards compatibility with 2.x and the considerations required for...
Ali Ghodsi – Intro to Lakehouse, Delta Lake (Databricks) – 46:40 Matei Zaharia – Spark 3.0, Koalas 1.0 (Databricks) – 17:03 Brooke Wenig – DEMO: Koalas 1.0, Spark 3.0 (Databricks) – 35:46 Reynold Xin – Introducing Delta Engine (Databricks) – 1:01:50 Arik Fraimovich – Redash Overview & DEMO (Databricks) – 1:27:25 Vish Subramanian – Brewing...
Modern enterprises depend on trusted data for AI, analytics, and data science to drive deeper insights and business value. With intelligent and automated data management, you can take advantage of Databricks Delta to gain efficiencies, cost savings, and scale to succeed. In this session we will discuss how customers can modernize their on-premises Data Lake...
Machine learning model fairness and interpretability are critical for data scientists, researchers and developers to explain their models and understand the value and accuracy of their findings. Interpretability is also important to debug machine learning models and make informed decisions about how to improve them. In this session, Francesca will go over a few methods...
You’re moving more data and workloads to Databricks – but what do you do with the application data inside your mainframes or IBM i systems? You cannot ignore this critical data – but making it usable in Databricks is easier said than done. That is why you need an approach to data integration that helps...
Enterprises used ETL tools for decades for higher productivity and standardization. Data Engineers see these tools don’t work anymore and have moved to code. However, we’re back again to ad hoc scripts and frameworks -reminding us of the world before ETL tools. We show how a new generation of tools can be built for Spark...
When combined with scale-out cloud infrastructure, modern hyperparameter optimization (HPO) libraries allow data scientists to deploy more compute power to improve model accuracy, running hundreds or thousands of model variants with minimal code changes. HPO has traditionally run into two barriers – complexity of model management and computational cost. In this talk, we walk through...
How are customers building enterprise data lakes on AWS with Databricks? Learn how Databricks complements the AWS data lake strategy and how HP has succeeded in transforming business with this approach.
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they...
As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App. The consequences of this anti-pattern are...
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future. In this talk, we want to share...
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
In the Financial markets with credit card companies there is always a need to measure the risk optimally and understand the performance of products before we could invest and make strategic decisions. At Capital One we are leveraging technologies to provide end to end analytical experiences for modelers and enable self service solutions for analysts...
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training. The newly introduced Horovod Spark Estimator API enables TensorFlow and...
Optimize the performance of Databricks infrastructure by adding Privacera security and compliance workflows. Learn how Privacera’s Apache Ranger-based architecture in the cloud integrates with Databricks Delta Lake to enable a secure multi-tenant framework to efficiently find and easily manage sensitive data with centralized, fine-grained access control.
Tired of spending too much time on manually intensive tasks to onboard and prepare data for your AI and ML projects? A proliferation of loosely integrated point tools and the lack of automation results in a great deal of time spent writing glue code and coordinating tooling, instead of training and operationalizing your ML models....
The leading enterprises continue to drive digital transformation and are modernizing their data architecture to take advantage of the many economic and functional benefits enabled by the cloud. While the move to the cloud is making companies more competitive, lean and nimble, many technical teams are concerned about the complexities and business risks associated with...
Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly. In this presentation we want to cover 4 topics that, when not addressed, can lead to serious issues for streams in production. The first topic considers what happens if input...
Python Notebooks are great for communicating data analysis & research but how do you port these data visualizations between the many available platforms (Jupyter, Databricks, Zeppelin, Colab,…). Also learn about how to scale up your visualizations using Spark. This talk will address: 6-8 strategies to render Matplotlib that generalize well Reviewing the landscape of Python...
Text or Image classification done using deep neural networks presents us with a unique way to identify each trained image/word via something known as ‘Embedding’. Embedding refers to fix sized vectors which are learnt during the training process of a neural network but it is very difficult to make sense of these random values. However,...
Looking for a talk from a past event? Check the Video Archive
Organized by Databricks
If you have questions, or would like information on sponsoring a Spark + AI Summit, please contact organizers@spark-summit.org.
Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation. The Apache Software Foundation has no affiliation with and does not endorse the materials provided at this event.