The Notebook based environment you see here is Databricks. You can edit, run and share code and documentation and visualizations and output in many languages. The Notebook can be attached to existing clusters, or if necessary, a new one can be spun up.
We will look at the problem in three parts: data access and preparation, modeling and interpretation, and finally, deployment. We’re going to only briefly review the work of the data engineer in Databricks here. Her goal is to make raw data usable. This includes fixing errors in the input, or standardizing representations and joining disparate data to produce tables ready for use by modelers and analysts.
Here the input are just CSV files from the World Health Organization and World Bank, and an additional drug overdose data set that we included to explore as a factor. And they can be read directly, “as is” from distributed storage as Spark DataFrames. Note that these data sources could just as easily have been a SQL database or JSON files or Parquet files, and so on. The schema is automatically read, or inferred in the case of CSV.
These files contain about 2000 different health and demographic features for 16 developed countries over the last several decades. The goal will be to learn how these features predict life expectancy, and which one or two are most important.
So the data engineers might prefer to work in SQL and Scala as here. Databricks supports these, in addition to Python and R, which the data scientists might prefer. All may be used even within one notebook.
And along the way, even the data engineers can query the data with SQL, and see the results with built in visualizations like this one. They might want to take an early look at the data, and we see that clearly the trend in life expectancy between 2000 and 2016 looks different for the USA. It’s low and declining. And the question will be why.
So at the end of the data engineering workflow, the three data sources are joined on country and year, and written as a registered Delta Lake table. So to the data scientist, this representation doesn’t make much direct difference. It’s just a table that can be read like any other data source. But Delta Lake provides transactional writes, and lets the data engineers update and fix data from a bad feed, or gracefully modify the schema or enforce certain constraints on the data. And that does matter to data scientists.
Anyone who’s faced modeling on dumps of data from a database that might break or change in subtle or silent ways, will appreciate the real data science problems that these factors can cause downstream, and that Delta Lake helps fix.
Or, for example, consider the need to recall exactly what data was used to produce a model for governance or reproducibility. Managing the data as a Delta Lake table allows the data scientist to query the data as of a previous point in time. And this helps with reproducibility of course. Note that everyone can collaborate on this notebook, perhaps leaving comments like this one.
So here we pick up from the world of the data scientist. Her goal is to explore the data, further enrich and refine it for this specific analysis here, and produce another table of featurized data for modeling and production deployment. So the collaboration began simply: read this table of data that the data engineers produced. This notebook uses Python, and the ecosystem may be more familiar and useful for data science. The Spark API is the same and the data tables are available in exactly the same way, however.
Now you don’t have to use just Spark and Databricks only. For example, here we dropped to Pandas, to fill in some missing values, before returning to Spark. You can define efficient UDFs, or user defined functions, for Spark that leverage Pandas too.
The Python ecosystem offers libraries for visualization too, and with Databricks, likewise, you can just use these. Common libraries like Matplotlib and Seaborn are already built into the ML Runtime, and others can be added easily.
Here the data scientist uses Seaborn to generate a pair plot of a few of the features such as life expectancy, literacy rate, and opioid deaths per capita. It can highlight correlations, such as that between per capita expenditure on healthcare and GDP and it also reveals a clear outlier. So the outlier here with respect to opioid deaths is going to turn out to be the United States again.
After some additional featurization, the data is written as another Delta Lake table.
Now, the data set is small enough that it’s possible to manipulate and model with tools like XGBoost and Pandas, all of which are available already in the Runtime. But Spark will still be relevant in a moment.
The data scientist is here developing modeling code using XGBoost. It will regress life expectancy as a function of the thousand or so features left in the input. But building a model really means building hundreds of them in order to discover the optimal settings of the model’s hyperparameters.
Typically, a data scientist might perform a grid or random search over these values, and wait hours while the platform crunches through each of them serially. Databricks, however, provides a Bayesian optimization framework called HyperOpt in its runtime, which is one modern and efficient way to perform the search in parallel among others.
So given a search space, Hyperopt runs variations on the model in parallel across a cluster, learning as it goes which settings give increasingly better results. HyperOpt can run these variations in parallel using Spark, even though the simple modeling here doesn’t need Spark itself. This parallelism can dramatically lower the wall clock time that these hyperparameter searches take.
The results are automatically tracked using MLflow in Databricks. This gives a quick view of the modeling runs that were created in this notebook. And we can drill into the experiment view.
This is the MLflow tracking server, and this experiment shows a more detailed overview of these trials. We can use this to search through runs and even compare them as here. For example, we might want to compare all of the runs using a parallel coordinate plot. And in this way, figure out which of the combinations of hyperparameters seem to produce the best loss.
This detail view also shows, for example, who created the model and with what revision of the notebook, when it was created, and all the details of the hyperparameters including loss. It also includes the model itself, along with feature importance plots that the data scientist created.
The model can now be registered with the Model Registry as the current “staging candidate” model for further analysis. The Model Registry is one centralized repository of logical models that are being managed by Databricks. It manages artifacts and versions of this model, and manages their promotion through staging to production.
This particular model has a few versions registered as versions of the same logical model, and they can exist in states like Staging and Production. The Model Registry is also accessible in the left nav bar.
Instead of managing models, as for example, a list of coefficients written down in some file, or a pickle file stored on a shared drive, the Model Registry puts a more formal workflow around tracking not just the artifacts of the model, but which ones are ready for which stage of production deployment.
So next, a manager might review the model and the plots. Here’s a feature importance plot created by Shap, and it shows that the feature that most influences predicted life expectancy is mortality from cancer, diabetes, and heart disease. Low mortality, in blue, indicates higher life expectancy and appears to explain plus one to maybe minus one point five years of life expectancy. And “year” is next most important, which unsurprisingly captures many of the cumulative effects of better health over time.
So notice that drug related deaths do not appear to be a top explanatory feature overall — other diseases still dominate. So after reviewing the model, and other plots, the manager might finally approve this model for production.
The deployment engineer takes over here. She loads the latest production model from the Model Registry, the one that has been approved for deployment. But MLflow automatically converts it, if desired, to a Spark UDF, or user defined function. This means it can be applied to featurize data at scale with Spark with just one line of code.
Now, this could work equally well in a batch scoring job or with a streaming job as well. So for a second, compare this to for example, handing the model or even just the coefficients to a software engineer to attempt to correctly re-implement as code that can be run in production. Here the exact model that the data scientist created is made available to production engineers, and nothing is lost in translation.
So in this production job for example, the data is applied to inputs from 2017 to 2018, for which life expectancy figures are not known. And this can be joined with data up to 2016 to complete the plot we saw earlier, showing the extrapolated trends.
Now, notice that this could be registered not only as a UDF in Python, but also in SQL as well. So note that the predictions appear fairly flat. But this is mostly because over half the feature data is missing in later years. And this model could also have been deployed as a REST API or as a service in Amazon Sagemaker or Azure ML.
So this has been a simple example of how Databricks can power the whole lifecycle from data engineering through to data science and modeling, and finally the MLOps tasks like Model Management and deployment.