Cataloging data for a Lakehouse

Providing seamless access across the platform requires a strong catalog server

Using AWS Glue as a catalog for Databricks

To discover data across all your services, you need a strong catalog to be able to find and access data. The AWS Glue service is an Apache-compatible Hive serverless metastore that allows you to easily share table metadata across AWS services, applications or AWS accounts. Databricks and Delta Lake are integrated with AWS Glue to discover data in your organization and to register data in Delta Lake and to discover data between Databricks instances.

Benefits

Databricks comes pre-integrated with AWS Glue

Simple

Simplifies manageability by using the same AWS Glue catalog across multiple Databricks workspaces.

Secure

Integrated security by using Identity and Access Management Credential Pass-Through for metadata in AWS Glue. For a detailed explanation, see the Databricks blog introducing Databricks AWS IAM Credential Pass-Through.

Collaborative

Provides easier access to metadata across the Amazon services and access to data cataloged in AWS Glue.

Databricks Delta Lake integration with AWS core services

This reference implementation illustrates the uniquely positioned Databricks Delta Lake integration with AWS core services to help you solve your most complex data lake challenges. Delta Lake runs on top of S3, and it is integrated with Amazon Kinesis, AWS Glue, Amazon Athena, Amazon Redshift and Amazon QuickSight, just to name a few.

If you are new to Delta Lake, you can learn more here.

Integrating Databricks with AWS Glue

STEP 1

How to configure a Databricks cluster to access the AWS Glue Catalog

Watch the demo video

Launch

First launch the Databricks computation cluster with the necessary AWS Glue Catalog IAM role. The IAM role and policy requirements are clearly outlined in a step-by-step manner in the Databricks AWS Glue as Metastore documentation.

In this example, create an AWS IAM role called Field_Glue_Role, which also has delegated access to my S3 bucket. Attach the role to the cluster configuration, as depicted in the demo video.

See how to update the Databricks Cluster Spark Configuration properties

Update

Next, the Spark Configuration properties of the cluster configuration must be set prior to the cluster startup as shown in the how to update video.

STEP 2

HSetting up the AWS Glue database using a Databricks notebook

Attach

Before creating an AWS Glue database, attach the cluster to your notebook, created in the previous step, and test your setup with the command shown here.

Validate

Then validate that the same list of databases is displayed using the AWS Glue console and list the databases.

Verify

Create a new AWS Glue database directly from the notebook and verify that the new AWS Glue database has been created successfully by re-issuing the SHOW DATABASES. The AWS Glue database can also be viewed via the data pane.

STEP 3

Create a Delta Lake table and manifest file using the same metastore

Create and catalog

Create and catalog the table directly from the notebook into the AWS Glue data catalog. Refer to Populating the AWS Glue data catalog for creating and cataloging tables using crawlers.

The demo data set here is from a movie recommendation site called MovieLens, which is comprised of movie ratings. Create a DataFrame with this python code.

Register

Then register the DataFrame as a temporary table and access it using this SQL command.

Delta Lake

Now create a Delta Lake table using the temporary table created in the previous step and this SQL command.

Note: It’s very easy to create a Delta Lake table as described in the Delta Lake Quickstart Guide

Generating a manifest for Amazon Athena

Now generate the manifest file required by Amazon Athena using the following steps.

1. Generate manifests by running this Scala method. Remember to prefix the cell with %scala if you have   created a python, SQL or R notebook.

2. Create a table in the Hive metastore connected to  Athena using the special format  SymlinkTextInputFormat and the manifest file location.

In the sample code, the manifest file is created in the s3a://aws-airlifts/movies_delta/_symlink_format_ manifest/ file location.

STEP 4

Query the Delta Lake table using Amazon Athena

Amazon Athena

Athena is a serverless service that does not need any infrastructure to manage and maintain. Therefore, you can query the Delta table without the need of a Databricks cluster running.

From the Amazon Athena console, select the database, then preview the table as shown in the video.

Conclusion

Integrating AWS Glue provides a powerful serverless metastore strategy for all enterprises using the AWS ecosystem. Elevate the reliability of data lakes with Delta Lake and provide seamless, serverless data access by integrating with Amazon Athena. The Databricks Lakehouse Platform powers the data lake strategy on AWS that enables data analysts, data engineers and data scientists to get performant and reliable data access.

Ready to get started?

try for free

Resources

Cataloging data for a lakehouse