Author – Moulshree Suhas, Cloud Engineer
Azure Databricks is a fast and collaborative Apache Spark based analytics platform on Azure. The interactive workspaces and streamlines workflows enable data scientists, data engineers and business analyst to work on big data with ease to extract, transform and load data. Integration with variety of data sources and services like Azure SQL data warehouse, Azure Cosmos DB, Azure Data lake, Azure blob storage, Event hub and Power BI. Azure Databricks provides a secure and scalable environment with Azure Active directory integration, role-based access, machine learning capabilities, reduced cost combined with fully managed cloud platform.
This document will help create an end to end Databricks solution to load, prepare and store data.
STEP 1: Create Azure Databricks
- Go to Azure marketplace and select ‘Analytics’ and then ‘Azure Databricks’.
- Select Pricing tier and click Review + Create. And click ‘Create’ again.
- Click on launch Workspace, Azure Databricks will open in another tab.
STEP 2: Create a storage account.
- Create a V1 general purpose storage account and click Review + Create.
- Add a container to the storage account.
- Add a file to the storage container.
STEP 3: Create Key vault.
- We will create key vault to store our secrets. Go to Azure marketplace and type ‘key vault’.
- Click on ‘Create’.
- Add a name and disable soft delete. Click Review + Create.
STEP 4: Create secret in Azure Key vault.
- Go to azure storage account, click on access key and copy key to a text editor for later use in the tutorial.
- Go to ‘Secrets’ in the left panel and click on ‘Generate/Import’.
- Type any name for your secret and add the storage account key as the value. Then click on ‘Create’.
- Go to properties and copy both DNS Name and Resource ID and store it for future use.
Step 5: Create Scope in Azure Databricks.
- Go to ‘https://<databricks-instance>#secrets/createScope’ URL. The following page will open up:
- Enter a Scope Name and Paste the DNS Name and Resource ID from the above step.
- ‘Manage Principle’ drop-down specifies if All Users or Creator have MANAGE permission for this secret scope.
- Manage permission allows users to read and write to this secret scope. Azure Databricks Premium Plan enables you to change the permissions for the scope.
- Click ‘Create’.
Step 6 - Create a cluster and notebook on Azure Databricks.
- Click on ‘Clusters’ on the left panel and click ‘Create cluster’ and create a cluster to run as shown below:
- Go to Home and click on ‘New notebook’. Select the primary language as ‘Python’ and click create.
- A notebook for writing code will be displayed. On the top left side, below the name of the notebook, click on DETACHED drop-down and select the cluster you want to run the notebook on.
Step 7 – Run code.
- Mount the Azure blob storage
2. Create a Data frame and display it. Data frame can be any name. Here it is Air1 as dataframe.
3. Apply Filter transformation and store it in a Table.
4. Refresh table and get count of rows using SQL. For running another language in the notebook, one can make use of “Magic commands”. This can be achieved by typing %language as shown below:
5.Write data to Azure SQL database.
- Column ‘_c50’ will be renamed as ‘tab_col’.
6. Connect to the Azure SQL server and run command for count of rows.
7. Setup and test the connection to Azure SQL database via JDBC.
Note: The jdbcUsername and jdbcPassword also will need to be stored in Azure key vault and a scope need to be created as shown in step 4 and step 5.
Step 8: Verify the table
- Go to SSMS and connect to the Azure SQL Database and verify the table by the following query:
Select * from Table1example