In this blogpost, we would see how to create Batch data process through Azure Data Engineering with covid dataset as sample. This post will give you all, the high level idea regarding Batch data process in Azure, if you are the beginner.
- Active Azure Subscription. If you don’t have, create a free account.
- Subscription Level Contributor access or Owner access. we can check our Access in Subscriptions–>Access Control (IAM)–>View my Access. (like below)
Sample Steps – Azure Batch Data Process
1 – csv file data
Get the Above sample dataset from GitHub repos.
2 – csv file at blob
- Create Azure Blob Storage (enable hierarchical namespace if creating ADLS) [we can use either blob or ADLS, for huge volume- ADLS with folder structure is recommended].
- Since we created as blob storage, we can directly upload files using azure portal itself; If we created as ADLS, for uploading our local files, we can use either Azure Storage Explorer[Downloadable Desktop Software] or AzCopy v10(Preview).
3 – Scala script at ADB
- Create Azure Data bricks workspace (Trial Pricing tier that gives free DBU’s[Data Bricks Units] upto 14 days).
- Launch workspace
- Create Standard Cluster
- Create Notebook
- Use Scala scripts and also use sql in between using magic command(%sql). [we also use python depends on person’s choice]
In Azure Data bricks notebook,
- Create Connection Strings for source[blob/ADLS]
- Mount file into data bricks, change as data frame
- Made necessary column name changes then change as temp view-sql
- Apply normal sql group by query with aggregating counts
If you are interest to re-use or practice the above Scala script/code, visit Arulmouzhi/AzureStuffs repos.
4 – SQL Query at Azure SQL DB
Create destination db and set Network/Firewall options
- Create Connection Strings for destination[sql db/synapse/ideally any db preferable]
- Load data as table in destination DB
5 – pbi visual at pbi desktop
- Created Power BI Visualization using world map visual and Aerial as map type/view.
- Try to set regions as same across different services and locally-redundant storage[LRS] as replication for cheapest/fastest storage (since its sample, High Availability not considered). “When data is ingested from Azure Storage[Blob/ADLS] across Azure regions, then we will incur costs for bandwidth”. For any data transfer costs across regions, there is a cost associated.
- Set Cluster Mode as Standard (where high concurrency is for multiple users, single node cluster is new one that is also preferable), set min/max workers as low as possible (I have set 1-2), set terminate after x mins of inactivity (eg., x=30 mins), set worker type as compute optimized(eg., standard_F4s).
- Create SQL Db, Basic purchase tier under DTU model, cheap while comparing with other purchase tiers if prefers ‘MS SQL’. Add your Firewall as well as ‘Allow Azure services and resources to access this server’ option.
- Use pbi desktop version(recommended but not mandatory).
- Once Experiments done, we can delete the resource group, so that automatically all the azure resources within the resource group gets deleted.
Q? Why Spark/ADB in-between when pbi itself supports csv directly?
- Here, the file is not that much huge. But for huge Batch Data, we need high computing power like ADB’s spark and we need to do required transformations as well. So, for the practice here, added ADB(Spark) in-between.
Recent Related Posts
- PowerShell way to Create New Azure SQL Database
- T-SQL way to ADD and EDIT Client IP Address in Azure SQL DB
- Azure CLI way to Create New Azure SQL Database
- Create New Azure SQL Database
Azure Resource Lock
We can even apply Azure Resource Lock that prevents accidental deletions and modifying of the resources.
Thus, we saw how to do batch data process in Azure at high level and other settings and options. Here we used Azure Services like Azure Blob Storage, Azure Data Bricks, Azure SQL DB and Power BI.
Not but not least, the idea behind this post is, Shashank Mishra 🇮🇳 ‘s #NoobDE_Challenge . The challenge is actually using open source tech stacks/NoSQL behind the batch data process, since it takes much time for me to learn open source from ground 0, tried the same in azure way.