Azure Data Platform · Azure Services

Azure Batch Data Process – Covid Dataset

In this blogpost, we would see how to create Batch data process through Azure Data Engineering with covid dataset as sample. This post will give you all, the high level idea regarding Batch data process in Azure, if you are the beginner.

Prerequisite

  • Active Azure Subscription. If you don’t have, create a free account.
  • Subscription Level Contributor access or Owner access. we can check our Access in Subscriptions–>Access Control (IAM)–>View my Access. (like below)
Type Subscriptions in Global Search Resources Text Box
Check our Access at Subscription level

Sample Steps – Azure Batch Data Process

1 – csv file data

csv file data

Get the Above sample dataset from GitHub repos.

2 – csv file at blob

csv file in Azure Blob Storage
  • Create Azure Blob Storage (enable hierarchical namespace if creating ADLS) [we can use either blob or ADLS, for huge volume- ADLS with folder structure is recommended].
  • Since we created as blob storage, we can directly upload files using azure portal itself; If we created as ADLS, for uploading our local files, we can use either Azure Storage Explorer[Downloadable Desktop Software] or AzCopy v10(Preview).

3 – Scala script at ADB

Scala script at ADB
  • Create Azure Data bricks workspace (Trial Pricing tier that gives free DBU’s[Data Bricks Units] upto 14 days).
  • Launch workspace
  • Create Standard Cluster
  • Create Notebook
  • Use Scala scripts and also use sql in between using magic command(%sql). [we also use python depends on person’s choice]

In Azure Data bricks notebook,

  • Create Connection Strings for source[blob/ADLS]
  • Mount file into data bricks, change as data frame
  • Made necessary column name changes then change as temp view-sql
  • Apply normal sql group by query with aggregating counts

If you are interest to re-use or practice the above Scala script/code, visit Arulmouzhi/AzureStuffs repos.

4 – SQL Query at Azure SQL DB

Create destination db and set Network/Firewall options

SQL Query at Azure SQL DB
  • Create Connection Strings for destination[sql db/synapse/ideally any db preferable]
  • Load data as table in destination DB

5 – pbi visual at pbi desktop

pbi visual at pbi desktop
  • Created Power BI Visualization using world map visual and Aerial as map type/view.

Note

  • Try to set regions as same across different services and locally-redundant storage[LRS] as replication for cheapest/fastest storage (since its sample, High Availability not considered). “When data is ingested from Azure Storage[Blob/ADLS] across Azure regions, then we will incur costs for bandwidth”. For any data transfer costs across regions, there is a cost associated.
  • Set Cluster Mode as Standard (where high concurrency is for multiple users, single node cluster is new one that is also preferable), set min/max workers as low as possible (I have set 1-2), set terminate after x mins of inactivity (eg., x=30 mins), set worker type as compute optimized(eg., standard_F4s).
  • Create SQL Db, Basic purchase tier under DTU model, cheap while comparing with other purchase tiers if prefers ‘MS SQL’. Add your Firewall as well as ‘Allow Azure services and resources to access this server’ option.
  • Use pbi desktop version(recommended but not mandatory).
  • Once Experiments done, we can delete the resource group, so that automatically all the azure resources within the resource group gets deleted.

Q? Why Spark/ADB in-between when pbi itself supports csv directly?

  • Here, the file is not that much huge. But for huge Batch Data, we need high computing power like ADB’s spark and we need to do required transformations as well. So, for the practice here, added ADB(Spark) in-between.

Recent Related Posts

Azure Resource Lock

We can even apply Azure Resource Lock that prevents accidental deletions and modifying of the resources.

Summary

Thus, we saw how to do batch data process in Azure at high level and other settings and options. Here we used Azure Services like Azure Blob Storage, Azure Data Bricks, Azure SQL DB and Power BI.

Not but not least, the idea behind this post is, Shashank Mishra 🇮🇳 ‘s #NoobDE_Challenge . The challenge is actually using open source tech stacks/NoSQL behind the batch data process, since it takes much time for me to learn open source from ground 0, tried the same in azure way.

Follow Blog and Show your Support for many more interesting upcoming Posts!

Advertisement

7 thoughts on “Azure Batch Data Process – Covid Dataset

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s