The technology of parallel data processing on many connected servers is not new. But in the case of a product like SQL Server, it can arouse interest, and rightly so! Imagine one non-trivial SQL query that has to return complex aggregates from hundreds of billions of rows. In Azure Data Warehouse, it will first go to the primary server – Control Node, where it will be divided into tasks for 60 smaller databases called distributions, which are connected to a specified number of compute servers known as Compute Nodes. The number of computing units depends on our subscription in Azure, where in the simplest variant (DWU100) one Compute Node will perform tasks on all 60 databases, and in the highest (DW6000) each database will be used for tasks individually by 60 computing units.
In addition, if our requirements are even bigger, we can change the compute model to a performance-optimized model (Compute Optimized Performace Tier) with just a few clicks, where the distribution data is kept on local storage caches and tempdb structures based on extremely fast NVMe SSDs. Their performance in action can be viewed here.
The advantages of MPP architecture are also used by the ADFv2 mechanisms. PolyBase support allows you to load data directly into distribution databases in parallel, bypassing Control Node. The parameterization logic, based on the JSON format, allows you to easily and quickly create universal solutions. Automatic column and type mapping help you create generic components for loading data from different sources to stage.
In the ELT approach (Extract, Load, Transform), all transformations are carried out at the warehouse, just after loading the data. It gives us complete freedom to manage changes in logic using pure T-SQL code, but also gives incomparably higher performance compared to traditional SQL Server or even Integration Services. The “CREATE TABLE AS SELECT” statement allows us to create tables easily and quickly. We can also directly add and reload partitions without the need for special functions and schemes, all thanks to the simplified syntax. We just need to remember that our data is already divided into 60 parts (after all, we have 60 different distributions), which is important for performance in Columnstore tables. Of course, for smaller sets we can use a traditional clustered index or leave them as heaps.
The entire flow of loading and processing data can be run directly from Azure Data Factory v2. Thanks to error handling, timeouts, and repeat configuration we can quickly and effectively implement logic for unexpected situations.
Do you like it and need more?
Don’t worry! The last part of this article will be available next week!
Learn it. Know it. Done.