Published on August 18th, 2016 | by Warner Chaves0
Beginner’s Guide to Azure SQL Data Warehouse
UPDATE: I have published a 90 minutes course on Pluralsight that goes over the fundamentals of Azure SQL Data Warehouse and has demos of provisioning, designing, loading and monitoring the system. The course can be found HERE.
The power of cloud storage and compute power has put data warehousing in the hands of businesses of all sizes. What was once a large capital expenditure and multi-year implementation can now be deployed and ready to use within minutes and allow any organization to collect, query and discover insights from their structured data sources. Azure SQL Data Warehouse (ASDW) is Microsoft’s entry in this new cloud offering segment and is a new permanent product in the Data Platform ecosystem.
The first thing to keep in mind is that ASDW was designed to be a cloud based system. As such, it aims to be very flexible for resource allocation and very efficient to scale up or down. To meet those goals, the system allows you too:
- Increase or decrease compute power represented by Data Warehousing Units.
- The amount of storage can grow and is charged independently from the compute power.
- The compute power can be completely paused and only storage is payed at that point.
The security model is similar to SQL and it supports Transparent Database Encryption (TDE) and Azure AD authentication.
To implement these elastic capabilities, the system is broken down into three main components:
- Data distributions: tables are split into 60 different storage buckets called distributions.
- Control node: is the connection endpoint and sends and gathers results from the Compute nodes.
- Compute nodes: they are attached to the
To distribute the data in the 60 buckets, there are also different options:
– Hash key distribution: the data is distributed by a column.
– Round robin distribution: the data is distributed round-robin style, putting roughly the same amount of records on each distribution without any consideration on their actual value.
Data Warehousing Units
Compute power in Azure SQL Data Warehouse is defined by Data Warehousing Units. This is very similar to the Database Transaction Unit in Azure SQL Database except it’s more aligned to the power of a distributed warehousing system instead to the OLTP scenario.
The amount of DWUs controls the following:
– How many concurrent queries you can have running.
– How many Compute nodes are part of your warehouse.
– How many parallel readers you get for data loads.
– The size limit of transactions.
One really outstanding capability of Azure SQL Data Warehouse is that you can PAUSE compute completely. If you have a workload that is not 24×7, you can pause all compute and only pay for storage and get maximum savings on the service cost.
Azure SQL Data Warehouse supports the same tools you can use for SQL Server On-premises like SSIS or BCP. However, for the fastest data load you will want to use the feature called PolyBase.
PolyBase works with flat files and it supports parallel readers all working on the same flat file loading data in parallel to the warehouse. This is much faster than loading data through BCP or SSIS because it goes straight from the file into the storage distributions. Loading data with BCP or SSIS, while it works, has a bottleneck in that it flows the data through the Control Node.
The cost of the service has two components: storage and compute. Storage is billed separately by the TB/hour and it includes 7 days of incremental snapshots.
Compute is billed by the amount of DWUs that you provision per hour. If you have two different levels per hour, you will be charged at the highest one.
Amazon has Redshift as its managed data warehouse. Redshift has two weaknesses compared to Azure:
- It is scaled and charged by node. This means that if you just want to increase compute, you will have to pay for the full node regardless or if you just want to add storage you will have to pay for compute since you need to add a full node.
- It does not allow for Pause of compute. This means that if you have idle times you would have to take a snapshot of the cluster, destroy it and then recreate it once you need to query again. Recreating a cluster can take several hours depending on the warehouse size.
Google also has a warehousing service called BigQuery. Just like Azure SQL DW, BigQuery also separates storage from compute and goes one step further: it only charges per query. However, BigQuery charges $5 per TB that you read while doing queries. So if your data warehouse is really busy and has to read a lot of data, this can add up pretty quickly and can end up as a disadvantage. Also, BigQuery does not provide much control over how tables are indexed, everything is always columnar storage. BigQuery also does NOT allow modifying the data after it is loaded, if you need to do any modifications you basically have to create a new table from the contents of the original one which can be expensive in terms of time and will temporarily spike up your storage.
Azure SQL Data Warehouse is a very interesting and exciting new service in the Azure ecosystem and the general Warehouse as a Service offerings in the public cloud.
It has three main advantages: independent scale of compute and storage, ability to pause compute and being built on the solid foundation of SQL Server. It’s a great service to consider not only for SQL Server warehouses migrating to the cloud but for anyone looking at data warehousing in the cloud in general.
Stay tuned for more material on the service coming to the blog, cheers!