Published on December 7th, 2016 | by Warner Chaves0
SQL On The Edge #11 – Azure Data Lake Fundamentals
It is no secret that we are generating increasing amounts of data every day. Hundreds of apps on our phones that we carry with us every day, amass a collection of data from connected devices. Hundreds and thousands of different computer systems enable the operations of every single company in the planet. We are at the point where we generate more data than we sometimes know what to do with. Previously, organizations would, at best, case archive it, and worst case, simply delete it and carry on. The question is though, what if that data actually held valuable insights that could provide tangible, practical benefits to the operation of the business?
What is Azure Data Lake?
Azure Data Lake (ADL) is a Platform as a Service offering on Microsoft Azure. The service is split into two independent but closely related subservices: Azure Data Lake Store and Azure Data Lake Analytics.
Azure Data Lake Store
Azure Data Lake store was built to overcome some of the limitations that exist on the regular Azure Blob Storage. First, there are no limits to the amount of files, size of a file, or the aggregated size of all files in the account. Second, the storage is optimized for Big Data and analytics. This means lots of writes and large read scans.
A great feature of ADL storage is that it provides a Web-HDFS API so you can create an HDInsight cluster, attach the ADL store and then run the entire set of Hadoop tools on your data.
That is the beauty of a service like Azure Data Lake. You can store and accumulate this data at a low cost in the cloud and not pay any charge to analyze it until you’re ready to do so. Once you’re ready to analyze, when you deploy a managed Hadoop cluster either on Azure, AWS or GCP, you pay for the compute while the cluster is up, even if you’re not actively using it. Azure Data Lake takes this concept to the next level with ADL Analytics.
Azure Data Lake Analytics
For the analytics piece of the service, Microsoft wanted users to have an experience that was cost efficient to scale and use, and easy to pick up for data pros that were not Hadoop experts.
To achieve the cost efficiency piece, the analytics service is 100% on-demand and does not require deploying any cluster or destroying it when you’re done. You submit an analytics job and you pay by the minute based on the compute power you assign to the job. Compute power is measured in Analytic Units, each unit represents a compute container that can process data in parallel.
With the service, you can accumulate data continuously all throughout the month and only pay for storage. You can then fire up an analytics end-of-month job, get your aggregation or any other results and you only paid for that one job in terms of compute.
To achieve the second objective of ease of use for data pros, Microsoft released a new language to run analytics called U-SQL.
What is U-SQL?
U-SQL is the language used for writing ADL Analytics jobs. It’s a mix of T-SQL and C# so it is instantly familiar to anyone that has done any SQL or object oriented programming with C# or Java. The use of U-SQL makes it easy for someone to approach the service and not have to be aware of all different tools and methods to run analytics in the Hadoop ecosystem, they only need to know some SQL, some basic programming and they can start coding their own Big Data analytics jobs. I can’t stress enough how big this is for lowering the barrier of entry to doing Big Data analytics. I have personally demonstrated U-SQL to groups of DBAs and shown them how to get up and running with it and they get very excited with the service. What looked like a long and intricate learning curve suddenly becomes very approachable through this new service and a familiar language.
At this point, the biggest gap that I can see in ADL is the lack of an interactive session type of experience. Right now the analytics have to be run in batch mode through jobs and always getting the output as a file in the ADL storage. Down the line, I hope we’ll see the support for interactive sessions similar to what you can do with Google BigQuery or with a Spark cluster and a notebook.
Now, let’s go to the video and the demo and get an introduction of the Azure Portal experience for Azure Data Lake and a quick look at some basic U-SQL. Enjoy!