Posted by Trey Johnson on 20 November 2020
I often meet with IT managers and data analysts who instinctively believe in moving some or all of their BI workloads to the cloud using solutions such as an ETL Data Warehouse, but do not have a ready grasp of the landscape and tools available to them. And that’s fair enough, right now it’s not exactly easy! Especially as Microsoft has recently bolstered its Azure platform, particularly around the pursuit of Data Science, Data Warehousing and Business intelligence/Analytics. There are three top questions that keep coming up in these conversations which I’ll look at in detail in this blog.
There are many ways of describing the Azure Data Platform end-to-end (or any Modern Data Warehouse architecture) and I would offer that the simplest definitions are sometimes the best. For this reason, we’ll focus on three core areas:
For starters, the Modern Data Warehouse Architecture still has a requirement to support the collection and management of data. Possibly, the complexity involved here is in the modern composition of data.
Merriam-Webster’s definition of data is as follows:
With this definition in mind, let’s talk about the modern composition of this data. There are numerous types of data providers for the Modern Data Warehouse platform. The most common historical ones were Relational or highly structured in nature. Structured data is a familiar concept probably older than the definition of data, itself. Lifetimes ago, there were ship manifests, human registries and transaction/invoices which preceded the full, modern-age of technology. The data coming off of these documents was well understood and, the schema was defined in advance because the format for entering the information itself was defined.
Before talking about the other providers, there is a very popular movement born out of traditional, relational data and that is one where we gravitate more towards ELT (Extract, Load and Transform) vs ETL (Extract, Transform and Load). ELT makes it easier to preserve the incoming data and empowers those who work with the data (in a transformative and/or preparatory capacity) to really understand the data in the originating form.
(You might also be interested to read "What are the 3 Main Differences between ELT and ETL?")
Other data providers take the form of Non-Relational data, often including the popular “NoSQL” designation, which ultimately means the encumbering rules of a relational data store are not seen. There is the freedom to be flexible with the schema and to acknowledge that structurally there isn’t a purely tabular orientation (rows & columns) to some of the data being taken in.
With the proliferation of NoSQL/Non-Relational data stores, especially on Azure, there are great opportunities to really revisit the data definition from earlier. Streaming is akin to the cognitive processes our bodies engage in, as streaming affords us the architectural opportunity to handle data defined as “information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful.”
Or, “I’ve heard Data Lakes, is this another metaphor or is it something different?”
A data stream is more of a reflection of higher volumes of smaller transactions at a velocity where the data may need some level of queuing to process. As such, Microsoft Azure is the better Microsoft platform for accommodating the Data Stream (or big data) and in Azure, as there are three influential parties to supporting streaming of data:
1. Azure Data Lake Storage
The first member of the Azure Big Data/streaming party is Azure Data Lake Storage. So yes, you could say a stream feeds into the lake if that helps you remember (but it isn’t exactly what the cool kids are saying!).
Characteristically, Azure Data Lake Storage provides a Hadoop Distributed File System (HDFS) in the cloud with support for exceptional large storage scenarios (think petabytes and exabytes here) and support for data however it arrives, structured, semi-structured or unstructured. Of course, it wouldn’t be Azure-based if it didn’t promote exceptional security using Azure AD for identity and access management.
2. Azure Data Lake Analytics
The second member of the Azure Big Data/Streaming party is Azure Data Lake Analytics. I would simply state that Azure Data Lake Analytics is the execution environment for running programmatic/processing type of activities. R, Python and .NET code can drive data transformations and processing programs along with a new-comer called U-SQL. U-SQL combines the declarative nature of SQL (structured query language) with C#’s data types and expression language. Here’s a picture portraying some U-SQL…
3. Azure HDInsight
The last but certainly not least member of the Azure Big Data/Streaming party is Azure HDInsight. Azure HDInsight is the powerhouse for supporting the creation of clusters, performing analyses and delivering solutions which leverage frameworks like these:
Or, you might ask, “so what about the older ways of collecting and managing Data, especially if we have ‘Not BIG’ Data?”
This is a popular sentiment. From a Collection and Management standpoint, I have done a “fly by” on the Azure Support for big data and streaming and likely could have used 10 times the amount of words to describe it. But that’s the realm of data science, which is a rapidly maturing but still a slightly “Wild West” discipline.
(We recommend reading "Gartner’s Definition of Data Management for BI and how ZAP Data Hub Compares")
If, on the other hand, you’re looking for data collection and preparation on structured data, the great news is Azure SQL, Azure SQL Data Warehouse and Azure Analysis Services really can pave the way for a reliable data platform to support the next key facet of the Microsoft Data Warehouse: Transformation and Analysis.
Transformation, particularly to support analysis has been a big part of addressing making data valuable for businesses of all shapes and sizes. The reality is that transformation becomes more challenging and extensive when we move beyond structured data. Regardless, let’s talk about the characteristic layers of what Azure affords in this area.
Data orchestration layer
Orchestration as part of the overall transformation and analysis features of Azure in the Modern Data Warehouse is key. It gives us the opportunity to automate and join disparate processes which were supporting the collection and management of the data. Orchestration might be something as simple as the copying of data or it might be as complex as triggering the execution of a transformation routine. Without it, the world ends up looking a lot like out of control batch files or super-challenging programming logic. In my view, as of this writing, orchestration is one of the key areas where Azure Data Factory shines and with further adoption of paradigms from SQL Server Integration Services (SSIS), aspects like “Control Activities” become much more compelling!
Complex event processing example
Generally speaking, event processing is fairly synonymous with the technical demand created by streaming data sources. Events themselves might simply be IoT devices providing feedback. Maybe it is visual feedback on the appearance of a group of kids at a baseball field with a separate event showing an adult in a uniform unloading equipment from a truck. Combine those two with the observance of a person dressed in a referee uniform interacting with the group, or the lighting of stadium lights/scoreboard. Each of those are events with the complex event being a ‘youth baseball game’.
Complex event processing allows, on Azure, for information to be queued by ‘event hubs’ and then peeled off these queues by event processing. In the real world, this is being used by people in supply chains and trading activities and in industries where “knowing in advance” is critical, such as energy load predictions through to more intelligent patient care. The support for complex event processing is enabled by Azure Stream Analytics.
Data modeling – in the context of Azure
Modeling is really a precursor to machine learning. Modeling is the process by which, elements of the data are shaped into the requisite variables to feed one or more mathematical models. It is a bit different and scientifically more significant than traditional modeling of a Data Mart or Warehouse but can often be confused. Modeling in the context of the Modern Microsoft Data Warehouse on Azure is where Azure Machine Learning models are prepared and evaluated to fit in more automated OR ongoing predictive/machine learning scenarios.
Azure machine learning
Machine learning is very much the intersection of computational horsepower with a mathematical vehicle for interpreting larger sets of variables, derived by data preparatory activities (or purpose-built programs in a variety of languages) to achieve a result or set of classified results. I HIGHLY recommend the Azure Machine Learning Cheat Sheet Mini Site as a simplified way of technically understanding all the options Azure Machine Learning brings to the table.
When Microsoft describes the Modern Data Warehouse and the Modern Data Platform, the concepts of visualization and decision are never left out of the conversation. Part of the reason for this is the Power BI platform is a significant enabler in many areas.
(You might also be interested to check out "Why all Financial Analytics need Data Visualization")
Whether it’s reports or dashboards, Power BI takes advantage of all of the data stores I’ve mentioned – Azure SQL, Azure SQL DW, Azure Data Lakes, Azure Analysis Services, etc. In fact, Power BI, much like other analytical tools, takes on some of the responsibility for supporting first generation, self-service activities for modeling data all the way up to the point where storage limits are reached and promotion from a tabular model underneath Power BI to Azure Analysis Services makes a tremendous amount of sense.
The realm of visualization and ‘decision supporting’ (now, that’s an old phrase!) capabilities will continue to evolve through the combination of unique data and compute support on Azure and features of products like Power BI.
Update: Much was announced recently at MS Ignite (September 2018) where many changes and additions were made across multiple Microsoft product lines and cloud offerings. Where practical, our subsequent blogs will help cover these changes as they become mainstream, which ultimately might affect your journey on the path to Azure BI.
Trey is Chief Evangelist and leader of ZAP’s Americas business with a background of 25 years working with SQL Server and Microsoft Data Platform technologies. His other roles include being an industry speaker, published author, Board Member of the PASS organization, member of Microsoft’s Advisory Councils and community enthusiast for the last two decades.
View my social profiles: LinkedIn | Twitter