Posted by Trey Johnson on 08 January 2021
So there you are, maybe you are taking stock of multiple data sources your business has a need for and you might be contemplating the best way to make data available for your users in a Data Warehouse, Data Mart, or Data Lake, some of whom are very keen to work on Power BI or Tableau to visualize the information. As you contemplate your path for moving, manipulating and making your business intelligence data available, you may be fixed on doing things as you have always done. In the case of data, this might be ELT or ETL processes which will allow for data to be available.
If this is all a bit new, you might not be familiar with ETL as a term, it stands for Extract, Transform and Load of data, generally into a data store for users. There are many tools which support ETL processes and Microsoft, with recent announcements about Azure Data Factory (ADF), is providing two products, SQL Server Integration Services and ADF. And it is worth pointing out that ZAP Data Hub has an ability to plug nicely into the orchestrations of the SSIS and ADF platforms.
(You might also like "5 Reasons Why a Data Warehouse Improves Business Reporting")
ELT (or Extract, Load and Transform) compared to ETL has some main differences and that is the real focus of this post, as well as exploring some differences which may cause you to ‘change direction’…
“If you tell the truth, you don’t have to remember anything.” – Mark Twain
When it comes to ETL, telling the truth (or not manipulating the data) is the only way you DON’T engage memory. ETL architectures require the use of memory to store and manipulate (transform) data which has been extracted, prior to it being loaded into a destination. This has the potential of becoming crazy fast for computational transformations and possibly very slow if other data is being used to enrich this information.
Depending on the nature of the transformation, large amounts of memory MAY be consumed to support the data being processed through ETL “Pipelines”.
Conversely, ELT offers some unique abilities to manipulate data but generally requires the data is extracted and often, non-modified when loaded into an intermediate data store, often described a Staging Environment.
ELT offers some benefits to transformation as rows can be looked at, in total, and transformations (called set-based operations) against large amounts of data can occur in a computationally more excellent way than “Row-By-Row” as we do in ETL.
It’s not uncommon for people to mix and match where transforms occur or to even do ETL-like operations against data which is being “staged” by an ELT process. But suffice it to say, you as the vision-keeper need to understand what your environment allows for (Scale of Memory and Storage) and how current and future data might intersect.
As the Data Focus of many companies starts to include elements like conversations, reactions, social posts and other loosely or entirely unstructured STREAMs of data. It is important to understand the patterns for how ETL/ELT are used with this information. Unstructured data, generally, needs to find a home before it can be manipulated. This pattern means the flow of information looks to be more like ELT than ETL. Data is often picked up by a “listener” and written to storage (such as BLOB storage on Azure HD Insight or another NOSQL environment). From there, one or more technologies can be leveraged to transform the data.
The alternative, which may gain strength as cloud-based ETL processes scale differently and allow for more dynamic data through ETL pipelines, will be highly dependent on how the functions of transformations are declared and used with row-focused logic definitions.
Not all transformations are the same. In fact, some are proprietary to a particular ETL toolset. Ultimately, this means moving on from one toolset to another means logic doesn’t always “port”. The end result is bidding a fond-farewell to logic which may have been challengingly crafted (or difficult to recreate!)
ELT affords your logic a bit more of a resilient life. Assuming the data is available, ELT logic can work well and often works best with large volumes of data requiring complex manipulations. These complex manipulations become the true “life-blood” of the data and thus ELT, in this sense, provides for a bit less risk against re-tooling.
As I write today’s blog, I’ve also been listening to the podcast I recorded with the guys at Data Driven. The podcast was recorded approximately 4 months before this blog and it is genuinely amazing at how much the world is changing around ZAP’s Data Hub offering and complimentary technologies like Microsoft’s Azure Data Factory.
The changes are so significant that possibly the greatest difference is there will likely be a need to no longer choose ONLY ETL or ONLY ELT but likely some combination of the two. Beyond ZAP’s investments to provide the most suitable Automated data management platform, is the reality where former “Visual-Only” folks are now building dataflows and pipelines in tools like Power BI and Tableau (Prep), which often augment ETL, ELT and automated data management.
(We also recommend reading "Do Tableau and Power BI replace the need for a Data Warehouse?")
So between the words here and the conversation on the podcast, I hope the reader/listener finds the opportunity to fully exploit ETL and ELT to the fullest in delivering rich-data to their end-users and remember ZAP can automate much of it with our ZAP Data Hub offering!
Trey is Chief Evangelist and leader of ZAP’s Americas business with a background of 25 years working with SQL Server and Microsoft Data Platform technologies. His other roles include being an industry speaker, published author, Board Member of the PASS organization, member of Microsoft’s Advisory Councils and community enthusiast for the last two decades.
View my social profiles: LinkedIn | Twitter