So, for 2023 I’ve decided that I want to learn Azure Synapse. I want to be able to make training content on it by the end of the year. I’d like to be able to consult on it in two years. And right now, I am absolutely banging my head against the learning curve. Let’s talk about why.
The integration problem
Occasionally, I’ll describe Power BI as “3 raccoons in a trench coat: PowerQuery, DAX, and visuals”. What I mean by that is it is 3 separate products masquerading as a single, perfectly cohesive product. Each of those pieces started out as separate Excel add-ins, and then were later combined into a single product. And it shows.
The team at Microsoft have done a great job of smoothing out the rough edges, but you still occasionally run into situations where the integration isn’t perfect. A simple example is where should I create my date tables in Power BI? Should I use M or DAX? The answer is either! Both of them have good tooling for it. Because these tools evolved separately, there’s going to be some overlap and there’s going to be some gaps.
Azure in general (and Synapse in particular) has this problem. If Power BI is 3 raccoons in a trench coat, Synapse is 10 of them wobbling from side to side. The power of the cloud is that Microsoft can quickly iterate and provide targeted tooling for specific needs. If a tool is unpopular or unsuccessful, like Azure Data Catalog, Microsoft can build a replacement, like Azure Purview.
But this makes learning difficult. Gone are the days of a monolithic SQL Server product where, in theory, all of the parts (SSRS/SSIS/SSAS) are designed to fit cohesively into a single product. Instead, Microsoft and us data professionals must provide the glue after the fact, after these products have evolved and taken shape. Unfortunately, this means understanding not only how these pieces fit together but when in practice they don’t.
This is the curse of the modern cloud professional. We are all generalists now.
The alternatives problem
The other big problem is just like the issue with M and DAX, there are multiple tools available to do the same job. And while M and DAX compete on the borders or on the joints, Azure Synapse has tools that are direct competitors. The most prominent example is the querying engines.
From what I understand, Azure Synapse has 3 main ways to access and process data: dedicated SQL pools , dedicated Spark pools, and SQL Serverless. Imagine if I told you that you had 3 ways to cut things: a scalpel, a butter knife, and a wood saw. These all cut things, it’s true. But then imagine if I immediately dived into what type of metal we use for our butter knives, that our saws have 60 teeth on them, etc.
It would be a little disorienting. It would be a little frustrating.
You might wonder how we ended up with 3 different tools that do similar things. You might wonder when you should use which. You might wonder when you shouldn’t use one of them especially. Giving your learners the general shape and parameters of a tool is a big deal.
Imagine if a course on Azure ButterKnife™ instead started with “This is Azure ButterKnife™, it is ideal for cutting food especially soft food. It shouldn’t be used on anything harder than a crispy piece of toast. It originally started as a way to spread butter on toast.” It would take 20 seconds to orient the learner, and if they were looking for a way to cut lumber, they could quickly move on.
The expertise problem
When I was doing a course on ksqlDB for Kafka, I ran into a particular problem. Because ksqlDB was a thin layer of SQL on top of a well-known Kafka infrastructure, so much of the content assumed you were experienced and entrenched in the Kafka ecosystem. It quickly covered terms and ideas that made sense in that world, but no sense if you were coming from the relational database world.
And a thing I would keep asking, to no one in particular, was “How did we end up here?”. What was the pain point that caused people to create an event stream technology and then put a SQL querying language on top instead of just using a relational database. I talk about this more on a podcast episode with the company that made ksqlDB.
Azure Synapse has a similar problem. It is an iteration on various technologies over the past decade. And it’s designed to support large datasets (multi-terabyte) and complex enterprise scenarios. And so a lot of the content out there assumes a certain level of expertise, in part because the people interested in it and the people training on it are both experts.
The challenge this presents is twofold. First, the more of an expert you are, the harder it is to empathize with a new learner. Often the best teacher is someone who learned a technology a year ago, and remembers all the stumbling blocks. This is a challenge I struggle with regularly myself.
The other issue is that the content often pre-supposes the learner knows what the foundational technologies are and why they are important. It might assume the learner Knows what delta lake is, and what parquet is, and um, why are we storing all our data in flat files to begin with???
That’s not to say that every course needs to be a 9 hour foundations course. But there are ways to briefly remind the viewer why something is important, what pain point it solves, and why they should care. And if they are totally new, this helps orient them quickly.
For example, a course could say “Here we are using the delta lake approach. This allows us to enhance the efficient column storage of parquet files with ACID compliance that we usually lose out on when using a data lake.” This explains to new learners why we are here and reminds seasoned learners why they should care. This can be done quickly and deftly, without feeling like you are talking down to experienced learners.
So now what?
I’m hoping this will help folks who make content in this area. If nothing else, I hope it will be a reminder to me a year from now, when I’ve forgotten what a pain this was. In the next blog post, I’ll write about the instructional design techniques people can use to get around these issues.