Author: Eugene Meidinger

Power BI Consulting: What Is in the Course?

April 6, 2025

|

Career & Professional Development, Course Updates
This course is launching April 8^th, 2025 for $10 for 24 hours. Then it will be $50 until April 13^th.

Below is a summary of the contents of the course.

Module 1 – Choosing to consult

This module is a reality check on why you want to consult and what things you should consider before making the jump. Module 1 videos are available for free on YouTube and on the course site.

In addition to the videos, there are 3 bonus docs:
- Readiness Checklist. This is a checklist of thought exercises to make sure you are ready to take the leap.
- Burn Rate Calculator. This is a simple excel file to estimate your monthly income and see how many months you can work with your existing savings.
- Recommended Reading List. A list of recommended and optional reading, podcasts, and videos for each module.
Module 2 – Paperwork

Module 2 focuses on the paperwork involved with getting started. In short, you will want:
- A legal entity (preferably one that provides liability protection)
- Business Insurance (general liability and Errors & Omissions)
- In the US, you’ll want to research an S-corp tax election
- A business bank account
- A default service agreement contract
- The ability to write up a scope of work
- The ability to track your time and to send invoices
The module also includes some quick demos on tracking time with Toggl and creating an invoice.

Module 3 – Sales and Marketing

This module covers the fundamentals of sales and marketing with core concepts like the AIDA model and the sales funnel. It talks about how consulting is a high-trust work, and your sales and marketing strategy should reflect that.

Module 4 – How to Scope

The scoping section covers what goes into a scope of work, and how to estimate time and overall scope. It explains what deliverables are and how they can vary in concreteness.

It also includes a private custom GPT that you can interact with to practice gathering requirements. If you are stuck, there is a document with a list of questions to ask the GPT. I also very quickly demo using Microsoft Word to write a scope of work.

Module 5 – How to Price and Contract

This module talks about three of the main pricing models: hourly, flat rate, and value pricing. It explains how to estimate your hourly rate based on your salary and desired role.

For contracting, the module covers the gist of what should go into a service agreement and what to watch out for. As an exercise, I’ve included an intentionally malicious service agreement that you need to review for problems. This exercise also has a custom GPT for practicing contract negotiation. As part of the exercise, I have a marked-up version of the contract if you are stuck finding problematic clauses.

Module 6 – Your First Project

This final module helps to answer the question of how you know you are ready skill-wise. It talks about some of the mental health hurdles to expect when working for yourself. Finally, it covers some specific technical details of Power BI consulting and that first customer.
SUM and SUMX often have identical performance.

March 21, 2025

|

DAX Functions, Performance Optimization, Power BI, Query Optimization

For years, I told people to avoid iterators. I compared them to cursors in SQL, which are really bad, or for loops in C# which are normally fine. I knew that DAX was column based and that it often broke down when doing row-based operations, but I couldn’t tell you why.

Advice to avoid iterators is often based on a misunderstanding and a misapprehension of how the Vertipaq engine works. If you are blindly giving this advice out, like I was, you are promoting a fundamental misunderstanding of how DAX works. We think that they are running row-by-agonizing-row (RBAR). Toiling away and wasting CPU.

The truth is that SUM and SUMX are the same. Specifically, SUM is syntactic sugar for SUMX. That means when you write SUM, the engine functionally rewrites it as a SUMX. There is no performance difference. There is no execution difference. There are identical execution plans. You can look for yourself.

Looking at the data

Here is the evaluation of SUM over 100 million rows of Contoso generated data, gathered with DAX Studio. With caching off, it takes 13 milliseconds and performs a single scan operation.

Here is SUMX over the same data. 15 ms, same scan operation, same xm_SQL output on the right. Any DAX within 4ms should be considered to have functionally identical performance, according to SQLBI.

Here are the physical and logical execution plans for SUM:

Here are the logical and physical plans for SUMX. Identical.

Why the confusion?

So why is this a point of confusion? It is good to avoid row-based operations in general, but the engine often optimizes those away behind the scenes. So a blanket ban on SUM is silly and misguided.

The fact of the matter is that if you stick to functions like SUM then you will fall into the pit of success. You will have better performance, on average, because the code you write will better align with how the formula engine and the storage engine work. CALCULATE + SUM is like having a safety on your code and when you have to step outside of that and use iterators like SUMX or FILTER you know that you have to be more cautious.

Sticking to SUM will force you to engage in patterns that often lead to better performance. But SUM by itself makes no difference.

But beyond that, it’s easy to write really, really bad code with iterators. If you put an IF statement inside of your SUMX then you will see CALLBACKDATAID, which is a sign the storage engine is having to make calls to the formula engine to handle logic it can’t handle by itself. Depending on how poorly you write your SUMX, it may do the vast majority of the work in the formula engine instead of using the storage engine and sending back data caches.

If you want to learn more, I recommend checking out the super comprehensive book by SQLBI or my course on performance tuning.
Microsoft Fabric Guidance for Small Businesses

January 26, 2025

|

Microsoft Fabric

If you are a small (or even medium) business, you may be wondering “What is Fabric and do we even need it?” If you are primarily on Power BI Pro licenses today, you may not find a compelling reason to switch to Fabric today, but the value add should improve over time as new features are added on the Fabric side and some features get deprecated on the Power BI side.

If you have the budget, time, and luxury, then you should start playing around with a Fabric 60-day trial today and continue to experiment with a pausible F2 afterwards. Not because of any immediate value add, but because when the time comes to consider using Fabric, it will be far less frustrating to evaluate your use cases.

This will cost $0.36 per hour for pausible capacity (plus storage). Roughly $270/mo if you left it on all the time. See here for a licensing explanation in plain English. Folks on Reddit have shared when they found the entry level F2 to be useful.

Warning! Fabric provides for bursting and smoothing, with up to 32x consumption for an F2. This means that if you run a heavy workload and immediately turn off your F2, you may get billed as if you had run an F64 because smoothing isn’t given time to pay back down the CU debt. If you are using an F2, you 100% need to research surge protection (currently in preview).

Microsoft is providing you with an ever growing buffet of tools and options withing Fabric, but also like a buffet if someone had food allergies or dietary restrictions, it would be reckless to toss them at it and say “Good luck!”.

What the Fabric is Microsoft Fabric?

If you are a Power BI user, Microsoft Fabric is best seen as an expansive extension of Power BI premium (and a replacement) in order to compete in the massively parallel processing space, similar to tools like Databricks and Snowflake. Fabric does have mirroring solutions for both of those products, so it doesn’t have to be a strict replacement.

Microsoft has not had success in this space historically and has decided to take a bundled approach with Power BI. This bundling means that over time, there will be more motivation for Power BI users to investigate Fabric as a tool as the value of Fabric increases.

Fabric is an attempt to take a Software-as-a service approach to the broader Azure data ecosystem, strongly inspired by the success of Power BI. However, this can lead to frustration as you are given options and comparisons, but not necessarily explicit guidance.

Metaphorically speaking, Microsoft is handing you a salad fork and a grapefruit spoon, but no one is telling you “You are eating a grapefruit, use the grapefruit spoon!” This blog post attempts to remedy that with explicit instructions and personal opinions.

The core of Fabric is heavily inspired by the Databricks lakehouse approach specifically, and data lakes more generally. In short, data lakes make sense when it’s cheaper to store the data rather than figure out what to keep. A data lakehouse is the result of taking a lake-first approach and then figuring out how to recreate the flexibility, consistency, and convenience of a SQL endpoint.

How should you approach fabric?

If you are comfortable with Microsoft Power BI, you should give preference to tools that are built on the same technology as Power BI. This means Gen2 dataflows (which are not a feature superset of Gen 1 dataflows), visual SQL queries, and your standard Power BI semantic models. You should only worry about data pipelines and Spark notebooks if and when you run into performance issues with dataflows, which are typically more expensive to run. See episode 1 of the Figuring out Fabric podcast for more on when to make the switch.

In terms of data storage, if you are happily pulling data from your existing data sources such as SQL Server or Excel, there is no urgent reason to switch to a lakehouse or a data warehouse as your data source. These tools provide better analytical performance (because of column compression) and a SQL endpoint, but if you are only using Power BI import mode, these features aren’t huge motivators. The Vertipaq engine already provides column compression.

In terms of choosing a Lakehouse versus a Warehouse, my recommendation is use a Lakehouse for experimentation or as a default and a Warehouse for standalone production solutions. More documentation, design patterns, and non-MSFT content exist around lakehouses. Fabric Data Warehouses are more of a Fabric-specific offshoot.

Both are backed by OneLake Storage, which is really Azure Data lake storage, which is really Azure Blob storage but with folder support and big data APIs. Both use the Parquet file format, which is column compressed similar to the Vertipaq engine in Power BI. Both use Delta lake to provide transactional guarantees for adds and deletes.

Important: I have covered delta lake and a lot of the motivation to use these tools in this user group presentation.

Lakehouses are powered by the Spark engine, are more flexible, more interoperable, and more popular than Fabric-style data warehouses. Fabric Data Warehouses are not warehouses in the traditional sense. Instead, they are more akin to modern lakehouses but with stronger transactional guarantees and the ability to write back to the data source via T-SQL. That is to say that a Fabric Data warehouse is closer in lineage to Hadoop or Databricks than it is to SQL Server Analysis services or a Star Schema database on SQL Server.

What are the benefits of Fabric?

In the same way that many of the benefits of Power Query don’t apply to people with clean data living in SQL databases, many of the benefits of Fabric may not apply to you, such as Direct Lake (which in my opinion is most useful with more than 100 million rows). Fabric, in theory, provides a single repository of data for data scientists, data engineers, BI developers, and business users to work together. But.

If you are a small business, you do not have any data scientists or data engineers. In fact, your BI dev is likely your sole IT person or a savvy business user who has been field promoted into Power BI dev.

If Power BI is the faucet of your data plumbing, the benefits of industrial plumbing are of little benefit or interest to you. However, you may be interested in setting up or managing a cistern or well, metaphorically speaking. Or you may want to move from a well and an outhouse to indoor plumbing. This is where Fabric can be of value to you.

There are three main benefits of Fabric to small business users, in my opinion. First is if you have a meaningful amount of data in flat files such as Excel and CSV. In my testing, Parquet loads 59% faster and the files are 78% smaller. Compression will vary wildly based on the shape of the data but will follow very similar patterns as the Vertipaq engine in Power BI. Also technically speaking, in Fabric you are not reading directly from the raw Parquet files into Power BI. Instead, you are going though the lakehouse with Direct Lake or the SQL Analytics Endpoint.

Moving that data into a Lakehouse and then loading it into delta tables will likely provide a better use experience, faster Power BI refreshes, and the ability to query the data with a SQL analytics endpoint. Now, as you are already aware, flat file data tends to be ugly. This means that you will likely need to use gen 2 data flows to clean and load the data into delta tables instead of doing a raw load.

You may have heard of medallion architecture. This is more naming convention than architecture, but the idea of “zones” of increasing data quality is real and valuable. In your case, I recommend considering the files section of a lakehouse as your bronze layer, the cleaned delta tables as your silver layer and your Power BI semantic model as your gold layer. Anything more than this is overcomplicating things for a small business starting out.

The second benefit of Fabric is the ability to provide a SQL endpoint for your data. SQL is the most common and popular data querying tool available. After Excel, it is the most popular business intelligence tool in the world. This is a very similar use case to Power BI Datamarts, which after 2 years in preview are unlikely to ever leave public preview.

Last is the ability to capture and store data from APIs as well as storing a history of the data over time. This would be tedious to do in pure Power BI but is incredibly simple with gen2 data flows and a lakehouse.

What are the downsides of Microsoft Fabric?

Given that Microsoft Fabric is following a similar iterative design approach to Power BI, it is still a bit rough around the edges, in the same way that Power BI was rough around the edges for the first 3 years. Fabric was very buggy on launch and has improved a lot since then, but many items are still in public preview.

Experiment with Fabric now, so that when you feel it is ready for prime time, you are ready as well. Niche, low usage features like streaming datasets will likely be deprecated and moved to fabric. In that instance, users only had 2 weeks of notice before the ability to create new streaming datasets was removed, which is utterly unacceptable, in my humble opinion [Edit: Shannon makes a fair point in the comments that deprecation of existing solutions is fairly slow]. New features, like devops pipelines will be Fabric first and will likely not ever be backported to Power BI pro (I assume). Over time, the weight of the feature set difference will become significant.

Fabric adds a layer of complexity and confusion that is frustrating. While my hope is that Fabric is Power BI-ifying Azure, many worry that the opposite is happening instead. There are 5x the number of Fabric items you can create compared to Power BI and it is overwhelming at first. We know from Reza and Arun that more is on the way. Stick to what you know and ignore the rest.

One area where this strategy is difficult is in cost management. If you plan to use Fabric, then you need to become intimately aware of the capacity management app. Because of the huge variety in workloads, there is a huge variety in cost of these workloads. When I benchmarked ways to load CSV files into Fabric, there was a 4x difference in cost between the cheapest and most expensive ways to load the data. This is not easy to predict or intuit in advance. Surge protection is currently in public preview and is desperately needed.

Another downside is that although you are charged separately for storage and compute, they are not separate from a user perspective. If you turn off or pause your Fabric capacity, you will temporarily lose access to the underlying data. From what I’ve been told, this is not the norm when it comes to lakehouses and can be a point of frustration for anyone wanting to use Fabric in an on-demand or almost serverless kind of way. In fact, Databricks offers a serverless option, something which we had in Azure Synapse but is fundamentally incompatible with the Fabric capacity model.

Sidenote: if you want to save money, you can in theory automate turning Fabric on and off for a few hours per day primarily to import data into Power BI. This is a janky but valid approach and requires a certain amount of sophistication in terms of automation and skill. You are, in a sense, building your own semi-serverless approach.

Another downside of Fabric is that you are left to your own devices when it comes to management and governance. While some tools are provided such as semantic link, you will likely have to build your own solutions from scratch with Python and Spark notebooks. Michael Kolvosky has created semantic link labs and provides a number of templates. Over time, the number of community solutions will expand.

My recommendation is to experiment with Python and Spark notebooks now so that when the time comes that you need to use them for management and orchestration, you aren’t feeling overwhelmed and frustrated. They are a popular tool for this purpose when it comes to Fabric.

Summary

So, should you use Fabric as a small business? In most cases no, in some cases yes. Should you start learning Fabric now? 100% yes. Integration between Power BI and Fabric will continue and most new features that aren’t core to Power BI (Power Query, DAX, core visuals) will show up in Fabric first.

I’ve seen multiple public calls for a Fabric Per User license. When my friend Alex Powers has surveyed people on what they would pay for an FPU license, people’s responses ranged between $30-70 per user per month. The time between Power BI Premium and PPU was 4 years and the time from Paginated Reports in Premium to Paginated Reports in Pro was 3 years. I have no insider knowledge about an FPU license, but these general ranges seem reasonable to me as estimates.

Finally, Power BI took about 4 years (2015-2019) before it felt well-polished (in my opinion) and I felt comfortable unconditionally endorsing it. I don’t think it’s unreasonable that Fabric follows a similar timeline, but that’s pure speculation on my part. I’ve started the Figuring out Fabric podcast to talk about the good and the bad, and I hope you’ll give it a listen.
Announcing the Figuring out Fabric Podcast!

January 20, 2025

|

Uncategorized
I’m delighted to announce the launch of the Figuring out Fabric Podcast. Currently you can find it on Buzzsprout (RSS feed) and YouTube, but soon it will be coming to a podcast directory near you.

Each week I’ll be interviewing experts and users alike on their experience with Fabric, warts and all. I can guarantee that we’ll have voices you aren’t used to and perspectives you won’t expect.

Each episode will be 30 minutes long with a single topic, so you can listen during your commute or while you exercise. Skip the topics you aren’t interested in. This will be a podcast that respects your time and your intelligence. No 2 hour BS sessions.

In our inaugural episode, Kristyna Ferris helps us pick the right data movement tool.

Here are the upcoming guests and topics:
- Cathrine Wilhemlsen. Medallion Architecture
- Kellyn Gorman. Extracting data from legacy systems
- Ginger Grant. Lakehouse versus Warehouse
- Frank Geisler. Realtime Intelligence
- Stephanie Bruno. Semantic link
Come along for the ride!
Should Power BI be Detached from Fabric?

January 16, 2025

|

Uncategorized
If you know Betteridge’s Law of Headlines, then you know the answer is no. But let’s get into it anyway.

Recently there was LinkedIn post that made a bunch of great and valid points but ended on an odd one.

Number one change would be removing Power BI from Fabric completely and doubling down on making it even easier for the average business user, as I have previously covered in some posts.

It’s hard for me to take this as a serious proposal instead of wishful thinking, but I think the author is being serious, so let’s treat it as such.

Historically, Microsoft has failed to stick the landing on big data

If you look back at the family tree of Microsoft Fabric, it’s a series of attempts to turn SQL Server into MPP and Big Data tools. None of which, as far as I can tell, ever gained significant popularity. Each time, the architecture would change, pivoting to the current hotness (MPP -> Hadoop -> Kubernetes -> Spark -> Databricks). Below are all tools that either died out or morphed their way into Fabric today.
- (2010). Parallel Data Warehouses. A MPP tool by DataAllegro that was tied to a HP Hardware Appliance. Never once did I hear about someone implementing this.
- (2014) Analytics Platform System. A rename and enhancement of PDW, adding in HDInsight. Never once did I hear about someone implementing this. Support ends in 2026.
- (2015) Azure SQL Data Warehouse. A migration of APS to the cloud, providing the ability to charge storage and compute separately. Positioned as a competitor to Redshift. I may have rarely heard of people using this, but nothing sticks out.
- (2019). Big Data Clusters. An overly complicated attempt to run a cluster of SQL Server nodes on Linux, supporting HDFS and Spark. It was killed off 3 years later.
- (2019) Azure Synapse Dedicated Pools. This was a new paint of coat Azure SQL Data Warehouse, put under the same umbrella as other products. I have in fact heard of some people using this. I found it incredibly frustrating to learn.
- (2023) Microsoft Fabric. Yet another evolution, replacing Synapse. Synapse is still supported but I haven’t seen any feature updates, so I would treat it as on life support.
That’s 6 products in 13 years. A new product every 2 years. If you are familiar with this saga, I can’t blame you for being pessimistic about the future of Fabric. Microsoft does not have a proven track record here.

Fabric would fail without Power BI

So is Fabric a distraction? Certainly. Should Power BI just be sliced off from Fabric, so it can continue to be a self-service B2C tool, and get the attention it deserves? Hell, no.

In my opinion, making such a suggestion completely misses the point. Fabric will fail without Power BI, full stop. Splitting would mean throwing in the towel for Microsoft and be highly embarrassing.

The only reason I have any faith in Fabric is because of Power BI and the amazing people who built Power BI. The only reason I have any confidence in Fabric is because of the proven pricing and development model of Power BI. The only reason I’m learning Fabric is because the fate of the two is inextricably bound now. I’m not doing it because I want to. We are all along for the ride whether we like it or not.

I have spent the past decade of my career successfully dodging Azure. I have never had to use Azure in any of my work, outside of very basic VMs for testing purposes. I have never learned how to use ADF, Azure SQL, Synapse, or any of that stuff. But that streak has ended with Fabric.

My customers are asking me about Fabric. I had to give a 5 day Power BI training, with one of the days on Fabric. Change is coming for us Power BI folks and I think consultants like me are mad that Microsoft moved our cheese. I get it. I spent a decade peacefully ignorant of what a lakehouse was until now, blah.

Is Power BI at risk? Of course it is! Microsoft Fabric is a massively ambitious project and a lot of development energy is going into adding new tools to Fabric like SQL DBs as well quality of life improvements. It’s a big bet and I estimate it will be another 2-3 years until it feels fully baked, just like it took Power BI 4 years. It’s a real concern right now.

Lastly, the logistics of detachment would be so complex and painful to MSFT that suggesting it is woefully naive. Many of the core PBI staff were moved to the Synapse side years ago. It’s a joint Fabric CAT team now.

Is MSFT supposed undo the deprecation of the P1 SKU and say “whoopsie-daisy”? “Hey sorry we scared you into signing a multi-year Fabric agreement, you can have your P1 back”? Seriously?

No, Odysseus has been tied to the mast. Fabric and Power BI sink or swim together. And for Power BI consultants like me, our careers sink or swim with it. Scary stuff!

Where Microsoft can do better

Currently I think there is a lot of room for improvement in the storytelling around which product to use when. I think there is room for improvement from massive tables and long user scenarios. I would love to see videos with clear do’s and don’ts, but I expect those will have to come from the community. I see a lot of How To’s from my peers, but I would love more How To Nots.

I really want to see Microsoft take staggered feature adoption seriously. Admin toggles are not scalable. It’s not an easy task, but I think we need something similar to roles or RBAC. Something like Power BI workspace roles, but much, much bigger. The number of Fabric items you can create is 5x the number of Power BI items and growing every day. There needs to be a better middle ground than “turn it all off” or “Wild West”.

One suggestion made by the original LinkedIn author was a paid addon for Power BI pro that adds Power BI Copilot. I think we absolutely do not need that right now. Copilot is expensive in Fabric ($0.32 -$2.90 per day by my math) and still could use some work. It needs more time to bake as LLM prices plummet. If we are bringing Fabric features to a shared capacity model, let’s get Fabric Per User and let’s do it right. Not a rushed job because of some AI hype.

Also, I don’t get why people are expecting a copilot addon or FPU license already. It was 4 years from Power BI Premium (2017) to Premium Per User (2021). It was 3 years from Paginated reports in Premium (2019) until we got Paginated reports in Pro (2022). Fabric has been out for less than 2 years and it is having a lot of growing pains. Perhaps we can be more patient?

How I hope to help

People are reasonably frustrated and feeling lost. Personally, I’d love to see more content about real, lived experiences and real pain points. But complaining only goes so far. So, with that I’m excited to announce the Figuring Out Fabric podcast coming out next week.

You and I can be lost together every week, together. I’ll ask real Fabric users some real questions about Fabric, and we’ll discuss the whole product, warts and all. If you are mad about Fabric, be mad with me. If you are excited about Fabric, be excited with me.
How Power BI Dogma Leads to a Lack of Understanding

January 11, 2025

|

Uncategorized
I continue to be really frustrated about the dogmatic approach to Power BI. Best practices become religion, not to be questioned or elaborated on. Only to be followed. And you start to end up with these 10 Power BI modeling commandments:
1. Thou shalt not use Many-to-Many
2. Thou shalt not use bi-directional filtering
3. Thou shalt not use calculated columns
4. Thou shalt not use implicit measures
5. Thou shalt not auto date/time
6. Thou shalt avoid iterators
7. Thou shalt star schema all the things
8. Thou shalt query fold
9. Thou shalt go as upstream as possible, as downstream as necessary
10. Thou shalt avoid DirectQuery
And I would recommend all of these. If you have zero context and you have a choice, follow these suggestions. On average, they will lead to better user experiences, smaller models, and faster performance.

On. Average.

On. Average.

But there’s problems when rules of thumb and best practices become edicts.

Why are people like this?

I think this type of advice comes from a really good and well-intentioned place. First, my friend Greg Baldini likes to point out that Power BI growth has been literally exponential. In the sense that the number of PBI users today is a multiple of PBI users a year ago. This means that new PBI users always outnumber experienced PBI users. This means we are in Eternal September.

I answer questions on Reddit, and I don’t know how many more times I can explain why Star Schema is a best practice (it’s better for performance, better for user experience, and leads to simpler DAX, BTW). Many times, I just point to the official docs, say it’s a best practice and move on. It’s hard to fit explanations in 280 characters.

The other reason is that Power BI is performant, until it suddenly isn’t. Power BI is easy, until it suddenly isn’t. And as Power BI devs and consultants, we often have to come in and clean the messes. It’s really tempting to scream “If you had just followed the commandments, this wouldn’t have happened. VertiPaq is a vengeful god!!!”.

I get it. But I think we need to be better in trying to teach people to fish, not just saying “this spot is good. Only fish in this bay. Don’t go anywhere else.”

Why we need to do better

So why does it matter? Well, a couple of reasons. One is it leads to people not digging deeper to learn internals and instead they hear what the experts say and just echo that. And sometimes that information is wrong. I ran into that today.

Someone on Reddit reasonably pushed back on me suggesting SUMX, violating commandment #6, which is a big no no. I tried to explain that in the most simple cases, SUM and SUMX are identical under the hood: identical performance, identical query plans, etc. SUM is just syntactic sugar for SUMX.

Here was the response:

That’s really overcomplicating things. No point in skipping best practices in your code. He doesn’t need to know such nuances to understand to avoid SUMX when possible

And no, sumx(table,col) is not the same as sum(col). One iterates on each row of the table, one sums up the column

And this was basically me from 2016 until….um….embarrassingly 2022. I knew iterators were bad, and some were worse than others. People said they were bad, so I avoided them. I couldn’t tell you when performance became an issue. I didn’t know enough internals to accurately intuit why it was slow. I just assumed it was like a cursor in SQL.

I then repeated in my lectures that it was bad sometimes. Something something nested iterators. Something something column lookups. I was spreading misinformation or at least muddled information. I had become part of the problem.

And that’s the problem. Dogma turns off curiosity. It turns off the desire to learn about the formula engine and the storage engine, to learn about data caches, to know a system deep in your bones.

Dogma is great when you can stay on the golden path. But when you deviate from the path and need to get back, all you get is a scolding and a spanking. This is my concern. Instead of equipping learners are we preparing them to feel utterly overwhelmed when they get lost in the woods?

How we can be better

I think the path to being better is simple.
1. Avoid absolutist language. Many of these commandments have exceptions, a few don’t. Many lead to a better default experience or better performance on average. Say that.
2. Give reasons why. In a 280 character post, spend 100 characters on why. The reader can research if they want to or ask for elaboration.
3. Encourage learning internals. Give the best practice but then point to tools like DAX studio to see under the hood. Teach internals in your demos.
4. Respect your audience. Treat your audience with respect and assume they are intelligent. Don’t denigrate business users or casual learners.
It’s hard to decide how much to explain, and no one want to fit a lecture into their “TOP 10 TIPS!” post. But a small effort here can make a big difference.
The fraught ethics around AI, ChatGPT, and Power BI.

January 1, 2025

|

Uncategorized
The more I tried to research practical ways to make use of ChatGPT and Power BI, the more pissed I became. Like bitcoin and NFTs before it, this is a world inextricably filled with liars, frauds, and scam artists. Honestly many of those people just frantically erased blockchain from their business cards and scribbled on “AI”.

There are many valid and practical uses of AI, I use it daily. But there are just as many people who want to take advantage of you. It is essential to educate yourself on how LLMs work and what their limitations are.

Other than Kurt Buhler and Chris Webb, I have yet to find anyone else publicly and critically discussing the limitations, consequences, and ethics of applying this new technology to my favorite reporting tool. Aside from some video courses on LinkedIn Learning, nearly every resource I find seems to either have a financial incentive to downplay the issues and limitations of AI or seems to be recklessly trying to ride the AI hype wave for clout.

Everyone involved here is hugely biased, including myself. So, let’s talk about it.

Legal disclaimer

Everything below is my own personal opinion based on disclosed facts. I do not have, nor am I implying having, any secret knowledge about any parties involved. This is not intended as defamation of any individuals or corporations. This is not intended as an attack or a dogpile on any individuals or corporations and to that effect, in all of my examples I have avoided directly naming or linking to the examples.

Please be kind to others. This is about a broader issue, not about any one individual. Please do not call out, harass, or try to cancel any individuals referenced in this blog post. My goal here is not to “cancel” anyone but to encourage better behavior through discussion. Thank you.

LLMs are fruit of the poisoned tree

Copyright law is a societal construct, but I am a fan of it because it allows me to make a living. I’m not a fan of it extending 70 years after the author’s death. I’m not a fan of companies suing against archival organizations. But If copyright law did not exist I would not have a job as a course creator. I would not be able to make the living I do.

While I get annoyed when people pirate my content, on some level I get it. I was a poor college student once. I’ve heard the arguments of “well they wouldn’t have bought it anyway”. I’ll be annoyed about the $2 I missed out on, but I’ll be okay. Now, if you spin up a BitTorrent tracker and encourage others to pirate, I’m going to be furious because you are now directly attacking my livelihood. Now it is personal.

Whatever your opinions are on the validity of copyright law and whether LLMs count as Fair Use or Transformative Use, one thing is clear. LLMs can only exist thanks to massive and blatant copyright infringement. LLMs are fruit of the poisoned tree. And no matter how sweet that fruit, we need to acknowledge this.

Anything that is publicly available online is treated as fair game, regardless of whether or not the author of the material has given or even implied permission, including 7,000 Indie books that were priced at $0. Many lawsuits allege that non-public, copywritten material is being used, given AI’s ability to reproduce snippets of text verbatim. In an interview with the Wall Street Journal, Open AI’s CTO dodged the question on whether SORA was trained on YouTube videos.

Moving forward, I will be pay-walling more and more of my content as the only way to opt-out of this. As a consequence, this means less free training material for you, dear reader. There are negative, personal consequences for you.

Again, whatever your stance on this is (and there is room for disagreement on the legalities, ethics, and societal benefits), it’s shocking and disgusting that this is somehow all okay, but in the early 2,000s the RIAA and MPAA sued thousands of individuals for file-sharing and copyright infringement, including a 12 year old girl. As a society, there is a real incoherence around copyright infringement that seems to be motivated primarily by profit and power.

The horse has left the barn

No matter how mad or frustrated I may get, the horse has permanantly left the barn. No amount of me stomping my feet will change that. No amount of national regulation will change that. You can run a GPT-4 level LLM on a personal machine today. Chinese organizations are catching up in the LLM race. And I doubt any Chinese organization intends on listening to US or EU regulations on the matter.

Additionally, LLMs are massively popular. One survey in May 2024 (n=4010) of participants in the education system found that 50% of students and educators were using ChatGPT weekly.

Another survey from the Wharton Business School of 800 business leaders found that weekly usage of AI had from up from 37% in 2023 to 73% in 2024.

Yet another study found that 24% of US workers aged 18-64 use AI on a weekly basis.

If you think that AI is a problem for society, then I regret to inform you that we irrevocably screwed. The individual benefits and corporate benefits are just too strong and enticing to roll back the clock on this one. Although I do hope for some sort of regulation in this space.

So now what?

While we can vote for and hope for regulation around this, no amount of regulation can completely stop it, in the same way that copyright law has utterly failed to stop pirating and copyright infringement.

Instead, I think the best we can do it to try to hold ourselves and others to a higher ethical standard, no matter how convenient it may be to do otherwise. Below are my opinions on the ethical obligations we have around AI. Many will disagree, and that’s OK! I don’t expect to persuade many of you, in the same way that I’ll never persuade many of my friends to not pirate video games that are still easily available for sale.

Obligations for individuals

As an individual, I encourage you to educate yourself on how LLMs work and their limitations. LLMs are a dangerous tool and you have an obligation to use them wisely.

Here are some of my favorite free resources:
- Intro to Large Language Models – Andrej Karpathy
- Large Language Models explained Briefly – Grant Sanderson
- Transformers (how LLMs work) explained visually – Grant Sanderson
- Let’s Build GPT: from scratch, in code, spelled out – Andrej Karpathy
- What is ChatGPT Doing and Why Does it Work? – Stephen Wolfram
- One Useful Thing – Ethan Mollick
Additionally, Co-Intelligence Living and Working with AI by Ethan Mollick is a splendid, splendid book on the practical use and ethics of LLMs and can be gotten cheaply at Audible.

If you are using ChatGPT for work, you have an obligation to understand when and how it can train on your chat data (which is does by default). You have an ethical obligation to follow your company’s security and AI policies to avoid accidentally exfiltrating confidential information.

I also strongly encourage you to ask ChatGPT questions in your core area of expertise. This is the best way to understand the jagged frontier of AI capabilities.

Obligations for content creators

If you are a content creator, you have an ethical obligation to not use ChatGPT as a ghostwriter. I think using it for a first pass can be okay and using it for brainstorming or editing is perfectly reasonable. Hold yourself to the same standards to as if you were using a human.

For example, if you are writing a conference abstract and you use ChatGPT, that’s fine. I have a friend who I help edit and refine his abstracts. Although, be aware that if you don’t edit the output, the organizers can tell because it’s going to be mediocre.

But if you paid someone to write an entire technical article and then slapped your name on it, that would be unethical and dishonest. If I found out you were doing that, I would stop reading your blog posts and in private I would encourage others to do the same.

You have an ethical obligation to take responsibility for the content you create and publish. To not do so is functionally littering at best, and actively harmful and malicious at worst. To publish an article about using Power BI for DAX without testing it first is harmful and insulting. Below is an article on LinkedIn with faulty DAX code that subverted the point of the article. Anyone who tried to use the code would have potentially wasted hours troubleshooting.

Don’t put bad code online. Don’t put untested code online. Just don’t.

One company in the Power BI space has decided to AI generate articles en masse, with (as far as I can tell), no human review for quality. The one on churn rate analysis is #2 on the search results for Bing.

When you open the page, it’s a bunch of AI generated slop including the ugliest imitation of the Azure Portal I have ever seen. This kind of content is a waste of time and actively harmful.

I will give them credit for at least including a clear disclaimer, so I don’t waste my time. Many people don’t do even that little. Unfortunately, this only shows up when you scroll to the bottom. This means this article wasted 5-10 minutes of my time when I was trying to answer a question on Reddit.

Even more insultingly, they ask for feedback if something is incorrect. So, you are telling me you have decided to mass litter content on the internet, wasting people’s time with inaccurate posts and you want me to do free labor to clean up your mess and benefit your company’s bottom line? No. Just no.

Now you may argue “Well, Google and Bing do it with their AI generated snippets. Hundreds of companies are doing it.”. This is the most insulting and condescending excuse I have ever heard. If you are telling me that your ethical bar is set by what trillion dollar corporations are doing, well then perhaps you shouldn’t have customers.

Next, If you endorse an AI product in any capacity, you have an ethical obligation to announce any financial relationship or compensation you receive from that product. I suspect it’s rare for people in our space to properly disclose these financial relationships, and I can understand why. I’ve been on the fence on how much to disclose in my business dealings. However, I think it’s important and I make an effort to do it for any company that I’ve done paid work with, as that introduces a bias into my endorsement.

These tools can produce bad or even harmful code. These tools are extremely good at appearing to be more capable than they actually are. It is easily to violate the data security boundary with these tools and allow them to train their models on confidential data.

For goodness sake, perhaps hold yourself to a higher ethical standard than an influencer on TikTok.

Obligations for companies

Software companies that combine Power BI and AI have an obligation to have crystal clear documentation on how they handle both user privacy and data security. I’m talking architecture diagrams and precise detail about what if any user data touches your servers. A small paragraph is woefully inadequate and encourages bad security practices. Additionally, this privacy and security information should be easily discoverable.

I was able to find three companies selling AI visuals for Power BI. Below is the entirely of the security statements I could find, outside of legalese buried in their terms of service or privacy documents.

While the security details are hinted at in the excerpts below, I’m not a fan of “just trust us, bro”. Any product that is exfiltrating your data beyond the security perimeter needs to be abundantly clear on the exact software architecture and processes used. This includes when and how much data is sent over the wire. Personally, I find the lack of this information to be disappointing.

Product #1

“[Product name] provides a secure connection between LLMs and your data, granting you the freedom to select your desired configuration.”

“Why trust us?

Your data remains your own. We’re committed to upholding the highest standards of data security and privacy, ensuring you maintain full control over your data at all times. With [product name], you can trust that your data is safe and secure.”

“Secure

At [Product name], we value your data privacy. We neither store, log, sell, nor monitor your data.

You Are In Control

We leverage OpenAI’s API in alignment with their recommended security measures. As stated on March 1, 2023, “OpenAI will not use data submitted by customers via our API to train or improve our models.”

Data Logging

[Product name] holds your privacy in the highest regard. We neither log nor store any information. Post each AI Lens session, all memory resides locally within Power BI.”

Product #2

Editors Note: this sentence on appsource was the only mention of security I could find. I found nothing on the product page.

“This functionality is especially valuable when you aim to offer your business users a secure and cost-effective way of interacting with LLMs such as ChatGPT, eliminating the requirement for additional frontend hosting.”

Product #3

“ Security

The data is processed locally in the Power BI report. By default, messages are not stored. We use the OpenAI model API which follows a policy of not training their model with the data it processes.”

“Is it secure? Are all my data sent to OpenAI or Anthropic?

The security and privacy of your data are our top priorities. By default, none of your messages are stored. Your data is processed locally within your Power BI report, ensuring a high level of confidentiality. Interacting with the OpenAI or Anthropic model is designed to be aware only of the schema of your data and the outcomes of queries, enabling it to craft responses to your questions without compromising your information. It’s important to note that the OpenAI and Anthropic API strictly follows a policy of not training its model with any processed data. In essence, both on our end and with the OpenAI or Anthropic API, your data is safeguarded, providing you with a secure and trustworthy experience.”

Clarity about the model being used

Software companies have an obligation to clearly disclose which AI model they are using. There is a huge, huge difference in quality between GPT 3.5, GPT 4o mini, and GPT 4o. Enough so that to not be clear on this is defrauding your customers. Thankfully, some software companies are good about doing this, but not all.

Mention of limitations

Ideally, any company selling you on using AI will at least have some sort of reasonable disclaimer about the limitations of AI and for Power BI, which things AI is not the best at. However, I understand that sales is sales and that I’m not going to win this argument. Still, this frustrates me.

Final thoughts

Thank you all for bearing with me. This was something I really needed to get off my chest.

I don’t plan on stopping using LLMs anytime soon. I use ChatGPT daily in my work and I recently signed up for GitHub Copilot and plan to experiment with that. If I can ever afford access to an F64 SKU, I plan to experiment with Copilot for Fabric and Power BI as well.

If you are concerned about data security, I recommend looking into tools like LM studio and Ollama to safely and securely experiment with local LLMs.

I think if used wisely and cautiously, these can be an amazing tool. We all have an obligation to educate ourselves on the best use of them and their failings. Content creators have an obligation to disclose financial incentives, when they use ChatGPT heavily to create content, and general LLM limitations. Software companies have an obligation to be crystal clear about security and privacy, as well as which models they use.
Lessons learned from Self-employment: 6 years in

December 17, 2024

|

Career & Professional Development, Self-Employment Insights

On some level, I’ve started to hate writing these blog posts.

The original intent was to show the ups and downs of being a consultant, inspired by Brent Ozar’s series on the same thing. There’s a huge survivorship bias in our field, only the winners talk about self-employment, and the LinkedIn algorithm encourages only Shiny Happy People. But when you enter the third consecutive year of the 3 most difficult years of your career, you start to wonder if it might be a you problem. So here we go.

Pivoting my business

Two years ago, Pluralsight gave all authors a 25% pay cut and I knew I needed to get out. I reached out to everyone I knew who sold courses themselves for advice. I’m deeply grateful to Matthew Roche, Melissa Coates, Brent Ozar, Kendra Little, and Erik Darling for the conversations that calmed my freak out at the time.

One year ago, I learned that I can’t successfully make content my full-time job while also successfully consulting. Consulting work tends to be a lot of hurry-up-and-wait. Lots of fires, emergencies, and urgencies. No customer is going to be happy if you tell them the project needs to wait a month because you have a course you need to get out. Previously with Pluralsight I was able to make it work because they scoped the work, so it was more like a project. Not so when hungry algos demand weekly content.

So, I cut the consulting work to a bare minimum. Thankfully, I receive money enough from Pluralsight royalties that even with the cut we never have to worry about paying the mortgage. However, it’s nowhere close to covering topline expenses. At the beginning pandemic, $6k/mo gross revenue was what we needed to live comfortably (Western PA is dirt cheap). After the pandemic, I hired a part time employee, inflation happened, and I pay for a lot more subscriptions, like Teachable and StreamlineHQ, so that number is closer to $9k/mo now.

I can confirm that I have not and never will make $9k/mo or more from just Pluralsight. My royalties overall have been stagnant or even gone down a bit since the huge spike upwards in early 2020. So it’s not enough to live off of alone.

Finally, after a lot of dithering in the 2023, I decided to set a public and hard deadline for my course. We were launching in February 2024 hell or high water. I launched with 2 out of 7 modules and it was a huge success, making low four figures. I’m grateful to everyone who let me on to their podcast or livestream, which provided a noticeable boost in sales.

Unfortunately, I had a number of projects right after launch, taking a lot of my focus. I also found out that this content was much much more difficult than the Pluralsight content I was used to. There was no one from curriculum to hand me a set of course objectives to build to. No one to define the scope and duration of the course.

What’s worse, the reason there is a moat and demand for Power BI performance tuning content is almost no one talks about it. You have dozens of scattered Chris Webb blog posts, a book and a course from SQL BI, a course by Nikola Illic, and a book by Thomas LeBlanc and Bhavik Merchant. And that’s about it?

I thought I was going to be putting out a module per week, when in reality I was doing Google searches for “Power BI Performance tuning”, opening 100 tabs, and realizing I had signed myself up for making 500 level internals content. F*ck.

A summer of sadness

All at the same time I was dealing with burnout. My health hadn’t really improved any over the past 3 years and I was finding it hard to work at all. I was anxious. I couldn’t focus. And the content work required deep thought and space and I couldn’t find any. I felt a sense of fragility where I might have a good week the one week and then have a bad nights sleep and derail the next week.

I hadn’t made any progress on my course and a handful of people reached out. I apologized profusely, offered refunds, and promised to give them free access to the next course. If you were impacted by my delays, do please reach out.

In general, I decided that I needed to keep cutting things. I tried to get any volunteer or work obligations off my plate. The one exception is I took on bringing back SQL Saturday Pittsburgh. With the help of co-organizers like James Donahoe and Steph Bruno, it was a lot of work but a big success. I’m very proud of that accomplishment.

Finally turning a corner

I think I finally started turning a corner around PASS Summit. It was refreshing to see my friends and see where the product is going. Before Summit, I had about 3.5 modules done. In the period of a few weeks I rushed to get the rest done. This was also because I really wanted to get the course finished for a Black Friday sale.

The sale went well, making mid three figures. Not enough to live on, but proof that there is demand and it’s worth continuing instead of burning it all down and getting a salaried job. Still, I recently had to float expenses on a credit card for the first time in years, so money is tighter than it used to be. Oh the joys of being paid NET 30 or more.

Immediately after Black Friday, I went to Philadephia to delivery a week long workshop on Fabric and Power BI. The longest training I had ever given before was 2 days. The workshop went well, but every evening I was rushing back to my hotel room to make more content. You would think that 70 slides plus exercises would last a whole day, but no, not even close.

Now I’m back home and effectively on vacation for the rest of the year and it’s lovely. I’m actually excited to be working on whatever whim hits me, setting up a homelab and doing Fabric benchmarks. It’s the first time I’ve done work for fun in years.

I’m excited for 2025 but cautious to not over-extend myself.
Fabric Benchmarking Part 1: Copying CSV Files to OneLake.

December 15, 2024

|

Microsoft Fabric, Performance Optimization

First, a disclaimer: I am not a data engineer, and I have never worked with Fabric in a professional capacity. With the announcement of Fabric SQL DBs, there’s been some discussion on whether they are better for Power BI import than Lakehouses. I was hoping to do some tests, but along the way I ended up on an extensive Yak Shaving expedition.

I have likely done some of these tests inefficiently. I have posted as much detail and source code as I can and if there is a better way for any of these, I’m happy to redo the tests and update the results.

Part one focuses on loading CSV files to the files portion of a lakehouse. Future benchmarks look at CSV to delta and PBI imports.

General Summary

In this benchmark, I generated ~2 billion rows of sales data using the Contoso data generator on a F8as_v6 virtual machine in Azure with a terabyte of premium SSD. This took about 2 hours (log) and produced 194 GB of files, which works out to about $1-2 as far as I can tell (assuming you shut down the VM and delete the premium disk quickly). You could easily do it for cheaper, since it only needed about 16 GB of RAM.

In general, I would create a separate lakehouse for each test and a separate workspace for each run of a given test. This was tedious and inefficient, but the easiest way to get clean results from the Fabric Capacity Metrics app without automation or custom reporting. I tried to set up Will Crayger’s monitoring tool but ran into some issues and will be submitting some pull requests.

To get the CU seconds, I copied from the Power BI visual in the metrics app and tried to ignore incidental costs (like creating a SQL endpoint for a lakehouse). To get the costs, I took the price of an F2 in East US 2 ($162/mo), divided it by the number of CUs (2 CUs), and divided by the number of seconds in 30 days (30*24*60*60). This technically overestimates the costs for months with 31 days in them.

Anyway, here are the numbers:

External methods of file upload (Azure Storage explorer, AZ Copy, and OneLake File Explorer) are clear winners, and browser based upload is a clear loser here. Do be aware that external methods may have external costs (i.e. Azure costs).

Data Generation process

As I mentioned, I spun up a beefy VM and ran the Contoso Data Generator, which is surprisingly well documented for a free, open source tool. You’ll need .NET 8 installed to build and run the tool. The biggest thing is that you will want to modify the config file if you want a non-standard size for your data. In my case, I wanted 1 billion rows of data (OrdersCount setting) and I limited each file to 10 million rows of data (CsvMaxOrdersPerFile setting). This technically will produce 1 billion orders so 2 actually billion sales rows when order header is combined with order lineitem. This produced 100 sales files of about 1.9 GB each.

I was hoping the temporary SSD drive included with Azure VMs was going to be enough, but it was ~30 GB if I recall, not nearly big enough. So instead, I went with Premium SSD storage instead, which has the downside of burning into my Azure Credits for as long as it exists.

One very odd note, at around %70 percent complete, the data generation halted for no particular reason for about 45 minutes. It was only using 8 GB of the 32 GB available and was completely idle with no CPU activity. Totally bizarre. You can see it in the generation log. My best theory is it was waiting for the file system to catch up.

Lastly, I wish I was aware of how easy it was to expand the VM disk image when you allocate a terabyte of SSD. Instead, I allocated the rest of the SSD as a E drive. It was still easy to generate the data, but it added needless complication.

CSV to CSV tests

Thanks to James Serra’s recent blog post, I had a great starting point to identify all the ways to load data into Fabric. That said, I’d love it if he expanded it to full paragraphs since the difference between a copy activity and a copy job was not clear at all. Additionally, the Contoso generator docs list 3 ways to load the data, which was also a helpful starting point.

I stored the data on a container on Azure Blob storage with Hierarchical Namespaces turned on and the it said the Data Lake Storage endpoint is turned on by default, making it Azure Data Lake Storage Gen 2? At least I think it does, but I don’t know for sure and I have no idea how to tell.

Azure storage Explorer

The Azure Storage Explorer is pretty neat and I was able to get it running without issue or confusion. Here are the docs for connecting to OneLake, it’s really straightforward. I did lose my RDP connection during all three of the official tests, because it maxed out IO on the disk which was the OS disk. I probably should have made a separate data disk, UGH. Bandwidth would fluctuate wildly between 2,000 and 8,000 Mbps. I suspect a separate disk would go even faster. The first time I had tested it, I swear it stayed at 5,000 Mbps and took 45 seconds, but I failed to record that.

It was also mildly surprising to find there was a deletion restriction for workspaces with capital letters in the name. Also, based on the log files in the .azcopy folder, I’m 95% sure the storage explorer is just a wrapper for AzCopy

AzCopy

AzCopy is also neat, but much more complicated, since it’s a command line program. Thankfully, Azure Storage Explorer let me export the AzCopy commands so I ran that instead of figuring it out myself or referencing the Contoso docs.

If you go this route, you’ll get a message like “To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ABCDE12FG to authenticate”. This authentication could be done from any computer, not just the VM, which was neat.

I got confirmation from the console output that the disk was impacting upload speeds. Whoops.

OneLake File Explorer

The OneLake File Explorer allows you to treat your OneLake like it was a OneDrive Folder. This was easy to set up and use, with a few minor exceptions. First, it’s not supported on Windows Server and in fact I couldn’t find a way at all to install the MISX file on Windows Server 2022. I tried to follow a guide to do that, but no luck.

The other issue is I don’t know what the heck I’m doing, so I didn’t realize I could expand the C Drive on the default image. Instead, I allocated the spare SSD space to the F drive. But when I tried to copy the files to the C drive, there wasn’t enough space, so I had them in 3 batches of 34 files.

This feature is extremely convenient but was challenging to work with at this scale. First, because it’s placed under the Users folder, both Windows search index and anti-virus were trying to scan the files. Additionally, because my files were very large, it would be quite slow when I deallocated files to free up space.

Oddly, the first batch stayed around 77 MB/s, the second was around 50 MB/s, and the last batch tanked to a speed of 12 MB/s, more than doubling the upload time. Task Manager showed disk usage at 100%, completely saturated. I tried taking a look at resource monitor but I didn’t see anything unusual. Most likely it’s just a bad idea to copy 194 GB from one drive back to itself, while deallocating the files in-between.

Browser Upload

Browser-based file upload was the most expensive in terms of CUs but was very convenient. It was shockingly stable as well. I’ve had trouble downloading multiple large files with Edge/Chrome before, so I was surprised it uploaded one hundred 2 GB files without issue or error. It took 30 minutes, but I expected a slowdown going via browser so not complaints here. Great feature.

Pipeline Copy Activity

Setting up a pipeline copy activity to read from Azure Blob storage was pretty easy to do. The biggest challenge was navigating all the options without feeling overwhelmed.

Surprisingly, there was no measurable difference in CUs between schema agnostic (binary) copy and not schema agnostic (CSV validation) copy. However, all the testing returned the same cost, so I’m guessing the costing isn’t as granular and doesn’t pick up a 2 second difference between runs.

Based on the logs it looks like it may also be using AzCopy because azCopyCommand was logged as true. It’s AzCopy all the way down apparently. The CU cost (23,040) is exactly equal to 2 times the logged copy duration (45 s) times the usedDataIntegrationUnits (256), so I suspect this is how it’s costed, but I have no way of proving it. It would explain why there was no cost variation between runs.

Pipeline Copy Job

The copy job feature is just lovely. I was confused based on the name how it differed from a copy activity, but it seems to be a simpler way of copying files with fewer overwhelming options and nicer UI that clearly shows throughput, etc. The JSON code also looks very simple. Just wonderful overall.

It is in preview, so you will have to turn it on. But that’s just an admin toggle. Reitse Eskens has a nice blog post on it. My only complaint is I didn’t see a way to copy a job or import the JSON code.

Spark Notebook – Fast copy

My friend Sandeep Pawar recommended trying fastcp from notbookutils in order to copy files with spark. The documentation is fairly sparse for now, but Sandeep has a short blog post that was helpful. Still, understanding the exact URL structure and how to authenticate was a challenge.

Fastcp is a wrapper for….you guessed it, AzCopy. It seems to take the same time as all the other options running AzCopy (45 seconds) + about 12 seconds for spinning up a Spark session as far as I can tell. Sandeep has told me that it also works in Python for cheaper, but when I ran the same code I got an authorization error.

Overall, I see the appeal of Spark notebooks, but one frustration was that DAX has taught me to press Alt + Enter when I need a newline, which does the exact opposite in notebooks and will instead execute a cell and make a new one.

Learnings and paper cuts

I think my biggest knowledge gap overall was in the precise difference between blob storage and ADLS storage gen 2, as well as access URLS and access methods. Multiple times I tried to generate an SAS key from the Azure Portal and got an error when I tried to use it. Once, out of frustration I copied the one from the export to AzCopy option into my spark notebook to get it to work. Another time I used the generate SAS UI in the storage explorer and that worked great.

Even trying to be aware of all the ways you can copy both CSV files as well as convert CSV to delta is quite a bit to take on. I’m not sure how anyone does it.

My biggest frustration with Fabric right now is around credentials management. Because I had made some different tests, if I searched for “blob”, 3 options might show up (1 blob storage, 2 ADLS).

Twice, I clicked on the wrong one (ADLS) and got an error. The icons and name are identical so the only way you can tell the difference is by “type”.

This is just so, so frustrating. Coming from Power BI, I know exactly where the data connection is because it’s embedded in the semantic model. In OneLake it appears that connections are shared and I have no idea what scope they are shared within (per user, per workspace, per domain?) and I have no idea where to go to mange them. This produces a sense of unease and being lost. It also led to frustration multiple times when I tried to add a lakehouse data source but my dataflow already had that source.

What I would love to see from the team is some sort of clear and easily accessible edit link when it pulls in an existing data source. This would be simple (I hope) and would lead to a sense of orientation, the same way that the settings section for a semantic model has similar links.
Fabric Licensing from Scratch

December 7, 2024

|

Glossary & Resources, Microsoft Fabric
The Basics

If you’ve dealt with Power BI licensing before, Fabric licensing makes sense as an extension of that model plus some new parts around CUs, bursting and smoothing. But what if you are brand new to Fabric, Power BI, and possibly even Office 365?

If you want to get started with Fabric, you need at a bare minimum the following:
1. Fabric computing capacity. The cheapest option, F2, costs $263 per month for pausable capacity (called Pay-as-you-go) and $156 per month for reserved capacity. Like Azure, prices vary per region.
2. An Entra tenant. Formerly called Azure Active Directory, Entra is required for managing users and authentication. You will also need an Office 365 tenant on top of that.
3. Fabric Free license. Even though you are paying for compute capacity, all users need some sort of license applied to them as well.
Once you have an F2, you can assign that capacity to Fabric workspaces. Workspaces are basically fancy content folders with some security on top of it. Workspaces are the most common way access is provided to content. With the F2 you’ll have access to all non-Power BI fabric objects.

The F2 sku provides 0.25 virtual cores for Power BI workloads, 4 virtual cores for Spark workloads, and 1 core for data warehouse workloads. These all correspond to 2 CUs, also known as compute units. CUs are a made up unit like DTUs for databases or Fahrenheit in America. They are, however, the way that you track and manage everything in your capacity and keep costs under control.

Storage is paid for separately. OneLake storage costs $0.023 per GB per month. You also get X TB of free mirroring storage equal to your SKU level. So F2 gets 2 TB of storage.

There is no cost for networking, but that will change at some point in the future.

Power BI content

If your users want to create Power BI reports in these workspaces, they will need to be assigned a Power BI Pro license at a minimum, which costs $14 per user per month. This applies to both report creators and report consumers. Pro provides a majority of Power BI features.

The features this does not provide are covered by Power BI Premium per User (PPU) licenses, which cost $24 per user per month. These licenses allow for things like more frequent refreshes and larger data models. PPU is a hybrid license because you both license the user as well as assign the content to a workspace set to PPU capacity.

One of the downsides of the PPU model is that they act as a universal receiver of content but not a universal donor. Essentially, the only way for anyone to read reports hosted in a PPU workspace is to have a PPU license. So, you can’t use this as a cheat code to license your report creators with PPU and everyone else with Pro. Nice try.

There is demand for a fabric equivalent, a FPU license, but there is no word on when or if this will happen. Folks estimate this could cost anywhere from $30 to $70 per user per month if we get one.

Finally, if you ramp up to an F64 sku, Power BI content is then included. Users will still need a Fabric Free license. At $5002/mo for F64, this means it’s worth switching over at 358 Pro users or 209 PPU users. Additionally, you unlock all premium features including copilot.

Even if you pay for F64 or higher (or Power BI report server on Prem), any report creators need to be licensed with Power BI Pro for use of that publish button. I cannot understand why Microsoft would charge $5k per month and then charge for publishing on top.

There are also licensing complications for embedding Power BI in a custom application which is outside of the scope of this post. Basically the A and EM SKUs are more restrictive than the F SKUs.

Capacity management

Despite a Fabric SKU providing a fixed number of Capacity Units, Fabric is also intended to be somewhat flexible. Fabric customers like the pricing predictability of Fabric compared to Azure workloads, but because of the sheer number of tools and workloads supported, actual usage can vary wildly compared to when premium capacity was only Power BI reports.

In order to support that, Fabric allows for bursting and smoothing. This is similar to auto-scaling, but not quite. Bursting will provide you with more capacity temporarily during spikey workloads, by up to a factor of 12 in most cases. However, this bursting isn’t free. You are borrowing against future compute capacity. This means it’s possible to throttle yourself by overextending.

Bursting is balanced out by smoothing. Whenever you have exceeded your default capacity, future work is spread out over a smoothing window. This is a 5 minute window for anything a user might see and 24 hours for background tasks. If you are using pay-as-you-go capacity, you’ll see a spike in CUs when you shut down the capacity, as all of this burst debt is paid off all at once instead of waiting for smoothing to catch up.

From what I’ve been told by peers, it’s possible that you can effectively take down a capacity with a rogue Spark notebook by bursting for to long. Essentially that smoothing has to use the full window to catch up. At Ignite 2024 they announced they are working on Surge protection to prevent this

Capacity consumption can be monitored with the Fabric Capacity Metrics App.

I believe you can also upgrade a reserved capacity temporarily and pay the pay-go costs for the difference, but I can’t find docs to that effect.