Month: January 2025

Microsoft Fabric Guidance for Small Businesses

January 26, 2025

|

Microsoft Fabric

If you are a small (or even medium) business, you may be wondering “What is Fabric and do we even need it?” If you are primarily on Power BI Pro licenses today, you may not find a compelling reason to switch to Fabric today, but the value add should improve over time as new features are added on the Fabric side and some features get deprecated on the Power BI side.

If you have the budget, time, and luxury, then you should start playing around with a Fabric 60-day trial today and continue to experiment with a pausible F2 afterwards. Not because of any immediate value add, but because when the time comes to consider using Fabric, it will be far less frustrating to evaluate your use cases.

This will cost $0.36 per hour for pausible capacity (plus storage). Roughly $270/mo if you left it on all the time. See here for a licensing explanation in plain English. Folks on Reddit have shared when they found the entry level F2 to be useful.

Warning! Fabric provides for bursting and smoothing, with up to 32x consumption for an F2. This means that if you run a heavy workload and immediately turn off your F2, you may get billed as if you had run an F64 because smoothing isn’t given time to pay back down the CU debt. If you are using an F2, you 100% need to research surge protection (currently in preview).

Microsoft is providing you with an ever growing buffet of tools and options withing Fabric, but also like a buffet if someone had food allergies or dietary restrictions, it would be reckless to toss them at it and say “Good luck!”.

What the Fabric is Microsoft Fabric?

If you are a Power BI user, Microsoft Fabric is best seen as an expansive extension of Power BI premium (and a replacement) in order to compete in the massively parallel processing space, similar to tools like Databricks and Snowflake. Fabric does have mirroring solutions for both of those products, so it doesn’t have to be a strict replacement.

Microsoft has not had success in this space historically and has decided to take a bundled approach with Power BI. This bundling means that over time, there will be more motivation for Power BI users to investigate Fabric as a tool as the value of Fabric increases.

Fabric is an attempt to take a Software-as-a service approach to the broader Azure data ecosystem, strongly inspired by the success of Power BI. However, this can lead to frustration as you are given options and comparisons, but not necessarily explicit guidance.

Metaphorically speaking, Microsoft is handing you a salad fork and a grapefruit spoon, but no one is telling you “You are eating a grapefruit, use the grapefruit spoon!” This blog post attempts to remedy that with explicit instructions and personal opinions.

The core of Fabric is heavily inspired by the Databricks lakehouse approach specifically, and data lakes more generally. In short, data lakes make sense when it’s cheaper to store the data rather than figure out what to keep. A data lakehouse is the result of taking a lake-first approach and then figuring out how to recreate the flexibility, consistency, and convenience of a SQL endpoint.

How should you approach fabric?

If you are comfortable with Microsoft Power BI, you should give preference to tools that are built on the same technology as Power BI. This means Gen2 dataflows (which are not a feature superset of Gen 1 dataflows), visual SQL queries, and your standard Power BI semantic models. You should only worry about data pipelines and Spark notebooks if and when you run into performance issues with dataflows, which are typically more expensive to run. See episode 1 of the Figuring out Fabric podcast for more on when to make the switch.

In terms of data storage, if you are happily pulling data from your existing data sources such as SQL Server or Excel, there is no urgent reason to switch to a lakehouse or a data warehouse as your data source. These tools provide better analytical performance (because of column compression) and a SQL endpoint, but if you are only using Power BI import mode, these features aren’t huge motivators. The Vertipaq engine already provides column compression.

In terms of choosing a Lakehouse versus a Warehouse, my recommendation is use a Lakehouse for experimentation or as a default and a Warehouse for standalone production solutions. More documentation, design patterns, and non-MSFT content exist around lakehouses. Fabric Data Warehouses are more of a Fabric-specific offshoot.

Both are backed by OneLake Storage, which is really Azure Data lake storage, which is really Azure Blob storage but with folder support and big data APIs. Both use the Parquet file format, which is column compressed similar to the Vertipaq engine in Power BI. Both use Delta lake to provide transactional guarantees for adds and deletes.

Important: I have covered delta lake and a lot of the motivation to use these tools in this user group presentation.

Lakehouses are powered by the Spark engine, are more flexible, more interoperable, and more popular than Fabric-style data warehouses. Fabric Data Warehouses are not warehouses in the traditional sense. Instead, they are more akin to modern lakehouses but with stronger transactional guarantees and the ability to write back to the data source via T-SQL. That is to say that a Fabric Data warehouse is closer in lineage to Hadoop or Databricks than it is to SQL Server Analysis services or a Star Schema database on SQL Server.

What are the benefits of Fabric?

In the same way that many of the benefits of Power Query don’t apply to people with clean data living in SQL databases, many of the benefits of Fabric may not apply to you, such as Direct Lake (which in my opinion is most useful with more than 100 million rows). Fabric, in theory, provides a single repository of data for data scientists, data engineers, BI developers, and business users to work together. But.

If you are a small business, you do not have any data scientists or data engineers. In fact, your BI dev is likely your sole IT person or a savvy business user who has been field promoted into Power BI dev.

If Power BI is the faucet of your data plumbing, the benefits of industrial plumbing are of little benefit or interest to you. However, you may be interested in setting up or managing a cistern or well, metaphorically speaking. Or you may want to move from a well and an outhouse to indoor plumbing. This is where Fabric can be of value to you.

There are three main benefits of Fabric to small business users, in my opinion. First is if you have a meaningful amount of data in flat files such as Excel and CSV. In my testing, Parquet loads 59% faster and the files are 78% smaller. Compression will vary wildly based on the shape of the data but will follow very similar patterns as the Vertipaq engine in Power BI. Also technically speaking, in Fabric you are not reading directly from the raw Parquet files into Power BI. Instead, you are going though the lakehouse with Direct Lake or the SQL Analytics Endpoint.

Moving that data into a Lakehouse and then loading it into delta tables will likely provide a better use experience, faster Power BI refreshes, and the ability to query the data with a SQL analytics endpoint. Now, as you are already aware, flat file data tends to be ugly. This means that you will likely need to use gen 2 data flows to clean and load the data into delta tables instead of doing a raw load.

You may have heard of medallion architecture. This is more naming convention than architecture, but the idea of “zones” of increasing data quality is real and valuable. In your case, I recommend considering the files section of a lakehouse as your bronze layer, the cleaned delta tables as your silver layer and your Power BI semantic model as your gold layer. Anything more than this is overcomplicating things for a small business starting out.

The second benefit of Fabric is the ability to provide a SQL endpoint for your data. SQL is the most common and popular data querying tool available. After Excel, it is the most popular business intelligence tool in the world. This is a very similar use case to Power BI Datamarts, which after 2 years in preview are unlikely to ever leave public preview.

Last is the ability to capture and store data from APIs as well as storing a history of the data over time. This would be tedious to do in pure Power BI but is incredibly simple with gen2 data flows and a lakehouse.

What are the downsides of Microsoft Fabric?

Given that Microsoft Fabric is following a similar iterative design approach to Power BI, it is still a bit rough around the edges, in the same way that Power BI was rough around the edges for the first 3 years. Fabric was very buggy on launch and has improved a lot since then, but many items are still in public preview.

Experiment with Fabric now, so that when you feel it is ready for prime time, you are ready as well. Niche, low usage features like streaming datasets will likely be deprecated and moved to fabric. In that instance, users only had 2 weeks of notice before the ability to create new streaming datasets was removed, which is utterly unacceptable, in my humble opinion [Edit: Shannon makes a fair point in the comments that deprecation of existing solutions is fairly slow]. New features, like devops pipelines will be Fabric first and will likely not ever be backported to Power BI pro (I assume). Over time, the weight of the feature set difference will become significant.

Fabric adds a layer of complexity and confusion that is frustrating. While my hope is that Fabric is Power BI-ifying Azure, many worry that the opposite is happening instead. There are 5x the number of Fabric items you can create compared to Power BI and it is overwhelming at first. We know from Reza and Arun that more is on the way. Stick to what you know and ignore the rest.

One area where this strategy is difficult is in cost management. If you plan to use Fabric, then you need to become intimately aware of the capacity management app. Because of the huge variety in workloads, there is a huge variety in cost of these workloads. When I benchmarked ways to load CSV files into Fabric, there was a 4x difference in cost between the cheapest and most expensive ways to load the data. This is not easy to predict or intuit in advance. Surge protection is currently in public preview and is desperately needed.

Another downside is that although you are charged separately for storage and compute, they are not separate from a user perspective. If you turn off or pause your Fabric capacity, you will temporarily lose access to the underlying data. From what I’ve been told, this is not the norm when it comes to lakehouses and can be a point of frustration for anyone wanting to use Fabric in an on-demand or almost serverless kind of way. In fact, Databricks offers a serverless option, something which we had in Azure Synapse but is fundamentally incompatible with the Fabric capacity model.

Sidenote: if you want to save money, you can in theory automate turning Fabric on and off for a few hours per day primarily to import data into Power BI. This is a janky but valid approach and requires a certain amount of sophistication in terms of automation and skill. You are, in a sense, building your own semi-serverless approach.

Another downside of Fabric is that you are left to your own devices when it comes to management and governance. While some tools are provided such as semantic link, you will likely have to build your own solutions from scratch with Python and Spark notebooks. Michael Kolvosky has created semantic link labs and provides a number of templates. Over time, the number of community solutions will expand.

My recommendation is to experiment with Python and Spark notebooks now so that when the time comes that you need to use them for management and orchestration, you aren’t feeling overwhelmed and frustrated. They are a popular tool for this purpose when it comes to Fabric.

Summary

So, should you use Fabric as a small business? In most cases no, in some cases yes. Should you start learning Fabric now? 100% yes. Integration between Power BI and Fabric will continue and most new features that aren’t core to Power BI (Power Query, DAX, core visuals) will show up in Fabric first.

I’ve seen multiple public calls for a Fabric Per User license. When my friend Alex Powers has surveyed people on what they would pay for an FPU license, people’s responses ranged between $30-70 per user per month. The time between Power BI Premium and PPU was 4 years and the time from Paginated Reports in Premium to Paginated Reports in Pro was 3 years. I have no insider knowledge about an FPU license, but these general ranges seem reasonable to me as estimates.

Finally, Power BI took about 4 years (2015-2019) before it felt well-polished (in my opinion) and I felt comfortable unconditionally endorsing it. I don’t think it’s unreasonable that Fabric follows a similar timeline, but that’s pure speculation on my part. I’ve started the Figuring out Fabric podcast to talk about the good and the bad, and I hope you’ll give it a listen.
Announcing the Figuring out Fabric Podcast!

January 20, 2025

|

Uncategorized
I’m delighted to announce the launch of the Figuring out Fabric Podcast. Currently you can find it on Buzzsprout (RSS feed) and YouTube, but soon it will be coming to a podcast directory near you.

Each week I’ll be interviewing experts and users alike on their experience with Fabric, warts and all. I can guarantee that we’ll have voices you aren’t used to and perspectives you won’t expect.

Each episode will be 30 minutes long with a single topic, so you can listen during your commute or while you exercise. Skip the topics you aren’t interested in. This will be a podcast that respects your time and your intelligence. No 2 hour BS sessions.

In our inaugural episode, Kristyna Ferris helps us pick the right data movement tool.

Here are the upcoming guests and topics:
- Cathrine Wilhemlsen. Medallion Architecture
- Kellyn Gorman. Extracting data from legacy systems
- Ginger Grant. Lakehouse versus Warehouse
- Frank Geisler. Realtime Intelligence
- Stephanie Bruno. Semantic link
Come along for the ride!
Should Power BI be Detached from Fabric?

January 16, 2025

|

Uncategorized
If you know Betteridge’s Law of Headlines, then you know the answer is no. But let’s get into it anyway.

Recently there was LinkedIn post that made a bunch of great and valid points but ended on an odd one.

Number one change would be removing Power BI from Fabric completely and doubling down on making it even easier for the average business user, as I have previously covered in some posts.

It’s hard for me to take this as a serious proposal instead of wishful thinking, but I think the author is being serious, so let’s treat it as such.

Historically, Microsoft has failed to stick the landing on big data

If you look back at the family tree of Microsoft Fabric, it’s a series of attempts to turn SQL Server into MPP and Big Data tools. None of which, as far as I can tell, ever gained significant popularity. Each time, the architecture would change, pivoting to the current hotness (MPP -> Hadoop -> Kubernetes -> Spark -> Databricks). Below are all tools that either died out or morphed their way into Fabric today.
- (2010). Parallel Data Warehouses. A MPP tool by DataAllegro that was tied to a HP Hardware Appliance. Never once did I hear about someone implementing this.
- (2014) Analytics Platform System. A rename and enhancement of PDW, adding in HDInsight. Never once did I hear about someone implementing this. Support ends in 2026.
- (2015) Azure SQL Data Warehouse. A migration of APS to the cloud, providing the ability to charge storage and compute separately. Positioned as a competitor to Redshift. I may have rarely heard of people using this, but nothing sticks out.
- (2019). Big Data Clusters. An overly complicated attempt to run a cluster of SQL Server nodes on Linux, supporting HDFS and Spark. It was killed off 3 years later.
- (2019) Azure Synapse Dedicated Pools. This was a new paint of coat Azure SQL Data Warehouse, put under the same umbrella as other products. I have in fact heard of some people using this. I found it incredibly frustrating to learn.
- (2023) Microsoft Fabric. Yet another evolution, replacing Synapse. Synapse is still supported but I haven’t seen any feature updates, so I would treat it as on life support.
That’s 6 products in 13 years. A new product every 2 years. If you are familiar with this saga, I can’t blame you for being pessimistic about the future of Fabric. Microsoft does not have a proven track record here.

Fabric would fail without Power BI

So is Fabric a distraction? Certainly. Should Power BI just be sliced off from Fabric, so it can continue to be a self-service B2C tool, and get the attention it deserves? Hell, no.

In my opinion, making such a suggestion completely misses the point. Fabric will fail without Power BI, full stop. Splitting would mean throwing in the towel for Microsoft and be highly embarrassing.

The only reason I have any faith in Fabric is because of Power BI and the amazing people who built Power BI. The only reason I have any confidence in Fabric is because of the proven pricing and development model of Power BI. The only reason I’m learning Fabric is because the fate of the two is inextricably bound now. I’m not doing it because I want to. We are all along for the ride whether we like it or not.

I have spent the past decade of my career successfully dodging Azure. I have never had to use Azure in any of my work, outside of very basic VMs for testing purposes. I have never learned how to use ADF, Azure SQL, Synapse, or any of that stuff. But that streak has ended with Fabric.

My customers are asking me about Fabric. I had to give a 5 day Power BI training, with one of the days on Fabric. Change is coming for us Power BI folks and I think consultants like me are mad that Microsoft moved our cheese. I get it. I spent a decade peacefully ignorant of what a lakehouse was until now, blah.

Is Power BI at risk? Of course it is! Microsoft Fabric is a massively ambitious project and a lot of development energy is going into adding new tools to Fabric like SQL DBs as well quality of life improvements. It’s a big bet and I estimate it will be another 2-3 years until it feels fully baked, just like it took Power BI 4 years. It’s a real concern right now.

Lastly, the logistics of detachment would be so complex and painful to MSFT that suggesting it is woefully naive. Many of the core PBI staff were moved to the Synapse side years ago. It’s a joint Fabric CAT team now.

Is MSFT supposed undo the deprecation of the P1 SKU and say “whoopsie-daisy”? “Hey sorry we scared you into signing a multi-year Fabric agreement, you can have your P1 back”? Seriously?

No, Odysseus has been tied to the mast. Fabric and Power BI sink or swim together. And for Power BI consultants like me, our careers sink or swim with it. Scary stuff!

Where Microsoft can do better

Currently I think there is a lot of room for improvement in the storytelling around which product to use when. I think there is room for improvement from massive tables and long user scenarios. I would love to see videos with clear do’s and don’ts, but I expect those will have to come from the community. I see a lot of How To’s from my peers, but I would love more How To Nots.

I really want to see Microsoft take staggered feature adoption seriously. Admin toggles are not scalable. It’s not an easy task, but I think we need something similar to roles or RBAC. Something like Power BI workspace roles, but much, much bigger. The number of Fabric items you can create is 5x the number of Power BI items and growing every day. There needs to be a better middle ground than “turn it all off” or “Wild West”.

One suggestion made by the original LinkedIn author was a paid addon for Power BI pro that adds Power BI Copilot. I think we absolutely do not need that right now. Copilot is expensive in Fabric ($0.32 -$2.90 per day by my math) and still could use some work. It needs more time to bake as LLM prices plummet. If we are bringing Fabric features to a shared capacity model, let’s get Fabric Per User and let’s do it right. Not a rushed job because of some AI hype.

Also, I don’t get why people are expecting a copilot addon or FPU license already. It was 4 years from Power BI Premium (2017) to Premium Per User (2021). It was 3 years from Paginated reports in Premium (2019) until we got Paginated reports in Pro (2022). Fabric has been out for less than 2 years and it is having a lot of growing pains. Perhaps we can be more patient?

How I hope to help

People are reasonably frustrated and feeling lost. Personally, I’d love to see more content about real, lived experiences and real pain points. But complaining only goes so far. So, with that I’m excited to announce the Figuring Out Fabric podcast coming out next week.

You and I can be lost together every week, together. I’ll ask real Fabric users some real questions about Fabric, and we’ll discuss the whole product, warts and all. If you are mad about Fabric, be mad with me. If you are excited about Fabric, be excited with me.
How Power BI Dogma Leads to a Lack of Understanding

January 11, 2025

|

Uncategorized
I continue to be really frustrated about the dogmatic approach to Power BI. Best practices become religion, not to be questioned or elaborated on. Only to be followed. And you start to end up with these 10 Power BI modeling commandments:
1. Thou shalt not use Many-to-Many
2. Thou shalt not use bi-directional filtering
3. Thou shalt not use calculated columns
4. Thou shalt not use implicit measures
5. Thou shalt not auto date/time
6. Thou shalt avoid iterators
7. Thou shalt star schema all the things
8. Thou shalt query fold
9. Thou shalt go as upstream as possible, as downstream as necessary
10. Thou shalt avoid DirectQuery
And I would recommend all of these. If you have zero context and you have a choice, follow these suggestions. On average, they will lead to better user experiences, smaller models, and faster performance.

On. Average.

On. Average.

But there’s problems when rules of thumb and best practices become edicts.

Why are people like this?

I think this type of advice comes from a really good and well-intentioned place. First, my friend Greg Baldini likes to point out that Power BI growth has been literally exponential. In the sense that the number of PBI users today is a multiple of PBI users a year ago. This means that new PBI users always outnumber experienced PBI users. This means we are in Eternal September.

I answer questions on Reddit, and I don’t know how many more times I can explain why Star Schema is a best practice (it’s better for performance, better for user experience, and leads to simpler DAX, BTW). Many times, I just point to the official docs, say it’s a best practice and move on. It’s hard to fit explanations in 280 characters.

The other reason is that Power BI is performant, until it suddenly isn’t. Power BI is easy, until it suddenly isn’t. And as Power BI devs and consultants, we often have to come in and clean the messes. It’s really tempting to scream “If you had just followed the commandments, this wouldn’t have happened. VertiPaq is a vengeful god!!!”.

I get it. But I think we need to be better in trying to teach people to fish, not just saying “this spot is good. Only fish in this bay. Don’t go anywhere else.”

Why we need to do better

So why does it matter? Well, a couple of reasons. One is it leads to people not digging deeper to learn internals and instead they hear what the experts say and just echo that. And sometimes that information is wrong. I ran into that today.

Someone on Reddit reasonably pushed back on me suggesting SUMX, violating commandment #6, which is a big no no. I tried to explain that in the most simple cases, SUM and SUMX are identical under the hood: identical performance, identical query plans, etc. SUM is just syntactic sugar for SUMX.

Here was the response:

That’s really overcomplicating things. No point in skipping best practices in your code. He doesn’t need to know such nuances to understand to avoid SUMX when possible

And no, sumx(table,col) is not the same as sum(col). One iterates on each row of the table, one sums up the column

And this was basically me from 2016 until….um….embarrassingly 2022. I knew iterators were bad, and some were worse than others. People said they were bad, so I avoided them. I couldn’t tell you when performance became an issue. I didn’t know enough internals to accurately intuit why it was slow. I just assumed it was like a cursor in SQL.

I then repeated in my lectures that it was bad sometimes. Something something nested iterators. Something something column lookups. I was spreading misinformation or at least muddled information. I had become part of the problem.

And that’s the problem. Dogma turns off curiosity. It turns off the desire to learn about the formula engine and the storage engine, to learn about data caches, to know a system deep in your bones.

Dogma is great when you can stay on the golden path. But when you deviate from the path and need to get back, all you get is a scolding and a spanking. This is my concern. Instead of equipping learners are we preparing them to feel utterly overwhelmed when they get lost in the woods?

How we can be better

I think the path to being better is simple.
1. Avoid absolutist language. Many of these commandments have exceptions, a few don’t. Many lead to a better default experience or better performance on average. Say that.
2. Give reasons why. In a 280 character post, spend 100 characters on why. The reader can research if they want to or ask for elaboration.
3. Encourage learning internals. Give the best practice but then point to tools like DAX studio to see under the hood. Teach internals in your demos.
4. Respect your audience. Treat your audience with respect and assume they are intelligent. Don’t denigrate business users or casual learners.
It’s hard to decide how much to explain, and no one want to fit a lecture into their “TOP 10 TIPS!” post. But a small effort here can make a big difference.
The fraught ethics around AI, ChatGPT, and Power BI.

January 1, 2025

|

Uncategorized
The more I tried to research practical ways to make use of ChatGPT and Power BI, the more pissed I became. Like bitcoin and NFTs before it, this is a world inextricably filled with liars, frauds, and scam artists. Honestly many of those people just frantically erased blockchain from their business cards and scribbled on “AI”.

There are many valid and practical uses of AI, I use it daily. But there are just as many people who want to take advantage of you. It is essential to educate yourself on how LLMs work and what their limitations are.

Other than Kurt Buhler and Chris Webb, I have yet to find anyone else publicly and critically discussing the limitations, consequences, and ethics of applying this new technology to my favorite reporting tool. Aside from some video courses on LinkedIn Learning, nearly every resource I find seems to either have a financial incentive to downplay the issues and limitations of AI or seems to be recklessly trying to ride the AI hype wave for clout.

Everyone involved here is hugely biased, including myself. So, let’s talk about it.

Legal disclaimer

Everything below is my own personal opinion based on disclosed facts. I do not have, nor am I implying having, any secret knowledge about any parties involved. This is not intended as defamation of any individuals or corporations. This is not intended as an attack or a dogpile on any individuals or corporations and to that effect, in all of my examples I have avoided directly naming or linking to the examples.

Please be kind to others. This is about a broader issue, not about any one individual. Please do not call out, harass, or try to cancel any individuals referenced in this blog post. My goal here is not to “cancel” anyone but to encourage better behavior through discussion. Thank you.

LLMs are fruit of the poisoned tree

Copyright law is a societal construct, but I am a fan of it because it allows me to make a living. I’m not a fan of it extending 70 years after the author’s death. I’m not a fan of companies suing against archival organizations. But If copyright law did not exist I would not have a job as a course creator. I would not be able to make the living I do.

While I get annoyed when people pirate my content, on some level I get it. I was a poor college student once. I’ve heard the arguments of “well they wouldn’t have bought it anyway”. I’ll be annoyed about the $2 I missed out on, but I’ll be okay. Now, if you spin up a BitTorrent tracker and encourage others to pirate, I’m going to be furious because you are now directly attacking my livelihood. Now it is personal.

Whatever your opinions are on the validity of copyright law and whether LLMs count as Fair Use or Transformative Use, one thing is clear. LLMs can only exist thanks to massive and blatant copyright infringement. LLMs are fruit of the poisoned tree. And no matter how sweet that fruit, we need to acknowledge this.

Anything that is publicly available online is treated as fair game, regardless of whether or not the author of the material has given or even implied permission, including 7,000 Indie books that were priced at $0. Many lawsuits allege that non-public, copywritten material is being used, given AI’s ability to reproduce snippets of text verbatim. In an interview with the Wall Street Journal, Open AI’s CTO dodged the question on whether SORA was trained on YouTube videos.

Moving forward, I will be pay-walling more and more of my content as the only way to opt-out of this. As a consequence, this means less free training material for you, dear reader. There are negative, personal consequences for you.

Again, whatever your stance on this is (and there is room for disagreement on the legalities, ethics, and societal benefits), it’s shocking and disgusting that this is somehow all okay, but in the early 2,000s the RIAA and MPAA sued thousands of individuals for file-sharing and copyright infringement, including a 12 year old girl. As a society, there is a real incoherence around copyright infringement that seems to be motivated primarily by profit and power.

The horse has left the barn

No matter how mad or frustrated I may get, the horse has permanantly left the barn. No amount of me stomping my feet will change that. No amount of national regulation will change that. You can run a GPT-4 level LLM on a personal machine today. Chinese organizations are catching up in the LLM race. And I doubt any Chinese organization intends on listening to US or EU regulations on the matter.

Additionally, LLMs are massively popular. One survey in May 2024 (n=4010) of participants in the education system found that 50% of students and educators were using ChatGPT weekly.

Another survey from the Wharton Business School of 800 business leaders found that weekly usage of AI had from up from 37% in 2023 to 73% in 2024.

Yet another study found that 24% of US workers aged 18-64 use AI on a weekly basis.

If you think that AI is a problem for society, then I regret to inform you that we irrevocably screwed. The individual benefits and corporate benefits are just too strong and enticing to roll back the clock on this one. Although I do hope for some sort of regulation in this space.

So now what?

While we can vote for and hope for regulation around this, no amount of regulation can completely stop it, in the same way that copyright law has utterly failed to stop pirating and copyright infringement.

Instead, I think the best we can do it to try to hold ourselves and others to a higher ethical standard, no matter how convenient it may be to do otherwise. Below are my opinions on the ethical obligations we have around AI. Many will disagree, and that’s OK! I don’t expect to persuade many of you, in the same way that I’ll never persuade many of my friends to not pirate video games that are still easily available for sale.

Obligations for individuals

As an individual, I encourage you to educate yourself on how LLMs work and their limitations. LLMs are a dangerous tool and you have an obligation to use them wisely.

Here are some of my favorite free resources:
- Intro to Large Language Models – Andrej Karpathy
- Large Language Models explained Briefly – Grant Sanderson
- Transformers (how LLMs work) explained visually – Grant Sanderson
- Let’s Build GPT: from scratch, in code, spelled out – Andrej Karpathy
- What is ChatGPT Doing and Why Does it Work? – Stephen Wolfram
- One Useful Thing – Ethan Mollick
Additionally, Co-Intelligence Living and Working with AI by Ethan Mollick is a splendid, splendid book on the practical use and ethics of LLMs and can be gotten cheaply at Audible.

If you are using ChatGPT for work, you have an obligation to understand when and how it can train on your chat data (which is does by default). You have an ethical obligation to follow your company’s security and AI policies to avoid accidentally exfiltrating confidential information.

I also strongly encourage you to ask ChatGPT questions in your core area of expertise. This is the best way to understand the jagged frontier of AI capabilities.

Obligations for content creators

If you are a content creator, you have an ethical obligation to not use ChatGPT as a ghostwriter. I think using it for a first pass can be okay and using it for brainstorming or editing is perfectly reasonable. Hold yourself to the same standards to as if you were using a human.

For example, if you are writing a conference abstract and you use ChatGPT, that’s fine. I have a friend who I help edit and refine his abstracts. Although, be aware that if you don’t edit the output, the organizers can tell because it’s going to be mediocre.

But if you paid someone to write an entire technical article and then slapped your name on it, that would be unethical and dishonest. If I found out you were doing that, I would stop reading your blog posts and in private I would encourage others to do the same.

You have an ethical obligation to take responsibility for the content you create and publish. To not do so is functionally littering at best, and actively harmful and malicious at worst. To publish an article about using Power BI for DAX without testing it first is harmful and insulting. Below is an article on LinkedIn with faulty DAX code that subverted the point of the article. Anyone who tried to use the code would have potentially wasted hours troubleshooting.

Don’t put bad code online. Don’t put untested code online. Just don’t.

One company in the Power BI space has decided to AI generate articles en masse, with (as far as I can tell), no human review for quality. The one on churn rate analysis is #2 on the search results for Bing.

When you open the page, it’s a bunch of AI generated slop including the ugliest imitation of the Azure Portal I have ever seen. This kind of content is a waste of time and actively harmful.

I will give them credit for at least including a clear disclaimer, so I don’t waste my time. Many people don’t do even that little. Unfortunately, this only shows up when you scroll to the bottom. This means this article wasted 5-10 minutes of my time when I was trying to answer a question on Reddit.

Even more insultingly, they ask for feedback if something is incorrect. So, you are telling me you have decided to mass litter content on the internet, wasting people’s time with inaccurate posts and you want me to do free labor to clean up your mess and benefit your company’s bottom line? No. Just no.

Now you may argue “Well, Google and Bing do it with their AI generated snippets. Hundreds of companies are doing it.”. This is the most insulting and condescending excuse I have ever heard. If you are telling me that your ethical bar is set by what trillion dollar corporations are doing, well then perhaps you shouldn’t have customers.

Next, If you endorse an AI product in any capacity, you have an ethical obligation to announce any financial relationship or compensation you receive from that product. I suspect it’s rare for people in our space to properly disclose these financial relationships, and I can understand why. I’ve been on the fence on how much to disclose in my business dealings. However, I think it’s important and I make an effort to do it for any company that I’ve done paid work with, as that introduces a bias into my endorsement.

These tools can produce bad or even harmful code. These tools are extremely good at appearing to be more capable than they actually are. It is easily to violate the data security boundary with these tools and allow them to train their models on confidential data.

For goodness sake, perhaps hold yourself to a higher ethical standard than an influencer on TikTok.

Obligations for companies

Software companies that combine Power BI and AI have an obligation to have crystal clear documentation on how they handle both user privacy and data security. I’m talking architecture diagrams and precise detail about what if any user data touches your servers. A small paragraph is woefully inadequate and encourages bad security practices. Additionally, this privacy and security information should be easily discoverable.

I was able to find three companies selling AI visuals for Power BI. Below is the entirely of the security statements I could find, outside of legalese buried in their terms of service or privacy documents.

While the security details are hinted at in the excerpts below, I’m not a fan of “just trust us, bro”. Any product that is exfiltrating your data beyond the security perimeter needs to be abundantly clear on the exact software architecture and processes used. This includes when and how much data is sent over the wire. Personally, I find the lack of this information to be disappointing.

Product #1

“[Product name] provides a secure connection between LLMs and your data, granting you the freedom to select your desired configuration.”

“Why trust us?

Your data remains your own. We’re committed to upholding the highest standards of data security and privacy, ensuring you maintain full control over your data at all times. With [product name], you can trust that your data is safe and secure.”

“Secure

At [Product name], we value your data privacy. We neither store, log, sell, nor monitor your data.

You Are In Control

We leverage OpenAI’s API in alignment with their recommended security measures. As stated on March 1, 2023, “OpenAI will not use data submitted by customers via our API to train or improve our models.”

Data Logging

[Product name] holds your privacy in the highest regard. We neither log nor store any information. Post each AI Lens session, all memory resides locally within Power BI.”

Product #2

Editors Note: this sentence on appsource was the only mention of security I could find. I found nothing on the product page.

“This functionality is especially valuable when you aim to offer your business users a secure and cost-effective way of interacting with LLMs such as ChatGPT, eliminating the requirement for additional frontend hosting.”

Product #3

“ Security

The data is processed locally in the Power BI report. By default, messages are not stored. We use the OpenAI model API which follows a policy of not training their model with the data it processes.”

“Is it secure? Are all my data sent to OpenAI or Anthropic?

The security and privacy of your data are our top priorities. By default, none of your messages are stored. Your data is processed locally within your Power BI report, ensuring a high level of confidentiality. Interacting with the OpenAI or Anthropic model is designed to be aware only of the schema of your data and the outcomes of queries, enabling it to craft responses to your questions without compromising your information. It’s important to note that the OpenAI and Anthropic API strictly follows a policy of not training its model with any processed data. In essence, both on our end and with the OpenAI or Anthropic API, your data is safeguarded, providing you with a secure and trustworthy experience.”

Clarity about the model being used

Software companies have an obligation to clearly disclose which AI model they are using. There is a huge, huge difference in quality between GPT 3.5, GPT 4o mini, and GPT 4o. Enough so that to not be clear on this is defrauding your customers. Thankfully, some software companies are good about doing this, but not all.

Mention of limitations

Ideally, any company selling you on using AI will at least have some sort of reasonable disclaimer about the limitations of AI and for Power BI, which things AI is not the best at. However, I understand that sales is sales and that I’m not going to win this argument. Still, this frustrates me.

Final thoughts

Thank you all for bearing with me. This was something I really needed to get off my chest.

I don’t plan on stopping using LLMs anytime soon. I use ChatGPT daily in my work and I recently signed up for GitHub Copilot and plan to experiment with that. If I can ever afford access to an F64 SKU, I plan to experiment with Copilot for Fabric and Power BI as well.

If you are concerned about data security, I recommend looking into tools like LM studio and Ollama to safely and securely experiment with local LLMs.

I think if used wisely and cautiously, these can be an amazing tool. We all have an obligation to educate ourselves on the best use of them and their failings. Content creators have an obligation to disclose financial incentives, when they use ChatGPT heavily to create content, and general LLM limitations. Software companies have an obligation to be crystal clear about security and privacy, as well as which models they use.