All posts by Eugene Meidinger

Lessons learned from Self-employment: 6 years in

On some level, I’ve started to hate writing these blog posts.

The original intent was to show the ups and downs of being a consultant, inspired by Brent Ozar’s series on the same thing. There’s a huge survivorship bias in our field, only the winners talk about self-employment, and the LinkedIn algorithm encourages only Shiny Happy People. But when you enter the third consecutive year of the 3 most difficult years of your career, you start to wonder if it might be a you problem. So here we go.

Pivoting my business

Two years ago, Pluralsight gave all authors a 25% pay cut and I knew I needed to get out. I reached out to everyone I knew who sold courses themselves for advice. I’m deeply grateful to Matthew Roche, Melissa Coates, Brent Ozar, Kendra Little, and Erik Darling for the conversations that calmed my freak out at the time.

One year ago, I learned that I can’t successfully make content my full-time job while also successfully consulting. Consulting work tends to be a lot of hurry-up-and-wait. Lots of fires, emergencies, and urgencies. No customer is going to be happy if you tell them the project needs to wait a month because you have a course you need to get out. Previously with Pluralsight I was able to make it work because they scoped the work, so it was more like a project. Not so when hungry algos demand weekly content.

So, I cut the consulting work to a bare minimum. Thankfully, I receive money enough from Pluralsight royalties that even with the cut we never have to worry about paying the mortgage. However, it’s nowhere close to covering topline expenses. At the beginning pandemic, $6k/mo gross revenue was what we needed to live comfortably (Western PA is dirt cheap). After the pandemic, I hired a part time employee, inflation happened, and I pay for a lot more subscriptions, like Teachable and StreamlineHQ, so that number is closer to $9k/mo now.

I can confirm that I have not and never will make $9k/mo or more from just Pluralsight. My royalties overall have been stagnant or even gone down a bit since the huge spike upwards in early 2020. So it’s not enough to live off of alone.

Finally, after a lot of dithering in the 2023, I decided to set a public and hard deadline for my course. We were launching in February 2024 hell or high water. I launched with 2 out of 7 modules and it was a huge success, making low four figures. I’m grateful to everyone who let me on to their podcast or livestream, which provided a noticeable boost in sales.

Unfortunately, I had a number of projects right after launch, taking a lot of my focus. I also found out that this content was much much more difficult than the Pluralsight content I was used to. There was no one from curriculum to hand me a set of course objectives to build to. No one to define the scope and duration of the course.

What’s worse, the reason there is a moat and demand for Power BI performance tuning content is almost no one talks about it. You have dozens of scattered Chris Webb blog posts, a book and a course from SQL BI, a course by Nikola Illic, and a book by Thomas LeBlanc and Bhavik Merchant. And that’s about it?

I thought I was going to be putting out a module per week, when in reality I was doing Google searches for “Power BI Performance tuning”, opening 100 tabs, and realizing I had signed myself up for making 500 level internals content. F*ck.

A summer of sadness

All at the same time I was dealing with burnout. My health hadn’t really improved any over the past 3 years and I was finding it hard to work at all. I was anxious. I couldn’t focus. And the content work required deep thought and space and I couldn’t find any. I felt a sense of fragility where I might have a good week the one week and then have a bad nights sleep and derail the next week.

I hadn’t made any progress on my course and a handful of people reached out. I apologized profusely, offered refunds, and promised to give them free access to the next course. If you were impacted by my delays, do please reach out.

In general, I decided that I needed to keep cutting things. I tried to get any volunteer or work obligations off my plate. The one exception is I took on bringing back SQL Saturday Pittsburgh. With the help of co-organizers like James Donahoe and Steph Bruno, it was a lot of work but a big success. I’m very proud of that accomplishment.

Finally turning a corner

I think I finally started turning a corner around PASS Summit. It was refreshing to see my friends and see where the product is going. Before Summit, I had about 3.5 modules done. In the period of a few weeks I rushed to get the rest done. This was also because I really wanted to get the course finished for a Black Friday sale.

The sale went well, making mid three figures. Not enough to live on, but proof that there is demand and it’s worth continuing instead of burning it all down and getting a salaried job. Still, I recently had to float expenses on a credit card for the first time in years, so money is tighter than it used to be. Oh the joys of being paid NET 30 or more.

Immediately after Black Friday, I went to Philadephia to delivery a week long workshop on Fabric and Power BI. The longest training I had ever given before was 2 days. The workshop went well, but every evening I was rushing back to my hotel room to make more content. You would think that 70 slides plus exercises would last a whole day, but no, not even close.

Now I’m back home and effectively on vacation for the rest of the year and it’s lovely. I’m actually excited to be working on whatever whim hits me, setting up a homelab and doing Fabric benchmarks. It’s the first time I’ve done work for fun in years.

I’m excited for 2025 but cautious to not over-extend myself.

Fabric Benchmarking Part 1: Copying CSV Files to OneLake.

First, a disclaimer: I am not a data engineer, and I have never worked with Fabric in a professional capacity. With the announcement of Fabric SQL DBs, there’s been some discussion on whether they are better for Power BI import than Lakehouses. I was hoping to do some tests, but along the way I ended up on an extensive Yak Shaving expedition.

I have likely done some of these tests inefficiently. I have posted as much detail and source code as I can and if there is a better way for any of these, I’m happy to redo the tests and update the results.

Part one focuses on loading CSV files to the files portion of a lakehouse. Future benchmarks look at CSV to delta and PBI imports.

General Summary

In this benchmark, I generated ~2 billion rows of sales data using the Contoso data generator on a F8as_v6 virtual machine in Azure with a terabyte of premium SSD. This took about 2 hours (log) and produced 194 GB of files, which works out to about $1-2 as far as I can tell (assuming you shut down the VM and delete the premium disk quickly). You could easily do it for cheaper, since it only needed about 16 GB of RAM.

In general, I would create a separate lakehouse for each test and a separate workspace for each run of a given test. This was tedious and inefficient, but the easiest way to get clean results from the Fabric Capacity Metrics app without automation or custom reporting.  I tried to set up Will Crayger’s monitoring tool but ran into some issues and will be submitting some pull requests.

To get the CU seconds, I copied from the Power BI visual in the metrics app and tried to ignore incidental costs (like creating a SQL endpoint for a lakehouse). To get the costs, I took the price of an F2 in East US 2 ($162/mo), divided it by the number of CUs (2 CUs), and divided by the number of seconds in 30 days (30*24*60*60). This technically overestimates the costs for months with 31 days in them.

Anyway, here are the numbers:

External methods of file upload (Azure Storage explorer, AZ Copy, and OneLake File Explorer) are clear winners, and browser based upload is a clear loser here. Do be aware that external methods may have external costs (i.e. Azure costs).

Data Generation process

As I mentioned, I spun up a beefy VM and ran the Contoso Data Generator, which is surprisingly well documented for a free, open source tool. You’ll need .NET 8 installed to build and run the tool. The biggest thing is that you will want to modify the config file if you want a non-standard size for your data. In my case, I wanted 1 billion rows of data (OrdersCount setting) and I limited each file to 10 million rows of data (CsvMaxOrdersPerFile setting). This technically will produce 1 billion orders so 2 actually billion sales rows when order header is combined with order lineitem. This produced 100 sales files of about 1.9 GB each.

I was hoping the temporary SSD drive included with Azure VMs was going to be enough, but it was ~30 GB if I recall, not nearly big enough. So instead, I went with Premium SSD storage instead, which has the downside of burning into my Azure Credits for as long as it exists.

One very odd note, at around %70 percent complete, the data generation halted for no particular reason for about 45 minutes. It was only using 8 GB of the 32 GB available and was completely idle with no CPU activity. Totally bizarre. You can see it in the generation log. My best theory is it was waiting for the file system to catch up.

Lastly, I wish I was aware of how easy it was to expand the VM disk image when you allocate a terabyte of SSD. Instead, I allocated the rest of the SSD as a E drive. It was still easy to generate the data, but it added needless complication.

CSV to CSV tests

Thanks to James Serra’s recent blog post, I had a great starting point to identify all the ways to load data into Fabric. That said, I’d love it if he expanded it to full paragraphs since the difference between a copy activity and a copy job was not clear at all. Additionally, the Contoso generator docs list 3 ways to load the data, which was also a helpful starting point.

I stored the data on a container on Azure Blob storage with Hierarchical Namespaces turned on and the it said the Data Lake Storage endpoint  is turned on by default, making it Azure Data Lake Storage Gen 2? At least I think it does, but I don’t know for sure and I have no idea how to tell.

Azure storage Explorer

The Azure Storage Explorer is pretty neat and I was able to get it running without issue or confusion. Here are the docs for connecting to OneLake, it’s really straightforward. I did lose my RDP connection during all three of the official tests, because it maxed out IO on the disk which was the OS disk. I probably should have made a separate data disk, UGH. Bandwidth would fluctuate wildly between 2,000 and 8,000 Mbps. I suspect a separate disk would go even faster. The first time I had tested it, I swear it stayed at 5,000 Mbps and took 45 seconds, but I failed to record that.

It was also mildly surprising to find there was a deletion restriction for workspaces with capital letters in the name. Also, based on the log files in the .azcopy folder, I’m 95% sure the storage explorer is just a wrapper for AzCopy

AzCopy

AzCopy is also neat, but much more complicated, since it’s a command line program. Thankfully, Azure Storage Explorer let me export the AzCopy commands so I ran that instead of figuring it out myself or referencing the Contoso docs.

If you go this route, you’ll get a message like “To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ABCDE12FG to authenticate”. This authentication could be done from any computer, not just the VM, which was neat.

 I got confirmation from the console output that the disk was impacting upload speeds. Whoops.

OneLake File Explorer

The OneLake File Explorer allows you to treat your OneLake like it was a OneDrive Folder. This was easy to set up and use, with a few minor exceptions. First, it’s not supported on Windows Server and in fact I couldn’t find a way at all to install the MISX file on Windows Server 2022. I tried to follow a guide to do that, but no luck.

The other issue is I don’t know what the heck I’m doing, so I didn’t realize I could expand the C Drive on the default image. Instead, I allocated the spare SSD space to the F drive. But when I tried to copy the files to the C drive, there wasn’t enough space, so I had them in 3 batches of 34 files.

This feature is extremely convenient but was challenging to work with at this scale. First, because it’s placed under the Users folder, both Windows search index and anti-virus were trying to scan the files. Additionally, because my files were very large, it would be quite slow when I deallocated files to free up space.

Oddly, the first batch stayed around 77 MB/s, the second was around 50 MB/s, and the last batch tanked to a speed of 12 MB/s, more than doubling the upload time. Task Manager showed disk usage at 100%, completely saturated. I tried taking a look at resource monitor but I didn’t see anything unusual. Most likely it’s just a bad idea to copy 194 GB from one drive back to itself, while deallocating the files in-between.

Browser Upload

Browser-based file upload was the most expensive in terms of CUs but was very convenient. It was shockingly stable as well. I’ve had trouble downloading multiple large files with Edge/Chrome before, so I was surprised it uploaded one hundred 2 GB files without issue or error. It took 30 minutes, but I expected a slowdown going via browser so not complaints here. Great feature.

Pipeline Copy Activity

Setting up a pipeline copy activity to read from Azure Blob storage was pretty easy to do. The biggest challenge was navigating all the options without feeling overwhelmed.

Surprisingly, there was no measurable difference in CUs between schema agnostic (binary) copy and not schema agnostic (CSV validation) copy. However, all the testing returned the same cost, so I’m guessing the costing isn’t as granular and doesn’t pick up a 2 second difference between runs.

Based on the logs it looks like it may also be using AzCopy because azCopyCommand was logged as true. It’s AzCopy all the way down apparently. The CU cost (23,040) is exactly equal to 2 times the logged copy duration (45 s) times the usedDataIntegrationUnits (256), so I suspect this is how it’s costed, but I have no way of proving it. It would explain why there was no cost variation between runs.

Pipeline Copy Job

The copy job feature is just lovely. I was confused based on the name how it differed from a copy activity, but it seems to be a simpler way of copying files with fewer overwhelming options and nicer UI that clearly shows throughput, etc. The JSON code also looks very simple. Just wonderful overall.

It is in preview, so you will have to turn it on. But that’s just an admin toggle. Reitse Eskens has a nice blog post on it. My only complaint is I didn’t see a way to copy a job or import the JSON code.

Spark Notebook – Fast copy

My friend Sandeep Pawar recommended trying fastcp from notbookutils in order to copy files with spark. The documentation is fairly sparse for now, but Sandeep has a short blog post that was helpful. Still, understanding the exact URL structure and how to authenticate was a challenge.

Fastcp is a wrapper for….you guessed it, AzCopy. It seems to take the same time as all the other options running AzCopy (45 seconds) + about 12 seconds for spinning up a Spark session as far as I can tell. Sandeep has told me that it also works in Python for cheaper, but when I ran the same code I got an authorization error.

Overall, I see the appeal of Spark notebooks, but one frustration was that DAX has taught me to press Alt + Enter when I need a newline, which does the exact opposite in notebooks and will instead execute a cell and make a new one.

Learnings and paper cuts

I think my biggest knowledge gap overall was in the precise difference between blob storage and ADLS storage gen 2, as well as access URLS and access methods. Multiple times I tried to generate an SAS key from the Azure Portal and got an error when I tried to use it. Once, out of frustration I copied the one from the export to AzCopy option into my spark notebook to get it to work. Another time I used the generate SAS UI in the storage explorer and that worked great.

Even trying to be aware of all the ways you can copy both CSV files as well as convert CSV to delta is quite a bit to take on. I’m not sure how anyone does it.

My biggest frustration with Fabric right now is around credentials management. Because I had made some different tests, if I searched for “blob”, 3 options might show up (1 blob storage, 2 ADLS).

Twice, I clicked on the wrong one (ADLS) and got an error. The icons and name are identical so the only way you can tell the difference is by “type”.

This is just so, so frustrating. Coming from Power BI, I know exactly where the data connection is because it’s embedded in the semantic model. In OneLake it appears that connections are shared and I have no idea what scope they are shared within (per user, per workspace, per domain?) and I have no idea where to go to mange them. This produces a sense of unease and being lost. It also led to frustration multiple times when I tried to add a lakehouse data source but my dataflow already had that source.

What I would love to see from the team is some sort of clear and easily accessible edit link when it pulls in an existing data source. This would be simple (I hope) and would lead to a sense of orientation, the same way that the settings section for a semantic model has similar links.

Fabric Licensing from Scratch

The Basics

If you’ve dealt with Power BI licensing before, Fabric licensing makes sense as an extension of that model plus some new parts around CUs, bursting and smoothing. But what if you are brand new to Fabric, Power BI, and possibly even Office 365?

If you want to get started with Fabric, you need at a bare minimum the following:

  1. Fabric computing capacity. The cheapest option, F2, costs $263 per month for pausable capacity (called Pay-as-you-go) and $156 per month for reserved capacity. Like Azure, prices vary per region.
  2. An Entra tenant. Formerly called Azure Active Directory, Entra is required for managing users and authentication.
  3. Fabric Free license. Even though you are paying for compute capacity, all users need some sort of license applied to them as well. I think assigning a license requires an office 365 tenant to access the admin portal but I’m not sure.

Once you have an F2, you can assign that capacity to Fabric workspaces. Workspaces are basically fancy content folders with some security on top of it. Workspaces are the most common way access is provided to content. With the F2 you’ll have access to all non-Power BIfabric objects.

The F2 sku provides 0.25 virtual cores for Power BI workloads, 4 virtual cores for Spark workloads, and 1 core for data warehouse workloads. These all correspond to 2 CUs, also known as compute units. CUs are a made up unit like DTUs for databases or Fahrenheit in America. They are, however, the way that you track and manage everything in your capacity and keep costs under control.

Storage is paid for separately. OneLake storage costs $0.023 per GB per month. You also get X TB of free mirroring storage equal to your SKU level. So F2 gets 2 TB of storage.

There is no cost for networking, but that will change at some point in the future.

Power BI content

If your users want to create Power BI reports in these workspaces, they will need to be assigned a Power BI Pro license at a minimum, which costs $14 per user per month. This applies to both report creators and report consumers. Pro provides a majority of Power BI features.

The features this does not provide are covered by Power BI Premium per User (PPU) licenses, which cost $24 per user per month. These licenses allow for things like more frequent refreshes and larger data models. PPU is a hybrid license because you both license the user as well as assign the content to a workspace set to PPU capacity.

One of the downsides of the PPU model is that they act as a universal receiver of content but not a universal donor. Essentially, the only way for anyone to read reports hosted in a PPU workspace is to have a PPU license. So, you can’t use this as a cheat code to license your report creators with PPU and everyone else with Pro. Nice try.

There is demand for a fabric equivalent, a FPU license, but there is no word on when or if this will happen. Folks estimate this could cost anywhere from $30 to $70 per user per month if we get one.

Finally, if you ramp up to an F64 sku, Power BI content is then included. Users will still need a Fabric Free license. At $5002/mo for F64, this means it’s worth switching over at 358 Pro users or 209 PPU users. Additionally, you unlock all premium features including copilot.

Even if you pay for F64 or higher (or Power BI report server on Prem), any report creators need to be licensed with Power BI Pro for use of that publish button. I cannot understand why Microsoft would charge $5k per month and then charge for publishing on top.

There are also licensing complications for embedding Power BI in a custom application which is outside of the scope of this post.

Capacity management

Despite a Fabric SKU providing a fixed number of Capacity Units, Fabric is also intended to be somewhat flexible. Fabric customers like the pricing predictability of Fabric compared to Azure workloads, but because of the sheer number of workloads supported, actual usage can vary wildly compared to when premium capacity was only Power BI reports.

In order to support that, Fabric allows for bursting and smoothing. This is similar to auto-scaling, but not quite. Bursting will provide you with more capacity temporarily during spikey workloads, by up to a factor of 12 in most cases. However this bursting isn’t free. You are borrowing against future compute capacity. This means it’s possible to throttle yourself.

Bursting is balanced out by smoothing. Whenever you have exceeded your default capacity, future work is spread out over a smoothing window. This is a 5 minute window for anything a user might see and 24 hours for background tasks. If you are using pay-as-you-go capacity, you’ll see a spike in CUs when you shut down the capacity as all of this burst debt is paid off all at once instead of waiting for smoothing to catch up.

From what I’ve been told by peers, it’s possible that you can effectively take down a capacity with a rogue Spark notebook by bursting for so long that smoothing has to use the full window to catch up. At Ignite they announced they are working on Surge protection to prevent this

Capacity consumption can be monitored with the Fabric Capacity Metrics App.

I believe you can also upgrade a reserved capacity temporarily and pay the pay-go costs for the difference, but I can’t find docs to that effect.

Benchmarking Power BI import speed for local data sources

TL;DR – The fastest local format for importing data into Power BI is Parquet and then….MS Access?

The chart above shows the number of seconds it took to load X million rows of data from a given data source, according to a profiler trace and Phil Seamark’s Refresh visualizer. Parquet is a clear winner by far, with MS Access surprisingly coming in second. Sadly the 2 GB file limit stops Access from becoming the big data format of the future.

Part of the reason I wanted to do these tests is often people on Reddit will complain that their refresh is slow and their CPU is maxed out. This is almost always a sign that they are importing oodles and oodles of CSV files. I recommended trying Parquet instead of CSV, but it’s nice to have concrete proof that it’s a better file source.

For clarification, SQL_CCI means I used a clustered columnstore index on the transaction table and “JSON – no types” means all of the data was stored as text strings, even the numbers.

Finally, if you like this kind of content, let me know! This took about 2 days of configuration, prep, and testing to do. It also involved learning things that the Contoso generated dataset has Nan as a given name, which my python code interpreted as NaN and caused Power BI to throw an error. I’m considering doing something similar for Fabric data sources when Fabric DBs show up in my tenant.

Methodology

All of these test were run on my GIGABYTE – G6 KF 16″ 165Hz Gaming Work Laptop (don’t tell my accountants). It has an Intel i7-13620H 2.40 GHz processor, 32 GB of RAM, and a Gigabyte ag450e1024-si secondary SSD. The only time a resource seemed to be maxed out was my RAM for the 100 million row SQL test (but not for columnstore). For SQL Server, I was running SQL Server 2022.

The data I used was the Contoso generated dataset from the folks at SQLBI.com. This is a great resource if you want to do any sort of performance testing around Star Schema data. I had to manually convert it to JSON, XML, Excel and MS Access. For Excel, I had to use 3 files for the transaction table.

Initially, I was planning on testing in 10x increments from 10k rows to 100m. However, MS Access imported in under a second for both 10k and 100k, making that a useless benchmark. Trying to convert the data to more than 1m rows of data for XML, JSON, and Excel seemed like more work than it was worth. However, if someone really wants to see those numbers, I can figure it out.

For recording the times, I did an initial run to warm any caches involved. Then I ran and recorded it 3 times and reported the median time in seconds. For 100m rows, I took so long I just reported the initial run, since I didn’t want to spend half an hour importing data 4 times over.

Want to try it yourself? Here’s a bunch of the files and some sample at the 10k level:

Perf Data – local import blog.zip

What to learn more?

If you want to learn more about performance tuning Power BI, consider checking out my training course. You can use code ACCESS24 to get it for $20 until Dec 6th.

Power BI performance tuning – what’s in the course?

My course on performance tuning is live and you can use code LAUNCHBLOG for 50% off until Sunday February 11th. Module 1 is free on YouTube and Teachable, no signups.

Performance tuning playlist – Module 1

The goal of this course is to orient you to the various pieces of Power BI, identify the source of problem, and give some general tips for solving them. If you are stuck and need help now, this should help.

Note! This is an early launch. Modules 1 and 2 are available now, and the remaining ones will be coming out weekly.

  • Module 1: A Guide to Performance Tuning. This module focuses on defining a performance tuning strategy, and all of the places where Power BI can be slow.
  • Module 2: Improving Refresh – Optimizing Power Query. Optimize Power Query by understanding its data-pulling logic, reducing the data being loaded, and leveraging query folding for faster refreshes.
  • Module 3: Improving Refresh – Measuring Refresh Performance. Master measuring refresh performance using diagnostics and the refresh visualizer to identify which parts are slow.
  • Module 4: Improving Rendering – Modeling. Better modeling means faster rending. Understand the internals of models, using columnar storage, star schema, and tools like DAX Studio for optimization.
  • Module 5: Improving Rendering – DAX Code. Optimize DAX code to run faster, focusing on minimizing formula engine workload and effective data pre-calculation
  • Module 6. Improving Rendering – Visuals. Streamline visuals for better performance by minimizing objects, avoiding complex visuals, and using just-in-time context with report tool-tips and drill-through pages.
  • Module 7. Improving DirectQuery. Optimize DirectQuery with strategies to limit querying, improve SQL performance, and employ advanced features like user defined aggregations, composite models, and hybrid tables.

Each module after the first covers how to solve performance problems in each specific area. Each module also provides demos of the various tools you can use (of which there are many, see below).

Fabric Ridealong week 4 – Who invented this?

Last week I struggled to load and process the data. I was frustrated and a good bit disoriented. This week has been mostly backing up (again) and getting a better idea of what’s going on.

Understanding Databricks is core to understanding Fabric

One of the things that helps to understand Fabric is that it’s heavily influenced by Databricks. It’s built on delta lake, which is created and open sourced by Databricks 2019. You are encouraged to use a medallion architecture, which as far as I can tell, comes from Databricks.

You will be a lot less frustrated if you realize that much of what’s going on with Fabric is a blend of open source formats and protocols, but also is a combination of the idiosyncrasies of Databricks and then those of Microsoft. David Gomes has good post about data lake file formats, and it’s interesting to imagine the parallel universe where Fabric is built on Iceberg (which is also based on Parquet files) instead of delta lake. (Note, I found this post from this week’s issue of Brent Ozar’s Newsletter)

It was honestly a bit refreshing to see Marco Russo, DAX expert, a bit befuddled on Twitter and LinkedIn about how wishy-washy medallion architecture is. This was reaffirmed by Simon Whitely’s recent video.

This also means that the best place to learn about these is Databricks itself. I’ve been skimming through Delta Lake: Up & Running and finding it helpful. It looks like you can also download it for free if you don’t mind a sales call.

What should I use for ETL?

After playing around some more, I think the best approach right now is to work with notebooks for all of my data transformation. So far I see a couple of benefits. First, it’s easier to put the code into source control, at least in theory. In practice, a notebook files is actually a big ol’ JSON file, so the commits may look a bit ugly.

Second, it’s easier from a from a “I’m completely lost” perspective, because it’s easier to step through individual steps, see the results, etc. This is especially true when Delta Lake: Up & Running has exercises in PySpark. I’d prefer to work with dataflows because that’s what I’m comfortable with, but clearly that hasn’t worked for me so far.

Clip from the book

Tomaž Kaštrun has a blog series on getting into fabric which shows how easy it is to create a PySpark notebook. I am a bit frustrated that I didn’t realize notebooks were a valid ETL tool, I always thought of them being for data science experiments. Microsoft has some terse documentation that covers some of the options for getting data into your Lakehouse. I hope they continue to expand it like they have done with the Power BI guidance.

Lessons learned from being self-employed: 5 years in

Content warning: burnout, health issues

I have not been looking forward to writing this blog post. I started the series, inspired by Brent Ozar’s series, because being able to see how the other side lived helped me to evaluate the risks and take the leap to work for myself. Unfortunately, that commitment means writing about one of the worst years of my career, and what has felt largely like a waste.

A health scare

2023 started off in a state of burnout, techniques for recovery that worked in the past had stopped working. I was forced to try taking 2 consecutive weeks off for the first time in my career, and it helped dramatically. Also, during this time I was panicking about the change in payments from Pluralsight, and I reached out to everyone I could think of who sold courses or had a big YouTube channel for advice. Thank you to everyone who spent the time to help.

As a result I had decided I was going to start selling my own, self-hosted courses. I think I had hoped that I could just ramp up the social media a smidge, ramp down the consulting a smidge, and make it all work. If I could go back in time, I would have cut down on all extraneous commitments and focused just on this. Instead, I tried to make it all work, because of what I thought I “should” be able to accomplish, or what I had been able to accomplish in years past.

Around this same time, Meagan Longoria (along with others) convinced me at SQLBits to raise my consulting rates by 30%. Meagan has the tendency to be painfully blunt, while also being kind and empathetic. I think it’s difficult to nail both candor and kindness at the same time.

The health scare came in March, when I started weighing myself again. Travel from Bits and work had caused me to fall out of habit with exercising. What I found was I was the heaviest I had been in my entire life at 300 lbs. Even heavier than when I was in college and considered myself fairly obese. I had gained 20 lbs in 3 months, which as a diabetic is very very bad.

Barreling towards burnout

I decided I needed to do something, so I bribed myself with a Magic the Gathering booster every morning I exercised, and a Steam Deck if I could do that daily for 3 consecutive months. Overall, that worked, but I did find that in my mid 30s, it’s hard to just push through like that. I have to be careful, or I’ll develop plantar fasciitis or some other issues for a while.

At the same time, however, my work requirements had picked up. I had signed up for a volunteer position with a local organization that had become very stressful. I had work projects that had dragged on longer than they should and were starting to frustrate my customers. And I had found that the branding and marketing of selling my own courses involved much much more work and executive function than I had realized.

I did end up contracting and then hiring part time a local college grad to be my marketing assistant. She was recommended via a close professor friend of mine and overall she has been great. The biggest challenge has been acclimating someone to our particular niche of the data space and what the community is like.

Around June, I realized I was simply spread too thin. I had experienced being physically unable to get out of bed any sooner than was physically necessary. I was physically unable to get up an hour early for work to try to push through a project or deadline. I was should-ing myself to death, taking on more than I could handle because I thought I should be able to do more, because I thought people would be disappointed in me if I had to close out projects and work.

Ultimately, the largest threat to my health and well-being was my own personal pride.

Turning the corner

Thankfully I did decide to wind down as much of my consulting work as I could. It took multiple months longer than I would have preferred, honestly. I closed out any open projects that I could easily do so, and now I’m down to 2 customers that are a few hours per week. I also decided that I wouldn’t take on any new projects during November or December.

I’ve also been focusing on making that course and I officially have given myself a hard deadline of February 5th. At the moment I have absolutely no idea how well it will do. If it does well, that means I can continue to focus on making training content for a living. If not, I’ll have to consider pivoting into more of a focus on consulting or going back to a regular job. I would have preferred to be releasing this in the summer of 2023, but here we are.

I think the hardest thing to grapple with regarding burnout, is the uncertainty of how long it will take to recover and how aggressive you have to be in resting to recover. I’m grateful to both Matthew Roche and Cathrine Wilhelmsen for putting that into perspective.

There are days that I feel much better, I feel energetic and enthusiastic. Coming back from PASS Summit, I felt that way all week. But at the moment it’s still fragile, and I have to remind myself that a good day in a week doesn’t mean the issue has been totally solved yet.

One other thing, I always struggle with the lack of sunlight in the winter. For the first time ever, I’m being proactive about it and going somewhere warm in December instead of January or February when the issue becomes apparent. So, I’ll be spending Christmas week in San Juan, Puerto Rico where it is currently 80 degrees Fahrenheit. See y’all on the other side of 2024.

Fabric Ridealong Week 3 – Trying to put it into a table

Last week, I struggled to load the data into Fabric, but finally got it into a Lakehouse. I was starting to run into a lot of frustration, and so it seemed like a good time to back up and get more oriented about the different pieces of Fabric and how they fit together. In my experience, it’s often most effective to try to do something, review some learning, and alternate. Without a particular pain point, it’s hard for the information to stick.

As an aside, I wish there was more training content that focused on orienting learners. In her book, Design for How People Learn, Julie Dirksen uses the closet analogy for memory and learning. Imagine someone asks you to put away a winter hat. Does that go with the other hats? Does it go with the other winter clothes? An instructor’s job is to provide boxes and labels for where knowledge should go.

Orienting training content says “Here are the boxes, here are the labels”. So if I learn Fabric supports Spark, should I put that in the big data box, the compute engine box, the delta lake box, or something else entirely? If you are posting the Microsoft graphic below without additional context, you are doing folks a disservice, because it would be like laying out your whole wardrobe on the floor and then asking someone to put it away.

Getting oriented

So, to get oriented, first I watched Learning Microsoft Fabric: A Data Analytics and Engineering Preview by Helen Wall and Gini von Courter on LinkedIn Learning. It was slightly more introductory than I would have liked, but did a good job of explaining how many of the pieces fit together.

Next, I starting going through the Microsoft learning path and cloud skills challenge. Some of the initial content was more marketing and fluffy than I would have preferred. For example, explanations of the tools used words from the tool name and then fluff like “industry-leading”.  This wouldn’t have helped me at all with my previous issue last week of understanding what data warehousing means in this context.

After some of the fluff, however, Microsoft has very well written exercises. They are detailed, easy to follow, and include technical tidbits along the way. I think the biggest possible improvement would be to have links to more in-depth information and guidance. For example, when the Lakehouse lab mentions the Parquet file format, I’d love for that to have a link explaining Parquet, or at least how it fits into the Microsoft ecosystem.

Trying it with the MTG data

Feeling more comfortable with how Lakehouse works, I try to load the CSV to a lakehouse table and I immediately run into an error.

It turns out that it doesn’t allow for spaces in column names. It would be nice if it provided me with an option to automatically rename the columns, but alas. So next I try to use a dataflow to transform the CSV into a suitable shape. I try loading files from OneLake data hub, and at first I assume I’m out of luck, because I don’t see my file. I assume this only shows processed parquet files, because I can see the sales table I made in the MS Learn lab.

It takes a few tries and some digging to notice the little arrow by the files and realize it’s a subfolder and not the name of the folder I’m in. This hybrid files and tables and SQL Endpoints thing is going to take some getting used to.

I create a dataflow based on the file, remove all but the first few columns and select publish. It seems to work for a while, and then I get an error:

MashupException.Error: Expression.Error: Failed to insert a table., InnerException: We cannot convert a value of type Table to type Text.

This seems…bizarre. I got back and check my data and it looks like plain CSV file, no nested data types or anything weird. Now I do see table data types as part of the navigation steps, but none of the previews for any of the steps show any errors. I hit publish again, and it spins for a long time. I assume this means it’s refreshing, but I honestly can’t tell. I go to the workspace list and manually click refresh.

I get the same error as before, and I’m not entirely sure how to solve it. In Power BI Desktop, I’m used to being taken to what line is producing the error.

It turns out that I also had a failed SQL connection from a different workspace in the same dataflow. How I caused that or created it, I have no idea. The original error message did include the name of the query, but because I had called it MS_learn, I thought the error was pointing me to a specific article.

It takes about 15 minutes to run, then the new file shows up under…tables in a subfolder called unidentified. I get a warning that I should move these over to files. It’s at this point I’m very confused about what is happening and what I am doing.

So, I move it to files, and then select load to tables. Do that seems to work, although I’m mildly concerned that I might have deleted the original CSV file with my dataflow because I don’t see it anymore.

Additionally, I notice that I have been doing this all in My Workspace, which isn’t ideal, but that when I create a semantic model, it doesn’t let me create it there. So I have to create it in my Fabric Test workspace instead.

Regardless, I’m able to create a semantic model and start creating a report. Overall, this is promising.

Summary

So far, it feels a lot like there is a lot of potential with Fabric, but if you fall off the ideal path, it can be challenging to get back onto it. I’m impressed with the amount of visual tools available, this seems to be underappreciated when people talk about Fabric. It’s clearly designed to put Power BI users at ease and make the learning experience better.

I’m still unclear when I’m supposed to put the data into a warehouse instead of this current workflow, and I’m still unclear what the proper way is to clean up my data or deal with issues like this.

Fabric ridealong Week 2 – getting the data uploaded

I want to preface that a lot of the issues I run into below are because of my own ignorance around the tooling, and a lot of the detail I include is to show what that ignorance looks like, since many people reading this might be used to Fabric or at least data engineering.

So, last week we took a look at the data and saw that it was suitable for learning fabric. The next step is to upload it. Before we do anything else, we need to start a Fabric Trial. The process is very easy, although part of me would have expected it to show up on the main page and not just in the account menu. That said, I think the process is identical for Power BI.

Once I start the trial, more options show up on the main page. Fabric is really a collection of tools. I like that there are clear links at the bottom for the documentation and the community.

I think something that could be clearer is that the documentation includes tutorials and learning paths. While I understand that the docs.microsoft.com subdomain has been merged into the learn.microsoft.com subdomain, when I see “Read documentation” I assume that means stuffy reference material as opposed to anything hands on. This is an opportunity to take a lesson from Power BI Desktop by maybe having an introduction video, or at least having a “If you don’t know where to start, start here” link.

Ignoring all of that, the first I’m tempted to do is select one of these personas and see if I can upload my data. So, I take a guess and try Data Warehouse. Unfortunately, it turns out that this is more a targeted subset of the functionality. Essentially, as far as I would be aware, I’m still in Power BI. This risks a little bit of confusion, because the first 3 personas (Power BI, Data Factory, and Data Activator) are product names, so I’m likely to assume that the rest of them are also separate products. In part, because that’s how it historically has felt to me in Azure, as I’ve talked about when first learning Synapse.

Now thankfully, I’m aware that the goal of Fabric is to have more of a Power BI style experience, so I’m able to quickly orient myself and realize it is showing me a subset of functionality instead of a singular tool. I also see “?experience=data-warehouse” in the URL which is also a hint. So, I go ahead and click on the warehouse button, hoping this is what I need to upload my data. Unfortunately, I get a warning.

The warning says I need to upgrade to a free trial. But I just signed up for the free trial! Reading the description, I realize that I need to assign my personal workspace to the premium capacity provided by the free trial. This is a little confusing, and at first I had assumed I ran into a bug. I click upgrade and it works.

Finding where to put the data

Next it asks me for the name of my warehouse. I choose “MTG Test” and cross my fingers. Overall it seems to work. Again, I’m presented with some default buttons in the middle. I see options for dataflows and pipelines, and I assume those are intended for pulling data from an existing source, not uploading data. I also see an option for sample data, which I really appreciate for ease of learning.

I see Get Data in the top left, which I find comforting because it looks a lot like Get Data for Power BI, so let’s take a look. Unfortunately, it’s the same 2 buttons. So, we are at a bit of an impasse.

I click on the dataflow piece, but I’m starting to feel out of my depth. If my data already existed somewhere, I’d be fine, but it doesn’t. I have to figure out how to get the data into the data lake. So I back up a bit and then Bing “Fabric file upload”. The second option is documentation on “Options to get data into the Fabric Lakehouse”.

The first option shows how to do it in the lakehouse explorer. I go back to my warehouse explorer, looking for the tables folder, but it’s not there. I see a schemas folder, which I assume is maybe a rename like how they recently renamed datasets to semantic models. I assume that maybe schemas are different than tables and that I need to find a more detailed article on Lakehouse Explorer. It probably takes me a full minute to realize that a warehouse and a lakehouse are not the same thing, and that I’m probably in a different tool.

So, I backup again and search for the more specific query “fabric warehouse upload”. I see an article called “Tutorial: Ingest data into a Warehouse in Microsoft Fabric”. I quickly scan the article and see it suggesting using a pipeline to pull in data from blob storage. So I know that’s an option, but I’m under the vague impression that there should be a way to upload the data directly in the explorer.

Giving up and trying again

I dig around in Bing some more and I find another article called “Bring your data to OneLake with Lakehouse”. From demos I’ve seen of OneLake, it’s supposed to work kinda like One Drive. At this point I know I’m misunderstanding something about the distinction between a warehouse and a lakehouse, but I decide to just give up and try to upload data to a lakehouse. The naming requirements are more strict so I make MTG_Test.

I got to get data, I see the option to upload files. I upload a 10 gigabyte file and it works! Next week I’ll figure out how to do something with it.

Summary

Setting up the fabric trial was extremely easy and well documented. As far as I can tell, there’s a lot of getting started documentation for Fabric, but I wish it was surfaced or advertised a bit better. I run into a lot of frustration trying to just upload a file, in part because I don’t have a good understanding of the architecture and because my use case is a bit odd.

Overall, I’m feeling a bit disheartened, but I have to remind myself that I ran into a lot of the same frustrations learning Power BI. Some of that was the newness, some of that is learning anything, and some of that I expect the product team will smooth out over time.

I also acknowledge that I’d probably have an easier time if I just sat down and went through the learning paths and the tutorials. In practice though, a lot of times when I’m learning a new technology I like to see how quickly I can get my hands dirty, and then back up as necessary.

Fabric ride-along Week 1 – Reviewing the data

This is week 1 where I try to take Magic the Gathering draft data to learn Microsoft Fabric. Check out week 0 for some reasoning why.

So, before I do anything else, I want to get a sense of the data I’m looking at to see if it’s suitable for this project. I download the data, and because it’s gzipped, I use 7-zip to open it up on windows 10, or Windows explorer on Windows 11. In either case, the first thing I notice is the huge size disparity. When compressed, it is a quarter of a gigabyte. Uncompressed, it’s about 10 GB. This tells us something.

The longer you work in business intelligence, and especially in consulting, the more you start picking up clues and making inferences. You do this because scope creep is extremely prevalent in BI, and if you are a consultant you might be the one paying for it. So, what does 40x compression difference tell us about the data?

40x is abnormal. In my experience with the Vertipaq engine in Power BI, on a good day you are looking at 5-10x compression compared to a SQL backend. So, we know that there is a lot of repeated data. Because this is the only file for this data, we can infer that we will have to do quite a bit of normalization. CSV is a flat format, so the source data is likely heavily denormalized in this case. I would be shocked if there was any nested or hierarchical data like you might expect with JSON.

The next step is to take a peek at the data. There might be documentation somewhere, but for whatever reason I prefer to just take a look and get a feel for it. So how do we do that? Well, someone experienced would probably use a dedicated tool for large files. But I’m not experienced, so I confirm that I have 32 gigs of RAM, double click on the file and cross my fingers. In doing so, I create the most viral tweet of my career.

Excel complains that there are too many rows, but eventually shows me the first million of them. I take a quick glance to get oriented. The very first thing I’m scanning for is anything with the word “id” in it (1). The next thing I’m scanning for are repeated values (2), these are likely to go with the id as a header table or dimension table. Then I see pick number incrementing (3), so it’s likely functioning as a line number. Then I see a bunch of ones and zeros (4) to the right, and I don’t like that.

Issues with the data

I don’t like that because it’s data I don’t know how to deal with. My first guess is it’s data for data science that’s been turned into features. Columns like this are great for running experiments, but awful for traditional analytical reporting. I’ll likely have to reshape the data into something more dimensional, but I’ll have to learn how best to store this information. Doing a pivot is simple enough, but I have a nagging feeling I’m missing something.

So, the next question, is just how many columns do we have and what do they look like? I scroll over all the way to the right, and I see the letters YS. I don’t know how many that is, but I know it’s bad. Typically, in my work it never gets past A and another letter. I check and there are 672 columns!!!

Why so many columns? This data is around drafting Magic the Gathering cards. So, for each card in the specific magic set (a quarterly release of cards), we have a column if it was possibly in that card pack (the cards the player can choose from), as well as in the player’s already selected pool (the cards they’ve drafted). Essentially, for every card they could possibly see in a draft we are tracking what they have seen as well as what they have picked.

Accordingly, we have a very sparse dataset. Based on how the math works out, these columns will have 0 the vast majority of the time. I know that having lots and lots of columns interferes with run-length encoding, so leaving the dataset as is not ideal from a compression and performance standpoint. This does explain why the data compresses so well though, since most of it is long chunks of 0s and commas. The gzip algorithm is able to see that and substitute it.

There’s another issue with this shape. We have columns with specific names of the cards. The cards available each set are completely different, with only a handful of repeats. This means if we just merged in the schema each new set, we would have thousands of columns. This simply isn’t feasible; we have to reshape the data. We are going to need to learn how to dynamically unpivot the data, probably in Azure Data Factory, which I have no experience in.

Coincidentally, Javier Villegas was giving a presentation on data ingestion in the Data Toboggan conference. I think an important part of learning technologies is giving yourself the chance for “serendipity” or “luck”. If you are regularly bumping into content, you can find content that is relevant to the problems you have. As I mentioned in week 0, if you don’t have active problems or active tasks you sometimes have to make your own.

Summary

We can tell the data is abnormally compressible and we need to figure out why. It turns out it is a sparse data set. The first thing I do is rapidly scan for id fields, numerically incrementing fields, and repeated values to get a sense of how I might normalize the data. Based on the current shape of the data, I know I’m going to have to pivot it. I’ll probably have to learn Azure Data Factory for that, but we’ll see. I know vaguely that Fabric has support for PowerQuery.