Category Archives: Performance Optimization

Benchmarking Power BI import speed for local data sources

TL;DR – The fastest local format for importing data into Power BI is Parquet and then….MS Access?

The chart above shows the number of seconds it took to load X million rows of data from a given data source, according to a profiler trace and Phil Seamark’s Refresh visualizer. Parquet is a clear winner by far, with MS Access surprisingly coming in second. Sadly the 2 GB file limit stops Access from becoming the big data format of the future.

Part of the reason I wanted to do these tests is often people on Reddit will complain that their refresh is slow and their CPU is maxed out. This is almost always a sign that they are importing oodles and oodles of CSV files. I recommended trying Parquet instead of CSV, but it’s nice to have concrete proof that it’s a better file source.

For clarification, SQL_CCI means I used a clustered columnstore index on the transaction table and “JSON – no types” means all of the data was stored as text strings, even the numbers.

Finally, if you like this kind of content, let me know! This took about 2 days of configuration, prep, and testing to do. It also involved learning things that the Contoso generated dataset has Nan as a given name, which my python code interpreted as NaN and caused Power BI to throw an error. I’m considering doing something similar for Fabric data sources when Fabric DBs show up in my tenant.

Methodology

All of these test were run on my GIGABYTE – G6 KF 16″ 165Hz Gaming Work Laptop (don’t tell my accountants). It has an Intel i7-13620H 2.40 GHz processor, 32 GB of RAM, and a Gigabyte ag450e1024-si secondary SSD. The only time a resource seemed to be maxed out was my RAM for the 100 million row SQL test (but not for columnstore). For SQL Server, I was running SQL Server 2022.

The data I used was the Contoso generated dataset from the folks at SQLBI.com. This is a great resource if you want to do any sort of performance testing around Star Schema data. I had to manually convert it to JSON, XML, Excel and MS Access. For Excel, I had to use 3 files for the transaction table.

Initially, I was planning on testing in 10x increments from 10k rows to 100m. However, MS Access imported in under a second for both 10k and 100k, making that a useless benchmark. Trying to convert the data to more than 1m rows of data for XML, JSON, and Excel seemed like more work than it was worth. However, if someone really wants to see those numbers, I can figure it out.

For recording the times, I did an initial run to warm any caches involved. Then I ran and recorded it 3 times and reported the median time in seconds. For 100m rows, I took so long I just reported the initial run, since I didn’t want to spend half an hour importing data 4 times over.

Want to try it yourself? Here’s a bunch of the files and some sample at the 10k level:

Perf Data – local import blog.zip

What to learn more?

If you want to learn more about performance tuning Power BI, consider checking out my training course. You can use code ACCESS24 to get it for $20 until Dec 6th.

Query folding workaround for Azure Devops and Power BI

Query folding is one of the most powerful tools in Power Query and Power BI. It is the automatic process of pushing down filters and other transformations back to the data source. This can dramatically improve performance for your queries.

Unfortunately, OData is not guaranteed to support query folding. According to the Power BI documentation on incremental refresh.

Most data sources that support SQL queries support query folding. However, data sources like flat files, blobs, web, and OData feeds typically do not. In cases where the filter is not supported by the datasource back-end, it cannot be pushed down. In such cases, the mashup engine compensates and applies the filter locally, which may require retrieving the full dataset from the data source.

I recently did some tests on this for the OData source for Azure Devops. When I tested with the sample Northwind database, query folding was working. I was able to see with Fiddler that my date filter was getting pushed back down.

clip_image001

However, when I tried to the same with Azure Devops, none of my filters where getting pushed down to the source. As a workaround, I was able to put my filters in the URL. So to filter based on date, I used the following url:
https://analytics.dev.azure.com/eugene1234/TestingOdata/_odata/v2.0/WorkItems?$select=WorkItemId,WorkItemType,Title,State,ChangedDate&$filter=ChangedDate%20gt%202018-12-30T23:59:59.99Z

clip_image001[5]

In this case, I was manually specifying the date filter in the URL. But it should be possible to use M code to dynamically generate the URL. Another option might be to create a custom data connector for oData that supports query folding.

DAX Error: The Expression Refers to Multiple Columns. Multiple Columns Cannot Be Converted to a Scalar Value.

Promotion: Use code DAXERROR to save 10% off my course. Module 1 is free.

Sometimes, when working with DAX, you might get the following error:

The expression refers to multiple columns. Multiple columns cannot be converted to a scalar value.

This error occurs whenever the DAX engine was expecting a single value, or scalar, and instead received a table of values instead. This is an easy error to make because many DAX functions, such as FILTER, SUMMARIZE and ALL, return table values. There are three situations where this error commonly occurs:

  1. Assigning a table value to a measure or calculated column
  2. Forgetting to use a DAX aggregation
  3. Treating ALL or FILTER as an action, not a function

In the rest of the post, we’ll cover each scenario and how to fix it.

Assigning a table value to a measure or calculated column

Let’s say that you were doing some analysis on the products table in the AdventureWorks sample database. In this case, maybe you want to only look at the black products. So you create a measure with the following code:

BlackProducts = FILTER(Products, Products[Color] = “Black”)

image

One solution to this problem is instead of assigning the code to a measure, which is intended to display a single value, you can create a calculated table instead.

To do so, go to Modeling –> New table in Power BI Desktop. Then ender the same code as before but for the calculated table. Now you will see a table filtered accordingly.
image

Forgetting to use a DAX aggregation

Now, what if we actually did want a single value instead of a table? Let’s say we want to count the number of black products. In that case, we could wrap our code in an aggregation function, such as COUNTROWS which can take in a table and return a single value.

CountOfBlackProducts = COUNTROWS(FILTER(Products, Products[Color] = “Black”))

This code will return the count of all products, but only if they have black as the color.

Treating ALL or FILTER as an action, not a function

Sometimes, people will try to use functions like ALL or FILTER to filter information on the report. By themselves, these functions actually return a table. However, when they are used with CALCULATE and CALCULATETABLE then you can use them to filter your data appropriately.

Want to learn more?

If you want to learn more about DAX, then check out my free learning path and my paid Pluralsight course.

Fumbling in the Dark with DevOps and Automation

In the past, I’ve been skeptical about how much things like PowerShell, Devops and Docker are relevant to me personally. It makes sense if you are writing application code. It makes sense if you are managing hundreds of servers.

But I do Business Intelligence. How do you write unit tests for a report? Why do I need PowerShell when I can just hit Publish on Power BI Desktop? Do I really need Powershell if I manage 3 SQL Servers?

This year, however, there have been a number of events that have been slowly changing my mind:

I don’t know what I’m doing

I’ve talked before about how automation is a relative term. But I’d like to do some true automation, I’d like to make a script like Cody’s where I can spin up a multi-server homelab with SQL Server, Sample databases and client tools all installed.

And right now I have no idea what I’m Doing and I’m fumbling in the dark. I’ve made a github project and I’ve gotten Lability to create the virtual machines. I know I need to learn Desired State Configuration, and I can’t quite get it to work with Lability yet.

And beyond that, I have no idea what I’m doing. And that’s okay. I suspect that this is a pain a lot of people run into with devops and why they put it off. The reason I write this is to remind people is that it’s okay to suck at something.

Image result for adventure time suck

I’ll keep y’all updated as I slowly make progress, fumbling in the dark.

DAX error: A function ‘XXXX’ has been used in a True/False expression that is used as a table filter expression. This is not allowed.

Promotion: Use code DAXERROR  to save 10% off my course. Module 1 is free.

Whenever you start trying to use more complicated filters in the CALCULATE or CALCULATETABLE functions in DAX, you may start to get the following error:

A function 'MAX' has been used in a True/False expression that is used as a table filter expression. This is not allowed.

image

The function in single quotes may vary. Instead of MAX, it could be SUM, MIN, AVERAGE or nearly anything. Sometimes, you may not even be using a function and the error will just say CALCULATE is the problem:

A function 'CALCULATE' has been used in a True/False expression that is used as a table filter expression. This is not allowed.

image

What causes this error?

The error is caused by using a TRUE/FALSE expression, something that evaluates to TRUE or FALSE, to filter the table in a way that CALCULATE or CALCULATETABLE doesn’t support.  So the error is saying you can’t use a boolean comparison to filter your table except in very specific circumstances.

The following comparisons are not supported:

    1. Comparing to a column to a measure. SalesHeader[TerritoryID] = [LargestTerritory]
    2. Comparing a column to a an aggregate value. SalesHeader[TerritoryID] = MAX(TerritoryID[TerritoryID]])
    3. Comparing a column to a What-If parameter. SalesHeader[TerritoryID] =

TerritoryParameter[TerritoryParameter Value]

In fact, you only have three options if you want to filter a column in a CALCULATE/CALCULATETABLE function:

  1. Compare the column to a static value. SalesHeader[TerritoryID] = 6
  2. Use variables to create a static value. VAR LargestTerritory = MAX(SalesHeader[TerritoryID])
  3. Use a FILTER function instead of a true/false expression. FILTER(SalesHeader, SalesHeader[TerritoryID] = [LargestTerritory])

This is because CALCULATE was designed for safety and performance. Complex row based comparisons can dramatically affect performance. So, in order to do more complex comparisons, you have to take the safety feature off and use the FILTER function.

How do I fix it?

In order to fix the issue, wrap your expression in the FILTER function. To use the FILTER function, you need to pass in the table you want to filter, and then a TRUE/FALSE expression to determine which rows get return. So, let’s say we had the following code:

CALCULATE (
    SUM ( SalesHeader[TotalDue] ),
    SalesHeader[TerritoryID] = [LargestTerritory]
)

to use the FILTER function, we would use this:

CALCULATE (
    SUM ( SalesHeader[TotalDue] ),
    FILTER ( ALL ( SalesHeader[TerritoryID] ), SalesHeader[TerritoryID] =    [LargestTerritory] )
)

The ALL function isn’t strictly necessary, but normally when we filter a single column in a CALCULATE function, it will undo any existing filters on that column. We use ALL here to replicate that behavior. In order to understand the specifics better, check out this article at sqlbi.com

Slides for PASS BI Virtual Group

Today I’m going to be presenting on DAX for the PASS BI Virtual Group. The focus is on all the hard mental concepts of DAX. If I could sum up the talk in one picture, it would be this:

That red area is where I banged my head when learning DAX. The learning curve shoots up wildly in the middle of learning the technology, instead of a slow gentle curve. This presentation covers the middle parts that are key to understanding DAX.

Here are the slides for the presentation: Introduction-to-DAX-2017-03-30

Here is the recording:


Here is the talk by Marco Russo I mentioned in my presentation: Optimizing Multi-Billion Row Tables in Tabular