This guide will explain 2 different approaches to performing linear regression in Power BI. The 2 forecasting approaches we have worked with as a Power BI consultancy are:
These 2 approaches are briefly compared below:
Linear Regression in DAX | Linear Regression in AutoML | |
Technical complexity | High | Low |
Level of customisation | High | Low |
Needed License | Free | Premium |
Selection of prediction variables | Manual | Automated |
Let’s now discuss these 2 forecasting approaches in more detail. We will start with pre-built out of the box models since setting them up is less technical. However, even if you are not planning to code a linear regression model from scratch, it is still important to understand how they are built step by step. For this reason we recommend that you don’t skip the second part of this article.
This article is part of a series on forecasting and predictive analytics in Power BI. Stay tuned for more guides on this topic.
There are several pre-requisites for users to access the out-of-the-box ML functionality with Power BI:
Bear in mind that linear regression can only be applied to numerical fields. If you want to predict number of visitors based on the day of the week, you can not use a column with values like
In order to auto-generate an ML model, you need to navigate to a workspace and find your dataflow. Click on an auto-ML button and then add a machine learning model
The next step is to select the table you want to work with as entity and the metric that you want to predict as outcome
Next, you would select the type of ML model that you want to use. Choose regression to begin the forecasting process.
Power Bi then automatically analyses the selected dataset and selects the columns that are likely to be good predictors of your target variable. This is done by identifying columns with high correlation to the metric you are trying to predict.
The final step is to name the ML model and begin to train it. You can do this by pressing the “Save and Train” button.
This will create a new model which you will be able to find in the “Machine Learning Models” tab. Click the right-arrow button to run the model and click “save and apply”
Once applied, you will see 2 new tables created in your dataflows. The table that says “enriched with ML” contains the results of the linear regression model. You will be able to connect to this table from Power BI Desktop. Simply click “get data”, select a dataflow to connect to and bring this table in.
Once you load this data into Power BI Desktop, you will see all of your original columns plus a new column called “Regression Results”. You can then load this data to Power BI and start visualising it.
If you are looking for a more custom approach to forecasting in Power BI, it is better to code your prediction model from scratch using DAX.
Linear Regression is the most popular predictive model and it is a very useful tool for a Power BI data analyst. This is the model that most stakeholders of data analysis projects understand and therefore is often a good starting point for predictive analysis.
Variable selection is the most important part of linear regression. As a result we need to start with data exploration to build hypotheses about which variables will be most useful.
I will be using a real-life case study that we worked on to explain linear regression in Power BI. This was a project for a night club where we predicted how many visitors they will have every day of the next year. They used this analysis to plan their most expensive events for the dates for which we predicted the highest attendance.
In our night club project the metric that we wanted to predict was number of visitors. We began this project by analysing which other columns the “number of visitors” seemed to be correlated with.
We did this by creating graphs where we plotted the number of visitors with other columns. You will see a few examples below:
As you can see, there seems to be a correlation between the number of visitors and the day of the week. Specifically, there is a big spike in visitors on Fridays. As a result, we can make a hypothesis that weekdays are a good predictor of the number of visitors.
Next we can look into the number of visitors by month. You can see from the graph below that the number of visitors usually grows until April – September and then it drops significantly from October – November. We can make an assumption that there is a trend here which can help us predict the number of visitors next year.
Finally, we can also see that there is some YoY growth in the business and we might be able to assume that this level of growth will continue going forward. Of course we only have 2 years of data so making this assumption can be a little dangerous. In an ideal world we would have several years of data so we can confidently say if there is a clear trend.
In a real-world Power BI forecasting project you would need to keep doing this data exploration exercise to create more hypotheses about prediction factors. You would be able to test those hypotheses in the next step of the forecasting process.
The DAX function for linear regression in Power BI is LINEST(). You can not use this function to create a calculated column or a measure but you can use it to create a new table.
To begin our prediction process let’s create a new table using the following DAX formula: linest test = LINEST(Predictions[Visitors], Predictions[Year], Predictions[Weekday], Predictions[Month])
In this case the table Predictions is the dataset we are trying to build our linear regression model on. It contains the data for number of visitors per day for 2022 and 2023. Remember, we are trying to use this dataset to predict the number of visitors per day in 2024.
Prediction[Visitors] is the column that we want to predict.
Predictions[Year], Predictions[Weekday] and Predictions[Month]) are the variables that predict the number of visitors according to our hypothesis.
The next step is to evaluate whether our hypothesis is accurate. In order to do this we need to look at the table created with the Linest DAX function.
One particular column that you need to pay attention to is “Coefficient of Determination”. This metric is also known as R squared.
Simply put, R squared measures how accurate our predictions are. R squared value above 0.5 generally means that your model is predicting accurately more often than not. As you are working on your linear regression model you want to get your R squared as close to 1 as possible.
One limitation with R squared is that if you keep adding features to it, the value will keep increasing. In other words R squared never decreases when you add values. However, the goal is not to add all the possible features but find the ones that are actually good predictors.
As you can see, the coefficient of determination in our case is 0.08 which is a pretty weak performance. However, before we learn how to improve it we need to discuss how to read our Linest table.
The column intercept is basically the expected number of visitors if year = 0, month = 0 and weekday = 0. It is represented as a blue dot on the graph below. In our case it is -550338
The slope of our LINEST function is calculated using the columns slope 1, 2 and 3. The formula of the LINEST function is going to be Intercept + Slope1*Year + Slope2*Weekday + Slope3*Month.
The image below is really a simplified version of what a graphical form of the linear regression function would look like. Our function already has 3 prediction factors meaning that the chart would need to be in 4D.
The only way to optimize the linear regression model is through feature selection. You either need to select a different set of features or change the way that the current features are selected.
Please note that all the features you use for linear regression must be numerical. If you have categorical data, you need to transform it to numerical. For example your weekdays can be 1,2,3… instead of “Monday”, “Tuesday”, “Wednesday”.
One important rule of feature selection to keep in mind is that the features should not be related to each other. For example if you are predicting sales based on gender, it doesn’t make sense to select is_male and is_female as your features.
There are also statistical ways to analyse whether a selected feature is good or bad.
If your R squared increases significantly after you add another feature to your linear regression, it is obviously good. However, R squared is a measure of the linear regression model as a whole, not individual variables.
If you want to measure usefulness of individual features, you can do this by calculating a p-value as a new measure using a formula below:
p-val Queries =
var _df=FIRSTNONBLANK(‘linest test'[DegreesOfFreedom],1)
var _coef=FIRSTNONBLANK(‘linest test'[Slope1],1)
var _se=FIRSTNONBLANK(‘linest test'[StandardErrorSlope1],1)
var _tval=ABS(_coef)/_se var _pval=T.DIST.2T(_tval,_df)
return
_pval
This formula above will calculate the p-value for the first variable in your model. The closer the p-value is to 0, the more significant the relevant feature is for predicting the number of visitors.
When applying the formula above to the features in our model we got the following results.
Year and month seem to be good predictors according to our p-value test but weekday is not. Let’s now see if we improve our model. Let’s replace 1 weekday feature with 7 features for individual days.
As you can see the coefficient of correlation is now 0.55 which is a big improvement from 0.08. This tells us that the relationship between weekdays and visitors is not linear. In other words the number of daily visitors does not always increase with every new day of the week.
Coming back to our data exploration phase, we saw that the number of visitors increases with every month until September and then drops drastically in October. This might suggest that we need 2 linear regressions: one for Jan-Sep and another for Oct-Dec.
We can apply a filter to our LINEST table by wrapping it into a CALCULATETABLE function as shown below.
As you can see our coefficient of correlation increased to 0.64 which is a great result.
Now that we got a good R squared we can start applying our linear regression model to predict visitors.
First of all we can test the accuracy of our model on 2022 data. We have the actuals for 2023 and we can do a prediction for the same year so that we can measure the difference.
We can do this by creating a new calculated table in our 2022-2023 table. We would replicate the standard linear regression function using DAX: ŷ = b₀ + b₁x₁ + b₂x₂ + ….
b₀ would be our intercept column from the LINEST table
b₁ would be our slope1 column from the LINEST table
Etc
Below is the full DAX function for predicting visitors
You can see that our prediction model accurately reflects some patterns such as spikes of visitors on Fridays. It is also only 4% off on a yearly basis but it could be around 30% off on average when you are predicting on a daily basis. This just shows that linear regression is more powerful when you are predicting monthly/annual visitors vs daily visitors.
Now that we have tested our prediction on the 2022 data, it is time to create a new table that contains 2024 data including all of our selected fields:
We will then create a new calculated column for this table using the same logic as before.
Creating these reports can definitely be a technical challenge. Let professional Power BI consultants do the job! Simply reach out to us and let’s discuss your Power BI forecasting project!