Using a Logistic Regression to Make a College Basketball Predictive Ranking

Avi Rajendra-Nicolucci
3 min readMar 1, 2021

I was assigned a take-home project recently where one of the parts was building a predictive ranking of ACC basketball teams from the 2018/19 season using box score data. It was optional to create a rating for the ranking which I did. So if team A was ranked higher than team B, team A would be predicted to win on a neutral court.

North Carolina vs. Duke is one of the best rivalries in American sports

This problem immediately seemed like a logistic regression problem due to the binary nature of winning and losing. My plan from the start was to create a logistic regression model and use the coefficients to make a rating system.

There were two ways to use the data: per game average and rolling averages. Both I would have to calculate on my own. I decided to go with the rolling average because past research suggested this was the best way of modeling college basketball. Plus, it made more sense to assess how a team was playing rather than just end-of-season averages. I decided to go with a 5 game rolling average which would provide a nice snapshot into how teams were playing.

The box score data provided consisted of a team’s points scored, assists, turnovers, steals, blocks, rebounds, offensive/defensive rebounds, field goals attempted/made, 3 point field goals attempted/made, free throws attempted/made, and fouls committed in a single game. That’s 15 variables which is kind of a lot! Having too many variables can lead to overfitting of a model. The other issue is multicollinearity, meaning some of these variables are highly correlated with each other. The solution is feature selection.

Virginia was the eventual national champion in 2018/19

I decided to use a technique called backward feature selection. Essentially you eliminate one variable from the model at a time until all of your variables are below a certain significance level or P-value. Typically that P-value is 0.05 which I stuck with here.

It was fascinating to see which variables didn’t make the cut in each iteration of the model. In the end, I was only left with three variables: margin of victory/loss, turnovers and fouls (all rolling averages). This, of course, was not including home-court advantage, if home-court was included it was by far the most important variable which was cool to see given how important fans are thought to be in college basketball.

I multiplied the coefficients of the variables by each team’s final rolling averages to get an index that would predict any future games. Check out the table below. Note that the rating was based on current form in the last five games but as you can see it matches closely with the final standings.

2018/19 ACC Basketball Ratings

One limitation of doing this with coefficients is they are a crude way of measuring feature importance. Using an importance score would’ve been a more accurate method here. But overall I’m pleased with the results, it definitely passes the eye test with North Carolina, Duke and Virginia in the top three of the rating. Virginia of course ended up winning the NCAA Tournament that season while Duke was knocked out in the Elite Eight and North Carolina was eliminated in the Sweet Sixteen.

This model was far from perfect but it was my first time using rolling averages and box score data together. All of this was done using Python in Jupyter Notebooks using the statsmodels package. I hope this was informative and I’d love to do some more modeling work in this area either with basketball or other sports.

--

--