# Introduction to regression using NBA statistics.

Abstract

This paper presents an activity that was used to introduce concepts related to the simple linear regression model using data from the National Basketball Association (NBA). Using SPSS to facilitate student understanding and interpretation of statistics concepts, this particular classroom example also illustrates potential problems that can arise when manipulating real-life data. Teaching activities of this sort might help students to begin to make the connection between learning in the classroom and applying the methods out in the real world.

Introduction

The teaching of introductory statistics courses and concepts can be a challenge. Many teachers not only want their students to attain an understanding of basic statistics concepts but they also would like to demonstrate to students the practical applications of the methods in the real world (Gal, Ginsburg, & Schau, 1997; Doerr & English, 2003; Groth & Powell, 2004).

Selecting an appropriate research scenario using real data that illustrates the concepts that are taught in an introductory class can also be a concern for teachers (Hill & Ball, 2004; Groth & Powell, 2004; Franklin, 2000). Although using real data can illustrate practical applications of the methods, it almost never seems to follow the theory that students learn in the classroom. Nevertheless, there are introductory statistics concepts that can help provide an introduction to interesting applications of understanding relationships in the real-world. In addition, when deviations from theory result, this may stimulate other worthwhile discussions and enhance understanding of related statistics concepts.

The purpose of this paper is to present an activity that was used to introduce ideas related to the simple linear regression model as well as to illustrate potential problems that can arise when manipulating 'real-life' data. Using the Statistical Product and Service Solutions (SPSS) software (or Excel) and data from the National Basketball Association (NBA) website, the following classroom activity might be helpful for students learning introductory statistics concepts. This activity is also appropriate and might be used by 6-12 grade teachers as an application of the components of the Data Analysis and Probability Standard (NCTM, 2000). In addition, other important teaching objectives can also be emphasized, such as how to interpret scatterplots and correlation, evaluating the tenability of assumptions, writing null and alternative hypotheses associated with hypothesis tests, and reporting and interpreting confidence intervals and p-values.

Classroom Activity

Most students are familiar with many of the teams affiliated with the NBA--most will even have a favorite. Near the end of a class period is a good time to allow students to log on to the internet to collect data for the next class. At [1], students can access individual player statistics for their favorite team such as average points per game, rebounds, fouls, steals, turnovers, etc.

After a class lecture on simple linear regression, students were asked to collect data on 2 variables from the NBA website that might be linearly related. During the next class, we used SPSS to analyze the data, interpreted related concepts, and evaluated assumptions before interpreting statistics and making inferences. Presented below, we offer one example for data collected on the Atlanta Hawk's 2002-03 season that illustrates our challenge using 'real data'.

Research Question

'Boo' has been an Atlanta Hawks fan all of her life. Her team finished in 11th place (out of 15th) in the Eastern Conference regular season standings. When considering her two variables, Boo offered that new head coach Terry Stotts might also be interested in her 'study' because studying the team statistics from last season might help to improve for the upcoming season. She also added that in the past few seasons, the Hawks have struggled with improving on many of the fundamentals of the game, including grabbing rebounds and cutting down on turnovers. Therefore, Boo wanted to know if there was a linear relationship between a player's rebounds per game and the number of points scored [Table 1]. To see all mentioned tables/figures, visit issue website at http://rapidintellect.com/AEQweb/fal2005.htm

After entering the data into SPSS, a user-friendly, software program that can generate simple statistics similar to Excel, a scatterplot of the data in Figure 1 revealed a positive and moderately strong linear relationship between the two variables [Figure 1]. The Pearson correlation of .78 confirmed this relationship; that is, there is an overall tendency to indicate that the more rebounds a player grabs, the more points the player will score [Table 2]. The simple linear regression equation relating these two variables was also generated but a check of the assumptions for making inferences would need to be evaluated first. The students, eager to crunch numbers right away and determine if there was statistical significance, presumed that the assumptions would be a simple technicality. However, what we found was that this was not the case. So reported below, we used Boo's example to illustrate how we dealt with the violations. We also discussed and reinforced other related concepts related to regression analysis, including interpreting and understanding scatterplots, correlation, hypothesis testing, assumptions, confidence intervals and p-values. We concluded the activity by discussing cause and effect, another important limitation about the linear relationship or association between two variables.

[FIGURE 1 OMITTED]

Assumptions

Using Boo's data, we generated the output and examined the assumptions using SPSS and an overhead projection while the rest of the class watched and asked questions when needed. Although the simple linear regression model relating points per game (PPG) and average rebounds (AVGREBS) revealed the predicted equation of 2.086x + .056 [Table 3], an evaluation of the assumptions indicated a possible violation. First, the assumption of normality for the distribution of the errors appeared to be maintained when examining the standardized residual plot [Figure 2]. Almost all of the standardized residuals were within 2 standard deviations of the mean, despite the small sample (There was only one standardized residual beyond +2 standard deviations of the mean Jason Terry) [Table 4]. Therefore, we concluded that the normality assumption appeared to be met.

[FIGURE 2 OMITTED]

On the other hand, there was some evidence to indicate that the assumption of constant variance across the values of AVGREBS was not maintained. The plot revealed that the variances increased from left to right [Figure 2]. Thus, we were not justified to make any inferences for our model due to the violation of this assumption. We then discussed how to resolve the violation by transforming the dependent variable to a different scale by considering either the logarithm or inverse of PPG (Anderson, Sweeney, & Williams, 1994). At this point during the activity, we took a few minutes to review how to make these transformations using our calculators. For Boo's data, each student found the appropriate transformation for each function, followed by a confirmation using SPSS (which can be easily done in SPSS or Excel). We ultimately discovered, after discussing the shapes of the two plots for each function, that applying the natural logarithm to PPG satisfied the requirements for making inferences back to the population. In fact, an inspection of the new plot revealed that the LOGEPPG scores not only provided a better normal approximation but also corrected the wedge-shaped pattern of the variances for the residual plot [Figure 3]. It was explained that the residuals now appeared to have more of an overall even scatter about the line from end to end (i.e., rectangular pattern). One student also noted that we lost observation 20 (Paul Shirley) [Table 1] [Table 5] because the natural log is not defined for x less than or equal to 0. In addition, it was also pointed out by the instructor that one observation (Brandon Williams) appeared to be somewhat different from the rest of the group, which reinforced the idea about how one person (i.e., an outlier) might potentially affect results and interpretations.

[FIGURE 3 OMITTED]

One student questioned whether it was even important to 'do' (evaluate) these assumptions, as 'we really just want to know whether there is a relationship between points and rebounds or not'. Another student responded by reminding the class that 'a different regression model will change the relationships'. With those thoughts in mind, we proceeded to estimating and interpreting the model, which also included emphasizing other important concepts related to regression, such as writing hypotheses, testing for statistical significance, interpreting the confidence interval, and prediction. We finished our activity by considering a follow-up research question, offered our advice to the coaches, and talked about the limitations of our inferences.

Interpretation of the Model

We generated the output for the estimated regression equation for the natural logarithm of PPG, which indicated .343x + .344 [Table 6]. Boo's interpretation of the slope revealed that 'a player's average points per game will increase by 1.40 points for every rebound he obtains', on average. Next, we talked about whether the relationship was statistically significant by considering hypothesis testing. The null and alternative hypotheses about the relationship between rebounds and points scored in the population was written as

Null hypothesis: The population slope Beta equals 0 versus the

Alternative hypothesis: The population slope Beta does not equal 0.

The F-statistic from the SPSS ANOVA summary table revealed that the relationship was in fact statistically significant (F(1,17) equals 20.5, with p equal to .0002) at the .05 level of significance. Alternately, we also discussed that the t-statistic can also be used to test the same relationship (t(17) equals 4.5, with p equal to .0002). This was also an appropriate moment to solidify the interpretation of the p-value, a concept that tends to remain fuzzy even for the most advanced learner.

After receiving no volunteers on how to interpret the p-value (but many students were able to state the decision of reject because the p-value is less than the alpha of .05), the instructor concluded that--the probability that we would have obtained a result like this, if in fact there was no relationship between points per game and rebounds, is very small. Therefore, we all agreed that we would reject the null hypothesis that there is no relationship between rebounds and number of points scored in the population. Instead, there is indeed sufficient evidence that these two variables are in fact linearly related, a positive and somewhat strong relationship. Furthermore, the 95% confidence interval, which also reveals additional interesting information, was also included on the output and it also provided another means to demonstrate how interval estimates can be used with real data: We can be 95% confident that a player's average points per game will increase on average somewhere between 1.20 and 1.65 points for every rebound grabbed. Finally, the students were able to use the model for prediction (only over the range of the x-values) and estimated that a player who averages 7 rebounds per game might average 15.5 points per game.

In reviewing other statistics presented on the website, we also noticed that a player's average rebounds per game were divided into two distinct types--offensive and defensive rebounds. Therefore, we considered another quick interesting question--Does obtaining rebounds in the opponent's court (defensive rebounds) have a stronger relationship with scoring or is hustling perhaps, to obtain a rebound after a missed field goal in the Hawks' court (offensive rebounds) more important? The class was given a short time to work in groups of 4 for this question, generate the output, and make a decision about which variable might be the better predictor. The results of the simple linear regression line for each variable with average points per game indicated that both were important and statistically significant (defensive rebounds: t(18) equals 6.566, with p equal to .0001; offensive rebounds: t(18) equals 2.586, with p equal to .019), assuming all assumptions were met. In addition, the correlation matrix of the three variables indicated that the number of defensive rebounds a player snags was a better predictor for scoring points (r equals .84) than defensive rebounds (r equals .52).

Finally, we discussed why we might have obtained the results we did and offered the following advice to the Hawks' coaches: The fact that defensive rebounds are important is no surprise; however, any athlete knows that obtaining offensive rebounds provides a better (and second) chance for a bucket. Perhaps this is where the Hawks need to improve. Therefore, one might speculate from our results that the Hawks are lacking in this area and should focus more on obtaining offensive and defensive rebounds in order to improve their winning ways!

We concluded our activities by re-emphasizing the limitations about our inferences between two variables. That is, a strong association between two variables is not adequate to make conclusions about cause and effect. Furthermore, other variables almost certainly impact a player's average points per game (i.e., spurious correlations), including the variables we selected (One student reported that one of the Hawks' star players was sidelined because of an injury; therefore, 'playing time' is also a factor). We also discussed the possibilities of interpreting (or not) the intercept and the effect of outliers and missing data were other interesting topics that students experienced while working with their data.

Final Thoughts

Utilizing interesting data sets available on the internet can be used in the classroom to not only supplement instruction but also to motivate students to learn how methods are used in practice (students could also be required to collect data from websites outside of class time to allow more time for analyses and discussions in class). A number of other websites can also be considered to illustrate methods in practice that might interest students. Sports data can be collected from websites such as the NFL [2], NHL [3[, or interesting data might even be available in the athletic department from previous seasons at their school, college or university. Another motivating website that lists statistics about music, movies, and artists on a weekly basis is the Billboard website at [4]. As a project, students might be asked to collect data for their favorite team or band and follow the analysis through from posing research questions, evaluating the tenability of assumptions, and hypothesis testing to making decisions and conclusions within the context of the scenario.

The purpose of this paper was to present a real-life example used in the classroom that illustrates introductory ideas about the simple linear regression model. Elementary and secondary teachers could also consider similar learning activities as a hands-on application of the Data Analysis and Probability Standard (NCTM, 2000). Using NBA statistics, concepts related to the regression model, such as scatterplots, correlation, hypothesis testing, assumptions, confidence intervals, and p-values were also discussed and illustrated within the research context. In our case, using real data created challenges (i.e., violation of assumptions) that students might not otherwise have encountered using a textbook.

As a teacher of introductory statistics, I was able to achieve many objectives in my classroom that day, by also incorporating how to apply and interpret other related concepts. At the same time, I also provided a fun activity for the students using data that students find interesting as well as integrating technology and the use of the internet into instruction. Hopefully, teaching activities of this sort can help students to begin to make the connection between learning in the classroom and applying the methods out in the real world.

References

Anderson, D. R., Sweeney, D. J., & Williams, T. A. (1994). Introduction to statistics: Concepts and applications (3fd ed.). West Publishing Company: Minneapolis/St. Paul.

Doerr, H. M., & English, L. D. (2003). A modeling perspective on students' mathematical reasoning about data. Journal for Research in Mathematics Education, 34(2), 110-136.

Franklin, C. (2000, October). Are our teachers prepared to provide instruction in statistics at the k-12 levels? Dialogues, 10. Retrieved October 10, 2002 from [5]

Gal, I., Ginsburg, L., & Schau, C. (1997). Monitoring attitudes and beliefs in statistics education. In I. Gal & J.B. Garfield (Eds.), The assessment challenge in statistics education (pp. 37-51). Netherlands: IOS Press.

Groth, R. E., & Powell, N.N. (2004). Using research projects to help develop high school students' statistical thinking. Mathematics Teacher, 97(2), 106-109.

Hill, H. C., & Ball, D. L. (2004). Learning mathematics for teaching: Results from california's mathematics professional development institutes. Journal for Research in Mathematics Education, 35(5), 330-351.

National Council of Teachers of Mathematics (2000). Principles and standards Ibr school mathematics. NCTM: Reston, VA.

Endnotes

[1] www.nba.com

[2] www.NFL.com

[3] www.NHL.com

[4] www.billboard.com

[5] http://www.nctm/org/dialogues/2000-10/areyour.htm

Jamie D. Mills, University of Alabama

Jamie Mills, Ph.D., is an Assistant Professor who teaches hybrid statistical methods courses in the College of Education.
COPYRIGHT 2005 Rapid Intellect Group, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.