P-Hacking Made Easy

Using a public data set of Major League Baseball salaries and on-field statistics, this paper runs regressions for all possible combinations of the selected control variables to generate statistically significant but spurious results, a practice known as "p-hacking." This overt, deliberate, and systematic p-hacking leads to many counterintuitive results that can help students think carefully about variable selection, causality, and parsimony. In addition, this paper provides an R script that students can easily modify to fit data sets of their choosing.


Introduction
Every semester hundreds of professors teaching Introductory Econometrics inveigh against the manipulation of variables and functional form to achieve statistical significance. Similarly, textbooks and peer-reviewed articles warn against a practice that is both pernicious and ubiquitous. For example, Imbens (2021), To put it bluntly, researchers are incentivized to find p-values below 0.05. This has led to concerns about researchers searching for specifications (whether consciously or unconsciously) that lead to such p-values in ways that invalidate the meaning and interpretation of those p-values. This has become known as p-hacking. 1 While there has been some progress (Brodeur, et al., 2020), multiple academic fields suffered replication crises in recent years, suggesting that "unsatisfactory" results lurk behind far too many published findings (Loken & Gelman, 2017). The final results may appear sound, but only the authors know how many alternative specifications they considered and rejected before submitting their findings to peer review. Perhaps a different approach could help, motivated by Angrist and Pischke's (2017) observation that " Econometrics is better taught by example than abstraction." Instead of railing against p-hacking, we might embrace it in order to expose the results of overt, deliberate, and systematic p-hacking. By highlighting regression's potential shortcomings early in a student's academic career, we can instill a healthy and life-long skepticism of significant results. This paper does not attempt to provide a conclusive analysis of baseball salaries; instead, it highlights the many ways such an attempt could fail. We do so by constructing a data set of 400 Major League Baseball (MLB) players' 2016 salaries and 16 measures of their on-field performance over the 2015 season, and then regressing salary on all combinations of those 16 variables. We record and summarize the coefficients and t-statistics, showing the prevalence of statistically significant yet spurious results.
The remainder of this paper proceeds as follows: The Data section introduces a public data set of professional baseball salaries and performance statistics, noting both its strengths and shortcomings. The Results section summarizes the findings from regressing salary on every available combination of control variables, highlighting the results that defy basic intuition. Having demonstrated the prevalence of the problem, the How to Select Independent Variables section suggests ways credible econometricians should think about selecting control variables. Finally, this paper provides the R Replication Script used to generate the results so that students can recreate this exercise on datasets of their choice.

Data
The R package Lahman (Friendly, et al., 2022) provided the raw data for this paper. The code used to obtain the data set used in the regressions appears in Part 1 of the R Replication Script section below. This captures the 2016 salary and 2015 on-field performance of 400 MLB players. 2 The independent variables are: At Bats (AB), Runs (R), Hits (H), Doubles (X2B), Triples (X3B), Home Runs (HR), Runs Batted In (RBI), Stolen Bases (SB), Base on Balls (BB, i.e. "walks"), Strikeouts (SO), Intentional Walks (IBB), Grounded into Double Plays (GIDP), Games Started (GS), 1 For more examples, see McCloskey and Ziliak (1996) or Head, et al. (2015). 2 The data used in this paper excludes pitchers, as their salaries depend on different performance measures than field position players. Putouts (PO) 3 , Assists (A), and Errors (E). For those unfamiliar with the game, we expect the coefficient in regressions to all be positive except for SO, GIDP, and E.
The most salient feature of the dependent variable is its right-skew. Salaries ranged from $507,500 to $28,000,000 with a median of $2,125,000 and a mean of $5,073,769. For a visualization, see the histogram in Figure 1, below. Almost every empirical analysis of observational data suffers from some omitted variable bias. This data set is no exception. There are three notable missing variables: experience, position, and contract status. A promising young player in 2015 may have earned a 2016 contract that assumed his continued improvement, whereas an older player "past his prime" might sign a contract reflecting an anticipated downward trajectory. 4 Some positions earn more than others, so different intercepts for positions would improve model fit (Magel and Hoffman, 2015). 5 The direction of omitted variable bias from contract status is hard to interpret, but the relative weights of guarantees and incentives will impact the model's fit for better or worse. 6 In light of these missing variables and the wide dispersion of salaries, we should not be too surprised if our regressions fail to explain much of the variance. Students with a particular interest in baseball may use the Lahman package to explore a wider space of independent variables than those used in this paper.
Despite those shortcomings, this dataset is more suitable for a simple regression than most real-world data for three reasons. First, we have little to no concern that measurement error will bias our coefficients downward. Economists have trouble measuring common variables such as income and GDP (Feldstein, 2017), whereas a player's number of strikeouts in a season suffers no such ambiguity. Second, using the set of all MLB players eliminates selection bias, a problem that confounds attempts to measure the marginal effect of things like education (Winship and Mare, 1992) or union membership (Heckman, 1990). Third, we know the true effect of every variable, making it obvious when we have the "wrong" sign for baseball statistics; for many economic questions that will not be so clear. For example, an increase in wages has an ambiguous effect on an individual worker's labor supply: the income effect tends to increase the demand for leisure, while the substitution effect increases the opportunity cost of leisure and incentivizes more hours worked. For this topic and many others, a poorly reasoned regression can yield incorrect results without the benefit of knowing the "right" sign for a coefficient.

Results
First, we consider the univariate regressions shown in Table 1, below. The most conspicuous result is the finding that RBIs alone can account for over 27 percent of the observed variation in salaries. This overstates the importance of RBIs, as they are correlated with games played (and several other variables). 7 But the issue of omitted variable bias becomes even more obvious when we consider strikeouts. Taken at face value, a univariate regression implies a marginal raise of over $51,000 per strikeout. Moreover, this effect is highly significant, with a t-statistic over 7. 8 5 However, as shown in Part 1 of the R script below, many players recorded statistics at multiple positions, making the designation of, say, "shortstop" more complicated than it might appear. 6 See, https://www.mlb.com/glossary/transactions/guaranteed-contract and https://www.mlb.com/glossary/ transactions/incentive-clause, respectively. 7 "A batter is credited with an RBI in most cases where the result of his plate appearance is a run being scored." See, https://www.mlb.com/glossary/standard-stats/runs-batted-in. 8 Errors also display the "wrong" sign in Table 1, albeit without a significant t-statistic.   So, we need to control for other factors, but which ones? Suppose we have four potential independent variables and wanted to understand the range of possible marginal effects and significance for a given dependent variable, var1. The R function crossing() generates a matrix of 1s and 0s indicating every available combination, with 1 and 0 corresponding to "included" and "excluded," respectively. 9 For the four variable example, our matrix is: This function is part of the tidyr package. See https://tidyr.tidyverse.org/ for details.
Note that var1 is always "on," so it appears in every regression. The first row corresponds to a univariate regression, while the last represents the regression on all possible variables. The inner loop in the R script below runs a regression for every row in the 0/1 matrix. First, it multiplies each column by the corresponding 1 or 0. If a column sums to zero, it replaces the 0's with NA, allowing the regression function lm() to skip those variables. The outer loop moves each variable into the first column in turn and runs the inner loop. So, the outer loop would run first with var1 always "on", then var2 always "on," and so on. This lets R always look for the coefficient and t-statistic in the first independent variable's position. For our baseball data with 16 independent variables, each variable has 2 15 = 32,768 possible sets of controls. Hence the R script below runs 32,768 different regressions (inner loop) for all 16 independent variables of interest (outer loop), for a total of 16*2 15 = 524,288 regressions. Note that some regressions run more than once; for example, every outer loop includes the regression on all possible independent variables.
We conservatively defined "significant" t-statistics as those greater than 2 in magnitude; using the traditional 5 percent significance level (i.e. ± 1.96) would produce even more striking results. Table 2, below, summarizes the results numerically.   It would be difficult to conceive of a less intuitive finding. However, simply starting a game does a team no good; players earn their salary via the actions captured by the other independent variables. A player called up from the minors mid-season who hits 35 home runs in 80 games can expect to make much more than a player taking 160 games to hit the same benchmark.
We also observe many regressions with "incorrect" yet significant signs for strikeouts and hits; 171 and 31, respectively. While uncommon, these spurious results occur often enough that a scholar could present a vacuous set of robustness checks supporting multiple significant, but wrong, results. Students and journal referees alike should think critically about robustness checks that simply add additional controls; Table 2 shows that an article could report that hits decrease salary, then display 30 alternative equations supporting that claim.
The results for doubles (X2B) and triples (X3B) are harder to explain. 10 Both are relatively rare: the median player's doubles and triples in the data set are 17 and 1, respectively. 2,429 regressions return a negative and significant value for doubles, and 32,683 such regressions exist for triples. The lesson here is that students should hesitate before making an inference from a sparse phenomenon.
Before moving on, consider that none of the results shown relied on data transformation. We did not take the logarithm of salary or any control variable, nor did we use quadratics 10 Also known as "extra-base hits," doubles and triples are both components of "hits," which is the sum of singles, doubles, triples, and home-runs. See, https://www.mlb.com/glossary/standard-stats/hit. Teaching (2023) or interaction terms. To test a hypothesis that going from 10 to 20 home runs affects salary differently than going from 40 to 50, quadratics like HR and HR 2 could offer a simple answer. Economists often apply the natural logarithm to skewed data, such as the salaries featured in this paper, to make inferences more reliable. 11 Still, researchers should recall Keene's (1995) admonition to pharmaceutical researchers: "It is clear that an industry statistician should not analyse the data using a number of transformations and pick the most favourable to the company." A responsible econometrician considers data transformations before running any regressions. In other words, the first question to ask is "What is the question?"

How to Select Independent Variables
Social scientists should have a clear justification of variable choice in mind before typing a single line of code. Four scenarios below illustrate how the end goal should guide the empirical strategy.
First, suppose a player's agent wanted to argue that her client was underpaid relative to his performance. In that case, she might need the most accurate forecast possible and thus include every available variable. This maximizes R 2 but at the cost of losing inference regarding marginal effects. There is no sensible way to interpret the effect of one additional hit holding strikeouts, walks, home runs, doubles, and triples all constant. This sort of trade-off is inevitable in econometrics. 12 Second, to discern causal effects we need to consider mediators and moderators. Suppose a young player wants to maximize his career earnings; should he focus more on raising his RBIs or his home runs? Certain control variables are obvious candidates for inclusion: one could argue that both RBIs and home runs at the plate are unaffected by errors in the field. But an increase in RBIs would almost certainly require an increase in hits. Similarly, swinging for the fences usually leads to more strikeouts. If we included these in our regression, they might diminish the effect of RBIs or home runs, causing us to underestimate their effect on salary. Pearl and Mackenzie (2018) offer a great starting point for thinking about this issue at a level accessible to undergraduates.
Third, every variable discussed so far has been a discrete count variable, making every regression a "component model." However, ratios may convey as much information as their components, if not more. For baseball, batting average and fielding percentage are ubiquitous measures of player performance. 13 The application of regression to finance can involve a range of ratios, broadly classified into liquidity, leverage, efficiency, profitability, and market value ratios. Ratios are also common in healthcare, covering everything from blood pressure to body mass index. The possibilities are vast, but the same pitfalls apply to the use of ratios as their components. Firebaugh and Gibbs (1985) wrote an introduction to this subject that most students should be able to read and understand during their first semester of econometrics.
Lastly, Table 2 hints at an intuitive (but very inefficient) method of selecting a parsimonious model that lowers the risk of over-fitting. If we regress salary on the eight variables with a median t-statistic greater than 2 in absolute value, we obtain an R 2 of 0.374, which compares favorably with the R 2 of 0.397 when we regress on all 16 independent variables. This could be a useful starting point for introducing Least Absolute Shrinkage and Selection (Lasso) regression. Whereas traditional regressions minimize the mean squared error, a Lasso regression also penalizes the magnitude of the coefficients. As Géron (2019) notes in an introduction to the 11 West (2022) is an accessible introduction to log transformation. 12 Moreover, blindly maximizing R 2 often leads to a model that overfits the data and fails to generalize. R 2 can convey useful information, but no single summary statistic should be considered in isolation. As Ziliak and McCloskey (2008) put it "Fit is not the same thing as importance. Statistical significance is not the same thing as scientific finding." 13 These ratios are H/AB and (PO+A)/(PO+A+E), respectively. Teaching (2023) subject "An important feature of Lasso regression is that it tends to eliminate the weights of the least important features."