I am trying to eliminate the undergraduate curve at Georgetown's business school. Here is a draft of a document I'm preparing to make that case. This should be interest to faculty or anyone interested in issues of perverse incentives.
A Proposal for the
Elimination of the Grading Curve in the Undergraduate Program
or
Exemption from the Curve for Teach-to-Mastery Methods
Submitted by:
Jason Brennan
Flanagan Family Professor, SEEPP
A. Executive Summary
This document forwards two proposals. The first proposal is to eliminate the undergraduate curve entirely. The second, conditional on the failure to pass the first, is to allow undergraduate faculty to opt out of the curve at will provided they teach to mastery.
The curve disadvantages our students in graduate admissions and job searches because it artificially makes them appear worse than students from comparable universities, it decreases collaboration and increases competition among students, it increases the role of luck in determining grades, it often makes grades dependent on unreliable and statistically insignificant small differences in absolute scores, it compounds equity issues especially in the first year that arise from differences in the strength of high school curriculums, and it incentivizes professors to either make classes too hard or to not teach everyone to full mastery of the material in order to get a distribution of skill levels.
There is no strong evidence of grade inflation, and even if there were, solving the problem cannot be done unilaterally. Further, most of the purported reasons to adopt a grading curve can be satisfied more effectively without a curve.
B. The Two Proposals
This document forwards two proposals, the second of which is conditional upon the failure of the first to be approved.
Proposal 1: Effective immediately, grading curves are eliminated in the undergraduate business curriculum. Undergraduate business faculty will receive the same academic freedom to assign grades as their peers throughout Georgetown University and their peers at most other colleges and universities within the United States.
Proposal 2: If proposal 1 fails and a grading curve remains as the default for undergraduate courses, faculty may, at will, and without requiring prior approval from their peers, opt out of the curve provided they follow a teach-to-mastery method.
C. Benchmarking: Who Curves?
Mandatory grading curves are unusual in most fields of study in the US.
Nearly all law schools impose grading curves, such that letter grades serve as an approximation of or shorthand for class rank, though there is a conceptual problem (discussed below) about the coherence of averaging grades across courses.
Beyond that, the overwhelming majority of universities and individual academic programs in nearly all undergraduate majors do not have a mandatory or externally imposed curve. Individual faculty retain academic freedom to distribute grades largely as they see fit, following their judgment about appropriate norms.
Table 1 below lists the undergraduate business schools ranked among the top 20 by US News and World Report and indicates whether they have a mandatory or suggested curve.
TABLE 1: Mandatory Curves at the Top 20 Undergraduate Business Schools
2021 US News Rank | School | Curve? |
1 | Pennsylvania | No |
2 | MIT | No |
3 | Berkeley | Fixed 3.45 for core; 3.5 for electives, 3.65 for low enrollment electives[1] |
3 | Michigan | No |
5 | New York University | In classes of 25 or greater, no more than 35% of students receive A or A-.[2] |
5 | Texas | No |
7 | Carnegie Mellon | No |
7 | Cornell | No |
7 | North Carolina | No |
7 | Virginia | No |
11 | Indiana | No |
12 | Emory | Recommended distribution[3] |
12 | Notre Dame | Variable mean dependent on year and department[4] |
12 | Southern California | No (recently eliminated) |
12 | Washington, St. Louis | No |
16 | Georgetown | Fixed mean cannot exceed 3.5 except in FYS courses |
16 | Ohio State | No |
16 | Wisconsin | 3.0 in select courses, 3.3 and no more than 30% As in others.[5] |
19 | Georgia Tech | No |
19 | Illinois | No |
19 | Maryland | No |
19 | Minnesota | No |
19 | Washington | No |
The upshot: Among the 23 universities ranked within the “top 20,” 5 have mandatory curves. 1 has a suggested curve, which appears to be lightly enforced.
D. What Do Grades Signify? (Skip If Necessary)
While nearly colleges and universities within the United States use the A, B, C, D, F, +/- grading system, there is no universal definition of what these grades signify. Indeed, Guy Montrose Whipple noted this point in the early 1900s:
When we consider the practically universal use in all educational institutions of a system of marks, whether numbers or letters, to indicate scholastic attainment of the pupils or students in these institutions, and when we remember how very great stress is laid by teachers and pupils alike upon these marks as real measures or indicators of attainment, we can but be astonished at the blind faith that has been felt in the reliability of the marking system. School administrators have been using with confidence an absolutely uncalibrated instrument…
What we know to know is: What are the traits, qualities or capacities we are actually trying to measure in our marking systems? How are these capacities distributed in the body of pupils or students? What method ought we to follow in measuring these capacities? What faults appear in the marking systems that we are now using, and how can these be avoided or minimized?[6]
American universities and colleges have in effect landed on a common set of symbols for grades without agreeing on what these symbols mean or signify, except that in some way A > B > C, etc.
To illustrate, even within one university, grades in different classes might in various professors’ own minds signify any of the following:
1. Grades as rankings:
a. A letter grade ranks a student against other students in the same section of the same class that semester.
b. A letter grade ranks a student against all other students in any section of a given class in a semester.
c. Further expansion: We could in principle expand the ranking set outward to go across professors, years, or even universities. At the limit, a letter grade in introductory microeconomics could rank a student against all other students who have ever taken or ever will take that class at any university anywhere.
2. Grades as qualitative evaluations:
a. A letter grade reports a qualitative description of how well a student mastered material according to the professor’s absolute standards, though different professors might have different standards.
b. A letter grade reports a qualitative description of how well a student mastered material according to the university’s absolute standards, consistent among all professors, but the standards might vary from university to university.
c. A letter grade reports a qualitative description of how well a student mastered a given set of material according to what is meant to be a universal absolute standard, e.g., such that a B in ECON 101 at Boise State = a B in ECON 101 at Cornell.
3. Grades as quantitative scores/percentages:
a. A letter grade reports what percent of questions and problems a student got correct, according to the professor’s standards, but the standards might vary from professor to professor.
b. A letter grade reports what percent of questions and problems a student mastered according to the university’s internal standards, consistent among all professors, though the standards might vary from university to university.
c. A letter grade reports what percent of questions and problems a student mastered, according to what is meant to be a universal standard, e.g., such that a B in ECON 101 at Boise State = a B in ECON 101 at Princeton.
In fact, a typical undergraduate business student at Georgetown has grades that were assigned on many of these different conceptions of grades. A FINC 101 grade might signify her ranking against all other students that semester. Her FREN 101 grade might signify the professor’s qualitative evaluation based on her professor’s judgment of what constitute universal standards for introductory French at the college level. Her CHEM 101 grade might be a shorthand for the total percentage of problems she got right on a set of multiple-choice exams with somewhat arbitrarily assigned weights to questions.
Further, grades distributed at different universities might have different values. ECON 101 at Dartmouth might have higher standards than ECON 101 at Keene State University.
The upshot: While everyone recognizes that all things equal, an A is better than a B, in fact, grades do not have universal significance or meaning.
E. Incommensurability of Grades (Skip If Necessary)
It is common for universities to average grades among classes and for instructors to average grades within classes. However, whether it is coherent to do so depends upon the meaning of the grades in question.
For instance, rankings cannot generally be averaged because they represent ordinal numbers on incommensurable scales. Compare:
1. If MSB ranks #16 among undergraduate programs and #21 among MBA programs, it is not accurate to say that we rank #18.5 overall. This presumes, incorrectly, both that the two rankings are equally weighty and that distances between spots on the rankings fall along a constant scale.
2. If Christine ranks 1/50 in STRT 230 and 45/50 in ACCT 101, it is similarly not accurate to say that she ranks 23/50 overall in these two courses. Again, such averaging presumes both that the two rankings are equally weighty and that the distances between spots on the rankings fall along a constant scale. Accordingly, at MSB, when we average her grades in these two courses for GPA purposes, we pretend that we can average incommensurable ordinal rankings. We insert information that was not present in the original grade and ignore information that was present. It is mathematically incoherent.
This problem is compounded when we start averaging grades across classes in which the grades have very different meanings. To illustrate, suppose Christine has five courses. In some courses, grades represent a qualitative judgment, in some a ranking, and in others a percentage of somewhat arbitrarily weighted test problems solved correctly. Her GPA calculation for the semester in effect looks like this:
GPA = [Good + Excellent + 100(249/320) + ranked 1/45 + ranked 19/38] ¸ 5
= [3.00 + 4.00 + 2.66 + 4.00 + 3.33] ¸ 5
= 3.398
The second step involves inserting information that was not present in the first step, ignoring information that was present, and pretending that incommensurate meanings are in fact commensurate cardinal numbers.
Despite these conceptual problems, GPA might have external validity. Perhaps, all things equal, those with higher GPAs might perform better in the future, earn higher incomes, and so on. However, there is little and limited evidence of such external validity. One major paper claims the relationship between GPA and future job success is weak.[7] In contrast, Frank Schmidt and John Hunter, in a comprehensive meta-analysis of previous papers, claim that the correlation between college GPA and future job success is 0.34.[8] There is strong evidence that higher high school GPAs predict higher incomes, though this results almost entirely from those with higher high school GPAs being more likely to graduate from college and receive the college wage premium.[9]
The upshot: In fact, grades between classes (and even individual grades within classes) cannot meaningfully be averaged. GPA might nevertheless have some external validity, though the evidence is surprisingly weak.
F. The Social Meaning of Grades Is a Collective Action Problem
We use grades for both internal and external purposes.
Internally, we use them to rank students against each other. Grades represent a short-hand description of a student approximate demonstrated skill compared to others in their year in the program. Even this description is too generous, as the problem of incommensurability means that two students with a 3.5 GPA in MSB courses might be radically different in terms of ability and absolute performance. Nevertheless, at least internally, everyone understands the grades are meant to represent an approximate relative ranking against other students at MSB.
Externally, though, what grades mean is not up to us. We do not get to decide how employers and graduate admissions committees interpret our grades. Instead, potential employers and graduate admissions programs will tend to presume our grades signify more common qualitative assessments. Many potential employers or graduate admissions programs are and will remain unaware that we use grades to approximate our internal rankings, even if we attempt to inform them of such through notes on transcripts or accompanying letters.
Further, we can expect that employers and graduate admissions programs make subjective judgments about how to compare equivalent letter grades across universities. For instance, a GPA of 3.6 from MIT might be viewed as more impressive than a GPA of 3.8 from UMass Boston. We can expect that employers and others will “weigh” our students’ GPAs based on their stereotypes of the quality of our students and the rigor of programs. Accordingly, employers might be look more favorably on a 3.5 undergraduate GPA from Georgetown’s MSB than a 3.5 GPA from ASU’s Carey. However, unless it is widely known that MSB students’ grades are curved—and it isn’t—then we nevertheless harm our students when they are compared to students from other universities which employers and admissions officers stereotype as being equivalent in rigor in quality.
Perhaps, thanks to our grading curve, a 3.5 GPA at MSB should be seen as better than a 3.5 GPA at USC’s Marshall, but that doesn’t mean it is. If we have a curve and they do not, we disadvantage our students. We create the appearance that our students are worse though they are not.
Some employers and graduate admissions programs require a minimal GPA for consideration; they do not allow students with lower GPAs to apply, period.
It would perhaps be best if all undergraduate business schools coordinated on a common set of standards in grading, just as law schools have coordinated in using grades to represent rankings (ignoring the problem of incommensurability mentioned above). But they have not.
Upshot: When students apply for jobs or graduate admissions, it doesn’t matter what our internal philosophy of grades is or what we take grades to signify. What matters is how others interpret those grades. They are generally unaware of our curve. We thus artificially create the appearance that our students are worse than comparable students at other universities. We do not get to decide how others interpret our grades. Our intentions may be good, but that does not mean they produce good results. We might believe internally that our curve increases the rigor of our program; externally, it might nevertheless signify our students are worse than others.
G. Grades as a Zero-Sum Game
A curve makes grades competitive. What grade a student receives depends not on how well she does, but on how well she does compared to others. The exact details will vary from class to class, but in general for one student to improve her grade over the semester requires corresponding losses to others’ grades. Instead of “People for others,” our grading motto is in effect, “My gain is your loss”.
Accordingly, a curve creates perverse incentives to avoid collaborating with or helping others not in one’s own group or even to sabotage others in different groups.
As we teach in many of our courses, competition is often a good thing. We want companies to exist in a competitive market. In some cases, the zero-sum grading curve allows us to better simulate competitive markets in class. Nevertheless, it is unclear whether the benefits outweigh the costs for classroom learning.
Our university advertises itself as promoting a collaborative environment in which students take an interest in each other’s’ welfare. A zero-sum grading system is in tension with these professed values.
Upshot: The curve creates a zero-sum grade distribution and corresponding competitive mentality among our students. This has advantages and disadvantages.
H. Luck
One might think that as much as possible, grades should be based on a students’ skill, talent, and performance, not luck. However, if grades are determined by relative ranking, then by necessity a student’s grades depend heavily upon luck. If a student is in an unusually talented or driven class, her grades will be lower than if she were in an unusually untalented or lazy class, though her absolute performance remains the same.
Upshot: A curve increases the significance of luck in determining grades.
I. Score Compression, Grade Uncertainty, Arbitrariness, and Faculty Sneakiness
Some faculty distribute scores during the semester with the goal of ensuring students know roughly where they stand. Students’ final grades are not a surprise.
However, in many classes, scores are compressed within a narrow range. Final averages might cluster between, say, 90-94. A professor will then need to impose a curve, such that a 90 becomes a B and 94 and A.
This is problematic for a number of reasons. For one, students face anxiety from uncertainty. They cannot easily anticipate what their in-class scores mean. Second, and perhaps more damningly, this means that final grades are effectively arbitrary. Small point differences inside a course are rarely significant or robust indicators of genuine differences in effort, ability, or performance. Empirical work on grading indeed finds that professors routinely score the same essay or assignment very differently depending on how the professor feels, whether the professor has recently eaten, or other arbitrary factors.
Jeffrey Schinske and Kimberly Tanner summarize the extant empirical literature on this last point:
[Educational psychologist W.C.] Eells investigated the consistency of individual teachers’ grading by asking 61 teachers to grade the same history and geography papers twice—the second time 11 wks [sic] after the first. He concluded that “variability of grading is about as great in the same individual as in groups of different individuals” and that, after analysis of reliability coefficients, assignment of scores amounted to “little better than sheer guesses”. Similar problems in marking reliability have been observed in higher education environments, although the degree of reliability varies dramatically, likely due to differences in instructor training, assessment type, grading system, and specific topic assessed. Factors that occasionally influence an instructor's scoring of written work include the penmanship of the author, sex of the author, ethnicity of the author, level of experience of the instructor, order in which the papers are reviewed, and even the attractiveness of the author.[10]
Professors tend to hate student “grade grubbing”. The curve creates a perverse incentive for faculty to score students very closely together, creating for many of them the illusion that they will receive high marks. Faculty can then assign lower grades to many of these students after student evaluations have been written and when the faculty no longer expect to see the students face-to-face.
Consider: If you were reading a paper submitted to a journal which relied upon this kind of data, you would probably laugh as you recommend rejection.
Upshot: Score compression is a problem in many courses. Combined with our curve, this leads to effectively arbitrary grade distributions based on small, unreliable differences in assigned scores.
J. Teaching to Mastery vs Teaching to Mediocrity
Traditional teaching methods typically ask students to complete a project, test, essay, or activity. An instructor then grades the deliverable, offering advice about what the student could have done better. But then the student does not get the chance to do better; they move on to the next item.
It’s instructive that basically no one uses this method when the instruction actually matters and the trainee is expected to perform. GEICO gives insurance adjustors tests about insurance law and policy, but it doesn’t let them progress with training and start doing the job until they actually master the material.
At base, “teach-to-mastery” or “mastery methods” of classroom instruction employ the same philosophy for classroom instruction. Teach-to-mastery is what an instructor does when they take FINC 101 at least as seriously as a child’s piano teacher takes the C major scale.
For instance, since the curve was suspended, Jason Brennan has implemented a simple teach-to-mastery method in all of his courses. Students must pass in their assignments on time, but they are permitted to revise every assignment as many times as they want until the semester ends, until they get the grade they want. In the past, a student might get a C on the first essay and then slowly improve over the semester. Now, students instead fix that first essay. Surprisingly, Brennan’s classes often have a culture of revision, in which students revise and improve work even when they have already gotten an A and such revisions cannot improve their final grade.
Such teaching does indeed take more work, though not necessarily as much as one might think, in part because revisions often simply add missing material rather than start over from scratch. In Brennan’s case, he estimates it adds maybe 15-20 extra hours of teaching work semester. It partly reduces work because it eliminates grade-grubbing; instead of students arguing they deserved a better grade, they rework their projects to get the better grade. He has not found his research output to have suffered in any way because of this.
At MSB, tenure-track faculty have a monetary incentive to minimize teaching work and instead maximize research output. Prestige, promotion, speaking engagements, and raises are more strongly tied to research than teaching.
Accordingly, given these incentives and given that we are a Research I university, it is probably too much to demand that all faculty teach to mastery. However, it seems perverse to insist that faculty do not teach to mastery or to discourage them from doing so.
As it stands, our curve creates perverse incentives. Faculty want to avoid a situation in which nearly all students master the material, but nevertheless some students must receive low B grades because they were slightly less exceptional than others. Accordingly, we are incentivized to make tests too long, essays too difficult, projects too rigorous, and to prevent too many students from learning too much, all to ensure that a sufficient number of students do poorly enough to fall below the fixed mean. The curve creates perverse incentives against holding extra office hours, allowing revisions, holding review sessions, explaining the material well, and pre-briefing projects.
When the curve is absent, a professor is free to think, “I have certain absolute standards for what constitutes A-level work and I want to work to ensure all of my students get there.” When the curve is present, we are instead incentivized to think, “I need to ensure that a sufficiently large number of students fall below my standards.” Thus, we are incentivized to teach to mediocrity rather than mastery.
Upshot: The curve creates perverse incentives to underteach or apply inappropriately high standards. It disincentives teaching to mastery, the system which is probably best for students.
K. The Widespread but Unsupported Belief in Grade Inflation
There is a widespread belief among academics and others that grades have become “inflated”. A C in 1960 is the same as a B in 2021, or so the claim goes.
For the sake of argument, assume grade inflation exists and that it’s bad. One might think that curves are an effective tool to combat such grade inflation. However, we again fall back into a collective action problem. If a small minority of schools impose strict curves, they reduce inflation at their school. But, to extend the metaphor, when their students apply for jobs or graduate admissions, the “buyers” won’t know the students’ “prices” are in a different “currency”. They won’t know that a Georgetown 2021B = a Georgetown 1960B = a Wharton 2021A-. They will simply assume our grades are also inflated and presume the Georgetown B-student is less capable than she really is. Curves might be a good solution when almost everyone does them, but that doesn’t mean we should adopt them when others have not. We do not get to decide what others interpret our grades as signifying.
However, we need to ask whether this assumption is warranted. As business professors, we all understand that a mere increase in monetary prices does not signify inflation. There could be demand or supply shifts or shocks, or any number of other things going on. We need to know whether prices changed because of changes in the quantity of money.
The analog for grades is that it is not enough to know whether average GPAs are higher now than in the past. We need also to know whether work has gotten better, worse, or stayed the same. Georgetown’s admissions are more rigorous now than in the past, our students are better, our faculty are better, students have more support services, students are more informed and better at self-sorting into classes they excel at, and so on. If we held constant standards over 50 years, then our students today should be getting better grades than students 50 years ago.
In fact, no one has published a study proving that students are getting higher grades today for equivalent work in the past, that a C paper yesterday really is the same as a B today. No one has shown that GPAs have increase over time at a faster rate than quality of student work. So, we don’t know whether grade inflation has occurred. (At elite universities, student credentials are stronger now than in the past, so we should expect their work is better all things equal.)
But it gets worse. It turns out we don’t even know how average college GPAs have changed. Many reports which appear to show a raw GPA increase use student-reported GPAs or other poor sources of data. These are unreliable: students might lie, misreport, or forget their GPAs, there might be selection problems in who answers the surveys, and so on.
We need good data—actual student transcripts collected in a way that ensures proper sampling. The only major study that does so was conducted by US Dept of Education researcher Clifford Adelman. A short summary of his findings is that cohort of 1972 had an average GPA of 2.70, the cohort of 1982 had an average GPA of 2.66, and the cohort of 1992 had an average GPA of 2.74.[11] His study is now out of date and no one has published an equivalently rigorous study since then, in part because only someone in the Dept of Ed could force universities to provide the needed data.
GradeInflation.com appears to demonstrate otherwise. But it’s bogus. It’s bogus in part because the author at best finds changings in raw GPA over time but doesn’t attempt to measure underlying performance. For all we know, student work has improved faster than average GPA, and so what looks like inflation is actually deflation. It’s bogus also because most of his data sources are bad. He relies on student newspapers, student self-reports, reports in rival university newspapers, and so on. Only in a minority of cases does he use unassailable data from registrar reports or properly sampled student transcripts. Even then, there usually isn’t sufficient data to determine whether many of the purported changes are statistically significant rather than random noise.
L. How Do Grades Affect Learning?
Experimental data suggests grades do little to help students improve their work. For instance, R. Butler and M. Nisan ran an experiment in which students completed a task, received either no feedback or one of two different types, and then had to do the task again.[12] They could then measure the value-added, if any, of the feedback. They gave the experimental groups either what they called evaluative or descriptive feedback. Evaluative feedback—such as a letter grade—tells students how good or bad their work is. Descriptive feedback gives students advice about how to do better. They generally found that—to produce better performance in the future—giving grades and evaluative feedback was better than giving no feedback, but giving descriptive feedback by itself was better than giving evaluative feedback and grades.
Further, grades do not appear to have a positive effect on students’ motivation. As Jeffrey Schinske and Kimberly Tanner summarize the large body of extant research:
It would not be surprising to most faculty members that, rather than stimulating an interest in learning, grades primarily enhance students’ motivation to avoid receiving bad grades. Grades appear to play on students’ fears of punishment or shame, or their desires to outcompete peers, as opposed to stimulating interest and enjoyment in learning tasks. Grades can dampen existing intrinsic motivation, give rise to extrinsic motivation, enhance fear of failure, reduce interest, decrease enjoyment in class work, increase anxiety, hamper performance on follow-up tasks, stimulate avoidance of challenging tasks, and heighten competitiveness. Even providing encouraging, written notes on graded work does not appear to reduce the negative impacts grading exerts on motivation. Rather than seeing low grades as an opportunity to improve themselves, students receiving low scores generally withdraw from class work. While students often express a desire to be graded, surveys indicate they would prefer descriptive comments to grades as a form of feedback. [13]
M. Responses to Arguments for the Curve
Argument: “We need to reduce grade variance.”
Response: The curve reduces variance in the final grades. It does not reduce variance in course content, course standards, or student performance. By itself, it creates the illusion of uniformity without creating underlying uniformity. It is a method of putting the same shell on different snails. Further, it’s not even clear that uniformity is good. We want faculty to experiment with different teaching methods, evaluation techniques, and so one.
Argument: “The curve creates rigor.”
Response: First, anyone asserting this should provide evidence it indeed does so. Whether it does is far from clear. Some classes are easy, some are hard, and yet all are curved. In some classes, a 92/100 generates a B average. Second, anyone asserting this should demonstrate that that we cannot provide appropriate rigor through other means. For instance, teaching-to-mastery is a way of being rigorous—indeed more rigorous—even though every student can receive an A. Third, if the problem is rigor, we should solve the problem by demanding rigor. We could have faculty evaluate each other’s syllabi or grading to ensure no one is going “too easy” on students. If we genuinely cared about rigor, we’d enforce rigor, not enforce the appearance of rigor.
Argument: “Students will go searching for easy As and faculty will be incentivized to provide them to get higher SET scores.”
Response: The research on SET scores over the past 20 years is almost univocal: SET scores are not a valid measure of faculty teaching effectiveness. The fact that Georgetown uses them indicates either that our administration is culpably misinformed or that we are not actually concerned about teaching effectiveness. Regardless, we should not do something wrong (impose the curve) as a response to a previously morally wrong and unscientific decision we made (use SET scores instead of measuring teaching effectiveness).
Argument: It serves my department’s or my own self-interest to use the curve.
Response: We have a fiduciary duty to students which renders all such arguments inadmissible.
Argument: The curve helps students get jobs and earns them more money.
Response: I cannot find evidence that this is true.
[1] https://haas.berkeley.edu/ewmba/academics/grades/
[2] https://www.stern.nyu.edu/portal-partners/current-students/undergraduate/academics/academic-policies#Grade%20Point%20Average
[3] https://goizueta.emory.edu/undergraduate-business-degree/curriculum/standards
[4] https://mendozaugrad.nd.edu/academics/departmental-grading-guidelines/
[5] https://guide.wisc.edu/undergraduate/business/#policiesandregulationstext
[6] Guy Montrose Whipple, “Editor’s Preface”, in Finklestein 1913, 1,
[7] Bretz 1989.
[8] Schmidt and Hunter 1998.
[9] French et al 2015.
[10] Schinske and Tanner 2014. See also Branthwaite, Trueman, and Berrisford, 1981 for further evidence that grading is unreliable.
[11] Adelman 2009
[12] Butler and Nisan 1986.
[13] Schinske and Tanner 2014,