250922 - LitRev Assessment Rewrite

## *Généalogie* (qv. Foucault, 1975) of Assessment in HE The term "assessment" is now taken for granted in education, but it is of relatively recent adoption: it does not feature once in John Roach's *Public Examinations in England, 1850-1900* (1971). Prior to the 1960s, terms like 'educational measurement', or 'evaluation' would dominate, with the former's clinical objectification, or the second's - from the Latin *valere*: 'be of value, be worth' - moral judgement. 'Assessment', coming from the Latin *assidere*,'to sit *with*', implies a collaboration, something done *with* the learner, not *to* her, a point made by many scholars deploring systems of assessment which fail to deliver on their etymological promise (Green, 1998; Knight, 2000; McFadzien, 2015). In Britain, 1976 is a key moment for the word (if not the practice), with the creation of the Assessment of Performance Unit (APU), under these terms of reference: > *"To promote the development of methods of assessing and monitoring the achievement of children at school \[...\]"* > (DES, 1976, para. 53) > [!note] Three decades of use of the term 'university assessment' > ![[Assessment NGRAM.png]] > *Google NGram chart showing an eight-fold increase in the incidence of the term 'university assessment' in its English language corpus between 1990 and 2020.* In this report, from the Department of Education and Science (the 'yellow book' - 1976), it is *achievement* that is assessed, not learning; indeed, it is also *monitored*. This language, and subsequent uses of the term in this text, are ambiguous as to whether they refer to the assessment of individual students, or the assessment of their achievements in the collective - meaning that of the schools charged with delivering it. By 1988, the National Curriculum, created as part of the Education Reform Act (1988, c. 40) uses the term in reference to individual pupils "attainment targets", particularly at the end of a "programme of study" - the newly created "key stages" (Harris, 1989). The history of assessment, as a practice, is of course much older. Written exams date back to 200 B.C.E Imperial China, where their major function was the ruthless selection of civil servants (Miyazaki, 1976). In Western Europe, assessment was foremost oral, medieval universities continuing the practice of _recitation_, from monastic catechism, and using _disputation_ - a, sometimes day-long, dialectic argumentation in teams with students and masters, which, in addition to being a method, before academic journals, of critical inquiry and knowledge dissemination, doubled up as examination and graduation ceremonies for these students. It is the rise of the modern nation-state that births correspondingly modern examinations, in the 18th century; with orals being replaced, in England, by more selective _written_ exams, by the mid-19th century. The results of those assessments is the rank ordering of students: marking systems appear even later, initially in an effort to provide transparency in the establishment of this ranking. As with imperial China of antiquity, it is the State's need for standards and selections of its civil servants driving these shifts. (Wilbrink, 1997). Historically, assessment has thus long served a role of *classification* - indeed, of *Discipline*, as Foucault (1975) would term it, ranking, as it does, subjectivities into hierarchies. University assessment does start to show an ancillary concern for learning from the mid-19th century: in 1852, a committee in Oxford wants examinations to promote habits of "self-instruction, voluntary labour, and self-examination" (Cox, 1967, p. 294). However, if we look beyond the ivory tower (and towards the thatched huts) the pedagogical role of assessment has been evident, in the context of apprenticeship of trades and crafts, since the middle ages *at least* Indeed many of the 'innovative' forms of assessment which appeared in British universities this century (self-, peer-, portfolio-, authentic-... ) map onto practices of workplace learning, past or present - it is only in the academe that *assessment for learning* (more on which below) represents 'innovation' (Kvale, 2007). These different conceptions of assessment across the academic/vocational binary persist to this day in British further (post-16) education (Torrance _et al._, 2005), and they inform the construction of an 'assessment identity' shaping students expectations of university assessment (Ecclestone, 2007). ## Formative vs Summative, AfL vs AoL. The UK's Quality Assurance Agency for Higher Education (QAA - more on which below) defines assessment as "any processes that appraise an individual's knowledge, understanding, abilities or skills" (QAA, 2006, p. 4). An umbrella term is useful to cover the variety of such processes: exams, dissertations, *vivas*, projects... It also allows for the term to include a different type of assessment, in purpose if not form. In the latest *Quality Code* from the QAA, 'assessment' itself is not defined, but its two purposes are: >“**Formative assessment:** Assessment with a developmental purpose, designed to help learners learn more effectively by giving them feedback on their performance and how it can be improved and/or maintained. Reflective practice by students sometimes contributes to formative assessment. >**Summative assessment:** Used to indicate the extent of a learner’s success in meeting the assessment criteria to gauge the intended learning outcomes of a module or course. Typically, within summative assessment, the marks awarded count towards the final mark of the course/module/award” >(QAA, 2018, p. 1) > [!note] Formative vs Summative > ![[F vs S NGRAM.png]] > _Google NGram chart showing the respective incidence of the terms "summative assessment" and "formative assessment" in their English language corpus_ These two terms capture the different instrumental goals of assessment; they are mirrored by a pair of terms capturing its philosophical purpose: assessment *of* learning (AoL), and assessment *for* learning (AfL). This is more than phrasing: not all formative assessment serves learning, nor is all assessment of learning summative. There is considerable tension between these aspects, representing, at one end, the classification purpose of assessment, which “cabins, cribs and confines, no matter how its supporters may protest to the contrary” (Barnett, 2007, p. 35); on the other, its pedagogical intent: students' work must be appraised if they are to be given feedback on how to improve it, an iterative cycle that prompted the more accurate, but less popular, phrase 'assessment *as* learning' (Schellekens *et al.*, 2021). AoL calls for objective standards, which presume an objective appraisal of learning; as such, it proceeds from a positivist epistemology, using quantitative measurements (at an ordinal, as opposed to interval, level). AfL, conversely, uses qualitative appraisal, in the form of feedback, necessary (but not sufficient!) to student learning (Broadfoot _et al._, 2002); it acknowledges that knowledge is constructed by the learner through interaction and experience (Piaget, 1972; Vygotsky and Cole, 1978). This pedagogical constructivism is predicated upon a epistemological social constructionism; AoL's and its positivism, *a contrario*, seems to lead, in history if not theory, to an '*instructionist*' (to its detractors - 'didactic' to its proponents) view of pedagogy: for learning to be objectively validated, knowledge needs reified, with teaching becoming its mere, one-way, transmission from teacher to student. Thomas Kuhn (1977) wrote of *the essential tension* between tradition and innovation in scientific research. With AoL, and, more broadly, didactic pedagogy representing, on the one hand, 'tradition', and constructivist AfL, on the other, 'innovation', a similar essential tension runs through the evolution of policy and praxis of British higher education in the last three decades, in which the QAA plays a major role. >[!note] Here comes a new challenger! >![[Formative A vs AfL NGRAM.png]] Founded in 1997, the QAA succeeded the Higher Education Quality Council, itself founded earlier that decade, alongside whose functions it absorbed part of those of England and Wales's higher education funding councils; the publication, mere months later, of Sir Ron Dearing's report, *Higher Education and the Learning Society* (Dearing, 1997), would give the QAA a mandate in assuring standards and quality in higher education (King, 2019). This was the formal beginning of a regime of auditing that, to paraphrase Adams (1980), has *made a lot of people very angry*. How *widely* it is *regarded as a bad move*, however, is a matter of perspective; it has certainly been viewed as a managerialist mission of monitoring, adding to administrative burden without improving learning (Laughton, 2003; Lucas, 2014). The standardisation mission of the QAA aligned it firmly with an AoL perspective, but the end of the millennium would also mark the start of 'the AfL movement': in 1996, when the British Educational Research Association (BERA) put an end to its funding of such groups, members of its Policy Task Group on Assessment obtained funding from the Nuffield Foundation to continue their work, as the Assessment Reform Group (ARG), with the stated mission for "assessment policy and practice at all levels \[to take\] account of relevant research evidence" (Nuffield Foundation, n.d). The ARG's magnum opus, Black and Wiliam (1998a)'s literature review on formative assessment in *school education*, with its companion booklet "*Inside the Black Box*" (1998b) fulfilled this mission thoroughly. This text had significant impact on assessment in British schools, foregrounding its formative role. > [!note] Assessment for ~~Learning~~ Performance There is considerable irony in Black and Wiliam (1998a, 1998b) presenting quantitative data as *positivist*, objective, measures of learning, in order to make the case for *constructivist* pedagogy. This quantitative method may be why, with the ascendency, in the UK, US and Australia, of a discourse of 'evidence-informed' practice "misdirect\[ing\] public policy" (Simpson, 2017), the "AfL movement" would become by 2010 "a new orthodoxy" in UK schools, with a rhetoric at odds with actual practice, where high-stakes summative testing still dominates (Taber _et al._, 2011). AfL went on to become somewhat of a problem child for Black and Wiliam in the 2000s, with Black warning it had become "a free brand name to attach to any practice" (Black, 2006: 11), then Wiliam despairing: "What will it take to make it work?" (Wiliam and Thompson, 2008). Delivering the inaugural annual lecture of the Chartered College of Teaching in 2023, Wiliam critiqued quantitative meta-analyses with the same arguments as Simpson (2017), then, drawing on Soderstrom and Bjork (2015), reminded the audience that "performance is not learning" (Wiliam, 2023, p. 10); both very valid critiques of his seminal work with Black 25 years prior, unacknowledged at the time. Formative assessment has received much less attention in higher education than in school contexts: Morris and colleagues (2021) systematic review identifies low-stake quizzes/tests and peer-assessment as well evidenced, and "technology" showing mixed findings - unsurprisingly given the breadth of the term. The true takeaway from this review may be the paucity of high-quality evidence about formative assessment and feedback in higher education, in contrast with schools, which the authors ascribe to differences, in the UK, between each sector's research practices and teaching aims/incentives. Schools prioritise students academic progress, to which they are accountable (via external exams grade), and school research is carried out by third-parties, who can scope studies so as to favour generalisability. Universities, conversely, ==operate under the dual scrutiny==**REWORD** of the Teaching Excellence Framework and the Office for Students (Dickinson, 2025; Fung, 2025), and assessment research is often carried out locally, thus leaning towards the idiographic rather than nomothetic (generalisable). Implementation tends, in schools, towards near-standardised best practice, whereas universities, offering greater teacher autonomy, will see a multiplicity of novel designs deployed and evaluated, further impairing the development of a comprehensive evidence base. Indeed, the review's findings need caveating: evidence on quizzes comes primarily from STEM subjects, and that on peer assessment from writing instruction (Morris, Perry and Wardle, 2021). ## Key concepts in assessment --- > *“Deliberate, systematic quality assurance ensures that assessment processes, standards and any other criteria are applied **consistently** and **equitably**, with **reliability**, **validity** and **fairness**.”* > (Quality Assurance Agency for Higher Education, 2018, p. 2) ### Validity Over the first half of the twentieth century, validity was a property of the test. Indeed, the QAA's definition “how well a test measures what it claims to measure” (Quality Assurance Agency for Higher Education, 2018, p. 4) is found near-verbatim in Kelly (1927, p.14 in Shaw and Crisp, 2011), and still in wide use in education texts in the 21st century (eg. Brown and Knight, 1994; (Cohen _et al._, 2010). But, as Wiliam (2010) points out, tests do not make claims. It is the people administering the tests who do. It is a definition predicated on what is (claimed to be) measured having an intrinsic, specific value for each individual - a positivist view. A valid test would correlate with other, existing measures, and predict future performance in assessment. This validity vis-a-vis external criteria, became known as *criterion validity*, respectively concurrent, or predictive to distinguish it from *content* validity (the quality with which the test samples the content studied), and *construct validity*: validity with respect to a theoretical construct (Wiliam, 2010; Shaw and Crisp, 2011). The latter proceeded directly from construct validity in psychological testing (of which educational assessment is arguably one), as originated in Cronbach and Meehl's (1955), and would rose to prominence in educational assessment in the following 30 years, coming to general adoption. It was Messick (1989) who proposed a unified model based on broader construct validity, including all available source of evidence, subsuming content and criterion conceptions. Construct validity underpins this model: validity is "construct-*referenced*" (p.35) but he defines it as ""the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores” (p. 13). For Messick (1989), thus, validation is an ongoing process; he further noted the need for "appraisals \[...\] of the social consequences of using the scores for applied decision making" (p. 13). This is what is termed *consequential validity* by Boud (2007). Later, Kane (2001, 2006) building on Messick's (1989) call for the integration of consequential sources of evidence, as well as Cronbach's (1988) conceptualisation of validation as evaluation (of the test itself), Kane (2001, 2006) proposed validation as the construction of an dual argument: one expounding the rationale of the interpretation (decisions and consequences) of test results, and another the validity and appropriateness of the assumptions and inferences that underpin the interpretive argument (Shaw and Crisp, 2011). **TODO:** - Consequential validity and washback - Consequential validity and fairness. ### Reliability Reliability is about consistency of measurement; Brown and Knight (1994) make the analogy of measuring fabric: a wooden yard stick is a reliable instrument, an elastic measuring tape is not. The outcome of the assessment needs to come from what is being assessed, not from the assessment instrument. In its narrowest sense, is *“the consistency of marks obtained with the same assessors and the same test conditions”* (Tan, 2007, p. 124), through *stability*: the test "\[coming\] up with the same result on different occasions, independently of who was giving and marking it” (Biggs *et al.*, 2022, p. 218). This is the 'measurement model' (Taylor, 1994), with a history in psychometry, where the person administering the test, eg. a quantitative scale, has limited influence on its results; in such a model, reliability is an attribute of the test. Evaluating student work is often qualitative, so in the 'standards model' of assessment, the one relevant to quality assurance, reliability extends to *intra-judge reliability*: the examiner's judgement being consistent for the same standard of work, and further to *inter-judge*: different examiners judging consistently (Biggs *et al.*, 2022). National standards are an effort to extend this 'intra-institution' reliability to an *inter-institution* one (Orr *et al*, 2012) - moderation by external examiners are an effort in this direction, although, as of 2006 at least, "widely recognised as being a fairly ‘light touch’ nod in the direction of attempting to maintain some degree of equivalence between different institutions” (Murphy, 2006, p. 40). ### Referencing Key to the *stability* element of reliability is the way the assessment is referenced. Norm-referenced assessment evaluates the performance of a student against that of others in the same test: the ranking of students that pre-dates them being given marks is norm-referencing. Criterion-referenced assessment (CRA), conversely, evaluates the student's performance against criteria known in advance (Biggs, Tang and Kennedy, 2022, p. 222). This is not to mean that all marks come from CRA: when English universities switched to marks in order to legitimise the eventual ranking, they were using a form of CRA, but in France, candidates were ranked in 10 tiers, leaving the extremes free for extreme (under-)achievers (Rothblatt, 1993; Chervel, 1993, cited in Wilbrink, 1997). 'Grading on the curve' is a contemporary form of the practice: whilst the marks come from CRA, ensuring the grade follows a normal distribution is norm-referenced assessment. (Biggs, Tang and Kennedy, 2022). Finally, *ipsative* assessment has the learner evaluated against her own past performance: a prototypical example is one's running time (or Tetris Score). (Brown and Knight, 1994, p.18). The pedagogical function of assessment really only require ipsative standards, whereas most conceptions of validity, particularly those needed for quality assurance, demand criterion-referencing. ### The validity-reliability tension Brown and Knight (1994) position reliability in tension with validity: "reliability can be high when we try to measure some of the products of study, but validity may be a more appropriate concept to value when we talk about addressing the processes of study" (p. 17). They contrast higher education assessment, prioritising validity at the expense of reliability, with the externally-administered and comprehensively benchmarked A-levels: significant effort towards reliability, at the cost of validity - if anything because no such examinations assesses the full specification. Standards of reliability are inherited from the tradition of positivist psychology, where they eventually came under criticism for resulting in narrow and artificial measures of phenomena in reality much more complex (Brown and Knight, 1994, p. 14). It is to this history, starting with Sir Francis Galton (1894), English polymath and vocal eugenicist, noticing the normal distribution of a number of human characteristics, that we owe 'the curve', upon which grading is sometimes done. This tension maps to that between AfL, which prioritises validity over reliability, and AoL, which technically should ascribe equal importance to both, but in practice prioritses reliability, as the unreliability of an assessment is noticeable from its results alone, whereas auditing for validity is more slippery, and requires re-assessing the purported learnings. The authors, drawing on Johnson and Blinkhorn's (1992) work on National Vocational Qualification assessment, argue for a reconceptualisation of reliability as a correct match between reference criteria and the evidence presented - whilst acknowledging this approach alone may guarantee intra-institution reliability, but fall short of the inter-institution reliability required for quality assurance in HE standards (Brown and Knight, 1994, pp. 19-21). ### Learning Outcomes The UK was one of the original signatories, in 1999 of the 'Bologna process', which created a European Higher Education Area (EHEA), originating in, but separate from, the European Union: it now contains 49 countries, although Belarus and the Russian federation have seen their rights suspended since 2022 (EHEA, 2025). Along with harmonising higher education cycles (Bachelor's, Master's, Doctoral), the Bologna process enabled mutual recognition of degrees, and the implementation of a system of quality assurance in teaching and learning - the QAA being the agent of this system in the United Kingdom (European Commission, 2022). In fact, Loughlin and colleagues argue that UK reforms coming in the wake of the Dearing report (1997) “pre-empted, and indeed inspired, elements of the Bologna Process” (Loughlin *et al.*, 2021, p. 123). Core to this program of accountability has been the rollout of *'intended learning outcomes'*: "“statements of what the learner will know, understand and be able to demonstrate after completion of a programme of learning (or individual subject/ course)” (Rauhvargers et al., 2009, p. 56)" (Hadjianastasis, 2017). These, often referred to simply as 'learning outcomes', apply at the institutional level, as broad "graduate attributes", and more substantively, at the programme (degree) level then down to individual courses (Biggs, Tang and Kennedy, 2022). Part of the standardisation work of the QAA is the requirement that modules "have a coherent and explicit set of learning outcomes and assessment criteria" (QAA, 2018, p. 1), and by 2007, it was able to declare “most departments in most institutions, have fully adopted the principles of programme design with respect to learning outcomes.” (QAA, 2007, p. 1) The history of learning outcomes, originally called 'objectives', dates back to the turn of the twentieth century, and the era's effort for a 'scientific' approach to education, with the work of Thorndike and Bobbitt, "the father of curriculum theory" (Eisner, 1983 \[1967\], p. 551). Bobbitt saw education as mere transmission of knowledge, and his theory of curriculum design, expounded in *The Curriculum* (1918), classified 160 specific objectives in nine areas. After a temporary shift towards educational progressivism, behaviourist approaches to curriculum design came back in favour, with Benjamin's Bloom *Taxonomy of Educational Objectives* (Bloom and Krathwohl, 1956), which classifies learning skills by complexity, from "remembering" to "creating" (in the revised version of Anderson and Krathwohl, 2001). Those *educational* objective became *instructional* objectives, in Mager (1962), a terminology that frames them as a result of instruction; Mager (1962) deliberately avoided the qualifier 'behavioural', given its association to Skinner's (1945) radical behaviourism. Mager's successors didn't have such qualms with the adjective, and, objectives became *behavioural*, “effect\[ing\] a polarisation of reaction to the notion of an educational objective.” (Allan, 1996, p. 97) in curriculum planning. By the 1980s, *'objectives'* became *'outcomes'*, "because behavioural objectives \[had\] received a bad press" (Brown, Bull and Pendlebury, 1997, p. 17). > [!note] The evolution of learning outcomes > ![[Allen, 2016, p. 101.png]] > (Allan, 1996, p. 101) Outcomes are broader, including non-behavioural, non-subject specific outcomes - eg. "be able to take responsibility for one's learning" (p. 102). At the course level, subject-specific outcomes are broader and more complex than their forebears, and, unlike them, not tied to a single performance variable - eg. "apply knowledge of validity, reliability and triangulation to a chosen research issue." (p.99) The use of 'outcomes' is epitomised in John Biggs's influential Structure of Observed Learning Outcomes, (SOLO - Biggs, 1979; Biggs and Collis, 1982). This taxonomy relates to Piaget's (1964) stages of cognitive development, and owes much to Bloom's (Bloom and Krathwohl, 1956), but is logically distinct from the former (Biggs, 1979, p. 385). Biggs and colleagues (2022) claim that SOLO's taxonomy is hierarchical *unlike Bloom's*, by which they surely must mean its hierarchy is more explicit: each of the five levels must make use of the previous ones. Importantly, SOLO's framing is about response to a task, with levels defined by the number of "items" of "data" included, then (once full coverage is reached) the quality of the relations drawn between them (Biggs, 1979, p. 385). There is empirical evidence for the effectiveness of SOLO, particularly if the learner is introduced to the framework (developing meta-cognition, or giving the student a better grasp of what to showcase in assessment?), but the ambiguity of identifying the level of response has also been shown to lead to poor inter-examiner reliability (Wells, 2015) ### Constructive Alignment Biggs would go on to establish the very influential concept of *constructive alignment* (CA), now “\[w\]idely viewed as best practice, at least in managerial circles,” (Newby and Cornelissen, 2025 p. 1). Introduced in 1996, coinciding with the wave of reforms in European HE leading up to the Bologna process, it was with the explicit promise to reconcile the behaviourist evaluation required for quality assurance with a constructivist pedagogy, moving away from the traditional modes of instruction (mass lecture) and assessment (examination of rote learning) which “relied on intrinsic motivation and highly developed study skills of an academic elite” (Loughlin et al., 2021, p. 121). Instead, CA promoted *constructivist* teaching and learning activities, *aligned* with the learning outcomes - that is, aligned with the criterion-referenced assessment verifying them (Biggs, 1996, 1999, 2003). Beyond a *tour de force* in reconciling paradigms, Biggs argued through the contrasting examples of Robert, the shallow, learn-to-the-test student, and Susan, the dedicated deep learner. Without constructive alignment, their ultimate summative outcomes could very well be similar. With it, Robert would either fail the summative assessment if relying on his old tricks, or ideally engage in the learning activities so as to become Susan-like. (Biggs, 1999; Biggs, Tang and Kennedy, 2022). Learning Outcomes as originally construed by the QAA and Dearing needn't be aligned with teaching: the initial focus of the agency was accountability; it is in this century that CA ascended to a de-facto orthodoxy, being explicitly referenced in Bologna process materials from 2015 (European Commission: Directorate-General for Education, Youth, Sport and Culture, 2015), and the QAA's latest code of practice (Loughlin, Lygo-Baker and Lindberg-Sand, 2021). It is worth noting that CA, from its inception (Biggs, 1996) recommends SOLO (Biggs, 1979; Biggs and Collis, 1982) as an evaluation taxonomy for student responses, but that this inclusion hasn't percolated to policy recommendation. > [!note] Constructive Alignment in the QAA's Code of Practice ![[QAA, 2018, p. 4.png]] (QAA, 2018, p. 4) However, the auditing and validation burden applies to LOs and assessment, not teaching activities: this is a recommendation, not a mandate. The association of CA to the programme of accountability deployed across institution resulted in its vilification amongst many academics as a managerial, rather than pedagogic endeavour. Perilous is the path from policy to practice, to be sure, and much the criticism of CA is that of its administrative implementation, vitiating a sound pedagogical theory (Loughlin, Lygo-Baker and Lindberg-Sand, 2021). CA has also received critique on its own basis, with its problematic theoretical underpinnings at odds with ambitions to decolonise the university, or make it more inclusive of disabilities; and its *de facto* proscription of some alternative, student-centred pedagogies (Newby and Cornelissen, 2025). From the opposite side, some have argued constructivism ill-suited to the natural sciences (Jervis and Jervis, 2005) - everyone, indeed, is a critic. Ultimately, there is a core critique: CA attempts to combine two incompatible paradigms (Hadjianastasis, 2017), and, with it ending up instrumentalised by the managers and auditors of HE, is effectively *selling instructionist wine in constructivist bottles*. Learning outcomes, and Outcomes Based Education (OBE) more broadly, have received significant critique: in theory for their intrinsically behaviourist philosophy and their neglect of broader outcomes (Allan, 1996; Cohen _et al._, 2010; Hadjianastasis, 2017); in practice, for being an instrument of managerial control, a "strait-jacket", a "box-ticking" exercise (Hussey and Smith, 2003; Dobbins _et al._, 2016), imposed as the consequence of the marketisation of education, with each LO a promise of service from teacher-producer to student-consumer (Brancaleone and O’Brien, 2011; Brown and Carasso, 2013). ### authentic assessment ### equitable assessment ### workload aspects TO DO: - [ ] Search Assessment and Integrity - [ ] Search assessment and authorship - [ ] Search assessment and originality