## A brief history of assessment in HE The term "assessment" is now taken for granted in education, but it is of relatively recent adoption: it does not feature once in John Roach's *Public Examinations in England, 1850-1900* (1971). Prior to the 1960s, terms like 'educational measurement', or 'evaluation' would dominate, with the former's clinical objectification, or the second's - from the Latin *'valere'*: 'be of value, be worth' - moral judgement. 'Assessment', coming from the Latin *'assidere'* - 'to sit *with*', implies a collaboration, something done *with* the learner, not *to* her, a point made by many scholars deploring systems of assessment which fail to deliver on their etymological promise (Green, 1998; Knight, 2000; McFadzien, 2015). In Britain, 1976 is a key moment for the word (if not the practice), with the creation of the Assessment of Performance Unit (APU), under these terms of reference: > *"To promote the development of methods of assessing and monitoring the achievement of children at school \[...\]"* > (DES, 1976, para. 53) > [!note] Three decades of use of the term 'university assessment' > ![[Assessment NGRAM.png]] > *Google NGram chart showing an eight-fold increase in the incidence of the term 'university assessment' in its English language corpus between 1990 and 2020.* In this report, from the Department of Education and Science (the 'yellow book' - 1976), it is *achievement* that is assessed, not learning; indeed, it is also *monitored*. This language, and subsequent uses of the term in this text, are ambiguous as to whether they refers to the assessment of individual students, or the assessment of their achievements in the collective - meaning that of the schools charged with delivering it. By 1988, the National Curriculum, created as part of the Education Reform Act (1988, c. 40) uses the term in reference to individual pupils "attainment targets", particularly at the end of a "programme of study" - the newly created "key stages" (Harris, 1989). The history of assessment, as a practice, is of course much older. Written exams date back to 200 B.C.E Imperial China, where their major function was the ruthless selection of civil servants (Miyazaki, 1976). In Western Europe, assessment was foremost oral, medieval universities continuing the practice of _recitation_, from monastic catechism, and using _disputation_ - a, sometimes day-long, dialectic argumentation in teams with students and masters, which, in addition to being a method, before academic journals, of critical inquiry and knowledge dissemination, doubled up as examination and graduation ceremonies for these students. It is the rise of the modern nation-state that births correspondingly modern examinations, in the 18th century; with orals being replaced, in England, by more selective _written_ exams, by the mid-19th century. The results of those assessments is the rank ordering of students: marking systems appear even later, initially in an effort to provide transparency in the establishment of this ranking. As with imperial China of antiquity, it is the State's need for standards and selections of its civil servants driving these shifts. (Wilbrink, 1997). Historically, assessment has thus long served a role of *classification* - indeed, of *Discipline*, as Foucault (1975) would term it, ranking, as it does, subjectivities into hierarchies. University assessment does start to show an ancillary concern for learning from the mid-19th century: in 1852, a committee in Oxford wants examinations to promote habits of "self-instruction, voluntary labour, and self-examination" (Cox, 1967, p. 294). However, if we look beyond the ivory tower (and towards the thatched huts) the pedagogical role of assessment has been evident, in the context of apprenticeship of trades and crafts, since the middle ages *at least* Indeed many of the 'innovative' forms of assessment which appeared in British universities this century (self-, peer-, portfolio-, authentic-... ) map onto practices of workplace learning, past or present - it is only in the academe that *assessment for learning* (more on which below) represents 'innovation' (Kvale, 2007). These different conceptions of assessment across the academic/vocational binary persist to this day in British further (post-16) education (Torrance _et al._, 2005), and they inform the construction of an 'assessment identity' shaping students expectations of university assessment (Ecclestone, 2007). ## Formative vs Summative, AfL vs AoL. The UK's Quality Assurance Agency for Higher Education (QAA - more on which below) defines assessment as "any processes that appraise an individual's knowledge, understanding, abilities or skills" (QAA, 2006, p. 4). An umbrella term is useful to cover the variety of such processes: exams, dissertations, *vivas*, projects... It also allows for the term to include a different type of assessment, in purpose if not form. In the latest *Quality Code* from the QAA, 'assessment' itself is not defined, but its two purposes are: >“**Formative assessment:** Assessment with a developmental purpose, designed to help learners learn more effectively by giving them feedback on their performance and how it can be improved and/or maintained. Reflective practice by students sometimes contributes to formative assessment. >**Summative assessment:** Used to indicate the extent of a learner’s success in meeting the assessment criteria to gauge the intended learning outcomes of a module or course. Typically, within summative assessment, the marks awarded count towards the final mark of the course/module/award” >(QAA, 2018, p. 1) > [!note] Formative vs Summative > ![[F vs S NGRAM.png]] > _Google NGram chart showing the respective incidence of the terms "summative assessment" and "formative assessment" in their English language corpus_ These two terms capture the different instrumental goals of assessment; they are mirrored by a pair of terms capturing its philosophical purpose: assessment *of* learning (AoL), and assessment *for* learning (AfL). This is more than phrasing: not all formative assessment serves learning, nor is all assessment of learning summative. There is considerable tension between these aspects, representing, at one end, the classification purpose of assessment, which “cabins, cribs and confines, no matter how its supporters may protest to the contrary” (Barnett, 2007, p. 35); on the other, its pedagogical intent: students' work must be appraised if they are to be given feedback on how to improve it, an iterative cycle that prompted the more accurate, but less popular, phrase 'assessment *as* learning' (Schellekens *et al.*, 2021). AoL calls for objective standards, which presume an objective appraisal of learning; as such, it proceeds from a positivist epistemology, using quantitative measurements (at an ordinal, as opposed to interval, level). AfL, conversely, uses qualitative appraisal, in the form of feedback, necessary (but not sufficient!) to student learning (Broadfoot _et al._, 2002); it acknowledges that knowledge is constructed by the learner through interaction and experience (Piaget, 1972; Vygotsky and Cole, 1978). This pedagogical constructivism is predicated upon a epistemological social constructionism; AoL's and its positivism, *a contrario*, seems to lead, in history if not theory, to an '*instructionist*' (to its detractors - 'didactic' to its proponents) view of pedagogy: for learning to be objectively validated, knowledge needs reified, with teaching becoming its mere, one-way, transmission from teacher to student. Thomas Kuhn (1977) wrote of *the essential tension* between tradition and innovation in scientific research. With AoL, and, more broadly, didactic pedagogy representing, on the one hand, 'tradition', and constructivist AfL, on the other, 'innovation', a similar essential tension runs through the evolution of policy and praxis of British higher education in the last three decades, in which the QAA plays a major role. >[!note] Here comes a new challenger! >![[Formative A vs AfL NGRAM.png]] Founded in 1997, the QAA succeeded the Higher Education Quality Council, itself founded earlier that decade, alongside whose functions it absorbed part of those of England and Wales's higher education funding councils; the publication, mere months later, of Sir Ron Dearing's report, *Higher Education and the Learning Society* (Dearing, 1997), would give the QAA a mandate in assuring standards and quality in higher education (King, 2019). This was the formal beginning of a regime of auditing that, to paraphrase Adams (1980), has *made a lot of people very angry*. How *widely* it is *regarded as a bad move*, however, is a matter of perspective; it has certainly been viewed as a managerialist mission of monitoring, adding to administrative burden without improving learning (Laughton, 2003; Lucas, 2014). The standardisation mission of the QAA aligned it firmly with an AoL perspective, but the end of the millennium would also mark the start of 'the AfL movement': in 1996, when the British Educational Research Association (BERA) put an end to its funding of such groups, members of its Policy Task Group on Assessment obtained funding from the Nuffield Foundation to continue their work, as the Assessment Reform Group (ARG), with the stated mission for "assessment policy and practice at all levels \[to take\] account of relevant research evidence" (Nuffield Foundation, n.d). The ARG's magnum opus, Black and Wiliam (1998a)'s literature review on formative assessment in *school education*, with its companion booklet "*Inside the Black Box*" (1998b) fulfilled this mission thoroughly. This text had significant impact on assessment in British schools, foregrounding its formative role. > [!note] Assessment for ~~Learning~~ Performance There is considerable irony in Black and Wiliam (1998a, 1998b) presenting quantitative data as *positivist*, objective, measures of learning, in order to make the case for *constructivist* pedagogy. This quantitative method may be why, with the ascendency, in the UK, US and Australia, of a discourse of 'evidence-informed' practice "misdirect\[ing\] public policy" (Simpson, 2017), the "AfL movement" would become by 2010 "a new orthodoxy" in UK schools, with a rhetoric at odds with actual practice, where high-stakes summative testing still dominates (Taber _et al._, 2011). AfL went on to become somewhat of a problem child for Black and Wiliam in the 2000s, with Black warning it had become "a free brand name to attach to any practice" (Black, 2006: 11), then Wiliam despairing: "What will it take to make it work?" (Wiliam and Thompson, 2008). Delivering the inaugural annual lecture of the Chartered College of Teaching in 2023, Wiliam critiqued quantitative meta-analyses with the same arguments as Simpson (2017), then, drawing on Soderstrom and Bjork (2015), reminded the audience that "performance is not learning" (Wiliam, 2023, p. 10); both very valid critiques of his seminal work with Black 25 years prior, unacknowledged at the time. Formative assessment has received much less attention in higher education than in school contexts: Morris and colleagues (2021) systematic review identifies low-stake quizzes/tests and peer-assessment as well evidenced, and "technology" showing mixed findings - unsurprisingly given the breadth of the term. The true takeaway from this review may be the paucity of high-quality evidence about formative assessment and feedback in higher education, in contrast with schools, which the authors ascribe to differences, in the UK, between each sector's research practices and teaching aims/incentives. Schools prioritise students academic progress, to which they are accountable (via external exams grade), and school research is carried out by third-parties, who can scope studies so as to favour generalisability. Universities, conversely, ==operate== ==under the dual scrutiny== of the Teaching Excellence Framework and the Office for Students (Dickinson, 2025; Fung, 2025), and assessment research is often carried out locally, thus leaning towards the idiographic rather than nomothetic (generalisable). Implementation tends, in schools, towards near-standardised best practice, whereas universities, offering greater teacher autonomy, will see a multiplicity of novel designs deployed and evaluated, further impairing the development of a comprehensive evidence base. Indeed, the review's findings need caveating: evidence on quizzes comes primarily from STEM subjects, and that on peer assessment from writing instruction (Morris, Perry and Wardle, 2021). ## Key concepts in assessment --- > *“Deliberate, systematic quality assurance ensures that assessment processes, standards and any other criteria are applied **consistently** and **equitably**, with **reliability**, **validity** and **fairness**.”* > (Quality Assurance Agency for Higher Education, 2018, p. 2) ### Validity Defined by the QAA as “how well a test measures what it claims to measure” (Quality Assurance Agency for Higher Education, 2018, p. 4), validity is not just a property of the test, but also of what is claimed to be measured. A better formulation, in Brown and Knight (1994) is "measuring what *you* set to measure" (emphasis mine): it hints at what is fully articulated in Messick (1989) : "the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores” (p. 13)". Setting out to measure something serves a purpose: this measure has summative or formative consequences - are they appropriate? The 'empirical evidence' includes other measures of the same learning: in the early 20th century, the main criteria for validity was results congruent to other metrics **MORE FROM** (Shaw and Crisp, 2011) ### Reliability Reliability is about consistency of measurement; Brown and Knight (1994) make the analogy of measuring fabric: a wooden yard stick is a reliable instrument, an elastic measuring tape is not. The outcome of the assessment needs to come from what is being assessed, not from the assessment instrument. In its narrowest sense, is *“the consistency of marks obtained with the same assessors and the same test conditions”* (Boud and Falchikov, 2007, p. 124), through *stability*: the test "\[coming\] up with the same result on different occasions, independently of who was giving and marking it” (Biggs *et al.*, 2022, p. 218). This is the 'measurement model' (Taylor, 1994), with a history in psychometry, where the person administering the test, eg. a quantitative scale, has limited influence on its results; in such a model, reliability is an attribute of the test. Evaluating student work is often qualitative, so in the 'standards model' of assessment, the one relevant to quality assurance, reliability extends to *intra-judge reliability*: the examiner's judgement being consistent for the same standard of work, and further to *inter-judge*: different examiners judging consistently (Biggs *et al.*, 2022). National standards are an effort to extend this 'intra-institution' reliability to an *inter-institution* one (Orr *et al*, 2012) - moderation by external examiners are an effort in this direction, although, as of 2006 at least, "widely recognised as being a fairly ‘light touch’ nod in the direction of attempting to maintain some degree of equivalence between different institutions” (Murphy, 2006, p. 40). ### Referencing Key to the *stability* element of reliability is the way the assessment is referenced. Norm-referenced assessment evaluates the performance of a student against that of others in the same test: the ranking of students that pre-dates them being given marks is norm-referencing. Criterion-referenced assessment (CRA), conversely, evaluates the student's performance against criteria known in advance (Biggs, Tang and Kennedy, 2022, p. 222). This is not to mean that all marks come from CRA: when English universities switched to marks in order to legitimise the eventual ranking, they were using a form of CRA, but in France, candidates were ranked in 10 tiers, leaving the extremes free for extreme (under-)achievers (Rothblatt, 1993; Chervel, 1993, cited in Wilbrink, 1997). 'Grading on the curve' is a contemporary form of the practice: whilst the marks come from CRA, ensuring the grade follows a normal distribution is norm-referenced assessment. (Biggs, Tang and Kennedy, 2022). Finally, *ipsative* assessment has the learner evaluated against her own past performance: a prototypical example is one's running time (or Tetris Score). (Brown and Knight, 1994, p.18). ### The validity-reliability tension Brown and Knight (1994) position reliability in tension with validity: "reliability can be high when we try to measure some of the products of study, but validity may be a more appropriate concept to value when we talk about addressing the processes of study" (p. 17). They contrast higher education assessment, prioritising validity at the expense of reliability, with the externally-administered and comprehensively benchmarked A-levels: significant effort towards reliability, at the cost of validity - if anything because no such examinations assesses the full specification. Standards of reliability are inherited from the tradition of positivist psychology, where they eventually came under criticism for resulting in narrow and artificial measures of phenomena in reality much more complex (Brown and Knight, 1994, p. 14). It is to this history, starting with Sir Francis Galton (1894), English polymath and vocal eugenicist, noticing the normal distribution of a number of human characteristics, that we owe 'the curve', upon which grading is sometimes done. The authors, drawing on Johnson and Blinkhorn's (1992) work on National Vocational Qualification assessment, argue for a reconceptualisation of reliability as a correct match between reference criteria and the evidence presented - whilst acknowledging this approach alone may guarantee intra-institution reliability, but fall short of the inter-institution reliability required for quality assurance in HE standards (Brown and Knight, 1994, pp. 19-21). ### Learning Outcomes The UK was one of the original signatories, in 1999 of the 'Bologna process', which created a European Higher Education Area (EHEA), originating in, but separate from, the European Union: it now contains 49 countries, although Belarus and the Russian federation have seen their rights suspended since 2022 (EHEA, 2025). Along with harmonising higher education cycles (Bachelor's, Master's, Doctoral), the Bologna process enabled mutual recognition of degrees, and the implementation of a system of quality assurance in teaching and learning - the QAA being the agent of this system in the United Kingdom (European Commission, 2022). In fact, Loughlin and colleagues argue that UK reforms coming in the wake of the Dearing report (1997) “pre-empted, and indeed inspired, elements of the Bologna Process” (Loughlin *et al.*, 2021, p. 123). Core to this program of accountability has been the rollout of *'intended learning outcomes'*: "“statements of what the learner will know, understand and be able to demonstrate after completion of a programme of learning (or individual subject/ course)” (Rauhvargers et al., 2009, p. 56)" (Hadjianastasis, 2017). These, often referred to simply as 'learning outcomes', apply at the institutional level, as broad "graduate attributes", and more substantively, at the programme (degree) level then down to individual courses (Biggs, Tang and Kennedy, 2022). Part of the standardisation work of the QAA is the requirement that modules "have a coherent and explicit set of learning outcomes and assessment criteria" (QAA, 2018, p. 1), and by 2007, it was able to declare “most departments in most institutions, have fully adopted the principles of programme design with respect to learning outcomes.” (QAA, 2007, p. 1) The history of learning outcomes, originally called 'objectives', dates back to the turn of the twentieth century, and the era's effort for a 'scientific' approach to education, with the work of Thorndike and Bobbitt, "the father of curriculum theory" (Eisner, 1983 \[1967\], p. 551). Bobbitt saw education as mere transmission of knowledge, and his theory of curriculum design, expounded in *The Curriculum* (1918), classified 160 specific objectives in nine areas. After a temporary shift towards educational progressivism, behaviourist approaches to curriculum design came back in favour, with Benjamin's Bloom *Taxonomy of Educational Objectives* (Bloom and Krathwohl, 1956), which classifies learning skills by complexity, from "remembering" to "creating" (in the revised version of Anderson and Krathwohl, 2001). Those *educational* objective became *instructional* objectives, in Mager (1962), a terminology that frames them as a result of instruction; Mager (1962) deliberately avoided the qualifier 'behavioural', given its association to Skinner's (1945) radical behaviourism. Mager's successors didn't have such qualms with the adjective, and, objectives became *behavioural*, “effect\[ing\] a polarisation of reaction to the notion of an educational objective.” (Allan, 1996, p. 97) in curriculum planning. By the 1980s, *'objectives'* became *'outcomes'*, "because behavioural objectives \[had\] received a bad press" (Brown, Bull and Pendlebury, 1997, p. 17). > [!note] The evolution of learning outcomes > ![[Pasted image 20250925200932.png]] > (Allan, 1996, p. 101) Outcomes are broader, including non-behavioural, non-subject specific outcomes - eg. "be able to take responsibility for one's learning" (p. 102). At the course level, subject-specific outcomes are broader and more complex than their forebears, and, unlike them, not tied to a single performance variable - eg. "apply knowledge of validity, reliability and triangulation to a chosen research issue." (p.99) The use of 'outcomes' is epitomised in John Biggs's significantly influential Structure of Observed Learning Outcomes, (SOLO - Biggs, 1979; Biggs and Collis, 1982). This taxonomy relates to Piaget's (1964) stages of cognitive development, and owes much to Bloom's (Bloom and Krathwohl, 1956), but is logically distinct from the former (Biggs, 1979, p. 385). Biggs and colleagues (2022) claim that SOLO's taxonomy is hierarchical *unlike Bloom's*, by which they surely must mean its hierarchy is more explicit: each of the five levels must make use of the previous ones. Importantly, SOLO's framing is about response to a task, with levels defined by the number of "items" of "data" included, then (once full coverage is reached) the quality of the relations drawn between them (Biggs, 1979, p. 385). ### constructive alignment ### authentic assessment ### equitable assessment ## workload aspects TO DO: - [ ] Search Assessment and Integrity - [ ] Search assessment and authorship - [ ] Search assessment and originality - [ ]