Computer Adaptive Testing: Background, benefits and case study of a large-scale national testing programme

Computer Adaptive Testing (CAT) is a hot topic amongst the assessment community, however, despite its many benefits, it still isn’t very widely used. In this article, we’re going to give you an overview of CAT, a run-down of some of the benefits, and without too much jargon, an overview of the technology behind it. To help contextualise it, we’re going to reference a recent case study of how the CAT technology in Surpass has been used to deliver an innovative national personalised assessment programme which is changing the shape of national education.

What is a Computer Adaptive Test?

Put simply, a Computer Adaptive Test (sometimes referred to as personalised assessment) is an online test which adapts to the candidate’s ability in real-time by selecting different questions from the bank in order to provide a more accurate measure of their ability level on a common scale.

What is a Computer Adaptive Test like for a candidate?

A personalised assessment pulls questions from a large pool of items that have been carefully calibrated in order to determine their difficulty level (more on this in the next section).

When a candidate begins their test, they are first presented with an item of medium difficulty deemed appropriate for their year group. If they get that question right, the next item they see will be slightly harder, if they get it wrong, they’ll see a slightly easier item. The system is constantly calculating the candidate’s estimated ability depending on what they get right and wrong, and presenting them with a personalised set of items until the level of confidence in the ability estimate has exceeded a pre-defined level (or the maximum number of questions have been presented) and the test ends. As every learner takes a different path through the test, with a different set of questions, they can potentially receive tests of a different length.

In contrast to a linear test which in some scenarios only give useful results for learners of average ability, with a personalised assessment, all the items presented to the candidate are designed to be challenging; the number of easy items that are presented to high-ability candidates is reduced, as is the number of hard questions to low-ability candidates, as neither give a clear indication of the ability of those learners.

As everything is scored in real-time, at the end of the test the candidate can receive immediate, personalised feedback in the form of ability statements as opposed to a raw score or grade, which provides factual information on their strengths and weaknesses based on the questions they answered.

How does a Computer Adaptive Test work in Surpass?

For a CAT to work, it needs reliable data and a comprehensive item bank with a good spread of content coverage and difficulty level. This means that the item bank must first be calibrated through pre-testing. This is one of the key barriers to CAT as a larger item bank and extensive work is required to get reliable data before any live tests can be delivered. The general rule is that an item must be exposed a minimum of 200 times before reliable data can be generated. Using this exposure data, Item Response Theory (IRT) is then used to calculate IRT parameters for each of the items in the bank. These IRT parameters include the difficulty of the item, and the discrimination of the item, i.e. the factor which determines how much an increase in ability of a candidate will have on the probability of them getting that item correct. In Surpass, these values are attached to the items as tags.

A test ‘blueprint’ is generated which determines factors such as content coverage of the test. Many more parameters can also be specified, including, minimum and maximum number of items to present and stopping conditions. An item pool is created which contains all of the items that could appear in the test.

Whereas with a linear test, the system knows which items will be delivered before the test begins, with an adaptive test, an algorithm selects the next item in real time, at the point the candidate clicks the ‘next’ button in the test driver. The algorithm works to the blueprint to ensure good coverage of all content areas and controls item exposure across the bank as a whole (so that some items aren’t presented more frequently than others), meaning the entire item bank is most efficiently used. The algorithm is capable of supporting up to three IRT parameters – difficulty, discrimination, and guessing.

In Surpass, all of this clever logic happens in just 300 milliseconds of the learner selecting ‘Next’ to move to the next question, meaning there’s never a delay to the candidate. The algorithm continues until the candidate’s ability has been estimated to the required level of accuracy.

The Surpass team have worked hard to ensure the system can handle these large volumes of data without affecting performance. Microsoft Azure apps have been utilised which are automatically scalable depending on anticipated volumes, and throughput (number of requests per second) has been tested at volumes much higher than those currently being delivered.

One of the key benefits of the adaptive assessment delivered through Surpass is that not only can you make use of the standard reporting functionality, but bespoke reports can be defined and generated via the Surpass API, making use of all of the rich data that is produced from an adaptive test. Reports can show individual candidate journeys throughout the test, as well as reporting on a group or class or even national level as well.

What are the benefits of CAT over paper-based testing?

There are numerous benefits to CAT over paper-based testing for formative assessment (providing the item bank has been properly calibrated) including:

Precise information for candidates of all ability

Traditional linear tests, where all candidates receive the same set of items, only ever really challenge the middle third of learners. A CAT is designed to challenge learners of all ability levels, providing and accurate and useful picture of leaner ability for everyone.

Decrease in teacher workload

Many school-level tests are still delivered on paper, which presents significant workload for teachers with the marking and administration of results. Immediate scoring and accurate learner-specific feedback gives teachers more time to focus on teaching, and implementing feedback to help their students to progress.

Potential for on-demand

With an on-screen personalised assessment, there’s no restriction to deliver within the paper test window, meaning they can be delivered for diagnostic purposes at any point throughout the year when the teacher feels it’s suitable. As every learner receives a personalised test, there’s no need for the cohort to all sit the test at exactly the same time.

More accurate feedback that can be actioned immediately

More accurate feedback can be provided immediately after the test in the form of competency-based ability statements rather than a score. This indicates to the candidate areas they have done well on, and areas they may need to improve. This kind of feedback is more useful in formative assessment, demonstrating to learners that there are areas to progress to, or constructive guidance on where to improve. Teachers can also see the performance of a class as a whole, indicating areas they may need to focus their teaching on.

Learner engagement

With questions that challenge learners of all ability, learner engagement throughout the test is better maintained. Low-achievers are encouraged, and high-achievers are challenged. Adaptive assessments can also take less time to complete than a traditional linear test, with an accurate ability measurement reached in a shorter time.

Using CAT for a large-scale national testing programme in the UK: A Case Study

At the 2019 Surpass Conference, Gavin Busuttil-Reynaud from AlphaPlus updated the Surpass Community on the use of adaptive tests built in Surpass for a large-scale national testing programme of primary and secondary school children in Wales. Some of the key points are summarised here, or you can catch up on the presentation in full by watching this video.

After introducing national testing for schoolchildren in Wales (UK) on paper in 2013, a feasibility study was conducted early on to determine how it could be delivered on-screen. In 2018, the phased transition of these tests to computer adaptive tests began, the first being procedural numeracy, and to be followed by reading and numerical reasoning. This is considered revolutionary considering that paper-based testing still dominates global government testing programmes. Back in 2004, Ken Boston, then head of the Qualifications and Curriculum Authority stated that ‘on-screen assessment will shortly touch the lives of every learner in the country’, with one of his objectives for the next 5 years being that ‘all new qualifications would include an option for on-screen assessment.’ As we know, 15 years on, this is not the case, with many qualifications still delivered solely on paper, which make the achievements of the project in Wales even more remarkable, particularly for pre-16 assessment.

In the first year alone, 268,000 learners have sat a personalised assessment in procedural numeracy which equates to 96% of the cohort of learners in years 2-9 in Wales, matching the completion rate of the paper tests.

The introduction of on-screen assessment also saw a significant reduction in the number of modified papers required. In 2018, over 4000 modified papers were ordered for this test which was reduced to just 357 modified large print and 12 braille assessments in 2019.

The assessment can be self-scheduled, giving teachers the flexibility to use it for diagnostic purposes at any point in the year. However, in the first year, many schools stuck to the traditional end of term testing period, although it’s possible that this practice will change in future as teachers become more familiar with these tests.

How has this new way of testing been received by teachers?

There are many benefits to personalised assessments in this scenario, as detailed in the section above. AlphaPlus have received positive feedback from teachers for the procedural numeracy assessment pilot which has been the focus of this case study. A teacher questionnaire revealed that 78% thought that learners were engaged, 83% thought the assessments were the right length, and over 60% found the learner and feedback reports to be useful.

However, during his 2019 Surpass Conference presentation, Gavin observed that there are still some barriers to overcome as the mindset shifts from paper-based testing. With a personalised assessment, the algorithm stops once it can confidently give an ability estimate, so some learners see more questions than others, which wouldn’t happen on a paper test.

“There is part of our paper culture that is so deeply ingrained that fairness is about doing exactly the same for all people, even if it’s a terrible fit for some of those people…the personalisation message has not got through to all of the teachers yet.“
Gavin Busuttil-Reynaud, AlphaPlus

Additionally, since a CAT is designed to challenge the high-ability learners, candidates can be presented with questions from older age groups that they haven’t been formally taught. While the objective of this is to show learners what they can move on to, or even demonstrate capabilities beyond their age group, Gavin went on to observe:

“Some teachers embrace this… others think it’s terrible that a learner had been asked something they won’t be taught until next year and think their teaching is being judged on something they haven’t been taught yet… There’s still a massive cultural journey for everyone to go on because these tests are so different from current practice, but the primary purpose of all this is to provide some detailed feedback.“
Gavin Busuttil-Reynaud, AlphaPlus

The priority of these tests is to inform teaching and learning with detailed reports based on all the available data designed to help teachers identify areas for improvement, and they are not used as a school accountability measure. No score is given on the learner report, just factual statements to highlight strengths and weaknesses.

The teacher is provided with a skills profile for their class, giving them an indication of where to focus their teaching, providing reliable data is available, as well as learner journey charts, which show the path they took through the test and can show patterns of learner behaviour.

Rob Nicholson, Headteacher of Borras Park Community School whose learners have sat these assessments commented:

“The personalised assessments can be used alongside other forms of assessment that schools have…it can be used to just solidify scores and assessments and knowledge of the child.”
Rob Nicholson, Headteacher of Borras Park Community School

How have the personalised assessments been received by learners?

For this project, the team were mindful of the young age of the learners, and so the Surpass test driver was customised to simplify the interface and create the best possible experience. The tests could be delivered on desktop computers, laptops, or tablet devices, which was important due to the inconsistency of hardware available in schools across the country.

Every candidate is challenged by the questions presented to them, so they can demonstrate what they know rather than what they don’t, with the algorithm designed so learners get 50% of items right, and 50% wrong. For the first time, some high achievers found questions they were unfamiliar with, while the lower achievers gained confidence by being able to answer some of the questions.

“For the learners at the lower end of the ability spectrum, typically, when they were doing the paper test, they would get somewhere between 90-95% of the items wrong. What an incredibly dispiriting experience. But they come out of this adaptive test going, I could do it!…And the high flyers who would whiz through a paper test in ten minutes suddenly now say, ‘that was a difficult test, I had to think’…at least it’s making them realise there’s something else to move on to.“
Gavin Busuttil-Reynaud, AlphaPlus

Learners are generally unfazed by a move to on-screen, as Jenny Jones, Deputy Headteacher of Borras Park Community School, observed:

“They’re used to working online, they’re used to using their iPads or the computers so they feel confident using them. It’s a fun activity.“
Jenny Jones, Deputy Headteacher of Borras Park Community School

There’s also been benefits for those learners with a visual impairment or accessibility requirements that would usually mean they require a modified version of the paper test. The only real difference is where diagrams are included so a simplified version or braille version is included in a paper booklet. Accessibility tools such as a magnifier and screen reader mean that the on-screen test is accessible to as many people as possible. AlphaPlus have worked with visually impaired learners and conclude that learners ‘wholeheartedly prefer the online versions’ and are unfazed by accessibility tools as it’s their usual way of working, and welcome being able to work at a computer the same as everyone else.

Conclusion

The case study of a successful national CAT implementation in the UK demonstrates that this type of testing can be introduced, and can have significant benefits over fixed tests, particularly in a formative setting. Shorter, personalised tests with learner-appropriate content provide greater learner engagement and a better learner experience. The results are processed faster, so they can be reviewed with the learner whilst their assessment experience is still fresh in their mind.

Psychometrically valid results along with rich-data on every candidate gives a greater understanding of what learners are capable of, and, used in conjunction with other indicators, can better inform teaching and learning and give the best possible opportunities for learner progression.

Commenting on the work with schoolchildren in Wales, Roger Murphy, Emeritus Professor of Education at Nottingham University stated:

“It’s a feature of the education system in Wales which is being watched very closely by many countries all round the world.“
Roger Murphy, Emeritus Professor of Education at Nottingham University

However, it should be noted that CAT is not going to be appropriate in all scenarios. CAT is limited to objective questions types, restricting the type of skills that can be tested and the generally accepted view is that producing a CAT is expensive. Maybe, as assessment technology progresses even further, functionality such as automatic item generation could mitigate some of the cost implications around creating larger item banks. Ultimately, the cost to produce must be weighed up against the benefits to determine whether CAT is the right way to go for your testing programme.

If you’re interested in learning more about personalised assessments in Surpass, please speak to your Surpass Account Manager.