
AI and automatic item generation (AIG) became the ‘it’ words at conferences more than a decade ago.
Back then, the issues I was dealing with in high-stakes testing were mostly focused on the cognitive burden and workload of my key subject matter experts (SMEs) who were vital to the success of my team.
I argued that I would rather AI focus on reducing the time and effort that went into assembling test forms than producing a plethora of items that would create a backlog that still required technical review and revision – work that the SMEs would ultimately have to do.
Today, it is gratifying to see that AI is being used in ways beyond just AIG that can support test development efforts (e.g. enemy item detection, semantic searches, test construction). However, AI alone can’t fix a broken system.
“Just like a new roof isn’t going to help a house with a crumbling foundation, AI can’t turn a flailing testing program into a successful one.”
Laura Steighner, Director, Test Development Customer Experience
Underneath the excitement of AI, a testing program needs to already be getting the basics right. Just like a new roof isn’t going to help a house with a crumbling foundation, AI can’t turn a flailing testing program into a successful one. Assessment software needs to consider and reduce the pain points for testing programs. So what do assessment systems and testing programs need to do to be successful?
Ultimately for testing programs, the answer involves establishing the right policies and procedures and actually implementing and enforcing them. But, even with the right policies and procedures, there is still something else even more critical: the right people. AI can streamline the work, but we still need humans to make sure the items and tests we produce meet our quality standards, ensuring that items are correctly keyed, technically accurate, and without bias.
In Spring 2025, I joined Surpass as the Director of Test Development Customer Experience and have had the opportunity to talk to many Surpass customers about their testing programs, how they leverage the tools at hand, and how we could better support their process. Customers included testing programs that develop K-12 or higher education formative and summative assessments, as well as professional licensing and certification programs, including healthcare and business industries.
Despite the differences between these customers and their testing programs, their feedback coalesced and centered on usability of software and services and focused on ensuring a smooth process working with their SMEs and capturing their authoring, edits, reviews, or other approvals in an intuitive and, I should emphasize, accurate way.
Their feedback, on what they want to see from assessment systems, can be categorised into four main areas:
- Data control: Customers should be able to manipulate their own data (e.g. able to export their metadata to analyze the health of their bank).
- Content-first views: On-screen views should prioritize content and metadata (e.g. users should be able to collapse/expand to maximize the important information on the screen).
- Scalability: Software development should consider scale impact to ensure that usability does not falter and features work well when concurrent writing and reviewing sessions scale from tens to hundreds (and in some cases, thousands!) of items.
- Efficient navigation: Navigation and functionality should be efficient and intuitive to users, including those who access the software infrequently like SMEs.
Surpass, in response, has developed new enhancements in their recent release updates, including the Item Authoring Task in the 25.4 Surpass release that enables a more collaborative item writing session between SME and lead item development manager.
As for the wider issue of how testing programs can evolve and improve, I recently returned from the Institute for Credentialing Excellence (ICE) Exchange annual conference where AI was still the overwhelming theme. However, more than ever, the conversation centered around the need to keep ‘humans in the loop’ throughout the test development process. In part, humans are needed because AI is still generating significant hallucinations and biases, but the need for some programs is also necessitated by accreditation requirements established by the National Commission for Certifying Agencies (NCCA) and soon for the ISO/IEC 17024 personnel certification standard.[1]
Surpass Copilot is an example of how an item banking software can integrate AI with the item development process, where humans are not only in the loop, but are also at the controls.
In addition to item writing, SMEs are responsible for ensuring technical accuracy, spotting nuance in items, identifying alternative ways an item can be interpreted, assessing for negative impact on subsets of the candidate pool, reasoning through item performance issues, establishing cut scores, etc.
Interestingly, several presenters at the ICE Exchange referenced the concept of “infusion” rather than ‘generated.’ SME-generated items rarely go directly on an exam for pre-testing without a significant amount of review and editing. In many cases, the draft submitted by a SME looks nothing like the item that makes it onto the exam.
Likewise, an AI-generated item also needs to go through review and editing (the results from a Surpass case study demonstrated that the amount of edits required on AI-infused items was less than those of items created without AI – we’re looking forward to sharing that case study in the coming months).
As a result, while the item may have been drafted by AI, humans with the appropriate technical expertise make sure that the final item meets technical, quality, and fairness standards. The resulting item should no longer be considered ‘AI-generated’ but rather ‘AI-infused.’ AI may play a role, but humans ensure the item is ready for testing.
This past year, we saw some significant public criticism against the use of AI-generated items on a licensing exam. High-stakes testing programs have some of the most highly motivated and stressed-out candidates. These test takers expect an error-free testing experience with high quality items. When they encounter mistakes, these candidates become vocal. And then to hear that some of the items were generated by AI and not by qualified SMEs…that didn’t go over well.
To be accurate, we should consider changing the way we refer to AI-generated items because testing professionals (the conservative, risk-adverse group that we are) are not promoting the use of AIG without human oversight. The process for quality assurance should be the same whether the item development process starts with AI or a SME. A change of terminology could convey the distinction.
Regardless, testing programs need to be more transparent about how AI is used in supporting item and test development and how they integrate SMEs to ensure that they achieve their technical and quality standards.
Remember, AI is a tool. It’s a great tool. But ultimately, our testing programs are lost without our SMEs – they are the foundation for a strong testing program.
[1] An updated version of the ISO/IEC 17024 standard on personnel certification is forthcoming March 2026. The revised standard will include requirements regarding the appropriate use of AI by certifying bodies and the need for human oversight.
Find out more
To read more about Surpass Assessment’s approach to ethical AI, and to create better test questions, faster, for less cost, check out Surpass Copilot.
Discover enterprise-grade AI that delivers instant, expert-level feedback on real examination questions – see how Surpass Tutor can transform learning through assessment, building on decades of expertise in professional assessment to benefit learners.







