At the end of a training programme, someone in the organisation will ask whether it worked. At that point, it is usually too late to answer the question properly. The data was never collected. The baseline was never set. The behaviours you wanted to change were never defined precisely enough to measure.

This is the most common mistake in L&D evaluation: treating measurement as something you do at the end, rather than something you design from the very beginning.

The Kirkpatrick New World Model is the most widely used framework for evaluating training effectiveness. Used correctly, it is not just an evaluation tool. It is a design tool, a briefing tool, and a way of setting expectations with stakeholders before a single slide is built. This article explains how to use it that way.

The four levels

Donald Kirkpatrick introduced his four-level evaluation model in the 1950s. The New World Model, developed by James and Wendy Kirkpatrick, updates the original with a stronger emphasis on working backwards from business outcomes, and on what happens after the training rather than during it.

The four levels, from the outside in:

Level 4
Results
The degree to which targeted outcomes occur as a result of the training and the support and accountability package. This is the business result: sales figures, error rates, customer satisfaction scores, retention, productivity. The reason the training was commissioned in the first place.
Level 3
Behaviour
The degree to which participants apply what they learned during training when they are back on the job. This is where most programmes quietly fail. Knowledge from the room does not automatically become changed behaviour in the workplace.
Level 2
Learning
The degree to which participants acquire the intended knowledge, skills, attitudes, confidence, and commitment based on their participation in the training. Did they actually learn what the programme set out to teach?
Level 1
Reaction
The degree to which participants find the training favourable, engaging, and relevant to their jobs. The post-training feedback form. What most organisations measure, and often mistake for evidence that the programme worked.

The levels are displayed here in reverse order intentionally. In the New World Model, you start at Level 4 and work backwards. Not because the other levels do not matter, but because everything else only makes sense in relation to the business outcome you are trying to achieve.

Why most organisations stop at Level 1

A score of 4.3 out of 5 on a post-training feedback form tells you that participants found the session pleasant and the facilitator engaging. It does not tell you whether anything changed. It does not tell you whether the knowledge will be applied. It certainly does not tell you whether the business result will move.

Yet Level 1 is where almost all evaluation effort goes. There are understandable reasons for this. It is cheap. It is immediate. It produces a number that can go in a report. And it is easy to confuse with evidence of impact.

A happy participant is not a changed employee. A changed employee is not necessarily a better business result. Kirkpatrick gives you a framework to track the gaps between all three.

The problem is not that Level 1 measurement is useless. Learner reaction tells you whether the training conditions were right: whether the facilitator was credible, whether the content felt relevant, whether the environment was conducive to learning. That matters. But it is the floor, not the ceiling. If you stop there, you are measuring inputs, not outcomes.

The critical shift: start at Level 4

The New World Model proposes something that sounds simple but changes everything: before you design a single element of a programme, define what success looks like at Level 4.

What business result needs to move? By how much? Over what timeframe? How will you know if it has moved?

If you cannot answer these questions before the programme begins, you cannot answer them at the end either. And if you cannot answer them at the end, you have no way of knowing whether the investment was worthwhile.

This is not about creating bureaucracy. It is about being precise. "We want managers to give better feedback" is not a Level 4 outcome. "We want to reduce voluntary attrition in the product team from 22% to 15% over 12 months, and we believe a significant driver is the quality of manager feedback conversations" is a Level 4 outcome. The difference is enormous, both for programme design and for evaluation.

What to define at each level, and when

The Kirkpatrick New World Model works best when it is treated as a planning tool from the very first conversation about the programme, not as a retrospective exercise after it has been delivered.

Here is what to define at each level, and when:

Level 4: Results
Define the target business outcome
Set a baseline: where is the metric now?
Agree a realistic timeframe for change
Identify who owns the outcome
Agree how it will be measured

Do this: before the design brief is written

Level 3: Behaviour
Define the specific behaviours that need to change
Agree what "changed behaviour" looks like in practice
Identify who will observe and report on behaviour change
Plan manager involvement and briefing
Design on-the-job practice opportunities

Do this: during programme design

Level 2: Learning
Define what learners need to know, be able to do, and believe
Set pre-training baseline (knowledge check or self-assessment)
Design assessments that test application, not just recall
Build in confidence and commitment measures alongside knowledge

Do this: during programme design

Level 1: Reaction
Design a short, focused feedback form
Ask about relevance and engagement, not just satisfaction
Include a question on intent to apply
Use results to improve delivery, not to prove impact

Do this: at end of each learning event

The conversation this forces

One of the most valuable things about using Kirkpatrick this way is the conversation it forces with your internal stakeholders before the programme begins.

When you ask a business leader to define the Level 4 outcome they expect from a training investment, you quickly discover whether the problem is actually a training problem. Many requests for training are really requests for a solution to something else: a process issue, a management issue, a resourcing issue, or a culture issue that training cannot fix.

If the business leader cannot define what behaviour change they expect, or what business result they are trying to move, that is a signal worth paying attention to. It does not necessarily mean the training should not happen. It means the brief needs more work before it becomes a design.

The single most useful question you can ask before commissioning any training programme: what will be different in this organisation in six months if the programme works? If the answer is vague, the brief is not ready.

Measuring at the end: what to actually collect

Once the programme has run, measurement at each level looks like this in practice.

Level 1 is collected immediately after each learning event: a short form, five to eight questions, focused on relevance, engagement, and intent to apply. Keep it short enough that people actually fill it in honestly rather than clicking through quickly to leave.

Level 2 is measured by comparing pre and post assessments. Not a quiz that tests whether someone remembers a fact, but a task or scenario that tests whether they can apply a skill. The gap between pre and post tells you whether learning happened. It does not tell you whether it will transfer.

Level 3 is the hardest to measure and the most important. You need to collect evidence that behaviour changed on the job, not in the training room. This means structured conversations with managers three to six weeks after the programme, observation data if the role allows it, or self-reported behaviour logs with specific prompts tied to the skills the programme targeted. The key word is structured: if you leave behaviour measurement to chance, you will get anecdotes, not evidence.

Level 4 is measured against the baseline you set at the beginning. This is why the baseline matters so much. If you did not record the attrition rate, the error rate, or the customer satisfaction score before the programme began, you cannot demonstrate change at the end. The measurement plan has to be built into the project from the start.

A practical note on attribution

One question always comes up at Level 4: how do you know the training caused the result, rather than something else that happened in the same period?

Honest answer: in most organisational settings, you cannot prove causation with certainty. Business results are influenced by many factors simultaneously, and you rarely have the conditions for a controlled experiment. What you can do is build a credible case for contribution: show the chain of evidence from Level 1 through to Level 4, demonstrate that behaviour changed in the ways the programme targeted, and document any other factors that may have influenced the result.

A credible case for contribution is worth far more than a claim of causation that nobody believes. Stakeholders who commission training understand that organisations are complex. What they want is evidence that the investment made a difference, and a clear account of how you know.

The connection to the 12 levers

Kirkpatrick and the 12 levers of transfer are complementary frameworks. Kirkpatrick tells you what to measure and when. The 12 levers tell you what conditions need to be in place for Level 3 and Level 4 results to occur. Used together, they give you both a design checklist and an evaluation framework.

If your Level 3 results are poor, the 12 levers diagnostic will usually tell you why: manager support was not activated, opportunity to practise was not designed, learner readiness was assumed rather than built. The frameworks point to the same root causes from different directions.

You can read more about the 12 levers and how to use them as a diagnostic here: The 12 levers of transfer: a practical guide for HR managers.

Where to start

If you are currently running programmes without Level 3 or Level 4 measurement in place, the most useful first step is to pick one programme and define what you would need to see at each level to consider it a success. Do that exercise before the next delivery cycle begins.

You do not need to build a complex measurement infrastructure overnight. Start with a clear Level 4 target, a Level 3 behaviour definition, and a structured check-in with managers three weeks after delivery. That alone puts you ahead of most organisations.

If you want a second pair of eyes on how your current programmes hold up against the Kirkpatrick framework, that is exactly what the free half-day audit is designed to do.

Want to know if your programmes would pass a Kirkpatrick review?

The free audit is a half-day review of your current L&D setup. We look at what you are measuring, what you are not, and what a well-designed evaluation plan would look like for your context. Honest findings. No obligation.

Book your free audit →