Inventory Forecasting with AI

3. How to validate this?

The core question before any modelling

Validation must show how the model handles data it has never seen—exactly the situation in production. So the dataset has to be split:

  • Training data – the model learns patterns here.
  • Validation (test) data – we check what the model really knows here.


But where to draw the line? I wanted to feed the model every possible hint—
all SKUs and all 36 months—to capture as many patterns as possible. Yet part of the timeline had to stay hidden; otherwise the network would simply memorise history and crash in the real world.

Solution? I built three validation scenarios of increasing strictness, from a “quick check” (the model validates only on data it already saw) to a hard test on truly unseen months.

3 Validation Scenarios

  • Scenario 1 – Smoke test Validation on the same dataset used for training (all 36 months). Goal: make sure the model can find patterns and predict values it has already seen—while watching for overfitting.
  • Scenario 2 – Semi-strict Training on 33 months, validation on the last 6 months (split 3 + 3: three months the model saw during training, three it did not).

    This setup is much closer to real deployment: the model first “walks through” history, learns the patterns, and only then predicts a fresh, unknown period. It lets us:

    • Test generalisation – can the model extend the trend, or does it just echo the past?
    • Spot shifts in seasonality – e.g. will it handle the Christmas peak or other patterns?
    • Check promo impact – are forecasts influenced by planned discount campaigns?
    • Gauge sensitivity to new items – the last months often contain SKUs with a short history; validation shows right away how the model copes with a changing assortment.

  • Scenario 3 – Strict Same training window as Scenario 2; validation = only the last 3 unseen months. I use this as a control check.

Only after I was satisfied with the results did I train the final version on the full dataset (no hold-out), so the model could leverage 100 % of the data for live deployment in the warehouse.

Note: All metrics and charts shown later in the article come from Scenario 2 (unless explicitly stated otherwise). Metrics are calculated on the final 3 + 3 months; visuals use a 12-month window for clarity.

The next graph illustrates the gap between Scenario 1 and Scenario 2. Red = Scenario 1, where the model saw every outcome during training. Green = Scenario 2, where the last three months were hidden. It is clear that unseen data must be included in validation from day one.

Which metric to choose when you can’t manually inspect every SKU?

In day-to-day practice people reach for R², RMSE or MAPE—so why invent anything else?The core trouble is the presence of zeros and the huge spread of values.

Why are zeros an issue?

Imagine the model predicts a turnover of 2 units for a month, while the real sales were zero. A difference of only two pieces, yet a percentage metric such as MAPE explodes, because you divide by zero (or by a tiny number). The same happens with RMSE: a few “small” deviations on high-volume items can inflate the total error and completely hide how well the model performs on key SKUs.

Why is high value variability a problem?

ItemActual salesPredicted salesAbsolute errorRelative error
A (low-volume)5 pcs8 pcs3 pcs0.6
B (high-volume)5 000 pcs4 800 pcs200 pcs–4 %
  • MAPE shoots up because of item A (60 %), even though we’re talking about ±3 pieces.
  • RMSE instead highlights item B (200 pcs), being highly sensitive to large absolute errors.
  • may look superb (99 % of variance explained) thanks to high-volume items, yet tells us little about “small” SKUs

No single metric is enough. With extreme variability and lots of zeros or small items, I had to switch to a combination of metrics to get a fair view of model quality.

MetricWhat it measuresNoteGoal
WAPE
(Weighted Absolute Percentage Error)
What % of total sales the model “missed”.Sensitive to SKUs with zero sales.⬇️
RMSE
(Root Mean Squared Error)
On average, by how many units the model was off.Sensitive to high-volume SKUs.⬇️

(Coefficient of determination)
How much of the variation in sales the model correctly “explains”.Strongly influenced by high-volume SKUs.⬆️
MAPE
(Mean Absolute Percentage Error)
Average percentage error per item.Sensitive to low-volume SKUs.⬇️
Robust scoreComposite of accuracy, deviation control and trend-direction match (1 = great, 0 = poor).Captures whether the forecast follows rising/falling trends.⬆️
Stable scoreWAPE plus extra checks on deviations and relative error for small sales (0 = great).Make sure the forecast is smooth and not overreacting on low-volume items.⬇️

Goal tells us which direction we want the number to move (low ⬇️ or high ⬆️).

My primary yard-sticks are WAPE, RMSE and R².

  • MAPE is not reliable for highly variable SKUs.
  • Robust and Stable serve mainly as “concept checks”; once the model is roughly tuned, those scores hardly change.

Isn’t that too simple?

During training I ran into one more snag:
Dataset-wide scores look fine, but they don’t flag outliers at the SKU level. So I expanded each metric into two flavours:

Metric versionHow it’s computedWhat question it answers
Whole-datasetAll SKUs are concatenated into one long vector, then the metric is calculated.How many units do we miss in total? → direct financial impact on inventory.
Per-SKU averageThe metric is computed for every item first, then the item-level values are averaged.How accurate is the forecast for a typical product? → quality across the entire assortment.

Below is an example of these metrics for a two-layer neural network with a soft-plus normaliser. Note the gap between WAPE and MAPE—the very issue discussed above.

MetricValue (ALL)Value (Item)
WAPE50.1925387.2097
RMSE69.082131.4245
0.6239
MAPE107.1253
ROBUST0.455207.7031
STABLE0.5869
MetricValue (ALL)Value (Item)
WAPE31.161144.6504
RMSE50.260726.5974
0.9084
MAPE75.7044
ROBUST0.522684.916
STABLE0.3901

Visual inspection still matters

Even with a full battery of metrics I discovered that visual examination of the curves is essential. Certain item profiles—special-purpose SKUs, long stretches of zeros, strong seasonality—can hide poor forecasts inside otherwise “good” global scores.

The chart below illustrates the point.
On paper the metrics are acceptable, yet the forecast for the final months (red) collapses to almost zero. This happens frequently with ultra-low-volume items.

MetricValue (ALL)Value (Item)Value (ALL)Value (Item)
WAPE41.221861.473834.726251.7859
RMSE67.358629.083948.279123.7958
0.74530.8691
MAPE82.362670.7658
ROBUST0.443288.10010.478577.5447
STABLE0.48720.418
Red lineGreen line

Note: In every chart the actual sales are drawn in black, forecasts in colour—so the gap is obvious at a glance.