Inventory Forecasting with AI #

8. Generalization – Including Items with Incomplete History

Full Dataset Test

In the previous chapter, we confirmed that the model performs “flawlessly” for the 70% of the assortment with a complete 36-month history.
Now it’s time for a real challenge: what happens when we unleash the model on the entire dataset, including products with shorter or even nonexistent history?

History LengthNumber of ItemsShare of Assortment
36 months (full history)1010570.00%
12–35 months266818.00%
Less than 12 months12118.00%
No history5194.00%

All the following experiments use the same model I built in earlier chapters – i.e. including all feature engineering improvements and outlier control.  I’m still using Scenario 2: 33 months for training, 3 months for validation. 

I tested three training variants, while always validating on the full dataset:

VariantTraining IncludesNew Items in Validation
A – Full historyOnly items with complete history468
B – Min. 1 monthAll SKUs with at least 1 historical data point73
C – No restrictionsAll items, including those with no historical data0

Poznámka: Například ve variantě Plná historie (A) bylo při validaci nalezeno 468 nových položek, které model při tréninku neviděl.

What I’m Analyzing

  • Global performance metrics (R², WAPE, RMSE, etc.) across variants A, B, and C
  • Impact on items with full history – Does adding “data-poor” items hurt performance on well-covered SKUs?
  • Behavior of new or short-history items – Can the model transfer knowledge from long time series, or do predictions still collapse to zero or behave randomly?
MetricsFull historyAny history1 months min history
Value (ALL)Value (Item)Value (ALL)Value (Item)Value (ALL)Value (Item)
WAPE41.0748133.709338.414793.487934.633858.0287
RMSE48.886724.572157.347223.202935.418118.5031
0.79520.74080.8511
MAPE131.751670.884668.2377
ROBUST0.4741158.72090.478578.53820.531277.5052
STABLE0.50130.46130.3681

Summary of Results

  • Training on full-history SKUs only (left column): Every SKU used in training had a complete 36-month history. However, the model failed dramatically when asked to predict new or unseen SKU – with Item-level WAPE and MAPE both exceeding 130%.
  • Training on all SKUs, including those with zero history (middle column): This variant did not improve global metrics and actually increased RMSE. Including completely unknown products in training added significant noise to the learning process.
  • The optimal variant was training on all items with at least 1 month of history (right column):  This approach substantially reduced prediction error while maintaining a strong R² score. I selected this configuration as my new baseline for further development.

Visual Inspection

As in previous chapters, I manually reviewed forecasts for several problematic SKUs – this time expanded to include items with short or no history.

Displayed model:

  • 🔴 Red – Model trained on full-history items only
  • 🟢 Green – Model trained with items including zero-history
  • 🟣 Purple – Model trained on items with at least 1 month of history

Note: To maintain context on each item’s history length, all plots show the full 36-month window.

For items with full history, all three models produced nearly identical forecasts. This means that even when items with missing history are added to training, the model remains stable –  thanks to the weight feature, it learns to skip empty periods without distorting the signal.

Interestingly, models trained on mixed-history items (especially the purple one) often predicted higher and more accurate seasonal peaks than the model trained solely on complete series. Shorter time series likely provided extra contextual signals that helped the network generalize seasonal patterns more boldly – and, in many cases, more accurately.

For items with partial history, the red model (trained only on full-history data) performed significantly worse. Since it had no access to a single real record for these SKUs during training, it had no understanding of their typical turnover. As a result, the model could only guess based on similarity to other products — often leading to inaccurate predictions.

We see a similar drop in performance for items with very short history. Once again, the red model had nothing to grab onto – no useful data to estimate demand.  The other two models, while also working with limited history, were still able to estimate the item’s turnover level and get much closer to reality.

In the second visual, you can clearly see how the green and purple models successfully “pull up” the prediction based on similar SKUs.  Even without direct historical data, they predict a sales spike in the second month, simply because similar products in the same group show the same pattern.  In other words, the model correctly transferred knowledge from richer time series instead of defaulting to zero.

When no historical data is available at all, predictions become little more than educated guesses into the void. Without a real sales anchor, the models default to some kind of segment average – which leads to significant underestimation for some items and overestimation for others. Without at least one actual sales data point, the forecast remains highly uncertain and difficult to interpret.


How Model Performance Varies by History Length

To get a more detailed view, I validated performance across five different scenarios, grouped by the length of each item’s history. The following insights are based solely on the model trained with items that had at least one month of sales history (green line in previous charts).

MetricsFull dataset
100% items
36 months
72% items
12-35 months
18% items
Value (ALL)Value (Item)Value (ALL)Value (Item)Value (ALL)Value (Item)
WAPE34.633858.028733.433145.393436.508149.183
RMSE35.418118.503138.158119.356123.464813.58
0.85110.85350.7525
MAPE68.237761.922267.8013
ROBUST0.531277.50520.528262.69030.536566.6689
STABLE0.36810.35050.4093
Metrics6-12 months
4% items
1-6 months
4% items
No history
2% items
Value (ALL)Value (Item)Value (ALL)Value (Item)Value (ALL)Value (Item)
WAPE37.761755.310956.165161.0081117.1288884.835
RMSE36.310811.918346.489234.908845.637733.9161
0.77670.5277-0.0826
MAPE69.141798.2678636.876
ROBUST0.547869.40620.5247105.51080.4238940.0516
STABLE0.42960.51511.3529

Key Takeaways:

  • Stable metrics across history buckets – The model performs consistently well up to the 6–12 month category. This shows it can handle incomplete history quite effectively. The only notable drop is in R², which is expected due to the lack of historical variance.
  • Lower RMSE in 12–35 and 6–12 month groups – These items tend to have moderate turnover, so absolute prediction errors are smaller, which brings down RMSE.
  • Short histories (less than 6 months) – Performance begins to degrade here. The model lacks a sense of typical turnover, and must rely more heavily on similarity with other items. Errors grow quickly — but still remain within reasonable bounds.
  • Zero history – Metrics are heavily skewed. The model uses group-level signals, but lacks any concrete context for the individual SKU, resulting in unstable outputs.

Mission: „Short-History Items“

It became clear that I needed to focus primarily on SKUs with short histories. For brand-new products, with no sales records and no signal about typical volume, the model simply cannot guess effectively. That’s why I set a practical minimum requirement: At least one real month of sales history. From this single data point, the model can infer an approximate scale of demand – and then reconstruct the rest of the pattern from similar items using group relationships and embedding vectors.

To achieve this, I began testing two approaches:

  • Group-level features – Every SKU belongs to a group and receives its own embedding vector. I aim to enrich this with explicit signals from the group, such as average turnover, seasonal shape of the group, etc.
  • Synthetic history – For new items, I generate a warm-up” period: the first few months are filled in using estimates based on similar SKUs. These synthetic data points are used only for feature calculation — this gives the model a rough idea of the curve shape (trend, seasonality) without affecting the training of target values.