Inventory Forecasting with AI

4. Baseline Models vs. Deep Learning

Why start with classic ML models?

  • Quick reality check Baseline algorithms train in minutes and instantly reveal major data issues (bad calendar encoding, duplicated rows, etc.).
  • Reference benchmark Once we know the performance of linear regression or a decision tree, we can quantify exactly how much a deep-learning model must improve to justify its higher cost.
  • Transparency Simpler models are easier to explain; they help us understand the relationship between inputs and outputs before deploying a more complex architecture.
Baseline ModelsHow They Work
LR – Linear RegressionCan capture simple increasing/decreasing trends. Once the data becomes more complex, it quickly loses accuracy.
DT – Decision TreeCan handle more complex curves and nonlinearities, but if the tree grows too deep, it becomes overfitted – it starts repeating patterns.
FR – Random ForestAverages out errors from individual trees → more stable than a single tree, but more memory-intensive and slower when dealing with a large number of items.
XGB – XGBoost RegressorOften the best among traditional ML models – it can capture complex relationships without much manual tuning.
GBR – Gradient Boosting RegressorGood as a low-cost benchmark: if even this fails, the data is truly challenging.

What Does a 12-Month Forecast Visualization for 4 SKU Look Like?


Deep Learning (DL) Architectures I Deployed

For comparison, I built three basic deep learning models. In all cases, these are AI models designed for time series forecasting. While the baseline models belonged to the simpler category, these belong to the more complex end of the spectrum.

DL algoritmusHow It WorksAdvantages / Disadvantages
DeepAR
(Autoregressive LSTM)
Learns from similar items; during prediction, it generates the full probabilistic range of demand → we can see both optimistic and pessimistic sales scenarios.For new products, long-tail items, or series with many zeros and extremes, it tends to "smooth out" the curve.
TFT
(Temporal Fusion Transformer)
Excellent for complex scenarios – it automatically selects which information is important and can explain its choices.Best for complex datasets with many external signals, but requires a large amount of data and has long training times.
N-HiTS
(Hierarchical Interpolation)
Decomposes a time series into multiple temporal levels (year/month/week), applies a dedicated small network at each level, then recombines them.Great for long forecast horizons and fast inference, but requires a regular time step.

Why These Complex Models?

  • They capture subtle and long-term seasonal patterns and promotional effects that classical ML often overlooks.
  • They share learned knowledge across the product range — helping new items and those with short histories.
  • They return uncertainty intervals, so the buyer receives a recommendation of “how much to order in both optimistic and pessimistic scenarios.”

What Does a 12-Month Forecast Visualization of DL models for 4 SKU Look Like?


Results

The testing was conducted only under Scenario 1 — both training and validation were performed on the same dataset. I used 5,000 products with a full 36-month history.

Classical models provide an initial orientation, but even the best of them fall short of the target accuracy. Deep learning brings a significant improvement — even in the basic configuration, it reduces the error by half or even two-thirds. This is a clear signal that it’s worth investing time into tuning features, hyperparameters, and deploying the model in production.

Comparison metrics between ML and DL models. I started using advanced validation metrics only at a later stage.

ML - baseline models
MetricsWAPERMSEMAPE
GBR0.4259124.9833173.599275.8702
FR0.3444133.5684157.6839217.063
DT0.2529142.5767165.8517212.5132
XGB0.2153146.125168.5045266.1286
LR0.0745158.6966200.3072477.9817
DeepLearning models
MetricsWAPERMSEMAPE
DeepAR0.84753.61562.8242188.5804
TFT0.390130.2255125.4162141.6336
NHiTS0.432206.9327121.0321382.6455


Test 2 – Encoder vs Decoder

In the previous phase, I confirmed that deeper (DL) architectures clearly outperform traditional ML. Before diving into tuning parameters like loss function, optimizer, or dropout (which I won’t cover here), it’s essential to understand the encoder–decoder architecture. In time series models, this is a key component that determines how much of the past the model “reads” and how far into the future it predicts.

  • Encoder: Defines how much of the historical data the model uses to generate a single prediction. You can think of it like a buyer looking at the last 12 months of history.
  • Decoder: Represents the forward-looking part – it generates predictions based on the information provided by the encoder.

Unlike a human buyer, the architecture creates multiple predictions (sliding windows) during training for each product. If I have 36 months of history, and I want the model to look back at the last 18 months and predict the next 3 months, then 18 sliding windows are generated during training over this time range.

EncoderDecoder
0-1819-21
1-1920-22
......
17-3536-38

A three-year history thus creates 18 training scenarios for each item; the model sees all possible transitions (e.g., winter → spring, Christmas → January, etc.).

Main difference and impact on model behavior:
What’s the difference between setting the encoder length to 18 versus 30 months?
The key difference lies in the types of patterns the model learns and what it emphasizes.
It’s a classic trade-off between flexibility and stability.

Encoder_length = 18Encoder_length = 30
Better adaptation to changes: model reacts faster to sudden market shifts (trend, promotion)Good in seasonality and long-term trends: Better in captures repeating patterns and slow long-term trends
More training windows: Faster learning, lower risk of overfitting (especially for short histories)Prediction stability: Robust against short-term noise
Lower memory and computation requirements: Faster inference and trainingFewer training windows: Longer training, higher overfitting risk if data is limited
Struggles with seasonal cycles: The model may not fully learn patterns that repeat over multiple yearsSlower reaction to changes: Takes longer to adapt to sudden market shifts (trends, promotions)
Seasonal memory: Higher risk of "forgetting" past seasonal peaks like Christmas.Higher memory and computation requirements

How to Choose?

There’s no universally best setting. In practice, I tested encoder lengths of 12 / 18 / 24 / 30 months, and the most balanced turned out to be encoder = 18 months, decoder = 3 months.
It provides enough historical context, generates a large number of training windows, and responds well to short-term market changes.