Inventory Forecasting with AI

5. Feature Engineering I – A Leap in Accuracy

DL model selection

In the previous part we decided to base further development on deep-learning architectures. To keep future visualisations readable I will focus mainly on Temporal Fusion Transformer (TFT). Three key arguments led me to this choice:

  • Scaling with complexity – the richer and more granular the dataset (dozens of features, promo flags, seasonality, relative indicators) the more clearly TFT outperforms other models.
  • DeepAR limits on the long tail – DeepAR tends to pull low-volume or sporadic SKUs down to zero, distorting purchase plans for slow-moving yet important items.
  • Attention = smart feature selection – TFT automatically decides which input columns matter at any moment and can explain that choice; a major benefit when you have 100 + features.

At first DeepAR seemed to achieve comparable metrics with lower compute demand, but once we enriched the inputs with more complex seasonal, promo and relative signals the picture flipped: TFT kept scaling while DeepAR hit its limits.

Scenario 1 – why it is wrong

During further testing the drawback of the first scenario became obvious – training and validation data were identical.

The charts show that Scenario 1 tracks reality very closely throughout, because the model had a chance to “see the future”. Once we withheld the last three months (Scenario 2) the gap appeared quickly. On the final chart with a seasonal item the model fails completely.

MetricsScenario 1Scenario 2
Value (ALL)Value (Item)Value (ALL)Value (Item)
WAPE32.163245.795953.117766.9848
RMSE62.065525.274683.149345.7434
0.86030.7754
MAPE63.949495.889
ROBUST0.479666.01290.418498.0722
STABLE0.38190.4615

When reviewing Scenario 2 results we can see the model has some predictive power, especially for SKUs with long and stable history, but the results are still well below expectations. Three key weaknesses emerged:

  • Low-turnover items – forecasts dropped to zero; risk of under-stocking the long tail.
  • Promo peaks – without an explicit discount signal the models reproduced the uplift only partially.
  • Seasonal cycles – for highly seasonal items the models underestimated the amplitude of summer and Christmas peaks.

It was clear the network needs additional inputs to read the context missing from raw numbers – feature engineering.

What are features?

When working with AI on time series, each data row represents a Product (ID) × Time point (e.g., item A in March 2024). Everything else in that row – segment, category, sales, price – we call features. These columns provide the model with the context it needs: they help it understand a product’s behaviour over time, its similarity to others, and the influence of external factors such as promos or season.

Why?

  • A raw sales number tells only how many units were sold.
  • Features add the why: it was August, a discount was running, the item is new, high season is peaking.

Thanks to this the model can recognise patterns it would never extract from a plain sales series.

How the model works with features

As noted, each row in our dataset is a Product (ID) × Month (time_idx) pair. Everything else on that row counts as a feature.

Feature list for the project dataset:

ID, seasonality, category, type, segment, segment 1, segment 2, name, turnover, date, time_idx, weight, SALE, SALE_INTENSITY, product_volume__bin

Three types of features
The basic split is into static and dynamic – that is, whether the values change over time.
Product type is the same for every month, whereas a promo flag appears only in certain months.

Feature typeExampleHow it helps the model
StaticProduct type, SKU ID, category, segmentLets the model share patterns across similar items and supports new products that have very little history.
Dynamic – knownCalendar date, discount flagValues change month by month and are known in advance, so the model can “look ahead” (e.g., it already knows when a promo will run).
Dynamic – unknownSales volumeValues change every month and are not known for the future—this is the variable we actually want the model to predict.

Numerical vs. Categorical features
Numerical (sales, averages, discounts) → fed directly into the network.
Categorical (product type, segment) → turned into embeddings, short vectors learned together with the model.


Why embeddings help
Two vectors that sit “close” in space = two categories whose demand behaves similarly. A brand-new “tools” SKU can instantly benefit from the sales history of “electro-accessories” if their profiles match.

Which features I added—and why

🗓️ Date-encoding

The model doesn’t “see” calendar months, only a numeric index (time_idx = 0 … 36). These features restore that link.

  • Harmonic month code (sin / cos) → the network learns that January follows December and the whole year is cyclical.
  • Helps capture periodic sales patterns (quarters, years, summer season).

📊 Relative (ratio) features

  • Express the deviation or share of a SKU’s sales against a larger whole (group mean, seasonal maximum, long-term trend …).
  • The model instantly spots when an item is above or below its norm and reacts faster to unexpected swings.

🌀 Lag & Rolling windows

  • Supply detailed information on the recent trajectory.
  • Speed up detection of momentum (sales speeding up or slowing down).
  • Rolling mean / std filter noise and show whether current sales sit above or below their moving average.

🌊 Wavelet signals

  • Decompose the curve into short ripples versus long trends.
  • The network simultaneously “sees” fine promo jumps and slow multi-year cycles.

📈 Trend

  • Adds direction and slope of sales.
  • Provides context on how sales fluctuate around their mean.

🔢 Absolute and log values

Most features are created both in raw and logarithmic form.

Why logarithmic?

  • Compress extremes → small items aren’t drowned out.
  • Stabilises variance; curves sit closer to normal → faster learning.
  • Converts multiplicative jumps (×2, ×3) into linear shifts the model can capture easily.

Result

Thanks to this combination of inputs, the model now predicts turnover and understands the context of each month—what is a normal trend, what is an outlier, and what a promo spike looks like.

After several rounds of tuning I settled on about 75 features—the sweet spot where metrics improved the most without blowing up training time or GPU memory. Adding more columns brought only marginal gains while slowing training dramatically, and some features even started to interfere with one another.

The charts compare three model variants:

  • Red curve  – baseline forecast without any extra features.
  • Green curve – baseline plus harmonic date-encoding (sin/cos month).
  • Purple curve – same model further enriched with relative and rolling features derived from turnover.

The contrast between the curves shows how each feature bundle improves—or shifts—the model’s ability to track real sales.

The table illustrates how extra information pushes model quality upward:

Metricsno featuresdate featuresturnover features
Value (ALL)Value (Item)Value (ALL)Value (Item)Value (ALL)Value (Item)
WAPE53.117766.984832.065146.771130.260844.5829
RMSE83.149345.743472.17525.135446.142722.3537
0.77540.81110.9012
MAPE95.88962.966560.0505
ROBUST0.418498.07220.490765.26970.494862.2027
STABLE0.46150.38390.365

What the numbers say

  • Moving from zero features to just date-encoding (harmonic month, quarter, …) cuts error by dozens of percentage points—the model now “gets” seasonality.
  • Adding turnover features (relative ratios, lags, rolling windows) shaves off a few more points and, more importantly, slashes RMSE.
  • The gap between the two enriched models is smaller; to judge which feature bundle truly pays off, you have to look beyond aggregate scores—zoom in on specific scenarios (seasonal peaks, promo spikes, long-tail SKUs). That’s why visual analysis is essential: contrasting curves reveal nuances and point you to precise problem types.

Impact of feature count on speed

On a test sample I measured how resource usage grows with more features (13 features = baseline).

FeaturesTraining timeGPU memory
120x 7x 2,5
444x 26x 6