Inventory Forecasting with AI #5: Feature Engineering I – A Leap in Accuracy

Inventory Forecasting with AI

5. Feature Engineering I – A Leap in Accuracy

DL model selection

In the previous part we decided to base further development on deep-learning architectures. To keep future visualisations readable I will focus mainly on Temporal Fusion Transformer (TFT). Three key arguments led me to this choice:

Scaling with complexity – the richer and more granular the dataset (dozens of features, promo flags, seasonality, relative indicators) the more clearly TFT outperforms other models.
DeepAR limits on the long tail – DeepAR tends to pull low-volume or sporadic SKUs down to zero, distorting purchase plans for slow-moving yet important items.
Attention = smart feature selection – TFT automatically decides which input columns matter at any moment and can explain that choice; a major benefit when you have 100 + features.

At first DeepAR seemed to achieve comparable metrics with lower compute demand, but once we enriched the inputs with more complex seasonal, promo and relative signals the picture flipped: TFT kept scaling while DeepAR hit its limits.

Scenario 1 – why it is wrong

During further testing the drawback of the first scenario became obvious – training and validation data were identical.

The charts show that Scenario 1 tracks reality very closely throughout, because the model had a chance to “see the future”. Once we withheld the last three months (Scenario 2) the gap appeared quickly. On the final chart with a seasonal item the model fails completely.

Metrics	Scenario 1		Scenario 2
Metrics	Value (ALL)	Value (Item)	Value (ALL)	Value (Item)
WAPE	32.1632	45.7959	53.1177	66.9848
RMSE	62.0655	25.2746	83.1493	45.7434
R²	0.8603		0.7754
MAPE	63.9494		95.889
ROBUST	0.4796	66.0129	0.4184	98.0722
STABLE	0.3819		0.4615

When reviewing Scenario 2 results we can see the model has some predictive power, especially for SKUs with long and stable history, but the results are still well below expectations. Three key weaknesses emerged:

Low-turnover items – forecasts dropped to zero; risk of under-stocking the long tail.
Promo peaks – without an explicit discount signal the models reproduced the uplift only partially.
Seasonal cycles – for highly seasonal items the models underestimated the amplitude of summer and Christmas peaks.

It was clear the network needs additional inputs to read the context missing from raw numbers – feature engineering.

What are features?

When working with AI on time series, each data row represents a Product (ID) × Time point (e.g., item A in March 2024). Everything else in that row – segment, category, sales, price – we call features. These columns provide the model with the context it needs: they help it understand a product’s behaviour over time, its similarity to others, and the influence of external factors such as promos or season.

Why?

A raw sales number tells only how many units were sold.
Features add the why: it was August, a discount was running, the item is new, high season is peaking.

Thanks to this the model can recognise patterns it would never extract from a plain sales series.

How the model works with features

As noted, each row in our dataset is a Product (ID) × Month (time_idx) pair. Everything else on that row counts as a feature.

Feature list for the project dataset:

ID, seasonality, category, type, segment, segment 1, segment 2, name, turnover, date, time_idx, weight, SALE, SALE_INTENSITY, product_volume__bin

Three types of features
The basic split is into static and dynamic – that is, whether the values change over time.
Product type is the same for every month, whereas a promo flag appears only in certain months.

Feature type	Example	How it helps the model
Static	Product type, SKU ID, category, segment	Lets the model share patterns across similar items and supports new products that have very little history.
Dynamic – known	Calendar date, discount flag	Values change month by month and are known in advance, so the model can “look ahead” (e.g., it already knows when a promo will run).
Dynamic – unknown	Sales volume	Values change every month and are not known for the future—this is the variable we actually want the model to predict.

Numerical vs. Categorical features
Numerical (sales, averages, discounts) → fed directly into the network.
Categorical (product type, segment) → turned into embeddings, short vectors learned together with the model.

Why embeddings help
Two vectors that sit “close” in space = two categories whose demand behaves similarly. A brand-new “tools” SKU can instantly benefit from the sales history of “electro-accessories” if their profiles match.

Which features I added—and why

🗓️ Date-encoding

The model doesn’t “see” calendar months, only a numeric index (time_idx = 0 … 36). These features restore that link.

Harmonic month code (sin / cos) → the network learns that January follows December and the whole year is cyclical.
Helps capture periodic sales patterns (quarters, years, summer season).

📊 Relative (ratio) features

Express the deviation or share of a SKU’s sales against a larger whole (group mean, seasonal maximum, long-term trend …).
The model instantly spots when an item is above or below its norm and reacts faster to unexpected swings.

🌀 Lag & Rolling windows

Supply detailed information on the recent trajectory.
Speed up detection of momentum (sales speeding up or slowing down).
Rolling mean / std filter noise and show whether current sales sit above or below their moving average.

🌊 Wavelet signals

Decompose the curve into short ripples versus long trends.
The network simultaneously “sees” fine promo jumps and slow multi-year cycles.

📈 Trend

Adds direction and slope of sales.
Provides context on how sales fluctuate around their mean.

🔢 Absolute and log values

Most features are created both in raw and logarithmic form.

Why logarithmic?

Compress extremes → small items aren’t drowned out.
Stabilises variance; curves sit closer to normal → faster learning.
Converts multiplicative jumps (×2, ×3) into linear shifts the model can capture easily.

Result

Thanks to this combination of inputs, the model now predicts turnover and understands the context of each month—what is a normal trend, what is an outlier, and what a promo spike looks like.

After several rounds of tuning I settled on about 75 features—the sweet spot where metrics improved the most without blowing up training time or GPU memory. Adding more columns brought only marginal gains while slowing training dramatically, and some features even started to interfere with one another.

The charts compare three model variants:

Red curve – baseline forecast without any extra features.
Green curve – baseline plus harmonic date-encoding (sin/cos month).
Purple curve – same model further enriched with relative and rolling features derived from turnover.

The contrast between the curves shows how each feature bundle improves—or shifts—the model’s ability to track real sales.

The table illustrates how extra information pushes model quality upward:

Metrics	no features		date features		turnover features
Metrics	Value (ALL)	Value (Item)	Value (ALL)	Value (Item)	Value (ALL)	Value (Item)
WAPE	53.1177	66.9848	32.0651	46.7711	30.2608	44.5829
RMSE	83.1493	45.7434	72.175	25.1354	46.1427	22.3537
R²	0.7754		0.8111		0.9012
MAPE	95.889		62.9665		60.0505
ROBUST	0.4184	98.0722	0.4907	65.2697	0.4948	62.2027
STABLE	0.4615		0.3839		0.365

What the numbers say

Moving from zero features to just date-encoding (harmonic month, quarter, …) cuts error by dozens of percentage points—the model now “gets” seasonality.
Adding turnover features (relative ratios, lags, rolling windows) shaves off a few more points and, more importantly, slashes RMSE.
The gap between the two enriched models is smaller; to judge which feature bundle truly pays off, you have to look beyond aggregate scores—zoom in on specific scenarios (seasonal peaks, promo spikes, long-tail SKUs). That’s why visual analysis is essential: contrasting curves reveal nuances and point you to precise problem types.

Impact of feature count on speed

On a test sample I measured how resource usage grows with more features (13 features = baseline).

Features	Training time	GPU memory
120	x 7	x 2,5
444	x 26	x 6

Inventory Forecasting with AI #5: Feature Engineering I – A Leap in Accuracy