Inventory Forecasting with AI 10. Final model Final Test – Full Dataset Moment of truth: How does the model perform when run on all 15,000 items? Model and data configuration (based on previous findings): Feature engineering: 120 features (combination of individual sales and group-features) Outlier trimming Merging sparse categories Training only on items with at least 1 month of history Encoder length: 18 months Architecture: TFT Validation: last 3 months not included in training For the final test, I selected a model with group features but without synthetic data. This setup proved to be the most stable across the entire assortment, although synthetic data remains an interesting option for future projects or specific subsets of products. Edit Metrics Full dataset 33 months 13-32 months Value (ALL) Value (Item) Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 30.1022 49.7127 28.1069 41.6723 33.4304 40.1634 RMSE 45.907 22.3895 50.4051 23.9145 26.3204 17.0358 R² 0.9232 0.93 0.7245 MAPE 65.5438 57.332 60.9407 ROBUST 0.551 72.2735 0.5498 58.7128 0.5049 66.3819 STABLE 0.3884 0.3642 0.5422 Edit Metrics 6-12 months 1-6 months no history Value (ALL) Value (Item) Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 38.5444 45.8518 44.6778 47.1055 80.7317 220.7146 RMSE 26.6287 17.1458 43.1306 27.5178 95.9705 61.587 R² 0.7529 0.7451 -0.1913 MAPE 64.676 79.1142 186.139 ROBUST 0.5233 66.5118 0.5441 89.7998 0.4013 250.967 STABLE 0.4801 0.4776 0.9196 Results by history length: Full history (36 months) – WAPE around 30% (ALL), high R² (0.92+). Medium history (12–35 months) – similar results to full history, with only a slight drop in R². Short history (6–12 months) – still solid forecasts, WAPE around 38%. Very short history (1–6 months) – higher errors, but still usable for rough planning. No history – predictions remain unusable; the model has no data point to determine the correct scale. Compared to Chapter #9 (partial dataset test), the metrics are practically identical – the model maintains stability even with triple the dataset size. A slight improvement is visible for SKUs with short history, mainly thanks to a greater number of categories within groups → richer embedding data → better pattern transfer. Visualization The charts compare the prediction from the partial dataset (previous chapter) with the new prediction from the full dataset. 🔴 Red – prediction from the model on the partial dataset. 🟢 Green – prediction from the final model on the full dataset. The most visible differences are found in SKUs with short history – here, group-level information now plays a much stronger role. For longer histories, the curves almost perfectly overlap, confirming that expanding the training dataset to the full range did not harm performance for established items. Promo Campaigns The model performs very well on the full dataset, so I also tested how it handles promotional campaigns. In the training data, the model receives information about when and how long a given product was on sale. Within feature engineering, I also add additional context: how often the item has promotions how long promotions typically last what their historical impact on sales has been The visualization shows the difference in forecasts when an item had a planned discount vs. no discount. 🔴 Red – without planned discounts 🟢 Green – discount applied in the last 3 months Where history confirms the success of promotions, the sales forecast increases. Where past campaigns have not worked, the curve remains almost unchanged (e.g., C511478). This shows that the model can make decisions based on the actual historical impact of promotional events, and that a discount by itself is not a universal trigger for increased sales. Production Model To deliver a complete solution to the customer, I trained the final version of the model – this time on the entire 42-month history (some months months have passed since the start of the project). This time, there is no validation window, so the evaluation is done by visually comparing the model’s output with the customer’s internal forecast and the actual sales (which I backfilled retrospectively). Even without precise metrics, the key takeaway is that the customer is satisfied and sees clear value in the project. What the visualization app shows ⚫ Historical data – actual sales 🔵 POC prediction – output from the test model with only basic settings 🟢 Final prediction – final model 🔴 Customer forecast – customer’s internal forecast In most cases, the model’s predictions stay very close to actual values Deviations appear mainly in anomalous sales that differ significantly from normal behavior The model reliably captures seasonal patterns and maintains their shape even when volumes fluctuate It can correct errors in the customer’s manual forecast For items with shorter history, predictions are more influenced by group behavior, which can lead to deviations from reality New items with less than 6 months of history remain challenging – accuracy starts to drop noticeably in this range 👉 Try the results yourself: https://demo-inventory-forecasts.streamlit.app/ (If the app is sleeping, click “Yes, get this app back up!” – it will be ready in under a minute.) Note: All data shown is anonymized and slightly modified to remain informative. Conclusion of the Series – Inventory Forecasting with AI After ten chapters, thousands of experiments, and terabytes of trained data, I’ve finally reached the final solution. Throughout the project, I have: Built a robust data processing and feature engineering pipeline Handled outliers effectively Included items with incomplete history Tested dozens of scenario, model, and configuration combinations Introduced group-features for transferring knowledge between similar items Final choice:A TFT model with group-features, without synthetic values, trained on items with at least 1 month of history. The result? Stable predictions across the ntire assortment, strong seasonality preservation, and the ability to correct errors in the customer’s manual forecasts.
Inventory Forecasting with AI #9: Forecasting Short-History Items
Inventory Forecasting with AI 9. Forecasting Short-History Items Summary In the previous chapter, I trained the model on the entire assortment for the first time – not only SKUs with a full 36-month history, but also those with just a few months of data, or even no history at all. Only items with a full history Items with at least one month of history The entire dataset, including items with no history at all The minimum 1 month of history variant delivered the highest stability – the model was able to use as much data as possible without training on “empty” rows. However, detailed validation revealed that the biggest weakness remains items with less than 6 months of history: Items with at least 6 months of history are handled relatively reliably. Below this threshold – especially for brand-new SKUs – errors increase sharply. The most common problem: the model fails to estimate even the basic scale – it doesn’t know whether to forecast 10, 100, or 1,000 units. Feature Engineering III – Group Features In chapter five, I showed how additional features can turn raw numbers into a “map” that allows the neural network to orient itself and predict demand with useful accuracy.This time, however, I’m targeting items with short history, which simply don’t contain enough information for the model to make accurate predictions on their own. The goal Find similar items – those sharing the same category, seasonality, and promotion behavior. Use group embeddings – each segment receives its own vector capturing demand similarity. Calculate group averages – if an item has no history, the model can use averages from its group. Add these values as new features – the network learns that it can “transfer” information from rich series to sparse ones. Why it works A new product in the “tools” category can immediately benefit from the history of “electrical accessories” if they share a long-term demand pattern. The model can better distinguish one-off sales spikes from normal seasonal patterns. Short series stop collapsing to zero forecasts without harming the predictions of items with long history. New group features The same types of features I first used at the individual sales level are now applied at the group level: Relative (ratio) features – comparing an item to its group average. Lag & rolling windows – delayed values and moving averages to capture trends. Wavelet signals – detecting periodic patterns. Trend indicators – slope and direction of sales changes. Absolute and log-transformed values – better scaling across volume levels. For each item, I also calculate ratio-based log-features, giving the model a more fine-grained measure of how much the item deviates from its group – exactly the kind of missing signal short-history products need. Results by history length To get a detailed view, I again split validation into five segments by history length.Validation was run on a model trained on items with at least one month of history plus the added group features. Edit Metrics Full dataset 36 months 12-35 months Value (ALL) Value (Item) Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 32.4045 55.2969 30.3056 40.9703 35.6533 41.0825 RMSE 39.9062 20.7024 41.4421 21.1611 25.4482 14.6151 R² 0.8745 0.8862 0.7558 MAPE 60.6599 55.216 62.2879 ROBUST 0.5071 69.6946 0.5078 55.8221 0.567 61.2863 STABLE 0.3969 0.3695 0.4645 Edit Metrics 6-12 months 1-6 months no history Value (ALL) Value (Item) Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 43.6437 58.8135 53.8679 59.4594 95.8109 522.2663 RMSE 22.9386 15.9583 48.4724 33.3359 44.6723 27.5461 R² 0.7551 0.647 -0.0372 MAPE 63.4811 73.2619 373.445 ROBUST 0.5038 64.5602 0.5628 75.3369 0.3612 550.8374 STABLE 0.4537 0.4594 1.0963 Full history (36 months): Metrics remain virtually unchanged, group features did not harm the model. Medium-length series (12–35 months): Results are comparable to the baseline, with no drop in performance. Short series (≤ 6 months): Noticeable improvement in R² and MAPE, the model estimates the sales scale more accurately. No history (0 months): Improved from absolute disaster to “still unusable,” but the model now shows some ability to infer curve shape from embeddings and group context. Synthetic Data – When to (Not) Include It in the Model Short and zero histories are the biggest challenge for forecasting – the model often fails to estimate even the basic scale. To give these items at least a hint of “history,” I replaced missing sales values with synthetic values. These values are calculated as a weighted average of sales from similar items, where: Similarity weights are derived from embeddings across a combination of groups (category × season × …). Unlike the feature-engineering approach, where multiple separate values are created for different groups, here only a single final value is generated for a given time point. The next question: When should synthetic data be used? Edit Variant Advantages Disadvantages 1️⃣ Synthetic data already in training • Model immediately learns the scale of the new item → less tendency to collapse to zero. • Trains on values that never actually existed → risk of noise. • More “complete” series = better stability. • Predefined patterns may persist even after real sales start to behave differently (overfitting to synthetic history). • Reduced risk of noise from empty items. 2️⃣ Synthetic data only at validation • Training remains “clean,” without noise risk. • Model hasn’t seen these patterns during training → may ignore synthetic data or scale it incorrectly. • Easy to test what synthetic data actually brings. • Possible confusion (why were there zeros before, and now there aren’t). • Synthetic rules can be swapped anytime without retraining. For evaluation, I again split the results into five history length segments. All models already included the group-features extension. Variant 1 – Synthetic in Training Edit Metrics Full dataset 33 months 13-32 months Value (ALL) Value (Item) Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 32.1862 71.2019 30.0079 43.2471 43.5696 51.4367 RMSE 39.8141 20.1656 41.7923 20.6518 37.0207 18.3537 R² 0.8751 0.8843 0.7773 MAPE 71.3488 62.8897 66.1173 ROBUST 0.5136 87.4698 0.5182 64.0748 0.4673 65.5932 STABLE 0.3973 0.3686 0.544
Inventory Forecasting with AI #8: Generalization – Including Items with Incomplete History
Inventory Forecasting with AI # 8. Generalization – Including Items with Incomplete History Full Dataset Test In the previous chapter, we confirmed that the model performs “flawlessly” for the 70% of the assortment with a complete 36-month history.Now it’s time for a real challenge: what happens when we unleash the model on the entire dataset, including products with shorter or even nonexistent history? Upravit History Length Number of Items Share of Assortment 36 months (full history) 10105 70.00% 12–35 months 2668 18.00% Less than 12 months 1211 8.00% No history 519 4.00% All the following experiments use the same model I built in earlier chapters – i.e. including all feature engineering improvements and outlier control. I’m still using Scenario 2: 33 months for training, 3 months for validation. I tested three training variants, while always validating on the full dataset: Upravit Variant Training Includes New Items in Validation A – Full history Only items with complete history 468 B – Min. 1 month All SKUs with at least 1 historical data point 73 C – No restrictions All items, including those with no historical data 0 Poznámka: Například ve variantě Plná historie (A) bylo při validaci nalezeno 468 nových položek, které model při tréninku neviděl. What I’m Analyzing Global performance metrics (R², WAPE, RMSE, etc.) across variants A, B, and C Impact on items with full history – Does adding “data-poor” items hurt performance on well-covered SKUs? Behavior of new or short-history items – Can the model transfer knowledge from long time series, or do predictions still collapse to zero or behave randomly? Upravit Metrics Full history Any history 1 months min history Value (ALL) Value (Item) Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 41.0748 133.7093 38.4147 93.4879 34.6338 58.0287 RMSE 48.8867 24.5721 57.3472 23.2029 35.4181 18.5031 R² 0.7952 0.7408 0.8511 MAPE 131.7516 70.8846 68.2377 ROBUST 0.4741 158.7209 0.4785 78.5382 0.5312 77.5052 STABLE 0.5013 0.4613 0.3681 Summary of Results Training on full-history SKUs only (left column): Every SKU used in training had a complete 36-month history. However, the model failed dramatically when asked to predict new or unseen SKU – with Item-level WAPE and MAPE both exceeding 130%. Training on all SKUs, including those with zero history (middle column): This variant did not improve global metrics and actually increased RMSE. Including completely unknown products in training added significant noise to the learning process. The optimal variant was training on all items with at least 1 month of history (right column): This approach substantially reduced prediction error while maintaining a strong R² score. I selected this configuration as my new baseline for further development. Visual Inspection As in previous chapters, I manually reviewed forecasts for several problematic SKUs – this time expanded to include items with short or no history. Displayed model: 🔴 Red – Model trained on full-history items only 🟢 Green – Model trained with items including zero-history 🟣 Purple – Model trained on items with at least 1 month of history Note: To maintain context on each item’s history length, all plots show the full 36-month window. For items with full history, all three models produced nearly identical forecasts. This means that even when items with missing history are added to training, the model remains stable – thanks to the weight feature, it learns to skip empty periods without distorting the signal. Interestingly, models trained on mixed-history items (especially the purple one) often predicted higher and more accurate seasonal peaks than the model trained solely on complete series. Shorter time series likely provided extra contextual signals that helped the network generalize seasonal patterns more boldly – and, in many cases, more accurately. For items with partial history, the red model (trained only on full-history data) performed significantly worse. Since it had no access to a single real record for these SKUs during training, it had no understanding of their typical turnover. As a result, the model could only guess based on similarity to other products — often leading to inaccurate predictions. We see a similar drop in performance for items with very short history. Once again, the red model had nothing to grab onto – no useful data to estimate demand. The other two models, while also working with limited history, were still able to estimate the item’s turnover level and get much closer to reality. In the second visual, you can clearly see how the green and purple models successfully “pull up” the prediction based on similar SKUs. Even without direct historical data, they predict a sales spike in the second month, simply because similar products in the same group show the same pattern. In other words, the model correctly transferred knowledge from richer time series instead of defaulting to zero. When no historical data is available at all, predictions become little more than educated guesses into the void. Without a real sales anchor, the models default to some kind of segment average – which leads to significant underestimation for some items and overestimation for others. Without at least one actual sales data point, the forecast remains highly uncertain and difficult to interpret. How Model Performance Varies by History Length To get a more detailed view, I validated performance across five different scenarios, grouped by the length of each item’s history. The following insights are based solely on the model trained with items that had at least one month of sales history (green line in previous charts). Upravit Metrics Full dataset 100% items 36 months 72% items 12-35 months 18% items Value (ALL) Value (Item) Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 34.6338 58.0287 33.4331 45.3934 36.5081 49.183 RMSE 35.4181 18.5031 38.1581 19.3561 23.4648 13.58 R² 0.8511 0.8535 0.7525 MAPE 68.2377 61.9222 67.8013 ROBUST 0.5312 77.5052 0.5282 62.6903 0.5365 66.6689 STABLE 0.3681 0.3505 0.4093 Upravit Metrics 6-12 months 4% items 1-6 months 4% items No history 2% items Value (ALL) Value (Item) Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 37.7617 55.3109 56.1651 61.0081 117.1288 884.835 RMSE 36.3108 11.9183 46.4892 34.9088
Inventory Forecasting with AI #7: Feature Engineering II – How to Tame Seasonality, Zeros, and Promotions
Inventory Forecasting with AI 7. Feature Engineering II – How to Tame Seasonality, Zeros, and Promotions Seasonal Sales A classic example: Christmas products. They don’t sell at all for ten months, and then sales suddenly spike in November and December. The key challenge here was teaching the model to distinguish true seasonal demand from random fluctuations or outliers. In the training window (33 months), such items only showed two peaks — which wasn’t enough for the model to recognize a meaningful pattern. As a result, it treated them as noise and continued predicting zeros. What went wrong: Too few repetitions – Two peaks in 33 months don’t form a statistically significant sample. Robust normalization – It softened the spikes, which made it easier for the model to ignore them. No explicit signal – The model didn’t know that a specific month was special for the given item. Each visualization includes a chart showing:: Black – Actual historical sales Red – Model output before adjustment Green – Model output after adjustment Solution: Feature Engineering to the Rescue I extended the dataset with new signals to help the model understand seasonality: Peak is coming – A binary feature that flags the specific month (or time index) when sales consistently spike (e.g., Christmas, Black Friday, summer season). It’s seasonal – A second binary signal that indicates whether this spike has occurred in at least two consecutive years, helping the model distinguish true seasonality from one-off events. Average peak size – A numerical feature showing the typical increase in sales during the spike, giving the model a sense of scale. Maximum peak size – The historically highest recorded spike, useful for anchoring expectations. Zero Predictions For part of the catalog, the model consistently predicted zero. Items with very low turnover and long zero-sales stretches quickly “taught” the model a safe rule → Always predict zero. What went wrong: Imbalanced signal – The data contained far more zeros than actual sales, so the model naturally slid toward zero predictions. Missing context: “this is normal” – The model couldn’t recognize that occasional small sales were normal for the item, not noise. Normalization drowned the signal – Small-volume items got lost among those with high sales, making small variations practically invisible. Solution This turned out to be one of the most difficult challenges. For some items – especially when low volumes were combined with promotional campaigns — zero predictions still dominated.The breakthrough came only after changing the normalization strategy (individual approach for each feature) and adding new contextual features: Item age – The model now knows how long the product has been on the market. New items are allowed to have a few initial zeros, while older products with long sales gaps indicate possible decline. Time since last sale – How many months have passed since the item was last sold. A recent sale increases the likelihood of another one. Time without sales – How long the item has been inactive. The longer the gap, the more cautiously the model predicts new sales. Seasonal month flag – A reminder that a seasonal sale historically occurred in this specific month. Promotional Sales A typical example is small electronics: most of the year, sales trickle in slowly — just a few units per month. But once the e-shop launches a -20% weekly promo, orders can spike by dozens of units, only to drop back down once the campaign ends. Promotions were fairly common in the dataset — some items had several campaigns per year. What went wrong: Promo signal got lost in the crowd – After expanding the dataset with many engineered features, only two were directly related to promotions. Their influence in the model’s attention mechanisms significantly weakened. In other words, in the original dataset, promo indicators were dominant; now they were drowned in noise and largely ignored. Missing campaign context – The binary SALE feature could say “a discount is active now,” but it didn’t convey how long it lasted or when a similar campaign happened in the past. Equal treatment of all months – From the loss function’s perspective, underpredicting a promo month was just as bad (or not worse) than underpredicting a regular month. Solution Once again, I turned to feature engineering and added new contextual information to the dataset: Time since last promotion – The model learns that the longer the gap since the last promo, the higher the chance that a new one will re-spark demand. Duration of active promotion – A second signal tells the model how long the current discount has been running. Product activity – Features such as item age, time gaps between sales, and number of sales in the past year help eliminate false positives for products that have essentially gone inactive. Discount month indicators – Information about which months in the past the item was typically on sale. How the Metrics Changed I reran the test with these new features included. At first glance, global metric changes were minimal, and in some cases, even slightly worse. That’s expected: items with seasonal, promotional, or irregular sales patterns make up only a small portion of the total volume. In the aggregated results, they are outweighed by the rest of the assortment. So the slight worsening likely reflects statistical noise rather than a real drop in model quality. Visual inspection, however, confirmed clear improvements in forecasts for problematic items. It’s also important to keep in mind that this is a stochastic problem – model outputs naturally vary between training runs. If your setup is correct, these variations should stay small – typically just a few percentage points. Edit Metrics Prediction New prediction Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 28.4178 42.0787 29.9001 39.1834 RMSE 46.3156 21.1877 52.8141 20.9628 R² 0.9004 0.9072 MAPE 52.4587 59.802 ROBUST 0.4986 59.7858 0.5259 62.4322 STABLE 0.3677 0.3702
Inventory Forecasting with AI #6: Outliers in the Data – The Next Step Toward Robust Forecasting
Inventory Forecasting with AI 6. Outliers in the Data – The Next Step Toward Robust Forecasting Why Handle outliers? Every product range occasionally experiences a one-time “spike” – for example, a customer purchasing hundreds of units for their own promotion, or a company buying out an entire truckload. These volumes are real, but non-repeatable, and they stand out significantly during a typical month. How will the model react? The model overfits to the extreme – begins to overstock an item that normally sells only in small quantities. The model ignores the extreme – learns that the spike is just noise, and in the process, suppresses real seasonal peaks. The model shares patterns across products – a single outlier can disrupt entire product groups. Large absolute errors inflate evaluation metrics (e.g., RMSE). The chart shows detected outliers across the full dataset. It’s clear that even with a high detection threshold, many outliers exist and deviate significantly from the average. That’s why outlier detection is necessary and decide what to do with them. The key question is: Where does legitimate sales volume end, and an outlier begin? Note: At standard detection settings, we’re already seeing over 1,000 outliers. How Do I Detect Outliers? I use a combination of a robust center (trimmed mean), MAD-score, and percentile-based filtering. Sales values that are both above the x-th percentile and further than k × MAD from the center are labeled as outliers and capped to a “safe upper limit”. MAD-score: |x – median| / MAD MAD = median absolute deviation Threshold Options: No outliner cut – Keep all extremes Medium outliner cut (threshold k ≈ 3–4) – Model becomes more stable, but we risk removing valid seasonal peaks Top outliner cut (threshold k ≈ 8–10) – Keeps most of the data, filters only true outliers; in testing, this proved to be the safest compromise Striking a Balance Outlier handling is a nuanced topic. According to metrics, a “medium cut” delivers the best results. However, this setting trims thousands of peaks — including some that may represent real demand — and can ultimately backfire. Edit Metrics No outliner cut Medium outliner cut Top outliner cut Value (ALL) Value (Item) Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 30.2608 44.5829 28.8406 42.9816 28.7899 43.5887 RMSE 46.1427 22.3537 48.0208 21.3234 48.2553 21.7787 R² 0.9012 0.9141 0.8904 MAPE 60.0505 58.8268 56.4778 ROBUST 0.4948 62.2027 0.5001 61.8983 0.4874 62.6898 STABLE 0.365 0.3482 0.3718 My decision:After visual inspection, I decided for a conservative approach — cut only the most extreme spikes. This way, the model retains important seasonal signals. The metrics may lose a bit of precision, but the forecasts stay more faithful to real product behavior. I’d rather accept a WAPE point increase than risk chopping off half of the Christmas demand. Sparse Categories Besides turnover outliers, another problem emerges: categories with very few items. Why is this harmful to the model? During deep learning training, each category gets an embedding vector (which encodes similarities between categories). For a group with only a handful of items, the vector is trained on virtually no data: It carries no meaningful signal (the network doesn’t “learn” it) It dilutes model capacity – adding parameters that don’t contribute to prediction It increases the risk of overfitting to that single item Solution: Merge all sparse segments into a single label, e.g. “unknown”. Keep all other groups with sufficient size unchanged. This way, embeddings are only created for categories that have at least several items – resulting in fewer but more informative vectors. It also reduces VRAM requirements. Other settings Once the data was cleaned and enriched with meaningful features, the second half of the job began — convincing the neural network to learn the right way.In practice, this is like fine-tuning a coffee machine: same beans, same water, but tiny differences in pressure or temperature make all the difference in taste. Modern deep learning models have orders of magnitude more parameters than classic algorithms — and dozens of hyperparameters to tune. The final setup always depends on the character of the data (seasonality, promotions, long-tail) and the business goal (minimize stockouts, reduce inventory, estimate uncertainty). A Few Key Parameters: Loss Function – what we consider an errorA function that measures the difference between model prediction and reality, guiding how much the network should adjust during training. Optimizer – how the model learnsAn algorithm that updates neuron weights based on the loss and its gradients to improve predictions. Learning Rate – how fast the model learnsControls how much the weights are adjusted at each training step. Model Size – learning capacity The number of neurons and layers determines how many patterns the model can absorb. Summary – Current Progress At this stage, I wrapped up the preparation phase and ran the first test on scenario #2: 33 months of historical data for training 3 months of validation data, not seen during training Only items with a complete 36-month history Feature engineering complete Extreme values trimmed Edit Metrics Value (ALL) Value (Item) WAPE 28.4178 42.0787 RMSE 46.3156 21.1877 R² 0.9004 MAPE 52.4587 ROBUST 0.4986 59.7858 STABLE 0.3677 The results confirm that combining enriched features with outlier control delivers the first practically usable outcomes — predictions are more stable and, most importantly, capture both the trend and the absolute demand level more accurately. However, even after all these improvements, visual inspection revealed three specific types of items where the model still struggles: Edit Item type Symptoms Why It’s a Problem Seasonal Sales 10 months of zero sales, then a sudden spike Model smooths the peak as noise Zero Predictions Irregular sales, frequent zero months Forecasts drift toward zero Promotions Sales increase only slightly during promos Promotional effects aren’t reflected properly in predictions The chart illustrates a typical seasonal item. The model fails to reproduce this seasonality without further support. In some cases — mostly low-turnover items with frequent zeros — the model persistently predicted zeros, regardless of past spikes. Another graph shows the model’s poor reaction
Inventory Forecasting with AI #5: Feature Engineering I – A Leap in Accuracy
Inventory Forecasting with AI 5. Feature Engineering I – A Leap in Accuracy DL model selection In the previous part we decided to base further development on deep-learning architectures. To keep future visualisations readable I will focus mainly on Temporal Fusion Transformer (TFT). Three key arguments led me to this choice: Scaling with complexity – the richer and more granular the dataset (dozens of features, promo flags, seasonality, relative indicators) the more clearly TFT outperforms other models. DeepAR limits on the long tail – DeepAR tends to pull low-volume or sporadic SKUs down to zero, distorting purchase plans for slow-moving yet important items. Attention = smart feature selection – TFT automatically decides which input columns matter at any moment and can explain that choice; a major benefit when you have 100 + features. At first DeepAR seemed to achieve comparable metrics with lower compute demand, but once we enriched the inputs with more complex seasonal, promo and relative signals the picture flipped: TFT kept scaling while DeepAR hit its limits. Scenario 1 – why it is wrong During further testing the drawback of the first scenario became obvious – training and validation data were identical. The charts show that Scenario 1 tracks reality very closely throughout, because the model had a chance to “see the future”. Once we withheld the last three months (Scenario 2) the gap appeared quickly. On the final chart with a seasonal item the model fails completely. Edit Metrics Scenario 1 Scenario 2 Value (ALL) Value (Item) Value (ALL) Value (Item) WAPE 32.1632 45.7959 53.1177 66.9848 RMSE 62.0655 25.2746 83.1493 45.7434 R² 0.8603 0.7754 MAPE 63.9494 95.889 ROBUST 0.4796 66.0129 0.4184 98.0722 STABLE 0.3819 0.4615 When reviewing Scenario 2 results we can see the model has some predictive power, especially for SKUs with long and stable history, but the results are still well below expectations. Three key weaknesses emerged: Low-turnover items – forecasts dropped to zero; risk of under-stocking the long tail. Promo peaks – without an explicit discount signal the models reproduced the uplift only partially. Seasonal cycles – for highly seasonal items the models underestimated the amplitude of summer and Christmas peaks. It was clear the network needs additional inputs to read the context missing from raw numbers – feature engineering. What are features? When working with AI on time series, each data row represents a Product (ID) × Time point (e.g., item A in March 2024). Everything else in that row – segment, category, sales, price – we call features. These columns provide the model with the context it needs: they help it understand a product’s behaviour over time, its similarity to others, and the influence of external factors such as promos or season. Why? A raw sales number tells only how many units were sold. Features add the why: it was August, a discount was running, the item is new, high season is peaking. Thanks to this the model can recognise patterns it would never extract from a plain sales series. How the model works with features As noted, each row in our dataset is a Product (ID) × Month (time_idx) pair. Everything else on that row counts as a feature. Feature list for the project dataset: ID, seasonality, category, type, segment, segment 1, segment 2, name, turnover, date, time_idx, weight, SALE, SALE_INTENSITY, product_volume__bin Three types of featuresThe basic split is into static and dynamic – that is, whether the values change over time.Product type is the same for every month, whereas a promo flag appears only in certain months. Edit Feature type Example How it helps the model Static Product type, SKU ID, category, segment Lets the model share patterns across similar items and supports new products that have very little history. Dynamic – known Calendar date, discount flag Values change month by month and are known in advance, so the model can “look ahead” (e.g., it already knows when a promo will run). Dynamic – unknown Sales volume Values change every month and are not known for the future—this is the variable we actually want the model to predict. Numerical vs. Categorical featuresNumerical (sales, averages, discounts) → fed directly into the network.Categorical (product type, segment) → turned into embeddings, short vectors learned together with the model. Why embeddings helpTwo vectors that sit “close” in space = two categories whose demand behaves similarly. A brand-new “tools” SKU can instantly benefit from the sales history of “electro-accessories” if their profiles match. Which features I added—and why 🗓️ Date-encoding The model doesn’t “see” calendar months, only a numeric index (time_idx = 0 … 36). These features restore that link. Harmonic month code (sin / cos) → the network learns that January follows December and the whole year is cyclical. Helps capture periodic sales patterns (quarters, years, summer season). 📊 Relative (ratio) features Express the deviation or share of a SKU’s sales against a larger whole (group mean, seasonal maximum, long-term trend …). The model instantly spots when an item is above or below its norm and reacts faster to unexpected swings. 🌀 Lag & Rolling windows Supply detailed information on the recent trajectory. Speed up detection of momentum (sales speeding up or slowing down). Rolling mean / std filter noise and show whether current sales sit above or below their moving average. 🌊 Wavelet signals Decompose the curve into short ripples versus long trends. The network simultaneously “sees” fine promo jumps and slow multi-year cycles. 📈 Trend Adds direction and slope of sales. Provides context on how sales fluctuate around their mean. 🔢 Absolute and log values Most features are created both in raw and logarithmic form. Why logarithmic? Compress extremes → small items aren’t drowned out. Stabilises variance; curves sit closer to normal → faster learning. Converts multiplicative jumps (×2, ×3) into linear shifts the model can capture easily. Result Thanks to this combination of inputs, the model now predicts turnover and understands the context of each month—what is a normal trend, what is an outlier, and what a promo spike
Inventory Forecasting with AI #4: Baseline Models vs. Deep Learning
Inventory Forecasting with AI 4. Baseline Models vs. Deep Learning Why start with classic ML models? Quick reality check Baseline algorithms train in minutes and instantly reveal major data issues (bad calendar encoding, duplicated rows, etc.). Reference benchmark Once we know the performance of linear regression or a decision tree, we can quantify exactly how much a deep-learning model must improve to justify its higher cost. Transparency Simpler models are easier to explain; they help us understand the relationship between inputs and outputs before deploying a more complex architecture. Edit Baseline Models How They Work LR – Linear Regression Can capture simple increasing/decreasing trends. Once the data becomes more complex, it quickly loses accuracy. DT – Decision Tree Can handle more complex curves and nonlinearities, but if the tree grows too deep, it becomes overfitted – it starts repeating patterns. FR – Random Forest Averages out errors from individual trees → more stable than a single tree, but more memory-intensive and slower when dealing with a large number of items. XGB – XGBoost Regressor Often the best among traditional ML models – it can capture complex relationships without much manual tuning. GBR – Gradient Boosting Regressor Good as a low-cost benchmark: if even this fails, the data is truly challenging. What Does a 12-Month Forecast Visualization for 4 SKU Look Like? Deep Learning (DL) Architectures I Deployed For comparison, I built three basic deep learning models. In all cases, these are AI models designed for time series forecasting. While the baseline models belonged to the simpler category, these belong to the more complex end of the spectrum. Edit DL algoritmus How It Works Advantages / Disadvantages DeepAR (Autoregressive LSTM) Learns from similar items; during prediction, it generates the full probabilistic range of demand → we can see both optimistic and pessimistic sales scenarios. For new products, long-tail items, or series with many zeros and extremes, it tends to “smooth out” the curve. TFT (Temporal Fusion Transformer) Excellent for complex scenarios – it automatically selects which information is important and can explain its choices. Best for complex datasets with many external signals, but requires a large amount of data and has long training times. N-HiTS (Hierarchical Interpolation) Decomposes a time series into multiple temporal levels (year/month/week), applies a dedicated small network at each level, then recombines them. Great for long forecast horizons and fast inference, but requires a regular time step. Why These Complex Models? They capture subtle and long-term seasonal patterns and promotional effects that classical ML often overlooks. They share learned knowledge across the product range — helping new items and those with short histories. They return uncertainty intervals, so the buyer receives a recommendation of “how much to order in both optimistic and pessimistic scenarios.” What Does a 12-Month Forecast Visualization of DL models for 4 SKU Look Like? Results The testing was conducted only under Scenario 1 — both training and validation were performed on the same dataset. I used 5,000 products with a full 36-month history. Classical models provide an initial orientation, but even the best of them fall short of the target accuracy. Deep learning brings a significant improvement — even in the basic configuration, it reduces the error by half or even two-thirds. This is a clear signal that it’s worth investing time into tuning features, hyperparameters, and deploying the model in production. Comparison metrics between ML and DL models. I started using advanced validation metrics only at a later stage. Edit ML – baseline models Metrics R² WAPE RMSE MAPE GBR 0.4259 124.9833 173.599 275.8702 FR 0.3444 133.5684 157.6839 217.063 DT 0.2529 142.5767 165.8517 212.5132 XGB 0.2153 146.125 168.5045 266.1286 LR 0.0745 158.6966 200.3072 477.9817 Edit DeepLearning models Metrics R² WAPE RMSE MAPE DeepAR 0.847 53.615 62.8242 188.5804 TFT 0.3901 30.2255 125.4162 141.6336 NHiTS 0.432 206.9327 121.0321 382.6455 Test 2 – Encoder vs Decoder In the previous phase, I confirmed that deeper (DL) architectures clearly outperform traditional ML. Before diving into tuning parameters like loss function, optimizer, or dropout (which I won’t cover here), it’s essential to understand the encoder–decoder architecture. In time series models, this is a key component that determines how much of the past the model “reads” and how far into the future it predicts. Encoder: Defines how much of the historical data the model uses to generate a single prediction. You can think of it like a buyer looking at the last 12 months of history. Decoder: Represents the forward-looking part – it generates predictions based on the information provided by the encoder. Unlike a human buyer, the architecture creates multiple predictions (sliding windows) during training for each product. If I have 36 months of history, and I want the model to look back at the last 18 months and predict the next 3 months, then 18 sliding windows are generated during training over this time range. Edit Encoder Decoder 0-18 19-21 1-19 20-22 … … 17-35 36-38 A three-year history thus creates 18 training scenarios for each item; the model sees all possible transitions (e.g., winter → spring, Christmas → January, etc.). Main difference and impact on model behavior:What’s the difference between setting the encoder length to 18 versus 30 months?The key difference lies in the types of patterns the model learns and what it emphasizes.It’s a classic trade-off between flexibility and stability. Edit Encoder_length = 18 Encoder_length = 30 Better adaptation to changes: model reacts faster to sudden market shifts (trend, promotion) Good in seasonality and long-term trends: Better in captures repeating patterns and slow long-term trends More training windows: Faster learning, lower risk of overfitting (especially for short histories) Prediction stability: Robust against short-term noise Lower memory and computation requirements: Faster inference and training Fewer training windows: Longer training, higher overfitting risk if data is limited Struggles with seasonal cycles: The model may not fully learn patterns that repeat over multiple years Slower reaction to changes: Takes longer to adapt to sudden market shifts (trends, promotions) Seasonal memory: Higher risk of “forgetting” past seasonal peaks like Christmas. Higher
Inventory Forecasting with AI #3: How to validate this?
Inventory Forecasting with AI 3. How to validate this? The core question before any modelling Validation must show how the model handles data it has never seen—exactly the situation in production. So the dataset has to be split: Training data – the model learns patterns here. Validation (test) data – we check what the model really knows here. But where to draw the line? I wanted to feed the model every possible hint—all SKUs and all 36 months—to capture as many patterns as possible. Yet part of the timeline had to stay hidden; otherwise the network would simply memorise history and crash in the real world. Solution? I built three validation scenarios of increasing strictness, from a “quick check” (the model validates only on data it already saw) to a hard test on truly unseen months. 3 Validation Scenarios Scenario 1 – Smoke test Validation on the same dataset used for training (all 36 months). Goal: make sure the model can find patterns and predict values it has already seen—while watching for overfitting. Scenario 2 – Semi-strict Training on 33 months, validation on the last 6 months (split 3 + 3: three months the model saw during training, three it did not). This setup is much closer to real deployment: the model first “walks through” history, learns the patterns, and only then predicts a fresh, unknown period. It lets us: Test generalisation – can the model extend the trend, or does it just echo the past? Spot shifts in seasonality – e.g. will it handle the Christmas peak or other patterns? Check promo impact – are forecasts influenced by planned discount campaigns? Gauge sensitivity to new items – the last months often contain SKUs with a short history; validation shows right away how the model copes with a changing assortment. Scenario 3 – Strict Same training window as Scenario 2; validation = only the last 3 unseen months. I use this as a control check. Only after I was satisfied with the results did I train the final version on the full dataset (no hold-out), so the model could leverage 100 % of the data for live deployment in the warehouse. Note: All metrics and charts shown later in the article come from Scenario 2 (unless explicitly stated otherwise). Metrics are calculated on the final 3 + 3 months; visuals use a 12-month window for clarity. The next graph illustrates the gap between Scenario 1 and Scenario 2. Red = Scenario 1, where the model saw every outcome during training. Green = Scenario 2, where the last three months were hidden. It is clear that unseen data must be included in validation from day one. Which metric to choose when you can’t manually inspect every SKU? In day-to-day practice people reach for R², RMSE or MAPE—so why invent anything else?The core trouble is the presence of zeros and the huge spread of values. Why are zeros an issue? Imagine the model predicts a turnover of 2 units for a month, while the real sales were zero. A difference of only two pieces, yet a percentage metric such as MAPE explodes, because you divide by zero (or by a tiny number). The same happens with RMSE: a few “small” deviations on high-volume items can inflate the total error and completely hide how well the model performs on key SKUs. Why is high value variability a problem? Edit Item Actual sales Predicted sales Absolute error Relative error A (low-volume) 5 pcs 8 pcs 3 pcs 0.6 B (high-volume) 5 000 pcs 4 800 pcs 200 pcs –4 % MAPE shoots up because of item A (60 %), even though we’re talking about ±3 pieces. RMSE instead highlights item B (200 pcs), being highly sensitive to large absolute errors. R² may look superb (99 % of variance explained) thanks to high-volume items, yet tells us little about “small” SKUs No single metric is enough. With extreme variability and lots of zeros or small items, I had to switch to a combination of metrics to get a fair view of model quality. Edit Metric What it measures Note Goal WAPE (Weighted Absolute Percentage Error) What % of total sales the model “missed”. Sensitive to SKUs with zero sales. ⬇️ RMSE (Root Mean Squared Error) On average, by how many units the model was off. Sensitive to high-volume SKUs. ⬇️ R² (Coefficient of determination) How much of the variation in sales the model correctly “explains”. Strongly influenced by high-volume SKUs. ⬆️ MAPE (Mean Absolute Percentage Error) Average percentage error per item. Sensitive to low-volume SKUs. ⬇️ Robust score Composite of accuracy, deviation control and trend-direction match (1 = great, 0 = poor). Captures whether the forecast follows rising/falling trends. ⬆️ Stable score WAPE plus extra checks on deviations and relative error for small sales (0 = great). Make sure the forecast is smooth and not overreacting on low-volume items. ⬇️ Goal tells us which direction we want the number to move (low ⬇️ or high ⬆️). My primary yard-sticks are WAPE, RMSE and R². MAPE is not reliable for highly variable SKUs. Robust and Stable serve mainly as “concept checks”; once the model is roughly tuned, those scores hardly change. Isn’t that too simple? During training I ran into one more snag:Dataset-wide scores look fine, but they don’t flag outliers at the SKU level. So I expanded each metric into two flavours: Edit Metric version How it’s computed What question it answers Whole-dataset All SKUs are concatenated into one long vector, then the metric is calculated. How many units do we miss in total? → direct financial impact on inventory. Per-SKU average The metric is computed for every item first, then the item-level values are averaged. How accurate is the forecast for a typical product? → quality across the entire assortment. Below is an example of these metrics for a two-layer neural network with a soft-plus normaliser. Note the gap between WAPE and MAPE—the very issue discussed above. Edit Metric Value (ALL) Value (Item) WAPE 50.1925 387.2097 RMSE
Inventory Forecasting with AI#2: Data Structure & First Pass Analysis
Inventory Forecasting with AI 2. Data Structure & First Pass Analysis First look at the data The customer supplied plain text exports — CSV files. Direct integration with internal systems was deliberately postponed; the aim at this stage was simply to verify that an AI model could work on the data. Each row in the files represents one product in one period and contains: A unique product identifier (identical in all files) Seasonality tag, category, type, and segment Description and any sub-category breakdown Unit of measure and detailed item description History for the last 36 months Turnover (in units and in currency) Quantity ordered Dates and labels of promotional campaigns In total there were roughly 15 000 individual SKUs. The basic time index is the month, which lets us follow sales and inventory development over time. Even at first glance it was clear the data were incomplete and showed several inconsistencies: Missing sales history: For many products part or all of the history was absent, typically for new or strongly seasonal items. Negative sales values: Some records contained negative sales, often returns, cancellations, or input errors. Unit-of-measure mismatches: Different units make turnover interpretation difficult (pieces vs grams, single units vs thousands). Duplicate records: Certain items appeared more than once with identical or slightly differing attributes. Promo duration vs sales granularity: Promotions were recorded in days, whereas sales were monthly. Uneven history length: Some products had a three-year history; others only a few months. Other missing data: Absent dates, categories, etc. The first graph (box plot) shows the distribution of product turnovers across the entire data-set, plotted on a logarithmic scale, where the second graph (histogram) displays the frequency of individual turnover values. The median log(turnover) is roughly 3.5, which corresponds to a typical volume of around 33 units per month. Most products fall in the 1.5 – 5 range (≈ 4 – 150 units/month). A few items, however, exhibit much higher turnovers; these extreme values form the box-plot’s right-hand “tail”. A large share of products has zero or very low sales (left side).The highest density is between log(turnover) 2 – 4, i.e. roughly 7 – 55 units per month.Only a small subset lies on the far right—products with very high turnover. They are rare but have a disproportionate impact on total sales volume. Data Import and Cleaning Before any forecasting model could be trained, the raw data had to be thoroughly prepared and cleaned. In AI projects this step is often the most time-consuming—and it has a decisive impact on model quality. Missing sales historyA major issue was incomplete sales history. In some extracts the warehouse system wrote a zero as an empty cell. In others the product did not yet exist for that month, so the field was also empty. To help the model tell the difference, I added a new column –weight- that marks whether the item actually existed in each period. Negative sales valuesSeveral rows contained negative sales caused by product returns. In agreement with the customer these values were converted to zero to keep the model from being skewed by returns. Unit-of-measure mismatchSome items sell only a few pieces per period, while others move by the thousands in the same timeframe. Because a global conversion was not feasible, the model had to learn to scale correctly. Promo duration vs. monthly salesPromotions were stored in days and often spanned more months, whereas sales were monthly. I therefore created a column that records the percentage of each month covered by a promo (e.g., promo active 50 % of the month). Uneven history lengthSome SKUs were launched during the 36-month window and had only a short record. To keep a consistent timeline I again relied on the –weight- flag to indicate months before the item existed. Duplicates and other gaps Together with the client we removed duplicate rows and filled or dropped remaining missing fields so that the final training set would influence the model as little as possible in a negative way. Resulting structure The data were converted into a classic “long” format, where each row represents one product–month combination.The final data set contains roughly 15 000 SKUs and just over half a million rows. List of columns: The list below shows the structure of the prepared long table. In addition to the basic identifiers and commercial attributes, it already includes several auxiliary columns that are essential for modelling—most importantly –weight- and –time_idx-. ID, seasonality, category, type, segment, unit, name, turnover, price, ordered, date, time_idx, weight, SALE, SALE_INTENSITY First analyses and patterns discovered Although the ultimate goal of the project is to forecast inventory levels, a detailed data review showed that it is more practical to predict turnover (expected sales) first. Standard logistics rules (min/max stock, lead times, safety stock) can then convert the sales forecast into an optimal stock target for each month. Key decisions during data exploration Some columns were excluded from model training because they do not influence sales volume directly: Price – in the customer’s setting the selling price showed no clear correlation with quantity sold. Quantity ordered – supplier performance affects stock, not sales; these variables will therefore be used later, when we translate the sales forecast into stock recommendations. Segment hierarchy – the “segment” attribute has three levels. To keep the full information I retained all three levels as separate fields: Segment, Segment 1, Segment 2. Edit Segment number ID mean_target std_target var_target AA-BB-UU 91 20.925 44.316 2 001.933 AA-EE-DD 89 54.869 70.83 5 114.107 AA-GG-LL 88 26.011 33.747 1 160.926 AA-AA-SS 89 84.013 173.254 30 598.405 XX-WW-TT 91 50.484 89.99 8 255.076 GG-PP-AA 86 28.943 40.32 1 657.201 DD-GG-VV 85 4.494 5.227 27.855 HH-EE-RR 81 12.802 11.232 128.596 AA-DD-OO 79 21.321 36.809 1 381.121 BB-WW-TT 77 90.052 141.141 20 306.666 Edit Segment2 number ID mean_target std_target var_target DD 2645 32.412 169.623 29 329.194 FF 2689 119.556 455.99 211 954.269 GG 1687 32.537 67.237 4 608.307 AA 1288 63.001 150.065 22 955.733 BB 611 104.113 287.801 84 433.946 HH 509 84.453
Inventory Forecasting with AI#1: Project Introduction & Objectives
Inventory Forecasting with AI 1. Project Introduction & Objectives Smart inventory planning is a challenge not only for large corporations but also for small and medium-sized businesses across industries. Today, AI opens up entirely new possibilities for them. After three years of deep study in AI, I decided to launch my own project: applying deep learning to inventory forecasting. Why this topic? A colleague asked me whether artificial intelligence could help solve inventory planning problems in a wholesale business.It quickly became clear that inventory management is a never-ending balancing act — too much stock means unnecessary costs, too little means lost sales and frustrated customers. On top of that, many dynamic factors come into play. How to Read This Series The goal of this series is to make the world of inventory forecasting with AI accessible to business professionals — those who deal with operations and inventory planning in practice and want to understand how AI and modern data tools can help. I won’t dive into technical details. Instead, I’ll focus on real-world challenges and how I addressed them. Each article will present a specific problem, show how I tackled it, and support it with examples or charts — so you can get a clear picture of what AI can (and can’t) do in supply chain forecasting. My aim is to provide value both to business managers who want to explore AI’s potential and to analysts interested in how data handling and modeling decisions impact forecasting results. All data used in this series come from real projects, but have been anonymized. New articles will be published progressively, as I gather new insights and practical examples. The Business Problem Tens of thousands of unique product SKUs Categorized by seasonality, type, product category, and segment Dozens of new items created and discontinued each month Promotions and special discounts applied irregularly Forecasts and ordering handled manually by purchasing teams Project Goal Project Goal To build a system that can: Track current inventory levels Predict future sales Take into account additional variables such as lead times, supplier reliability, promotions, etc. Recommend order quantities and automate min/max inventory thresholds Sales trends over time for various products (SKUs):The following charts illustrate sales trends for four different products over a 24-month period. Each line shows the sales volume of a single product. You can clearly see how these products differ — in sales volume, seasonal patterns, and even periods with zero sales. These variations highlight why inventory forecasting is hard — and why it requires a much more sophisticated approach than a simple average or basic extrapolation. Project Definition To meet the project goals, we first needed to clearly define the specific requirements, constraints, and risks involved. The solution must: Automatically recommend monthly min/max inventory levels for each SKU Take into account multiple parameters: Product categorization Promotional sales Seasonality Sales history Similarity between products New product launches Be financially feasible Allow companies to test the model on their own data Project Risks & Challenges From the beginning, it was clear that the model would face a wide range of challenges: One-time sales spikes (e.g. sudden bulk orders or unplanned demand surges) Frequent assortment changes (tens of products introduced or discontinued each year) Varying sales history lengths (new items vs. long-standing products) Zero sales periods (months without demand, followed by sudden activity) Large variance in sales volumes (from one-piece-per-month SKUs to thousands of units sold) Product interdependencies (sales of one item influence demand for another) Impact of seasonality and planned promotions The objective was not only to build a model that can predict average demand, but also one that can handle these irregular, complex, and often unpredictable scenarios. Sales History Coverage TableThe sales history table illustrates the variability in how much historical data was available per product. While the majority of products had a complete 3-year history, a significant portion had only limited or no historical data — posing a major challenge for traditional forecasting models Edit Product history length Number of items Share of assortment 36 months (full history) 10 105 70.00% 12–35 months 2 668 18.00% less than 12 months 1 211 8.00% no history 519 4.00% Solution – In a Nutshell Many companies assume that feeding data into an AI model is enough to get useful results.As this project showed, the reality is far more complex. Initial tests revealed that even modern deep learning models (used out of the box) were not sufficient — they struggled with high assortment variability, frequent product turnover, and a large number of zero-sale periods. The project gradually expanded in scope. It became necessary to: Improve data preparation (feature engineering) Train models to handle extreme values Address varying history lengths Dynamically scale inputs based on context Today, the model handles most of these challenges automatically, with minimal manual intervention. Results Edit Metric Value Business Meaning Forecast Accuracy 80% How accurately the model predicted future sales. Average Deviation 48 Average difference between predicted and actual sales. Mean Absolute Error 31 Average size of the error in number of units sold. Key Strengths of the Solution Forecast accuracy over 80% across the full product assortment Model leverages both individual sales history and group-level patterns from similar products Special attention to promotional and discounted items Capable of forecasting newly introduced products with minimal historical data Responsive to both seasonal effects and short-term trends Automated outputs generated for the entire assortment Scalability and Reusability The solution is highly adaptable and can be configured for different clients or datasets.Most algorithms and core components are predefined and reusable.While expert knowledge is still required for initial setup and result interpretation, the deployment and validation process is much faster compared to custom development from scratch. Forecast vs. Reality – Visual Results:The following charts (not shown here) compare actual vs. predicted sales for selected SKUs.You can clearly see that the model is able to capture overall sales trends and seasonal fluctuations, even for products with inconsistent histories or irregular sales spikes.At the same time, these charts