PatchTST: Using Transformers for Time Series Forecasting
Author: edparks
— paper summary — 19 min readCitation
| Paper | A Time Series is Worth 64 Words: Long-term Forecasting with Transformers |
| Authors | Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. |
| Year published | 2023 |
| Venue | ICLR 2023 |
| Paper URL | https://arxiv.org/abs/2211.14730 |
| Code URL | https://github.com/yuqinie98/PatchTST |
LLM
| LLM Model | HuggingFace SmolLM3-3B |
|---|---|
| LLM Model Version | Version 3 - Commit ID: a07cc9a |
| Model/Service URL | https://huggingface.co/HuggingFaceTB/SmolLM3-3B |
| Why This Model? | SmolLM3 offers a balance of performance and efficiency, with advanced features such as reasoning. |
Paper Questions
What LLM did you select (model and version) and why?
For the project I selected the HuggingFace SmolLM3-3B for the paper task over similar sized models such as Qwen2.5-3B, or larger models such as Llama 3.2 or Google's Gemma. The biggest reason for selecting the model was the balance it has between performance and efficiency along with its easy integration into the HuggingChat platform to provide prompts.
Benefits I was seeking with the model are detailed below:
Ultra-Efficiency: The model is specifically designed to be as high performing as possible while remaining efficient to even the most resource constrained environments. Authors of the model note that this is a growing ask among LLM users (Bakouch et al., 2025).
Strong Performance: In a comparison between three Qwen LLMs, Llama 3.2, and Gemma3, the SmolLM3-3B Base model placed third in terms of benchmark performance despite being the second fastest among the six models tested, highlighting the balance between performance and efficiency (Bakouch et al., 2025).
Long Context Window: The inference size of 128,000 tokens was one of the largest benefits as it allows for more nuanced and coherent answers to be developed. This is necessary given that we are working with academic papers that have dense amounts of inter-related information across long spans of text.
Reasoning Ability: SmolLM3-3B has the ability to incorporate a reasoning feature for responses. This was of particular interest as the questions prepared for the paper are longer and involve a multi-step reasoning processes to achieve the best results.
Some of the initial concerns I had for the model are detailed below:
Lower Parameter Size: Given the relatively low parameter count compared to larger models from Meta and Google, there were concerns about the model’s ability to realistically handle a complex prompt and make meaning from the entire paper. The larger pre-trained models typically have stronger performance on benchmarks and handle nuanced prompts with more caution.
Hallucinations: Since the paper includes over 10 pages of text with tables and diagrams, there was concern as to whether the model would be able to effectively fit the paper within its context window and produce high quality answers to various parts.
What are the authors proposing?
AI Question
1Based on the copy of the machine learning paper included in this chat can you simply describe in a few sentences what the authors are proposing? Please don’t just regurgitate the abstract, but rather describe the novel concepts of the authors' experiments.AI Response
The authors propose a Transformer-based model, PatchTST, designed for long-term time series forecasting. Their key innovations include:
- Patch-based Input Segmentation:
- Patching Mechanism: They divide multivariate time series into non-overlapping patches of fixed length (e.g., 16 time steps with a stride of 8). This reduces the input size from ( L ) (original sequence length) to ( \approx L/S ), significantly lowering computational complexity.
- Local Semantic Information: Patches retain local patterns, enabling the model to learn meaningful temporal relationships at different scales, which is critical for long-term forecasting.
- Channel-Independence Design:
- Independent Processing of Channels: Each channel (e.g., a single time series feature) is processed independently, ensuring that information from one channel does not interfere with another. This allows the model to focus on specific features without cross-channel noise.
- Adaptability: PatchTST adapts to diverse time series behaviors, as each series can learn distinct attention patterns, improving robustness against unrelated or conflicting data.
- Longer Look-Back Windows:
- Strategic Downsampling: By downsampling the input sequence (e.g., every 4 steps) and padding with the last value, PatchTST can use a longer look-back window (( L )) without increasing computational costs. This enhances the model's ability to capture historical dependencies.
- Self-Supervised Pre-training:
- Masked Autoencoder: They use masked autoencoders to learn abstract representations by randomly masking patches and training the model to reconstruct them. This avoids the need for large labeled datasets and produces high-quality features for downstream tasks.
- Transfer Learning:
- Cross-Dataset Transfer: Pre-trained models on one dataset (e.g., Electricity) are fine-tuned on others (e.g., Weather, Traffic), reducing the need for extensive labeled data. This approach leverages learned representations for improved performance across tasks.
- Efficiency and Robustness:
- Reduced Complexity: PatchTST's design minimizes memory and computational demands, making it scalable for large datasets.
- Robustness to Overfitting: By using downsampling and padding, it avoids overfitting while preserving critical temporal information. Instance normalization further mitigates distribution shifts between training and testing data.
The authors demonstrate that PatchTST outperforms traditional Transformer models in multivariate time series forecasting, especially on large datasets. Their approach combines patching and channel independence to retain semantic information, reduce complexity, and enable effective learning from longer sequences, making it a robust and efficient foundation model for future work.
Human Response
In the paper the authors propose a custom rendition of a transformer encoder model that is designed specifically for sequential time-series data. Instead of framing a time series as a sequence of individual events, parts of the overall series are tokenized similar to a language model for forecasting tasks.
Conceptually the idea was noted to be an extension of deep learning based forecasting developments that have more recently been developed. These include the use of standard feed-forward networks, convolutional networks, and other transformer based methods. Despite these models realizing promising results for the application of deep learning in the field, they have been equally matched by more simple linear models (Zeng et al., 2022).
The authors note that aspects of the transformer model such as the attention mechanism and the way data is represented in tokens presents many opportunities to make new breakthroughs in forecasting performance that cannot be achieved with linear models. They specifically propose two new approaches utilizing transformers for forecasting through methods called patching and channel independence.
Patching involves creating local context embeddings from the larger sequences that mimics the tokenization phase, and channel independence involves processing multi-variate sequences independently versus a standard channel mixing process. Additionally, the use of language model-like pre-training processes such as token masking was proposed by the authors as a method for richer interpretations of general time series signals to boost the performance of downstream tasks.
What is the motivation for the work?
AI Question
1Based on the copy of the machine learning paper included in this chat can you simply describe in a few sentences what the motivation is for the authors’ work in this paper? Please help me specifically understand how similar types of previous experiments were inspiration for the methods the authors proposed and why.AI Response
The authors' motivation stems from the limitations of Transformers in handling long-term time series forecasting, where their inherent design struggles with large sequence lengths and high computational costs. Key inspirations for their methods include:
- Patch-based Optimization:
- Insight: Patching, a technique used in computer vision and NLP to reduce memory and improve efficiency, was adapted to Transformers. By segmenting time series into non-overlapping patches, the authors reduce the input size from ( L ) to ( \approx L/S ), which is critical for Transformers' ( O(N^2) ) complexity.
- Adaptation: This approach is inspired by the success of patching in other domains (e.g., ViT, BEiT) to handle large-scale data with limited resources.
- Channel Independence:
- Insight: Channel independence, where each channel (e.g., a single time series feature) is processed independently, was used in traditional methods like TCN and LogTrans. The authors extend this to Transformers, ensuring cross-channel noise is mitigated.
- Adaptation: Channel independence leverages the efficiency of Transformers while avoiding cross-channel interference, a key advantage over channel-mixing approaches that blend all channels into a single input.
- Longer Look-Back Windows:
- Insight: Downsampling (e.g., sampling every 4 steps) and padding with the last value allow the model to use a longer historical context (( L )) without increasing computational demands. This is inspired by techniques in time series analysis (e.g., ARIMA) that use sparse data for efficiency.
- Adaptation: Downsampling is a pragmatic solution to enable Transformers to learn from longer sequences while maintaining scalability.
- Self-Supervised Learning for Pre-training:
- Insight: Masked autoencoders, a staple in NLP for learning embeddings from unlabeled data, were applied to Transformers. This addresses the scarcity of labeled time series data, a common challenge in generative and forecasting tasks.
- Adaptation: The authors adapt this self-supervised strategy to Transformers, enabling effective pre-training on unlabeled data and transfer learning across datasets.
- Efficiency and Robustness:
- Insight: Traditional Transformer models often overfit due to their large receptive fields. The authors use downsampling and padding to reduce overfitting while preserving temporal context, inspired by sparse convolutional networks (e.g., TCN) that mitigate overfitting through sparsity.
- Adaptation: This approach balances computational efficiency with the need for robustness, addressing the trade-off between model complexity and data utilization.
Why These Approaches?
The authors' work is driven by the need to scale Transformers for time series forecasting while avoiding overfitting and maintaining computational feasibility. Traditional Transformer models struggle with long sequences due to their inherent design, which is compounded by the lack of labeled data in many time series tasks. By combining patching, channel independence, downsampling, and self-supervised learning, they create a more efficient and adaptable framework that outperforms existing methods. This approach bridges the gap between the Transformers' strengths in sequence modeling and their limitations in temporal data, making it a viable solution for long-term forecasting and foundation models in time series analysis.
Human Response
The authors of the paper highlighted motivations for the specific model architecture largely based on the current state of deep learning approaches for forecasting and the general importance of the forecasting task to analysis with time series data.
It was distinctly noted in the paper that deep learning methods have considerable promise in their ability to forecast but currently lack the edge in performance that gives any incentive for their practical use over less complex linear models. Especially among transformer based solutions to forecasting, deep learning implementations require substantially more skill, time, cost, and data to properly develop a strong model.
Another critical motivation noted by the authors is the complexity that traditional forecasting methods fail to model properly. Most time series sequences are not univariate by nature, but given data and modeling limitations the multiple influencing factors rarely are able to be completely captured for a forecast. The prospects of deep learning to better account for these multi-variate scenarios is a direct motivation for components in the proposed architecture.
A final key motivation for the work is the natural ability of transformers to self discover patterns given the way data is tokenized and can attend to one another in a broader sequence. While this has been explored in other transformer based forecasting variants, the methods by which the data is preprocessed remains an active area of research.
What are the results and how do they compare with competing approaches?
AI Question
1Based on the copy of the machine learning paper included in this chat can you simply describe in a few sentences what the results are for the authors’ novel approach and how they compare to the others tested? Please include the high-level summary results the authors noted in the paper and yourself summary any key resulting takeaways from the performance tables that are included in the report.AI Response
The authors' novel approach, PatchTST, significantly outperforms existing Transformer-based models in multivariate time series forecasting. Key results include:
- Supervised Forecasting:
- PatchTST/64 (with a longer look-back window and more patches) achieves 21.0% reduction in MSE and 16.7% in MAE compared to the best Transformer baselines on large datasets (Weather, Traffic, Electricity).
- On smaller datasets (ILI), PatchTST/42 (default model) still outperforms other Transformers by 20.2% in MSE and 16.4% in MAE.
- Self-Supervised Pre-training:
- PatchTST outperforms state-of-the-art self-supervised methods like BTSF, TS2Vec, TNC, and TS-TCC by 34.5%-48.8% in forecasting accuracy, demonstrating its effectiveness in learning abstract representations.
- Transfer Learning:
- Pre-training PatchTST on Electricity and fine-tuning on other datasets yields MSE improvements of 48.5%-48.8%, slightly exceeding supervised training in some cases.
- Ablation Study:
- Channel independence is critical: PatchTST with both patching and channel independence (P+CI) outperforms all other variants, including standalone patching (P) and channel-mixing (CI).
- Patch length and look-back window are synergistic: Increasing ( L ) consistently improves performance, with PatchTST/64 showing the largest gains.
Key Takeaways:
- PatchTST is the most effective Transformer model for multivariate time series forecasting, especially when combined with channel independence.
- Channel independence reduces cross-channel interference and allows Transformers to focus on local patterns, enhancing robustness.
- Patch length and look-back window are trade-offs: Longer ( L ) improves performance but requires downsampling to maintain efficiency.
- Self-supervised pre-training is a powerful tool for learning abstract representations, outperforming supervised methods on time series.
- Transfer learning with PatchTST is viable but requires careful implementation, yielding competitive results on diverse datasets.
In summary, PatchTST's combination of patching, channel independence, and downsampling enables Transformers to handle long-term dependencies efficiently, making it a superior choice for time series forecasting and a promising foundation model for future work.
Human Response
After training and evaluation across multiple datasets and training styles, the PatchTST model displayed clear evidence of outperforming all current state-of-the-art (SotA) approaches on the downstream tasks. The models used in the comparison included the transformer based variants of FEDformer (Zhou et al., 2021), Autoformer (Wu et al., 2022), and Informer (Zhou et al., 2021), and a linear model called DLinear (Zeng et al., 2022) which is noted for its comparable performance.
Datasets used for the task evaluation include those across the domains of weather, utilities usage, and traffic patterns. In addition, the model’s performance was evaluated across different forecast sequence lengths typically of 96, 192, 336, and 720 steps. Overall, the base 64 patch version of the model outperformed all baseline methods achieving a 21.0% overall decrease in mean square error (MSE) and a 16.7% reduction in mean absolute error (MAE).
The different training methods implemented for the model include a linear-probe of the fine-tuning model head, an end-to-end fine-tuning cycle, and a supervised training cycle from scratch. Across the different datasets, forecast lengths, and competitor models, the pre-training process of the end-to-end fine-tuning of the PatchTST model produced the best results highlighting the power of language model style processes for the time series implementation.
Given the PatchTST model is built off a transformer core architecture, the performance of time series forecasting was evaluated for the model’s task transfer learning ability. The performance was benchmarked against other SotA transfer learning models such as the Bilinear Temporal-Spectral Fusion (BTSF) model (Yang & Hong, 2022), and TS2Vec (Yue et al., 2022). PatchTST reported improvements ranging from 34.5% to 48.8% in MSE for the transfer learning tasks.
Additionally, an ablation study was conducted to determine the usefulness of the patching and channel independence mechanisms in terms of improving the forecasting ability of PatchTST. The results of the ablation study experiments were compared to the performance of the FEDformer model which was considered the most SotA transformer approach. The ablation study highlighted that the combination of the patching and channel independence mechanism are required for the model to achieve world class results. When combined, the PatchTST model outperformed FEDformer at every forecast length across every dataset.
Are the comparisons fair?
AI Question
1Based on the copy of the machine learning paper included in this chat and given what you are trained on, can you simply describe in a few sentences whether the comparisons of the authors’ PatchTST model to those in the results table are fair? If the reasons are not fair please simply explain why and what change could make it a more fair comparison.AI Response
The comparisons in the authors' results are fair but could be improved by standardizing the baselines' hyperparameters and architectures. Key issues include:
- Varying Architectural and Hyperparameter Differences:
- Baselines like FEDformer, Autoformer, and Informer use different attention mechanisms and architectures (e.g., FEDformer uses Fourier-enhanced attention, Autoformer uses auto-correlation).
- The authors should use the same architecture (e.g., identical number of layers, attention heads, and dropout rates) across all baselines to ensure fair comparisons.
- Differing Look-Back Windows:
- Transformers (e.g., FEDformer, Autoformer) use varying look-back windows (e.g., 24, 48, 96, 192, 336, 720), while DLinear uses L=336.
- The authors should standardize the look-back window (e.g., use L=336 for all baselines) to compare PatchTST's effectiveness with Transformers on the same scale.
- Self-Supervised vs. Supervised Baselines:
- The self-supervised comparisons (Table 4) are valid as they evaluate distinct learning paradigms (e.g., masked autoencoders vs. Transformer-based models). However, the supervised baselines (Table 3) are all Transformers, so the self-supervised results could be seen as a separate benchmark rather than a direct comparison.
Improvements:
- Standardize Hyperparameters and Architectures: Use identical dropout rates, attention heads, and layer counts across all baselines.
- Uniform Look-Back Window: Apply the same L value (e.g., 336) to all models to ensure fair comparisons.
- Separate Baselines for Self-Supervised vs. Supervised: Clearly distinguish self-supervised (Table 4) and supervised (Table 3) results to avoid conflating different learning paradigms.
By addressing these issues, the comparisons would be more apples-to-apples, allowing for a more accurate assessment of PatchTST's performance relative to a standardized set of baselines.
Human Response
Without intimately knowing the details of all comparison models, datasets, and nuances of the evaluation tasks that do not exist in the paper, it is hard to give a definitive answer as to whether the comparisons are completely fair. However, based on the descriptions of how the experiments were designed and why, the authors present strong evidence to justify that the comparisons made are fair.
More specifically in terms of models, the authors implemented comparable deep learning based forecasting methods that are task appropriate unless otherwise stated. In general, forecasting scenarios the authors compared the PatchTST model with general transformer architectures such as FEDformer and Informer, whereas unsupervised learning task transfer experiments leveraged models that perform stronger in those areas for comparison (i.e. BTSF and TS2Vec). The only non-transformer model utilized for general task evaluation was the DLinear model, which was clearly noted by the authors as being included due to its documented performance against other complex deep learning forecasting methods.
To ensure the evaluation measures were fair the authors did a thorough evaluation combining different forecast lengths from 96 to 720 steps, dataset domains from weather to traffic patterns, methods of training from end-to-end fine-tuning to standard supervised learning, and alternative measures of error such as mean square error and mean absolute error. The combination of these were likely selected to highlight the raw power of the PatchTST model along with ensuring the authors did not cherry pick results.
Lastly the authors performed a detailed ablation study on the newly proposed features that are a part of the PatchTST framework to provide additional support for the robustness of each feature. The comparable results of PatchTST to that of FEDformer without the patching and channel independence features highlight a prime example of the fairness while still allowing space to showcase the effectiveness of the new features.
What are the takeaways according to the authors?
AI Question
1Based on the copy of the machine learning paper included in this chat can you simply describe in a few sentences what takeaways the authors want the readers to take from it? Please don’t just regurgitate the conclusions but rather leverage any points the authors make across the entire paper.AI Response
The authors' key takeaways for readers are:
Transformer-based models are not inherently effective for long-term time series forecasting. Their design, particularly PatchTST, is necessary to overcome the limitations of existing Transformers in handling long sequences and complex temporal dependencies.
Patching and channel independence are critical for capturing local and global information. PatchTST's combination of patching (reducing input complexity) and channel independence (avoiding cross-channel interference) enables Transformers to learn meaningful patterns at multiple scales, improving forecasting accuracy and computational efficiency.
Transformer models can be adapted for foundation tasks. Their approach demonstrates the potential of Transformers in time series forecasting and self-supervised learning, showing that they can serve as building blocks for future work in this domain.
Self-supervised pre-training and transfer learning are powerful strategies. PatchTST achieves state-of-the-art results through self-supervised learning, and its representations can be transferred to other datasets, reducing reliance on large labeled data.
Efficiency and scalability are essential for large-scale time series tasks. Patching and downsampling in PatchTST significantly reduce computational costs while enabling longer look-back windows, making it practical for resource-constrained environments.
Channel independence is a key design principle. It allows Transformers to focus on local patterns within individual channels, avoiding cross-channel noise and improving feature learning robustness.
Transformer models can be effective when combined with clever architectural choices. The authors show that their design balances local semantic information retention with global context capture, highlighting the importance of thoughtful architecture engineering.
Their approach is generalizable across datasets and tasks. PatchTST's versatility in handling diverse multivariate time series and its ability to transfer learning suggest it could be extended to other applications.
In summary, the authors want readers to understand that their PatchTST model is a transformative approach that overcomes Transformer limitations in time series forecasting by combining patching and channel independence, making it a foundational model for future work in this field.
Human Response
According to the authors the key takeaways are that patching and channel independence are very important and effective new methods that give transformers the ability to produce world class forecasting results.
The patching mechanism specifically allows the model to capture local information of surrounding steps in the sequence for a more nuanced understanding of a time series rather than modeling a series at the individual step level. Additionally, utilizing patching allows for the full series to be processed with a lower number of overall steps. This substantially reduces the compute cost of the model as the attention mechanism naturally incurs a quadratic space time complexity.
Channel independence on the other hand is designed to take multivariate patches and individually process them in the architecture which has performance benefits over more traditional methods such as channel mixing. While the concept is relatively new to transformers, they offer numerous future extensions such as channel enhancements by incorporating their dependencies among one another.
Furthermore, the authors note the potential of self-supervised learning methods for more complex downstream applications. Despite supervised implementations of the training process achieving strong performance, the ability of the model to leverage representation learning methods enhances the broader performance and opens the approach to many future areas of experimentation.
What are the takeaways according to you?
Human Response
The paper regarding the PatchTST model made several important contributions to the realm of time series forecasting, and provided strong evidence that transformers and general statistical language processing frameworks are applicable in many other domains.
More specifically, the authors highlighted that a modified way of representing the time series similar to what is done for text transformers allows the model to better understand the local context in a more nuanced way that improves the accuracy for long term predictions. Given the uniqueness of the approach, transformers may not be applicable for all forecasting scenarios, with simpler methods producing comparable results.
A final aspect that is important for future time series experimentation is the success of the self-supervised learning methods for transfer onto another task. These methods included a similar rendition to that of pre-training language models through masked token recreation. The token recreation has been profoundly successful in a language domain, allowing transformers to garner a deeper understanding of word meaning and apply it to other text domain tasks.
Various time series on the other hand rarely have the same underlying meaning hence the surprising results in the ability to transfer knowledge from one domain to another. This opens the possibilities for how time series analyses can be approached to incorporate foundational structure with series specific patterns.
Would you use this? If so, how/where would you use this?
Human Response
I personally would use the PatchTST approach to time series modeling as I believe it is a very innovative approach to how sequential data can be represented, and is one of the driving reasons for my selection of it as the paper. Additionally, I believe there are many alternative approaches to this model that could be utilized for even stronger performance on a broad range of tasks.
The best scenarios I would apply the model to include time series challenges where there is a substantial amount of training data to utilize, there are complex multi-variate features that contribute to an overall forecast, and the forecasts are required to be long in terms of their context. All of these are where the model performs best as it is designed to capture complex relationships of a series and generate accurate long context forecasts.
More specifically, the domains I would apply this model to would include, climate modeling given there are many different environmental aspects that are all complexly related to one another in producing the current and future states; medical vitals for forecasting patient health, as nuanced trends across different vital signals can produce early signals that are predictive to where general health is trending during a hospital stay; and financial analysis such as forecasting the valuation of a security where the sequential nature of various company measurements may be valuable to obtaining a longer term forecast to effectively manage a client’s portfolio.
Additionally, the prospect of being able to pre-train the model opens many more forecasting opportunities where data may be considerably limited. For example, a restaurant chain may have daily sales data across all its locations. The model could be pre-trained on these various sequences for general patterns and then fine-tuned on a smaller dataset from a new location to obtain the strongest sales forecast possible.
What problems remain and what are the next steps?
AI Question
1Based on the copy of the machine learning paper included in this chat can you simply describe in a few sentences what the key problems that remain are in terms of applying the transformer architecture to non-text scenarios and time series forecasting with deep learning models? Please explain in a fashion of how the authors’ approach did not cover the area but was noted by the authors as a future area of work that is needed.AI Response
The authors identified several key remaining challenges for applying transformers to non-text scenarios and time series forecasting, which they noted as future areas of work:
Long-term dependency resolution: Transformers struggle with capturing long-term dependencies in time series due to their reliance on sequential information. While the authors used downsampling and padding to manage this, efficient attention mechanisms tailored to temporal coherence could enhance performance.
Cross-channel interactions: Channel independence in PatchTST avoids cross-channel interference but integrating cross-channel correlations (e.g., time series with similar patterns across channels) remains unexplored.
Temporal coherence of patches: The model assumes sequential patches (e.g., overlapping or consecutive patches) but ensuring temporal coherence (e.g., consistent temporal relationships between patches) could improve accuracy.
Self-supervised learning for time series: While self-supervised pre-training is effective, more sophisticated methods (e.g., contrastive learning, attention mechanisms) might better capture abstract representations in time series.
Non-linear relationship modeling: Transformers might not fully capture non-linear temporal dependencies (e.g., seasonality, trends) that are common in time series, suggesting the need for non-linear attention mechanisms.
Temporal consistency in transfer learning: Ensuring temporal alignment between pre-trained and fine-tuned models on different datasets is a challenge that the authors didn't address.
Efficient attention for time series: Transformers use a vanilla attention mechanism, but attention mechanisms optimized for time series (e.g., dynamic attention, hierarchical attention) could improve scalability and accuracy.
These issues highlight the need for transformer-specific attention and attentional architectures for time series, as well as holistic approaches to temporal coherence and cross-channel interactions, which the authors see as critical for future advancements in the field.
Human Response
Based on the foundational work the authors discussed regarding deep learning for forecasting, there are numerous hurdles that have to be overcome despite the promising results of the PatchTST model. Many of these challenges are surrounding the practicality of utilizing a transformer model for such tasks when much simpler renditions can often lead to comparable results.
A major area of challenge for these models that was not covered in depth by the authors is the use of these models in scenarios where the data does not inherently have a stable cyclical nature or a low amount of random noise. In the paper all models were tested on weather, utilities use, and traffic patterns which all have heavy cyclical and predictable natures and less noise than something like a time series of human behaviors. It is relatively unclear how the model would perform in scenarios where random fluctuations in the data occur on a more frequent basis.
As mentioned, transformer models are also much more complex and computationally intensive than standard forecasting models making them prone to a host of issues. These include the possibility of overfitting to small and homogenous datasets, the additional skill level required to ensure they are being properly utilized, and the cost that is required for training and inference tasks. These aspects are more practical than scientific in nature but still affect the potential success of the approach in more novel scenarios.
Finally, the authors' adaptation of language model like pre-training processes displayed signs of performance improvement but not to the same extent as experienced for text based language tasks. These processes are critical for the broader success of a language based transformer model but may not produce the same leap in terms of general forecasting performance, with basic supervised learning approaches still producing impressive results. Focusing on the improvement of these concepts has substantial potential for increasing the overall performance of transformer based time series models in a broad spectrum of domains with much less available data.
What issues and limitations did you encounter while using the LLM to assist with your paper summary?
Human Response
My experience with the HuggingFace SmolLM3-3B model was overall a positive experience. The model appears to easily comprehend my complex questions and prompts in a very efficient and complete manner and had a chat interface on the HuggingChat platform that felt similar to large-scale closed-weight LLMs such as ChatGPT, Grok, or Gemini. Despite a great experience, a few minor challenges still existed.
The first issue was less about the model and more about the HuggingChat platform. Since the model is trained to only comprehend text, the model initially could not process a raw PDF for the paper. Instead the PDF had to be converted to a text only format which was a concern for whether contents of the paper would render correctly leading the model to misinterpret or hallucinate information.
Aside from the more platform specific issue, the only challenges faced with the model were largely around the framing of its answers to my questions. The answers themselves were of high quality but often overly verbose and lengthy when much less could be used. Despite prompts to keep the response to only a few sentences, the model often opted to create a multi-bullet output in which each bullet was a sentence or two. This format was helpful for gathering the individual components of the paper, but not in a very concise format that may be easier to grasp the general concepts with.
Additional Citations
Bakouch, E., Ben Allal, L., Lozhkov, A., Tazi, N., Tunstall, L., Patiño, C. M., Beeching, E., Roucher, A., Reedi, A. J., Gallouédec, Q., Rasul, K., Habib, N., Fourrier, C., Kydlicek, H., Penedo, G., Larcher, H., Morlon, M., Srivastav, V., Lochner, J., Nguyen, X.-S., Raffel, C., von Werra, L., & Wolf, T. (2025). SmolLM3: Smol, multilingual, long-context reasoner. Hugging Face. https://huggingface.co/blog/smollm3 Also available at: https://smollm3.com/about https://huggingface.co/HuggingFaceTB/SmolLM3-3B
Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. (2023). A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. arXiv. https://arxiv.org/abs/2211.14730
Wu, H., Xu, J., Wang, J., & Long, M. (2022). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. arXiv. https://arxiv.org/abs/2106.13008
Yang, L., & Hong, S. (2022). Unsupervised time-series representation learning with iterative bilinear temporal-spectral fusion. arXiv. https://arxiv.org/abs/2202.04770
Yue, Z., Wang, Y., Duan, J., Yang, T., Huang, C., Tong, Y., & Xu, B. (2022). TS2Vec: Towards Universal Representation of Time Series. Proceedings of the AAAI Conference on Artificial Intelligence, 36(8), 8980-8987. https://doi.org/10.1609/aaai.v36i8.20881
Zeng, A., Chen, M., Zhang, L., & Xu, Q. (2022). Are transformers effective for time series forecasting? arXiv. https://arxiv.org/abs/2205.13504
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., & Jin, R. (2022). FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. arXiv. https://arxiv.org/abs/2201.12740
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. arXiv. https://arxiv.org/abs/2012.07436