Sunday, October 12, 2025

Book Review: Time Series Forecasting using Foundation Models

As someone who primarily works in NLP and Search in the Health Domain, I don't have much use for Time Series. However, while exploring the Financial domain based on personal interest, I have been curious about Time Series for some time. Recently I attended the OpenHPI course Time Series Analysis taught by Mario Tormo Romero (even did the quizzes and the certificate of completion!). I was familiar with traditional techniques such as ARIMA (and its derivatives), but the course also covered Neural Network based techniques using CNN and RNN architectures, as well as some Transformer based models such as N-BEATS, Autoformer, Informer and TFT. Overall, I loved the course and learned a lot from it. If I had to complain, it would be to point to the lack of practical code examples and/or exercises, but I suppose it may not be that hard to Google (or now ChatGPT) that stuff on my own.

As I get older, I find I learn faster using what I know already to create analogies for what I am learning rather than starting from scratch. So it seemed to me that there is some similarity between predicting the next word in a sentence and predicting where a stock price will be headed next week given its previous history. Thus methods useful in NLP, including the relatively cutting edge methods around Transformers and Generative AI, could, at least in principle, be applicable for Time Series forecasting. Of course, NLP involves discrete entities, i.e words in a vocabulary, while Time Series involve continuous values, so there are bound to be differences as well.

So when I came across Marco Peixeiro's Time Series Forecasting using Foundation Models I was actually quite intrigued (sorry if I sound Victorian, but thats the closest word I can think of to indicate the mixture of vindication and curiosity I felt when I saw the title). Being a relative outsider to the world of Time Series forecasting, I felt vindicated that there is a research community that is actually looking at this connection, and was also curious to see where they had taken it. So I read the book and here is what I learned.

High level feedback -- overall, this book fulfils the promise it makes in its title, and then some. It covers 7 different Foundation Models (loosely speaking, some of these are more methodological framework than model) covering encoder-only, encoder-decoder and decoder-only (and even a couple of Mixture of Experts) models. In each of these model specific chapters, it provides code examples for using in zero-shot mode and fine-tuning where applicable. For models that produce point estimates, it demonstrates cross-validation based methods to produce a forecast distribution, as well as code for anomaly detection where applicable. Over the course of these seven chapters, it contrasts and compares these models with each other, so by the end of the book, the reader has a good grasp of what each model can or cannot do, and where they might shine. There is also a capstone project with a different dataset which serves to cement the reader's understanding of these various models. I think the material is not only comprehensive, but also prepares you to intelligently follow advances in the field of Time Series forecasting using Foundation Models, which is important given that it is still a relatively nascent and fast-growing field.

Detailed per chapter feedback -- the book is organized in three parts (four if you include the Capstone Project which is really a large exercise). Part 1 is mostly background, Part 2 covers 5 models specifically developed for Time Series forecasting, and Part 3 covers 2 models where the Time Series task is converted to a Language Task and a LLM used to handle it.

Part 1

  • Chapter 1: Understanding Foundation Models -- covers the Transformer architecture, with detailed coverage of its building blocks. Of note is the coverage of positional embeddings, which becomes even more crucial in the context of Time Series (an meaningless stream of numbers rather than a semi-meaningful stream of words). It also covers why (and why not) one would want to use Foundation Models for Time Series forecasting. --
  • Chapter 2: Building Foundation Models -- covers the N-BEATS model architecture. N-BEATS was also one of the models covered towards the end of the OpenHPI course, so this represents a sort of progression towards the use of FMs for Time Series forecasting. In addition, it covers different evaluation metrics used in this area, and the effect of forecasting horizons on performance.

Part 2

  • Chapter 3: Forecasting with TimeGPT -- covers the TimeGPT model, an encoder-decoder model that can predict future values in an univariate Time Series with exogenous variables. Code examples that illustrate how to use this model for zero-shot forecasting as well as fine-tuning, as well as cross-validation over different forecasting horizons and anomaly detection.
  • Chapter 4: Zero Shot Probabilistic Forecasting with Lag-LLaMA -- this is an open-source model built on top of the decoder-only LLaMA model from Meta. It supports univariate Time Series only, and is trained using lagged values of many different Time Series to create features. Lag-LLaMA provides probabilistic forecasts rather than point predictions. Code examples similar to the previous chapter are also provided.
  • Chapter 5: Learning the language of time with Chronos -- this chapter covers Chronos, a framework that allows using T5 and GPT-2 like language models with Time Series data. It describes various techniques like as mean scaling, mixup (convex combinations of multiple Time Series) and KernelSynth for data augmentation. The framework yields probabilistic forecasts as well, and median is usually used for point predictions if needed. As in previous chapters, code examples for zero-shot forecasting and fine-tuning, as well as cross-validation and anomaly detection are provided.
  • Chapter 6: Moirai a Universal forecasting Transformer -- Moirai is an encoder only model, provides probabilistic forecasts, and supports exogenous features out of the box. It uses a technique called patching to combine multiple consecutive inputs into a single element, similar to how one might use n-grams in NLP, which allows it to capture local semantic meaning and support longer context lengths. The output is sent through a linear projection layer. Moirai comes in two flavors, this one and Moirai-MoE, a mixture-of-experts version which is based on a decoder-only Transformer model.
  • Chapter 7: Deterministic Forecasting with TimesFM -- TimesFM produces determinisitic point predictions rather than a probabilistic forecast. It cannot be used for anomaly detection since we cannot construct confidence interfavals directly. One innovation with TimesFM is the use of residual blocks. The output is in the form of patches which goes through a linear layer to produce the final prediction. Exogenous variables are supported through the use of additional regression model. Unlike the other chapters, this does not cover fine-tuning since that requires JAX and was considered out of scope for the book (but maybe its a good reason to learn JAX?).

Part 3

  • Chapter 8: Forecasting as a Language task -- this chapter covers PromptCast, another technique that turns the Time Series forecasting task into a language task. The LLMs used here are Flan-T5 and LLaMA 2.3 3B-instruct. Essentialy it consists of creating prompts that specify an input sequence, optionally describing the task and asking the LLM to provide the next value. The chapter illustrates using zero-shot, few-shot and chain of thought prompting. The approach is likened to the Pudding mit Gabel festival, where people use forks to eat pudding.
  • Chapter 9: Reprogram an LLM for forecasting -- this chapter covers TimeLLM, another framework that reframes a Time Series forecasting task as a language task. It uses patches and reprogramming it by running it through a vocabulary, along with a prompt, as input, and a linear layer to produce the prediction from the learned embeddings. Training involves updating the weights of the patch reprogramming and linear layers. While it produces point predictions, it can be used for anomaly detection by using cross-validation to generate forecasts across multiple time horizons.

Part 4

  • Chapter 10: Capstone Project -- forecasting daily visits to a blog -- the chapter provides the dataset and asks to build models that predicts future daily visits. The provided solution starts with a SARIMA baseline, then uses the different models that the book discussed, to produce better and better predictions.

So there you have it. As I have mentioned earlier, I found this book quite useful, not only in its coverage of various models and how it is used for time series, but also as a primer to follow research progress in this field. Hopefully you found this review helpful and I hope this book will serve you as well as it has served me.