LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

Abhilash Neog¹, Sepideh Fatemi¹, Medha Sawhney¹, Kazi Sajeed Mehrab¹, Aanish Pradhan¹, Bennett J. McAfee², Emma Marchisin³, Robert Ladwig⁴, Arka Daw⁵, Cayelan C. Carey⁶, Paul C. Hanson³, Anuj Karpatne¹

¹Dept. of Computer Science, Virginia Tech    ²Annis Water Resources Institute, Grand Valley State University
³Center for Limnology, University of Wisconsin–Madison    ⁴Dept. of Ecoscience, Aarhus University
⁵Oak Ridge National Laboratory    ⁶Dept. of Biological Sciences, Virginia Tech

Accepted at KDD 2026

Paper (Longer Version)

Code

Motivation

Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While there is a growing body of work on modeling the temperature of water in lakes, modeling a single variate only provides a partial view to the complex interactions of processes governing lake dynamics, observed at varying depths, frequencies, subsets of variables, and levels of reliability from one site (lake) to another.

Benchmarking efforts such as LakeBeD-US have harmonized water quality observations across multiple monitoring programs - resulting in over 500 million observations spanning 17 variables from 21 lakes, yet the data remains plagued by high degrees of missing values, uneven sampling frequencies, and highly variable depth and variate coverage across sites. This sparsity and heterogeneity, which is intrinsic to real-world environmental monitoring, severely limits the ability of ML methods to scale to broader collections of lakes using irregular multi-variate multi-depth time series data.

At the same time, the broader ML community has made significant progress in developing time series foundation models (e.g., Chronos 2, MOMENT) that learn task-agnostic representations from large heterogeneous corpora. However, aquatic sciences still lacks a foundation model capable of unifying information across multiple lakes and variates with irregular frequencies and depths. Existing TS foundation models either focus on univariate signals or assume clean, densely sampled data.

Motivated by this gap, we ask the following research questions:

RQ1. Can we build a foundation model for aquatic sciences that learns generic lake processes across a broad collection of lakes and variables, while retaining site-specific nuances?
RQ2. Can we use such a foundation model to forecast lake dynamics using any subset of variables available at a lake with irregular observations across time and depth?
RQ3. Can we extract feature representations of lakes that capture their static and time-varying characteristics, revealing novel information about their similarity and temporal evolution at macro-system scales?

To answer these questions, we introduce LakeFM, a foundation model pre-trained on a large-scale ecological dataset containing over 1.5 million samples, comprising over 1,000 diverse lake simulations from physics-based models and real-world observations from 21 lakes in LakeBeD-US dataset.

LakeFM — Methodology

LakeFM operates as an encoder-decoder framework with four core components designed to handle the sparsity and irregularity inherent in lake ecosystems.

LakeFM architecture overview — **LakeFM architecture.** Each observation *(time, variable, depth, value)* is tokenized and passed through RoPE-augmented Transformer encoder layers with an intra/inter-variate attention bias. Two linear projectors disentangle the encoder output into static (lake-invariant) and temporal (dynamic) subspaces, jointly trained with a contrastive loss and a probabilistic forecasting loss. A query-based decoder conditions on arbitrary future *(time, variable, depth)* queries to produce Student-t predictive distributions.

Tokenization & Embedding

Each observation tuple (time, variable, depth, value) is treated as a token. Composite embeddings combine time, depth, variate, and value signals — allowing the model to handle irregular grids without imputation.

Encoder Layers

Transformer layers with Rotary Position Embeddings (RoPE) and a learnable attention bias that differentiates intra-variate from inter-variate interactions across irregular token sequences.

Static & Temporal Disentanglement

Two parallel linear projectors separate the encoder output into static (lake-invariant characteristics) and temporal (dynamic behavior) subspaces, jointly optimized with contrastive and forecasting objectives.

Query-Based Forecasting

A decoder conditioned on arbitrary future (time, variable, depth) queries attends over the encoded history, enabling forecasts at any irregular output grid without requiring a fixed forecast horizon.

Predictions are parameterized as a Student-t distribution (μ, σ, ν), with per-token degrees-of-freedom ν learned adaptively to capture the heavy-tailed noise common in ecological measurements. Pre-training jointly minimizes a probabilistic forecasting loss and a lake-wise InfoNCE contrastive loss to encourage meaningful lake-specific representations.

Experiments and Results

Dataset

LakeFM is pre-trained on a large-scale ecological dataset containing over 1.5 million samples:

LakeBeD-US [McAfee et al., 2025] — 500 million unique observations spanning 21 lakes across the United States, covering 17 variables with 60–70% sparsity on average.
WQHanson Simulations [Hanson et al., 2023] — 4 simulation lakes generated using the process-based water quality model.
FCR Simulations [Hipsey et al., 2019] — 1,000 simulations generated using the GLM-AED process-based model.

Evaluation Setup

The LakeBeD-US data is partitioned into an In-Distribution (ID) set (15 lakes) and an Out-of-Distribution (OOD) set (6 entirely unseen lakes). We compare against time-series foundation models (Chronos 2, LPTM, MOMENT) and a non-foundation local model (iTransformer trained per-lake).

Forecasting Performance

LakeFM achieves a best overall rank of 2.03 across all ID lakes and 2.0 across all OOD lakes in terms of lake-wise MSE. In the ID setting, LakeFM consistently shows the lowest MSE across all lakes, while baselines like iTransformer show high variability on BM and GL4. On OOD lakes, LakeFM shows the best zero-shot performance on all lakes except TR. Note that iTransformer's performance varies widely across OOD lakes, since it only relies on local data from a specific lake for training and does not utilize transfer of knowledge across lakes, in contrast to LakeFM and other foundation models.

Out-of-distribution forecasting performance bar plot — Zero-shot generalization (OOD) across 6 unseen lakes.

Discovering New Insights of Variate Interactions

LakeFM's query-based decoder enables forecasting under arbitrary input masking — a capability existing TSFMs lack. By selectively withholding variables or depth layers from the context window, we can probe the cross-variate and cross-depth dependencies that LakeFM has learned, generating ecologically novel and testable hypotheses.

5.2.1 Variate Masking

We mask individual variates in the context window and measure the change in forecasting performance of all variables — revealing which variables carry the most predictive information for others. As a case study, we examine Lake PRLA, masking either Dissolved Oxygen (DO) or Water Temperature (Temp) and observing the impact on DO forecasts.

Masking DO yields MSE = 12.57 and CRPS = 1.93, while masking Temp yields MSE = 11.00 and CRPS = 2.52. Although Temp masking results in a lower DO MSE, its higher CRPS (2.52) reveals that the model becomes overconfident yet inaccurate — it produces a narrow prediction interval that fails to capture the true value. Conversely, when DO is masked, the model correctly assigns higher uncertainty (lower CRPS = 1.93 despite higher MSE), producing a wider, more inclusive prediction range. This demonstrates that LakeFM's uncertainty estimates are physically meaningful: it recognizes when a crucial covariate is missing and responds by widening its confidence interval rather than collapsing to a point estimate.

PRLA DO forecast — no masking — (a) No masking

PRLA DO forecast — Temp masked — (b) Temp masked

PRLA DO forecast — DO masked — (c) DO masked

DO forecasts under different variate masking scenarios for Lake PRLA at depth 1.0 m.

5.2.2 Depth Masking

We study the effect of masking all variates at either shallow or deep layers in the context window, measuring the impact on forecasting performance. LakeFM leverages cross-depth relationships to maintain accuracy even when one depth stratum is withheld — for example, masking shallow layers in Lake BARC reveals that deep-layer context is sufficient to reconstruct near-surface dynamics.

BARC forecast — no depth masking — (a) No masking

BARC forecast — shallow layers masked — (b) Shallow layers masked

BARC forecast — deep layers masked — (c) Deep layers masked

Depth masking scenarios for Lake BARC at depth 0.5 m.

Physical Consistency

LakeFM is not explicitly trained with physical constraints, yet its predictions demonstrate emergent compliance with two fundamental limnological laws, evaluated across 100 unseen simulation lakes:

Thermal Stratification Law. During summer, lake temperature decreases monotonically with depth. We measure this via the inversion rate — the average number of depth-wise temperature inversions per day (lower is better).
Beer-Lambert Law. Light intensity attenuates exponentially with depth due to biomass in the water column. We quantify this via the Pearson R² between predicted Chlorophyll-a and Light Attenuation (higher is better).

LakeFM shows a lower inversion rate and higher Beer-Lambert R² than Chronos 2 on the large majority of unseen lakes, indicating that physical plausibility emerges from pre-training on diverse ecological data alone.

Inversion rate comparison: LakeFM vs Chronos 2 — Inversion rate (thermal stratification law, ↓).

Beer-Lambert R² comparison: LakeFM vs Chronos 2 — Beer-Lambert R² (light attenuation law, ↑).

Learned Lake Representations

t-SNE of static lake embeddings colored by hydrologic regime — **Static embeddings.** t-SNE of LakeFM's lake-level static embeddings colored by hydrologic regime. Without any explicit supervision, the embeddings cluster along known ecological and geographic axes — lakes with similar hydrology and geographic origin are placed close together.

Temporal embedding trajectories for five Wisconsin lakes in 2018 — **Temporal embeddings.** Seasonal trajectories of five Wisconsin lakes across 2018, projected via t-SNE and colored by trophic state and hydrologic regime. Eutrophic drainage lakes (ME, MO) follow similar temporal paths, while oligotrophic seepage lakes (SP, BM) form a separate cluster — demonstrating ecologically meaningful temporal dynamics in the representations.

t-SNE of simulated lake embeddings colored by w_p_cyano parameter — **LakeFM Embedding Alignment with Simulation Parameters.** We investigate whether LakeFM's embeddings of simulated lakes encode information of process-based parameters used to generate the simulations. We can see a clear gradient across the embedding space with respect to this parameter, with clear separation of low, intermediate, and high *w_p_cyano* values.

Qualitative comparison on Lake Mendota water temperature at 0m — **Qualitative Comparison.** Qualitative analysis comparing LakeFM and baselines — Chronos 2, iTransformer, MOMENT, and LPTM — on Lake Mendota for the Water Temperature variate observed at depth 0 m.

BibTeX

If you find this work useful, please cite:

@article{neog2024lakefm,
  title     = {LakeFM: Toward a Foundation Model for Aquatic Ecosystems
               Using Irregular Multivariate Multi-depth Time Series Data},
  author    = {Neog, Abhilash and Fatemi, Sepideh and Sawhney, Medha and
               Mehrab, Kazi Sajeed and Pradhan, Aanish and McAfee, Bennett J.
               and Marchisin, Emma and Ladwig, Robert and Daw, Arka and
               Carey, Cayelan C. and Hanson, Paul C. and Karpatne, Anuj},
  year      = {2024}
}

References

P. C. Hanson, R. Ladwig, C. Buelo, E. A. Albright, A. D. Delany, and C. C. Carey. Legacy Phosphorus and Ecosystem Memory Control Future Water Quality in a Eutrophic Lake. Journal of Geophysical Research: Biogeosciences 128, 12 (2023), e2023JG007620. doi:10.1029/2023JG007620
M. R. Hipsey, L. C. Bruce, C. Boon, B. Busch, C. C. Carey, D. P. Hamilton, P. C. Hanson, J. S. Read, E. de Sousa, M. Weber, and L. A. Winslow. A General Lake Model (GLM 3.0) for linking with high-frequency sensor data from the Global Lake Ecological Observatory Network (GLEON). Geoscientific Model Development 12, 1 (2019), 473–523.
B. J. McAfee, A. Pradhan, A. Neog, S. Fatemi, R. T. Hensley, M. E. Lofton, A. Karpatne, C. C. Carey, and P. C. Hanson. LakeBeD-US: a benchmark dataset for lake water quality time series and vertical profiles. Earth System Science Data 17, 7 (2025), 3141–3165.

Acknowledgements

We sincerely thank Mary E. Lofton from the Department of Biology, Virginia Tech for preparing and curating the FCR simulations (comprising 1,000 simulation lake datasets) used in this study. This work was supported in part by NSF awards #2213549 and #2213550. We are also grateful to computing resources from Bridges-2 at Pittsburgh Supercomputing Center available through NAIRR pilot award #240161 and from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. We are also grateful to the Advanced Research Computing (ARC) Center at Virginia Tech for providing access to GPU compute resources for this project. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE).