have a simple problem: they show you the forecast, but they don’t tell you when it actually changed.
That might sound trivial. It isn’t.
Modern numerical weather prediction (NWP) systems — like ECMWF IFS — produce remarkably accurate forecasts at ~9 km resolution, updated every few hours. The data is already very good.
The problem is not the forecast.
The problem is attention: knowing when a change in that data is actually meaningful.
I didn’t learn that from software engineering. I learned it years earlier, studying chaos theory at the Instituto Balseiro. It was there, working through dynamical systems, that I first encountered a slightly unsettling idea:
A system can be completely deterministic and still be practically unpredictable.
That idea stayed with me. And years later, when I started building AI systems, I realized that many of them were ignoring it.
When I started seeing how developers were building weather agents, I noticed a pattern:
At first glance, this seems reasonable. From a physics perspective, it is problematic — at least for problems where the decision boundary is already well-defined — because it replaces a well-defined threshold with a probabilistic interpretation.
In a chaotic system, significance is not a linguistic judgment — it is a threshold defined on variables like temperature, precipitation, or wind speed. It depends on magnitudes, context, and time horizons.
An LLM is a stochastic process. It is very good at generating language, but it is not designed to enforce deterministic boundaries on physical systems.
When you ask an LLM whether a forecast “changed significantly,” you’re asking a probabilistic model to approximate a deterministic rule that you could have defined explicitly. That introduces variability exactly where you want consistency.
The failure modes are subtle:
In many applications, that might be acceptable. In agriculture, energy, and logistics — where a 3°C drop is a phase transition for a crop, a non-linear spike in energy demand, or an operational disruption — it is not. These decisions need to be stable and explainable.
Which led me to a simple rule:
If you can write an assert statement for it, you probably shouldn’t be using a prompt.
My career has looked less like a straight line and more like a trajectory in phase space. A Marie Curie PhD in climate dynamics, five years directing R&D at Uruguay’s national meteorology institute — forest fire prevention, seasonal forecasting, climate adaptation — then a shift to production ML at Microsoft and Mercado Libre.
That arc gave me something specific: I already understood the physics of the data, the skill horizons of the models, and what “significant change” actually means in a physical system. Not as a software abstraction — as a measurable delta on a variable with known uncertainty bounds.
When I started building AI systems, the instinct was immediate: this is a threshold problem. Thresholds belong in code, not in prompts.
Skygent is one expression of that perspective — an agent designed not to display forecasts, but to detect meaningful changes in them.
The system runs continuously on real forecast data for user-defined events, evaluating changes every few hours and only triggering alerts when predefined conditions are met. In practice, most evaluation cycles result in no alert — only a small fraction of changes cross the significance threshold. That’s the point: signal, not noise.
Skygent follows a clean separation across five layers:

Architecture description
Only one layer calls the LLM.
At the core is a Python evaluator. It doesn’t interpret — it calculates. It:
This is where decisions are made. Every alert has a traceable path: which variable changed, by how much, which threshold was crossed. In a corporate or government environment, being able to explain why an alert fired — without saying “the model felt like it” — is not optional.
An alert fires only if a threshold is crossed. If the delta doesn’t cross the boundary, nothing happens. This is a binary, testable condition — not a judgment call.
Only after the decision is made does the LLM enter the pipeline. Its role is strictly limited: take structured JSON data, translate it into natural language.
# Structured payload sent to GPT-4o-mini
{
"event_name": "Ana's Wedding",
"variable": "precipitation_probability_max",
"from_value": 10.0,
"to_value": 50.0,
"delta": 40.0,
"horizon_days": 5.2,
"confidence": "medium"
}
Output:
“Rain probability increased from 10% to 50% for your event window. Confidence is medium due to the 5-day forecast horizon.”
The LLM is not deciding anything. It is explaining.
It is practically impossible to reach 100% test coverage on a pure LLM agent — you cannot write deterministic assertions on probabilistic outputs.
The hybrid approach changes this. The decision logic is pure Python with Pydantic-validated inputs: 204 unit tests, zero LLM dependencies in the test suite. The LLM handles only the narrative tone — the one thing that genuinely benefits from natural language generation.
This is not just a testing convenience. It means every decision the
system makes can be explained, reproduced, and verified independently of the LLM.
A naive agent calls the LLM on every polling cycle. This one doesn’t.
Skygent evaluates every 6 hours. It only calls the model when a threshold is crossed — roughly once or twice per week per monitored event, compared to ~28 calls for a naive polling agent.
At gpt-4o-mini pricing (~$0.0001 per narrative), cost is negligible. More importantly, cost is proportional to actual information: you pay for an LLM call only when something worth communicating happened.
Previous snapshot: Rain probability 10%, Max temp 22°C, Wind 15 km/h
Current snapshot: Rain probability 50%, Max temp 21.4°C, Wind 18 km/h
Threshold: Alert if rain probability Δ > 20pp
Evaluation frequency: Every 6 hours
Result: Alert triggered → GPT-4o-mini generates narrative → Telegram delivery

Screenshot of Skygent’s alert example
This approach doesn’t apply everywhere. It breaks down when:
In those cases, LLM-first architectures — ReAct, Plan-and-Execute — make more sense.
One honest caveat: the thresholds in Skygent are configurable defaults — reasonable starting points informed by meteorological practice, but not calibrated against historical forecast errors for specific use cases. Calibration against real outcomes is the natural next step for any vertical deployment. The pattern is sound; the parameters are a starting point.
The most important decision I made building this system was not choosing a model or a framework.
It was deciding where not to use an LLM.
There is a tendency right now to delegate more and more to language models — to let them figure things out. But some problems already have structure. Some decisions already have boundaries.
When they do, approximating them with language is the wrong move. Encoding them explicitly is better.
In practice, this often comes down to a simple distinction: use LLMs to explain decisions, not to replace well-defined ones.
The full implementation — significance evaluator, LangGraph pipeline, Telegram bot — is available at: github.com/ferariz/skygent
Fernando Arizmendi builds production AI systems at the intersection of rigorous scientific method and applied AI engineering. He is a physicist (B.Sc. & M.Sc.) from Instituto Balseiro, former Marie Curie fellow (Ph.D. studying Climate Dynamics & Complex Systems), and previously directed R&D at Uruguay’s national meteorology institute.
All images by the author. Pipeline diagram generated with Claude (Anthropic).