Modelling Every Transformer in the Grid

Most grid planning workflows still lean on coarse assumptions: average loads, aggregate statistics, spreadsheets, and a lot of manual calculations. These approaches worked when energy flow was one-directional. They don’t work anymore. Rooftop solar, batteries, and EVs have introduced two-way flows and rapid variability at the edge. Distribution networks are becoming harder to model and plan with static tools.

Australia has an advantage: smart meters at scale. Millions of them.

They provide household-level data that can make planning much more accurate. But the raw data is messy, delayed, inconsistent, and fragmented across DNSPs. There are very few tools to operationalise it into something actionable for network planners.

This is the gap we set out to address.

Why Per-Transformer Modelling

The core idea is simple, instead of building broad models that smooth over local details, we create a dedicated load model for every distribution transformer. Each transformer has its own customers, DER penetration, and behaviour. Aggregating them into clusters hides the differences that matter for planning.

For each, we train a neural network in tensorflow using only its downstream smart meter data. No clustering, no synthetic smoothing, just the actual behaviour represented as a predictive model. These models can be updated regularly (on demand or automatically) and can be utilised immediately.

That enables practical questions like:

  • What’s the available capacity on this transformer right now?

  • What happens if 100 more PV systems connect on this feeder?

  • Where can we safely increase export limits without voltage violations?

The Scaling Problem

Training one model is easy. Training hundreds of models per transformer, because we don’t know upfront which architecture or parameter set will perform best, is not. For each transformer, we generate candidate configurations, train hundreds of them in parallel, and select the one that performs best on validation metrics.

At this point, the orchestration problem becomes the bottleneck. We potentially have hundreds of thousands of model-training tasks. They need to run reliably, with retries and state management. Hardware will fail. Containers will restart. Some models take minutes to train; others take hours. If the process loses state on failure, we start over, which doesn’t scale. This is a huge orchestration problem we initially set out to solve on our own with RabbitMQ and Celery.

Why We Chose Temporal

After months of building on RabbitMQ and Celery, we had a system that technically worked but behaved like a black box. Data went in, and with luck, a model came out. When things broke, diagnosing issues consumed more engineering time than the actual business problem we were trying to solve: transformer analytics.

We needed a way to express complex workflows, fan-out to hundreds of training tasks, fan-in for evaluation, handle retries without building all the orchestration logic ourselves. Temporal solved this.

Each workflow is durable. If a worker goes down, the workflow continues when it comes back.

  • State is persisted, so we don’t lose track of which models finished, which failed, and which need retrying.

  • Determinism ensures the workflow logic is replay-able without side effects, which matters when you have thousands of concurrent executions.

Deterministic and durable

A durable workflow is a stateful process that can pause, resume, and recover from failures without losing progress. Because activity results are stored in a database it can sleep for ever.

Our typical flow:

1. Start a workflow for a transformer.

2. Generate candidate hyperparameter sets.

3. For each candidate, spawn an activity to train a model.

4. Collect performance metrics and pick the best.

5. Persist the selected model and its metadata.

Failures at any step are automatically isolated and retried, eliminating manual intervention and preventing stalled executions. Some candidate configurations will inevitably produce poor or invalid models—those activities fail, as shown in red below—but the workflow continues unaffected, ensuring the overall process completes successfully.

Durable Model Training

Utilising durable workflows to train hundreds of candidate models for evaluation. The best model is selected and promoted to be used in further analytics workflows.

https://temporal.io

Deployment Constraints

All of this runs on-premises inside DNSP infrastructure. That’s non-negotiable because of data sovereignty and compliance. It means:

  • No managed cloud services. We are bound by customer hardware capacity.

  • Everything containerised for controlled deployment.

  • Observability and failure handling built in.

Temporal does all the heavy lifting, leaving us to work on our business domain and not stress about workflow orchestration.

We use TensorFlow for neural models, simpler libraries for lightweight cases, and Litestar for our API layer. Temporal handles the orchestration. The result isn’t flashy—it’s a system that quietly trains, evaluates, and maintains per-transformer models at scale without collapsing under its own weight.