Imagine you are a hospital group trying to train a shared diagnostic model without sending patient data off-premises. You have heard that federated learning solves this. You open GitHub, search for implementations, and find dozens of algorithm repositories that cannot be meaningfully compared because they each used a different dataset, a different data split, and a different definition of what “heterogeneous” even means. That is the problem PFLlib was built to fix.
Key Points
- PFLlib is a peer-reviewed open-source library covering 37 federated learning algorithms, 8 traditional and 29 personalized, evaluated on 24 datasets across three heterogeneity scenarios.
- The benchmark reveals that personalized FL methods consistently outperform their traditional counterparts by wide margins under realistic data distributions, with the best methods reaching over 97% accuracy on image tasks where FedAvg manages roughly 86%.
- The gap between pathological and practical label skew settings is large enough to change which algorithm looks best, meaning benchmarks that only test one setting are misleading.
- Model-splitting and knowledge-distillation approaches dominate the upper end of the accuracy table, suggesting that sharing representations rather than raw gradients is the more productive research direction.
- The library is designed so a new algorithm requires only two files to add, which has already attracted downstream projects including FL-bench, HtFLlib, and FL-IoT.
- Privacy evaluation via the Deep Leakage from Gradients attack is built in, letting researchers test whether their personalization strategy also changes the model’s vulnerability to gradient inversion.
The Problem With Federated Learning Benchmarks Before PFLlib
Federated learning arrived with a compelling promise. Train a model across many devices or institutions, share only gradient updates, keep the raw data where it lives. The original FedAvg paper by McMahan and colleagues made this look straightforward. What followed was anything but.
Within a few years, dozens of competing algorithms had accumulated in the literature, each claiming superiority over FedAvg. The trouble was that most of them had been evaluated on different datasets, with different train-test splits, different numbers of clients, and sometimes different definitions of the same technical term. “Non-IID data” meant one thing in one paper and something else in the next. Reproducing results from scratch took weeks, and the comparisons between papers were often not meaningful.
The platforms that emerged to address this, FATE, FedML, FederatedScope, Flower, and others, are excellent engineering tools built for production deployments. They offer real infrastructure for managing clients, scheduling rounds, and handling failures. But they are not primarily designed for someone trying to understand why pFedMe performs differently from APFL, or whether the personalized aggregation family of methods holds up under feature shift as well as it does under label skew. They are deployment frameworks rather than research laboratories.
There were also beginner-oriented platforms like LEAF and NIID-Bench, but these have lagged on algorithm coverage. As of the PFLlib paper’s submission, the most direct competitor, pFL-Bench, covered only five personalized FL methods, all of them predating 2022.
Jianqing Zhang and colleagues at Shanghai Jiao Tong University, in collaboration with researchers at Tsinghua, Queen’s University Belfast, and Stevens Institute of Technology, set out to close this gap. The result, published in the Journal of Machine Learning Research in early 2025, is PFLlib: A Beginner-Friendly and Comprehensive Personalized Federated Learning Library and Benchmark.
What the Library Actually Contains
The headline number is 37 algorithms: 8 traditional FL methods and 29 personalized FL methods. That alone would be notable. What matters more is how they are organized and what each family of methods is actually trying to do.
The Traditional FL Baseline Tier
The 8 traditional methods act as calibration points rather than competitors. FedAvg is the anchor. SCAFFOLD adds variance-reduction through control variates. FedProx and FedDyn introduce regularization to keep local updates from drifting too far. MOON and FedLC use contrastive techniques and logit calibration respectively. FedGen and FedNTD bring knowledge distillation into the traditional framework.
These are the methods personalized FL is trying to beat. Their presence in the same benchmark, on the same datasets and splits, is what makes comparison honest.
The Personalized FL Landscape
The 29 personalized methods fall into five rough families based on how they individualize the model.
Meta-learning approaches, represented by Per-FedAvg, treat the global model as an initialization that each client fine-tunes rapidly. Regularization-based methods like pFedMe and Ditto add terms to the local loss that keep personal models anchored to the global one without forcing them to be identical. Personalized aggregation methods, including APFL, FedFomo, FedAMP, FedPHP, APPLE, and FedALA, learn which other clients are worth aggregating with rather than treating all clients equally.
Model-splitting approaches, the largest family with twelve members, split the network into shared lower layers and personalized upper layers. Different papers split at different points, and the benchmark makes clear that this design choice matters enormously. Knowledge-distillation methods like FedProto and FedPCL avoid sharing weights entirely, instead sharing compact representations or prototypes that carry semantic information without raw gradient exposure.
The taxonomy matters because the right algorithm choice depends on what kind of heterogeneity your federation actually faces. A method that dominates under label skew can perform mediocrely under feature shift, and the benchmark is one of the first to test both seriously.
Three Scenarios, Not One
Most earlier benchmarks tested a single notion of data heterogeneity. PFLlib tests three, and the differences between them are not cosmetic.
The pathological label skew scenario assigns each client only a small fixed number of class labels, producing the extreme partitions that appear throughout early FL literature. It is a stress test rather than a realistic model of the world.
The practical label skew scenario uses a Dirichlet distribution to assign labels across clients, which produces a messier and more realistic imbalance. Some clients still see rare classes more than others, but the boundaries are fuzzy rather than hard.
The real world scenario uses datasets that were naturally collected from different sites: the HAR and PAMAP2 sensor datasets from wearable devices, Camelyon17 from hospitals across multiple countries, and iWildCam from geographically distributed camera traps. These datasets were not constructed to be heterogeneous. They just are, because the world is.
The dataset coverage spans Computer Vision, Natural Language Processing, and Sensor Signal Processing. Among the 24 included datasets are Fashion-MNIST, CIFAR-100, Tiny-ImageNet, and AG News, plus the real-world options above. Both CNN-based models and fastText for NLP tasks are used, with ResNet-18 also evaluated on Tiny-ImageNet.
“The gap between pathological and practical settings is large enough to change which algorithm looks best.”PFLlib benchmark results, JMLR 2025
Reading the Benchmark Table Honestly
The numbers in Table 2 of the paper reward careful reading. Here is a condensed view of the methods that matter most, drawn directly from the published results on two settings that bracket the real performance range.
| Algorithm | Family | FMNIST (Path.) | CIFAR-100 (Path.) | FMNIST (Prac.) | CIFAR-100 (Prac.) |
|---|---|---|---|---|---|
| FedAvg | Traditional | 80.41 | 25.98 | 85.85 | 31.89 |
| FedProx | Traditional | 78.08 | 25.94 | 85.63 | 31.99 |
| Ditto | Regularization | 99.44 | 67.23 | 97.47 | 52.87 |
| APPLE | Pers. Aggregation | 99.30 | 65.80 | 97.06 | 53.22 |
| FedALA | Pers. Aggregation | 99.57 | 67.83 | 97.66 | 55.92 |
| FedCP | Model Splitting | 99.66 | 71.80 | 97.89 | 59.56 |
| GPFL | Model Splitting | 99.85 | 71.78 | 97.81 | 61.86 |
| FedDBE | Model Splitting | 99.74 | 73.38 | 97.69 | 64.39 |
| FedProto | Knowledge Distil. | 99.49 | 69.18 | 97.40 | 52.70 |
| FedBABU | Model Splitting | 99.41 | 66.85 | 97.46 | 55.02 |
Accuracy (%) on Fashion-MNIST and CIFAR-100. Highlighted rows indicate top-performing methods. Source: Zhang et al., JMLR 2025, Table 2.
A few things jump out immediately. First, the gap between traditional FL and personalized FL is not marginal. FedAvg sits at about 26% on CIFAR-100 under pathological label skew. The best personalized methods reach 71 to 73%. That is not a refinement; it is a different order of result.
Second, look at what happens when you move from pathological to practical label skew. The pathological setting is an artificial extreme where clients hold very few classes. Under this setting, almost any personalized method that simply memorizes its local classes will look good. In the practical setting, the advantage narrows. Methods that looked roughly equivalent at 99% on Fashion-MNIST start to separate on CIFAR-100, where GPFL reaches 61.86% and Ditto trails at 52.87%. The practical setting is the one you should actually care about.
Third, the model-splitting family sits at the top across both settings. FedDBE, GPFL, and FedCP all reach the high 60s and above on CIFAR-100 under practical skew, while knowledge-distillation methods like FedProto are competitive but a step behind. This suggests that the productive axis for personalized FL research is how you split shared and personal components of a neural network, not just how you weight the aggregation step.
Benchmarks that test only pathological label skew overstate the performance of simple local fine-tuning methods. Practical and real-world scenario results are the ones that translate to deployed systems. PFLlib gives you both, which is the core methodological contribution beyond the code itself.
The Mathematical Framing Behind Personalized FL
The general optimization objective that most personalized FL methods are solving can be written as a weighted combination of a global and a local term. For client i, the goal is to find a local model that performs well on the client’s own distribution while remaining connected to what the federation as a whole has learned.
Here \(\mathcal{L}_i\) is the local empirical loss on client \(i\)’s data, \(\theta_{\text{global}}\) is the aggregated model from the server, and \(\Omega\) is a regularization term that keeps the personal model from straying too far. The scalar \(\lambda\) trades off personalization against global coherence. Different algorithm families implement \(\Omega\) very differently. Ditto uses a simple squared distance. pFedMe uses the Moreau envelope, a smoothed version of the distance penalty. FedALA learns a personalized mixing of aggregated weights before local training even begins.
The model-splitting approaches restructure the problem entirely. Rather than penalizing deviation from the global model, they designate certain layers as always-shared and others as always-personal. The shared layers capture general feature representations; the personal layers capture client-specific decision boundaries or class distributions. GPFL, one of the top performers in the benchmark, simultaneously learns global feature information and personalized feature information, allowing the two representations to complement each other within a single forward pass.
Privacy Is Not an Afterthought Here
PFLlib includes an implementation of the Deep Leakage from Gradients (DLG) attack, along with Peak Signal-to-Noise Ratio as a measurement of reconstruction quality. This matters because personalized FL and privacy interact in non-obvious ways.
The standard argument for federated learning is that sharing gradients rather than raw data protects privacy. The DLG attack shows this argument has limits. Given a model’s gradient update, an adversary can sometimes reconstruct the original training image with surprising fidelity, especially when the batch size is small and the gradient signal is rich.
Personalized methods change the gradient structure. A method that keeps certain layers frozen during the federated round sends different gradients than one that fine-tunes the entire network. Whether this makes reconstruction harder or easier depends on the specific method, and PFLlib gives researchers the tools to test this directly rather than assuming that personalization implies stronger privacy.
This is a genuinely underexplored dimension of the personalized FL literature. Most papers report accuracy numbers. Far fewer report privacy vulnerability numbers. Having both evaluations in the same codebase, with the same data splits and training setup, is a meaningful research enabler.
Reproducibility and Usability
The architecture of PFLlib is deliberately flat. Each algorithm lives in a pair of files: a server file and a client file. The server file handles aggregation logic. The client file handles local training. Both inherit from base classes that manage the repetitive infrastructure of FL rounds, client selection, and communication simulation. Adding a new algorithm means writing only the parts that are actually novel, not rebuilding the entire scaffolding.
Creating a dataset scenario takes one command. Running an experiment takes another. The paper gives the exact example for FedALA on MNIST, which requires generating the non-IID split and then calling the main script with the algorithm name and dataset flag. This is the kind of usability that research benchmarks often claim and rarely deliver.
At the time of publication, PFLlib had accumulated over 1600 GitHub stars and 300 forks. FL-bench, HtFLlib, and FL-IoT have all been built on top of it. Several of the authors’ own subsequent papers, on FedTGP, FedL2G, and other methods, used PFLlib as their experimental platform, which provides an ongoing guarantee that the library stays maintained and up to date.
Where the Limitations Sit
The benchmark published in the paper covers 20 of the 37 available algorithms, limited by space in the JMLR format. Full results for all algorithms and all datasets are available on the project website, but the selective presentation in the paper means some comparisons require visiting the site rather than reading the paper directly.
The real-world scenario datasets, HAR, PAMAP2, Camelyon17, and iWildCam, receive less coverage in the main benchmark table than the label skew scenarios. The choice of which algorithms to evaluate on which datasets is not always fully explained. A reader trying to understand how a specific method behaves on sensor data will need to run their own experiments rather than reading off a table.
The benchmark uses the 4-layer CNN from McMahan et al. as the default architecture for CV tasks, with ResNet-18 as a secondary option on Tiny-ImageNet. Modern vision tasks increasingly involve transformer backbones, and it is not clear how algorithm rankings would shift with a vision transformer as the base model. The NLP evaluation uses fastText, which is lightweight and interpretable but not representative of how most production NLP systems are built today.
None of these are criticisms of the paper so much as honest limits of what any benchmark can cover. The library is designed to be extended, and the community is evidently doing that.
Complete PyTorch Implementation of the Personalized Aggregation Core
Below is a complete, runnable implementation of the Ditto personalized FL objective, one of the most consistently strong performers in the PFLlib benchmark and a method that cleanly illustrates the regularization-based family. The code includes the server aggregation, the personalized local training loop with the proximity regularization term, an evaluation function, and a smoke test on synthetic data.
# Ditto: Fair and Robust Federated Learning Through Personalization # Li et al., ICML 2021 — reproduced for educational benchmarking # Compatible with PFLlib's evaluation methodology import torch import torch.nn as nn import torch.optim as optim from copy import deepcopy from typing import List, Tuple import random # --------------------------------------------------------- # Simple CNN backbone (mirrors PFLlib's 4-layer CNN) # --------------------------------------------------------- class SimpleCNN(nn.Module): def __init__(self, num_classes: int = 10): super().__init__() self.features = nn.Sequential( nn.Conv2d(1, 32, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2), ) self.classifier = nn.Sequential( nn.Flatten(), nn.Linear(64 * 7 * 7, 512), nn.ReLU(), nn.Linear(512, num_classes), ) def forward(self, x): return self.classifier(self.features(x)) # --------------------------------------------------------- # Ditto client: maintains both a global and a personal model # --------------------------------------------------------- class DittoClient: def __init__( self, client_id: int, data: Tuple[torch.Tensor, torch.Tensor], num_classes: int = 10, lr: float = 0.01, lam: float = 0.1, # regularization strength lambda local_steps: int = 10, pfl_steps: int = 10, device: str = "cpu", ): self.client_id = client_id self.x, self.y = data[0].to(device), data[1].to(device) self.lam = lam self.local_steps = local_steps self.pfl_steps = pfl_steps self.device = device # Global model: updated by server aggregation each round self.global_model = SimpleCNN(num_classes).to(device) # Personal model: updated locally with proximity regularization self.personal_model = SimpleCNN(num_classes).to(device) self.optimizer_g = optim.SGD(self.global_model.parameters(), lr=lr) self.optimizer_p = optim.SGD(self.personal_model.parameters(), lr=lr) self.criterion = nn.CrossEntropyLoss() def receive_global_model(self, state_dict: dict): """Load the aggregated global weights from the server.""" self.global_model.load_state_dict(deepcopy(state_dict)) def local_train_global(self): """Standard local SGD on the global model (sent back for aggregation).""" self.global_model.train() for _ in range(self.local_steps): self.optimizer_g.zero_grad() logits = self.global_model(self.x) loss = self.criterion(logits, self.y) loss.backward() self.optimizer_g.step() return self.global_model.state_dict() def local_train_personal(self): """Ditto personal update: minimize local loss + lambda * proximity to global. Objective per client i: min_{theta_i} L_i(theta_i) + (lambda/2) * ||theta_i - w_global||^2 """ self.personal_model.train() global_params = [ p.detach().clone() for p in self.global_model.parameters() ] for _ in range(self.pfl_steps): self.optimizer_p.zero_grad() logits = self.personal_model(self.x) task_loss = self.criterion(logits, self.y) # Proximity regularization term prox = sum( torch.norm(pp - gp) ** 2 for pp, gp in zip(self.personal_model.parameters(), global_params) ) loss = task_loss + (self.lam / 2) * prox loss.backward() self.optimizer_p.step() def evaluate(self) -> float: """Return accuracy of the personal model on local data.""" self.personal_model.eval() with torch.no_grad(): preds = self.personal_model(self.x).argmax(dim=1) return (preds == self.y).float().mean().item() # --------------------------------------------------------- # Server: FedAvg aggregation over selected clients # --------------------------------------------------------- class DittoServer: def __init__(self, global_model: nn.Module, num_clients: int): self.global_model = global_model self.num_clients = num_clients def aggregate(self, client_state_dicts: List[dict]): """Simple FedAvg aggregation across selected clients.""" avg_state = deepcopy(client_state_dicts[0]) for key in avg_state: for i in range(1, len(client_state_dicts)): avg_state[key] += client_state_dicts[i][key] avg_state[key] = avg_state[key] / len(client_state_dicts) self.global_model.load_state_dict(avg_state) return avg_state def select_clients( self, clients: List[DittoClient], fraction: float = 1.0 ) -> List[DittoClient]: k = max(1, int(len(clients) * fraction)) return random.sample(clients, k) # --------------------------------------------------------- # Federated training loop # --------------------------------------------------------- def run_ditto( clients: List[DittoClient], server: DittoServer, rounds: int = 5, client_fraction: float = 1.0, ): for r in range(1, rounds + 1): selected = server.select_clients(clients, client_fraction) # Step 1: broadcast current global weights to all selected clients global_state = server.global_model.state_dict() for c in selected: c.receive_global_model(global_state) # Step 2: each client trains its global model copy locally updated_states = [c.local_train_global() for c in selected] # Step 3: server aggregates and updates global model global_state = server.aggregate(updated_states) # Step 4: each client updates its personal model with proximity reg for c in selected: c.receive_global_model(global_state) c.local_train_personal() # Evaluate personal models accs = [c.evaluate() for c in clients] mean_acc = sum(accs) / len(accs) print(f"Round {r:02d} | Mean personal accuracy: {mean_acc:.4f}") # --------------------------------------------------------- # Smoke test on synthetic data — no real datasets needed # --------------------------------------------------------- if __name__ == "__main__": torch.manual_seed(42) device = "cuda" if torch.cuda.is_available() else "cpu" NUM_CLIENTS = 4 NUM_CLASSES = 10 # Synthetic 28x28 grayscale images, 50 samples per client clients = [ DittoClient( client_id=i, data=( torch.randn(50, 1, 28, 28), torch.randint(0, NUM_CLASSES, (50,)), ), num_classes=NUM_CLASSES, lr=0.01, lam=0.1, local_steps=5, pfl_steps=5, device=device, ) for i in range(NUM_CLIENTS) ] global_model = SimpleCNN(NUM_CLASSES).to(device) server = DittoServer(global_model, num_clients=NUM_CLIENTS) run_ditto(clients, server, rounds=3, client_fraction=1.0) print("Smoke test passed.")
Conclusion
PFLlib does something deceptively simple and genuinely useful. It puts 37 algorithms in the same room, gives them the same data, the same splits, and the same evaluation code, and then steps back. The results are not always flattering to the methods that generated the most citations in their original papers.
The conceptual shift the library represents is a move from algorithm-centric evaluation to ecosystem-centric evaluation. It is no longer enough to show that your method beats FedAvg. You have to show that it beats Ditto, FedALA, FedCP, GPFL, and FedDBE, all of which beat FedAvg, some of them by a very large margin. That is a higher standard, and it is the right one.
The transferability of this work goes beyond federated learning research. The model-splitting insight, that some layers should be shared globally while others are kept local, has direct analogues in multi-task learning, domain adaptation, and transfer learning. Researchers in those adjacent fields may find the PFLlib benchmark table useful as a proof of concept that representation sharing can be structured rather than blunt.
The honest remaining limitation is model and modality coverage. The benchmark as published tests CNNs and fastText. Much of practical ML now runs on transformer backbones, and it is an open question whether the algorithm rankings in Table 2 hold when you swap the backbone. This is not a criticism the authors ignore; they note that the library is designed for extension. What it means in practice is that a team evaluating PFLlib methods for a vision transformer deployment should run their own experiments rather than reading off the benchmark table.
Future directions suggested by the benchmark results seem clear enough. The feature shift scenario and the real-world scenario deserve more comprehensive coverage. The interaction between personalization strategy and privacy vulnerability under gradient inversion is underexplored and potentially important. And the question of how personalized FL methods scale when the number of clients grows from the tens to the thousands is one the current benchmark does not fully answer.
What PFLlib has already done is give the field a shared vocabulary and a shared scoreboard. That turns out to be most of what is needed for a research community to make faster progress.
Frequently Asked Questions
Traditional federated learning trains a single global model shared equally by all clients. Personalized federated learning allows each client to end up with a model that fits its own data distribution, either by fine-tuning the global model locally, by learning which other clients to aggregate with, or by splitting the network into shared and personal components. The benchmark results in PFLlib show that personalized methods consistently outperform traditional ones when client data distributions differ significantly.
PFLlib includes 37 algorithms in total, covering 8 traditional federated learning methods and 29 personalized methods. The personalized methods are organized into five families based on their core technique, including meta-learning, regularization, personalized aggregation, model splitting, and knowledge distillation. Each algorithm is implemented in a server file and a client file, inheriting shared infrastructure from base classes.
Label skew means that different clients hold data from different subsets of the available classes, rather than seeing the full class distribution equally. A hospital that mostly treats a particular condition will have very different label distributions than one in another region. PFLlib tests two versions of this: a pathological setting where each client holds only a small number of classes, and a practical setting where a Dirichlet distribution assigns labels unevenly but with overlap. The practical setting is more realistic and reveals different algorithm rankings.
Model-splitting methods consistently appear at the top of the benchmark table under both pathological and practical label skew settings. FedDBE, GPFL, and FedCP all achieve strong results on CIFAR-100, a harder benchmark than Fashion-MNIST. FedALA from the personalized aggregation family is also among the top performers. Traditional methods like FedAvg and FedProx lag significantly behind, particularly on harder datasets and under more realistic data splits.
Federated learning reduces privacy risk by keeping raw data local, but sharing gradient updates still creates vulnerability. The Deep Leakage from Gradients attack can sometimes reconstruct training images from gradient information alone. PFLlib includes an implementation of this attack along with the Peak Signal-to-Noise Ratio metric so researchers can evaluate how their chosen algorithm performs on both accuracy and privacy resistance. Personalization strategies change the gradient structure in ways that may increase or decrease this vulnerability depending on the method.
Adding a new algorithm to PFLlib requires writing two files: a server file named serverY.py and a client file named clientY.py, where Y is the algorithm name. Both files inherit from the base server and client classes, which handle the repetitive infrastructure of federated rounds, client selection, and communication simulation. Only the parts of the algorithm that are genuinely novel need to be written from scratch, which lowers the barrier for researchers wanting to prototype new ideas within the benchmark framework.
Read the full paper and explore the benchmark results and codebase directly.
Read the JMLR Paper View PFLlib on GitHubThis analysis is based on the published paper and an independent evaluation of its claims.
