In Single Image Super-Resolution (SISR), Generative Adversarial Network (GAN)-based algorithms often achieve lower scores on conventional evaluation metrics compared to other methods, despite producing visually comparable—or even superior—results to human observers. This discrepancy suggests inherent limitations in traditional evaluation methodologies. Specifically, PSNR (Peak Signal-to-Noise Ratio) computes pixel-wise differences and favors lower MSE (Mean Squared Error). Since GAN-based methods typically employ perceptual or adversarial loss functions rather than MSE-based objectives, they are inherently disadvantaged under PSNR evaluation. To address this issue, we propose Calibrated-PSNR, a novel evaluation metric that better accounts for the characteristics of GAN-based super-resolution models.
As shown in the table and figures below, the SRResNet model achieves higher PSNR scores than SRGAN on uniform/monochromatic regions. (1: top-left, 2: top-right, 3: bottom-left, 4: bottom-right of the monochromatic test image)
Calibrated-PSNR Approach: Divide the image into 32×32 non-overlapping blocks Compute PSNR for each block individually Exclude blocks with PSNR ≥ 35 dB (typically uniform/low-texture regions) Calculate the final PSNR using only the remaining blocks This strategy reduces the disproportionate influence of easily-reconstructed uniform regions, allowing the metric to better reflect perceptual quality in textured/complex areas where GAN-based models excel.
When comparing standard PSNR with our proposed Calibrated-PSNR, the performance gap between SRResNet and SRGAN significantly narrows. This demonstrates that Calibrated-PSNR effectively mitigates the bias toward monochromatic regions, providing a more balanced evaluation aligned with human perception.
python Proposed_metric.py
Input Requirements
- hrdirroot: Path to the directory containing ground-truth high-resolution images
- srdirroot: Path to the directory containing super-resolved output images
- save_dir: Path to save the CSV file with metric results
Generate Low-Resolution Images from GT
python GT2LR.py
- Python 3.6.5
- PyTorch 1.1.0
- Pillow 5.1.0
- numpy 1.14.5
- scikit-image 0.15.0
git clone https://github.com/dongheehand/SRGAN-PyTorch.git
Training:
python main.py --LR_path ./LR_imgs_dir --GT_path ./GT_imgs_dir
Testing:
python main.py --mode test_only --LR_path ./LR_imgs_dir --generator_path ./model/SRResNet.pt
Uses the same codebase as SRResNet.
Train
python main.py --LR_path ./LR_imgs_dir --GT_path ./GT_imgs_dir
Test
python main.py --mode test_only --LR_path ./LR_imgs_dir --generator_path ./model/SRGAN.pt
Based on XPixelGroup/RankSRGAN
# 1. Clone repository
git clone https://github.com/WenlongZhang0724/RankSRGAN.git
# 2. Place LR images to restore in './LR' folder
# 3. Download pretrained models from Google Drive:
# https://drive.google.com/drive/folders/1_KhEc_zBRW7iLeEJITU3i923DC6wv51T
# → Place models in './experiments/pretrained_models/'
# 4. Configure options in test.py and run evaluation
python test.py -opt options/test/test_RankSRGAN.yml
# 5. Results saved in './results' folder
# 1. Generate rank dataset: ./datasets/generate_rankdataset/
python train_rank.py -opt options/train/train_Ranker.yml
# 1. Modify config file as needed: options/train/train_RankSRGAN.json
python train_niqe.py -opt options/train/train_RankSRGAN.yml
Implementation based on the original authors' repository: adamian98/pulse
We utilize implementations for the following Image Quality Assessment (IQA) metrics: LPIPS, FID, NIQE, MA, MUSIQ, NIMA, DBCNN, WaDIQaM, BRISQUE, PI Primary IQA Toolkit:
chaofengc/IQA-PyTorchComprehensive PyTorch toolbox for full-reference and no-reference IQA metrics LPIPS Implementation Reference:- S-aiueo32/lpips-pytorch Batch Evaluation Script: save_metrics.py
# Configuration parameters:
hrdirroot # Path to ground-truth HR images
srdirroot # Path to super-resolved output images
save_dir # Path to output CSV file for metric results
💡 Note: Calibrated-PSNR is designed as a complementary metric—not a replacement—for comprehensive SR evaluation. We recommend using it alongside perceptual metrics (LPIPS, FID) and user studies for holistic model assessment.


