What's the Role of Classifier Models?
Image classifiers (also used interchangeably with "tagger" and "labeler") play a major role in the generative AI bias pipeline, functioning as both contributors to and indicators of systemic representation issues. Even so, they have become essential tools for creating, filtering, and organizing the very datasets that power text-to-image (image generating) models. The taxonomies and tagging patterns they employ directly influence both training data composition and how generated images are subsequently classified.
Different architectural approaches (Vision Transformers, SwinV2, ConvNext) offer various trade-offs between accuracy, efficiency, and built-in biases. When examining the Danbooru dataset (***NOTE: posts are uploaded constantly and frequently feature Not Safe For Work content. The provided link includes a filter to display only content rated "safe," but may not be foolproof***) that powers these taggers, researchers have found significant gender imbalances that skew toward female characters and often include sexualization bias, which then gets encoded into the tagging systems and propagated forward into generation models (Park et al., 2024).
At their core, classifier models illustrate the metadata side of the problem. Analyzing tagging models alongside the generative models they support presents an opportunity for information professionals to intervene in this cycle through improved documentation, dataset curation, and taxonomy development.
Waifu Diffusion Taggers
SmilingWolf, a prominent AI developer, created a family of specialized image taggers that have become standard tools for image analysis, particularly influential in anime/manga contexts but with impacts extending into broader image generation ecosystems. These "WD" (Waifu Diffusion) taggers employ different neural network architectures while sharing a common training approach. All are trained on the Danbooru dataset with tags that appear at least 600 times, focusing on high-quality annotations. The models categorize images across three domains: general content tags, character identification, and content rating assessment. Their widespread adoption has influenced classification pipelines for both specialized and general-purpose generative systems, making them valuable subjects for studying how taxonomies and labeling conventions can shape representation biases throughout the AI stack, all the way from dataset curation to final outputs.
WD EVA02-Large Tagger v3
A powerful image tagging model based on the EVA02 architecture, which is a Transformer-based model with 304M parameters. It uses an advanced plain Transformer design and was pre-trained to reconstruct language-aligned vision features via masked image modeling. This tagger supports ratings, characters, and general tags, and was trained on Danbooru images (up to ID 7220105) with a focus on high-quality data (images with 10+ general tags). Features timm and ONNX compatibility with a validation F1 score of 0.4772. EVA02's architecture includes mean pooling, SwiGLU activation functions, and Rotary Position Embeddings for enhanced performance. Model card
WD ViT-Large Tagger v3
A Vision Transformer (ViT) based image tagging model that supports ratings, characters, and general tags. This model employs a standard Vision Transformer architecture and was trained on the same Danbooru dataset as the other models in this family. The ViT architecture is simpler than some alternatives, with fewer built-in architectural priors, making it versatile for various computer vision tasks. Its validation threshold is 0.2606 with an F1 score of 0.4674. This model has become popular for anime image tagging applications due to its balance of accuracy and efficiency. Model card
WD SwinV2 Tagger v3
A powerful and accurate model built on the Swin Transformer V2 architecture, which is well-suited for most image tagging use cases. Like others in the family, it supports ratings, characters, and general tags, and was trained on filtered Danbooru images. Version 2.0 employed tag frequency-based loss scaling to combat class imbalance, resulting in improved performance with a validation F1 score of 0.4541 (up from 0.4411 in v1.0). The Swin architecture uses shifted windows for more efficient attention computation compared to standard Vision Transformers. Model card
WD ConvNext Tagger v3
An efficient CNN-based model offering a good balance of speed and accuracy for image tagging tasks. Like the other taggers in this family, it was trained on the same Danbooru dataset and supports the same tag categories. The latest version (v2.0) implements tag frequency-based loss scaling and extended training, improving its F1 score to 0.4419 from 0.4282 in v1.0. ConvNext represents a modern convolutional neural network design that incorporates some transformer-inspired improvements while maintaining the efficiency of CNNs. Model card
Want to try them out?
Check out the public demo.
© 2025 E'Narda McCalister. All rights reserved.