From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
Meet a new way to write captions and hashtags for fashion photos—grounded in what is actually in the picture.
- Detects multiple garments in an image with a YOLO-based model.
- Extracts dominant colors and infers fabric and gender by retrieving similar products with CLIP and FAISS.
- Packages these facts as an evidence pack that steers a large language model to stay faithful while sounding stylish.
Why it matters: classic end-to-end captioners often miss attributes or hallucinate. Retrieval-augmented generation keeps the style and improves factual grounding.
Results: the detector reached 0.71 mAP across nine garment types. The RAG-LLM pipeline delivered more attribute-aligned captions and hashtags with higher coverage (including full coverage at the 50% threshold), while a fine-tuned BLIP baseline showed higher word overlap but weaker generalization.
Takeaway: blend vision, retrieval, and LLMs to scale accurate, on-brand fashion copy across products and shoots.
Paper: https://arxiv.org/abs/2511.19149v1
Register: https://www.AiFeta.com
AI FashionTech ComputerVision GenAI RAG LLM YOLO CLIP BLIP ecommerce arXiv