From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

Meet a new way to write captions and hashtags for fashion photos—grounded in what is actually in the picture.

  • Detects multiple garments in an image with a YOLO-based model.
  • Extracts dominant colors and infers fabric and gender by retrieving similar products with CLIP and FAISS.
  • Packages these facts as an evidence pack that steers a large language model to stay faithful while sounding stylish.

Why it matters: classic end-to-end captioners often miss attributes or hallucinate. Retrieval-augmented generation keeps the style and improves factual grounding.

Results: the detector reached 0.71 mAP across nine garment types. The RAG-LLM pipeline delivered more attribute-aligned captions and hashtags with higher coverage (including full coverage at the 50% threshold), while a fine-tuned BLIP baseline showed higher word overlap but weaker generalization.

Takeaway: blend vision, retrieval, and LLMs to scale accurate, on-brand fashion copy across products and shoots.

Paper: https://arxiv.org/abs/2511.19149v1

Register: https://www.AiFeta.com

AI FashionTech ComputerVision GenAI RAG LLM YOLO CLIP BLIP ecommerce arXiv

Read more