We are excited to present the accepted papers for the Computer Vision in Advertising and Marketing (CVAM) Workshop at ICCV 2025.
🏆 ICCV Proceedings Track
Full papers (4-8 pages) published in the official ICCV 2025 Workshop Proceedings.
AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety
Adi Levi,
Or Levi,
Jonathan Morra,
Sardhendu Mishra
đź“… Whitepaper Presentations #1 - Oct 20 10:00 AM HST
Abstract: As the volume of video content online grows exponentially, the demand for moderation of unsafe videos has surpassed human capabilities, posing both operational and mental health challenges. While recent studies demonstrated the merits of Multimodal Large Language Models (MLLMs) in various video understanding tasks, their application to multimodal content moderation, a domain that requires nuanced understanding of both visual and textual cues, remains relatively underexplored. In this work, we benchmark the capabilities of MLLMs in brand safety classification, a critical subset of content moderation for safeguarding advertising integrity. To this end, we introduce a novel, multimodal and multilingual dataset, meticulously labeled by professional reviewers in a multitude of risk categories. Through a detailed comparative analysis, we demonstrate the effectiveness of MLLMs such as Gemini, GPT, and Llama in multimodal brand safety, and evaluate their accuracy and cost efficiency compared to professional human reviewers. Furthermore, we present an in-depth discussion shedding light on limitations of MLLMs and failure cases. We are releasing our dataset alongside this paper to facilitate future research on effective and responsible brand safety and content moderation.
ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way
Abstract: We introduce a training-free framework, ByDeWay, a training-free method to boost the performances of Multimodal Large Language Models. Specifically, ByDeWay leverages a novel prompting strategy, Layered-Depth-Based Prompting (LDP), that enhances the spatial reasoning and grounding capabilities of Multimodal Large Language Models (MLLMs). Our key insight is to inject structured spatial context derived from monocular depth estimation into the input prompts—without modifying any model parameters. By segmenting scenes into closest, mid-range, and farthest depth layers and generating region-specific captions using a grounded vision-language model, we produce explicit depth-aware textual descriptions. These descriptions are concatenated with image-question prompts to guide the model toward spatially grounded and hallucination-resistant outputs. Our method is lightweight, modular, and compatible with any black-box MLLM. Evaluations on hallucination-sensitive (POPE) and reasoning-intensive (GQA) tasks show consistent improvements across multiple MLLMs, demonstrating the effectiveness of depth-aware prompting in a zero-training setup.
CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation
Max Curie,
Paulo da Costa
đź“… Whitepaper Presentations #1 - Oct 20 10:00 AM HST
Abstract: We introduce CLASP (Clustering via Adaptive Spectral Processing), a lightweight framework for unsupervised image segmentation that operates without any labeled data or fine-tuning. CLASP first extracts per-patch features using a self-supervised ViT encoder (DINO); then, it builds an affinity matrix and applies spectral clustering. To avoid manual tuning, we select the segment count automatically with a eigengap-silhouette search, and we sharpen the boundaries with a fully connected DenseCRF. Despite its simplicity and training-free nature, CLASP attains competitive mIoU and pixel-accuracy on COCO-Stuff and ADE20K, matching recent unsupervised baselines. The zero-training design makes CLASP a strong, easily reproducible baseline for large unannotated corpora—especially common in digital advertising and marketing workflows such as brand-safety screening, creative asset curation, and social-media content moderation.
Reasoning-Enhanced Prompt Strategies for Multi-Label Classification
Jinze Yu,
Guanghui Wang
đź“… Whitepaper Presentations #2 - Oct 20 3:00 PM HST
Abstract: This paper introduces Confidence-Ranked Reasoning, a novel approach for multi-label classification using large language models (LLMs) that balances reasoning capabilities with computational efficiency. Our approach addresses token constraints by instructing the model to rank categories by confidence, then performing detailed reasoning only for the top-k candidates. Evaluating on a customer service dialog dataset with 65 categories, we demonstrate that our method with k=5 achieves a 13% improvement in micro F1 score over standard Chain-of-Thought prompting while using 32% fewer tokens. The approach effectively focuses reasoning resources on promising categories, with optimal efficiency around k=5. Our method enhances interpretability through explicit reasoning traces and provides controllable trade-offs between thoroughness and efficiency, representing a practical advancement for multi-label classification with LLMs.
A Multi-Stage Pipeline for Accurate Handwritten Information Extraction from Financial Forms
Guanghui Wang,
Xing Zhang,
Jinze Yu,
Tomal Deb,
Xuefeng Liu,
Peiyang He
đź“… Whitepaper Presentations #2 - Oct 20 3:00 PM HST
Abstract: Financial institutions continue to process millions of handwritten forms despite digital transformation efforts, creating a significant operational bottleneck. This research addresses the persistent challenge of automating handwritten data extraction from financial documents by introducing a four-stage processing pipeline that significantly outperforms existing solutions. Our approach sequentially combines targeted structural analysis, specialized optical character recognition, multimodal large language model (MLLMs) verification, and database cross-validation to handle the inherent variability in handwritten content. Experimental results demonstrate exceptional accuracy with our enhanced hybrid method achieving 98.4% F1-score across diverse field types (textual, numerical and checkbox), with perfect extraction of textual content and near-perfect numerical field recognition (98.2% F1-score). This represents dramatic improvement over conventional systems, particularly for numerical data where precision is critical for financial transactions. The document-level accuracy of 80% substantially reduces manual review requirements, offering immediate practical value while establishing a methodological framework for combining complementary technologies to overcome individual component limitations. This research demonstrates how strategically sequenced verification steps can systematically enhance extraction reliability for mission-critical document processing applications.
MGT: Extending Virtual Try-Off to Multi-Garment Scenarios
Riza Velioglu,
Petra Bevandić,
Robin Chan,
Barbara Hammer
đź“… Whitepaper Presentations #2 - Oct 20 3:00 PM HST
Abstract: Computer vision is transforming fashion industry through Virtual Try-On (VTON) and Virtual Try-Off (VTOFF). VTON generates images of a person in a specified garment using a target photo and a standardized garment image, while a more challenging variant, Person-to-Person Virtual Try-On (p2p-VTON), uses a photo of another person wearing the garment. VTOFF, in contrast, extracts standardized garment images from photos of clothed individuals. We introduce Multi-Garment TryOffDiff (MGT), a diffusion-based VTOFF model capable of handling diverse garment types, including upper-body, lower-body, and dresses. MGT builds on a latent diffusion architecture with SigLIP-based image conditioning to capture garment characteristics such as shape, texture, and pattern. To address garment diversity, MGT incorporates class-specific embeddings, achieving state-of-the-art VTOFF results on VITON-HD and competitive performance on DressCode, demonstrating effectiveness across multiple garment types. When paired with VTON models, it further enhances p2p-VTON by reducing unwanted attribute transfer, such as skin tone, ensuring preservation of person-specific characteristics.
From Pixels to Context: Adapting Generative Models for Advertising at Scale
HyunHee Chung,
Taeyoung Na
đź“… Whitepaper Presentations #2 - Oct 20 3:00 PM HST
Abstract: As marketing shifts toward hyper-personalization, advertisers seek to generate customized advertisement posters at scale—an inherently challenging task for traditional heuristic workflows. Generative AI offers a promising solution, but its adaptation to real-world advertising presents two key challenges: (1) generalized models fail to precisely capture target tasks, requiring personalized models. However, selecting optimal training samples and defining their inclusion criteria remains an inefficient trial-and-error process, and (2) fine-tuning models without sacrificing generative diversity and controllability, where controllability in advertisement poster generation specifically requires preserving the input product image without distortion. Existing methods rely on ad-hoc dataset selection and often constrain latent spaces, leading to suboptimal personalization. To address this, we introduce DCD-Pipeline (Directional Context Derivative Pipeline) for systematic in-context data selection and DBA-Attention (Dual-Branch Adaptive Attention) for preserving both generalization and personalization through separate attention branches. Applied to advertising poster generation, our approach significantly improves context-aware, high-fidelity content creation, demonstrating the potential of Generative AI in scalable, industry-driven applications.
Cross-lingual Visual Text Stylization and Synthesis Incorporating Text Rendering and Diffusion Model
Minmin Shen,
Caren Chen
đź“… Whitepaper Presentations #2 - Oct 20 3:00 PM HST
Abstract: Visual Text Stylization and Synthesis aims to generate a text that has the same style as the input text. This task is more challenging if the input and output images are of different languages, and remains an unaddressed issue for the state-of-the-art diffusion-based image generation models. To fulfill the demand for cross-lingual visual text stylization and synthesis for commercial applications, we propose a hybrid approach combining the strengths of two different methods: text rendering and diffusion models to generate visual text with the same style as the reference visual text image. This approach addresses the technical challenges of cross-lingual text style transfer and is able to produce high quality visual text with various styles and complex texture. Moreover, our approach is able to handle long text with multi-line layout by incorporating large language models (LLM). We evaluate our approach on a large test set of bilingual visual text pairs. It is shown in the experiments that the proposed approach outperforms strong baselines and is comparable to human-created ones, according to human perception.
Toward Scalable Video Narration: A Training-free Approach using Multimodal Large Language Models
Abstract: In this paper, we introduce VideoNarrator, a novel training-free pipeline designed to generate dense video captions that offer a structured snapshot of video content. These captions offer detailed narrations with precise timestamps, capturing the nuances present in each segment of the video. Despite advancements in multimodal large language models (MLLMs) for video comprehension, these models often struggle with temporally aligned narrations and tend to hallucinate, particularly in unfamiliar scenarios. VideoNarrator addresses these challenges by leveraging a flexible pipeline where off-the-shelf MLLMs and visual-language models (VLMs) can function as caption generators, context providers, or caption verifiers. Our experimental results demonstrate that the synergistic interaction of these components significantly enhances the quality and accuracy of video narrations, effectively reducing hallucinations and improving temporal alignment. This structured approach not only enhances video understanding but also facilitates downstream tasks such as video summarization and video question answering, and can be potentially extended for advertising and marketing applications.
Training-Free Diffusion Framework for Stylized Image Generation with Identity Preservation
Mohammad ali rezaei,
Helia Hajikazem,
Saeed Khanehgir,
Mahdi Javanmardi
đź“… Whitepaper Presentations #1 - Oct 20 10:00 AM HST
Abstract: Although diffusion models have demonstrated remarkable generative capabilities, existing style transfer techniques often struggle to maintain identity while achieving high-quality stylization. This limitation becomes particularly critical in practical applications such as advertising and marketing, where preserving the identity of featured individuals is essential for a campaign's effectiveness. This limitation is particularly severe when subjects are distant from the camera or appear within a group, frequently leading to a significant loss of identity. To address this issue, we introduce a novel, training-free framework for identity-preserved stylized image synthesis. Key contributions include: the "Mosaic Restored Content Image" technique, which significantly enhances identity retention in complex scenes , and a training-free content consistency loss that improves the preservation of fine-grained details by directing more attention to the original image during stylization. Our experiments reveal that the proposed approach substantially exceeds the baseline model in concurrently maintaining high stylistic fidelity and robust identity integrity, all without necessitating model retraining or fine-tuning.
Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis
Maciej Szankin,
Lihang Ying,
Vidhyananth Ramasamy Venkatasamy
đź“… Whitepaper Presentations #1 - Oct 20 10:00 AM HST
Abstract: Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs—including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2—against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost—an important consideration for edge deployment. To foster future research, we release our weather-augmented benchmark and evaluation code publicly.
Align Before You Recommend: Parameter Efficient Personalization via Cross-Attentive Fusion of Hierarchical Language Models
Alicja Kwasniewska,
Chad Neal,
Marcin Bednarz
đź“… Whitepaper Presentations #1 - Oct 20 10:00 AM HST
Abstract: The rapidly growing global advertising and marketing industry demands innovative machine learning systems that balance accuracy with efficiency. Recommendation systems, crucial to many platforms, require careful considerations and potential enhancements. While Large Language Models (LLMs) have transformed various domains, their potential in sequential recommendation systems remains under-explored. Pioneering works like Hierarchical Large Language Models (HLLM) demonstrated LLMs' capability for next-item recommendation but rely on computationally intensive fine-tuning, limiting widespread adoption. This work introduces HLLM+, enhancing the HLLM framework to achieve high-accuracy recommendations without full model fine-tuning. By introducing targeted alignment components between frozen LLMs, our approach matches fully-tuned model performance in popular item recommendation tasks (recall/NCDG @5/@10) while reducing training time by 30.7%. We also propose a ranking-aware loss adjustment, improving convergence and recommendation quality for popular items. Experiments show HLLM+ achieves superior performance with frozen item representations, improving recall@5 by up to 52% compared to baseline frozen models. These findings are significant for the advertising technology sector, where rapid adaptation and efficient deployment across brands are essential for maintaining competitive advantage.
Human + AI for Accelerating Ad Localization Evaluation
Abstract: Adapting advertisements for multilingual audiences requires more than simple text translation. It demands preservation of visual consistency, spatial alignment, and stylistic integrity across diverse languages and formats. This paper introduces a structured framework that combines automated components with human oversight to address the complexities of advertisement localization. To the best of our knowledge this is the first work that combines OCR, inpainting, MT, and reimposition specifically for accelerating ad localization evaluation workflows. Qualitative results across six language locales demonstrate that the proposed approach produces semantically accurate and visually coherent localized advertisements suitable for real-world applications
Privacy-Preserving Audience Analytics: Lightweight Thermal Face Recognition for Real-Time Marketing Intelligence at the Edge
Maciej Szankin,
Jacek Ruminski
đź“… Whitepaper Presentations #1 - Oct 20 10:00 AM HST
Abstract: Modern retail analytics demand real-time audience measurement, yet privacy regulations and consumer concerns limit traditional RGB cameras for demographic analysis. We present LTAS (Lightweight Thermal Architecture Search), a privacy-preserving framework that optimizes pre-trained models for edge-deployed thermal face recognition with minimal adaptation - requiring only single-batch fine-tuning on 64 images. Unlike expensive super-network approaches, LTAS leverages thermal imagery's constrained visual diversity to achieve rapid optimization. Evaluating 500 architectural variants across three thermal face datasets reveals that network depth reduction is the primary efficiency driver, achieving up to 48% parameter reduction while maintaining 82% of baseline accuracy. Depth optimization alone delivers 35-45% parameter reduction without accuracy loss, while kernel size modifications provide limited benefits. This enables real-time privacy-compliant audience analytics on resource-constrained retail devices, making thermal-based marketing measurement both practical and scalable.
đź“„ ArXiv Track
Short papers (up to 4 pages) hosted on arXiv and featured on the workshop website (non-archival).
A Survey of Human Synergy in Influencer Marketing Through Authenticity-Preserving Content Generation Approaches
Nafi Diallo,
Pegah Ojaghi
Abstract: The rapid integration of artificial intelligence (AI) in influencer marketing has transformed the industry, offering enhanced efficiency and scalability while raising critical concerns about authenticity, transparency, and consumer trust preservation. This survey examines current approaches to human-AI collaboration in influencer marketing, analyzing how brands can leverage AI-powered content generation while maintaining the credibility and transparent relationships that drive consumer engagement.
Through comprehensive analysis of existing literature on influencer credibility, disclosure practices, and consumer trust mechanisms, we present a taxonomy of authenticity preserving content generation methods. Our findings reveal that strategic AI-human collaboration must navigate complex consumer persuasion knowledge, advertising literacy, and skepticism while maintaining the transparency and source credibility that underpin successful influencer marketing.
Key approaches include disclosure-aware content creation, credibility-preserving collaboration workflows, and trust-building mechanisms that acknowledge consumer sophistication in detecting commercial intent. We identify critical challenges including the need for appropriate disclosure of AI assistance, maintaining para-social relationships in technology-mediated content creation, and balancing efficiency gains with credibility preservation. As consumer advertising literacy continues to evolve, this survey offers a comprehensive foundation for navigating the balance between technological innovation and the transparency, credibility, and authentic engagement that drive sustainable influencer marketing success.
DocQIR-Emb: Document Image Retrieval with Multi-lingual Question Query
Chih-Hui Ho,
Giovanna Carreira Marinho,
Felipe Viana,
Varad Pimpalkhute,
Rodolfo Luis Tonoli,
Andre Von Zuben
Abstract: Document image retrieval is crucial for document understanding. Unlike standard text-to-image tasks that align captions with natural images, it requires interpreting user questions and retrieving relevant tables or scientific figures. Domain gaps between captions and queries, as well as natural and scientific images, make existing retrieval models ineffective. To study this, we introduce DocQIR, a multilingual benchmark covering 5 languages. Our study shows that off-the-shelf models fail when queries are multilingual. To address this, we propose DocQIR-Emb, which uses a multilingual text encoder and a VLM to map questions and images into a shared space. The text encoder is frozen, while the VLM is optimized. Experiments show DocQIR-Emb improves retrieval by over 40% across both tables and scientific images.
LLM-HYPE: Generative CTR Modeling for Cold-Start Ad Personalization via Multimodal LLM-Based Hypernetworks
Abstract: In online advertising platforms, new promotional ads suffer from the cold-start problem when newly introduced ads lack sufficient user feedback for model training. In this work, we propose LLM-HYPE, a novel framework that treats large language models (LLMs) as hypernetworks to directly generate the parameters of the click-through rate (CTR) estimator—without any training labels. LLM-HYPE uses few-shot Chain-of-Thought prompting over multi-modal ad content (text and images) to infer feature-wise model weights for a linear CTR predictor. By retrieving semantically similar past campaigns via CLIP embeddings and formatting them into prompt-based demonstrations, the LLM learns to reason about customer intent, feature influence, and content relevance. To ensure numerical stability and serve-ability, we introduce normalization and calibration techniques that align the generated weights with production-ready CTR distributions. Extensive offline experiments show that LLM-HYPE significantly outperforms cold-start baselines and even surpasses traditionally trained models in NDCG@10. A 30-day online A/B testing demonstrates that the generated models achieve CTR performance comparable to continuously trained systems—without incurring labeling or retraining costs.
Human + AI for Accelerating Ad Localization Evaluation
Abstract: Adapting advertisements for multilingual audiences requires more than simple text translation; it demands preservation of visual consistency, spatial alignment, and stylistic integrity across diverse languages and formats. We introduce a structured framework that combines automated components with human oversight to address the complexities of advertisement localization. To the best of our knowledge, this is the first work to integrate scene text detection, inpainting, machine translation (MT), and text reimposition specifically for accelerating ad localization evaluation workflows. Qualitative results across six locales demonstrate that our approach produces semantically accurate and visually coherent localized advertisements, suitable for deployment in real-world workflows.
CAP: Evaluation of Persuasive and Creative Image Generation
Aysan Aghazadeh,
Adriana Kovashka
Abstract: We introduce a new framework with three metrics, Creativity, Alignment, and Persuasiveness (CAP), for evaluating advertisement image generation. Current Text-to-Image (T2I) methods excel with explicit descriptions but struggle with generating creative and persuasive images from implicit prompts. We highlight these weaknesses and present a simple yet effective approach to enhance the creativity, alignment, and persuasiveness of generated images.
The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads
Aysan Aghazadeh,
Adriana Kovashka
Abstract: Text-to-image models are appealing for customizing visual advertisements and targeting specific populations. We investigate this potential by examining the demographic bias within ads for different ad topics, and the disparate level of persuasiveness (judged by models) of ads that are identical except for the gender/race of the people portrayed. We also experiment with a technique to target ads for specific countries.
Leveraging Artificial Empathy for Permission-Based Advertising
Xinying Hao,
Garrett Sonnier
Abstract: Ad avoidance behavior is an increasingly important problem for social platforms that rely on permission advertising revenues. A common tactic employed by platforms is to impose a period of forced ad exposure. For example, YouTube typically requires viewers to watch the first five seconds of an ad, after which the viewer can choose to skip the ad and proceed to the desired content. In this paper, we develop a model that quantifies the effects of forced ad exposure on consumers' emotions and ad-skipping behavior when watching online video advertisements. We use artificial empathy (i.e., facial recognition technology) to measure emotions in a way that is completely unobtrusive, thus avoiding mere measurement effects. Leveraging computer vision techniques, we also extract frame-level features from video advertisements. Our Bayesian dynamic generalized linear model captures the temporal trajectory of consumer emotions under forced and unforced ad exposure conditions as well as the dynamics of the consumer's ad skipping behavior. Our results indicate that forced ad exposure largely ignites contempt and disgust and suppresses feelings of surprise. Surprise and anger cause a decrease in skipping probability while contempt, disgust, and sadness increase the risk of losing the audience's attention. Moreover, we find a high carryover effect in the skipping propensity, which highlights the importance of capturing viewers' attention in the opening seconds. These findings offer actionable insights for ad designers and platforms, especially in the era of generative AI and personalized video content.