Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–6 of 6 results for author: Tutar, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02987  [pdf, other

    cs.CV

    Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment

    Authors: Wenliang Zhong, Wenyi Wu, Qi Li, Rob Barton, Boxin Du, Shioulin Sam, Karim Bouyarmane, Ismail Tutar, Junzhou Huang

    Abstract: Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this paper, we first establish that adapters using query-based Transformers such as Q-former is a simplified Multi-instance Learning method without considering instance heterogeneity/correlation. We then propose… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  2. arXiv:2401.13795  [pdf, other

    cs.CV

    Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All

    Authors: Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, Ismail B. Tutar

    Abstract: As online shopping is growing, the ability for buyers to virtually visualize products in their settings-a phenomenon we define as "Virtual Try-All"-has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of… ▽ More

    Submitted 24 January, 2024; originally announced January 2024.

  3. arXiv:2308.16354  [pdf, other

    cs.CV

    Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications

    Authors: Wenyi Wu, Karim Bouyarmane, Ismail Tutar

    Abstract: We present Catalog Phrase Grounding (CPG), a model that can associate product textual data (title, brands) into corresponding regions of product images (isolated product region, brand logo region) for e-commerce vision-language applications. We use a state-of-the-art modulated multimodal transformer encoder-decoder architecture unifying object detection and phrase-grounding. We train the model in… ▽ More

    Submitted 30 August, 2023; originally announced August 2023.

    Comments: KDD 2022 Workshop on First Content Understanding and Generation for e-Commerce

  4. arXiv:2305.01257  [pdf, other

    cs.CV cs.AI

    DreamPaint: Few-Shot Inpainting of E-Commerce Items for Virtual Try-On without 3D Modeling

    Authors: Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, Ismail B. Tutar

    Abstract: We introduce DreamPaint, a framework to intelligently inpaint any e-commerce product on any user-provided context image. The context image can be, for example, the user's own image for virtual try-on of clothes from the e-commerce catalog on themselves, the user's room image for virtual try-on of a piece of furniture from the e-commerce catalog in their room, etc. As opposed to previous augmented-… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

  5. arXiv:2204.05555  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    Solving Price Per Unit Problem Around the World: Formulating Fact Extraction as Question Answering

    Authors: Tarik Arici, Kushal Kumar, Hayreddin Çeker, Anoop S V K K Saladi, Ismail Tutar

    Abstract: Price Per Unit (PPU) is an essential information for consumers shopping on e-commerce websites when comparing products. Finding total quantity in a product is required for computing PPU, which is not always provided by the sellers. To predict total quantity, all relevant quantities given in a product attributes such as title, description and image need to be inferred correctly. We formulate this p… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

    Comments: 8 pages, 2 figures, 8 tables. This work was accepted in TrueFact KDD '21 workshop

  6. arXiv:2109.12178  [pdf, other

    cs.CV cs.AI cs.LG

    MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling

    Authors: Tarik Arici, Mehmet Saygin Seyfioglu, Tal Neiman, Yi Xu, Son Train, Trishul Chilimbi, Belinda Zeng, Ismail Tutar

    Abstract: Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in a… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.