Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

2025/11/20 00:00

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Piyasa Fırsatı
Prompt Logosu
Prompt Fiyatı(PROMPT)
$0.05264
$0.05264$0.05264
-3.51%
USD
Prompt (PROMPT) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen [email protected] ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

AI Startup Surge Risks Repeating Tech’s Last Funding Mania

AI Startup Surge Risks Repeating Tech’s Last Funding Mania

The AI startup frenzy and FOMO are inflating round sizes and valuations. Yes, the potential is huge. But too much capital too early often leads to mediocre outcomes. Remake of 2020–22?
Paylaş
Hackernoon2025/09/19 12:14
Bitcoin ETFs Revive with $241 Million Inflow, Ethereum ETFs Report Lowest Trading Value of the Week

Bitcoin ETFs Revive with $241 Million Inflow, Ethereum ETFs Report Lowest Trading Value of the Week

The post Bitcoin ETFs Revive with $241 Million Inflow, Ethereum ETFs Report Lowest Trading Value of the Week appeared first on Coinpedia Fintech News On September 24, the US spot Bitcoin ETF saw a combined inflow of $241.00 million, while Ethereum ETFs continued their day 3 streak of outflow. It recorded a total net outflow of $79.36 million, as per the SoSoValue report.  Bitcoin ETF Breakdown  After two consecutive days of experiencing huge sell-offs, Bitcoin ETFs finally managed to record an inflow of $241.00 million. BlackRock IBIT led with $128.90 million, and Ark and 21Shares ARKB followed with $37.72 million.  Additional gains were made by Fidelity FBTC, Bitwise BITB, and Grayscale BTC of $29.70 million, $24.69 million, and $13.56 million, respectively. VanEck HODL also made a smaller addition of $6.42 million in inflows.  Despite the inflows, the total trading value of the Bitcoin ETF dropped to $2.58 billion, with total net assets $149.74 billion. This marks 6.62% of Bitcoin market cap, slightly higher than the previous day.  Ethereum ETF Breakdown  Ethereum ETFs saw a total outflow of $79.36 million, with Fidelity’s FETH leading with $33.26 million. BlackRock ETHA also experienced heavy selling pressure of $26.47 million, followed by Grayscale’s ETHE $8.91 million. 21Shares TETH and Bitwise ETHW also posted smaller withdrawals of $6.24 million and $4.48 million, respectively.  The total trading value of Ethereum ETFs dropped below a billion, reaching $971.79 million. Net assets came in at $27.42 billion, representing 5.45% of the Ethereum market cap.  Ethereum ETF Market Context  Bitcoin is trading at $111,766, signalling a 4.6% drop compared to a week ago. Its market cap has also dipped to $2.225 trillion. Its daily trading volume has reached $49.837 billion, showing mild progress there.  Ethereum is priced at $4,011.92, with a market cap of $483.822 billion, showing a sharp decline. Its trading volume has also slipped to $37.680 billion, reflecting a slow market.  Due to heavy outflow this week, Bitcoin and Ethereum’s prices are experiencing price swings. Crypto analysts from Bloomberg warn the market to brace for further volatility.  
Paylaş
Coinstats2025/09/25 18:40
Son of filmmaker Rob Reiner charged with homicide for death of his parents

Son of filmmaker Rob Reiner charged with homicide for death of his parents

FILE PHOTO: Rob Reiner, director of "The Princess Bride," arrives for a special 25th anniversary viewing of the film during the New York Film Festival in New York
Paylaş
Rappler2025/12/16 09:59