The code implementation does not match the description in the paper.

@zhangshaolei1998 
Hi there, thanks for the great work. I attempted to reproduce your work. While reading your code, I found that the implementation differs from the description in the paper. 
Specifically, first, the paper describes the **Pre-fusion** module as taking the original visual input $H^{V}$ (as shown in Figure 6 of the paper). However, in `llavamini/model/llavamini_arch.py`, at line 369 and 410, the **compressed visual tokens** (named as compressed_image_features) are also fed into the **pre-fusion** module. as shown in below:
```python
#line 369
x=torch.cat([global_image_features,compressed_image_features,text_embedding],dim=1)
#line 410
# modality pre-fusion
for layer in self.get_model().prefusion_layers:
    x = layer(x,attention_mask=attention_mask,position_ids=position_ids)[0]

````
Secondly, the compressed visual tokens (named as compressed_image_features in your code) ultimately come from the $x$ of the **pre-fusion** module, which does not align with the description in the paper.  See `llavamini/model/llavamini_arch.py`, at lines 410-417.
```python
# modality pre-fusion
for layer in self.get_model().prefusion_layers:
    x = layer(x,attention_mask=attention_mask,position_ids=position_ids)[0]

fusion_text_features=x[:,-1*input_ids.size(1):,:]
compressed_image_features=x[:,-1*input_ids.size(1)-1*compressed_image_features.size(1):-1*input_ids.size(1),:]
fusion_text_features=fusion_text_features*(~padding_mask).unsqueeze(-1).int()+all_text_embedding*padding_mask.unsqueeze(-1)

return compressed_image_features,fusion_text_features
````
Is there anyone who can explain the reason for this inconsistency? Or did I misunderstand something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The code implementation does not match the description in the paper. #24

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

The code implementation does not match the description in the paper. #24

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions