@zhangshaolei1998
Hi there, thanks for the great work. I attempted to reproduce your work. While reading your code, I found that the implementation differs from the description in the paper.
Specifically, first, the paper describes the Pre-fusion module as taking the original visual input $H^{V}$ (as shown in Figure 6 of the paper). However, in llavamini/model/llavamini_arch.py, at line 369 and 410, the compressed visual tokens (named as compressed_image_features) are also fed into the pre-fusion module. as shown in below:
#line 369
x=torch.cat([global_image_features,compressed_image_features,text_embedding],dim=1)
#line 410
# modality pre-fusion
for layer in self.get_model().prefusion_layers:
x = layer(x,attention_mask=attention_mask,position_ids=position_ids)[0]
Secondly, the compressed visual tokens (named as compressed_image_features in your code) ultimately come from the $x$ of the pre-fusion module, which does not align with the description in the paper. See llavamini/model/llavamini_arch.py, at lines 410-417.
# modality pre-fusion
for layer in self.get_model().prefusion_layers:
x = layer(x,attention_mask=attention_mask,position_ids=position_ids)[0]
fusion_text_features=x[:,-1*input_ids.size(1):,:]
compressed_image_features=x[:,-1*input_ids.size(1)-1*compressed_image_features.size(1):-1*input_ids.size(1),:]
fusion_text_features=fusion_text_features*(~padding_mask).unsqueeze(-1).int()+all_text_embedding*padding_mask.unsqueeze(-1)
return compressed_image_features,fusion_text_features
Is there anyone who can explain the reason for this inconsistency? Or did I misunderstand something?
@zhangshaolei1998$H^{V}$ (as shown in Figure 6 of the paper). However, in
Hi there, thanks for the great work. I attempted to reproduce your work. While reading your code, I found that the implementation differs from the description in the paper.
Specifically, first, the paper describes the Pre-fusion module as taking the original visual input
llavamini/model/llavamini_arch.py, at line 369 and 410, the compressed visual tokens (named as compressed_image_features) are also fed into the pre-fusion module. as shown in below:Secondly, the compressed visual tokens (named as compressed_image_features in your code) ultimately come from the$x$ of the pre-fusion module, which does not align with the description in the paper. See
llavamini/model/llavamini_arch.py, at lines 410-417.Is there anyone who can explain the reason for this inconsistency? Or did I misunderstand something?