This repo contains the data and evaluation code for the paper "Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation".
- 🔥[2025-02-06]: We have released our FG-BMK benchmark!
FG-BMK is a comprehensive fine-grained evaluation benchmark, which includes 1.01 million questions and 0.28 million images, providing a robust test bed for evaluating LVLMs. FG-BMK incorporates two evaluation paradigms: human-oriented and machine-oriented. The human-oriented evaluation employs dialogue-like interactions to assess a model’s ability to understand and respond to fine-grained visual queries in a conversational context. The machine-oriented evaluation focuses on two core fine-grained vision tasks—image retrieval and recognition—to directly measure the feature representation capabilities of LVLMs. Compared with existing efforts which primarily focus on fine-grained classification or with limited questions, FG-BMK enable a comprehensive assessment of LVLMs’ fine-grained feature representation and semantic recognition abilities. Our evaluations of eight open-source LVLMs/VLMs uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
Before running the inference, you need to download the corresponding dataset images. The links to the dataset projects are provided below, and you can download the datasets to any location:
For human-oriented evaluations, we have pre-generated questions for each image within the dataset, as detailed in files such as benchmark/human-oriented/attribute_recognition/cub_attribute_questions.jsonl.
For machine-oriented evaluations, the dataset categories, along with the corresponding training set train.csv and test set test.csv, can be found in directories like benchmark/machine-oriented/aircraft.
Our evaluation includes eight LVLMs/VLMs. Below, we list the project for each model to assist in configuring the corresponding environment and downloading the relevant checkpoint.
| Model Projects | Checkpoints |
|---|---|
| InternVL | InternVL-Chat-V1.1 |
| LLaVA-1.5 | LLaVA-1.5-7B |
| Qwen-VL | Qwen-VL-Chat |
| BLIP-2 | BLIP-2-FLAN-T5-XL |
| EVA-CLIP | EVA02_CLIP_L_psz14_s4B |
| BEiT3 | BEiT3-large-itc |
| CoCa | CoCa-L |
| DINOv2 | DINOv2-L |
To use your own model and provide the final answer, you first need to modify the model loading code in human_evaluation_demo.py to adapt it for your specific model. Here is an example of loading InternVL model:
# Load InternVL model, tokenizer, and image processor
from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor
model = AutoModel.from_pretrained(
args.model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True, use_fast=False)
image_processor = CLIPImageProcessor.from_pretrained(args.model_path)
Then, the model answers the questions based on its inference code like:
# Load images
image = Image.open(image_path).resize((448, 448))
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values.to(torch.bfloat16).cuda()
# Generate response
generation_config = dict(max_new_tokens=1024, do_sample=True)
response = model.chat(tokenizer, pixel_values, prompt_text, generation_config)
After modifying the model loading code, we need to configure the model-path (checkpoint), question-file, image-folder (path to where the dataset is stored), and answers-file (output path) in run_human_demo.sh and run the demo by:
bash run_human_demo.sh
# The code splits the question file based on the number of GPUs and runs inference concurrently.The outputs will be merged into one file in the following format:
{"question_id": 1, "image": "images/001.Black_footed_Albatross/Black_Footed_Albatross_0078_796126.jpg", "prompt": "Is the genus of the object geococcyx? Answer with yes or no.", "text": "No", "class": "no", "category": "generic"}
{"question_id": 2, "image": "images/001.Black_footed_Albatross/Black_Footed_Albatross_0003_796136.jpg", "prompt": "Is the genus of the object raven? Answer with yes or no.", "text": "No", "class": "no", "category": "generic"}
Finally, use answer_acc.py to calculate the accuracy of the model's answers.
python answer_acc.pyPlease refer to example output for a detailed prediction file form. We also provide inference code for Qwen-VL and BLIP-2 as references.
To evaluate the LVLM's feature representation ability, you first need to modify the feature extraction code in models.py. Here is an example of CoCa feature extraction code:
def coca(model_name, pretrained, cache_dir):
from open_clip import create_model_and_transforms
def _hook(self, _, input, output):
self.feat.append(output.transpose(0, 1))
def get_intermediate_layers(self, x, n=1, return_class_token=True):
self.feat = []
self(x)
class_tokens = [out[:, 0] for out in self.feat]
outputs = [out[:, 1:] for out in self.feat]
return tuple(zip(outputs, class_tokens))
model, _, preprocess = create_model_and_transforms(model_name, pretrained, cache_dir=cache_dir)
model = model.visual
model.eval()
model.cuda()
model.__class__._hook = _hook
model.__class__.get_intermediate_layers = get_intermediate_layers
model.transformer.resblocks[-2].register_forward_hook(model._hook)
model.transformer.resblocks[-1].register_forward_hook(model._hook)
return model
In this module, we use the _hook and the defined get_intermediate_layers method to extract the visual features from the last two layers of the vision encoder. We then concatenate the cls token and image tokens in the predefined order and return an instance of the CoCa model. Examples of visual feature extraction using EVA-CLIP and Qwen-VL are already provided in model.py.
Once you've made the modifications, simply import the model in eval_linear.py or eval_retrieval.py by:
from dinov2.utils.config import setup
from models import coca
torch.backends.cudnn.benchmark = True
model = coca('coca_ViT-L-14', 'laion2b_s13b_b90k', '.cache')
config = setup(args)
autocast_dtype = torch.float16
Then run the demo by executing:
python eval_linear.py
or
python eval_retrieval.pyThe outputs log will be like:
I20240504 13:34:23 16157 dinov2 helpers.py:103] Training [ 0/10000] eta: 8:13:21 loss: 143.1147 (143.1147) lr: 0.0005 (0.0005) time: 2.960182 data: 2.449736 max mem: 2711
...
After complete training process, the code will automatically inference on the test test of the fine-grained dataset and give the results.
