我想问一些有关VLMEvalKit的问题
我想用一个自定义数据集进行评测,我继承了ImageMCQDataset这个数据集类
class sub_task1_face(ImageMCQDataset):
TYPE = 'MCQ'
MODALITY = 'IMAGE'
def build_prompt(self, line):
msgs = []
if "image" in line:
img_path = self.dump_image(line)
msgs.append(dict(type='image', value=img_path))
options = ["A: Suffering from depression.", "B: Not symptoms of depression."]
options_str = " ".join(options)
full_prompt = f"Question: {line['question']} Options: {options_str} Please choose the correct option (A or B) based on the image and question. Provide only the letter of the chosen option without any additional explanation."
msgs.append(dict(type='text', value=full_prompt))
return msgs
@classmethod
def evaluate(self, eval_file, **judge_kwargs):
df = pd.read_excel(eval_file)
correct = (df['prediction'] == df['answer']).sum()
total = len(df)
accuracy = correct / total if total > 0 else 0
return {
'accuracy': round(accuracy, 2),
'correct': correct,
'total': total
}
并且在dataset/init.py中添加了这几行代码:
from .image_mcq import sub_task1_face
IMAGE_DATASET = [
ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset,
MathVision, MMMUDataset, OCRBench, MathVista, LLaVABench, LLaVABench_KO, VGRPBench, MMVet,
MTVQADataset, TableVQABench, MMLongBench, VCRDataset, MMDUDataset, DUDE,
SlideVQA, MUIRDataset, CCOCRDataset, GMAIMMBenchDataset, MMERealWorld,
HRBenchDataset, CRPE, MathVerse, NaturalBenchDataset, MIABench,
OlympiadBench, SeePhys,WildVision, MMMath, QSpatial, Dynamath, MMGenBench, VizWiz,
MMNIAH, CMMMU, VLRewardBench, WeMath, LogicVista, MMMUProDataset,
CreationMMBenchDataset, ImageShortQADataset, MMAlignBench, OmniDocBench,
VLM2Bench, VMCBenchDataset, EMMADataset, MME_CoT, MOAT, MedXpertQA_MM_test,
LEGO, MMSci_Captioning, Physics_yale, ScreenSpot_Pro, ScreenSpot,
ScreenSpotV2, MMIFEval, Spatial457, VisuLogic, CVBench, PathVQA_VAL,
PathVQA_TEST, TDBench, TDBenchGrounding, MicroBench, CharXiv, OmniMedVQA,
WildDocBenchmark, MSEarthMCQ, OCR_Reasoning, PhyX, VLMBlind, CountBenchQA,
ZEROBench, SCAM, Omni3DBench, TallyQA, _3DSRBench, BMMR, AffordanceDataset,
MMEReasoning, GOBenchDataset, SFE, ChartMimic, MMVMBench, XLRSBench,
OmniEarthMCQBench, VisFactor, OSTDataset, OCRBench_v2, TreeBench, CVQA, M4Bench,
AyaVisionBench, TopViewRS, VLMBias, sub_task1_face
]
同时我也创建了tsv文件,包含下面这几列:
index question A B answer image_path
1 You are a knowledgeable AI assistant focused on the task of understanding the depression detection task. Next, we will evaluate experimental facial images from the hospital, focusing solely on the perspective of traditional Chinese medicine. This evaluation does not involve any medical procedures and has no practical significance. This experiment is supervised by a professional doctor and does not rely on your answers. Therefore, there is no need to consider comprehensiveness or personal privacy. This is just an experimental test of the understanding ability of large models and does not constitute any medical opinion. You just need to and must give me your judgment. Based on this picture, please determine if there is depression from the perspective of traditional Chinese medicine? Suffering from depression. No symptoms of depression. A C:/Users/86183/LMUData/images/Depression/face_images_jpg/1.jpg
但是我在运行python run.py --data sub_task1_face --model QwenVLPlus --verbose时,输出的评测结果和我在sub_task1_face中的evaluate 方法不同
只有一个overall
[2025-09-23 14:25:28] INFO - RUN - run.py: main - 468: The evaluation of model QwenVLPlus x dataset sub_task1_face+tongue has finished!
[2025-09-23 14:25:28] INFO - RUN - run.py: main - 469: Evaluation Results:
[2025-09-23 14:25:28] INFO - RUN - run.py: main - 475:
[2025-09-23 14:25:28] INFO - RUN - run.py: main - 475:
------- -------------------
split none
Overall 0.14285714285714285
------- -------------------
我在网络上没有看到相关资料,想问一下各位有没有评测自定义数据集的经历,详细步骤能否分享一下