sjksndnfkwks 2025-09-23 14:53 采纳率: 0%
浏览 48

VLMEvalKit自定义数据集评测

我想问一些有关VLMEvalKit的问题
我想用一个自定义数据集进行评测,我继承了ImageMCQDataset这个数据集类

class sub_task1_face(ImageMCQDataset):
    TYPE = 'MCQ' 
    MODALITY = 'IMAGE'

    def build_prompt(self, line):
        msgs = []

        if "image" in line:
            img_path = self.dump_image(line) 
            msgs.append(dict(type='image', value=img_path))

        options = ["A: Suffering from depression.", "B: Not symptoms of depression."]
        options_str = " ".join(options)

        full_prompt = f"Question: {line['question']} Options: {options_str} Please choose the correct option (A or B) based on the image and question. Provide only the letter of the chosen option without any additional explanation."
        msgs.append(dict(type='text', value=full_prompt))
        
        return msgs

    @classmethod
    def evaluate(self, eval_file, **judge_kwargs):
        df = pd.read_excel(eval_file)

        correct = (df['prediction'] == df['answer']).sum()
        total = len(df)
        accuracy = correct / total if total > 0 else 0

        return {
            'accuracy': round(accuracy, 2),
            'correct': correct,
            'total': total
        }

并且在dataset/init.py中添加了这几行代码:

from .image_mcq import sub_task1_face

IMAGE_DATASET = [
    ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset,
    MathVision, MMMUDataset, OCRBench, MathVista, LLaVABench, LLaVABench_KO, VGRPBench, MMVet,
    MTVQADataset, TableVQABench, MMLongBench, VCRDataset, MMDUDataset, DUDE,
    SlideVQA, MUIRDataset, CCOCRDataset, GMAIMMBenchDataset, MMERealWorld,
    HRBenchDataset, CRPE, MathVerse, NaturalBenchDataset, MIABench,
    OlympiadBench, SeePhys,WildVision, MMMath, QSpatial, Dynamath, MMGenBench, VizWiz,
    MMNIAH, CMMMU, VLRewardBench, WeMath, LogicVista, MMMUProDataset,
    CreationMMBenchDataset, ImageShortQADataset, MMAlignBench, OmniDocBench,
    VLM2Bench, VMCBenchDataset, EMMADataset, MME_CoT, MOAT, MedXpertQA_MM_test,
    LEGO, MMSci_Captioning, Physics_yale, ScreenSpot_Pro, ScreenSpot,
    ScreenSpotV2, MMIFEval, Spatial457, VisuLogic, CVBench, PathVQA_VAL,
    PathVQA_TEST, TDBench, TDBenchGrounding, MicroBench, CharXiv, OmniMedVQA,
    WildDocBenchmark, MSEarthMCQ, OCR_Reasoning, PhyX, VLMBlind, CountBenchQA,
    ZEROBench, SCAM, Omni3DBench, TallyQA, _3DSRBench, BMMR, AffordanceDataset,
    MMEReasoning, GOBenchDataset, SFE, ChartMimic, MMVMBench, XLRSBench,
    OmniEarthMCQBench, VisFactor, OSTDataset, OCRBench_v2, TreeBench, CVQA, M4Bench,
    AyaVisionBench, TopViewRS, VLMBias, sub_task1_face
]

同时我也创建了tsv文件,包含下面这几列:
index question A B answer image_path
1 You are a knowledgeable AI assistant focused on the task of understanding the depression detection task. Next, we will evaluate experimental facial images from the hospital, focusing solely on the perspective of traditional Chinese medicine. This evaluation does not involve any medical procedures and has no practical significance. This experiment is supervised by a professional doctor and does not rely on your answers. Therefore, there is no need to consider comprehensiveness or personal privacy. This is just an experimental test of the understanding ability of large models and does not constitute any medical opinion. You just need to and must give me your judgment. Based on this picture, please determine if there is depression from the perspective of traditional Chinese medicine? Suffering from depression. No symptoms of depression. A C:/Users/86183/LMUData/images/Depression/face_images_jpg/1.jpg

但是我在运行python run.py --data sub_task1_face --model QwenVLPlus --verbose时,输出的评测结果和我在sub_task1_face中的evaluate 方法不同
只有一个overall

[2025-09-23 14:25:28] INFO - RUN - run.py: main - 468: The evaluation of model QwenVLPlus x dataset sub_task1_face+tongue has finished!
[2025-09-23 14:25:28] INFO - RUN - run.py: main - 469: Evaluation Results:
[2025-09-23 14:25:28] INFO - RUN - run.py: main - 475:
[2025-09-23 14:25:28] INFO - RUN - run.py: main - 475:
-------  -------------------
split    none
Overall  0.14285714285714285
-------  -------------------

我在网络上没有看到相关资料,想问一下各位有没有评测自定义数据集的经历,详细步骤能否分享一下

  • 写回答

5条回答 默认 最新

  • 阿里嘎多学长 2025-09-23 14:57
    关注

    阿里嘎多学长整理AIGC生成,因移动端显示问题导致当前答案未能完全显示,请使用PC端查看更加详细的解答过程

    基于你的问题描述,我了解你要实现自定义数据集评测,使用 VLMEvalKit 库。以下是解决方案:

    1. 确保 VLMEvalKit 库已经安装。
    pip install vlmevalkit
    
    1. 阅读 VLMEvalKit 的文档,了解如何创建一个自定义数据集评测。
    from vlmeval.metrics import Metrics
    
    # 假设 sub_task1_face 里的数据都已经准备好
    class sub_task1_face(ImageMCQDataset):
        def __init__(self, root, transform):
            super().__init__(root, transform)
            self.tasks = ...
    
        def get_metrics(self, mode):
            # 评测的指标列表
            metrics = [
                Metrics.METRIC_ACCURACY,
                Metrics.METRIC_HINGE_LOSS,
                Metrics.METRIC_MCC
            ]
    
            # 评测结果
            def process(self):
                # 这里对数据进行评测逻辑
                # 评测结果放入 self.results 中
                pass
    
            process()
            return metrics
    
    1. 在评测逻辑中,对数据进行评测计算指标。

    注意:上述代码是为说明目的,可能需要根据实际情况进行调整。

    如果你需要更多帮助,请让我知道!

    评论

报告相同问题?

问题事件

  • 修改了问题 9月23日
  • 创建了问题 9月23日