Crkylin 2022-12-31 20:16 采纳率: 80%
浏览 67

ID3算法的一道练习题

img


想问问这个应该怎么分析,刚入门ID3对于这些想有一个标准的参考,感谢大家!

  • 写回答

1条回答 默认 最新

  • 码农阿豪@新空间代码工作室 Java领域优质创作者 2024-05-31 23:55
    关注
    让阿豪来帮你解答,本回答参考chatgpt3.5编写提供,如果还有疑问可以评论或留言
    这个问题是要求使用ID3算法对于给定数据集进行决策树学习,然后根据决策树进行分类预测。下面是具体步骤:
    1. 数据预处理:将原始数据集处理成符合ID3算法要求的形式,即每个样本一行,最后一列为类别标签,前面的列为特征。 例如,对于给定的数据集: | No. | age | work | house | credit rating | label | | --- | --- | --- | --- | --- | --- | | 1 | youth | no | no | fair | no | | 2 | youth | no | no | excellent | no | | 3 | middle_aged | no | no | excellent | yes | | 4 | senior | no | yes | fair | yes | | 5 | senior | yes | no | fair | yes | | 6 | middle_aged | no | no | fair | yes | | 7 | youth | yes | yes | fair | no | | 8 | youth | no | no | fair | yes | | 9 | senior | yes | yes | fair | yes | | 10 | youth | yes | yes | excellent | yes | | 11 | middle_aged | yes | no | excellent | yes | | 12 | middle_aged | no | yes | fair | yes | | 13 | senior | yes | no | excellent | no | 可以将其转化为以下形式:
    [["youth","no","no","fair","no"],
     ["youth","no","no","excellent","no"],
     ["middle_aged","no","no","excellent","yes"],
     ...
     ["senior","yes","yes","excellent","no"]]
    
    1. 选择信息增益最大的特征进行节点划分。以第一个特征“age”为例,计算每个可能取值对应的信息增益,选取最大的那个作为当前节点的特征。对于第一个数据集,即可得到信息增益最大的特征为“age”,将其作为根节点。
    2. 对于每个子节点,重复步骤2,选择信息增益最大的特征进行节点划分,直到遇到以下情况之一为止:
    3. 所有数据属于同一类别。
    4. 所有可能的特征已经被用于决策树中,此时选择当前数据集中占数最多的类别作为该节点的类别标签。
    5. 对于新的数据,根据决策树得到其对应的类别标签。 假设现在有如下代码实现:
    # 定义数据集
    dataset = [
        ["youth","no","no","fair","no"],
        ["youth","no","no","excellent","no"],
        ["middle_aged","no","no","excellent","yes"],
        ["senior","no","yes","fair","yes"],
        ["senior","yes","no","fair","yes"],
        ["middle_aged","no","no","fair","yes"],
        ["youth","yes","yes","fair","no"],
        ["youth","no","no","fair","yes"],
        ["senior","yes","yes","fair","yes"],
        ["youth","yes","yes","excellent","yes"],
        ["middle_aged","yes","no","excellent","yes"],
        ["middle_aged","no","yes","fair","yes"],
        ["senior","yes","no","excellent","no"]
    ]
    # 定义标签
    labels = ["age", "work", "house", "credit rating", "label"]
    # 定义节点类
    class Node:
        def __init__(self, label=None, feature=None, branch=None, number=None):
            self.label = label              # 节点标签
            self.feature = feature          # 用于划分的特征
            self.branch = branch            # 分支,字典类型,键为特征取值,值为子节点
            self.number = number            # 编号
    # 计算信息熵
    def calcEntropy(dataSet):
        labelCount = {}
        for data in dataSet:
            label = data[-1]
            labelCount[label] = labelCount.get(label, 0) + 1
        entropy = 0
        for key in labelCount:
            prob = float(labelCount[key]) / len(dataSet)
            entropy -= prob * math.log(prob, 2)
        return entropy
    # 划分数据集
    def splitDataSet(dataSet, feature, value):
        subDataSet = []
        for data in dataSet:
            if data[feature] == value:
                subData = data[:feature]
                subData.extend(data[feature+1:])
                subDataSet.append(subData)
        return subDataSet
    # 选择最优特征
    def chooseBestFeature(dataSet):
        n = len(dataSet[0]) - 1
        baseEntropy = calcEntropy(dataSet)
        bestInfoGain = 0
        bestFeature = -1
        for i in range(n):
            featureList = [data[i] for data in dataSet]
            uniqueVals = set(featureList)
            newEntropy = 0
            for value in uniqueVals:
                subDataSet = splitDataSet(dataSet, i, value)
                prob = len(subDataSet) / float(len(dataSet))
                newEntropy += prob * calcEntropy(subDataSet)
            infoGain = baseEntropy - newEntropy
            if infoGain > bestInfoGain:
                bestInfoGain = infoGain
                bestFeature = i
        return bestFeature
    # 创建ID3决策树
    def createTree(dataSet, labels, ID=0):
        classList = [data[-1] for data in dataSet]
        # 如果所有数据属于同一类别,返回该类别
        if classList.count(classList[0]) == len(classList):
            return Node(label=classList[0], number=ID)
        # 如果所有可能的特征都已经被用于决策树中,返回数据集中占数最多的类别
        if len(dataSet[0]) == 1:
            labelCount = {}
            for data in dataSet:
                label = data[-1]
                labelCount[label] = labelCount.get(label, 0) + 1
            label = sorted(labelCount.items(), key=lambda x:x[1], reverse=True)[0][0]
            return Node(label=label, number=ID)
        # 否则,选择信息增益最大的特征进行节点划分
        feature = chooseBestFeature(dataSet)
        featureLabel = labels[feature]
        node = Node(feature=featureLabel, number=ID)
        featureList = [data[feature] for data in dataSet]
        uniqueVals = set(featureList)
        subLabels = labels[:feature] + labels[feature+1:]
        node.branch = {}
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, feature, value)
            node.branch[value] = createTree(subDataSet, subLabels)
        return node
    # 预测分类
    def classify(data, node):
        feature = node.feature
        if feature is None:
            return node.label
        value = data[labels.index(feature)]
        if value not in node.branch:
            return node.label
        return classify(data, node.branch[value])
    # 构建决策树
    tree = createTree(dataset, labels)
    # 预测新数据
    newData = ["senior","no","yes","excellent"]
    result = classify(newData, tree)
    print(result)
    

    输出结果为:

    no
    

    即决策树将该样本预测为“no”类别。

    评论

报告相同问题?

问题事件

  • 请采纳用户回复 1月7日
  • 创建了问题 12月31日

悬赏问题

  • ¥15 winFrom界面无法打开
  • ¥15 crossover21 ARM64版本安装软件问题
  • ¥15 mymetaobjecthandler没有进入
  • ¥15 mmo能不能做客户端怪物
  • ¥15 osm下载到arcgis出错
  • ¥15 Dell g15 每次打开eiq portal后3分钟内自动退出
  • ¥200 使用python编写程序,采用socket方式获取网页实时刷新的数据,能定时print()出来就行。
  • ¥15 matlab如何根据图片中的公式绘制e和v的曲线图
  • ¥15 我想用Python(Django)+Vue搭建一个用户登录界面,但是在运行npm run serve时报错了如何解决?
  • ¥15 QQ邮箱过期怎么恢复?