duanbogan5878 2016-10-27 02:56
浏览 54
已采纳

分类HTTP发布对象的最便宜方法

I can use SciPy to classify text on my machine, but I need to categorize string objects from HTTP POST requests at, or in near, real time. What algorithms should I research if my goals are high concurrency, near real-time output and small memory footprint? I figured I could get by with the Support Vector Machine (SVM) implementation in Go, but is that the best algorithm for my use case?

  • 写回答

1条回答 默认 最新

  • doukezi4606 2016-10-27 03:26
    关注

    Yes, SVM (with a linear kernel) should be a good starting point. You can use scikit-learn (it wraps liblinear I believe) to train your model. After the model is learned, the model is simply a list of feature:weight for each category you want to classifying into. Something like this (suppose you have only 3 classes):

    class1[feature1] = weight11
    class1[feature2] = weight12
    ...
    class1[featurek] = weight1k    ------- for class 1
    
    ... different <feature, weight> ------ for class 2
    ... different <feature, weight> ------ for class 3 , etc
    

    At prediction time, you don't need scikit-learn at all, you can use whatever language you are using on the server backend to do a linear computation. Suppose a specific POST request contains features (feature3, feature5), what you need to do is like this:

    linear_score[class1] = 0
    linear_score[class1] += lookup weight of feature3 in class1
    linear_score[class1] += lookup weight of feature5 in class1
    
    linear_score[class2] = 0
    linear_score[class2] += lookup weight of feature3 in class2
    linear_score[class2] += lookup weight of feature5 in class2
    
    ..... same thing for class3
    pick class1, or class2 or class3 whichever has the highest linear_score
    

    One step further: If you could have some way to define the feature weight (e.g., using tf-idf score of tokens), then your prediction could become:

    linear_score[class1] += class1[feature3] x feature_weight[feature3]
    so on and so forth.
    

    Note feature_weight[feature k] is usually different for each request. Since for each request, the total number of active features must be much smaller than the total number of considered features (consider 50 tokens or features vs your entire vocabulary of 1 MM tokens), the prediction should be very fast. I can imagine once your model is ready, an implementation of the prediction could be just written based on a key-value store (e.g., redis).

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 运筹学排序问题中的在线排序
  • ¥15 关于docker部署flink集成hadoop的yarn,请教个问题 flink启动yarn-session.sh连不上hadoop,这个整了好几天一直不行,求帮忙看一下怎么解决
  • ¥30 求一段fortran代码用IVF编译运行的结果
  • ¥15 深度学习根据CNN网络模型,搭建BP模型并训练MNIST数据集
  • ¥15 C++ 头文件/宏冲突问题解决
  • ¥15 用comsol模拟大气湍流通过底部加热(温度不同)的腔体
  • ¥50 安卓adb backup备份子用户应用数据失败
  • ¥20 有人能用聚类分析帮我分析一下文本内容嘛
  • ¥30 python代码,帮调试,帮帮忙吧
  • ¥15 #MATLAB仿真#车辆换道路径规划