dtyqflrr775518 2012-08-06 02:07 采纳率: 100%
浏览 68
已采纳

在linux中进行XML解析,打印多个元素

So I found a script online for xml parsing in linux that I am wanting to use, and I was hoping to get some help with understanding how the script works, and how to edit it for my own use.

Here is the script (credit)

#!/bin/bash

cat $1 | awk '

START {    pos=1;    xml=$0    len=length(xml);    endp=1 }

{    while(pos <= len)    {
      if(substr(xml,pos,7) == "<title>")
      {
         pos=pos+7;
         endp=pos;
         while((substr(xml,endp,8) != "</title>") && (endp < len))
         {
            endp++;
         }
         print "   ",substr(xml,pos,endp-pos)," * ";
         pos=endp+7;
      }
      pos++;    } }'

Here is a simplified sample of the xml data I will be using

I have already gotten rid of the extra characters on both sides of the tags and made a few other adjustments by changing the script to this

  #!/bin/bash

    cat $1 | awk '

    START {    pos=1;    xml=$0    len=length(xml);    endp=1 }

    {    while(pos <= len)    {
          if(substr(xml,pos,16) == "<sport><![CDATA[")
          {
             pos=pos+16;
             endp=pos;
             while((substr(xml,endp,11) != "]]></sport>") && (endp < len))
             {
                endp++;
             }
             print "",substr(xml,pos,endp-pos),"";
             pos=endp+10;
          }
          pos++;    } }'

So using this script leaves me with a plain text file with this result

Women's Soccer
Men's Soccer
Women's Soccer

Ultimately I'd like to have a script output the following

Women's Soccer Away @ South Carolina (Exhibition) at 7:00 PM
Men's Soccer Home vs. Ohio State at 7:00 PM
Women's Soccer Away @ William and Mary at 7:00 PM

For those wondering, this is the shell that calls the parse script (ignore file names and locations)

wget -O rss.xml http://en-us.fxfeeds.mozilla.com/en-US/firefox/headlines.xml
        ~dsl/bin/rssparse! rss.xml > headlines_$$.tmp
        cd /tmp/ldmtrx
        split --lines=30 /tmp/headlines_$$.tmp ldmtrxnews
        cd /tmp
        rm headlines_$$.tmp rss.xml 

While it would be greatly appreciated, I don't expect anyone to complete this task for me, I'd just really like some tips and help getting started. I'm not sure how to run this script on a different element and then print both elements (for example <sport> and <homeaway>) I could run the script again, but then the elements would be printed on different lines.

Lastly, I'd like to know how to exclude all data that does not have a <date> matching today's date. Thanks for your help.

  • 写回答

1条回答 默认 最新

  • dongne1560 2012-08-07 22:03
    关注

    You must know that your example lacks of validation. It needs some tweaks

    check this pastie instead of that pastie

    then using xmlstarlet you can superseed all that this script does.

    $ wget --output-document - http://pastie.org/pastes/4408130/download | xmlstarlet sel -t -m rss/channel/item -v sport -o ' Away @ ' -v opponent -o ' at ' -v time -na
    

    That outputs:

    Women's Soccer Away @ South Carolina (Exhibition) at 7:00 PM
    Men's Soccer Away @ Ohio State (Exhibition) at 7:00 PM
    Women's Soccer Away @ William and Mary at 7:00 PM
    

    And when the output is what you need you can use -C with xmlstarlet to show an xml template you can source in any language you need that particular parsing.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 ROS Turtlebot3 多机协同自主探索环境时遇到的多机任务分配问题,explore节点
  • ¥15 Matlab怎么求解含参的二重积分?
  • ¥15 苹果手机突然连不上wifi了?
  • ¥15 cgictest.cgi文件无法访问
  • ¥20 删除和修改功能无法调用
  • ¥15 kafka topic 所有分副本数修改
  • ¥15 小程序中fit格式等运动数据文件怎样实现可视化?(包含心率信息))
  • ¥15 如何利用mmdetection3d中的get_flops.py文件计算fcos3d方法的flops?
  • ¥40 串口调试助手打开串口后,keil5的代码就停止了
  • ¥15 电脑最近经常蓝屏,求大家看看哪的问题