So I found a script online for xml parsing in linux that I am wanting to use, and I was hoping to get some help with understanding how the script works, and how to edit it for my own use.
Here is the script (credit)
#!/bin/bash
cat $1 | awk '
START { pos=1; xml=$0 len=length(xml); endp=1 }
{ while(pos <= len) {
if(substr(xml,pos,7) == "<title>")
{
pos=pos+7;
endp=pos;
while((substr(xml,endp,8) != "</title>") && (endp < len))
{
endp++;
}
print " ",substr(xml,pos,endp-pos)," * ";
pos=endp+7;
}
pos++; } }'
Here is a simplified sample of the xml data I will be using
I have already gotten rid of the extra characters on both sides of the tags and made a few other adjustments by changing the script to this
#!/bin/bash
cat $1 | awk '
START { pos=1; xml=$0 len=length(xml); endp=1 }
{ while(pos <= len) {
if(substr(xml,pos,16) == "<sport><![CDATA[")
{
pos=pos+16;
endp=pos;
while((substr(xml,endp,11) != "]]></sport>") && (endp < len))
{
endp++;
}
print "",substr(xml,pos,endp-pos),"";
pos=endp+10;
}
pos++; } }'
So using this script leaves me with a plain text file with this result
Women's Soccer
Men's Soccer
Women's Soccer
Ultimately I'd like to have a script output the following
Women's Soccer Away @ South Carolina (Exhibition) at 7:00 PM
Men's Soccer Home vs. Ohio State at 7:00 PM
Women's Soccer Away @ William and Mary at 7:00 PM
For those wondering, this is the shell that calls the parse script (ignore file names and locations)
wget -O rss.xml http://en-us.fxfeeds.mozilla.com/en-US/firefox/headlines.xml
~dsl/bin/rssparse! rss.xml > headlines_$$.tmp
cd /tmp/ldmtrx
split --lines=30 /tmp/headlines_$$.tmp ldmtrxnews
cd /tmp
rm headlines_$$.tmp rss.xml
While it would be greatly appreciated, I don't expect anyone to complete this task for me, I'd just really like some tips and help getting started. I'm not sure how to run this script on a different element and then print both elements (for example <sport>
and <homeaway>
) I could run the script again, but then the elements would be printed on different lines.
Lastly, I'd like to know how to exclude all data that does not have a <date>
matching today's date. Thanks for your help.