douzhixun8393 2013-10-27 17:20
浏览 25

正则表达式PHP代码刮去有换行符的街道地址

Been searching for two days now with Google, and a lot on SOF here, but I can't solve this regex preg_match problem. I want to simple scrape a street address, and normally I can do this easily, but because some street addresses have line breaks in the middle of them with around 25 characters of whitespace, my code displays an empty array or just NULL.

Below I have included the source code to show an example of what I'm trying to scrape, and also the failed code I have so far. Any help from someone with more experience than I, would be greatly appreciated this Sunday morning.

Sample of source code here;

<span style="font-size:14px;">736 
                  E 17th St</span><br />

My attempt so far;

$new_data = file_get_contents('someURLaddress');

$street_address_regex = '~14px\;\"\>(.*?)\<\/span\>\<br\s\/\>\s~s';

preg_match($street_address_regex,$new_data,$extracted_street_address);

var_dump ($extracted_street_address);
  • 写回答

1条回答 默认 最新

  • dsy19811981 2013-10-30 07:15
    关注

    I'm only doing this because it is horrible practice to use a dot. The giveaway that you're doing something wrong in Regular Expressions is when you use the Single-Line option. That's a huge waste of resources and bound to break at some point.

    This is 99.9% positively what you need to use:

    $street_address_regex = '~14px;">([^<]*)~i';
    

    Or, if you are (for some reason) expecting a < as a legitimate character, either meaning Less-than or formatting tags like bold or italics, then you can do this:

    $street_address_regex = '~14px;">([^<]*<)*?\/span~i';
    

    And if it bothers you enough that you don't want to have to format out the last < character you'll get in your string, you can do this:

    $street_address_regex = '~14px;">((?:[^<]*(?(?!<\/span)<))*)~i';
    

    .

    Test it With This Tester

    .

    But honestly, you shouldn't even be using Regex. Find the stripos of <span style="font-size:14px;"> and add its length (to get the Address Starting Point)... Then find the stripos of </span> and input the offset point of the previously found Index (to get the Address Ending Point). Subtract them to get the length. Then pull the substr using the OriginalString, StartIndex, And Length.

    Sounds like a lot, but make that a small function that you use instead of Regex, and just input the OriginalString, StartString, and EndString... then return the contents between StartString and EndString using the method I just said. Make the function re-usable.

    With that function, that portion of your code will literally run 10 times faster, at least. Regex is powerful as hell for patterns, but you don't have a pattern, you have two static strings from which you want the contents between them. Regex is slow as hell for static string manipulation... Especially using the Dot with Single-Line ~Shiver~

    $Input = '<span style="font-size:14px;">736 E 17th St</span><br />';
    echo GetBetween($Input, '14px;">', '</span');
    
    function GetBetween($OrigStr, $StartStr, $EndStr) {
        $StartPos = stripos($OrigStr, $StartStr) + strlen($StartStr);
        $EndPos = stripos($OrigStr, $EndStr, $StartPos);
        return substr($OrigStr, $StartPos, $EndPos - $StartPos);
    }
    
    评论

报告相同问题?

悬赏问题

  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看
  • ¥15 关于#Java#的问题,如何解决?