dongtong7990 2015-05-18 23:24
浏览 138
已采纳

如果preg_match与模式不匹配,则取消设置数组?

I have a multidimensional array that looks like this:

Array
(
    [0] => Array
        (
            [0] => Title 1
            [1] => Some text ... US5801351017 ...
        )

    [1] => Array
        (
            [0] => Title 2
            [1] => Some text ... US0378331005 ...
        )

    [2] => Array
        (
            [0] => Title 3
            [1] => Some text ... //Note here that it does not contain an ISIN Code
        )
...

I am trying to filter out the arrays that match my Regex containg an ISIN Code. The array above was produced from the following code:

$title = $html->find("h3.r a");
$titlearray = array_map(function($value){
    return trim($value->plaintext);
}, $title);

$description = $html->find("span.st");
$descriptionarray = array_map(function($value){
    $string = strip_tags($value);
    return $string;
}, $description);

$result1 = array();
foreach($titlearray as $key => $value) {
    $tmp = array($value);
    if (isset($descriptionarray[$key])) {
        $tmp[] = $descriptionarray[$key];
    }
    $result1[] = $tmp;
}

print_r($result1);

I have written some code that comes very close but does not really unset the arrays that do not contain an ISIN Code. The code I have is this:

$title = $html->find("h3.r a");
$titlearray = array_map(function($value){
    return trim($value->plaintext);
}, $title);

$description = $html->find("span.st");
$descriptionarray = array_map(function($value){
    $match = array();
    $string = strip_tags($value);
    $pattern = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";
    preg_match($pattern, $string, $match);
    return $match;
}, $description);

$merged = array();
$i=0;
foreach($descriptionarray as $value){
  $merged[$i] = $value;
  $merged[$i][] = $titlearray[$i];
  $i++;
}

print_r($merged);

which gives me these arrays:

Array
(
    [0] => Array
        (
            [0] => US5801351017
            [1] => Title 1
        )

    [1] => Array
        (
            [0] => US0378331005
            [1] => Title 2
        )

    [2] => Array
        (
            [0] => Title 3
        )
...

How can I get rid of the arrays that do not match my Regex? What I am looking for is this output:

Array
(
    [0] => Array
        (
            [0] => Title 1
            [1] => US5801351017
        )

    [1] => Array
        (
            [0] => Title 2
            [1] => US0378331005
        )
...

EDIT

@CasimiretHippolyte

According to his answer, I have this code now:

$titles = $html->find("h3.r a");

$descriptions = $html->find("span.st");

$ISIN_PATTERN = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";

$results = [];

foreach ($descriptions as $k => $v) {
    if (preg_match($ISIN_PATTERN, strip_tags($v), $m)) {
        $results[] = ['Title' => trim($titles[$k]->plaintext), 'ISIN' => $m[1]];
    }
}

print_r($results);

This narrows my array down selecting merely the elements that match the Regex, but it does not display the matches under 'ISIN' => $m[1] . It outputs this:

Array
(
    [0] => Array
        (
            [Title] => Title 1
            [ISIN] => 
        )

    [1] => Array
        (
            [Title] => Title 2
            [ISIN] => 
        )
...

FURTHER EDIT

This code solves the issue:

$titles = $html->find("h3.r a");

$descriptions = $html->find("span.st");

$ISIN_PATTERN = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";

$results1 = [];

foreach ($descriptions as $k => $v) {
    if (preg_match($ISIN_PATTERN, strip_tags($v), $m)) {
        $results1[] = ['Title' => trim($titles[$k]->plaintext), 'ISIN' => $m[1]];
    }
}

$titlesarray = array_column($results1, 'Title');

$results2 = array_map(function($value){
    $match = array();
    $string = strip_tags($value);
    $pattern = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";
    preg_match($pattern, $string, $match);
    return $match;
}, $descriptions);

$descriptionarray = array_column($results2, 0);

$result3 = array();
foreach($titlesarray as $key => $value) {
    $tmp = array($value);
    if (isset($descriptionarray[$key])) {
        $tmp[] = $descriptionarray[$key];
    }
    $result3[] = $tmp;
}

print_r($result3);

I scraped something together very fast as I needed a quick solution. This is highly inefficient given that I use an extra arrar_map(), simplify the arrays into a Simple Array and then join them back together. Apart from that, I repeat my Regex.


LAST EDIT

@CasimiretHippolyte answer is the most efficient solution and gives the answer for using either his pattern with $m[1] or my pattern with $m[0].

  • 写回答

1条回答 默认 最新

  • doumiebiao6827 2015-05-19 01:18
    关注

    You can design your code in an other way with a simple foreach and build the result items one by one only when the ISIN code is found:

    $titles = $html->find("h3.r a");
    $descriptions = $html->find("span.st");
    
    define ('ISIN_PATTERN', '~
     \b  # there is probably a word boundary at the begin of the ISIN code
     (?=([A-Z]{2}[A-Z0-9]{10})\b) # check the format before testing the whole alternation
                                  # at the same time, the ISIN is captured in group 1
     (?: # so, this alternation is only here to make the pattern fail or succeed
         C[AHLNRSYZ]|I[DELNRST]|P[AEHKLT]|S[AEIGK]|A[ARTU]|B[EGMR]|L[BKUV]|M[OUXY]|T[HNRW]
         |E[EGS]|G[BGR]|H[KRU]|J[EOP]|K[RWY]|N[GLO]|D[EK]|F[IR]|R[OU]|U[AS]|V[EG]|XS|ZA
     )~x');
    
    $results = [];
    
    foreach ($descriptions as $k => $v) {
        if (preg_match(ISIN_PATTERN, strip_tags($v), $m))
            $results[] = [ 'ISIN' => $m[1], 'Title' => trim($titles[$k]->plaintext) ]; 
    }
    
    print_r($results);
    

    Note: this code is not tested and can probably be improved. Several ideas:

    • stop to use simplehtml and use DOMDocument and DOMXPath
    • the hand driven pattern is designed with the assumption that all countries are equiprobable. If it isn't the case, rewrite it to check the most current countries in priority
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!
  • ¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像,如何解决?