douhuiqi3855 2015-03-20 13:53
浏览 63

用PHP多行文本正则表达式头痛

I've unformatted text data extracted from pdf like this:

AB01234 This could be a

long question with multiple

new lines a)these b)are c)the responses which could

contains new lines d)either b

AB01235 This is another question with same multiple

response a) one b) two c) three d) four c

...

My goal is to group questions identifiers, questions, answers and correct answer which is the last char. Is there any way to do so with a regex?

{
   [0] => 'AB01234',
   [1] => 'This could be a long question with multiple new lines',
   [2] => 'these'
   [3] => 'are',
   [4] => 'the responses which could contains new lines',
   [5] => 'either',
   [6] => 'b'
}
  • 写回答

1条回答 默认 最新

  • duanlie7962 2015-03-20 14:32
    关注

    I would not attempt to do this with a single regular expression. There is far too much variance in the input. I would clean up the text like this:

    $text = '
        AB01234 This could be a
        long question with multiple
        new lines a)these b)are c)the responses which could
        contains new lines d)either b
        AB01235 This is another question with same multiple
        response a) one b) two c) three d) four c
    ';
    $text = preg_replace('/([A-Z]{2}[0-9]{5})/', ' QUESTION\1 ', $text);
    $text = preg_replace('/([a-z]\))/', ' ANSWER\1 ', $text);
    $text = trim(preg_replace('/\s+/', ' ', $text));
    print($text);
    

    You will see that the text is now rather clean. It is one line. Spacing is cleaned up. You also have clear flags for QUESTION and ANSWER. You can change those to anything you like, such as !@#$#@!# for a question. They just have to be something that will not ever appear in the text.

    Now, you could try a regular expression, but explode is easier at this point because you flagged the delimiters. I used explode and implode a lot in this example, just in case you haven't seen it much. You don't have to use it. You could use regex or substrings.

    $questions = array();
    $qas = explode("QUESTION", $text);
    foreach($qas as $qa)
    {
        if($qa == "") continue;
        $answers = explode("ANSWER", $qa);
        $q = array();
        foreach($answers as $i=>$answer)
        {
            $a = explode(' ', $answer);
            if($i == 0) $q[] = $a[0];
            $questions[0] = $a[0];
            array_shift($a);
            $q[] = implode(' ', $a);
        }
        $questions[] = $q;
    }
    print_r($questions);
    

    Now, you should have an array that you want.

    评论

报告相同问题?

悬赏问题

  • ¥15 数学建模招标中位数问题
  • ¥15 phython路径名过长报错 不知道什么问题
  • ¥15 深度学习中模型转换该怎么实现
  • ¥15 HLs设计手写数字识别程序编译通不过
  • ¥15 Stata外部命令安装问题求帮助!
  • ¥15 从键盘随机输入A-H中的一串字符串,用七段数码管方法进行绘制。提交代码及运行截图。
  • ¥15 TYPCE母转母,插入认方向
  • ¥15 如何用python向钉钉机器人发送可以放大的图片?
  • ¥15 matlab(相关搜索:紧聚焦)
  • ¥15 基于51单片机的厨房煤气泄露检测报警系统设计