doupuzhimuhan9216 2018-11-13 17:00
浏览 34
已采纳

too long

I am looking to loop through existing .vtt files and read the cue data into a database.

The format of the .vtt files are:

WEBVTT FILE

line1
00:00:00.000 --> 00:00:10.000
‘Stuff’

line2
00:00:10.000 --> 00:00:20.000
Other stuff
Example with 2 lines

line3
00:00:20.00 --> 00:00:30.000
Example with only 2 digits in milliseconds

line4
00:00:30.000 --> 00:00:40.000
Different stuff

00:00:40.000 --> 00:00:50.000
Example without a head line

Originally I was trying to use ^ and $ to be quite regimented with the lines along the lines of: /^(\w*)$^(\d{2}):(\d{2}):(\d{2})\.(\d{2,3}) --> (\d{2}):(\d{2}):(\d{2})\.(\d{2,3})$^(.+)$/ims but I struggled to get this working in the regex checker and resorted to using \s to deal with line start/ends.

Currently I am using the following regex: /(.*)\s(\d{2}):(\d{2}):(\d{2})\.(\d{2,3}) --> (\d{2}):(\d{2}):(\d{2})\.(\d{2,3})\s(.+)/im

This partially works using online regex checkers like: https://regex101.com/r/mmpObk/3 (this example does not pick up multi-line subtitles, but does get the first line which at this point is good enough for my purpose as all subtitles are currently 1 liners). However if I put this into php (preg_match_all("/(.*)\s(\d{2}):(\d{2}):(\d{2})\.(\d{2,3}) --> (\d{2}):(\d{2}):(\d{2})\.(\d{2,3})\s(.+)/mi", $fileData, $matches)) and dump the results I get an array of empty arrays.

What might be different between the online regex and php?

Thanks in advance for any suggestions.

EDIT--- Below is a dump of $fileData and a dump of $matches:

string(341) "WEBVTT FILE

line1
00:00:00.000 --> 00:00:10.000
‘Stuff’

line2
00:00:10.000 --> 00:00:20.000
Other stuff
Example with 2 lines

line3
00:00:20.00 --> 00:00:30.000
Example with only 2 digits in milliseconds

line4
00:00:30.000 --> 00:00:40.000
Different stuff

00:00:40.000 --> 00:00:50.000
Example without a head line"

array(11) {
    [0]=>
        array(0) {}
    [1]=>
        array(0) {}
    [2]=>
        array(0) {}
    [3]=>
        array(0) {}
    [4]=>
        array(0) {}
    [5]=>
        array(0) {}
    [6]=>
        array(0) {}
    [7]=>
        array(0) {}
    [8]=>
        array(0) {}
    [9]=>
        array(0) {}
    [10]=>
        array(0) {}
}
  • 写回答

1条回答 默认 最新

  • dqch34769 2018-11-13 17:45
    关注

    The problem with your regular expression is poor line-ending handling.

    You have this at the end: \s(.+)/mi.
    This only matches 1 whitespace, but newlines can be 1 or 2 whitespaces.

    To fix it, you can use \R(.+)/mi.

    It works on the website because it is normalizing your newlines into Linux-style newlines.
    That is, Windows-style newlines are (2 characters) and Linux-style are (1 character).


    Alternativelly, you can try this regular expression:

    /(?:line(\d+)\R)?(\d{2}(?::\d{2}){2}\.\d{2,3})\s*-->\s*(\d{2}(?::\d{2}){2}\.\d{2,3})\R((?:[^
    ]|?
    [^
    ])*)(?:?
    ?
    |$)/i
    

    It looks horrible, but it works.
    Note: I'm swapping between \R and because \R matches the literal R inside [].

    The data is captured like this:

    1. Line number (if present)
    2. Initial timestamp
    3. Final timestamp
    4. Multiline text

    You can try it on https://regex101.com/r/Yk8iD1/1

    You can use the handy code generator tool to generate the following PHP:

    $re = '/(?:line(\d+)\R)?(\d{2}(?::\d{2}){2}\.\d{2,3})\s*-->\s*(\d{2}(?::\d{2}){2}\.\d{2,3})\R((?:[^
    ]|?
    [^
    ])*)(?:?
    ?
    |$)/i';
    $str = 'WEBVTT FILE
    
    line1
    00:00:00.000 --> 00:00:10.000
    ‘Stuff’
    
    line2
    00:00:10.000 --> 00:00:20.000
    Other stuff
    Example with 2 lines
    
    line3
    00:00:20.00 --> 00:00:30.000
    Example with only 2 digits in milliseconds
    
    line4
    00:00:30.000 --> 00:00:40.000
    Different stuff
    
    00:00:40.000 --> 00:00:50.000
    Example without a head line';
    
    preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
    
    // Print the entire match result
    var_dump($matches);
    

    You can test it on http://sandbox.onlinephpfunctions.com/code/7f5362f56e912f3504ed075e7013071059cdee7b

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 shape_predictor_68_face_landmarks.dat
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 对于相关问题的求解与代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
  • ¥15 保护模式-系统加载-段寄存器
  • ¥15 电脑桌面设定一个区域禁止鼠标操作
  • ¥15 求NPF226060磁芯的详细资料
  • ¥15 使用R语言marginaleffects包进行边际效应图绘制