doujia1939 2014-06-14 11:20
浏览 85
已采纳

使用PHP进行文本处理

Stackoverflow: I need your help!

I've been tasked with turning some (fairly) complex work diagrams for railway staff extracted from a Word document into something more usable for further processing, such as into a PHP array.

Here is a sample of one of the work diagrams:

LTP  BH  4000
( Link 5)
    DVR                         Su
On  00.22       PASS    Barnham     00+34   5H97                
Off 08.03           Lham    00+42                       
Hrs 7:41        PPTC    Lham        (06+24) 5N08                
                Traction for the above Service is           
Days    Su          class 377
From    18/05/2014  377 PC  Lham        01+46   5S62    DOO         
To  24/08/2014          (Via CET)           
            TC  Lham O Sh   01+50                       
            PNB             
        377 PC  Lham O Sh       03+10   5W62    DOO         
                (Via CWM)           
            DTCS    Lham    03+32                       
        377 PP  Lham Shed       04+10   5W00    DOO         
                (Via CWM)           
            DTCS    Lham Shed   04+24                       
            PPTC    Lham Shed       (07+39) 5E24                
                Traction for the above Service is           
                class 377
            PPTC    Lham        (06+37) 5H92                
                Traction for the above service is           
                class 377
        377 PP  Lham Shed       05+45   5W01    DOO         
                (Via CET)           
        377     Lham O Sh   05+57   06+28   5W01    DOO         
                (Via CWM)           
            TC  Lham Shed   06+42                       
            PPTC    Lham Shed       (09+58) 5H67                
                Traction for the above Service is           
                class 377
            PPTC    Lham Shed       (07+41) 5P29        RP MO       
                Traction for the above Service is           
                class 377
                (Unit forms part of 22+17           
                attachment)
            PASS    Lham        07.54   2P31                
                (To Bognor Regis)           
                Barnham 08.02                       


Routes  919

I've managed to process some of the data using simple regular expressions, but where I am struggling is the "middle" data which actually shows the work to be done. I am struggling because there is no real structure that defines what each line should look like, you will notice that many lines are different with some even including free text notes.

What I am looking to accomplish is to turn each row into an array that looks like the following:

$row = array("stock", "activity", "location", "departure_time", "arrival_time", "train_id", "notes");

The difficulty comes as not every line fits into this format - some lines have every "column", whereas others have one or more columns missing and other lines consist of free text.

I am by no means a text processing expert, but I cannot seem to find a solution to this problem. I'm not after a complete solution, just some pointers would be gratefully received!

Update Just for clarification, I'm not interested in the free text rows. The data they contain is not important for what I am trying to accomplish.

  • 写回答

2条回答 默认 最新

  • dongzouqie4220 2014-06-14 22:18
    关注

    I found what was causing me grief solving this. I'm loading the Word document using a tool called "antiword". Antiword seems to strip special characters such as tabs. However, I found that by passing the "-w 0" switch, these characters are preserved and parsing the diagrams using simple regular expressions became trivial. Many thanks to @Iserni for taking to time to help me, none the less.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 DS18B20内部ADC模数转换器
  • ¥15 做个有关计算的小程序
  • ¥15 MPI读取tif文件无法正常给各进程分配路径
  • ¥15 如何用MATLAB实现以下三个公式(有相互嵌套)
  • ¥30 关于#算法#的问题:运用EViews第九版本进行一系列计量经济学的时间数列数据回归分析预测问题 求各位帮我解答一下
  • ¥15 setInterval 页面闪烁,怎么解决
  • ¥15 如何让企业微信机器人实现消息汇总整合
  • ¥50 关于#ui#的问题:做yolov8的ui界面出现的问题
  • ¥15 如何用Python爬取各高校教师公开的教育和工作经历
  • ¥15 TLE9879QXA40 电机驱动