dongwuxie5112 2013-04-24 04:07
浏览 84
已采纳

如何浏览长文本并将其转换为MySQL的Insert语句

I have a very long text that looks like this:

1- E.M. Smith, J.P. LAVERGNE, P. VIALLEFONT et J. DAUNIS. Recherches en série triazépine-1,2,4. J. Heterocyclic Chem. 12, 66 (1975).

2- M. BENCHIDMI et E.M. ESSASSI. Synthèse de bis s-triazolo [4,3-b : 4,3-d] triazépines-1,2,4. J. Heterocyclic Chem., 13, 885 (1976).

3- LAVERGNE et P. VIALLEFONT. Hydrazinolyse d'azabenzodiazépinones et d'azabenzodiazépine-thiones de type 1,5. Tetrahedron, 33, 28O7 (1977).

4- E.M. ESSASSI. "Synthèse et étude de RMN1H en présence de l'Eu(fod)3 des pyrazolo [1,5,4-ef] benzodiazépine-1,5 ones-6 Bull. Soc. Chim. Belg., 96, 399 (1987).

. . . .

And the list continues for over 300 more, I need to extract each line and add it into an Insert Query for MySql, removing the list numbers and escaping all quotes and double quotes, I have though about using regular expressions but it turns out to be quite difficult for me.

The insert query should look like:

INSERT INTO PUBLICATIONS (NAME,AUTHOR,CITE,PUB_YEAR) VALUES
("Recherches en série triazépine-1,2,4.", "E.M. Smith, J.P. LAVERGNE, P. VIALLEFONT et J. DAUNIS.","J. Heterocyclic Chem. 12, 66","1975"), 
( "Synthèse de bis s-triazolo [4,3-b : 4,3-d] triazépines-1,2,4.", "M. BENCHIDMI et E.M. ESSASSI.","J. Heterocyclic Chem., 13, 885","1976" ),
etc.

I just gave some format to the text to have some idea but it has no spaces or next lines, it is all in one huge string.

What I have thought is using something like:

$string = "all my string"
$pattern = '/regex pattern/';
$replacement = 'result format';
echo preg_replace($pattern, $replacement, $string);

I realized that splitting it up might be impossible as there is no specific pattern so I could maybe add a manually to split each line

Thanks a lot!

  • 写回答

1条回答 默认 最新

  • dongyuling0312 2013-04-24 06:22
    关注

    EDIT:After observations, this kind of pattern can do the job, but i need more data to see all possible exceptions, and to better understand the "logic" of this kind of data. (But the first answer is always a way.)

    Some Rules i have seen:

    authors :

    • Begin with eventually with forname initials followed by the name
    • All authors are separated by a comma and a space, the last by ~ et ~
    • end with a dot and a space

    titles :

    • Begin with uppercase with eventually a qouble quote before
    • don't have dots
    • don't always ending with a digit:
      • with a comma before and with a dot and a space after
      • or with a - before and a space after
    • except if there's no dot at the end

    cites :

    • Begin with uppercase
    • differents words with uppercase first letter that can be shorted with a dot
    • followed by : comma, space, number, comma space number, space.

    code

    $subject = <<<LOD
    1- E.M. Smith, J.P. LAVERGNE, P. VIALLEFONT et J. DAUNIS. Recherches en série triazépine-1,2,4. J. Heterocyclic Chem. 12, 66 (1975).
    2- M. BENCHIDMI et E.M. ESSASSI. Synthèse de bis s-triazolo [4,3-b : 4,3-d] triazépines-1,2,4. J. Heterocyclic Chem., 13, 885 (1976).
    3- LAVERGNE et P. VIALLEFONT. Hydrazinolyse d'azabenzodiazépinones et d'azabenzodiazépine-thiones de type 1,5. Tetrahedron, 33, 28O7 (1977).
    4- E.M. ESSASSI. "Synthèse et étude de RMN1H en présence de l'Eu(fod)3 des pyrazolo [1,5,4-ef] benzodiazépine-1,5 ones-6 Bull. Soc. Chim. Belg., 96, 399 (1987).
    1O- J.M.F. BOURGOIN-DE-LA-VILLARDIERE. Recherches en série triazepine-1,2,4: 1 - détermination de la structure de la triazolotriazépinone obtenue par action de l'acétylacétate d'éthyle sur le diamino-3,4 triazole-1,2,4 J. Heterocyclic Chem., 13, 885 (1976).
    LOD;
    $pattern =
     '~# authors :
      (?(DEFINE)(?<FN>(?:[A-Z]\.){0,3}+(?(?<=\.)\h)) ) # ForName
      (?(DEFINE)(?<NM>[A-Z](?:[A-Z]++|[a-z]++)(?:-[A-Z](?:[A-Z]++|[a-z]++))*+)) # NaMe
      [O\d]++-\h(?<author>(?&FN)(?&NM)(?>(,\h(?&FN)(?&NM))*+\het\h(?&FN)(?&NM))?+)\.\h
      # titles :
      "?+(?<title>[A-Z][^.]+?(?:\.|(?:,|-)\d))\h
      # cites :
      (?<cite>(?:[A-Z][a-z]*+\.?+\h)*[A-Z][a-z]*+\.?+,?+\h[O\d]++,\h[O\d]++)\h
      # date :
      \((?<date>[^)]++)\) 
     ~x';               
    
    preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
    foreach ($matches as &$match) {  //cosmetic
        foreach ($match as $key=>$value) {
            if (is_numeric($key)||$key=='NM'||$key=='FN') unset($match[$key]);
        }
    } 
    echo '<meta charset="UTF-8"/><pre>' . print_r($matches, true) . '</pre>';
    

    --Answer before edit--

    Wow, do you notice there's absolutely nothing to make the difference between Author, Name and Cite. A way is to slice (a simple newline between Author, Name and Cite) that with hand (with about 5s per line, you finish in less than 30min, toutouyoutou:).

    I say that because the only difference i see between Author, Name and Cite is the sense that can't be matched with a regex.

    If you make this rebarbative work, it will be easy to make the sql query. example:

    1- E.M. Smith, J.P. LAVERGNE, P. VIALLEFONT et J. DAUNIS.
    Recherches en série triazépine-1,2,4.
    J. Heterocyclic Chem. 12, 66 (1975).
    

    Thats all, no need to touch the number or the date, the regex can do the job. If you do this work, edit your message to have some help for the regex.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 DS18B20内部ADC模数转换器
  • ¥15 做个有关计算的小程序
  • ¥15 MPI读取tif文件无法正常给各进程分配路径
  • ¥15 如何用MATLAB实现以下三个公式(有相互嵌套)
  • ¥30 关于#算法#的问题:运用EViews第九版本进行一系列计量经济学的时间数列数据回归分析预测问题 求各位帮我解答一下
  • ¥15 setInterval 页面闪烁,怎么解决
  • ¥15 如何让企业微信机器人实现消息汇总整合
  • ¥50 关于#ui#的问题:做yolov8的ui界面出现的问题
  • ¥15 如何用Python爬取各高校教师公开的教育和工作经历
  • ¥15 TLE9879QXA40 电机驱动