dongwuxie5112 2013-04-24 04:07
浏览 84
已采纳

如何浏览长文本并将其转换为MySQL的Insert语句

I have a very long text that looks like this:

1- E.M. Smith, J.P. LAVERGNE, P. VIALLEFONT et J. DAUNIS. Recherches en série triazépine-1,2,4. J. Heterocyclic Chem. 12, 66 (1975).

2- M. BENCHIDMI et E.M. ESSASSI. Synthèse de bis s-triazolo [4,3-b : 4,3-d] triazépines-1,2,4. J. Heterocyclic Chem., 13, 885 (1976).

3- LAVERGNE et P. VIALLEFONT. Hydrazinolyse d'azabenzodiazépinones et d'azabenzodiazépine-thiones de type 1,5. Tetrahedron, 33, 28O7 (1977).

4- E.M. ESSASSI. "Synthèse et étude de RMN1H en présence de l'Eu(fod)3 des pyrazolo [1,5,4-ef] benzodiazépine-1,5 ones-6 Bull. Soc. Chim. Belg., 96, 399 (1987).

. . . .

And the list continues for over 300 more, I need to extract each line and add it into an Insert Query for MySql, removing the list numbers and escaping all quotes and double quotes, I have though about using regular expressions but it turns out to be quite difficult for me.

The insert query should look like:

INSERT INTO PUBLICATIONS (NAME,AUTHOR,CITE,PUB_YEAR) VALUES
("Recherches en série triazépine-1,2,4.", "E.M. Smith, J.P. LAVERGNE, P. VIALLEFONT et J. DAUNIS.","J. Heterocyclic Chem. 12, 66","1975"), 
( "Synthèse de bis s-triazolo [4,3-b : 4,3-d] triazépines-1,2,4.", "M. BENCHIDMI et E.M. ESSASSI.","J. Heterocyclic Chem., 13, 885","1976" ),
etc.

I just gave some format to the text to have some idea but it has no spaces or next lines, it is all in one huge string.

What I have thought is using something like:

$string = "all my string"
$pattern = '/regex pattern/';
$replacement = 'result format';
echo preg_replace($pattern, $replacement, $string);

I realized that splitting it up might be impossible as there is no specific pattern so I could maybe add a manually to split each line

Thanks a lot!

  • 写回答

1条回答 默认 最新

  • dongyuling0312 2013-04-24 06:22
    关注

    EDIT:After observations, this kind of pattern can do the job, but i need more data to see all possible exceptions, and to better understand the "logic" of this kind of data. (But the first answer is always a way.)

    Some Rules i have seen:

    authors :

    • Begin with eventually with forname initials followed by the name
    • All authors are separated by a comma and a space, the last by ~ et ~
    • end with a dot and a space

    titles :

    • Begin with uppercase with eventually a qouble quote before
    • don't have dots
    • don't always ending with a digit:
      • with a comma before and with a dot and a space after
      • or with a - before and a space after
    • except if there's no dot at the end

    cites :

    • Begin with uppercase
    • differents words with uppercase first letter that can be shorted with a dot
    • followed by : comma, space, number, comma space number, space.

    code

    $subject = <<<LOD
    1- E.M. Smith, J.P. LAVERGNE, P. VIALLEFONT et J. DAUNIS. Recherches en série triazépine-1,2,4. J. Heterocyclic Chem. 12, 66 (1975).
    2- M. BENCHIDMI et E.M. ESSASSI. Synthèse de bis s-triazolo [4,3-b : 4,3-d] triazépines-1,2,4. J. Heterocyclic Chem., 13, 885 (1976).
    3- LAVERGNE et P. VIALLEFONT. Hydrazinolyse d'azabenzodiazépinones et d'azabenzodiazépine-thiones de type 1,5. Tetrahedron, 33, 28O7 (1977).
    4- E.M. ESSASSI. "Synthèse et étude de RMN1H en présence de l'Eu(fod)3 des pyrazolo [1,5,4-ef] benzodiazépine-1,5 ones-6 Bull. Soc. Chim. Belg., 96, 399 (1987).
    1O- J.M.F. BOURGOIN-DE-LA-VILLARDIERE. Recherches en série triazepine-1,2,4: 1 - détermination de la structure de la triazolotriazépinone obtenue par action de l'acétylacétate d'éthyle sur le diamino-3,4 triazole-1,2,4 J. Heterocyclic Chem., 13, 885 (1976).
    LOD;
    $pattern =
     '~# authors :
      (?(DEFINE)(?<FN>(?:[A-Z]\.){0,3}+(?(?<=\.)\h)) ) # ForName
      (?(DEFINE)(?<NM>[A-Z](?:[A-Z]++|[a-z]++)(?:-[A-Z](?:[A-Z]++|[a-z]++))*+)) # NaMe
      [O\d]++-\h(?<author>(?&FN)(?&NM)(?>(,\h(?&FN)(?&NM))*+\het\h(?&FN)(?&NM))?+)\.\h
      # titles :
      "?+(?<title>[A-Z][^.]+?(?:\.|(?:,|-)\d))\h
      # cites :
      (?<cite>(?:[A-Z][a-z]*+\.?+\h)*[A-Z][a-z]*+\.?+,?+\h[O\d]++,\h[O\d]++)\h
      # date :
      \((?<date>[^)]++)\) 
     ~x';               
    
    preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
    foreach ($matches as &$match) {  //cosmetic
        foreach ($match as $key=>$value) {
            if (is_numeric($key)||$key=='NM'||$key=='FN') unset($match[$key]);
        }
    } 
    echo '<meta charset="UTF-8"/><pre>' . print_r($matches, true) . '</pre>';
    

    --Answer before edit--

    Wow, do you notice there's absolutely nothing to make the difference between Author, Name and Cite. A way is to slice (a simple newline between Author, Name and Cite) that with hand (with about 5s per line, you finish in less than 30min, toutouyoutou:).

    I say that because the only difference i see between Author, Name and Cite is the sense that can't be matched with a regex.

    If you make this rebarbative work, it will be easy to make the sql query. example:

    1- E.M. Smith, J.P. LAVERGNE, P. VIALLEFONT et J. DAUNIS.
    Recherches en série triazépine-1,2,4.
    J. Heterocyclic Chem. 12, 66 (1975).
    

    Thats all, no need to touch the number or the date, the regex can do the job. If you do this work, edit your message to have some help for the regex.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 关于#stm32#的问题:需要指导自动酸碱滴定仪的原理图程序代码及仿真
  • ¥20 设计一款异域新娘的视频相亲软件需要哪些技术支持
  • ¥15 stata安慰剂检验作图但是真实值不出现在图上
  • ¥15 c程序不知道为什么得不到结果
  • ¥40 复杂的限制性的商函数处理
  • ¥15 程序不包含适用于入口点的静态Main方法
  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因