dsi36131 2013-11-07 17:39
浏览 38
已采纳

Go,正则表达式:非常有挑战性的正则表达式

Do you think it is possible only with Regex?

Here is my try on Go Playground

This is successful with some dirty code

http://play.golang.org/p/YysZCB3vlu

I want expanded Korean characters to be converted a complete letter. For example, "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔ" to 좋은값이싸요아침안녕하세요웬

For browser that don't render korean characters correctly:

좋   은   값   이   싸   요   아   침   안   녕   하   세   요   웬

The easy part is that Korean letter can only start with One Consonant + One or Two Vowel. That can be caught with (.([ㅏ-ㅣ])+).

The challenging part is Zero or One or Maximum Two Optional Consonants that follows the vowel. Another reason why it is hard is that after the maximum two optional consonants, we have another consonants that does not belong the previous letter and this consonants means another start of a new one letter.

Like below:

ㄱㅏㅂㅅㅇㅣ
= ㄱㅏㅂㅅ  +  ㅇㅣ
= 값 + 이
= 값이

It is possible to catch all the patterns with if-condition and basic regex. But it would be good if I have shorter version of this.

My ultimate goal is to convert "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔㄴ" to 좋은값이싸요아침안녕하세요웬

For browser that don't render korean characters correctly:

좋   은   값   이   싸   요   아   침   안   녕   하   세   요   웬

  • 写回答

1条回答 默认 最新

  • dougan4884 2013-11-07 19:10
    关注

    I don't know Korean, but it sounds like your possible input combinations are:

    C(Consonant) V(Vowel)
    CVV
    CVVC
    CVVCC
    CVC
    CVCC
    

    So a regex rule to capture that (without capturing the first consonant of the next word) is: CV{1,2}C{0,2}(?!V)

    Then you just need to define your C and V character classes, such as replacing V with [ㅏ-ㅣ]

    Use your program to loop through the matches found in the string, and output the combined word

    EDIT: Go doesn't support negative lookahead, so I suggest doing the following:

    1. Reverse the string (something like How to reverse a string in Go?, but be careful with unicode byte sequences)
    2. Run a match on C{0,2}V{1,2}C
    3. Reverse each match and perform the word join/lookup

    There are other ways of getting around the lack of negative lookahead, but it will probably involve a lot more code to manipulate where the next match will start in the input string.

    Also, when defining the set of characters you will look for as vowels or consonants, it would be better to use the unicode escape sequence rather than the Korean glyphs themselves (normally, e.g., \x1161), but I'm not sure Go supports unicode reference in regex either...

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 r包runway详细安装教程
  • ¥15 Html中读取Json文件中数据并制作表格
  • ¥15 谁有RH342练习环境
  • ¥15 STM32F407 DMA中断问题
  • ¥15 uniapp连接阿里云无法发布消息和订阅
  • ¥25 麦当劳点餐系统代码纠错
  • ¥15 轮班监督委员会问题。
  • ¥20 关于变压器的具体案例分析
  • ¥15 生成的QRCode圖片加上下載按鈕
  • ¥15 板材切割优化算法,数学建模,python,lingo