douzou7012 2011-04-28 23:07
浏览 82
已采纳

如何修复HTML中提取的纯文本的句子间距?

I'm pulling articles from specific URLs for conversion to sentences, but the text body has a random behavior of eliminating whitespace between some sentences resulting in:

Jane went to the store.She bought a dog. The dog was very friendly.It had no teeth.

Some of my text is stock symbols (AZ.GAN) etc. So I can't simply insert a space between all periods which have no adjacent whitespace.

Jane bought several shares of (TY.JPN). She lost all her cash money."Arg!" She cried.

The above example would destroy the stock symbol variable.

Curious if anyone knows the cause of this. I have tried several HTML and DOM. I use Simple_DOM to grab the plaintext. Although, I get the same result if I do it manually, or with any other parsing engine.

  • 写回答

2条回答 默认 最新

  • dongxi5505 2011-04-28 23:17
    关注

    Unfortunately I don't have an approach for your specific question, but is it possible that the missing space between sentences is actually a linebreak (e.g. ) that your text viewer (whatever it is) isn't showing you?

    Perhaps try something like this just to make sure

    var articleContent = ... // get content
    articleContent = articleContent.replace(/ /g, ' NEW LINE ');

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 为啥画版图在Run DRC会出现Connect Error?可我Calibre的hostname和计算机的hostname已经设置成一样的了。
  • ¥20 网站后台使用极速模式非常的卡
  • ¥20 Keil uVision5创建project没反应
  • ¥15 mmseqs内存报错
  • ¥15 vika文档如何与obsidian同步
  • ¥15 华为手机相册里面的照片能够替换成自己想要的照片吗?
  • ¥15 陆空双模式无人机飞控设置
  • ¥15 sentaurus lithography
  • ¥100 求抖音ck号 或者提ck教程
  • ¥15 关于#linux#的问题:子进程1等待子进程A、B退出后退出(语言-c语言)