douliangpo0128 2011-12-12 23:17

已采纳

Word字符串/剪切HTML字符串中的文本

here what i want to do : i have a string containing HTML tags and i want to cut it using the wordwrap function excluding HTML tags.

I'm stuck :

public function textWrap($string, $width)
{
    $dom = new DOMDocument();
    $dom->loadHTML($string);
    foreach ($dom->getElementsByTagName('*') as $elem)
    {
        foreach ($elem->childNodes as $node)
        {
            if ($node->nodeType === XML_TEXT_NODE)
            {
                $text = trim($node->nodeValue);
                $length = mb_strlen($text);
                $width -= $length;
                if($width <= 0)
                { 
                    // Here, I would like to delete all next nodes
                    // and cut the current nodeValue and finally return the string 
                }
            }
        }
    }
}

I'm not sure i'm doing it in the right way at the moment. I hope it's clear...

EDIT :

Here an example. I have this text

    <p>
        <span class="Underline"><span class="Bold">Test to be cut</span></span>
   </p><p>Some text</p>

Let's say I want to cut it at the 6th character, I would like to return this :

<p>
    <span class="Underline"><span class="Bold">Test to</span></span>
</p>

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

dongyou8701 2011-12-13 19:20

关注

As I wrote in a comment, you first need to find the textual offset where to do the cut.

First of all I setup a DOMDocument containing the HTML fragment and then selecting the body which represents it in the DOM:

$htmlFragment = <<<HTML
<p>
        <span class="Underline"><span class="Bold">Test to be cut</span></span>
   </p><p>Some text </p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($htmlFragment);
$parent = $dom->getElementsByTagName('body')->item(0);
if (!$parent)
{
    throw new Exception('Parent element not found.');
}

Then I use my TextRange class to find the place where the cut needs to be done and I use the TextRange to actually do the cut and locate the DOMNode that should become the last node of the fragment:

$range = new TextRange($parent);

// find position where to cut the HTML textual represenation
// by looking for a word or the at least matching whitespace
// with a regular expression. 
$width = 17;
$pattern = sprintf('~^.{0,%d}(?<=\S)(?=\s)|^.{0,%1$d}(?=\s)~su', $width);
$r = preg_match($pattern, $range, $matches);
if (FALSE === $r)
{
    throw new Exception('Wordcut regex failed.');
}
if (!$r)
{
    throw new Exception(sprintf('Text "%s" is not cut-able (should not happen).', $range));
}

This regular expression finds the offset where to cut things in the textual representation made available by $range. The regex pattern is inspired by another answer which discusses it more detailed and has been slightly modified to fit this answers needs.

// chop-off the textnodes to make a cut in DOM possible
$range->split($matches[0]);
$nodes = $range->getNodes();
$cutPosition = end($nodes);

As it can be possible that there is nothing to cut (e.g. the body will become empty), I need to deal with that special case. Otherwise - as noted in the comment - all following nodes need to be removed:

// obtain list of elements to remove with xpath
if (FALSE === $cutPosition)
{
    // if there is no node, delete all parent children
    $cutPosition = $parent;
    $xpath = 'child::node()';
}
else
{
    $xpath = 'following::node()';
}

The rest is straight forward: Query the xpath, remove the nodes and output the result:

// execute xpath
$xp = new DOMXPath($dom);
$remove = $xp->query($xpath, $cutPosition);
if (!$remove)
{
    throw new Exception('XPath query failed to obtain elements to remove');
}

// remove nodes
foreach($remove as $node)
{
    $node->parentNode->removeChild($node);
}

// inner HTML (PHP >= 5.3.6)
foreach($parent->childNodes as $node)
{
    echo $dom->saveHTML($node);
}

The full code example is available on viper codepad incl. the TextRange class. The codepad has a bug so it's result is not properly (Related: XPath query result order). The actual output is the following:

<p>
        <span class="Underline"><span class="Bold">Test to</span></span></p>

So take care you have a current libxml version (normally the case) and the output foreach at the end makes use of a PHP function saveHTML which is available with that parameter since PHP 5.3.6. If you don't have that PHP version, take some alternative like outlined in How to get the xml content of a node as a string? or a similar question.

When you closely look in my example code you might notice that the cut length is quite large ($width = 17;). That is because there are many whitespace characters in front of the text. This could be tweaked by making the regular expression drop any number of whitespace in fron t of it and/or by trimming the TextRange first. The second option does need more functionality, I wrote something quick that can be used after creating the initial range:

...
$range = new TextRange($parent);
$trimmer = new TextRangeTrimmer($range);
$trimmer->trim();
...

That would remove the needless whitespace on left and right inside your HTML fragment. The TextRangeTrimmer code is the following:

class TextRangeTrimmer
{
    /**
     * @var TextRange
     */
    private $range;

    /**
     * @var array
     */
    private $charlist;

    public function __construct(TextRange $range, Array $charlist = NULL)
    {
        $this->range = $range;
        $this->setCharlist($charlist);      
    }
    /**
     * @param array $charlist list of UTF-8 encoded characters
     * @throws InvalidArgumentException
     */
    public function setCharlist(Array $charlist = NULL)
    {
         if (NULL === $charlist)
            $charlist = str_split(" \t
\0\x0B")
        ;

        $list = array();

        foreach($charlist as $char)
        {
            if (!is_string($char))
            {
                throw new InvalidArgumentException('Not an Array of strings.');
            }
            if (strlen($char))
            {
                $list[] = $char; 
            }
        }

        $this->charlist = array_flip($list);
    }
    /**
     * @return array characters
     */
    public function getCharlist()
    {
        return array_keys($this->charlist);
    }
    public function trim()
    {
        if (!$this->charlist) return;
        $this->ltrim();
        $this->rtrim();
    }
    /**
     * number of consecutive charcters of $charlist from $start to $direction
     * 
     * @param array $charlist
     * @param int $start offset
     * @param int $direction 1: forward, -1: backward
     * @throws InvalidArgumentException
     */
    private function lengthOfCharacterSequence(Array $charlist, $start, $direction = 1)
    {
        $start = (int) $start;              
        $direction = max(-1, min(1, $direction));
        if (!$direction) throw new InvalidArgumentException('Direction must be 1 or -1.');

        $count = 0;
        for(;$char = $this->range->getCharacter($start), $char !== ''; $start += $direction, $count++)
            if (!isset($charlist[$char])) break;

        return $count;
    }
    public function ltrim()
    {
        $count = $this->lengthOfCharacterSequence($this->charlist, 0);

        if ($count)
        {
            $remainder = $this->range->split($count);
            foreach($this->range->getNodes() as $textNode)
            {
                $textNode->parentNode->removeChild($textNode);
            }
            $this->range->setNodes($remainder->getNodes());
        }

    }
    public function rtrim()
    {
        $count = $this->lengthOfCharacterSequence($this->charlist, -1, -1);

        if ($count)
        {
            $chop = $this->range->split(-$count);
            foreach($chop->getNodes() as $textNode)
            {
                $textNode->parentNode->removeChild($textNode);
            }
        }
    }
}

Hope this is helpful.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

java编程拆分字符串文本监听
2019-01-08 07:35

回答 4 已采纳首先，你是要在输入完成所有字符串后显示还是边输入边显示输入完成显示：不会就先百度找下String.split()方法，字符串转数组边输入边显示：监听输入内容含有"|"，截取字符串显示文
关于html字符串拼接的问题 html5 javascript
2018-05-08 14:20

回答 4 已采纳主要是为了便于查看方便，一下就知道html标签结构，后期维护方便（添加、删除等更容易）。js 的es6支持模板字符串，可以用一下方式试试： let tmpl = `
JS中怎样将字符串中的指定下标的值删除？ javascript
2017-08-06 06:28

回答 6 已采纳可以这样： ``` var str = "acbabca"; str = str.slice(0, 3) + str.slice(4); ``` slice方法的参数意义和使用范例可以参
【前端学习笔记JS—字符串常见操作】
2022-04-12 15:14

岁月流年初雪又卷的博客本文就前端知识中字符串的常见操作基础知识点进行总结，希望对你有用！ 字符串常见操作一、什么是字符串 JS字符串用于存储和处理文本。 字符串是引号中的零个或多个字符： var str = "Hello word!"; 二、字符...
javaScript中的怎么去理解字符串？ javascript
2016-09-28 04:18

回答 8 已采纳几乎所有的语言的字符串都是如此设计，主要是考虑性能问题。字符串本质上是字符构成的数组，需要在内存中连续存储，当你追加内容的时候不能保证原始的内存有足够的内存存储超长的部分，所以字符串修改最简单直接的方
筛选字符串中的相同字母 c++
2017-09-17 04:32

回答 2 已采纳你这程序有很大问题，各方面都不严谨我先回答你所说的乱码问题: { 你定义了一个char b[27] 然后你写了这行代码: for(int k=0;k<27;k++) cout<&l
字符串中部分包含于数组判断 SHELL实现
2016-01-27 03:42

回答 1 已采纳 http://www.codelast.com/?tag=%E5%88%A4%E6%96%AD%E5%AD%97%E7%AC%A6%E4%B8%B2%E6%98%AF%E5%90%A6%E5%9C%A
js前端常用工具类封装（如日期时间，字符串处理等）及常用css样式的封装
2022-02-16 11:21

甜十一的博客 * @param str {String}字符串格式的日期，传入格式：yyyy-mm-dd(2015-01-31) * @return {Date}由字符串转换成的日期 */ strTurnDate:function(str){ var re = /^(\d{4})\S(\d{1,2})\S(\d{1,2})$/; var dt; if (re....
c++中 string 字符串删去空字符 c++
2017-07-20 13:17

回答 2 已采纳 http://www.cnblogs.com/Shirlies/p/4666744.html
关于字符串中取出某些字符的操作
2017-06-18 01:50

回答 1 已采纳首先说一说你这么写，strncpy里面的第一个参数是char型的，所以你应该这样定义char * psh，然后你直接用strncpy赋值这是不行的，因为你上面的psh指针并没有给它内存，你直接赋值
输入字符串中数字之和，C语言
2016-12-28 09:55

回答 3 已采纳以第一个case为例子： ``` #include "stdio.h" #include "string.h" #include "stdio.h" int main() {
html5 剪切板,【Web前端问题】当你复制一个网页的时候,你的剪切板里是什么呢?...
2021-06-17 02:32

彤垚的博客就是比如我复制一个网页的内容,其中有文本,也有图片,要是把他粘贴到一个*.txt里,他就粘贴出来是一个纯字符串;要是把他粘贴到一个word文档里,他就是包含html 标签和 css 样式的一段排好版的文字;要是把他粘贴到 QQ...
C语言中字符串中的字符无效
2015-02-13 03:27

回答 2 已采纳如果已经分配过内存，可能是越界导致的指令错误。
前端HTML5+CSS3学习笔记
2021-11-17 17:57

Baucc的博客前端HTML5+CSS3学习笔记
HTML前端常用（必记单词）
2021-11-13 14:58

杨不旧的博客 arr.reverse() 反转/颠倒数组 arr.sort() 数组排序按照字符串比大小的方法来排序 arr.sort(function(a1,a2){ return a1-a2 //从小到大 return a2-a1 //从大到小 }) a1 a2随便起代表的是数组中随机的某两项 ...
没有解决我的问题, 去提问

悬赏问题

¥15 请问有人会紧聚焦相关的matlab知识嘛？
¥50 yalmip+Gurobi
¥20 win10修改放大文本以及缩放与布局后蓝屏无法正常进入桌面
¥15 itunes恢复数据最后一步发生错误
¥15 关于#windows#的问题：2024年5月15日的win11更新后资源管理器没有地址栏了顶部的地址栏和文件搜索都消失了
¥100 H5网页如何调用微信扫一扫功能？
¥15 讲解电路图，付费求解
¥15 有偿请教计算电磁学的问题涉及到空间中时域UTD和FDTD算法结合的
¥15 three.js添加后处理以后模型锯齿化严重
¥15 vite打包后，页面出现h.createElement is not a function，但本地运行正常

码龄粉丝数原力等级 --

Word字符串/剪切HTML字符串中的文本

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

Word字符串/剪切HTML字符串中的文本

2条回答 默认 最新

悬赏问题

2条回答默认最新