I am working on extracting the data from a 2 column table. The first column is the variable name and the second column is the data for that variable.
I have this almost working, but some data may contain HTML and is often wrapped in a DIV. I want to get the HTML inside the DIV, but not the DIV itself. I know regex might be an solution, but I'd like to better understand DOMDocument.
This is the code I have so far:
private function readHtml()
{
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$htmlData = curl_exec($curl);
curl_close($curl);
$dom = new \DOMDocument();
$html = $dom->loadHTML($htmlData);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$cols = $rows->item(1)->getElementsByTagName('td');
$table = [];
$key = null;
$value = null;
foreach ($rows as $i => $row){
//skip the heading columns
if($i <= 1 ) continue;
$cols = $row->getElementsByTagName('td');
foreach ($cols as $count => $node) {
if($count == 0) {
$key = strtolower(str_replace(' ', '_',$node->textContent));
} else {
$htmlNode = $node->getElementsByTagName('div');
if($htmlNode->length >=1) {
$innerHTML= '';
foreach ($htmlNode as $innerNode) {
$innerHTML .= $innerNode->ownerDocument->saveHTML( $innerNode );
}
$value = $innerHTML;
} else {
$value = $node->textContent;
}
}
}
$table[$key] = $value;
}
return $table;
}
My output is correct, but I'd like to not include the wrapper DIV of the data that contains HTML:
Array
(
[type] => raw
[direction] => north
[intro] => Welcome to the test.
[html_body] => <div class="softmerge-inner" style="width: 5653px; left: -1px;">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut <span style="font-weight:bold;">aliquip</span> ex ea commodo consequat. Duis aute irure dolor in <span style="text-decoration:underline;">reprehenderit</span> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, <span style="font-style:italic;">sunt in</span> culpa qui officia deserunt mollit anim id est laborum.</div>
[count] => 1003
)
UPDATE
Based on some feedback and ideas in the answers this is the current iteration of the function, which is slimmer and is returning the desired output. I don't feel too good about the double regex but its working.
private function readHtml()
{
# the url given in your example
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$dom = new \DOMDocument();
$dom->loadHTMLFile($url);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$cols = $rows->item(1)->getElementsByTagName('td');
$table = [];
$key = null;
$value = null;
foreach ($rows as $i => $row){
//skip the heading columns
if($i <= 1 ) continue;
$cols = $row->getElementsByTagName('td');
foreach ($cols as $count => $node) {
if($count == 0) {
$key = strtolower(str_replace(' ', '_',$node->textContent));
} else {
$value = $node->ownerDocument->saveHTML( $node );
$value = preg_replace('/(<div.*?>|<\/div>)/','',$value);
$value = preg_replace('/(<td.*?>|<\/td>)/','',$value);
}
}
$table[$key] = $value;
}
return $table;
}