I am trying to process a large HTML file using DOM. I read it in and immediately write it out to another file without making any changes, but the output file is much smaller (and shorter) than the input.
This is particularly puzzling, because I could swear I did this previously while learning to use DOM and the output looked okay.
Here is my code:
<?
// ini_set("memory_limit", -1);
require_once("inc/common.inc");
$acad = "../inprogress/academy/";
$htmFName = "$acad/mf/humanacad.htm";
$sz = filesize($htmFName);
echo "fname: $htmFName, $sz bytes
";
$dom = new DOMDocument();
$dom->loadHTML($htmFName);
$dom->save("z");
$sz = filesize("z");
echo "fname: z: $sz bytes
";
And the output:
fname: ../inprogress/academy//mf/humanacad.htm, 2621622 bytes
fname: z: 219 bytes
Here is the beginning of the input file:
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=Generator content="Microsoft Word 11 (filtered)">
<title> The Hanging Academy</title>
<style>
<!--
...
-->
</style>
</head>
<body lang=EN-US link=blue vlink=blue>
<div class=Section1>
<p class=SectionHd>THE HANGING ACADEMY -- Part 1: Miranda</p>
And here is the entirety of the output file:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>../inprogress/academy//mf/humanacad.htm</p></body></html>