« November 2009 | Main | August 2010 »
May 19, 2010
XML_Tranformer.php and bad chars
XML/Transformer.php, that wonderful unmaintained PEAR module uses the php built-in xml_parse function, and, at least on my server, this function loves to fail, crash, and or die on extended ascii codes.
so what did I do? I hacked it up myself. In the transform function of the class I started by just replacing ampersands
$xml = preg_replace('/&(?!amp;)/i', '&', $xml);
but that wasn't cutting it once I started getting errors like
Transformer: XML Error: Invalid character at line 1:342549
You guessed right, I didn't want to go searching for the 342549th character in line 1, so I wrote even more replacements:
// bad_chr found on php manual page for built-in function xml_parse
// added higher codes found on https://files.oakland.edu/users/grossman/web/ascii.codes.html
$bad_chr = array("\x00" => "chr(0)", "\x01" => "chr(1)", "\x02" => "chr(2)", "\x03" => "chr(3)", "\x04" => "chr(4)", "\x05" => "chr(5)", "\x06" => "chr(6)", "\x07" => "chr(7)", "\x08" => "chr(8)", "\x09" => "chr(9)", "\x0a" => "chr(10)", "\x0b" => "chr(11)", "\x0c" => "chr(12)", "\x0d" => "chr(13)", "\x0e" => "chr(14)", "\x0f" => "chr(15)", "\x10" => "chr(16)", "\x11" => "chr(17)", "\x12" => "chr(18)", "\x13" => "chr(19)", "\x14" => "chr(20)", "\x15" => "chr(21)", "\x16" => "chr(22)", "\x17" => "chr(23)", "\x18" => "chr(24)", "\x19" => "chr(25)", "\x1a" => "chr(26)", "\x1b" => "chr(27)", "\x1c" => "chr(28)", "\x1d" => "chr(29)", "\x1e" => "chr(30)", "\x1f" => "chr(31)", "\x91" => "chr(145)", //single quote
"\x92" => "chr(146)", //single quote
"\x93" => "chr(147)", //double quote
"\x94" => "chr(148)", //double quote
"\x96" => "chr(150)", //short dash
"\x97" => "chr(151)", //long dash
"\xA0" => "chr(32)", //other space
"\xB4" => "chr(180)", //some other single quote
"\xBC" => "chr(188)", //frac 1/4
"\xBD" => "chr(189)", //frac 1/2
"\xBE" => "chr(190)", //frac 3/4
);
$xml = strtr($xml, $bad_chr);
In the end this is working. I am not entirely sure how I'll catch others adding more bad character codes, since the upload and parse is all automated, but hopefully that covers what any normal human would dare use.
Posted by Matt at 08:44 PM | Comments (0)