« Mysql from Jaunty to Karmic | Main | Why's this like that? - 0 »

May 19, 2010

XML_Tranformer.php and bad chars

XML/Transformer.php, that wonderful unmaintained PEAR module uses the php built-in xml_parse function, and, at least on my server, this function loves to fail, crash, and or die on extended ascii codes.

so what did I do? I hacked it up myself. In the transform function of the class I started by just replacing ampersands

$xml = preg_replace('/&(?!amp;)/i', '&', $xml);

but that wasn't cutting it once I started getting errors like
Transformer: XML Error: Invalid character at line 1:342549

You guessed right, I didn't want to go searching for the 342549th character in line 1, so I wrote even more replacements:

    // bad_chr found on php manual page for built-in function xml_parse
    // added higher codes found on https://files.oakland.edu/users/grossman/web/ascii.codes.html
    $bad_chr = array("\x00" => "chr(0)", "\x01" => "chr(1)", "\x02" => "chr(2)", "\x03" => "chr(3)", "\x04" => "chr(4)", "\x05" => "chr(5)", "\x06" => "chr(6)", "\x07" => "chr(7)", "\x08" => "chr(8)", "\x09" => "chr(9)", "\x0a" => "chr(10)", "\x0b" => "chr(11)", "\x0c" => "chr(12)", "\x0d" => "chr(13)", "\x0e" => "chr(14)", "\x0f" => "chr(15)", "\x10" => "chr(16)", "\x11" => "chr(17)", "\x12" => "chr(18)", "\x13" => "chr(19)", "\x14" => "chr(20)", "\x15" => "chr(21)", "\x16" => "chr(22)", "\x17" => "chr(23)", "\x18" => "chr(24)", "\x19" => "chr(25)", "\x1a" => "chr(26)", "\x1b" => "chr(27)", "\x1c" => "chr(28)", "\x1d" => "chr(29)", "\x1e" => "chr(30)", "\x1f" => "chr(31)",

                "\x91" => "chr(145)", //single quote                                                                               
"\x92" => "chr(146)", //single quote
"\x93" => "chr(147)", //double quote
"\x94" => "chr(148)", //double quote
"\x96" => "chr(150)", //short dash
"\x97" => "chr(151)", //long dash
"\xA0" => "chr(32)", //other space
"\xB4" => "chr(180)", //some other single quote
"\xBC" => "chr(188)", //frac 1/4
"\xBD" => "chr(189)", //frac 1/2
"\xBE" => "chr(190)", //frac 3/4
);

$xml = strtr($xml, $bad_chr);

In the end this is working. I am not entirely sure how I'll catch others adding more bad character codes, since the upload and parse is all automated, but hopefully that covers what any normal human would dare use.

Posted by Matt at May 19, 2010 08:44 PM

Comments

Post a comment




Remember Me?

(you may use HTML tags for style)