How to deal with java encoding problems (especially xml)? -
How to deal with java encoding problems (especially xml)? -
i searched java , encoding , did not found resource explaining how deal commons problems arise in java when encoding , decoding strings. there lot of specific questions single errors did not found wide response/reference guide problem. main questions are:
what string encoding?
why in java can read files wrong charatecters?
why when dealing xml got invalid byte x of y-byte utf-8 sequence exception? main causes , how avoid them?
since stackoverflow encourages self answers seek responde myself.
encoding process of converting info 1 format another, response details how string encoding works in java (you may want read more generic introduction text end encoding).
introduction
string encoding/decoding process transforms byte[] string , viceversa.
at first sight may think there no problems, if more process issues may arise. at lowest level info stored/transmitted in bytes: files sequence of bytes , network communication done sending , receiving bytes. every time want read or write file plain readable content or every time submit web form/read web page there underlying encoding operation. let's start basic string encoding operation in java; creating string sequence of bytes. next code converts byte[] (the bytes may come file or socket) string.
byte[] stringinbyte=new byte[]{104,101,108,108,111}; string simple=new string(stringinbyte); system.out.println("simple=" + simple);//prints simple=hello
so far good, "simple". value of bytes taken here shows 1 way map letters , numbers bytes let's complicate sample simple requirement byte[] contains € (euro) sign; oops, there no euro symbol in ascii table.
this can summarized core of problem, human readable characters (together other necessary ones such carriage return, line feed, etc) more 256, i.e. cannot represented 1 byte. if reason must stick single byte representation (i.e. historical reasons first econding tables using 7 bytes, space constraints reason, if space on disk limited , write text documents english language people there not need include italian letters accent such è,ì) have problem of choosing characters represent.
choosing encoding choosing mapping between bytes , chars.
coming euro illustration , sticking 1 byte --> 1 char mapping iso8859-15 encoding table has € sign; sequence of bytes representing string "hello €" next one
byte[] stringinbyte1=new byte[]{104,101,108,108,111,32,(byte)164};
how "tell" java encoding utilize conversion? string has constructor
string(byte[] bytes, string charsetname)
that allows specify "the mapping" if utilize different charsets different output results can see below:
byte[] stringinbyte1=new byte[]{104,101,108,108,111,32,(byte)164}; string simple1=new string(stringinbyte1,"iso8859-15"); system.out.println("simple1=" + simple1); //prints simple1=hello € string simple2=new string(stringinbyte1,"iso8859-1"); system.out.println("simple2=" + simple2); //prints simple1=hello ¤
so explains why read characters , read different 1 encoding used writing (string byte[]) different 1 used reading (byte[] string). same byte may map different chararacters in different encodings some characters may "look strange". these basic concepts needed understand string encoding; let's complicate matter little bit more. there may need represent more 256 symbols in 1 text document, in order accomplish multi byte encoding have been created.
with multibyte encoding there no more 1 byte --> 1 char mapping there sequence of bytes --> 1 char mapping
one of known multibyte encoding utf-8; utf-8 variable length encoding, chars represented 1 byte others more one;
utf-8 overlaps 1 byte encodings such us7ascii or iso8859-1; can viewed extension of 1 byte encodings.
let see utf-8 in action first example
byte[] stringinbyte=new byte[]{104,101,108,108,111}; string simple=new string(stringinbyte); system.out.println("simple=" + simple);//prints simple=hello string simple3=new string(stringinbyte, "utf-8"); system.out.println("simple3=" + simple3);//also prints simple=hello
as can see trying code prints hello, i.e. bytes represent hello in utf-8 , iso8859-1 same.
but if seek sample € sign got ?
byte[] stringinbyte1=new byte[]{104,101,108,108,111,32,(byte)164}; string simple1=new string(stringinbyte1,"iso8859-15"); system.out.println("simple1=" + simple1);//prints simple1=hello string simple4=new string(stringinbyte1, "utf-8"); system.out.println("simple4=" + simple4);//prints simple4=hello ?
meaning char not recognized , there error. note no exception if there error during conversion.
unfortunately not java classes behave same way when dealing invalid chars; allow see happens when deal xml.
managing xml
before going through examples woth remembering in java inputstream/outputstream read/write bytes , reader/writer read/write characters.
let's seek read sequence of bytes of xml in different ways, i.e reading files in order string vs reading file in order dom.
//create xml file string xmlsample="<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<specialchars>àèìòù€</specialchars>"; try(fileoutputstream fosxmlfileoutputstreame= new fileoutputstream("test.xml")) { //write file wrong encoding fosxmlfileoutputstreame.write(xmlsample.getbytes("iso8859-15")); } seek ( fileinputstream xmlfileinputstream= new fileinputstream("test.xml"); //read file encoding declared in xml header inputstreamreader inputstreamreader= new inputstreamreader(xmlfileinputstream,"utf-8"); ) { char[] cbuf=new char[xmlsample.length()]; inputstreamreader.read(cbuf); system.out.println("file read utf-8=" + new string(cbuf)); //prints //file read utf-8=<?xml version="1.0" encoding="utf-8"?> //<specialchars>������</specialchars> } file xmlfile = new file("test.xml"); documentbuilderfactory dbfactory = documentbuilderfactory.newinstance(); documentbuilder dbuilder = dbfactory.newdocumentbuilder(); document doc = dbuilder.parse(xmlfile); //throws com.sun.org.apache.xerces.internal.impl.io.malformedbytesequenceexception: invalid byte 2 of 3-byte utf-8 sequence
in first case result unusual chars no exception, in sec case exception (invalid sequence....) exception occurs because reading 3 bytes char of utf-8 sequence , sec byte has invalid value (because of utf-8 way of encoding chars).
the tricky part since utf-8 overlaps other encodings invalid byte 2 of 3-byte utf-8 sequence exceptions arise "random" (i.e. messages characters represented more 1 byte), in production environment error can hard track , reproduce.
with these info can seek reply next question:
why invalid byte x of y-byte utf-8 sequence exception when reading/dealing xml file?
because there mismatch encoding used writing (iso8859-15 in test case above) , encoding reading (utf-8 in test case above); mismatch may have different causes:
you making wrong conversion between bytes , char: illustration if reading file inputstream , converting reader , passing reader xml library must specify charset name in next code (i.e. must know encoding used saving file)
try ( fileinputstream xmlfileinputstream= new fileinputstream("test.xml"); //this reader xml library (dom4j, jdom example) //utf-8 file encoding if specify wrong encoding or not apsecify encoding may face invalid byte x of y-byte utf-8 sequence exception inputstreamreader inputstreamreader= new inputstreamreader(xmlfileinputstream,"utf-8"); )
you passing inputstream straight xml library file file not right (as in first illustration of managing xml header states utf-8 real encoding iso8859-15. simply putting in first line of file not enough; file must saved encoding used in header.
you reading file reader created without specifying encoding , platform encoding different file encoding:
filereader filereader=new filereader("text.xml");
this lead 1 aspect @ to the lowest degree me source of of string encoding problems in java: using default platform encoding
when call
"hello €".getbytes();
you can different results on different operating systems; because on windows default encoding windows-1252 while on linux may utf-8; € char encoded differently not different bytes different array sizes:
string helloeuro="hello €"; //prints hello euro byte[] size in iso8859-15 = 7 system.out.println("hello euro byte[] size in iso8859-15 = " + helloeuro.getbytes("iso8859-15").length); //prints hello euro byte[] size in utf-8 = 9 system.out.println("hello euro byte[] size in utf-8 = " + helloeuro.getbytes("utf-8").length);
using string.getbytes() or new string(byte[] ...) without specifying encoding first check when run encoding issues
the sec 1 checking if reading or writing files using filereader or filewriter; in both cases documentation states:
the constructors of class assume default character encoding , default byte-buffer size acceptable
as string.getbytes() reading/writing same file on different platforms reader/writer , without specifying charset may lead different byte sequence due different default platform encoding
the solution, javadoc suggest utilize outputstreamreader/outputstreamwriter wraps outputstream/inputstream charset specification.
some final notes on how xml libraries read xml content:
if pass reader library relies on reader encoding (i.e. not check xml header says) , not encoding since reading chars not bytes.
if pass inputstream or file library relies on xml header encoding , may throw encoding exceptions
database
a different issue may arise when dealing databases; when database created has encoding property used save varchar , string column (es clob). if database created 8 bit encoding (iso8859-15 example) problems may arise when seek insert chars not allowed encoding. saved on db may different string specified @ java level because in java strings represented in memory in utf-16 "wider" 1 specified @ database level. simplest solution : create database utf-8 encoding.
web this starting point.
if sense missing sense free inquire more in comments.
java xml file encoding
Comments
Post a Comment