byte array 의 charset 체크하기 (UTF-8)

개발/JAVA | 2015. 3. 3. 15:58 | Posted by 자수씨

출처: http://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array

charset 이 불확실한 상황에서 여러 시스템에서 xml 기반 데이터를 주고 받다보면 charset 으로 인해 문베가 발생하곤한다.

A라는 시스템에서는 xml 은 utf-8 기반인데, 안에 있는 내용(BASE64 로 인코딩한)은 euc-kr 방식이고, B라는 시스템에서는 xml 은 utf-8 기반이고, 안에 있는 내용도 utf-8 방식이다.

위의 상황이라면 A와 B를 둘 다 만족시킬 수 있는 시스템을 만들기 위해서는 byte array 가 utf-8 인지 euc-kr 인지 구분이 필요하다.

(utf-8로 디코딩해야 하는 상황이라면 아래의 검사 후 utf-8이 아니라면 euc-kr 로 다시 디코딩)

참고했던 페이지는 Java 7 기반으로 되어 있어서 Java 6 기반으로 바꾸어보았다.

in Java 6

###java

public class ValidateUtf8 {

/**

* Returns the number of UTF-8 characters, or -1 if the array does not

* contain a valid UTF-8 string. Overlong encodings, null characters,

* invalid Unicode values, and surrogates are accepted.

*

* @param bytes byte array to check length

* @return length

*/

public static int charLength(byte[] bytes) {

int charCount = 0, expectedLen;

for (int i = 0; i < bytes.length; i++) {

charCount++;

// Lead byte analysis

if ((bytes[i] & Integer.parseInt("10000000", 2)) == Integer.parseInt("00000000", 2)) {

continue;

} else if ((bytes[i] & Integer.parseInt("11100000", 2)) == Integer.parseInt("11000000", 2)) {

expectedLen = 2;

} else if ((bytes[i] & Integer.parseInt("11110000", 2)) == Integer.parseInt("11100000", 2)) {

expectedLen = 3;

} else if ((bytes[i] & Integer.parseInt("11111000", 2)) == Integer.parseInt("11110000", 2)) {

expectedLen = 4;

} else if ((bytes[i] & Integer.parseInt("11111100", 2)) == Integer.parseInt("11111000", 2)) {

expectedLen = 5;

} else if ((bytes[i] & Integer.parseInt("11111110", 2)) == Integer.parseInt("11111100", 2)) {

expectedLen = 6;

} else {

return -1;

}

// Count trailing bytes

while (--expectedLen > 0) {

if (++i >= bytes.length) {

return -1;

}

if ((bytes[i] & Integer.parseInt("11000000", 2)) != Integer.parseInt("10000000", 2)) {

return -1;

}

}

}

return charCount;

}

/**

* Validate a UTF-8 byte array

*

* @param bytes byte array to validate

* @return true if UTF-8

*/

public static boolean validate(byte[] bytes) {

return (charLength(bytes) != -1);

}

}

in Java 7 이상

### java

public class ValidateUtf8 {

/**

* Returns the number of UTF-8 characters, or -1 if the array does not

* contain a valid UTF-8 string. Overlong encodings, null characters,

* invalid Unicode values, and surrogates are accepted.

*

* @param bytes byte array to check length

* @return length

*/

public static int charLength(byte[] bytes) {

int charCount = 0, expectedLen;

for (int i = 0; i < bytes.length; i++) {

charCount++;

// Lead byte analysis

if ((bytes[i] & 0b10000000) == 0b00000000) {

continue;

} else if ((bytes[i] & 0b11100000) == 0b11000000) {

expectedLen = 2;

} else if ((bytes[i] & 0b11110000) == 0b11100000) {

expectedLen = 3;

} else if ((bytes[i] & 0b11111000) == 0b11110000) {

expectedLen = 4;

} else if ((bytes[i] & 0b11111100) == 0b11111000) {

expectedLen = 5;

} else if ((bytes[i] & 0b11111110) == 0b11111100) {

expectedLen = 6;

} else {

return -1;

}

// Count trailing bytes

while (--expectedLen > 0) {

if (++i >= bytes.length) {

return -1;

}

if ((bytes[i] & 0b11000000) != 0b10000000) {

return -1;

}

}

}

return charCount;

}

/**

* Validate a UTF-8 byte array

*

* @param bytes byte array to validate

* @return true if UTF-8

*/

public static boolean validate(byte[] bytes) {

return (charLength(bytes) != -1);

}

}

저작자표시 (새창열림)

'개발 > JAVA' 카테고리의 다른 글

[lettuce] io.netty.channel.epoll.EpollEventLoopGroup cannot be cast to io.netty.channel.EventLoopGroup (0)	2019.06.02
iBatis와 MyBatis의 차이점 (0)	2015.04.28
[Maven] 버전 쉽게 관리하기 - 프로젝트 내부에 프로젝트 (0)	2015.02.11

byte array 의 charset 체크하기 (UTF-8)

'개발 > JAVA' 카테고리의 다른 글

티스토리툴바