编码检测工具 juniversalchardet__encoding_detector_juniversalchardet_jchardet_UniversalDetector_

当前位置: 软件>java软件

编码检测工具 juniversalchardet

来源: 发布时间：2015-01-27

本文导语: Mozilla在很多年前就做了一个非常优秀的编码检测工具，叫chardet(java版jchardet )，后来有发布了算法更加优秀的universalchardet，用于Firefox的自动编码识别。另外Apache内容抽取项目Tika的发布包tika-app-1.*.jar(自1.2及以后版本)其中打包...

Mozilla在很多年前就做了一个非常优秀的编码检测工具，叫chardet(java版java软件 iis7站长之家)，后来有发布了算法更加优秀的universalchardet，用于Firefox的自动编码识别。另外Apache内容抽取项目Tika的发布包tika-app-1.*.jar(自1.2及以后版本)其中打包了juniversalchardet。

注意：如果试图识别几个字节的短文本编码，可能会出现了识别错误，这应该是算法实现本身的缺陷，但识别稍大一点文本编码，正确率则非常高，尤其较chardet要高的多。

Encodings that can be detected

Chinese
- ISO-2022-CN
- BIG5
- EUC-TW
- GB18030
- HZ-GB-23121

Cyrillic
- ISO-8859-5
- KOI8-R
- WINDOWS-1251
- MACCYRILLIC
- IBM866
- IBM855

Greek
- ISO-8859-7
- WINDOWS-1253

Hebrew
- ISO-8859-8
- WINDOWS-1255

Japanese
- ISO-2022-JP
- SHIFT_JIS
- EUC-JP

Korean
- ISO-2022-KR
- EUC-KR

Unicode
- UTF-8
- utf-8BE / utf-8LE
- UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431

Others
- WINDOWS-1252

Related Works jchardet jchardet is another Java port of the Mozilla's encoding dectection library. The main difference between jchardet and juniversalchardet is modules they are based on. jchardet is based on the 'chardet' module that has long existed. juniversalchardet is based on the 'universalchardet' module that is new and generally provides better accuracy on detection results.

Sample Code

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
  public static void main(String[] args) throws java.io.IOException {
    byte[] buf = new byte[4096];
    String fileName = args[0];
    java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

    // (1)
    UniversalDetector detector = new UniversalDetector(null);

    // (2)
    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();
  }
}

您可能感兴趣的文章:

完美的2个php检测字符串是否是utf-8编码函数分享

本站(WWW.)旨在分享和传播互联网科技相关的资讯和技术，将尽最大努力为读者提供更好的信息聚合和浏览方式。
本站(WWW.)站内文章除注明原创外，均为转载、整理或搜集自网络。欢迎任何形式的转载，转载请注明出处。