用 SAX 和 XNI 檢測 XML 文檔的編碼

DIV+CSS佈局教程網 >> 網頁腳本 >> XML學習教程 >> XML詳解 >> 用 SAX 和 XNI 檢測 XML 文檔的編碼

編輯：XML詳解

　本文示例源代碼或素材下載

　　XML 根據 Unicode 字符進行定義。在現代計算機的傳輸和存儲過程中，那些 Unicode 字符必須按字節存儲，通過解析器進行解碼。很多編碼方案可實現此目的：UTF-8、 UTF-16、ISO-8859-1、Cp1252 和 SJIS 等。

　　通常情況下，但不一定總是這樣，您實際上不關注基本編碼。XML 解析器對任何寫入到 Unicode 字符串和字符數組中的文檔進行轉換。程序對解碼後的字符串進行操作。本文討論真正關注基本編碼的 “不常出現” 的情況。

　　最常見的情況是想為輸出結果保存輸入編碼。

　　另外一種情況是，不用解析文檔，而將其作為字符串或字符大對象（Character Large Object, CLOB）存儲在數據庫中。

　　類似地，有些系統通過 HTTP 傳輸 XML 文檔時，並沒有全部讀取文檔，但需要設置 HTTP 的 Content-type 報頭，指定正確的編碼。在這種情況下，您需要知道文檔是如何編碼的。

　　大多數情況下，對於您編寫的文檔，您知道如何編碼。但是，如果不是您編寫的文檔 — 只是從其他地方接收的文檔（例如，從一個 Atom 提要中）— 那麼最好的方法是使用一個 streaming API，例如 Simple API for XML（SAX）、Streaming API for XML（StAX）、System.Xml.XMLReader 或 Xerces Native Interface（XNI）。另外，也可以使用樹型 API，例如文檔對象模型（Document Object Model，DOM）。但是，它們需要讀取整個文檔，即使通常只需讀取前 100 個字節（或更少）來判斷編碼。streaming API 可以只讀取需要的內容，一旦得到結果後，就不再解析。這樣就會更有效率。

　　SAX

　　目前，大多數 SAX 解析器，包括與 Sun 公司的 Java™ 軟件開發套件（JDK）6 綁定的 SAX 解析器，可以用來檢測編碼。該技術不難實現，但是也不易理解。可以簡單地概括為：

在 setDocumentLocator 方法中，將 Locator 參數傳遞給 Locator2。

　　在字段中保存 Locator2 對象。

　　在 startDocument 方法中，調用 Locator2 字段的 getEncoding() 方法。

　　（可選）如果已得到想要的全部結果，那麼可以拋出 SAXException 提前結束解析過程。

　　清單 1 通過一個簡單的程序說明該技術，輸出命令行中給定的所有 URL 的編碼。

　　清單 1. 使用 SAX 確定文檔的編碼

import org.XML.sax.*; import org.XML.sax.ext.*; import org.XML.sax.helpers.*; import Java.io.IOException; public class SAXEncodingDetector extends DefaultHandler { 　　public static void main(String[] args) throws SAXException, IOException { 　　　　XMLReader parser = XMLReaderFactory.createXMLReader(); 　　　　SAXEncodingDetector handler = new SAXEncodingDetector(); 　　　　parser.setContentHandler(handler); 　　　　for (int i = 0; i < args.length; i++) { 　　　　　　try { 　　　　　　　　parser.parse(args[i]); 　　　　　　} 　　　　　　catch (SAXException ex) { 　　　　　　　　System.out.println(handler.encoding); 　　　　　　} 　　　　} 　　} 　　　　private String encoding; 　　private Locator2 locator; 　　　　_cnnew1@Override 　　public void setDocumentLocator(Locator locator) { 　　　　if (locator instanceof Locator2) { 　　　　　　this.locator = (Locator2) locator; 　　　　} 　　　　else { 　　　　　　this.encoding = "unknown"; 　　　　} 　　} 　　　　@Override 　　public void startDocument() throws SAXException { 　　　　if (locator != null) { 　　　　　　this.encoding = locator.getEncoding(); 　　　　} 　　　　throw new SAXException("Early termination"); 　　} 　　 }

該方法花費 90% 的時間，有可能會更多一點。但是，SAX 解析器不需要支持 Locator 接口，更不用說 Locator2 以及其他的接口。如果知道正在使用的是 Xerces，第二種方法是使用 XNI。

　　Xerces Native Interface

　　使用 XNI 的方法與 SAX 是非常相似的（實際上，在 Xerces 中，SAX 解析器是本機 XNI 解析器之上很薄的一層）。總之，這種方法更容易一些，因為編碼作為參數直接傳遞給 startDocument()。您只需要讀取它，如清單 2 所示。

　　清單 2. 使用 XNI 確定文檔的編碼

import Java.io.IOException; import org.apache.xerces.parsers.*; import org.apache.xerces.xni.*; import org.apache.xerces.xni.parser.*; public class XNIEncodingDetector extends XMLDocumentParser { 　　　　public static void main(String[] args) throws XNIException, IOException { 　　　　XNIEncodingDetector parser = new XNIEncodingDetector(); 　　　　for (int i = 0; i < args.length; i++) { 　　　　　　try { 　　　　　　　　XMLInputSource document = new XMLInputSource("", args[i], ""); 　　　　　　　　parser.parse(document); 　　　　　　} 　　　　　　catch (XNIException ex) { 　　　　　　　　System.out.println(parser.encoding); 　　　　　　} 　　　　} 　　} 　　　　private String encoding = "unknown"; 　　@Override 　　public void startDocument(XMLLocator locator, String encoding, 　　　　NamespaceContext context, Augmentations augs) 　　　　　　　　throws XNIException { 　　　　this.encoding = encoding; 　　　　throw new XNIException("Early termination"); 　　} }

請注意，因為一些未知的原因，該技術只使用 org.apache.xerces 中實際的 Xerces 類，而不使用與 Sun 的 JDK 6 綁定的 com.sun.org.apache.xerces.internal 中重新打包的 Xerces 類。

　　XNI 提供了另外一個 SAX 不具有的功能。在少數情況下，在 XML 聲明中聲明的編碼不是實際的編碼。SAX 只報告實際編碼，但是，XNI 也可以告訴您在 XMLDecl() 方法中聲明的編碼，如清單 3 所示。

　　清單 3. 使用 XNI 確定文檔的聲明的編碼和實際的編碼

import Java.io.IOException; import org.apache.xerces.parsers.*; import org.apache.xerces.xni.*; import org.apache.xerces.xni.parser.*; public class AdvancedXNIEncodingDetector extends XMLDocumentParser { 　　　　public static void main(String[] args) throws XNIException, IOException { 　　　　AdvancedXNIEncodingDetector parser = new AdvancedXNIEncodingDetector(); 　　　　for (int i = 0; i < args.length; i++) { 　　　　　　try { 　　　　　　　　XMLInputSource document = new XMLInputSource("", args[i], ""); 　　　　　　　　parser.parse(document); 　　　　　　} 　　　　　　catch (XNIException ex) { 　　　　　　　　System.out.println("Actual: " + parser.actualEncoding); 　　　　　　　　System.out.println("Declared: " + parser.declaredEncoding); 　　　　　　} 　　　　} 　　} 　　　　private String actualEncoding = "unknown"; 　　private String declaredEncoding = "none"; 　　@Override 　　public void startDocument(XMLLocator locator, String encoding, 　　　　NamespaceContext namespaceContext, Augmentations augs) 　　　　　　　　throws XNIException { 　　　　this.actualEncoding = encoding; 　　　　this.declaredEncoding = "none"; // reset 　　} 　　@Override 　　// this method is not called if there's no XML declaration 　　public void XMLDecl(String version, String encoding, 　　　String standalone, Augmentations augs) throws XNIException { 　　　　this.declaredEncoding = encoding; 　　} 　　@Override 　　public void startElement(QName element, XMLAttributes attributes, 　　　Augmentations augs) throws XNIException { 　　　　 throw new XNIException("Early termination"); 　　} 　　 }

　　通常情況下，如果聲明的編碼和實際的編碼不同，就表明服務器存在一個 bug。最常見的原因是由於 HTTP Content-type 報頭指定的編碼與在 XML 聲明中聲明的編碼不同。在本例中，要嚴格遵守規范，要求優先考慮 HTTP 報頭的值。但實際上，很可能 XML 聲明中的值是正確的。

　　結束語

　　通常情況下，您不需要了解輸入文檔的編碼。只需要用解析器處理輸入文檔，以 UTF-8 編碼輸出結果即可。但是，有些情況下需要知道輸入編碼，SAX 和 XNI 可以提供快速而有效的方法來解決這一問題。

上一頁:使用 UTF-8 對 XML 文檔進行編碼
下一頁:GridViewRow可以任意位置單擊引發事件的方法

XML詳解

在delphi中使用xml文檔有兩種方法: 在delphi中使用XML文檔有兩種方法，一是使用Delphi 內置的xml broker；一是使
Silverlight入門：了解Silverlight調用: Html 頁面會調用 Default.Html.JS 源代碼頁中的 createSilverlig
采用XML數據來填充ASP表單: 文章裡，我們會使用一個簡單的Web表單，它會列出某個目錄下的一些XML文件。然後，我們會從這個目錄

XML基礎 XML與XSLT XML詳解

小編推薦

實例構建XBL組件 XML字符串和XML DOCUMENT的相互轉換 XMLHTTP GetHTML頁面時的中文亂碼之完全客戶端Script解決方案什麼是XAML? Java與XML聯合編程 Working With XML Schemas Using the Design View 使用 PHP 處理 XML 配置文件一個CD目錄的XML文件創建具有JScript的HTML的XMLHTTP C#實現對象的Xml格式序列化及反序列化

DIV CSS 佈局教程網

相關文章