archived-doc-zh/reference/mbstring/functions/mb-detect-encoding.xml

<?xml version="1.0" encoding="utf-8"?>
<!-- $Revision$ -->
<!-- EN-Revision: 7c4b5fb40ac3149a5b931f1e31b1050ab5eaab7e Maintainer: daijie Status: ready -->
<!-- CREDITS: mowangjuanzi -->
<refentry xml:id="function.mb-detect-encoding" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">
 <refnamediv>
  <refname>mb_detect_encoding</refname>
  <refpurpose>检测字符的编码</refpurpose>
 </refnamediv>

 <refsect1 role="description">
  &reftitle.description;
  <methodsynopsis>
   <type class="union"><type>string</type><type>false</type></type><methodname>mb_detect_encoding</methodname>
   <methodparam><type>string</type><parameter>string</parameter></methodparam>
   <methodparam choice="opt"><type class="union"><type>array</type><type>string</type><type>null</type></type><parameter>encodings</parameter><initializer>&null;</initializer></methodparam>
   <methodparam choice="opt"><type>bool</type><parameter>strict</parameter><initializer>&false;</initializer></methodparam>
  </methodsynopsis>
  <para>
   从候选列表中检测 <type>string</type> <parameter>string</parameter> 最可能的字符编码。
  </para>
  <para>
   自 PHP 8.1 起，此函数使用启发式检测指定列表中最有可能正确的有效文本编码，且可能不按照提供的 <parameter>encodings</parameter> 顺序进行检测。
  </para>
  <para>
   对预期（intended）字符编码的自动检测不可能永远完全可靠；没有额外的信息，就类似于在没有密钥的情况下解码已编码的字符串。最好使用与数据一起存储或传输的字符编码表示，例如“Content-Type” HTTP 头。
  </para>
  <para>
   此函数适用于多字节编码，但并非所有字节顺序都构成有效字符串。如果输入字符串包含这样的顺序，则将会拒绝该编码。
  </para>

  <warning>
   <title>结果不准确</title>
   <para>
    此函数的名称具有误导性，它执行的是“猜测”而不是“检测”。
   </para>
   <para>
    猜测结果远非准确，因此无法使用此函数准确检测正确的字符编码。
   </para>
  </warning>
 </refsect1>

 <refsect1 role="parameters">
  &reftitle.parameters;
  <para>
   <variablelist>
    <varlistentry>
     <term><parameter>string</parameter></term>
     <listitem>
      <para>
       要检测的 <type>string</type>。
      </para>
     </listitem>
    </varlistentry>
    <varlistentry>
     <term><parameter>encodings</parameter></term>
     <listitem>
      <para>
       尝试的字符编码列表。该列表可以指定为字符串数组，或以逗号分隔的单个字符串。
      </para>
      <para>
       如果省略 <parameter>encodings</parameter> 被或为 &null;，则将使用当前的 detect_order（使用
       <link linkend="ini.mbstring.detect-order">mbstring.detect_order</link> 配置选项或
       <function>mb_detect_order</function> 函数设置）。
      </para>
     </listitem>
    </varlistentry>
    <varlistentry>
     <term><parameter>strict</parameter></term>
     <listitem>
      <para>
       控制 <parameter>string</parameter> 在列出的所有 <parameter>encodings</parameter> 中无效时的行为。如果
       <parameter>strict</parameter> 设置为 &false;，将返回最接近的匹配编码；如果 <parameter>strict</parameter>
       设置为 &true;，将返回 &false;。
      </para>
      <para>
       可以使用 <link linkend="ini.mbstring.strict-detection">mbstring.strict_detection</link> 配置选项设置 <parameter>strict</parameter> 的默认值。
      </para>
     </listitem>
    </varlistentry>
   </variablelist>
  </para>
 </refsect1>

 <refsect1 role="returnvalues">
  &reftitle.returnvalues;
  <para>
   检测到的字符编码，如果字符串在任何列出的编码中均无效，则返回 &false;。
  </para>
 </refsect1>

 <refsect1 role="changelog">
  &reftitle.changelog;
  <informaltable>
   <tgroup cols="2">
    <thead>
     <row>
      <entry>&Version;</entry>
      <entry>&Description;</entry>
     </row>
    </thead>
    <tbody>
     <row>
      <entry>8.2.0</entry>
      <entry>
       <function>mb_detect_encoding</function>
       将不再返回以下非文本编码：<literal>"Base64"</literal>、<literal>"QPrint"</literal>、<literal>"UUencode"</literal>、<literal>"HTML entities"</literal>、<literal>"7 bit"</literal> 和 <literal>"8 bit"</literal>。
      </entry>
     </row>
    </tbody>
   </tgroup>
  </informaltable>
 </refsect1>

 <refsect1 role="examples">
  &reftitle.examples;
  <para>
   <example>
    <title><function>mb_detect_encoding</function> 示例</title>
    <programlisting role="php">
<![CDATA[
<?php

$str = "\x95\xB6\x8E\x9A\x83\x52\x81\x5B\x83\x68";

// 使用当前的 detect_order 来检测字符编码
var_dump(mb_detect_encoding($str));

// “auto”将根据 mbstring.language 来扩展
var_dump(mb_detect_encoding($str, "auto"));

// 通过以逗号分隔的列表指定“encodings”参数
var_dump(mb_detect_encoding($str, "JIS, eucjp-win, sjis-win"));

// 使用数组指定“encodings”参数
$encodings = [
  "ASCII",
  "JIS",
  "EUC-JP"
];
var_dump(mb_detect_encoding($str, $encodings));
?>
]]>
    </programlisting>
    &example.outputs;
    <screen>
<![CDATA[
string(5) "ASCII"
string(5) "ASCII"
string(8) "SJIS-win"
string(5) "ASCII"
]]>
    </screen>
   </example>
  </para>
  <para>
   <example>
    <title><parameter>strict</parameter> 参数的影响</title>
    <programlisting role="php">
     <![CDATA[
<?php
// 'áéóú' 在 ISO-8859-1 中的编码
$str = "\xE1\xE9\xF3\xFA";

// 该字符串不是有效的 ASCII 或 UTF-8，但 UTF-8 被认为是更接近的匹配
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));

// 如果找到有效编码，则严格参数不会更改结果
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true));
?>
]]>
    </programlisting>
    &example.outputs;
    <screen>
<![CDATA[
string(5) "UTF-8"
bool(false)
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"
]]>
    </screen>
   </example>
  </para>
  <para>
   在某些情况下，相同的字节顺序可能会在多种字符编码中形成有效的字符串，并且无法知道其意图是哪种解释。例如，在众多字符编码中，字节序列“\xC4\xA2”可能是：
  </para>
  <para>
   <simplelist>
    <member>
     "Ä¢" (U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS followed by U+00A2 CENT SIGN)
     encoded in any of ISO-8859-1, ISO-8859-15, or Windows-1252
    </member>
    <member>
     "ФЂ" (U+0424 CYRILLIC CAPITAL LETTER EF followed by U+0402 CYRILLIC CAPITAL LETTER
     DJE) encoded in ISO-8859-5
    </member>
    <member>
     "Ģ" (U+0122 LATIN CAPITAL LETTER G WITH CEDILLA) encoded in UTF-8
    </member>
   </simplelist>
  </para>
  <para>
   <example>
    <title>匹配多个编码时顺序的影响</title>
    <programlisting role="php">
     <![CDATA[
<?php
$str = "\xC4\xA2";

// 该字符串在所有三种编码中均有效，但返回的可能并不总是列表中的第一个编码
var_dump(mb_detect_encoding($str, ['UTF-8']));
var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5'])); // 自 php8.1 起返回 ISO-8859-1 而不是 UTF-8
var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8']));
var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1']));
?>
]]>
    </programlisting>
    &example.outputs;
    <screen>
<![CDATA[
string(5) "UTF-8"
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"
string(10) "ISO-8859-5"
]]>
    </screen>
   </example>
  </para>
 </refsect1>

 <refsect1 role="seealso">
  &reftitle.seealso;
  <para>
   <simplelist>
    <member><function>mb_detect_order</function></member>
   </simplelist>
  </para>
 </refsect1>

</refentry>
<!-- Keep this comment at the end of the file
Local variables:
mode: sgml
sgml-omittag:t
sgml-shorttag:t
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
indent-tabs-mode:nil
sgml-parent-document:nil
sgml-default-dtd-file:"~/.phpdoc/manual.ced"
sgml-exposed-tags:nil
sgml-local-catalogs:nil
sgml-local-ecat-files:nil
End:
vim600: syn=xml fen fdm=syntax fdl=2 si
vim: et tw=78 syn=sgml
vi: ts=1 sw=1
-->