mirror of
https://github.com/php/php-src.git
synced 2026-04-11 10:03:18 +02:00
Upgrade PCRE to version 3.4.
This commit is contained in:
@@ -20,7 +20,21 @@ restrictions:
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
|
||||
2. The origin of this software must not be misrepresented, either by
|
||||
explicit claim or by omission.
|
||||
explicit claim or by omission. In practice, this means that if you use
|
||||
PCRE in software which you distribute to others, commercially or
|
||||
otherwise, you must put a sentence like this
|
||||
|
||||
Regular expression support is provided by the PCRE library package,
|
||||
which is open source software, written by Philip Hazel, and copyright
|
||||
by the University of Cambridge, England.
|
||||
|
||||
somewhere reasonably visible in your documentation and in any relevant
|
||||
files or online help data or similar. A reference to the ftp site for
|
||||
the source, that is, to
|
||||
|
||||
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
|
||||
|
||||
should also be given in the documentation.
|
||||
|
||||
3. Altered versions must be plainly marked as such, and must not be
|
||||
misrepresented as being the original software.
|
||||
|
||||
@@ -2,6 +2,46 @@ ChangeLog for PCRE
|
||||
------------------
|
||||
|
||||
|
||||
Version 3.4 22-Aug-00
|
||||
---------------------
|
||||
|
||||
1. Fixed typo in pcre.h: unsigned const char * changed to const unsigned char *.
|
||||
|
||||
2. Diagnose condition (?(0) as an error instead of crashing on matching.
|
||||
|
||||
|
||||
Version 3.3 01-Aug-00
|
||||
---------------------
|
||||
|
||||
1. If an octal character was given, but the value was greater than \377, it
|
||||
was not getting masked to the least significant bits, as documented. This could
|
||||
lead to crashes in some systems.
|
||||
|
||||
2. Perl 5.6 (if not earlier versions) accepts classes like [a-\d] and treats
|
||||
the hyphen as a literal. PCRE used to give an error; it now behaves like Perl.
|
||||
|
||||
3. Added the functions pcre_free_substring() and pcre_free_substring_list().
|
||||
These just pass their arguments on to (pcre_free)(), but they are provided
|
||||
because some uses of PCRE bind it to non-C systems that can call its functions,
|
||||
but cannot call free() or pcre_free() directly.
|
||||
|
||||
4. Add "make test" as a synonym for "make check". Corrected some comments in
|
||||
the Makefile.
|
||||
|
||||
5. Add $(DESTDIR)/ in front of all the paths in the "install" target in the
|
||||
Makefile.
|
||||
|
||||
6. Changed the name of pgrep to pcregrep, because Solaris has introduced a
|
||||
command called pgrep for grepping around the active processes.
|
||||
|
||||
7. Added the beginnings of support for UTF-8 character strings.
|
||||
|
||||
8. Arranged for the Makefile to pass over the settings of CC, CFLAGS, and
|
||||
RANLIB to ./ltconfig so that they are used by libtool. I think these are all
|
||||
the relevant ones. (AR is not passed because ./ltconfig does its own figuring
|
||||
out for the ar command.)
|
||||
|
||||
|
||||
Version 3.2 12-May-00
|
||||
---------------------
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@ Basic Installation
|
||||
These are generic installation instructions that apply to systems that
|
||||
can run the `configure' shell script - Unix systems and any that imitate
|
||||
it. They are not specific to PCRE. There are PCRE-specific instructions
|
||||
for non-Unix systems in the file NON-UNIX.
|
||||
for non-Unix systems in the file NON-UNIX-USE.
|
||||
|
||||
The `configure' shell script attempts to guess correct values for
|
||||
various system-dependent variables used during compilation. It uses
|
||||
|
||||
@@ -20,19 +20,21 @@ restrictions:
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|
||||
|
||||
2. The origin of this software must not be misrepresented, either by
|
||||
explicit claim or by omission. In practice, this means you must put
|
||||
a sentence like this
|
||||
explicit claim or by omission. In practice, this means that if you use
|
||||
PCRE in software which you distribute to others, commercially or
|
||||
otherwise, you must put a sentence like this
|
||||
|
||||
Regular expression support is provided by the PCRE library package,
|
||||
which is open source software, copyright by the University of
|
||||
Cambridge.
|
||||
which is open source software, written by Philip Hazel, and copyright
|
||||
by the University of Cambridge, England.
|
||||
|
||||
somewhere reasonably visible in your documentation and in any relevant
|
||||
files. A reference to the ftp site for the source should also be given
|
||||
files or online help data or similar. A reference to the ftp site for
|
||||
the source, that is, to
|
||||
|
||||
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
|
||||
|
||||
in the documentation.
|
||||
should also be given in the documentation.
|
||||
|
||||
3. Altered versions must be plainly marked as such, and must not be
|
||||
misrepresented as being the original software.
|
||||
|
||||
@@ -1,6 +1,14 @@
|
||||
News about PCRE releases
|
||||
------------------------
|
||||
|
||||
Release 3.3 01-Aug-00
|
||||
---------------------
|
||||
|
||||
There is some support for UTF-8 character strings. This is incomplete and
|
||||
experimental. The documentation describes what is and what is not implemented.
|
||||
Otherwise, this is just a bug-fixing release.
|
||||
|
||||
|
||||
Release 3.0 01-Feb-00
|
||||
---------------------
|
||||
|
||||
|
||||
@@ -7,6 +7,15 @@ The latest release of PCRE is always available from
|
||||
|
||||
Please read the NEWS file if you are upgrading from a previous release.
|
||||
|
||||
PCRE has its own native API, but a set of "wrapper" functions that are based on
|
||||
the POSIX API are also supplied in the library libpcreposix. Note that this
|
||||
just provides a POSIX calling interface to PCRE: the regular expressions
|
||||
themselves still follow Perl syntax and semantics. The header file
|
||||
for the POSIX-style functions is called pcreposix.h. The official POSIX name is
|
||||
regex.h, but I didn't want to risk possible problems with existing files of
|
||||
that name by distributing it that way. To use it with an existing program that
|
||||
uses the POSIX API, it will have to be renamed or pointed at by a link.
|
||||
|
||||
|
||||
Building PCRE on a Unix system
|
||||
------------------------------
|
||||
@@ -15,20 +24,29 @@ To build PCRE on a Unix system, run the "configure" command in the PCRE
|
||||
distribution directory. This is a standard GNU "autoconf" configuration script,
|
||||
for which generic instructions are supplied in INSTALL. On many systems just
|
||||
running "./configure" is sufficient, but the usual methods of changing standard
|
||||
defaults are available. For example
|
||||
defaults are available. For example,
|
||||
|
||||
CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local
|
||||
|
||||
specifies that the C compiler should be run with the flags '-O2 -Wall' instead
|
||||
of the default, and that "make install" should install PCRE under /opt/local
|
||||
instead of the default /usr/local. The "configure" script builds thre files:
|
||||
instead of the default /usr/local.
|
||||
|
||||
If you want to make use of the experimential, incomplete support for UTF-8
|
||||
character strings in PCRE, you must add --enable-utf8 to the "configure"
|
||||
command. Without it, the code for handling UTF-8 is not included in the
|
||||
library. (Even when included, it still has to be enabled by an option at run
|
||||
time.)
|
||||
|
||||
The "configure" script builds four files:
|
||||
|
||||
. Makefile is built by copying Makefile.in and making substitutions.
|
||||
. config.h is built by copying config.in and making substitutions.
|
||||
. pcre-config is built by copying pcre-config.in and making substitutions.
|
||||
. RunTest is a script for running tests
|
||||
|
||||
Once "configure" has run, you can run "make". It builds two libraries called
|
||||
libpcre and libpcreposix, a test program called pcretest, and the pgrep
|
||||
libpcre and libpcreposix, a test program called pcretest, and the pcregrep
|
||||
command. You can use "make install" to copy these, and the public header file
|
||||
pcre.h, to appropriate live directories on your system, in the normal way.
|
||||
|
||||
@@ -54,11 +72,11 @@ The default distribution builds PCRE as two shared libraries. This support is
|
||||
new and experimental and may not work on all systems. It relies on the
|
||||
"libtool" scripts - these are distributed with PCRE. It should build a
|
||||
"libtool" script and use this to compile and link shared libraries, which are
|
||||
placed in a subdirectory called .libs. The programs pcretest and pgrep are
|
||||
placed in a subdirectory called .libs. The programs pcretest and pcregrep are
|
||||
built to use these uninstalled libraries by means of wrapper scripts. When you
|
||||
use "make install" to install shared libraries, pgrep and pcretest are
|
||||
use "make install" to install shared libraries, pcregrep and pcretest are
|
||||
automatically re-built to use the newly installed libraries. However, only
|
||||
pgrep is installed, as pcretest is really just a test program.
|
||||
pcregrep is installed, as pcretest is really just a test program.
|
||||
|
||||
To build PCRE using static libraries you must use --disable-shared when
|
||||
configuring it. For example
|
||||
@@ -82,8 +100,8 @@ Testing PCRE
|
||||
------------
|
||||
|
||||
To test PCRE on a Unix system, run the RunTest script in the pcre directory.
|
||||
(This can also be run by "make runtest" or "make check".) For other systems,
|
||||
see the instruction in NON-UNIX-USE.
|
||||
(This can also be run by "make runtest", "make check", or "make test".) For
|
||||
other systems, see the instruction in NON-UNIX-USE.
|
||||
|
||||
The script runs the pcretest test program (which is documented in
|
||||
doc/pcretest.txt) on each of the testinput files (in the testdata directory) in
|
||||
@@ -97,12 +115,24 @@ RunTest, for example:
|
||||
The first and third test files can also be fed directly into the perltest
|
||||
script to check that Perl gives the same results. The third file requires the
|
||||
additional features of release 5.005, which is why it is kept separate from the
|
||||
main test input, which needs only Perl 5.004. In the long run, when 5.005 is
|
||||
widespread, these two test files may get amalgamated.
|
||||
main test input, which needs only Perl 5.004. In the long run, when 5.005 (or
|
||||
higher) is widespread, these two test files may get amalgamated.
|
||||
|
||||
The second set of tests check pcre_info(), pcre_study(), pcre_copy_substring(),
|
||||
pcre_get_substring(), pcre_get_substring_list(), error detection and run-time
|
||||
flags that are specific to PCRE, as well as the POSIX wrapper API.
|
||||
The second set of tests check pcre_fullinfo(), pcre_info(), pcre_study(),
|
||||
pcre_copy_substring(), pcre_get_substring(), pcre_get_substring_list(), error
|
||||
detection, and run-time flags that are specific to PCRE, as well as the POSIX
|
||||
wrapper API. It also uses the debugging flag to check some of the internals of
|
||||
pcre_compile().
|
||||
|
||||
If you build PCRE with a locale setting that is not the standard C locale, the
|
||||
character tables may be different (see next paragraph). In some cases, this may
|
||||
cause failures in the second set of tests. For example, in a locale where the
|
||||
isprint() function yields TRUE for characters in the range 128-255, the use of
|
||||
[:isascii:] inside a character class defines a different set of characters, and
|
||||
this shows up in this test as a difference in the compiled code, which is being
|
||||
listed for checking. Where the comparison test output contains [\x00-\x7f] the
|
||||
test will contain [\x00-\xff], and similarly in some other cases. This is not a
|
||||
bug in PCRE.
|
||||
|
||||
The fourth set of tests checks pcre_maketables(), the facility for building a
|
||||
set of character tables for a specific locale and using them instead of the
|
||||
@@ -117,14 +147,10 @@ output to say why. If running this test produces instances of the error
|
||||
in the comparison output, it means that locale is not available on your system,
|
||||
despite being listed by "locale". This does not mean that PCRE is broken.
|
||||
|
||||
PCRE has its own native API, but a set of "wrapper" functions that are based on
|
||||
the POSIX API are also supplied in the library libpcreposix.a. Note that this
|
||||
just provides a POSIX calling interface to PCRE: the regular expressions
|
||||
themselves still follow Perl syntax and semantics. The header file
|
||||
for the POSIX-style functions is called pcreposix.h. The official POSIX name is
|
||||
regex.h, but I didn't want to risk possible problems with existing files of
|
||||
that name by distributing it that way. To use it with an existing program that
|
||||
uses the POSIX API, it will have to be renamed or pointed at by a link.
|
||||
The fifth test checks the experimental, incomplete UTF-8 support. It is not run
|
||||
automatically unless PCRE is built with UTF-8 support. This file can be fed
|
||||
directly to the perltest8 script, which requires Perl 5.6 or higher. The sixth
|
||||
file tests internal UTF-8 features of PCRE that are not relevant to Perl.
|
||||
|
||||
|
||||
Character tables
|
||||
@@ -197,7 +223,7 @@ The distribution should contain the following files:
|
||||
NEWS important changes in this release
|
||||
NON-UNIX-USE notes on building PCRE on non-Unix systems
|
||||
README this file
|
||||
RunTest a Unix shell script for running tests
|
||||
RunTest.in template for a Unix shell script for running tests
|
||||
config.guess ) files used by libtool,
|
||||
config.sub ) used only when building a shared library
|
||||
configure a configuring shell script (built by autoconf)
|
||||
@@ -211,24 +237,29 @@ The distribution should contain the following files:
|
||||
doc/pcreposix.txt plain text version
|
||||
doc/pcretest.txt documentation of test program
|
||||
doc/perltest.txt documentation of Perl test program
|
||||
doc/pgrep.1 man page source for the pgrep utility
|
||||
doc/pgrep.html HTML version
|
||||
doc/pgrep.txt plain text version
|
||||
doc/pcregrep.1 man page source for the pcregrep utility
|
||||
doc/pcregrep.html HTML version
|
||||
doc/pcregrep.txt plain text version
|
||||
install-sh a shell script for installing files
|
||||
ltconfig ) files used to build "libtool",
|
||||
ltmain.sh ) used only when building a shared library
|
||||
pcretest.c test program
|
||||
perltest Perl test program
|
||||
pgrep.c source of a grep utility that uses PCRE
|
||||
perltest8 Perl test program for UTF-8 tests
|
||||
pcregrep.c source of a grep utility that uses PCRE
|
||||
pcre-config.in source of script which retains PCRE information
|
||||
testdata/testinput1 test data, compatible with Perl 5.004 and 5.005
|
||||
testdata/testinput2 test data for error messages and non-Perl things
|
||||
testdata/testinput3 test data, compatible with Perl 5.005
|
||||
testdata/testinput4 test data for locale-specific tests
|
||||
testdata/testinput5 test data for UTF-8 tests compatible with Perl 5.6
|
||||
testdata/testinput6 test data for other UTF-8 tests
|
||||
testdata/testoutput1 test results corresponding to testinput1
|
||||
testdata/testoutput2 test results corresponding to testinput2
|
||||
testdata/testoutput3 test results corresponding to testinput3
|
||||
testdata/testoutput4 test results corresponding to testinput4
|
||||
testdata/testoutput5 test results corresponding to testinput5
|
||||
testdata/testoutput6 test results corresponding to testinput6
|
||||
|
||||
(C) Auxiliary files for Win32 DLL
|
||||
|
||||
@@ -236,4 +267,4 @@ The distribution should contain the following files:
|
||||
pcre.def
|
||||
|
||||
Philip Hazel <ph10@cam.ac.uk>
|
||||
February 2000
|
||||
August 2000
|
||||
|
||||
@@ -1,5 +1,8 @@
|
||||
#! /bin/sh
|
||||
|
||||
# This file is generated by configure from RunTest.in. Make any changes
|
||||
# to that file.
|
||||
|
||||
# Run PCRE tests
|
||||
|
||||
cf=diff
|
||||
@@ -10,6 +13,8 @@ do1=no
|
||||
do2=no
|
||||
do3=no
|
||||
do4=no
|
||||
do5=no
|
||||
do6=no
|
||||
|
||||
while [ $# -gt 0 ] ; do
|
||||
case $1 in
|
||||
@@ -17,16 +22,32 @@ while [ $# -gt 0 ] ; do
|
||||
2) do2=yes;;
|
||||
3) do3=yes;;
|
||||
4) do4=yes;;
|
||||
5) do5=yes;;
|
||||
6) do6=yes;;
|
||||
*) echo "Unknown test number $1"; exit 1;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if [ $do1 = no -a $do2 = no -a $do3 = no -a $do4 = no ] ; then
|
||||
if [ "" = "" ] ; then
|
||||
if [ $do5 = yes ] ; then
|
||||
echo "Can't run test 5 because UFT8 support is not configured"
|
||||
exit 1
|
||||
fi
|
||||
if [ $do6 = yes ] ; then
|
||||
echo "Can't run test 6 because UFT8 support is not configured"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
if [ $do1 = no -a $do2 = no -a $do3 = no -a $do4 = no -a\
|
||||
$do5 = no -a $do6 = no ] ; then
|
||||
do1=yes
|
||||
do2=yes
|
||||
do3=yes
|
||||
do4=yes
|
||||
if [ "" != "" ] ; then do5=yes; fi
|
||||
if [ "" != "" ] ; then do6=yes; fi
|
||||
fi
|
||||
|
||||
# Primary test, Perl-compatible
|
||||
@@ -66,6 +87,7 @@ if [ $do3 = yes ] ; then
|
||||
fi
|
||||
|
||||
if [ $do1 = yes -a $do2 = yes -a $do3 = yes ] ; then
|
||||
echo " "
|
||||
echo "The three main tests all ran OK"
|
||||
echo " "
|
||||
fi
|
||||
@@ -79,8 +101,14 @@ if [ $do4 = yes ] ; then
|
||||
./pcretest testdata/testinput4 testtry
|
||||
if [ $? = 0 ] ; then
|
||||
$cf testtry testdata/testoutput4
|
||||
if [ $? != 0 ] ; then exit 1; fi
|
||||
if [ $? != 0 ] ; then
|
||||
echo " "
|
||||
echo "Locale test did not run entirely successfully."
|
||||
echo "This usually means that there is a problem with the locale"
|
||||
echo "settings rather than a bug in PCRE."
|
||||
else
|
||||
echo "Locale test ran OK"
|
||||
fi
|
||||
echo " "
|
||||
else exit 1
|
||||
fi
|
||||
@@ -91,4 +119,30 @@ if [ $do4 = yes ] ; then
|
||||
fi
|
||||
fi
|
||||
|
||||
# Additional tests for UTF8 support
|
||||
|
||||
if [ $do5 = yes ] ; then
|
||||
echo "Testing experimental, incomplete UTF8 support (Perl compatible)"
|
||||
./pcretest testdata/testinput5 testtry
|
||||
if [ $? = 0 ] ; then
|
||||
$cf testtry testdata/testoutput5
|
||||
if [ $? != 0 ] ; then exit 1; fi
|
||||
else exit 1
|
||||
fi
|
||||
echo "UTF8 test ran OK"
|
||||
echo " "
|
||||
fi
|
||||
|
||||
if [ $do6 = yes ] ; then
|
||||
echo "Testing API and internals for UTF8 support (not Perl compatible)"
|
||||
./pcretest testdata/testinput6 testtry
|
||||
if [ $? = 0 ] ; then
|
||||
$cf testtry testdata/testoutput6
|
||||
if [ $? != 0 ] ; then exit 1; fi
|
||||
else exit 1
|
||||
fi
|
||||
echo "UTF8 internals test ran OK"
|
||||
echo " "
|
||||
fi
|
||||
|
||||
# End
|
||||
|
||||
@@ -202,9 +202,10 @@ Forward assertions are just like other subpatterns, but starting with one of
|
||||
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
|
||||
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
|
||||
is OP_REVERSE, followed by a two byte count of the number of characters to move
|
||||
back the pointer in the subject string. A separate count is present in each
|
||||
alternative of a lookbehind assertion, allowing them to have different fixed
|
||||
lengths.
|
||||
back the pointer in the subject string. When operating in UTF-8 mode, the count
|
||||
is a character count rather than a byte count. A separate count is present in
|
||||
each alternative of a lookbehind assertion, allowing them to have different
|
||||
fixed lengths.
|
||||
|
||||
|
||||
Once-only subpatterns
|
||||
@@ -239,4 +240,4 @@ the compiled data.
|
||||
|
||||
|
||||
Philip Hazel
|
||||
February 2000
|
||||
August 2000
|
||||
|
||||
@@ -44,6 +44,12 @@ pcre - Perl-compatible regular expressions.
|
||||
.B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);"
|
||||
.PP
|
||||
.br
|
||||
.B void pcre_free_substring(const char *\fIstringptr\fR);
|
||||
.PP
|
||||
.br
|
||||
.B void pcre_free_substring_list(const char **\fIstringptr\fR);
|
||||
.PP
|
||||
.br
|
||||
.B const unsigned char *pcre_maketables(void);
|
||||
.PP
|
||||
.br
|
||||
@@ -70,7 +76,9 @@ pcre - Perl-compatible regular expressions.
|
||||
The PCRE library is a set of functions that implement regular expression
|
||||
pattern matching using the same syntax and semantics as Perl 5, with just a few
|
||||
differences (see below). The current implementation corresponds to Perl 5.005,
|
||||
with some additional features from the Perl development release.
|
||||
with some additional features from later versions. This includes some
|
||||
experimental, incomplete support for UTF-8 encoded strings. Details of exactly
|
||||
what is and what is not supported are given below.
|
||||
|
||||
PCRE has its own native API, which is described in this document. There is also
|
||||
a set of wrapper functions that correspond to the POSIX regular expression API.
|
||||
@@ -84,12 +92,16 @@ contain the major and minor release numbers for the library. Applications can
|
||||
use these to include support for different releases.
|
||||
|
||||
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
|
||||
are used for compiling and matching regular expressions, while
|
||||
\fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
|
||||
are used for compiling and matching regular expressions.
|
||||
|
||||
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
|
||||
\fBpcre_get_substring_list()\fR are convenience functions for extracting
|
||||
captured substrings from a matched subject string. The function
|
||||
\fBpcre_maketables()\fR is used (optionally) to build a set of character tables
|
||||
in the current locale for passing to \fBpcre_compile()\fR.
|
||||
captured substrings from a matched subject string; \fBpcre_free_substring()\fR
|
||||
and \fBpcre_free_substring_list()\fR are also provided, to free the memory used
|
||||
for extracted strings.
|
||||
|
||||
The function \fBpcre_maketables()\fR is used (optionally) to build a set of
|
||||
character tables in the current locale for passing to \fBpcre_compile()\fR.
|
||||
|
||||
The function \fBpcre_fullinfo()\fR is used to find out information about a
|
||||
compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only
|
||||
@@ -223,6 +235,14 @@ This option inverts the "greediness" of the quantifiers so that they are not
|
||||
greedy by default, but become greedy if followed by "?". It is not compatible
|
||||
with Perl. It can also be set by a (?U) option setting within the pattern.
|
||||
|
||||
PCRE_UTF8
|
||||
|
||||
This option causes PCRE to regard both the pattern and the subject as strings
|
||||
of UTF-8 characters instead of just byte strings. However, it is available only
|
||||
if PCRE has been built to include UTF-8 support. If not, the use of this option
|
||||
provokes an error. Support for UTF-8 is new, experimental, and incomplete.
|
||||
Details of exactly what it entails are given below.
|
||||
|
||||
|
||||
.SH STUDYING A PATTERN
|
||||
When a pattern is going to be used several times, it is worth spending more
|
||||
@@ -558,7 +578,7 @@ extract a single substring, whose number is given as \fIstringnumber\fR. A
|
||||
value of zero extracts the substring that matched the entire pattern, while
|
||||
higher values extract the captured substrings. For \fBpcre_copy_substring()\fR,
|
||||
the string is placed in \fIbuffer\fR, whose length is given by
|
||||
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of store is
|
||||
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of memory is
|
||||
obtained via \fBpcre_malloc\fR, and its address is returned via
|
||||
\fIstringptr\fR. The yield of the function is the length of the string, not
|
||||
including the terminating zero, or one of
|
||||
@@ -590,6 +610,15 @@ string. This can be distinguished from a genuine zero-length substring by
|
||||
inspecting the appropriate offset in \fIovector\fR, which is negative for unset
|
||||
substrings.
|
||||
|
||||
The two convenience functions \fBpcre_free_substring()\fR and
|
||||
\fBpcre_free_substring_list()\fR can be used to free the memory returned by
|
||||
a previous call of \fBpcre_get_substring()\fR or
|
||||
\fBpcre_get_substring_list()\fR, respectively. They do nothing more than call
|
||||
the function pointed to by \fBpcre_free\fR, which of course could be called
|
||||
directly from a C program. However, PCRE is used in some situations where it is
|
||||
linked via a special interface to another programming language which cannot use
|
||||
\fBpcre_free\fR directly; it is for these cases that the functions are
|
||||
provided.
|
||||
|
||||
|
||||
.SH LIMITATIONS
|
||||
@@ -691,8 +720,14 @@ The syntax and semantics of the regular expressions supported by PCRE are
|
||||
described below. Regular expressions are also described in the Perl
|
||||
documentation and in a number of other books, some of which have copious
|
||||
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
|
||||
O'Reilly (ISBN 1-56592-257), covers them in great detail. The description
|
||||
here is intended as reference documentation.
|
||||
O'Reilly (ISBN 1-56592-257), covers them in great detail.
|
||||
|
||||
The description here is intended as reference documentation. The basic
|
||||
operation of PCRE is on strings of bytes. However, there is the beginnings of
|
||||
some support for UTF-8 character strings. To use this support you must
|
||||
configure PCRE to include it, and then call \fBpcre_compile()\fR with the
|
||||
PCRE_UTF8 option. How this affects the pattern matching is described in the
|
||||
final section of this document.
|
||||
|
||||
A regular expression is a pattern that is matched against a subject string from
|
||||
left to right. Most characters stand for themselves in a pattern, and match the
|
||||
@@ -1210,7 +1245,7 @@ to the string
|
||||
|
||||
/* first command */ not comment /* second comment */
|
||||
|
||||
fails, because it matches the entire string due to the greediness of the .*
|
||||
fails, because it matches the entire string owing to the greediness of the .*
|
||||
item.
|
||||
|
||||
However, if a quantifier is followed by a question mark, it ceases to be
|
||||
@@ -1311,7 +1346,7 @@ example, the pattern
|
||||
|
||||
(a|b\\1)+
|
||||
|
||||
matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of
|
||||
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
|
||||
the subpattern, the back reference matches the character string corresponding
|
||||
to the previous iteration. In order for this to work, the pattern must be such
|
||||
that the first iteration does not need to match the back reference. This can be
|
||||
@@ -1529,9 +1564,10 @@ subpattern, a compile-time error occurs.
|
||||
|
||||
There are two kinds of condition. If the text between the parentheses consists
|
||||
of a sequence of digits, the condition is satisfied if the capturing subpattern
|
||||
of that number has previously matched. Consider the following pattern, which
|
||||
contains non-significant white space to make it more readable (assume the
|
||||
PCRE_EXTENDED option) and to divide it into three parts for ease of discussion:
|
||||
of that number has previously matched. The number must be greater than zero.
|
||||
Consider the following pattern, which contains non-significant white space to
|
||||
make it more readable (assume the PCRE_EXTENDED option) and to divide it into
|
||||
three parts for ease of discussion:
|
||||
|
||||
( \\( )? [^()]+ (?(1) \\) )
|
||||
|
||||
@@ -1685,6 +1721,77 @@ with the pattern above. The former gives a failure almost instantly when
|
||||
applied to a whole line of "a" characters, whereas the latter takes an
|
||||
appreciable time with strings longer than about 20 characters.
|
||||
|
||||
|
||||
.SH UTF-8 SUPPORT
|
||||
Starting at release 3.3, PCRE has some support for character strings encoded
|
||||
in the UTF-8 format. This is incomplete, and is regarded as experimental. In
|
||||
order to use it, you must configure PCRE to include UTF-8 support in the code,
|
||||
and, in addition, you must call \fBpcre_compile()\fR with the PCRE_UTF8 option
|
||||
flag. When you do this, both the pattern and any subject strings that are
|
||||
matched against it are treated as UTF-8 strings instead of just strings of
|
||||
bytes, but only in the cases that are mentioned below.
|
||||
|
||||
If you compile PCRE with UTF-8 support, but do not use it at run time, the
|
||||
library will be a bit bigger, but the additional run time overhead is limited
|
||||
to testing the PCRE_UTF8 flag in several places, so should not be very large.
|
||||
|
||||
PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
|
||||
not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
|
||||
the results are undefined.
|
||||
|
||||
Running with PCRE_UTF8 set causes these changes in the way PCRE works:
|
||||
|
||||
1. In a pattern, the escape sequence \\x{...}, where the contents of the braces
|
||||
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
|
||||
code number is the given hexadecimal number, for example: \\x{1234}. This
|
||||
inserts from one to six literal bytes into the pattern, using the UTF-8
|
||||
encoding. If a non-hexadecimal digit appears between the braces, the item is
|
||||
not recognized.
|
||||
|
||||
2. The original hexadecimal escape sequence, \\xhh, generates a two-byte UTF-8
|
||||
character if its value is greater than 127.
|
||||
|
||||
3. Repeat quantifiers are NOT correctly handled if they follow a multibyte
|
||||
character. For example, \\x{100}* and \\xc3+ do not work. If you want to
|
||||
repeat such characters, you must enclose them in non-capturing parentheses,
|
||||
for example (?:\\x{100}), at present.
|
||||
|
||||
4. The dot metacharacter matches one UTF-8 character instead of a single byte.
|
||||
|
||||
5. Unlike literal UTF-8 characters, the dot metacharacter followed by a
|
||||
repeat quantifier does operate correctly on UTF-8 characters instead of
|
||||
single bytes.
|
||||
|
||||
4. Although the \\x{...} escape is permitted in a character class, characters
|
||||
whose values are greater than 255 cannot be included in a class.
|
||||
|
||||
5. A class is matched against a UTF-8 character instead of just a single byte,
|
||||
but it can match only characters whose values are less than 256. Characters
|
||||
with greater values always fail to match a class.
|
||||
|
||||
6. Repeated classes work correctly on multiple characters.
|
||||
|
||||
7. Classes containing just a single character whose value is greater than 127
|
||||
(but less than 256), for example, [\\x80] or [^\\x{93}], do not work because
|
||||
these are optimized into single byte matches. In the first case, of course,
|
||||
the class brackets are just redundant.
|
||||
|
||||
8. Lookbehind assertions move backwards in the subject by a fixed number of
|
||||
characters instead of a fixed number of bytes. Simple cases have been tested
|
||||
to work correctly, but there may be hidden gotchas herein.
|
||||
|
||||
9. The character types such as \\d and \\w do not work correctly with UTF-8
|
||||
characters. They continue to test a single byte.
|
||||
|
||||
10. Anything not explicitly mentioned here continues to work in bytes rather
|
||||
than in characters.
|
||||
|
||||
The following UTF-8 features of Perl 5.6 are not implemented:
|
||||
|
||||
1. The escape sequence \\C to match a single byte.
|
||||
|
||||
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X.
|
||||
|
||||
.SH AUTHOR
|
||||
Philip Hazel <ph10@cam.ac.uk>
|
||||
.br
|
||||
@@ -1696,6 +1803,8 @@ Cambridge CB2 3QG, England.
|
||||
.br
|
||||
Phone: +44 1223 334714
|
||||
|
||||
Last updated: 27 January 2000
|
||||
Last updated: 28 August 2000,
|
||||
.br
|
||||
the 250th anniversary of the death of J.S. Bach.
|
||||
.br
|
||||
Copyright (c) 1997-2000 University of Cambridge.
|
||||
|
||||
@@ -37,7 +37,8 @@ conversion went wrong.
|
||||
<LI><A NAME="TOC27" HREF="#SEC27">COMMENTS</A>
|
||||
<LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A>
|
||||
<LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A>
|
||||
<LI><A NAME="TOC30" HREF="#SEC30">AUTHOR</A>
|
||||
<LI><A NAME="TOC30" HREF="#SEC30">UTF-8 SUPPORT</A>
|
||||
<LI><A NAME="TOC31" HREF="#SEC31">AUTHOR</A>
|
||||
</UL>
|
||||
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A>
|
||||
<P>
|
||||
@@ -76,6 +77,12 @@ pcre - Perl-compatible regular expressions.
|
||||
<B>int *<I>ovector</I>, int <I>stringcount</I>, const char ***<I>listptr</I>);</B>
|
||||
</P>
|
||||
<P>
|
||||
<B>void pcre_free_substring(const char *<I>stringptr</I>);</B>
|
||||
</P>
|
||||
<P>
|
||||
<B>void pcre_free_substring_list(const char **<I>stringptr</I>);</B>
|
||||
</P>
|
||||
<P>
|
||||
<B>const unsigned char *pcre_maketables(void);</B>
|
||||
</P>
|
||||
<P>
|
||||
@@ -100,7 +107,9 @@ pcre - Perl-compatible regular expressions.
|
||||
The PCRE library is a set of functions that implement regular expression
|
||||
pattern matching using the same syntax and semantics as Perl 5, with just a few
|
||||
differences (see below). The current implementation corresponds to Perl 5.005,
|
||||
with some additional features from the Perl development release.
|
||||
with some additional features from later versions. This includes some
|
||||
experimental, incomplete support for UTF-8 encoded strings. Details of exactly
|
||||
what is and what is not supported are given below.
|
||||
</P>
|
||||
<P>
|
||||
PCRE has its own native API, which is described in this document. There is also
|
||||
@@ -117,12 +126,18 @@ use these to include support for different releases.
|
||||
</P>
|
||||
<P>
|
||||
The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B>
|
||||
are used for compiling and matching regular expressions, while
|
||||
<B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and
|
||||
are used for compiling and matching regular expressions.
|
||||
</P>
|
||||
<P>
|
||||
The functions <B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and
|
||||
<B>pcre_get_substring_list()</B> are convenience functions for extracting
|
||||
captured substrings from a matched subject string. The function
|
||||
<B>pcre_maketables()</B> is used (optionally) to build a set of character tables
|
||||
in the current locale for passing to <B>pcre_compile()</B>.
|
||||
captured substrings from a matched subject string; <B>pcre_free_substring()</B>
|
||||
and <B>pcre_free_substring_list()</B> are also provided, to free the memory used
|
||||
for extracted strings.
|
||||
</P>
|
||||
<P>
|
||||
The function <B>pcre_maketables()</B> is used (optionally) to build a set of
|
||||
character tables in the current locale for passing to <B>pcre_compile()</B>.
|
||||
</P>
|
||||
<P>
|
||||
The function <B>pcre_fullinfo()</B> is used to find out information about a
|
||||
@@ -297,6 +312,18 @@ This option inverts the "greediness" of the quantifiers so that they are not
|
||||
greedy by default, but become greedy if followed by "?". It is not compatible
|
||||
with Perl. It can also be set by a (?U) option setting within the pattern.
|
||||
</P>
|
||||
<P>
|
||||
<PRE>
|
||||
PCRE_UTF8
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
This option causes PCRE to regard both the pattern and the subject as strings
|
||||
of UTF-8 characters instead of just byte strings. However, it is available only
|
||||
if PCRE has been built to include UTF-8 support. If not, the use of this option
|
||||
provokes an error. Support for UTF-8 is new, experimental, and incomplete.
|
||||
Details of exactly what it entails are given below.
|
||||
</P>
|
||||
<LI><A NAME="SEC6" HREF="#TOC1">STUDYING A PATTERN</A>
|
||||
<P>
|
||||
When a pattern is going to be used several times, it is worth spending more
|
||||
@@ -743,7 +770,7 @@ extract a single substring, whose number is given as <I>stringnumber</I>. A
|
||||
value of zero extracts the substring that matched the entire pattern, while
|
||||
higher values extract the captured substrings. For <B>pcre_copy_substring()</B>,
|
||||
the string is placed in <I>buffer</I>, whose length is given by
|
||||
<I>buffersize</I>, while for <B>pcre_get_substring()</B> a new block of store is
|
||||
<I>buffersize</I>, while for <B>pcre_get_substring()</B> a new block of memory is
|
||||
obtained via <B>pcre_malloc</B>, and its address is returned via
|
||||
<I>stringptr</I>. The yield of the function is the length of the string, not
|
||||
including the terminating zero, or one of
|
||||
@@ -789,6 +816,17 @@ string. This can be distinguished from a genuine zero-length substring by
|
||||
inspecting the appropriate offset in <I>ovector</I>, which is negative for unset
|
||||
substrings.
|
||||
</P>
|
||||
<P>
|
||||
The two convenience functions <B>pcre_free_substring()</B> and
|
||||
<B>pcre_free_substring_list()</B> can be used to free the memory returned by
|
||||
a previous call of <B>pcre_get_substring()</B> or
|
||||
<B>pcre_get_substring_list()</B>, respectively. They do nothing more than call
|
||||
the function pointed to by <B>pcre_free</B>, which of course could be called
|
||||
directly from a C program. However, PCRE is used in some situations where it is
|
||||
linked via a special interface to another programming language which cannot use
|
||||
<B>pcre_free</B> directly; it is for these cases that the functions are
|
||||
provided.
|
||||
</P>
|
||||
<LI><A NAME="SEC11" HREF="#TOC1">LIMITATIONS</A>
|
||||
<P>
|
||||
There are some size limitations in PCRE but it is hoped that they will never in
|
||||
@@ -908,8 +946,15 @@ The syntax and semantics of the regular expressions supported by PCRE are
|
||||
described below. Regular expressions are also described in the Perl
|
||||
documentation and in a number of other books, some of which have copious
|
||||
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
|
||||
O'Reilly (ISBN 1-56592-257), covers them in great detail. The description
|
||||
here is intended as reference documentation.
|
||||
O'Reilly (ISBN 1-56592-257), covers them in great detail.
|
||||
</P>
|
||||
<P>
|
||||
The description here is intended as reference documentation. The basic
|
||||
operation of PCRE is on strings of bytes. However, there is the beginnings of
|
||||
some support for UTF-8 character strings. To use this support you must
|
||||
configure PCRE to include it, and then call <B>pcre_compile()</B> with the
|
||||
PCRE_UTF8 option. How this affects the pattern matching is described in the
|
||||
final section of this document.
|
||||
</P>
|
||||
<P>
|
||||
A regular expression is a pattern that is matched against a subject string from
|
||||
@@ -1576,7 +1621,7 @@ to the string
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
fails, because it matches the entire string due to the greediness of the .*
|
||||
fails, because it matches the entire string owing to the greediness of the .*
|
||||
item.
|
||||
</P>
|
||||
<P>
|
||||
@@ -1718,7 +1763,7 @@ example, the pattern
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of
|
||||
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
|
||||
the subpattern, the back reference matches the character string corresponding
|
||||
to the previous iteration. In order for this to work, the pattern must be such
|
||||
that the first iteration does not need to match the back reference. This can be
|
||||
@@ -2033,9 +2078,10 @@ subpattern, a compile-time error occurs.
|
||||
<P>
|
||||
There are two kinds of condition. If the text between the parentheses consists
|
||||
of a sequence of digits, the condition is satisfied if the capturing subpattern
|
||||
of that number has previously matched. Consider the following pattern, which
|
||||
contains non-significant white space to make it more readable (assume the
|
||||
PCRE_EXTENDED option) and to divide it into three parts for ease of discussion:
|
||||
of that number has previously matched. The number must be greater than zero.
|
||||
Consider the following pattern, which contains non-significant white space to
|
||||
make it more readable (assume the PCRE_EXTENDED option) and to divide it into
|
||||
three parts for ease of discussion:
|
||||
</P>
|
||||
<P>
|
||||
<PRE>
|
||||
@@ -2240,7 +2286,96 @@ with the pattern above. The former gives a failure almost instantly when
|
||||
applied to a whole line of "a" characters, whereas the latter takes an
|
||||
appreciable time with strings longer than about 20 characters.
|
||||
</P>
|
||||
<LI><A NAME="SEC30" HREF="#TOC1">AUTHOR</A>
|
||||
<LI><A NAME="SEC30" HREF="#TOC1">UTF-8 SUPPORT</A>
|
||||
<P>
|
||||
Starting at release 3.3, PCRE has some support for character strings encoded
|
||||
in the UTF-8 format. This is incomplete, and is regarded as experimental. In
|
||||
order to use it, you must configure PCRE to include UTF-8 support in the code,
|
||||
and, in addition, you must call <B>pcre_compile()</B> with the PCRE_UTF8 option
|
||||
flag. When you do this, both the pattern and any subject strings that are
|
||||
matched against it are treated as UTF-8 strings instead of just strings of
|
||||
bytes, but only in the cases that are mentioned below.
|
||||
</P>
|
||||
<P>
|
||||
If you compile PCRE with UTF-8 support, but do not use it at run time, the
|
||||
library will be a bit bigger, but the additional run time overhead is limited
|
||||
to testing the PCRE_UTF8 flag in several places, so should not be very large.
|
||||
</P>
|
||||
<P>
|
||||
PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
|
||||
not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
|
||||
the results are undefined.
|
||||
</P>
|
||||
<P>
|
||||
Running with PCRE_UTF8 set causes these changes in the way PCRE works:
|
||||
</P>
|
||||
<P>
|
||||
1. In a pattern, the escape sequence \x{...}, where the contents of the braces
|
||||
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
|
||||
code number is the given hexadecimal number, for example: \x{1234}. This
|
||||
inserts from one to six literal bytes into the pattern, using the UTF-8
|
||||
encoding. If a non-hexadecimal digit appears between the braces, the item is
|
||||
not recognized.
|
||||
</P>
|
||||
<P>
|
||||
2. The original hexadecimal escape sequence, \xhh, generates a two-byte UTF-8
|
||||
character if its value is greater than 127.
|
||||
</P>
|
||||
<P>
|
||||
3. Repeat quantifiers are NOT correctly handled if they follow a multibyte
|
||||
character. For example, \x{100}* and \xc3+ do not work. If you want to
|
||||
repeat such characters, you must enclose them in non-capturing parentheses,
|
||||
for example (?:\x{100}), at present.
|
||||
</P>
|
||||
<P>
|
||||
4. The dot metacharacter matches one UTF-8 character instead of a single byte.
|
||||
</P>
|
||||
<P>
|
||||
5. Unlike literal UTF-8 characters, the dot metacharacter followed by a
|
||||
repeat quantifier does operate correctly on UTF-8 characters instead of
|
||||
single bytes.
|
||||
</P>
|
||||
<P>
|
||||
4. Although the \x{...} escape is permitted in a character class, characters
|
||||
whose values are greater than 255 cannot be included in a class.
|
||||
</P>
|
||||
<P>
|
||||
5. A class is matched against a UTF-8 character instead of just a single byte,
|
||||
but it can match only characters whose values are less than 256. Characters
|
||||
with greater values always fail to match a class.
|
||||
</P>
|
||||
<P>
|
||||
6. Repeated classes work correctly on multiple characters.
|
||||
</P>
|
||||
<P>
|
||||
7. Classes containing just a single character whose value is greater than 127
|
||||
(but less than 256), for example, [\x80] or [^\x{93}], do not work because
|
||||
these are optimized into single byte matches. In the first case, of course,
|
||||
the class brackets are just redundant.
|
||||
</P>
|
||||
<P>
|
||||
8. Lookbehind assertions move backwards in the subject by a fixed number of
|
||||
characters instead of a fixed number of bytes. Simple cases have been tested
|
||||
to work correctly, but there may be hidden gotchas herein.
|
||||
</P>
|
||||
<P>
|
||||
9. The character types such as \d and \w do not work correctly with UTF-8
|
||||
characters. They continue to test a single byte.
|
||||
</P>
|
||||
<P>
|
||||
10. Anything not explicitly mentioned here continues to work in bytes rather
|
||||
than in characters.
|
||||
</P>
|
||||
<P>
|
||||
The following UTF-8 features of Perl 5.6 are not implemented:
|
||||
</P>
|
||||
<P>
|
||||
1. The escape sequence \C to match a single byte.
|
||||
</P>
|
||||
<P>
|
||||
2. The use of Unicode tables and properties and escapes \p, \P, and \X.
|
||||
</P>
|
||||
<LI><A NAME="SEC31" HREF="#TOC1">AUTHOR</A>
|
||||
<P>
|
||||
Philip Hazel <ph10@cam.ac.uk>
|
||||
<BR>
|
||||
@@ -2253,6 +2388,10 @@ Cambridge CB2 3QG, England.
|
||||
Phone: +44 1223 334714
|
||||
</P>
|
||||
<P>
|
||||
Last updated: 27 January 2000
|
||||
Last updated: 28 August 2000,
|
||||
<BR>
|
||||
<PRE>
|
||||
the 250th anniversary of the death of J.S. Bach.
|
||||
<BR>
|
||||
</PRE>
|
||||
Copyright (c) 1997-2000 University of Cambridge.
|
||||
|
||||
@@ -28,6 +28,10 @@ SYNOPSIS
|
||||
int pcre_get_substring_list(const char *subject,
|
||||
int *ovector, int stringcount, const char ***listptr);
|
||||
|
||||
void pcre_free_substring(const char *stringptr);
|
||||
|
||||
void pcre_free_substring_list(const char **stringptr);
|
||||
|
||||
const unsigned char *pcre_maketables(void);
|
||||
|
||||
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
|
||||
@@ -48,9 +52,12 @@ DESCRIPTION
|
||||
The PCRE library is a set of functions that implement regu-
|
||||
lar expression pattern matching using the same syntax and
|
||||
semantics as Perl 5, with just a few differences (see
|
||||
|
||||
below). The current implementation corresponds to Perl
|
||||
5.005, with some additional features from the Perl develop-
|
||||
ment release.
|
||||
5.005, with some additional features from later versions.
|
||||
This includes some experimental, incomplete support for
|
||||
UTF-8 encoded strings. Details of exactly what is and what
|
||||
is not supported are given below.
|
||||
|
||||
PCRE has its own native API, which is described in this
|
||||
document. There is also a set of wrapper functions that
|
||||
@@ -67,13 +74,18 @@ DESCRIPTION
|
||||
releases.
|
||||
|
||||
The functions pcre_compile(), pcre_study(), and pcre_exec()
|
||||
are used for compiling and matching regular expressions,
|
||||
while pcre_copy_substring(), pcre_get_substring(), and
|
||||
pcre_get_substring_list() are convenience functions for
|
||||
are used for compiling and matching regular expressions.
|
||||
|
||||
The functions pcre_copy_substring(), pcre_get_substring(),
|
||||
and pcre_get_substring_list() are convenience functions for
|
||||
extracting captured substrings from a matched subject
|
||||
string. The function pcre_maketables() is used (optionally)
|
||||
to build a set of character tables in the current locale for
|
||||
passing to pcre_compile().
|
||||
string; pcre_free_substring() and pcre_free_substring_list()
|
||||
are also provided, to free the memory used for extracted
|
||||
strings.
|
||||
|
||||
The function pcre_maketables() is used (optionally) to build
|
||||
a set of character tables in the current locale for passing
|
||||
to pcre_compile().
|
||||
|
||||
The function pcre_fullinfo() is used to find out information
|
||||
about a compiled pattern; pcre_info() is an obsolete version
|
||||
@@ -92,10 +104,19 @@ DESCRIPTION
|
||||
|
||||
|
||||
MULTI-THREADING
|
||||
The PCRE functions can be used in multi-threading applica-
|
||||
tions, with the proviso that the memory management functions
|
||||
pointed to by pcre_malloc and pcre_free are shared by all
|
||||
threads.
|
||||
The PCRE functions can be used in multi-threading
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
SunOS 5.8 Last change: 2
|
||||
|
||||
|
||||
|
||||
applications, with the proviso that the memory management
|
||||
functions pointed to by pcre_malloc and pcre_free are shared
|
||||
by all threads.
|
||||
|
||||
The compiled form of a regular expression is not altered
|
||||
during matching, so the same compiled pattern can safely be
|
||||
@@ -103,7 +124,6 @@ MULTI-THREADING
|
||||
|
||||
|
||||
|
||||
|
||||
COMPILING A PATTERN
|
||||
The function pcre_compile() is called to compile a pattern
|
||||
into an internal form. The pattern is a C string terminated
|
||||
@@ -235,12 +255,23 @@ COMPILING A PATTERN
|
||||
followed by "?". It is not compatible with Perl. It can also
|
||||
be set by a (?U) option setting within the pattern.
|
||||
|
||||
PCRE_UTF8
|
||||
|
||||
This option causes PCRE to regard both the pattern and the
|
||||
subject as strings of UTF-8 characters instead of just byte
|
||||
strings. However, it is available only if PCRE has been
|
||||
built to include UTF-8 support. If not, the use of this
|
||||
option provokes an error. Support for UTF-8 is new, experi-
|
||||
mental, and incomplete. Details of exactly what it entails
|
||||
are given below.
|
||||
|
||||
|
||||
|
||||
STUDYING A PATTERN
|
||||
When a pattern is going to be used several times, it is
|
||||
worth spending more time analyzing it in order to speed up
|
||||
the time taken for matching. The function pcre_study() takes
|
||||
|
||||
a pointer to a compiled pattern as its first argument, and
|
||||
returns a pointer to a pcre_extra block (another void
|
||||
typedef) containing additional information about the pat-
|
||||
@@ -344,9 +375,9 @@ INFORMATION ABOUT A PATTERN
|
||||
|
||||
PCRE_INFO_BACKREFMAX
|
||||
|
||||
Return the number of the highest back reference in the pat-
|
||||
tern. The fourth argument should point to an int variable.
|
||||
Zero is returned if there are no back references.
|
||||
Return the number of the highest back reference in the
|
||||
pattern. The fourth argument should point to an int vari-
|
||||
able. Zero is returned if there are no back references.
|
||||
|
||||
PCRE_INFO_FIRSTCHAR
|
||||
|
||||
@@ -605,6 +636,15 @@ MATCHING A PATTERN
|
||||
|
||||
EXTRACTING CAPTURED SUBSTRINGS
|
||||
Captured substrings can be accessed directly by using the
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
SunOS 5.8 Last change: 12
|
||||
|
||||
|
||||
|
||||
offsets returned by pcre_exec() in ovector. For convenience,
|
||||
the functions pcre_copy_substring(), pcre_get_substring(),
|
||||
and pcre_get_substring_list() are provided for extracting
|
||||
@@ -631,7 +671,7 @@ EXTRACTING CAPTURED SUBSTRINGS
|
||||
the entire pattern, while higher values extract the captured
|
||||
substrings. For pcre_copy_substring(), the string is placed
|
||||
in buffer, whose length is given by buffersize, while for
|
||||
pcre_get_substring() a new block of store is obtained via
|
||||
pcre_get_substring() a new block of memory is obtained via
|
||||
pcre_malloc, and its address is returned via stringptr. The
|
||||
yield of the function is the length of the string, not
|
||||
including the terminating zero, or one of
|
||||
@@ -665,6 +705,16 @@ EXTRACTING CAPTURED SUBSTRINGS
|
||||
inspecting the appropriate offset in ovector, which is nega-
|
||||
tive for unset substrings.
|
||||
|
||||
The two convenience functions pcre_free_substring() and
|
||||
pcre_free_substring_list() can be used to free the memory
|
||||
returned by a previous call of pcre_get_substring() or
|
||||
pcre_get_substring_list(), respectively. They do nothing
|
||||
more than call the function pointed to by pcre_free, which
|
||||
of course could be called directly from a C program. How-
|
||||
ever, PCRE is used in some situations where it is linked via
|
||||
a special interface to another programming language which
|
||||
cannot use pcre_free directly; it is for these cases that
|
||||
the functions are provided.
|
||||
|
||||
|
||||
|
||||
@@ -733,6 +783,7 @@ DIFFERENCES FROM PERL
|
||||
(?p{code}) constructions. However, there is some experimen-
|
||||
tal support for recursive patterns using the non-Perl item
|
||||
(?R).
|
||||
|
||||
8. There are at the time of writing some oddities in Perl
|
||||
5.005_02 concerned with the settings of captured strings
|
||||
when part of a pattern is repeated. For example, matching
|
||||
@@ -785,11 +836,17 @@ REGULAR EXPRESSION DETAILS
|
||||
The syntax and semantics of the regular expressions sup-
|
||||
ported by PCRE are described below. Regular expressions are
|
||||
also described in the Perl documentation and in a number of
|
||||
|
||||
other books, some of which have copious examples. Jeffrey
|
||||
Friedl's "Mastering Regular Expressions", published by
|
||||
O'Reilly (ISBN 1-56592-257), covers them in great detail.
|
||||
O'Reilly (ISBN 1-56592-257), covers them in great detail.
|
||||
|
||||
The description here is intended as reference documentation.
|
||||
The basic operation of PCRE is on strings of bytes. However,
|
||||
there is the beginnings of some support for UTF-8 character
|
||||
strings. To use this support you must configure PCRE to
|
||||
include it, and then call pcre_compile() with the PCRE_UTF8
|
||||
option. How this affects the pattern matching is described
|
||||
in the final section of this document.
|
||||
|
||||
A regular expression is a pattern that is matched against a
|
||||
subject string from left to right. Most characters stand for
|
||||
@@ -1004,6 +1061,7 @@ CIRCUMFLEX AND DOLLAR
|
||||
Outside a character class, in the default matching mode, the
|
||||
circumflex character is an assertion which is true only if
|
||||
the current matching point is at the start of the subject
|
||||
|
||||
string. If the startoffset argument of pcre_exec() is non-
|
||||
zero, circumflex can never match. Inside a character class,
|
||||
circumflex has an entirely different meaning (see below).
|
||||
@@ -1056,6 +1114,7 @@ FULL STOP (PERIOD, DOT)
|
||||
Outside a character class, a dot in the pattern matches any
|
||||
one character in the subject, including a non-printing char-
|
||||
acter, but not (by default) newline. If the PCRE_DOTALL
|
||||
|
||||
option is set, dots match newlines as well. The handling of
|
||||
dot is entirely independent of the handling of circumflex
|
||||
and dollar, the only relationship being that they both
|
||||
@@ -1403,7 +1462,7 @@ REPETITION
|
||||
|
||||
/* first command */ not comment /* second comment */
|
||||
|
||||
fails, because it matches the entire string due to the
|
||||
fails, because it matches the entire string owing to the
|
||||
greediness of the .* item.
|
||||
|
||||
However, if a quantifier is followed by a question mark, it
|
||||
@@ -1517,18 +1576,19 @@ BACK REFERENCES
|
||||
A back reference that occurs inside the parentheses to which
|
||||
it refers fails when the subpattern is first used, so, for
|
||||
example, (a\1) never matches. However, such references can
|
||||
be useful inside repeated subpatterns. For example, the
|
||||
pattern
|
||||
be useful inside repeated subpatterns. For example, the pat-
|
||||
tern
|
||||
|
||||
(a|b\1)+
|
||||
|
||||
matches any number of "a"s and also "aba", "ababaa" etc. At
|
||||
matches any number of "a"s and also "aba", "ababbaa" etc. At
|
||||
each iteration of the subpattern, the back reference matches
|
||||
the character string corresponding to the previous itera-
|
||||
tion. In order for this to work, the pattern must be such
|
||||
that the first iteration does not need to match the back
|
||||
reference. This can be done using alternation, as in the
|
||||
example above, or by a quantifier with a minimum of zero.
|
||||
the character string corresponding to the previous
|
||||
iteration. In order for this to work, the pattern must be
|
||||
such that the first iteration does not need to match the
|
||||
back reference. This can be done using alternation, as in
|
||||
the example above, or by a quantifier with a minimum of
|
||||
zero.
|
||||
|
||||
|
||||
|
||||
@@ -1681,9 +1741,9 @@ ONCE-ONLY SUBPATTERNS
|
||||
|
||||
This kind of parenthesis "locks up" the part of the pattern
|
||||
it contains once it has matched, and a failure further into
|
||||
the pattern is prevented from backtracking into it. Back-
|
||||
tracking past it to previous items, however, works as nor-
|
||||
mal.
|
||||
the pattern is prevented from backtracking into it.
|
||||
Backtracking past it to previous items, however, works as
|
||||
normal.
|
||||
|
||||
An alternative description is that a subpattern of this type
|
||||
matches the string of characters that an identical stan-
|
||||
@@ -1778,10 +1838,11 @@ CONDITIONAL SUBPATTERNS
|
||||
There are two kinds of condition. If the text between the
|
||||
parentheses consists of a sequence of digits, the condition
|
||||
is satisfied if the capturing subpattern of that number has
|
||||
previously matched. Consider the following pattern, which
|
||||
contains non-significant white space to make it more read-
|
||||
able (assume the PCRE_EXTENDED option) and to divide it into
|
||||
three parts for ease of discussion:
|
||||
previously matched. The number must be greater than zero.
|
||||
Consider the following pattern, which contains non-
|
||||
significant white space to make it more readable (assume the
|
||||
PCRE_EXTENDED option) and to divide it into three parts for
|
||||
ease of discussion:
|
||||
|
||||
( \( )? [^()]+ (?(1) \) )
|
||||
|
||||
@@ -1966,6 +2027,92 @@ PERFORMANCE
|
||||
|
||||
|
||||
|
||||
UTF-8 SUPPORT
|
||||
Starting at release 3.3, PCRE has some support for character
|
||||
strings encoded in the UTF-8 format. This is incomplete, and
|
||||
is regarded as experimental. In order to use it, you must
|
||||
configure PCRE to include UTF-8 support in the code, and, in
|
||||
addition, you must call pcre_compile() with the PCRE_UTF8
|
||||
option flag. When you do this, both the pattern and any sub-
|
||||
ject strings that are matched against it are treated as
|
||||
UTF-8 strings instead of just strings of bytes, but only in
|
||||
the cases that are mentioned below.
|
||||
|
||||
If you compile PCRE with UTF-8 support, but do not use it at
|
||||
run time, the library will be a bit bigger, but the addi-
|
||||
tional run time overhead is limited to testing the PCRE_UTF8
|
||||
flag in several places, so should not be very large.
|
||||
|
||||
PCRE assumes that the strings it is given contain valid
|
||||
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
|
||||
you pass invalid UTF-8 strings to PCRE, the results are
|
||||
undefined.
|
||||
|
||||
Running with PCRE_UTF8 set causes these changes in the way
|
||||
PCRE works:
|
||||
|
||||
1. In a pattern, the escape sequence \x{...}, where the con-
|
||||
tents of the braces is a string of hexadecimal digits, is
|
||||
interpreted as a UTF-8 character whose code number is the
|
||||
given hexadecimal number, for example: \x{1234}. This
|
||||
inserts from one to six literal bytes into the pattern,
|
||||
using the UTF-8 encoding. If a non-hexadecimal digit appears
|
||||
between the braces, the item is not recognized.
|
||||
|
||||
2. The original hexadecimal escape sequence, \xhh, generates
|
||||
a two-byte UTF-8 character if its value is greater than 127.
|
||||
|
||||
3. Repeat quantifiers are NOT correctly handled if they fol-
|
||||
low a multibyte character. For example, \x{100}* and \xc3+
|
||||
do not work. If you want to repeat such characters, you must
|
||||
enclose them in non-capturing parentheses, for example
|
||||
(?:\x{100}), at present.
|
||||
|
||||
4. The dot metacharacter matches one UTF-8 character instead
|
||||
of a single byte.
|
||||
|
||||
5. Unlike literal UTF-8 characters, the dot metacharacter
|
||||
followed by a repeat quantifier does operate correctly on
|
||||
UTF-8 characters instead of single bytes.
|
||||
|
||||
4. Although the \x{...} escape is permitted in a character
|
||||
class, characters whose values are greater than 255 cannot
|
||||
be included in a class.
|
||||
|
||||
5. A class is matched against a UTF-8 character instead of
|
||||
just a single byte, but it can match only characters whose
|
||||
values are less than 256. Characters with greater values
|
||||
always fail to match a class.
|
||||
|
||||
6. Repeated classes work correctly on multiple characters.
|
||||
|
||||
7. Classes containing just a single character whose value is
|
||||
greater than 127 (but less than 256), for example, [\x80] or
|
||||
[^\x{93}], do not work because these are optimized into sin-
|
||||
gle byte matches. In the first case, of course, the class
|
||||
brackets are just redundant.
|
||||
|
||||
8. Lookbehind assertions move backwards in the subject by a
|
||||
fixed number of characters instead of a fixed number of
|
||||
bytes. Simple cases have been tested to work correctly, but
|
||||
there may be hidden gotchas herein.
|
||||
|
||||
9. The character types such as \d and \w do not work
|
||||
correctly with UTF-8 characters. They continue to test a
|
||||
single byte.
|
||||
|
||||
10. Anything not explicitly mentioned here continues to work
|
||||
in bytes rather than in characters.
|
||||
|
||||
The following UTF-8 features of Perl 5.6 are not imple-
|
||||
mented:
|
||||
1. The escape sequence \C to match a single byte.
|
||||
|
||||
2. The use of Unicode tables and properties and escapes \p,
|
||||
\P, and \X.
|
||||
|
||||
|
||||
|
||||
AUTHOR
|
||||
Philip Hazel <ph10@cam.ac.uk>
|
||||
University Computing Service,
|
||||
@@ -1973,5 +2120,6 @@ AUTHOR
|
||||
Cambridge CB2 3QG, England.
|
||||
Phone: +44 1223 334714
|
||||
|
||||
Last updated: 27 January 2000
|
||||
Last updated: 28 August 2000,
|
||||
the 250th anniversary of the death of J.S. Bach.
|
||||
Copyright (c) 1997-2000 University of Cambridge.
|
||||
|
||||
@@ -1,20 +1,20 @@
|
||||
.TH PGREP 1
|
||||
.TH PCREGREP 1
|
||||
.SH NAME
|
||||
pgrep - a grep with Perl-compatible regular expressions.
|
||||
pcregrep - a grep with Perl-compatible regular expressions.
|
||||
.SH SYNOPSIS
|
||||
.B pgrep [-Vchilnsvx] pattern [file] ...
|
||||
.B pcregrep [-Vchilnsvx] pattern [file] ...
|
||||
|
||||
|
||||
.SH DESCRIPTION
|
||||
\fBpgrep\fR searches files for character patterns, in the same way as other
|
||||
\fBpcregrep\fR searches files for character patterns, in the same way as other
|
||||
grep commands do, but it uses the PCRE regular expression library to support
|
||||
patterns that are compatible with the regular expressions of Perl 5. See
|
||||
\fBpcre(3)\fR for a full description of syntax and semantics.
|
||||
|
||||
If no files are specified, \fBpgrep\fR reads the standard input. By default,
|
||||
If no files are specified, \fBpcregrep\fR reads the standard input. By default,
|
||||
each line that matches the pattern is copied to the standard output, and if
|
||||
there is more than one file, the file name is printed before each line of
|
||||
output. However, there are options that can change how \fBpgrep\fR behaves.
|
||||
output. However, there are options that can change how \fBpcregrep\fR behaves.
|
||||
|
||||
Lines are limited to BUFSIZ characters. BUFSIZ is defined in \fB<stdio.h>\fR.
|
||||
The newline character is removed from the end of each line before it is matched
|
||||
@@ -73,4 +73,4 @@ for syntax errors or inacessible files (even if matches were found).
|
||||
.SH AUTHOR
|
||||
Philip Hazel <ph10@cam.ac.uk>
|
||||
.br
|
||||
Copyright (c) 1997-1999 University of Cambridge.
|
||||
Copyright (c) 1997-2000 University of Cambridge.
|
||||
@@ -1,9 +1,9 @@
|
||||
<HTML>
|
||||
<HEAD>
|
||||
<TITLE>pgrep specification</TITLE>
|
||||
<TITLE>pcregrep specification</TITLE>
|
||||
</HEAD>
|
||||
<body bgcolor="#FFFFFF" text="#00005A">
|
||||
<H1>pgrep specification</H1>
|
||||
<H1>pcregrep specification</H1>
|
||||
This HTML document has been generated automatically from the original man page.
|
||||
If there is any nonsense in it, please consult the man page in case the
|
||||
conversion went wrong.
|
||||
@@ -18,24 +18,24 @@ conversion went wrong.
|
||||
</UL>
|
||||
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A>
|
||||
<P>
|
||||
pgrep - a grep with Perl-compatible regular expressions.
|
||||
pcregrep - a grep with Perl-compatible regular expressions.
|
||||
</P>
|
||||
<LI><A NAME="SEC2" HREF="#TOC1">SYNOPSIS</A>
|
||||
<P>
|
||||
<B>pgrep [-Vchilnsvx] pattern [file] ...</B>
|
||||
<B>pcregrep [-Vchilnsvx] pattern [file] ...</B>
|
||||
</P>
|
||||
<LI><A NAME="SEC3" HREF="#TOC1">DESCRIPTION</A>
|
||||
<P>
|
||||
<B>pgrep</B> searches files for character patterns, in the same way as other
|
||||
<B>pcregrep</B> searches files for character patterns, in the same way as other
|
||||
grep commands do, but it uses the PCRE regular expression library to support
|
||||
patterns that are compatible with the regular expressions of Perl 5. See
|
||||
<B>pcre(3)</B> for a full description of syntax and semantics.
|
||||
</P>
|
||||
<P>
|
||||
If no files are specified, <B>pgrep</B> reads the standard input. By default,
|
||||
If no files are specified, <B>pcregrep</B> reads the standard input. By default,
|
||||
each line that matches the pattern is copied to the standard output, and if
|
||||
there is more than one file, the file name is printed before each line of
|
||||
output. However, there are options that can change how <B>pgrep</B> behaves.
|
||||
output. However, there are options that can change how <B>pcregrep</B> behaves.
|
||||
</P>
|
||||
<P>
|
||||
Lines are limited to BUFSIZ characters. BUFSIZ is defined in <B><stdio.h></B>.
|
||||
@@ -102,4 +102,4 @@ for syntax errors or inacessible files (even if matches were found).
|
||||
<P>
|
||||
Philip Hazel <ph10@cam.ac.uk>
|
||||
<BR>
|
||||
Copyright (c) 1997-1999 University of Cambridge.
|
||||
Copyright (c) 1997-2000 University of Cambridge.
|
||||
@@ -1,25 +1,26 @@
|
||||
NAME
|
||||
pgrep - a grep with Perl-compatible regular expressions.
|
||||
pcregrep - a grep with Perl-compatible regular expressions.
|
||||
|
||||
|
||||
|
||||
SYNOPSIS
|
||||
pgrep [-Vchilnsvx] pattern [file] ...
|
||||
pcregrep [-Vchilnsvx] pattern [file] ...
|
||||
|
||||
|
||||
|
||||
DESCRIPTION
|
||||
pgrep searches files for character patterns, in the same way
|
||||
as other grep commands do, but it uses the PCRE regular
|
||||
pcregrep searches files for character patterns, in the same
|
||||
way as other grep commands do, but it uses the PCRE regular
|
||||
expression library to support patterns that are compatible
|
||||
with the regular expressions of Perl 5. See pcre(3) for a
|
||||
full description of syntax and semantics.
|
||||
|
||||
If no files are specified, pgrep reads the standard input.
|
||||
By default, each line that matches the pattern is copied to
|
||||
the standard output, and if there is more than one file, the
|
||||
file name is printed before each line of output. However,
|
||||
there are options that can change how pgrep behaves.
|
||||
If no files are specified, pcregrep reads the standard
|
||||
input. By default, each line that matches the pattern is
|
||||
copied to the standard output, and if there is more than one
|
||||
file, the file name is printed before each line of output.
|
||||
However, there are options that can change how pcregrep
|
||||
behaves.
|
||||
|
||||
Lines are limited to BUFSIZ characters. BUFSIZ is defined in
|
||||
<stdio.h>. The newline character is removed from the end of
|
||||
@@ -82,5 +83,5 @@ DIAGNOSTICS
|
||||
|
||||
AUTHOR
|
||||
Philip Hazel <ph10@cam.ac.uk>
|
||||
Copyright (c) 1997-1999 University of Cambridge.
|
||||
Copyright (c) 1997-2000 University of Cambridge.
|
||||
|
||||
@@ -77,6 +77,14 @@ to the native function.
|
||||
The PCRE_MULTILINE option is set when the expression is passed for compilation
|
||||
to the native function.
|
||||
|
||||
In the absence of these flags, no options are passed to the native function.
|
||||
This means the the regex is compiled with PCRE default semantics. In
|
||||
particular, the way it handles newline characters in the subject string is the
|
||||
Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only
|
||||
\fIsome\fR of the effects specified for REG_NEWLINE. It does not affect the way
|
||||
newlines are matched by . (they aren't) or a negative class such as [^a] (they
|
||||
are).
|
||||
|
||||
The yield of \fBregcomp()\fR is zero on success, and non-zero otherwise. The
|
||||
\fIpreg\fR structure is filled in on success, and one member of the structure
|
||||
is publicized: \fIre_nsub\fR contains the number of capturing subpatterns in
|
||||
@@ -138,4 +146,4 @@ Cambridge CB2 3QG, England.
|
||||
.br
|
||||
Phone: +44 1223 334714
|
||||
|
||||
Copyright (c) 1997-1999 University of Cambridge.
|
||||
Copyright (c) 1997-2000 University of Cambridge.
|
||||
|
||||
@@ -107,6 +107,15 @@ The PCRE_MULTILINE option is set when the expression is passed for compilation
|
||||
to the native function.
|
||||
</P>
|
||||
<P>
|
||||
In the absence of these flags, no options are passed to the native function.
|
||||
This means the the regex is compiled with PCRE default semantics. In
|
||||
particular, the way it handles newline characters in the subject string is the
|
||||
Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only
|
||||
<I>some</I> of the effects specified for REG_NEWLINE. It does not affect the way
|
||||
newlines are matched by . (they aren't) or a negative class such as [^a] (they
|
||||
are).
|
||||
</P>
|
||||
<P>
|
||||
The yield of <B>regcomp()</B> is zero on success, and non-zero otherwise. The
|
||||
<I>preg</I> structure is filled in on success, and one member of the structure
|
||||
is publicized: <I>re_nsub</I> contains the number of capturing subpatterns in
|
||||
@@ -179,4 +188,4 @@ Cambridge CB2 3QG, England.
|
||||
Phone: +44 1223 334714
|
||||
</P>
|
||||
<P>
|
||||
Copyright (c) 1997-1999 University of Cambridge.
|
||||
Copyright (c) 1997-2000 University of Cambridge.
|
||||
|
||||
@@ -80,6 +80,15 @@ COMPILING A PATTERN
|
||||
The PCRE_MULTILINE option is set when the expression is
|
||||
passed for compilation to the native function.
|
||||
|
||||
In the absence of these flags, no options are passed to the
|
||||
native function. This means the the regex is compiled with
|
||||
PCRE default semantics. In particular, the way it handles
|
||||
newline characters in the subject string is the Perl way,
|
||||
not the POSIX way. Note that setting PCRE_MULTILINE has only
|
||||
some of the effects specified for REG_NEWLINE. It does not
|
||||
affect the way newlines are matched by . (they aren't) or a
|
||||
negative class such as [^a] (they are).
|
||||
|
||||
The yield of regcomp() is zero on success, and non-zero oth-
|
||||
erwise. The preg structure is filled in on success, and one
|
||||
member of the structure is publicized: re_nsub contains the
|
||||
@@ -147,4 +156,4 @@ AUTHOR
|
||||
Cambridge CB2 3QG, England.
|
||||
Phone: +44 1223 334714
|
||||
|
||||
Copyright (c) 1997-1999 University of Cambridge.
|
||||
Copyright (c) 1997-2000 University of Cambridge.
|
||||
|
||||
@@ -43,6 +43,10 @@ backslash, because
|
||||
is interpreted as the first line of a pattern that starts with "abc/", causing
|
||||
pcretest to read the next line as a continuation of the regular expression.
|
||||
|
||||
|
||||
PATTERN MODIFIERS
|
||||
-----------------
|
||||
|
||||
The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
|
||||
PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For
|
||||
example:
|
||||
@@ -103,37 +107,48 @@ compiled, and the results used when the expression is matched.
|
||||
The /M modifier causes the size of memory block used to hold the compiled
|
||||
pattern to be output.
|
||||
|
||||
Finally, the /P modifier causes pcretest to call PCRE via the POSIX wrapper API
|
||||
rather than its native API. When this is done, all other modifiers except /i,
|
||||
/m, and /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is
|
||||
set if /m is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always,
|
||||
and PCRE_DOTALL unless REG_NEWLINE is set.
|
||||
The /P modifier causes pcretest to call PCRE via the POSIX wrapper API rather
|
||||
than its native API. When this is done, all other modifiers except /i, /m, and
|
||||
/+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is set if /m
|
||||
is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and
|
||||
PCRE_DOTALL unless REG_NEWLINE is set.
|
||||
|
||||
The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option set.
|
||||
This turns on the (currently incomplete) support for UTF-8 character handling
|
||||
in PCRE, provided that it was compiled with this support enabled. This modifier
|
||||
also causes any non-printing characters in output strings to be printed using
|
||||
the \x{hh...} notation if they are valid UTF-8 sequences.
|
||||
|
||||
|
||||
DATA LINES
|
||||
----------
|
||||
|
||||
Before each data line is passed to pcre_exec(), leading and trailing whitespace
|
||||
is removed, and it is then scanned for \ escapes. The following are recognized:
|
||||
|
||||
\a alarm (= BEL)
|
||||
\b backspace
|
||||
\e escape
|
||||
\f formfeed
|
||||
\n newline
|
||||
\r carriage return
|
||||
\t tab
|
||||
\v vertical tab
|
||||
\nnn octal character (up to 3 octal digits)
|
||||
\xhh hexadecimal character (up to 2 hex digits)
|
||||
\a alarm (= BEL)
|
||||
\b backspace
|
||||
\e escape
|
||||
\f formfeed
|
||||
\n newline
|
||||
\r carriage return
|
||||
\t tab
|
||||
\v vertical tab
|
||||
\nnn octal character (up to 3 octal digits)
|
||||
\xhh hexadecimal character (up to 2 hex digits)
|
||||
\x{hh...} hexadecimal UTF-8 character
|
||||
|
||||
\A pass the PCRE_ANCHORED option to pcre_exec()
|
||||
\B pass the PCRE_NOTBOL option to pcre_exec()
|
||||
\Cdd call pcre_copy_substring() for substring dd after a successful match
|
||||
(any decimal number less than 32)
|
||||
\Gdd call pcre_get_substring() for substring dd after a successful match
|
||||
(any decimal number less than 32)
|
||||
\L call pcre_get_substringlist() after a successful match
|
||||
\N pass the PCRE_NOTEMPTY option to pcre_exec()
|
||||
\Odd set the size of the output vector passed to pcre_exec() to dd
|
||||
(any number of decimal digits)
|
||||
\Z pass the PCRE_NOTEOL option to pcre_exec()
|
||||
\A pass the PCRE_ANCHORED option to pcre_exec()
|
||||
\B pass the PCRE_NOTBOL option to pcre_exec()
|
||||
\Cdd call pcre_copy_substring() for substring dd after a successful
|
||||
match (any decimal number less than 32)
|
||||
\Gdd call pcre_get_substring() for substring dd after a successful
|
||||
match (any decimal number less than 32)
|
||||
\L call pcre_get_substringlist() after a successful match
|
||||
\N pass the PCRE_NOTEMPTY option to pcre_exec()
|
||||
\Odd set the size of the output vector passed to pcre_exec() to dd
|
||||
(any number of decimal digits)
|
||||
\Z pass the PCRE_NOTEOL option to pcre_exec()
|
||||
|
||||
A backslash followed by anything else just escapes the anything else. If the
|
||||
very last character is a backslash, it is ignored. This gives a way of passing
|
||||
@@ -143,6 +158,15 @@ If /P was present on the regex, causing the POSIX wrapper API to be used, only
|
||||
\B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to
|
||||
regexec() respectively.
|
||||
|
||||
The use of \x{hh...} to represent UTF-8 characters is not dependent on the use
|
||||
of the /8 modifier on the pattern. It is recognized always. There may be any
|
||||
number of hexadecimal digits inside the braces. The result is from one to six
|
||||
bytes, encoded according to the UTF-8 rules.
|
||||
|
||||
|
||||
OUTPUT FROM PCRETEST
|
||||
--------------------
|
||||
|
||||
When a match succeeds, pcretest outputs the list of captured substrings that
|
||||
pcre_exec() returns, starting with number 0 for the string that matched the
|
||||
whole pattern. Here is an example of an interactive pcretest run.
|
||||
@@ -158,8 +182,9 @@ whole pattern. Here is an example of an interactive pcretest run.
|
||||
No match
|
||||
|
||||
If the strings contain any non-printing characters, they are output as \0x
|
||||
escapes. If the pattern has the /+ modifier, then the output for substring 0 is
|
||||
followed by the the rest of the subject string, identified by "0+" like this:
|
||||
escapes, or as \x{...} escapes if the /8 modifier was present on the pattern.
|
||||
If the pattern has the /+ modifier, then the output for substring 0 is followed
|
||||
by the the rest of the subject string, identified by "0+" like this:
|
||||
|
||||
re> /cat/+
|
||||
data> cataract
|
||||
@@ -190,6 +215,10 @@ Note that while patterns can be continued over several lines (a plain ">"
|
||||
prompt is used for continuations), data lines may not. However newlines can be
|
||||
included in data by means of the \n escape.
|
||||
|
||||
|
||||
COMMAND LINE OPTIONS
|
||||
--------------------
|
||||
|
||||
If the -p option is given to pcretest, it is equivalent to adding /P to each
|
||||
regular expression: the POSIX wrapper API is used to call PCRE. None of the
|
||||
following flags has any effect in this case.
|
||||
@@ -208,10 +237,10 @@ a synonym for -m.
|
||||
|
||||
If the -t option is given, each compile, study, and match is run 20000 times
|
||||
while being timed, and the resulting time per compile or match is output in
|
||||
milliseconds. Do not set -t with -s, because you will then get the size output
|
||||
milliseconds. Do not set -t with -m, because you will then get the size output
|
||||
20000 times and the timing will be distorted. If you want to change the number
|
||||
of repetitions used for timing, edit the definition of LOOPREPEAT at the top of
|
||||
pcretest.c
|
||||
|
||||
Philip Hazel <ph10@cam.ac.uk>
|
||||
January 2000
|
||||
August 2000
|
||||
|
||||
@@ -13,11 +13,17 @@ for perltest as well as for pcretest, and the special upper case modifiers such
|
||||
as /A that pcretest recognizes are not used in these files. The output should
|
||||
be identical, apart from the initial identifying banner.
|
||||
|
||||
For testing UTF-8 features, an alternative form of perltest, called perltest8,
|
||||
is supplied. This requires Perl 5.6 or higher. It recognizes the special
|
||||
modifier /8 that pcretest uses to invoke UTF-8 functionality. The testinput5
|
||||
file can be fed to perltest8.
|
||||
|
||||
The testinput2 and testinput4 files are not suitable for feeding to perltest,
|
||||
since they do make use of the special upper case modifiers and escapes that
|
||||
pcretest uses to test some features of PCRE. The first of these files also
|
||||
contains malformed regular expressions, in order to check that PCRE diagnoses
|
||||
them correctly.
|
||||
them correctly. Similarly, testinput6 tests UTF-8 features that do not relate
|
||||
to Perl.
|
||||
|
||||
Philip Hazel <ph10@cam.ac.uk>
|
||||
January 2000
|
||||
August 2000
|
||||
|
||||
@@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals.
|
||||
|
||||
Written by: Philip Hazel <ph10@cam.ac.uk>
|
||||
|
||||
Copyright (c) 1997-1999 University of Cambridge
|
||||
Copyright (c) 1997-2000 University of Cambridge
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Permission is granted to anyone to use this software for any purpose on any
|
||||
@@ -143,6 +143,25 @@ return 0;
|
||||
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Free store obtained by get_substring_list *
|
||||
*************************************************/
|
||||
|
||||
/* This function exists for the benefit of people calling PCRE from non-C
|
||||
programs that can call its functions, but not free() or (pcre_free)() directly.
|
||||
|
||||
Argument: the result of a previous pcre_get_substring_list()
|
||||
Returns: nothing
|
||||
*/
|
||||
|
||||
void
|
||||
pcre_free_substring_list(const char **pointer)
|
||||
{
|
||||
(pcre_free)((void *)pointer);
|
||||
}
|
||||
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Copy captured string to new store *
|
||||
*************************************************/
|
||||
@@ -186,4 +205,23 @@ substring[yield] = 0;
|
||||
return yield;
|
||||
}
|
||||
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Free store obtained by get_substring *
|
||||
*************************************************/
|
||||
|
||||
/* This function exists for the benefit of people calling PCRE from non-C
|
||||
programs that can call its functions, but not free() or (pcre_free)() directly.
|
||||
|
||||
Argument: the result of a previous pcre_get_substring()
|
||||
Returns: nothing
|
||||
*/
|
||||
|
||||
void
|
||||
pcre_free_substring(const char *pointer)
|
||||
{
|
||||
(pcre_free)((void *)pointer);
|
||||
}
|
||||
|
||||
/* End of get.c */
|
||||
|
||||
@@ -109,7 +109,7 @@ time, run time or study time, respectively. */
|
||||
|
||||
#define PUBLIC_OPTIONS \
|
||||
(PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \
|
||||
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY)
|
||||
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8)
|
||||
|
||||
#define PUBLIC_EXEC_OPTIONS \
|
||||
(PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY)
|
||||
@@ -278,6 +278,10 @@ just to accommodate the POSIX wrapper. */
|
||||
#define ERR29 "(?p must be followed by )"
|
||||
#define ERR30 "unknown POSIX class name"
|
||||
#define ERR31 "POSIX collating elements are not supported"
|
||||
#define ERR32 "this version of PCRE is not compiled with PCRE_UTF8 support"
|
||||
#define ERR33 "characters with values > 255 are not yet supported in classes"
|
||||
#define ERR34 "character value in \\x{...} sequence is too large"
|
||||
#define ERR35 "invalid condition (?(0)"
|
||||
|
||||
/* All character handling must be done as unsigned characters. Otherwise there
|
||||
are problems with top-bit-set characters and functions such as isspace().
|
||||
@@ -334,6 +338,7 @@ typedef struct match_data {
|
||||
BOOL offset_overflow; /* Set if too many extractions */
|
||||
BOOL notbol; /* NOTBOL flag */
|
||||
BOOL noteol; /* NOTEOL flag */
|
||||
BOOL utf8; /* UTF8 flag */
|
||||
BOOL endonly; /* Dollar not before final \n */
|
||||
BOOL notempty; /* Empty string match not wanted */
|
||||
const uschar *start_pattern; /* For use when recursing */
|
||||
|
||||
@@ -66,6 +66,16 @@ not be set greater than 200. */
|
||||
#define BRASTACK_SIZE 200
|
||||
|
||||
|
||||
/* The number of bytes in a literal character string above which we can't add
|
||||
any more is different when UTF-8 characters may be encountered. */
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
#define MAXLIT 250
|
||||
#else
|
||||
#define MAXLIT 255
|
||||
#endif
|
||||
|
||||
|
||||
/* Min and max values for the common repeats; for the maxima, 0 => infinity */
|
||||
|
||||
static const char rep_min[] = { 0, 0, 1, 1, 0, 0 };
|
||||
@@ -176,6 +186,64 @@ void (*pcre_free)(void *) = free;
|
||||
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Macros and tables for character handling *
|
||||
*************************************************/
|
||||
|
||||
/* When UTF-8 encoding is being used, a character is no longer just a single
|
||||
byte. The macros for character handling generate simple sequences when used in
|
||||
byte-mode, and more complicated ones for UTF-8 characters. */
|
||||
|
||||
#ifndef SUPPORT_UTF8
|
||||
#define GETCHARINC(c, eptr) c = *eptr++;
|
||||
#define GETCHARLEN(c, eptr, len) c = *eptr;
|
||||
#define BACKCHAR(eptr)
|
||||
|
||||
#else /* SUPPORT_UTF8 */
|
||||
|
||||
/* Get the next UTF-8 character, advancing the pointer */
|
||||
|
||||
#define GETCHARINC(c, eptr) \
|
||||
c = *eptr++; \
|
||||
if (md->utf8 && (c & 0xc0) == 0xc0) \
|
||||
{ \
|
||||
int a = utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
|
||||
int s = 6 - a; /* Amount to shift next byte */ \
|
||||
c &= utf8_table3[a]; /* Low order bits from first byte */ \
|
||||
while (a-- > 0) \
|
||||
{ \
|
||||
c |= (*eptr++ & 0x3f) << s; \
|
||||
s += 6; \
|
||||
} \
|
||||
}
|
||||
|
||||
/* Get the next UTF-8 character, not advancing the pointer, setting length */
|
||||
|
||||
#define GETCHARLEN(c, eptr, len) \
|
||||
c = *eptr; \
|
||||
len = 1; \
|
||||
if (md->utf8 && (c & 0xc0) == 0xc0) \
|
||||
{ \
|
||||
int i; \
|
||||
int a = utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
|
||||
int s = 6 - a; /* Amount to shift next byte */ \
|
||||
c &= utf8_table3[a]; /* Low order bits from first byte */ \
|
||||
for (i = 1; i <= a; i++) \
|
||||
{ \
|
||||
c |= (eptr[i] & 0x3f) << s; \
|
||||
s += 6; \
|
||||
} \
|
||||
len += a; \
|
||||
}
|
||||
|
||||
/* If the pointer is not at the start of a character, move it back until
|
||||
it is. */
|
||||
|
||||
#define BACKCHAR(eptr) while((*eptr & 0xc0) == 0x80) eptr--;
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Default character tables *
|
||||
@@ -191,6 +259,66 @@ tables. */
|
||||
|
||||
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
/*************************************************
|
||||
* Tables for UTF-8 support *
|
||||
*************************************************/
|
||||
|
||||
/* These are the breakpoints for different numbers of bytes in a UTF-8
|
||||
character. */
|
||||
|
||||
static int utf8_table1[] = { 0x7f, 0x7ff, 0xffff, 0x1fffff, 0x3ffffff, 0x7fffffff};
|
||||
|
||||
/* These are the indicator bits and the mask for the data bits to set in the
|
||||
first byte of a character, indexed by the number of additional bytes. */
|
||||
|
||||
static int utf8_table2[] = { 0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc};
|
||||
static int utf8_table3[] = { 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01};
|
||||
|
||||
/* Table of the number of extra characters, indexed by the first character
|
||||
masked with 0x3f. The highest number for a valid UTF-8 character is in fact
|
||||
0x3d. */
|
||||
|
||||
static uschar utf8_table4[] = {
|
||||
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
|
||||
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
|
||||
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
|
||||
3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5 };
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Convert character value to UTF-8 *
|
||||
*************************************************/
|
||||
|
||||
/* This function takes an integer value in the range 0 - 0x7fffffff
|
||||
and encodes it as a UTF-8 character in 0 to 6 bytes.
|
||||
|
||||
Arguments:
|
||||
cvalue the character value
|
||||
buffer pointer to buffer for result - at least 6 bytes long
|
||||
|
||||
Returns: number of characters placed in the buffer
|
||||
*/
|
||||
|
||||
static int
|
||||
ord2utf8(int cvalue, uschar *buffer)
|
||||
{
|
||||
register int i, j;
|
||||
for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
|
||||
if (cvalue <= utf8_table1[i]) break;
|
||||
*buffer++ = utf8_table2[i] | (cvalue & utf8_table3[i]);
|
||||
cvalue >>= 6 - i;
|
||||
for (j = 0; j < i; j++)
|
||||
{
|
||||
*buffer++ = 0x80 | (cvalue & 0x3f);
|
||||
cvalue >>= 6;
|
||||
}
|
||||
return i + 1;
|
||||
}
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Return version string *
|
||||
*************************************************/
|
||||
@@ -349,9 +477,9 @@ while (length-- > 0)
|
||||
|
||||
/* This function is called when a \ has been encountered. It either returns a
|
||||
positive value for a simple escape such as \n, or a negative value which
|
||||
encodes one of the more complicated things such as \d. On entry, ptr is
|
||||
pointing at the \. On exit, it is on the final character of the escape
|
||||
sequence.
|
||||
encodes one of the more complicated things such as \d. When UTF-8 is enabled,
|
||||
a positive value greater than 255 may be returned. On entry, ptr is pointing at
|
||||
the \. On exit, it is on the final character of the escape sequence.
|
||||
|
||||
Arguments:
|
||||
ptrptr points to the pattern position pointer
|
||||
@@ -373,7 +501,9 @@ check_escape(const uschar **ptrptr, const char **errorptr, int bracount,
|
||||
const uschar *ptr = *ptrptr;
|
||||
int c, i;
|
||||
|
||||
c = *(++ptr) & 255; /* Ensure > 0 on signed-char systems */
|
||||
/* If backslash is at the end of the pattern, it's an error. */
|
||||
|
||||
c = *(++ptr);
|
||||
if (c == 0) *errorptr = ERR1;
|
||||
|
||||
/* Digits or letters may have special meaning; all others are literals. */
|
||||
@@ -433,18 +563,46 @@ else
|
||||
}
|
||||
|
||||
/* \0 always starts an octal number, but we may drop through to here with a
|
||||
larger first octal digit */
|
||||
larger first octal digit. */
|
||||
|
||||
case '0':
|
||||
c -= '0';
|
||||
while(i++ < 2 && (cd->ctypes[ptr[1]] & ctype_digit) != 0 &&
|
||||
ptr[1] != '8' && ptr[1] != '9')
|
||||
c = c * 8 + *(++ptr) - '0';
|
||||
c &= 255; /* Take least significant 8 bits */
|
||||
break;
|
||||
|
||||
/* Special escapes not starting with a digit are straightforward */
|
||||
/* \x is complicated when UTF-8 is enabled. \x{ddd} is a character number
|
||||
which can be greater than 0xff, but only if the ddd are hex digits. */
|
||||
|
||||
case 'x':
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (ptr[1] == '{' && (options & PCRE_UTF8) != 0)
|
||||
{
|
||||
const uschar *pt = ptr + 2;
|
||||
register int count = 0;
|
||||
c = 0;
|
||||
while ((cd->ctypes[*pt] & ctype_xdigit) != 0)
|
||||
{
|
||||
count++;
|
||||
c = c * 16 + cd->lcc[*pt] -
|
||||
(((cd->ctypes[*pt] & ctype_digit) != 0)? '0' : 'W');
|
||||
pt++;
|
||||
}
|
||||
if (*pt == '}')
|
||||
{
|
||||
if (c < 0 || count > 8) *errorptr = ERR34;
|
||||
ptr = pt;
|
||||
break;
|
||||
}
|
||||
/* If the sequence of hex digits does not end with '}', then we don't
|
||||
recognize this construct; fall through to the normal \x handling. */
|
||||
}
|
||||
#endif
|
||||
|
||||
/* Read just a single hex char */
|
||||
|
||||
c = 0;
|
||||
while (i++ < 2 && (cd->ctypes[ptr[1]] & ctype_xdigit) != 0)
|
||||
{
|
||||
@@ -454,6 +612,8 @@ else
|
||||
}
|
||||
break;
|
||||
|
||||
/* Other special escapes not starting with a digit are straightforward */
|
||||
|
||||
case 'c':
|
||||
c = *(++ptr);
|
||||
if (c == 0)
|
||||
@@ -591,12 +751,13 @@ if the length is fixed. This is needed for dealing with backward assertions.
|
||||
|
||||
Arguments:
|
||||
code points to the start of the pattern (the bracket)
|
||||
options the compiling options
|
||||
|
||||
Returns: the fixed length, or -1 if there is no fixed length
|
||||
*/
|
||||
|
||||
static int
|
||||
find_fixedlength(uschar *code)
|
||||
find_fixedlength(uschar *code, int options)
|
||||
{
|
||||
int length = -1;
|
||||
|
||||
@@ -617,7 +778,7 @@ for (;;)
|
||||
case OP_BRA:
|
||||
case OP_ONCE:
|
||||
case OP_COND:
|
||||
d = find_fixedlength(cc);
|
||||
d = find_fixedlength(cc, options);
|
||||
if (d < 0) return -1;
|
||||
branchlength += d;
|
||||
do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT);
|
||||
@@ -671,10 +832,17 @@ for (;;)
|
||||
cc++;
|
||||
break;
|
||||
|
||||
/* Handle char strings */
|
||||
/* Handle char strings. In UTF-8 mode we must count characters, not bytes.
|
||||
This requires a scan of the string, unfortunately. We assume valid UTF-8
|
||||
strings, so all we do is reduce the length by one for byte whose bits are
|
||||
10xxxxxx. */
|
||||
|
||||
case OP_CHARS:
|
||||
branchlength += *(++cc);
|
||||
#ifdef SUPPORT_UTF8
|
||||
for (d = 1; d <= *cc; d++)
|
||||
if ((cc[d] & 0xc0) == 0x80) branchlength--;
|
||||
#endif
|
||||
cc += *cc + 1;
|
||||
break;
|
||||
|
||||
@@ -1054,7 +1222,17 @@ for (;; ptr++)
|
||||
goto FAILED;
|
||||
}
|
||||
}
|
||||
/* Fall through if single character */
|
||||
|
||||
/* Fall through if single character, but don't at present allow
|
||||
chars > 255 in UTF-8 mode. */
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (c > 255)
|
||||
{
|
||||
*errorptr = ERR33;
|
||||
goto FAILED;
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
/* A single character may be followed by '-' to form a range. However,
|
||||
@@ -1074,17 +1252,29 @@ for (;; ptr++)
|
||||
}
|
||||
|
||||
/* The second part of a range can be a single-character escape, but
|
||||
not any of the other escapes. */
|
||||
not any of the other escapes. Perl 5.6 treats a hyphen as a literal
|
||||
in such circumstances. */
|
||||
|
||||
if (d == '\\')
|
||||
{
|
||||
const uschar *oldptr = ptr;
|
||||
d = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd);
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (d > 255)
|
||||
{
|
||||
*errorptr = ERR33;
|
||||
goto FAILED;
|
||||
}
|
||||
#endif
|
||||
/* \b is backslash; any other special means the '-' was literal */
|
||||
|
||||
if (d < 0)
|
||||
{
|
||||
if (d == -ESC_b) d = '\b'; else
|
||||
{
|
||||
*errorptr = ERR7;
|
||||
goto FAILED;
|
||||
ptr = oldptr - 2;
|
||||
goto SINGLE_CHARACTER; /* A few lines below */
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -1112,6 +1302,8 @@ for (;; ptr++)
|
||||
/* Handle a lone single character - we can get here for a normal
|
||||
non-escape char, or after \ that introduces a single character. */
|
||||
|
||||
SINGLE_CHARACTER:
|
||||
|
||||
class [c/8] |= (1 << (c&7));
|
||||
if ((options & PCRE_CASELESS) != 0)
|
||||
{
|
||||
@@ -1562,6 +1754,11 @@ for (;; ptr++)
|
||||
{
|
||||
condref = *ptr - '0';
|
||||
while (*(++ptr) != ')') condref = condref*10 + *ptr - '0';
|
||||
if (condref == 0)
|
||||
{
|
||||
*errorptr = ERR35;
|
||||
goto FAILED;
|
||||
}
|
||||
ptr++;
|
||||
}
|
||||
else ptr--;
|
||||
@@ -1829,6 +2026,20 @@ for (;; ptr++)
|
||||
tempptr = ptr;
|
||||
c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd);
|
||||
if (c < 0) { ptr = tempptr; break; }
|
||||
|
||||
/* If a character is > 127 in UTF-8 mode, we have to turn it into
|
||||
two or more characters in the UTF-8 encoding. */
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (c > 127 && (options & PCRE_UTF8) != 0)
|
||||
{
|
||||
uschar buffer[8];
|
||||
int len = ord2utf8(c, buffer);
|
||||
for (c = 0; c < len; c++) *code++ = buffer[c];
|
||||
length += len;
|
||||
continue;
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
/* Ordinary character or single-char escape */
|
||||
@@ -1839,7 +2050,7 @@ for (;; ptr++)
|
||||
|
||||
/* This "while" is the end of the "do" above. */
|
||||
|
||||
while (length < 255 && (cd->ctypes[c = *(++ptr)] & ctype_meta) == 0);
|
||||
while (length < MAXLIT && (cd->ctypes[c = *(++ptr)] & ctype_meta) == 0);
|
||||
|
||||
/* Update the last character and the count of literals */
|
||||
|
||||
@@ -1851,7 +2062,7 @@ for (;; ptr++)
|
||||
the next state. */
|
||||
|
||||
previous[1] = length;
|
||||
if (length < 255) ptr--;
|
||||
if (length < MAXLIT) ptr--;
|
||||
break;
|
||||
}
|
||||
} /* end of big loop */
|
||||
@@ -1889,7 +2100,7 @@ Argument:
|
||||
ptrptr -> the address of the current pattern pointer
|
||||
errorptr -> pointer to error message
|
||||
lookbehind TRUE if this is a lookbehind assertion
|
||||
condref > 0 for OPT_CREF setting at start of conditional group
|
||||
condref >= 0 for OPT_CREF setting at start of conditional group
|
||||
reqchar -> place to put the last required character, or a negative number
|
||||
countlits -> place to put the shortest literal count of any branch
|
||||
cd points to the data block with tables pointers
|
||||
@@ -1917,7 +2128,7 @@ code += 3;
|
||||
/* At the start of a reference-based conditional group, insert the reference
|
||||
number as an OP_CREF item. */
|
||||
|
||||
if (condref > 0)
|
||||
if (condref >= 0)
|
||||
{
|
||||
*code++ = OP_CREF;
|
||||
*code++ = condref;
|
||||
@@ -1989,7 +2200,7 @@ for (;;)
|
||||
if (lookbehind)
|
||||
{
|
||||
*code = OP_END;
|
||||
length = find_fixedlength(last_branch);
|
||||
length = find_fixedlength(last_branch, options);
|
||||
DPRINTF(("fixed length = %d\n", length));
|
||||
if (length < 0)
|
||||
{
|
||||
@@ -2280,6 +2491,16 @@ uschar bralenstack[BRASTACK_SIZE];
|
||||
uschar *code_base, *code_end;
|
||||
#endif
|
||||
|
||||
/* Can't support UTF8 unless PCRE has been compiled to include the code. */
|
||||
|
||||
#ifndef SUPPORT_UTF8
|
||||
if ((options & PCRE_UTF8) != 0)
|
||||
{
|
||||
*errorptr = ERR32;
|
||||
return NULL;
|
||||
}
|
||||
#endif
|
||||
|
||||
/* We can't pass back an error message if errorptr is NULL; I guess the best we
|
||||
can do is just return NULL. */
|
||||
|
||||
@@ -2775,6 +2996,16 @@ while ((c = *(++ptr)) != 0)
|
||||
&compile_block);
|
||||
if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
|
||||
if (c < 0) { ptr = saveptr; break; }
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (c > 127 && (options & PCRE_UTF8) != 0)
|
||||
{
|
||||
int i;
|
||||
for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
|
||||
if (c <= utf8_table1[i]) break;
|
||||
runlength += i;
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
/* Ordinary character or single-char escape */
|
||||
@@ -2784,7 +3015,7 @@ while ((c = *(++ptr)) != 0)
|
||||
|
||||
/* This "while" is the end of the "do" above. */
|
||||
|
||||
while (runlength < 255 &&
|
||||
while (runlength < MAXLIT &&
|
||||
(compile_block.ctypes[c = *(++ptr)] & ctype_meta) == 0);
|
||||
|
||||
ptr--;
|
||||
@@ -3429,10 +3660,21 @@ for (;;)
|
||||
|
||||
/* Move the subject pointer back. This occurs only at the start of
|
||||
each branch of a lookbehind assertion. If we are too close to the start to
|
||||
move back, this match function fails. */
|
||||
move back, this match function fails. When working with UTF-8 we move
|
||||
back a number of characters, not bytes. */
|
||||
|
||||
case OP_REVERSE:
|
||||
#ifdef SUPPORT_UTF8
|
||||
c = (ecode[1] << 8) + ecode[2];
|
||||
for (i = 0; i < c; i++)
|
||||
{
|
||||
eptr--;
|
||||
BACKCHAR(eptr)
|
||||
}
|
||||
#else
|
||||
eptr -= (ecode[1] << 8) + ecode[2];
|
||||
#endif
|
||||
|
||||
if (eptr < md->start_subject) return FALSE;
|
||||
ecode += 3;
|
||||
break;
|
||||
@@ -3752,6 +3994,10 @@ for (;;)
|
||||
if ((ims & PCRE_DOTALL) == 0 && eptr < md->end_subject && *eptr == '\n')
|
||||
return FALSE;
|
||||
if (eptr++ >= md->end_subject) return FALSE;
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (md->utf8)
|
||||
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
|
||||
#endif
|
||||
ecode++;
|
||||
break;
|
||||
|
||||
@@ -3953,7 +4199,13 @@ for (;;)
|
||||
for (i = 1; i <= min; i++)
|
||||
{
|
||||
if (eptr >= md->end_subject) return FALSE;
|
||||
c = *eptr++;
|
||||
GETCHARINC(c, eptr) /* Get character; increment eptr */
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
/* We do not yet support class members > 255 */
|
||||
if (c > 255) return FALSE;
|
||||
#endif
|
||||
|
||||
if ((data[c/8] & (1 << (c&7))) != 0) continue;
|
||||
return FALSE;
|
||||
}
|
||||
@@ -3973,7 +4225,12 @@ for (;;)
|
||||
if (match(eptr, ecode, offset_top, md, ims, eptrb, 0))
|
||||
return TRUE;
|
||||
if (i >= max || eptr >= md->end_subject) return FALSE;
|
||||
c = *eptr++;
|
||||
GETCHARINC(c, eptr) /* Get character; increment eptr */
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
/* We do not yet support class members > 255 */
|
||||
if (c > 255) return FALSE;
|
||||
#endif
|
||||
if ((data[c/8] & (1 << (c&7))) != 0) continue;
|
||||
return FALSE;
|
||||
}
|
||||
@@ -3985,17 +4242,29 @@ for (;;)
|
||||
else
|
||||
{
|
||||
const uschar *pp = eptr;
|
||||
for (i = min; i < max; eptr++, i++)
|
||||
int len = 1;
|
||||
for (i = min; i < max; i++)
|
||||
{
|
||||
if (eptr >= md->end_subject) break;
|
||||
c = *eptr;
|
||||
if ((data[c/8] & (1 << (c&7))) != 0) continue;
|
||||
break;
|
||||
GETCHARLEN(c, eptr, len) /* Get character, set length if UTF-8 */
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
/* We do not yet support class members > 255 */
|
||||
if (c > 255) break;
|
||||
#endif
|
||||
if ((data[c/8] & (1 << (c&7))) == 0) break;
|
||||
eptr += len;
|
||||
}
|
||||
|
||||
while (eptr >= pp)
|
||||
{
|
||||
if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0))
|
||||
return TRUE;
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
BACKCHAR(eptr)
|
||||
#endif
|
||||
}
|
||||
return FALSE;
|
||||
}
|
||||
}
|
||||
@@ -4315,13 +4584,29 @@ for (;;)
|
||||
|
||||
/* First, ensure the minimum number of matches are present. Use inline
|
||||
code for maximizing the speed, and do the type test once at the start
|
||||
(i.e. keep it out of the loop). Also test that there are at least the
|
||||
minimum number of characters before we start. */
|
||||
(i.e. keep it out of the loop). Also we can test that there are at least
|
||||
the minimum number of bytes before we start, except when doing '.' in
|
||||
UTF8 mode. Leave the test in in all cases; in the special case we have
|
||||
to test after each character. */
|
||||
|
||||
if (min > md->end_subject - eptr) return FALSE;
|
||||
if (min > 0) switch(ctype)
|
||||
{
|
||||
case OP_ANY:
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (md->utf8)
|
||||
{
|
||||
for (i = 1; i <= min; i++)
|
||||
{
|
||||
if (eptr >= md->end_subject ||
|
||||
(*eptr++ == '\n' && (ims & PCRE_DOTALL) == 0))
|
||||
return FALSE;
|
||||
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
|
||||
}
|
||||
break;
|
||||
}
|
||||
#endif
|
||||
/* Non-UTF8 can be faster */
|
||||
if ((ims & PCRE_DOTALL) == 0)
|
||||
{ for (i = 1; i <= min; i++) if (*eptr++ == '\n') return FALSE; }
|
||||
else eptr += min;
|
||||
@@ -4379,6 +4664,10 @@ for (;;)
|
||||
{
|
||||
case OP_ANY:
|
||||
if ((ims & PCRE_DOTALL) == 0 && c == '\n') return FALSE;
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (md->utf8)
|
||||
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
|
||||
#endif
|
||||
break;
|
||||
|
||||
case OP_NOT_DIGIT:
|
||||
@@ -4418,6 +4707,33 @@ for (;;)
|
||||
switch(ctype)
|
||||
{
|
||||
case OP_ANY:
|
||||
|
||||
/* Special code is required for UTF8, but when the maximum is unlimited
|
||||
we don't need it. */
|
||||
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (md->utf8 && max < INT_MAX)
|
||||
{
|
||||
if ((ims & PCRE_DOTALL) == 0)
|
||||
{
|
||||
for (i = min; i < max; i++)
|
||||
{
|
||||
if (eptr >= md->end_subject || *eptr++ == '\n') break;
|
||||
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
for (i = min; i < max; i++)
|
||||
{
|
||||
eptr++;
|
||||
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
|
||||
}
|
||||
}
|
||||
break;
|
||||
}
|
||||
#endif
|
||||
/* Non-UTF8 can be faster */
|
||||
if ((ims & PCRE_DOTALL) == 0)
|
||||
{
|
||||
for (i = min; i < max; i++)
|
||||
@@ -4490,8 +4806,14 @@ for (;;)
|
||||
}
|
||||
|
||||
while (eptr >= pp)
|
||||
{
|
||||
if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0))
|
||||
return TRUE;
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (md->utf8)
|
||||
while (eptr > pp && (*eptr & 0xc0) == 0x80) eptr--;
|
||||
#endif
|
||||
}
|
||||
return FALSE;
|
||||
}
|
||||
/* Control never gets here */
|
||||
@@ -4572,6 +4894,7 @@ match_block.end_subject = match_block.start_subject + length;
|
||||
end_subject = match_block.end_subject;
|
||||
|
||||
match_block.endonly = (re->options & PCRE_DOLLAR_ENDONLY) != 0;
|
||||
match_block.utf8 = (re->options & PCRE_UTF8) != 0;
|
||||
|
||||
match_block.notbol = (options & PCRE_NOTBOL) != 0;
|
||||
match_block.noteol = (options & PCRE_NOTEOL) != 0;
|
||||
|
||||
@@ -4,14 +4,15 @@
|
||||
|
||||
/* Copyright (c) 1997-2000 University of Cambridge */
|
||||
|
||||
#ifndef PCRE_H
|
||||
#define PCRE_H
|
||||
#ifndef _PCRE_H
|
||||
#define _PCRE_H
|
||||
|
||||
/* The file pcre.h is build by "configure". Do not edit it; instead
|
||||
make changes to pcre.in. */
|
||||
|
||||
#define PCRE_MAJOR 3
|
||||
#define PCRE_MINOR 1
|
||||
#define PCRE_DATE 09-Feb-2000
|
||||
|
||||
#include "php_compat.h"
|
||||
#define PCRE_MINOR 4
|
||||
#define PCRE_DATE 22-Aug-2000
|
||||
|
||||
/* Win32 uses DLL by default */
|
||||
|
||||
@@ -28,7 +29,6 @@
|
||||
/* Have to include stdlib.h in order to ensure that size_t is defined;
|
||||
it is needed here for malloc. */
|
||||
|
||||
#include <sys/types.h>
|
||||
#include <stdlib.h>
|
||||
|
||||
/* Allow for C++ users */
|
||||
@@ -50,6 +50,7 @@ extern "C" {
|
||||
#define PCRE_NOTEOL 0x0100
|
||||
#define PCRE_UNGREEDY 0x0200
|
||||
#define PCRE_NOTEMPTY 0x0400
|
||||
#define PCRE_UTF8 0x0800
|
||||
|
||||
/* Exec-time and get-time error codes */
|
||||
|
||||
@@ -88,14 +89,16 @@ PCRE_DL_IMPORT extern void (*pcre_free)(void *);
|
||||
/* Functions */
|
||||
|
||||
extern pcre *pcre_compile(const char *, int, const char **, int *,
|
||||
const unsigned char *);
|
||||
extern int pcre_copy_substring(const char *, int *, int, int, char *, int);
|
||||
extern int pcre_exec(const pcre *, const pcre_extra *, const char *,
|
||||
int, int, int, int *, int);
|
||||
extern int pcre_get_substring(const char *, int *, int, int, const char **);
|
||||
extern int pcre_get_substring_list(const char *, int *, int, const char ***);
|
||||
extern int pcre_info(const pcre *, int *, int *);
|
||||
extern int pcre_fullinfo(const pcre *, const pcre_extra *, int, void *);
|
||||
const unsigned char *);
|
||||
extern int pcre_copy_substring(const char *, int *, int, int, char *, int);
|
||||
extern int pcre_exec(const pcre *, const pcre_extra *, const char *,
|
||||
int, int, int, int *, int);
|
||||
extern void pcre_free_substring(const char *);
|
||||
extern void pcre_free_substring_list(const char **);
|
||||
extern int pcre_get_substring(const char *, int *, int, int, const char **);
|
||||
extern int pcre_get_substring_list(const char *, int *, int, const char ***);
|
||||
extern int pcre_info(const pcre *, int *, int *);
|
||||
extern int pcre_fullinfo(const pcre *, const pcre_extra *, int, void *);
|
||||
extern unsigned const char *pcre_maketables(void);
|
||||
extern pcre_extra *pcre_study(const pcre *, int, const char **);
|
||||
extern const char *pcre_version(void);
|
||||
|
||||
@@ -1,7 +1,10 @@
|
||||
/*************************************************
|
||||
* PCRE grep program *
|
||||
* pcregrep program *
|
||||
*************************************************/
|
||||
|
||||
/* This is a grep program that uses the PCRE regular expression library to do
|
||||
its pattern matching. */
|
||||
|
||||
#include <stdio.h>
|
||||
#include <string.h>
|
||||
#include <stdlib.h>
|
||||
@@ -59,7 +62,7 @@ return sys_errlist[n];
|
||||
*************************************************/
|
||||
|
||||
static int
|
||||
pgrep(FILE *in, char *name)
|
||||
pcregrep(FILE *in, char *name)
|
||||
{
|
||||
int rc = 1;
|
||||
int linenumber = 0;
|
||||
@@ -119,7 +122,7 @@ return rc;
|
||||
static int
|
||||
usage(int rc)
|
||||
{
|
||||
fprintf(stderr, "Usage: pgrep [-Vchilnsvx] pattern [file] ...\n");
|
||||
fprintf(stderr, "Usage: pcregrep [-Vchilnsvx] pattern [file] ...\n");
|
||||
return rc;
|
||||
}
|
||||
|
||||
@@ -165,7 +168,7 @@ for (i = 1; i < argc; i++)
|
||||
break;
|
||||
|
||||
default:
|
||||
fprintf(stderr, "pgrep: unknown option %c\n", s[-1]);
|
||||
fprintf(stderr, "pcregrep: unknown option %c\n", s[-1]);
|
||||
return usage(2);
|
||||
}
|
||||
}
|
||||
@@ -180,7 +183,7 @@ if (i >= argc) return usage(0);
|
||||
pattern = pcre_compile(argv[i++], options, &error, &errptr, NULL);
|
||||
if (pattern == NULL)
|
||||
{
|
||||
fprintf(stderr, "pgrep: error in regex at offset %d: %s\n", errptr, error);
|
||||
fprintf(stderr, "pcregrep: error in regex at offset %d: %s\n", errptr, error);
|
||||
return 2;
|
||||
}
|
||||
|
||||
@@ -189,13 +192,13 @@ if (pattern == NULL)
|
||||
hints = pcre_study(pattern, 0, &error);
|
||||
if (error != NULL)
|
||||
{
|
||||
fprintf(stderr, "pgrep: error while studing regex: %s\n", error);
|
||||
fprintf(stderr, "pcregrep: error while studing regex: %s\n", error);
|
||||
return 2;
|
||||
}
|
||||
|
||||
/* If there are no further arguments, do the business on stdin and exit */
|
||||
|
||||
if (i >= argc) return pgrep(stdin, NULL);
|
||||
if (i >= argc) return pcregrep(stdin, NULL);
|
||||
|
||||
/* Otherwise, work through the remaining arguments as files. If there is only
|
||||
one, don't give its name on the output. */
|
||||
@@ -213,7 +216,7 @@ for (; i < argc; i++)
|
||||
}
|
||||
else
|
||||
{
|
||||
int frc = pgrep(in, filenames? argv[i] : NULL);
|
||||
int frc = pcregrep(in, filenames? argv[i] : NULL);
|
||||
if (frc == 0 && rc == 1) rc = 0;
|
||||
fclose(in);
|
||||
}
|
||||
@@ -80,7 +80,11 @@ static int eint[] = {
|
||||
REG_BADPAT, /* "assertion expected after (?(" */
|
||||
REG_BADPAT, /* "(?p must be followed by )" */
|
||||
REG_ECTYPE, /* "unknown POSIX class name" */
|
||||
REG_BADPAT /* "POSIX collating elements are not supported" */
|
||||
REG_BADPAT, /* "POSIX collating elements are not supported" */
|
||||
REG_INVARG, /* "this version of PCRE is not compiled with PCRE_UTF8 support" */
|
||||
REG_BADPAT, /* "characters with values > 255 are not yet supported in classes" */
|
||||
REG_BADPAT, /* "character value in \x{...} sequence is too large" */
|
||||
REG_BADPAT /* "invalid condition (?(0)" */
|
||||
};
|
||||
|
||||
/* Table of texts corresponding to POSIX error codes */
|
||||
|
||||
@@ -4,8 +4,8 @@
|
||||
|
||||
/* Copyright (c) 1997-2000 University of Cambridge */
|
||||
|
||||
#ifndef PCREPOSIX_H
|
||||
#define PCREPOSIX_H
|
||||
#ifndef _PCREPOSIX_H
|
||||
#define _PCREPOSIX_H
|
||||
|
||||
/* This is the header for the POSIX wrapper interface to the PCRE Perl-
|
||||
Compatible Regular Expression library. It defines the things POSIX says should
|
||||
|
||||
@@ -38,6 +38,113 @@ static size_t gotten_store;
|
||||
|
||||
|
||||
|
||||
static int utf8_table1[] = {
|
||||
0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff};
|
||||
|
||||
static int utf8_table2[] = {
|
||||
0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc};
|
||||
|
||||
static int utf8_table3[] = {
|
||||
0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01};
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Convert character value to UTF-8 *
|
||||
*************************************************/
|
||||
|
||||
/* This function takes an integer value in the range 0 - 0x7fffffff
|
||||
and encodes it as a UTF-8 character in 0 to 6 bytes.
|
||||
|
||||
Arguments:
|
||||
cvalue the character value
|
||||
buffer pointer to buffer for result - at least 6 bytes long
|
||||
|
||||
Returns: number of characters placed in the buffer
|
||||
-1 if input character is negative
|
||||
0 if input character is positive but too big (only when
|
||||
int is longer than 32 bits)
|
||||
*/
|
||||
|
||||
static int
|
||||
ord2utf8(int cvalue, unsigned char *buffer)
|
||||
{
|
||||
register int i, j;
|
||||
for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
|
||||
if (cvalue <= utf8_table1[i]) break;
|
||||
if (i >= sizeof(utf8_table1)/sizeof(int)) return 0;
|
||||
if (cvalue < 0) return -1;
|
||||
*buffer++ = utf8_table2[i] | (cvalue & utf8_table3[i]);
|
||||
cvalue >>= 6 - i;
|
||||
for (j = 0; j < i; j++)
|
||||
{
|
||||
*buffer++ = 0x80 | (cvalue & 0x3f);
|
||||
cvalue >>= 6;
|
||||
}
|
||||
return i + 1;
|
||||
}
|
||||
|
||||
|
||||
/*************************************************
|
||||
* Convert UTF-8 string to value *
|
||||
*************************************************/
|
||||
|
||||
/* This function takes one or more bytes that represents a UTF-8 character,
|
||||
and returns the value of the character.
|
||||
|
||||
Argument:
|
||||
buffer a pointer to the byte vector
|
||||
vptr a pointer to an int to receive the value
|
||||
|
||||
Returns: > 0 => the number of bytes consumed
|
||||
-6 to 0 => malformed UTF-8 character at offset = (-return)
|
||||
*/
|
||||
|
||||
int
|
||||
utf82ord(unsigned char *buffer, int *vptr)
|
||||
{
|
||||
int c = *buffer++;
|
||||
int d = c;
|
||||
int i, j, s;
|
||||
|
||||
for (i = -1; i < 6; i++) /* i is number of additional bytes */
|
||||
{
|
||||
if ((d & 0x80) == 0) break;
|
||||
d <<= 1;
|
||||
}
|
||||
|
||||
if (i == -1) { *vptr = c; return 1; } /* ascii character */
|
||||
if (i == 0 || i == 6) return 0; /* invalid UTF-8 */
|
||||
|
||||
/* i now has a value in the range 1-5 */
|
||||
|
||||
d = c & utf8_table3[i];
|
||||
s = 6 - i;
|
||||
|
||||
for (j = 0; j < i; j++)
|
||||
{
|
||||
c = *buffer++;
|
||||
if ((c & 0xc0) != 0x80) return -(j+1);
|
||||
d |= (c & 0x3f) << s;
|
||||
s += 6;
|
||||
}
|
||||
|
||||
/* Check that encoding was the correct unique one */
|
||||
|
||||
for (j = 0; j < sizeof(utf8_table1)/sizeof(int); j++)
|
||||
if (d <= utf8_table1[j]) break;
|
||||
if (j != i) return -(i+1);
|
||||
|
||||
/* Valid value */
|
||||
|
||||
*vptr = d;
|
||||
return i+1;
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
/* Debugging function to print the internal form of the regex. This is the same
|
||||
code as contained in pcre.c under the DEBUG macro. */
|
||||
|
||||
@@ -265,14 +372,31 @@ for(;;)
|
||||
|
||||
|
||||
|
||||
/* Character string printing function. */
|
||||
/* Character string printing function. A "normal" and a UTF-8 version. */
|
||||
|
||||
static void pchars(unsigned char *p, int length)
|
||||
static void pchars(unsigned char *p, int length, int utf8)
|
||||
{
|
||||
int c;
|
||||
while (length-- > 0)
|
||||
{
|
||||
if (utf8)
|
||||
{
|
||||
int rc = utf82ord(p, &c);
|
||||
if (rc > 0)
|
||||
{
|
||||
length -= rc - 1;
|
||||
p += rc;
|
||||
if (c < 256 && isprint(c)) fprintf(outfile, "%c", c);
|
||||
else fprintf(outfile, "\\x{%02x}", c);
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
/* Not UTF-8, or malformed UTF-8 */
|
||||
|
||||
if (isprint(c = *(p++))) fprintf(outfile, "%c", c);
|
||||
else fprintf(outfile, "\\x%02x", c);
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -403,6 +527,7 @@ while (!done)
|
||||
int do_g = 0;
|
||||
int do_showinfo = showinfo;
|
||||
int do_showrest = 0;
|
||||
int utf8 = 0;
|
||||
int erroroffset, len, delimiter;
|
||||
|
||||
if (infile == stdin) printf(" re> ");
|
||||
@@ -494,6 +619,7 @@ while (!done)
|
||||
case 'S': do_study = 1; break;
|
||||
case 'U': options |= PCRE_UNGREEDY; break;
|
||||
case 'X': options |= PCRE_EXTRA; break;
|
||||
case '8': options |= PCRE_UTF8; utf8 = 1; break;
|
||||
|
||||
case 'L':
|
||||
ppp = pp;
|
||||
@@ -633,7 +759,7 @@ while (!done)
|
||||
if (backrefmax > 0)
|
||||
fprintf(outfile, "Max back reference = %d\n", backrefmax);
|
||||
if (options == 0) fprintf(outfile, "No options\n");
|
||||
else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s\n",
|
||||
else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s%s\n",
|
||||
((options & PCRE_ANCHORED) != 0)? " anchored" : "",
|
||||
((options & PCRE_CASELESS) != 0)? " caseless" : "",
|
||||
((options & PCRE_EXTENDED) != 0)? " extended" : "",
|
||||
@@ -641,7 +767,8 @@ while (!done)
|
||||
((options & PCRE_DOTALL) != 0)? " dotall" : "",
|
||||
((options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
|
||||
((options & PCRE_EXTRA) != 0)? " extra" : "",
|
||||
((options & PCRE_UNGREEDY) != 0)? " ungreedy" : "");
|
||||
((options & PCRE_UNGREEDY) != 0)? " ungreedy" : "",
|
||||
((options & PCRE_UTF8) != 0)? " utf8" : "");
|
||||
|
||||
if (((((real_pcre *)re)->options) & PCRE_ICHANGED) != 0)
|
||||
fprintf(outfile, "Case state changes\n");
|
||||
@@ -796,6 +923,30 @@ while (!done)
|
||||
break;
|
||||
|
||||
case 'x':
|
||||
|
||||
/* Handle \x{..} specially - new Perl thing for utf8 */
|
||||
|
||||
if (*p == '{')
|
||||
{
|
||||
unsigned char *pt = p;
|
||||
c = 0;
|
||||
while (isxdigit(*(++pt)))
|
||||
c = c * 16 + tolower(*pt) - ((isdigit(*pt))? '0' : 'W');
|
||||
if (*pt == '}')
|
||||
{
|
||||
unsigned char buffer[8];
|
||||
int ii, utn;
|
||||
utn = ord2utf8(c, buffer);
|
||||
for (ii = 0; ii < utn - 1; ii++) *q++ = buffer[ii];
|
||||
c = buffer[ii]; /* Last byte */
|
||||
p = pt + 1;
|
||||
break;
|
||||
}
|
||||
/* Not correct form; fall through */
|
||||
}
|
||||
|
||||
/* Ordinary \x */
|
||||
|
||||
c = 0;
|
||||
while (i++ < 2 && isxdigit(*p))
|
||||
{
|
||||
@@ -876,12 +1027,12 @@ while (!done)
|
||||
{
|
||||
fprintf(outfile, "%2d: ", (int)i);
|
||||
pchars(dbuffer + pmatch[i].rm_so,
|
||||
pmatch[i].rm_eo - pmatch[i].rm_so);
|
||||
pmatch[i].rm_eo - pmatch[i].rm_so, utf8);
|
||||
fprintf(outfile, "\n");
|
||||
if (i == 0 && do_showrest)
|
||||
{
|
||||
fprintf(outfile, " 0+ ");
|
||||
pchars(dbuffer + pmatch[i].rm_eo, len - pmatch[i].rm_eo);
|
||||
pchars(dbuffer + pmatch[i].rm_eo, len - pmatch[i].rm_eo, utf8);
|
||||
fprintf(outfile, "\n");
|
||||
}
|
||||
}
|
||||
@@ -931,14 +1082,14 @@ while (!done)
|
||||
else
|
||||
{
|
||||
fprintf(outfile, "%2d: ", i/2);
|
||||
pchars(bptr + offsets[i], offsets[i+1] - offsets[i]);
|
||||
pchars(bptr + offsets[i], offsets[i+1] - offsets[i], utf8);
|
||||
fprintf(outfile, "\n");
|
||||
if (i == 0)
|
||||
{
|
||||
if (do_showrest)
|
||||
{
|
||||
fprintf(outfile, " 0+ ");
|
||||
pchars(bptr + offsets[i+1], len - offsets[i+1]);
|
||||
pchars(bptr + offsets[i+1], len - offsets[i+1], utf8);
|
||||
fprintf(outfile, "\n");
|
||||
}
|
||||
}
|
||||
@@ -971,7 +1122,8 @@ while (!done)
|
||||
else
|
||||
{
|
||||
fprintf(outfile, "%2dG %s (%d)\n", i, substring, rc);
|
||||
free((void *)substring);
|
||||
/* free((void *)substring); */
|
||||
pcre_free_substring(substring);
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -989,7 +1141,8 @@ while (!done)
|
||||
fprintf(outfile, "%2dL %s\n", i, stringlist[i]);
|
||||
if (stringlist[i] != NULL)
|
||||
fprintf(outfile, "string list not terminated by NULL\n");
|
||||
free((void *)stringlist);
|
||||
/* free((void *)stringlist); */
|
||||
pcre_free_substring_list(stringlist);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -9,7 +9,7 @@
|
||||
sub pchars {
|
||||
my($t) = "";
|
||||
|
||||
foreach $c (split(//, @_[0]))
|
||||
foreach $c (split(//, $_[0]))
|
||||
{
|
||||
if (ord $c >= 32 && ord $c < 127) { $t .= $c; }
|
||||
else { $t .= sprintf("\\x%02x", ord $c); }
|
||||
|
||||
208
ext/pcre/pcrelib/perltest8
Executable file
208
ext/pcre/pcrelib/perltest8
Executable file
@@ -0,0 +1,208 @@
|
||||
#! /usr/bin/perl
|
||||
|
||||
# Program for testing regular expressions with perl to check that PCRE handles
|
||||
# them the same. This is the version that supports /8 for UTF-8 testing. It
|
||||
# requires at least Perl 5.6.
|
||||
|
||||
|
||||
# Function for turning a string into a string of printing chars. There are
|
||||
# currently problems with UTF-8 strings; this fudges round them.
|
||||
|
||||
sub pchars {
|
||||
my($t) = "";
|
||||
|
||||
if ($utf8)
|
||||
{
|
||||
use utf8;
|
||||
@p = unpack('U*', $_[0]);
|
||||
foreach $c (@p)
|
||||
{
|
||||
if ($c >= 32 && $c < 127) { $t .= chr $c; }
|
||||
else { $t .= sprintf("\\x{%02x}", $c); }
|
||||
}
|
||||
}
|
||||
|
||||
else
|
||||
{
|
||||
foreach $c (split(//, $_[0]))
|
||||
{
|
||||
if (ord $c >= 32 && ord $c < 127) { $t .= $c; }
|
||||
else { $t .= sprintf("\\x%02x", ord $c); }
|
||||
}
|
||||
}
|
||||
|
||||
$t;
|
||||
}
|
||||
|
||||
|
||||
|
||||
# Read lines from named file or stdin and write to named file or stdout; lines
|
||||
# consist of a regular expression, in delimiters and optionally followed by
|
||||
# options, followed by a set of test data, terminated by an empty line.
|
||||
|
||||
# Sort out the input and output files
|
||||
|
||||
if (@ARGV > 0)
|
||||
{
|
||||
open(INFILE, "<$ARGV[0]") || die "Failed to open $ARGV[0]\n";
|
||||
$infile = "INFILE";
|
||||
}
|
||||
else { $infile = "STDIN"; }
|
||||
|
||||
if (@ARGV > 1)
|
||||
{
|
||||
open(OUTFILE, ">$ARGV[1]") || die "Failed to open $ARGV[1]\n";
|
||||
$outfile = "OUTFILE";
|
||||
}
|
||||
else { $outfile = "STDOUT"; }
|
||||
|
||||
printf($outfile "Perl $] Regular Expressions\n\n");
|
||||
|
||||
# Main loop
|
||||
|
||||
NEXT_RE:
|
||||
for (;;)
|
||||
{
|
||||
printf " re> " if $infile eq "STDIN";
|
||||
last if ! ($_ = <$infile>);
|
||||
printf $outfile "$_" if $infile ne "STDIN";
|
||||
next if ($_ eq "");
|
||||
|
||||
$pattern = $_;
|
||||
|
||||
while ($pattern !~ /^\s*(.).*\1/s)
|
||||
{
|
||||
printf " > " if $infile eq "STDIN";
|
||||
last if ! ($_ = <$infile>);
|
||||
printf $outfile "$_" if $infile ne "STDIN";
|
||||
$pattern .= $_;
|
||||
}
|
||||
|
||||
chomp($pattern);
|
||||
$pattern =~ s/\s+$//;
|
||||
|
||||
# The private /+ modifier means "print $' afterwards".
|
||||
|
||||
$showrest = ($pattern =~ s/\+(?=[a-z]*$)//);
|
||||
|
||||
# The private /8 modifier means "operate in UTF-8". Currently, Perl
|
||||
# has bugs that we try to work around using this flag.
|
||||
|
||||
$utf8 = ($pattern =~ s/8(?=[a-z]*$)//);
|
||||
|
||||
# Check that the pattern is valid
|
||||
|
||||
if ($utf8)
|
||||
{
|
||||
use utf8;
|
||||
eval "\$_ =~ ${pattern}";
|
||||
}
|
||||
else
|
||||
{
|
||||
eval "\$_ =~ ${pattern}";
|
||||
}
|
||||
|
||||
if ($@)
|
||||
{
|
||||
printf $outfile "Error: $@";
|
||||
next NEXT_RE;
|
||||
}
|
||||
|
||||
# If the /g modifier is present, we want to put a loop round the matching;
|
||||
# otherwise just a single "if".
|
||||
|
||||
$cmd = ($pattern =~ /g[a-z]*$/)? "while" : "if";
|
||||
|
||||
# If the pattern is actually the null string, Perl uses the most recently
|
||||
# executed (and successfully compiled) regex is used instead. This is a
|
||||
# nasty trap for the unwary! The PCRE test suite does contain null strings
|
||||
# in places - if they are allowed through here all sorts of weird and
|
||||
# unexpected effects happen. To avoid this, we replace such patterns with
|
||||
# a non-null pattern that has the same effect.
|
||||
|
||||
$pattern = "/(?#)/$2" if ($pattern =~ /^(.)\1(.*)$/);
|
||||
|
||||
# Read data lines and test them
|
||||
|
||||
for (;;)
|
||||
{
|
||||
printf "data> " if $infile eq "STDIN";
|
||||
last NEXT_RE if ! ($_ = <$infile>);
|
||||
chomp;
|
||||
printf $outfile "$_\n" if $infile ne "STDIN";
|
||||
|
||||
s/\s+$//;
|
||||
s/^\s+//;
|
||||
|
||||
last if ($_ eq "");
|
||||
|
||||
$x = eval "\"$_\""; # To get escapes processed
|
||||
|
||||
# Empty array for holding results, then do the matching.
|
||||
|
||||
@subs = ();
|
||||
|
||||
$pushes = "push \@subs,\$&;" .
|
||||
"push \@subs,\$1;" .
|
||||
"push \@subs,\$2;" .
|
||||
"push \@subs,\$3;" .
|
||||
"push \@subs,\$4;" .
|
||||
"push \@subs,\$5;" .
|
||||
"push \@subs,\$6;" .
|
||||
"push \@subs,\$7;" .
|
||||
"push \@subs,\$8;" .
|
||||
"push \@subs,\$9;" .
|
||||
"push \@subs,\$10;" .
|
||||
"push \@subs,\$11;" .
|
||||
"push \@subs,\$12;" .
|
||||
"push \@subs,\$13;" .
|
||||
"push \@subs,\$14;" .
|
||||
"push \@subs,\$15;" .
|
||||
"push \@subs,\$16;" .
|
||||
"push \@subs,\$'; }";
|
||||
|
||||
if ($utf8)
|
||||
{
|
||||
use utf8;
|
||||
eval "${cmd} (\$x =~ ${pattern}) {" . $pushes;
|
||||
}
|
||||
else
|
||||
{
|
||||
eval "${cmd} (\$x =~ ${pattern}) {" . $pushes;
|
||||
}
|
||||
|
||||
if ($@)
|
||||
{
|
||||
printf $outfile "Error: $@\n";
|
||||
next NEXT_RE;
|
||||
}
|
||||
elsif (scalar(@subs) == 0)
|
||||
{
|
||||
printf $outfile "No match\n";
|
||||
}
|
||||
else
|
||||
{
|
||||
while (scalar(@subs) != 0)
|
||||
{
|
||||
printf $outfile (" 0: %s\n", &pchars($subs[0]));
|
||||
printf $outfile (" 0+ %s\n", &pchars($subs[17])) if $showrest;
|
||||
$last_printed = 0;
|
||||
for ($i = 1; $i <= 16; $i++)
|
||||
{
|
||||
if (defined $subs[$i])
|
||||
{
|
||||
while ($last_printed++ < $i-1)
|
||||
{ printf $outfile ("%2d: <unset>\n", $last_printed); }
|
||||
printf $outfile ("%2d: %s\n", $i, &pchars($subs[$i]));
|
||||
$last_printed = $i;
|
||||
}
|
||||
}
|
||||
splice(@subs, 0, 18);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
printf $outfile "\n";
|
||||
|
||||
# End
|
||||
22
ext/pcre/pcrelib/testdata/testinput1
vendored
22
ext/pcre/pcrelib/testdata/testinput1
vendored
@@ -1899,4 +1899,24 @@
|
||||
//g
|
||||
abc
|
||||
|
||||
/ End of test input /
|
||||
/<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)((<BR>([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/is
|
||||
<TR BGCOLOR='#DBE9E9'><TD align=left valign=top>43.<a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor<BR>(N-1286)</a></TD><TD align=left valign=top>Lega lstaff.com</TD><TD align=left valign=top>CA - Statewide</TD></TR>
|
||||
|
||||
/a[^a]b/
|
||||
acb
|
||||
a\nb
|
||||
|
||||
/a.b/
|
||||
acb
|
||||
*** Failers
|
||||
a\nb
|
||||
|
||||
/a[^a]b/s
|
||||
acb
|
||||
a\nb
|
||||
|
||||
/a.b/s
|
||||
acb
|
||||
a\nb
|
||||
|
||||
/ End of testinput1 /
|
||||
|
||||
8
ext/pcre/pcrelib/testdata/testinput2
vendored
8
ext/pcre/pcrelib/testdata/testinput2
vendored
@@ -40,8 +40,6 @@
|
||||
|
||||
/[\B]/
|
||||
|
||||
/[a-\w]/
|
||||
|
||||
/[z-a]/
|
||||
|
||||
/^*/
|
||||
@@ -707,4 +705,8 @@
|
||||
Ab
|
||||
AB
|
||||
|
||||
/ End of test input /
|
||||
/[\200-\410]/
|
||||
|
||||
/^(?(0)f|b)oo/
|
||||
|
||||
/ End of testinput2 /
|
||||
|
||||
16
ext/pcre/pcrelib/testdata/testinput3
vendored
16
ext/pcre/pcrelib/testdata/testinput3
vendored
@@ -1707,4 +1707,18 @@
|
||||
/a*/g
|
||||
abbab
|
||||
|
||||
/ End of test input /
|
||||
/^[a-\d]/
|
||||
abcde
|
||||
-things
|
||||
0digit
|
||||
*** Failers
|
||||
bcdef
|
||||
|
||||
/^[\d-a]/
|
||||
abcde
|
||||
-things
|
||||
0digit
|
||||
*** Failers
|
||||
bcdef
|
||||
|
||||
/ End of testinput3 /
|
||||
|
||||
1
ext/pcre/pcrelib/testdata/testinput4
vendored
1
ext/pcre/pcrelib/testdata/testinput4
vendored
@@ -62,3 +62,4 @@
|
||||
*** Failers
|
||||
école
|
||||
|
||||
/ End of testinput4 /
|
||||
|
||||
118
ext/pcre/pcrelib/testdata/testinput5
vendored
Normal file
118
ext/pcre/pcrelib/testdata/testinput5
vendored
Normal file
@@ -0,0 +1,118 @@
|
||||
/-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/
|
||||
/-- strings automatically, do not use the \x{} construct except with --/
|
||||
/-- patterns that have the /8 option set, and don't use them without! --/
|
||||
|
||||
/a.b/8
|
||||
acb
|
||||
a\x7fb
|
||||
a\x{100}b
|
||||
*** Failers
|
||||
a\nb
|
||||
|
||||
/a(.{3})b/8
|
||||
a\x{4000}xyb
|
||||
a\x{4000}\x7fyb
|
||||
a\x{4000}\x{100}yb
|
||||
*** Failers
|
||||
a\x{4000}b
|
||||
ac\ncb
|
||||
|
||||
/a(.*?)(.)/
|
||||
a\xc0\x88b
|
||||
|
||||
/a(.*?)(.)/8
|
||||
a\x{100}b
|
||||
|
||||
/a(.*)(.)/
|
||||
a\xc0\x88b
|
||||
|
||||
/a(.*)(.)/8
|
||||
a\x{100}b
|
||||
|
||||
/a(.)(.)/
|
||||
a\xc0\x92bcd
|
||||
|
||||
/a(.)(.)/8
|
||||
a\x{240}bcd
|
||||
|
||||
/a(.?)(.)/
|
||||
a\xc0\x92bcd
|
||||
|
||||
/a(.?)(.)/8
|
||||
a\x{240}bcd
|
||||
|
||||
/a(.??)(.)/
|
||||
a\xc0\x92bcd
|
||||
|
||||
/a(.??)(.)/8
|
||||
a\x{240}bcd
|
||||
|
||||
/a(.{3})b/8
|
||||
a\x{1234}xyb
|
||||
a\x{1234}\x{4321}yb
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
*** Failers
|
||||
a\x{1234}b
|
||||
ac\ncb
|
||||
|
||||
/a(.{3,})b/8
|
||||
a\x{1234}xyb
|
||||
a\x{1234}\x{4321}yb
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
axxxxbcdefghijb
|
||||
a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
*** Failers
|
||||
a\x{1234}b
|
||||
|
||||
/a(.{3,}?)b/8
|
||||
a\x{1234}xyb
|
||||
a\x{1234}\x{4321}yb
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
axxxxbcdefghijb
|
||||
a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
*** Failers
|
||||
a\x{1234}b
|
||||
|
||||
/a(.{3,5})b/8
|
||||
a\x{1234}xyb
|
||||
a\x{1234}\x{4321}yb
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
axxxxbcdefghijb
|
||||
a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
axbxxbcdefghijb
|
||||
axxxxxbcdefghijb
|
||||
*** Failers
|
||||
a\x{1234}b
|
||||
axxxxxxbcdefghijb
|
||||
|
||||
/a(.{3,5}?)b/8
|
||||
a\x{1234}xyb
|
||||
a\x{1234}\x{4321}yb
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
axxxxbcdefghijb
|
||||
a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
axbxxbcdefghijb
|
||||
axxxxxbcdefghijb
|
||||
*** Failers
|
||||
a\x{1234}b
|
||||
axxxxxxbcdefghijb
|
||||
|
||||
/^[a\x{c0}]/8
|
||||
*** Failers
|
||||
\x{100}
|
||||
|
||||
/(?<=aXb)cd/8
|
||||
aXbcd
|
||||
|
||||
/(?<=a\x{100}b)cd/8
|
||||
a\x{100}bcd
|
||||
|
||||
/(?<=a\x{100000}b)cd/8
|
||||
a\x{100000}bcd
|
||||
|
||||
/(?:\x{100}){3}b/8
|
||||
\x{100}\x{100}\x{100}b
|
||||
*** Failers
|
||||
\x{100}\x{100}b
|
||||
|
||||
/ End of testinput5 /
|
||||
52
ext/pcre/pcrelib/testdata/testinput6
vendored
Normal file
52
ext/pcre/pcrelib/testdata/testinput6
vendored
Normal file
@@ -0,0 +1,52 @@
|
||||
/\x{100}/8DM
|
||||
|
||||
/\x{1000}/8DM
|
||||
|
||||
/\x{10000}/8DM
|
||||
|
||||
/\x{100000}/8DM
|
||||
|
||||
/\x{1000000}/8DM
|
||||
|
||||
/\x{4000000}/8DM
|
||||
|
||||
/\x{7fffFFFF}/8DM
|
||||
|
||||
/[\x{ff}]/8DM
|
||||
|
||||
/[\x{100}]/8DM
|
||||
|
||||
/\x{ffffffff}/8
|
||||
|
||||
/\x{100000000}/8
|
||||
|
||||
/^\x{100}a\x{1234}/8
|
||||
\x{100}a\x{1234}bcd
|
||||
|
||||
/\x80/8D
|
||||
|
||||
/\xff/8D
|
||||
|
||||
/-- These tests are here rather than in testinput5 because Perl 5.6 has --/
|
||||
/-- some problems with UTF-8 support, in the area of \x{..} where the --/
|
||||
/-- value is < 255. It grumbles about invalid UTF-8 strings. --/
|
||||
|
||||
/^[a\x{c0}]b/8
|
||||
\x{c0}b
|
||||
|
||||
/^([a\x{c0}]*?)aa/8
|
||||
a\x{c0}aaaa/
|
||||
|
||||
/^([a\x{c0}]*?)aa/8
|
||||
a\x{c0}aaaa/
|
||||
a\x{c0}a\x{c0}aaa/
|
||||
|
||||
/^([a\x{c0}]*)aa/8
|
||||
a\x{c0}aaaa/
|
||||
a\x{c0}a\x{c0}aaa/
|
||||
|
||||
/^([a\x{c0}]*)a\x{c0}/8
|
||||
a\x{c0}aaaa/
|
||||
a\x{c0}a\x{c0}aaa/
|
||||
|
||||
/ End of testinput6 /
|
||||
45
ext/pcre/pcrelib/testdata/testoutput1
vendored
45
ext/pcre/pcrelib/testdata/testoutput1
vendored
@@ -1,4 +1,4 @@
|
||||
PCRE version 3.2 12-May-2000
|
||||
PCRE version 3.4 22-Aug-2000
|
||||
|
||||
/the quick brown fox/
|
||||
the quick brown fox
|
||||
@@ -2921,5 +2921,46 @@ No match
|
||||
0:
|
||||
0:
|
||||
|
||||
/ End of test input /
|
||||
/<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)((<BR>([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/is
|
||||
<TR BGCOLOR='#DBE9E9'><TD align=left valign=top>43.<a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor<BR>(N-1286)</a></TD><TD align=left valign=top>Lega lstaff.com</TD><TD align=left valign=top>CA - Statewide</TD></TR>
|
||||
0: <TR BGCOLOR='#DBE9E9'><TD align=left valign=top>43.<a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor<BR>(N-1286)</a></TD><TD align=left valign=top>Lega lstaff.com</TD><TD align=left valign=top>CA - Statewide</TD></TR>
|
||||
1: BGCOLOR='#DBE9E9'
|
||||
2: align=left valign=top
|
||||
3: 43.
|
||||
4: <a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor<BR>(N-1286)
|
||||
5:
|
||||
6:
|
||||
7: <unset>
|
||||
8: align=left valign=top
|
||||
9: Lega lstaff.com
|
||||
10: align=left valign=top
|
||||
11: CA - Statewide
|
||||
|
||||
/a[^a]b/
|
||||
acb
|
||||
0: acb
|
||||
a\nb
|
||||
0: a\x0ab
|
||||
|
||||
/a.b/
|
||||
acb
|
||||
0: acb
|
||||
*** Failers
|
||||
No match
|
||||
a\nb
|
||||
No match
|
||||
|
||||
/a[^a]b/s
|
||||
acb
|
||||
0: acb
|
||||
a\nb
|
||||
0: a\x0ab
|
||||
|
||||
/a.b/s
|
||||
acb
|
||||
0: acb
|
||||
a\nb
|
||||
0: a\x0ab
|
||||
|
||||
/ End of testinput1 /
|
||||
|
||||
|
||||
13
ext/pcre/pcrelib/testdata/testoutput2
vendored
13
ext/pcre/pcrelib/testdata/testoutput2
vendored
@@ -1,4 +1,4 @@
|
||||
PCRE version 3.2 12-May-2000
|
||||
PCRE version 3.4 22-Aug-2000
|
||||
|
||||
/(a)b|/
|
||||
Capturing subpattern count = 1
|
||||
@@ -94,9 +94,6 @@ Failed: missing terminating ] for character class at offset 5
|
||||
/[\B]/
|
||||
Failed: invalid escape sequence in character class at offset 2
|
||||
|
||||
/[a-\w]/
|
||||
Failed: invalid escape sequence in character class at offset 4
|
||||
|
||||
/[z-a]/
|
||||
Failed: range out of order in character class at offset 3
|
||||
|
||||
@@ -2064,7 +2061,13 @@ No match
|
||||
AB
|
||||
No match
|
||||
|
||||
/ End of test input /
|
||||
/[\200-\410]/
|
||||
Failed: range out of order in character class at offset 9
|
||||
|
||||
/^(?(0)f|b)oo/
|
||||
Failed: invalid condition (?(0) at offset 5
|
||||
|
||||
/ End of testinput2 /
|
||||
Capturing subpattern count = 0
|
||||
No options
|
||||
First char = ' '
|
||||
|
||||
28
ext/pcre/pcrelib/testdata/testoutput3
vendored
28
ext/pcre/pcrelib/testdata/testoutput3
vendored
@@ -1,4 +1,4 @@
|
||||
PCRE version 3.2 12-May-2000
|
||||
PCRE version 3.4 22-Aug-2000
|
||||
|
||||
/(?<!bar)foo/
|
||||
foo
|
||||
@@ -2963,5 +2963,29 @@ No match
|
||||
0:
|
||||
0:
|
||||
|
||||
/ End of test input /
|
||||
/^[a-\d]/
|
||||
abcde
|
||||
0: a
|
||||
-things
|
||||
0: -
|
||||
0digit
|
||||
0: 0
|
||||
*** Failers
|
||||
No match
|
||||
bcdef
|
||||
No match
|
||||
|
||||
/^[\d-a]/
|
||||
abcde
|
||||
0: a
|
||||
-things
|
||||
0: -
|
||||
0digit
|
||||
0: 0
|
||||
*** Failers
|
||||
No match
|
||||
bcdef
|
||||
No match
|
||||
|
||||
/ End of testinput3 /
|
||||
|
||||
|
||||
3
ext/pcre/pcrelib/testdata/testoutput4
vendored
3
ext/pcre/pcrelib/testdata/testoutput4
vendored
@@ -1,4 +1,4 @@
|
||||
PCRE version 3.2 12-May-2000
|
||||
PCRE version 3.4 22-Aug-2000
|
||||
|
||||
/^[\w]+/
|
||||
*** Failers
|
||||
@@ -112,4 +112,5 @@ No match
|
||||
école
|
||||
No match
|
||||
|
||||
/ End of testinput4 /
|
||||
|
||||
|
||||
242
ext/pcre/pcrelib/testdata/testoutput5
vendored
Normal file
242
ext/pcre/pcrelib/testdata/testoutput5
vendored
Normal file
@@ -0,0 +1,242 @@
|
||||
PCRE version 3.4 22-Aug-2000
|
||||
|
||||
/-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/
|
||||
/-- strings automatically, do not use the \x{} construct except with --/
|
||||
No match
|
||||
/-- patterns that have the /8 option set, and don't use them without! --/
|
||||
No match
|
||||
|
||||
/a.b/8
|
||||
acb
|
||||
0: acb
|
||||
a\x7fb
|
||||
0: a\x{7f}b
|
||||
a\x{100}b
|
||||
0: a\x{100}b
|
||||
*** Failers
|
||||
No match
|
||||
a\nb
|
||||
No match
|
||||
|
||||
/a(.{3})b/8
|
||||
a\x{4000}xyb
|
||||
0: a\x{4000}xyb
|
||||
1: \x{4000}xy
|
||||
a\x{4000}\x7fyb
|
||||
0: a\x{4000}\x{7f}yb
|
||||
1: \x{4000}\x{7f}y
|
||||
a\x{4000}\x{100}yb
|
||||
0: a\x{4000}\x{100}yb
|
||||
1: \x{4000}\x{100}y
|
||||
*** Failers
|
||||
No match
|
||||
a\x{4000}b
|
||||
No match
|
||||
ac\ncb
|
||||
No match
|
||||
|
||||
/a(.*?)(.)/
|
||||
a\xc0\x88b
|
||||
0: a\xc0
|
||||
1:
|
||||
2: \xc0
|
||||
|
||||
/a(.*?)(.)/8
|
||||
a\x{100}b
|
||||
0: a\x{100}
|
||||
1:
|
||||
2: \x{100}
|
||||
|
||||
/a(.*)(.)/
|
||||
a\xc0\x88b
|
||||
0: a\xc0\x88b
|
||||
1: \xc0\x88
|
||||
2: b
|
||||
|
||||
/a(.*)(.)/8
|
||||
a\x{100}b
|
||||
0: a\x{100}b
|
||||
1: \x{100}
|
||||
2: b
|
||||
|
||||
/a(.)(.)/
|
||||
a\xc0\x92bcd
|
||||
0: a\xc0\x92
|
||||
1: \xc0
|
||||
2: \x92
|
||||
|
||||
/a(.)(.)/8
|
||||
a\x{240}bcd
|
||||
0: a\x{240}b
|
||||
1: \x{240}
|
||||
2: b
|
||||
|
||||
/a(.?)(.)/
|
||||
a\xc0\x92bcd
|
||||
0: a\xc0\x92
|
||||
1: \xc0
|
||||
2: \x92
|
||||
|
||||
/a(.?)(.)/8
|
||||
a\x{240}bcd
|
||||
0: a\x{240}b
|
||||
1: \x{240}
|
||||
2: b
|
||||
|
||||
/a(.??)(.)/
|
||||
a\xc0\x92bcd
|
||||
0: a\xc0
|
||||
1:
|
||||
2: \xc0
|
||||
|
||||
/a(.??)(.)/8
|
||||
a\x{240}bcd
|
||||
0: a\x{240}
|
||||
1:
|
||||
2: \x{240}
|
||||
|
||||
/a(.{3})b/8
|
||||
a\x{1234}xyb
|
||||
0: a\x{1234}xyb
|
||||
1: \x{1234}xy
|
||||
a\x{1234}\x{4321}yb
|
||||
0: a\x{1234}\x{4321}yb
|
||||
1: \x{1234}\x{4321}y
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
0: a\x{1234}\x{4321}\x{3412}b
|
||||
1: \x{1234}\x{4321}\x{3412}
|
||||
*** Failers
|
||||
No match
|
||||
a\x{1234}b
|
||||
No match
|
||||
ac\ncb
|
||||
No match
|
||||
|
||||
/a(.{3,})b/8
|
||||
a\x{1234}xyb
|
||||
0: a\x{1234}xyb
|
||||
1: \x{1234}xy
|
||||
a\x{1234}\x{4321}yb
|
||||
0: a\x{1234}\x{4321}yb
|
||||
1: \x{1234}\x{4321}y
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
0: a\x{1234}\x{4321}\x{3412}b
|
||||
1: \x{1234}\x{4321}\x{3412}
|
||||
axxxxbcdefghijb
|
||||
0: axxxxbcdefghijb
|
||||
1: xxxxbcdefghij
|
||||
a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
0: a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
1: \x{1234}\x{4321}\x{3412}\x{3421}
|
||||
*** Failers
|
||||
No match
|
||||
a\x{1234}b
|
||||
No match
|
||||
|
||||
/a(.{3,}?)b/8
|
||||
a\x{1234}xyb
|
||||
0: a\x{1234}xyb
|
||||
1: \x{1234}xy
|
||||
a\x{1234}\x{4321}yb
|
||||
0: a\x{1234}\x{4321}yb
|
||||
1: \x{1234}\x{4321}y
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
0: a\x{1234}\x{4321}\x{3412}b
|
||||
1: \x{1234}\x{4321}\x{3412}
|
||||
axxxxbcdefghijb
|
||||
0: axxxxb
|
||||
1: xxxx
|
||||
a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
0: a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
1: \x{1234}\x{4321}\x{3412}\x{3421}
|
||||
*** Failers
|
||||
No match
|
||||
a\x{1234}b
|
||||
No match
|
||||
|
||||
/a(.{3,5})b/8
|
||||
a\x{1234}xyb
|
||||
0: a\x{1234}xyb
|
||||
1: \x{1234}xy
|
||||
a\x{1234}\x{4321}yb
|
||||
0: a\x{1234}\x{4321}yb
|
||||
1: \x{1234}\x{4321}y
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
0: a\x{1234}\x{4321}\x{3412}b
|
||||
1: \x{1234}\x{4321}\x{3412}
|
||||
axxxxbcdefghijb
|
||||
0: axxxxb
|
||||
1: xxxx
|
||||
a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
0: a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
1: \x{1234}\x{4321}\x{3412}\x{3421}
|
||||
axbxxbcdefghijb
|
||||
0: axbxxb
|
||||
1: xbxx
|
||||
axxxxxbcdefghijb
|
||||
0: axxxxxb
|
||||
1: xxxxx
|
||||
*** Failers
|
||||
No match
|
||||
a\x{1234}b
|
||||
No match
|
||||
axxxxxxbcdefghijb
|
||||
No match
|
||||
|
||||
/a(.{3,5}?)b/8
|
||||
a\x{1234}xyb
|
||||
0: a\x{1234}xyb
|
||||
1: \x{1234}xy
|
||||
a\x{1234}\x{4321}yb
|
||||
0: a\x{1234}\x{4321}yb
|
||||
1: \x{1234}\x{4321}y
|
||||
a\x{1234}\x{4321}\x{3412}b
|
||||
0: a\x{1234}\x{4321}\x{3412}b
|
||||
1: \x{1234}\x{4321}\x{3412}
|
||||
axxxxbcdefghijb
|
||||
0: axxxxb
|
||||
1: xxxx
|
||||
a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
0: a\x{1234}\x{4321}\x{3412}\x{3421}b
|
||||
1: \x{1234}\x{4321}\x{3412}\x{3421}
|
||||
axbxxbcdefghijb
|
||||
0: axbxxb
|
||||
1: xbxx
|
||||
axxxxxbcdefghijb
|
||||
0: axxxxxb
|
||||
1: xxxxx
|
||||
*** Failers
|
||||
No match
|
||||
a\x{1234}b
|
||||
No match
|
||||
axxxxxxbcdefghijb
|
||||
No match
|
||||
|
||||
/^[a\x{c0}]/8
|
||||
*** Failers
|
||||
No match
|
||||
\x{100}
|
||||
No match
|
||||
|
||||
/(?<=aXb)cd/8
|
||||
aXbcd
|
||||
0: cd
|
||||
|
||||
/(?<=a\x{100}b)cd/8
|
||||
a\x{100}bcd
|
||||
0: cd
|
||||
|
||||
/(?<=a\x{100000}b)cd/8
|
||||
a\x{100000}bcd
|
||||
0: cd
|
||||
|
||||
/(?:\x{100}){3}b/8
|
||||
\x{100}\x{100}\x{100}b
|
||||
0: \x{100}\x{100}\x{100}b
|
||||
*** Failers
|
||||
No match
|
||||
\x{100}\x{100}b
|
||||
No match
|
||||
|
||||
/ End of testinput5 /
|
||||
|
||||
185
ext/pcre/pcrelib/testdata/testoutput6
vendored
Normal file
185
ext/pcre/pcrelib/testdata/testoutput6
vendored
Normal file
@@ -0,0 +1,185 @@
|
||||
PCRE version 3.4 22-Aug-2000
|
||||
|
||||
/\x{100}/8DM
|
||||
Memory allocation (code space): 11
|
||||
------------------------------------------------------------------
|
||||
0 7 Bra 0
|
||||
3 2 \xc0\x88
|
||||
7 7 Ket
|
||||
10 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 192
|
||||
Need char = 136
|
||||
|
||||
/\x{1000}/8DM
|
||||
Memory allocation (code space): 12
|
||||
------------------------------------------------------------------
|
||||
0 8 Bra 0
|
||||
3 3 \xe0\x80\x84
|
||||
8 8 Ket
|
||||
11 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 224
|
||||
Need char = 132
|
||||
|
||||
/\x{10000}/8DM
|
||||
Memory allocation (code space): 13
|
||||
------------------------------------------------------------------
|
||||
0 9 Bra 0
|
||||
3 4 \xf0\x80\x80\x82
|
||||
9 9 Ket
|
||||
12 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 240
|
||||
Need char = 130
|
||||
|
||||
/\x{100000}/8DM
|
||||
Memory allocation (code space): 13
|
||||
------------------------------------------------------------------
|
||||
0 9 Bra 0
|
||||
3 4 \xf0\x80\x80\xa0
|
||||
9 9 Ket
|
||||
12 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 240
|
||||
Need char = 160
|
||||
|
||||
/\x{1000000}/8DM
|
||||
Memory allocation (code space): 14
|
||||
------------------------------------------------------------------
|
||||
0 10 Bra 0
|
||||
3 5 \xf8\x80\x80\x80\x90
|
||||
10 10 Ket
|
||||
13 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 248
|
||||
Need char = 144
|
||||
|
||||
/\x{4000000}/8DM
|
||||
Memory allocation (code space): 15
|
||||
------------------------------------------------------------------
|
||||
0 11 Bra 0
|
||||
3 6 \xfc\x80\x80\x80\x80\x82
|
||||
11 11 Ket
|
||||
14 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 252
|
||||
Need char = 130
|
||||
|
||||
/\x{7fffFFFF}/8DM
|
||||
Memory allocation (code space): 15
|
||||
------------------------------------------------------------------
|
||||
0 11 Bra 0
|
||||
3 6 \xfd\xbf\xbf\xbf\xbf\xbf
|
||||
11 11 Ket
|
||||
14 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 253
|
||||
Need char = 191
|
||||
|
||||
/[\x{ff}]/8DM
|
||||
Memory allocation (code space): 40
|
||||
------------------------------------------------------------------
|
||||
0 6 Bra 0
|
||||
3 1 \xff
|
||||
6 6 Ket
|
||||
9 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 255
|
||||
No need char
|
||||
|
||||
/[\x{100}]/8DM
|
||||
Memory allocation (code space): 40
|
||||
Failed: characters with values > 255 are not yet supported in classes at offset 7
|
||||
|
||||
/\x{ffffffff}/8
|
||||
Failed: character value in \x{...} sequence is too large at offset 11
|
||||
|
||||
/\x{100000000}/8
|
||||
Failed: character value in \x{...} sequence is too large at offset 12
|
||||
|
||||
/^\x{100}a\x{1234}/8
|
||||
\x{100}a\x{1234}bcd
|
||||
0: \x{100}a\x{1234}
|
||||
|
||||
/\x80/8D
|
||||
------------------------------------------------------------------
|
||||
0 7 Bra 0
|
||||
3 2 \xc0\x84
|
||||
7 7 Ket
|
||||
10 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 192
|
||||
Need char = 132
|
||||
|
||||
/\xff/8D
|
||||
------------------------------------------------------------------
|
||||
0 7 Bra 0
|
||||
3 2 \xdf\x87
|
||||
7 7 Ket
|
||||
10 End
|
||||
------------------------------------------------------------------
|
||||
Capturing subpattern count = 0
|
||||
Options: utf8
|
||||
First char = 223
|
||||
Need char = 135
|
||||
|
||||
/-- These tests are here rather than in testinput5 because Perl 5.6 has --/
|
||||
/-- some problems with UTF-8 support, in the area of \x{..} where the --/
|
||||
No match
|
||||
/-- value is < 255. It grumbles about invalid UTF-8 strings. --/
|
||||
No match
|
||||
|
||||
/^[a\x{c0}]b/8
|
||||
\x{c0}b
|
||||
0: \x{c0}b
|
||||
|
||||
/^([a\x{c0}]*?)aa/8
|
||||
a\x{c0}aaaa/
|
||||
0: a\x{c0}aa
|
||||
1: a\x{c0}
|
||||
|
||||
/^([a\x{c0}]*?)aa/8
|
||||
a\x{c0}aaaa/
|
||||
0: a\x{c0}aa
|
||||
1: a\x{c0}
|
||||
a\x{c0}a\x{c0}aaa/
|
||||
0: a\x{c0}a\x{c0}aa
|
||||
1: a\x{c0}a\x{c0}
|
||||
|
||||
/^([a\x{c0}]*)aa/8
|
||||
a\x{c0}aaaa/
|
||||
0: a\x{c0}aaaa
|
||||
1: a\x{c0}aa
|
||||
a\x{c0}a\x{c0}aaa/
|
||||
0: a\x{c0}a\x{c0}aaa
|
||||
1: a\x{c0}a\x{c0}a
|
||||
|
||||
/^([a\x{c0}]*)a\x{c0}/8
|
||||
a\x{c0}aaaa/
|
||||
0: a\x{c0}
|
||||
1:
|
||||
a\x{c0}a\x{c0}aaa/
|
||||
0: a\x{c0}a\x{c0}
|
||||
1: a\x{c0}
|
||||
|
||||
/ End of testinput6 /
|
||||
|
||||
Reference in New Issue
Block a user