1
0
mirror of https://github.com/php/php-src.git synced 2026-04-11 10:03:18 +02:00

Upgrade PCRE to version 3.4.

This commit is contained in:
Andrei Zmievski
2001-02-20 22:00:33 +00:00
parent 714e340a3b
commit 6542e70473
41 changed files with 2299 additions and 251 deletions

View File

@@ -20,7 +20,21 @@ restrictions:
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2. The origin of this software must not be misrepresented, either by
explicit claim or by omission.
explicit claim or by omission. In practice, this means that if you use
PCRE in software which you distribute to others, commercially or
otherwise, you must put a sentence like this
Regular expression support is provided by the PCRE library package,
which is open source software, written by Philip Hazel, and copyright
by the University of Cambridge, England.
somewhere reasonably visible in your documentation and in any relevant
files or online help data or similar. A reference to the ftp site for
the source, that is, to
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
should also be given in the documentation.
3. Altered versions must be plainly marked as such, and must not be
misrepresented as being the original software.

View File

@@ -2,6 +2,46 @@ ChangeLog for PCRE
------------------
Version 3.4 22-Aug-00
---------------------
1. Fixed typo in pcre.h: unsigned const char * changed to const unsigned char *.
2. Diagnose condition (?(0) as an error instead of crashing on matching.
Version 3.3 01-Aug-00
---------------------
1. If an octal character was given, but the value was greater than \377, it
was not getting masked to the least significant bits, as documented. This could
lead to crashes in some systems.
2. Perl 5.6 (if not earlier versions) accepts classes like [a-\d] and treats
the hyphen as a literal. PCRE used to give an error; it now behaves like Perl.
3. Added the functions pcre_free_substring() and pcre_free_substring_list().
These just pass their arguments on to (pcre_free)(), but they are provided
because some uses of PCRE bind it to non-C systems that can call its functions,
but cannot call free() or pcre_free() directly.
4. Add "make test" as a synonym for "make check". Corrected some comments in
the Makefile.
5. Add $(DESTDIR)/ in front of all the paths in the "install" target in the
Makefile.
6. Changed the name of pgrep to pcregrep, because Solaris has introduced a
command called pgrep for grepping around the active processes.
7. Added the beginnings of support for UTF-8 character strings.
8. Arranged for the Makefile to pass over the settings of CC, CFLAGS, and
RANLIB to ./ltconfig so that they are used by libtool. I think these are all
the relevant ones. (AR is not passed because ./ltconfig does its own figuring
out for the ar command.)
Version 3.2 12-May-00
---------------------

View File

@@ -4,7 +4,7 @@ Basic Installation
These are generic installation instructions that apply to systems that
can run the `configure' shell script - Unix systems and any that imitate
it. They are not specific to PCRE. There are PCRE-specific instructions
for non-Unix systems in the file NON-UNIX.
for non-Unix systems in the file NON-UNIX-USE.
The `configure' shell script attempts to guess correct values for
various system-dependent variables used during compilation. It uses

View File

@@ -20,19 +20,21 @@ restrictions:
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2. The origin of this software must not be misrepresented, either by
explicit claim or by omission. In practice, this means you must put
a sentence like this
explicit claim or by omission. In practice, this means that if you use
PCRE in software which you distribute to others, commercially or
otherwise, you must put a sentence like this
Regular expression support is provided by the PCRE library package,
which is open source software, copyright by the University of
Cambridge.
which is open source software, written by Philip Hazel, and copyright
by the University of Cambridge, England.
somewhere reasonably visible in your documentation and in any relevant
files. A reference to the ftp site for the source should also be given
files or online help data or similar. A reference to the ftp site for
the source, that is, to
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
in the documentation.
should also be given in the documentation.
3. Altered versions must be plainly marked as such, and must not be
misrepresented as being the original software.

View File

@@ -1,6 +1,14 @@
News about PCRE releases
------------------------
Release 3.3 01-Aug-00
---------------------
There is some support for UTF-8 character strings. This is incomplete and
experimental. The documentation describes what is and what is not implemented.
Otherwise, this is just a bug-fixing release.
Release 3.0 01-Feb-00
---------------------

View File

@@ -7,6 +7,15 @@ The latest release of PCRE is always available from
Please read the NEWS file if you are upgrading from a previous release.
PCRE has its own native API, but a set of "wrapper" functions that are based on
the POSIX API are also supplied in the library libpcreposix. Note that this
just provides a POSIX calling interface to PCRE: the regular expressions
themselves still follow Perl syntax and semantics. The header file
for the POSIX-style functions is called pcreposix.h. The official POSIX name is
regex.h, but I didn't want to risk possible problems with existing files of
that name by distributing it that way. To use it with an existing program that
uses the POSIX API, it will have to be renamed or pointed at by a link.
Building PCRE on a Unix system
------------------------------
@@ -15,20 +24,29 @@ To build PCRE on a Unix system, run the "configure" command in the PCRE
distribution directory. This is a standard GNU "autoconf" configuration script,
for which generic instructions are supplied in INSTALL. On many systems just
running "./configure" is sufficient, but the usual methods of changing standard
defaults are available. For example
defaults are available. For example,
CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local
specifies that the C compiler should be run with the flags '-O2 -Wall' instead
of the default, and that "make install" should install PCRE under /opt/local
instead of the default /usr/local. The "configure" script builds thre files:
instead of the default /usr/local.
If you want to make use of the experimential, incomplete support for UTF-8
character strings in PCRE, you must add --enable-utf8 to the "configure"
command. Without it, the code for handling UTF-8 is not included in the
library. (Even when included, it still has to be enabled by an option at run
time.)
The "configure" script builds four files:
. Makefile is built by copying Makefile.in and making substitutions.
. config.h is built by copying config.in and making substitutions.
. pcre-config is built by copying pcre-config.in and making substitutions.
. RunTest is a script for running tests
Once "configure" has run, you can run "make". It builds two libraries called
libpcre and libpcreposix, a test program called pcretest, and the pgrep
libpcre and libpcreposix, a test program called pcretest, and the pcregrep
command. You can use "make install" to copy these, and the public header file
pcre.h, to appropriate live directories on your system, in the normal way.
@@ -54,11 +72,11 @@ The default distribution builds PCRE as two shared libraries. This support is
new and experimental and may not work on all systems. It relies on the
"libtool" scripts - these are distributed with PCRE. It should build a
"libtool" script and use this to compile and link shared libraries, which are
placed in a subdirectory called .libs. The programs pcretest and pgrep are
placed in a subdirectory called .libs. The programs pcretest and pcregrep are
built to use these uninstalled libraries by means of wrapper scripts. When you
use "make install" to install shared libraries, pgrep and pcretest are
use "make install" to install shared libraries, pcregrep and pcretest are
automatically re-built to use the newly installed libraries. However, only
pgrep is installed, as pcretest is really just a test program.
pcregrep is installed, as pcretest is really just a test program.
To build PCRE using static libraries you must use --disable-shared when
configuring it. For example
@@ -82,8 +100,8 @@ Testing PCRE
------------
To test PCRE on a Unix system, run the RunTest script in the pcre directory.
(This can also be run by "make runtest" or "make check".) For other systems,
see the instruction in NON-UNIX-USE.
(This can also be run by "make runtest", "make check", or "make test".) For
other systems, see the instruction in NON-UNIX-USE.
The script runs the pcretest test program (which is documented in
doc/pcretest.txt) on each of the testinput files (in the testdata directory) in
@@ -97,12 +115,24 @@ RunTest, for example:
The first and third test files can also be fed directly into the perltest
script to check that Perl gives the same results. The third file requires the
additional features of release 5.005, which is why it is kept separate from the
main test input, which needs only Perl 5.004. In the long run, when 5.005 is
widespread, these two test files may get amalgamated.
main test input, which needs only Perl 5.004. In the long run, when 5.005 (or
higher) is widespread, these two test files may get amalgamated.
The second set of tests check pcre_info(), pcre_study(), pcre_copy_substring(),
pcre_get_substring(), pcre_get_substring_list(), error detection and run-time
flags that are specific to PCRE, as well as the POSIX wrapper API.
The second set of tests check pcre_fullinfo(), pcre_info(), pcre_study(),
pcre_copy_substring(), pcre_get_substring(), pcre_get_substring_list(), error
detection, and run-time flags that are specific to PCRE, as well as the POSIX
wrapper API. It also uses the debugging flag to check some of the internals of
pcre_compile().
If you build PCRE with a locale setting that is not the standard C locale, the
character tables may be different (see next paragraph). In some cases, this may
cause failures in the second set of tests. For example, in a locale where the
isprint() function yields TRUE for characters in the range 128-255, the use of
[:isascii:] inside a character class defines a different set of characters, and
this shows up in this test as a difference in the compiled code, which is being
listed for checking. Where the comparison test output contains [\x00-\x7f] the
test will contain [\x00-\xff], and similarly in some other cases. This is not a
bug in PCRE.
The fourth set of tests checks pcre_maketables(), the facility for building a
set of character tables for a specific locale and using them instead of the
@@ -117,14 +147,10 @@ output to say why. If running this test produces instances of the error
in the comparison output, it means that locale is not available on your system,
despite being listed by "locale". This does not mean that PCRE is broken.
PCRE has its own native API, but a set of "wrapper" functions that are based on
the POSIX API are also supplied in the library libpcreposix.a. Note that this
just provides a POSIX calling interface to PCRE: the regular expressions
themselves still follow Perl syntax and semantics. The header file
for the POSIX-style functions is called pcreposix.h. The official POSIX name is
regex.h, but I didn't want to risk possible problems with existing files of
that name by distributing it that way. To use it with an existing program that
uses the POSIX API, it will have to be renamed or pointed at by a link.
The fifth test checks the experimental, incomplete UTF-8 support. It is not run
automatically unless PCRE is built with UTF-8 support. This file can be fed
directly to the perltest8 script, which requires Perl 5.6 or higher. The sixth
file tests internal UTF-8 features of PCRE that are not relevant to Perl.
Character tables
@@ -197,7 +223,7 @@ The distribution should contain the following files:
NEWS important changes in this release
NON-UNIX-USE notes on building PCRE on non-Unix systems
README this file
RunTest a Unix shell script for running tests
RunTest.in template for a Unix shell script for running tests
config.guess ) files used by libtool,
config.sub ) used only when building a shared library
configure a configuring shell script (built by autoconf)
@@ -211,24 +237,29 @@ The distribution should contain the following files:
doc/pcreposix.txt plain text version
doc/pcretest.txt documentation of test program
doc/perltest.txt documentation of Perl test program
doc/pgrep.1 man page source for the pgrep utility
doc/pgrep.html HTML version
doc/pgrep.txt plain text version
doc/pcregrep.1 man page source for the pcregrep utility
doc/pcregrep.html HTML version
doc/pcregrep.txt plain text version
install-sh a shell script for installing files
ltconfig ) files used to build "libtool",
ltmain.sh ) used only when building a shared library
pcretest.c test program
perltest Perl test program
pgrep.c source of a grep utility that uses PCRE
perltest8 Perl test program for UTF-8 tests
pcregrep.c source of a grep utility that uses PCRE
pcre-config.in source of script which retains PCRE information
testdata/testinput1 test data, compatible with Perl 5.004 and 5.005
testdata/testinput2 test data for error messages and non-Perl things
testdata/testinput3 test data, compatible with Perl 5.005
testdata/testinput4 test data for locale-specific tests
testdata/testinput5 test data for UTF-8 tests compatible with Perl 5.6
testdata/testinput6 test data for other UTF-8 tests
testdata/testoutput1 test results corresponding to testinput1
testdata/testoutput2 test results corresponding to testinput2
testdata/testoutput3 test results corresponding to testinput3
testdata/testoutput4 test results corresponding to testinput4
testdata/testoutput5 test results corresponding to testinput5
testdata/testoutput6 test results corresponding to testinput6
(C) Auxiliary files for Win32 DLL
@@ -236,4 +267,4 @@ The distribution should contain the following files:
pcre.def
Philip Hazel <ph10@cam.ac.uk>
February 2000
August 2000

View File

@@ -1,5 +1,8 @@
#! /bin/sh
# This file is generated by configure from RunTest.in. Make any changes
# to that file.
# Run PCRE tests
cf=diff
@@ -10,6 +13,8 @@ do1=no
do2=no
do3=no
do4=no
do5=no
do6=no
while [ $# -gt 0 ] ; do
case $1 in
@@ -17,16 +22,32 @@ while [ $# -gt 0 ] ; do
2) do2=yes;;
3) do3=yes;;
4) do4=yes;;
5) do5=yes;;
6) do6=yes;;
*) echo "Unknown test number $1"; exit 1;;
esac
shift
done
if [ $do1 = no -a $do2 = no -a $do3 = no -a $do4 = no ] ; then
if [ "" = "" ] ; then
if [ $do5 = yes ] ; then
echo "Can't run test 5 because UFT8 support is not configured"
exit 1
fi
if [ $do6 = yes ] ; then
echo "Can't run test 6 because UFT8 support is not configured"
exit 1
fi
fi
if [ $do1 = no -a $do2 = no -a $do3 = no -a $do4 = no -a\
$do5 = no -a $do6 = no ] ; then
do1=yes
do2=yes
do3=yes
do4=yes
if [ "" != "" ] ; then do5=yes; fi
if [ "" != "" ] ; then do6=yes; fi
fi
# Primary test, Perl-compatible
@@ -66,6 +87,7 @@ if [ $do3 = yes ] ; then
fi
if [ $do1 = yes -a $do2 = yes -a $do3 = yes ] ; then
echo " "
echo "The three main tests all ran OK"
echo " "
fi
@@ -79,8 +101,14 @@ if [ $do4 = yes ] ; then
./pcretest testdata/testinput4 testtry
if [ $? = 0 ] ; then
$cf testtry testdata/testoutput4
if [ $? != 0 ] ; then exit 1; fi
if [ $? != 0 ] ; then
echo " "
echo "Locale test did not run entirely successfully."
echo "This usually means that there is a problem with the locale"
echo "settings rather than a bug in PCRE."
else
echo "Locale test ran OK"
fi
echo " "
else exit 1
fi
@@ -91,4 +119,30 @@ if [ $do4 = yes ] ; then
fi
fi
# Additional tests for UTF8 support
if [ $do5 = yes ] ; then
echo "Testing experimental, incomplete UTF8 support (Perl compatible)"
./pcretest testdata/testinput5 testtry
if [ $? = 0 ] ; then
$cf testtry testdata/testoutput5
if [ $? != 0 ] ; then exit 1; fi
else exit 1
fi
echo "UTF8 test ran OK"
echo " "
fi
if [ $do6 = yes ] ; then
echo "Testing API and internals for UTF8 support (not Perl compatible)"
./pcretest testdata/testinput6 testtry
if [ $? = 0 ] ; then
$cf testtry testdata/testoutput6
if [ $? != 0 ] ; then exit 1; fi
else exit 1
fi
echo "UTF8 internals test ran OK"
echo " "
fi
# End

View File

@@ -202,9 +202,10 @@ Forward assertions are just like other subpatterns, but starting with one of
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
is OP_REVERSE, followed by a two byte count of the number of characters to move
back the pointer in the subject string. A separate count is present in each
alternative of a lookbehind assertion, allowing them to have different fixed
lengths.
back the pointer in the subject string. When operating in UTF-8 mode, the count
is a character count rather than a byte count. A separate count is present in
each alternative of a lookbehind assertion, allowing them to have different
fixed lengths.
Once-only subpatterns
@@ -239,4 +240,4 @@ the compiled data.
Philip Hazel
February 2000
August 2000

View File

@@ -44,6 +44,12 @@ pcre - Perl-compatible regular expressions.
.B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);"
.PP
.br
.B void pcre_free_substring(const char *\fIstringptr\fR);
.PP
.br
.B void pcre_free_substring_list(const char **\fIstringptr\fR);
.PP
.br
.B const unsigned char *pcre_maketables(void);
.PP
.br
@@ -70,7 +76,9 @@ pcre - Perl-compatible regular expressions.
The PCRE library is a set of functions that implement regular expression
pattern matching using the same syntax and semantics as Perl 5, with just a few
differences (see below). The current implementation corresponds to Perl 5.005,
with some additional features from the Perl development release.
with some additional features from later versions. This includes some
experimental, incomplete support for UTF-8 encoded strings. Details of exactly
what is and what is not supported are given below.
PCRE has its own native API, which is described in this document. There is also
a set of wrapper functions that correspond to the POSIX regular expression API.
@@ -84,12 +92,16 @@ contain the major and minor release numbers for the library. Applications can
use these to include support for different releases.
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
are used for compiling and matching regular expressions, while
\fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
are used for compiling and matching regular expressions.
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
\fBpcre_get_substring_list()\fR are convenience functions for extracting
captured substrings from a matched subject string. The function
\fBpcre_maketables()\fR is used (optionally) to build a set of character tables
in the current locale for passing to \fBpcre_compile()\fR.
captured substrings from a matched subject string; \fBpcre_free_substring()\fR
and \fBpcre_free_substring_list()\fR are also provided, to free the memory used
for extracted strings.
The function \fBpcre_maketables()\fR is used (optionally) to build a set of
character tables in the current locale for passing to \fBpcre_compile()\fR.
The function \fBpcre_fullinfo()\fR is used to find out information about a
compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only
@@ -223,6 +235,14 @@ This option inverts the "greediness" of the quantifiers so that they are not
greedy by default, but become greedy if followed by "?". It is not compatible
with Perl. It can also be set by a (?U) option setting within the pattern.
PCRE_UTF8
This option causes PCRE to regard both the pattern and the subject as strings
of UTF-8 characters instead of just byte strings. However, it is available only
if PCRE has been built to include UTF-8 support. If not, the use of this option
provokes an error. Support for UTF-8 is new, experimental, and incomplete.
Details of exactly what it entails are given below.
.SH STUDYING A PATTERN
When a pattern is going to be used several times, it is worth spending more
@@ -558,7 +578,7 @@ extract a single substring, whose number is given as \fIstringnumber\fR. A
value of zero extracts the substring that matched the entire pattern, while
higher values extract the captured substrings. For \fBpcre_copy_substring()\fR,
the string is placed in \fIbuffer\fR, whose length is given by
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of store is
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of memory is
obtained via \fBpcre_malloc\fR, and its address is returned via
\fIstringptr\fR. The yield of the function is the length of the string, not
including the terminating zero, or one of
@@ -590,6 +610,15 @@ string. This can be distinguished from a genuine zero-length substring by
inspecting the appropriate offset in \fIovector\fR, which is negative for unset
substrings.
The two convenience functions \fBpcre_free_substring()\fR and
\fBpcre_free_substring_list()\fR can be used to free the memory returned by
a previous call of \fBpcre_get_substring()\fR or
\fBpcre_get_substring_list()\fR, respectively. They do nothing more than call
the function pointed to by \fBpcre_free\fR, which of course could be called
directly from a C program. However, PCRE is used in some situations where it is
linked via a special interface to another programming language which cannot use
\fBpcre_free\fR directly; it is for these cases that the functions are
provided.
.SH LIMITATIONS
@@ -691,8 +720,14 @@ The syntax and semantics of the regular expressions supported by PCRE are
described below. Regular expressions are also described in the Perl
documentation and in a number of other books, some of which have copious
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
O'Reilly (ISBN 1-56592-257), covers them in great detail. The description
here is intended as reference documentation.
O'Reilly (ISBN 1-56592-257), covers them in great detail.
The description here is intended as reference documentation. The basic
operation of PCRE is on strings of bytes. However, there is the beginnings of
some support for UTF-8 character strings. To use this support you must
configure PCRE to include it, and then call \fBpcre_compile()\fR with the
PCRE_UTF8 option. How this affects the pattern matching is described in the
final section of this document.
A regular expression is a pattern that is matched against a subject string from
left to right. Most characters stand for themselves in a pattern, and match the
@@ -1210,7 +1245,7 @@ to the string
/* first command */ not comment /* second comment */
fails, because it matches the entire string due to the greediness of the .*
fails, because it matches the entire string owing to the greediness of the .*
item.
However, if a quantifier is followed by a question mark, it ceases to be
@@ -1311,7 +1346,7 @@ example, the pattern
(a|b\\1)+
matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
the subpattern, the back reference matches the character string corresponding
to the previous iteration. In order for this to work, the pattern must be such
that the first iteration does not need to match the back reference. This can be
@@ -1529,9 +1564,10 @@ subpattern, a compile-time error occurs.
There are two kinds of condition. If the text between the parentheses consists
of a sequence of digits, the condition is satisfied if the capturing subpattern
of that number has previously matched. Consider the following pattern, which
contains non-significant white space to make it more readable (assume the
PCRE_EXTENDED option) and to divide it into three parts for ease of discussion:
of that number has previously matched. The number must be greater than zero.
Consider the following pattern, which contains non-significant white space to
make it more readable (assume the PCRE_EXTENDED option) and to divide it into
three parts for ease of discussion:
( \\( )? [^()]+ (?(1) \\) )
@@ -1685,6 +1721,77 @@ with the pattern above. The former gives a failure almost instantly when
applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
.SH UTF-8 SUPPORT
Starting at release 3.3, PCRE has some support for character strings encoded
in the UTF-8 format. This is incomplete, and is regarded as experimental. In
order to use it, you must configure PCRE to include UTF-8 support in the code,
and, in addition, you must call \fBpcre_compile()\fR with the PCRE_UTF8 option
flag. When you do this, both the pattern and any subject strings that are
matched against it are treated as UTF-8 strings instead of just strings of
bytes, but only in the cases that are mentioned below.
If you compile PCRE with UTF-8 support, but do not use it at run time, the
library will be a bit bigger, but the additional run time overhead is limited
to testing the PCRE_UTF8 flag in several places, so should not be very large.
PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
the results are undefined.
Running with PCRE_UTF8 set causes these changes in the way PCRE works:
1. In a pattern, the escape sequence \\x{...}, where the contents of the braces
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
code number is the given hexadecimal number, for example: \\x{1234}. This
inserts from one to six literal bytes into the pattern, using the UTF-8
encoding. If a non-hexadecimal digit appears between the braces, the item is
not recognized.
2. The original hexadecimal escape sequence, \\xhh, generates a two-byte UTF-8
character if its value is greater than 127.
3. Repeat quantifiers are NOT correctly handled if they follow a multibyte
character. For example, \\x{100}* and \\xc3+ do not work. If you want to
repeat such characters, you must enclose them in non-capturing parentheses,
for example (?:\\x{100}), at present.
4. The dot metacharacter matches one UTF-8 character instead of a single byte.
5. Unlike literal UTF-8 characters, the dot metacharacter followed by a
repeat quantifier does operate correctly on UTF-8 characters instead of
single bytes.
4. Although the \\x{...} escape is permitted in a character class, characters
whose values are greater than 255 cannot be included in a class.
5. A class is matched against a UTF-8 character instead of just a single byte,
but it can match only characters whose values are less than 256. Characters
with greater values always fail to match a class.
6. Repeated classes work correctly on multiple characters.
7. Classes containing just a single character whose value is greater than 127
(but less than 256), for example, [\\x80] or [^\\x{93}], do not work because
these are optimized into single byte matches. In the first case, of course,
the class brackets are just redundant.
8. Lookbehind assertions move backwards in the subject by a fixed number of
characters instead of a fixed number of bytes. Simple cases have been tested
to work correctly, but there may be hidden gotchas herein.
9. The character types such as \\d and \\w do not work correctly with UTF-8
characters. They continue to test a single byte.
10. Anything not explicitly mentioned here continues to work in bytes rather
than in characters.
The following UTF-8 features of Perl 5.6 are not implemented:
1. The escape sequence \\C to match a single byte.
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X.
.SH AUTHOR
Philip Hazel <ph10@cam.ac.uk>
.br
@@ -1696,6 +1803,8 @@ Cambridge CB2 3QG, England.
.br
Phone: +44 1223 334714
Last updated: 27 January 2000
Last updated: 28 August 2000,
.br
the 250th anniversary of the death of J.S. Bach.
.br
Copyright (c) 1997-2000 University of Cambridge.

View File

@@ -37,7 +37,8 @@ conversion went wrong.
<LI><A NAME="TOC27" HREF="#SEC27">COMMENTS</A>
<LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A>
<LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A>
<LI><A NAME="TOC30" HREF="#SEC30">AUTHOR</A>
<LI><A NAME="TOC30" HREF="#SEC30">UTF-8 SUPPORT</A>
<LI><A NAME="TOC31" HREF="#SEC31">AUTHOR</A>
</UL>
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A>
<P>
@@ -76,6 +77,12 @@ pcre - Perl-compatible regular expressions.
<B>int *<I>ovector</I>, int <I>stringcount</I>, const char ***<I>listptr</I>);</B>
</P>
<P>
<B>void pcre_free_substring(const char *<I>stringptr</I>);</B>
</P>
<P>
<B>void pcre_free_substring_list(const char **<I>stringptr</I>);</B>
</P>
<P>
<B>const unsigned char *pcre_maketables(void);</B>
</P>
<P>
@@ -100,7 +107,9 @@ pcre - Perl-compatible regular expressions.
The PCRE library is a set of functions that implement regular expression
pattern matching using the same syntax and semantics as Perl 5, with just a few
differences (see below). The current implementation corresponds to Perl 5.005,
with some additional features from the Perl development release.
with some additional features from later versions. This includes some
experimental, incomplete support for UTF-8 encoded strings. Details of exactly
what is and what is not supported are given below.
</P>
<P>
PCRE has its own native API, which is described in this document. There is also
@@ -117,12 +126,18 @@ use these to include support for different releases.
</P>
<P>
The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B>
are used for compiling and matching regular expressions, while
<B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and
are used for compiling and matching regular expressions.
</P>
<P>
The functions <B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and
<B>pcre_get_substring_list()</B> are convenience functions for extracting
captured substrings from a matched subject string. The function
<B>pcre_maketables()</B> is used (optionally) to build a set of character tables
in the current locale for passing to <B>pcre_compile()</B>.
captured substrings from a matched subject string; <B>pcre_free_substring()</B>
and <B>pcre_free_substring_list()</B> are also provided, to free the memory used
for extracted strings.
</P>
<P>
The function <B>pcre_maketables()</B> is used (optionally) to build a set of
character tables in the current locale for passing to <B>pcre_compile()</B>.
</P>
<P>
The function <B>pcre_fullinfo()</B> is used to find out information about a
@@ -297,6 +312,18 @@ This option inverts the "greediness" of the quantifiers so that they are not
greedy by default, but become greedy if followed by "?". It is not compatible
with Perl. It can also be set by a (?U) option setting within the pattern.
</P>
<P>
<PRE>
PCRE_UTF8
</PRE>
</P>
<P>
This option causes PCRE to regard both the pattern and the subject as strings
of UTF-8 characters instead of just byte strings. However, it is available only
if PCRE has been built to include UTF-8 support. If not, the use of this option
provokes an error. Support for UTF-8 is new, experimental, and incomplete.
Details of exactly what it entails are given below.
</P>
<LI><A NAME="SEC6" HREF="#TOC1">STUDYING A PATTERN</A>
<P>
When a pattern is going to be used several times, it is worth spending more
@@ -743,7 +770,7 @@ extract a single substring, whose number is given as <I>stringnumber</I>. A
value of zero extracts the substring that matched the entire pattern, while
higher values extract the captured substrings. For <B>pcre_copy_substring()</B>,
the string is placed in <I>buffer</I>, whose length is given by
<I>buffersize</I>, while for <B>pcre_get_substring()</B> a new block of store is
<I>buffersize</I>, while for <B>pcre_get_substring()</B> a new block of memory is
obtained via <B>pcre_malloc</B>, and its address is returned via
<I>stringptr</I>. The yield of the function is the length of the string, not
including the terminating zero, or one of
@@ -789,6 +816,17 @@ string. This can be distinguished from a genuine zero-length substring by
inspecting the appropriate offset in <I>ovector</I>, which is negative for unset
substrings.
</P>
<P>
The two convenience functions <B>pcre_free_substring()</B> and
<B>pcre_free_substring_list()</B> can be used to free the memory returned by
a previous call of <B>pcre_get_substring()</B> or
<B>pcre_get_substring_list()</B>, respectively. They do nothing more than call
the function pointed to by <B>pcre_free</B>, which of course could be called
directly from a C program. However, PCRE is used in some situations where it is
linked via a special interface to another programming language which cannot use
<B>pcre_free</B> directly; it is for these cases that the functions are
provided.
</P>
<LI><A NAME="SEC11" HREF="#TOC1">LIMITATIONS</A>
<P>
There are some size limitations in PCRE but it is hoped that they will never in
@@ -908,8 +946,15 @@ The syntax and semantics of the regular expressions supported by PCRE are
described below. Regular expressions are also described in the Perl
documentation and in a number of other books, some of which have copious
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
O'Reilly (ISBN 1-56592-257), covers them in great detail. The description
here is intended as reference documentation.
O'Reilly (ISBN 1-56592-257), covers them in great detail.
</P>
<P>
The description here is intended as reference documentation. The basic
operation of PCRE is on strings of bytes. However, there is the beginnings of
some support for UTF-8 character strings. To use this support you must
configure PCRE to include it, and then call <B>pcre_compile()</B> with the
PCRE_UTF8 option. How this affects the pattern matching is described in the
final section of this document.
</P>
<P>
A regular expression is a pattern that is matched against a subject string from
@@ -1576,7 +1621,7 @@ to the string
</PRE>
</P>
<P>
fails, because it matches the entire string due to the greediness of the .*
fails, because it matches the entire string owing to the greediness of the .*
item.
</P>
<P>
@@ -1718,7 +1763,7 @@ example, the pattern
</PRE>
</P>
<P>
matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
the subpattern, the back reference matches the character string corresponding
to the previous iteration. In order for this to work, the pattern must be such
that the first iteration does not need to match the back reference. This can be
@@ -2033,9 +2078,10 @@ subpattern, a compile-time error occurs.
<P>
There are two kinds of condition. If the text between the parentheses consists
of a sequence of digits, the condition is satisfied if the capturing subpattern
of that number has previously matched. Consider the following pattern, which
contains non-significant white space to make it more readable (assume the
PCRE_EXTENDED option) and to divide it into three parts for ease of discussion:
of that number has previously matched. The number must be greater than zero.
Consider the following pattern, which contains non-significant white space to
make it more readable (assume the PCRE_EXTENDED option) and to divide it into
three parts for ease of discussion:
</P>
<P>
<PRE>
@@ -2240,7 +2286,96 @@ with the pattern above. The former gives a failure almost instantly when
applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
</P>
<LI><A NAME="SEC30" HREF="#TOC1">AUTHOR</A>
<LI><A NAME="SEC30" HREF="#TOC1">UTF-8 SUPPORT</A>
<P>
Starting at release 3.3, PCRE has some support for character strings encoded
in the UTF-8 format. This is incomplete, and is regarded as experimental. In
order to use it, you must configure PCRE to include UTF-8 support in the code,
and, in addition, you must call <B>pcre_compile()</B> with the PCRE_UTF8 option
flag. When you do this, both the pattern and any subject strings that are
matched against it are treated as UTF-8 strings instead of just strings of
bytes, but only in the cases that are mentioned below.
</P>
<P>
If you compile PCRE with UTF-8 support, but do not use it at run time, the
library will be a bit bigger, but the additional run time overhead is limited
to testing the PCRE_UTF8 flag in several places, so should not be very large.
</P>
<P>
PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
the results are undefined.
</P>
<P>
Running with PCRE_UTF8 set causes these changes in the way PCRE works:
</P>
<P>
1. In a pattern, the escape sequence \x{...}, where the contents of the braces
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
code number is the given hexadecimal number, for example: \x{1234}. This
inserts from one to six literal bytes into the pattern, using the UTF-8
encoding. If a non-hexadecimal digit appears between the braces, the item is
not recognized.
</P>
<P>
2. The original hexadecimal escape sequence, \xhh, generates a two-byte UTF-8
character if its value is greater than 127.
</P>
<P>
3. Repeat quantifiers are NOT correctly handled if they follow a multibyte
character. For example, \x{100}* and \xc3+ do not work. If you want to
repeat such characters, you must enclose them in non-capturing parentheses,
for example (?:\x{100}), at present.
</P>
<P>
4. The dot metacharacter matches one UTF-8 character instead of a single byte.
</P>
<P>
5. Unlike literal UTF-8 characters, the dot metacharacter followed by a
repeat quantifier does operate correctly on UTF-8 characters instead of
single bytes.
</P>
<P>
4. Although the \x{...} escape is permitted in a character class, characters
whose values are greater than 255 cannot be included in a class.
</P>
<P>
5. A class is matched against a UTF-8 character instead of just a single byte,
but it can match only characters whose values are less than 256. Characters
with greater values always fail to match a class.
</P>
<P>
6. Repeated classes work correctly on multiple characters.
</P>
<P>
7. Classes containing just a single character whose value is greater than 127
(but less than 256), for example, [\x80] or [^\x{93}], do not work because
these are optimized into single byte matches. In the first case, of course,
the class brackets are just redundant.
</P>
<P>
8. Lookbehind assertions move backwards in the subject by a fixed number of
characters instead of a fixed number of bytes. Simple cases have been tested
to work correctly, but there may be hidden gotchas herein.
</P>
<P>
9. The character types such as \d and \w do not work correctly with UTF-8
characters. They continue to test a single byte.
</P>
<P>
10. Anything not explicitly mentioned here continues to work in bytes rather
than in characters.
</P>
<P>
The following UTF-8 features of Perl 5.6 are not implemented:
</P>
<P>
1. The escape sequence \C to match a single byte.
</P>
<P>
2. The use of Unicode tables and properties and escapes \p, \P, and \X.
</P>
<LI><A NAME="SEC31" HREF="#TOC1">AUTHOR</A>
<P>
Philip Hazel &#60;ph10@cam.ac.uk&#62;
<BR>
@@ -2253,6 +2388,10 @@ Cambridge CB2 3QG, England.
Phone: +44 1223 334714
</P>
<P>
Last updated: 27 January 2000
Last updated: 28 August 2000,
<BR>
<PRE>
the 250th anniversary of the death of J.S. Bach.
<BR>
</PRE>
Copyright (c) 1997-2000 University of Cambridge.

View File

@@ -28,6 +28,10 @@ SYNOPSIS
int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);
void pcre_free_substring(const char *stringptr);
void pcre_free_substring_list(const char **stringptr);
const unsigned char *pcre_maketables(void);
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
@@ -48,9 +52,12 @@ DESCRIPTION
The PCRE library is a set of functions that implement regu-
lar expression pattern matching using the same syntax and
semantics as Perl 5, with just a few differences (see
below). The current implementation corresponds to Perl
5.005, with some additional features from the Perl develop-
ment release.
5.005, with some additional features from later versions.
This includes some experimental, incomplete support for
UTF-8 encoded strings. Details of exactly what is and what
is not supported are given below.
PCRE has its own native API, which is described in this
document. There is also a set of wrapper functions that
@@ -67,13 +74,18 @@ DESCRIPTION
releases.
The functions pcre_compile(), pcre_study(), and pcre_exec()
are used for compiling and matching regular expressions,
while pcre_copy_substring(), pcre_get_substring(), and
pcre_get_substring_list() are convenience functions for
are used for compiling and matching regular expressions.
The functions pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() are convenience functions for
extracting captured substrings from a matched subject
string. The function pcre_maketables() is used (optionally)
to build a set of character tables in the current locale for
passing to pcre_compile().
string; pcre_free_substring() and pcre_free_substring_list()
are also provided, to free the memory used for extracted
strings.
The function pcre_maketables() is used (optionally) to build
a set of character tables in the current locale for passing
to pcre_compile().
The function pcre_fullinfo() is used to find out information
about a compiled pattern; pcre_info() is an obsolete version
@@ -92,10 +104,19 @@ DESCRIPTION
MULTI-THREADING
The PCRE functions can be used in multi-threading applica-
tions, with the proviso that the memory management functions
pointed to by pcre_malloc and pcre_free are shared by all
threads.
The PCRE functions can be used in multi-threading
SunOS 5.8 Last change: 2
applications, with the proviso that the memory management
functions pointed to by pcre_malloc and pcre_free are shared
by all threads.
The compiled form of a regular expression is not altered
during matching, so the same compiled pattern can safely be
@@ -103,7 +124,6 @@ MULTI-THREADING
COMPILING A PATTERN
The function pcre_compile() is called to compile a pattern
into an internal form. The pattern is a C string terminated
@@ -235,12 +255,23 @@ COMPILING A PATTERN
followed by "?". It is not compatible with Perl. It can also
be set by a (?U) option setting within the pattern.
PCRE_UTF8
This option causes PCRE to regard both the pattern and the
subject as strings of UTF-8 characters instead of just byte
strings. However, it is available only if PCRE has been
built to include UTF-8 support. If not, the use of this
option provokes an error. Support for UTF-8 is new, experi-
mental, and incomplete. Details of exactly what it entails
are given below.
STUDYING A PATTERN
When a pattern is going to be used several times, it is
worth spending more time analyzing it in order to speed up
the time taken for matching. The function pcre_study() takes
a pointer to a compiled pattern as its first argument, and
returns a pointer to a pcre_extra block (another void
typedef) containing additional information about the pat-
@@ -344,9 +375,9 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_BACKREFMAX
Return the number of the highest back reference in the pat-
tern. The fourth argument should point to an int variable.
Zero is returned if there are no back references.
Return the number of the highest back reference in the
pattern. The fourth argument should point to an int vari-
able. Zero is returned if there are no back references.
PCRE_INFO_FIRSTCHAR
@@ -605,6 +636,15 @@ MATCHING A PATTERN
EXTRACTING CAPTURED SUBSTRINGS
Captured substrings can be accessed directly by using the
SunOS 5.8 Last change: 12
offsets returned by pcre_exec() in ovector. For convenience,
the functions pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() are provided for extracting
@@ -631,7 +671,7 @@ EXTRACTING CAPTURED SUBSTRINGS
the entire pattern, while higher values extract the captured
substrings. For pcre_copy_substring(), the string is placed
in buffer, whose length is given by buffersize, while for
pcre_get_substring() a new block of store is obtained via
pcre_get_substring() a new block of memory is obtained via
pcre_malloc, and its address is returned via stringptr. The
yield of the function is the length of the string, not
including the terminating zero, or one of
@@ -665,6 +705,16 @@ EXTRACTING CAPTURED SUBSTRINGS
inspecting the appropriate offset in ovector, which is nega-
tive for unset substrings.
The two convenience functions pcre_free_substring() and
pcre_free_substring_list() can be used to free the memory
returned by a previous call of pcre_get_substring() or
pcre_get_substring_list(), respectively. They do nothing
more than call the function pointed to by pcre_free, which
of course could be called directly from a C program. How-
ever, PCRE is used in some situations where it is linked via
a special interface to another programming language which
cannot use pcre_free directly; it is for these cases that
the functions are provided.
@@ -733,6 +783,7 @@ DIFFERENCES FROM PERL
(?p{code}) constructions. However, there is some experimen-
tal support for recursive patterns using the non-Perl item
(?R).
8. There are at the time of writing some oddities in Perl
5.005_02 concerned with the settings of captured strings
when part of a pattern is repeated. For example, matching
@@ -785,11 +836,17 @@ REGULAR EXPRESSION DETAILS
The syntax and semantics of the regular expressions sup-
ported by PCRE are described below. Regular expressions are
also described in the Perl documentation and in a number of
other books, some of which have copious examples. Jeffrey
Friedl's "Mastering Regular Expressions", published by
O'Reilly (ISBN 1-56592-257), covers them in great detail.
O'Reilly (ISBN 1-56592-257), covers them in great detail.
The description here is intended as reference documentation.
The basic operation of PCRE is on strings of bytes. However,
there is the beginnings of some support for UTF-8 character
strings. To use this support you must configure PCRE to
include it, and then call pcre_compile() with the PCRE_UTF8
option. How this affects the pattern matching is described
in the final section of this document.
A regular expression is a pattern that is matched against a
subject string from left to right. Most characters stand for
@@ -1004,6 +1061,7 @@ CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode, the
circumflex character is an assertion which is true only if
the current matching point is at the start of the subject
string. If the startoffset argument of pcre_exec() is non-
zero, circumflex can never match. Inside a character class,
circumflex has an entirely different meaning (see below).
@@ -1056,6 +1114,7 @@ FULL STOP (PERIOD, DOT)
Outside a character class, a dot in the pattern matches any
one character in the subject, including a non-printing char-
acter, but not (by default) newline. If the PCRE_DOTALL
option is set, dots match newlines as well. The handling of
dot is entirely independent of the handling of circumflex
and dollar, the only relationship being that they both
@@ -1403,7 +1462,7 @@ REPETITION
/* first command */ not comment /* second comment */
fails, because it matches the entire string due to the
fails, because it matches the entire string owing to the
greediness of the .* item.
However, if a quantifier is followed by a question mark, it
@@ -1517,18 +1576,19 @@ BACK REFERENCES
A back reference that occurs inside the parentheses to which
it refers fails when the subpattern is first used, so, for
example, (a\1) never matches. However, such references can
be useful inside repeated subpatterns. For example, the
pattern
be useful inside repeated subpatterns. For example, the pat-
tern
(a|b\1)+
matches any number of "a"s and also "aba", "ababaa" etc. At
matches any number of "a"s and also "aba", "ababbaa" etc. At
each iteration of the subpattern, the back reference matches
the character string corresponding to the previous itera-
tion. In order for this to work, the pattern must be such
that the first iteration does not need to match the back
reference. This can be done using alternation, as in the
example above, or by a quantifier with a minimum of zero.
the character string corresponding to the previous
iteration. In order for this to work, the pattern must be
such that the first iteration does not need to match the
back reference. This can be done using alternation, as in
the example above, or by a quantifier with a minimum of
zero.
@@ -1681,9 +1741,9 @@ ONCE-ONLY SUBPATTERNS
This kind of parenthesis "locks up" the part of the pattern
it contains once it has matched, and a failure further into
the pattern is prevented from backtracking into it. Back-
tracking past it to previous items, however, works as nor-
mal.
the pattern is prevented from backtracking into it.
Backtracking past it to previous items, however, works as
normal.
An alternative description is that a subpattern of this type
matches the string of characters that an identical stan-
@@ -1778,10 +1838,11 @@ CONDITIONAL SUBPATTERNS
There are two kinds of condition. If the text between the
parentheses consists of a sequence of digits, the condition
is satisfied if the capturing subpattern of that number has
previously matched. Consider the following pattern, which
contains non-significant white space to make it more read-
able (assume the PCRE_EXTENDED option) and to divide it into
three parts for ease of discussion:
previously matched. The number must be greater than zero.
Consider the following pattern, which contains non-
significant white space to make it more readable (assume the
PCRE_EXTENDED option) and to divide it into three parts for
ease of discussion:
( \( )? [^()]+ (?(1) \) )
@@ -1966,6 +2027,92 @@ PERFORMANCE
UTF-8 SUPPORT
Starting at release 3.3, PCRE has some support for character
strings encoded in the UTF-8 format. This is incomplete, and
is regarded as experimental. In order to use it, you must
configure PCRE to include UTF-8 support in the code, and, in
addition, you must call pcre_compile() with the PCRE_UTF8
option flag. When you do this, both the pattern and any sub-
ject strings that are matched against it are treated as
UTF-8 strings instead of just strings of bytes, but only in
the cases that are mentioned below.
If you compile PCRE with UTF-8 support, but do not use it at
run time, the library will be a bit bigger, but the addi-
tional run time overhead is limited to testing the PCRE_UTF8
flag in several places, so should not be very large.
PCRE assumes that the strings it is given contain valid
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
you pass invalid UTF-8 strings to PCRE, the results are
undefined.
Running with PCRE_UTF8 set causes these changes in the way
PCRE works:
1. In a pattern, the escape sequence \x{...}, where the con-
tents of the braces is a string of hexadecimal digits, is
interpreted as a UTF-8 character whose code number is the
given hexadecimal number, for example: \x{1234}. This
inserts from one to six literal bytes into the pattern,
using the UTF-8 encoding. If a non-hexadecimal digit appears
between the braces, the item is not recognized.
2. The original hexadecimal escape sequence, \xhh, generates
a two-byte UTF-8 character if its value is greater than 127.
3. Repeat quantifiers are NOT correctly handled if they fol-
low a multibyte character. For example, \x{100}* and \xc3+
do not work. If you want to repeat such characters, you must
enclose them in non-capturing parentheses, for example
(?:\x{100}), at present.
4. The dot metacharacter matches one UTF-8 character instead
of a single byte.
5. Unlike literal UTF-8 characters, the dot metacharacter
followed by a repeat quantifier does operate correctly on
UTF-8 characters instead of single bytes.
4. Although the \x{...} escape is permitted in a character
class, characters whose values are greater than 255 cannot
be included in a class.
5. A class is matched against a UTF-8 character instead of
just a single byte, but it can match only characters whose
values are less than 256. Characters with greater values
always fail to match a class.
6. Repeated classes work correctly on multiple characters.
7. Classes containing just a single character whose value is
greater than 127 (but less than 256), for example, [\x80] or
[^\x{93}], do not work because these are optimized into sin-
gle byte matches. In the first case, of course, the class
brackets are just redundant.
8. Lookbehind assertions move backwards in the subject by a
fixed number of characters instead of a fixed number of
bytes. Simple cases have been tested to work correctly, but
there may be hidden gotchas herein.
9. The character types such as \d and \w do not work
correctly with UTF-8 characters. They continue to test a
single byte.
10. Anything not explicitly mentioned here continues to work
in bytes rather than in characters.
The following UTF-8 features of Perl 5.6 are not imple-
mented:
1. The escape sequence \C to match a single byte.
2. The use of Unicode tables and properties and escapes \p,
\P, and \X.
AUTHOR
Philip Hazel <ph10@cam.ac.uk>
University Computing Service,
@@ -1973,5 +2120,6 @@ AUTHOR
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
Last updated: 27 January 2000
Last updated: 28 August 2000,
the 250th anniversary of the death of J.S. Bach.
Copyright (c) 1997-2000 University of Cambridge.

View File

@@ -1,20 +1,20 @@
.TH PGREP 1
.TH PCREGREP 1
.SH NAME
pgrep - a grep with Perl-compatible regular expressions.
pcregrep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS
.B pgrep [-Vchilnsvx] pattern [file] ...
.B pcregrep [-Vchilnsvx] pattern [file] ...
.SH DESCRIPTION
\fBpgrep\fR searches files for character patterns, in the same way as other
\fBpcregrep\fR searches files for character patterns, in the same way as other
grep commands do, but it uses the PCRE regular expression library to support
patterns that are compatible with the regular expressions of Perl 5. See
\fBpcre(3)\fR for a full description of syntax and semantics.
If no files are specified, \fBpgrep\fR reads the standard input. By default,
If no files are specified, \fBpcregrep\fR reads the standard input. By default,
each line that matches the pattern is copied to the standard output, and if
there is more than one file, the file name is printed before each line of
output. However, there are options that can change how \fBpgrep\fR behaves.
output. However, there are options that can change how \fBpcregrep\fR behaves.
Lines are limited to BUFSIZ characters. BUFSIZ is defined in \fB<stdio.h>\fR.
The newline character is removed from the end of each line before it is matched
@@ -73,4 +73,4 @@ for syntax errors or inacessible files (even if matches were found).
.SH AUTHOR
Philip Hazel <ph10@cam.ac.uk>
.br
Copyright (c) 1997-1999 University of Cambridge.
Copyright (c) 1997-2000 University of Cambridge.

View File

@@ -1,9 +1,9 @@
<HTML>
<HEAD>
<TITLE>pgrep specification</TITLE>
<TITLE>pcregrep specification</TITLE>
</HEAD>
<body bgcolor="#FFFFFF" text="#00005A">
<H1>pgrep specification</H1>
<H1>pcregrep specification</H1>
This HTML document has been generated automatically from the original man page.
If there is any nonsense in it, please consult the man page in case the
conversion went wrong.
@@ -18,24 +18,24 @@ conversion went wrong.
</UL>
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A>
<P>
pgrep - a grep with Perl-compatible regular expressions.
pcregrep - a grep with Perl-compatible regular expressions.
</P>
<LI><A NAME="SEC2" HREF="#TOC1">SYNOPSIS</A>
<P>
<B>pgrep [-Vchilnsvx] pattern [file] ...</B>
<B>pcregrep [-Vchilnsvx] pattern [file] ...</B>
</P>
<LI><A NAME="SEC3" HREF="#TOC1">DESCRIPTION</A>
<P>
<B>pgrep</B> searches files for character patterns, in the same way as other
<B>pcregrep</B> searches files for character patterns, in the same way as other
grep commands do, but it uses the PCRE regular expression library to support
patterns that are compatible with the regular expressions of Perl 5. See
<B>pcre(3)</B> for a full description of syntax and semantics.
</P>
<P>
If no files are specified, <B>pgrep</B> reads the standard input. By default,
If no files are specified, <B>pcregrep</B> reads the standard input. By default,
each line that matches the pattern is copied to the standard output, and if
there is more than one file, the file name is printed before each line of
output. However, there are options that can change how <B>pgrep</B> behaves.
output. However, there are options that can change how <B>pcregrep</B> behaves.
</P>
<P>
Lines are limited to BUFSIZ characters. BUFSIZ is defined in <B>&#60;stdio.h&#62;</B>.
@@ -102,4 +102,4 @@ for syntax errors or inacessible files (even if matches were found).
<P>
Philip Hazel &#60;ph10@cam.ac.uk&#62;
<BR>
Copyright (c) 1997-1999 University of Cambridge.
Copyright (c) 1997-2000 University of Cambridge.

View File

@@ -1,25 +1,26 @@
NAME
pgrep - a grep with Perl-compatible regular expressions.
pcregrep - a grep with Perl-compatible regular expressions.
SYNOPSIS
pgrep [-Vchilnsvx] pattern [file] ...
pcregrep [-Vchilnsvx] pattern [file] ...
DESCRIPTION
pgrep searches files for character patterns, in the same way
as other grep commands do, but it uses the PCRE regular
pcregrep searches files for character patterns, in the same
way as other grep commands do, but it uses the PCRE regular
expression library to support patterns that are compatible
with the regular expressions of Perl 5. See pcre(3) for a
full description of syntax and semantics.
If no files are specified, pgrep reads the standard input.
By default, each line that matches the pattern is copied to
the standard output, and if there is more than one file, the
file name is printed before each line of output. However,
there are options that can change how pgrep behaves.
If no files are specified, pcregrep reads the standard
input. By default, each line that matches the pattern is
copied to the standard output, and if there is more than one
file, the file name is printed before each line of output.
However, there are options that can change how pcregrep
behaves.
Lines are limited to BUFSIZ characters. BUFSIZ is defined in
<stdio.h>. The newline character is removed from the end of
@@ -82,5 +83,5 @@ DIAGNOSTICS
AUTHOR
Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-1999 University of Cambridge.
Copyright (c) 1997-2000 University of Cambridge.

View File

@@ -77,6 +77,14 @@ to the native function.
The PCRE_MULTILINE option is set when the expression is passed for compilation
to the native function.
In the absence of these flags, no options are passed to the native function.
This means the the regex is compiled with PCRE default semantics. In
particular, the way it handles newline characters in the subject string is the
Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only
\fIsome\fR of the effects specified for REG_NEWLINE. It does not affect the way
newlines are matched by . (they aren't) or a negative class such as [^a] (they
are).
The yield of \fBregcomp()\fR is zero on success, and non-zero otherwise. The
\fIpreg\fR structure is filled in on success, and one member of the structure
is publicized: \fIre_nsub\fR contains the number of capturing subpatterns in
@@ -138,4 +146,4 @@ Cambridge CB2 3QG, England.
.br
Phone: +44 1223 334714
Copyright (c) 1997-1999 University of Cambridge.
Copyright (c) 1997-2000 University of Cambridge.

View File

@@ -107,6 +107,15 @@ The PCRE_MULTILINE option is set when the expression is passed for compilation
to the native function.
</P>
<P>
In the absence of these flags, no options are passed to the native function.
This means the the regex is compiled with PCRE default semantics. In
particular, the way it handles newline characters in the subject string is the
Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only
<I>some</I> of the effects specified for REG_NEWLINE. It does not affect the way
newlines are matched by . (they aren't) or a negative class such as [^a] (they
are).
</P>
<P>
The yield of <B>regcomp()</B> is zero on success, and non-zero otherwise. The
<I>preg</I> structure is filled in on success, and one member of the structure
is publicized: <I>re_nsub</I> contains the number of capturing subpatterns in
@@ -179,4 +188,4 @@ Cambridge CB2 3QG, England.
Phone: +44 1223 334714
</P>
<P>
Copyright (c) 1997-1999 University of Cambridge.
Copyright (c) 1997-2000 University of Cambridge.

View File

@@ -80,6 +80,15 @@ COMPILING A PATTERN
The PCRE_MULTILINE option is set when the expression is
passed for compilation to the native function.
In the absence of these flags, no options are passed to the
native function. This means the the regex is compiled with
PCRE default semantics. In particular, the way it handles
newline characters in the subject string is the Perl way,
not the POSIX way. Note that setting PCRE_MULTILINE has only
some of the effects specified for REG_NEWLINE. It does not
affect the way newlines are matched by . (they aren't) or a
negative class such as [^a] (they are).
The yield of regcomp() is zero on success, and non-zero oth-
erwise. The preg structure is filled in on success, and one
member of the structure is publicized: re_nsub contains the
@@ -147,4 +156,4 @@ AUTHOR
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
Copyright (c) 1997-1999 University of Cambridge.
Copyright (c) 1997-2000 University of Cambridge.

View File

@@ -43,6 +43,10 @@ backslash, because
is interpreted as the first line of a pattern that starts with "abc/", causing
pcretest to read the next line as a continuation of the regular expression.
PATTERN MODIFIERS
-----------------
The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For
example:
@@ -103,37 +107,48 @@ compiled, and the results used when the expression is matched.
The /M modifier causes the size of memory block used to hold the compiled
pattern to be output.
Finally, the /P modifier causes pcretest to call PCRE via the POSIX wrapper API
rather than its native API. When this is done, all other modifiers except /i,
/m, and /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is
set if /m is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always,
and PCRE_DOTALL unless REG_NEWLINE is set.
The /P modifier causes pcretest to call PCRE via the POSIX wrapper API rather
than its native API. When this is done, all other modifiers except /i, /m, and
/+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is set if /m
is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and
PCRE_DOTALL unless REG_NEWLINE is set.
The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option set.
This turns on the (currently incomplete) support for UTF-8 character handling
in PCRE, provided that it was compiled with this support enabled. This modifier
also causes any non-printing characters in output strings to be printed using
the \x{hh...} notation if they are valid UTF-8 sequences.
DATA LINES
----------
Before each data line is passed to pcre_exec(), leading and trailing whitespace
is removed, and it is then scanned for \ escapes. The following are recognized:
\a alarm (= BEL)
\b backspace
\e escape
\f formfeed
\n newline
\r carriage return
\t tab
\v vertical tab
\nnn octal character (up to 3 octal digits)
\xhh hexadecimal character (up to 2 hex digits)
\a alarm (= BEL)
\b backspace
\e escape
\f formfeed
\n newline
\r carriage return
\t tab
\v vertical tab
\nnn octal character (up to 3 octal digits)
\xhh hexadecimal character (up to 2 hex digits)
\x{hh...} hexadecimal UTF-8 character
\A pass the PCRE_ANCHORED option to pcre_exec()
\B pass the PCRE_NOTBOL option to pcre_exec()
\Cdd call pcre_copy_substring() for substring dd after a successful match
(any decimal number less than 32)
\Gdd call pcre_get_substring() for substring dd after a successful match
(any decimal number less than 32)
\L call pcre_get_substringlist() after a successful match
\N pass the PCRE_NOTEMPTY option to pcre_exec()
\Odd set the size of the output vector passed to pcre_exec() to dd
(any number of decimal digits)
\Z pass the PCRE_NOTEOL option to pcre_exec()
\A pass the PCRE_ANCHORED option to pcre_exec()
\B pass the PCRE_NOTBOL option to pcre_exec()
\Cdd call pcre_copy_substring() for substring dd after a successful
match (any decimal number less than 32)
\Gdd call pcre_get_substring() for substring dd after a successful
match (any decimal number less than 32)
\L call pcre_get_substringlist() after a successful match
\N pass the PCRE_NOTEMPTY option to pcre_exec()
\Odd set the size of the output vector passed to pcre_exec() to dd
(any number of decimal digits)
\Z pass the PCRE_NOTEOL option to pcre_exec()
A backslash followed by anything else just escapes the anything else. If the
very last character is a backslash, it is ignored. This gives a way of passing
@@ -143,6 +158,15 @@ If /P was present on the regex, causing the POSIX wrapper API to be used, only
\B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to
regexec() respectively.
The use of \x{hh...} to represent UTF-8 characters is not dependent on the use
of the /8 modifier on the pattern. It is recognized always. There may be any
number of hexadecimal digits inside the braces. The result is from one to six
bytes, encoded according to the UTF-8 rules.
OUTPUT FROM PCRETEST
--------------------
When a match succeeds, pcretest outputs the list of captured substrings that
pcre_exec() returns, starting with number 0 for the string that matched the
whole pattern. Here is an example of an interactive pcretest run.
@@ -158,8 +182,9 @@ whole pattern. Here is an example of an interactive pcretest run.
No match
If the strings contain any non-printing characters, they are output as \0x
escapes. If the pattern has the /+ modifier, then the output for substring 0 is
followed by the the rest of the subject string, identified by "0+" like this:
escapes, or as \x{...} escapes if the /8 modifier was present on the pattern.
If the pattern has the /+ modifier, then the output for substring 0 is followed
by the the rest of the subject string, identified by "0+" like this:
re> /cat/+
data> cataract
@@ -190,6 +215,10 @@ Note that while patterns can be continued over several lines (a plain ">"
prompt is used for continuations), data lines may not. However newlines can be
included in data by means of the \n escape.
COMMAND LINE OPTIONS
--------------------
If the -p option is given to pcretest, it is equivalent to adding /P to each
regular expression: the POSIX wrapper API is used to call PCRE. None of the
following flags has any effect in this case.
@@ -208,10 +237,10 @@ a synonym for -m.
If the -t option is given, each compile, study, and match is run 20000 times
while being timed, and the resulting time per compile or match is output in
milliseconds. Do not set -t with -s, because you will then get the size output
milliseconds. Do not set -t with -m, because you will then get the size output
20000 times and the timing will be distorted. If you want to change the number
of repetitions used for timing, edit the definition of LOOPREPEAT at the top of
pcretest.c
Philip Hazel <ph10@cam.ac.uk>
January 2000
August 2000

View File

@@ -13,11 +13,17 @@ for perltest as well as for pcretest, and the special upper case modifiers such
as /A that pcretest recognizes are not used in these files. The output should
be identical, apart from the initial identifying banner.
For testing UTF-8 features, an alternative form of perltest, called perltest8,
is supplied. This requires Perl 5.6 or higher. It recognizes the special
modifier /8 that pcretest uses to invoke UTF-8 functionality. The testinput5
file can be fed to perltest8.
The testinput2 and testinput4 files are not suitable for feeding to perltest,
since they do make use of the special upper case modifiers and escapes that
pcretest uses to test some features of PCRE. The first of these files also
contains malformed regular expressions, in order to check that PCRE diagnoses
them correctly.
them correctly. Similarly, testinput6 tests UTF-8 features that do not relate
to Perl.
Philip Hazel <ph10@cam.ac.uk>
January 2000
August 2000

View File

@@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals.
Written by: Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-1999 University of Cambridge
Copyright (c) 1997-2000 University of Cambridge
-----------------------------------------------------------------------------
Permission is granted to anyone to use this software for any purpose on any
@@ -143,6 +143,25 @@ return 0;
/*************************************************
* Free store obtained by get_substring_list *
*************************************************/
/* This function exists for the benefit of people calling PCRE from non-C
programs that can call its functions, but not free() or (pcre_free)() directly.
Argument: the result of a previous pcre_get_substring_list()
Returns: nothing
*/
void
pcre_free_substring_list(const char **pointer)
{
(pcre_free)((void *)pointer);
}
/*************************************************
* Copy captured string to new store *
*************************************************/
@@ -186,4 +205,23 @@ substring[yield] = 0;
return yield;
}
/*************************************************
* Free store obtained by get_substring *
*************************************************/
/* This function exists for the benefit of people calling PCRE from non-C
programs that can call its functions, but not free() or (pcre_free)() directly.
Argument: the result of a previous pcre_get_substring()
Returns: nothing
*/
void
pcre_free_substring(const char *pointer)
{
(pcre_free)((void *)pointer);
}
/* End of get.c */

View File

@@ -109,7 +109,7 @@ time, run time or study time, respectively. */
#define PUBLIC_OPTIONS \
(PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY)
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8)
#define PUBLIC_EXEC_OPTIONS \
(PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY)
@@ -278,6 +278,10 @@ just to accommodate the POSIX wrapper. */
#define ERR29 "(?p must be followed by )"
#define ERR30 "unknown POSIX class name"
#define ERR31 "POSIX collating elements are not supported"
#define ERR32 "this version of PCRE is not compiled with PCRE_UTF8 support"
#define ERR33 "characters with values > 255 are not yet supported in classes"
#define ERR34 "character value in \\x{...} sequence is too large"
#define ERR35 "invalid condition (?(0)"
/* All character handling must be done as unsigned characters. Otherwise there
are problems with top-bit-set characters and functions such as isspace().
@@ -334,6 +338,7 @@ typedef struct match_data {
BOOL offset_overflow; /* Set if too many extractions */
BOOL notbol; /* NOTBOL flag */
BOOL noteol; /* NOTEOL flag */
BOOL utf8; /* UTF8 flag */
BOOL endonly; /* Dollar not before final \n */
BOOL notempty; /* Empty string match not wanted */
const uschar *start_pattern; /* For use when recursing */

View File

@@ -66,6 +66,16 @@ not be set greater than 200. */
#define BRASTACK_SIZE 200
/* The number of bytes in a literal character string above which we can't add
any more is different when UTF-8 characters may be encountered. */
#ifdef SUPPORT_UTF8
#define MAXLIT 250
#else
#define MAXLIT 255
#endif
/* Min and max values for the common repeats; for the maxima, 0 => infinity */
static const char rep_min[] = { 0, 0, 1, 1, 0, 0 };
@@ -176,6 +186,64 @@ void (*pcre_free)(void *) = free;
/*************************************************
* Macros and tables for character handling *
*************************************************/
/* When UTF-8 encoding is being used, a character is no longer just a single
byte. The macros for character handling generate simple sequences when used in
byte-mode, and more complicated ones for UTF-8 characters. */
#ifndef SUPPORT_UTF8
#define GETCHARINC(c, eptr) c = *eptr++;
#define GETCHARLEN(c, eptr, len) c = *eptr;
#define BACKCHAR(eptr)
#else /* SUPPORT_UTF8 */
/* Get the next UTF-8 character, advancing the pointer */
#define GETCHARINC(c, eptr) \
c = *eptr++; \
if (md->utf8 && (c & 0xc0) == 0xc0) \
{ \
int a = utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int s = 6 - a; /* Amount to shift next byte */ \
c &= utf8_table3[a]; /* Low order bits from first byte */ \
while (a-- > 0) \
{ \
c |= (*eptr++ & 0x3f) << s; \
s += 6; \
} \
}
/* Get the next UTF-8 character, not advancing the pointer, setting length */
#define GETCHARLEN(c, eptr, len) \
c = *eptr; \
len = 1; \
if (md->utf8 && (c & 0xc0) == 0xc0) \
{ \
int i; \
int a = utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int s = 6 - a; /* Amount to shift next byte */ \
c &= utf8_table3[a]; /* Low order bits from first byte */ \
for (i = 1; i <= a; i++) \
{ \
c |= (eptr[i] & 0x3f) << s; \
s += 6; \
} \
len += a; \
}
/* If the pointer is not at the start of a character, move it back until
it is. */
#define BACKCHAR(eptr) while((*eptr & 0xc0) == 0x80) eptr--;
#endif
/*************************************************
* Default character tables *
@@ -191,6 +259,66 @@ tables. */
#ifdef SUPPORT_UTF8
/*************************************************
* Tables for UTF-8 support *
*************************************************/
/* These are the breakpoints for different numbers of bytes in a UTF-8
character. */
static int utf8_table1[] = { 0x7f, 0x7ff, 0xffff, 0x1fffff, 0x3ffffff, 0x7fffffff};
/* These are the indicator bits and the mask for the data bits to set in the
first byte of a character, indexed by the number of additional bytes. */
static int utf8_table2[] = { 0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc};
static int utf8_table3[] = { 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01};
/* Table of the number of extra characters, indexed by the first character
masked with 0x3f. The highest number for a valid UTF-8 character is in fact
0x3d. */
static uschar utf8_table4[] = {
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5 };
/*************************************************
* Convert character value to UTF-8 *
*************************************************/
/* This function takes an integer value in the range 0 - 0x7fffffff
and encodes it as a UTF-8 character in 0 to 6 bytes.
Arguments:
cvalue the character value
buffer pointer to buffer for result - at least 6 bytes long
Returns: number of characters placed in the buffer
*/
static int
ord2utf8(int cvalue, uschar *buffer)
{
register int i, j;
for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
if (cvalue <= utf8_table1[i]) break;
*buffer++ = utf8_table2[i] | (cvalue & utf8_table3[i]);
cvalue >>= 6 - i;
for (j = 0; j < i; j++)
{
*buffer++ = 0x80 | (cvalue & 0x3f);
cvalue >>= 6;
}
return i + 1;
}
#endif
/*************************************************
* Return version string *
*************************************************/
@@ -349,9 +477,9 @@ while (length-- > 0)
/* This function is called when a \ has been encountered. It either returns a
positive value for a simple escape such as \n, or a negative value which
encodes one of the more complicated things such as \d. On entry, ptr is
pointing at the \. On exit, it is on the final character of the escape
sequence.
encodes one of the more complicated things such as \d. When UTF-8 is enabled,
a positive value greater than 255 may be returned. On entry, ptr is pointing at
the \. On exit, it is on the final character of the escape sequence.
Arguments:
ptrptr points to the pattern position pointer
@@ -373,7 +501,9 @@ check_escape(const uschar **ptrptr, const char **errorptr, int bracount,
const uschar *ptr = *ptrptr;
int c, i;
c = *(++ptr) & 255; /* Ensure > 0 on signed-char systems */
/* If backslash is at the end of the pattern, it's an error. */
c = *(++ptr);
if (c == 0) *errorptr = ERR1;
/* Digits or letters may have special meaning; all others are literals. */
@@ -433,18 +563,46 @@ else
}
/* \0 always starts an octal number, but we may drop through to here with a
larger first octal digit */
larger first octal digit. */
case '0':
c -= '0';
while(i++ < 2 && (cd->ctypes[ptr[1]] & ctype_digit) != 0 &&
ptr[1] != '8' && ptr[1] != '9')
c = c * 8 + *(++ptr) - '0';
c &= 255; /* Take least significant 8 bits */
break;
/* Special escapes not starting with a digit are straightforward */
/* \x is complicated when UTF-8 is enabled. \x{ddd} is a character number
which can be greater than 0xff, but only if the ddd are hex digits. */
case 'x':
#ifdef SUPPORT_UTF8
if (ptr[1] == '{' && (options & PCRE_UTF8) != 0)
{
const uschar *pt = ptr + 2;
register int count = 0;
c = 0;
while ((cd->ctypes[*pt] & ctype_xdigit) != 0)
{
count++;
c = c * 16 + cd->lcc[*pt] -
(((cd->ctypes[*pt] & ctype_digit) != 0)? '0' : 'W');
pt++;
}
if (*pt == '}')
{
if (c < 0 || count > 8) *errorptr = ERR34;
ptr = pt;
break;
}
/* If the sequence of hex digits does not end with '}', then we don't
recognize this construct; fall through to the normal \x handling. */
}
#endif
/* Read just a single hex char */
c = 0;
while (i++ < 2 && (cd->ctypes[ptr[1]] & ctype_xdigit) != 0)
{
@@ -454,6 +612,8 @@ else
}
break;
/* Other special escapes not starting with a digit are straightforward */
case 'c':
c = *(++ptr);
if (c == 0)
@@ -591,12 +751,13 @@ if the length is fixed. This is needed for dealing with backward assertions.
Arguments:
code points to the start of the pattern (the bracket)
options the compiling options
Returns: the fixed length, or -1 if there is no fixed length
*/
static int
find_fixedlength(uschar *code)
find_fixedlength(uschar *code, int options)
{
int length = -1;
@@ -617,7 +778,7 @@ for (;;)
case OP_BRA:
case OP_ONCE:
case OP_COND:
d = find_fixedlength(cc);
d = find_fixedlength(cc, options);
if (d < 0) return -1;
branchlength += d;
do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT);
@@ -671,10 +832,17 @@ for (;;)
cc++;
break;
/* Handle char strings */
/* Handle char strings. In UTF-8 mode we must count characters, not bytes.
This requires a scan of the string, unfortunately. We assume valid UTF-8
strings, so all we do is reduce the length by one for byte whose bits are
10xxxxxx. */
case OP_CHARS:
branchlength += *(++cc);
#ifdef SUPPORT_UTF8
for (d = 1; d <= *cc; d++)
if ((cc[d] & 0xc0) == 0x80) branchlength--;
#endif
cc += *cc + 1;
break;
@@ -1054,7 +1222,17 @@ for (;; ptr++)
goto FAILED;
}
}
/* Fall through if single character */
/* Fall through if single character, but don't at present allow
chars > 255 in UTF-8 mode. */
#ifdef SUPPORT_UTF8
if (c > 255)
{
*errorptr = ERR33;
goto FAILED;
}
#endif
}
/* A single character may be followed by '-' to form a range. However,
@@ -1074,17 +1252,29 @@ for (;; ptr++)
}
/* The second part of a range can be a single-character escape, but
not any of the other escapes. */
not any of the other escapes. Perl 5.6 treats a hyphen as a literal
in such circumstances. */
if (d == '\\')
{
const uschar *oldptr = ptr;
d = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd);
#ifdef SUPPORT_UTF8
if (d > 255)
{
*errorptr = ERR33;
goto FAILED;
}
#endif
/* \b is backslash; any other special means the '-' was literal */
if (d < 0)
{
if (d == -ESC_b) d = '\b'; else
{
*errorptr = ERR7;
goto FAILED;
ptr = oldptr - 2;
goto SINGLE_CHARACTER; /* A few lines below */
}
}
}
@@ -1112,6 +1302,8 @@ for (;; ptr++)
/* Handle a lone single character - we can get here for a normal
non-escape char, or after \ that introduces a single character. */
SINGLE_CHARACTER:
class [c/8] |= (1 << (c&7));
if ((options & PCRE_CASELESS) != 0)
{
@@ -1562,6 +1754,11 @@ for (;; ptr++)
{
condref = *ptr - '0';
while (*(++ptr) != ')') condref = condref*10 + *ptr - '0';
if (condref == 0)
{
*errorptr = ERR35;
goto FAILED;
}
ptr++;
}
else ptr--;
@@ -1829,6 +2026,20 @@ for (;; ptr++)
tempptr = ptr;
c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd);
if (c < 0) { ptr = tempptr; break; }
/* If a character is > 127 in UTF-8 mode, we have to turn it into
two or more characters in the UTF-8 encoding. */
#ifdef SUPPORT_UTF8
if (c > 127 && (options & PCRE_UTF8) != 0)
{
uschar buffer[8];
int len = ord2utf8(c, buffer);
for (c = 0; c < len; c++) *code++ = buffer[c];
length += len;
continue;
}
#endif
}
/* Ordinary character or single-char escape */
@@ -1839,7 +2050,7 @@ for (;; ptr++)
/* This "while" is the end of the "do" above. */
while (length < 255 && (cd->ctypes[c = *(++ptr)] & ctype_meta) == 0);
while (length < MAXLIT && (cd->ctypes[c = *(++ptr)] & ctype_meta) == 0);
/* Update the last character and the count of literals */
@@ -1851,7 +2062,7 @@ for (;; ptr++)
the next state. */
previous[1] = length;
if (length < 255) ptr--;
if (length < MAXLIT) ptr--;
break;
}
} /* end of big loop */
@@ -1889,7 +2100,7 @@ Argument:
ptrptr -> the address of the current pattern pointer
errorptr -> pointer to error message
lookbehind TRUE if this is a lookbehind assertion
condref > 0 for OPT_CREF setting at start of conditional group
condref >= 0 for OPT_CREF setting at start of conditional group
reqchar -> place to put the last required character, or a negative number
countlits -> place to put the shortest literal count of any branch
cd points to the data block with tables pointers
@@ -1917,7 +2128,7 @@ code += 3;
/* At the start of a reference-based conditional group, insert the reference
number as an OP_CREF item. */
if (condref > 0)
if (condref >= 0)
{
*code++ = OP_CREF;
*code++ = condref;
@@ -1989,7 +2200,7 @@ for (;;)
if (lookbehind)
{
*code = OP_END;
length = find_fixedlength(last_branch);
length = find_fixedlength(last_branch, options);
DPRINTF(("fixed length = %d\n", length));
if (length < 0)
{
@@ -2280,6 +2491,16 @@ uschar bralenstack[BRASTACK_SIZE];
uschar *code_base, *code_end;
#endif
/* Can't support UTF8 unless PCRE has been compiled to include the code. */
#ifndef SUPPORT_UTF8
if ((options & PCRE_UTF8) != 0)
{
*errorptr = ERR32;
return NULL;
}
#endif
/* We can't pass back an error message if errorptr is NULL; I guess the best we
can do is just return NULL. */
@@ -2775,6 +2996,16 @@ while ((c = *(++ptr)) != 0)
&compile_block);
if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
if (c < 0) { ptr = saveptr; break; }
#ifdef SUPPORT_UTF8
if (c > 127 && (options & PCRE_UTF8) != 0)
{
int i;
for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
if (c <= utf8_table1[i]) break;
runlength += i;
}
#endif
}
/* Ordinary character or single-char escape */
@@ -2784,7 +3015,7 @@ while ((c = *(++ptr)) != 0)
/* This "while" is the end of the "do" above. */
while (runlength < 255 &&
while (runlength < MAXLIT &&
(compile_block.ctypes[c = *(++ptr)] & ctype_meta) == 0);
ptr--;
@@ -3429,10 +3660,21 @@ for (;;)
/* Move the subject pointer back. This occurs only at the start of
each branch of a lookbehind assertion. If we are too close to the start to
move back, this match function fails. */
move back, this match function fails. When working with UTF-8 we move
back a number of characters, not bytes. */
case OP_REVERSE:
#ifdef SUPPORT_UTF8
c = (ecode[1] << 8) + ecode[2];
for (i = 0; i < c; i++)
{
eptr--;
BACKCHAR(eptr)
}
#else
eptr -= (ecode[1] << 8) + ecode[2];
#endif
if (eptr < md->start_subject) return FALSE;
ecode += 3;
break;
@@ -3752,6 +3994,10 @@ for (;;)
if ((ims & PCRE_DOTALL) == 0 && eptr < md->end_subject && *eptr == '\n')
return FALSE;
if (eptr++ >= md->end_subject) return FALSE;
#ifdef SUPPORT_UTF8
if (md->utf8)
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
#endif
ecode++;
break;
@@ -3953,7 +4199,13 @@ for (;;)
for (i = 1; i <= min; i++)
{
if (eptr >= md->end_subject) return FALSE;
c = *eptr++;
GETCHARINC(c, eptr) /* Get character; increment eptr */
#ifdef SUPPORT_UTF8
/* We do not yet support class members > 255 */
if (c > 255) return FALSE;
#endif
if ((data[c/8] & (1 << (c&7))) != 0) continue;
return FALSE;
}
@@ -3973,7 +4225,12 @@ for (;;)
if (match(eptr, ecode, offset_top, md, ims, eptrb, 0))
return TRUE;
if (i >= max || eptr >= md->end_subject) return FALSE;
c = *eptr++;
GETCHARINC(c, eptr) /* Get character; increment eptr */
#ifdef SUPPORT_UTF8
/* We do not yet support class members > 255 */
if (c > 255) return FALSE;
#endif
if ((data[c/8] & (1 << (c&7))) != 0) continue;
return FALSE;
}
@@ -3985,17 +4242,29 @@ for (;;)
else
{
const uschar *pp = eptr;
for (i = min; i < max; eptr++, i++)
int len = 1;
for (i = min; i < max; i++)
{
if (eptr >= md->end_subject) break;
c = *eptr;
if ((data[c/8] & (1 << (c&7))) != 0) continue;
break;
GETCHARLEN(c, eptr, len) /* Get character, set length if UTF-8 */
#ifdef SUPPORT_UTF8
/* We do not yet support class members > 255 */
if (c > 255) break;
#endif
if ((data[c/8] & (1 << (c&7))) == 0) break;
eptr += len;
}
while (eptr >= pp)
{
if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0))
return TRUE;
#ifdef SUPPORT_UTF8
BACKCHAR(eptr)
#endif
}
return FALSE;
}
}
@@ -4315,13 +4584,29 @@ for (;;)
/* First, ensure the minimum number of matches are present. Use inline
code for maximizing the speed, and do the type test once at the start
(i.e. keep it out of the loop). Also test that there are at least the
minimum number of characters before we start. */
(i.e. keep it out of the loop). Also we can test that there are at least
the minimum number of bytes before we start, except when doing '.' in
UTF8 mode. Leave the test in in all cases; in the special case we have
to test after each character. */
if (min > md->end_subject - eptr) return FALSE;
if (min > 0) switch(ctype)
{
case OP_ANY:
#ifdef SUPPORT_UTF8
if (md->utf8)
{
for (i = 1; i <= min; i++)
{
if (eptr >= md->end_subject ||
(*eptr++ == '\n' && (ims & PCRE_DOTALL) == 0))
return FALSE;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
break;
}
#endif
/* Non-UTF8 can be faster */
if ((ims & PCRE_DOTALL) == 0)
{ for (i = 1; i <= min; i++) if (*eptr++ == '\n') return FALSE; }
else eptr += min;
@@ -4379,6 +4664,10 @@ for (;;)
{
case OP_ANY:
if ((ims & PCRE_DOTALL) == 0 && c == '\n') return FALSE;
#ifdef SUPPORT_UTF8
if (md->utf8)
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
#endif
break;
case OP_NOT_DIGIT:
@@ -4418,6 +4707,33 @@ for (;;)
switch(ctype)
{
case OP_ANY:
/* Special code is required for UTF8, but when the maximum is unlimited
we don't need it. */
#ifdef SUPPORT_UTF8
if (md->utf8 && max < INT_MAX)
{
if ((ims & PCRE_DOTALL) == 0)
{
for (i = min; i < max; i++)
{
if (eptr >= md->end_subject || *eptr++ == '\n') break;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
}
else
{
for (i = min; i < max; i++)
{
eptr++;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
}
break;
}
#endif
/* Non-UTF8 can be faster */
if ((ims & PCRE_DOTALL) == 0)
{
for (i = min; i < max; i++)
@@ -4490,8 +4806,14 @@ for (;;)
}
while (eptr >= pp)
{
if (match(eptr--, ecode, offset_top, md, ims, eptrb, 0))
return TRUE;
#ifdef SUPPORT_UTF8
if (md->utf8)
while (eptr > pp && (*eptr & 0xc0) == 0x80) eptr--;
#endif
}
return FALSE;
}
/* Control never gets here */
@@ -4572,6 +4894,7 @@ match_block.end_subject = match_block.start_subject + length;
end_subject = match_block.end_subject;
match_block.endonly = (re->options & PCRE_DOLLAR_ENDONLY) != 0;
match_block.utf8 = (re->options & PCRE_UTF8) != 0;
match_block.notbol = (options & PCRE_NOTBOL) != 0;
match_block.noteol = (options & PCRE_NOTEOL) != 0;

View File

@@ -4,14 +4,15 @@
/* Copyright (c) 1997-2000 University of Cambridge */
#ifndef PCRE_H
#define PCRE_H
#ifndef _PCRE_H
#define _PCRE_H
/* The file pcre.h is build by "configure". Do not edit it; instead
make changes to pcre.in. */
#define PCRE_MAJOR 3
#define PCRE_MINOR 1
#define PCRE_DATE 09-Feb-2000
#include "php_compat.h"
#define PCRE_MINOR 4
#define PCRE_DATE 22-Aug-2000
/* Win32 uses DLL by default */
@@ -28,7 +29,6 @@
/* Have to include stdlib.h in order to ensure that size_t is defined;
it is needed here for malloc. */
#include <sys/types.h>
#include <stdlib.h>
/* Allow for C++ users */
@@ -50,6 +50,7 @@ extern "C" {
#define PCRE_NOTEOL 0x0100
#define PCRE_UNGREEDY 0x0200
#define PCRE_NOTEMPTY 0x0400
#define PCRE_UTF8 0x0800
/* Exec-time and get-time error codes */
@@ -88,14 +89,16 @@ PCRE_DL_IMPORT extern void (*pcre_free)(void *);
/* Functions */
extern pcre *pcre_compile(const char *, int, const char **, int *,
const unsigned char *);
extern int pcre_copy_substring(const char *, int *, int, int, char *, int);
extern int pcre_exec(const pcre *, const pcre_extra *, const char *,
int, int, int, int *, int);
extern int pcre_get_substring(const char *, int *, int, int, const char **);
extern int pcre_get_substring_list(const char *, int *, int, const char ***);
extern int pcre_info(const pcre *, int *, int *);
extern int pcre_fullinfo(const pcre *, const pcre_extra *, int, void *);
const unsigned char *);
extern int pcre_copy_substring(const char *, int *, int, int, char *, int);
extern int pcre_exec(const pcre *, const pcre_extra *, const char *,
int, int, int, int *, int);
extern void pcre_free_substring(const char *);
extern void pcre_free_substring_list(const char **);
extern int pcre_get_substring(const char *, int *, int, int, const char **);
extern int pcre_get_substring_list(const char *, int *, int, const char ***);
extern int pcre_info(const pcre *, int *, int *);
extern int pcre_fullinfo(const pcre *, const pcre_extra *, int, void *);
extern unsigned const char *pcre_maketables(void);
extern pcre_extra *pcre_study(const pcre *, int, const char **);
extern const char *pcre_version(void);

View File

@@ -1,7 +1,10 @@
/*************************************************
* PCRE grep program *
* pcregrep program *
*************************************************/
/* This is a grep program that uses the PCRE regular expression library to do
its pattern matching. */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
@@ -59,7 +62,7 @@ return sys_errlist[n];
*************************************************/
static int
pgrep(FILE *in, char *name)
pcregrep(FILE *in, char *name)
{
int rc = 1;
int linenumber = 0;
@@ -119,7 +122,7 @@ return rc;
static int
usage(int rc)
{
fprintf(stderr, "Usage: pgrep [-Vchilnsvx] pattern [file] ...\n");
fprintf(stderr, "Usage: pcregrep [-Vchilnsvx] pattern [file] ...\n");
return rc;
}
@@ -165,7 +168,7 @@ for (i = 1; i < argc; i++)
break;
default:
fprintf(stderr, "pgrep: unknown option %c\n", s[-1]);
fprintf(stderr, "pcregrep: unknown option %c\n", s[-1]);
return usage(2);
}
}
@@ -180,7 +183,7 @@ if (i >= argc) return usage(0);
pattern = pcre_compile(argv[i++], options, &error, &errptr, NULL);
if (pattern == NULL)
{
fprintf(stderr, "pgrep: error in regex at offset %d: %s\n", errptr, error);
fprintf(stderr, "pcregrep: error in regex at offset %d: %s\n", errptr, error);
return 2;
}
@@ -189,13 +192,13 @@ if (pattern == NULL)
hints = pcre_study(pattern, 0, &error);
if (error != NULL)
{
fprintf(stderr, "pgrep: error while studing regex: %s\n", error);
fprintf(stderr, "pcregrep: error while studing regex: %s\n", error);
return 2;
}
/* If there are no further arguments, do the business on stdin and exit */
if (i >= argc) return pgrep(stdin, NULL);
if (i >= argc) return pcregrep(stdin, NULL);
/* Otherwise, work through the remaining arguments as files. If there is only
one, don't give its name on the output. */
@@ -213,7 +216,7 @@ for (; i < argc; i++)
}
else
{
int frc = pgrep(in, filenames? argv[i] : NULL);
int frc = pcregrep(in, filenames? argv[i] : NULL);
if (frc == 0 && rc == 1) rc = 0;
fclose(in);
}

View File

@@ -80,7 +80,11 @@ static int eint[] = {
REG_BADPAT, /* "assertion expected after (?(" */
REG_BADPAT, /* "(?p must be followed by )" */
REG_ECTYPE, /* "unknown POSIX class name" */
REG_BADPAT /* "POSIX collating elements are not supported" */
REG_BADPAT, /* "POSIX collating elements are not supported" */
REG_INVARG, /* "this version of PCRE is not compiled with PCRE_UTF8 support" */
REG_BADPAT, /* "characters with values > 255 are not yet supported in classes" */
REG_BADPAT, /* "character value in \x{...} sequence is too large" */
REG_BADPAT /* "invalid condition (?(0)" */
};
/* Table of texts corresponding to POSIX error codes */

View File

@@ -4,8 +4,8 @@
/* Copyright (c) 1997-2000 University of Cambridge */
#ifndef PCREPOSIX_H
#define PCREPOSIX_H
#ifndef _PCREPOSIX_H
#define _PCREPOSIX_H
/* This is the header for the POSIX wrapper interface to the PCRE Perl-
Compatible Regular Expression library. It defines the things POSIX says should

View File

@@ -38,6 +38,113 @@ static size_t gotten_store;
static int utf8_table1[] = {
0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff};
static int utf8_table2[] = {
0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc};
static int utf8_table3[] = {
0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01};
/*************************************************
* Convert character value to UTF-8 *
*************************************************/
/* This function takes an integer value in the range 0 - 0x7fffffff
and encodes it as a UTF-8 character in 0 to 6 bytes.
Arguments:
cvalue the character value
buffer pointer to buffer for result - at least 6 bytes long
Returns: number of characters placed in the buffer
-1 if input character is negative
0 if input character is positive but too big (only when
int is longer than 32 bits)
*/
static int
ord2utf8(int cvalue, unsigned char *buffer)
{
register int i, j;
for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
if (cvalue <= utf8_table1[i]) break;
if (i >= sizeof(utf8_table1)/sizeof(int)) return 0;
if (cvalue < 0) return -1;
*buffer++ = utf8_table2[i] | (cvalue & utf8_table3[i]);
cvalue >>= 6 - i;
for (j = 0; j < i; j++)
{
*buffer++ = 0x80 | (cvalue & 0x3f);
cvalue >>= 6;
}
return i + 1;
}
/*************************************************
* Convert UTF-8 string to value *
*************************************************/
/* This function takes one or more bytes that represents a UTF-8 character,
and returns the value of the character.
Argument:
buffer a pointer to the byte vector
vptr a pointer to an int to receive the value
Returns: > 0 => the number of bytes consumed
-6 to 0 => malformed UTF-8 character at offset = (-return)
*/
int
utf82ord(unsigned char *buffer, int *vptr)
{
int c = *buffer++;
int d = c;
int i, j, s;
for (i = -1; i < 6; i++) /* i is number of additional bytes */
{
if ((d & 0x80) == 0) break;
d <<= 1;
}
if (i == -1) { *vptr = c; return 1; } /* ascii character */
if (i == 0 || i == 6) return 0; /* invalid UTF-8 */
/* i now has a value in the range 1-5 */
d = c & utf8_table3[i];
s = 6 - i;
for (j = 0; j < i; j++)
{
c = *buffer++;
if ((c & 0xc0) != 0x80) return -(j+1);
d |= (c & 0x3f) << s;
s += 6;
}
/* Check that encoding was the correct unique one */
for (j = 0; j < sizeof(utf8_table1)/sizeof(int); j++)
if (d <= utf8_table1[j]) break;
if (j != i) return -(i+1);
/* Valid value */
*vptr = d;
return i+1;
}
/* Debugging function to print the internal form of the regex. This is the same
code as contained in pcre.c under the DEBUG macro. */
@@ -265,14 +372,31 @@ for(;;)
/* Character string printing function. */
/* Character string printing function. A "normal" and a UTF-8 version. */
static void pchars(unsigned char *p, int length)
static void pchars(unsigned char *p, int length, int utf8)
{
int c;
while (length-- > 0)
{
if (utf8)
{
int rc = utf82ord(p, &c);
if (rc > 0)
{
length -= rc - 1;
p += rc;
if (c < 256 && isprint(c)) fprintf(outfile, "%c", c);
else fprintf(outfile, "\\x{%02x}", c);
continue;
}
}
/* Not UTF-8, or malformed UTF-8 */
if (isprint(c = *(p++))) fprintf(outfile, "%c", c);
else fprintf(outfile, "\\x%02x", c);
}
}
@@ -403,6 +527,7 @@ while (!done)
int do_g = 0;
int do_showinfo = showinfo;
int do_showrest = 0;
int utf8 = 0;
int erroroffset, len, delimiter;
if (infile == stdin) printf(" re> ");
@@ -494,6 +619,7 @@ while (!done)
case 'S': do_study = 1; break;
case 'U': options |= PCRE_UNGREEDY; break;
case 'X': options |= PCRE_EXTRA; break;
case '8': options |= PCRE_UTF8; utf8 = 1; break;
case 'L':
ppp = pp;
@@ -633,7 +759,7 @@ while (!done)
if (backrefmax > 0)
fprintf(outfile, "Max back reference = %d\n", backrefmax);
if (options == 0) fprintf(outfile, "No options\n");
else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s\n",
else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s%s\n",
((options & PCRE_ANCHORED) != 0)? " anchored" : "",
((options & PCRE_CASELESS) != 0)? " caseless" : "",
((options & PCRE_EXTENDED) != 0)? " extended" : "",
@@ -641,7 +767,8 @@ while (!done)
((options & PCRE_DOTALL) != 0)? " dotall" : "",
((options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
((options & PCRE_EXTRA) != 0)? " extra" : "",
((options & PCRE_UNGREEDY) != 0)? " ungreedy" : "");
((options & PCRE_UNGREEDY) != 0)? " ungreedy" : "",
((options & PCRE_UTF8) != 0)? " utf8" : "");
if (((((real_pcre *)re)->options) & PCRE_ICHANGED) != 0)
fprintf(outfile, "Case state changes\n");
@@ -796,6 +923,30 @@ while (!done)
break;
case 'x':
/* Handle \x{..} specially - new Perl thing for utf8 */
if (*p == '{')
{
unsigned char *pt = p;
c = 0;
while (isxdigit(*(++pt)))
c = c * 16 + tolower(*pt) - ((isdigit(*pt))? '0' : 'W');
if (*pt == '}')
{
unsigned char buffer[8];
int ii, utn;
utn = ord2utf8(c, buffer);
for (ii = 0; ii < utn - 1; ii++) *q++ = buffer[ii];
c = buffer[ii]; /* Last byte */
p = pt + 1;
break;
}
/* Not correct form; fall through */
}
/* Ordinary \x */
c = 0;
while (i++ < 2 && isxdigit(*p))
{
@@ -876,12 +1027,12 @@ while (!done)
{
fprintf(outfile, "%2d: ", (int)i);
pchars(dbuffer + pmatch[i].rm_so,
pmatch[i].rm_eo - pmatch[i].rm_so);
pmatch[i].rm_eo - pmatch[i].rm_so, utf8);
fprintf(outfile, "\n");
if (i == 0 && do_showrest)
{
fprintf(outfile, " 0+ ");
pchars(dbuffer + pmatch[i].rm_eo, len - pmatch[i].rm_eo);
pchars(dbuffer + pmatch[i].rm_eo, len - pmatch[i].rm_eo, utf8);
fprintf(outfile, "\n");
}
}
@@ -931,14 +1082,14 @@ while (!done)
else
{
fprintf(outfile, "%2d: ", i/2);
pchars(bptr + offsets[i], offsets[i+1] - offsets[i]);
pchars(bptr + offsets[i], offsets[i+1] - offsets[i], utf8);
fprintf(outfile, "\n");
if (i == 0)
{
if (do_showrest)
{
fprintf(outfile, " 0+ ");
pchars(bptr + offsets[i+1], len - offsets[i+1]);
pchars(bptr + offsets[i+1], len - offsets[i+1], utf8);
fprintf(outfile, "\n");
}
}
@@ -971,7 +1122,8 @@ while (!done)
else
{
fprintf(outfile, "%2dG %s (%d)\n", i, substring, rc);
free((void *)substring);
/* free((void *)substring); */
pcre_free_substring(substring);
}
}
}
@@ -989,7 +1141,8 @@ while (!done)
fprintf(outfile, "%2dL %s\n", i, stringlist[i]);
if (stringlist[i] != NULL)
fprintf(outfile, "string list not terminated by NULL\n");
free((void *)stringlist);
/* free((void *)stringlist); */
pcre_free_substring_list(stringlist);
}
}
}

View File

@@ -9,7 +9,7 @@
sub pchars {
my($t) = "";
foreach $c (split(//, @_[0]))
foreach $c (split(//, $_[0]))
{
if (ord $c >= 32 && ord $c < 127) { $t .= $c; }
else { $t .= sprintf("\\x%02x", ord $c); }

208
ext/pcre/pcrelib/perltest8 Executable file
View File

@@ -0,0 +1,208 @@
#! /usr/bin/perl
# Program for testing regular expressions with perl to check that PCRE handles
# them the same. This is the version that supports /8 for UTF-8 testing. It
# requires at least Perl 5.6.
# Function for turning a string into a string of printing chars. There are
# currently problems with UTF-8 strings; this fudges round them.
sub pchars {
my($t) = "";
if ($utf8)
{
use utf8;
@p = unpack('U*', $_[0]);
foreach $c (@p)
{
if ($c >= 32 && $c < 127) { $t .= chr $c; }
else { $t .= sprintf("\\x{%02x}", $c); }
}
}
else
{
foreach $c (split(//, $_[0]))
{
if (ord $c >= 32 && ord $c < 127) { $t .= $c; }
else { $t .= sprintf("\\x%02x", ord $c); }
}
}
$t;
}
# Read lines from named file or stdin and write to named file or stdout; lines
# consist of a regular expression, in delimiters and optionally followed by
# options, followed by a set of test data, terminated by an empty line.
# Sort out the input and output files
if (@ARGV > 0)
{
open(INFILE, "<$ARGV[0]") || die "Failed to open $ARGV[0]\n";
$infile = "INFILE";
}
else { $infile = "STDIN"; }
if (@ARGV > 1)
{
open(OUTFILE, ">$ARGV[1]") || die "Failed to open $ARGV[1]\n";
$outfile = "OUTFILE";
}
else { $outfile = "STDOUT"; }
printf($outfile "Perl $] Regular Expressions\n\n");
# Main loop
NEXT_RE:
for (;;)
{
printf " re> " if $infile eq "STDIN";
last if ! ($_ = <$infile>);
printf $outfile "$_" if $infile ne "STDIN";
next if ($_ eq "");
$pattern = $_;
while ($pattern !~ /^\s*(.).*\1/s)
{
printf " > " if $infile eq "STDIN";
last if ! ($_ = <$infile>);
printf $outfile "$_" if $infile ne "STDIN";
$pattern .= $_;
}
chomp($pattern);
$pattern =~ s/\s+$//;
# The private /+ modifier means "print $' afterwards".
$showrest = ($pattern =~ s/\+(?=[a-z]*$)//);
# The private /8 modifier means "operate in UTF-8". Currently, Perl
# has bugs that we try to work around using this flag.
$utf8 = ($pattern =~ s/8(?=[a-z]*$)//);
# Check that the pattern is valid
if ($utf8)
{
use utf8;
eval "\$_ =~ ${pattern}";
}
else
{
eval "\$_ =~ ${pattern}";
}
if ($@)
{
printf $outfile "Error: $@";
next NEXT_RE;
}
# If the /g modifier is present, we want to put a loop round the matching;
# otherwise just a single "if".
$cmd = ($pattern =~ /g[a-z]*$/)? "while" : "if";
# If the pattern is actually the null string, Perl uses the most recently
# executed (and successfully compiled) regex is used instead. This is a
# nasty trap for the unwary! The PCRE test suite does contain null strings
# in places - if they are allowed through here all sorts of weird and
# unexpected effects happen. To avoid this, we replace such patterns with
# a non-null pattern that has the same effect.
$pattern = "/(?#)/$2" if ($pattern =~ /^(.)\1(.*)$/);
# Read data lines and test them
for (;;)
{
printf "data> " if $infile eq "STDIN";
last NEXT_RE if ! ($_ = <$infile>);
chomp;
printf $outfile "$_\n" if $infile ne "STDIN";
s/\s+$//;
s/^\s+//;
last if ($_ eq "");
$x = eval "\"$_\""; # To get escapes processed
# Empty array for holding results, then do the matching.
@subs = ();
$pushes = "push \@subs,\$&;" .
"push \@subs,\$1;" .
"push \@subs,\$2;" .
"push \@subs,\$3;" .
"push \@subs,\$4;" .
"push \@subs,\$5;" .
"push \@subs,\$6;" .
"push \@subs,\$7;" .
"push \@subs,\$8;" .
"push \@subs,\$9;" .
"push \@subs,\$10;" .
"push \@subs,\$11;" .
"push \@subs,\$12;" .
"push \@subs,\$13;" .
"push \@subs,\$14;" .
"push \@subs,\$15;" .
"push \@subs,\$16;" .
"push \@subs,\$'; }";
if ($utf8)
{
use utf8;
eval "${cmd} (\$x =~ ${pattern}) {" . $pushes;
}
else
{
eval "${cmd} (\$x =~ ${pattern}) {" . $pushes;
}
if ($@)
{
printf $outfile "Error: $@\n";
next NEXT_RE;
}
elsif (scalar(@subs) == 0)
{
printf $outfile "No match\n";
}
else
{
while (scalar(@subs) != 0)
{
printf $outfile (" 0: %s\n", &pchars($subs[0]));
printf $outfile (" 0+ %s\n", &pchars($subs[17])) if $showrest;
$last_printed = 0;
for ($i = 1; $i <= 16; $i++)
{
if (defined $subs[$i])
{
while ($last_printed++ < $i-1)
{ printf $outfile ("%2d: <unset>\n", $last_printed); }
printf $outfile ("%2d: %s\n", $i, &pchars($subs[$i]));
$last_printed = $i;
}
}
splice(@subs, 0, 18);
}
}
}
}
printf $outfile "\n";
# End

View File

@@ -1899,4 +1899,24 @@
//g
abc
/ End of test input /
/<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)((<BR>([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/is
<TR BGCOLOR='#DBE9E9'><TD align=left valign=top>43.<a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor<BR>(N-1286)</a></TD><TD align=left valign=top>Lega lstaff.com</TD><TD align=left valign=top>CA - Statewide</TD></TR>
/a[^a]b/
acb
a\nb
/a.b/
acb
*** Failers
a\nb
/a[^a]b/s
acb
a\nb
/a.b/s
acb
a\nb
/ End of testinput1 /

View File

@@ -40,8 +40,6 @@
/[\B]/
/[a-\w]/
/[z-a]/
/^*/
@@ -707,4 +705,8 @@
Ab
AB
/ End of test input /
/[\200-\410]/
/^(?(0)f|b)oo/
/ End of testinput2 /

View File

@@ -1707,4 +1707,18 @@
/a*/g
abbab
/ End of test input /
/^[a-\d]/
abcde
-things
0digit
*** Failers
bcdef
/^[\d-a]/
abcde
-things
0digit
*** Failers
bcdef
/ End of testinput3 /

View File

@@ -62,3 +62,4 @@
*** Failers
école
/ End of testinput4 /

118
ext/pcre/pcrelib/testdata/testinput5 vendored Normal file
View File

@@ -0,0 +1,118 @@
/-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/
/-- strings automatically, do not use the \x{} construct except with --/
/-- patterns that have the /8 option set, and don't use them without! --/
/a.b/8
acb
a\x7fb
a\x{100}b
*** Failers
a\nb
/a(.{3})b/8
a\x{4000}xyb
a\x{4000}\x7fyb
a\x{4000}\x{100}yb
*** Failers
a\x{4000}b
ac\ncb
/a(.*?)(.)/
a\xc0\x88b
/a(.*?)(.)/8
a\x{100}b
/a(.*)(.)/
a\xc0\x88b
/a(.*)(.)/8
a\x{100}b
/a(.)(.)/
a\xc0\x92bcd
/a(.)(.)/8
a\x{240}bcd
/a(.?)(.)/
a\xc0\x92bcd
/a(.?)(.)/8
a\x{240}bcd
/a(.??)(.)/
a\xc0\x92bcd
/a(.??)(.)/8
a\x{240}bcd
/a(.{3})b/8
a\x{1234}xyb
a\x{1234}\x{4321}yb
a\x{1234}\x{4321}\x{3412}b
*** Failers
a\x{1234}b
ac\ncb
/a(.{3,})b/8
a\x{1234}xyb
a\x{1234}\x{4321}yb
a\x{1234}\x{4321}\x{3412}b
axxxxbcdefghijb
a\x{1234}\x{4321}\x{3412}\x{3421}b
*** Failers
a\x{1234}b
/a(.{3,}?)b/8
a\x{1234}xyb
a\x{1234}\x{4321}yb
a\x{1234}\x{4321}\x{3412}b
axxxxbcdefghijb
a\x{1234}\x{4321}\x{3412}\x{3421}b
*** Failers
a\x{1234}b
/a(.{3,5})b/8
a\x{1234}xyb
a\x{1234}\x{4321}yb
a\x{1234}\x{4321}\x{3412}b
axxxxbcdefghijb
a\x{1234}\x{4321}\x{3412}\x{3421}b
axbxxbcdefghijb
axxxxxbcdefghijb
*** Failers
a\x{1234}b
axxxxxxbcdefghijb
/a(.{3,5}?)b/8
a\x{1234}xyb
a\x{1234}\x{4321}yb
a\x{1234}\x{4321}\x{3412}b
axxxxbcdefghijb
a\x{1234}\x{4321}\x{3412}\x{3421}b
axbxxbcdefghijb
axxxxxbcdefghijb
*** Failers
a\x{1234}b
axxxxxxbcdefghijb
/^[a\x{c0}]/8
*** Failers
\x{100}
/(?<=aXb)cd/8
aXbcd
/(?<=a\x{100}b)cd/8
a\x{100}bcd
/(?<=a\x{100000}b)cd/8
a\x{100000}bcd
/(?:\x{100}){3}b/8
\x{100}\x{100}\x{100}b
*** Failers
\x{100}\x{100}b
/ End of testinput5 /

52
ext/pcre/pcrelib/testdata/testinput6 vendored Normal file
View File

@@ -0,0 +1,52 @@
/\x{100}/8DM
/\x{1000}/8DM
/\x{10000}/8DM
/\x{100000}/8DM
/\x{1000000}/8DM
/\x{4000000}/8DM
/\x{7fffFFFF}/8DM
/[\x{ff}]/8DM
/[\x{100}]/8DM
/\x{ffffffff}/8
/\x{100000000}/8
/^\x{100}a\x{1234}/8
\x{100}a\x{1234}bcd
/\x80/8D
/\xff/8D
/-- These tests are here rather than in testinput5 because Perl 5.6 has --/
/-- some problems with UTF-8 support, in the area of \x{..} where the --/
/-- value is < 255. It grumbles about invalid UTF-8 strings. --/
/^[a\x{c0}]b/8
\x{c0}b
/^([a\x{c0}]*?)aa/8
a\x{c0}aaaa/
/^([a\x{c0}]*?)aa/8
a\x{c0}aaaa/
a\x{c0}a\x{c0}aaa/
/^([a\x{c0}]*)aa/8
a\x{c0}aaaa/
a\x{c0}a\x{c0}aaa/
/^([a\x{c0}]*)a\x{c0}/8
a\x{c0}aaaa/
a\x{c0}a\x{c0}aaa/
/ End of testinput6 /

View File

@@ -1,4 +1,4 @@
PCRE version 3.2 12-May-2000
PCRE version 3.4 22-Aug-2000
/the quick brown fox/
the quick brown fox
@@ -2921,5 +2921,46 @@ No match
0:
0:
/ End of test input /
/<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)((<BR>([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/is
<TR BGCOLOR='#DBE9E9'><TD align=left valign=top>43.<a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor<BR>(N-1286)</a></TD><TD align=left valign=top>Lega lstaff.com</TD><TD align=left valign=top>CA - Statewide</TD></TR>
0: <TR BGCOLOR='#DBE9E9'><TD align=left valign=top>43.<a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor<BR>(N-1286)</a></TD><TD align=left valign=top>Lega lstaff.com</TD><TD align=left valign=top>CA - Statewide</TD></TR>
1: BGCOLOR='#DBE9E9'
2: align=left valign=top
3: 43.
4: <a href='joblist.cfm?JobID=94 6735&Keyword='>Word Processor<BR>(N-1286)
5:
6:
7: <unset>
8: align=left valign=top
9: Lega lstaff.com
10: align=left valign=top
11: CA - Statewide
/a[^a]b/
acb
0: acb
a\nb
0: a\x0ab
/a.b/
acb
0: acb
*** Failers
No match
a\nb
No match
/a[^a]b/s
acb
0: acb
a\nb
0: a\x0ab
/a.b/s
acb
0: acb
a\nb
0: a\x0ab
/ End of testinput1 /

View File

@@ -1,4 +1,4 @@
PCRE version 3.2 12-May-2000
PCRE version 3.4 22-Aug-2000
/(a)b|/
Capturing subpattern count = 1
@@ -94,9 +94,6 @@ Failed: missing terminating ] for character class at offset 5
/[\B]/
Failed: invalid escape sequence in character class at offset 2
/[a-\w]/
Failed: invalid escape sequence in character class at offset 4
/[z-a]/
Failed: range out of order in character class at offset 3
@@ -2064,7 +2061,13 @@ No match
AB
No match
/ End of test input /
/[\200-\410]/
Failed: range out of order in character class at offset 9
/^(?(0)f|b)oo/
Failed: invalid condition (?(0) at offset 5
/ End of testinput2 /
Capturing subpattern count = 0
No options
First char = ' '

View File

@@ -1,4 +1,4 @@
PCRE version 3.2 12-May-2000
PCRE version 3.4 22-Aug-2000
/(?<!bar)foo/
foo
@@ -2963,5 +2963,29 @@ No match
0:
0:
/ End of test input /
/^[a-\d]/
abcde
0: a
-things
0: -
0digit
0: 0
*** Failers
No match
bcdef
No match
/^[\d-a]/
abcde
0: a
-things
0: -
0digit
0: 0
*** Failers
No match
bcdef
No match
/ End of testinput3 /

View File

@@ -1,4 +1,4 @@
PCRE version 3.2 12-May-2000
PCRE version 3.4 22-Aug-2000
/^[\w]+/
*** Failers
@@ -112,4 +112,5 @@ No match
école
No match
/ End of testinput4 /

242
ext/pcre/pcrelib/testdata/testoutput5 vendored Normal file
View File

@@ -0,0 +1,242 @@
PCRE version 3.4 22-Aug-2000
/-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/
/-- strings automatically, do not use the \x{} construct except with --/
No match
/-- patterns that have the /8 option set, and don't use them without! --/
No match
/a.b/8
acb
0: acb
a\x7fb
0: a\x{7f}b
a\x{100}b
0: a\x{100}b
*** Failers
No match
a\nb
No match
/a(.{3})b/8
a\x{4000}xyb
0: a\x{4000}xyb
1: \x{4000}xy
a\x{4000}\x7fyb
0: a\x{4000}\x{7f}yb
1: \x{4000}\x{7f}y
a\x{4000}\x{100}yb
0: a\x{4000}\x{100}yb
1: \x{4000}\x{100}y
*** Failers
No match
a\x{4000}b
No match
ac\ncb
No match
/a(.*?)(.)/
a\xc0\x88b
0: a\xc0
1:
2: \xc0
/a(.*?)(.)/8
a\x{100}b
0: a\x{100}
1:
2: \x{100}
/a(.*)(.)/
a\xc0\x88b
0: a\xc0\x88b
1: \xc0\x88
2: b
/a(.*)(.)/8
a\x{100}b
0: a\x{100}b
1: \x{100}
2: b
/a(.)(.)/
a\xc0\x92bcd
0: a\xc0\x92
1: \xc0
2: \x92
/a(.)(.)/8
a\x{240}bcd
0: a\x{240}b
1: \x{240}
2: b
/a(.?)(.)/
a\xc0\x92bcd
0: a\xc0\x92
1: \xc0
2: \x92
/a(.?)(.)/8
a\x{240}bcd
0: a\x{240}b
1: \x{240}
2: b
/a(.??)(.)/
a\xc0\x92bcd
0: a\xc0
1:
2: \xc0
/a(.??)(.)/8
a\x{240}bcd
0: a\x{240}
1:
2: \x{240}
/a(.{3})b/8
a\x{1234}xyb
0: a\x{1234}xyb
1: \x{1234}xy
a\x{1234}\x{4321}yb
0: a\x{1234}\x{4321}yb
1: \x{1234}\x{4321}y
a\x{1234}\x{4321}\x{3412}b
0: a\x{1234}\x{4321}\x{3412}b
1: \x{1234}\x{4321}\x{3412}
*** Failers
No match
a\x{1234}b
No match
ac\ncb
No match
/a(.{3,})b/8
a\x{1234}xyb
0: a\x{1234}xyb
1: \x{1234}xy
a\x{1234}\x{4321}yb
0: a\x{1234}\x{4321}yb
1: \x{1234}\x{4321}y
a\x{1234}\x{4321}\x{3412}b
0: a\x{1234}\x{4321}\x{3412}b
1: \x{1234}\x{4321}\x{3412}
axxxxbcdefghijb
0: axxxxbcdefghijb
1: xxxxbcdefghij
a\x{1234}\x{4321}\x{3412}\x{3421}b
0: a\x{1234}\x{4321}\x{3412}\x{3421}b
1: \x{1234}\x{4321}\x{3412}\x{3421}
*** Failers
No match
a\x{1234}b
No match
/a(.{3,}?)b/8
a\x{1234}xyb
0: a\x{1234}xyb
1: \x{1234}xy
a\x{1234}\x{4321}yb
0: a\x{1234}\x{4321}yb
1: \x{1234}\x{4321}y
a\x{1234}\x{4321}\x{3412}b
0: a\x{1234}\x{4321}\x{3412}b
1: \x{1234}\x{4321}\x{3412}
axxxxbcdefghijb
0: axxxxb
1: xxxx
a\x{1234}\x{4321}\x{3412}\x{3421}b
0: a\x{1234}\x{4321}\x{3412}\x{3421}b
1: \x{1234}\x{4321}\x{3412}\x{3421}
*** Failers
No match
a\x{1234}b
No match
/a(.{3,5})b/8
a\x{1234}xyb
0: a\x{1234}xyb
1: \x{1234}xy
a\x{1234}\x{4321}yb
0: a\x{1234}\x{4321}yb
1: \x{1234}\x{4321}y
a\x{1234}\x{4321}\x{3412}b
0: a\x{1234}\x{4321}\x{3412}b
1: \x{1234}\x{4321}\x{3412}
axxxxbcdefghijb
0: axxxxb
1: xxxx
a\x{1234}\x{4321}\x{3412}\x{3421}b
0: a\x{1234}\x{4321}\x{3412}\x{3421}b
1: \x{1234}\x{4321}\x{3412}\x{3421}
axbxxbcdefghijb
0: axbxxb
1: xbxx
axxxxxbcdefghijb
0: axxxxxb
1: xxxxx
*** Failers
No match
a\x{1234}b
No match
axxxxxxbcdefghijb
No match
/a(.{3,5}?)b/8
a\x{1234}xyb
0: a\x{1234}xyb
1: \x{1234}xy
a\x{1234}\x{4321}yb
0: a\x{1234}\x{4321}yb
1: \x{1234}\x{4321}y
a\x{1234}\x{4321}\x{3412}b
0: a\x{1234}\x{4321}\x{3412}b
1: \x{1234}\x{4321}\x{3412}
axxxxbcdefghijb
0: axxxxb
1: xxxx
a\x{1234}\x{4321}\x{3412}\x{3421}b
0: a\x{1234}\x{4321}\x{3412}\x{3421}b
1: \x{1234}\x{4321}\x{3412}\x{3421}
axbxxbcdefghijb
0: axbxxb
1: xbxx
axxxxxbcdefghijb
0: axxxxxb
1: xxxxx
*** Failers
No match
a\x{1234}b
No match
axxxxxxbcdefghijb
No match
/^[a\x{c0}]/8
*** Failers
No match
\x{100}
No match
/(?<=aXb)cd/8
aXbcd
0: cd
/(?<=a\x{100}b)cd/8
a\x{100}bcd
0: cd
/(?<=a\x{100000}b)cd/8
a\x{100000}bcd
0: cd
/(?:\x{100}){3}b/8
\x{100}\x{100}\x{100}b
0: \x{100}\x{100}\x{100}b
*** Failers
No match
\x{100}\x{100}b
No match
/ End of testinput5 /

185
ext/pcre/pcrelib/testdata/testoutput6 vendored Normal file
View File

@@ -0,0 +1,185 @@
PCRE version 3.4 22-Aug-2000
/\x{100}/8DM
Memory allocation (code space): 11
------------------------------------------------------------------
0 7 Bra 0
3 2 \xc0\x88
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 192
Need char = 136
/\x{1000}/8DM
Memory allocation (code space): 12
------------------------------------------------------------------
0 8 Bra 0
3 3 \xe0\x80\x84
8 8 Ket
11 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 224
Need char = 132
/\x{10000}/8DM
Memory allocation (code space): 13
------------------------------------------------------------------
0 9 Bra 0
3 4 \xf0\x80\x80\x82
9 9 Ket
12 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 240
Need char = 130
/\x{100000}/8DM
Memory allocation (code space): 13
------------------------------------------------------------------
0 9 Bra 0
3 4 \xf0\x80\x80\xa0
9 9 Ket
12 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 240
Need char = 160
/\x{1000000}/8DM
Memory allocation (code space): 14
------------------------------------------------------------------
0 10 Bra 0
3 5 \xf8\x80\x80\x80\x90
10 10 Ket
13 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 248
Need char = 144
/\x{4000000}/8DM
Memory allocation (code space): 15
------------------------------------------------------------------
0 11 Bra 0
3 6 \xfc\x80\x80\x80\x80\x82
11 11 Ket
14 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 252
Need char = 130
/\x{7fffFFFF}/8DM
Memory allocation (code space): 15
------------------------------------------------------------------
0 11 Bra 0
3 6 \xfd\xbf\xbf\xbf\xbf\xbf
11 11 Ket
14 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 253
Need char = 191
/[\x{ff}]/8DM
Memory allocation (code space): 40
------------------------------------------------------------------
0 6 Bra 0
3 1 \xff
6 6 Ket
9 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 255
No need char
/[\x{100}]/8DM
Memory allocation (code space): 40
Failed: characters with values > 255 are not yet supported in classes at offset 7
/\x{ffffffff}/8
Failed: character value in \x{...} sequence is too large at offset 11
/\x{100000000}/8
Failed: character value in \x{...} sequence is too large at offset 12
/^\x{100}a\x{1234}/8
\x{100}a\x{1234}bcd
0: \x{100}a\x{1234}
/\x80/8D
------------------------------------------------------------------
0 7 Bra 0
3 2 \xc0\x84
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 192
Need char = 132
/\xff/8D
------------------------------------------------------------------
0 7 Bra 0
3 2 \xdf\x87
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 223
Need char = 135
/-- These tests are here rather than in testinput5 because Perl 5.6 has --/
/-- some problems with UTF-8 support, in the area of \x{..} where the --/
No match
/-- value is < 255. It grumbles about invalid UTF-8 strings. --/
No match
/^[a\x{c0}]b/8
\x{c0}b
0: \x{c0}b
/^([a\x{c0}]*?)aa/8
a\x{c0}aaaa/
0: a\x{c0}aa
1: a\x{c0}
/^([a\x{c0}]*?)aa/8
a\x{c0}aaaa/
0: a\x{c0}aa
1: a\x{c0}
a\x{c0}a\x{c0}aaa/
0: a\x{c0}a\x{c0}aa
1: a\x{c0}a\x{c0}
/^([a\x{c0}]*)aa/8
a\x{c0}aaaa/
0: a\x{c0}aaaa
1: a\x{c0}aa
a\x{c0}a\x{c0}aaa/
0: a\x{c0}a\x{c0}aaa
1: a\x{c0}a\x{c0}a
/^([a\x{c0}]*)a\x{c0}/8
a\x{c0}aaaa/
0: a\x{c0}
1:
a\x{c0}a\x{c0}aaa/
0: a\x{c0}a\x{c0}
1: a\x{c0}
/ End of testinput6 /