diff --git a/ext/pcre/pcrelib/COPYING b/ext/pcre/pcrelib/COPYING
index f305033c16c..34d20db9288 100644
--- a/ext/pcre/pcrelib/COPYING
+++ b/ext/pcre/pcrelib/COPYING
@@ -20,7 +20,21 @@ restrictions:
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2. The origin of this software must not be misrepresented, either by
- explicit claim or by omission.
+ explicit claim or by omission. In practice, this means that if you use
+ PCRE in software which you distribute to others, commercially or
+ otherwise, you must put a sentence like this
+
+ Regular expression support is provided by the PCRE library package,
+ which is open source software, written by Philip Hazel, and copyright
+ by the University of Cambridge, England.
+
+ somewhere reasonably visible in your documentation and in any relevant
+ files or online help data or similar. A reference to the ftp site for
+ the source, that is, to
+
+ ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
+
+ should also be given in the documentation.
3. Altered versions must be plainly marked as such, and must not be
misrepresented as being the original software.
diff --git a/ext/pcre/pcrelib/ChangeLog b/ext/pcre/pcrelib/ChangeLog
index 5bedd53bc63..2133dd7612f 100644
--- a/ext/pcre/pcrelib/ChangeLog
+++ b/ext/pcre/pcrelib/ChangeLog
@@ -2,6 +2,46 @@ ChangeLog for PCRE
------------------
+Version 3.4 22-Aug-00
+---------------------
+
+1. Fixed typo in pcre.h: unsigned const char * changed to const unsigned char *.
+
+2. Diagnose condition (?(0) as an error instead of crashing on matching.
+
+
+Version 3.3 01-Aug-00
+---------------------
+
+1. If an octal character was given, but the value was greater than \377, it
+was not getting masked to the least significant bits, as documented. This could
+lead to crashes in some systems.
+
+2. Perl 5.6 (if not earlier versions) accepts classes like [a-\d] and treats
+the hyphen as a literal. PCRE used to give an error; it now behaves like Perl.
+
+3. Added the functions pcre_free_substring() and pcre_free_substring_list().
+These just pass their arguments on to (pcre_free)(), but they are provided
+because some uses of PCRE bind it to non-C systems that can call its functions,
+but cannot call free() or pcre_free() directly.
+
+4. Add "make test" as a synonym for "make check". Corrected some comments in
+the Makefile.
+
+5. Add $(DESTDIR)/ in front of all the paths in the "install" target in the
+Makefile.
+
+6. Changed the name of pgrep to pcregrep, because Solaris has introduced a
+command called pgrep for grepping around the active processes.
+
+7. Added the beginnings of support for UTF-8 character strings.
+
+8. Arranged for the Makefile to pass over the settings of CC, CFLAGS, and
+RANLIB to ./ltconfig so that they are used by libtool. I think these are all
+the relevant ones. (AR is not passed because ./ltconfig does its own figuring
+out for the ar command.)
+
+
Version 3.2 12-May-00
---------------------
diff --git a/ext/pcre/pcrelib/INSTALL b/ext/pcre/pcrelib/INSTALL
index d63a78fef9b..08802812deb 100644
--- a/ext/pcre/pcrelib/INSTALL
+++ b/ext/pcre/pcrelib/INSTALL
@@ -4,7 +4,7 @@ Basic Installation
These are generic installation instructions that apply to systems that
can run the `configure' shell script - Unix systems and any that imitate
it. They are not specific to PCRE. There are PCRE-specific instructions
-for non-Unix systems in the file NON-UNIX.
+for non-Unix systems in the file NON-UNIX-USE.
The `configure' shell script attempts to guess correct values for
various system-dependent variables used during compilation. It uses
diff --git a/ext/pcre/pcrelib/LICENCE b/ext/pcre/pcrelib/LICENCE
index 8422bd9ae66..34d20db9288 100644
--- a/ext/pcre/pcrelib/LICENCE
+++ b/ext/pcre/pcrelib/LICENCE
@@ -20,19 +20,21 @@ restrictions:
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2. The origin of this software must not be misrepresented, either by
- explicit claim or by omission. In practice, this means you must put
- a sentence like this
+ explicit claim or by omission. In practice, this means that if you use
+ PCRE in software which you distribute to others, commercially or
+ otherwise, you must put a sentence like this
Regular expression support is provided by the PCRE library package,
- which is open source software, copyright by the University of
- Cambridge.
+ which is open source software, written by Philip Hazel, and copyright
+ by the University of Cambridge, England.
somewhere reasonably visible in your documentation and in any relevant
- files. A reference to the ftp site for the source should also be given
+ files or online help data or similar. A reference to the ftp site for
+ the source, that is, to
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
- in the documentation.
+ should also be given in the documentation.
3. Altered versions must be plainly marked as such, and must not be
misrepresented as being the original software.
diff --git a/ext/pcre/pcrelib/NEWS b/ext/pcre/pcrelib/NEWS
index 4c80bd6833f..56fccdfad37 100644
--- a/ext/pcre/pcrelib/NEWS
+++ b/ext/pcre/pcrelib/NEWS
@@ -1,6 +1,14 @@
News about PCRE releases
------------------------
+Release 3.3 01-Aug-00
+---------------------
+
+There is some support for UTF-8 character strings. This is incomplete and
+experimental. The documentation describes what is and what is not implemented.
+Otherwise, this is just a bug-fixing release.
+
+
Release 3.0 01-Feb-00
---------------------
diff --git a/ext/pcre/pcrelib/README b/ext/pcre/pcrelib/README
index 90aaf4d6a65..d124ee014c1 100644
--- a/ext/pcre/pcrelib/README
+++ b/ext/pcre/pcrelib/README
@@ -7,6 +7,15 @@ The latest release of PCRE is always available from
Please read the NEWS file if you are upgrading from a previous release.
+PCRE has its own native API, but a set of "wrapper" functions that are based on
+the POSIX API are also supplied in the library libpcreposix. Note that this
+just provides a POSIX calling interface to PCRE: the regular expressions
+themselves still follow Perl syntax and semantics. The header file
+for the POSIX-style functions is called pcreposix.h. The official POSIX name is
+regex.h, but I didn't want to risk possible problems with existing files of
+that name by distributing it that way. To use it with an existing program that
+uses the POSIX API, it will have to be renamed or pointed at by a link.
+
Building PCRE on a Unix system
------------------------------
@@ -15,20 +24,29 @@ To build PCRE on a Unix system, run the "configure" command in the PCRE
distribution directory. This is a standard GNU "autoconf" configuration script,
for which generic instructions are supplied in INSTALL. On many systems just
running "./configure" is sufficient, but the usual methods of changing standard
-defaults are available. For example
+defaults are available. For example,
CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local
specifies that the C compiler should be run with the flags '-O2 -Wall' instead
of the default, and that "make install" should install PCRE under /opt/local
-instead of the default /usr/local. The "configure" script builds thre files:
+instead of the default /usr/local.
+
+If you want to make use of the experimential, incomplete support for UTF-8
+character strings in PCRE, you must add --enable-utf8 to the "configure"
+command. Without it, the code for handling UTF-8 is not included in the
+library. (Even when included, it still has to be enabled by an option at run
+time.)
+
+The "configure" script builds four files:
. Makefile is built by copying Makefile.in and making substitutions.
. config.h is built by copying config.in and making substitutions.
. pcre-config is built by copying pcre-config.in and making substitutions.
+. RunTest is a script for running tests
Once "configure" has run, you can run "make". It builds two libraries called
-libpcre and libpcreposix, a test program called pcretest, and the pgrep
+libpcre and libpcreposix, a test program called pcretest, and the pcregrep
command. You can use "make install" to copy these, and the public header file
pcre.h, to appropriate live directories on your system, in the normal way.
@@ -54,11 +72,11 @@ The default distribution builds PCRE as two shared libraries. This support is
new and experimental and may not work on all systems. It relies on the
"libtool" scripts - these are distributed with PCRE. It should build a
"libtool" script and use this to compile and link shared libraries, which are
-placed in a subdirectory called .libs. The programs pcretest and pgrep are
+placed in a subdirectory called .libs. The programs pcretest and pcregrep are
built to use these uninstalled libraries by means of wrapper scripts. When you
-use "make install" to install shared libraries, pgrep and pcretest are
+use "make install" to install shared libraries, pcregrep and pcretest are
automatically re-built to use the newly installed libraries. However, only
-pgrep is installed, as pcretest is really just a test program.
+pcregrep is installed, as pcretest is really just a test program.
To build PCRE using static libraries you must use --disable-shared when
configuring it. For example
@@ -82,8 +100,8 @@ Testing PCRE
------------
To test PCRE on a Unix system, run the RunTest script in the pcre directory.
-(This can also be run by "make runtest" or "make check".) For other systems,
-see the instruction in NON-UNIX-USE.
+(This can also be run by "make runtest", "make check", or "make test".) For
+other systems, see the instruction in NON-UNIX-USE.
The script runs the pcretest test program (which is documented in
doc/pcretest.txt) on each of the testinput files (in the testdata directory) in
@@ -97,12 +115,24 @@ RunTest, for example:
The first and third test files can also be fed directly into the perltest
script to check that Perl gives the same results. The third file requires the
additional features of release 5.005, which is why it is kept separate from the
-main test input, which needs only Perl 5.004. In the long run, when 5.005 is
-widespread, these two test files may get amalgamated.
+main test input, which needs only Perl 5.004. In the long run, when 5.005 (or
+higher) is widespread, these two test files may get amalgamated.
-The second set of tests check pcre_info(), pcre_study(), pcre_copy_substring(),
-pcre_get_substring(), pcre_get_substring_list(), error detection and run-time
-flags that are specific to PCRE, as well as the POSIX wrapper API.
+The second set of tests check pcre_fullinfo(), pcre_info(), pcre_study(),
+pcre_copy_substring(), pcre_get_substring(), pcre_get_substring_list(), error
+detection, and run-time flags that are specific to PCRE, as well as the POSIX
+wrapper API. It also uses the debugging flag to check some of the internals of
+pcre_compile().
+
+If you build PCRE with a locale setting that is not the standard C locale, the
+character tables may be different (see next paragraph). In some cases, this may
+cause failures in the second set of tests. For example, in a locale where the
+isprint() function yields TRUE for characters in the range 128-255, the use of
+[:isascii:] inside a character class defines a different set of characters, and
+this shows up in this test as a difference in the compiled code, which is being
+listed for checking. Where the comparison test output contains [\x00-\x7f] the
+test will contain [\x00-\xff], and similarly in some other cases. This is not a
+bug in PCRE.
The fourth set of tests checks pcre_maketables(), the facility for building a
set of character tables for a specific locale and using them instead of the
@@ -117,14 +147,10 @@ output to say why. If running this test produces instances of the error
in the comparison output, it means that locale is not available on your system,
despite being listed by "locale". This does not mean that PCRE is broken.
-PCRE has its own native API, but a set of "wrapper" functions that are based on
-the POSIX API are also supplied in the library libpcreposix.a. Note that this
-just provides a POSIX calling interface to PCRE: the regular expressions
-themselves still follow Perl syntax and semantics. The header file
-for the POSIX-style functions is called pcreposix.h. The official POSIX name is
-regex.h, but I didn't want to risk possible problems with existing files of
-that name by distributing it that way. To use it with an existing program that
-uses the POSIX API, it will have to be renamed or pointed at by a link.
+The fifth test checks the experimental, incomplete UTF-8 support. It is not run
+automatically unless PCRE is built with UTF-8 support. This file can be fed
+directly to the perltest8 script, which requires Perl 5.6 or higher. The sixth
+file tests internal UTF-8 features of PCRE that are not relevant to Perl.
Character tables
@@ -197,7 +223,7 @@ The distribution should contain the following files:
NEWS important changes in this release
NON-UNIX-USE notes on building PCRE on non-Unix systems
README this file
- RunTest a Unix shell script for running tests
+ RunTest.in template for a Unix shell script for running tests
config.guess ) files used by libtool,
config.sub ) used only when building a shared library
configure a configuring shell script (built by autoconf)
@@ -211,24 +237,29 @@ The distribution should contain the following files:
doc/pcreposix.txt plain text version
doc/pcretest.txt documentation of test program
doc/perltest.txt documentation of Perl test program
- doc/pgrep.1 man page source for the pgrep utility
- doc/pgrep.html HTML version
- doc/pgrep.txt plain text version
+ doc/pcregrep.1 man page source for the pcregrep utility
+ doc/pcregrep.html HTML version
+ doc/pcregrep.txt plain text version
install-sh a shell script for installing files
ltconfig ) files used to build "libtool",
ltmain.sh ) used only when building a shared library
pcretest.c test program
perltest Perl test program
- pgrep.c source of a grep utility that uses PCRE
+ perltest8 Perl test program for UTF-8 tests
+ pcregrep.c source of a grep utility that uses PCRE
pcre-config.in source of script which retains PCRE information
testdata/testinput1 test data, compatible with Perl 5.004 and 5.005
testdata/testinput2 test data for error messages and non-Perl things
testdata/testinput3 test data, compatible with Perl 5.005
testdata/testinput4 test data for locale-specific tests
+ testdata/testinput5 test data for UTF-8 tests compatible with Perl 5.6
+ testdata/testinput6 test data for other UTF-8 tests
testdata/testoutput1 test results corresponding to testinput1
testdata/testoutput2 test results corresponding to testinput2
testdata/testoutput3 test results corresponding to testinput3
testdata/testoutput4 test results corresponding to testinput4
+ testdata/testoutput5 test results corresponding to testinput5
+ testdata/testoutput6 test results corresponding to testinput6
(C) Auxiliary files for Win32 DLL
@@ -236,4 +267,4 @@ The distribution should contain the following files:
pcre.def
Philip Hazel
@@ -76,6 +77,12 @@ pcre - Perl-compatible regular expressions.
int *ovector, int stringcount, const char ***listptr);
+void pcre_free_substring(const char *stringptr);
+
+void pcre_free_substring_list(const char **stringptr);
+
const unsigned char *pcre_maketables(void);
@@ -100,7 +107,9 @@ pcre - Perl-compatible regular expressions.
The PCRE library is a set of functions that implement regular expression
pattern matching using the same syntax and semantics as Perl 5, with just a few
differences (see below). The current implementation corresponds to Perl 5.005,
-with some additional features from the Perl development release.
+with some additional features from later versions. This includes some
+experimental, incomplete support for UTF-8 encoded strings. Details of exactly
+what is and what is not supported are given below.
PCRE has its own native API, which is described in this document. There is also
@@ -117,12 +126,18 @@ use these to include support for different releases.
The functions pcre_compile(), pcre_study(), and pcre_exec()
-are used for compiling and matching regular expressions, while
-pcre_copy_substring(), pcre_get_substring(), and
+are used for compiling and matching regular expressions.
+
+The functions pcre_copy_substring(), pcre_get_substring(), and
pcre_get_substring_list() are convenience functions for extracting
-captured substrings from a matched subject string. The function
-pcre_maketables() is used (optionally) to build a set of character tables
-in the current locale for passing to pcre_compile().
+captured substrings from a matched subject string; pcre_free_substring()
+and pcre_free_substring_list() are also provided, to free the memory used
+for extracted strings.
+
+The function pcre_maketables() is used (optionally) to build a set of
+character tables in the current locale for passing to pcre_compile().
The function pcre_fullinfo() is used to find out information about a
@@ -297,6 +312,18 @@ This option inverts the "greediness" of the quantifiers so that they are not
greedy by default, but become greedy if followed by "?". It is not compatible
with Perl. It can also be set by a (?U) option setting within the pattern.
+
+ PCRE_UTF8
+
+
+This option causes PCRE to regard both the pattern and the subject as strings +of UTF-8 characters instead of just byte strings. However, it is available only +if PCRE has been built to include UTF-8 support. If not, the use of this option +provokes an error. Support for UTF-8 is new, experimental, and incomplete. +Details of exactly what it entails are given below. +
When a pattern is going to be used several times, it is worth spending more @@ -743,7 +770,7 @@ extract a single substring, whose number is given as stringnumber. A value of zero extracts the substring that matched the entire pattern, while higher values extract the captured substrings. For pcre_copy_substring(), the string is placed in buffer, whose length is given by -buffersize, while for pcre_get_substring() a new block of store is +buffersize, while for pcre_get_substring() a new block of memory is obtained via pcre_malloc, and its address is returned via stringptr. The yield of the function is the length of the string, not including the terminating zero, or one of @@ -789,6 +816,17 @@ string. This can be distinguished from a genuine zero-length substring by inspecting the appropriate offset in ovector, which is negative for unset substrings.
++The two convenience functions pcre_free_substring() and +pcre_free_substring_list() can be used to free the memory returned by +a previous call of pcre_get_substring() or +pcre_get_substring_list(), respectively. They do nothing more than call +the function pointed to by pcre_free, which of course could be called +directly from a C program. However, PCRE is used in some situations where it is +linked via a special interface to another programming language which cannot use +pcre_free directly; it is for these cases that the functions are +provided. +
There are some size limitations in PCRE but it is hoped that they will never in @@ -908,8 +946,15 @@ The syntax and semantics of the regular expressions supported by PCRE are described below. Regular expressions are also described in the Perl documentation and in a number of other books, some of which have copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published by -O'Reilly (ISBN 1-56592-257), covers them in great detail. The description -here is intended as reference documentation. +O'Reilly (ISBN 1-56592-257), covers them in great detail. +
++The description here is intended as reference documentation. The basic +operation of PCRE is on strings of bytes. However, there is the beginnings of +some support for UTF-8 character strings. To use this support you must +configure PCRE to include it, and then call pcre_compile() with the +PCRE_UTF8 option. How this affects the pattern matching is described in the +final section of this document.
A regular expression is a pattern that is matched against a subject string from @@ -1576,7 +1621,7 @@ to the string
-fails, because it matches the entire string due to the greediness of the .* +fails, because it matches the entire string owing to the greediness of the .* item.
@@ -1718,7 +1763,7 @@ example, the pattern
-matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of +matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of the subpattern, the back reference matches the character string corresponding to the previous iteration. In order for this to work, the pattern must be such that the first iteration does not need to match the back reference. This can be @@ -2033,9 +2078,10 @@ subpattern, a compile-time error occurs.
There are two kinds of condition. If the text between the parentheses consists of a sequence of digits, the condition is satisfied if the capturing subpattern -of that number has previously matched. Consider the following pattern, which -contains non-significant white space to make it more readable (assume the -PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: +of that number has previously matched. The number must be greater than zero. +Consider the following pattern, which contains non-significant white space to +make it more readable (assume the PCRE_EXTENDED option) and to divide it into +three parts for ease of discussion:
@@ -2240,7 +2286,96 @@ with the pattern above. The former gives a failure almost instantly when applied to a whole line of "a" characters, whereas the latter takes an appreciable time with strings longer than about 20 characters. -
+Starting at release 3.3, PCRE has some support for character strings encoded +in the UTF-8 format. This is incomplete, and is regarded as experimental. In +order to use it, you must configure PCRE to include UTF-8 support in the code, +and, in addition, you must call pcre_compile() with the PCRE_UTF8 option +flag. When you do this, both the pattern and any subject strings that are +matched against it are treated as UTF-8 strings instead of just strings of +bytes, but only in the cases that are mentioned below. +
++If you compile PCRE with UTF-8 support, but do not use it at run time, the +library will be a bit bigger, but the additional run time overhead is limited +to testing the PCRE_UTF8 flag in several places, so should not be very large. +
++PCRE assumes that the strings it is given contain valid UTF-8 codes. It does +not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE, +the results are undefined. +
++Running with PCRE_UTF8 set causes these changes in the way PCRE works: +
++1. In a pattern, the escape sequence \x{...}, where the contents of the braces +is a string of hexadecimal digits, is interpreted as a UTF-8 character whose +code number is the given hexadecimal number, for example: \x{1234}. This +inserts from one to six literal bytes into the pattern, using the UTF-8 +encoding. If a non-hexadecimal digit appears between the braces, the item is +not recognized. +
++2. The original hexadecimal escape sequence, \xhh, generates a two-byte UTF-8 +character if its value is greater than 127. +
++3. Repeat quantifiers are NOT correctly handled if they follow a multibyte +character. For example, \x{100}* and \xc3+ do not work. If you want to +repeat such characters, you must enclose them in non-capturing parentheses, +for example (?:\x{100}), at present. +
++4. The dot metacharacter matches one UTF-8 character instead of a single byte. +
++5. Unlike literal UTF-8 characters, the dot metacharacter followed by a +repeat quantifier does operate correctly on UTF-8 characters instead of +single bytes. +
++4. Although the \x{...} escape is permitted in a character class, characters +whose values are greater than 255 cannot be included in a class. +
++5. A class is matched against a UTF-8 character instead of just a single byte, +but it can match only characters whose values are less than 256. Characters +with greater values always fail to match a class. +
++6. Repeated classes work correctly on multiple characters. +
++7. Classes containing just a single character whose value is greater than 127 +(but less than 256), for example, [\x80] or [^\x{93}], do not work because +these are optimized into single byte matches. In the first case, of course, +the class brackets are just redundant. +
++8. Lookbehind assertions move backwards in the subject by a fixed number of +characters instead of a fixed number of bytes. Simple cases have been tested +to work correctly, but there may be hidden gotchas herein. +
++9. The character types such as \d and \w do not work correctly with UTF-8 +characters. They continue to test a single byte. +
++10. Anything not explicitly mentioned here continues to work in bytes rather +than in characters. +
++The following UTF-8 features of Perl 5.6 are not implemented: +
++1. The escape sequence \C to match a single byte. +
++2. The use of Unicode tables and properties and escapes \p, \P, and \X. +
+
Philip Hazel <ph10@cam.ac.uk>
@@ -2253,6 +2388,10 @@ Cambridge CB2 3QG, England.
Phone: +44 1223 334714
-Last updated: 27 January 2000
+Last updated: 28 August 2000,
+
+ the 250th anniversary of the death of J.S. Bach. +Copyright (c) 1997-2000 University of Cambridge. diff --git a/ext/pcre/pcrelib/doc/pcre.txt b/ext/pcre/pcrelib/doc/pcre.txt index b8106e4457f..1db4b537b7a 100644 --- a/ext/pcre/pcrelib/doc/pcre.txt +++ b/ext/pcre/pcrelib/doc/pcre.txt @@ -28,6 +28,10 @@ SYNOPSIS int pcre_get_substring_list(const char *subject, int *ovector, int stringcount, const char ***listptr); + void pcre_free_substring(const char *stringptr); + + void pcre_free_substring_list(const char **stringptr); + const unsigned char *pcre_maketables(void); int pcre_fullinfo(const pcre *code, const pcre_extra *extra, @@ -48,9 +52,12 @@ DESCRIPTION The PCRE library is a set of functions that implement regu- lar expression pattern matching using the same syntax and semantics as Perl 5, with just a few differences (see + below). The current implementation corresponds to Perl - 5.005, with some additional features from the Perl develop- - ment release. + 5.005, with some additional features from later versions. + This includes some experimental, incomplete support for + UTF-8 encoded strings. Details of exactly what is and what + is not supported are given below. PCRE has its own native API, which is described in this document. There is also a set of wrapper functions that @@ -67,13 +74,18 @@ DESCRIPTION releases. The functions pcre_compile(), pcre_study(), and pcre_exec() - are used for compiling and matching regular expressions, - while pcre_copy_substring(), pcre_get_substring(), and - pcre_get_substring_list() are convenience functions for + are used for compiling and matching regular expressions. + + The functions pcre_copy_substring(), pcre_get_substring(), + and pcre_get_substring_list() are convenience functions for extracting captured substrings from a matched subject - string. The function pcre_maketables() is used (optionally) - to build a set of character tables in the current locale for - passing to pcre_compile(). + string; pcre_free_substring() and pcre_free_substring_list() + are also provided, to free the memory used for extracted + strings. + + The function pcre_maketables() is used (optionally) to build + a set of character tables in the current locale for passing + to pcre_compile(). The function pcre_fullinfo() is used to find out information about a compiled pattern; pcre_info() is an obsolete version @@ -92,10 +104,19 @@ DESCRIPTION MULTI-THREADING - The PCRE functions can be used in multi-threading applica- - tions, with the proviso that the memory management functions - pointed to by pcre_malloc and pcre_free are shared by all - threads. + The PCRE functions can be used in multi-threading + + + + + +SunOS 5.8 Last change: 2 + + + + applications, with the proviso that the memory management + functions pointed to by pcre_malloc and pcre_free are shared + by all threads. The compiled form of a regular expression is not altered during matching, so the same compiled pattern can safely be @@ -103,7 +124,6 @@ MULTI-THREADING - COMPILING A PATTERN The function pcre_compile() is called to compile a pattern into an internal form. The pattern is a C string terminated @@ -235,12 +255,23 @@ COMPILING A PATTERN followed by "?". It is not compatible with Perl. It can also be set by a (?U) option setting within the pattern. + PCRE_UTF8 + + This option causes PCRE to regard both the pattern and the + subject as strings of UTF-8 characters instead of just byte + strings. However, it is available only if PCRE has been + built to include UTF-8 support. If not, the use of this + option provokes an error. Support for UTF-8 is new, experi- + mental, and incomplete. Details of exactly what it entails + are given below. + STUDYING A PATTERN When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. The function pcre_study() takes + a pointer to a compiled pattern as its first argument, and returns a pointer to a pcre_extra block (another void typedef) containing additional information about the pat- @@ -344,9 +375,9 @@ INFORMATION ABOUT A PATTERN PCRE_INFO_BACKREFMAX - Return the number of the highest back reference in the pat- - tern. The fourth argument should point to an int variable. - Zero is returned if there are no back references. + Return the number of the highest back reference in the + pattern. The fourth argument should point to an int vari- + able. Zero is returned if there are no back references. PCRE_INFO_FIRSTCHAR @@ -605,6 +636,15 @@ MATCHING A PATTERN EXTRACTING CAPTURED SUBSTRINGS Captured substrings can be accessed directly by using the + + + + + +SunOS 5.8 Last change: 12 + + + offsets returned by pcre_exec() in ovector. For convenience, the functions pcre_copy_substring(), pcre_get_substring(), and pcre_get_substring_list() are provided for extracting @@ -631,7 +671,7 @@ EXTRACTING CAPTURED SUBSTRINGS the entire pattern, while higher values extract the captured substrings. For pcre_copy_substring(), the string is placed in buffer, whose length is given by buffersize, while for - pcre_get_substring() a new block of store is obtained via + pcre_get_substring() a new block of memory is obtained via pcre_malloc, and its address is returned via stringptr. The yield of the function is the length of the string, not including the terminating zero, or one of @@ -665,6 +705,16 @@ EXTRACTING CAPTURED SUBSTRINGS inspecting the appropriate offset in ovector, which is nega- tive for unset substrings. + The two convenience functions pcre_free_substring() and + pcre_free_substring_list() can be used to free the memory + returned by a previous call of pcre_get_substring() or + pcre_get_substring_list(), respectively. They do nothing + more than call the function pointed to by pcre_free, which + of course could be called directly from a C program. How- + ever, PCRE is used in some situations where it is linked via + a special interface to another programming language which + cannot use pcre_free directly; it is for these cases that + the functions are provided. @@ -733,6 +783,7 @@ DIFFERENCES FROM PERL (?p{code}) constructions. However, there is some experimen- tal support for recursive patterns using the non-Perl item (?R). + 8. There are at the time of writing some oddities in Perl 5.005_02 concerned with the settings of captured strings when part of a pattern is repeated. For example, matching @@ -785,11 +836,17 @@ REGULAR EXPRESSION DETAILS The syntax and semantics of the regular expressions sup- ported by PCRE are described below. Regular expressions are also described in the Perl documentation and in a number of - other books, some of which have copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published by - O'Reilly (ISBN 1-56592-257), covers them in great detail. + O'Reilly (ISBN 1-56592-257), covers them in great detail. + The description here is intended as reference documentation. + The basic operation of PCRE is on strings of bytes. However, + there is the beginnings of some support for UTF-8 character + strings. To use this support you must configure PCRE to + include it, and then call pcre_compile() with the PCRE_UTF8 + option. How this affects the pattern matching is described + in the final section of this document. A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for @@ -1004,6 +1061,7 @@ CIRCUMFLEX AND DOLLAR Outside a character class, in the default matching mode, the circumflex character is an assertion which is true only if the current matching point is at the start of the subject + string. If the startoffset argument of pcre_exec() is non- zero, circumflex can never match. Inside a character class, circumflex has an entirely different meaning (see below). @@ -1056,6 +1114,7 @@ FULL STOP (PERIOD, DOT) Outside a character class, a dot in the pattern matches any one character in the subject, including a non-printing char- acter, but not (by default) newline. If the PCRE_DOTALL + option is set, dots match newlines as well. The handling of dot is entirely independent of the handling of circumflex and dollar, the only relationship being that they both @@ -1403,7 +1462,7 @@ REPETITION /* first command */ not comment /* second comment */ - fails, because it matches the entire string due to the + fails, because it matches the entire string owing to the greediness of the .* item. However, if a quantifier is followed by a question mark, it @@ -1517,18 +1576,19 @@ BACK REFERENCES A back reference that occurs inside the parentheses to which it refers fails when the subpattern is first used, so, for example, (a\1) never matches. However, such references can - be useful inside repeated subpatterns. For example, the - pattern + be useful inside repeated subpatterns. For example, the pat- + tern (a|b\1)+ - matches any number of "a"s and also "aba", "ababaa" etc. At + matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of the subpattern, the back reference matches - the character string corresponding to the previous itera- - tion. In order for this to work, the pattern must be such - that the first iteration does not need to match the back - reference. This can be done using alternation, as in the - example above, or by a quantifier with a minimum of zero. + the character string corresponding to the previous + iteration. In order for this to work, the pattern must be + such that the first iteration does not need to match the + back reference. This can be done using alternation, as in + the example above, or by a quantifier with a minimum of + zero. @@ -1681,9 +1741,9 @@ ONCE-ONLY SUBPATTERNS This kind of parenthesis "locks up" the part of the pattern it contains once it has matched, and a failure further into - the pattern is prevented from backtracking into it. Back- - tracking past it to previous items, however, works as nor- - mal. + the pattern is prevented from backtracking into it. + Backtracking past it to previous items, however, works as + normal. An alternative description is that a subpattern of this type matches the string of characters that an identical stan- @@ -1778,10 +1838,11 @@ CONDITIONAL SUBPATTERNS There are two kinds of condition. If the text between the parentheses consists of a sequence of digits, the condition is satisfied if the capturing subpattern of that number has - previously matched. Consider the following pattern, which - contains non-significant white space to make it more read- - able (assume the PCRE_EXTENDED option) and to divide it into - three parts for ease of discussion: + previously matched. The number must be greater than zero. + Consider the following pattern, which contains non- + significant white space to make it more readable (assume the + PCRE_EXTENDED option) and to divide it into three parts for + ease of discussion: ( \( )? [^()]+ (?(1) \) ) @@ -1966,6 +2027,92 @@ PERFORMANCE +UTF-8 SUPPORT + Starting at release 3.3, PCRE has some support for character + strings encoded in the UTF-8 format. This is incomplete, and + is regarded as experimental. In order to use it, you must + configure PCRE to include UTF-8 support in the code, and, in + addition, you must call pcre_compile() with the PCRE_UTF8 + option flag. When you do this, both the pattern and any sub- + ject strings that are matched against it are treated as + UTF-8 strings instead of just strings of bytes, but only in + the cases that are mentioned below. + + If you compile PCRE with UTF-8 support, but do not use it at + run time, the library will be a bit bigger, but the addi- + tional run time overhead is limited to testing the PCRE_UTF8 + flag in several places, so should not be very large. + + PCRE assumes that the strings it is given contain valid + UTF-8 codes. It does not diagnose invalid UTF-8 strings. If + you pass invalid UTF-8 strings to PCRE, the results are + undefined. + + Running with PCRE_UTF8 set causes these changes in the way + PCRE works: + + 1. In a pattern, the escape sequence \x{...}, where the con- + tents of the braces is a string of hexadecimal digits, is + interpreted as a UTF-8 character whose code number is the + given hexadecimal number, for example: \x{1234}. This + inserts from one to six literal bytes into the pattern, + using the UTF-8 encoding. If a non-hexadecimal digit appears + between the braces, the item is not recognized. + + 2. The original hexadecimal escape sequence, \xhh, generates + a two-byte UTF-8 character if its value is greater than 127. + + 3. Repeat quantifiers are NOT correctly handled if they fol- + low a multibyte character. For example, \x{100}* and \xc3+ + do not work. If you want to repeat such characters, you must + enclose them in non-capturing parentheses, for example + (?:\x{100}), at present. + + 4. The dot metacharacter matches one UTF-8 character instead + of a single byte. + + 5. Unlike literal UTF-8 characters, the dot metacharacter + followed by a repeat quantifier does operate correctly on + UTF-8 characters instead of single bytes. + + 4. Although the \x{...} escape is permitted in a character + class, characters whose values are greater than 255 cannot + be included in a class. + + 5. A class is matched against a UTF-8 character instead of + just a single byte, but it can match only characters whose + values are less than 256. Characters with greater values + always fail to match a class. + + 6. Repeated classes work correctly on multiple characters. + + 7. Classes containing just a single character whose value is + greater than 127 (but less than 256), for example, [\x80] or + [^\x{93}], do not work because these are optimized into sin- + gle byte matches. In the first case, of course, the class + brackets are just redundant. + + 8. Lookbehind assertions move backwards in the subject by a + fixed number of characters instead of a fixed number of + bytes. Simple cases have been tested to work correctly, but + there may be hidden gotchas herein. + + 9. The character types such as \d and \w do not work + correctly with UTF-8 characters. They continue to test a + single byte. + + 10. Anything not explicitly mentioned here continues to work + in bytes rather than in characters. + + The following UTF-8 features of Perl 5.6 are not imple- + mented: + 1. The escape sequence \C to match a single byte. + + 2. The use of Unicode tables and properties and escapes \p, + \P, and \X. + + + AUTHOR Philip Hazel
+
-pgrep - a grep with Perl-compatible regular expressions. +pcregrep - a grep with Perl-compatible regular expressions.
-pgrep [-Vchilnsvx] pattern [file] ... +pcregrep [-Vchilnsvx] pattern [file] ...
-pgrep searches files for character patterns, in the same way as other +pcregrep searches files for character patterns, in the same way as other grep commands do, but it uses the PCRE regular expression library to support patterns that are compatible with the regular expressions of Perl 5. See pcre(3) for a full description of syntax and semantics.
-If no files are specified, pgrep reads the standard input. By default, +If no files are specified, pcregrep reads the standard input. By default, each line that matches the pattern is copied to the standard output, and if there is more than one file, the file name is printed before each line of -output. However, there are options that can change how pgrep behaves. +output. However, there are options that can change how pcregrep behaves.
Lines are limited to BUFSIZ characters. BUFSIZ is defined in <stdio.h>. @@ -102,4 +102,4 @@ for syntax errors or inacessible files (even if matches were found).
Philip Hazel <ph10@cam.ac.uk>
-Copyright (c) 1997-1999 University of Cambridge.
+Copyright (c) 1997-2000 University of Cambridge.
diff --git a/ext/pcre/pcrelib/doc/pgrep.txt b/ext/pcre/pcrelib/doc/pcregrep.txt
similarity index 78%
rename from ext/pcre/pcrelib/doc/pgrep.txt
rename to ext/pcre/pcrelib/doc/pcregrep.txt
index bcd08c0aaba..3483f9e1587 100644
--- a/ext/pcre/pcrelib/doc/pgrep.txt
+++ b/ext/pcre/pcrelib/doc/pcregrep.txt
@@ -1,25 +1,26 @@
NAME
- pgrep - a grep with Perl-compatible regular expressions.
+ pcregrep - a grep with Perl-compatible regular expressions.
SYNOPSIS
- pgrep [-Vchilnsvx] pattern [file] ...
+ pcregrep [-Vchilnsvx] pattern [file] ...
DESCRIPTION
- pgrep searches files for character patterns, in the same way
- as other grep commands do, but it uses the PCRE regular
+ pcregrep searches files for character patterns, in the same
+ way as other grep commands do, but it uses the PCRE regular
expression library to support patterns that are compatible
with the regular expressions of Perl 5. See pcre(3) for a
full description of syntax and semantics.
- If no files are specified, pgrep reads the standard input.
- By default, each line that matches the pattern is copied to
- the standard output, and if there is more than one file, the
- file name is printed before each line of output. However,
- there are options that can change how pgrep behaves.
+ If no files are specified, pcregrep reads the standard
+ input. By default, each line that matches the pattern is
+ copied to the standard output, and if there is more than one
+ file, the file name is printed before each line of output.
+ However, there are options that can change how pcregrep
+ behaves.
Lines are limited to BUFSIZ characters. BUFSIZ is defined in
+In the absence of these flags, no options are passed to the native function. +This means the the regex is compiled with PCRE default semantics. In +particular, the way it handles newline characters in the subject string is the +Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only +some of the effects specified for REG_NEWLINE. It does not affect the way +newlines are matched by . (they aren't) or a negative class such as [^a] (they +are). +
+The yield of regcomp() is zero on success, and non-zero otherwise. The preg structure is filled in on success, and one member of the structure is publicized: re_nsub contains the number of capturing subpatterns in @@ -179,4 +188,4 @@ Cambridge CB2 3QG, England. Phone: +44 1223 334714
-Copyright (c) 1997-1999 University of Cambridge.
+Copyright (c) 1997-2000 University of Cambridge.
diff --git a/ext/pcre/pcrelib/doc/pcreposix.txt b/ext/pcre/pcrelib/doc/pcreposix.txt
index 4a7036f3406..2d76f7cdcc3 100644
--- a/ext/pcre/pcrelib/doc/pcreposix.txt
+++ b/ext/pcre/pcrelib/doc/pcreposix.txt
@@ -80,6 +80,15 @@ COMPILING A PATTERN
The PCRE_MULTILINE option is set when the expression is
passed for compilation to the native function.
+ In the absence of these flags, no options are passed to the
+ native function. This means the the regex is compiled with
+ PCRE default semantics. In particular, the way it handles
+ newline characters in the subject string is the Perl way,
+ not the POSIX way. Note that setting PCRE_MULTILINE has only
+ some of the effects specified for REG_NEWLINE. It does not
+ affect the way newlines are matched by . (they aren't) or a
+ negative class such as [^a] (they are).
+
The yield of regcomp() is zero on success, and non-zero oth-
erwise. The preg structure is filled in on success, and one
member of the structure is publicized: re_nsub contains the
@@ -147,4 +156,4 @@ AUTHOR
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
- Copyright (c) 1997-1999 University of Cambridge.
+ Copyright (c) 1997-2000 University of Cambridge.
diff --git a/ext/pcre/pcrelib/doc/pcretest.txt b/ext/pcre/pcrelib/doc/pcretest.txt
index 0e6783af0c5..add2979f14e 100644
--- a/ext/pcre/pcrelib/doc/pcretest.txt
+++ b/ext/pcre/pcrelib/doc/pcretest.txt
@@ -43,6 +43,10 @@ backslash, because
is interpreted as the first line of a pattern that starts with "abc/", causing
pcretest to read the next line as a continuation of the regular expression.
+
+PATTERN MODIFIERS
+-----------------
+
The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For
example:
@@ -103,37 +107,48 @@ compiled, and the results used when the expression is matched.
The /M modifier causes the size of memory block used to hold the compiled
pattern to be output.
-Finally, the /P modifier causes pcretest to call PCRE via the POSIX wrapper API
-rather than its native API. When this is done, all other modifiers except /i,
-/m, and /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is
-set if /m is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always,
-and PCRE_DOTALL unless REG_NEWLINE is set.
+The /P modifier causes pcretest to call PCRE via the POSIX wrapper API rather
+than its native API. When this is done, all other modifiers except /i, /m, and
+/+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is set if /m
+is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and
+PCRE_DOTALL unless REG_NEWLINE is set.
+
+The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option set.
+This turns on the (currently incomplete) support for UTF-8 character handling
+in PCRE, provided that it was compiled with this support enabled. This modifier
+also causes any non-printing characters in output strings to be printed using
+the \x{hh...} notation if they are valid UTF-8 sequences.
+
+
+DATA LINES
+----------
Before each data line is passed to pcre_exec(), leading and trailing whitespace
is removed, and it is then scanned for \ escapes. The following are recognized:
- \a alarm (= BEL)
- \b backspace
- \e escape
- \f formfeed
- \n newline
- \r carriage return
- \t tab
- \v vertical tab
- \nnn octal character (up to 3 octal digits)
- \xhh hexadecimal character (up to 2 hex digits)
+ \a alarm (= BEL)
+ \b backspace
+ \e escape
+ \f formfeed
+ \n newline
+ \r carriage return
+ \t tab
+ \v vertical tab
+ \nnn octal character (up to 3 octal digits)
+ \xhh hexadecimal character (up to 2 hex digits)
+ \x{hh...} hexadecimal UTF-8 character
- \A pass the PCRE_ANCHORED option to pcre_exec()
- \B pass the PCRE_NOTBOL option to pcre_exec()
- \Cdd call pcre_copy_substring() for substring dd after a successful match
- (any decimal number less than 32)
- \Gdd call pcre_get_substring() for substring dd after a successful match
- (any decimal number less than 32)
- \L call pcre_get_substringlist() after a successful match
- \N pass the PCRE_NOTEMPTY option to pcre_exec()
- \Odd set the size of the output vector passed to pcre_exec() to dd
- (any number of decimal digits)
- \Z pass the PCRE_NOTEOL option to pcre_exec()
+ \A pass the PCRE_ANCHORED option to pcre_exec()
+ \B pass the PCRE_NOTBOL option to pcre_exec()
+ \Cdd call pcre_copy_substring() for substring dd after a successful
+ match (any decimal number less than 32)
+ \Gdd call pcre_get_substring() for substring dd after a successful
+ match (any decimal number less than 32)
+ \L call pcre_get_substringlist() after a successful match
+ \N pass the PCRE_NOTEMPTY option to pcre_exec()
+ \Odd set the size of the output vector passed to pcre_exec() to dd
+ (any number of decimal digits)
+ \Z pass the PCRE_NOTEOL option to pcre_exec()
A backslash followed by anything else just escapes the anything else. If the
very last character is a backslash, it is ignored. This gives a way of passing
@@ -143,6 +158,15 @@ If /P was present on the regex, causing the POSIX wrapper API to be used, only
\B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to
regexec() respectively.
+The use of \x{hh...} to represent UTF-8 characters is not dependent on the use
+of the /8 modifier on the pattern. It is recognized always. There may be any
+number of hexadecimal digits inside the braces. The result is from one to six
+bytes, encoded according to the UTF-8 rules.
+
+
+OUTPUT FROM PCRETEST
+--------------------
+
When a match succeeds, pcretest outputs the list of captured substrings that
pcre_exec() returns, starting with number 0 for the string that matched the
whole pattern. Here is an example of an interactive pcretest run.
@@ -158,8 +182,9 @@ whole pattern. Here is an example of an interactive pcretest run.
No match
If the strings contain any non-printing characters, they are output as \0x
-escapes. If the pattern has the /+ modifier, then the output for substring 0 is
-followed by the the rest of the subject string, identified by "0+" like this:
+escapes, or as \x{...} escapes if the /8 modifier was present on the pattern.
+If the pattern has the /+ modifier, then the output for substring 0 is followed
+by the the rest of the subject string, identified by "0+" like this:
re> /cat/+
data> cataract
@@ -190,6 +215,10 @@ Note that while patterns can be continued over several lines (a plain ">"
prompt is used for continuations), data lines may not. However newlines can be
included in data by means of the \n escape.
+
+COMMAND LINE OPTIONS
+--------------------
+
If the -p option is given to pcretest, it is equivalent to adding /P to each
regular expression: the POSIX wrapper API is used to call PCRE. None of the
following flags has any effect in this case.
@@ -208,10 +237,10 @@ a synonym for -m.
If the -t option is given, each compile, study, and match is run 20000 times
while being timed, and the resulting time per compile or match is output in
-milliseconds. Do not set -t with -s, because you will then get the size output
+milliseconds. Do not set -t with -m, because you will then get the size output
20000 times and the timing will be distorted. If you want to change the number
of repetitions used for timing, edit the definition of LOOPREPEAT at the top of
pcretest.c
Philip Hazel ]{0,})> ]{0,})>([\d]{0,}\.)(.*)((
([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD> ]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/is
+
+
+/a[^a]b/
+ acb
+ a\nb
+
+/a.b/
+ acb
+ *** Failers
+ a\nb
+
+/a[^a]b/s
+ acb
+ a\nb
+
+/a.b/s
+ acb
+ a\nb
+
+/ End of testinput1 /
diff --git a/ext/pcre/pcrelib/testdata/testinput2 b/ext/pcre/pcrelib/testdata/testinput2
index 1d9504cec28..ad116ef75a1 100644
--- a/ext/pcre/pcrelib/testdata/testinput2
+++ b/ext/pcre/pcrelib/testdata/testinput2
@@ -40,8 +40,6 @@
/[\B]/
-/[a-\w]/
-
/[z-a]/
/^*/
@@ -707,4 +705,8 @@
Ab
AB
-/ End of test input /
+/[\200-\410]/
+
+/^(?(0)f|b)oo/
+
+/ End of testinput2 /
diff --git a/ext/pcre/pcrelib/testdata/testinput3 b/ext/pcre/pcrelib/testdata/testinput3
index 67d39f3ac54..d3bd74fdd33 100644
--- a/ext/pcre/pcrelib/testdata/testinput3
+++ b/ext/pcre/pcrelib/testdata/testinput3
@@ -1707,4 +1707,18 @@
/a*/g
abbab
-/ End of test input /
+/^[a-\d]/
+ abcde
+ -things
+ 0digit
+ *** Failers
+ bcdef
+
+/^[\d-a]/
+ abcde
+ -things
+ 0digit
+ *** Failers
+ bcdef
+
+/ End of testinput3 /
diff --git a/ext/pcre/pcrelib/testdata/testinput4 b/ext/pcre/pcrelib/testdata/testinput4
index c23b52aceb8..f2878965f64 100644
--- a/ext/pcre/pcrelib/testdata/testinput4
+++ b/ext/pcre/pcrelib/testdata/testinput4
@@ -62,3 +62,4 @@
*** Failers
école
+/ End of testinput4 /
diff --git a/ext/pcre/pcrelib/testdata/testinput5 b/ext/pcre/pcrelib/testdata/testinput5
new file mode 100644
index 00000000000..d66cfbddf30
--- /dev/null
+++ b/ext/pcre/pcrelib/testdata/testinput5
@@ -0,0 +1,118 @@
+/-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/
+/-- strings automatically, do not use the \x{} construct except with --/
+/-- patterns that have the /8 option set, and don't use them without! --/
+
+/a.b/8
+ acb
+ a\x7fb
+ a\x{100}b
+ *** Failers
+ a\nb
+
+/a(.{3})b/8
+ a\x{4000}xyb
+ a\x{4000}\x7fyb
+ a\x{4000}\x{100}yb
+ *** Failers
+ a\x{4000}b
+ ac\ncb
+
+/a(.*?)(.)/
+ a\xc0\x88b
+
+/a(.*?)(.)/8
+ a\x{100}b
+
+/a(.*)(.)/
+ a\xc0\x88b
+
+/a(.*)(.)/8
+ a\x{100}b
+
+/a(.)(.)/
+ a\xc0\x92bcd
+
+/a(.)(.)/8
+ a\x{240}bcd
+
+/a(.?)(.)/
+ a\xc0\x92bcd
+
+/a(.?)(.)/8
+ a\x{240}bcd
+
+/a(.??)(.)/
+ a\xc0\x92bcd
+
+/a(.??)(.)/8
+ a\x{240}bcd
+
+/a(.{3})b/8
+ a\x{1234}xyb
+ a\x{1234}\x{4321}yb
+ a\x{1234}\x{4321}\x{3412}b
+ *** Failers
+ a\x{1234}b
+ ac\ncb
+
+/a(.{3,})b/8
+ a\x{1234}xyb
+ a\x{1234}\x{4321}yb
+ a\x{1234}\x{4321}\x{3412}b
+ axxxxbcdefghijb
+ a\x{1234}\x{4321}\x{3412}\x{3421}b
+ *** Failers
+ a\x{1234}b
+
+/a(.{3,}?)b/8
+ a\x{1234}xyb
+ a\x{1234}\x{4321}yb
+ a\x{1234}\x{4321}\x{3412}b
+ axxxxbcdefghijb
+ a\x{1234}\x{4321}\x{3412}\x{3421}b
+ *** Failers
+ a\x{1234}b
+
+/a(.{3,5})b/8
+ a\x{1234}xyb
+ a\x{1234}\x{4321}yb
+ a\x{1234}\x{4321}\x{3412}b
+ axxxxbcdefghijb
+ a\x{1234}\x{4321}\x{3412}\x{3421}b
+ axbxxbcdefghijb
+ axxxxxbcdefghijb
+ *** Failers
+ a\x{1234}b
+ axxxxxxbcdefghijb
+
+/a(.{3,5}?)b/8
+ a\x{1234}xyb
+ a\x{1234}\x{4321}yb
+ a\x{1234}\x{4321}\x{3412}b
+ axxxxbcdefghijb
+ a\x{1234}\x{4321}\x{3412}\x{3421}b
+ axbxxbcdefghijb
+ axxxxxbcdefghijb
+ *** Failers
+ a\x{1234}b
+ axxxxxxbcdefghijb
+
+/^[a\x{c0}]/8
+ *** Failers
+ \x{100}
+
+/(?<=aXb)cd/8
+ aXbcd
+
+/(?<=a\x{100}b)cd/8
+ a\x{100}bcd
+
+/(?<=a\x{100000}b)cd/8
+ a\x{100000}bcd
+
+/(?:\x{100}){3}b/8
+ \x{100}\x{100}\x{100}b
+ *** Failers
+ \x{100}\x{100}b
+
+/ End of testinput5 /
diff --git a/ext/pcre/pcrelib/testdata/testinput6 b/ext/pcre/pcrelib/testdata/testinput6
new file mode 100644
index 00000000000..1ccaa0dbc1c
--- /dev/null
+++ b/ext/pcre/pcrelib/testdata/testinput6
@@ -0,0 +1,52 @@
+/\x{100}/8DM
+
+/\x{1000}/8DM
+
+/\x{10000}/8DM
+
+/\x{100000}/8DM
+
+/\x{1000000}/8DM
+
+/\x{4000000}/8DM
+
+/\x{7fffFFFF}/8DM
+
+/[\x{ff}]/8DM
+
+/[\x{100}]/8DM
+
+/\x{ffffffff}/8
+
+/\x{100000000}/8
+
+/^\x{100}a\x{1234}/8
+ \x{100}a\x{1234}bcd
+
+/\x80/8D
+
+/\xff/8D
+
+/-- These tests are here rather than in testinput5 because Perl 5.6 has --/
+/-- some problems with UTF-8 support, in the area of \x{..} where the --/
+/-- value is < 255. It grumbles about invalid UTF-8 strings. --/
+
+/^[a\x{c0}]b/8
+ \x{c0}b
+
+/^([a\x{c0}]*?)aa/8
+ a\x{c0}aaaa/
+
+/^([a\x{c0}]*?)aa/8
+ a\x{c0}aaaa/
+ a\x{c0}a\x{c0}aaa/
+
+/^([a\x{c0}]*)aa/8
+ a\x{c0}aaaa/
+ a\x{c0}a\x{c0}aaa/
+
+/^([a\x{c0}]*)a\x{c0}/8
+ a\x{c0}aaaa/
+ a\x{c0}a\x{c0}aaa/
+
+/ End of testinput6 /
diff --git a/ext/pcre/pcrelib/testdata/testoutput1 b/ext/pcre/pcrelib/testdata/testoutput1
index 4d5dc904984..a6930bc9f1e 100644
--- a/ext/pcre/pcrelib/testdata/testoutput1
+++ b/ext/pcre/pcrelib/testdata/testoutput1
@@ -1,4 +1,4 @@
-PCRE version 3.2 12-May-2000
+PCRE version 3.4 22-Aug-2000
/the quick brown fox/
the quick brown fox
@@ -2921,5 +2921,46 @@ No match
0:
0:
-/ End of test input /
+/43.Word Processor
(N-1286)Lega lstaff.com CA - Statewide ]{0,})> ]{0,})>([\d]{0,}\.)(.*)((
([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD> ]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/is
+
+ 0: 43.Word Processor
(N-1286)Lega lstaff.com CA - Statewide
+ 1: BGCOLOR='#DBE9E9'
+ 2: align=left valign=top
+ 3: 43.
+ 4: Word Processor43.Word Processor
(N-1286)Lega lstaff.com CA - Statewide
(N-1286)
+ 5:
+ 6:
+ 7: