mirror of
https://github.com/php/php-src.git
synced 2026-04-25 17:08:14 +02:00
Upgraded\ bundled\ PCRE\ to\ version\ 8.11.
This commit is contained in:
@@ -1,6 +1,136 @@
|
||||
ChangeLog for PCRE
|
||||
------------------
|
||||
|
||||
Version 8.11 10-Dec-2010
|
||||
------------------------
|
||||
|
||||
1. (*THEN) was not working properly if there were untried alternatives prior
|
||||
to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it
|
||||
backtracked to try for "b" instead of moving to the next alternative branch
|
||||
at the same level (in this case, to look for "c"). The Perl documentation
|
||||
is clear that when (*THEN) is backtracked onto, it goes to the "next
|
||||
alternative in the innermost enclosing group".
|
||||
|
||||
2. (*COMMIT) was not overriding (*THEN), as it does in Perl. In a pattern
|
||||
such as (A(*COMMIT)B(*THEN)C|D) any failure after matching A should
|
||||
result in overall failure. Similarly, (*COMMIT) now overrides (*PRUNE) and
|
||||
(*SKIP), (*SKIP) overrides (*PRUNE) and (*THEN), and (*PRUNE) overrides
|
||||
(*THEN).
|
||||
|
||||
3. If \s appeared in a character class, it removed the VT character from
|
||||
the class, even if it had been included by some previous item, for example
|
||||
in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part
|
||||
of \s, but is part of the POSIX "space" class.)
|
||||
|
||||
4. A partial match never returns an empty string (because you can always
|
||||
match an empty string at the end of the subject); however the checking for
|
||||
an empty string was starting at the "start of match" point. This has been
|
||||
changed to the "earliest inspected character" point, because the returned
|
||||
data for a partial match starts at this character. This means that, for
|
||||
example, /(?<=abc)def/ gives a partial match for the subject "abc"
|
||||
(previously it gave "no match").
|
||||
|
||||
5. Changes have been made to the way PCRE_PARTIAL_HARD affects the matching
|
||||
of $, \z, \Z, \b, and \B. If the match point is at the end of the string,
|
||||
previously a full match would be given. However, setting PCRE_PARTIAL_HARD
|
||||
has an implication that the given string is incomplete (because a partial
|
||||
match is preferred over a full match). For this reason, these items now
|
||||
give a partial match in this situation. [Aside: previously, the one case
|
||||
/t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial
|
||||
match rather than a full match, which was wrong by the old rules, but is
|
||||
now correct.]
|
||||
|
||||
6. There was a bug in the handling of #-introduced comments, recognized when
|
||||
PCRE_EXTENDED is set, when PCRE_NEWLINE_ANY and PCRE_UTF8 were also set.
|
||||
If a UTF-8 multi-byte character included the byte 0x85 (e.g. +U0445, whose
|
||||
UTF-8 encoding is 0xd1,0x85), this was misinterpreted as a newline when
|
||||
scanning for the end of the comment. (*Character* 0x85 is an "any" newline,
|
||||
but *byte* 0x85 is not, in UTF-8 mode). This bug was present in several
|
||||
places in pcre_compile().
|
||||
|
||||
7. Related to (6) above, when pcre_compile() was skipping #-introduced
|
||||
comments when looking ahead for named forward references to subpatterns,
|
||||
the only newline sequence it recognized was NL. It now handles newlines
|
||||
according to the set newline convention.
|
||||
|
||||
8. SunOS4 doesn't have strerror() or strtoul(); pcregrep dealt with the
|
||||
former, but used strtoul(), whereas pcretest avoided strtoul() but did not
|
||||
cater for a lack of strerror(). These oversights have been fixed.
|
||||
|
||||
9. Added --match-limit and --recursion-limit to pcregrep.
|
||||
|
||||
10. Added two casts needed to build with Visual Studio when NO_RECURSE is set.
|
||||
|
||||
11. When the -o option was used, pcregrep was setting a return code of 1, even
|
||||
when matches were found, and --line-buffered was not being honoured.
|
||||
|
||||
12. Added an optional parentheses number to the -o and --only-matching options
|
||||
of pcregrep.
|
||||
|
||||
13. Imitating Perl's /g action for multiple matches is tricky when the pattern
|
||||
can match an empty string. The code to do it in pcretest and pcredemo
|
||||
needed fixing:
|
||||
|
||||
(a) When the newline convention was "crlf", pcretest got it wrong, skipping
|
||||
only one byte after an empty string match just before CRLF (this case
|
||||
just got forgotten; "any" and "anycrlf" were OK).
|
||||
|
||||
(b) The pcretest code also had a bug, causing it to loop forever in UTF-8
|
||||
mode when an empty string match preceded an ASCII character followed by
|
||||
a non-ASCII character. (The code for advancing by one character rather
|
||||
than one byte was nonsense.)
|
||||
|
||||
(c) The pcredemo.c sample program did not have any code at all to handle
|
||||
the cases when CRLF is a valid newline sequence.
|
||||
|
||||
14. Neither pcre_exec() nor pcre_dfa_exec() was checking that the value given
|
||||
as a starting offset was within the subject string. There is now a new
|
||||
error, PCRE_ERROR_BADOFFSET, which is returned if the starting offset is
|
||||
negative or greater than the length of the string. In order to test this,
|
||||
pcretest is extended to allow the setting of negative starting offsets.
|
||||
|
||||
15. In both pcre_exec() and pcre_dfa_exec() the code for checking that the
|
||||
starting offset points to the beginning of a UTF-8 character was
|
||||
unnecessarily clumsy. I tidied it up.
|
||||
|
||||
16. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
|
||||
bad UTF-8 sequence and one that is incomplete when using PCRE_PARTIAL_HARD.
|
||||
|
||||
17. Nobody had reported that the --include_dir option, which was added in
|
||||
release 7.7 should have been called --include-dir (hyphen, not underscore)
|
||||
for compatibility with GNU grep. I have changed it to --include-dir, but
|
||||
left --include_dir as an undocumented synonym, and the same for
|
||||
--exclude-dir, though that is not available in GNU grep, at least as of
|
||||
release 2.5.4.
|
||||
|
||||
18. At a user's suggestion, the macros GETCHAR and friends (which pick up UTF-8
|
||||
characters from a string of bytes) have been redefined so as not to use
|
||||
loops, in order to improve performance in some environments. At the same
|
||||
time, I abstracted some of the common code into auxiliary macros to save
|
||||
repetition (this should not affect the compiled code).
|
||||
|
||||
19. If \c was followed by a multibyte UTF-8 character, bad things happened. A
|
||||
compile-time error is now given if \c is not followed by an ASCII
|
||||
character, that is, a byte less than 128. (In EBCDIC mode, the code is
|
||||
different, and any byte value is allowed.)
|
||||
|
||||
20. Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_
|
||||
START_OPTIMIZE option, which is now allowed at compile time - but just
|
||||
passed through to pcre_exec() or pcre_dfa_exec(). This makes it available
|
||||
to pcregrep and other applications that have no direct access to PCRE
|
||||
options. The new /Y option in pcretest sets this option when calling
|
||||
pcre_compile().
|
||||
|
||||
21. Change 18 of release 8.01 broke the use of named subpatterns for recursive
|
||||
back references. Groups containing recursive back references were forced to
|
||||
be atomic by that change, but in the case of named groups, the amount of
|
||||
memory required was incorrectly computed, leading to "Failed: internal
|
||||
error: code overflow". This has been fixed.
|
||||
|
||||
22. Some patches to pcre_stringpiece.h, pcre_stringpiece_unittest.cc, and
|
||||
pcretest.c, to avoid build problems in some Borland environments.
|
||||
|
||||
|
||||
Version 8.10 25-Jun-2010
|
||||
------------------------
|
||||
|
||||
|
||||
@@ -4,6 +4,7 @@ Technical Notes about PCRE
|
||||
These are very rough technical notes that record potentially useful information
|
||||
about PCRE internals.
|
||||
|
||||
|
||||
Historical note 1
|
||||
-----------------
|
||||
|
||||
@@ -22,6 +23,7 @@ the one matching the longest subset of the subject string was chosen. This did
|
||||
not necessarily maximize the individual wild portions of the pattern, as is
|
||||
expected in Unix and Perl-style regular expressions.
|
||||
|
||||
|
||||
Historical note 2
|
||||
-----------------
|
||||
|
||||
@@ -34,6 +36,7 @@ maximizing (or, optionally, minimizing in Perl) the amount of the subject that
|
||||
matches individual wild portions of the pattern. This is an "NFA algorithm" in
|
||||
Friedl's terminology.
|
||||
|
||||
|
||||
OK, here's the real stuff
|
||||
-------------------------
|
||||
|
||||
@@ -44,6 +47,7 @@ in the pattern, to save on compiling time. However, because of the greater
|
||||
complexity in Perl regular expressions, I couldn't do this. In any case, a
|
||||
first pass through the pattern is helpful for other reasons.
|
||||
|
||||
|
||||
Computing the memory requirement: how it was
|
||||
--------------------------------------------
|
||||
|
||||
@@ -54,6 +58,7 @@ idea was that this would turn out faster than the Henry Spencer code because
|
||||
the first pass is degenerate and the second pass can just store stuff straight
|
||||
into the vector, which it knows is big enough.
|
||||
|
||||
|
||||
Computing the memory requirement: how it is
|
||||
-------------------------------------------
|
||||
|
||||
@@ -75,6 +80,7 @@ runs more slowly than before (30% or more, depending on the pattern) because it
|
||||
is doing a full analysis of the pattern. My hope was that this would not be a
|
||||
big issue, and in the event, nobody has commented on it.
|
||||
|
||||
|
||||
Traditional matching function
|
||||
-----------------------------
|
||||
|
||||
@@ -84,6 +90,7 @@ and the way that Perl works. This is not surprising, since it is intended to be
|
||||
as compatible with Perl as possible. This is the function most users of PCRE
|
||||
will use most of the time.
|
||||
|
||||
|
||||
Supplementary matching function
|
||||
-------------------------------
|
||||
|
||||
@@ -119,7 +126,6 @@ quantifiers) are always just two bytes long.
|
||||
|
||||
A list of the opcodes follows:
|
||||
|
||||
|
||||
Opcodes with no following data
|
||||
------------------------------
|
||||
|
||||
@@ -151,12 +157,24 @@ These items are all just one byte long
|
||||
OP_EXTUNI match an extended Unicode character
|
||||
OP_ANYNL match any Unicode newline sequence
|
||||
|
||||
OP_ACCEPT ) These are Perl 5.10's "backtracking
|
||||
OP_COMMIT ) control verbs". If OP_ACCEPT is inside
|
||||
OP_FAIL ) capturing parentheses, it may be preceded
|
||||
OP_PRUNE ) by one or more OP_CLOSE, followed by a 2-byte
|
||||
OP_SKIP ) number, indicating which parentheses must be
|
||||
OP_THEN ) closed.
|
||||
OP_ACCEPT ) These are Perl 5.10's "backtracking control
|
||||
OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing
|
||||
OP_FAIL ) parentheses, it may be preceded by one or more
|
||||
OP_PRUNE ) OP_CLOSE, followed by a 2-byte number,
|
||||
OP_SKIP ) indicating which parentheses must be closed.
|
||||
|
||||
|
||||
Backtracking control verbs with data
|
||||
------------------------------------
|
||||
|
||||
OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the
|
||||
start of the current branch.
|
||||
|
||||
OP_MARK is followed by the mark name, preceded by a one-byte length, and
|
||||
followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments,
|
||||
the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first
|
||||
two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE
|
||||
offset value.
|
||||
|
||||
|
||||
Repeating single characters
|
||||
@@ -419,4 +437,4 @@ at compile time, and so does not cause anything to be put into the compiled
|
||||
data.
|
||||
|
||||
Philip Hazel
|
||||
October 2009
|
||||
October 2010
|
||||
|
||||
@@ -1,6 +1,27 @@
|
||||
News about PCRE releases
|
||||
------------------------
|
||||
|
||||
Release 8.11 10-Dec-2010
|
||||
------------------------
|
||||
|
||||
A number of bugs in the library and in pcregrep have been fixed. As always, see
|
||||
ChangeLog for details. The following are the non-bug-fix changes:
|
||||
|
||||
. Added --match-limit and --recursion-limit to pcregrep.
|
||||
|
||||
. Added an optional parentheses number to the -o and --only-matching options
|
||||
of pcregrep.
|
||||
|
||||
. Changed the way PCRE_PARTIAL_HARD affects the matching of $, \z, \Z, \b, and
|
||||
\B.
|
||||
|
||||
. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
|
||||
bad UTF-8 sequence and one that is incomplete when using PCRE_PARTIAL_HARD.
|
||||
|
||||
. Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_
|
||||
START_OPTIMIZE option, which is now allowed at compile time
|
||||
|
||||
|
||||
Release 8.10 25-Jun-2010
|
||||
------------------------
|
||||
|
||||
|
||||
@@ -23,6 +23,7 @@
|
||||
# define PCRE_EXP_DATA_DEFN __attribute__ ((visibility("default")))
|
||||
#endif
|
||||
|
||||
|
||||
/* Exclude these below definitions when building within PHP */
|
||||
#ifndef ZEND_API
|
||||
|
||||
@@ -281,7 +282,7 @@ them both to 0; an emulation function will be used. */
|
||||
#define PACKAGE_NAME "PCRE"
|
||||
|
||||
/* Define to the full name and version of this package. */
|
||||
#define PACKAGE_STRING "PCRE 8.10"
|
||||
#define PACKAGE_STRING "PCRE 8.11"
|
||||
|
||||
/* Define to the one symbol short name of this package. */
|
||||
#define PACKAGE_TARNAME "pcre"
|
||||
@@ -290,7 +291,7 @@ them both to 0; an emulation function will be used. */
|
||||
#define PACKAGE_URL ""
|
||||
|
||||
/* Define to the version of this package. */
|
||||
#define PACKAGE_VERSION "8.10"
|
||||
#define PACKAGE_VERSION "8.11"
|
||||
|
||||
|
||||
/* If you are compiling for a system other than a Unix-like system or
|
||||
@@ -346,7 +347,7 @@ them both to 0; an emulation function will be used. */
|
||||
|
||||
/* Version number of package */
|
||||
#ifndef VERSION
|
||||
#define VERSION "8.10"
|
||||
#define VERSION "8.11"
|
||||
#endif
|
||||
|
||||
/* Define to empty if `const' does not conform to ANSI C. */
|
||||
|
||||
+1399
-1209
File diff suppressed because it is too large
Load Diff
+40
-36
@@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
|
||||
/* The current PCRE version information. */
|
||||
|
||||
#define PCRE_MAJOR 8
|
||||
#define PCRE_MINOR 10
|
||||
#define PCRE_MINOR 11
|
||||
#define PCRE_PRERELEASE
|
||||
#define PCRE_DATE 2010-06-25
|
||||
#define PCRE_DATE 2010-12-10
|
||||
|
||||
/* When an application links to a PCRE DLL in Windows, the symbols that are
|
||||
imported have to be identified as such. When building PCRE, the appropriate
|
||||
@@ -96,42 +96,44 @@ extern "C" {
|
||||
#endif
|
||||
|
||||
/* Options. Some are compile-time only, some are run-time only, and some are
|
||||
both, so we keep them all distinct. */
|
||||
both, so we keep them all distinct. However, almost all the bits in the options
|
||||
word are now used. In the long run, we may have to re-use some of the
|
||||
compile-time only bits for runtime options, or vice versa. */
|
||||
|
||||
#define PCRE_CASELESS 0x00000001
|
||||
#define PCRE_MULTILINE 0x00000002
|
||||
#define PCRE_DOTALL 0x00000004
|
||||
#define PCRE_EXTENDED 0x00000008
|
||||
#define PCRE_ANCHORED 0x00000010
|
||||
#define PCRE_DOLLAR_ENDONLY 0x00000020
|
||||
#define PCRE_EXTRA 0x00000040
|
||||
#define PCRE_NOTBOL 0x00000080
|
||||
#define PCRE_NOTEOL 0x00000100
|
||||
#define PCRE_UNGREEDY 0x00000200
|
||||
#define PCRE_NOTEMPTY 0x00000400
|
||||
#define PCRE_UTF8 0x00000800
|
||||
#define PCRE_NO_AUTO_CAPTURE 0x00001000
|
||||
#define PCRE_NO_UTF8_CHECK 0x00002000
|
||||
#define PCRE_AUTO_CALLOUT 0x00004000
|
||||
#define PCRE_PARTIAL_SOFT 0x00008000
|
||||
#define PCRE_CASELESS 0x00000001 /* Compile */
|
||||
#define PCRE_MULTILINE 0x00000002 /* Compile */
|
||||
#define PCRE_DOTALL 0x00000004 /* Compile */
|
||||
#define PCRE_EXTENDED 0x00000008 /* Compile */
|
||||
#define PCRE_ANCHORED 0x00000010 /* Compile, exec, DFA exec */
|
||||
#define PCRE_DOLLAR_ENDONLY 0x00000020 /* Compile */
|
||||
#define PCRE_EXTRA 0x00000040 /* Compile */
|
||||
#define PCRE_NOTBOL 0x00000080 /* Exec, DFA exec */
|
||||
#define PCRE_NOTEOL 0x00000100 /* Exec, DFA exec */
|
||||
#define PCRE_UNGREEDY 0x00000200 /* Compile */
|
||||
#define PCRE_NOTEMPTY 0x00000400 /* Exec, DFA exec */
|
||||
#define PCRE_UTF8 0x00000800 /* Compile */
|
||||
#define PCRE_NO_AUTO_CAPTURE 0x00001000 /* Compile */
|
||||
#define PCRE_NO_UTF8_CHECK 0x00002000 /* Compile, exec, DFA exec */
|
||||
#define PCRE_AUTO_CALLOUT 0x00004000 /* Compile */
|
||||
#define PCRE_PARTIAL_SOFT 0x00008000 /* Exec, DFA exec */
|
||||
#define PCRE_PARTIAL 0x00008000 /* Backwards compatible synonym */
|
||||
#define PCRE_DFA_SHORTEST 0x00010000
|
||||
#define PCRE_DFA_RESTART 0x00020000
|
||||
#define PCRE_FIRSTLINE 0x00040000
|
||||
#define PCRE_DUPNAMES 0x00080000
|
||||
#define PCRE_NEWLINE_CR 0x00100000
|
||||
#define PCRE_NEWLINE_LF 0x00200000
|
||||
#define PCRE_NEWLINE_CRLF 0x00300000
|
||||
#define PCRE_NEWLINE_ANY 0x00400000
|
||||
#define PCRE_NEWLINE_ANYCRLF 0x00500000
|
||||
#define PCRE_BSR_ANYCRLF 0x00800000
|
||||
#define PCRE_BSR_UNICODE 0x01000000
|
||||
#define PCRE_JAVASCRIPT_COMPAT 0x02000000
|
||||
#define PCRE_NO_START_OPTIMIZE 0x04000000
|
||||
#define PCRE_NO_START_OPTIMISE 0x04000000
|
||||
#define PCRE_PARTIAL_HARD 0x08000000
|
||||
#define PCRE_NOTEMPTY_ATSTART 0x10000000
|
||||
#define PCRE_UCP 0x20000000
|
||||
#define PCRE_DFA_SHORTEST 0x00010000 /* DFA exec */
|
||||
#define PCRE_DFA_RESTART 0x00020000 /* DFA exec */
|
||||
#define PCRE_FIRSTLINE 0x00040000 /* Compile */
|
||||
#define PCRE_DUPNAMES 0x00080000 /* Compile */
|
||||
#define PCRE_NEWLINE_CR 0x00100000 /* Compile, exec, DFA exec */
|
||||
#define PCRE_NEWLINE_LF 0x00200000 /* Compile, exec, DFA exec */
|
||||
#define PCRE_NEWLINE_CRLF 0x00300000 /* Compile, exec, DFA exec */
|
||||
#define PCRE_NEWLINE_ANY 0x00400000 /* Compile, exec, DFA exec */
|
||||
#define PCRE_NEWLINE_ANYCRLF 0x00500000 /* Compile, exec, DFA exec */
|
||||
#define PCRE_BSR_ANYCRLF 0x00800000 /* Compile, exec, DFA exec */
|
||||
#define PCRE_BSR_UNICODE 0x01000000 /* Compile, exec, DFA exec */
|
||||
#define PCRE_JAVASCRIPT_COMPAT 0x02000000 /* Compile */
|
||||
#define PCRE_NO_START_OPTIMIZE 0x04000000 /* Compile, exec, DFA exec */
|
||||
#define PCRE_NO_START_OPTIMISE 0x04000000 /* Synonym */
|
||||
#define PCRE_PARTIAL_HARD 0x08000000 /* Exec, DFA exec */
|
||||
#define PCRE_NOTEMPTY_ATSTART 0x10000000 /* Exec, DFA exec */
|
||||
#define PCRE_UCP 0x20000000 /* Compile */
|
||||
|
||||
/* Exec-time and get/set-time error codes */
|
||||
|
||||
@@ -159,6 +161,8 @@ both, so we keep them all distinct. */
|
||||
#define PCRE_ERROR_RECURSIONLIMIT (-21)
|
||||
#define PCRE_ERROR_NULLWSLIMIT (-22) /* No longer actually used */
|
||||
#define PCRE_ERROR_BADNEWLINE (-23)
|
||||
#define PCRE_ERROR_BADOFFSET (-24)
|
||||
#define PCRE_ERROR_SHORTUTF8 (-25)
|
||||
|
||||
/* Request types for pcre_fullinfo() */
|
||||
|
||||
|
||||
+127
-31
@@ -406,6 +406,7 @@ static const char error_texts[] =
|
||||
"different names for subpatterns of the same number are not allowed\0"
|
||||
"(*MARK) must have an argument\0"
|
||||
"this version of PCRE is not compiled with PCRE_UCP support\0"
|
||||
"\\c must be followed by an ASCII character\0"
|
||||
;
|
||||
|
||||
/* Table to identify digits and hex digits. This is used when compiling
|
||||
@@ -839,7 +840,8 @@ else
|
||||
break;
|
||||
|
||||
/* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
|
||||
This coding is ASCII-specific, but then the whole concept of \cx is
|
||||
An error is given if the byte following \c is not an ASCII character. This
|
||||
coding is ASCII-specific, but then the whole concept of \cx is
|
||||
ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
|
||||
|
||||
case CHAR_c:
|
||||
@@ -849,11 +851,15 @@ else
|
||||
*errorcodeptr = ERR2;
|
||||
break;
|
||||
}
|
||||
|
||||
#ifndef EBCDIC /* ASCII/UTF-8 coding */
|
||||
#ifndef EBCDIC /* ASCII/UTF-8 coding */
|
||||
if (c > 127) /* Excludes all non-ASCII in either mode */
|
||||
{
|
||||
*errorcodeptr = ERR68;
|
||||
break;
|
||||
}
|
||||
if (c >= CHAR_a && c <= CHAR_z) c -= 32;
|
||||
c ^= 0x40;
|
||||
#else /* EBCDIC coding */
|
||||
#else /* EBCDIC coding */
|
||||
if (c >= CHAR_a && c <= CHAR_z) c += 64;
|
||||
c ^= 0xC0;
|
||||
#endif
|
||||
@@ -1097,10 +1103,21 @@ top-level call starts at the beginning of the pattern. All other calls must
|
||||
start at a parenthesis. It scans along a pattern's text looking for capturing
|
||||
subpatterns, and counting them. If it finds a named pattern that matches the
|
||||
name it is given, it returns its number. Alternatively, if the name is NULL, it
|
||||
returns when it reaches a given numbered subpattern. We know that if (?P< is
|
||||
encountered, the name will be terminated by '>' because that is checked in the
|
||||
first pass. Recursion is used to keep track of subpatterns that reset the
|
||||
capturing group numbers - the (?| feature.
|
||||
returns when it reaches a given numbered subpattern. Recursion is used to keep
|
||||
track of subpatterns that reset the capturing group numbers - the (?| feature.
|
||||
|
||||
This function was originally called only from the second pass, in which we know
|
||||
that if (?< or (?' or (?P< is encountered, the name will be correctly
|
||||
terminated because that is checked in the first pass. There is now one call to
|
||||
this function in the first pass, to check for a recursive back reference by
|
||||
name (so that we can make the whole group atomic). In this case, we need check
|
||||
only up to the current position in the pattern, and that is still OK because
|
||||
and previous occurrences will have been checked. To make this work, the test
|
||||
for "end of pattern" is a check against cd->end_pattern in the main loop,
|
||||
instead of looking for a binary zero. This means that the special first-pass
|
||||
call can adjust cd->end_pattern temporarily. (Checks for binary zero while
|
||||
processing items within the loop are OK, because afterwards the main loop will
|
||||
terminate.)
|
||||
|
||||
Arguments:
|
||||
ptrptr address of the current character pointer (updated)
|
||||
@@ -1108,6 +1125,7 @@ Arguments:
|
||||
name name to seek, or NULL if seeking a numbered subpattern
|
||||
lorn name length, or subpattern number if name is NULL
|
||||
xmode TRUE if we are in /x mode
|
||||
utf8 TRUE if we are in UTF-8 mode
|
||||
count pointer to the current capturing subpattern number (updated)
|
||||
|
||||
Returns: the number of the named subpattern, or -1 if not found
|
||||
@@ -1115,7 +1133,7 @@ Returns: the number of the named subpattern, or -1 if not found
|
||||
|
||||
static int
|
||||
find_parens_sub(uschar **ptrptr, compile_data *cd, const uschar *name, int lorn,
|
||||
BOOL xmode, int *count)
|
||||
BOOL xmode, BOOL utf8, int *count)
|
||||
{
|
||||
uschar *ptr = *ptrptr;
|
||||
int start_count = *count;
|
||||
@@ -1200,9 +1218,11 @@ if (ptr[0] == CHAR_LEFT_PARENTHESIS)
|
||||
}
|
||||
|
||||
/* Past any initial parenthesis handling, scan for parentheses or vertical
|
||||
bars. */
|
||||
bars. Stop if we get to cd->end_pattern. Note that this is important for the
|
||||
first-pass call when this value is temporarily adjusted to stop at the current
|
||||
position. So DO NOT change this to a test for binary zero. */
|
||||
|
||||
for (; *ptr != 0; ptr++)
|
||||
for (; ptr < cd->end_pattern; ptr++)
|
||||
{
|
||||
/* Skip over backslashed characters and also entire \Q...\E */
|
||||
|
||||
@@ -1276,7 +1296,15 @@ for (; *ptr != 0; ptr++)
|
||||
|
||||
if (xmode && *ptr == CHAR_NUMBER_SIGN)
|
||||
{
|
||||
while (*(++ptr) != 0 && *ptr != CHAR_NL) {};
|
||||
ptr++;
|
||||
while (*ptr != 0)
|
||||
{
|
||||
if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
|
||||
ptr++;
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
|
||||
#endif
|
||||
}
|
||||
if (*ptr == 0) goto FAIL_EXIT;
|
||||
continue;
|
||||
}
|
||||
@@ -1285,7 +1313,7 @@ for (; *ptr != 0; ptr++)
|
||||
|
||||
if (*ptr == CHAR_LEFT_PARENTHESIS)
|
||||
{
|
||||
int rc = find_parens_sub(&ptr, cd, name, lorn, xmode, count);
|
||||
int rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf8, count);
|
||||
if (rc > 0) return rc;
|
||||
if (*ptr == 0) goto FAIL_EXIT;
|
||||
}
|
||||
@@ -1331,12 +1359,14 @@ Arguments:
|
||||
name name to seek, or NULL if seeking a numbered subpattern
|
||||
lorn name length, or subpattern number if name is NULL
|
||||
xmode TRUE if we are in /x mode
|
||||
utf8 TRUE if we are in UTF-8 mode
|
||||
|
||||
Returns: the number of the found subpattern, or -1 if not found
|
||||
*/
|
||||
|
||||
static int
|
||||
find_parens(compile_data *cd, const uschar *name, int lorn, BOOL xmode)
|
||||
find_parens(compile_data *cd, const uschar *name, int lorn, BOOL xmode,
|
||||
BOOL utf8)
|
||||
{
|
||||
uschar *ptr = (uschar *)cd->start_pattern;
|
||||
int count = 0;
|
||||
@@ -1349,7 +1379,7 @@ matching closing parens. That is why we have to have a loop. */
|
||||
|
||||
for (;;)
|
||||
{
|
||||
rc = find_parens_sub(&ptr, cd, name, lorn, xmode, &count);
|
||||
rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf8, &count);
|
||||
if (rc > 0 || *ptr++ == 0) break;
|
||||
}
|
||||
|
||||
@@ -1722,9 +1752,12 @@ for (;;)
|
||||
case OP_MARK:
|
||||
case OP_PRUNE_ARG:
|
||||
case OP_SKIP_ARG:
|
||||
case OP_THEN_ARG:
|
||||
code += code[1];
|
||||
break;
|
||||
|
||||
case OP_THEN_ARG:
|
||||
code += code[1+LINK_SIZE];
|
||||
break;
|
||||
}
|
||||
|
||||
/* Add in the fixed length from the table */
|
||||
@@ -1825,9 +1858,12 @@ for (;;)
|
||||
case OP_MARK:
|
||||
case OP_PRUNE_ARG:
|
||||
case OP_SKIP_ARG:
|
||||
case OP_THEN_ARG:
|
||||
code += code[1];
|
||||
break;
|
||||
|
||||
case OP_THEN_ARG:
|
||||
code += code[1+LINK_SIZE];
|
||||
break;
|
||||
}
|
||||
|
||||
/* Add in the fixed length from the table */
|
||||
@@ -2103,10 +2139,13 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
|
||||
case OP_MARK:
|
||||
case OP_PRUNE_ARG:
|
||||
case OP_SKIP_ARG:
|
||||
case OP_THEN_ARG:
|
||||
code += code[1];
|
||||
break;
|
||||
|
||||
case OP_THEN_ARG:
|
||||
code += code[1+LINK_SIZE];
|
||||
break;
|
||||
|
||||
/* None of the remaining opcodes are required to match a character. */
|
||||
|
||||
default:
|
||||
@@ -2504,8 +2543,15 @@ if ((options & PCRE_EXTENDED) != 0)
|
||||
while ((cd->ctypes[*ptr] & ctype_space) != 0) ptr++;
|
||||
if (*ptr == CHAR_NUMBER_SIGN)
|
||||
{
|
||||
while (*(++ptr) != 0)
|
||||
ptr++;
|
||||
while (*ptr != 0)
|
||||
{
|
||||
if (IS_NEWLINE(ptr)) { ptr += cd->nllen; break; }
|
||||
ptr++;
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
|
||||
#endif
|
||||
}
|
||||
}
|
||||
else break;
|
||||
}
|
||||
@@ -2541,8 +2587,15 @@ if ((options & PCRE_EXTENDED) != 0)
|
||||
while ((cd->ctypes[*ptr] & ctype_space) != 0) ptr++;
|
||||
if (*ptr == CHAR_NUMBER_SIGN)
|
||||
{
|
||||
while (*(++ptr) != 0)
|
||||
ptr++;
|
||||
while (*ptr != 0)
|
||||
{
|
||||
if (IS_NEWLINE(ptr)) { ptr += cd->nllen; break; }
|
||||
ptr++;
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
|
||||
#endif
|
||||
}
|
||||
}
|
||||
else break;
|
||||
}
|
||||
@@ -3115,9 +3168,14 @@ for (;; ptr++)
|
||||
if ((cd->ctypes[c] & ctype_space) != 0) continue;
|
||||
if (c == CHAR_NUMBER_SIGN)
|
||||
{
|
||||
while (*(++ptr) != 0)
|
||||
ptr++;
|
||||
while (*ptr != 0)
|
||||
{
|
||||
if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
|
||||
ptr++;
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
|
||||
#endif
|
||||
}
|
||||
if (*ptr != 0) continue;
|
||||
|
||||
@@ -3492,9 +3550,14 @@ for (;; ptr++)
|
||||
for (c = 0; c < 32; c++) classbits[c] |= ~cbits[c+cbit_word];
|
||||
continue;
|
||||
|
||||
/* Perl 5.004 onwards omits VT from \s, but we must preserve it
|
||||
if it was previously set by something earlier in the character
|
||||
class. */
|
||||
|
||||
case ESC_s:
|
||||
for (c = 0; c < 32; c++) classbits[c] |= cbits[c+cbit_space];
|
||||
classbits[1] &= ~0x08; /* Perl 5.004 onwards omits VT from \s */
|
||||
classbits[0] |= cbits[cbit_space];
|
||||
classbits[1] |= cbits[cbit_space+1] & ~0x08;
|
||||
for (c = 2; c < 32; c++) classbits[c] |= cbits[c+cbit_space];
|
||||
continue;
|
||||
|
||||
case ESC_S:
|
||||
@@ -4806,7 +4869,12 @@ for (;; ptr++)
|
||||
*errorcodeptr = ERR66;
|
||||
goto FAILED;
|
||||
}
|
||||
*code++ = verbs[i].op;
|
||||
*code = verbs[i].op;
|
||||
if (*code++ == OP_THEN)
|
||||
{
|
||||
PUT(code, 0, code - bcptr->current_branch - 1);
|
||||
code += LINK_SIZE;
|
||||
}
|
||||
}
|
||||
|
||||
else
|
||||
@@ -4816,7 +4884,12 @@ for (;; ptr++)
|
||||
*errorcodeptr = ERR59;
|
||||
goto FAILED;
|
||||
}
|
||||
*code++ = verbs[i].op_arg;
|
||||
*code = verbs[i].op_arg;
|
||||
if (*code++ == OP_THEN_ARG)
|
||||
{
|
||||
PUT(code, 0, code - bcptr->current_branch - 1);
|
||||
code += LINK_SIZE;
|
||||
}
|
||||
*code++ = arglen;
|
||||
memcpy(code, arg, arglen);
|
||||
code += arglen;
|
||||
@@ -5010,7 +5083,7 @@ for (;; ptr++)
|
||||
/* Search the pattern for a forward reference */
|
||||
|
||||
else if ((i = find_parens(cd, name, namelen,
|
||||
(options & PCRE_EXTENDED) != 0)) > 0)
|
||||
(options & PCRE_EXTENDED) != 0, utf8)) > 0)
|
||||
{
|
||||
PUT2(code, 2+LINK_SIZE, i);
|
||||
code[1+LINK_SIZE]++;
|
||||
@@ -5311,11 +5384,17 @@ for (;; ptr++)
|
||||
while ((cd->ctypes[*ptr] & ctype_word) != 0) ptr++;
|
||||
namelen = (int)(ptr - name);
|
||||
|
||||
/* In the pre-compile phase, do a syntax check and set a dummy
|
||||
reference number. */
|
||||
/* In the pre-compile phase, do a syntax check. We used to just set
|
||||
a dummy reference number, because it was not used in the first pass.
|
||||
However, with the change of recursive back references to be atomic,
|
||||
we have to look for the number so that this state can be identified, as
|
||||
otherwise the incorrect length is computed. If it's not a backwards
|
||||
reference, the dummy number will do. */
|
||||
|
||||
if (lengthptr != NULL)
|
||||
{
|
||||
const uschar *temp;
|
||||
|
||||
if (namelen == 0)
|
||||
{
|
||||
*errorcodeptr = ERR62;
|
||||
@@ -5331,7 +5410,22 @@ for (;; ptr++)
|
||||
*errorcodeptr = ERR48;
|
||||
goto FAILED;
|
||||
}
|
||||
recno = 0;
|
||||
|
||||
/* The name table does not exist in the first pass, so we cannot
|
||||
do a simple search as in the code below. Instead, we have to scan the
|
||||
pattern to find the number. It is important that we scan it only as
|
||||
far as we have got because the syntax of named subpatterns has not
|
||||
been checked for the rest of the pattern, and find_parens() assumes
|
||||
correct syntax. In any case, it's a waste of resources to scan
|
||||
further. We stop the scan at the current point by temporarily
|
||||
adjusting the value of cd->endpattern. */
|
||||
|
||||
temp = cd->end_pattern;
|
||||
cd->end_pattern = ptr;
|
||||
recno = find_parens(cd, name, namelen,
|
||||
(options & PCRE_EXTENDED) != 0, utf8);
|
||||
cd->end_pattern = temp;
|
||||
if (recno < 0) recno = 0; /* Forward ref; set dummy number */
|
||||
}
|
||||
|
||||
/* In the real compile, seek the name in the table. We check the name
|
||||
@@ -5356,7 +5450,7 @@ for (;; ptr++)
|
||||
}
|
||||
else if ((recno = /* Forward back reference */
|
||||
find_parens(cd, name, namelen,
|
||||
(options & PCRE_EXTENDED) != 0)) <= 0)
|
||||
(options & PCRE_EXTENDED) != 0, utf8)) <= 0)
|
||||
{
|
||||
*errorcodeptr = ERR15;
|
||||
goto FAILED;
|
||||
@@ -5467,7 +5561,7 @@ for (;; ptr++)
|
||||
if (called == NULL)
|
||||
{
|
||||
if (find_parens(cd, NULL, recno,
|
||||
(options & PCRE_EXTENDED) != 0) < 0)
|
||||
(options & PCRE_EXTENDED) != 0, utf8) < 0)
|
||||
{
|
||||
*errorcodeptr = ERR15;
|
||||
goto FAILED;
|
||||
@@ -6797,6 +6891,8 @@ while (ptr[skipatstart] == CHAR_LEFT_PARENTHESIS &&
|
||||
{ skipatstart += 7; options |= PCRE_UTF8; continue; }
|
||||
else if (strncmp((char *)(ptr+skipatstart+2), STRING_UCP_RIGHTPAR, 4) == 0)
|
||||
{ skipatstart += 6; options |= PCRE_UCP; continue; }
|
||||
else if (strncmp((char *)(ptr+skipatstart+2), STRING_NO_START_OPT_RIGHTPAR, 13) == 0)
|
||||
{ skipatstart += 15; options |= PCRE_NO_START_OPTIMIZE; continue; }
|
||||
|
||||
if (strncmp((char *)(ptr+skipatstart+2), STRING_CR_RIGHTPAR, 3) == 0)
|
||||
{ skipatstart += 5; newnl = PCRE_NEWLINE_CR; }
|
||||
|
||||
@@ -292,7 +292,7 @@ argument of match(), which never changes. */
|
||||
|
||||
#define RMATCH(ra,rb,rc,rd,re,rf,rg,rw)\
|
||||
{\
|
||||
heapframe *newframe = (pcre_stack_malloc)(sizeof(heapframe));\
|
||||
heapframe *newframe = (heapframe *)(pcre_stack_malloc)(sizeof(heapframe));\
|
||||
if (newframe == NULL) RRETURN(PCRE_ERROR_NOMEMORY);\
|
||||
frame->Xwhere = rw; \
|
||||
newframe->Xeptr = ra;\
|
||||
@@ -420,17 +420,18 @@ immediately. The second one is used when we already know we are past the end of
|
||||
the subject. */
|
||||
|
||||
#define CHECK_PARTIAL()\
|
||||
if (md->partial != 0 && eptr >= md->end_subject && eptr > mstart)\
|
||||
{\
|
||||
md->hitend = TRUE;\
|
||||
if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
|
||||
if (md->partial != 0 && eptr >= md->end_subject && \
|
||||
eptr > md->start_used_ptr) \
|
||||
{ \
|
||||
md->hitend = TRUE; \
|
||||
if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
|
||||
}
|
||||
|
||||
#define SCHECK_PARTIAL()\
|
||||
if (md->partial != 0 && eptr > mstart)\
|
||||
{\
|
||||
md->hitend = TRUE;\
|
||||
if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
|
||||
if (md->partial != 0 && eptr > md->start_used_ptr) \
|
||||
{ \
|
||||
md->hitend = TRUE; \
|
||||
if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
|
||||
}
|
||||
|
||||
|
||||
@@ -486,7 +487,7 @@ heap storage. Set up the top-level frame here; others are obtained from the
|
||||
heap whenever RMATCH() does a "recursion". See the macro definitions above. */
|
||||
|
||||
#ifdef NO_RECURSE
|
||||
heapframe *frame = (pcre_stack_malloc)(sizeof(heapframe));
|
||||
heapframe *frame = (heapframe *)(pcre_stack_malloc)(sizeof(heapframe));
|
||||
if (frame == NULL) RRETURN(PCRE_ERROR_NOMEMORY);
|
||||
frame->Xprevframe = NULL; /* Marks the top level */
|
||||
|
||||
@@ -708,36 +709,47 @@ for (;;)
|
||||
case OP_FAIL:
|
||||
MRRETURN(MATCH_NOMATCH);
|
||||
|
||||
/* COMMIT overrides PRUNE, SKIP, and THEN */
|
||||
|
||||
case OP_COMMIT:
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
|
||||
ims, eptrb, flags, RM52);
|
||||
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE &&
|
||||
rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
|
||||
rrc != MATCH_THEN)
|
||||
RRETURN(rrc);
|
||||
MRRETURN(MATCH_COMMIT);
|
||||
|
||||
/* PRUNE overrides THEN */
|
||||
|
||||
case OP_PRUNE:
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
|
||||
ims, eptrb, flags, RM51);
|
||||
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
|
||||
MRRETURN(MATCH_PRUNE);
|
||||
|
||||
case OP_PRUNE_ARG:
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
|
||||
ims, eptrb, flags, RM56);
|
||||
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
|
||||
md->mark = ecode + 2;
|
||||
RRETURN(MATCH_PRUNE);
|
||||
|
||||
/* SKIP overrides PRUNE and THEN */
|
||||
|
||||
case OP_SKIP:
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
|
||||
ims, eptrb, flags, RM53);
|
||||
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
|
||||
RRETURN(rrc);
|
||||
md->start_match_ptr = eptr; /* Pass back current position */
|
||||
MRRETURN(MATCH_SKIP);
|
||||
|
||||
case OP_SKIP_ARG:
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
|
||||
ims, eptrb, flags, RM57);
|
||||
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
|
||||
RRETURN(rrc);
|
||||
|
||||
/* Pass back the current skip name by overloading md->start_match_ptr and
|
||||
returning the special MATCH_SKIP_ARG return code. This will either be
|
||||
@@ -747,17 +759,24 @@ for (;;)
|
||||
md->start_match_ptr = ecode + 2;
|
||||
RRETURN(MATCH_SKIP_ARG);
|
||||
|
||||
/* For THEN (and THEN_ARG) we pass back the address of the bracket or
|
||||
the alt that is at the start of the current branch. This makes it possible
|
||||
to skip back past alternatives that precede the THEN within the current
|
||||
branch. */
|
||||
|
||||
case OP_THEN:
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
|
||||
ims, eptrb, flags, RM54);
|
||||
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
|
||||
md->start_match_ptr = ecode - GET(ecode, 1);
|
||||
MRRETURN(MATCH_THEN);
|
||||
|
||||
case OP_THEN_ARG:
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
|
||||
ims, eptrb, flags, RM58);
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
|
||||
offset_top, md, ims, eptrb, flags, RM58);
|
||||
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
|
||||
md->mark = ecode + 2;
|
||||
md->start_match_ptr = ecode - GET(ecode, 1);
|
||||
md->mark = ecode + LINK_SIZE + 2;
|
||||
RRETURN(MATCH_THEN);
|
||||
|
||||
/* Handle a capturing bracket. If there is space in the offset vector, save
|
||||
@@ -802,7 +821,9 @@ for (;;)
|
||||
{
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
|
||||
ims, eptrb, flags, RM1);
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH &&
|
||||
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
|
||||
RRETURN(rrc);
|
||||
md->capture_last = save_capture_last;
|
||||
ecode += GET(ecode, 1);
|
||||
}
|
||||
@@ -863,7 +884,9 @@ for (;;)
|
||||
|
||||
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims,
|
||||
eptrb, flags, RM2);
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH &&
|
||||
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
|
||||
RRETURN(rrc);
|
||||
ecode += GET(ecode, 1);
|
||||
}
|
||||
/* Control never reaches here. */
|
||||
@@ -1064,7 +1087,8 @@ for (;;)
|
||||
ecode += 1 + LINK_SIZE + GET(ecode, LINK_SIZE + 2);
|
||||
while (*ecode == OP_ALT) ecode += GET(ecode, 1);
|
||||
}
|
||||
else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
|
||||
else if (rrc != MATCH_NOMATCH &&
|
||||
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
|
||||
{
|
||||
RRETURN(rrc); /* Need braces because of following else */
|
||||
}
|
||||
@@ -1192,7 +1216,9 @@ for (;;)
|
||||
mstart = md->start_match_ptr; /* In case \K reset it */
|
||||
break;
|
||||
}
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH &&
|
||||
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
|
||||
RRETURN(rrc);
|
||||
ecode += GET(ecode, 1);
|
||||
}
|
||||
while (*ecode == OP_ALT);
|
||||
@@ -1226,7 +1252,9 @@ for (;;)
|
||||
do ecode += GET(ecode,1); while (*ecode == OP_ALT);
|
||||
break;
|
||||
}
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH &&
|
||||
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
|
||||
RRETURN(rrc);
|
||||
ecode += GET(ecode,1);
|
||||
}
|
||||
while (*ecode == OP_ALT);
|
||||
@@ -1363,7 +1391,8 @@ for (;;)
|
||||
(pcre_free)(new_recursive.offset_save);
|
||||
MRRETURN(MATCH_MATCH);
|
||||
}
|
||||
else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
|
||||
else if (rrc != MATCH_NOMATCH &&
|
||||
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
|
||||
{
|
||||
DPRINTF(("Recursion gave error %d\n", rrc));
|
||||
if (new_recursive.offset_save != stacksave)
|
||||
@@ -1406,7 +1435,9 @@ for (;;)
|
||||
mstart = md->start_match_ptr;
|
||||
break;
|
||||
}
|
||||
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
|
||||
if (rrc != MATCH_NOMATCH &&
|
||||
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
|
||||
RRETURN(rrc);
|
||||
ecode += GET(ecode,1);
|
||||
}
|
||||
while (*ecode == OP_ALT);
|
||||
@@ -1672,37 +1703,40 @@ for (;;)
|
||||
if (eptr < md->end_subject)
|
||||
{ if (!IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH); }
|
||||
else
|
||||
{ if (md->noteol) MRRETURN(MATCH_NOMATCH); }
|
||||
{
|
||||
if (md->noteol) MRRETURN(MATCH_NOMATCH);
|
||||
SCHECK_PARTIAL();
|
||||
}
|
||||
ecode++;
|
||||
break;
|
||||
}
|
||||
else
|
||||
else /* Not multiline */
|
||||
{
|
||||
if (md->noteol) MRRETURN(MATCH_NOMATCH);
|
||||
if (!md->endonly)
|
||||
{
|
||||
if (eptr != md->end_subject &&
|
||||
(!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
|
||||
MRRETURN(MATCH_NOMATCH);
|
||||
ecode++;
|
||||
break;
|
||||
}
|
||||
if (!md->endonly) goto ASSERT_NL_OR_EOS;
|
||||
}
|
||||
|
||||
/* ... else fall through for endonly */
|
||||
|
||||
/* End of subject assertion (\z) */
|
||||
|
||||
case OP_EOD:
|
||||
if (eptr < md->end_subject) MRRETURN(MATCH_NOMATCH);
|
||||
SCHECK_PARTIAL();
|
||||
ecode++;
|
||||
break;
|
||||
|
||||
/* End of subject or ending \n assertion (\Z) */
|
||||
|
||||
case OP_EODN:
|
||||
if (eptr != md->end_subject &&
|
||||
ASSERT_NL_OR_EOS:
|
||||
if (eptr < md->end_subject &&
|
||||
(!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
|
||||
MRRETURN(MATCH_NOMATCH);
|
||||
|
||||
/* Either at end of string or \n before end. */
|
||||
|
||||
SCHECK_PARTIAL();
|
||||
ecode++;
|
||||
break;
|
||||
|
||||
@@ -5598,6 +5632,7 @@ if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
|
||||
if (re == NULL || subject == NULL ||
|
||||
(offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
|
||||
if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
|
||||
if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
|
||||
|
||||
/* This information is for finding all the numbers associated with a given
|
||||
name, for condition testing. */
|
||||
@@ -5764,16 +5799,14 @@ back the character offset. */
|
||||
#ifdef SUPPORT_UTF8
|
||||
if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
|
||||
{
|
||||
if (_pcre_valid_utf8((USPTR)subject, length) >= 0)
|
||||
return PCRE_ERROR_BADUTF8;
|
||||
int tb;
|
||||
if ((tb = _pcre_valid_utf8((USPTR)subject, length)) >= 0)
|
||||
return (tb == length && md->partial > 1)?
|
||||
PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
|
||||
if (start_offset > 0 && start_offset < length)
|
||||
{
|
||||
int tb = ((USPTR)subject)[start_offset];
|
||||
if (tb > 127)
|
||||
{
|
||||
tb &= 0xc0;
|
||||
if (tb != 0 && tb != 0xc0) return PCRE_ERROR_BADUTF8_OFFSET;
|
||||
}
|
||||
tb = ((USPTR)subject)[start_offset] & 0xc0;
|
||||
if (tb == 0x80) return PCRE_ERROR_BADUTF8_OFFSET;
|
||||
}
|
||||
}
|
||||
#endif
|
||||
@@ -5901,9 +5934,10 @@ for(;;)
|
||||
/* There are some optimizations that avoid running the match if a known
|
||||
starting point is not found, or if a known later character is not present.
|
||||
However, there is an option that disables these, for testing and for ensuring
|
||||
that all callouts do actually occur. */
|
||||
that all callouts do actually occur. The option can be set in the regex by
|
||||
(*NO_START_OPT) or passed in match-time options. */
|
||||
|
||||
if ((options & PCRE_NO_START_OPTIMIZE) == 0)
|
||||
if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
|
||||
{
|
||||
/* Advance to a unique first byte if there is one. */
|
||||
|
||||
|
||||
@@ -192,9 +192,7 @@ stdint.h is available, include it; it may define INT64_MAX. Systems that do not
|
||||
have stdint.h (e.g. Solaris) may have inttypes.h. The macro int64_t may be set
|
||||
by "configure". */
|
||||
|
||||
#ifdef PHP_WIN32
|
||||
#include "win32/php_stdint.h"
|
||||
#elif HAVE_STDINT_H
|
||||
#if HAVE_STDINT_H
|
||||
#include <stdint.h>
|
||||
#elif HAVE_INTTYPES_H
|
||||
#include <inttypes.h>
|
||||
@@ -410,9 +408,10 @@ capturing parenthesis numbers in back references. */
|
||||
|
||||
/* When UTF-8 encoding is being used, a character is no longer just a single
|
||||
byte. The macros for character handling generate simple sequences when used in
|
||||
byte-mode, and more complicated ones for UTF-8 characters. BACKCHAR should
|
||||
never be called in byte mode. To make sure it can never even appear when UTF-8
|
||||
support is omitted, we don't even define it. */
|
||||
byte-mode, and more complicated ones for UTF-8 characters. GETCHARLENTEST is
|
||||
not used when UTF-8 is not supported, so it is not defined, and BACKCHAR should
|
||||
never be called in byte mode. To make sure they can never even appear when
|
||||
UTF-8 support is omitted, we don't even define them. */
|
||||
|
||||
#ifndef SUPPORT_UTF8
|
||||
#define GETCHAR(c, eptr) c = *eptr;
|
||||
@@ -420,43 +419,83 @@ support is omitted, we don't even define it. */
|
||||
#define GETCHARINC(c, eptr) c = *eptr++;
|
||||
#define GETCHARINCTEST(c, eptr) c = *eptr++;
|
||||
#define GETCHARLEN(c, eptr, len) c = *eptr;
|
||||
/* #define GETCHARLENTEST(c, eptr, len) */
|
||||
/* #define BACKCHAR(eptr) */
|
||||
|
||||
#else /* SUPPORT_UTF8 */
|
||||
|
||||
/* These macros were originally written in the form of loops that used data
|
||||
from the tables whose names start with _pcre_utf8_table. They were rewritten by
|
||||
a user so as not to use loops, because in some environments this gives a
|
||||
significant performance advantage, and it seems never to do any harm. */
|
||||
|
||||
/* Base macro to pick up the remaining bytes of a UTF-8 character, not
|
||||
advancing the pointer. */
|
||||
|
||||
#define GETUTF8(c, eptr) \
|
||||
{ \
|
||||
if ((c & 0x20) == 0) \
|
||||
c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
|
||||
else if ((c & 0x10) == 0) \
|
||||
c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
|
||||
else if ((c & 0x08) == 0) \
|
||||
c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
|
||||
((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
|
||||
else if ((c & 0x04) == 0) \
|
||||
c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
|
||||
((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
|
||||
(eptr[4] & 0x3f); \
|
||||
else \
|
||||
c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
|
||||
((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
|
||||
((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
|
||||
}
|
||||
|
||||
/* Get the next UTF-8 character, not advancing the pointer. This is called when
|
||||
we know we are in UTF-8 mode. */
|
||||
|
||||
#define GETCHAR(c, eptr) \
|
||||
c = *eptr; \
|
||||
if (c >= 0xc0) \
|
||||
{ \
|
||||
int gcii; \
|
||||
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
|
||||
int gcss = 6*gcaa; \
|
||||
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
|
||||
for (gcii = 1; gcii <= gcaa; gcii++) \
|
||||
{ \
|
||||
gcss -= 6; \
|
||||
c |= (eptr[gcii] & 0x3f) << gcss; \
|
||||
} \
|
||||
}
|
||||
if (c >= 0xc0) GETUTF8(c, eptr);
|
||||
|
||||
/* Get the next UTF-8 character, testing for UTF-8 mode, and not advancing the
|
||||
pointer. */
|
||||
|
||||
#define GETCHARTEST(c, eptr) \
|
||||
c = *eptr; \
|
||||
if (utf8 && c >= 0xc0) \
|
||||
if (utf8 && c >= 0xc0) GETUTF8(c, eptr);
|
||||
|
||||
/* Base macro to pick up the remaining bytes of a UTF-8 character, advancing
|
||||
the pointer. */
|
||||
|
||||
#define GETUTF8INC(c, eptr) \
|
||||
{ \
|
||||
int gcii; \
|
||||
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
|
||||
int gcss = 6*gcaa; \
|
||||
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
|
||||
for (gcii = 1; gcii <= gcaa; gcii++) \
|
||||
if ((c & 0x20) == 0) \
|
||||
c = ((c & 0x1f) << 6) | (*eptr++ & 0x3f); \
|
||||
else if ((c & 0x10) == 0) \
|
||||
{ \
|
||||
gcss -= 6; \
|
||||
c |= (eptr[gcii] & 0x3f) << gcss; \
|
||||
c = ((c & 0x0f) << 12) | ((*eptr & 0x3f) << 6) | (eptr[1] & 0x3f); \
|
||||
eptr += 2; \
|
||||
} \
|
||||
else if ((c & 0x08) == 0) \
|
||||
{ \
|
||||
c = ((c & 0x07) << 18) | ((*eptr & 0x3f) << 12) | \
|
||||
((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
|
||||
eptr += 3; \
|
||||
} \
|
||||
else if ((c & 0x04) == 0) \
|
||||
{ \
|
||||
c = ((c & 0x03) << 24) | ((*eptr & 0x3f) << 18) | \
|
||||
((eptr[1] & 0x3f) << 12) | ((eptr[2] & 0x3f) << 6) | \
|
||||
(eptr[3] & 0x3f); \
|
||||
eptr += 4; \
|
||||
} \
|
||||
else \
|
||||
{ \
|
||||
c = ((c & 0x01) << 30) | ((*eptr & 0x3f) << 24) | \
|
||||
((eptr[1] & 0x3f) << 18) | ((eptr[2] & 0x3f) << 12) | \
|
||||
((eptr[3] & 0x3f) << 6) | (eptr[4] & 0x3f); \
|
||||
eptr += 5; \
|
||||
} \
|
||||
}
|
||||
|
||||
@@ -465,32 +504,49 @@ know we are in UTF-8 mode. */
|
||||
|
||||
#define GETCHARINC(c, eptr) \
|
||||
c = *eptr++; \
|
||||
if (c >= 0xc0) \
|
||||
{ \
|
||||
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
|
||||
int gcss = 6*gcaa; \
|
||||
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
|
||||
while (gcaa-- > 0) \
|
||||
{ \
|
||||
gcss -= 6; \
|
||||
c |= (*eptr++ & 0x3f) << gcss; \
|
||||
} \
|
||||
}
|
||||
if (c >= 0xc0) GETUTF8INC(c, eptr);
|
||||
|
||||
/* Get the next character, testing for UTF-8 mode, and advancing the pointer.
|
||||
This is called when we don't know if we are in UTF-8 mode. */
|
||||
|
||||
#define GETCHARINCTEST(c, eptr) \
|
||||
c = *eptr++; \
|
||||
if (utf8 && c >= 0xc0) \
|
||||
if (utf8 && c >= 0xc0) GETUTF8INC(c, eptr);
|
||||
|
||||
/* Base macro to pick up the remaining bytes of a UTF-8 character, not
|
||||
advancing the pointer, incrementing the length. */
|
||||
|
||||
#define GETUTF8LEN(c, eptr, len) \
|
||||
{ \
|
||||
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
|
||||
int gcss = 6*gcaa; \
|
||||
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
|
||||
while (gcaa-- > 0) \
|
||||
if ((c & 0x20) == 0) \
|
||||
{ \
|
||||
gcss -= 6; \
|
||||
c |= (*eptr++ & 0x3f) << gcss; \
|
||||
c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
|
||||
len++; \
|
||||
} \
|
||||
else if ((c & 0x10) == 0) \
|
||||
{ \
|
||||
c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
|
||||
len += 2; \
|
||||
} \
|
||||
else if ((c & 0x08) == 0) \
|
||||
{\
|
||||
c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
|
||||
((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
|
||||
len += 3; \
|
||||
} \
|
||||
else if ((c & 0x04) == 0) \
|
||||
{ \
|
||||
c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
|
||||
((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
|
||||
(eptr[4] & 0x3f); \
|
||||
len += 4; \
|
||||
} \
|
||||
else \
|
||||
{\
|
||||
c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
|
||||
((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
|
||||
((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
|
||||
len += 5; \
|
||||
} \
|
||||
}
|
||||
|
||||
@@ -499,19 +555,7 @@ if there are extra bytes. This is called when we know we are in UTF-8 mode. */
|
||||
|
||||
#define GETCHARLEN(c, eptr, len) \
|
||||
c = *eptr; \
|
||||
if (c >= 0xc0) \
|
||||
{ \
|
||||
int gcii; \
|
||||
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
|
||||
int gcss = 6*gcaa; \
|
||||
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
|
||||
for (gcii = 1; gcii <= gcaa; gcii++) \
|
||||
{ \
|
||||
gcss -= 6; \
|
||||
c |= (eptr[gcii] & 0x3f) << gcss; \
|
||||
} \
|
||||
len += gcaa; \
|
||||
}
|
||||
if (c >= 0xc0) GETUTF8LEN(c, eptr, len);
|
||||
|
||||
/* Get the next UTF-8 character, testing for UTF-8 mode, not advancing the
|
||||
pointer, incrementing length if there are extra bytes. This is called when we
|
||||
@@ -519,19 +563,7 @@ do not know if we are in UTF-8 mode. */
|
||||
|
||||
#define GETCHARLENTEST(c, eptr, len) \
|
||||
c = *eptr; \
|
||||
if (utf8 && c >= 0xc0) \
|
||||
{ \
|
||||
int gcii; \
|
||||
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
|
||||
int gcss = 6*gcaa; \
|
||||
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
|
||||
for (gcii = 1; gcii <= gcaa; gcii++) \
|
||||
{ \
|
||||
gcss -= 6; \
|
||||
c |= (eptr[gcii] & 0x3f) << gcss; \
|
||||
} \
|
||||
len += gcaa; \
|
||||
}
|
||||
if (utf8 && c >= 0xc0) GETUTF8LEN(c, eptr, len);
|
||||
|
||||
/* If the pointer is not at the start of a character, move it back until
|
||||
it is. This is called only in UTF-8 mode - we don't put a test within the macro
|
||||
@@ -539,7 +571,7 @@ because almost all calls are already within a block of UTF-8 only code. */
|
||||
|
||||
#define BACKCHAR(eptr) while((*eptr & 0xc0) == 0x80) eptr--
|
||||
|
||||
#endif
|
||||
#endif /* SUPPORT_UTF8 */
|
||||
|
||||
|
||||
/* In case there is no definition of offsetof() provided - though any proper
|
||||
@@ -583,7 +615,7 @@ time, run time, or study time, respectively. */
|
||||
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
|
||||
PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK|PCRE_AUTO_CALLOUT|PCRE_FIRSTLINE| \
|
||||
PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE| \
|
||||
PCRE_JAVASCRIPT_COMPAT|PCRE_UCP)
|
||||
PCRE_JAVASCRIPT_COMPAT|PCRE_UCP|PCRE_NO_START_OPTIMIZE)
|
||||
|
||||
#define PUBLIC_EXEC_OPTIONS \
|
||||
(PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NOTEMPTY_ATSTART| \
|
||||
@@ -900,15 +932,16 @@ so that PCRE works on both ASCII and EBCDIC platforms, in non-UTF-mode only. */
|
||||
|
||||
#define STRING_DEFINE "DEFINE"
|
||||
|
||||
#define STRING_CR_RIGHTPAR "CR)"
|
||||
#define STRING_LF_RIGHTPAR "LF)"
|
||||
#define STRING_CRLF_RIGHTPAR "CRLF)"
|
||||
#define STRING_ANY_RIGHTPAR "ANY)"
|
||||
#define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)"
|
||||
#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
|
||||
#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
|
||||
#define STRING_UTF8_RIGHTPAR "UTF8)"
|
||||
#define STRING_UCP_RIGHTPAR "UCP)"
|
||||
#define STRING_CR_RIGHTPAR "CR)"
|
||||
#define STRING_LF_RIGHTPAR "LF)"
|
||||
#define STRING_CRLF_RIGHTPAR "CRLF)"
|
||||
#define STRING_ANY_RIGHTPAR "ANY)"
|
||||
#define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)"
|
||||
#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
|
||||
#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
|
||||
#define STRING_UTF8_RIGHTPAR "UTF8)"
|
||||
#define STRING_UCP_RIGHTPAR "UCP)"
|
||||
#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
|
||||
|
||||
#else /* SUPPORT_UTF8 */
|
||||
|
||||
@@ -1154,15 +1187,16 @@ only. */
|
||||
|
||||
#define STRING_DEFINE STR_D STR_E STR_F STR_I STR_N STR_E
|
||||
|
||||
#define STRING_CR_RIGHTPAR STR_C STR_R STR_RIGHT_PARENTHESIS
|
||||
#define STRING_LF_RIGHTPAR STR_L STR_F STR_RIGHT_PARENTHESIS
|
||||
#define STRING_CRLF_RIGHTPAR STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
|
||||
#define STRING_ANY_RIGHTPAR STR_A STR_N STR_Y STR_RIGHT_PARENTHESIS
|
||||
#define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
|
||||
#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
|
||||
#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
|
||||
#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
|
||||
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
|
||||
#define STRING_CR_RIGHTPAR STR_C STR_R STR_RIGHT_PARENTHESIS
|
||||
#define STRING_LF_RIGHTPAR STR_L STR_F STR_RIGHT_PARENTHESIS
|
||||
#define STRING_CRLF_RIGHTPAR STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
|
||||
#define STRING_ANY_RIGHTPAR STR_A STR_N STR_Y STR_RIGHT_PARENTHESIS
|
||||
#define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
|
||||
#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
|
||||
#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
|
||||
#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
|
||||
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
|
||||
#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
|
||||
|
||||
#endif /* SUPPORT_UTF8 */
|
||||
|
||||
@@ -1516,8 +1550,9 @@ in UTF-8 mode. The code that uses this table must know about such things. */
|
||||
3, 3, /* RREF, NRREF */ \
|
||||
1, /* DEF */ \
|
||||
1, 1, /* BRAZERO, BRAMINZERO */ \
|
||||
3, 1, 3, /* MARK, PRUNE, PRUNE_ARG, */ \
|
||||
1, 3, 1, 3, /* SKIP, SKIP_ARG, THEN, THEN_ARG, */ \
|
||||
3, 1, 3, /* MARK, PRUNE, PRUNE_ARG */ \
|
||||
1, 3, /* SKIP, SKIP_ARG */ \
|
||||
1+LINK_SIZE, 3+LINK_SIZE, /* THEN, THEN_ARG */ \
|
||||
1, 1, 1, 3, 1 /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO */
|
||||
|
||||
|
||||
@@ -1536,7 +1571,8 @@ enum { ERR0, ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9,
|
||||
ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
|
||||
ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
|
||||
ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
|
||||
ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERRCOUNT };
|
||||
ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68,
|
||||
ERRCOUNT };
|
||||
|
||||
/* The real format of the start of the pcre block; the index of names and the
|
||||
code vector run on as long as necessary after the end. We store an explicit
|
||||
|
||||
@@ -537,11 +537,26 @@ for(;;)
|
||||
case OP_MARK:
|
||||
case OP_PRUNE_ARG:
|
||||
case OP_SKIP_ARG:
|
||||
case OP_THEN_ARG:
|
||||
fprintf(f, " %s %s", OP_names[*code], code + 2);
|
||||
extra += code[1];
|
||||
break;
|
||||
|
||||
case OP_THEN:
|
||||
if (print_lengths)
|
||||
fprintf(f, " %s %d", OP_names[*code], GET(code, 1));
|
||||
else
|
||||
fprintf(f, " %s", OP_names[*code]);
|
||||
break;
|
||||
|
||||
case OP_THEN_ARG:
|
||||
if (print_lengths)
|
||||
fprintf(f, " %s %d %s", OP_names[*code], GET(code, 1),
|
||||
code + 2 + LINK_SIZE);
|
||||
else
|
||||
fprintf(f, " %s %s", OP_names[*code], code + 2 + LINK_SIZE);
|
||||
extra += code[1+LINK_SIZE];
|
||||
break;
|
||||
|
||||
/* Anything else is just an item with no data*/
|
||||
|
||||
default:
|
||||
|
||||
@@ -417,10 +417,13 @@ for (;;)
|
||||
case OP_MARK:
|
||||
case OP_PRUNE_ARG:
|
||||
case OP_SKIP_ARG:
|
||||
case OP_THEN_ARG:
|
||||
cc += _pcre_OP_lengths[op] + cc[1];
|
||||
break;
|
||||
|
||||
case OP_THEN_ARG:
|
||||
cc += _pcre_OP_lengths[op] + cc[1+LINK_SIZE];
|
||||
break;
|
||||
|
||||
/* For the record, these are the opcodes that are matched by "default":
|
||||
OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
|
||||
OP_THEN. */
|
||||
|
||||
@@ -70,6 +70,20 @@ Arguments:
|
||||
|
||||
Returns: < 0 if the string is a valid UTF-8 string
|
||||
>= 0 otherwise; the value is the offset of the bad byte
|
||||
|
||||
Bad bytes can be:
|
||||
|
||||
. An isolated byte whose most significant bits are 0x80, because this
|
||||
can only correctly appear within a UTF-8 character;
|
||||
|
||||
. A byte whose most significant bits are 0xc0, but whose other bits indicate
|
||||
that there are more than 3 additional bytes (i.e. an RFC 2279 starting
|
||||
byte, which is no longer valid under RFC 3629);
|
||||
|
||||
.
|
||||
|
||||
The returned offset may also be equal to the length of the string; this means
|
||||
that one or more bytes is missing from the final UTF-8 character.
|
||||
*/
|
||||
|
||||
int
|
||||
@@ -91,7 +105,8 @@ for (p = string; length-- > 0; p++)
|
||||
if (c < 128) continue;
|
||||
if (c < 0xc0) return p - string;
|
||||
ab = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */
|
||||
if (length < ab || ab > 3) return p - string;
|
||||
if (ab > 3) return p - string; /* Too many for RFC 3629 */
|
||||
if (length < ab) return p + 1 + length - string; /* Missing bytes */
|
||||
length -= ab;
|
||||
|
||||
/* Check top bits in the second byte */
|
||||
|
||||
@@ -50,13 +50,16 @@ const char *error;
|
||||
char *pattern;
|
||||
char *subject;
|
||||
unsigned char *name_table;
|
||||
unsigned int option_bits;
|
||||
int erroffset;
|
||||
int find_all;
|
||||
int crlf_is_newline;
|
||||
int namecount;
|
||||
int name_entry_size;
|
||||
int ovector[OVECCOUNT];
|
||||
int subject_length;
|
||||
int rc, i;
|
||||
int utf8;
|
||||
|
||||
|
||||
/**************************************************************************
|
||||
@@ -238,15 +241,56 @@ if (namecount <= 0) printf("No named substrings\n"); else
|
||||
* subject is not a valid match; other possibilities must be tried. The *
|
||||
* second flag restricts PCRE to one match attempt at the initial string *
|
||||
* position. If this match succeeds, an alternative to the empty string *
|
||||
* match has been found, and we can proceed round the loop. *
|
||||
* match has been found, and we can print it and proceed round the loop, *
|
||||
* advancing by the length of whatever was found. If this match does not *
|
||||
* succeed, we still stay in the loop, advancing by just one character. *
|
||||
* In UTF-8 mode, which can be set by (*UTF8) in the pattern, this may be *
|
||||
* more than one byte. *
|
||||
* *
|
||||
* However, there is a complication concerned with newlines. When the *
|
||||
* newline convention is such that CRLF is a valid newline, we want must *
|
||||
* advance by two characters rather than one. The newline convention can *
|
||||
* be set in the regex by (*CR), etc.; if not, we must find the default. *
|
||||
*************************************************************************/
|
||||
|
||||
if (!find_all)
|
||||
if (!find_all) /* Check for -g */
|
||||
{
|
||||
pcre_free(re); /* Release the memory used for the compiled pattern */
|
||||
return 0; /* Finish unless -g was given */
|
||||
}
|
||||
|
||||
/* Before running the loop, check for UTF-8 and whether CRLF is a valid newline
|
||||
sequence. First, find the options with which the regex was compiled; extract
|
||||
the UTF-8 state, and mask off all but the newline options. */
|
||||
|
||||
(void)pcre_fullinfo(re, NULL, PCRE_INFO_OPTIONS, &option_bits);
|
||||
utf8 = option_bits & PCRE_UTF8;
|
||||
option_bits &= PCRE_NEWLINE_CR|PCRE_NEWLINE_LF|PCRE_NEWLINE_CRLF|
|
||||
PCRE_NEWLINE_ANY|PCRE_NEWLINE_ANYCRLF;
|
||||
|
||||
/* If no newline options were set, find the default newline convention from the
|
||||
build configuration. */
|
||||
|
||||
if (option_bits == 0)
|
||||
{
|
||||
int d;
|
||||
(void)pcre_config(PCRE_CONFIG_NEWLINE, &d);
|
||||
/* Note that these values are always the ASCII ones, even in
|
||||
EBCDIC environments. CR = 13, NL = 10. */
|
||||
option_bits = (d == 13)? PCRE_NEWLINE_CR :
|
||||
(d == 10)? PCRE_NEWLINE_LF :
|
||||
(d == (13<<8 | 10))? PCRE_NEWLINE_CRLF :
|
||||
(d == -2)? PCRE_NEWLINE_ANYCRLF :
|
||||
(d == -1)? PCRE_NEWLINE_ANY : 0;
|
||||
}
|
||||
|
||||
/* See if CRLF is a valid newline sequence. */
|
||||
|
||||
crlf_is_newline =
|
||||
option_bits == PCRE_NEWLINE_ANY ||
|
||||
option_bits == PCRE_NEWLINE_CRLF ||
|
||||
option_bits == PCRE_NEWLINE_ANYCRLF;
|
||||
|
||||
/* Loop for second and subsequent matches */
|
||||
|
||||
for (;;)
|
||||
@@ -280,14 +324,32 @@ for (;;)
|
||||
is zero, it just means we have found all possible matches, so the loop ends.
|
||||
Otherwise, it means we have failed to find a non-empty-string match at a
|
||||
point where there was a previous empty-string match. In this case, we do what
|
||||
Perl does: advance the matching position by one, and continue. We do this by
|
||||
setting the "end of previous match" offset, because that is picked up at the
|
||||
top of the loop as the point at which to start again. */
|
||||
Perl does: advance the matching position by one character, and continue. We
|
||||
do this by setting the "end of previous match" offset, because that is picked
|
||||
up at the top of the loop as the point at which to start again.
|
||||
|
||||
There are two complications: (a) When CRLF is a valid newline sequence, and
|
||||
the current position is just before it, advance by an extra byte. (b)
|
||||
Otherwise we must ensure that we skip an entire UTF-8 character if we are in
|
||||
UTF-8 mode. */
|
||||
|
||||
if (rc == PCRE_ERROR_NOMATCH)
|
||||
{
|
||||
if (options == 0) break;
|
||||
ovector[1] = start_offset + 1;
|
||||
if (options == 0) break; /* All matches found */
|
||||
ovector[1] = start_offset + 1; /* Advance one byte */
|
||||
if (crlf_is_newline && /* If CRLF is newline & */
|
||||
start_offset < subject_length - 1 && /* we are at CRLF, */
|
||||
subject[start_offset] == '\r' &&
|
||||
subject[start_offset + 1] == '\n')
|
||||
ovector[1] += 1; /* Advance by one more. */
|
||||
else if (utf8) /* Otherwise, ensure we */
|
||||
{ /* advance a whole UTF-8 */
|
||||
while (ovector[1] < subject_length) /* character. */
|
||||
{
|
||||
if ((subject[ovector[1]] & 0xc0) != 0x80) break;
|
||||
ovector[1] += 1;
|
||||
}
|
||||
}
|
||||
continue; /* Go round the loop again */
|
||||
}
|
||||
|
||||
|
||||
@@ -149,6 +149,7 @@ static const int eint[] = {
|
||||
REG_BADPAT, /* different names for subpatterns of the same number are not allowed */
|
||||
REG_BADPAT, /* (*MARK) must have an argument */
|
||||
REG_INVARG, /* this version of PCRE is not compiled with PCRE_UCP support */
|
||||
REG_BADPAT, /* \c must be followed by an ASCII character */
|
||||
};
|
||||
|
||||
/* Table of texts corresponding to POSIX error codes */
|
||||
|
||||
Reference in New Issue
Block a user