1
0
mirror of https://github.com/php/php-src.git synced 2026-04-25 17:08:14 +02:00

Upgraded\ bundled\ PCRE\ to\ version\ 8.11.

This commit is contained in:
Ilia Alshanetsky
2010-12-24 15:03:29 +00:00
parent e540cb0c45
commit 15b7a4476a
14 changed files with 2062 additions and 1436 deletions
+130
View File
@@ -1,6 +1,136 @@
ChangeLog for PCRE
------------------
Version 8.11 10-Dec-2010
------------------------
1. (*THEN) was not working properly if there were untried alternatives prior
to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it
backtracked to try for "b" instead of moving to the next alternative branch
at the same level (in this case, to look for "c"). The Perl documentation
is clear that when (*THEN) is backtracked onto, it goes to the "next
alternative in the innermost enclosing group".
2. (*COMMIT) was not overriding (*THEN), as it does in Perl. In a pattern
such as (A(*COMMIT)B(*THEN)C|D) any failure after matching A should
result in overall failure. Similarly, (*COMMIT) now overrides (*PRUNE) and
(*SKIP), (*SKIP) overrides (*PRUNE) and (*THEN), and (*PRUNE) overrides
(*THEN).
3. If \s appeared in a character class, it removed the VT character from
the class, even if it had been included by some previous item, for example
in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part
of \s, but is part of the POSIX "space" class.)
4. A partial match never returns an empty string (because you can always
match an empty string at the end of the subject); however the checking for
an empty string was starting at the "start of match" point. This has been
changed to the "earliest inspected character" point, because the returned
data for a partial match starts at this character. This means that, for
example, /(?<=abc)def/ gives a partial match for the subject "abc"
(previously it gave "no match").
5. Changes have been made to the way PCRE_PARTIAL_HARD affects the matching
of $, \z, \Z, \b, and \B. If the match point is at the end of the string,
previously a full match would be given. However, setting PCRE_PARTIAL_HARD
has an implication that the given string is incomplete (because a partial
match is preferred over a full match). For this reason, these items now
give a partial match in this situation. [Aside: previously, the one case
/t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial
match rather than a full match, which was wrong by the old rules, but is
now correct.]
6. There was a bug in the handling of #-introduced comments, recognized when
PCRE_EXTENDED is set, when PCRE_NEWLINE_ANY and PCRE_UTF8 were also set.
If a UTF-8 multi-byte character included the byte 0x85 (e.g. +U0445, whose
UTF-8 encoding is 0xd1,0x85), this was misinterpreted as a newline when
scanning for the end of the comment. (*Character* 0x85 is an "any" newline,
but *byte* 0x85 is not, in UTF-8 mode). This bug was present in several
places in pcre_compile().
7. Related to (6) above, when pcre_compile() was skipping #-introduced
comments when looking ahead for named forward references to subpatterns,
the only newline sequence it recognized was NL. It now handles newlines
according to the set newline convention.
8. SunOS4 doesn't have strerror() or strtoul(); pcregrep dealt with the
former, but used strtoul(), whereas pcretest avoided strtoul() but did not
cater for a lack of strerror(). These oversights have been fixed.
9. Added --match-limit and --recursion-limit to pcregrep.
10. Added two casts needed to build with Visual Studio when NO_RECURSE is set.
11. When the -o option was used, pcregrep was setting a return code of 1, even
when matches were found, and --line-buffered was not being honoured.
12. Added an optional parentheses number to the -o and --only-matching options
of pcregrep.
13. Imitating Perl's /g action for multiple matches is tricky when the pattern
can match an empty string. The code to do it in pcretest and pcredemo
needed fixing:
(a) When the newline convention was "crlf", pcretest got it wrong, skipping
only one byte after an empty string match just before CRLF (this case
just got forgotten; "any" and "anycrlf" were OK).
(b) The pcretest code also had a bug, causing it to loop forever in UTF-8
mode when an empty string match preceded an ASCII character followed by
a non-ASCII character. (The code for advancing by one character rather
than one byte was nonsense.)
(c) The pcredemo.c sample program did not have any code at all to handle
the cases when CRLF is a valid newline sequence.
14. Neither pcre_exec() nor pcre_dfa_exec() was checking that the value given
as a starting offset was within the subject string. There is now a new
error, PCRE_ERROR_BADOFFSET, which is returned if the starting offset is
negative or greater than the length of the string. In order to test this,
pcretest is extended to allow the setting of negative starting offsets.
15. In both pcre_exec() and pcre_dfa_exec() the code for checking that the
starting offset points to the beginning of a UTF-8 character was
unnecessarily clumsy. I tidied it up.
16. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
bad UTF-8 sequence and one that is incomplete when using PCRE_PARTIAL_HARD.
17. Nobody had reported that the --include_dir option, which was added in
release 7.7 should have been called --include-dir (hyphen, not underscore)
for compatibility with GNU grep. I have changed it to --include-dir, but
left --include_dir as an undocumented synonym, and the same for
--exclude-dir, though that is not available in GNU grep, at least as of
release 2.5.4.
18. At a user's suggestion, the macros GETCHAR and friends (which pick up UTF-8
characters from a string of bytes) have been redefined so as not to use
loops, in order to improve performance in some environments. At the same
time, I abstracted some of the common code into auxiliary macros to save
repetition (this should not affect the compiled code).
19. If \c was followed by a multibyte UTF-8 character, bad things happened. A
compile-time error is now given if \c is not followed by an ASCII
character, that is, a byte less than 128. (In EBCDIC mode, the code is
different, and any byte value is allowed.)
20. Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_
START_OPTIMIZE option, which is now allowed at compile time - but just
passed through to pcre_exec() or pcre_dfa_exec(). This makes it available
to pcregrep and other applications that have no direct access to PCRE
options. The new /Y option in pcretest sets this option when calling
pcre_compile().
21. Change 18 of release 8.01 broke the use of named subpatterns for recursive
back references. Groups containing recursive back references were forced to
be atomic by that change, but in the case of named groups, the amount of
memory required was incorrectly computed, leading to "Failed: internal
error: code overflow". This has been fixed.
22. Some patches to pcre_stringpiece.h, pcre_stringpiece_unittest.cc, and
pcretest.c, to avoid build problems in some Borland environments.
Version 8.10 25-Jun-2010
------------------------
+26 -8
View File
@@ -4,6 +4,7 @@ Technical Notes about PCRE
These are very rough technical notes that record potentially useful information
about PCRE internals.
Historical note 1
-----------------
@@ -22,6 +23,7 @@ the one matching the longest subset of the subject string was chosen. This did
not necessarily maximize the individual wild portions of the pattern, as is
expected in Unix and Perl-style regular expressions.
Historical note 2
-----------------
@@ -34,6 +36,7 @@ maximizing (or, optionally, minimizing in Perl) the amount of the subject that
matches individual wild portions of the pattern. This is an "NFA algorithm" in
Friedl's terminology.
OK, here's the real stuff
-------------------------
@@ -44,6 +47,7 @@ in the pattern, to save on compiling time. However, because of the greater
complexity in Perl regular expressions, I couldn't do this. In any case, a
first pass through the pattern is helpful for other reasons.
Computing the memory requirement: how it was
--------------------------------------------
@@ -54,6 +58,7 @@ idea was that this would turn out faster than the Henry Spencer code because
the first pass is degenerate and the second pass can just store stuff straight
into the vector, which it knows is big enough.
Computing the memory requirement: how it is
-------------------------------------------
@@ -75,6 +80,7 @@ runs more slowly than before (30% or more, depending on the pattern) because it
is doing a full analysis of the pattern. My hope was that this would not be a
big issue, and in the event, nobody has commented on it.
Traditional matching function
-----------------------------
@@ -84,6 +90,7 @@ and the way that Perl works. This is not surprising, since it is intended to be
as compatible with Perl as possible. This is the function most users of PCRE
will use most of the time.
Supplementary matching function
-------------------------------
@@ -119,7 +126,6 @@ quantifiers) are always just two bytes long.
A list of the opcodes follows:
Opcodes with no following data
------------------------------
@@ -151,12 +157,24 @@ These items are all just one byte long
OP_EXTUNI match an extended Unicode character
OP_ANYNL match any Unicode newline sequence
OP_ACCEPT ) These are Perl 5.10's "backtracking
OP_COMMIT ) control verbs". If OP_ACCEPT is inside
OP_FAIL ) capturing parentheses, it may be preceded
OP_PRUNE ) by one or more OP_CLOSE, followed by a 2-byte
OP_SKIP ) number, indicating which parentheses must be
OP_THEN ) closed.
OP_ACCEPT ) These are Perl 5.10's "backtracking control
OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing
OP_FAIL ) parentheses, it may be preceded by one or more
OP_PRUNE ) OP_CLOSE, followed by a 2-byte number,
OP_SKIP ) indicating which parentheses must be closed.
Backtracking control verbs with data
------------------------------------
OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the
start of the current branch.
OP_MARK is followed by the mark name, preceded by a one-byte length, and
followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments,
the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first
two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE
offset value.
Repeating single characters
@@ -419,4 +437,4 @@ at compile time, and so does not cause anything to be put into the compiled
data.
Philip Hazel
October 2009
October 2010
+21
View File
@@ -1,6 +1,27 @@
News about PCRE releases
------------------------
Release 8.11 10-Dec-2010
------------------------
A number of bugs in the library and in pcregrep have been fixed. As always, see
ChangeLog for details. The following are the non-bug-fix changes:
. Added --match-limit and --recursion-limit to pcregrep.
. Added an optional parentheses number to the -o and --only-matching options
of pcregrep.
. Changed the way PCRE_PARTIAL_HARD affects the matching of $, \z, \Z, \b, and
\B.
. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
bad UTF-8 sequence and one that is incomplete when using PCRE_PARTIAL_HARD.
. Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_
START_OPTIMIZE option, which is now allowed at compile time
Release 8.10 25-Jun-2010
------------------------
+4 -3
View File
@@ -23,6 +23,7 @@
# define PCRE_EXP_DATA_DEFN __attribute__ ((visibility("default")))
#endif
/* Exclude these below definitions when building within PHP */
#ifndef ZEND_API
@@ -281,7 +282,7 @@ them both to 0; an emulation function will be used. */
#define PACKAGE_NAME "PCRE"
/* Define to the full name and version of this package. */
#define PACKAGE_STRING "PCRE 8.10"
#define PACKAGE_STRING "PCRE 8.11"
/* Define to the one symbol short name of this package. */
#define PACKAGE_TARNAME "pcre"
@@ -290,7 +291,7 @@ them both to 0; an emulation function will be used. */
#define PACKAGE_URL ""
/* Define to the version of this package. */
#define PACKAGE_VERSION "8.10"
#define PACKAGE_VERSION "8.11"
/* If you are compiling for a system other than a Unix-like system or
@@ -346,7 +347,7 @@ them both to 0; an emulation function will be used. */
/* Version number of package */
#ifndef VERSION
#define VERSION "8.10"
#define VERSION "8.11"
#endif
/* Define to empty if `const' does not conform to ANSI C. */
+1399 -1209
View File
File diff suppressed because it is too large Load Diff
+40 -36
View File
@@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
/* The current PCRE version information. */
#define PCRE_MAJOR 8
#define PCRE_MINOR 10
#define PCRE_MINOR 11
#define PCRE_PRERELEASE
#define PCRE_DATE 2010-06-25
#define PCRE_DATE 2010-12-10
/* When an application links to a PCRE DLL in Windows, the symbols that are
imported have to be identified as such. When building PCRE, the appropriate
@@ -96,42 +96,44 @@ extern "C" {
#endif
/* Options. Some are compile-time only, some are run-time only, and some are
both, so we keep them all distinct. */
both, so we keep them all distinct. However, almost all the bits in the options
word are now used. In the long run, we may have to re-use some of the
compile-time only bits for runtime options, or vice versa. */
#define PCRE_CASELESS 0x00000001
#define PCRE_MULTILINE 0x00000002
#define PCRE_DOTALL 0x00000004
#define PCRE_EXTENDED 0x00000008
#define PCRE_ANCHORED 0x00000010
#define PCRE_DOLLAR_ENDONLY 0x00000020
#define PCRE_EXTRA 0x00000040
#define PCRE_NOTBOL 0x00000080
#define PCRE_NOTEOL 0x00000100
#define PCRE_UNGREEDY 0x00000200
#define PCRE_NOTEMPTY 0x00000400
#define PCRE_UTF8 0x00000800
#define PCRE_NO_AUTO_CAPTURE 0x00001000
#define PCRE_NO_UTF8_CHECK 0x00002000
#define PCRE_AUTO_CALLOUT 0x00004000
#define PCRE_PARTIAL_SOFT 0x00008000
#define PCRE_CASELESS 0x00000001 /* Compile */
#define PCRE_MULTILINE 0x00000002 /* Compile */
#define PCRE_DOTALL 0x00000004 /* Compile */
#define PCRE_EXTENDED 0x00000008 /* Compile */
#define PCRE_ANCHORED 0x00000010 /* Compile, exec, DFA exec */
#define PCRE_DOLLAR_ENDONLY 0x00000020 /* Compile */
#define PCRE_EXTRA 0x00000040 /* Compile */
#define PCRE_NOTBOL 0x00000080 /* Exec, DFA exec */
#define PCRE_NOTEOL 0x00000100 /* Exec, DFA exec */
#define PCRE_UNGREEDY 0x00000200 /* Compile */
#define PCRE_NOTEMPTY 0x00000400 /* Exec, DFA exec */
#define PCRE_UTF8 0x00000800 /* Compile */
#define PCRE_NO_AUTO_CAPTURE 0x00001000 /* Compile */
#define PCRE_NO_UTF8_CHECK 0x00002000 /* Compile, exec, DFA exec */
#define PCRE_AUTO_CALLOUT 0x00004000 /* Compile */
#define PCRE_PARTIAL_SOFT 0x00008000 /* Exec, DFA exec */
#define PCRE_PARTIAL 0x00008000 /* Backwards compatible synonym */
#define PCRE_DFA_SHORTEST 0x00010000
#define PCRE_DFA_RESTART 0x00020000
#define PCRE_FIRSTLINE 0x00040000
#define PCRE_DUPNAMES 0x00080000
#define PCRE_NEWLINE_CR 0x00100000
#define PCRE_NEWLINE_LF 0x00200000
#define PCRE_NEWLINE_CRLF 0x00300000
#define PCRE_NEWLINE_ANY 0x00400000
#define PCRE_NEWLINE_ANYCRLF 0x00500000
#define PCRE_BSR_ANYCRLF 0x00800000
#define PCRE_BSR_UNICODE 0x01000000
#define PCRE_JAVASCRIPT_COMPAT 0x02000000
#define PCRE_NO_START_OPTIMIZE 0x04000000
#define PCRE_NO_START_OPTIMISE 0x04000000
#define PCRE_PARTIAL_HARD 0x08000000
#define PCRE_NOTEMPTY_ATSTART 0x10000000
#define PCRE_UCP 0x20000000
#define PCRE_DFA_SHORTEST 0x00010000 /* DFA exec */
#define PCRE_DFA_RESTART 0x00020000 /* DFA exec */
#define PCRE_FIRSTLINE 0x00040000 /* Compile */
#define PCRE_DUPNAMES 0x00080000 /* Compile */
#define PCRE_NEWLINE_CR 0x00100000 /* Compile, exec, DFA exec */
#define PCRE_NEWLINE_LF 0x00200000 /* Compile, exec, DFA exec */
#define PCRE_NEWLINE_CRLF 0x00300000 /* Compile, exec, DFA exec */
#define PCRE_NEWLINE_ANY 0x00400000 /* Compile, exec, DFA exec */
#define PCRE_NEWLINE_ANYCRLF 0x00500000 /* Compile, exec, DFA exec */
#define PCRE_BSR_ANYCRLF 0x00800000 /* Compile, exec, DFA exec */
#define PCRE_BSR_UNICODE 0x01000000 /* Compile, exec, DFA exec */
#define PCRE_JAVASCRIPT_COMPAT 0x02000000 /* Compile */
#define PCRE_NO_START_OPTIMIZE 0x04000000 /* Compile, exec, DFA exec */
#define PCRE_NO_START_OPTIMISE 0x04000000 /* Synonym */
#define PCRE_PARTIAL_HARD 0x08000000 /* Exec, DFA exec */
#define PCRE_NOTEMPTY_ATSTART 0x10000000 /* Exec, DFA exec */
#define PCRE_UCP 0x20000000 /* Compile */
/* Exec-time and get/set-time error codes */
@@ -159,6 +161,8 @@ both, so we keep them all distinct. */
#define PCRE_ERROR_RECURSIONLIMIT (-21)
#define PCRE_ERROR_NULLWSLIMIT (-22) /* No longer actually used */
#define PCRE_ERROR_BADNEWLINE (-23)
#define PCRE_ERROR_BADOFFSET (-24)
#define PCRE_ERROR_SHORTUTF8 (-25)
/* Request types for pcre_fullinfo() */
+127 -31
View File
@@ -406,6 +406,7 @@ static const char error_texts[] =
"different names for subpatterns of the same number are not allowed\0"
"(*MARK) must have an argument\0"
"this version of PCRE is not compiled with PCRE_UCP support\0"
"\\c must be followed by an ASCII character\0"
;
/* Table to identify digits and hex digits. This is used when compiling
@@ -839,7 +840,8 @@ else
break;
/* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
This coding is ASCII-specific, but then the whole concept of \cx is
An error is given if the byte following \c is not an ASCII character. This
coding is ASCII-specific, but then the whole concept of \cx is
ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
case CHAR_c:
@@ -849,11 +851,15 @@ else
*errorcodeptr = ERR2;
break;
}
#ifndef EBCDIC /* ASCII/UTF-8 coding */
#ifndef EBCDIC /* ASCII/UTF-8 coding */
if (c > 127) /* Excludes all non-ASCII in either mode */
{
*errorcodeptr = ERR68;
break;
}
if (c >= CHAR_a && c <= CHAR_z) c -= 32;
c ^= 0x40;
#else /* EBCDIC coding */
#else /* EBCDIC coding */
if (c >= CHAR_a && c <= CHAR_z) c += 64;
c ^= 0xC0;
#endif
@@ -1097,10 +1103,21 @@ top-level call starts at the beginning of the pattern. All other calls must
start at a parenthesis. It scans along a pattern's text looking for capturing
subpatterns, and counting them. If it finds a named pattern that matches the
name it is given, it returns its number. Alternatively, if the name is NULL, it
returns when it reaches a given numbered subpattern. We know that if (?P< is
encountered, the name will be terminated by '>' because that is checked in the
first pass. Recursion is used to keep track of subpatterns that reset the
capturing group numbers - the (?| feature.
returns when it reaches a given numbered subpattern. Recursion is used to keep
track of subpatterns that reset the capturing group numbers - the (?| feature.
This function was originally called only from the second pass, in which we know
that if (?< or (?' or (?P< is encountered, the name will be correctly
terminated because that is checked in the first pass. There is now one call to
this function in the first pass, to check for a recursive back reference by
name (so that we can make the whole group atomic). In this case, we need check
only up to the current position in the pattern, and that is still OK because
and previous occurrences will have been checked. To make this work, the test
for "end of pattern" is a check against cd->end_pattern in the main loop,
instead of looking for a binary zero. This means that the special first-pass
call can adjust cd->end_pattern temporarily. (Checks for binary zero while
processing items within the loop are OK, because afterwards the main loop will
terminate.)
Arguments:
ptrptr address of the current character pointer (updated)
@@ -1108,6 +1125,7 @@ Arguments:
name name to seek, or NULL if seeking a numbered subpattern
lorn name length, or subpattern number if name is NULL
xmode TRUE if we are in /x mode
utf8 TRUE if we are in UTF-8 mode
count pointer to the current capturing subpattern number (updated)
Returns: the number of the named subpattern, or -1 if not found
@@ -1115,7 +1133,7 @@ Returns: the number of the named subpattern, or -1 if not found
static int
find_parens_sub(uschar **ptrptr, compile_data *cd, const uschar *name, int lorn,
BOOL xmode, int *count)
BOOL xmode, BOOL utf8, int *count)
{
uschar *ptr = *ptrptr;
int start_count = *count;
@@ -1200,9 +1218,11 @@ if (ptr[0] == CHAR_LEFT_PARENTHESIS)
}
/* Past any initial parenthesis handling, scan for parentheses or vertical
bars. */
bars. Stop if we get to cd->end_pattern. Note that this is important for the
first-pass call when this value is temporarily adjusted to stop at the current
position. So DO NOT change this to a test for binary zero. */
for (; *ptr != 0; ptr++)
for (; ptr < cd->end_pattern; ptr++)
{
/* Skip over backslashed characters and also entire \Q...\E */
@@ -1276,7 +1296,15 @@ for (; *ptr != 0; ptr++)
if (xmode && *ptr == CHAR_NUMBER_SIGN)
{
while (*(++ptr) != 0 && *ptr != CHAR_NL) {};
ptr++;
while (*ptr != 0)
{
if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
ptr++;
#ifdef SUPPORT_UTF8
if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
#endif
}
if (*ptr == 0) goto FAIL_EXIT;
continue;
}
@@ -1285,7 +1313,7 @@ for (; *ptr != 0; ptr++)
if (*ptr == CHAR_LEFT_PARENTHESIS)
{
int rc = find_parens_sub(&ptr, cd, name, lorn, xmode, count);
int rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf8, count);
if (rc > 0) return rc;
if (*ptr == 0) goto FAIL_EXIT;
}
@@ -1331,12 +1359,14 @@ Arguments:
name name to seek, or NULL if seeking a numbered subpattern
lorn name length, or subpattern number if name is NULL
xmode TRUE if we are in /x mode
utf8 TRUE if we are in UTF-8 mode
Returns: the number of the found subpattern, or -1 if not found
*/
static int
find_parens(compile_data *cd, const uschar *name, int lorn, BOOL xmode)
find_parens(compile_data *cd, const uschar *name, int lorn, BOOL xmode,
BOOL utf8)
{
uschar *ptr = (uschar *)cd->start_pattern;
int count = 0;
@@ -1349,7 +1379,7 @@ matching closing parens. That is why we have to have a loop. */
for (;;)
{
rc = find_parens_sub(&ptr, cd, name, lorn, xmode, &count);
rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf8, &count);
if (rc > 0 || *ptr++ == 0) break;
}
@@ -1722,9 +1752,12 @@ for (;;)
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
code += code[1];
break;
case OP_THEN_ARG:
code += code[1+LINK_SIZE];
break;
}
/* Add in the fixed length from the table */
@@ -1825,9 +1858,12 @@ for (;;)
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
code += code[1];
break;
case OP_THEN_ARG:
code += code[1+LINK_SIZE];
break;
}
/* Add in the fixed length from the table */
@@ -2103,10 +2139,13 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
code += code[1];
break;
case OP_THEN_ARG:
code += code[1+LINK_SIZE];
break;
/* None of the remaining opcodes are required to match a character. */
default:
@@ -2504,8 +2543,15 @@ if ((options & PCRE_EXTENDED) != 0)
while ((cd->ctypes[*ptr] & ctype_space) != 0) ptr++;
if (*ptr == CHAR_NUMBER_SIGN)
{
while (*(++ptr) != 0)
ptr++;
while (*ptr != 0)
{
if (IS_NEWLINE(ptr)) { ptr += cd->nllen; break; }
ptr++;
#ifdef SUPPORT_UTF8
if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
#endif
}
}
else break;
}
@@ -2541,8 +2587,15 @@ if ((options & PCRE_EXTENDED) != 0)
while ((cd->ctypes[*ptr] & ctype_space) != 0) ptr++;
if (*ptr == CHAR_NUMBER_SIGN)
{
while (*(++ptr) != 0)
ptr++;
while (*ptr != 0)
{
if (IS_NEWLINE(ptr)) { ptr += cd->nllen; break; }
ptr++;
#ifdef SUPPORT_UTF8
if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
#endif
}
}
else break;
}
@@ -3115,9 +3168,14 @@ for (;; ptr++)
if ((cd->ctypes[c] & ctype_space) != 0) continue;
if (c == CHAR_NUMBER_SIGN)
{
while (*(++ptr) != 0)
ptr++;
while (*ptr != 0)
{
if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
ptr++;
#ifdef SUPPORT_UTF8
if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
#endif
}
if (*ptr != 0) continue;
@@ -3492,9 +3550,14 @@ for (;; ptr++)
for (c = 0; c < 32; c++) classbits[c] |= ~cbits[c+cbit_word];
continue;
/* Perl 5.004 onwards omits VT from \s, but we must preserve it
if it was previously set by something earlier in the character
class. */
case ESC_s:
for (c = 0; c < 32; c++) classbits[c] |= cbits[c+cbit_space];
classbits[1] &= ~0x08; /* Perl 5.004 onwards omits VT from \s */
classbits[0] |= cbits[cbit_space];
classbits[1] |= cbits[cbit_space+1] & ~0x08;
for (c = 2; c < 32; c++) classbits[c] |= cbits[c+cbit_space];
continue;
case ESC_S:
@@ -4806,7 +4869,12 @@ for (;; ptr++)
*errorcodeptr = ERR66;
goto FAILED;
}
*code++ = verbs[i].op;
*code = verbs[i].op;
if (*code++ == OP_THEN)
{
PUT(code, 0, code - bcptr->current_branch - 1);
code += LINK_SIZE;
}
}
else
@@ -4816,7 +4884,12 @@ for (;; ptr++)
*errorcodeptr = ERR59;
goto FAILED;
}
*code++ = verbs[i].op_arg;
*code = verbs[i].op_arg;
if (*code++ == OP_THEN_ARG)
{
PUT(code, 0, code - bcptr->current_branch - 1);
code += LINK_SIZE;
}
*code++ = arglen;
memcpy(code, arg, arglen);
code += arglen;
@@ -5010,7 +5083,7 @@ for (;; ptr++)
/* Search the pattern for a forward reference */
else if ((i = find_parens(cd, name, namelen,
(options & PCRE_EXTENDED) != 0)) > 0)
(options & PCRE_EXTENDED) != 0, utf8)) > 0)
{
PUT2(code, 2+LINK_SIZE, i);
code[1+LINK_SIZE]++;
@@ -5311,11 +5384,17 @@ for (;; ptr++)
while ((cd->ctypes[*ptr] & ctype_word) != 0) ptr++;
namelen = (int)(ptr - name);
/* In the pre-compile phase, do a syntax check and set a dummy
reference number. */
/* In the pre-compile phase, do a syntax check. We used to just set
a dummy reference number, because it was not used in the first pass.
However, with the change of recursive back references to be atomic,
we have to look for the number so that this state can be identified, as
otherwise the incorrect length is computed. If it's not a backwards
reference, the dummy number will do. */
if (lengthptr != NULL)
{
const uschar *temp;
if (namelen == 0)
{
*errorcodeptr = ERR62;
@@ -5331,7 +5410,22 @@ for (;; ptr++)
*errorcodeptr = ERR48;
goto FAILED;
}
recno = 0;
/* The name table does not exist in the first pass, so we cannot
do a simple search as in the code below. Instead, we have to scan the
pattern to find the number. It is important that we scan it only as
far as we have got because the syntax of named subpatterns has not
been checked for the rest of the pattern, and find_parens() assumes
correct syntax. In any case, it's a waste of resources to scan
further. We stop the scan at the current point by temporarily
adjusting the value of cd->endpattern. */
temp = cd->end_pattern;
cd->end_pattern = ptr;
recno = find_parens(cd, name, namelen,
(options & PCRE_EXTENDED) != 0, utf8);
cd->end_pattern = temp;
if (recno < 0) recno = 0; /* Forward ref; set dummy number */
}
/* In the real compile, seek the name in the table. We check the name
@@ -5356,7 +5450,7 @@ for (;; ptr++)
}
else if ((recno = /* Forward back reference */
find_parens(cd, name, namelen,
(options & PCRE_EXTENDED) != 0)) <= 0)
(options & PCRE_EXTENDED) != 0, utf8)) <= 0)
{
*errorcodeptr = ERR15;
goto FAILED;
@@ -5467,7 +5561,7 @@ for (;; ptr++)
if (called == NULL)
{
if (find_parens(cd, NULL, recno,
(options & PCRE_EXTENDED) != 0) < 0)
(options & PCRE_EXTENDED) != 0, utf8) < 0)
{
*errorcodeptr = ERR15;
goto FAILED;
@@ -6797,6 +6891,8 @@ while (ptr[skipatstart] == CHAR_LEFT_PARENTHESIS &&
{ skipatstart += 7; options |= PCRE_UTF8; continue; }
else if (strncmp((char *)(ptr+skipatstart+2), STRING_UCP_RIGHTPAR, 4) == 0)
{ skipatstart += 6; options |= PCRE_UCP; continue; }
else if (strncmp((char *)(ptr+skipatstart+2), STRING_NO_START_OPT_RIGHTPAR, 13) == 0)
{ skipatstart += 15; options |= PCRE_NO_START_OPTIMIZE; continue; }
if (strncmp((char *)(ptr+skipatstart+2), STRING_CR_RIGHTPAR, 3) == 0)
{ skipatstart += 5; newnl = PCRE_NEWLINE_CR; }
+80 -46
View File
@@ -292,7 +292,7 @@ argument of match(), which never changes. */
#define RMATCH(ra,rb,rc,rd,re,rf,rg,rw)\
{\
heapframe *newframe = (pcre_stack_malloc)(sizeof(heapframe));\
heapframe *newframe = (heapframe *)(pcre_stack_malloc)(sizeof(heapframe));\
if (newframe == NULL) RRETURN(PCRE_ERROR_NOMEMORY);\
frame->Xwhere = rw; \
newframe->Xeptr = ra;\
@@ -420,17 +420,18 @@ immediately. The second one is used when we already know we are past the end of
the subject. */
#define CHECK_PARTIAL()\
if (md->partial != 0 && eptr >= md->end_subject && eptr > mstart)\
{\
md->hitend = TRUE;\
if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
if (md->partial != 0 && eptr >= md->end_subject && \
eptr > md->start_used_ptr) \
{ \
md->hitend = TRUE; \
if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
}
#define SCHECK_PARTIAL()\
if (md->partial != 0 && eptr > mstart)\
{\
md->hitend = TRUE;\
if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
if (md->partial != 0 && eptr > md->start_used_ptr) \
{ \
md->hitend = TRUE; \
if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
}
@@ -486,7 +487,7 @@ heap storage. Set up the top-level frame here; others are obtained from the
heap whenever RMATCH() does a "recursion". See the macro definitions above. */
#ifdef NO_RECURSE
heapframe *frame = (pcre_stack_malloc)(sizeof(heapframe));
heapframe *frame = (heapframe *)(pcre_stack_malloc)(sizeof(heapframe));
if (frame == NULL) RRETURN(PCRE_ERROR_NOMEMORY);
frame->Xprevframe = NULL; /* Marks the top level */
@@ -708,36 +709,47 @@ for (;;)
case OP_FAIL:
MRRETURN(MATCH_NOMATCH);
/* COMMIT overrides PRUNE, SKIP, and THEN */
case OP_COMMIT:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM52);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE &&
rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
rrc != MATCH_THEN)
RRETURN(rrc);
MRRETURN(MATCH_COMMIT);
/* PRUNE overrides THEN */
case OP_PRUNE:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM51);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
MRRETURN(MATCH_PRUNE);
case OP_PRUNE_ARG:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
ims, eptrb, flags, RM56);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
md->mark = ecode + 2;
RRETURN(MATCH_PRUNE);
/* SKIP overrides PRUNE and THEN */
case OP_SKIP:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM53);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
RRETURN(rrc);
md->start_match_ptr = eptr; /* Pass back current position */
MRRETURN(MATCH_SKIP);
case OP_SKIP_ARG:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
ims, eptrb, flags, RM57);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
RRETURN(rrc);
/* Pass back the current skip name by overloading md->start_match_ptr and
returning the special MATCH_SKIP_ARG return code. This will either be
@@ -747,17 +759,24 @@ for (;;)
md->start_match_ptr = ecode + 2;
RRETURN(MATCH_SKIP_ARG);
/* For THEN (and THEN_ARG) we pass back the address of the bracket or
the alt that is at the start of the current branch. This makes it possible
to skip back past alternatives that precede the THEN within the current
branch. */
case OP_THEN:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM54);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
md->start_match_ptr = ecode - GET(ecode, 1);
MRRETURN(MATCH_THEN);
case OP_THEN_ARG:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
ims, eptrb, flags, RM58);
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
offset_top, md, ims, eptrb, flags, RM58);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
md->mark = ecode + 2;
md->start_match_ptr = ecode - GET(ecode, 1);
md->mark = ecode + LINK_SIZE + 2;
RRETURN(MATCH_THEN);
/* Handle a capturing bracket. If there is space in the offset vector, save
@@ -802,7 +821,9 @@ for (;;)
{
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM1);
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
if (rrc != MATCH_NOMATCH &&
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
RRETURN(rrc);
md->capture_last = save_capture_last;
ecode += GET(ecode, 1);
}
@@ -863,7 +884,9 @@ for (;;)
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims,
eptrb, flags, RM2);
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
if (rrc != MATCH_NOMATCH &&
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
RRETURN(rrc);
ecode += GET(ecode, 1);
}
/* Control never reaches here. */
@@ -1064,7 +1087,8 @@ for (;;)
ecode += 1 + LINK_SIZE + GET(ecode, LINK_SIZE + 2);
while (*ecode == OP_ALT) ecode += GET(ecode, 1);
}
else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
else if (rrc != MATCH_NOMATCH &&
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
{
RRETURN(rrc); /* Need braces because of following else */
}
@@ -1192,7 +1216,9 @@ for (;;)
mstart = md->start_match_ptr; /* In case \K reset it */
break;
}
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
if (rrc != MATCH_NOMATCH &&
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
RRETURN(rrc);
ecode += GET(ecode, 1);
}
while (*ecode == OP_ALT);
@@ -1226,7 +1252,9 @@ for (;;)
do ecode += GET(ecode,1); while (*ecode == OP_ALT);
break;
}
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
if (rrc != MATCH_NOMATCH &&
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
RRETURN(rrc);
ecode += GET(ecode,1);
}
while (*ecode == OP_ALT);
@@ -1363,7 +1391,8 @@ for (;;)
(pcre_free)(new_recursive.offset_save);
MRRETURN(MATCH_MATCH);
}
else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
else if (rrc != MATCH_NOMATCH &&
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
{
DPRINTF(("Recursion gave error %d\n", rrc));
if (new_recursive.offset_save != stacksave)
@@ -1406,7 +1435,9 @@ for (;;)
mstart = md->start_match_ptr;
break;
}
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
if (rrc != MATCH_NOMATCH &&
(rrc != MATCH_THEN || md->start_match_ptr != ecode))
RRETURN(rrc);
ecode += GET(ecode,1);
}
while (*ecode == OP_ALT);
@@ -1672,37 +1703,40 @@ for (;;)
if (eptr < md->end_subject)
{ if (!IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH); }
else
{ if (md->noteol) MRRETURN(MATCH_NOMATCH); }
{
if (md->noteol) MRRETURN(MATCH_NOMATCH);
SCHECK_PARTIAL();
}
ecode++;
break;
}
else
else /* Not multiline */
{
if (md->noteol) MRRETURN(MATCH_NOMATCH);
if (!md->endonly)
{
if (eptr != md->end_subject &&
(!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
MRRETURN(MATCH_NOMATCH);
ecode++;
break;
}
if (!md->endonly) goto ASSERT_NL_OR_EOS;
}
/* ... else fall through for endonly */
/* End of subject assertion (\z) */
case OP_EOD:
if (eptr < md->end_subject) MRRETURN(MATCH_NOMATCH);
SCHECK_PARTIAL();
ecode++;
break;
/* End of subject or ending \n assertion (\Z) */
case OP_EODN:
if (eptr != md->end_subject &&
ASSERT_NL_OR_EOS:
if (eptr < md->end_subject &&
(!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
MRRETURN(MATCH_NOMATCH);
/* Either at end of string or \n before end. */
SCHECK_PARTIAL();
ecode++;
break;
@@ -5598,6 +5632,7 @@ if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
if (re == NULL || subject == NULL ||
(offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
/* This information is for finding all the numbers associated with a given
name, for condition testing. */
@@ -5764,16 +5799,14 @@ back the character offset. */
#ifdef SUPPORT_UTF8
if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
{
if (_pcre_valid_utf8((USPTR)subject, length) >= 0)
return PCRE_ERROR_BADUTF8;
int tb;
if ((tb = _pcre_valid_utf8((USPTR)subject, length)) >= 0)
return (tb == length && md->partial > 1)?
PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
if (start_offset > 0 && start_offset < length)
{
int tb = ((USPTR)subject)[start_offset];
if (tb > 127)
{
tb &= 0xc0;
if (tb != 0 && tb != 0xc0) return PCRE_ERROR_BADUTF8_OFFSET;
}
tb = ((USPTR)subject)[start_offset] & 0xc0;
if (tb == 0x80) return PCRE_ERROR_BADUTF8_OFFSET;
}
}
#endif
@@ -5901,9 +5934,10 @@ for(;;)
/* There are some optimizations that avoid running the match if a known
starting point is not found, or if a known later character is not present.
However, there is an option that disables these, for testing and for ensuring
that all callouts do actually occur. */
that all callouts do actually occur. The option can be set in the regex by
(*NO_START_OPT) or passed in match-time options. */
if ((options & PCRE_NO_START_OPTIMIZE) == 0)
if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
{
/* Advance to a unique first byte if there is one. */
+129 -93
View File
@@ -192,9 +192,7 @@ stdint.h is available, include it; it may define INT64_MAX. Systems that do not
have stdint.h (e.g. Solaris) may have inttypes.h. The macro int64_t may be set
by "configure". */
#ifdef PHP_WIN32
#include "win32/php_stdint.h"
#elif HAVE_STDINT_H
#if HAVE_STDINT_H
#include <stdint.h>
#elif HAVE_INTTYPES_H
#include <inttypes.h>
@@ -410,9 +408,10 @@ capturing parenthesis numbers in back references. */
/* When UTF-8 encoding is being used, a character is no longer just a single
byte. The macros for character handling generate simple sequences when used in
byte-mode, and more complicated ones for UTF-8 characters. BACKCHAR should
never be called in byte mode. To make sure it can never even appear when UTF-8
support is omitted, we don't even define it. */
byte-mode, and more complicated ones for UTF-8 characters. GETCHARLENTEST is
not used when UTF-8 is not supported, so it is not defined, and BACKCHAR should
never be called in byte mode. To make sure they can never even appear when
UTF-8 support is omitted, we don't even define them. */
#ifndef SUPPORT_UTF8
#define GETCHAR(c, eptr) c = *eptr;
@@ -420,43 +419,83 @@ support is omitted, we don't even define it. */
#define GETCHARINC(c, eptr) c = *eptr++;
#define GETCHARINCTEST(c, eptr) c = *eptr++;
#define GETCHARLEN(c, eptr, len) c = *eptr;
/* #define GETCHARLENTEST(c, eptr, len) */
/* #define BACKCHAR(eptr) */
#else /* SUPPORT_UTF8 */
/* These macros were originally written in the form of loops that used data
from the tables whose names start with _pcre_utf8_table. They were rewritten by
a user so as not to use loops, because in some environments this gives a
significant performance advantage, and it seems never to do any harm. */
/* Base macro to pick up the remaining bytes of a UTF-8 character, not
advancing the pointer. */
#define GETUTF8(c, eptr) \
{ \
if ((c & 0x20) == 0) \
c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
else if ((c & 0x10) == 0) \
c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
else if ((c & 0x08) == 0) \
c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
else if ((c & 0x04) == 0) \
c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
(eptr[4] & 0x3f); \
else \
c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
}
/* Get the next UTF-8 character, not advancing the pointer. This is called when
we know we are in UTF-8 mode. */
#define GETCHAR(c, eptr) \
c = *eptr; \
if (c >= 0xc0) \
{ \
int gcii; \
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int gcss = 6*gcaa; \
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
for (gcii = 1; gcii <= gcaa; gcii++) \
{ \
gcss -= 6; \
c |= (eptr[gcii] & 0x3f) << gcss; \
} \
}
if (c >= 0xc0) GETUTF8(c, eptr);
/* Get the next UTF-8 character, testing for UTF-8 mode, and not advancing the
pointer. */
#define GETCHARTEST(c, eptr) \
c = *eptr; \
if (utf8 && c >= 0xc0) \
if (utf8 && c >= 0xc0) GETUTF8(c, eptr);
/* Base macro to pick up the remaining bytes of a UTF-8 character, advancing
the pointer. */
#define GETUTF8INC(c, eptr) \
{ \
int gcii; \
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int gcss = 6*gcaa; \
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
for (gcii = 1; gcii <= gcaa; gcii++) \
if ((c & 0x20) == 0) \
c = ((c & 0x1f) << 6) | (*eptr++ & 0x3f); \
else if ((c & 0x10) == 0) \
{ \
gcss -= 6; \
c |= (eptr[gcii] & 0x3f) << gcss; \
c = ((c & 0x0f) << 12) | ((*eptr & 0x3f) << 6) | (eptr[1] & 0x3f); \
eptr += 2; \
} \
else if ((c & 0x08) == 0) \
{ \
c = ((c & 0x07) << 18) | ((*eptr & 0x3f) << 12) | \
((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
eptr += 3; \
} \
else if ((c & 0x04) == 0) \
{ \
c = ((c & 0x03) << 24) | ((*eptr & 0x3f) << 18) | \
((eptr[1] & 0x3f) << 12) | ((eptr[2] & 0x3f) << 6) | \
(eptr[3] & 0x3f); \
eptr += 4; \
} \
else \
{ \
c = ((c & 0x01) << 30) | ((*eptr & 0x3f) << 24) | \
((eptr[1] & 0x3f) << 18) | ((eptr[2] & 0x3f) << 12) | \
((eptr[3] & 0x3f) << 6) | (eptr[4] & 0x3f); \
eptr += 5; \
} \
}
@@ -465,32 +504,49 @@ know we are in UTF-8 mode. */
#define GETCHARINC(c, eptr) \
c = *eptr++; \
if (c >= 0xc0) \
{ \
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int gcss = 6*gcaa; \
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
while (gcaa-- > 0) \
{ \
gcss -= 6; \
c |= (*eptr++ & 0x3f) << gcss; \
} \
}
if (c >= 0xc0) GETUTF8INC(c, eptr);
/* Get the next character, testing for UTF-8 mode, and advancing the pointer.
This is called when we don't know if we are in UTF-8 mode. */
#define GETCHARINCTEST(c, eptr) \
c = *eptr++; \
if (utf8 && c >= 0xc0) \
if (utf8 && c >= 0xc0) GETUTF8INC(c, eptr);
/* Base macro to pick up the remaining bytes of a UTF-8 character, not
advancing the pointer, incrementing the length. */
#define GETUTF8LEN(c, eptr, len) \
{ \
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int gcss = 6*gcaa; \
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
while (gcaa-- > 0) \
if ((c & 0x20) == 0) \
{ \
gcss -= 6; \
c |= (*eptr++ & 0x3f) << gcss; \
c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
len++; \
} \
else if ((c & 0x10) == 0) \
{ \
c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
len += 2; \
} \
else if ((c & 0x08) == 0) \
{\
c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
len += 3; \
} \
else if ((c & 0x04) == 0) \
{ \
c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
(eptr[4] & 0x3f); \
len += 4; \
} \
else \
{\
c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
len += 5; \
} \
}
@@ -499,19 +555,7 @@ if there are extra bytes. This is called when we know we are in UTF-8 mode. */
#define GETCHARLEN(c, eptr, len) \
c = *eptr; \
if (c >= 0xc0) \
{ \
int gcii; \
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int gcss = 6*gcaa; \
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
for (gcii = 1; gcii <= gcaa; gcii++) \
{ \
gcss -= 6; \
c |= (eptr[gcii] & 0x3f) << gcss; \
} \
len += gcaa; \
}
if (c >= 0xc0) GETUTF8LEN(c, eptr, len);
/* Get the next UTF-8 character, testing for UTF-8 mode, not advancing the
pointer, incrementing length if there are extra bytes. This is called when we
@@ -519,19 +563,7 @@ do not know if we are in UTF-8 mode. */
#define GETCHARLENTEST(c, eptr, len) \
c = *eptr; \
if (utf8 && c >= 0xc0) \
{ \
int gcii; \
int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int gcss = 6*gcaa; \
c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
for (gcii = 1; gcii <= gcaa; gcii++) \
{ \
gcss -= 6; \
c |= (eptr[gcii] & 0x3f) << gcss; \
} \
len += gcaa; \
}
if (utf8 && c >= 0xc0) GETUTF8LEN(c, eptr, len);
/* If the pointer is not at the start of a character, move it back until
it is. This is called only in UTF-8 mode - we don't put a test within the macro
@@ -539,7 +571,7 @@ because almost all calls are already within a block of UTF-8 only code. */
#define BACKCHAR(eptr) while((*eptr & 0xc0) == 0x80) eptr--
#endif
#endif /* SUPPORT_UTF8 */
/* In case there is no definition of offsetof() provided - though any proper
@@ -583,7 +615,7 @@ time, run time, or study time, respectively. */
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK|PCRE_AUTO_CALLOUT|PCRE_FIRSTLINE| \
PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE| \
PCRE_JAVASCRIPT_COMPAT|PCRE_UCP)
PCRE_JAVASCRIPT_COMPAT|PCRE_UCP|PCRE_NO_START_OPTIMIZE)
#define PUBLIC_EXEC_OPTIONS \
(PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NOTEMPTY_ATSTART| \
@@ -900,15 +932,16 @@ so that PCRE works on both ASCII and EBCDIC platforms, in non-UTF-mode only. */
#define STRING_DEFINE "DEFINE"
#define STRING_CR_RIGHTPAR "CR)"
#define STRING_LF_RIGHTPAR "LF)"
#define STRING_CRLF_RIGHTPAR "CRLF)"
#define STRING_ANY_RIGHTPAR "ANY)"
#define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)"
#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
#define STRING_UTF8_RIGHTPAR "UTF8)"
#define STRING_UCP_RIGHTPAR "UCP)"
#define STRING_CR_RIGHTPAR "CR)"
#define STRING_LF_RIGHTPAR "LF)"
#define STRING_CRLF_RIGHTPAR "CRLF)"
#define STRING_ANY_RIGHTPAR "ANY)"
#define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)"
#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
#define STRING_UTF8_RIGHTPAR "UTF8)"
#define STRING_UCP_RIGHTPAR "UCP)"
#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
#else /* SUPPORT_UTF8 */
@@ -1154,15 +1187,16 @@ only. */
#define STRING_DEFINE STR_D STR_E STR_F STR_I STR_N STR_E
#define STRING_CR_RIGHTPAR STR_C STR_R STR_RIGHT_PARENTHESIS
#define STRING_LF_RIGHTPAR STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_CRLF_RIGHTPAR STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_ANY_RIGHTPAR STR_A STR_N STR_Y STR_RIGHT_PARENTHESIS
#define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
#define STRING_CR_RIGHTPAR STR_C STR_R STR_RIGHT_PARENTHESIS
#define STRING_LF_RIGHTPAR STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_CRLF_RIGHTPAR STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_ANY_RIGHTPAR STR_A STR_N STR_Y STR_RIGHT_PARENTHESIS
#define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
#endif /* SUPPORT_UTF8 */
@@ -1516,8 +1550,9 @@ in UTF-8 mode. The code that uses this table must know about such things. */
3, 3, /* RREF, NRREF */ \
1, /* DEF */ \
1, 1, /* BRAZERO, BRAMINZERO */ \
3, 1, 3, /* MARK, PRUNE, PRUNE_ARG, */ \
1, 3, 1, 3, /* SKIP, SKIP_ARG, THEN, THEN_ARG, */ \
3, 1, 3, /* MARK, PRUNE, PRUNE_ARG */ \
1, 3, /* SKIP, SKIP_ARG */ \
1+LINK_SIZE, 3+LINK_SIZE, /* THEN, THEN_ARG */ \
1, 1, 1, 3, 1 /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO */
@@ -1536,7 +1571,8 @@ enum { ERR0, ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9,
ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERRCOUNT };
ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68,
ERRCOUNT };
/* The real format of the start of the pcre block; the index of names and the
code vector run on as long as necessary after the end. We store an explicit
+16 -1
View File
@@ -537,11 +537,26 @@ for(;;)
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
fprintf(f, " %s %s", OP_names[*code], code + 2);
extra += code[1];
break;
case OP_THEN:
if (print_lengths)
fprintf(f, " %s %d", OP_names[*code], GET(code, 1));
else
fprintf(f, " %s", OP_names[*code]);
break;
case OP_THEN_ARG:
if (print_lengths)
fprintf(f, " %s %d %s", OP_names[*code], GET(code, 1),
code + 2 + LINK_SIZE);
else
fprintf(f, " %s %s", OP_names[*code], code + 2 + LINK_SIZE);
extra += code[1+LINK_SIZE];
break;
/* Anything else is just an item with no data*/
default:
+4 -1
View File
@@ -417,10 +417,13 @@ for (;;)
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
cc += _pcre_OP_lengths[op] + cc[1];
break;
case OP_THEN_ARG:
cc += _pcre_OP_lengths[op] + cc[1+LINK_SIZE];
break;
/* For the record, these are the opcodes that are matched by "default":
OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
OP_THEN. */
+16 -1
View File
@@ -70,6 +70,20 @@ Arguments:
Returns: < 0 if the string is a valid UTF-8 string
>= 0 otherwise; the value is the offset of the bad byte
Bad bytes can be:
. An isolated byte whose most significant bits are 0x80, because this
can only correctly appear within a UTF-8 character;
. A byte whose most significant bits are 0xc0, but whose other bits indicate
that there are more than 3 additional bytes (i.e. an RFC 2279 starting
byte, which is no longer valid under RFC 3629);
.
The returned offset may also be equal to the length of the string; this means
that one or more bytes is missing from the final UTF-8 character.
*/
int
@@ -91,7 +105,8 @@ for (p = string; length-- > 0; p++)
if (c < 128) continue;
if (c < 0xc0) return p - string;
ab = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */
if (length < ab || ab > 3) return p - string;
if (ab > 3) return p - string; /* Too many for RFC 3629 */
if (length < ab) return p + 1 + length - string; /* Missing bytes */
length -= ab;
/* Check top bits in the second byte */
+69 -7
View File
@@ -50,13 +50,16 @@ const char *error;
char *pattern;
char *subject;
unsigned char *name_table;
unsigned int option_bits;
int erroffset;
int find_all;
int crlf_is_newline;
int namecount;
int name_entry_size;
int ovector[OVECCOUNT];
int subject_length;
int rc, i;
int utf8;
/**************************************************************************
@@ -238,15 +241,56 @@ if (namecount <= 0) printf("No named substrings\n"); else
* subject is not a valid match; other possibilities must be tried. The *
* second flag restricts PCRE to one match attempt at the initial string *
* position. If this match succeeds, an alternative to the empty string *
* match has been found, and we can proceed round the loop. *
* match has been found, and we can print it and proceed round the loop, *
* advancing by the length of whatever was found. If this match does not *
* succeed, we still stay in the loop, advancing by just one character. *
* In UTF-8 mode, which can be set by (*UTF8) in the pattern, this may be *
* more than one byte. *
* *
* However, there is a complication concerned with newlines. When the *
* newline convention is such that CRLF is a valid newline, we want must *
* advance by two characters rather than one. The newline convention can *
* be set in the regex by (*CR), etc.; if not, we must find the default. *
*************************************************************************/
if (!find_all)
if (!find_all) /* Check for -g */
{
pcre_free(re); /* Release the memory used for the compiled pattern */
return 0; /* Finish unless -g was given */
}
/* Before running the loop, check for UTF-8 and whether CRLF is a valid newline
sequence. First, find the options with which the regex was compiled; extract
the UTF-8 state, and mask off all but the newline options. */
(void)pcre_fullinfo(re, NULL, PCRE_INFO_OPTIONS, &option_bits);
utf8 = option_bits & PCRE_UTF8;
option_bits &= PCRE_NEWLINE_CR|PCRE_NEWLINE_LF|PCRE_NEWLINE_CRLF|
PCRE_NEWLINE_ANY|PCRE_NEWLINE_ANYCRLF;
/* If no newline options were set, find the default newline convention from the
build configuration. */
if (option_bits == 0)
{
int d;
(void)pcre_config(PCRE_CONFIG_NEWLINE, &d);
/* Note that these values are always the ASCII ones, even in
EBCDIC environments. CR = 13, NL = 10. */
option_bits = (d == 13)? PCRE_NEWLINE_CR :
(d == 10)? PCRE_NEWLINE_LF :
(d == (13<<8 | 10))? PCRE_NEWLINE_CRLF :
(d == -2)? PCRE_NEWLINE_ANYCRLF :
(d == -1)? PCRE_NEWLINE_ANY : 0;
}
/* See if CRLF is a valid newline sequence. */
crlf_is_newline =
option_bits == PCRE_NEWLINE_ANY ||
option_bits == PCRE_NEWLINE_CRLF ||
option_bits == PCRE_NEWLINE_ANYCRLF;
/* Loop for second and subsequent matches */
for (;;)
@@ -280,14 +324,32 @@ for (;;)
is zero, it just means we have found all possible matches, so the loop ends.
Otherwise, it means we have failed to find a non-empty-string match at a
point where there was a previous empty-string match. In this case, we do what
Perl does: advance the matching position by one, and continue. We do this by
setting the "end of previous match" offset, because that is picked up at the
top of the loop as the point at which to start again. */
Perl does: advance the matching position by one character, and continue. We
do this by setting the "end of previous match" offset, because that is picked
up at the top of the loop as the point at which to start again.
There are two complications: (a) When CRLF is a valid newline sequence, and
the current position is just before it, advance by an extra byte. (b)
Otherwise we must ensure that we skip an entire UTF-8 character if we are in
UTF-8 mode. */
if (rc == PCRE_ERROR_NOMATCH)
{
if (options == 0) break;
ovector[1] = start_offset + 1;
if (options == 0) break; /* All matches found */
ovector[1] = start_offset + 1; /* Advance one byte */
if (crlf_is_newline && /* If CRLF is newline & */
start_offset < subject_length - 1 && /* we are at CRLF, */
subject[start_offset] == '\r' &&
subject[start_offset + 1] == '\n')
ovector[1] += 1; /* Advance by one more. */
else if (utf8) /* Otherwise, ensure we */
{ /* advance a whole UTF-8 */
while (ovector[1] < subject_length) /* character. */
{
if ((subject[ovector[1]] & 0xc0) != 0x80) break;
ovector[1] += 1;
}
}
continue; /* Go round the loop again */
}
+1
View File
@@ -149,6 +149,7 @@ static const int eint[] = {
REG_BADPAT, /* different names for subpatterns of the same number are not allowed */
REG_BADPAT, /* (*MARK) must have an argument */
REG_INVARG, /* this version of PCRE is not compiled with PCRE_UCP support */
REG_BADPAT, /* \c must be followed by an ASCII character */
};
/* Table of texts corresponding to POSIX error codes */