Upgraded\ bundled\ PCRE\ to\ version\ 8.11.

2026-04-25 17:08:14 +02:00 · 2010-12-24 15:03:29 +00:00
parent e540cb0c45
commit 15b7a4476a
14 changed files with 2062 additions and 1436 deletions
@@ -1,6 +1,136 @@
 ChangeLog for PCRE
 ------------------

+Version 8.11 10-Dec-2010
+------------------------
+
+1.  (*THEN) was not working properly if there were untried alternatives prior
+    to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it
+    backtracked to try for "b" instead of moving to the next alternative branch
+    at the same level (in this case, to look for "c"). The Perl documentation
+    is clear that when (*THEN) is backtracked onto, it goes to the "next
+    alternative in the innermost enclosing group".
+
+2.  (*COMMIT) was not overriding (*THEN), as it does in Perl. In a pattern
+    such as   (A(*COMMIT)B(*THEN)C|D)  any failure after matching A should
+    result in overall failure. Similarly, (*COMMIT) now overrides (*PRUNE) and
+    (*SKIP), (*SKIP) overrides (*PRUNE) and (*THEN), and (*PRUNE) overrides
+    (*THEN).
+
+3.  If \s appeared in a character class, it removed the VT character from
+    the class, even if it had been included by some previous item, for example
+    in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part
+    of \s, but is part of the POSIX "space" class.)
+
+4.  A partial match never returns an empty string (because you can always
+    match an empty string at the end of the subject); however the checking for
+    an empty string was starting at the "start of match" point. This has been
+    changed to the "earliest inspected character" point, because the returned
+    data for a partial match starts at this character. This means that, for
+    example, /(?<=abc)def/ gives a partial match for the subject "abc"
+    (previously it gave "no match").
+
+5.  Changes have been made to the way PCRE_PARTIAL_HARD affects the matching
+    of $, \z, \Z, \b, and \B. If the match point is at the end of the string,
+    previously a full match would be given. However, setting PCRE_PARTIAL_HARD
+    has an implication that the given string is incomplete (because a partial
+    match is preferred over a full match). For this reason, these items now
+    give a partial match in this situation. [Aside: previously, the one case
+    /t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial
+    match rather than a full match, which was wrong by the old rules, but is
+    now correct.]
+
+6.  There was a bug in the handling of #-introduced comments, recognized when
+    PCRE_EXTENDED is set, when PCRE_NEWLINE_ANY and PCRE_UTF8 were also set.
+    If a UTF-8 multi-byte character included the byte 0x85 (e.g. +U0445, whose
+    UTF-8 encoding is 0xd1,0x85), this was misinterpreted as a newline when
+    scanning for the end of the comment. (*Character* 0x85 is an "any" newline,
+    but *byte* 0x85 is not, in UTF-8 mode). This bug was present in several
+    places in pcre_compile().
+
+7.  Related to (6) above, when pcre_compile() was skipping #-introduced
+    comments when looking ahead for named forward references to subpatterns,
+    the only newline sequence it recognized was NL. It now handles newlines
+    according to the set newline convention.
+
+8.  SunOS4 doesn't have strerror() or strtoul(); pcregrep dealt with the
+    former, but used strtoul(), whereas pcretest avoided strtoul() but did not
+    cater for a lack of strerror(). These oversights have been fixed.
+
+9.  Added --match-limit and --recursion-limit to pcregrep.
+
+10. Added two casts needed to build with Visual Studio when NO_RECURSE is set.
+
+11. When the -o option was used, pcregrep was setting a return code of 1, even
+    when matches were found, and --line-buffered was not being honoured.
+
+12. Added an optional parentheses number to the -o and --only-matching options
+    of pcregrep.
+
+13. Imitating Perl's /g action for multiple matches is tricky when the pattern
+    can match an empty string. The code to do it in pcretest and pcredemo
+    needed fixing:
+
+    (a) When the newline convention was "crlf", pcretest got it wrong, skipping
+        only one byte after an empty string match just before CRLF (this case
+        just got forgotten; "any" and "anycrlf" were OK).
+
+    (b) The pcretest code also had a bug, causing it to loop forever in UTF-8
+        mode when an empty string match preceded an ASCII character followed by
+        a non-ASCII character. (The code for advancing by one character rather
+        than one byte was nonsense.)
+
+    (c) The pcredemo.c sample program did not have any code at all to handle
+        the cases when CRLF is a valid newline sequence.
+
+14. Neither pcre_exec() nor pcre_dfa_exec() was checking that the value given
+    as a starting offset was within the subject string. There is now a new
+    error, PCRE_ERROR_BADOFFSET, which is returned if the starting offset is
+    negative or greater than the length of the string. In order to test this,
+    pcretest is extended to allow the setting of negative starting offsets.
+
+15. In both pcre_exec() and pcre_dfa_exec() the code for checking that the
+    starting offset points to the beginning of a UTF-8 character was
+    unnecessarily clumsy. I tidied it up.
+
+16. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
+    bad UTF-8 sequence and one that is incomplete when using PCRE_PARTIAL_HARD.
+
+17. Nobody had reported that the --include_dir option, which was added in
+    release 7.7 should have been called --include-dir (hyphen, not underscore)
+    for compatibility with GNU grep. I have changed it to --include-dir, but
+    left --include_dir as an undocumented synonym, and the same for
+    --exclude-dir, though that is not available in GNU grep, at least as of
+    release 2.5.4.
+
+18. At a user's suggestion, the macros GETCHAR and friends (which pick up UTF-8
+    characters from a string of bytes) have been redefined so as not to use
+    loops, in order to improve performance in some environments. At the same
+    time, I abstracted some of the common code into auxiliary macros to save
+    repetition (this should not affect the compiled code).
+
+19. If \c was followed by a multibyte UTF-8 character, bad things happened. A
+    compile-time error is now given if \c is not followed by an ASCII
+    character, that is, a byte less than 128. (In EBCDIC mode, the code is
+    different, and any byte value is allowed.)
+
+20. Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_
+    START_OPTIMIZE option, which is now allowed at compile time - but just
+    passed through to pcre_exec() or pcre_dfa_exec(). This makes it available
+    to pcregrep and other applications that have no direct access to PCRE
+    options. The new /Y option in pcretest sets this option when calling
+    pcre_compile().
+
+21. Change 18 of release 8.01 broke the use of named subpatterns for recursive
+    back references. Groups containing recursive back references were forced to
+    be atomic by that change, but in the case of named groups, the amount of
+    memory required was incorrectly computed, leading to "Failed: internal
+    error: code overflow". This has been fixed.
+
+22. Some patches to pcre_stringpiece.h, pcre_stringpiece_unittest.cc, and
+    pcretest.c, to avoid build problems in some Borland environments.
+
+
 Version 8.10 25-Jun-2010
 ------------------------

@@ -4,6 +4,7 @@ Technical Notes about PCRE
 These are very rough technical notes that record potentially useful information 
 about PCRE internals.

+
 Historical note 1
 -----------------

@@ -22,6 +23,7 @@ the one matching the longest subset of the subject string was chosen. This did
 not necessarily maximize the individual wild portions of the pattern, as is
 expected in Unix and Perl-style regular expressions.

+
 Historical note 2
 -----------------

@@ -34,6 +36,7 @@ maximizing (or, optionally, minimizing in Perl) the amount of the subject that
 matches individual wild portions of the pattern. This is an "NFA algorithm" in
 Friedl's terminology.

+
 OK, here's the real stuff
 -------------------------

@@ -44,6 +47,7 @@ in the pattern, to save on compiling time. However, because of the greater
 complexity in Perl regular expressions, I couldn't do this. In any case, a
 first pass through the pattern is helpful for other reasons. 

+
 Computing the memory requirement: how it was
 --------------------------------------------

@@ -54,6 +58,7 @@ idea was that this would turn out faster than the Henry Spencer code because
 the first pass is degenerate and the second pass can just store stuff straight
 into the vector, which it knows is big enough.

+
 Computing the memory requirement: how it is
 -------------------------------------------

@@ -75,6 +80,7 @@ runs more slowly than before (30% or more, depending on the pattern) because it
 is doing a full analysis of the pattern. My hope was that this would not be a
 big issue, and in the event, nobody has commented on it.

+
 Traditional matching function
 -----------------------------

@@ -84,6 +90,7 @@ and the way that Perl works. This is not surprising, since it is intended to be
 as compatible with Perl as possible. This is the function most users of PCRE
 will use most of the time.

+
 Supplementary matching function
 -------------------------------

@@ -119,7 +126,6 @@ quantifiers) are always just two bytes long.

 A list of the opcodes follows:

-
 Opcodes with no following data
 ------------------------------

@@ -151,12 +157,24 @@ These items are all just one byte long
  OP_EXTUNI              match an extended Unicode character 
  OP_ANYNL               match any Unicode newline sequence 
  
-  OP_ACCEPT              ) These are Perl 5.10's "backtracking    
-  OP_COMMIT              ) control verbs". If OP_ACCEPT is inside
-  OP_FAIL                ) capturing parentheses, it may be preceded 
-  OP_PRUNE               ) by one or more OP_CLOSE, followed by a 2-byte 
-  OP_SKIP                ) number, indicating which parentheses must be
-  OP_THEN                ) closed.
+  OP_ACCEPT              ) These are Perl 5.10's "backtracking control   
+  OP_COMMIT              ) verbs". If OP_ACCEPT is inside capturing
+  OP_FAIL                ) parentheses, it may be preceded by one or more
+  OP_PRUNE               ) OP_CLOSE, followed by a 2-byte number,
+  OP_SKIP                ) indicating which parentheses must be closed.
+  
+
+Backtracking control verbs with data
+------------------------------------
+ 
+OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the
+start of the current branch.
+
+OP_MARK is followed by the mark name, preceded by a one-byte length, and 
+followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments, 
+the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first 
+two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE 
+offset value.
  

 Repeating single characters
@@ -419,4 +437,4 @@ at compile time, and so does not cause anything to be put into the compiled
 data.

 Philip Hazel
-October 2009
+October 2010
@@ -1,6 +1,27 @@
 News about PCRE releases
 ------------------------

+Release 8.11 10-Dec-2010
+------------------------
+
+A number of bugs in the library and in pcregrep have been fixed. As always, see
+ChangeLog for details. The following are the non-bug-fix changes:
+
+. Added --match-limit and --recursion-limit to pcregrep.
+
+. Added an optional parentheses number to the -o and --only-matching options
+  of pcregrep.
+
+. Changed the way PCRE_PARTIAL_HARD affects the matching of $, \z, \Z, \b, and
+  \B.
+
+. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
+  bad UTF-8 sequence and one that is incomplete when using PCRE_PARTIAL_HARD.
+
+. Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_
+  START_OPTIMIZE option, which is now allowed at compile time
+
+
 Release 8.10 25-Jun-2010
 ------------------------

@@ -23,6 +23,7 @@
 # define PCRE_EXP_DATA_DEFN	__attribute__ ((visibility("default")))
 #endif

+
 /* Exclude these below definitions when building within PHP */
 #ifndef ZEND_API

@@ -281,7 +282,7 @@ them both to 0; an emulation function will be used. */
 #define PACKAGE_NAME "PCRE"

 /* Define to the full name and version of this package. */
-#define PACKAGE_STRING "PCRE 8.10"
+#define PACKAGE_STRING "PCRE 8.11"

 /* Define to the one symbol short name of this package. */
 #define PACKAGE_TARNAME "pcre"
@@ -290,7 +291,7 @@ them both to 0; an emulation function will be used. */
 #define PACKAGE_URL ""

 /* Define to the version of this package. */
-#define PACKAGE_VERSION "8.10"
+#define PACKAGE_VERSION "8.11"


 /* If you are compiling for a system other than a Unix-like system or
@@ -346,7 +347,7 @@ them both to 0; an emulation function will be used. */

 /* Version number of package */
 #ifndef VERSION
-#define VERSION "8.10"
+#define VERSION "8.11"
 #endif

 /* Define to empty if `const' does not conform to ANSI C. */
@@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
 /* The current PCRE version information. */

 #define PCRE_MAJOR          8
-#define PCRE_MINOR          10
+#define PCRE_MINOR          11
 #define PCRE_PRERELEASE     
-#define PCRE_DATE           2010-06-25
+#define PCRE_DATE           2010-12-10

 /* When an application links to a PCRE DLL in Windows, the symbols that are
 imported have to be identified as such. When building PCRE, the appropriate
@@ -96,42 +96,44 @@ extern "C" {
 #endif

 /* Options. Some are compile-time only, some are run-time only, and some are
-both, so we keep them all distinct. */
+both, so we keep them all distinct. However, almost all the bits in the options
+word are now used. In the long run, we may have to re-use some of the
+compile-time only bits for runtime options, or vice versa. */

-#define PCRE_CASELESS           0x00000001
-#define PCRE_MULTILINE          0x00000002
-#define PCRE_DOTALL             0x00000004
-#define PCRE_EXTENDED           0x00000008
-#define PCRE_ANCHORED           0x00000010
-#define PCRE_DOLLAR_ENDONLY     0x00000020
-#define PCRE_EXTRA              0x00000040
-#define PCRE_NOTBOL             0x00000080
-#define PCRE_NOTEOL             0x00000100
-#define PCRE_UNGREEDY           0x00000200
-#define PCRE_NOTEMPTY           0x00000400
-#define PCRE_UTF8               0x00000800
-#define PCRE_NO_AUTO_CAPTURE    0x00001000
-#define PCRE_NO_UTF8_CHECK      0x00002000
-#define PCRE_AUTO_CALLOUT       0x00004000
-#define PCRE_PARTIAL_SOFT       0x00008000
+#define PCRE_CASELESS           0x00000001  /* Compile */
+#define PCRE_MULTILINE          0x00000002  /* Compile */
+#define PCRE_DOTALL             0x00000004  /* Compile */
+#define PCRE_EXTENDED           0x00000008  /* Compile */
+#define PCRE_ANCHORED           0x00000010  /* Compile, exec, DFA exec */
+#define PCRE_DOLLAR_ENDONLY     0x00000020  /* Compile */
+#define PCRE_EXTRA              0x00000040  /* Compile */
+#define PCRE_NOTBOL             0x00000080  /* Exec, DFA exec */
+#define PCRE_NOTEOL             0x00000100  /* Exec, DFA exec */
+#define PCRE_UNGREEDY           0x00000200  /* Compile */
+#define PCRE_NOTEMPTY           0x00000400  /* Exec, DFA exec */
+#define PCRE_UTF8               0x00000800  /* Compile */
+#define PCRE_NO_AUTO_CAPTURE    0x00001000  /* Compile */
+#define PCRE_NO_UTF8_CHECK      0x00002000  /* Compile, exec, DFA exec */
+#define PCRE_AUTO_CALLOUT       0x00004000  /* Compile */
+#define PCRE_PARTIAL_SOFT       0x00008000  /* Exec, DFA exec */
 #define PCRE_PARTIAL            0x00008000  /* Backwards compatible synonym */
-#define PCRE_DFA_SHORTEST       0x00010000
-#define PCRE_DFA_RESTART        0x00020000
-#define PCRE_FIRSTLINE          0x00040000
-#define PCRE_DUPNAMES           0x00080000
-#define PCRE_NEWLINE_CR         0x00100000
-#define PCRE_NEWLINE_LF         0x00200000
-#define PCRE_NEWLINE_CRLF       0x00300000
-#define PCRE_NEWLINE_ANY        0x00400000
-#define PCRE_NEWLINE_ANYCRLF    0x00500000
-#define PCRE_BSR_ANYCRLF        0x00800000
-#define PCRE_BSR_UNICODE        0x01000000
-#define PCRE_JAVASCRIPT_COMPAT  0x02000000
-#define PCRE_NO_START_OPTIMIZE  0x04000000
-#define PCRE_NO_START_OPTIMISE  0x04000000
-#define PCRE_PARTIAL_HARD       0x08000000
-#define PCRE_NOTEMPTY_ATSTART   0x10000000
-#define PCRE_UCP                0x20000000
+#define PCRE_DFA_SHORTEST       0x00010000  /* DFA exec */
+#define PCRE_DFA_RESTART        0x00020000  /* DFA exec */
+#define PCRE_FIRSTLINE          0x00040000  /* Compile */
+#define PCRE_DUPNAMES           0x00080000  /* Compile */
+#define PCRE_NEWLINE_CR         0x00100000  /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_LF         0x00200000  /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_CRLF       0x00300000  /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANY        0x00400000  /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANYCRLF    0x00500000  /* Compile, exec, DFA exec */
+#define PCRE_BSR_ANYCRLF        0x00800000  /* Compile, exec, DFA exec */
+#define PCRE_BSR_UNICODE        0x01000000  /* Compile, exec, DFA exec */
+#define PCRE_JAVASCRIPT_COMPAT  0x02000000  /* Compile */
+#define PCRE_NO_START_OPTIMIZE  0x04000000  /* Compile, exec, DFA exec */
+#define PCRE_NO_START_OPTIMISE  0x04000000  /* Synonym */
+#define PCRE_PARTIAL_HARD       0x08000000  /* Exec, DFA exec */
+#define PCRE_NOTEMPTY_ATSTART   0x10000000  /* Exec, DFA exec */
+#define PCRE_UCP                0x20000000  /* Compile */

 /* Exec-time and get/set-time error codes */

@@ -159,6 +161,8 @@ both, so we keep them all distinct. */
 #define PCRE_ERROR_RECURSIONLIMIT (-21)
 #define PCRE_ERROR_NULLWSLIMIT    (-22)  /* No longer actually used */
 #define PCRE_ERROR_BADNEWLINE     (-23)
+#define PCRE_ERROR_BADOFFSET      (-24)
+#define PCRE_ERROR_SHORTUTF8      (-25)

 /* Request types for pcre_fullinfo() */

@@ -406,6 +406,7 @@ static const char error_texts[] =
  "different names for subpatterns of the same number are not allowed\0"
  "(*MARK) must have an argument\0"
  "this version of PCRE is not compiled with PCRE_UCP support\0"
+  "\\c must be followed by an ASCII character\0"
  ;

 /* Table to identify digits and hex digits. This is used when compiling
@@ -839,7 +840,8 @@ else
    break;

    /* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
-    This coding is ASCII-specific, but then the whole concept of \cx is
+    An error is given if the byte following \c is not an ASCII character. This
+    coding is ASCII-specific, but then the whole concept of \cx is
    ASCII-specific. (However, an EBCDIC equivalent has now been added.) */

    case CHAR_c:
@@ -849,11 +851,15 @@ else
      *errorcodeptr = ERR2;
      break;
      }
-
-#ifndef EBCDIC  /* ASCII/UTF-8 coding */
+#ifndef EBCDIC    /* ASCII/UTF-8 coding */
+    if (c > 127)  /* Excludes all non-ASCII in either mode */
+      {
+      *errorcodeptr = ERR68;
+      break;
+      }
    if (c >= CHAR_a && c <= CHAR_z) c -= 32;
    c ^= 0x40;
-#else           /* EBCDIC coding */
+#else             /* EBCDIC coding */
    if (c >= CHAR_a && c <= CHAR_z) c += 64;
    c ^= 0xC0;
 #endif
@@ -1097,10 +1103,21 @@ top-level call starts at the beginning of the pattern. All other calls must
 start at a parenthesis. It scans along a pattern's text looking for capturing
 subpatterns, and counting them. If it finds a named pattern that matches the
 name it is given, it returns its number. Alternatively, if the name is NULL, it
-returns when it reaches a given numbered subpattern. We know that if (?P< is
-encountered, the name will be terminated by '>' because that is checked in the
-first pass. Recursion is used to keep track of subpatterns that reset the
-capturing group numbers - the (?| feature.
+returns when it reaches a given numbered subpattern. Recursion is used to keep
+track of subpatterns that reset the capturing group numbers - the (?| feature.
+
+This function was originally called only from the second pass, in which we know
+that if (?< or (?' or (?P< is encountered, the name will be correctly
+terminated because that is checked in the first pass. There is now one call to
+this function in the first pass, to check for a recursive back reference by
+name (so that we can make the whole group atomic). In this case, we need check
+only up to the current position in the pattern, and that is still OK because
+and previous occurrences will have been checked. To make this work, the test
+for "end of pattern" is a check against cd->end_pattern in the main loop,
+instead of looking for a binary zero. This means that the special first-pass
+call can adjust cd->end_pattern temporarily. (Checks for binary zero while
+processing items within the loop are OK, because afterwards the main loop will
+terminate.)

 Arguments:
  ptrptr       address of the current character pointer (updated)
@@ -1108,6 +1125,7 @@ Arguments:
  name         name to seek, or NULL if seeking a numbered subpattern
  lorn         name length, or subpattern number if name is NULL
  xmode        TRUE if we are in /x mode
+  utf8         TRUE if we are in UTF-8 mode
  count        pointer to the current capturing subpattern number (updated)

 Returns:       the number of the named subpattern, or -1 if not found
@@ -1115,7 +1133,7 @@ Returns:       the number of the named subpattern, or -1 if not found

 static int
 find_parens_sub(uschar **ptrptr, compile_data *cd, const uschar *name, int lorn,
-  BOOL xmode, int *count)
+  BOOL xmode, BOOL utf8, int *count)
 {
 uschar *ptr = *ptrptr;
 int start_count = *count;
@@ -1200,9 +1218,11 @@ if (ptr[0] == CHAR_LEFT_PARENTHESIS)
  }

 /* Past any initial parenthesis handling, scan for parentheses or vertical
-bars. */
+bars. Stop if we get to cd->end_pattern. Note that this is important for the
+first-pass call when this value is temporarily adjusted to stop at the current
+position. So DO NOT change this to a test for binary zero. */

-for (; *ptr != 0; ptr++)
+for (; ptr < cd->end_pattern; ptr++)
  {
  /* Skip over backslashed characters and also entire \Q...\E */

@@ -1276,7 +1296,15 @@ for (; *ptr != 0; ptr++)

  if (xmode && *ptr == CHAR_NUMBER_SIGN)
    {
-    while (*(++ptr) != 0 && *ptr != CHAR_NL) {};
+    ptr++;
+    while (*ptr != 0)
+      {
+      if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
+      ptr++;
+#ifdef SUPPORT_UTF8
+      if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
+#endif
+      }
    if (*ptr == 0) goto FAIL_EXIT;
    continue;
    }
@@ -1285,7 +1313,7 @@ for (; *ptr != 0; ptr++)

  if (*ptr == CHAR_LEFT_PARENTHESIS)
    {
-    int rc = find_parens_sub(&ptr, cd, name, lorn, xmode, count);
+    int rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf8, count);
    if (rc > 0) return rc;
    if (*ptr == 0) goto FAIL_EXIT;
    }
@@ -1331,12 +1359,14 @@ Arguments:
  name         name to seek, or NULL if seeking a numbered subpattern
  lorn         name length, or subpattern number if name is NULL
  xmode        TRUE if we are in /x mode
+  utf8         TRUE if we are in UTF-8 mode

 Returns:       the number of the found subpattern, or -1 if not found
 */

 static int
-find_parens(compile_data *cd, const uschar *name, int lorn, BOOL xmode)
+find_parens(compile_data *cd, const uschar *name, int lorn, BOOL xmode,
+  BOOL utf8)
 {
 uschar *ptr = (uschar *)cd->start_pattern;
 int count = 0;
@@ -1349,7 +1379,7 @@ matching closing parens. That is why we have to have a loop. */

 for (;;)
  {
-  rc = find_parens_sub(&ptr, cd, name, lorn, xmode, &count);
+  rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf8, &count);
  if (rc > 0 || *ptr++ == 0) break;
  }

@@ -1722,9 +1752,12 @@ for (;;)
      case OP_MARK:
      case OP_PRUNE_ARG:
      case OP_SKIP_ARG:
-      case OP_THEN_ARG:
      code += code[1];
      break;
+
+      case OP_THEN_ARG:
+      code += code[1+LINK_SIZE];
+      break;
      }

    /* Add in the fixed length from the table */
@@ -1825,9 +1858,12 @@ for (;;)
      case OP_MARK:
      case OP_PRUNE_ARG:
      case OP_SKIP_ARG:
-      case OP_THEN_ARG:
      code += code[1];
      break;
+
+      case OP_THEN_ARG:
+      code += code[1+LINK_SIZE];
+      break;
      }

    /* Add in the fixed length from the table */
@@ -2103,10 +2139,13 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
    case OP_MARK:
    case OP_PRUNE_ARG:
    case OP_SKIP_ARG:
-    case OP_THEN_ARG:
    code += code[1];
    break;

+    case OP_THEN_ARG:
+    code += code[1+LINK_SIZE];
+    break;
+
    /* None of the remaining opcodes are required to match a character. */

    default:
@@ -2504,8 +2543,15 @@ if ((options & PCRE_EXTENDED) != 0)
    while ((cd->ctypes[*ptr] & ctype_space) != 0) ptr++;
    if (*ptr == CHAR_NUMBER_SIGN)
      {
-      while (*(++ptr) != 0)
+      ptr++;
+      while (*ptr != 0)
+        {
        if (IS_NEWLINE(ptr)) { ptr += cd->nllen; break; }
+        ptr++;
+#ifdef SUPPORT_UTF8
+        if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
+#endif
+        }
      }
    else break;
    }
@@ -2541,8 +2587,15 @@ if ((options & PCRE_EXTENDED) != 0)
    while ((cd->ctypes[*ptr] & ctype_space) != 0) ptr++;
    if (*ptr == CHAR_NUMBER_SIGN)
      {
-      while (*(++ptr) != 0)
+      ptr++;
+      while (*ptr != 0)
+        {
        if (IS_NEWLINE(ptr)) { ptr += cd->nllen; break; }
+        ptr++;
+#ifdef SUPPORT_UTF8
+        if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
+#endif
+        }
      }
    else break;
    }
@@ -3115,9 +3168,14 @@ for (;; ptr++)
    if ((cd->ctypes[c] & ctype_space) != 0) continue;
    if (c == CHAR_NUMBER_SIGN)
      {
-      while (*(++ptr) != 0)
+      ptr++;
+      while (*ptr != 0)
        {
        if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
+        ptr++;
+#ifdef SUPPORT_UTF8
+        if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
+#endif
        }
      if (*ptr != 0) continue;

@@ -3492,9 +3550,14 @@ for (;; ptr++)
            for (c = 0; c < 32; c++) classbits[c] |= ~cbits[c+cbit_word];
            continue;

+            /* Perl 5.004 onwards omits VT from \s, but we must preserve it
+            if it was previously set by something earlier in the character
+            class. */
+
            case ESC_s:
-            for (c = 0; c < 32; c++) classbits[c] |= cbits[c+cbit_space];
-            classbits[1] &= ~0x08;   /* Perl 5.004 onwards omits VT from \s */
+            classbits[0] |= cbits[cbit_space];
+            classbits[1] |= cbits[cbit_space+1] & ~0x08;
+            for (c = 2; c < 32; c++) classbits[c] |= cbits[c+cbit_space];
            continue;

            case ESC_S:
@@ -4806,7 +4869,12 @@ for (;; ptr++)
              *errorcodeptr = ERR66;
              goto FAILED;
              }
-            *code++ = verbs[i].op;
+            *code = verbs[i].op;
+            if (*code++ == OP_THEN)
+              {
+              PUT(code, 0, code - bcptr->current_branch - 1);
+              code += LINK_SIZE;
+              }
            }

          else
@@ -4816,7 +4884,12 @@ for (;; ptr++)
              *errorcodeptr = ERR59;
              goto FAILED;
              }
-            *code++ = verbs[i].op_arg;
+            *code = verbs[i].op_arg;
+            if (*code++ == OP_THEN_ARG)
+              {
+              PUT(code, 0, code - bcptr->current_branch - 1);
+              code += LINK_SIZE;
+              }
            *code++ = arglen;
            memcpy(code, arg, arglen);
            code += arglen;
@@ -5010,7 +5083,7 @@ for (;; ptr++)
        /* Search the pattern for a forward reference */

        else if ((i = find_parens(cd, name, namelen,
-                        (options & PCRE_EXTENDED) != 0)) > 0)
+                        (options & PCRE_EXTENDED) != 0, utf8)) > 0)
          {
          PUT2(code, 2+LINK_SIZE, i);
          code[1+LINK_SIZE]++;
@@ -5311,11 +5384,17 @@ for (;; ptr++)
        while ((cd->ctypes[*ptr] & ctype_word) != 0) ptr++;
        namelen = (int)(ptr - name);

-        /* In the pre-compile phase, do a syntax check and set a dummy
-        reference number. */
+        /* In the pre-compile phase, do a syntax check. We used to just set
+        a dummy reference number, because it was not used in the first pass.
+        However, with the change of recursive back references to be atomic,
+        we have to look for the number so that this state can be identified, as
+        otherwise the incorrect length is computed. If it's not a backwards
+        reference, the dummy number will do. */

        if (lengthptr != NULL)
          {
+          const uschar *temp;
+
          if (namelen == 0)
            {
            *errorcodeptr = ERR62;
@@ -5331,7 +5410,22 @@ for (;; ptr++)
            *errorcodeptr = ERR48;
            goto FAILED;
            }
-          recno = 0;
+
+          /* The name table does not exist in the first pass, so we cannot
+          do a simple search as in the code below. Instead, we have to scan the
+          pattern to find the number. It is important that we scan it only as
+          far as we have got because the syntax of named subpatterns has not
+          been checked for the rest of the pattern, and find_parens() assumes
+          correct syntax. In any case, it's a waste of resources to scan
+          further. We stop the scan at the current point by temporarily
+          adjusting the value of cd->endpattern. */
+
+          temp = cd->end_pattern;
+          cd->end_pattern = ptr;
+          recno = find_parens(cd, name, namelen,
+            (options & PCRE_EXTENDED) != 0, utf8);
+          cd->end_pattern = temp;
+          if (recno < 0) recno = 0;    /* Forward ref; set dummy number */
          }

        /* In the real compile, seek the name in the table. We check the name
@@ -5356,7 +5450,7 @@ for (;; ptr++)
            }
          else if ((recno =                /* Forward back reference */
                    find_parens(cd, name, namelen,
-                      (options & PCRE_EXTENDED) != 0)) <= 0)
+                      (options & PCRE_EXTENDED) != 0, utf8)) <= 0)
            {
            *errorcodeptr = ERR15;
            goto FAILED;
@@ -5467,7 +5561,7 @@ for (;; ptr++)
            if (called == NULL)
              {
              if (find_parens(cd, NULL, recno,
-                    (options & PCRE_EXTENDED) != 0) < 0)
+                    (options & PCRE_EXTENDED) != 0, utf8) < 0)
                {
                *errorcodeptr = ERR15;
                goto FAILED;
@@ -6797,6 +6891,8 @@ while (ptr[skipatstart] == CHAR_LEFT_PARENTHESIS &&
    { skipatstart += 7; options |= PCRE_UTF8; continue; }
  else if (strncmp((char *)(ptr+skipatstart+2), STRING_UCP_RIGHTPAR, 4) == 0)
    { skipatstart += 6; options |= PCRE_UCP; continue; }
+  else if (strncmp((char *)(ptr+skipatstart+2), STRING_NO_START_OPT_RIGHTPAR, 13) == 0)
+    { skipatstart += 15; options |= PCRE_NO_START_OPTIMIZE; continue; }

  if (strncmp((char *)(ptr+skipatstart+2), STRING_CR_RIGHTPAR, 3) == 0)
    { skipatstart += 5; newnl = PCRE_NEWLINE_CR; }
@@ -292,7 +292,7 @@ argument of match(), which never changes. */

 #define RMATCH(ra,rb,rc,rd,re,rf,rg,rw)\
  {\
-  heapframe *newframe = (pcre_stack_malloc)(sizeof(heapframe));\
+  heapframe *newframe = (heapframe *)(pcre_stack_malloc)(sizeof(heapframe));\
  if (newframe == NULL) RRETURN(PCRE_ERROR_NOMEMORY);\
  frame->Xwhere = rw; \
  newframe->Xeptr = ra;\
@@ -420,17 +420,18 @@ immediately. The second one is used when we already know we are past the end of
 the subject. */

 #define CHECK_PARTIAL()\
-  if (md->partial != 0 && eptr >= md->end_subject && eptr > mstart)\
-    {\
-    md->hitend = TRUE;\
-    if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+  if (md->partial != 0 && eptr >= md->end_subject && \
+      eptr > md->start_used_ptr) \
+    { \
+    md->hitend = TRUE; \
+    if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
    }

 #define SCHECK_PARTIAL()\
-  if (md->partial != 0 && eptr > mstart)\
-    {\
-    md->hitend = TRUE;\
-    if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+  if (md->partial != 0 && eptr > md->start_used_ptr) \
+    { \
+    md->hitend = TRUE; \
+    if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
    }


@@ -486,7 +487,7 @@ heap storage. Set up the top-level frame here; others are obtained from the
 heap whenever RMATCH() does a "recursion". See the macro definitions above. */

 #ifdef NO_RECURSE
-heapframe *frame = (pcre_stack_malloc)(sizeof(heapframe));
+heapframe *frame = (heapframe *)(pcre_stack_malloc)(sizeof(heapframe));
 if (frame == NULL) RRETURN(PCRE_ERROR_NOMEMORY);
 frame->Xprevframe = NULL;            /* Marks the top level */

@@ -708,36 +709,47 @@ for (;;)
    case OP_FAIL:
    MRRETURN(MATCH_NOMATCH);

+    /* COMMIT overrides PRUNE, SKIP, and THEN */
+
    case OP_COMMIT:
    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
      ims, eptrb, flags, RM52);
-    if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+    if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE &&
+        rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
+        rrc != MATCH_THEN)
+      RRETURN(rrc);
    MRRETURN(MATCH_COMMIT);

+    /* PRUNE overrides THEN */
+
    case OP_PRUNE:
    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
      ims, eptrb, flags, RM51);
-    if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+    if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
    MRRETURN(MATCH_PRUNE);

    case OP_PRUNE_ARG:
    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
      ims, eptrb, flags, RM56);
-    if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+    if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
    md->mark = ecode + 2;
    RRETURN(MATCH_PRUNE);

+    /* SKIP overrides PRUNE and THEN */
+
    case OP_SKIP:
    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
      ims, eptrb, flags, RM53);
-    if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+    if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
+      RRETURN(rrc);
    md->start_match_ptr = eptr;   /* Pass back current position */
    MRRETURN(MATCH_SKIP);

    case OP_SKIP_ARG:
    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
      ims, eptrb, flags, RM57);
-    if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+    if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
+      RRETURN(rrc);

    /* Pass back the current skip name by overloading md->start_match_ptr and
    returning the special MATCH_SKIP_ARG return code. This will either be
@@ -747,17 +759,24 @@ for (;;)
    md->start_match_ptr = ecode + 2;
    RRETURN(MATCH_SKIP_ARG);

+    /* For THEN (and THEN_ARG) we pass back the address of the bracket or
+    the alt that is at the start of the current branch. This makes it possible
+    to skip back past alternatives that precede the THEN within the current
+    branch. */
+
    case OP_THEN:
    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
      ims, eptrb, flags, RM54);
    if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+    md->start_match_ptr = ecode - GET(ecode, 1);
    MRRETURN(MATCH_THEN);

    case OP_THEN_ARG:
-    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
-      ims, eptrb, flags, RM58);
+    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
+      offset_top, md, ims, eptrb, flags, RM58);
    if (rrc != MATCH_NOMATCH) RRETURN(rrc);
-    md->mark = ecode + 2;
+    md->start_match_ptr = ecode - GET(ecode, 1);
+    md->mark = ecode + LINK_SIZE + 2;
    RRETURN(MATCH_THEN);

    /* Handle a capturing bracket. If there is space in the offset vector, save
@@ -802,7 +821,9 @@ for (;;)
        {
        RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
          ims, eptrb, flags, RM1);
-        if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+        if (rrc != MATCH_NOMATCH &&
+            (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+          RRETURN(rrc);
        md->capture_last = save_capture_last;
        ecode += GET(ecode, 1);
        }
@@ -863,7 +884,9 @@ for (;;)

      RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims,
        eptrb, flags, RM2);
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
      ecode += GET(ecode, 1);
      }
    /* Control never reaches here. */
@@ -1064,7 +1087,8 @@ for (;;)
        ecode += 1 + LINK_SIZE + GET(ecode, LINK_SIZE + 2);
        while (*ecode == OP_ALT) ecode += GET(ecode, 1);
        }
-      else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+      else if (rrc != MATCH_NOMATCH &&
+              (rrc != MATCH_THEN || md->start_match_ptr != ecode))
        {
        RRETURN(rrc);         /* Need braces because of following else */
        }
@@ -1192,7 +1216,9 @@ for (;;)
        mstart = md->start_match_ptr;   /* In case \K reset it */
        break;
        }
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
      ecode += GET(ecode, 1);
      }
    while (*ecode == OP_ALT);
@@ -1226,7 +1252,9 @@ for (;;)
        do ecode += GET(ecode,1); while (*ecode == OP_ALT);
        break;
        }
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
      ecode += GET(ecode,1);
      }
    while (*ecode == OP_ALT);
@@ -1363,7 +1391,8 @@ for (;;)
            (pcre_free)(new_recursive.offset_save);
          MRRETURN(MATCH_MATCH);
          }
-        else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+        else if (rrc != MATCH_NOMATCH &&
+                (rrc != MATCH_THEN || md->start_match_ptr != ecode))
          {
          DPRINTF(("Recursion gave error %d\n", rrc));
          if (new_recursive.offset_save != stacksave)
@@ -1406,7 +1435,9 @@ for (;;)
        mstart = md->start_match_ptr;
        break;
        }
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
      ecode += GET(ecode,1);
      }
    while (*ecode == OP_ALT);
@@ -1672,37 +1703,40 @@ for (;;)
      if (eptr < md->end_subject)
        { if (!IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH); }
      else
-        { if (md->noteol) MRRETURN(MATCH_NOMATCH); }
+        {
+        if (md->noteol) MRRETURN(MATCH_NOMATCH);
+        SCHECK_PARTIAL();
+        }
      ecode++;
      break;
      }
-    else
+    else  /* Not multiline */
      {
      if (md->noteol) MRRETURN(MATCH_NOMATCH);
-      if (!md->endonly)
-        {
-        if (eptr != md->end_subject &&
-            (!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
-          MRRETURN(MATCH_NOMATCH);
-        ecode++;
-        break;
-        }
+      if (!md->endonly) goto ASSERT_NL_OR_EOS;
      }
+
    /* ... else fall through for endonly */

    /* End of subject assertion (\z) */

    case OP_EOD:
    if (eptr < md->end_subject) MRRETURN(MATCH_NOMATCH);
+    SCHECK_PARTIAL();
    ecode++;
    break;

    /* End of subject or ending \n assertion (\Z) */

    case OP_EODN:
-    if (eptr != md->end_subject &&
+    ASSERT_NL_OR_EOS:
+    if (eptr < md->end_subject &&
        (!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
      MRRETURN(MATCH_NOMATCH);
+
+    /* Either at end of string or \n before end. */
+
+    SCHECK_PARTIAL();
    ecode++;
    break;

@@ -5598,6 +5632,7 @@ if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
 if (re == NULL || subject == NULL ||
   (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
 if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
+if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;

 /* This information is for finding all the numbers associated with a given
 name, for condition testing. */
@@ -5764,16 +5799,14 @@ back the character offset. */
 #ifdef SUPPORT_UTF8
 if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
  {
-  if (_pcre_valid_utf8((USPTR)subject, length) >= 0)
-    return PCRE_ERROR_BADUTF8;
+  int tb;
+  if ((tb = _pcre_valid_utf8((USPTR)subject, length)) >= 0)
+    return (tb == length && md->partial > 1)?
+      PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
  if (start_offset > 0 && start_offset < length)
    {
-    int tb = ((USPTR)subject)[start_offset];
-    if (tb > 127)
-      {
-      tb &= 0xc0;
-      if (tb != 0 && tb != 0xc0) return PCRE_ERROR_BADUTF8_OFFSET;
-      }
+    tb = ((USPTR)subject)[start_offset] & 0xc0;
+    if (tb == 0x80) return PCRE_ERROR_BADUTF8_OFFSET;
    }
  }
 #endif
@@ -5901,9 +5934,10 @@ for(;;)
  /* There are some optimizations that avoid running the match if a known
  starting point is not found, or if a known later character is not present.
  However, there is an option that disables these, for testing and for ensuring
-  that all callouts do actually occur. */
+  that all callouts do actually occur. The option can be set in the regex by
+  (*NO_START_OPT) or passed in match-time options. */

-  if ((options & PCRE_NO_START_OPTIMIZE) == 0)
+  if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
    {
    /* Advance to a unique first byte if there is one. */

@@ -192,9 +192,7 @@ stdint.h is available, include it; it may define INT64_MAX. Systems that do not
 have stdint.h (e.g. Solaris) may have inttypes.h. The macro int64_t may be set
 by "configure". */

-#ifdef PHP_WIN32
-#include "win32/php_stdint.h"
-#elif HAVE_STDINT_H
+#if HAVE_STDINT_H
 #include <stdint.h>
 #elif HAVE_INTTYPES_H
 #include <inttypes.h>
@@ -410,9 +408,10 @@ capturing parenthesis numbers in back references. */

 /* When UTF-8 encoding is being used, a character is no longer just a single
 byte. The macros for character handling generate simple sequences when used in
-byte-mode, and more complicated ones for UTF-8 characters. BACKCHAR should
-never be called in byte mode. To make sure it can never even appear when UTF-8
-support is omitted, we don't even define it. */
+byte-mode, and more complicated ones for UTF-8 characters. GETCHARLENTEST is
+not used when UTF-8 is not supported, so it is not defined, and BACKCHAR should
+never be called in byte mode. To make sure they can never even appear when
+UTF-8 support is omitted, we don't even define them. */

 #ifndef SUPPORT_UTF8
 #define GETCHAR(c, eptr) c = *eptr;
@@ -420,43 +419,83 @@ support is omitted, we don't even define it. */
 #define GETCHARINC(c, eptr) c = *eptr++;
 #define GETCHARINCTEST(c, eptr) c = *eptr++;
 #define GETCHARLEN(c, eptr, len) c = *eptr;
+/* #define GETCHARLENTEST(c, eptr, len) */
 /* #define BACKCHAR(eptr) */

 #else   /* SUPPORT_UTF8 */

+/* These macros were originally written in the form of loops that used data
+from the tables whose names start with _pcre_utf8_table. They were rewritten by
+a user so as not to use loops, because in some environments this gives a
+significant performance advantage, and it seems never to do any harm. */
+
+/* Base macro to pick up the remaining bytes of a UTF-8 character, not
+advancing the pointer. */
+
+#define GETUTF8(c, eptr) \
+    { \
+    if ((c & 0x20) == 0) \
+      c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
+    else if ((c & 0x10) == 0) \
+      c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
+    else if ((c & 0x08) == 0) \
+      c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
+      ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
+    else if ((c & 0x04) == 0) \
+      c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
+          ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
+          (eptr[4] & 0x3f); \
+    else \
+      c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
+          ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
+          ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
+    }
+
 /* Get the next UTF-8 character, not advancing the pointer. This is called when
 we know we are in UTF-8 mode. */

 #define GETCHAR(c, eptr) \
  c = *eptr; \
-  if (c >= 0xc0) \
-    { \
-    int gcii; \
-    int gcaa = _pcre_utf8_table4[c & 0x3f];  /* Number of additional bytes */ \
-    int gcss = 6*gcaa; \
-    c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
-    for (gcii = 1; gcii <= gcaa; gcii++) \
-      { \
-      gcss -= 6; \
-      c |= (eptr[gcii] & 0x3f) << gcss; \
-      } \
-    }
+  if (c >= 0xc0) GETUTF8(c, eptr);

 /* Get the next UTF-8 character, testing for UTF-8 mode, and not advancing the
 pointer. */

 #define GETCHARTEST(c, eptr) \
  c = *eptr; \
-  if (utf8 && c >= 0xc0) \
+  if (utf8 && c >= 0xc0) GETUTF8(c, eptr);
+
+/* Base macro to pick up the remaining bytes of a UTF-8 character, advancing
+the pointer. */
+
+#define GETUTF8INC(c, eptr) \
    { \
-    int gcii; \
-    int gcaa = _pcre_utf8_table4[c & 0x3f];  /* Number of additional bytes */ \
-    int gcss = 6*gcaa; \
-    c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
-    for (gcii = 1; gcii <= gcaa; gcii++) \
+    if ((c & 0x20) == 0) \
+      c = ((c & 0x1f) << 6) | (*eptr++ & 0x3f); \
+    else if ((c & 0x10) == 0) \
      { \
-      gcss -= 6; \
-      c |= (eptr[gcii] & 0x3f) << gcss; \
+      c = ((c & 0x0f) << 12) | ((*eptr & 0x3f) << 6) | (eptr[1] & 0x3f); \
+      eptr += 2; \
+      } \
+    else if ((c & 0x08) == 0) \
+      { \
+      c = ((c & 0x07) << 18) | ((*eptr & 0x3f) << 12) | \
+          ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
+      eptr += 3; \
+      } \
+    else if ((c & 0x04) == 0) \
+      { \
+      c = ((c & 0x03) << 24) | ((*eptr & 0x3f) << 18) | \
+          ((eptr[1] & 0x3f) << 12) | ((eptr[2] & 0x3f) << 6) | \
+          (eptr[3] & 0x3f); \
+      eptr += 4; \
+      } \
+    else \
+      { \
+      c = ((c & 0x01) << 30) | ((*eptr & 0x3f) << 24) | \
+          ((eptr[1] & 0x3f) << 18) | ((eptr[2] & 0x3f) << 12) | \
+          ((eptr[3] & 0x3f) << 6) | (eptr[4] & 0x3f); \
+      eptr += 5; \
      } \
    }

@@ -465,32 +504,49 @@ know we are in UTF-8 mode. */

 #define GETCHARINC(c, eptr) \
  c = *eptr++; \
-  if (c >= 0xc0) \
-    { \
-    int gcaa = _pcre_utf8_table4[c & 0x3f];  /* Number of additional bytes */ \
-    int gcss = 6*gcaa; \
-    c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
-    while (gcaa-- > 0) \
-      { \
-      gcss -= 6; \
-      c |= (*eptr++ & 0x3f) << gcss; \
-      } \
-    }
+  if (c >= 0xc0) GETUTF8INC(c, eptr);

 /* Get the next character, testing for UTF-8 mode, and advancing the pointer.
 This is called when we don't know if we are in UTF-8 mode. */

 #define GETCHARINCTEST(c, eptr) \
  c = *eptr++; \
-  if (utf8 && c >= 0xc0) \
+  if (utf8 && c >= 0xc0) GETUTF8INC(c, eptr);
+
+/* Base macro to pick up the remaining bytes of a UTF-8 character, not
+advancing the pointer, incrementing the length. */
+
+#define GETUTF8LEN(c, eptr, len) \
    { \
-    int gcaa = _pcre_utf8_table4[c & 0x3f];  /* Number of additional bytes */ \
-    int gcss = 6*gcaa; \
-    c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
-    while (gcaa-- > 0) \
+    if ((c & 0x20) == 0) \
      { \
-      gcss -= 6; \
-      c |= (*eptr++ & 0x3f) << gcss; \
+      c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
+      len++; \
+      } \
+    else if ((c & 0x10)  == 0) \
+      { \
+      c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
+      len += 2; \
+      } \
+    else if ((c & 0x08)  == 0) \
+      {\
+      c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
+          ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
+      len += 3; \
+      } \
+    else if ((c & 0x04)  == 0) \
+      { \
+      c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
+          ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
+          (eptr[4] & 0x3f); \
+      len += 4; \
+      } \
+    else \
+      {\
+      c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
+          ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
+          ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
+      len += 5; \
      } \
    }

@@ -499,19 +555,7 @@ if there are extra bytes. This is called when we know we are in UTF-8 mode. */

 #define GETCHARLEN(c, eptr, len) \
  c = *eptr; \
-  if (c >= 0xc0) \
-    { \
-    int gcii; \
-    int gcaa = _pcre_utf8_table4[c & 0x3f];  /* Number of additional bytes */ \
-    int gcss = 6*gcaa; \
-    c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
-    for (gcii = 1; gcii <= gcaa; gcii++) \
-      { \
-      gcss -= 6; \
-      c |= (eptr[gcii] & 0x3f) << gcss; \
-      } \
-    len += gcaa; \
-    }
+  if (c >= 0xc0) GETUTF8LEN(c, eptr, len);

 /* Get the next UTF-8 character, testing for UTF-8 mode, not advancing the
 pointer, incrementing length if there are extra bytes. This is called when we
@@ -519,19 +563,7 @@ do not know if we are in UTF-8 mode. */

 #define GETCHARLENTEST(c, eptr, len) \
  c = *eptr; \
-  if (utf8 && c >= 0xc0) \
-    { \
-    int gcii; \
-    int gcaa = _pcre_utf8_table4[c & 0x3f];  /* Number of additional bytes */ \
-    int gcss = 6*gcaa; \
-    c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
-    for (gcii = 1; gcii <= gcaa; gcii++) \
-      { \
-      gcss -= 6; \
-      c |= (eptr[gcii] & 0x3f) << gcss; \
-      } \
-    len += gcaa; \
-    }
+  if (utf8 && c >= 0xc0) GETUTF8LEN(c, eptr, len);

 /* If the pointer is not at the start of a character, move it back until
 it is. This is called only in UTF-8 mode - we don't put a test within the macro
@@ -539,7 +571,7 @@ because almost all calls are already within a block of UTF-8 only code. */

 #define BACKCHAR(eptr) while((*eptr & 0xc0) == 0x80) eptr--

-#endif
+#endif  /* SUPPORT_UTF8 */


 /* In case there is no definition of offsetof() provided - though any proper
@@ -583,7 +615,7 @@ time, run time, or study time, respectively. */
   PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
   PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK|PCRE_AUTO_CALLOUT|PCRE_FIRSTLINE| \
   PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE| \
-   PCRE_JAVASCRIPT_COMPAT|PCRE_UCP)
+   PCRE_JAVASCRIPT_COMPAT|PCRE_UCP|PCRE_NO_START_OPTIMIZE)

 #define PUBLIC_EXEC_OPTIONS \
  (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NOTEMPTY_ATSTART| \
@@ -900,15 +932,16 @@ so that PCRE works on both ASCII and EBCDIC platforms, in non-UTF-mode only. */

 #define STRING_DEFINE               "DEFINE"

-#define STRING_CR_RIGHTPAR          "CR)"
-#define STRING_LF_RIGHTPAR          "LF)"
-#define STRING_CRLF_RIGHTPAR        "CRLF)"
-#define STRING_ANY_RIGHTPAR         "ANY)"
-#define STRING_ANYCRLF_RIGHTPAR     "ANYCRLF)"
-#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
-#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
-#define STRING_UTF8_RIGHTPAR        "UTF8)"
-#define STRING_UCP_RIGHTPAR         "UCP)"
+#define STRING_CR_RIGHTPAR             "CR)"
+#define STRING_LF_RIGHTPAR             "LF)"
+#define STRING_CRLF_RIGHTPAR           "CRLF)"
+#define STRING_ANY_RIGHTPAR            "ANY)"
+#define STRING_ANYCRLF_RIGHTPAR        "ANYCRLF)"
+#define STRING_BSR_ANYCRLF_RIGHTPAR    "BSR_ANYCRLF)"
+#define STRING_BSR_UNICODE_RIGHTPAR    "BSR_UNICODE)"
+#define STRING_UTF8_RIGHTPAR           "UTF8)"
+#define STRING_UCP_RIGHTPAR            "UCP)"
+#define STRING_NO_START_OPT_RIGHTPAR   "NO_START_OPT)"

 #else  /* SUPPORT_UTF8 */

@@ -1154,15 +1187,16 @@ only. */

 #define STRING_DEFINE               STR_D STR_E STR_F STR_I STR_N STR_E

-#define STRING_CR_RIGHTPAR          STR_C STR_R STR_RIGHT_PARENTHESIS
-#define STRING_LF_RIGHTPAR          STR_L STR_F STR_RIGHT_PARENTHESIS
-#define STRING_CRLF_RIGHTPAR        STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
-#define STRING_ANY_RIGHTPAR         STR_A STR_N STR_Y STR_RIGHT_PARENTHESIS
-#define STRING_ANYCRLF_RIGHTPAR     STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
-#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
-#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
-#define STRING_UTF8_RIGHTPAR        STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
-#define STRING_UCP_RIGHTPAR         STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
+#define STRING_CR_RIGHTPAR             STR_C STR_R STR_RIGHT_PARENTHESIS
+#define STRING_LF_RIGHTPAR             STR_L STR_F STR_RIGHT_PARENTHESIS
+#define STRING_CRLF_RIGHTPAR           STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
+#define STRING_ANY_RIGHTPAR            STR_A STR_N STR_Y STR_RIGHT_PARENTHESIS
+#define STRING_ANYCRLF_RIGHTPAR        STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
+#define STRING_BSR_ANYCRLF_RIGHTPAR    STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
+#define STRING_BSR_UNICODE_RIGHTPAR    STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
+#define STRING_UTF8_RIGHTPAR           STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
+#define STRING_UCP_RIGHTPAR            STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
+#define STRING_NO_START_OPT_RIGHTPAR   STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS

 #endif  /* SUPPORT_UTF8 */

@@ -1516,8 +1550,9 @@ in UTF-8 mode. The code that uses this table must know about such things. */
  3, 3,                          /* RREF, NRREF                            */ \
  1,                             /* DEF                                    */ \
  1, 1,                          /* BRAZERO, BRAMINZERO                    */ \
-  3, 1, 3,                       /* MARK, PRUNE, PRUNE_ARG,                */ \
-  1, 3, 1, 3,                    /* SKIP, SKIP_ARG, THEN, THEN_ARG,        */ \
+  3, 1, 3,                       /* MARK, PRUNE, PRUNE_ARG                 */ \
+  1, 3,                          /* SKIP, SKIP_ARG                         */ \
+  1+LINK_SIZE, 3+LINK_SIZE,      /* THEN, THEN_ARG                         */ \
  1, 1, 1, 3, 1                  /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO  */


@@ -1536,7 +1571,8 @@ enum { ERR0,  ERR1,  ERR2,  ERR3,  ERR4,  ERR5,  ERR6,  ERR7,  ERR8,  ERR9,
       ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
       ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
       ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
-       ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERRCOUNT };
+       ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68,
+       ERRCOUNT };

 /* The real format of the start of the pcre block; the index of names and the
 code vector run on as long as necessary after the end. We store an explicit
@@ -537,11 +537,26 @@ for(;;)
    case OP_MARK:
    case OP_PRUNE_ARG:
    case OP_SKIP_ARG:
-    case OP_THEN_ARG:
    fprintf(f, "    %s %s", OP_names[*code], code + 2);
    extra += code[1];
    break;

+    case OP_THEN:
+    if (print_lengths)
+      fprintf(f, "    %s %d", OP_names[*code], GET(code, 1));
+    else
+      fprintf(f, "    %s", OP_names[*code]);
+    break;
+
+    case OP_THEN_ARG:
+    if (print_lengths)
+      fprintf(f, "    %s %d %s", OP_names[*code], GET(code, 1),
+        code + 2 + LINK_SIZE);
+    else
+      fprintf(f, "    %s %s", OP_names[*code], code + 2 + LINK_SIZE);
+    extra += code[1+LINK_SIZE];
+    break;
+
    /* Anything else is just an item with no data*/

    default:
@@ -417,10 +417,13 @@ for (;;)
    case OP_MARK:
    case OP_PRUNE_ARG:
    case OP_SKIP_ARG:
-    case OP_THEN_ARG:
    cc += _pcre_OP_lengths[op] + cc[1];
    break;

+    case OP_THEN_ARG:
+    cc += _pcre_OP_lengths[op] + cc[1+LINK_SIZE];
+    break;
+
    /* For the record, these are the opcodes that are matched by "default":
    OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
    OP_THEN. */
@@ -70,6 +70,20 @@ Arguments:

 Returns:       < 0    if the string is a valid UTF-8 string
               >= 0   otherwise; the value is the offset of the bad byte
+
+Bad bytes can be:
+
+  . An isolated byte whose most significant bits are 0x80, because this
+    can only correctly appear within a UTF-8 character;
+
+  . A byte whose most significant bits are 0xc0, but whose other bits indicate
+    that there are more than 3 additional bytes (i.e. an RFC 2279 starting
+    byte, which is no longer valid under RFC 3629);
+
+  .
+
+The returned offset may also be equal to the length of the string; this means
+that one or more bytes is missing from the final UTF-8 character.
 */

 int
@@ -91,7 +105,8 @@ for (p = string; length-- > 0; p++)
  if (c < 128) continue;
  if (c < 0xc0) return p - string;
  ab = _pcre_utf8_table4[c & 0x3f];     /* Number of additional bytes */
-  if (length < ab || ab > 3) return p - string;
+  if (ab > 3) return p - string;        /* Too many for RFC 3629 */
+  if (length < ab) return p + 1 + length - string;   /* Missing bytes */
  length -= ab;

  /* Check top bits in the second byte */
@@ -50,13 +50,16 @@ const char *error;
 char *pattern;
 char *subject;
 unsigned char *name_table;
+unsigned int option_bits;
 int erroffset;
 int find_all;
+int crlf_is_newline;
 int namecount;
 int name_entry_size;
 int ovector[OVECCOUNT];
 int subject_length;
 int rc, i;
+int utf8;


 /**************************************************************************
@@ -238,15 +241,56 @@ if (namecount <= 0) printf("No named substrings\n"); else
 * subject is not a valid match; other possibilities must be tried. The   *
 * second flag restricts PCRE to one match attempt at the initial string  *
 * position. If this match succeeds, an alternative to the empty string   *
-* match has been found, and we can proceed round the loop.               *
+* match has been found, and we can print it and proceed round the loop,  *
+* advancing by the length of whatever was found. If this match does not  *
+* succeed, we still stay in the loop, advancing by just one character.   *
+* In UTF-8 mode, which can be set by (*UTF8) in the pattern, this may be *
+* more than one byte.                                                    *
+*                                                                        *
+* However, there is a complication concerned with newlines. When the     *
+* newline convention is such that CRLF is a valid newline, we want must  *
+* advance by two characters rather than one. The newline convention can  *
+* be set in the regex by (*CR), etc.; if not, we must find the default.  *
 *************************************************************************/

-if (!find_all)
+if (!find_all)     /* Check for -g */
  {
  pcre_free(re);   /* Release the memory used for the compiled pattern */
  return 0;        /* Finish unless -g was given */
  }

+/* Before running the loop, check for UTF-8 and whether CRLF is a valid newline
+sequence. First, find the options with which the regex was compiled; extract
+the UTF-8 state, and mask off all but the newline options. */
+
+(void)pcre_fullinfo(re, NULL, PCRE_INFO_OPTIONS, &option_bits);
+utf8 = option_bits & PCRE_UTF8;
+option_bits &= PCRE_NEWLINE_CR|PCRE_NEWLINE_LF|PCRE_NEWLINE_CRLF|
+               PCRE_NEWLINE_ANY|PCRE_NEWLINE_ANYCRLF;
+
+/* If no newline options were set, find the default newline convention from the
+build configuration. */
+
+if (option_bits == 0)
+  {
+  int d;
+  (void)pcre_config(PCRE_CONFIG_NEWLINE, &d);
+  /* Note that these values are always the ASCII ones, even in
+  EBCDIC environments. CR = 13, NL = 10. */
+  option_bits = (d == 13)? PCRE_NEWLINE_CR :
+          (d == 10)? PCRE_NEWLINE_LF :
+          (d == (13<<8 | 10))? PCRE_NEWLINE_CRLF :
+          (d == -2)? PCRE_NEWLINE_ANYCRLF :
+          (d == -1)? PCRE_NEWLINE_ANY : 0;
+  }
+
+/* See if CRLF is a valid newline sequence. */
+
+crlf_is_newline =
+     option_bits == PCRE_NEWLINE_ANY ||
+     option_bits == PCRE_NEWLINE_CRLF ||
+     option_bits == PCRE_NEWLINE_ANYCRLF;
+
 /* Loop for second and subsequent matches */

 for (;;)
@@ -280,14 +324,32 @@ for (;;)
  is zero, it just means we have found all possible matches, so the loop ends.
  Otherwise, it means we have failed to find a non-empty-string match at a
  point where there was a previous empty-string match. In this case, we do what
-  Perl does: advance the matching position by one, and continue. We do this by
-  setting the "end of previous match" offset, because that is picked up at the
-  top of the loop as the point at which to start again. */
+  Perl does: advance the matching position by one character, and continue. We
+  do this by setting the "end of previous match" offset, because that is picked
+  up at the top of the loop as the point at which to start again.
+
+  There are two complications: (a) When CRLF is a valid newline sequence, and
+  the current position is just before it, advance by an extra byte. (b)
+  Otherwise we must ensure that we skip an entire UTF-8 character if we are in
+  UTF-8 mode. */

  if (rc == PCRE_ERROR_NOMATCH)
    {
-    if (options == 0) break;
-    ovector[1] = start_offset + 1;
+    if (options == 0) break;                    /* All matches found */
+    ovector[1] = start_offset + 1;              /* Advance one byte */
+    if (crlf_is_newline &&                      /* If CRLF is newline & */
+        start_offset < subject_length - 1 &&    /* we are at CRLF, */
+        subject[start_offset] == '\r' &&
+        subject[start_offset + 1] == '\n')
+      ovector[1] += 1;                          /* Advance by one more. */
+    else if (utf8)                              /* Otherwise, ensure we */
+      {                                         /* advance a whole UTF-8 */
+      while (ovector[1] < subject_length)       /* character. */
+        {
+        if ((subject[ovector[1]] & 0xc0) != 0x80) break;
+        ovector[1] += 1;
+        }
+      }
    continue;    /* Go round the loop again */
    }

@@ -149,6 +149,7 @@ static const int eint[] = {
  REG_BADPAT,  /* different names for subpatterns of the same number are not allowed */
  REG_BADPAT,  /* (*MARK) must have an argument */
  REG_INVARG,  /* this version of PCRE is not compiled with PCRE_UCP support */
+  REG_BADPAT,  /* \c must be followed by an ASCII character */
 };

 /* Table of texts corresponding to POSIX error codes */