The only character that is 'special' is the left square bracket or
[. The simplest pattern is just literal text, with no left square brackets.
Whenever we need EasyPattern keywords we just put them inside
[...].
EasyPattern
Description
Matches this text...
hello there
No [ ... ] expression has been used, this is just literal text
hello there
hello there [longest 1 or more letters]
The special part is [longest 1 or more letters]
hello there Fred, hello there Cornelia, etc
I am [1 or more digits] years old]
The special part is [1 or more digits]
I am 2 years old, I am 302 years old, etc
This is a left square bracket
[ '[' ]
This shows how to insert a left square bracket in
literal text.
This is a left square bracket [
To use multiple keywords, you can either
Put them next to each other
[...][...]
e.g. [letter][digit] matches "a1", "b1", etc.
Put commas between them
[..., ...]
e.g. [letter, digit] instead of
[letter][digit]
Put spaces between them
[... ...]
e.g. [letter digit] instead of
[letter][digit]
You can also put literal text anywhere inside [...] using single quotes
or double quotes.
['literal']
e.g. "['abc']" instead of
"abc"
[... 'literal']
e.g. "[digit, 'abc']" instead of
"[digit]abc"
['literal' ...]
e.g. "['abc', digit]" instead of
"abc[digit]"
[... 'literal' ...]
e.g. "[digit, 'abc', digit]" instead of
"[digit]abc[digit]"
There is usually no difference in meaning between including
literal text within a bracketed expression (in single quotes) and leaving
literal text outside the brackets. The choice is a matter of individual
preference. One exception: when using [not], only the single-quoted literal will
work, e.g. [not '-'].
The most important keywords represent character classes or sets, that
is, a set of related characters.
Any character, letters, digits, etc.
[character],
[char],
[chars],
[characters]
All 256 chars (every character including NULL). EasyPattern's
[character] or
[char] will match any character including
return. If you want any character except a return (or formfeed), use
[paragraphChar];
that is, any character that could appear in a paragraph. Details below.
[letter],
[letters]
Includes ?and ? common in certain
European languages.
[digit],
[digits]
Decimal digits 0-9
[number],
[numbers],
[numeric]
A number with an optional leading sign, digits, optional decimal point and
trailing digits
[Integer]
A number with an optional leading sign, followed by digits
[Float]
A number with an optional leading sign, digits, optional decimal point
and trailing digits, optionally followed by 'e', a sign, and 1 or more
digits
[EBCDICletter]
An EBCDIC letter
[EBCDICupper]
An EBCDIC uppercase letter
[EBCDIClower]
An EBCDIC lowercase letter
[EBCDICdigit]
An EBCDIC digit, ASCII F0-F9
[punctuation]
Printing characters, excluding letters and digits, includes
!?.,:; " ' ' / - () {} -
Note that ? and ? are considered punctuation.
[symbol],
[symbols]
~@#$%^&*
EasyPattern distinguishes punctuation from symbols; the sets do
not overlap. For broader combinations, see
[printableChar] and
[typewriterChar].
For narrower focus, see [sentencePunctuation],
[anyQuote],
[anyBracket] and
[anyDash].
Special letters
[upper],
[uppercase],
[uppercaseLetter]
Uppercase letters. Note: In TextPipe you will also need to enable the
Match Case option for this to make any difference.
[lower], [lowercase],
[lowercaseLetter]
Lowercase letters. Note: In TextPipe you will also need to enable the
Match Case option for this to make any difference.
Reserved punctuation
[leftBracket]
[
[rightBracket]
]
[leftParen],
[leftParenthesis]
(
[rightParen],
[rightParenthesis]
)
[leftAngle],
[lessThan]
<
[rightAngle],
[greaterThan]
>
[comma]
,
[singleQuote]
'
[doubleQuote],
[quote]
" (i.e. standard ASCII "straight" quotation mark)
[backwardSingleQuote]
`
ASCII function()
asc(code, ...),
ascii(code, ...)
The ASCII() or ASC() function embeds arbitrary control characters by
entering the control code in decimal or hex (precede the hex digit with '$"
eg $ff). You can add one or more control
characters by separating each with a space or comma. e.g. ASC( 65, 66 )
outputs 'AB' into the pattern.
EBCDIC function()
ebcdic( literal )
The EBCDIC() function embeds Mainframe EBCDIC characters translated
from a string literal you provide e.g. EBCDIC( '0' )
outputs \xF0 (an EBCDIC '0')
Square brackets are "reserved" by EasyPattern. The other
punctuation marks listed here are literal when they appear outside of brackets.
EasyPattern gives special meaning to certain punctuation marks;
these keywords can be used to represent the literal character.
A literal comma and left & right brackets & parentheses may appear
inside single quotes. The keywords are provided to make patterns easier to read.
Filename patterns
[Drive]
A drive letter followed by a colon (:) e.g
d:\special\folder\filename.doc,
feeding the letter into @Drive@
[Folder]
A path fragment between \ ... \, e.g. d:\special\folder\filename.doc,
feeding into @Folder@
[Path]
A path with optional drive e.g. e.g.
d:\special\folder\filename.doc,
feeding into @Drive@ and @Path@
[UNCpath]
A UNC path consisting of server, share and path - these
feed into @Server@, @Share@ and @Path@
[Filename]
A filename, starting from \ and not ending with \ e.g
d:\special\folder\filename.doc
There are many ways to create your own character sets to
match exactly the characters you require.
You can combine existing character sets using "or":
[... or ...]
e.g. [letter or digit],
['a' or 'b']
Most EasyPattern keywords refer to a set of characters from which one will
match. The first use of "or" is to make a larger set -- though again, only one
of the larger set will match. (Any quantity can be specified using repetition
keywords, but it is still applying a quantity to a single character not to
multiple characters)
When sets are combined with "or", parentheses are optional. (In
technical terms, the character set use of "or" has very high
precedence; see below) [letter, letter or digit, letter]--
matches "aaa", "xyz", "h4q", "b7f" etc.
It doesn't hurt to add parentheses even
though they are not required. [letter (letter or digit) letter]--
same as above
Negation - Match anything except a given set
Instead of specifying all the characters that could occur in a
match, it is often convenient to specify characters that could not occur.
[not ...],
[non ...],
[anyExcept ...]
e.g. [oneOrMore non letter]
EasyPattern has keywords for [quotedString] and
[HTMLTag], but if it didn't,
they would be easy to define: ['<', oneOrMore not '>', '>']? same as
[HTMLTag] [quote, oneOrMore not quote, quote]? a simple definition for
[quotedString]
Negation can only be applied to a single character, or a character
set from which one will match. For example: [not letter]? fine [not letter or digit]? fine:
[letter or digit] is a set from which one
will match [not word]? ERROR: "word" matches multiple characters [not 'a']? fine (a single character) [not 'whatever']? ERROR [not lineChar or letter]? CAREFUL![lineChar] is defined as
[not linefeed OR verticalTab OR formfeed OR return].
You cannot combine negated and non-negated characters sets, so this pattern
is equivalent to [ (not lineChar) or letter ], instead of [ not (lineChar or
letter) ]
Keywords such as [letter] and
[digit] are character sets defined
internally to EasyPattern; the angle bracket notation lets you define your own
characters sets. In both case, EasyPattern matches any single character in that
set.
[<...>]
e.g. [<aeiou>],
[<135>],
[<!@#$%^&*>]
For single characters, [or] and a set are interchangeable, e.g.
[<aei>]
and ['a' or 'e' or 'i'] have the same meaning.
User defined sets, single character literals and EasyPattern
keywords can be combined with [or]: [<aeiou> or <123> or '7' or symbol]
When [or] is used to specify alternatives as part of a larger
pattern, grouping parentheses are required, e.g. [space, 'Player' or 'EasyPattern',
space]-- may not mean what you think! [space]Player[or]EasyPattern[space]-- may not mean what you think! [(space, 'Player') or
('EasyPattern', space)]-- that's what they mean [space ('Player' or 'EasyPattern')
space]-- this might be what you
wanted
Remember: as noted in the section on expressions,
commas are allowed between items to make patterns easier to read; they do not
affect what the pattern means.
If you leave out the parentheses, EasyPattern will treat everything to
the left of the [or] as one implicit group and everything to the right of the
[or] as a separate group.
Note that visual grouping with brackets or commas is not enough; you must
use parentheses. For example, all of the following will be interpreted as
[(digit, 'this') or ('that')]: [digit, 'this' or 'that']? careful; the
commas may mislead [digit]['this' or 'that']? careful; the brackets may mislead [digit 'this' or 'that']? the grouping isn't clear;
parentheses would help [digit]this[or]that? the grouping isn't clear; parentheses would help
As noted in the previous section, parentheses are not
required when [or] is used to combine character sets.
"or" as set vs. "or" as alternative
In many cases, you don't have to worry that there are two
different uses for "or"; both generally make sense in context. However, there
are 2 reasons for learning the differences:
or as set doesn't require parentheses; the grouping is implied
or as set can be part of a "not" expression since it still
represents one character
When the repetition or count includes a range of values to match,
EasyPattern has the choice of matching the "shortest" sequence of
characters that fits the pattern, or the "longest" that fits the
pattern. For example
[shortest zeroOrOne ...]
0 or 1
will try to match zero occurrences
[shortest zeroOrMore ...]
0+
will try to match zero occurrences
[shortest oneOrMore ...]
1+
will try to match one occurrence
[shortest twoOrMore ...]
2+
will try to match two occurrences
EasyPattern defaults to the SHORTEST match so the "shortest" keyword
is optional.
[shortest ... ...]
match the lowest possible number of repetitions (default)
[longest ... ...]
match the highest possible number of repetitions
In these cases, EasyPattern will only match more than the minimum
if required to complete additional parts of the pattern, e.g. given "abc123" and
the pattern [shortest oneOrMore letter, digit], EasyPattern will match "abc1", i.e. all
3 letters. However, given the same string and the pattern
[shortest oneOrMore
letter], EasyPattern will just match "a" the first letter. Given the same string and
[longest oneOrMore letter], EasyPattern will match "abc". Note that EasyPattern always
starts with the first character that matches that pattern, e.g. despite "c1"
being shorter than "abc1", EasyPattern matches the latter.
Shortest/longest can be confusing.
Shortest can be quite slow, use "not" if possible
Pattern matching can become very time consuming if the number of repeats is
not known. Take for example
The pattern matcher first matches a, and 8 '2's, then it finds that 'z' does
not match 'b'. So it backtracks, trying with 7 '2's, failing again, then with 6
'2's, all the way back to 1 '2', before finally giving up, and starting to test
for 'a' again. If we know that backtracking into a repeated match will still
result in failure, we can tell EasyPatterns to not bother, by using the
atomic keyword.
This time, the pattern matcher first matches a, and 8 '2's, then it finds
that 'z' does not match 'b'. So it backtracks all the way back to starting to
test for 'a' again.
Literals, groups
All of the repetition & quantity keywords can be applied to
literals and groups as well as to individual keywords, e.g. [oneOrMore 'ab']? matches "ab", "abab", "ababab"
etc. [oneOrMore letter or digit]? matches "aaa", "456", "a45bbb" etc. [oneOrMore not letter or digit]? matches punctuation, symbols,
whitespace etc. [oneOrMore ('alpha' or 'omega')]? matches "alphaalapha", "alphaomega"
etc. [oneOrMore (letter, digit)]? matches "r2", "r2d2", "r2d2f7b2c4" etc.
Assigns the contents of the group to a variable which can be referred to
later in both the search pattern ([group#]
e.g. [group6] ,# can range
from 1 to 26) and in the replacement string ($# e.g. $6, # can range from
1-9, a-z. $0 represents the entire matched string). If specified, the text
can also be stored in the global variable @varname in addition to the
positional variables $1, $2 etc.
[group#]
Matches the same text that a previously captured group found.
When a match is found, it must be/must not be preceded by what is in the
brackets. The bracket contents are NOT included in the actual match. The
bracket contents are limited to fixed length strings - so no '3+' etc are
allowed. This must be the first part of your pattern.
[mustBeginWith(
'hello' or 'goodbye' ) 'fred']
[... mustEndWith(...)],
[... mustNotEndWith(...)]
When a match is found, it must be/must not be followed by what is in the
brackets. The bracket contents are NOT included in the actual match. The
bracket contents are limited to fixed length strings - so no '3+' etc are
allowed. This must be the last part of your pattern.
['fred' mustEndWith(
'erick' or 'dy' ) ]
Parentheses must match, i.e. ")" always ends the most recent "(",
independent of number.
EasyPattern allows comments to be included in multi-line patterns using the
character ';' or '#' to make the start of a comment, extending until the end of
the line e.g.
[ 3 space ;look for 3 spaces
'hello' #then the keyword we want
]
[space OR tab OR cr OR lf OR verticalTab OR nonbreakingSpace]
[tab]
ASCII 9, \t
[return], [cr]
ASCII 13, \r
[linefeed],
[lf]
ASCII 10, \n
[verticalTab]
ASCII 11
[formfeed]
ASCII 12, \f
[null]
ASCII 0
[CRLF]
[return, linefeed]
[newline]
[(return, linefeed) or return or linefeed]
[DOSNewline]
[return, linefeed]
[UNIXNewline]
[linefeed]
[MacNewline]
[return]
[not] cannot be applied to
[CRLF],
[newline] or
[DOSNewline] since they
either are or may be a character sequence rather than just a single character.
A space character can usually be typed directly into a pattern ([ ' ' ]) but
using the keyword may make the pattern easier to understand (and modify later)
Whitespace combinations
[horizontalWhitespace],
[hSpace]
[space or nonbreakingSpace or tab]
[verticalWhitespace],
[vSpace]
[return or linefeed or formfeed or vertical tab]
words, columns, lines & paragraphs
[wordDelimiter]
[space OR tab OR linefeed OR verticalTab OR formfeed OR
return]
[wordChar]
[not wordDelimiter]
[word]
[1+ wordChar]
[columnDelimiter]
[tab OR linefeed OR formfeed OR return]
[columnChar]
[not columnDelimiter]
[column]
[1+ columnChar] Note: Use
[0+ columnChar] instead if
the column could be blank
[lineDelimiter]
[linefeed OR verticalTab OR formfeed OR return]
[lineChar]
[not lineDelimiter]
[line]
[1+ lineChar] Note: Use
[0+ lineChar] instead if the
line could be blank
[paragraphDelimiter]
[formfeed OR return]
[paragraphChar]
[not paragraphDelimiter]
[paragraph]
[1+ paragraphChar]
The above delimiters are characters not positions; they will
"consume" the character that they match. In contrast,
[TextStart] and
[TextEnd]
(below) are positions.
The above objects (word, column, line, paragraph) do not include
delimiters. So, to match multiple objects, you need to include the delimiters,
e.g. [2+ word]-- won't match anything [2+ (word, optional wordDelimiter)]-- correct
The definition for word is based strictly on whitespace so it will
include punctuation, matching text such as "$27.52" and "fancy+name". Although
in many cases it would be nice to exclude trailing punctuation, that pattern
would fail for text like "S.M.U.". When EP's definition of a word isn't
appropriate for your text, simply use the custom pattern that fits. For example,
[1+ wordChar, letter or digit or symbol] would ensure that the last char is not
punctuation.
Word, column, line & paragraph require one or more character. If a
line might be empty, use: [0+ lineChar] instead of
[line].
Because the definitions for word, column, line & paragraph look for
anything except the appropriate delimiter (rather than the leading
delimiter, a series of anything else, and the trailing delimiter), they can
be used to get the rest of a word, column, line & paragraph when the
starting point is already in the middle.
These definitions allow control characters (except the specific whitespace used as delimiters) to appear in words, columns, lines & paragraphs.
A column may contain the verticalTab character (it's used by
FileMaker to indicate line breaks within a field)
Word, column, line & paragraph consist of multiple characters so patterns
like [not word] don't make sense.
matches at end of the entire text or before newline at end
[lineStart]
matches
the start of a line (*)
[lineEnd]
matches
the end of a line (*)
[wordBoundary] or [wordBreak]
matches at a word boundary
[notWordBoundary]
matches when not at a word boundary
(*) [lineStart] and
[lineEnd] will work fine if the file you're editing has
Unix end of line characters, because the core EasyPattern engine assumes this.
For DOS or Windows files, you should use [ cr lf or textEnd ]
or [ mustEndWith(cr lf or textEnd) ]
A Comma-Separated-Value field. If fields are delimited by single or double
quotes, embedded newlines are allowed, as are doubled-up quotes. The quotes
are returned as part of the match.
[TABfield]
A Tab-delimited field. To process multiple tab fields e.g. [ 3 or more ( TABfield Tab) TABfield ]
[PipeField]
A Pipe-delimited field. To process multiple pipe fields e.g. [ 3 or more ( PipeField '|' )
PipeField ]
A complete pattern may include many individual keywords and
many expressions. How do you know which keywords go together and where one
expression stops and another begins? If in doubt, just enclose every expression
in parentheses. But, EasyPattern has rules for combining keywords into expressions,
so parentheses aren't always required. The traditional way of expressing these rules
is to list the "precedence" of various operators or terms.:
(...), including numbered groups
[or] for characters sets and single-character literal
[not]
quantity specifiers ( oneormore, 2+, 3..7, etc)
character set keywords (e.g. letter, digit) and single-character
literals
multi-character literals
[or] as alternative, for groups and multi-character literals
Items with high precedence don't need parentheses; they group together
automatically. For example, let's build a pattern step-by-step using the "high
precedence" operators: [letter or digit]? "or" for characters
set keywords [letter or digit or '.']? and single-character literal [letter or digit or '.' or <!?>]? and arbitrary set [not letter or digit or '.' or <!?>]? reverse the meaning with not [1+ not letter or digit or '.' or <!?>]? add a quantity specifier [1+ (not (letter or digit or '.' or <!?>))]? if you like
parentheses, though
the meaning is the same
Adding lower precedence terms before, after or both doesn't change the
grouping, though the expression is long enough that you may find a pair of
commas, brackets, or parentheses helpful. As long as you understand how EasyPattern
is doing the grouping, it doesn't matter whether you choose commas, brackets or
parentheses. If the parentheses are added around something that is already a group, they
don't change the meaning. [punctuation 1+ not letter or digit or '.' or
<!?> symbol] [punctuation, 1+ not letter or digit or '.' or <!?>, symbol]? same
meaning but easier to read [punctuation][1+ not letter or digit or '.' or <!?>][symbol]? same
meaning [punctuation (1+ not letter or digit or '.' or <!?>) symbol]? same
meaning
Remember, commas and brackets don't change the meaning, only the look. If you
put them in the middle of high precedence terms, you might confuse yourself: [punctuation 1+ not letter][or][digit or '.'
or <!?> symbol]? same meaning but HARDER to read [punctuation 1+ not letter, or, digit or '.' or <!?> symbol]? same
meaning but HARDER to read
Only parentheses change the meaning: [(punctuation 1+ not letter) or (digit or '.'
or <!?> symbol)]? different meaning
Note that [or] for character sets and
[or] as alternative have opposite
precedence. See
Character Sets and
Alternatives (above) for details &
examples.
EasyPattern vs. perl regex or grep
At its core, EasyPattern uses "regular expression"
technology that is similar to the "regex" or "grep" tools that originated on
UNIX. EasyPattern's primary benefit is that the patterns are much easier to read
and write.
For those who have some experience with regex, here are a few specific
differences:
Quantity is specified as a prefix rather than a suffix. We believe
prefix notation is much more natural.
e.g. [1+ digit] rather than "[0-9]+"
Parentheses groups are not automatically numbered. Drawback (to some): you
have to include a number if you want to refer to that matched portion.
Benefits: the parentheses that are there just for logical grouping don't get
numbered.
No backslashes are required to "escape" special characters
(instead, EasyPattern provides keywords such as
[rightBracket]). Benefit: Other pattern
languages already use backslash as an escape character so extra backslashes make
patterns even more difficult to read.
EasyPattern includes keywords for many character sets that require a custom
bracketed set in regex, e.g. punctuation, whitespace, paragraph, column, etc.
EasyPattern keywords generally include Macintosh-specific characters, e.g.
[letter] includes letters with umlauts and other diacritical marks
EasyPattern can combine character sets with
[or] (as well as use
[or] for
alternatives).
EP's [character] or
[char] will match any character; the
"equivalent" in some products will match anything except carriage return. If you
want any character except a return (or formfeed), use
[paragraphChar]; that is,
any character that could appear in a paragraph. Of course,
[not return] works
too.