RegexSearch is a Java application that performs find and find-and-replace searches for regular expressions on multiple text files.
RegexSearch has the following features:
RegexSearch is made available under two licences:
uk.blankaspect.regexsearch
package (including
resources) may be used under the terms of version 3 of the GNU General Public License.
This section assumes that you have some experience of building and running a Java application.
A Java Development Kit (JDK) that supports Java 17, such as Eclipse Temurin from Adoptium, is required to build RegexSearch.
The src
directory of the repository contains all the source code and
resources that constitute the application; there are no external dependencies. The
directory conforms to the Maven standard directory layout: the source code is in
src/main/java
and the resources are in src/main/resources
.
The supplied Kotlin-DSL-based Gradle
build script, build.gradle.kts
, modifies the compileJava
and
jar
tasks (provided by the java
plug-in) to compile the
application and to create an executable JAR file for it.
The supplied Kotlin-DSL-based Gradle build script, build.gradle.kts
,
defines two tasks that may be used to run RegexSearch:
compileJava
task.
jar
task.
Both tasks extend the JavaExec
task.
When it starts up, RegexSearch is configured with configuration properties that are read from two sources: the command line of the Java launcher and a configuration file whose location may be explicitly specified. If the same property is specified on the command line and in a configuration file, the value from the configuration file takes precedence.
The recommended method of setting the properties in a configuration file is with the Options > Preferences command. For command-line properties, which must be edited manually, the form of the property values is given in an appendix, and it can also be inferred by generating a configuration file with the desired values and inspecting the contents of the file.
When RegexSearch is run, configuration properties may be passed to the java
launcher on the command line as system properties; eg,
-Dapp.general.mainWindowLocation="100, 0"
. RegexSearch's
command-line configuration properties all have the prefix app.
. A list of
all the properties that are recognised by RegexSearch is given in an appendix.
System properties can be added to the two supplied Gradle
tasks, runMain
and runJar
, with the
systemProperties(Map<String, ?>)
method.
One particular property, app.configDir, is used to specify the directory that contains a configuration file, as described below.
The configuration file is named regexSearch-config.xml
. RegexSearch
doesn't require a configuration file: it uses a default value for any configuration
property that is missing from the source(s) of configuration. Similarly, if it finds a
property value to be invalid, RegexSearch will display a message to this effect and use
the default value of the property. If the configuration file contains a property that
was specified on the command line, the value from the configuration file is used.
If the configuration has changed when you exit the application normally (ie, using the File > Exit command or an equivalent), RegexSearch will save its configuration to a configuration file. If a configuration file was read on startup, it will overwrite that file; otherwise, it will write a configuration file to the default directory described above, unless the value of the app.configDir property was an empty string.
A configuration file can be written explicitly with the Save configuration command within the Preferences dialog.
When it starts up, RegexSearch is informed of the location of the configuration file with the app.configDir property. The value of the property may contain special constructs for system properties, environment variables and the user's home directory.
The app.configDir property may be set in two ways:
java
launcher, or
regexSearch-properties.xml
that resides in the same
directory as the regexSearch.jar
JAR file.
The regexSearch-properties.xml
file is a legacy from the time when
RegexSearch was distributed as a JAR file with an installer; the properties file was
written by the installer. If you create a properties file, it should have the following
form, with the example pathname replaced by the actual pathname:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <entry key="app.configDir">/home/slothrop/.blankaspect/regexSearch</entry> </properties>
If the app.configDir property is set both on the command line and in the properties file, the value in the properties file takes precedence.
The existence and value of the app.configDir property determines the locations that are searched for a configuration file:
$PWD
on Linux/UNIX systems, depends on the system and the way in
which RegexSearch was run.
$HOME/.blankaspect/regexSearch
, where $HOME
is the
environment variable that denotes the user's home directory. On Windows
systems, the default directory is
%APPDATA%\blankaspect\regexSearch
, where %APPDATA%
is
an environment variable.
The parameters of a search consist of:
The search parameters are stored collectively as an XML file. The DTD of the current version of a search-parameter file can be found in a file in the RegexSearch repository. It is provided only for reference because RegexSearch does not validate search-parameter files against the DTD.
A search-parameter file may contain multiple file sets, targets and replacements, and each file set may contain multiple pathnames and pathname filters, though only a single pathname, inclusion filter, exclusion filter, target and replacement are used in a search. For each of those five parameters, RegexSearch's user interface allows you to select from the list of available values, to edit or delete existing values and to add new ones to the list.
When RegexSearch is run, search parameters are read from the file denoted by the configuration property path.defaultSearchParameters. The current set of search parameters can be saved with the File > Save search parameters command, and files saved in this way can be opened with the File > Open search parameters command. When you open a new search-parameter file or exit RegexSearch, you will be prompted to save the current search parameters if a search-parameter file was read, either automatically at startup or explicitly, and the parameters have changed since the file was read. A change to the file-set index or a parameter index is regarded as a change to the file.
A file set specifies the files that are searched in a find or find-and-replace operation. A file set has a file-set kind and, depending on its file-set kind, it may also have a pathname and two kinds of pathname filter: an inclusion filter and an exclusion filter.
Depending on the file-set kind, the pathname of a file set may be direct or indirect. A direct pathname denotes either a single file or a base directory that, in conjunction with inclusion and exclusion filters, defines the scope of a search. An indirect pathname denotes a file that contains a list of files and directories to be searched. A pathname may be absolute or relative; a relative pathname is relative to the current working directory.
The inclusion filters and exclusion filters of a file set are two kinds of pathname filter. A pathname filter is a set of patterns that determines, usually in conjunction with a base pathname, the files that are searched. A file is included in a search if it matches at least one pattern in the inclusion filter AND none of the patterns in the exclusion filter. The maximum number of patterns in a filter is 64.
A pattern is a pathname that may include wildcards. There are three wildcards: two filename wildcards and a pathname wildcard.
The filename wildcards, ?
and *
, have their usual meaning:
?
matches a single character and *
matches zero or more
characters in a pathname component (a filename or directory name). For example, the
pattern "foo*.txt" will match the filenames foo.txt
,
food.txt
and football.txt
. Following the UNIX convention (but
differing from the Windows convention), a dot, .
, has no special
significance in patterns: it is matched by the *
wildcard. Thus, the
pattern "foo*" will match the filenames foo
,
football.txt
and food.store.log
.
The pathname wildcard, **
, matches zero or more pathname components. Its
use in a pathname pattern is analogous to the use of *
in a filename
pattern. By itself, the pattern **
is the recursive analogue of
*
: it matches all files in or below the base directory. Used as a pathname
component in a larger pattern, **
denotes a recursive portion of the
pathname that may be bounded above or below by a non-recursive pathname. For example,
.txt
in the base directory and in all directories below the base
directory;
.xml
in all directories named config
that are in or
below the directory named editor
in the base directory;
.java
in or below any directory named test
that is below the
base directory (or in the base directory, if it is named test
).
A pathname-filter pattern may be either relative to a base directory (in a directory or
list file set), or it may be absolute. The pathname components in a pattern are
separated with a /
character (U+002F). (A \
may be used as
the directory separator on the Windows platform, but /
is recommended
because \
is used as the escape character in filter
fields.) A pattern that ends with a directory separator is assumed to be followed by
an implicit **
. A pattern that, when appended to its base directory,
denotes an existing directory is assumed to be followed by an implicit /**
.
A pattern may contain dot and double-dot components (.
and
..
), but only if they appear before the first wildcard in the
pattern.
A file is matched against a pathname-filter pattern by converting both the pattern (appended to a base directory, if the pattern is relative) and the pathname of the target file to a canonical form. An error may occur in converting the pattern to canonical form if, for example, the resulting pathname is illegal or access to part of the file system is not permitted.
A file-set kind may be one of File, Directory, List, Results or Clipboard. The five kinds of file set are described below.
The search is performed on the file denoted by the pathname of the file set. No inclusion filter or exclusion filter is applied.
The pathname of the file set denotes the base directory of the search. (A search is not
necessarily confined to this directory and the directories below it because the
inclusion filter may contain patterns that denote pathnames outside the base directory.)
An inclusion filter and an exclusion filter may be specified. If no inclusion filter is
specified, a filter consisting of the single pattern **
(match all
recursively) is assumed. Any relative patterns (see pathname
filter) in the inclusion filter and exclusion filter are relative to the base
directory.
The files in a directory are searched in order of filename. The ordering is lexicographic (ie, the Unicode values of characters in the filename are compared) and platform-dependent: it is case-sensitive on Linux/UNIX systems, but alphabetic case is ignored on Windows systems. Recursion is specified implicitly by pathname wildcards. A recursive search on a directory is depth-first: files in subdirectories are searched before the files in the directory. Like files, subdirectories are searched in order of name. If the inclusion filter contains any absolute patterns, the files or directories specified by those patterns are searched after the base directory.
The pathname of the file set is assumed to denote a text file, each of whose non-empty
lines denotes the pathname of a file or directory that is to be searched. A line of the
list file may contain a comment, beginning with a ;
character. If a line
contains a comment, any characters after the last non-space character before the comment
are ignored (eg, the line "simple-filename.txt ; file #23" is parsed
as "simple-filename.txt"). Empty lines are ignored. The pathnames are
validated before the search starts, and the search will not proceed unless each pathname
denotes an existing directory or a regular file.
An inclusion filter and an exclusion filter may be specified. If no inclusion filter is
specified, a filter consisting of the single pattern **
(match all
recursively) is assumed.
The search of the pathnames in the list is equivalent to a sequence of searches on file sets whose file-set kind is either File or Directory according to whether a pathname denotes a file or directory. If the pathname denotes a directory, the inclusion filter and any exclusion filter are applied to it. If the inclusion filter contains any absolute patterns, the files or directories specified by those patterns are searched after all the pathnames in the list.
The files that will be searched are those from the last list of files to be saved with the Search > Save results command. The current list of saved results can be viewed with the Search > View saved results command. No inclusion filter or exclusion filter is applied.
This denotes a "pseudo-file": the contents of the system clipboard. If the clipboard contains text, the text will be searched in the same way as if it were read from a file. In a find-and-replace search, if any changes are made to the text, the modified text is put back on the clipboard at the end of the search. No inclusion filter or exclusion filter is applied.
The target of the search — the pattern that you are attempting to match in the files that are searched — can be either literal text or a regular expression. The corresponding kinds of search are referred to below as literal-text search and regular-expression search.
The replacement is an expression that will be used to replace occurrences of the
target pattern in a find-and-replace search. The interpretation of the
replacement differs according to whether the target is literal text or a regular
expression. Both types of replacement may contain metasymbols — special
sequences that are introduced with an escape character. By default, the escape
character is a backslash, \
, but it can be changed with the general.replacementEscapeCharacter
configuration property if, for example, you want to avoid having to escape the
backslashes in Windows pathnames. In a replacement, an escape character must always be
escaped by prefixing another escape character to it (eg, \\
, if the escape
character is \
).
The following metasymbols may appear in a literal-text replacement string. It is
assumed that the escape character is \
.
\t | Tab character, U+0009 |
\n | Line-feed character, U+000A |
\unnnn | Unicode character U+nnnn, where n is a hexadecimal digit character, [0-9A-Fa-f] |
\\ | Literal escape character |
The following metasymbols may appear in a regular-expression replacement string. It is
assumed that the escape character is \
.
\t | Tab character, U+0009 |
\n | Line-feed character, U+000A |
\unnnn | Unicode character U+nnnn, where n is a hexadecimal digit character, [0-9A-Fa-f] |
\\ | Literal escape character |
\n | Capturing group in the target pattern, where n is the decimal index of the group |
\Ln |
Capturing group in the target pattern, where n is the decimal index of the
group.
All alphabetic characters in the group are converted to lower case.
|
\Un |
Capturing group in the target pattern, where n is the decimal index of the
group.
All alphabetic characters in the group are converted to upper case.
|
The main display consists of two windows:
The main window is always visible. The control dialog may be hidden and made visible with the Hide control dialog / Show control dialog command.
The width and/or height of some text components are specified in logical units of columns and rows. The width of a column and the height of a row are determined by the font that is used to display text within the component: the height of a row is the height of the font, and the width of a column is the width of a zero character (U+0030), or, if the font doesn't have a glyph for the zero character, the width of the glyph that is used for characters that are not defined.
The text view is the text area at the top of the main window in which the contents of a file are displayed. The text view is not editable. The following attributes of the text view are configurable:
The number of viewable columns in the text view is also used as the number of viewable columns in the result area. If the two areas use different fonts, the actual width of the text view and result area is the wider of the two areas.
The colours of the text view are also applied to the result area and to the fields in the Search options dialog.
When a file containing tab characters (U+0009) is displayed in the text view, RegexSearch uses two configuration properties — tab-width filters and default tab width — to determine how the tab characters are converted to spaces. A tab-width filter maps a filename filter (a set of patterns that are used to match filenames) to the number of spaces that will be used to replace tab characters when displaying a matching file. If none of the defined tab-width filters matches the file, the default tab width is used. If the tab width is zero, tab characters are not expanded but rendered as a U+2192 (rightwards arrow) character, or as the "not defined" glyph if the font doesn't contain a glyph for the arrow character.
The filename-filter part of the tab-width filter consists of one or more filename
patterns separated by spaces; for instance, "*.cpp *.h". If a filename matches
one of the patterns, it is included in the search in the case of the inclusive filter or
excluded from the search in the case of the exclusive filter. A pattern may be a
literal filename or it may contain the wildcards *
and ?
,
which have their usual meaning: *
matches zero or more characters and
?
matches a single character.
The result area is the text area at the bottom of the main window in which the results of a search are displayed. The result area is not editable. The following information is displayed in the result area after a search:
After a search, files that are listed in the result area can be opened with an external editor as though the Edit > Edit file command were issued on the file during a search. The command that invokes the external editor is issued by holding down the Ctrl key and clicking the left mouse button on the chosen pathname in the result area.
The following attributes of the result area are configurable:
The number of viewable columns in the text view is also used as the number of viewable columns in the result area. If the two areas use different fonts, the actual width of the text view and result area is the wider of the two areas.
The maximum number of columns in the result area is fixed at 1024.
The colours of the result area are also applied to the text view.
The file-set controls, which can be found in the top row of the control dialog, consist of a drop-down list for selecting the file-set kind, a group of three buttons for inserting, duplicating and deleting file sets, and a group of four buttons for navigating the list of file sets and changing the position of the current file set in the list.
The index of the current file set and the number of file sets in the list are shown in a box between the two pairs of navigation buttons. "End" indicates that the file-set position is at the end of the list; there is no current file set. A file set may be inserted at the end of the list.
The drop-down list is used to select the file-set kind. The Pathname field and Include and Exclude fields are enabled or disabled according to the file-set kind.
A file set can be added to and removed from the list of file sets with the commands that are associated with the group of three buttons in the top row of the control dialog. Each command can also be issued from the keyboard.
The Insert command inserts a new file set into the list at the current file-set index. To add a new file set to the end of the list, first navigate to the end of the list. The Insert command can be issued by pressing the F2 key.
The Duplicate command makes a copy of the current file-set, inserts the copy into the list after the current index, then selects the copy. The Duplicate command can be issued by pressing the F3 key.
The Delete command deletes the current file-set after you have confirmed the deletion. The Delete command can be issued by pressing the F4 key.
The list of file sets can be navigated and the position of the current file set in the list can be changed with the commands that are associated with the group of four arrow buttons and barred-arrow buttons in the top row of the control dialog. Each command can also be issued from the keyboard.
The arrow buttons select the previous or next file set in the list. The current file-set index continues to change while the mouse button is pressed or until the start or end of the list is reached. Holding down the Ctrl key while clicking on or pressing the arrow buttons will move the current file set up or down the list. The Go to previous and Go to next commands can be issued by pressing the F6 and F7 keys respectively. The Move up and Move down commands can be issued by pressing Ctrl+F6 and Ctrl+F7 respectively.
The barred-arrow buttons select the first file set in the list or go to the end of the list (where no file set is selected). The Go to start and Go to end commands can be issued by pressing the F5 and F8 keys respectively.
The five most prominent components in the control dialog are referred to as parameter fields. Two of the fields — the Target and Replacement fields — are text areas rather than fields, and the size of these text areas can be configured with the appearance.parameterEditorSize property. The width of these two fields determines the width of the other three parameter fields.
Each parameter field maintains a history list: a list of the most recent values that were entered in the field, up to a maximum of 64 items. A parameter field is similar in operation to a combo box except that an item is not moved to the top of the list when it is selected. (The order of items in the list may be changed in the editor; see below.)
A value is entered into the field explicitly by pressing Ctrl+Enter or implicitly when
When a value is entered in the field, it is inserted at the top of the list.
A history list can be navigated and edited in several ways. Navigation and editing commands are available from a pop-up menu that is activated in a system-dependent manner (eg, by pressing or releasing the right mouse button) or by pressing the context-menu key when the field has keyboard focus. The Select previous item and Select next item commands that are available from the pop-up menu can also be issued by pressing Ctrl+PageUp and Ctrl+PageDown respectively. The Delete command can be issued by pressing Ctrl+Shift+Delete.
All the parameter fields have an Edit command that displays an editor in which the items in the field's history list can be edited. The command is available from the field's pop-up menu and can also be issued by pressing Alt+Enter. (For the filter fields, the command can be issued with the Edit button adjacent to the field.) Within the list editor, the position of an item in the list can be changed by dragging it with the mouse, or by pressing Ctrl+Shift+Up or Ctrl+Shift+Down when the list has keyboard focus. The Delete key and Delete button delete the selected item after confirmation, while Shift+Delete and Shift+left-click on the Delete button delete the selected item without confirmation.
A pathname can be entered in the field by typing, by selecting a file using the "…" button adjacent to the field, or by dragging a file or directory from, for example, a file browser and dropping it onto the field or onto other parts of the control dialog or onto the main window.
The fields contain a pathname filter: a set of patterns separated by spaces. Within the
field, the backslash, \
, acts as an escape character to allow the inclusion
of space characters in patterns. The escape convention in the filter fields is that a
character following a \
is treated as a literal character, and a single
trailing \
is ignored. Thus, you would use \
for a
literal space and \\
for a literal backslash. Because of this, it is
recommended that you use a /
to separate pathname components in patterns on
the Windows platform.
The individual patterns of a pathname filter can be edited from the Edit pattern dialog — the third-level editor that is invoked with the Edit command in the Edit filter dialog that is invoked by the Edit command in the list editor that is invoked by the Edit command in the Include or Exclude field. (Got that?) Note that no escape character is used in the Pattern field of the Edit pattern dialog.
The history list of the Exclude field set may contain an empty string (ie, "exclude nothing").
The Target and Replacement fields are actually text areas that can contain multiple lines of text. The Replacement field is enabled only if the Replace check box is selected.
Text in the Target and Replacement fields can include tab characters (U+0009) and line-feed characters (U+000A), which are entered in the field by pressing Ctrl+Tab and Enter respectively. Line feeds are not displayed in a special way in the field, so, if your target or replacement isn't behaving as you expected, it may be that you have an unwanted — and invisble — line feed at the end of the field.
The fields may use a tab surrogate to display tab characters. Some characters in the fields may be escaped in two different ways: tabs and line feeds can be escaped separately, and an Escape command can be applied to the field. The Escape command behaves differently in the Target field and the Replacement field.
Within the Target and Replacement fields, tabs are replaced with the character that is denoted by the appearance.tabSurrogate configuration property. The default tab surrogate is the tab character (U+0009) itself; in this case, tabs are displayed as a number of spaces up to the next tab stop, and the tab width is denoted by the tabWidth.targetAndReplacement configuration property. It is important to understand that the tab surrogate is not just a substitute glyph: it actually replaces each occurrence of the tab character in the field unless tabs are escaped. When the content of the field is used (eg, in a search), the tab surrogate is converted either to a tab character or to a tab sequence ("\t") as appropriate, so you should choose as tab surrogate a character that is unlikely to appear in any target or replacement text.
Tabs and line feeds may be escaped (ie, converted to the escape sequences
"\t
" and "\n
" respectively) in the Target and Replacement fields by
selecting the Tabs escaped or Line feeds escaped item in the field's pop-up menu. (In
reality, it is the tab surrogate that is converted to "\t
", but
the existence of tab surrogates is ignored in this section so as not to complicate
matters.) Deselecting the menu item reverses the procedure: each occurrence of
"\t
" or "\n
" is converted to a tab
character or line-feed character, even if the \
is itself escaped with
another backslash. You will need to be careful about toggling the escaping of tabs and
line feeds if the text contains literal "\t
" or
"\n
" sequences. Within the field, the escaping of tabs and line
feeds can be toggled from the keyboard with Ctrl+T and Ctrl+N respectively. Indicators
appear alongside a field in which tabs or line feeds are escaped.
When a regular-expression search is performed, any tab characters and line-feed characters in the Target field are escaped automatically in the target pattern that is used in the search. Tabs and line feeds are also escaped in the list of target or replacement items displayed in the editor, in the Select item submenu displayed in the field's pop-up menu, and when targets and replacements are saved to a search-parameter file.
The Escape button adjacent to the Target field is enabled only when the Regular expression check box is selected. The Escape command for the Target
field prefixes a \
to each metacharacter in the field. The set of
metacharacters on which the command operates is denoted by the general.escapedMetacharacters configuration
property. The default value of the property is the set of characters that are used in
metasymbols outside a character class delimited by square brackets:
$ ( ) * + . ? [ \ ] ^ { | }
(]
and }
are not metacharacters but are included in the set
for symmetry.)
If tabs or line feeds are escaped in the Target field and
the backspace character is in the set of escaped metacharacters, the \
prefix to the escaped tabs and line feeds will itself be escaped by the Escape command. Unless the text contains literal
"\t
" or "\n
" sequences, it may be best to
unescape tabs and line feeds before issuing the Escape
command.
The Escape command for the Replacement field, which can be issued with the button adjacent
to the field, prefixes an escape character to each escape character in the field. (The
escape character for a replacement is specified by the general.replacementEscapeCharacter
configuration property.) If the escape character is \
, it may be best to
unescape tabs and line feeds before issuing the Escape
command unless the text contains literal "\t
" or
"\n
" sequences.
At the bottom of the control dialog are four check boxes.
If the Replace check box is selected, the search mode is find. If the check box is not selected, the search mode is find-and-replace.
If the Regular expression check box is selected, the search target is a regular expression. If the check box is not selected, the search target is literal text.
If the Ignore case check box is selected, the case of alphabetic characters is ignored when matching a search target. If the check box is not selected, matching is case-sensitive.
If the Show not found check box is selected, the results that are displayed in the result area at the end of a search will include a list of files that were searched but in which the target was not found. If the check box is not selected, the list of "not found" files will be omitted from the search results.
The main window is not directly resizeable but its size can be modified indirectly by means of some of the configuration properties that relate to the text view and result area.
The control dialog is resizeable. The initial size of the dialog is determined by the appearance.parameterEditorSize configuration property, which can be edited in the Preferences dialog. If the control dialog has been resized using the GUI, the value of the appearance.parameterEditorSize property that is written to the configuration file when RegexSearch exits is obtained from the actual dimensions of the Target and Replacement fields, overriding any changes to the configuration property that were made in the Preferences dialog.
As was mentioned above, the size of some text components — including the text view and result area — is determined by the font that they use to display text, as well as any properties that explicitly control their dimensions in terms of columns and rows. Any changes to configuration properties that affect the size of the main window or control dialog will not take effect until the next time that RegexSearch is run.
RegexSearch's main commands are accessible from its main menu. Some of the commands are also accessible from a pop-up menu that is activated in a system-dependent manner (eg, by pressing or releasing the right mouse button) while the mouse cursor is over one of the text areas or the background of the control dialog.
The Open search parameters command brings up a file-selection dialog in which you can choose the file that you want to open. If the file has the correct format, the search parameters are loaded from it and the application's display is updated. If the current search parameters were read from a file, either automatically at startup or explicitly, and the parameters have changed since the file was read, you will be asked whether you want to save the current parameters before the new parameters are loaded.
The Save search parameters command brings up a file-selection dialog in which you can choose the file to which you want to save the current set of search parameters. A file that is saved in this way can be specified as the default search parameters that will be loaded when RegexSearch starts up.
This command terminates the application. If you have made changes to search parameters that were read from a file, you will be asked whether you want to save them.
The Edit file command executes a specified system command in a separate process. The command line, which is specified with the editor.command configuration property, may include a placeholder for the pathname of the file that is currently displayed in the text view. The intended purpose of the command line is to open the currently displayed file in a text editor, though it can be used for other purposes.
When using the Edit file command during a find-and-replace search, remember that the file in the text editor will not be synchronised with the file in RegexSearch's buffer, which may subsequently be written back to the file system with modifications if replacements have been made in the file, even if the replacements were made before the Edit file command was issued. (If the Edit file command is issued while the Search options dialog is displayed, the Next file option in the dialog can be used to discard any changes to the current file.)
After a search has finished, the external editor can be invoked on files that are selected from those listed in the result area.
This command is available only during a find-and-replace search. It behaves similarly to the Edit file command except that the associated system command (specified with the editor.command configuration property) is not executed until the search of the current file is finished and, if any replacements have been made, the modified file has been written to the file system.
If the control dialog is visible, this command is named Hide control dialog and it makes the control dialog invisible. If the control dialog is hidden, this command is named Show control dialog and it makes the control dialog visible.
When you issue a Search command, RegexSearch first validates the search parameters and displays an error message for the first parameter that is invalid. If the file-set kind is List, the specified list file is read and parsed. In a search of multiple files, the files are searched in the order described in the Directory and List file-set kinds.
Within a file, the search proceeds from the start of the file to the end. If a match of the target expression is found, the search will resume at the first character after the last character in the matched text, or, if a replacement is made, at the first character after the replacement.
When the first match of the target expression is found, the file in which the match occurred is displayed in the text view, and the matched text is highlighted. A Search options dialog box is displayed; the type of dialog depends on the search mode, find or find-and-replace. Because the Search options dialog is non-modal, the text in the text view can be scrolled while the dialog is displayed.
The options in the Search options dialog can be selected either by clicking on the appropriate button or by pressing a key or key combination. In addition to the usual Java Alt+<key> combination, each option (apart from Cancel, whose keyboard equivalent is the Escape key) can be selected by pressing the key by itself (ie, without the Alt key).
At the end of a search, the aggregate results are displayed in the result area. The results include a list of any files or directories that were not processed because of an error and a list of files or directories whose pathname could not be converted to canonical form. If the file-writing mode is Use a temporary file, preserve attributes, the results of a find-and-replace search include a separate list of files that were written but whose attributes were not set.
In find mode, the Search options dialog has four options:
In find-and-replace mode, the Search options dialog has seven options:
Some aspects of RegexSearch's behaviour when processing files are worth noting in order that you may avoid the unintended consequences of that behaviour. RegexSearch assumes that the files it reads during a search are text files that have a specified character set and encoding. It also assumes that certain characters or character sequences in the files are line separators. The implications of these two assumptions are discussed below.
When a file is read during a search, the bytes of the file are converted to 16-bit Unicode according to the configuration property general.characterEncoding. A character encoding, such as UTF-8, maps between sequences of bytes and 16-bit Unicode values.
Within the file, all occurrences of the characters LF (U+000A) and CR (U+000D), and the character sequence (CR, LF) are treated as line separators. The kind of line separator is recorded for possible later use. If the file contains more than one kind of line separator, the most numerous kind of line separator prevails. If the numbers of different kinds are equal, the precedence from highest to lowest is: LF – CR – CR+LF.
In find mode, the processing of a file ends at this point: the processing is
internal, and no physical changes are made to the stored file. In
find-and-replace mode, a file may be modified as a result of a replacement, and
the file written back to the file system. If the general.preserveLineSeparator configuration
property has the value true
, the file is written with the kind of line
separator that was detected when it was read; otherwise, it is written with an LF line
separator.
The way in which a modified file is written to the file system is determined by the general.fileWritingMode configuration property. A file may be written directly, or it may be written first to a temporary file that is renamed after the entire file has been written. If a temporary file is used, the owner, group and permissions of the file may be set to those of the original file on systems that support it. (Linux is the only system that is known to do so.) See the description of the general.fileWritingMode property for more details on its use.
The Copy results command copies the contents of the result area to the system clipboard. The general.copyResultsAsListFile configuration property controls the format of the text that is placed on the clipboard: the results can be either in the form in which they appear in the result area or in a form that is suitable for use as a list file in a new search, with match/replacement counts converted to comments.
The Save results command saves the list of files from the results of the last search (ie, the files in which an occurrence of the target was found). A list of files that is saved with this command can be used as the file set for a further search if you select Results as the file-set kind.
The View saved results command displays the last list of files to be saved with the Save results command, which allows you to see the files that will comprise the file set if Results is selected as the file-set kind.
The Preferences command brings up a tabbed dialog box in which the configuration properties of RegexSearch can be edited. The properties on the various tabbed pages are described below.
Some of the configuration properties in the Preferences dialog are edited with a spinner — a graphical component that consists of a text field adjacent to a pair of small buttons. The value in the text field may be edited manually, or it may be incremented and decremented by one of the following methods:
Using the last two methods, the amount by which the value is incremented or decremented can be modified by holding down the Ctrl, Shift or Ctrl+Shift keys, which correspond to increments of 10, 100 and 1000 respectively.
<default encoding>
, which denotes the platform- and
locale-dependent default character encoding.
\
prefixed to them) when the Escape command is applied to a regular-expression
target.
$()*+.?[\]^{|}
\
(backslash, U+005C).
Yes
, alphabetic case will be ignored when matching
pathnames against the patterns in an inclusion or exclusion filter and when
matching filenames against the patterns in a tab-width filter (eg, the filename
pattern "*.txt" will match the filenames foo.txt
and
BAR.TXT
).
No
.
chmod
, chgrp
and chown
commands are
issued with the --reference
option, which should set the
file's permissions, group and owner to those of the original file.
Linux is known to support the --reference
option for these
three commands; other UNIX-like systems may support it. Because it involves
the additional execution of three system commands, this file-writing mode is
slower than the other two.
Use a temporary file
.
Yes
, a file in which replacements are made during a
find-and-replace search will be written with the same kind of line
separator — LF (U+000A), CR (U+000D) or CR+LF — that it had when it
was read. (Files that have more than one kind of line separator will be written
with the kind of line separator that is most numerous.) If this property has
the value No
, files modified by RegexSearch will be written with LF
(UNIX-style) line separators.
Yes
.
Yes
, pathnames are displayed in a reduced
"UNIX style" in some parts of the GUI. A pathname is converted from
its platform-specific form in two steps:
~
.
\
on Windows systems) is replaced
by /
.
No
.
Yes
, all the text in a text field will be
automatically selected when the field gains keyboard focus, regardless of how
the focus is transferred.
Yes
.
Yes
, the location of the main window on the screen
will be saved to the configuration file when you exit the application. The next
time that RegexSearch is run, its main window will be positioned at the
previously saved location.
No
.
Yes
and the control dialog has not
been explicitly hidden, it is automatically hidden during a search and made
visible again when the search ends. If the control dialog is hidden in this
way, the Show control dialog command
can be used to make it visible during a search.
No
.
No
, the results are in the form in which they appear
in the result area of the main window. If you select Yes
, the
results are converted into a form that is suitable for use as a list file in a new search.
No
.
awt.useSystemAAFontSettings
.
Default
.
Default
.
%
(U+0025) acts as an escape character. %f
is a
placeholder for the pathname of the file that is to be edited. All other
characters that follow %
are treated as themselves; thus, a literal
space is represented by %
(ie, U+0025, U+0020), and a literal
%
is represented by %%
.
Some of the configuration properties will take effect when the Preferences dialog is accepted (by closing it with OK); other properties (eg, the look-and-feel and fonts) will not take effect until the next time that RegexSearch is run.
The configuration file is normally saved automatically when RegexSearch exits, if the configuration has changed. The Save configuration command in the Preferences dialog can be used to save a configuration file explicitly.
Within RegexSearch, the parsing and matching of regular expressions is performed by the Java regex engine. The purpose of this section is to present a summary of the syntax of Java's regular expressions, which is similar to that of Perl and Python. This section is not intended to be a tutorial on the use of regular expressions; see the references at the end of this section for suggested sources of further information.
Note: There are several differences between the syntax of regular expressions in Java and the syntax of regular expressions in Linux/UNIX tools such as sed and (g)awk.
In a search, the target pattern, replacement pattern and file are all composed of
Unicode characters. RegexSearch converts files from bytes to 16-bit Unicode characters
according to the scheme described in How files are
processed. In particular, the line separators CR and CR+LF are converted to LF
before a file is searched. Thereafter, by default, the only line separator recognised
during a search is the line feed character (U+000A) unless the (?-d)
flag
appears in the target pattern.
When selected, the Ignore case check box in the control
dialog enables the default form of case-insensitive matching, which applies only to
characters in the US-ASCII character encoding. To apply case-insensitive matching to
all Unicode characters, use the (?u)
flag in the target pattern.
Within a regular expression, all characters are treated as literal characters except for twelve metacharacters — characters that have a special meaning and don't behave normally in regular expressions. The metacharacters are:
$ ( ) * + . ? [ \ ^ { |
A metacharacter can be escaped — that is, its special meaning can be
removed — by prefixing a backslash, \
, to it. An escaped
metacharacter represents its corresponding literal character; thus, \?
represents the character ?
, and \\
represents a literal
backslash.
Some metacharacters are used by theselves within regular expressions; others are used to
create special sequences called metasymbols. (In the documentation for
java.util.regex.Pattern
, metasymbols are referred to as constructs.)
For example, several alphanumeric characters become metasymbols when preceded by a
backslash.
. |
By default, a dot matches any single character except a newline. The
(?s) flag enables a mode in which a dot matches any character
including a newline.
|
^ |
Matches the beginning of a line.
Example:
^# matches a # character at the beginning of a
line.
|
$ |
Matches the end of a line or the end of the input string (in RegexSearch, the end
of a file).
Example:
;$ matches a ; character at the end of a line
or at the end of a file.
|
\ |
The backslash has two roles:
|
| |
The vertical bar separates alternatives.
Example:
his|her|its matches any one of the strings "his",
"her" or "its".
|
[ ] |
Matches one character from a character class — a set
of characters enclosed within the square brackets. The set of
characters can be specified in a number of ways. It may be:
If the first character within the square brackets is a circumflex,
^ ,
the set of characters is negated; that is, the character class matches one
character that is not in the set of characters that follows the
^ .
Example:
[^0-9] matches any character except a (Western) decimal
digit; [a-z&&[^ij]] is equivalent to [a-hk-z] .
|
( ) |
Encloses a capturing group. The set of characters within the parentheses
is treated as a unit; eg,
^(foo|bar) matches either "foo"
or "bar" at the beginning of a line. The group is called
capturing because the text that it matched can be included later in the
target pattern or in the replacement by specifying the index of the group in a
metasymbol (see \n in Alphanumeric
metasymbols).
A cluster — a non-capturing group — can be specified by
enclosing a set of characters between
(?: and ) (eg,
(?:foo|bar) matches either "foo" or "bar" without
capturing it).
|
Quantifiers specify how many times the preceding character or group should match. The different types of quantifier are available in three flavours, which Java refers to as greedy, reluctant and possessive. (Greedy quantifiers are also known as maximal, and reluctant quantifiers are also known as lazy or minimal.)
A greedy (maximal) quantifier starts by matching as much as possible of the input string. If this doesn't allow the whole pattern to be matched, the greedy quantifier matches progressively less of the input string until either the whole pattern is matched or the match fails.
A reluctant (minimal) quantifier starts by matching as little as possible of the input string. If this doesn't allow the whole pattern to be matched, the reluctant quantifier matches progressively more of the input string until either the whole pattern is matched or the match fails.
A possessive quantifier starts, like a greedy quantifier, by matching as much as possible of the input string. However, if this doesn't allow the whole pattern to be matched, no backing-up is performed, and the match fails.
Quantifiers | Meaning | ||
---|---|---|---|
Greedy | Reluctant | Possessive | |
* | *? | *+ | Matches zero or more times |
+ | +? | ++ | Matches one or more times |
? | ?? | ?+ | Matches once or not at all |
{n} | {n}? | {n}+ | Matches exactly n times |
{n,} | {n,}? | {n,}+ | Matches at least n times |
{n,m} | {n,m}? | {n,m}+ | Matches at least n times but not more than m times |
\0n | The character with octal value 0n, where n is in [0-7] |
\0nn | The character with octal value 0nn, where n is in [0-7] |
\0mnn | The character with octal value 0mnn, where m is in [0-3] and n is in [0-7] |
\n | The sequence matched by the nth capturing group |
\a | The alert character (BEL), U+0007 |
\A | The beginning of the input string (in RegexSearch, the beginning of a file) |
\b | A word boundary |
\B | Not a word boundary |
\cX | The control character, Control-X |
\d | A digit, [0-9] |
\D | A non-digit, [^0-9] |
\e | The escape character (ESC), U+001B |
\E | End the quotation of metacharacters started by \Q |
\f | The form feed character (FF), U+000C |
\n | The line feed character (LF), U+000A |
\p{prop} | Any character in the character class named prop |
\P{prop} | Any character not in the character class named prop |
\Q | Quote (escape) metacharacters until \E |
\r | The carriage return character (CR), U+000D |
\s | A whitespace character, [ \t\n\x0B\f\r] |
\S | A non-whitespace character, [^\s] |
\t | The tab character (HT), U+0009 |
\unnnn | The Unicode character U+nnnn, where n is a hexdecimal digit character, [0-9A-Fa-f] |
\w | A word character, [0-9A-Za-z_] |
\W | A non-word character, [^\w] |
\xnn | The character with hexdecimal value 0xnn |
\z | The end of the input string (in RegexSearch, the end of a file) |
\Z |
The end of the input string (in RegexSearch, the end of a file), apart from a final
\n .
|
Named character classes are metasymbols of the form \p{name}
or
\P{name}
. There are three types of named character class: POSIX,
Unicode and Java.
Lower | A lowercase alphabetic character, [a-z] |
Upper | An uppercase alphabetic character, [A-Z] |
ASCII | An ASCII character, [\x00-\x7F] |
Alpha | An alphabetic character, [\p{Lower}\p{Upper}] |
Digit | A decimal digit character, [0-9] |
Alnum | An alphanumeric character, [\p{Alpha}\p{Digit}] |
Punct | Punctuation: one of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
Graph | A visible character, [\p{Alnum}\p{Punct}] |
A printable character, [\p{Graph}\x20] | |
Blank | A space or a tab character, [ \t] |
Cntrl | A control character, [\x00-\x1F\x7F] |
XDigit | A hexadecimal digit character [0-9a-fA-F] |
Space | A whitespace character, [ \t\n\x0B\f\r] |
The Unicode character classes are too numerous to list all of them here. They include
Unicode character blocks (eg, Greek) and character categories (eg, uppercase letters).
When forming a metasymbol, In
is prefixed to the name of a Unicode block
(eg, \p{InGreek}
), and Is
is optionally prefixed to the name
of a Unicode category is (eg, \p{Lu}
or \p{IsLu}
).
The following table lists abbreviations for values in the Unicode General Category:
L | Letter |
Lu | Letter, uppercase |
Ll | Letter, lowercase |
Lt | Letter, titlecase |
Lm | Letter, modifier |
Lo | Letter, other |
M | Mark |
Mn | Mark, non-spacing |
Mc | Mark, spacing combining |
Me | Mark, enclosing |
N | Number |
Nd | Number, decimal digit |
Nl | Number, letter |
No | Number, other |
P | Punctuation |
Pc | Punctuation, connector |
Pd | Punctuation, dash |
Ps | Punctuation, open |
Pe | Punctuation, close |
Pi |
Punctuation, initial quote (may behave like Ps or Pe
depending on usage)
|
Pf |
Punctuation, final quote (may behave like Ps or Pe
depending on usage)
|
Po | Punctuation, other |
S | Symbol |
Sm | Symbol, mathematical |
Sc | Symbol, currency |
Sk | Symbol, modifier |
So | Symbol, other |
Z | Separator |
Zs | Separator, space |
Zl | Separator, line |
Zp | Separator, paragraph |
Cc | Other, control |
Cf | Other, format |
Cs | Other, surrogate |
Co | Other, private use |
Cn | Other, not assigned |
The Java character classes will probably be of interest only to Java programmers. The
name of the character class is formed by substituting java
for
is
in the name of a method of the java.lang.Character
class that begins with is
. For example, the character class
javaLetterOrDigit
is equivalent to
java.lang.Character.isLetterOrDigit()
.
Extended sequences are metasymbols of the form (?...)
. The modifiers,
[dimsux]
, and their "off" versions (preceded by a minus sign) can
be concatenated within an extended sequence; for example, (?iu-ms)
switches
on i
and u
and switches off m
and s
.
(?:…) | Non-capturing group (cluster). |
(?>…) | Non-capturing group referred to in Perl as a nonbacktracking subpattern. |
(?d) (?-d) |
Enable/disable UNIX lines mode.
If enabled, only the UNIX line separator (
\n , U+000A) is recognised
by the metacharacters . , ^ and $ ;
otherwise, the following characters and character sequences are recognised as line
separators: \n (U+000A), \r (U+000D), \r\n
(U+000D, U+000A), U+0085, U+2028, U+2029.
UNIX linesmode is enabled by default.
|
(?i) (?-i) |
Enable/disable case-insensitive matching.
Case sensitivity is initially denoted by the ignore case search parameter,
but it can be changed within the target pattern by means of this flag. By
default, case-insensitive matching applies only to characters in the US-ASCII
character encoding, but this can be extended to all Unicode characters with the
(?u) flag.
|
(?m) (?-m) |
Enable/disable multiline mode.
In multiline mode, the metacharacters
^ and $ match at
the beginning and end, respectively, of a line; otherwise, they match only at the
beginning and end of the input string (ie, the file).
Multiline mode is enabled by default.
|
(?s) (?-s) |
Enable/disable dotall mode.
In dotall mode (known in Perl as single-line mode), the
.
(dot) metacharacter matches any one character including a line separator;
otherwise, . matches any one character except for a line
separator.
|
(?u) (?-u) |
By default, the case-insensitive matching that is control by the ignore case
search parameters and the (?i) flag applies only to characters in the
US-ASCII character encoding. Using the (?u) flag, case-insensitive
matching can be extended to all Unicode characters.
|
(?x) (?-x) |
Enable/disable comments mode.
In comments mode, whitespace and comments in the target pattern are ignored. A
comment starts with a
# character and ends at the end of the pattern.
|
(?=pattern) | Positive lookahead: a zero-width assertion that is true if pattern immediately follows the assertion. |
(?!pattern) | Negative lookahead: a zero-width assertion that is true if pattern does not immediately follow the assertion. |
(?<=pattern) | Positive lookbehind: a zero-width assertion that is true if pattern immediately precedes the assertion. |
(?<!pattern) | Negative lookbehind: a zero-width assertion that is true if pattern does not immediately precede the assertion. |
The following sources were used in writing this section:
java.util.regex.Pattern
class.
The Java documentation recommends the following book as providing a detailed treatment of the use of regular expressions:
Friedl, Jeffrey, Mastering regular expressions 3rd ed., O'Reilly, 2006. ISBN 0596528124.
Where indicated elsewhere in this document, pathname parameters and properties in RegexSearch can contain special constructs for system properties, environment variables and the user's home directory. The special constructs are expanded when the pathname is used.
user.home
)
and environment variables (eg, PATH
) are referenced by enclosing them
between ${
and }
; that is, they must have the form
${<name>}
. A Java system property takes precedence over an
environment variable with the same name.
${user.home}/projects
${HOME}/projects
sys.
to it.
${sys.user.home}/projects
env.
to it.
${env.HOME}/projects
~
in a pathname is expanded into the user's home directory
using the user.home
system property, which is usually equivalent to the
environment variable $HOME
on Linux/UNIX systems or
%USERPROFILE%
on Windows systems.
~/projects
The table below lists the configuration properties of RegexSearch. Apart from the app.configDir property, which, for obvious reasons, cannot be used within a configuration file, all properties can be used in the two configuration locations: command-line properties and configuration file.
When used in a -D
command-line property, app.
must be prefixed
to the property key (eg, app.general.mainWindowLocation).
The <index> of a indexed property must be a three-digit decimal-string representation of the zero-based index of the property (eg, the third tab-width filter would be app.tabWidth.fileFilter.002).
When used in a configuration file, the components of the property keys become element
names in the XML document hierarchy. (The app
prefix of the property key
in command-line properties corresponds to the root element of the XML document.) The
form of properties in a configuration file was changed in version 2.1 of RegexSearch.
Configuration files in the old format can be read, but files are written in the new
format.
Any commas (,
) or backslash characters (\
) in the name of a
font must be escaped by prefixing a \
character to them.
In the table below, the initial character of an italicised component of a property value denotes its data type according to the following convention:
i | integer |
p | platform-specific pathname, which may contain special constructs |
s | string |
c | character |
Property key | Property value |
---|---|
configDir | pPathname |
appearance.lookAndFeel | sName |
appearance.parameterEditorSize | iNumColumns, iNumRows |
appearance.resultAreaNumRows | iNumRows |
appearance.tabSurrogate | sUnicode4 |
appearance.textAntialiasing | default | none | normal | subpixelHRgb | subpixelHBgr | subpixelVRgb | subpixelVBgr |
appearance.textAreaColour.background | iRed, iGreen, iBlue |
appearance.textAreaColour.highlightBackground | iRed, iGreen, iBlue |
appearance.textAreaColour.highlightText | iRed, iGreen, iBlue |
appearance.textAreaColour.text | iRed, iGreen, iBlue |
appearance.textViewMaxNumColumns | iNumColumns |
appearance.textViewTextAntialiasing | default | none | normal | subpixelHRgb | subpixelHBgr | subpixelVRgb | subpixelVBgr |
appearance.textViewViewableSize | iNumColumns, iNumRows |
editor.command | sCommand |
font.comboBox | sName, plain | bold | italic | boldItalic, iSize |
font.main | sName, plain | bold | italic | boldItalic, iSize |
font.parameterEditor | sName, plain | bold | italic | boldItalic, iSize |
font.resultArea | sName, plain | bold | italic | boldItalic, iSize |
font.textField | sName, plain | bold | italic | boldItalic, iSize |
font.textView | sName, plain | bold | italic | boldItalic, iSize |
general.characterEncoding | sName |
general.controlDialogLocation | iX, iY |
general.copyResultsAsListFile | false | true |
general.escapedMetacharacters | sCharacters |
general.ignoreFilenameCase | false | true |
general.fileWritingMode | direct | useTempFile | useTempFilePreserveAttributes |
general.hideControlDialogWhenSearching | false | true |
general.mainWindowLocation | iX, iY |
general.preserveLineSeparator | false | true |
general.replacementEscapeCharacter | cCharacter |
general.selectTextOnFocusGained | false | true |
general.showUnixPathnames | false | true |
path.defaultSearchParameters | pPathname |
tabWidth.default | iNumChars |
tabWidth.fileFilter.<index> | sPatterns : iNumChars |
tabWidth.targetAndReplacement | iNumChars |