Is the default separator only space for awk?
-
3The default field delimiter or field separator (FS) is `[ \t]+`, i.e. one or more space and tab characters. – Thor May 22 '15 at 20:58
-
2@Thor no, it's not. See the man page. – Ed Morton May 22 '15 at 21:05
-
1@EdMorton: Right, I missed newlines, i.e. `FS='[ \t\n]+`. But this only has an effect when RS does not include newlines. – Thor May 24 '15 at 22:31
-
1@Thor not exactly since even if you have an `RS` that includes newlines the default `FS` will have an effect if you construct a string containing newlines and you do `split(string,arr)`. – Ed Morton May 24 '15 at 23:10
-
1Good question, not stupid at all – Timo Mar 14 '18 at 11:12
4 Answers
Here's a pragmatic summary that applies to all major Awk implementations:
- GNU Awk (
gawk) - the defaultawkin some Linux distros - Mawk (
mawk) - the defaultawkin some Linux distros (e.g., earlier versions of Ubuntu crysman reports that version 19.04 now comes with GNU Awk - see his comment below.) - BWK Awk - the default
awkon BSD-like platforms, including macOS
On Linux, awk -W version will tell you which implementation the default awk is.
BWK Awk only understands awk --version (which GNU Awk understands in addition to awk -W version).
Recent versions of all these implementations follow the POSIX standard with respect to field separators[1] (but not record separators).
Glossary:
RSis the input-record separator, which describes how the input is broken into records:- The POSIX-mandated default value is a newline, also referred to as
\nbelow; that is, input is broken into lines by default. - On
awk's command line,RScan be specified as-v RS=<sep>. - POSIX restricts
RSto a literal, single-character value, but GNU Awk and Mawk support multi-character values that may be extended regular expressions (BWK Awk does not support that).
- The POSIX-mandated default value is a newline, also referred to as
FSis the input-field separator, which describes how each record is split into fields; it may be an extended regular expression.- On
awk's command line,FScan be specified as-F <sep>(or-v FS=<sep>). - The POSIX-mandated default value is formally a space (
0x20), but that space is not literally interpreted as the (only) separator, but has special meaning; see below.
- On
By default:
- any run of spaces and/or tabs and/or newlines is treated as a field separator
- with leading and trailing runs ignored.
The POSIX spec. uses the abstraction <blank> for spaces and tabs, which is true for all locales, but could comprise additional characters in specific locales - I don't know if any such locales exist.
Note that with the default input-record separator (RS), \n, newlines typically do not enter the picture as field separators, because no record itself contains \n in that case.
Newlines as field separators do come into play, however:
- When
RSis set to a value that results in records themselves containing\ninstances (such as whenRSis set to the empty string; see below). - Generally, when the
split()function is used to split a string into array elements without an explicit-field separator argument.- Even though the input records won't contain
\ninstances in case the defaultRSis in effect, thesplit()function when invoked without an explicit field-separator argument on a multi-line string from a different source (e.g., a variable passed via the-voption or as a pseudo-filename) always treats\nas a field separator.
- Even though the input records won't contain
Important NON-default considerations:
Assigning the empty string to
RShas special meaning: it reads the input in paragraph mode, meaning that the input is broken into records by runs of non-empty lines, with leading and trailing runs of empty lines ignored.When you assign anything other than a literal space to
FS, the interpretation ofFSchanges fundamentally:- A single character or each character from a specified character set is recognized individually as a field separator - not runs of it, as with the default.
- For instance, setting
FSto[ ]- even though it effectively amounts to a single space - causes every individual space instance in each record to be treated as a field separator. - To recognize runs, the regex quantifier (duplication symbol)
+must be used; e.g.,[\t]+would recognize runs of tabs as a single separator.
- For instance, setting
- Leading and trailing separators are NOT ignored, and, instead, separate empty fields.
- Setting
FSto the empty string means that each character of a record is its own field.
- A single character or each character from a specified character set is recognized individually as a field separator - not runs of it, as with the default.
As mandated by POSIX, if
RSis set to the empty string (paragraph mode), newlines (\n) are also considered field separators, irrespective of the value ofFS.
[1] Unfortunately, GNU Awk up to at least version 4.1.3 complies with an obsolete POSIX standard with respect to field separators when you use the option to enforce POSIX compliance, -P (--posix): with that option in effect and RS set to a non-empty value, newlines (\n instances) are NOT recognized as field separators. The GNU Awk manual spells out the obsolete behavior (but neglects to mention that it doesn't apply when RS is set to the empty string). The POSIX standard changed in 2008 (see comments) to also consider newlines field separators when FS has its default value - as GNU Awk has always done without -P (--posix).
Here are 2 commands that verify the behavior described above:
- With
-Pin effect andRSset to the empty string,\nis still treated as a field separator:
gawk -P -F' ' -v RS='' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb' - With
-Pin effect and a non-emptyRS,\nis NOT treated as a field separator - this is the obsolete behavior:
gawk -P -F' ' -v RS='|' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb'
A fix is coming, according to the GNU Awk maintainers; expect it in version 4.2 (no time frame given).
(Tip of the hat to @JohnKugelman and @EdMorton for their help.)
- 382,024
- 64
- 607
- 775
-
Thanks mklement0, I read the reply from John as well and it seems only space is default delimiter? However you mentioned both space and Tab? Please feel free to correct me if I am wrong. :) – Lin Ma May 23 '15 at 04:07
-
1In short: While a space is _formally_ the default `FS` value, it _stands for_ space, tabs, and newlines. I've updated my answer to clarify. – mklement0 May 23 '15 at 20:12
-
1The POSIX standard changed and gawk is supporting the older version. See the 2004 standard (http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html) which states `a field is a string of non-
s` vs the 2013 standard (http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html) which states `a field is a string of non- – Ed Morton May 25 '15 at 15:46non- characters`. You should email bug-gawk@gnu.org about that. -
1@EdMorton: Turns out I misread the POSIX spec and you were correct: the spec did change as you describe in POSIX.1-2008 (SUS v4). Per your suggestion (thanks) I've emailed bug-gawk@gnu.org and have since heard back that a fix is coming in the "next major release", but I'm unclear on what version that is. – mklement0 May 31 '15 at 18:55
-
1@mklement0 it seems Ubuntu 19.04 comes with both gawk and mawk, here is where you find it: http://releases.ubuntu.com/19.04/ubuntu-19.04-desktop-amd64.manifest . BUT, indeed, gawk is default: `❱ which -a "awk" /usr/bin/awk ❱ file /usr/bin/awk /usr/bin/awk: symbolic link to /etc/alternatives/awk ❱ file /etc/alternatives/awk /etc/alternatives/awk: symbolic link to /usr/bin/gawk` – crysman Jul 13 '19 at 22:20
-
Thanks, @crysman - I've updated the answer to point to your comment. – mklement0 Jul 13 '19 at 22:32
The question the default delimiter is only space for awk? is ambiguous but I'll try to answer both of the questions you might be asking.
The default value of the FS variable (which holds the field separator that tells awk how to separate records into fields as it reads them) is a single space character.
The thing that awk uses to separate records into fields is a "field separator" which is a regular expression with some additional functionality that only applies when the field separator is a single blank character. That additional functionality is that:
- Leading and trailing white space is ignored during field splitting.
- Fields are separated at chains of contiguous space characters which includes blanks, tabs and newlines.
- If you want to use a literal blank character as a field separator you must specify it as
[ ]instead of just a standalone literal blank char like you could in a regexp.
In addition to field separators being used to split records into fields as the input is read they are used in some other contexts, e.g. the 3rd arg for split(), so it's important for you to know which contexts require a string or a regexp or a fieldsep and the man page clearly specifies each.
Among other things, the above explains this:
$ echo ' a b c ' | awk '{printf "%d: <%s> <%s> <%s>\n", NF, $1, $2, $3}'
3: <a> <b> <c>
$ echo ' a b c ' | awk -F' ' '{printf "%d: <%s> <%s> <%s>\n", NF, $1, $2, $3}'
3: <a> <b> <c>
$ echo ' a b c ' | awk -F'[ ]' '{printf "%d: <%s> <%s> <%s>\n", NF, $1, $2, $3}'
5: <> <a> <b>
so if you don't understand why the first 2 produce the same output but the last is different, please ask.
- 188,023
- 17
- 78
- 185
-
Please don't confuse "blank" with "space". "space" is the actual space character (`0x20`), whereas "blank" is a [potentially locale-specific _abstraction_](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_04_01): "In the POSIX locale, only the
and – mklement0 May 22 '15 at 21:40shall be included. In a locale definition file, the and are automatically included in this class." (I can see no umbrella term in the POSIX spec that covers both "blank" and newlines.) -
Let's take a look at the GNU awk man page:
FS— The input field separator, a space by default. See Fields, above.
To the Fields section!
As each input record is read, gawk splits the record into fields, using the value of the
FSvariable as the field separator. IfFSis a single character, fields are separated by that character. IfFSis the null string, then each individual character becomes a separate field. Otherwise,FSis expected to be a full regular expression. In the special case thatFSis a single space, fields are separated by runs of spaces and/or tabs and/or newlines.
- 349,597
- 67
- 533
- 578
-
Hi John, a bit lost in your reply. Does it mean only space, or both space/Tab are used as default delimiter? – Lin Ma May 23 '15 at 04:06
-
Just to complement this answer: Even though the quotes are from the GNU Awk manual page, they also apply to the other Awk implementation that some Linux distros come with by default, Mawk (`mawk`; e.g., on Ubuntu) - and they also apply to BWK Awk, as found on BSD-like platforms, including macOS. – mklement0 Sep 09 '22 at 12:05
-
1@mklement0 : @mklement0 : diff note : now ghetto `BWK awk` can also handle regex in `RS` : `jot -s '' -c - 33 126 | gtr -d '\n' | nawk '$-_ =NR "=NR:{ "($-_)" }:NF=" NF' RS='(:|[0-9]|\42)+' 1=NR:{ ! }:NF=1 2=NR:{ #$%&'()*+,-./ }:NF=1 3=NR:{ ;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_\`abcdefghijklmnopqrstuvwxyz{|}~ }:NF=1` – RARE Kpop Manifesto Sep 09 '22 at 15:00
'[ ]+' works for me.
Run awk -W version to get the awk version. Mine is GNU Awk 4.0.2.
# cat a.txt
tcp 0 0 10.192.25.199:65002 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:26895 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:18422 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 10.192.25.199:8888 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN
tcp 0 0 10.192.25.199:8093 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:8670 0.0.0.0:* LISTEN
For example, I want to get the Listen port. So I need to use the awk default delimiter added with ':'
# cat a.txt | awk -F '[ ]+|:' '{print $5}'
65002
26895
111
18422
22
8888
50010
50075
8093
8670
If you just want to test the default delimiter, you can run
# cat a.txt | awk -F '[ ]+' '{print $4}'
10.192.25.199:65002
127.0.0.1:26895
0.0.0.0:111
0.0.0.0:18422
0.0.0.0:22
10.192.25.199:8888
0.0.0.0:50010
0.0.0.0:50075
10.192.25.199:8093
0.0.0.0:8670
The result is as expected.
- 21
- 4