awk脚本学习笔记:基本语法、流程控制、自定义函数

awk是一款优秀的文本处理工具(按行处理),也是一门编程设计语言(样式扫描和处理语言)。可以进行正则表达式匹配,样式装入、流控制、数学运算、进程控制语句,甚至内置的变量和函数。本文整理awk脚本的基本语法、流程控制、自定义函数。

1 基本语法

1.1 创建与运行脚本

(1)创建

第一行#!/usr/bin/awk -f指定解释器,建议还是加上。文件后缀名(通常是.awk)也不是必须的,但为了管理方便,建议加上。

#!/usr/bin/awk -f

(2)运行

./test.awk  <文件列表>        执行前需给脚本加上执行权限
awk -f test.awk <文件列表>   加执行权限不是必须的

上述是将awk作为脚本来运行,实际上,也可以将awk作为命令来运行(适应于简短的awk程序),举例如下:

$ awk {sum+=$2}; END{print sum} Pride_and_Prejudice_count_words.txt

1.2 注释与语句块

(1)注释

单行注释:用井号#,也可以像C语言一样,用/**/,但/**/不能用于多行注释。我暂时还没找到awk多行注释的方法。

(2)语句块

跟大多数程序设计语言使用大括号表示语句块一样,awk也使用大括号定义语句块。每条语句结尾不需要加分号,但可用分号在一行连接多条语句。

1.3 程序结构

awk脚本由一系列模式-动作组成,如下:

pattern { 
    action 
}
pattern { 
    action 
}
...

注:左大括号要紧跟pattern之后,若大括号另起一行,则视为新的pattern(空模式)。awk依次从文件读入一行(按行处理),与模式相匹配的,执行相应的动作;模式为空的,视为每行皆模式相匹配,即都执行相应的动作;动作action可以为空,在只有pattern情况下(即没有大括号{}和action),若模式相匹配,则将整行打印打印出来(相当于$0),但BEGIN/END模式的动作不能为空awk支持的模式如下:

(1)BEGIN/END

awk脚本最初/最后执行的内容(initialization and cleanup),相当于构造函数/析构函数。值得注意的是,当有多个文件输入时,BEGIN/END也只执行一次。BEGIN可以用于定义一些全局变量,END可以用来最后的输出。

(2)Empty

空模式,视为每行皆模式相匹配(matches every record),即都执行相应的动作。形式如下:

{ 
    action 
}

(3)Regexp Patterns

正则表达式模式,关于正则表达式,详情见4。

/正则表达式/ { action }

(4)Expression Patterns

表达式为真时(表达式可以是比较操作符,也可以是布尔操作符&&、||、!),执行相应动作。形式如下:

表达式 { action }

 

awk '$1 == "li" { print $2 }' mail-list

(5)Ranges

begpat, endpat { action }

 

awk '$1 == "on", $1 == "off"' myfile

范围模式(range pattern),从与模式1相匹配的行到与模式2相匹配的行(包含该行)之间的所有行,执行相应动作。

(6)BEGINFILE/ENDFILE

Two special patterns for advanced control.

1.4 内置变量[1]

$0           当前记录(作为单个变量)
$1~$n       当前记录的第n个字段,字段间由FS分隔
FS          输入字段分隔符,默认是空格
NF          当前记录中的字段个数,就是有多少列
NR          已经读出的记录数,就是行号,从1开始
RS          输入的记录分隔符,默认为换行符

OFS         输出字段分隔符,默认是空格
ORS         输出的记录分隔符,默认为换行符
ARGC         命令行参数个数
ARGV         命令行参数数组,从0 到 ARGC-1
ILENAME      当前输入文件的名字
IGNORECASE   如果为真,则进行忽略大小写的匹配
ARGIND       当前被处理文件的ARGV标志符
CONVFMT      数字转换格式,如%.6g
ENVIRON      UNIX环境变量
ERRNO        UNIX系统错误消息
FIELDWIDTHS  输入字段宽度的空白分隔字符串
FNR          当前记录数
OFMT         数字的输出格式,如%.6g
RSTART       被匹配函数匹配的字符串首
RLENGTH      被匹配函数匹配的字符串长度
SUBSEP       \034

注:记录(record)在awk实际上就是行。

1.5 常量与变量

(1)常量

数值常量:可以是整数、浮点数、科学记数(如1.05e+2)
八进制、十六进制:分别以0、0x/0X开头
字符串:用双引号括起
正则表达式常量:如matches = /foo/

(2)字符串操作

toupper(string)               return a copy of string, with each uppercase character tolower(string)               return a copy of string, with each lowercase character asort(source [, dest [, how ] ])   sort value, gawk extensions asorti(source [, dest [, how ] ])   sort index, gawk extensions

substr(string, start [, length]) 返回子字符串,下标从1开始 index(in, find)                  return the position of the string in for the first occurrence of the string find match(string, regexp [, array]) regexp may be either a regexp constant (/…/) or a string constant ("…").

 

strtonum(str)                     Examine str and return its numeric value length([string])                  return the number of characters in string. length()返回$0长度 sprintf(format, expression1, …)

 

split(string, array [, fieldsep [, seps ] ])        divide string into pieces defined by fieldpat
patsplit(string, array [, fieldpat [, seps ] ])     store the pieces in array and the separator strings in the seps array.

字符串替换

sub(regexp, replacement [, target])         只替换最左最长的匹配,target默认为$0
gsub(regexp, replacement [, target])        g代表全局global
gensub(regexp, replacement, how [, target]) how若为"g""G",等同于gsub,否则视为数字,即替换第几个匹配

注:字符串替换中replacement含有&,&表示已配的子串(the precise substring),举例如下:

str = “daabaaa”
sub(/a+/, “C&C”, str)   #输出dCaaCbaaa

(3)数组

awk不需要事先指定大小,下标可以是数也可以是字符串,更像是key-value。值得注意的是,awk数组的下标是从1开始。数组的下标皆视为是字符串string,如a[2*5]下标是”10″。

①一些操作

if (index in array) 等同于 if (array[index] != “”)
delete array[index-expression]

②遍历数组

遍历数组的顺序取决于awk版本,但可以通过PROCINFO[“sorted_in”]指定遍历数组的顺序,举例如下使用方法如下:

#可以在BEGIN模式指定
PROCINFO["sorted_in"] = "@ind_str_asc"

for (var in array)
    body

可以自定义遍历顺序,但awk已支持多种遍历顺序,如下:

"@unsorted"     in arbitrary order, which is the default awk behavior.


"@ind_str_asc"        in ascending order compared as strings.
"@ind_num_asc"        in ascending order but force them to be treated as numbers. Any index with a non-numeric value will end up positioned as if it were zero.

"@ind_str_desc"       String indices ordered from high to low.
"@ind_num_desc"       Numeric indices ordered from high to low.


"@val_type_asc"       Element values, based on type, ordered from low to high. 
"@val_str_asc"        Element values, treated as strings, ordered from low to high. 
"@val_num_asc"        Element values, treated as numbers, ordered from low to high.

"@val_type_desc"    Element values, based on type, ordered from high to low. 
"@val_str_desc"       Element values, treated as strings, ordered from high to low. 
"@val_num_desc"       Element values, treated as numbers, ordered from high to low. 

1.6 打印信息

(1)print

print语句输出带换行(followed by a newline),变量之间默认以空格分开,但可以通过输出字段分隔符OFS改变(如OFS = “;”)

print item1, item2, …

(2)printf

printf format, item1, item2, …

注:printf输出不带换行。相关格式控制如下:

%c        Print a number as an ASCII character.
%s        Print a string.
%u        Print an unsigned decimal integer.
%d, %i    Print a decimal integer.     
%e, %E    Print a number in scientific (exponential) notation.
%f        Print a number in floating-point notation.
%F        Like ‘%f’ but the infinity and “not a number” values are spelled using uppercase letters.
%g, %G    Print a number in either scientific notation or in floating-point notation, whichever uses fewer characters.
%o        Print an unsigned octal integer.
%x, %X    Print an unsigned hexadecimal integer.
%%        Print a single ‘%’. 

跟C语言printf一样,可以用N$(如2$,表示第2个参数)、-(左对齐)、空格、+、0、’、域宽、.prec(精度)。详情参见awk文档

(3)重定向输出

print items > output-file
print items >> output-file
print items | command
print items |& command

1.7 参数传递

(1)-v var=val

在运行时传入参数,如awk -f test.awk -v sum=100 test.txt。值得注意的是,每个变量前面皆需指定-v,否则作为命令行参数数组ARGV。

(2)ARGV

也可以通过awk内置变量ARGV(命令行参数数组)传递参数,ARGV参数数组被视为输入文件input-file。

2. 循环控制

2.1 break/continue/next/nextfile/exit

break跳出循环;continue进入下一次循环;next停止处理当前行,进入下一行;nextfile停止处理当前文件,进入下一个文件;exit退出,但退出之前会运行END模式的动作。

exit [return code]

2.2 条件判断

(1)比较操作符

x < y     True if x is less than y.
x <= y    True if x is less than or equal to y.
x > y     True if x is greater than y.
x >= y    True if x is greater than or equal to y.
x == y    True if x is equal to y.
x != y    True if x is not equal to y.
x ~ y    True if the string x matches the regexp denoted by y.
x !~ y    True if the string x does not match the regexp denoted by y.
subscript in array    True if the array has an element with the subscript.

(2)布尔操作符

boolean1 && boolean2
boolean1 || boolean2
! boolean

(3)正则表达式

exp ~ /regexp/
exp !~ /regexp/

2.3 if

if (condition)
{
    body
}
else if (condition)
{
    body
}
else
{
    body
}

2.4 for

for (initialization; condition; increment) { body }

 

for (var in array)

2.5 while

while (condition)
{
    body
}
  
do
{
    body
}
while (condition)

2.6 switch

switch (expression) 
{
case value or regular expression:
    case-body
default:
    default-body
}

3. 函数

3.1 自定义

function name([parameter-list])
{
     body-of-function
}

3.2 内建函数

Numeric Functions     Functions that work with numbers.
String Functions      Functions for string manipulation.
I/O Functions         Functions for files and shell commands.
Time Functions        Functions for dealing with timestamps.
Bitwise Functions     Functions for bitwise operations.
Type Functions        Functions for type information.
I18N Functions        Functions for string translation.

4. 正则表达式

4.1 转义字符

\\       A literal backslash, ‘\’. 反斜扛
\a       The “alert” character. 响铃
\b       Backspace, ASCII code 8 (BS). 退枨
\f       Formfeed. 换页
\n       Newline. 换行
\r       Carriage return(CR). 回车
\t       Horizontal TAB(HT). 水平制表符   
\v       Vertical tab(VT). 垂直制表符
\/       A literal slash (necessary for regexp constants only).
\”       A literal double quote (necessary for string constants only).
\nnn     The octal value nnn.
\xhh     The hexadecimal value hh.

4.2 操作符

\       suppress the special meaning of a character. eg, ‘\$’ matches the character ‘$’.
^       the beginning of a string. 但不匹配a line embedded in a string,如if ("line1\nLINE 2" ~ /^L/)为假
$       the end of a string. 但不匹配a line embedded in a string,如if ("line1\nLINE 2" ~ /1$/)为假
.       any single character, including the newline character. 但在POSIX模式,.不匹配NUL
[…]    a bracket expression. eg.‘[MVX]’ matches any one of the characters ‘M’, ‘V’, or ‘X’ in a string.
[^ …]    a complemented bracket expression. eg.‘[^awk]’ matches any character that is not an ‘a’, ‘w’, or ‘k’.
|       the alternation operator. the lowest precedence of all the regular expression operators. eg, ‘^P|[[:digit:]]’ 
(…)    grouping in regular expressions, as in arithmetic. eg, ‘@(samp|code)\{[^}]+\}’ matches both ‘@code{foo}’ and ‘@samp{bar}’. 
*       be repeated as many times as necessary to find a match. eg, ‘ph*’ matches one ‘p’ followed by any number (including 0) of ‘h’s. The ‘*’ repeats the smallest possible preceding expression. 
+       similar to ‘*’, but must be matched at least once. 
?       similar to ‘*’, but can be matched either once or not at all.  
{n}     repeated n times only
{n,}    repeated at least n times
{n,m}   repeated n to m times

注:在正则表达式中,‘*’, ‘+’, ‘?’,‘{’ and ‘}’ 具有最高优先级。followed by concatenation, and finally by ‘|’.

4.3 括号表达式

(1)[: :]

[:alnum:]    Alphanumeric characters.
[:alpha:]    Alphabetic characters.
[:blank:]    Space and TAB characters.
[:cntrl:]    Control characters.
[:digit:]    Numeric characters.
[:graph:]    Characters that are both printable and visible. (A space is invisible)
[:lower:]    Lowercase alphabetic characters.
[:print:]    Printable characters (characters that are not control characters).
[:punct:]    Punctuation characters
[:space:]    Space characters (such as space, TAB, and formfeed, to name a few).
[:upper:]    Uppercase alphabetic characters.
[:xdigit:]   Characters that are hexadecimal digits.

(2)[. .]

用于多字符匹配(multicharacter collating elements),eg,‘[[.ch.]]’ is a regexp that matches this collating element.

(3)[= =]

Locale-specific names for a list of characters that are equal. eg,‘[[=e=]]’ is a regexp that matches any of ‘e’, ‘é’, or ‘è’. 这对于非英语的文本处理很有用。

4.4 awk特殊操作符

\s    Matches any whitespace character.  ‘[[:space:]]’.
\S    Matches any character that is not whitespace. ‘[^[:space:]]’.

\w    Matches any letter, digit, or underscore. ‘[[:alnum:]_]’.
\W    Matches any character that is not word-constituent. ‘[^[:alnum:]_]’.

\<    Matches the beginning of a word. eg, /\<away/ matches ‘away’ but not ‘stowaway’.
\>    Matches the end of a word. eg, /stow\>/ matches ‘stow’ but not ‘stowaway’.

\y    Matches the word boundary (the beginning or the end of a word). 
\B    ‘\B’ is essentially the opposite of ‘\y’. eg, /\Brat\B/ matches ‘crate’ but not match ‘dirty rat’.

\`    Matches the beginning of a buffer (string).
\'    Matches the end of a buffer (string).

5. 学习资料

awk手册http://www.gnu.org/software/gawk/manual/gawk.html

参考资料:
[1]博文《linux awk 内置变量使用介绍
[2]awk手册

Leave a Reply

Your email address will not be published. Required fields are marked *