linux-shell-文本操作grep、awk、sed

发布于 2017-10-05 · 本文总共 18366 字 · 阅读大约需要 53 分钟

排序数据–sort

Usage Examples

查找文件行中值重复的行

sort ./test.txt | uniq -d

查询2个文件重复的数据

cat file1 file2 | sort | uniq -d

help

sort --help
Usage: sort [OPTION]... [FILE]...
  or:  sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.

Mandatory arguments to long options are mandatory for short options too.
Ordering options:

  -b, --ignore-leading-blanks  ignore leading blanks
  -d, --dictionary-order      consider only blanks and alphanumeric characters
  -f, --ignore-case           fold lower case to upper case characters
  -g, --general-numeric-sort  compare according to general numerical value
  -i, --ignore-nonprinting    consider only printable characters
  -M, --month-sort            compare (unknown) < 'JAN' < ... < 'DEC'
  -h, --human-numeric-sort    compare human readable numbers (e.g., 2K 1G)
  -n, --numeric-sort          compare according to string numerical value
  -R, --random-sort           sort by random hash of keys
      --random-source=FILE    get random bytes from FILE
  -r, --reverse               reverse the result of comparisons
      --sort=WORD             sort according to WORD:
                                general-numeric -g, human-numeric -h, month -M,
                                numeric -n, random -R, version -V
  -V, --version-sort          natural sort of (version) numbers within text

Other options:

      --batch-size=NMERGE   merge at most NMERGE inputs at once;
                            for more use temp files
  -c, --check, --check=diagnose-first  check for sorted input; do not sort
  -C, --check=quiet, --check=silent  like -c, but do not report first bad line
      --compress-program=PROG  compress temporaries with PROG;
                              decompress them with PROG -d
      --debug               annotate the part of the line used to sort,
                              and warn about questionable usage to stderr
      --files0-from=F       read input from the files specified by
                            NUL-terminated names in file F;
                            If F is - then read names from standard input
  -k, --key=KEYDEF          sort via a key; KEYDEF gives location and type
  -m, --merge               merge already sorted files; do not sort
  -o, --output=FILE         write result to FILE instead of standard output
  -s, --stable              stabilize sort by disabling last-resort comparison
  -S, --buffer-size=SIZE    use SIZE for main memory buffer
  -t, --field-separator=SEP  use SEP instead of non-blank to blank transition
  -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;
                              multiple options specify multiple directories
      --parallel=N          change the number of sorts run concurrently to N
  -u, --unique              with -c, check for strict ordering;
                              without -c, output only the first of an equal run
  -z, --zero-terminated     end lines with 0 byte, not newline
      --help     display this help and exit
      --version  output version information and exit

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
field number and C a character position in the field; both are origin 1, and
the stop position defaults to the line's end.  If neither -t nor -b is in
effect, characters in a field are counted from the beginning of the preceding
whitespace.  OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
which override global ordering options for that key.  If no key is given, use
the entire line as the key.

SIZE may be followed by the following multiplicative suffixes:
% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.

With no FILE, or when FILE is -, read standard input.

*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'sort invocation'

grep、awk、sed

grep–数据查找定位

global search Regular Expressions and print out the line

Print lines matching a pattern

基于正则表达式查找满足条件的行

类比SQL：select * from table;

awk–数据切片

根据定位到的数据行处理其中的段

类比SQL：select field from table;

sed–数据修改

stream editor

根据定位到的数据行修改数据

类比SQL：update table set field=new where field=old;

grep

Command-line Options

1.Generic Program Information

-V

grep -V
grep (GNU grep) 2.20
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

2.Matching Control

-e patterns –regexp=patterns

-f file –file=file

-i -y –ignore-case –no-ignore-case

-v –invert-match

-w –word-regexp

-x –line-regexp

3.General Output Control

-c –count

–color[=WHEN] –colour[=WHEN]

-L –files-without-match

-l –files-with-matches

-m num –max-count=num

-o –only-matching

-q –quiet –silent

-s –no-messages

4.Output Line Prefix Control

-b –byte-offset

-H –with-filename

-h –no-filename

-n –line-number

-T –initial-tab

-Z –null

5.Context Line Control

-A num –after-context=num Print num lines of trailing context after matching lines.

-B num –before-context=num Print num lines of leading context before matching lines.

-C num -num –context=num Print num lines of leading and trailing output context.

6.File and Directory Selection

-a –text

Process a binary file as if it were text; this is equivalent to the ‘–binary-files=text’ option.

–binary-files=type

-I

Process a binary file as if it did not contain matching data; this is equivalent to the ‘–binary-files=without-match’ option.

–include=glob

–exclude=glob

–exclude-from=file

–exclude-dir=glob

-r –recursive

-R –dereference-recursive

Usage example

1.How can I list just the names of matching files? 只匹配文件名

grep -l "404" *.html
404.html

-l, –files-with-matches print only names of FILEs containing matches

grep -L "404" *.html
ie.html

-L, –files-without-match print only names of FILEs containing no match

2.How do I search directories recursively? 递归文件夹

grep -r 'linux' ./posts
./posts/Python/2017-10-06-python-shell-os.md:https://www.ibm.com/developerworks/cn/linux/l-cn-pexpect2/index.html
./posts/Kubernetes/2017-10-22-Kubernetes-resources-code-cloud.md:    export GOPKG=$GOROOT/pkg/tool/linux_amd64
./posts/Kubernetes/2017-10-22-Kubernetes-resources-code-cloud.md:    export GOOS=linux

3.What if a pattern or file has a leading ‘-’? 使用-e正则过滤字符

-e, –regexp=PATTERN use PATTERN for matching

4.Suppose I want to search for a whole word, not a part of a word? 单词完全匹配

5.How do I output context around the matching lines? 打印上下文

6.How do I force grep to print the name of the file? 打印文件名

-H, –with-filename print the file name for each match

7.Why do people use strange regular expressions on ps output? 过滤其他程序输出

ps -ef | grep '[c]ron'
root      2042     1  0  2018 ?        00:00:55 /usr/sbin/crond -n

-i, –ignore-case ignore case distinctions

grep -i 'hello.*world' menu.h main.c

9.统计文本内特定字符串的个数

grep -o ‘haha’ file

wc -l

help

 grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c

Regexp selection and interpretation:
  -E, --extended-regexp     PATTERN is an extended regular expression (ERE)
  -F, --fixed-strings       PATTERN is a set of newline-separated fixed strings
  -G, --basic-regexp        PATTERN is a basic regular expression (BRE)
  -P, --perl-regexp         PATTERN is a Perl regular expression
  -e, --regexp=PATTERN      use PATTERN for matching
  -f, --file=FILE           obtain PATTERN from FILE
  -i, --ignore-case         ignore case distinctions
  -w, --word-regexp         force PATTERN to match only whole words
  -x, --line-regexp         force PATTERN to match only whole lines
  -z, --null-data           a data line ends in 0 byte, not newline

Miscellaneous:
  -s, --no-messages         suppress error messages
  -v, --invert-match        select non-matching lines
  -V, --version             display version information and exit
      --help                display this help text and exit

Output control:
  -m, --max-count=NUM       stop after NUM matches
  -b, --byte-offset         print the byte offset with output lines
  -n, --line-number         print line number with output lines
      --line-buffered       flush output on every line
  -H, --with-filename       print the file name for each match
  -h, --no-filename         suppress the file name prefix on output
      --label=LABEL         use LABEL as the standard input file name prefix
  -o, --only-matching       show only the part of a line matching PATTERN
  -q, --quiet, --silent     suppress all normal output
      --binary-files=TYPE   assume that binary files are TYPE;
                            TYPE is 'binary', 'text', or 'without-match'
  -a, --text                equivalent to --binary-files=text
  -I                        equivalent to --binary-files=without-match
  -d, --directories=ACTION  how to handle directories;
                            ACTION is 'read', 'recurse', or 'skip'
  -D, --devices=ACTION      how to handle devices, FIFOs and sockets;
                            ACTION is 'read' or 'skip'
  -r, --recursive           like --directories=recurse
  -R, --dereference-recursive
                            likewise, but follow all symlinks
      --include=FILE_PATTERN
                            search only files that match FILE_PATTERN
      --exclude=FILE_PATTERN
                            skip files and directories matching FILE_PATTERN
      --exclude-from=FILE   skip files matching any file pattern from FILE
      --exclude-dir=PATTERN directories that match PATTERN will be skipped.
  -L, --files-without-match print only names of FILEs containing no match
  -l, --files-with-matches  print only names of FILEs containing matches
  -c, --count               print only a count of matching lines per FILE
  -T, --initial-tab         make tabs line up (if needed)
  -Z, --null                print 0 byte after FILE name

Context control:
  -B, --before-context=NUM  print NUM lines of leading context
  -A, --after-context=NUM   print NUM lines of trailing context
  -C, --context=NUM         print NUM lines of output context
  -NUM                      same as --context=NUM
      --group-separator=SEP use SEP as a group separator
      --no-group-separator  use empty string as a group separator
      --color[=WHEN],
      --colour[=WHEN]       use markers to highlight the matching strings;
                            WHEN is 'always', 'never', or 'auto'
  -U, --binary              do not strip CR characters at EOL (MSDOS/Windows)
  -u, --unix-byte-offsets   report offsets as if CRs were not there
                            (MSDOS/Windows)

'egrep' means 'grep -E'.  'fgrep' means 'grep -F'.
Direct invocation as either 'egrep' or 'fgrep' is deprecated.
When FILE is -, read standard input.  With no FILE, read . if a command-line
-r is given, - otherwise.  If fewer than two FILEs are given, assume -h.
Exit status is 0 if any line is selected, 1 otherwise;
if any error occurs and -q is not given, the exit status is 2.

Report bugs to: bug-grep@gnu.org
GNU Grep home page: <http://www.gnu.org/software/grep/>
General help using GNU software: <http://www.gnu.org/gethelp/>

awk

理论上可以代替grep

The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977 at AT&T Bell Laboratories

help

awk --help
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:          GNU long options: (standard)
        -f progfile             --file=progfile
        -F fs                   --field-separator=fs
        -v var=val              --assign=var=val
Short options:          GNU long options: (extensions)
        -b                      --characters-as-bytes
        -c                      --traditional
        -C                      --copyright
        -d[file]                --dump-variables[=file]
        -e 'program-text'       --source='program-text'
        -E file                 --exec=file
        -g                      --gen-pot
        -h                      --help
        -L [fatal]              --lint[=fatal]
        -n                      --non-decimal-data
        -N                      --use-lc-numeric
        -O                      --optimize
        -p[file]                --profile[=file]
        -P                      --posix
        -r                      --re-interval
        -S                      --sandbox
        -t                      --lint-old
        -V                      --version

To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.

Examples:
        gawk '{ sum += $1 }; END { print sum }' file
        gawk -F: '{ print $1 }' /etc/passwd

sed

sed, a stream editor

help

sed
Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...

  -n, --quiet, --silent
                 suppress automatic printing of pattern space
  -e script, --expression=script
                 add the script to the commands to be executed
  -f script-file, --file=script-file
                 add the contents of script-file to the commands to be executed
  --follow-symlinks
                 follow symlinks when processing in place
  -i[SUFFIX], --in-place[=SUFFIX]
                 edit files in place (makes backup if SUFFIX supplied)
  -c, --copy
                 use copy instead of rename when shuffling files in -i mode
  -b, --binary
                 does nothing; for compatibility with WIN32/CYGWIN/MSDOS/EMX (
                 open files in binary mode (CR+LFs are not treated specially))
  -l N, --line-length=N
                 specify the desired line-wrap length for the `l' command
  --posix
                 disable all GNU extensions.
  -r, --regexp-extended
                 use extended regular expressions in the script.
  -s, --separate
                 consider files as separate rather than as a single continuous
                 long stream.
  -u, --unbuffered
                 load minimal amounts of data from the input files and flush
                 the output buffers more often
  -z, --null-data
                 separate lines by NUL characters
  --help
                 display this help and exit
  --version
                 output version information and exit

If no -e, --expression, -f, or --file option is given, then the first
non-option argument is taken as the sed script to interpret.  All
remaining arguments are names of input files; if no input files are
specified, then the standard input is read.

GNU sed home page: <http://www.gnu.org/software/sed/>.
General help using GNU software: <http://www.gnu.org/gethelp/>.

examples

1. 替换文本

# 替换第一次出现的"old"为"new"
sed 's/old/new/' input.txt

# 替换每一行中的所有"old"为"new"
sed 's/old/new/g' input.txt

# 替换第n次出现的"old"为"new"
sed 's/old/new/n' input.txt

1. 删除行

# 删除包含"pattern"的所有行
sed '/pattern/d' input.txt

# 删除第n行
sed 'nd' input.txt

# 删除文件中的空白行
sed '/^$/d' input.txt

1. 插入和附加文本

# 在匹配"pattern"的行前插入文本
sed '/pattern/i\inserted text' input.txt

# 在匹配"pattern"的行后追加文本
sed '/pattern/a\appended text' input.txt

1. 使用变量

# 使用变量进行替换
old="foo"
new="bar"
sed "s/$old/$new/" input.txt

1. 保存更改到原文件

# 将更改写回到原文件
sed -i 's/old/new/' input.txt

# 使用备份文件保存原文件并写入更改
sed -i.bak 's/old/new/' input.txt

1. 打印特定行

# 打印第n行
sed -n 'nd' input.txt

# 打印匹配"pattern"的行
sed -n '/pattern/p' input.txt

1. 执行多个命令

# 执行多个sed命令
sed -e 's/old/new/' -e '/pattern/d' input.txt

head

head命令用于显示文件的开头的内容。在默认情况下，head命令显示文件的头10行内容。

语法

head(选项)(参数)

选项

-n<数字>：指定显示头部内容的行数； -c<字符数>：指定显示头部内容的字符数； -v：总是显示文件名的头信息； -q：不显示文件名的头信息。

参数

文件列表：指定显示头部内容的文件列表

Usage Examples

help

head --help
Usage: head [OPTION]... [FILE]...
Print the first 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -c, --bytes=[-]K         print the first K bytes of each file;
                             with the leading '-', print all but the last
                             K bytes of each file
  -n, --lines=[-]K         print the first K lines instead of the first 10;
                             with the leading '-', print all but the last
                             K lines of each file
  -q, --quiet, --silent    never print headers giving file names
  -v, --verbose            always print headers giving file names
      --help     display this help and exit
      --version  output version information and exit

K may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'head invocation'

格式化Json

json文本

echo '{"uid":100120,"token":"1fa9fb8004b04f66b7da57393641eddc"}' | jq .

文件

cat abc.json | jq .

curl测试返回的数据格式化输出

result=$(curl http://xxxxx)
echo $result | jq .

wc

print newline, word, and byte counts for each file wc：查询文件中有多少字，多少行，多少字符

例如：cat /etc/man.config

-l: 列出多少行 -w：列出多少字 -m：列出多少字符

Usage Examples

help

wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  With no FILE, or when FILE is -,
read standard input.  A word is a non-zero-length sequence of characters
delimited by white space.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the length of the longest line
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'wc invocation'

cut

remove sections from each line of files

语法

cut -d[分隔符] -f[域] file

常见参数

-b(bytes) ：以字节为单位进行分割。这些字节位置将忽略多字节字符边界，除非也指定了 -n 标志。
-c(characters) ：以字符为单位进行分割。
-d ：自定义分隔符，默认为制表符。
-f(filed) ：与-d一起使用，指定显示哪个区域。
-n ：取消分割多字节字符。仅和 -b 标志一起使用。如果字符的最后一个字节落在由 -b 标志的 List 参数指示的
范围之内，该字符将被写出；否则，该字符将被排除。

Usage Examoles

如 -b 1,3,5 //剪切每一行第 1 3 5个字符如 -b 1-5 //剪切每一行第 1-5 个字符如 -b -5 //剪切每一行第 1-5 个字符如 -b 3- //剪切每一行第 3个字符以后的

help

cut --help
Usage: cut OPTION... [FILE]...
Print selected parts of lines from each FILE to standard output.

Mandatory arguments to long options are mandatory for short options too.
  -b, --bytes=LIST        select only these bytes
  -c, --characters=LIST   select only these characters
  -d, --delimiter=DELIM   use DELIM instead of TAB for field delimiter
  -f, --fields=LIST       select only these fields;  also print any line
                            that contains no delimiter character, unless
                            the -s option is specified
  -n                      with -b: don't split multibyte characters
      --complement        complement the set of selected bytes, characters
                            or fields
  -s, --only-delimited    do not print lines not containing delimiters
      --output-delimiter=STRING  use STRING as the output delimiter
                            the default is to use the input delimiter
      --help     display this help and exit
      --version  output version information and exit

Use one, and only one of -b, -c or -f.  Each LIST is made up of one
range, or many ranges separated by commas.  Selected input is written
in the same order that it is read, and is written exactly once.
Each range is one of:

  N     N'th byte, character or field, counted from 1
  N-    from N'th byte, character or field, to end of line
  N-M   from N'th to M'th (included) byte, character or field
  -M    from first to M'th (included) byte, character or field

With no FILE, or when FILE is -, read standard input.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'cut invocation'

uniq

-c, –count prefix lines by the number of occurrences //在每行前加上表示相应行目出现次数的前缀编号

-d, –repeated only print duplicate lines, one for each group //只输出重复的行

-D, –all-repeated[=METHOD] print all duplicate lines //以空行为界限 groups can be delimited with an empty line METHOD={none(default),prepend,separate}

-f, –skip-fields=N avoid comparing the first N fields //比较时跳过前N 列 –group[=METHOD] show all items, separating groups with an empty line METHOD={separate(default),prepend,append,both}

-i, –ignore-case ignore differences in case when comparing //在比较的时候不区分大小写

-s, –skip-chars=N avoid comparing the first N characters //比较时跳过前N 个字符

-u, –unique only print unique lines //只显示唯一的行

-z, –zero-terminated end lines with 0 byte, not newline //使用’\0’作为行结束符，而不是新换行

-w, –check-chars=N compare no more than N characters in lines //对每行第N个字符以后的内容不作对照 –help display this help and exit –version output version information and exit

help

[root@qwq ~]# uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences //在每行前加上表示相应行目出现次数的前缀编号
  -d, --repeated        only print duplicate lines, one for each group  //只输出重复的行
  -D, --all-repeated[=METHOD]  print all duplicate lines                //以空行为界限
                          groups can be delimited with an empty line
                          METHOD={none(default),prepend,separate}
  -f, --skip-fields=N   avoid comparing the first N fields              //比较时跳过前N 列
      --group[=METHOD]  show all items, separating groups with an empty line
                          METHOD={separate(default),prepend,append,both}
  -i, --ignore-case     ignore differences in case when comparing       //在比较的时候不区分大小写
  -s, --skip-chars=N    avoid comparing the first N characters          //比较时跳过前N 个字符
  -u, --unique          only print unique lines                         //只显示唯一的行
  -z, --zero-terminated  end lines with 0 byte, not newline             //使用'\0'作为行结束符，而不是新换行
  -w, --check-chars=N   compare no more than N characters in lines      //对每行第N个字符以后的内容不作对照
      --help     display this help and exit
      --version  output version information and exit

A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters.  Fields are skipped before chars.

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'uniq invocation'

实践：nginx日志分析

1.统计访问IP前十：

[root@qwq ~]# awk '{print $1}' nginx/access.log | sort |uniq -c |sort -nr |head -10
124.250.97.124
101.133.224.19
106.38.46.51
120.232.182.137
40.77.189.255
157.55.39.68
66.249.79.254
74.82.47.5
66.249.79.225
47.92.95.89

2.过滤日期

[root@qwq ~]# grep "27/Jul/2017" nginx/access.log |  awk '{print $1}' |  sort | uniq -c | sort -nr | head -10

refs

http://www.gnu.org/software/grep/manual/grep.html

http://www.gnu.org/software/sed/manual/sed.html

http://www.gnu.org/software/gawk/manual/gawk.html

linux-shell-文本操作grep、awk、sed

排序数据–sort

Usage Examples

查找文件行中值重复的行

help

grep、awk、sed

grep

Command-line Options

1.Generic Program Information

2.Matching Control

3.General Output Control

4.Output Line Prefix Control

5.Context Line Control

6.File and Directory Selection

Usage example

help

awk

help

sed

help

examples

head

语法

选项

参数

Usage Examples

help

格式化Json

json文本

文件

curl测试返回的数据格式化输出

wc

Usage Examples

help

cut

Usage Examoles

help

uniq

help

实践：nginx日志分析

refs

文章评论