linux-shell-文本操作grep、awk、sed
排序数据–sort
Usage Examples
查找文件行中值重复的行
sort ./test.txt | uniq -d
查询2个文件重复的数据
cat file1 file2 | sort | uniq -d
help
sort --help
Usage: sort [OPTION]... [FILE]...
or: sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.
Mandatory arguments to long options are mandatory for short options too.
Ordering options:
-b, --ignore-leading-blanks ignore leading blanks
-d, --dictionary-order consider only blanks and alphanumeric characters
-f, --ignore-case fold lower case to upper case characters
-g, --general-numeric-sort compare according to general numerical value
-i, --ignore-nonprinting consider only printable characters
-M, --month-sort compare (unknown) < 'JAN' < ... < 'DEC'
-h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G)
-n, --numeric-sort compare according to string numerical value
-R, --random-sort sort by random hash of keys
--random-source=FILE get random bytes from FILE
-r, --reverse reverse the result of comparisons
--sort=WORD sort according to WORD:
general-numeric -g, human-numeric -h, month -M,
numeric -n, random -R, version -V
-V, --version-sort natural sort of (version) numbers within text
Other options:
--batch-size=NMERGE merge at most NMERGE inputs at once;
for more use temp files
-c, --check, --check=diagnose-first check for sorted input; do not sort
-C, --check=quiet, --check=silent like -c, but do not report first bad line
--compress-program=PROG compress temporaries with PROG;
decompress them with PROG -d
--debug annotate the part of the line used to sort,
and warn about questionable usage to stderr
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-k, --key=KEYDEF sort via a key; KEYDEF gives location and type
-m, --merge merge already sorted files; do not sort
-o, --output=FILE write result to FILE instead of standard output
-s, --stable stabilize sort by disabling last-resort comparison
-S, --buffer-size=SIZE use SIZE for main memory buffer
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
--parallel=N change the number of sorts run concurrently to N
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run
-z, --zero-terminated end lines with 0 byte, not newline
--help display this help and exit
--version output version information and exit
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
field number and C a character position in the field; both are origin 1, and
the stop position defaults to the line's end. If neither -t nor -b is in
effect, characters in a field are counted from the beginning of the preceding
whitespace. OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
which override global ordering options for that key. If no key is given, use
the entire line as the key.
SIZE may be followed by the following multiplicative suffixes:
% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.
With no FILE, or when FILE is -, read standard input.
*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'sort invocation'
grep、awk、sed
- grep–数据查找定位
global search Regular Expressions and print out the line
Print lines matching a pattern
基于正则表达式查找满足条件的行
类比SQL:select * from table;
- awk–数据切片
根据定位到的数据行处理其中的段
类比SQL:select field from table;
- sed–数据修改
stream editor
根据定位到的数据行修改数据
类比SQL:update table set field=new where field=old;
grep
Command-line Options
1.Generic Program Information
-V
grep -V
grep (GNU grep) 2.20
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
2.Matching Control
-e patterns –regexp=patterns
-f file –file=file
-i -y –ignore-case –no-ignore-case
-v –invert-match
-w –word-regexp
-x –line-regexp
3.General Output Control
-c –count
–color[=WHEN] –colour[=WHEN]
-L –files-without-match
-l –files-with-matches
-m num –max-count=num
-o –only-matching
-q –quiet –silent
-s –no-messages
4.Output Line Prefix Control
-b –byte-offset
-H –with-filename
-h –no-filename
-n –line-number
-T –initial-tab
-Z –null
5.Context Line Control
-A num –after-context=num Print num lines of trailing context after matching lines.
-B num –before-context=num Print num lines of leading context before matching lines.
-C num -num –context=num Print num lines of leading and trailing output context.
6.File and Directory Selection
-a –text
Process a binary file as if it were text; this is equivalent to the ‘–binary-files=text’ option.
–binary-files=type
-I
Process a binary file as if it did not contain matching data; this is equivalent to the ‘–binary-files=without-match’ option.
–include=glob
–exclude=glob
–exclude-from=file
–exclude-dir=glob
-r –recursive
-R –dereference-recursive
Usage example
1.How can I list just the names of matching files? 只匹配文件名
grep -l "404" *.html
404.html
-l, –files-with-matches print only names of FILEs containing matches
grep -L "404" *.html
ie.html
-L, –files-without-match print only names of FILEs containing no match
2.How do I search directories recursively? 递归文件夹
grep -r 'linux' ./posts
./posts/Python/2017-10-06-python-shell-os.md:https://www.ibm.com/developerworks/cn/linux/l-cn-pexpect2/index.html
./posts/Kubernetes/2017-10-22-Kubernetes-resources-code-cloud.md: export GOPKG=$GOROOT/pkg/tool/linux_amd64
./posts/Kubernetes/2017-10-22-Kubernetes-resources-code-cloud.md: export GOOS=linux
3.What if a pattern or file has a leading ‘-’? 使用-e正则过滤字符
-e, –regexp=PATTERN use PATTERN for matching
4.Suppose I want to search for a whole word, not a part of a word? 单词完全匹配
5.How do I output context around the matching lines? 打印上下文
6.How do I force grep to print the name of the file? 打印文件名
-H, –with-filename print the file name for each match
7.Why do people use strange regular expressions on ps output? 过滤其他程序输出
ps -ef | grep '[c]ron'
root 2042 1 0 2018 ? 00:00:55 /usr/sbin/crond -n
8.
-i, –ignore-case ignore case distinctions
grep -i 'hello.*world' menu.h main.c
9.统计文本内特定字符串的个数
grep -o ‘haha’ file | wc -l |
help
grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c
Regexp selection and interpretation:
-E, --extended-regexp PATTERN is an extended regular expression (ERE)
-F, --fixed-strings PATTERN is a set of newline-separated fixed strings
-G, --basic-regexp PATTERN is a basic regular expression (BRE)
-P, --perl-regexp PATTERN is a Perl regular expression
-e, --regexp=PATTERN use PATTERN for matching
-f, --file=FILE obtain PATTERN from FILE
-i, --ignore-case ignore case distinctions
-w, --word-regexp force PATTERN to match only whole words
-x, --line-regexp force PATTERN to match only whole lines
-z, --null-data a data line ends in 0 byte, not newline
Miscellaneous:
-s, --no-messages suppress error messages
-v, --invert-match select non-matching lines
-V, --version display version information and exit
--help display this help text and exit
Output control:
-m, --max-count=NUM stop after NUM matches
-b, --byte-offset print the byte offset with output lines
-n, --line-number print line number with output lines
--line-buffered flush output on every line
-H, --with-filename print the file name for each match
-h, --no-filename suppress the file name prefix on output
--label=LABEL use LABEL as the standard input file name prefix
-o, --only-matching show only the part of a line matching PATTERN
-q, --quiet, --silent suppress all normal output
--binary-files=TYPE assume that binary files are TYPE;
TYPE is 'binary', 'text', or 'without-match'
-a, --text equivalent to --binary-files=text
-I equivalent to --binary-files=without-match
-d, --directories=ACTION how to handle directories;
ACTION is 'read', 'recurse', or 'skip'
-D, --devices=ACTION how to handle devices, FIFOs and sockets;
ACTION is 'read' or 'skip'
-r, --recursive like --directories=recurse
-R, --dereference-recursive
likewise, but follow all symlinks
--include=FILE_PATTERN
search only files that match FILE_PATTERN
--exclude=FILE_PATTERN
skip files and directories matching FILE_PATTERN
--exclude-from=FILE skip files matching any file pattern from FILE
--exclude-dir=PATTERN directories that match PATTERN will be skipped.
-L, --files-without-match print only names of FILEs containing no match
-l, --files-with-matches print only names of FILEs containing matches
-c, --count print only a count of matching lines per FILE
-T, --initial-tab make tabs line up (if needed)
-Z, --null print 0 byte after FILE name
Context control:
-B, --before-context=NUM print NUM lines of leading context
-A, --after-context=NUM print NUM lines of trailing context
-C, --context=NUM print NUM lines of output context
-NUM same as --context=NUM
--group-separator=SEP use SEP as a group separator
--no-group-separator use empty string as a group separator
--color[=WHEN],
--colour[=WHEN] use markers to highlight the matching strings;
WHEN is 'always', 'never', or 'auto'
-U, --binary do not strip CR characters at EOL (MSDOS/Windows)
-u, --unix-byte-offsets report offsets as if CRs were not there
(MSDOS/Windows)
'egrep' means 'grep -E'. 'fgrep' means 'grep -F'.
Direct invocation as either 'egrep' or 'fgrep' is deprecated.
When FILE is -, read standard input. With no FILE, read . if a command-line
-r is given, - otherwise. If fewer than two FILEs are given, assume -h.
Exit status is 0 if any line is selected, 1 otherwise;
if any error occurs and -q is not given, the exit status is 2.
Report bugs to: bug-grep@gnu.org
GNU Grep home page: <http://www.gnu.org/software/grep/>
General help using GNU software: <http://www.gnu.org/gethelp/>
awk
理论上可以代替grep
The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977 at AT&T Bell Laboratories
help
awk --help
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options: GNU long options: (standard)
-f progfile --file=progfile
-F fs --field-separator=fs
-v var=val --assign=var=val
Short options: GNU long options: (extensions)
-b --characters-as-bytes
-c --traditional
-C --copyright
-d[file] --dump-variables[=file]
-e 'program-text' --source='program-text'
-E file --exec=file
-g --gen-pot
-h --help
-L [fatal] --lint[=fatal]
-n --non-decimal-data
-N --use-lc-numeric
-O --optimize
-p[file] --profile[=file]
-P --posix
-r --re-interval
-S --sandbox
-t --lint-old
-V --version
To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.
gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.
Examples:
gawk '{ sum += $1 }; END { print sum }' file
gawk -F: '{ print $1 }' /etc/passwd
sed
sed, a stream editor
help
sed
Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...
-n, --quiet, --silent
suppress automatic printing of pattern space
-e script, --expression=script
add the script to the commands to be executed
-f script-file, --file=script-file
add the contents of script-file to the commands to be executed
--follow-symlinks
follow symlinks when processing in place
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)
-c, --copy
use copy instead of rename when shuffling files in -i mode
-b, --binary
does nothing; for compatibility with WIN32/CYGWIN/MSDOS/EMX (
open files in binary mode (CR+LFs are not treated specially))
-l N, --line-length=N
specify the desired line-wrap length for the `l' command
--posix
disable all GNU extensions.
-r, --regexp-extended
use extended regular expressions in the script.
-s, --separate
consider files as separate rather than as a single continuous
long stream.
-u, --unbuffered
load minimal amounts of data from the input files and flush
the output buffers more often
-z, --null-data
separate lines by NUL characters
--help
display this help and exit
--version
output version information and exit
If no -e, --expression, -f, or --file option is given, then the first
non-option argument is taken as the sed script to interpret. All
remaining arguments are names of input files; if no input files are
specified, then the standard input is read.
GNU sed home page: <http://www.gnu.org/software/sed/>.
General help using GNU software: <http://www.gnu.org/gethelp/>.
examples
-
- 替换文本
# 替换第一次出现的"old"为"new"
sed 's/old/new/' input.txt
# 替换每一行中的所有"old"为"new"
sed 's/old/new/g' input.txt
# 替换第n次出现的"old"为"new"
sed 's/old/new/n' input.txt
-
- 删除行
# 删除包含"pattern"的所有行
sed '/pattern/d' input.txt
# 删除第n行
sed 'nd' input.txt
# 删除文件中的空白行
sed '/^$/d' input.txt
-
- 插入和附加文本
# 在匹配"pattern"的行前插入文本
sed '/pattern/i\inserted text' input.txt
# 在匹配"pattern"的行后追加文本
sed '/pattern/a\appended text' input.txt
-
- 使用变量
# 使用变量进行替换
old="foo"
new="bar"
sed "s/$old/$new/" input.txt
-
- 保存更改到原文件
# 将更改写回到原文件
sed -i 's/old/new/' input.txt
# 使用备份文件保存原文件并写入更改
sed -i.bak 's/old/new/' input.txt
-
- 打印特定行
# 打印第n行
sed -n 'nd' input.txt
# 打印匹配"pattern"的行
sed -n '/pattern/p' input.txt
-
- 执行多个命令
# 执行多个sed命令
sed -e 's/old/new/' -e '/pattern/d' input.txt
head
head命令用于显示文件的开头的内容。在默认情况下,head命令显示文件的头10行内容。
语法
head(选项)(参数)
选项
-n<数字>:指定显示头部内容的行数; -c<字符数>:指定显示头部内容的字符数; -v:总是显示文件名的头信息; -q:不显示文件名的头信息。字符数>数字>
参数
文件列表:指定显示头部内容的文件列表
Usage Examples
help
head --help
Usage: head [OPTION]... [FILE]...
Print the first 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-c, --bytes=[-]K print the first K bytes of each file;
with the leading '-', print all but the last
K bytes of each file
-n, --lines=[-]K print the first K lines instead of the first 10;
with the leading '-', print all but the last
K lines of each file
-q, --quiet, --silent never print headers giving file names
-v, --verbose always print headers giving file names
--help display this help and exit
--version output version information and exit
K may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'head invocation'
格式化Json
json文本
echo '{"uid":100120,"token":"1fa9fb8004b04f66b7da57393641eddc"}' | jq .
文件
cat abc.json | jq .
curl测试返回的数据格式化输出
result=$(curl http://xxxxx)
echo $result | jq .
wc
print newline, word, and byte counts for each file wc:查询文件中有多少字,多少行,多少字符
例如:cat /etc/man.config | wc |
-l: 列出多少行 -w:列出多少字 -m:列出多少字符
Usage Examples
help
wc --help
Usage: wc [OPTION]... [FILE]...
or: wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. With no FILE, or when FILE is -,
read standard input. A word is a non-zero-length sequence of characters
delimited by white space.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-L, --max-line-length print the length of the longest line
-w, --words print the word counts
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'wc invocation'
cut
remove sections from each line of files
语法
cut -d[分隔符] -f[域] file
常见参数
-b(bytes) :以字节为单位进行分割。这些字节位置将忽略多字节字符边界,除非也指定了 -n 标志。
-c(characters) :以字符为单位进行分割。
-d :自定义分隔符,默认为制表符。
-f(filed) :与-d一起使用,指定显示哪个区域。
-n :取消分割多字节字符。仅和 -b 标志一起使用。如果字符的最后一个字节落在由 -b 标志的 List 参数指示的
范围之内,该字符将被写出;否则,该字符将被排除。
Usage Examoles
如 -b 1,3,5 //剪切每一行第 1 3 5个字符 如 -b 1-5 //剪切每一行第 1-5 个字符 如 -b -5 //剪切每一行第 1-5 个字符 如 -b 3- //剪切每一行第 3个字符以后的
help
cut --help
Usage: cut OPTION... [FILE]...
Print selected parts of lines from each FILE to standard output.
Mandatory arguments to long options are mandatory for short options too.
-b, --bytes=LIST select only these bytes
-c, --characters=LIST select only these characters
-d, --delimiter=DELIM use DELIM instead of TAB for field delimiter
-f, --fields=LIST select only these fields; also print any line
that contains no delimiter character, unless
the -s option is specified
-n with -b: don't split multibyte characters
--complement complement the set of selected bytes, characters
or fields
-s, --only-delimited do not print lines not containing delimiters
--output-delimiter=STRING use STRING as the output delimiter
the default is to use the input delimiter
--help display this help and exit
--version output version information and exit
Use one, and only one of -b, -c or -f. Each LIST is made up of one
range, or many ranges separated by commas. Selected input is written
in the same order that it is read, and is written exactly once.
Each range is one of:
N N'th byte, character or field, counted from 1
N- from N'th byte, character or field, to end of line
N-M from N'th to M'th (included) byte, character or field
-M from first to M'th (included) byte, character or field
With no FILE, or when FILE is -, read standard input.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'cut invocation'
uniq
-c, –count prefix lines by the number of occurrences //在每行前加上表示相应行目出现次数的前缀编号
-d, –repeated only print duplicate lines, one for each group //只输出重复的行
-D, –all-repeated[=METHOD] print all duplicate lines //以空行为界限 groups can be delimited with an empty line METHOD={none(default),prepend,separate}
-f, –skip-fields=N avoid comparing the first N fields //比较时跳过前N 列 –group[=METHOD] show all items, separating groups with an empty line METHOD={separate(default),prepend,append,both}
-i, –ignore-case ignore differences in case when comparing //在比较的时候不区分大小写
-s, –skip-chars=N avoid comparing the first N characters //比较时跳过前N 个字符
-u, –unique only print unique lines //只显示唯一的行
-z, –zero-terminated end lines with 0 byte, not newline //使用’\0’作为行结束符,而不是新换行
-w, –check-chars=N compare no more than N characters in lines //对每行第N个字符以后的内容不作对照 –help display this help and exit –version output version information and exit
help
[root@qwq ~]# uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).
With no options, matching lines are merged to the first occurrence.
Mandatory arguments to long options are mandatory for short options too.
-c, --count prefix lines by the number of occurrences //在每行前加上表示相应行目出现次数的前缀编号
-d, --repeated only print duplicate lines, one for each group //只输出重复的行
-D, --all-repeated[=METHOD] print all duplicate lines //以空行为界限
groups can be delimited with an empty line
METHOD={none(default),prepend,separate}
-f, --skip-fields=N avoid comparing the first N fields //比较时跳过前N 列
--group[=METHOD] show all items, separating groups with an empty line
METHOD={separate(default),prepend,append,both}
-i, --ignore-case ignore differences in case when comparing //在比较的时候不区分大小写
-s, --skip-chars=N avoid comparing the first N characters //比较时跳过前N 个字符
-u, --unique only print unique lines //只显示唯一的行
-z, --zero-terminated end lines with 0 byte, not newline //使用'\0'作为行结束符,而不是新换行
-w, --check-chars=N compare no more than N characters in lines //对每行第N个字符以后的内容不作对照
--help display this help and exit
--version output version information and exit
A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters. Fields are skipped before chars.
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'uniq invocation'
实践:nginx日志分析
1.统计访问IP前十:
[root@qwq ~]# awk '{print $1}' nginx/access.log | sort |uniq -c |sort -nr |head -10
36 124.250.97.124
9 101.133.224.19
6 106.38.46.51
5 120.232.182.137
4 40.77.189.255
4 157.55.39.68
3 66.249.79.254
2 74.82.47.5
2 66.249.79.225
2 47.92.95.89
2.过滤日期
[root@qwq ~]# grep "27/Jul/2017" nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -10
refs
http://www.gnu.org/software/grep/manual/grep.html
http://www.gnu.org/software/sed/manual/sed.html
http://www.gnu.org/software/gawk/manual/gawk.html