Zero threshold master regular expression -- super hard core! [suggested collection]

Nova project It was included in two columns at the same time
8 articles 1 subscription
35 articles 143 subscriptions

preface

Regular expressions can be used in almost all languages. No matter front-end JavaScript or back-end Java, python, c# etc., corresponding interfaces / functions are provided to support regular expressions.

Before learning regular expressions, we can only watch the regular masters write a string like an alien language. We don't know what the specific meaning is, but we can replace a large string of if else logical data verification.

Today, we will use the most popular language to tell the basic knowledge of regular expressions and uncover the mystery of regular expressions!

Insert picture description here

1. Introduction to regular expressions

regular expression : a group of special text composed of letters and symbols can help us extract special text that meets our requirements from a complex string.

In the actual development process, it is often necessary to find strings that meet some complex rules, such as email, picture address, mobile phone number, etc. at this time, regular expressions can be used to match or find strings that meet some rules.

2. Re module introduction

This paper gives an example test through regular in Python. First, using regular expressions in Python requires importing the module re.

Note: re. Match() matches the string data from the beginning according to the regular expression. If the beginning does not match, an error will be reported. The following cases will use match for matching, which is convenient for explanation.

Introduction of re module

In Python, you can use a re module when you need to match strings through regular expressions.

#Import re module
importre

#Use the match method for matching
result =re.match(regular expression ,String to match)

#If the previous step matches the data, you can use the group method to extract the data
result.group()

Use of re module

importre


#Use the match method for matching
result =re.match("csdn","PROG3.COM")
#Get matching results
info =result.group()
print(info)
# csdn

Regular expressions are so powerful because they have many special operators (also known as "metacharacters"), special characters, and modifiers.

In order to make it easier to remember and learn regular expressions, I divide these into matching a single character, matching multiple characters, matching the beginning and end, and matching groups.

3. Match a single character

codefunction
.Match any 1 character (except newline \ n)
[ ]Match the characters listed in []
\dMatch numbers, i.e. 0-9
\DMatch is not a number, that is, it is not a number
\sMatch white space, i.e. space, tab
\SMatch non blank
\wMatch non special characters, i.e. A-Z, A-Z, 0-9 chinese characters
\WMatch special characters, i.e. non alphabetic, non numeric and non Chinese characters

Example 1:

importre
# 	 Match any 1 character (except \ n)
#1. Regular expression
#2. String to match
# match_ Obj returns the matching object
ret =re.match(".","M")
print(ret.group())

ret =re.match("t.o","too")
print(ret.group())

ret =re.match("t.o","two")
print(ret.group())

match_ obj =re.match("t.o", "t\no")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    #Match failed match_ Obj is a none
    print("Matching failed")

Operation results:

M
too
two
Matching failed

Example 2: []

importre
#1. Regular expression
#2. String to match
# match_ Obj returns the matching object
# [ ] 	 Match the characters listed in []

#If the first character of Hello is lowercase, the regular expression needs lowercase H
ret =re.match("h","hello Python") 
print(ret.group())

#If the first character of Hello is capitalized, the regular expression needs a capitalized H
ret =re.match("H","Hello Python") 
print(ret.group())

match_ obj =re.match("Gourd doll [12]", "Gourd baby 1")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    #Match failed match_ Obj is a none
    print("Matching failed")

#Match one of the bank card passwords
match_ obj =re.match("[0123456789]", "7")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    #Match failed match_ Obj is a none
    print("Matching failed")

match_ obj =re.match("[0-9]", "7")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    #Match failed match_ Obj is a none
    print("Matching failed")

Operation results:

h
H
Gourd babyone
seven
seven

Example 3: \ D

# \d = > [0-9]= >[0123456789]
match_ obj =re.match("\d", "7")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    #Match failed match_ Obj is a none
    print("Matching failed")

Operation results:

seven

Example 4: \ D

#\ D: match a non numeric character
match_ obj =re.match("\D", "a")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    #Match failed match_ Obj is a none
    print("Matching failed")

Operation results:

a

Example 5: \ s

#\ s: match a blank character, space or tab
match_ obj =re.match("Gourd doll \ s [12]", "Gourd baby 1")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    #Match failed match_ Obj is a none
    print("Matching failed")

Operation results:

Gourd babyone

Example 6: \ s

match_ obj =re.match("Gourd doll \ s [12]", "Gourd baby + 1")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(No blank matches:,result)
else:
    #Match failed match_ Obj is a none
    print("No blank match: match failed")

Operation results:

No blank matches:Gourd baby+one

Example 7: \ w

#\ W: match a letter, number, underline, Chinese character
match_ obj =re.match("\w", "Ha")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    #Match failed match_ Obj is a none
    print("Matching failed")

Operation results:

Ha

Example 8: \ w

#Match one bit of a special character
match_ obj =re.match("\W", "&")
ifmatch_ obj:
    #Get matching results
    print(match_obj.group())
else:
    print("Matching failed")

Operation results:

&

4. Match multiple characters

codefunction
*If the previous character appears 0 times or infinite times, it can be present or absent
+Match the previous character once or infinitely, that is, at least once
?The previous character appears 1 or 0 times before matching, that is, either 1 or no
{m}Match the previous character m times
{m,n}Match the previous character from m to N times

Example 1:*

Requirements: match a string. The first letter is upper and lower case characters, followed by lower case letters, and these lower case letters are optional

# * 	 If the previous character appears 0 times or infinite times, it can be present or absent
importre

ret =re.match("[A-Z][a-z]*","M")
print(ret.group())

ret =re.match("[A-Z][a-z]*","MnnM")
print(ret.group())

ret =re.match("[A-Z][a-z]*","Aabcdef")
print(ret.group())

Operation results:

M
Mnn
Aabcdef

Example 2:+

Requirements: match a string, the first character is t, the last string is O, and there is at least one character in the middle

importre


match_ obj =re.match("t.+o", "two")
ifmatch_ obj:
    print(match_obj.group())
else:
    print("Matching failed")

Operation results:

two

Example 3:?

Requirement: such data can be matched, but HTTPS s may or HTTP s may not

match_ obj =re.match("https?", "http")
ifmatch_ obj:
    print(match_obj.group())
else:
    print("Matching failed")

Operation results:

http

Example 4: {m}

#{m}: matching the previous string must occur m times
match_ obj =re.match("ht{2}p", "http")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    print("Matching failed")

Operation results:

http

Example 5: {m, n}

#{m, n}: match the previous string at least m times and at most N times
match_ obj =re.match("ht{1,3}p", "httttp")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    print("Matching failed")

#Extension: {m,}: matches the previous string at least m times
match_ obj =re.match("ht{2,}p", "htttttp")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    print("Matching failed")

Operation results:

Matching failed
htttttp

5. Match beginning and end

codefunction
^At the beginning of the matching string, [^ specified character]: indicates that all characters except the specified character are matched
$Match end of string

Example 1:^

#Match starts with a number
match_ obj =re.match("^\d.*", "1abc")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    print("Matching failed")

Operation results:

1abc

Example 2:$

match_ obj =re.match(".*\d$", "aa3")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    print("Matching failed")

Operation results:

aa3

Example 3: ^ and$

#Matches the middle content that begins with a number, regardless of the end of a number
match_ obj =re.match("^\d.*\d$", "2asdfa3")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    print("Matching failed")

#[^ specified character] indicates that all characters except the specified character match

#[^ 47] matches except 4 and 7
#^: indicates that it starts with the specified string
#[^]: indicates that all strings except the specified string match
match_ obj =re.match("^\d.*[^47]$", "2asdfa7")
ifmatch_ obj:
    #Get matching results
result =match_ obj.group()
    print(result)
else:
    print("Matching failed")

Operation results:

2asdfa3
Matching failed

6. Matching grouping

codefunction
|Match any of the left and right expressions
(ab)Group the characters in parentheses as a group
\numThe string to which the reference group num matches
(?P)Group aliases
(?P=name)Reference the string to which the alias name group matches

Example 1:|

#Fruit list
fruit_ list = ['apple', 'banana', 'orage', 'pear', 'peach']

forvalue infruit_ list:
    #According to each string, regular expressions are used for matching
    # | 	 Match any of the left and right expressions
match_ obj =re.match("banana|pear",value)

    ifmatch_ obj:
result =match_ obj.group()
        print("Fruit I want to eat:",result)
    else:
        print("Fruit I don't want to eat:",value)

Operation results:

I don't want to eat fruit:apple
I want some fruit:banana
I don't want to eat fruit:orage
I want some fruit:pear
I don't want to eat fruit:peach

Example 2: ()

#Match 163, 126, QQ, etc
#\.: indicates that the. In the regular expression is escaped and becomes an ordinary point, which can only match the. Character
#(163|126|qq) indicates a group. A parenthesis indicates a group. The group starts from 1
#If multiple parentheses appear, the grouping order is sorted once from left to right
match_ obj =re.match("[a-zA-Z0-9_]{4,20}@(163|126|qq)\.com", " hello@163.com ")
ifmatch_ obj:
    #Get the whole matching data. If you use grouping, the default is 0
result =match_ obj.group(0)
    #Get matching grouping data
    type =match_ obj.group(one)
    print(type)
    print(result)
else:
    print("Matching failed")

# "qq:3014587"
match_ obj =re.match("(qq:)([1-9]\d{4,11})", "qq:666666")
ifmatch_ obj:

result =match_ obj.group()
    print(result)

result =match_ obj.group(one)
    print(result)

result =match_ obj.group(two)
    print(result)
else:
    print("Matching failed")

Operation results:

one hundred and sixty-three
hello@one hundred and sixty-threecom
qq:six hundred and sixty-six thousand six hundred and sixty-six
qq:
six hundred and sixty-six thousand six hundred and sixty-six

Example 3: \ num

Demand: match HH

match_ obj =re.match("<[a-zA-Z1-6]+>.*</[a-zA-Z1-6]+>", "<html>hh</div>")

ifmatch_ obj:
    print(match_obj.group())
else:
    print("Matching failed")

match_ obj =re.match("<([a-zA-Z1-6]+)>.*</\\1>", "<html>hh</html>")

ifmatch_ obj:
    print(match_obj.group())
else:
    print("Matching failed")

Operation results:

<html>hh</div>
<html>hh</html>

Requirements: match < HTML > < H1 > www.baidu.com</h1></html>

match_ obj =re.match("<([a-zA-Z1-6]+)><([a-zA-Z1-6]+)>.*</\\2></\\1>", "<html><h1>www.baidu.com</h1></html>")

ifmatch_ obj:
    print(match_obj.group())
else:
    print("Matching failed")

Operation results:

<html>\<h1>www.baidu.com\</h1>\</html>

Example 4: (? P) (? P = name)

# <html><h1>www.itcast.cn</h1></html>

match_ obj =re.match("<(?P<name1>[a-zA-Z1-6]+)><(?P<name2>[a-zA-Z1-6]+)>.*</(?P=name2)></(?P=name1)>", "<html><h1>www.baidu.com</h1></html>")
ifmatch_ obj:

result =match_ obj.group()
    print(result)
else:
    print("Matching failed")

Operation results:

<html><h1>www.baidu.com</h1></html>

7. Introduction to common methods in Python

Three functions are used to find matches, match (), search (), findall (), one function sub () is used to replace, and one function split () is used to segment strings.

  • Match(): matches the beginning of the string. If the beginning does not match, it returns none;
  • Search(): scan the whole string and return it immediately after matching, instead of matching later;
  • Findall(): scan the whole string and return all matching values in list form;
  • Compile (): compile strings into regular expression objects for use by match (), search () and findall () functions;
  • Sub(): scan the whole string to replace some values of the string;
  • Split(): scan the whole string and segment the string according to the specified separator;
s1 = 'AB less than CD years ABC'
r2 =re.search('b',s1)
r2.group()
# b

r3 =re.findall('[a|b]',s1)
r3.group()
# ['a', 'b', 'a', 'b']

r4 =re.findall('f',s1)
r4.group()
# []

Next, according to the HTML format data often seen by a crawler, we use regular expressions to obtain the Li tag or the href attribute. What should we do?

html = '''<html>
<head lang="en">
<title>Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<div id="content">
<ul id="ul1">
<li>first</li>
<li>second</li>
<li>third</li>
</ul>
<ul id="ul2">
<li>alpha</li>
<li>beta</li>
</ul>
</div>
<div id="url">
<a href="http:www.baidu.com" title="baidu">baidu</a>
<a href="http:prog3.com" title="csdn">csdn</a>
</div>
</body>
</html>
'''
#Get Li tag
re.findall('<li>(.*?)</li>',html)
# ['first', 'second', 'third', 'alpha', 'beta']

#Get href attribute
re.findall('<a href="(.*?)" ',html)
# ['http:www.baidu.com', 'http:prog3.com']

The operation is actually very simple, fixed writing method: copy the string source code from beginning to end, use () parentheses to enclose what we want, and write. *?

There is one mentioned aboveAnd?, Among themIndicates greedy matching? Indicates a non greedy match.

  • . it can match all characters except line feed;
  • *Indicates that the preceding character is matched infinite times;
  • ? The elements immediately in front can be matched at most once;
#'\ D {3,5}' means matching non numbers 3-5 times.
s3 = '12one34two56three78four'
#The maximum number of matches is not specified here, so all matching values will be replaced by default
re.sub('\D{3,5}','letters',s3)
#12 letters 34 letters 56 letters 78 letters

#The maximum matching times count = 2 is specified here, so only the first two matching values will be replaced
re.sub('\D{3,5}','letters',s3,two)
#12 letters 34 letters 56three78four

#If count = 3, 3 matching values will be replaced
re.sub('\D{3,5}','letters',s3,three)
#12 letter 34 letter 56 letter 78four

#'\ d' indicates a non matching number
s4 = '136-3456-7413'
#If you do not specify "maximum number of divisions", it is an unlimited number of divisions
re.split('\D',s4)
# ['136', '3456', '7413']

#If Max split times maxplit = 1 is specified, only the first separator is used for segmentation
re.split('\D',s4,one)
# ['136', '3456-7413']

Well, that's all for today. Let's continue our efforts tomorrow!
Insert picture description here

Creation is not easy, white whoring is not good. Your support and recognition is the biggest driving force for my creation. See you in the next article!

Dragon youth

If there are any mistakes in this blog, please comment and advice. Thank you very much!

emoticon
Insert expression
©️ 2020 CSDN Skin theme: swimming - white Designer: Bai Songlin Return to home page
Paid inelement
Payment with balance
Click retrieve
Code scanning payment
Wallet balance 0

Deduction Description:

1. The balance is the virtual currency of wallet recharge, and the payment amount is deducted according to the ratio of 1:1.
2. The balance cannot be purchased and downloaded directly. You can buy VIP, c-coin package, paid column and courses.

Balance recharge