Copyright statement: This article is the original article for the blogger, without permission may not be reproduced.
introduce
Everyone in the use of Google or Baidu search, enter the search content, Google has always been able to provide a very good spell check, such as your inputSpeling, Google will return immediatelySpelling.
A few days ago, seeHttp://norvig.com/spell-correct.htmlThis article, then translated, plus their own understanding, with the following blog.
The following is the spelling checker is a simple implementation with 21 lines of Python code but fully functional.
code
ImportRe, collections
Def Words(text): ReturnRe.findall ('[a-z]+', text.lower ())
Def Train(features):
Model = collections.defaultdict (Lambda:One)
ForFInFeatures:
= model[f]One
ReturnModel
NWORDS = train (words (file =.'big.txt'(.Read ())
Alphabet ='abcdefghijklmnopqrstuvwxyz'
Def Edits1(word):
Splits = [(word[: i], word[i:])ForIInRange (len) + (word)One)]
Deletes = [a + b[One:]ForA, BInSplitsIfB]
Transposes = [a + b[One] + b[Zero] + b[Two:]ForA, BInSplitsIfLen (b) >One]
Replaces = [a + C + b[One:]ForA, BInSplitsForCInAlphabetIfB]
Inserts = [a + C + BForA, BInSplitsForCInAlphabet]
ReturnSet (deletes + replaces + transposes + inserts)
Def Known_edits2(word):
ReturnSet (E2ForE1InEdits1 (word)ForE2InEdits1 (E1)IfE2InNWORDS)
Def Known(words): ReturnSet (WForWInWordsIfWInNWORDS)
Def Correct(word):
Candidates = known ([word])OrKnown (edits1 (word))OrKnown_edits2 (word)Or[word]
ReturnMax (candidates, key=NWORDS.get)
The correct function is the entrance into the program, go to the wrong spelling words will return the correct. Such as:
> > >Correct ("Cpoy")
'copy'
> > >Correct ("Engilsh")
'english'
> > >Correct ("Sruprise")
'surprise'
In addition to this code, as part of machine learning, there must be a large number of sample data, readyBig.txtAs our sample data.
Principle behind
The code above is based on Bias to achieve, in fact, Baidu Google to achieve the spell check is also achieved through Bias, but certainly more than this complex.
First, simply introduce the principle behind, if the reader understood before, you can skip this section.
For one word, we try to select one of the most likely correct spelling suggestions (suggested by the words that may be input). Sometimes it is not clear (for example, Lates should be corrected to late or latest?) , we use the probability to decide which one as a proposal. We find the best possible spelling of all possible correct spelling from the original word C w:
P argmaxc (c|w)
By the Bias theorem, the formula can be converted to
P argmaxc (WP (|c) (c) / P (W)
The following is a brief introduction to the meaning of the above formula:
- P (c|w) on behalf of the input word w in the case, you would have wanted to enter the probability of the word C.
- P (w|c) on behalf of users want to enter the word C but enter the probability of W, this can be considered a given.
- P (c) represents the probability that the word C appears in the sample data.
- P (W) represents the probability that the word w appears in the sample number.
It can be determined that P (W) is the same for all possible word C probabilities, so the above formula can be converted to
ArgmaxcP (w|c) P (c)
All of our code is based on this formula, the following analysis of the specific code to achieve:
Code analysis
Using words () function to extractBig.txtWords in
Def Words(text): ReturnRe.findall ('[a-z]+', text.lower ())
Re.findall ('[a-z]+' is the use of Python regular expression module, to extract all of the '[a-z]+' conditions, which is composed of letters of the word. (this is not a detailed introduction of regular expressions, and interested students can seeRegular expression profile. Text.lower () is the text into lowercase letters, that is, "the" and "The" as defined by the same word.
利用train()函数计算每个单词出现的次数然后训练出一个合适的模型
DEF 火车(特):
模型组的集合。defaultdict(λ:一)
对于F在特点:
模型[女]一
返回模型
外来词=火车(字(文件(“大。txt”read())))。
外来词这样[W]代表了单词W在样本中出现的次数。如果有一个单词并没有出现在我们的样本中该怎么办?处理方法是将他们的次数默认设为1,这里通过收藏模块和λ表达式实现。收藏。defaultdict()创建了一个默认的字典,λ:1将这个字典中的每个值都默认设为表达式可以看(λ1。λ简介
现在我们处理完了公式argmaxc P(W | C)P(C)
中的P(C),接下来处理P(W | C)即想输入单词C却错误地输入单词W的概率,通过编辑距离的--将一个单词变为另一个单词所需要的编辑次数来衡量,一次编辑可能是一次删除,一个交换(两个相邻的字母),一次插入,一次修改。下面的函数返回一个将C进行一次编辑所有可能得到的单词W的集合:
edits1(DEF字):
分裂= [(字[我],字[我:])对于我在范围(伦恩(字)+一)]
删除=一+乙一]对于 一,乙在分裂如果乙
把= [一+乙一[ [ ]零[ [ ]二]对于 一,乙在分裂如果 伦恩(乙)一]
替换=一[ [ [一]对于 一,乙在分裂对于C在字母表如果乙
插入物一+碳对于 一,乙在分裂对于C在字母]
返回 集(删除+转移+替换+插入)
相关论文显示,80-95%的拼写错误跟想要拼写的单词都只有1个编辑距离,如果觉得一次编辑不够,那我们再来一次
DEF known_edits2(字):
返回组(E2对于E1在edits1(字)对于E2在edits1(E1)如果E2在外来词)
0次的即本身就拼写正确的同时还可能有编辑距离为:
DEF 已知(词):
返回设置(宽对于W在话如果W在外来词)
我们假设编辑距离1次的概率远大于2次的,0次的远大于1次的。下面通过正确函数先选择编辑距离最小的单词,其对应的P(W | C)就会越大,作为候选单词,再选择P(C)最大的那个单词作为拼写建议。
DEF 对的(字):
候选人=已知([字])或已知的(edits1(字))或known_edits2(字)或[字]
返回马克斯(考生关键=外来词。得到)
- 顶
- 三
- 踩
- 二
- 上一篇leetcode二进制树的路径
- 下一篇巧用递归求字符串的子集
- 猜你在找