MasterofProject

Sogou 21 lines of Python code to achieve the spelling checker

label PythonBiasmachine learning
2806 people read comment(5) Collection report

introduce

Everyone in the use of Google or Baidu search, enter the search content, Google has always been able to provide a very good spell check, such as your inputSpeling, Google will return immediatelySpelling.
A few days ago, seeHttp://norvig.com/spell-correct.htmlThis article, then translated, plus their own understanding, with the following blog.
The following is the spelling checker is a simple implementation with 21 lines of Python code but fully functional.

code

ImportRe, collections

Def Words(text): ReturnRe.findall ('[a-z]+', text.lower ())

Def Train(features):
Model = collections.defaultdict (Lambda:One)
    ForFInFeatures:
= model[f]One
    ReturnModel

NWORDS = train (words (file =.'big.txt'(.Read ())

Alphabet ='abcdefghijklmnopqrstuvwxyz'

Def Edits1(word):
Splits = [(word[: i], word[i:])ForIInRange (len) + (word)One)]
Deletes = [a + b[One:]ForA, BInSplitsIfB]
Transposes = [a + b[One] + b[Zero] + b[Two:]ForA, BInSplitsIfLen (b) >One]
Replaces = [a + C + b[One:]ForA, BInSplitsForCInAlphabetIfB]
Inserts = [a + C + BForA, BInSplitsForCInAlphabet]
   ReturnSet (deletes + replaces + transposes + inserts)

Def Known_edits2(word):
    ReturnSet (E2ForE1InEdits1 (word)ForE2InEdits1 (E1)IfE2InNWORDS)

Def Known(words): ReturnSet (WForWInWordsIfWInNWORDS)

Def Correct(word):
Candidates = known ([word])OrKnown (edits1 (word))OrKnown_edits2 (word)Or[word]
    ReturnMax (candidates, key=NWORDS.get)

The correct function is the entrance into the program, go to the wrong spelling words will return the correct. Such as:

> > >Correct ("Cpoy")
'copy'
> > >Correct ("Engilsh")
'english'
> > >Correct ("Sruprise")
'surprise'

In addition to this code, as part of machine learning, there must be a large number of sample data, readyBig.txtAs our sample data.

Principle behind

The code above is based on Bias to achieve, in fact, Baidu Google to achieve the spell check is also achieved through Bias, but certainly more than this complex.
First, simply introduce the principle behind, if the reader understood before, you can skip this section.
For one word, we try to select one of the most likely correct spelling suggestions (suggested by the words that may be input). Sometimes it is not clear (for example, Lates should be corrected to late or latest?) , we use the probability to decide which one as a proposal. We find the best possible spelling of all possible correct spelling from the original word C w:

P argmaxc (c|w)

By the Bias theorem, the formula can be converted to

P argmaxc (WP (|c) (c) / P (W)

The following is a brief introduction to the meaning of the above formula:

  1. P (c|w) on behalf of the input word w in the case, you would have wanted to enter the probability of the word C.
  2. P (w|c) on behalf of users want to enter the word C but enter the probability of W, this can be considered a given.
  3. P (c) represents the probability that the word C appears in the sample data.
  4. P (W) represents the probability that the word w appears in the sample number.
    It can be determined that P (W) is the same for all possible word C probabilities, so the above formula can be converted to
ArgmaxcP (w|c) P (c)

All of our code is based on this formula, the following analysis of the specific code to achieve:

Code analysis

Using words () function to extractBig.txtWords in

Def Words(text): ReturnRe.findall ('[a-z]+', text.lower ())

Re.findall ('[a-z]+' is the use of Python regular expression module, to extract all of the '[a-z]+' conditions, which is composed of letters of the word. (this is not a detailed introduction of regular expressions, and interested students can seeRegular expression profile. Text.lower () is the text into lowercase letters, that is, "the" and "The" as defined by the same word.

利用train()函数计算每个单词出现的次数然后训练出一个合适的模型

DEF 火车(特)
模型组的集合。defaultdict(λ对于F特点:
模型[女]
    返回模型
外来词=火车(字(文件(“大。txt”read())))。

外来词这样[W]代表了单词W在样本中出现的次数。如果有一个单词并没有出现在我们的样本中该怎么办?处理方法是将他们的次数默认设为1,这里通过收藏模块和λ表达式实现。收藏。defaultdict()创建了一个默认的字典,λ:1将这个字典中的每个值都默认设为表达式可以看(λ1。λ简介

现在我们处理完了公式argmaxc P(W | C)P(C)中的P(C),接下来处理P(W | C)即想输入单词C却错误地输入单词W的概率,通过编辑距离的--将一个单词变为另一个单词所需要的编辑次数来衡量,一次编辑可能是一次删除,一个交换(两个相邻的字母),一次插入,一次修改。下面的函数返回一个将C进行一次编辑所有可能得到的单词W的集合:

edits1(DEF):
分裂= [([我],[我:])对于范围(伦恩)+)]
删除=+乙]对于 ,乙分裂如果乙
把= [+乙[ [ ][ [ ]]对于 ,乙分裂如果 伦恩(乙)]
替换=[ [ []对于 ,乙分裂对于C字母表如果乙
插入物+碳对于 ,乙分裂对于C字母]
   返回 (删除+转移+替换+插入)

相关论文显示,80-95%的拼写错误跟想要拼写的单词都只有1个编辑距离,如果觉得一次编辑不够,那我们再来一次

DEF known_edits2(字)
    返回组(E2对于E1edits1(字)对于E2edits1(E1)如果E2外来词)

0次的即本身就拼写正确的同时还可能有编辑距离为:

DEF 已知(词)
    返回设置(宽对于W如果W外来词)

我们假设编辑距离1次的概率远大于2次的,0次的远大于1次的。下面通过正确函数先选择编辑距离最小的单词,其对应的P(W | C)就会越大,作为候选单词,再选择P(C)最大的那个单词作为拼写建议。

DEF 对的(字)
候选人=已知([字])已知的(edits1(字))known_edits2(字)[字]
    返回马克斯(考生关键=外来词。得到)
猜你在找
查看评论
*以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:18651次
    • 积分:七百五十四
    • 等级:
    • 排名:千里之外
    • 原创:51篇
    • 转载:1篇
    • 译文:0篇
    • 评论:15条
    博客专栏
    文章存档
    Latest comments