NLP series (2) _ with Naive Bayesian for text classification (on)

label: NLPnatural language processingBias methodmachine learningdata mining
9660 people read comment(4) Collection report

Author:Long Xinchen& &Han Xiaoyang

Time: January 2016.




Statement: All rights reserved, please contact the author, please contact the author and indicate the source.

1 Introduction

Bias method is a long history, has a solid theoretical basis of the method, while dealing with a lot of problems directly and efficiently, many senior Natural Language Processing models can also be evolved from it. Therefore, learning Bayesian approach is a very good entry point for the research of Natural Language Processing problem.

2 Bias formula

A line of Bias's formula:


In fact, it is derived from the following joint probability formula:


amongP(Y)A priori probability,P(Y|X)Posterior probability,P(Y,X)Joint probability.

Well, no, no, the core of the Bayesian formula is so.

3 use machine learning to understand Bayesian formula

In the perspective of machine learning, we putXUnderstand"With a characteristic"The.YUnderstand"Category label"(general machine learning problems areX=> features,Y=> resultsRight. In the simplest two classification problemyesandnoDetermine), we willYUnderstand"Belongs to a class"Label. So the Bayes formula becomes the following:


We try to explain the above formula (Shuo) in a more oral (ren) language (Hua):

P("genustosomeclass"|"haveyessomespecialsign")=Under the condition that a certain sample has a certain characteristic, the sample "belongs to a kind of" probability. So calledPosterior probability..

P("haveyessomespecialsign"|"genustosomeclass")=Under the condition that a certain sample belongs to a certain class, the probability of "having a characteristic".

P("genustosomeclass")=(in the case of an unknown sample with that "having a characteristic"), the sample "belongs to a class". So calledA priori probability.

P("haveyessomespecialsign")=(in the case of an unknown "belonging to a class") the sample "has a characteristic" probability.

And the ultimate goal of our two classification problem is tojudgeP("genustosomeclass"|"haveyessomespecialsign")Is greater than 1/2Just enough. Bias method to calculate"With a certain characteristic of a class"The probability of conversion to the need to calculate"Belonging to a certain class of conditions has a characteristic"The probability of the latter method is much simpler, we only need to find some of the known features of the label can be trained. And the samples of the category labels are clear, so the Bayesian approach in machine learning is a supervised learning methods.

Here to add, generalThe prior probability and posterior probability are relative.Appear, for exampleP(Y)andP(Y|X)Is aboutYThe prior probability and posterior probability,P(X)andP(X|Y)Is aboutXThe prior probability and posterior probability.

4 spam identification

For example, we now have to classify the mail, identify spam and email, if we choose to use the simple Bias classifier, the goal is tojudgeP(".GarbagePostpiece"|"haveyessomespecialsign")Is greater than 1/2. Now suppose that we have 10 thousand letters of spam and normal mail as the training set. Need to determine whether the following message is spam:

"I can handle our formal invoice (fidelity) 17% VAT invoice discount points!"

that isJudgment probabilityP(".GarbagePostpiece"|"Idepartmentcandoreasonjustgaugehairticket(protectreally)Seventeen%increasevaluetaxhairticketspotnumberexcellentbenefit!")Is greater than 1/2.

Ahem, wood found and converted into this probability, the calculation method: is to write a counter and + 1 + 1 + 1 Statistics out many times these words appear in the all spam and normal mail!!! Well, the specific point:



5 word segmentation

And then the students began to throw rotten cabbage and rotten eggs at me!! mislead and cause harm to the young men!! Do you think that the people who send spam messages are stuck in twentieth Century!! Do you think they send mail like copy homework do not change content!! Which come so much the same sentence!!".

Cough, alarm clock. Indeed, in our sample size, "completely hit" sentences with little or no (and fail to satisfy the law of large numbers), calculated the probability will very distorted. On the one hand, find a large training set is a very difficult thing, another in fact for any training set, we can construct a never in the training set of sentences as spam (heart before seen naive Bayesian classification error mail pieces, I think a big Chinese compatriots (Zao) new (JIA) simply stunning (FA) ah (Zhi).

A very sad but very realistic conclusion:

The training set is limited, and the probability of a sentence is infinite. So the training set covering all the possibilities of the sentence does not exist.

So the solution is?

Right!The probability of a sentence is infinite, but the word is so!!The Chinese characters commonly used 2500 commonly used words, also 56000 (you finally understand primary school Chinese teacher's well intentioned.). According to people's experience, the meaning of the two words are not forced to have each word, words are the same. such as"I can handle regular invoices, 17% value-added tax invoices!", this sentence is less than before the sentence."(Bao Zhen)"The word, but the meaning is the same. If these conditions are taken into account, then the number will increase, which is convenient for us to calculate the.

So, we can not take the sentence as a feature, but take the sentence inside the words (combination) as a feature to consider. such as"Regular invoice"Can be used as a single word,"Value added tax"Can also be used as a single word and so on.

sentence"I can handle regular invoices, 17% value-added tax invoices!" Can become ("I" and "our", "may", "management", "formal invoice", "fidelity", "value-added tax", "invoice", "points", "discount").

So you come into contact with the Chinese NLP, the most important one of the most important technology:participle!!! that isThe sentence is split into more fine-grained words to say. Well, in addition, after the wordThe removal of punctuation, numbers, and even irrelevant components (withdrawal) is a technique of feature preprocessing..

Chinese word segmentation is a special field of Technology (I will not tell you a search engine factory code has a special word to do the word segmentation!!! In the next article, we will discuss it as a known case. Please see the details next time.

We observe ("I" and "our", "may", "management", "formal invoice", "fidelity", "value-added tax", "invoice", "points", "discount").This can be understood as a vector: each dimension of the vector represents the specific location of the feature word in the text. This will feature split into smaller units, based on these more flexible, more fine-grained features to determine the way of thinking, Natural Language Processing and machine learning are very common and effective.

So the Bayesian formula becomes:





6 conditional independence assumption

Some students say... As if... Seems... After the above toss, looks more complicated -_-|| probability

The... Then we simplify it...

probabilityP(("I","department","can","doreason","justgaugehairticket","protectreally","increasevaluetax","hairticket","spotnumber","excellentbenefit")|".GarbagePostpiece")Still not good enough, we introduce aVery simple approximation. In order to make the formula seem more compact, we make the letter S means "junk mail", which makes the letter H means "normal mail" ". Approximate formulas are as follows:




This is the legend.Conditional independence assumption. Based on the "normal mail" of the conditional independence assumption. And the like, here omitted. Then, the conditional independence assumption of Bayesian formula, the above two adverse events.

So we only need to compare the size of the following two:





Li (WO) harm (Cao)! Jiangzi after treatmentEach item in the formula are particularly good for! it needs onlyStatistics of various types of messages in the probability of the emergence of the keywordCan be!!! For example:


The number of statistics is very convenient, and the number of samples is large enough, the probability is relatively close to the real probability. So the problem of spam identification can be solved.

7 plain Bias (Bayes Naive), "Naive" where?

Bayesian method with conditional independence assumption is naive Bayes method (Bayes Naive).The pronunciation of Naive is "a dirty," meaning "plain" and "naive" ","Stupid". Well, that is to say, people say the name is a kind of methodology, more adorable stupid?

Sentence ("I" and "our", "may", "management" and "formal invoice) in (" I "," our ") and (" formal invoice ") a change order, it becomes a new sentence (" formal invoice "," may "," management "," I "," our "). The new sentence is completely different from the meaning of the old one.But because of the law of multiplication, the conditional probability of the two is exactly the same as that of the naive Bayes method!The calculation procedure is as follows:





That is, in the eyes of plain Bias, "I can handle the formal invoice" and "regular invoices can handle me" exactly the same. Naive Bayes lost the order information between words.This is equivalent to the vocabulary in all in a bag, casual mix, Bayesian that they like. So this is the caseWord bag model (of words bag).

With the bag of words

The word bag model is quite different from people's daily experience. For example, in the case of conditional independence assumptions,"Wu Song killed the tiger" and "the tiger killed Wu Song".Grace, simple Bias is so simple and direct, compared to other classifiers, it seems that there is a little crazy.

8 simple and efficient, hanging wire counter attack

Although naive Bayesian method is stupid and stupid, practice proves that the application of spam identification is stillAmazingly good. Mr. Graham Paul made a simple Bias classifier,"1000 spam messages can be filtered out of 995, and there is no one".(Graham Paul "hackers and painters")

That... The effect is good?

"People puts forward a theoretical explanation, and established when the effect of naive Bayes to equivalent to necessary and sufficient conditions for a non naive Bayes, explain the core is some independence assumptions between each category distribution is uniform so for the relative size of the likelihood does not affect; if not more so, there are great possibilitiesThe negative effects or positive effects produced by each individual hypothesis offset each other, and ultimately lead to little effect on the results.. Specific mathematical formula, please refer toThis paper." (Liu Weipeng ": ordinary and magical Bayesian method")

Well, this classifier is the most simple and direct seemingly cute little naive Bayes naive Bayes, in fact, it isSimple, practical, and powerfulThe.

Three 9 ways to deal with the repetition of words

WeBefore the spam vector (the "I", "our", "may", "for", "formal invoice", "fidelity", "value-added tax", "invoice", "points", "discount"), which each word do not repeat.And this is actually very rare in reality. Because if the text length increases, or the word segmentation method to change,There must be a lot of words to repeat., therefore, it is necessary to further explore this situation. For example, the following message:

"On behalf of the invoice. VAT invoice, regular invoice."

After word segmentation:

("on behalf of", "invoice", "value added tax", "invoice", "normal", "invoice"

The invoice was repeated three times.

9.1 polynomial model:

If we consider the situation of repeating words, that is to say,Repetition of words we see as many times as it appearsIn the way of conditional independence assumption, there are




Pay attention to this one:PThree("hairticket"|S).

In the statistical calculation of P ("invoice" |S), each of the statistics of the spam samples were repeated in the statistics.


You look at the number of results appear in the probability of the index / time, so this model is calledPolynomial model.

9.2 Bernoulli model

Another more simplified method isThe repetition of words and expressions are regarded as the only 1 times.,



Statistical calculationP("Wordlanguage"|S)So is the time.


Such a model is calledBernoulli model(also known asTwo independent models). This way is more simple and convenient. Of course, it lost the frequency information, so the effect may be worse.

9.3 mixed model

The third way is in the calculation of the probability of a sentence, without considering the word repetition times, but in statistical computing word probability p ("word" |S), but considering the number of occurrences of the word repetition, such a model may be calledhybrid model.

We show the relationship between the three models through the following figure.

Three forms

Which model is used in practice, the key to look at the specific business scenarios. The author's simple experience is,For spam identification, the hybrid model is better.

10 remove stop words and choose key words

We continue to observe("I" and "our", "may", "management", "formal invoice", "fidelity", "value-added tax", "invoice", "points", "discount")This sentence. Actually, like"I", "but"Such words are very neutral, regardless of whether they appear in the spam can not help to determine the useful information. So you can directly do not consider these typical words. These words do not help us to classify"Stop words" (Words Stop). This canReduce our training model, determine the time of classification.

So the previous sentence has become(the "company", "management", "formal invoice", "fidelity", "value added tax", "invoice", "points", "preferential").

We further analyze. With human experience, in fact"Regular invoice", "invoice""This kind of word if appear, mail as the probability of the spam is very large, can be used as our distinction between spam"Key words". And like"Division", "management", "preferential""This kind of word is a little chicken ribs, may contribute to the classification, but not so strong. If you want to save, do a simple classifier, can directly use "Keywords" statistics and judgment, the rest of the word, regardless of the first. So before the spam sentences will become("regular invoice", "invoice"). In this way, we can reduce our training model, judge the time of classification, and the speed is very fast.

"Stop words" and "key words" generally can be specified in advance by artificial experience. Different "stop words" and "key words" training out of the effect of the classifier will be some differences. So there are no quantitative indicators to assess the ability of the different words? In our previous article"Machine learning series (6) _ feature selection and pretreatment of the blind white Formica (under)"In fact, provides an evaluation method, we can refer to. Do not repeat here.

11 on the smooth Technology

We have a problem (the Chinese NLP in the problem of super, cry blind T_T), such as in the calculation of the probability of the following independent conditions:



We scan the training set and find out.The word "formal invoice" has appeared!!!SoP("justgaugehairticket"|S)=Zero... The problem is serious, the whole probability has become 0!!! Naive Bayesian method facing a pile of 0, failed miserably... More cruel isThis situation is actually very common, because even if the training set is large, there may be no coverage of the words. Essentially orThe number of samples is too small to satisfy the law of large numbers, and the probability of distortion is calculated.. In order to solve this problem, an analysis of the idea is to directly do not consider such a word, but this approach is equivalent to the default to P (formal invoice |S) assigned to 1. In fact, the effect is not good, a lot of statistical information to waste away. We further analyze, since you can default assignment to 1, why can not the default assignment to a very small number? This is the basic idea of smoothing technology, still maintained a consistent style,Simple / soilhoweverDirectly and effectively.

For the Bernoulli model, a smoothing algorithm for P ("regular invoice" |S) is:


For polynomial models, a smoothing algorithm for P ("regular invoice" for S) is:


To say, the type of smoothing technology is actually very much, there is interest in the words back to us specifically to pull a special topic to talk about the good. Just to mention here, that's all.Smoothing techniques are given for the probability that a word is not appearing in the training set, and the corresponding reduction in the probability of other words that have appeared..

Smoothing technique is a realistic requirement because of the small data set.If the data set is large enough, the impact of smoothing techniques on the results will be small.

12 summary

We find the most simple and common example: spam identification, illustrate the idea of naive Bayesian text classification process. Basic idea is to distinguish between good training set and test set, the text set segmentation, remove punctuation characteristics such as pretreatment of the operation, and then use the conditional independence assumption, the original probability converted into word probability product, and subsequent processing.

Bias formula + conditional independence assumption = Naive Bayesian method

Based on the lexical repetition in the training phase and judgment (test) three kinds of treatment methods on stage, we have the corresponding Bernoulli model, polynomial model and mixed model. In the training phase, if the sample set is so small that some words did not appear, we can use the technology of smoothing gave an estimate of the probability. But not all words need statistics, we can according to the corresponding "stop words" and "Keywords" the model is further simplified, and improve the training speed of judgment.

Because the formula is compared, in order to prevent see the formula on the dog of, we try to use the mouth (Shuo) language (ren) of (Hua) in the expression formula, rigorous place still hope will forgive me, have flaws in between welcome everyone points out.

step on
Guess you're looking for
View comments
* the above user comments represent the personal views do not represent the views or position of the CSDN website
    personal data
    • visit:250617 times
    • Integral:Two thousand one hundred and one
    • Grade:
    • Ranking:11581st
    • original:28
    • Reprint:0
    • Translation:0
    • Comment:123
    Personal introduction and contact way

    Long Xinchen

    "I graduated from Wudaokou College of computer", a few years of data mining machine learning / working experience. A factory doing odd jobs, several NLP, Web attack recognition, intrusion detection self-learning projects. Welcome to contact and exchange.


    QQ: 3253950332

    Machine learning exchange group: 439183906 (full), 373038809

    Professional work or research staff share group: 472059892

    Latest comments