Winnowing: a generic algorithm for document fingerprinting

Classification: Computer security 965 people read Comment(0) Collection Report

This paper is a study of the Local Algorithms for Document Fingerprinting Winnowing:.

In today's era, the electronic content will be through many ways to the same situation, such as: reference, version of the amendment, plagiarism, etc.. While the document fingerprint is an effective way to specify a copy, even in a large number of documents, there are a small part of the copy.

In this paper, the authors introduce a generic document fingerprinting algorithm, which seems to capture the generic properties of any fingerprint to guarantee copy detection. In addition, the author also proposed a new effect of the algorithm, the name is "Winnowing". The algorithm is very efficient for the detection of document fingerprints.

First, the author starts from the background, tells the practical significance of the document plagiarism detection. Plagiarism on the document, not just a large space of reference or plagiarism, but also includes a small part of the copy, and then more subtle, and will not easily be found, but now the latter is a phenomenon has been widespread concern, there are many studies in the latter attempt to solve the problem.

Before a lot of detection algorithms have used a point of view, that is k-gram, the length of the K's continuous string. These algorithms, first of all, the document is divided into a number of k-grams, and the K value is specified by the user. Then hash each k-gram, and then select some subset of these hash values as the fingerprint of this document. In all practical algorithms, the fingerprint is also the location information, which represents the document location of these fingerprints. Note here is that if the hash function is selected in the actual process, then the probability of conflict between the two different documents will be very small, so it greatly reduces the chance that, in turn, once the document is the same as the two documents, then the use of the two documents is the same k-gram.

So, in order to efficiency, and not all of the hash value as a document of the fingerprint, but the selection of a part of the document as a fingerprint. Then there will be a problem of how to choose the hash value, this problem is also a key point of this paper, a good selection strategy can make the algorithm efficiency, high availability, and small interference noise. In the selection strategy, there is a way to be widely used, that is, 0 P mod, and the p value is specified by the user. Such a method is easy to implement because it makes all of the 1/p values in the hash are preserved as a document of the fingerprint. And as a method to detect whether there is a similar document, is to detect the same fingerprint of the document.

But there is a drawback, that is not to ensure that all the matching documents will be detected, because it relies on all the choices of 0 P mod. And the method proposed in this paper, winnowing, can be selected from the hash sequence to select the appropriate set as the document fingerprint, and can guarantee that any total long enough matching can be detected. This winnowing algorithm is so efficient, because in the selection of hash values, using the concept of sliding window (window) to select the hash value, the use of sliding window to select the benefits of the fingerprint: fingerprint selection depends more on the hash value of the window, but 0 P mod, more depends on 0 Mod P function. That is, in the most original hash values, the author selects the minimum W value in the window, and then moves the window to the right, and then selects the minimum hash value in the window. After this process, it will choose a lot of hash value, but it can be predicted that in these hash values, there will be a lot of adjacent hash values are the same, which is due to the sliding window. Then the same hash values are processed, and the hash value set is obtained, which is a lot less than the original number. This is the document fingerprint after using winnowing.

Of course, there are some documents that are not applicable, for example, the entire document is the same character of the document. Because of this, there is almost only one hash value, and this will greatly reduce the representation of document fingerprints.

Subsequently, the author also introduces some background and related work in the field of document similarity detection.

First, if you want a document to be extracted from the fingerprint, then this fingerprint extraction algorithm must meet the 3 characteristics. First, the algorithm is not sensitive to the space. That is, the initial processing of the document will take measures to makeFileThe number of spaces, punctuation, etc., is not affected by the preliminary results. Second, the algorithm itself has the characteristics of noise suppression. The first note is that the words "the", such as "", appear in the document, and then it is not meaningful. So in terms of matching, matching must be long enough, so that the matching can indicate that there is indeed a copy of the document or similar phenomenon. Similarly, for example, there are the same proverbs in different documents, although the length of the proverb may be long enough, but should not be the same as the same as that the document exists. Third, there is no dependence on the location of the algorithm and the document content. For example, the two documents already exist in a similar phenomenon, so in one of the articles in the insertion of a paragraph will not affect the original similarity results; or delete some irrelevant content will not affect the final similarity results.

On the first feature, the document for a simple pre - processing that can be done, the text is not a big story.

But the second characteristics of the story, the author puts forward a threshold concept, that is to say, the need to choose a sufficiently long K, so that the K value is longer than the average length of the proverb. So here is a hypothesis, this assumption is not less than the length of the K string matching is meaningful, and the length is less than k of the sub string matching is not meaningful.

In fact, the third characteristics of the story is very interesting and meaningful, in order to achieve this feature, the author introduces a previously widely used strategy String Match Karp-Rabin.

Karp-Rabin algorithm is implemented to achieve fast sub string matching. In the calculation of hash, the main is to present a polynomial calculation. But because of the special nature of hash, the former one has a lot of similarities with the hash, because the former one is the difference between the former one and the latter one. So we can use which use calculation of the former existing values were calculated, in the calculation only on the results of the former twice addition calculation and two multiplication calculation, which greatly reduce the computational complexity of the original computation.

In the following paper, the author describes and analyzes the winnowing algorithm, in fact, it can be found that the key point of the algorithm is how to select the final fingerprint from many hash values from k-grams. In the detection process, the author hopes to meet the two characteristics in the process of matching: 1. If the length of a sub string matches the threshold value of T, so that the matching can be detected; 2, the algorithm will not detect the length less than k, the K value is called the noise threshold.

It can be seen, if the more K, then I will have confidence to prove that the matching is not accidental. In winnowing, select the document fingerprint, if there is a lot of adjacent to the same minimum, then select the most right side of the value, and finally save all the selected hash value as a document of the document.

While the use of this algorithm, although the fingerprint can be obtained by the algorithm, but in the specific algorithm process, there may be a problem, which is how to make the hash value was once elected. First of all these values are stored in the database, and then they will be indexed. If we take another window Fw to achieve the fingerprint selection, then the use of memory and disk access in the implementation process will become less, more conducive to the search for fingerprints.

In the back part of the paper, the author uses a series of experimental data to test the winnowing algorithm. As well as the C language to show the specific code of the winnowing algorithm.

Copyright statement: This article is the original article of the main Bo, without the permission of the owner.

Guess what you're looking for.
View comments
* above the user's comments only represent their personal views, do not represent the views or position of the CSDN website
    Personal data
    • Visit77259 times
    • Integral:1277
    • Grade
    • Rank:18768th
    • Original47
    • Reprint:0
    • Translation:1
    • Comment:49
    Blog column
    Contact information
    Latest comments