MasterofProject

2 Bourne Jason Jason___Bourne21:05 2015.07.25 questions

Give you a file containing 100 million QQ numbers, how to quickly find a QQ number? 50C

Such as the title, I want to find a more efficient way to welcome everyone to discuss. If the answer is violence swept the fastest, then I am from a box on the ear three hundred!

58 answers

Caozhy
Caozhy   Ds   Rxr 22:23 2015.07.25

The easiest way is to the dictionary tree. Search efficiency is Log (N)

Hadues
Hadues   23:58 2015.07.25

A file containing 100 million QQ numbers, in the face of such a large data processing, but also to meet the needs of the fastest search to QQ, if it is me, I will consider this in a few ways.

First, the hardware configuration

Two, programming language

Three, algorithm

Four, check before the judge

Five, storage area

Or listen to me slowly come ---------------------------

First, the hardware configuration - the command line to run the fastest computer

Want to quickly find, weapons must be better, then you have to do the following:

1 operating system

Give up the graphical user interface with the command line interface of the operating system, do not explain. If between Linux and Windows, seemingly had to choose Linux.

2 using the fastest computer in the world

The Milky way two super computer, Cray super computer, quantum computer, biological computer, which is the fastest on the use of which bar.

3 the fastest computer in the world to form a cluster of servers to handle the best cooperation.

Two, programming language - try to close to the underlying programming language

High level language always seems to have no low-level language processing speed, so should try to use close to the underlying programming language.

Three, algorithm - operating system paging principle + time complexity of the algorithm

Topic in the query speed can be fast, then the meaning is not as long as the time complexity of the low point, more than the sacrifice point space does not matter.

Always feel that the operating system of paging query principle is very good, on the basis of the use of a time complexity of the algorithm, then the speed should be improved a lot.

Four, check before the judge

1 query before judging the number of QQ

Before the query, the first to determine the number of QQ, you can narrow the scope of a part of the paging query.

2 bit by bit paging judgment

From the top to judge each one to judge, which belongs to the page, which table.

Five, storage area - the construction of high-speed buffer

As far as possible to store a relatively fast access to the data structure, and put the fastest access to the storage area, the number of bits to be satisfied after the page is loaded into memory.

A fantastic dream display slight skill before an expert talk rubbish, unintelligible.

Sun_xiaofan
Sun_xiaofan   20:01 2015.08.18

The problem lies not in the algorithm is the main approach to data storage, you a file storage of 100 million QQ number, this file will have more than a few g size, if your memory is large enough, read into memory although it could be, but generally not so because you read from a file and complex degree is O (n) the magnitude of the.
Because you want to find a QQ number, you have to read a few G data to memory, this overhead is too large, so the problem is not a problem, the data in the disk structure should be how to store.
Generally, such as the database such as self indexing, relatively easy to get, but if you want to maintain the structure of disk storage, it is very complex, here to say a simple and easy to understand.
Use disk directory directly to do a simple index, because the disk itself is a B tree index, so the management efficiency is not low. For example, the top six of the QQ number as an index, all the top six identical QQ numbers, are processed into a file. Then the file name to the six named qq. All files into a folder may be more difficult to manage, then you can use this idea, on the first level directory with the top 5 QQ to do the index, followed by management, and finally the entire directory is a 10 fork.
Efficiency is not high reason is because read all the files it's a waste of time, you will file scattered storage, read the file quickly, the search range is also very small.
The overall complexity = a number of documents to find when the random disk addressing +1M about disk read, the next direct linear search is very fast.
In fact, we found that the number of search, complexity is not in memory, but to avoid a large number of data from the disk read. Database do this thing than file quickly, the reason is, first database cache, before check collection will cache for a period of time. Secondly, on the disk is the primary index structure, than our own file tree simulation to the efficient.

U011767611
U011767611   11:25 2015.07.26

Depending on the type of data you are:
If the QQ number is stored as an integer in the database, then you can find the;
If it is a character type, you can double traversal: (algorithm using Python to achieve)

Index = 0
7 current_src = []
Eight
9 "" "the dst_qqs itering"""
10 dst_qq for in dst_qqs:
11 print'match -->', dst_qq,'In, index
12 src_qq for in src_qqs:
13 src_qq[index]==dst_qq: if
14 current_src.append (src_qq)
15'match:'src_qq, print
16 else:
17'not match:'print, src_qq
Eighteen
19 time.sleep (1)
20 index = 1
21 current_src = src_qqs

Principle: first traversal of the target QQ number, each get one, and the internal traversal of the target QQ source, for each group of QQ corresponding to the corresponding bit, in accordance with the added to the list;
At this point, set the 100 million QQ number, 0~9 corresponds to each of the same probability, then select the 10 million group qq;
Continue to traverse: (set the target QQ number is 9), the external traversal 9 times, the internal traversal sequence is:
100 million (100 million)
Ten million
One million
One hundred thousand
Ten thousand
One thousand
One hundred
Ten
One
Time complexity can be self - calculus;

Gtitanlq
Gtitanlq   21:47 2015.07.25

Node{struct
A int;
Flag int;
POS long;
Node *next[10] struct;
Node *father struct;
};

A character a character reading, each character a node, to determine the character of the node is present, there is no dynamic creation of a space in front of the character corresponding to the node flag Assignment 1
FTell get the location, copy to the node in the POS, and then continue to read and write ah
In the end, it is a big tree.
At the end of the cycle when the loop traversal on it

Yufuerhuigood
Yufuerhuigood Reply Xiao thousand songs: are learning slag, do not look down upon
5 months ago Reply
Gtitanlq
Gtitanlq Do not bird me, do not look down on me such a study of slag bar
6 months ago Reply
Tabe123
Tabe123   22:12 2015.07.25

10 fork tree can be considered a good

Zuishikonghuan
Zuishikonghuan   08:19 2015.07.26

Using the underlying programming language
Multi thread
Algorithm, optimization
Conditional words Distributed Computing

Zuishikonghuan
Zuishikonghuan   08:23 2015.07.26

If the platform support can also be provided by the platform interface, such as in the windows system can memory mapped files. Of course, it is best to write drivers. In the drive program control disk device, to avoid the in application layer API calls, interrupt in the kernel, system service routines, I / O management, buffer copy, filtration equipment, etc. the influence reading speed

Qq_27220973
Qq_27220973   12:29 2015.07.26

It is recommended to use the idea of data mining to do a look, the data preprocessing, sub block processing, the feeling will solve the problem of large data, memory is not enough

Wingfiring
Wingfiring   18:45 2015.08.18

Ask the Lord, how many times do you want to check? If you check it out once, you can start acting from the fans.

A total of 58 data One Three Four ... Shadowe
User default icon Csdn
Upload medium...
Upload pictures
Insert picture