All of the data mining code involved in this paper is on my GitHub: https://github.com/linyiqun/DataMiningAlgorithm
Probably spent nearly two months time, his 18 large data mining classical algorithm of learning and the implementation of the code, which relates to the classification, clustering, link mining, association rules mining, pattern mining and so on. Is also considered a small entry in the field of data mining. Here's a little summary, followed by my own corresponding algorithm of the blog link, and hope...
Adhere to the time of a month or so, from the beginning of the redis code classification, from struct structure analysis, finally analysis the main the main program end, intermediate, the modules of code checkmate of one by one, learning, in a word, the harvest the very much, and for a long time not so long patience to a framework and thoroughly, a framework for learning, with just a small part of the, can thoroughly understand the principle of the behind is the real kung fu. In the final stage of this study, it is time to dry, I put this month some summary of some good code, and design thought summed up, originally wanted to gather together.
From occasional objective if you are a Hadoop cluster of daily maintenance, then you must go through a lot of nodes on the assembly line work. For example, with the business scale rapid expansion, the cluster resource gradually enough to use, the normal practice is by adding machines to achieve linear expansion effect. Of course, when the machine is in use process, the aging of the machine caused their problems, such as bad disk, such as some machines network connection is not on, this time to the machines in the cluster to move out, avoid by all means is not a temporary benefits, these machines in the cluster.
Objective in HDFS, all the data are all datanode. And these datanode data are stored in the nodes in each directory, and each directory we will correspond to a separate disk so that we have put in basic storage space. So many nodes, so many pieces of disc, HDFS in write operation to a valid disk selection, improper selection will inevitably lead to write performance decline, thus affecting the performance of the whole cluster. In this paper, we discuss the existing HDFS several disk selection strategies and lack of, then the...
1 Introduction recently a period of time, as a result of the engine room for the relocation, we a Hadoop cluster by machine is undoubtedly the largest number, how can best make the machine in a room moved to another room, to cluster the data above and run the business as a result of the possible small influence or almost no effect. This is we want to reach the goal. But in the implementation of the relocation process, or found some interesting phenomenon, so there are such as the title of the article described the phenomenon, why say "a", because this phenomenon is not every time, so this thing to watch.
Datanode migration program goals due to the impact of external factors, where the original DN nodes of the machine from a room to room B, which will involve changes in the host name and IP. The ultimate goal is after migration of cluster does not cause big influence.
Service is still available, the data is not lost. Knowledge because migration in DN, will inevitably lead to migrate the node's heart stops, if more than the check his heartbeat timeout, the node will be task is dead node, in order to balance the number of copies of, will cause cluster in a large number of
Block block copy of the phenomenon, if you do not want to in a short time...
Objective in the use of Hadoop cluster, all the tasks are ultimately to application run in clusters, no matter you are to write their own Mr program or write your hive SQL into Mr tasks are ultimately to the identity of the application running. These application after running, the information in the jobHistory on can see, can be said to Hadoop in this regard doing really complete. But the perfect return to improve. But jobHistory can be said to is a post-mortem analysis...
Objective done Hadoop cluster investigation work of students must be used JobHistory, this is a very good use of the "weapon", why do you say so? As the tool name call that this tool can help you find historical job running information, and remember information recorded very detailed, from job to task to TaskAttempt. if at this time, a job suddenly failed to execute the, you want to find out the reason, in JobHistory of the web interface click Details link, basically can find reasons. But appears to be a very perfect.
Objective in the Hadoop FsShell command estimated many commonly used is Hadoop FS - ls, -lsr, catenin and so on such and Linux is almost consistent with the file system related commands. But think about it, there are some different. First of all, from the point of view of their size, stand-alone version of the file system, file number, content and HDFS is a distributed system, which can accommodate a large number of file directory. Therefore under the premise, you if feel free to perform LS or LSR orders, sometimes will get...
Introduction recently in the operation and maintenance of our department a Hadoop cluster found the phenomenon of a lot of job oom, because on the machine can use a command to view and full of GC is more serious. As we all know, the consequences brought by the full GC is relatively large, "stop the world", once your full GC elapsed time for more than a few minutes, then other activities have suspended for so much time. So full GC once and abnormal, be sure to find the root cause and solve. This article is for everyone about me...
Introduction recently in the work to solve the problem of a slow disk, personal feeling of whole analysis process is very interesting and significant. And monitor disk in the current Hadoop or not do very full, most of them are the datanode, can say this is a blind spot. In fact, think, Hadoop itself do the monitoring is reasonable, necessary because of the issues like this, is basically a hardware problem, this should not in software in the monitoring, not so much. But then we thought, if you find the machine by means of the monitoring software level.
Before the preface in ready to start write a article, I always think should give this article set a title to the most accurate express theme and will not let the word become too much. Because the metric index monitoring in yarn already exists, and the indicators of support is very much, so this article is certainly not simply introduced the meaning of a few indicators monitoring and how to add custom metric indicators to monitor the content of such, the key point lies in two words, refinement. Refinement of potential meaning has two, one is on the basis of the original monitoring indicators, an increase of more fine-grained monitoring, to improve the original prison...
Objective we all know that in Hadoop, the execution of a job needs into one task to perform. In the task, there will be two types, a map task, another is reduce the task., of course, this isn't the bottom level, in the interior of the task, can also divided into TaskAttempt, called try to task, task tries tentatively in this article within the discourse words, for each task. Of course he will have his resource usage, broadly speaking, resources are divided into two concepts, one is memory...
Objective in the current Hadoop cluster, for all of the user's job, attitudes are consistent, that is to say, "generosity", but if the number of clusters average job runs up will inevitably will appear resource abuse, introduced before several pieces of articles, but the theme is biased in favor of monitoring, and not a solution. For example, custom hive SQL job analysis tool and this article Hadoop abnormal task analysis found that to get back to the main point, generally if a slightly to the degree of a certain scale, should will appear.
Introduction Hadoop system as a more mature distributed system, he is often defined as a place to store massive data, and MySQL here traditional RDBMS database has a significant difference. Hadoop with his natural advantage, he can store Pb level data, as long as your machine enough, I could save so much, and help you to consider the backup copy of this mechanism, can only say, Hadoop the whole system is really very perfect. When it comes to the massive amount of data stored in Hadoop, each day's data increment can basically reach the TB level, for a class...
The preface as a professional programmer, if removal treatment, consider the factors of salary and so on, starting from the point of view of the pure technology, how to achieve a relatively high level, the answer is with the top of the a group of people exchanges and cooperation, of course, the top the group of people many almost estimates are not in the side, and mostly in foreign countries. So there is no way, no, not, do not forget to have the network this thing, you can through the community, the mail to exchange, put forward their own ideas. These people are often active in many open source communities, such as Apache. under a lot of sub projects, are very good system. So this article...
Using the simplest analytical business objective Hadoop as a large distributed system, when his scale is continually expanding, amplified to a certain extent, the use of self heating and of course also will become much different business will be submitted to the various types of tasks, submit the hive query task, people will write a MapReduce program. Then it slowly produces a called "multi tenant" concept. Multi tenant direct understanding is a large public bicycle field, is a wave of people in common use, the bike is excuse me, you can't, you'll have to wait. But, when the user is...