Subscribe cloud computing RSS CSDN home> cloud computing

A small team of leveraging big data -- when machine learning team recommended practice

Published in09:15 2015-10-16| Time reading| sourceCSDN| ZeroArticle comments| authorZhang Xiangyu

Abstract:Dangdang personalized recommendation development manager Zhang Xiangyu depth sharing experience in the machine learning experience of Dangdang recommendation team. This share is more focused on the process oriented - some practices in the construction of the system, some of the pit, and how to get out of the pit, as well as a small team".

[editor's note]Dangdang personalized recommendation development manager Zhang Xiangyu depth sharing experience in the machine learning recommended team, focusing on a small teamSome of the practices in the construction of the system, some of the pit, and how to get out of the pit.Dangdang constructed by machine learning process system in the trampled pits are mainly include: see model); do not attach importance to the visualization and analysis tools; too dependent on algorithm; key processes and data did not grasp the in his team, team, not "full stack"; giant system.Tools, Dangdang in the exploration stage is the choice of R and Python, the big data stage is mainly rely on Hadoop and Spark, the current cluster is the size of several hundred units. In the "whole process of building" in the constant iteration, Dangdang alsoSum up a set of features of the project related tools -Dmilch.

Say it first.My original intention.Machine learning system is now Red Multi NB this thing I do not have to repeat. However, due to the special nature of the machine learning system, it is not an easy thing to build a system that is reliable and easy to use. Every time I see the wonderful colleagues to share, I will think of these complex and exquisite system, is how to build up? Construction process is what? Is there some pit behind this? Have some experience? Whether can "steal" to draw lessons from?

So I want to be aMore emphasis on the process oriented"Share, and share with you some of our practice in building systems, some of the pits, and how to get out of the pit.

In addition, I share this timeMore emphasis is on the small team"One, because Dangdang currently do ML team is really still relatively small, followed by I understand that not every business is as big as BAT.So the experience and practice of small teams may have a unique reference, I hope this sharing can never be the same point of view to provide you with some reference.

Share the experience of today to share the ML small team from Dangdang recommended group.

Our team is responsible for advertising in Dangdang recommendation / machine learning system from the beginning of 0 to build, optimization, maintenance and improvement. In addition to the computing platform for other teams responsible for maintenance, pipeline ML in each link is responsible for. The production model is used to sort the parts of the recommendation module and part of the advertising module.

Before sharing the start, it is necessary to clarify the positioning of this sharing. As shown above, this share does not involve the content, there is a demand for the students can refer to other wonderful CSDN share.

Above these, are involved in this sharing, which is the focus of the process is the focus of this. The share of the population is as follows:

Regardless of your machine learning system which stage of the construction, if from the share gains, or inspire, that the author with "post holiday syndrome" do the PPT not white busy...

This is our share of the Outline: first simple talk about my little team, some of the understanding, and then spend the main time to share with you a small team to learn the practice of small team. Then I will summarize some of the pits that we have guessed in practice, and what we have learned from them. Then I will take some references as an example, do some of the future work and the possible direction of the future. The last is the question and answer session.

A brief discussion on the small team

First of all, to talk about my understanding of the small team.

Why small teams appear? This problem at first glance is a bit like a nonsense, because each team is developed from small to large. That's true, but the machine learning team has some of its own characteristics.

  1. Compared with some functional systems, one of the characteristics of machine learning systems isUncertainty, that is, the effect of the system to build up, it is not quantified from the beginning. This leads to a more cautious investment in the decision-making, not in the beginning to put too many people.
  2. This aspectTalent is relatively scarce, recruitment difficult. Resume looks pretty much, have practical ability or experience is actually very little. The principle of Ningquewulan, a small but excellent team is a better choice.

Small team to do the system challenge? This is the first question that we care about.The essence of small team challenge is actually two words: few people.From this fundamental limit will be derived from a number of specific challenges.

  1. First isHigh requirements on individual ability. It is easy to understand, fewer people would mean everyone needs to play a big role, so to individual competence requirements is relatively high. For this problem, in fact, there is no good way, mainly external recruitment and internal training.
  2. Followed by the system development process, we generallyNeed to cross to be in charge of a number of tasksThis is both a challenge to the individual ability, but also a challenge to the ability of cooperation. But on the other hand, this is actually the best training for employees, to allow you to grow at the fastest speed.
  3. Again isDirection and demand selection problem. Because fewer people, so in the decision to move, the need to be very cautious, to minimize the input. Sometimes it is really a limit, but from another point of view, which forced us to concentrate on the most important part of the, good steels is used on the blade.
  4. The last point isSingle point risk is higher. As each person in charge of the more, so everyone's turnover, and leave the other transaction will cause a greater impact on the system. The same problem is mainly through the internal training and external recruitment to solve, but there is a way, is to use a challenge to keep people. Which way to make, it depends on the specific circumstances of the specific circumstances of the.

Such a look, small team challenge is not small, but in turn, the team also has some unique advantages.

  1. The first is the teamEasy to condense. This is also the natural advantage of any small team.
  2. Followed byEasy collaboration. A lot of things do not need to have a meeting, turned around a few words can be done.
  3. Again isThe advantage of iteration speed. As the process involves all the things that are responsible for a few people, do not need to coordinate too many resources, so as long as these people are hard, the iteration speed will be faster.
  4. The last point is a very important point, that is,Team growth. Because you are responsible for a lot of things, then growth speed will quickly, and personal sense of achievement will also relatively high, if deployed properly, let the whole team in a very dynamic active state.

Dangdang recommended machine learning practice

Below we spend some time to share Dangdang machine learning team is how leveraging the machine learning system of this big stone.

The picture above shows the background.Integrated framework. From the top of the architecture diagram can be seen, a machine learning system is as a subsystem, and recommended operation platform generation (recommended the offline operation platform) direct interaction.

This is just a few framework plans to let everyone know the location and role of the machine learning system in the entire recommendation system, is not the focus of this share, no need to be understood.

The schema map of this page is the details of the red box in the last page. Can be seen from the figure,Machine learning systems play a role in the result sort.. Details on this architecture I do not start here, interested students can refer toThe United States group of students some time ago to shareIs a similar structure.

The above chart is a page architecture in the red box part further expansion, is the machine learning system itself a architecture diagram. Experienced students can see, this diagram includes the main parts of the machine learning process system.

We'll say how this system is built up, the experience of the process is what kind of. The initial stage of the system is aExploration stage. The significance of this phase is to understand that your problem is not a suitable for use ML technology to solve the problem.

Machine learning is very powerful, but not universal, especially in some areas that require strong artificial a priori, may not be the most appropriate program, in particular, is not suitable as a system to start the program. At this stage, the tools we use are R and Python.

On the right side of the map, the red box to live part of the R can be used to solve the blue box to live part of the Python is more appropriate, the part of the green box is both needed.

Why choose R and Python?

Say first R.

  1. Because Almighty R, called the data in the scientific community of the Swiss Army knife.
  2. Because R has been popular for many years, it is a mature tool, easy to find solutions to problems encountered.
  3. At that time (2013) sklearn and the like is not perfect enough to use, but there are problems are not easy to find solutions.

Besides Python.

  1. Python development of high efficiency, suitable for rapid development, iteration, of course, should pay attention to the quality of the project.
  2. Python text processing ability is strong, suitable for the processing of text related features.
  3. The combination of Python and Hadoop, Spark and other computing platforms has a strong ability to expand the amount of data when the data is scalable.

However, the part of the R can now be used as a substitute for Python, because sklearn, Pandas, Theano and other tools for the representative of the toolkit has been more mature.

But when it's too early to explore, it's time toLarge data volume of the system when the R is no longer suitableThe. The main reason is two:Can handle data is small and the processing speed is slow.

  1. First is because pure r only support single, and the data must be all loaded into memory, which is obviously for large data processing is a very obvious obstacles, but now some new techniques to this problem may eased, but we have not tried.
  2. The second is relatively slow calculation speed, which is of course refers to the speed of the large amount of data.

So on the left side of the frame,Once the amount of data to the stage, toHadoopAnd SparkAs a representative of the tool will be on the stage, become the main use of the tool.

After the initial exploration, validation phase, it is to enterSteps of Engineering iterationThe.

As shown in the figure is a typical process we develop.

After verification is passed, it is important to enter the next link, I call it"Whole process construction"Refers to the construction of the ML system, as well as the use of the back side, all build up to form a complete development environment.

It must be emphasized here is "complete", is not only to set up a ML model related samples, feature and training links, behind using a model of the link, such as sorting display, also want to together build up. On this point will be mentioned again in the back.

If it is the first time to build a system, then the whole process of building will take a relatively long time to complete. But this step is the cornerstone of all the work behind the time and effort is worth it.

After the completion of this step, a system has been constructed in fact Is, of course, system of a type only without God, because each part may be not fully optimized, and some may is only a shell without content.

After entering the iterative optimization of the "Infernal Affairs", this part of the work is constantly looking for can be optimized point, then try a variety of solutions, validation of the line, how to think to achieve on-line standard, the AB line. After building up the system process, the basic is in the back of the loop in this iteration. The original meaning of the affairs of the affairs is to point to the eighteenth layer of the 18 layer of hell, which means the infinite cycle of suffering.

In fact, this development process, especially the process of building a house, the first to play the foundation, after the construction of a blank room, after which is constantly renovation, a variety of inspection, until you can check in. Live in a period of time may feel that there is not satisfied, or there is something new, more beautiful decoration method, it may be re decoration. So repeatedly. Until one day you get rich, to change the house, that is, the overall system reconstruction, upgrade the time.

This page is introduced.The tools we use, are a number of common mainstream tools available on the market, in addition to dmilch this set of tools.

Dmilch (milch): Dangdang Learning toolCHain MachIne is a set of feature engineering related tools that we have summed up in our constant iteration. Contains a number of commonly used features of the processing tools, such as the characteristics of regularization, normalization, commonly used indicators, etc.. And LinkedIn some time ago open source out of the FeatureFu purpose is similar to the convenience of the characteristics of the treatment, but the different angles.

This page introduces several of usThe key points in the workflow. In fact, small teams have a natural advantage in this regard, so ourThe central idea is to run a small step".

The first key point isSerial between changes. This is perhaps the unique characteristics of machine learning algorithm for this class of systems, improved together, sometimes can not distinguish between what is and what factors played a role in the real, like a pair of traditional Chinese medicine, the onset did not know what is, and we hope to extract the real "artemisinin".

The second point isProject promotion mechanism. We will probably have one or two sessions a week to discuss, the main content is to verify the effectiveness of the improvement, the program is discussed, and the next step to confirm the action.

Technical staff actually do not like the meeting, then why do you have to open every week? I think one of the most important purposes isLet everyone participate in the discussion, jointly responsible for the project, to grow together. Assume the work of the division of labor, but in the discussion without division of labor, each person must have ideas on the system, there are recommendations. This also ensures that we can each other to absorb their own unfamiliar places, more conducive to growth.

There is also a topic that has to say isAn attempt on new technology. If you follow the examples of building houses before us, new technologies like tall on the furniture and the like, home didn't one or two pieces of townhouse, embarrassed to say hello.

Our experience in this regard is,To understand the existing technology, using new technology through, say, not later. For example, the recommended collaborative filtering algorithm, the general will purchase, browse, comment, collection and other different data, different dimensions are to be calculated to see which effect is better. When the familiar technology value squeeze dry, try new technology too late.

Another point is very important, is the other people's technology, may not be suitable for you. Different company's business scenarios, data size, data characteristics are not the same, the new technology for others to be carefully adopted.

We have full confidence tried a international manufacturers of a technical, but repeated attempts have not been good effect, instead of inviting a great complexity. Later, and some peer exchange found that we have not been good results. So foreign to the moon, there may be just in a foreign country more round. What technology, or to see their own system where the soil is suitable for what kind of seedling.

Before the end of this part, I briefly introduce ourThe effect of the model on the on-line recommendationRecommended first screen click rate increased by 15%~20%. Ad click rate increased by about 30%, RPM increased by about 20%. It can be seen that the effect is still very obvious.

Those years, we stepped on the pit

Next to share the next important link, that is, we have stepped on a variety of pits.

"The past is the prologue. pit", perhaps part of the value of each share. We have also stepped in the construction of a lot of the pit, here and we share a few I think relatively large pit, and I hope to be helpful to everyone. I would like to introduce a few pits, and then say what we have to climb out of the pit from the feeling, the harvest.

See model, not see the system.

If you want to put a row of the pit, the pit must be the first. Because if you fall into the pit, then the direction of the direction of your system is likely to be completely wrong.

Specifically, refers to this problem is in constructing the system, we started a basic only pay attention to the quality of the machine learning models, AUC, NE, but not to pay attention to the the model to the final line effect how. The consequence of this is that we feel that the model has been very good from the point of view of all aspects of the index, but a on-line found that there is no effect. Because we ignore the model is how to use, has been in the same as the "optimization" model, the final effect is not good.

What is the correct position? From our experience point of view,In the early stage of the system building, it is clear that you are not building a model, but a model centered system. Time to know what to do after the model out, how to use, this is very important to the overall situation.

Model is the center of the system, but not all of the system. In each stage of system design, development, optimization, to see the problem from the point of view of system, eyes can not only model, there is no system (product). Otherwise, when you call up a AUC=0.99 model, the head of the system has been found and the farther away.

Therefore, the machine learning system should pay attention to both models and systems, if only to see the model and can not see the system, it is likely to make a good indicator but not effective "vase system".

Do not attach importance to visual analysis tools

This is a start that can easily be ignored, but it will lead to a very difficult period of your problem (here refers to the non deep learning system).

Because the machine learning system is a black box to a certain extent, so our energy will be used to focus on the parameters and models of these things, it is natural to think that the internal work of the model does not need to be concerned about. But our experience is that if we only focus on the outside of the black box, do not care about it, then if the model is not good, then it will be difficult to locate the problem. In turn, if the effect is good, it will be a little confused, just like your home toilet lights suddenly lit themselves, or the TV suddenly opened their own, always make people very practical.

Our feelings on this issue is very deep. The earliest we do when the system, found that the effect is not good, in fact, is not too many tricks can help the location problem. Can be a variety of characteristics of the characteristics of the sample, the sample processing to change the pattern, if the effect is good, it is good, bad, and then toss.

Later weMade a web page, the top of each sample, the characteristics and parameters of each case, the number of samples, the number of sorting in the candidate set, and so on, all displayed.As the whole system model to do an autopsy, hope to see as much as possible to see the internal details of the system, the analysis of the problem has a great help.

This system helped us a lot, too, although it is not "art", but a lot of things appear in front of you, you will find something you think is not the same, will find something you won't think of. This is a bit like a black box system for machine learning, especially valuable. Up to now, this system is one of the things that we are very much dependent on each time the effect is verified. It can be said that we are another pair of eyes.

Over reliance

This pit believes that many students have encountered. Let me give you an example. We encountered a problem of text processing, to filter out a large number of irrelevant text. The beginning of many kinds of algorithms, all kinds of tuning, but has not satisfied with the results.

Finally, we showed the trick: human flesh filtering. Specifically, it is three people spent three days of pure manual over time (thousands of words), the effect is immediate. At that time, the problem, perhaps there is a better effect of the algorithm, but from the system, the overall measure of the project point of view, or the highest ROI.

So althoughMachine learning algorithm is based on the system, but also can not think rigid, all things are only thinking about using algorithm to solve, some places, or millet plus rifles more appropriate.

Critical processes and data are not mastered in their team

This pit, it can be said that is not an easy to find the pit, especially in the early days of the system, is more subtle. We also eat a number of losses after the discovery of this problem.

In many companies, the front-end display, log collection and other work is responsible for the team, and such as the recommended advertising team is directly used to bring.The benefits are obvious, can make the machine learning team focused on their own work, but the bad side is that they collected the data is not always what we expect to get.

For example. We started using the exposure data is a brother team to help us do, but we used to find and other data is not on the up, looking for a long time to find the problem. This problem directly affects the correctness of the samples we get, so the impact is very large.

What is the cause of the problem? actuallyNot a brother team is not serious, but they do not fully understand the needs of our data, they do not use the data, so the quality of the data will be a risk. After eating this loss, we now put this part of the work is also used to do their own, so that the data is correct or not we can monitor the whole, the problem can be resolved within their own, do not coordinate a variety of resources.

The team is not enough "full stack"

This pit is a complicated one.In the last pit, I mentioned that we found a problem with the quality of the data, after which they did this part of the collection of exposure work. But the location of the problem and their own reasons to take over is not a problem when it is done.The reason is simple and cruel: there is no front end talent in our group.

Because the exposure problem involves a series of actions from the browser to the background system, and the front end is the first link of these actions.But we are in the component machine learning team, and did not realize that there will be front of what, that there is a background + model is enough. So lead us to face this problem is relatively weak.Until then there was a rich experience of the front end of the staff to join our group, we have to locate the problem, and made their own decision to take over.

The lesson for us is:Set up a team to be more cautious, to look at the more systematic perspective, can not be said to do machine learning will only recruit algorithm engineer. This will lead to a short board of the team, paving the way for some problems.

However, some problems may also be difficult to predict before the encounter, so this pit is indeed more complex.

Giant system

At the end of a pit, also left a big hole. This pit I call "the giant system".

What does a giant system mean? In simple terms, is to make the whole system into a "one" system, rather than sub modules made of a number of subsystems. The implication of a system is that there is a high coupling between the modules within the system, strong correlation, samples, characteristics, training, prediction and so on are all stuck together, can not be separated. What are the consequences of doing so?

Direct examples. Our first edition, the light line is about a week. And after the maintenance is very difficult, it is very difficult to change things. Why make such, my reflection is: when learning theory, we assume the samples, feature, training the pipeline as a thing, direct reaction to this way of thinking to the system is a giant system. Maybe there are only a dozen features, and there's no problem with hundreds of samples. But when you feature up to tens of millions, sample up to tens of millions of time, you need to think about, your system is not a get out the.

What is the better way to do it? Our later solution is:Big system small do. "Big system small do" this is not my invention, is the Spring Festival this year (or last year) to see the micro channel team said to grab a red envelope system architecture when it comes to a concept. I think this statement is very good, very much agree with. This approach means that, although your system is very large, very complex, but do the time or to do a good job of module separation, which is conducive to development, but also conducive to expansion, maintenance.

Machine learning system is characterized in that, just start you may use what are very few, so that a system can handle, but doing doing, need to make all kinds of feature transformation, sample processing, the system will unconsciously become large, and if you only pay attention to the model, is very easy to create a maintenance can not huge system to.

The long march has just started

Our team after just the "pit", a system can be said is built up, but this is only the Long March first step.For us, in fact, the machine learning system for this new thing, itself is also different from the traditional software systems, many of the complex, there are many challenges to be solved.I am here with two references to briefly introduce these complex, and the challenges facing. Interested in in-depth understanding of the students can find a specific look at the article.

The first one is Google ResearchAn article in paper, speaking of the machine learning technology debt.The title is also very interesting, can be translated as: "machine learning: high interest debt credit card technology".

This article mainly said is machine learning system to build very complex, if lack of experience, or was not careful enough, in many areas it is easy to debt, the debt felt had little effect, but due to the high interest to later will let you also very painful.

Above, I read the article after the article according to their own finishing, the specific dimensions of technical debt. These dimensions and our own practice is also highly consistent, then look at the article is full of knee.

Such as mentioned in figure right "subsystem fuzzy boundaries, and before I said" giant system "have similarities, that is within the system without segmentation.

For example, "spaghtti system-level (system level Italy)". Spaghetti code commonly used to refer to a generation of messy code, because the machine learning system is built up in the exploration, unlike other systems as fully developed to build, so it is easy to produce spaghetti code.

If you can refer to these dimensions to consider before taking the system, then the system development, upgrading and maintenance will be much easier. Believe that these experience is also a big company like Google fell out of a lot of pit summed up. Giant like this, it is not easy for us.

The next article is now inFBSGDDaniel Bottou LeonIn ICML 2015Do a tutorial. Title name:Big challenges in machine learning Two, is a more systematic practice of the article, said that the machine learning is facing two new challenges.

The first point is frightful to the ear:Machine learning destroys software engineering.. But if you think about it, it's true. The development process of machine learning system is mostly exploratory and incremental, which is very different from traditional software engineering, which is a challenge to the system developers. I think there will be a special "machine learning system architect" in the future.

The second point isAt present, the experimental method is also in the limit.. At first glance, this is a scientific experiment, but it is not. Machine learning system development because it is exploratory, so in the development of a variety of experiments often do to verify the effectiveness of the overall method framework, but also need to be carefully designed. Obviously in the Bottou view, the current approach is not appropriate.

Zhang Xiangyu

Dangdang personalized recommendation development manager

86 year old man, the people's university undergraduate master's degree, the incumbent Dangdang personalized recommendation development manager. In the recommendation system, machine learning system, and other aspects of the work, and pay attention to Internet banking, anti fraud, risk control and other new applications of machine learning technology.


(commissioning editor / Zhou Jianding)

step on