return Sign in

10 points of experience in building a real machine learning system (two)

Original text:10 More lessons learned from building real-life Machine Learning systems Part II -
Translator: Liu Xiangyu
Reviewer: Zhao Yihua
Commissioning editor: Zhou Jianding

In myOn a blogIn this paper, I introduced 10 new experiences, and described 5 of them. Now, tell me about the rest of the 5.

6 characteristics of the project's troubles and harvest

A good performance of the machine learning characteristics of the main features are:

  • Reusable property
  • Convertible property
  • Interpretative nature
  • reliability

What do these characteristics mean?

  • Reusability: you should be able to reuse features in different models, applications, and teams.
  • Transformation: in addition to the direct re-use characteristics, should also can easily use the conversion form (such as log (f), max (f), within the time window use sigma ft).
  • Interpretation: in order to achieve the operations described above, you must understand the meaning of the characters and explain their values.
  • Reliability: in the feature should be very easy to monitor and detect vulnerabilities / problems.

Let's take a look at one of the real cases in the feature Engineering: Quora's answer ranking. How do we know that developing a good answer ranking model requires the characteristics of it? First of all, we want to determine what is the definition of "good" in Quora. Fortunately, we have a detailedQuora answer strategy (Answer Polices Quora)Described. In this description we will find the following modifiers:

  • Real
  • Reusable.
  • Provide explanation
  • Good form

Picture description

Quora's answer ranking is averyInteresting machine learning problems require good feature Engineering

So, how do we convert the dimensions of our concern into features and then input into the machine learning model? The trick is, what product data we have those characteristics. We can use the features associated with the quality of writing, interactive features (such as approval or comment), user characteristics (such as the professional knowledge and integrity of the users in the topic).

Two basic architecture of machine learning in 7 aspects

Whenever you are developing a machine learning infrastructure, you need to know two different infrastructure:

  • Model 1: machine learning experiment. In this model, we focus on flexibility, ease of use, and reusability.
  • Model 2: machine learning products. In this model, we focus on the addition of all the focus on the outside, but also focus on performance and scalability.

Ideally, we want the two models to be as similar as possible. So, how can we combine them?

One possible solution is to use the experimental method, until after a certain effect to put into production. This may mean, for example, that the machine learning researchers use R, then let the engineers choose their own language to achieve product. Another solution is the opposite: to use the production model, and then let the researchers to figure out how to do the experiment. For example, you can choose to use highly optimized C++ code, and then let the machine learning researchers only through the log or database data in the experiment.

The reality is that the above two options areDoes not play a role. Their efficiency is not high, the cost of resources, and ultimately lead to at least one model can not work properly.

Skill is to realize the intermediate solution that can solve these two kinds of model needs. An example is where machine learning researchers in the Ipython notebooks using Python tools (scikit-learn, Theano) experiments were performed, in the production process, as far as possible reuse tools, only when there is a need for achieve optimized version. Another option is to implement the abstraction layer at the top of the implementation, so that it can be accessed with a more friendly experimental tool.

8 why should you care to answer questions about the model?

The value of machine learning model is the value it brings to the product. The owner of the product and stakeholders in the process of product performance expectations, and hope to verify them. It is important to be able to answer the model and why it is important to do something or why it is a failure. The debugging of the model can actually bridge the gap between the product design and the machine learning algorithm.

I can say so, the model of the adjustable sex is essential, it can be finalized, or at least affect the model used, dependent on the characteristics, or used to achieve the tool.

Picture description

Debugging decision trees will be a simple task (using UI BIGML)

Picture description

Flow Tensor in the calculation of graphics

As an example, we have a debugging tool on the Quora, which allows us to analyze why we see a specific question and answer in the home page. This tool is not only to report a single question and answer, but also to report the characteristics associated with it. It can also compare different questions and answers, and understand the characteristics of the top ranked question and answer. This is very useful when debugging problems, as well as the product team and other stakeholders to better understand the model behavior and what is important, what is not important, it is also very useful.

Picture description

Debugging characteristic value of input signal of Quora

Picture description

Will be an "object" from the product (such as Quora answer) into a feature very good!

9 you don't need a distributed machine learning algorithm

Recently, there is an industry trend that seems to suggest that a distributed machine learning algorithm should be used by default. If you don't do that, it may be because your data is not big enough, right? Well, I don't think that's the case here. In fact, most of the things people need to do with machine learning applications should be done on a single machine. Of course, there are some notable exceptions, for example, if you are building a large-scale deep artificial neural network to identify the cat. But most of us don't need to do this kind of system.

Picture description

Well, if you're doing a cat's model, you may need a distributed machine learning algorithm.

Of course, you need to know about other methods in order to make everything work on one machine. For example, you need to understand the benefits of (intelligent) data sampling, how to use offline processing solutions, or how to use a single machine parallel computing. I am herethisThe discussion is carried out.

Picture description

Wrote many things about distributed machine learning

In addition, I think that such a method, such as Spark and Hadoop, provides a "easy" way to use the most complex and distributed processing platform, in a way that is dangerous. In particular, if you care about costs or delays, keeping them transparent or easy to understand is a good idea.

Here's an interesting example from Quora, which illustrates some of the problems in this area. At some point, we realize that there is a very low efficiency of Spark implementation. 15 machines took 6 hours to run some things, and a rough calculation tells us that the time required is far from that. It took us an engineer to analyze it for 4 days, for example, to analyze how the Spark scheduler launches the query. The final C++ implementation is currently running on a machine and it takes only 10 minutes to complete the calculation!

10 data science and machine learning project unknown story

We've all heard and read about what is a data scientist.answer. Most of them are about how they combine mathematics, software and expertise.

Picture description

Wei Entu, a famous data scientist

Picture description

Modern data scientist ("old" data scientist?)

There is a different issue, how the data science team is integrated into the organization. Many companies have overcome or overcome this point. Most of them agree that it is very important to have a strong data scientist who can gain value and knowledge from the data. However, no matter what some people will say, has a strong engineering and technical data scientists are a unicorn, to find them is not easy. This will often lead to such a situation, data scientists need to rely on engineer for production, and on the other hand, engineers do not want to do so, because they already have enough things to do, and perfect production of other people's ideas.

So, how to solve it? My advice is to consider the funnel model of a typical machine learning project. There are 3 different stages.

  1. The first part of the funnel is the research direction of the data. Here, the team studies the data and tries to understand what the problem is, in order to put forward the hypothesis. Did the user click the red button on Friday night than any other day? Users prefer new content even if it's quality may not be high? How do we deal with the trade-off between the cold start of the new content and the mature content?

  2. The second part of the funnel is once put forward the hypothesis, we need to implement a machine learning solution. It includes model selection, feature engineering, implementation solutions in production. It also includes the initial version of the solution, as well as the future iterations to optimize and improve the current system.

  3. The third and last part of the funnel is the focus of running the online experiment (AB test) and analyzing the results. In this way, we can understand the real work of the solution and confirm our initial hypothesis.

Note that these three stages do not take a long time, they should be in the context of the iteration and agile process as soon as possible.

Based on what I have seen, heard and experienced in the case, I suggest that the data scientists are responsible for the first and third parts, the engineer is responsible for the second part. In the iteration process, the data scientist is responsible for the first and third parts to get a faster iteration, as the result of the experiment is rapidly returning to the hypothesis. In Quora, we will be the data scientists and engineers are involved in all projects, and let them work closely together, while still allowing them to clear their own areas of responsibility. seeChen William a."What is the difference between a Quora machine learning engineer and a data scientist?" Learn more details.

Picture description

Data driven innovation funnel

Finally, it is needed to explain that, in order to make the machine learning engineering team highly effective, it may need to expand the definition of the machine learning project. Again, it is difficult to find a large group of engineers who are very good at machine learning and software engineering (it is difficult to set up a team of 11 pitching). A good machine learning engineering team includes a high degree of machine learning knowledge of encoding experts and a master of software skills.


There are a lot of problems in the realization of practical machine learning solutions. Some of them are very different from the way you read them, and even the usual practice is very different. If I want to make a summary of these 10 new experiences in several dimensions, I would like to stress the following points:

  • Make sure your training model can learn what you want.
  • The combination of supervised and unsupervised techniques is the key to many machine learning applications.
  • Focus on the characteristics of the project is very important
  • Good consideration of machine learning infrastructure / tools
  • Organize your team

I hope these suggestions will be of great benefit, but I also admit that there are some problems in them. I am very willing to listen to different views and methods, please comment in the comments section.