Original text:10 More lessons learned from building real-life Machine Learning systems Part II -
Translator: Liu Xiangyu
Reviewer: Zhao Yihua
Commissioning editor: Zhou Jianding
In myOn a blogIn this paper, I introduced 10 new experiences, and described 5 of them. Now, tell me about the rest of the 5.
A good performance of the machine learning characteristics of the main features are:
What do these characteristics mean?
Let's take a look at one of the real cases in the feature Engineering: Quora's answer ranking. How do we know that developing a good answer ranking model requires the characteristics of it? First of all, we want to determine what is the definition of "good" in Quora. Fortunately, we have a detailedQuora answer strategy (Answer Polices Quora)Described. In this description we will find the following modifiers:
So, how do we convert the dimensions of our concern into features and then input into the machine learning model? The trick is, what product data we have those characteristics. We can use the features associated with the quality of writing, interactive features (such as approval or comment), user characteristics (such as the professional knowledge and integrity of the users in the topic).
Whenever you are developing a machine learning infrastructure, you need to know two different infrastructure:
Ideally, we want the two models to be as similar as possible. So, how can we combine them?
One possible solution is to use the experimental method, until after a certain effect to put into production. This may mean, for example, that the machine learning researchers use R, then let the engineers choose their own language to achieve product. Another solution is the opposite: to use the production model, and then let the researchers to figure out how to do the experiment. For example, you can choose to use highly optimized C++ code, and then let the machine learning researchers only through the log or database data in the experiment.
The reality is that the above two options areDoes not play a role. Their efficiency is not high, the cost of resources, and ultimately lead to at least one model can not work properly.
Skill is to realize the intermediate solution that can solve these two kinds of model needs. An example is where machine learning researchers in the Ipython notebooks using Python tools (scikit-learn, Theano) experiments were performed, in the production process, as far as possible reuse tools, only when there is a need for achieve optimized version. Another option is to implement the abstraction layer at the top of the implementation, so that it can be accessed with a more friendly experimental tool.
The value of machine learning model is the value it brings to the product. The owner of the product and stakeholders in the process of product performance expectations, and hope to verify them. It is important to be able to answer the model and why it is important to do something or why it is a failure. The debugging of the model can actually bridge the gap between the product design and the machine learning algorithm.
I can say so, the model of the adjustable sex is essential, it can be finalized, or at least affect the model used, dependent on the characteristics, or used to achieve the tool.
As an example, we have a debugging tool on the Quora, which allows us to analyze why we see a specific question and answer in the home page. This tool is not only to report a single question and answer, but also to report the characteristics associated with it. It can also compare different questions and answers, and understand the characteristics of the top ranked question and answer. This is very useful when debugging problems, as well as the product team and other stakeholders to better understand the model behavior and what is important, what is not important, it is also very useful.
Recently, there is an industry trend that seems to suggest that a distributed machine learning algorithm should be used by default. If you don't do that, it may be because your data is not big enough, right? Well, I don't think that's the case here. In fact, most of the things people need to do with machine learning applications should be done on a single machine. Of course, there are some notable exceptions, for example, if you are building a large-scale deep artificial neural network to identify the cat. But most of us don't need to do this kind of system.
Of course, you need to know about other methods in order to make everything work on one machine. For example, you need to understand the benefits of (intelligent) data sampling, how to use offline processing solutions, or how to use a single machine parallel computing. I am herethisThe discussion is carried out.
In addition, I think that such a method, such as Spark and Hadoop, provides a "easy" way to use the most complex and distributed processing platform, in a way that is dangerous. In particular, if you care about costs or delays, keeping them transparent or easy to understand is a good idea.
Here's an interesting example from Quora, which illustrates some of the problems in this area. At some point, we realize that there is a very low efficiency of Spark implementation. 15 machines took 6 hours to run some things, and a rough calculation tells us that the time required is far from that. It took us an engineer to analyze it for 4 days, for example, to analyze how the Spark scheduler launches the query. The final C++ implementation is currently running on a machine and it takes only 10 minutes to complete the calculation!
We've all heard and read about what is a data scientist.answer. Most of them are about how they combine mathematics, software and expertise.
There is a different issue, how the data science team is integrated into the organization. Many companies have overcome or overcome this point. Most of them agree that it is very important to have a strong data scientist who can gain value and knowledge from the data. However, no matter what some people will say, has a strong engineering and technical data scientists are a unicorn, to find them is not easy. This will often lead to such a situation, data scientists need to rely on engineer for production, and on the other hand, engineers do not want to do so, because they already have enough things to do, and perfect production of other people's ideas.
So, how to solve it? My advice is to consider the funnel model of a typical machine learning project. There are 3 different stages.
The first part of the funnel is the research direction of the data. Here, the team studies the data and tries to understand what the problem is, in order to put forward the hypothesis. Did the user click the red button on Friday night than any other day? Users prefer new content even if it's quality may not be high? How do we deal with the trade-off between the cold start of the new content and the mature content?
The second part of the funnel is once put forward the hypothesis, we need to implement a machine learning solution. It includes model selection, feature engineering, implementation solutions in production. It also includes the initial version of the solution, as well as the future iterations to optimize and improve the current system.
The third and last part of the funnel is the focus of running the online experiment (AB test) and analyzing the results. In this way, we can understand the real work of the solution and confirm our initial hypothesis.
Note that these three stages do not take a long time, they should be in the context of the iteration and agile process as soon as possible.
Based on what I have seen, heard and experienced in the case, I suggest that the data scientists are responsible for the first and third parts, the engineer is responsible for the second part. In the iteration process, the data scientist is responsible for the first and third parts to get a faster iteration, as the result of the experiment is rapidly returning to the hypothesis. In Quora, we will be the data scientists and engineers are involved in all projects, and let them work closely together, while still allowing them to clear their own areas of responsibility. seeChen William a."What is the difference between a Quora machine learning engineer and a data scientist?" Learn more details.
Finally, it is needed to explain that, in order to make the machine learning engineering team highly effective, it may need to expand the definition of the machine learning project. Again, it is difficult to find a large group of engineers who are very good at machine learning and software engineering (it is difficult to set up a team of 11 pitching). A good machine learning engineering team includes a high degree of machine learning knowledge of encoding experts and a master of software skills.
There are a lot of problems in the realization of practical machine learning solutions. Some of them are very different from the way you read them, and even the usual practice is very different. If I want to make a summary of these 10 new experiences in several dimensions, I would like to stress the following points:
I hope these suggestions will be of great benefit, but I also admit that there are some problems in them. I am very willing to listen to different views and methods, please comment in the comments section.