Some thoughts on the product of machine learning

translate October 29, 2017 16:46:35

The original:Machine Learning in Production
Author: Szilard Pafka
Translator: Teixeira10

In this article, the author introduces several aspects of machine learning product, including model training, model evaluation, model deployment, etc., so that readers can learn and discuss.
The following is the Translation:

This article mainly discusses how to make machine learning product (including system components, processes, challenges, traps, etc.). In this regard, there will be some related blog articles or papers to talk about the best practice of machine learning product. All the people are welcome to discuss the problems in GitHub.

The original ideas and syllabuses

The following figure summarizes the components and processes involved:

historical data

It will exist in the database, the CSV file, the data warehouse, and the HDFS.

Engineering characteristics

In typical structured / table format business data, it can contain connection and aggregation features (for example, the number of times specific users click in a particular time period).

This "ETL" is reprocessed and is not suitable for operating systems (such as MySQL), and is usually applicable to "Vertica (Redshift)" or "Spark".

Good features are error mechanism / iteration / research / exploration / persistence (as the whole upper half of the above picture, FE, model training and evaluation).

Classification variables: some modeling tools need to be converted to numbers (for example, single heat encoding).

Training and tuning

The result of a feature engineering is a "data matrix" with features and labels (in the case of supervised learning).

These data are usually smaller and usually do not require a distributed system.

The algorithms that have the best performance of ALGOS are commonly used: gradient enhancement (GBM), random forest, neural network (and depth learning), support vector machine (SVM).

In some cases (sparse data, model interpretation), a linear model (such as logistic regression) must be used.

There are good open source tools (R package, Python sklearn, xgboost, VW, H2O, etc.).

It is necessary to avoid overfitting (and the use of rules, etc.).

It also requires an unbiased assessment and a look at the next point.

The model can be adjusted by the search of super parameter space (grid or random search, Bayesian optimization method, etc.).

By integrating multiple models (average, stack, etc.), performance can be further improved, but the disadvantage is that the complexity of deploying this model has increased.

Model evaluation

It's very important that it takes a lot of time to be here.

Unbiased estimates are performed with a test set, and cross validation (some "ahead stop" ALGOS needs a validation set).

If you have a super - parameter optimization, you need a separate validation set (or cross validation).

The real world is nonstationary, and a time gap test set needs to be used.

Diagnosis: probability distribution, ROC curve and so on.

The related business indicators can also be used to assess (the impact of the model on business terms).

Model deployment

Real time data value

This is usually considered a "Engineering" task (which distinguishes the boundaries between data scientists and software engineers.

We need to use the same tools to deploy the model, do not use other languages or tools (SQL, PMML, Java, c++, and custom format similar to Json), unless the model is trained by the same tool. (in the border case, the risk of small bug is very high).

Different servers require more CPU / RAM (because the probability values require a server with low latency, high availability and scalability).

Real time data come from different systems, usually need to copy FE (duplicated code is evil, but it is inevitable); historical data that has been transformed or cleaned up may also need to be duplicated here.

Data values can be processed in batches (which is more simple, which can read values from database, get values, and write results back to database) or real time way. The current method is mainly to separate concerns from http REST API.

If the data science team can support this part of work, it will be better.

Take action

The main goal of a company's ML system is to provide a number of business values (such as customers, money, etc.).

These work may have to be done by the engineering team.

A test (A / B test model) can be carried out gradually.

Evaluation and monitoring

In product and training tests (non-stationary, changing conditions, error assumptions, bug, etc.), the model may have different performance.

It is critical to evaluate the model after deployment.

ML metrics based assessment (value distribution, etc.) and business indicators.

Post deployment evaluation and continuous monitoring (dashboards and alerts) (used to detect external changes and interrupts, as well as the model will be slowly degraded during this period).

This should also be owned by the data science team (compared with the model developed offline).


The ML system creates a tight coupling, which is considered evil from an engineering point of view.

Although some of the problems found in this article have not yet been solved, we should consider and reduce the problem as much as possible.

Some of the ideas about the framework are described here.

Instance coupling: FE data mode, FE replication for numeric values, and applications in a large number of engineering and business areas.

ML needs to be "sold" to a number of business ends (management and business units in the application domain of each ML product).

Business is related to the inside of ML, and business contracts (reports, indicator boards, alerts, etc.) are displayed on a continuous basis, which helps to improve trust and support.

Learning and improving

Iterate all components, learn from practice, use its experience (for example, add ideas in business, add new features to FE, if performance is down, then re model training).

In order to achieve rapid iteration, most of the above should be automated by tools, such as Rstudio+R-markdown/Jupyter notebooks, GIT, docker and so on.

Thinking about the product of the project software

1 background What is the software products, online reference definition: "good software products, the whole process of the installation and configuration, application initialization, system management, user or customer does not need to add or adjust the code for the software and statement that is able to complete the software, and the software can meet at least 80%...
  • Hzm7512
  • Hzm7512
  • 19:40 November 2012, 06
  • One thousand three hundred and seventy-one

A few thoughts on Flash (Thoughts on Flash)

The relationship between apple and Adobe has a long history. Adobe When the founder of the garage is now working in the legendary garage, we know it. Apple is their first big customer, changing their Postscript language to our new La.
  • Guangegwi
  • Guangegwi
  • 11:51 in October 21, 2015
  • Five hundred and seven

Thinking of product oriented

On the product, the company has also carried out the corresponding discussion, is also the goal of the company. The company has been doing the project, and there is no good idea of the product. People think why they are called products, products should have most of the common characteristics, according to the corresponding technical standards or specifications produced.
  • Fullbug
  • Fullbug
  • 2011, 23 05, 2011, 16:21
  • Five hundred and six

The road of software project product

The software product of 2. Road 2.1.. The product of software project is the problem that a large number of software enterprises, especially the application type software R & D enterprises, must face. Whether it is a small software company and a large and medium - sized software enterprise, it is in the face of software projects and software production.
  • Joeyon
  • Joeyon
  • 11:33 in November 24, 2014
  • One thousand four hundred and seventy-four

Product thinking -- think like a product manager

Saying: don't want to be a product manager's program. Ape is not a good cook. In recent years, with the hot position of product manager, more and more professionals such as ape, designer and project manager have been transformed into product managers. ...
  • Leangoo
  • Leangoo
  • 2016, 27 07, 2016, 17:12
  • Six hundred and forty-four

From undergraduate to graduate student, see the robot learning program customized by Xinjiang engineers

Preface Many of my friends asked me for interested in robotics and artificial intelligence, how to start learning. Recently a little bit empty, I write my opinion. Two years ago, I knew how to define the "robot"? The problem is to try to make a more careful definition of the robot, I think...
  • Milanrongruo
  • Milanrongruo
  • 2017, 06 01, 2017, 10:05
  • Two thousand five hundred and fifty-two

Some thoughts on lowmemorykiller

Implementation mechanism of lowmemkiller The time to call lowmemkiller How to set up the oom_adj of the application The existing problems of lowmemkiller In the lowmemkiller.c source code there is...
  • Kickxxx
  • Kickxxx
  • 18:22 November 2013, 01
  • Seven thousand nine hundred and eighty-nine

Some thoughts on QA

The level of QA Level 1: be responsible for one project (workload VS test method VS project to control) Level 2: be responsible for one product line (the whole test scheme of VS with the VS project to control) Level 3: General test...
  • Wodeyijia911
  • Wodeyijia911
  • 2017, 05 04, 2017, 16:37
  • Three hundred and twenty-four

Some thoughts on online education

Online education is e Learning, or distance learning and online learning. Generally speaking, it refers to a web-based learning behavior, which is similar to the concept of network training. Online education is a method of spreading content and learning fast through the application of information technology and Internet technology. One, online...
  • U013703359
  • U013703359
  • 2014, 20 02, 2014, 10:46
  • Three hundred and twenty-five

Some thoughts on object oriented

With the development of programming, there are few objects that do not know object oriented. Especially when business is more and more complex, the number of team is increasing, and the requirement for scalability of the system is higher. Object oriented has become a necessary idea for design and development. There are numerous reasons to prove from both the academic and the practice.
  • Huangrunqing
  • Huangrunqing
  • 16:33 in December 13, 2012
  • Five hundred and ninety
Content Report
Back to the top
Collector assistant
Bad information report
You report the article:Some thoughts on the product of machine learning
Reporting reasons:
Reasons for the following:

(at most only 30 words are allowed)