Some thoughts on the product of machine learning
The original:Machine Learning in Production
Author: Szilard Pafka
In this article, the author introduces several aspects of machine learning product, including model training, model evaluation, model deployment, etc., so that readers can learn and discuss.
The following is the Translation:
This article mainly discusses how to make machine learning product (including system components, processes, challenges, traps, etc.). In this regard, there will be some related blog articles or papers to talk about the best practice of machine learning product. All the people are welcome to discuss the problems in GitHub.
The original ideas and syllabuses
The following figure summarizes the components and processes involved:
It will exist in the database, the CSV file, the data warehouse, and the HDFS.
In typical structured / table format business data, it can contain connection and aggregation features (for example, the number of times specific users click in a particular time period).
This "ETL" is reprocessed and is not suitable for operating systems (such as MySQL), and is usually applicable to "Vertica (Redshift)" or "Spark".
Good features are error mechanism / iteration / research / exploration / persistence (as the whole upper half of the above picture, FE, model training and evaluation).
Classification variables: some modeling tools need to be converted to numbers (for example, single heat encoding).
Training and tuning
The result of a feature engineering is a "data matrix" with features and labels (in the case of supervised learning).
These data are usually smaller and usually do not require a distributed system.
The algorithms that have the best performance of ALGOS are commonly used: gradient enhancement (GBM), random forest, neural network (and depth learning), support vector machine (SVM).
In some cases (sparse data, model interpretation), a linear model (such as logistic regression) must be used.
There are good open source tools (R package, Python sklearn, xgboost, VW, H2O, etc.).
It is necessary to avoid overfitting (and the use of rules, etc.).
It also requires an unbiased assessment and a look at the next point.
The model can be adjusted by the search of super parameter space (grid or random search, Bayesian optimization method, etc.).
By integrating multiple models (average, stack, etc.), performance can be further improved, but the disadvantage is that the complexity of deploying this model has increased.
It's very important that it takes a lot of time to be here.
Unbiased estimates are performed with a test set, and cross validation (some "ahead stop" ALGOS needs a validation set).
If you have a super - parameter optimization, you need a separate validation set (or cross validation).
The real world is nonstationary, and a time gap test set needs to be used.
Diagnosis: probability distribution, ROC curve and so on.
The related business indicators can also be used to assess (the impact of the model on business terms).
Real time data value
This is usually considered a "Engineering" task (which distinguishes the boundaries between data scientists and software engineers.
We need to use the same tools to deploy the model, do not use other languages or tools (SQL, PMML, Java, c++, and custom format similar to Json), unless the model is trained by the same tool. (in the border case, the risk of small bug is very high).
Different servers require more CPU / RAM (because the probability values require a server with low latency, high availability and scalability).
Real time data come from different systems, usually need to copy FE (duplicated code is evil, but it is inevitable); historical data that has been transformed or cleaned up may also need to be duplicated here.
Data values can be processed in batches (which is more simple, which can read values from database, get values, and write results back to database) or real time way. The current method is mainly to separate concerns from http REST API.
If the data science team can support this part of work, it will be better.
The main goal of a company's ML system is to provide a number of business values (such as customers, money, etc.).
These work may have to be done by the engineering team.
A test (A / B test model) can be carried out gradually.
Evaluation and monitoring
In product and training tests (non-stationary, changing conditions, error assumptions, bug, etc.), the model may have different performance.
It is critical to evaluate the model after deployment.
ML metrics based assessment (value distribution, etc.) and business indicators.
Post deployment evaluation and continuous monitoring (dashboards and alerts) (used to detect external changes and interrupts, as well as the model will be slowly degraded during this period).
This should also be owned by the data science team (compared with the model developed offline).
The ML system creates a tight coupling, which is considered evil from an engineering point of view.
Although some of the problems found in this article have not yet been solved, we should consider and reduce the problem as much as possible.
Some of the ideas about the framework are described here.
Instance coupling: FE data mode, FE replication for numeric values, and applications in a large number of engineering and business areas.
ML needs to be "sold" to a number of business ends (management and business units in the application domain of each ML product).
Business is related to the inside of ML, and business contracts (reports, indicator boards, alerts, etc.) are displayed on a continuous basis, which helps to improve trust and support.
Learning and improving
Iterate all components, learn from practice, use its experience (for example, add ideas in business, add new features to FE, if performance is down, then re model training).
In order to achieve rapid iteration, most of the above should be automated by tools, such as Rstudio+R-markdown/Jupyter notebooks, GIT, docker and so on.