This article from: company data
Machine learning model to support large dimensions, its data platform with Hong Kong University of science and technology to develop a distributed computing framework for machine learning–Angel 1.0.
Angel is a proprietary machine learning system developed using Java language, users can use the Spark, MapReduce, use it to complete a machine learning model training. Angel has supported the SGD, ADMM optimization algorithm, and we also provide some commonly used machine learning model; however, if the user has custom needs, optimization algorithms can also be provided in our top package model with relative ease.
Angel Chukonu as a network solution of the Hong Kong University of science and technology in high dimension parameters of machine learning in the process of updating, targeted to lag computing task parameter speed, on the whole, shorten the time of machine learning algorithms. This innovative use of the Hong Kong University of science and technology Professor Chen and his research team developed the perceived upper-level application (Application-aware) network optimization solutions, as well as the large-scale machine learning research programme led by Professor Yang Qiang.
In addition, Bin Cui, Peking University Professor and his students also participate in the Angel project research and development.
In the actual production, Angel in the tens of millions of levels feature latitude conditions SGD performance Spark several times is a mature open source systems to dozens of times times. Angel has been in Tencent video testimonials, Canton stop precision recommended business practice, and we are currently expanding the application range of Tencent, the goal is to support enterprise-level large-scale machine learning tasks such as Tencent.
The overall structure
Angel in the overall schema reference to Google DistBelief. DistBeilef was originally designed for deep learning, it uses the parameter server to address the huge model update problem in training. Parameter Server also can be used in non-deep learning in machine learning models, such as the SGD, ADMM, LBFGS optimization algorithm on each iteration in the face of billions of update scenario, the need to expand performance parameters distributed caching. Angel supported BSP during operation, SSP, ASP three calculation model, which SSP is validated by the EricXing in the Petuum project at Carnegie Mellon University model in machine learning that particular operations scenario to enhance reduce the convergence time. System has five roles: FENDI plus case
Master: responsible for the application and allocation of resources, and task management.
Task: responsible for task execution, in the form of thread.
Worker: independent process runs on the Yarn in the Container, is a container Task execution.
ParameterServer: built with the start of a task, the task ends and the destruction of, is responsible for the update task parameters in the training process and storage.
WorkerGroup is a virtual concept, made up of several Worker, the metadata maintained by the Master. Considering for the parallel development of the model, all running Worker training in a WorkerGroup data is the same. Although we provide some general models, but does not guarantee meeting demand, and user-defined model can achieve the common interface, formally equivalent to MapReduce or Spark.
1) user friendly
1. automated data segmentation: Angel system provides users with a capability to automatically cut the training data, user-friendly data parallel computing: system default Hadoop FS interface is compatible, the original stored training samples in support of the Hadoop distributed file system FS interface such as HDFS, and Tachyon.
2. rich data management: sample data is stored in a distributed file system, system to be read from the file system into calculation calculation process is cached in memory to speed up iterations; if the in-memory cache of data is temporarily saved to the local disk without communications to the distributed file system again.
3. the rich library of linear algebra and optimization algorithms: Angel also provided an efficient vector and matrix computation library (sparse/dense), and convenience to users of free choice of data, parameters, forms of expression. In terms of optimization algorithms, Angel has realized the SGD, ADMM; models support the Latent DirichletAllocation (LDA), MatrixFactorization (MF), LogisticRegression (LR), Support Vector Machine (SVM).
4. optional model: we mentioned in the review, Angel of the BSP,SSP,ASP model parameter server can support.
5. more granular fault-tolerant: fault-tolerance in the system is divided into Master of fault tolerance parameter server disaster recovery, snapshot cache within the Worker process parameters, fault-tolerant RPC calls.
6. friendly task to run and monitor: Angel also has friendly run support tasks run patterns based on Yarn. Meanwhile, Angel’s progress also facilitates the user to view the Web App page clusters.
2) parameter to the server
In actual of production environment in the, can intuitive of feel to Spark of Driver single points update parameter and broadcast of bottleneck, although can through linear expand to reduced calculation Shi of took, but brings has convergence sex declined of problem, while more serious of is in data parallel of operation process in the, due to each Executor are keep a full of parameter snapshot, linear expand brings has n x parameter snapshot of flow, and this flow concentrated to has Driver a a node Shang!
Seen from the diagram, task in machine learning and Spark even more machine resources are not used, the machine only under certain smaller scale can bring out the best performance, but the best performance is also not ideal.
Using parameter-server scenario, Spark and we made the following comparison: there are 50 million on a data set of training samples using SGD solution of logistic regression models, with 10 nodes (Worker), for different dimensions of each iteration of each characteristic and comparison of overall convergence time (Angel using the BSP model here).
Visible through the data, model Angel contrast Spark advantages are more apparent.
3) memory optimization
During operation to reduce memory consumption and improve operational convergence within the single process uses asynchronous lock-free Hogwild! Pattern. N Task with in the course of an operation if the parameters in the operation remain a separate snapshot memory overhead for parameters n times, model dimensions are consumed when the more obvious! Optimization algorithm of SGD, the actual scene, the vast majority of cases are sparse training data, so update conflicts greatly reduces the probability of, even if the conflict in the gradient is also not entirely to the poor direction, after all, are moving in the direction of the gradient descent updates. We use the Hogwild! mode, so that more than one Task within a process share the same parameter to the snapshot, reduce memory consumption and improve the rate of convergence.
4) network optimization
We have two main point of optimization:
1) process Task parameters after the operation smooth merger pushed the update to update the parameters of the server, which reduces consumption upstream of the machine where the Task, have cut down consumption parameters the server while reducing the push update bottleneck during peak times;
2) further network optimized for SSP: because SSP is a semi-synchronous operation coordination mechanism, in a limited window to run a train, when it reached the edge of the node, node tasks have to be stopped to wait for the slow update parameters. To solve this problem, we flow through the network to speed up the slow process of redistribution of the nodes. We give the slower node with higher bandwidth; accordingly, fast working nodes less bandwidth. In this way, fastest node number of iterations and the slowest node to control the gap, reducing the window be broken (waiting for) the probability, which reduced the working nodes idle wait time due to SSP window.
As shown in the following figure, in 100 million 30 rounds result dimensions, iteration test, you can see the Chukonu of idle waiting for the cumulative time is greatly reduced, was 3.79 times times.
Figure below shows before and after optimization of execution time, with 50 million model dimensions, for example, 20 server nodes and 10 parameters, Staleness=5, the implementation of 30 rounds of iteration. As can be seen, open Chukonu average after completion of the round in 7.97 seconds, compared to the original tasks than average 15% increase was 9.2 seconds per round. FENDI plus case
In addition, the accelerated target node can slow the slow nodes are more likely to get the most recent parameter calculation model for comparing the original SSP, convergence has improved. As shown in the following figure, also is targeting 50 million dimension model under the SSP results evaluation, native Angel task after 30 rounds of iterative (276) loss reached 0.0697, opened after Chukonu, in the 19th round of the iteration (145) have reached a lower loss. From this particular scene is a close 90% of convergence speed boost.
Next, the team will expand the size of the application, at the same time, the project team has continued to develop the next version of Angel, next version will be further optimized in parallel model. In addition, the project team is planning to open source Angel, we will follow the right time to open.
This data by Tencent exclusive reprinted, for reprint, please contact the original author.