Co-editor Tsung-jen Huang Xin
This summer, the network of Lei Feng (search for “Lei feng’s network”, public interest) in Shenzhen will hold an unprecedented “Global Summit on artificial intelligence and robots” (CCF-GAIR). General Assembly site, Google, DeepMind,Uber, head of the artificial intelligence laboratory of the giants such as Microsoft will visit Shenzhen, zero distance shown to us foreign intelligence stirring, move the Earth’s core. If you don’t want to miss the spirit of Carnival, the General Assembly, please purchase our early bird tickets click on the end.
This times CVPR 2016 Shang, depth learning almost into has now computer Visual research of standard, people face recognition, and image recognition, and video recognition, and pedestrian detection, and mass scene recognition of related papers in are with to has depth learning of method, plus Google,Facebook such of big enterprise power, many people doubts, why depth learning compared other of AI achieved method, has rendering out a Deputy grind pressure of State?
This issue hard to create open class guests we invited Tang Executive Director of research and development of science and technology Cao Xudong, which has just come back from CVPR 2016, just here to explain why deep learning has become almost standard in computer vision research the problem. And to explain the present status and future trends of CV and deep learning.
Cao Xudong, Shang Tang Executive Director of research and development of science and technology, deep learning experts. Graduated from Tsinghua University. Former researcher at the Institute, responsible for research and development algorithm of human face for Microsoft Xbox, How-old and other well-known products, How phenomenal product Old.net there are hundreds of millions of users. Top CVPR/ICCV/ECCV computer vision conferences more than 10 scientific papers have been published, of which three two ICCV and CVPR thesis papers received oral reports honor (receiving small 5%).
Shang Tang technology Cao Xudong: why deep learning has become almost standard in computer vision research? | Hard to create open class
Object detection based on deep learning Q: current deep learning for object detection advantages and disadvantages?
First of all a brief traditional object detection method and object detection method based on deep learning.
Traditional methods use a sliding window frame and a figure down into millions of different location and scale of child Windows, each window uses classifiers to determine whether objects are included. Traditional methods for different types of objects, tend to design different characteristics and classification algorithms, such as face detection algorithm is Harr feature +Adaboosting classifier; classical method of pedestrian detection is a HOG (histogram of gradients) + Support Vector Machine; General object detection is a feature of HOG and DPM (deformable part model) Algorithm.
Object detection algorithm based on deep learning is RCNN series: RCNN,fast RCNN (Ross Girshick), faster RCNN (Qing, kaiming, Sun Jian, and Ross). The three core ideas are are: CNN model candidate for better regional categories multiplex is expected to be the sharing feature map speed training and speed of object detection; further candidate region use the sharing feature map enhance calculation speed. Object detection based on deep learning can be categorized as mass sliding Windows, just use convolution method.
RCNN Series algorithm for object detection is also be divided into two steps. There was end-to-end (end-to-end), object detection, for example, YOLO (You Only Look Once:Unified, Real-Time Object Detection) and SSD (SSD:Single Shot MultiBox Detector) algorithm. Both of these algorithms is known as similar but faster and faster RCNN accuracy. Object detection samples of positive and negative extreme disequilibrium, two-stage cascade can better respond to uneven. Beyond faster end-to-end learning RCNN needs more study.
Deep learning why become a standard Q of CV: for deep learning of this session is almost a standard in computer vision research, France Inria researcher Nikos Paragios at the Institute expressed concern about the writing in LinkedIn seems to be too single, what do you think about this?
Answer why deep learning will become standard methods of computer vision.
First, the most important reason is that deep learning can do precision unattainable by conventional methods, this is the key in the key, if this is 1, the other advantage is 1 followed by 0. Deep learning revolution in the year 2011-2012, 11 times in the field of speech recognition has a significant breakthrough in 12 major breakthroughs in the field of image recognition. Deep learning revolution, lifting in many applications of computer vision to a practical level, gave birth to the industry a large number of applications. This is why 11 years ago, computer vision and artificial intelligence doctoral students are unable to find work, but after 12 years, especially now that turned many companies paid for the baby.
Advanced learning becomes standard, there are other advantages.
First, the depth of learning algorithm’s versatility is very strong, said testing, in the traditional method, customized for different objects require different algorithms. Compare view, based on deep learning algorithm is more common, such as faster RCNN in the face, pedestrian, General object detection tasks can be achieved very good results.
Second, deep learning features (feature) has a strong ability to migrate. So-called feature-migration capabilities, is on a mission to learn that some of the features, used in task b can also be obtained very good results. In ImageNet, for example (object based) learning feature on scene classification tasks can achieve very good results.
Third, the project development, optimization, maintenance and low cost. Depth measurement and matrix multiplication is convolution, optimized for this calculation, all deep learning algorithm can improve performance. In addition, by combining existing layer (layer), we can achieve a lot of complex network structures and algorithms, the low cost of development and maintenance. Think while developing and maintaining algorithms such as Boosting,Random Forest is a very painful thing.
Answer the questions deep learning is too simple.
Deep learning is too simple, I think it is inaccurate. Vientiane, is like saying an inclusive universe is too simple.
In simple terms, is the input to the output of a mapping of machine learning, the traditional method uses a simple mapping of shallow, deep learning is a multi-layer composite maps now. Depth learning has many of freedom, learning target and learning method has is variety select, network structure layer and layer Zhijian has countless of may connection way, each layer mapping of specific form what is volume product, also is full connection, also is other of form, and no limit, actually except full connection and volume product zhiwai, also can with other of mapping form, like last year ICCV Shang of a work: Microsoft research with Random Forest do for new of mapping form.
Advanced learning technology tree Q: sent Shang Tang CVPR2016 of science and technology thesis focuses on the four papers of the segmentation of objects the dress recognition search technology, the identification and positioning the face detection of concatenated Convolutional neural networks of joint training, what is the importance of these 4? This is your current operational focus what is the relationship between?
Deep learning technology framework is a tree-like structure.
Training platform is a root, such as tensorflow and Caffe. Deep learning is still in experimental phase now, efficiency is to a large extent determine the efficiency, good training platform to shorten the time from one month to the day, for deep learning and development is very important.
Model is the trunk. Since 06 deep learning concepts, and academics spent six years to realise the model structure is the key to deep learning. Typical results AlexNet, VGGNet, GoogleNet, ResNet, and so on. Major academic research how do better precision models. In the industry, we have to consider how do models faster, smaller.
The trunk has several main branches, corresponding to the core tasks in computer vision, including detection, recognition, segmentation, feature point location, sequence of five major tasks, any specific applications of computer vision can be formed by the combination of the five tasks. A case study of face recognition, face recognition to complete the entire process involves facial feature points detection, localization, and features extraction and verification. This includes detection, feature point location and identify the three parts.
We mentioned earlier that five main directions are actually spent a great deal of research, on the one hand is to ensure that we are at the forefront of academic breakthrough, on the other hand, our most important application has developed a set of methods in parallel with academics, able to do 10 times acceleration and hundreds of models of compression, while maintaining good accuracy. Four papers referred to in this question is mainly on the five core tasks in computer vision of some of the research results. We in addition to research results in industry have more practical, more results, we do face detection in academia the best result at the same time to 300FPS speeds. Face feature point location than the academic community the best result at the same time, at 3000FPS. Open papers in academia, I have not seen this kind of performance. Fossil iPhone case
Q: in the segmentation of objects in this article (authors Shi Jianping) is the main settlement of instance segmentation (also known as Simultaneous Detection and Segmentation). Instance segmentation has become a new hot issue recently. It to solve the problem is to inspect (Object Detection) and semantic segmentation (Semantic Segmentation) consolidated a question. Than testing, needs to be more accurate boundary information objects than semantic segmentation, objects need to distinguish between different individuals. Testing easy, now emphasizes the depth of detection upgrade to 3D,4D from 2D detection; semantic segmentation has been doing is to distinguish between objects of different individuals, so now before the semantic segmentation and what is the difference? Is the semantic segmentation to increase to combine semantic understanding of the scene?
In depth learning area has a simple but very common principle. Learn, become more abundant, the more precise the guidance, learning in General will be better.
As a simple example, in the case of a sufficient amount of data, if I annotate my image categories are only animals, plants and scenes, learning models and features may be general. But if marked refinement of these categories, such as most begin 10 data, we put it down to 1000 categories, such as Dalmatians, pit bulls and other dogs into, into the Persian Cat, tabby cat, usually you can learn how to model and more features.
Another example is object detection, if the bounding box based on the additional monitor long information will get better results. Mark the face of the position of the eyes, nose, mouth, face angle, ethnic sex male female attributes such as a multitasking learning algorithm, usually get better results.
Two representative can refer to: Joint cascade face detection and alignment,Facial landmark detection by deep multi-task learning.
Sometimes multiple annotations/mission is to coordinate, you can Multi-Task Learning framework for learning. In some cases, multiple tasks is a progressive relationship, results of a previous task can help a task, for example, each independent testing after the subdivision of each body Mask. Rational use of this progressive relationship, you can get better results than the parallel relationship, which is actually the Central Instance segmentation ideas. Because with the traditional semantic segmentation is different, the traditional semantic segmentation needs to classify the object category, no need to distinguish between different individuals. Object segmentation (Instance segmentation) is the need to distinguish between categories and the need to distinguish between individuals of the same object, so deep learning network needs to learn than semantic segmentation tasks for more information. Microsoft Research Asia Dai Jifeng do to this very ground-breaking work. Shang Tang Shi Jianping senior fellow of science and technology of our work are also very innovative. By the method of multi-scale local-regional integration, end-to-end implementation instance object segmentation category and distinguish different individual unified category information.
Computer vision black technology Q: there have been some recent CV application black technology, such as the MIT machine “watch TV” prediction of human behavior; MIT artificial intelligence for video dubbing; Disney Institute was directly identified by AI is what’s going on in the video. These black technology is a gimmick or a really meaningful?
Do people have a deep learning ultimate pursuit. Now the depth of learning mode is actually stupid. Given the data, and the corresponding label (label). Like a picture, the label is a cat, to another picture, the label is a dog, and then send these data to the neural network learning, and ultimately achieve a good result. Such methods are called supervised learning, although very effective, and human learning is not the same. Deep learning researchers hope machines more intelligent and can learn like a person. FOSSIL iPhone 5
After the supervised learning has made significant achievements in the field, we put more energy into closer study of human and semi-supervised learning (semi-supervised) and unsupervised learning (unsupervised). On one hand, we want more in-depth understanding of human vision mechanism, mechanism of human intelligence. The other hand, supervised learning requires a lot of data, if you can use a semi-supervised or unsupervised learning of ways to bypass the major label problems, to achieve the same accuracy, which is very attractive to industry.
Issues mentioned in these black technology, is in the study of human exploration work, makes perfect sense.
Moving in this direction a lot of work. These works use the image or video without supervision. Although these data are not labelled, but are actually contained within the data structure. Such as the movement of objects in the video, there are specific rules of conduct; in one picture, an object also has a specific structure. Using these specific structures in your video or image, we can make an unsupervised problem into a supervision problem, and then use a supervised learning approach to learning.
There are two typical work. First 2×2 or 3×3 image image is divided into regions, given any two regions predict the relative position between them. The use of objects, the inherent structure of the scene, such as the sky above the road, his legs below the body. Another working with video data to study edge, mainly using video objects are relative to the background of great movement in this special feature.
Longer term, exploring human learning process semi-supervised and unsupervised, multisensory input/deep learning is the learning of another trend.
How to see the best paper Q: Microsoft research paper Deep Residual Learning for Image Recognition was awarded the best paper award at this CVPR 2016 best student papers at Stanford University Structural-RNN:Deep Learning on Spatio-Temporal Graphs, what do you think of these two papers?
Kaiming, Sun Jian’s two best paper is 10 minutes to understand, one day will be able to reproduce the results. Study on long-term effect of after work. In addition, Sun Jian research style is a big influence on me. Problem-oriented, addressing important issues, do real work of research. The methodology is valuable not only in academia, but also in the industrial sector to look for even more important.
Back to the article itself, this paper deals with Network General than 20~30 levels of depth when loss of training and testing no longer decline, even as the number of layers increases, loss will gradually increase, has given us an effective solution to this problem. This method is effective and has a lot of explaining, for example, one explanation is that by cross-layer (skip-layer) can back to the middle of the loss a lot, solved the problem of gradient spread. Another explanation is that ResNet through skip-layer, you can do more in depth model.
My explanation is a bit more complex. In my opinion, no down-sampling case, when reached a certain level when the convolution learning ability is diminishing. When the network is too deep, convolution layer can only learn to increase noise and cause information loss, led to increased training and test loss. Skip layer can be good Adaptive adjustment of the learning objectives for each layer, to solve this problem.
ResNet has a lot of redundancy, and out network behind several layers of 152 stories won’t change the accuracy, might skip these layers. Maintain accuracy of cases, remove these redundant, a smaller, more economical network, is very valuable.
Industrial floor: from academia to industry Q: papers on the whole about basic research papers and propose concrete solutions to two types of paper, paper industry, how do we have the right attitude, such as how long the paper into practical period is suitable? And how to make the most sense?
Now industry with academia research is carried out in parallel, overall, the industry didn’t fall behind in academic circles, academia has not lagged behind the industry, industry and academia focus are not the same.
Now deep learning research iteration is very, very fast, and surprisingly fast. In other areas, mainly through the publication of journal articles to academic research exchanges, journal short cycle may one year long may take two or three years time. In the computer field, there is more published papers, conference papers the time time of about half a year’s time. Field of study in depth, and people to make it for the first time results in preprints (Arxiv), half a year later to the meeting papers.
Technology in the commercial soup, many researchers develop the habit each day to preprints (Arxiv) to see the latest papers, if the idea of this paper has value or results in this paper makes a number of very prominent members first tried to reproduce, do some exploratory experiments.
I think the specific field of study in depth, and landing cycle of new technologies is almost zero.
Q:CVPR so many keynotes (main Conference), which several sections of that content is most useful?
Many of the sections are very interesting I think CVPR. To say which section is the most useful, from the practical point of view of the industrial sector, of course, is the most useful detection and identification plates.
Q: what experience after attending this CVPR?
Computer vision is the biggest experience of Chinese communities really badly. Last year when all the ImageNet ICCV contest first prize was won by Chinese. CVPR, they also see a lot of Chinese people’s excellent paper. Kaiming, Shao-Qing and Xiang Yu, Sun Jian won the best paper award. Ethnic Chinese in the area of computer vision research level and higher. This is a very exciting thing. A little chicken soup, we missed the industrial revolution, China missed the electrical revolution, was followed by the information revolution. But artificial intelligence revolution, we ran side by side with the leading countries in the world are. In the tide of the times, a great career, and often exciting, can’t sleep at night.
Hard on the open class for the current period, Cao Xudong depth of focus as we introduced learning compared to other AI achieved dominant 4 features: precision; algorithm general; features promote good engineering framework unified. This might be construed as the depth of learning in the AI community now popular causes.
In addition, he noted the core tasks in computer vision, including detection, recognition, segmentation, feature point location, sequence of five major tasks for specific applications of computer vision to outline a clearer context.
But striking is about deep learning that particular iterations in the field of speed, turning paper into practical and reasonable period, he believed that study in depth this particular landing cycles in the field should be zero. While Google, Facebook and other large companies is leading the industrialization boom in papers from the written word to practice, but the cycle is clearly for the industry was a thrilling speed.