What will be brought when the chip is plugged into the AI â€‹â€‹wings?

2020-02-06 08:06:32

Since 2010, due to the development of big data industry, the amount of data has shown explosive growth, and the traditional computing architecture can not support the massive parallel computing needs of deep learning. So the research community has carried out a new round of technology research and development on AI chips. Applied research. This emerging technology has brought about a turn for the technology giant's business upgrade and expansion, and also gives new ventures the opportunity to change the existing landscape.

The AI â€‹â€‹chip is one of the core technologies of the artificial intelligence era, which determines the infrastructure and development ecology of the platform. As the top priority of the artificial intelligence industry, AI chips have become the most popular investment field, and various AI chips emerge one after another.

In a broad sense, any chip that can run an artificial intelligence algorithm is called an AI chip. However, the AI â€‹â€‹chip in the usual sense refers to a chip that has been specially designed for artificial intelligence algorithms. At this stage, these artificial intelligence algorithms are generally based on deep learning algorithms, and may also include other machine learning algorithms.

In general, the so-called AI chip is an ASIC (dedicated chip) that points to the AI â€‹â€‹algorithm. Traditional CPUs and GPUs can be used to execute AI algorithms, but they are slow and have low performance and cannot be used commercially.

For example, autonomous driving needs to identify road pedestrian traffic lights and other conditions, but if it is the current CPU to calculate, then it is estimated that the car has not found the river in front of the river, which is slow, time is life. If you use a GPU, it is much faster, but the power consumption is large, and the battery of the car is not expected to support normal use for a long time. Moreover, the GPU is expensive and cannot be used by ordinary consumers. In addition, because the GPU is not an ASIC developed specifically for the AI â€‹â€‹algorithm, in the end, the speed has not reached the limit, and there is room for improvement. In the field like smart driving, it must be fast! In the shou machine terminal, you can use AI applications such as face recognition and voice recognition, which must have low power consumption.

What is the AI â€‹â€‹chip?

Before answering this question, let's first understand the two concepts, what is the CPU and GPU?

Simply put, the CPU is the "brain" of the shou machine, and the "master commander" of the normal operation of the shou machine. The GPU is translated into a graphics processor, and the main job is indeed image processing.

Let's talk about the division of labor between the CPU and the GPU. The CPU follows the von Neumann architecture. The core is the "storage program, sequential execution". It is like a steward who does everything, and everything needs to be step by step. If you let the CPU plant a tree, work such as digging, watering, planting trees, and sealing the soil should be carried out step by step.

If you let the GPU plant a tree, you will call Xiao A, Xiao B, Xiao C, etc. to complete the work of digging, watering, planting trees, sealing the soil into different subtasks. This is because the GPU performs parallel operations, which breaks down a problem into several parts, each part being completed by an independent computing unit. Every pixel that happens to be processed by the image needs to be calculated, which coincides with the working principle of the GPU.

Just like the year-on-year: CPU is like an old professor, points and differentials will be counted, but some work is to calculate a large number of additions, subtractions, multiplications and divisions within one hundred. The most ideal method is of course not to let the old professor count down, but to hire dozens of Primary school students assigned tasks. This is the division of CPU and GPU, the CPU is responsible for large-scale computing, the GPU is born for image processing, from computers to smart phones.

However, when the demand for artificial intelligence appears, the division of labor between the CPU and the GPU has a problem. The deep learning of the artificial intelligence terminal is different from the traditional calculation. The background is preliminarily summed up from a large amount of training data, and the artificial intelligence terminal can be determined. The parameters, such as the training sample is the face image data, and the implemented function is face recognition on the terminal.

CPUs often require hundreds or even thousands of instructions to complete a neuron's processing, unable to support large-scale parallel computing, and the GPU on the shou machine needs to handle the image processing needs of various applications. Forced use of CPU and GPU for artificial intelligence tasks, the results are generally inefficient, fever.

One of the AI â€‹â€‹chips provided by Apple A12, Kirin 980 and Exynos 9820. Popularly speaking, it is an artificial intelligence accelerator, because the GPU is based on block data processing, but the AI â€‹â€‹application on the shou machine needs to be processed in real time. The artificial intelligence accelerator just solves this pain point and takes over the work related to deep learning, thereby alleviating the CPU. And the pressure of the GPU.

They separate the computational load of the CPU and GPU, and the AI-related tasks such as face recognition and speech recognition are offloaded to the ASIC for processing. The AI â€‹â€‹chip core has already become an industry trend.

On the one hand, the value of the AI â€‹â€‹chip lies in the division of labor with the CPU and GPU. The task stack of too many CPUs and GPUs only consumes power and increases the temperature.

On the other hand, under the cooperation of the AI â€‹â€‹chip, the user behavior can be learned, and then the user's usage scenario is predicted, and then a reasonable performance allocation is performed. It's like saying that when you are playing the game, let the CPU run efficiently, and when you are reading e-books, avoid wasting performance.

AI chip development process

From Turing's papers "Computers and Intelligence" and Turing tests, to the initial neuron simulation unit, the perceptron, and now to the hundreds of deep neural networks, human exploration of artificial intelligence has never been Did not stop. In the 1980s, the emergence of multi-layer neural networks and back-propagation algorithms ignited a new spark in the artificial intelligence industry. The main innovation of backpropagation is that it can feed back the error between the information output and the target output through the multi-layer network to the previous level, and converge the final output to a certain target range. In 1989, Bell Labs successfully used the backpropagation algorithm to develop a handwritten identifier in a multi-layer neural network. In 1998, Yann LeCun and Yoshua Bengio published the paper "Gradient-based learning applied to document recognition" related to handwriting recognition neural network and back propagation optimization, which opened up the era of convolutional neural networks.

Since then, artificial intelligence has been in a long period of development and quiet, until IBM's deep blue victory over chess masters in 1997 and IBM's Watson intelligent system in 2011 won the Jeopardy program, artificial intelligence has once again attracted people's attention. In 2016, Alpha Go defeated the South Korean Go nine professional players, marking another wave of artificial intelligence. From the basic algorithms, the underlying hardware, the tool framework to the practical application scenarios, the field of artificial intelligence at this stage has been fully blossomed.

The underlying hardware AI chip, which is the core of artificial intelligence, has also experienced many ups and downs and twists and turns. In general, the development of AI chips has undergone four major changes before and after, and its development process is shown in the figure.

1. Before 2007, the AI â€‹â€‹chip industry has not developed into a mature industry. At the same time, due to factors such as algorithms and data volume, there is no particularly strong market demand for AI chips at this stage. Universal CPU chips can meet application needs.

2. With the development of high-definition video, VR, AR games and other industries, GPU products have achieved rapid breakthroughs; at the same time, people have found that the parallel computing features of GPUs just adapt to the needs of artificial intelligence algorithms and big data parallel computing, such as GPUs than traditional ones. The CPU can increase the efficiency of the deep learning algorithm by several tens of times, so it is beginning to try to use the GPU for artificial intelligence calculation.

3. After entering 2010, cloud computing is widely promoted. Artificial intelligence researchers can use cloud computing to perform hybrid operations with a large number of CPUs and GPUs, further promoting the in-depth application of AI chips, which has led to the development and application of various AI chips. .

4. The requirements of artificial intelligence for computing power are constantly increasing rapidly. After entering 2015, the GPU performance and power consumption ratio are not high, which makes it subject to various restrictions in the application of work. The industry began to develop special chips for artificial intelligence, with a view to Through better hardware and chip architecture, the performance and efficiency of energy consumption are further improved.

What is the difference between an AI chip and a regular chip?

AI algorithm, in the field of image recognition, commonly used in CNN convolutional networks, speech recognition, natural language processing and other fields, mainly RNN, which are two different types of algorithms. However, they are essentially multiplications and additions of matrices or vectors, and then with some algorithms such as division and index.

A mature AI algorithm, such as YOLO-V3, is a large number of convolutions, residual networks, full joins, etc., the essence of which is multiplication and addition. For YOLO-V3, if a specific input pattern size is determined, the total number of multiplication addition calculations is determined. For example, one trillion times. (The real situation is much larger than this.) To execute YOLO-V3 quickly, you must perform one trillion additions.

This time to see, such as IBM's POWER8, advanced server with one of the superscalar CPU, 4GHz, SIMD, 128bit, assuming that processing 16bit data, that is 8 numbers, then one cycle, up to 8 times Plus calculation. Perform up to 16 operations at a time. This is theoretically, but it is unlikely. Then the CPU counts the number of peaks in one second = 16X4Gops=64Gops. In this way, you can calculate the time that the CPU calculates once. Similarly, if you switch to GPU calculation, you can also know the execution time.

Let me talk about the AI â€‹â€‹chip. For example, the famous Google TPU1.

TPU1, approximately 700M Hz, has a 256X256 size pulsation array as shown below. A total of 256X256 = 64K multiply and add units, each unit can perform one multiplication and one addition at a time. That is 128K operations. (multiplication is one, addition is counted again)

In addition, in addition to the pulsating array, there are other modules, such as activation, etc., which also have multiplication, addition, and the like. So, look at the TPU1 one-second peak count is at least = 128K X 700MHz = 89600Gops = about 90Tops. Comparing the CPU with the TPU1, you will find that the computing power has several orders of magnitude difference, which is why the CPU is slow.

Of course, the above data are completely ideal theoretical values, and the actual situation can reach 5%. Because the memory on the chip is not large enough, the data is stored in the DRAM, and the data from the DRAM is very slow, so the multiplication logic often has to wait. In addition, the AI â€‹â€‹algorithm has many layers of network components, which must be calculated layer by layer. Therefore, when switching layers, the multiplication logic is still resting. Therefore, many factors cause the actual chip to fail to reach the peak value of profit calculation. And the gap is still huge.

At present, the size of the neural network is getting larger and larger, and the parameters are more and more. When encountering a large NN model, the training takes several weeks or even one or two months. Suddenly power off, I have to come back. Modified the model, it takes a few weeks to know right or wrong, OK to wait? Suddenly there is TPU, then you find that it is good to have a lunch back, parameter optimization, continue to run, how cool!

In general, the CPU and GPU are not AI-specific chips. In order to implement other functions, there are a lot of other logics inside, and these logics are completely useless for the current AI algorithm, so naturally the CPU and the GPU are Can not achieve the best price/performance ratio.

At present, in the fields of image recognition, speech recognition, natural language processing, etc., the algorithm with high precision is based on deep learning. The computational precision of traditional machine learning has been surpassed. Currently, the most widely used algorithm estimates non-deep learning. Moreover, the amount of computation of traditional machine learning is much less than that of deep learning, so when discussing AI chips, it is for deep learning with a particularly large amount of computation. After all, the algorithm with a small amount of calculation, to be honest, the CPU is already very fast. Moreover, the CPU is suitable for executing complex algorithms for scheduling, which is impossible for GPUs and AI chips, so the three of them are only for different application scenarios, and each has its own home field.

Classification and technology of AI chips

There are two development paths for artificial intelligence chips: one is to continue the traditional computing architecture and accelerate the hardware computing power. It is mainly represented by three types of chips, namely GPU, FPGA, and ASIC, but the CPU still plays an irreplaceable role; The other is to change the classic von Neumann computing architecture, using brain-like neural structures to improve computing power, represented by IBM TrueNorth chips.

1, the traditional CPU

The computer industry has used the term CPU since the early 1960s. So far, the CPU has undergone tremendous changes from form, design to implementation, but its basic working principle has not changed much. Usually the CPU consists of two main components, the controller and the operator. The traditional CPU internal structure diagram is shown in Figure 3. From the figure we can see that essentially only a single ALU module (logical operation unit) is used to complete the data calculation, and the other modules exist to ensure the instruction. Can be executed one after another in an orderly manner. This versatile structure is very suitable for the traditional programming calculation mode, and can increase the calculation speed by increasing the CPU frequency (increasing the number of instructions executed per unit time). However, for deep learning, which does not require too many program instructions, but requires the computational requirements of massive data operations, this structure is somewhat powerless. Especially under the power limitation, the instruction execution speed cannot be accelerated by unrestrictedly increasing the operating frequency of the CPU and the memory. This situation leads to an insurmountable bottleneck in the development of the CPU system.

2, parallel accelerated computing GPU

GPU is the first processor to engage in parallel acceleration calculations. It is faster than CPU and more flexible than other accelerator chips.

The reason why the traditional CPU is not suitable for the execution of the artificial intelligence algorithm is that the calculation instruction follows the serial execution mode and fails to exert the full potential of the chip. In contrast, GPUs have a high parallel architecture and are more efficient than CPUs in processing graphics data and complex algorithms. Comparing the difference in structure between GPU and CPU, most of the CPU area is controller and register, while GPU has more ALU (ARITHMETIC LOGIC UNIT) for data processing. This structure is suitable for parallel processing of dense data. The structure comparison between CPU and GPU is shown in the figure. The speed of the program on the GPU system is often several tens of times or even thousands of times higher than that of a single-core CPU. As companies such as NVIDIA and AMD continue to advance their support for GPU massively parallel architectures, GPUs for general-purpose computing (ie GPGPUs, GENERAL PURPOSE GPUs, general-purpose computing graphics processors) have become an important means of accelerating parallel applications.

At present, GPU has developed to a more mature stage. Companies such as Google, FACEBOOK, Microsoft, TWITTER, and baidu are using GPUs to analyze image, video, and audio files to improve application features such as search and image tags. In addition, many car manufacturers are also using GPU chips to develop driverless. Not only that, GPUs are also used in VR/AR related industries. But GPUs also have certain limitations. The deep learning algorithm is divided into two parts: training and inference. The GPU platform is very efficient in algorithm training. However, when inferring a single input, the advantages of parallel computing cannot be fully exploited.

3, semi-customized FPGA

FPGA is a product of further development based on programmable devices such as PAL, GAL, and CPLD. Users can define these gates and the connections between memories by burning into the FPGA configuration file. This burn-in is not one-off. For example, the user can configure the FPGA as a microcontroller MCU. After use, the configuration file can be edited to configure the same FPGA as an audio codec. Therefore, it not only solves the shortcomings of the flexibility of the custom circuit, but also overcomes the shortcomings of the limited number of gates of the original programmable device.

FPGAs can perform both data parallel and task parallel computing, with more significant efficiency gains when dealing with specific applications. For a particular operation, a general-purpose CPU may require multiple clock cycles; instead, the FPGA can program a recombination circuit to directly generate a dedicated circuit that can be completed with a small or even one clock cycle.

In addition, due to the flexibility of FPGAs, many of the underlying hardware control operations that are difficult to implement using general-purpose processors or ASICs can be easily implemented using FPGAs. This feature leaves more room for the implementation and optimization of the algorithm. At the same time, the one-time cost of FPGA (lithographic mask fabrication cost) is much lower than that of ASIC. When the chip demand is not yet scaled, the deep learning algorithm is not stable, and it needs constant iteration improvement, the FPGA chip has reconfigurable characteristics. A semi-custom artificial intelligence chip is one of the best choices.

In terms of power consumption, FPGAs also have inherent advantages in terms of architecture. In the traditional Feng structure, an execution unit (such as a CPU core) executes arbitrary instructions, and all of them require an instruction memory, a decoder, an arithmetic unit of various instructions, and branch jump processing logic to participate in the operation, and each logical unit of the FPGA The function is already determined when reprogramming (ie, burn-in), no instructions are required, and no shared memory is required, which can greatly reduce the power consumption of the unit execution and improve the overall energy consumption ratio.

Due to the flexibility and speed of FPGAs, there is a trend to replace ASICs in many areas. The application of FPGA in the field of artificial intelligence is shown in the figure.

4, fully customized ASIC

At present, the artificial intelligence computing demand represented by deep learning mainly uses GPU, FPGA and other common chips suitable for parallel computing to achieve acceleration. When industrial applications are not on the rise, the use of such existing general-purpose chips can avoid the high investment and high risk of special development of custom chips (ASICs). However, since this general-purpose chip design is not specifically designed for deep learning, it has natural limitations in performance and power consumption. With the expansion of the scale of artificial intelligence applications, such problems have become increasingly prominent.

The GPU is an image processor designed to cope with massively parallel computing in image processing. Therefore, when applied to the deep learning algorithm, there are three limitations: First, the parallel computing advantage cannot be fully utilized in the application process. Deep learning involves both training and inference. The GPU is very efficient in training deep learning algorithms, but the advantage of parallelism cannot be fully exploited when inferring a single input. Second, the hardware structure cannot be flexibly configured. The GPU adopts the SIMT calculation mode, and the hardware structure is relatively fixed. At present, the deep learning algorithm is not completely stable. If the deep learning algorithm changes greatly, the GPU cannot flexibly configure the hardware structure like the FPGA. Third, running deep learning algorithms is less energy efficient than FPGAs.

Although FPGA is highly optimistic, even the new generation baidu brain is based on FPGA platform development, but it is not developed specifically for the application of deep learning algorithms. There are also many limitations in practical applications: First, the basic unit has limited computing power. In order to achieve reconfigurable features, there are a large number of very fine-grained basic units inside the FPGA, but the computing power of each unit (mainly relying on the LUT lookup table) is far lower than the ALU modules in the CPU and GPU; Second, computing resources The proportion is relatively low. In order to achieve reconfigurable features, a large amount of resources inside the FPGA are used for configurable on-chip routing and wiring. Third, speed and power consumption are still relatively small compared to dedicated custom chips (ASICs). Fourth, FPGAs are more expensive. In the case of scale-up, the cost of a single FPGA is much higher than that of a dedicated custom chip.

After the deep learning algorithm is stabilized, the AI â€‹â€‹chip can be fully customized by the ASIC design method, so that the performance, power consumption and area indicators are optimal for the deep learning algorithm.

5, brain-like chips

The brain-like chip does not use the classic von Neumann architecture, but is based on a neuromorphic architecture design, represented by IBM Truenorth. IBM researchers used memory cells as synapses, computing units as neurons, and transmission units as axons to build prototypes of neural chips. At present, Truenorth uses Samsung's 28nm power processing technology, and the chip-on-chip network consisting of 5.4 billion transistors has 4096 synaptic cores, and the real-time operating power consumption is only 70mW. Because synapses require variable weights and memory functions, IBM has experimentally implemented novel synapses using phase-change non-volatile memory (PCM) technology compatible with CMOS processes, accelerating commercialization.

AI chip application field

With the continuous development of artificial intelligence chips, the application field will continue to develop in multiple directions over time.

1, smart machine

In September 2017, Huawei released the Kirin 970 chip at the Consumer Electronics Show in Berlin, Germany. The chip is equipped with the Cambrian NPU, becoming the "world's first intelligent mobile AI chip"; in mid-October 2017 Mate10 series new products ( The processor of this series of shou machines is listed as Kirin 970). The Huawei Mate10 series of intelligent cameras equipped with NPUs have strong deep learning and local inference capabilities, enabling various types of deep neural network-based photography and image processing applications to provide users with a more perfect experience.

Apple released the shou machine represented by iPhone X and their built-in A11 Bionic chip. A11 Bionic's self-developed dual-core architecture Neural Engine (Neural Network Processing Engine), which processes the corresponding neural network computing requirements per second up to 600 billion times. The emergence of this Neural Engine made A11 Bionic a true AI chip. A11 Bionic has greatly improved the iPhone X's experience in taking pictures and provided some creative new usages.

2. ADAS (Advanced Driver Assistance System)

ADAS is one of the most attractive artificial intelligence applications for the public. It needs to process a large amount of real-time data collected by sensors such as laser radar, millimeter wave radar, and camera. Compared with the traditional vehicle control method, the intelligent control method is mainly embodied in the application of the control object model and the comprehensive information learning application, including neural network control and deep learning methods.

3, CV (Computer Vision (Computer Vision) equipment

Equipment that requires computer vision technology, such as smart cameras, drones, driving recorders, face recognition welcome robots, and smart tablets, often have local inference needs, if only working under the network, Undoubtedly will bring a bad experience. Computer vision technology seems to be one of the fertile ground for artificial intelligence applications, and computer vision chips will have broad market prospects.

4, VR equipment

The representative of the VR device chip is the HPU chip, which is developed by Microsoft for its own VR device Hololens. The chip, which is manufactured by TSMC, can process data from five cameras, one depth sensor and motion sensor at the same time, and has the computer vision matrix operation and CNN operation acceleration function. This allows VR devices to reconstruct high-quality portrait 3D images and transfer them anywhere in real time.

5, voice interaction equipment

In terms of voice interactive device chips, there are two companies in China, such as Intellect and Yunzhisheng. The chip solutions provided by them all have built-in deep neural network acceleration scheme optimized for speech recognition to realize offline recognition of devices. The stable recognition capability provides the possibility for the landing of voice technology; at the same time, the core link of voice interaction has also made a major breakthrough. The speech recognition link has broken through the single point capability, and has made a major breakthrough from far-field recognition to speech analysis and semantic understanding, presenting an overall interaction plan.

6, the robot

Both home robots and commercial service robots require specialized software + chip artificial intelligence solutions. The typical company has a horizon robot founded by Yu Kai, the former head of the baidu deep learning lab. Of course, the horizon robot also has Provide other embedded artificial intelligence solutions such as ADAS and smart home.

Forklift Parts Investment Casting products are widely used, such as forklift power plant, the combustion balance weight type fork, power plant, chassis, many of parts are using investment casting, our company can provide customers with a variety of shapes of product research and development, design, customization, and different material products, in addition, investment casting is widely used in Auto Parts Castings, Truck Parts Castings and so on many industries

forklift parts

Forklift Parts Castings

Forklift Parts Castings,Forklift Cast product,Forklift Precision Castings,Forklift Parts Precision Castings

Wei fang junlong precision casting Co.,Ltd , https://www.junlongcasting.com