人工智能芯片发展前景研究Ⅱ:计算硬件(英文版).pptx
,AI-Optimized Chipsets,Part II: Computing HardwareMay 2018,Computing Hardware,Previously in Part I, we reviewed the ADAC loop and key factors driving innovation for AI-optimized chipsets.In this instalment, we explore how AI-led computing demands are powering these trends:Deep learning is expected to drive training for neural networks requiring massive datasets for AI algorithm developmentThis in turn leads to a shift in the performance focus of computing from general application to neural nets, increasing demand for high performance computingDeep learning is both computationally and memory intensive, necessitating enhancements in processor performanceHence, the rise of startups adopting alternative, innovative approaches and how this is expected topave the way for different types of AI-optimized chipsets,Source: Nvidia | Graphcore,Deep learning is expected to drive training for neural networks,TrainingInference,“Dog”“Cat”,Untrained Neural Network Model,“Cat”,Trained Model Optimized for Performance,Training refers to neural network learning with significant dataAI algorithms are developed via trainingConsumes significant computing powerTraining loads can be divided into many concurrent tasks. This is ideal for the GPUs double floating point precision and huge core countsTraining can also be conducted using FPGAsRequires calculations with relatively high precision, often using 32-bit floating-point operations,Inference refers to the neutral network interpreting new data to generate accurate resultsTypically conducted at the application or client end-point (i.e. edge), rather than on the server or cloudRequires fewer hardware resources and depending on the application, can be performed using CPUsThis could be FPGAs, ASICs, Digital Signal Processors (DSPs) etcInference is expected to shift locally to mobile devicesPrecision can be sacrificed in favor of greater speed or less power consumption,“The workloads are changing dramatically” for computing, as a result of machine learningand whenever workloads have changed in computing, it has always created an opportunity for new kinds of computing.”Andrew FeldmanCEO | Cerebras,Source: Intel, NVIDIA, ImageNet, Ark Invest Management LLC,Deep Learning Growth Drivers,With massive datasets required for AI algorithm developmentand inference,Source: Deep Learning: An Artificial Intelligence Revolution by ARK Investment | Learning both Weights and Connections for Efficient Neural Networks by Song Han et al. | Icon made by Those Icons from flaticon,Shifting the performance focus of computing from generalapplication to neural nets,Source: Deep Learning: An Artificial Intelligence Revolution by ARK Investment | Learning both Weights and Connections for Efficient Neural Networks by Song Han et al. | Convolutional Neural Network by Mathworks,Deep learning chipsets are designed to optimize performance, power and memory.Algorithms tend to be highly parallelRequires data splitting between different processing unitsConnecting the pipeline in the most efficient manner is keySignificant transfer of data back and forth between memoryFor instance, convolutional neural networks require convolution operations to be repeated throughout the pipeline and the number of operations can be extremely significant,Example of a neural network with many convolutional layers,Deep learning is both computationally and memory intensive,A neural network takes input data, multiplies them with a weight matrix and applies an activation functionMultiplying matrices is often the most computationally intensive part of running a trained model,Driving enhancements in processor performance via matrixmultiplication.,The outputs of this matrix multiplication are then processed further by an activation function,This sequence of multiplications and additions can be written as a matrix multiplication,Y1,Y2,InputX1X2 X3,Neurons,= f (W11X1 + W12X2 + W13X3),= f (W21X1 + W22X2 + W23X3),Source: An in-depth look at Googles first Tensor Processing Unit (TPU) by Kat Sato,Quantization in neural networks,Quantization is a process of converting a range of input values into a smaller set of output values that closely approximates the original dataReduces the cost of neural network predictions and memory usageEspecially for mobile and embedded deploymentsNeural network predictions may not require the precision of 16-bit or 32-bit floating point calculationsFor example, if it is raining - knowing whether it is light or heavy will suffice, there is no need to know how many droplets of water are falling per second8-bit integers can still be used to calculate a neural network prediction while maintaining the appropriate level of accuracy,Source: An in-depth look at Googles first Tensor Processing Unit (TPU) by Kat Sato,Quantization in TensorFlow,And graph processing.,Scalar ProcessingProcesses an operation per instructionCPUs run at clock speeds in the GHz rangeMight take a long time to execute large matrix operations via a sequence of scalar operations,An,A1+B1,Bn,+,=,=,C1,A2+B2=C2,Cn,ai + bi = cifor i = 1 to n,A1,A2,An,B1,B2,Bn,+,=,C1,C2,Cn,a1:n + b1:n = c1:n,Source: Spark 2.x - 2nd generation Tungsten Engine,Vector ProcessingSame operation performed concurrently across a large number of data elements at the same timeGPUs are effectively vector processors,Graph ProcessingRuns many computational processes (vertices)Calculates the effects these vertices on other points with which they interact via lines (i.e. edges)Overall processing works on many vertices and pointssimultaneouslyLow precision needed,Source: Cerebras Founder Feldman Contemplates the A.I. Chip Age by Barrons | : Suffering Ceepie-Geepies! Do We Need a New Processor Architecture? By The Register | StartupUnveils Graph Processor at Hot Chips by EETimes,The key to a “graph” machine is software that captures the “intent” of the graph problems it needs to solveProcessing in parallel instead of sequentialThinCIs Graph Streaming Processor (GSP) is designed to understand the complex data dependencies and flowGSPs manage this entirely on the chip with:Minimal software interventionExtremely low memory bandwidth needsReduces or eliminates inter- processor communications and synchronizations,A microprocessor wastes a lot of effort with a sparse matrix multiplying by zeroA sparse matrix is a matrix that hasmany elements that are zeroA new chip is needed to:Handle sparse matrix mathEmphasize communications between inputs and outputs of calculationsMachine learning methods (e.g. convolutional neural networks) involve:RecursionFeedbackComputations in one instance feed into computations elsewhere in the processCerebras solution: Simple on compute, on arithmetic and very intense on communications,Creating new approaches that focus on graph processing and sparse matrix math, emphasizing communications between inputs and outputs of calculations,Graphcores Intelligence Processing Unit (IPU) has a structure which provides:Efficient massive computeparallelismHuge memory bandwidthBoth factors essential for delivering a significant step-up in graph processing power needed for machine intelligenceThe graph is a highly-parallelexecution plan for the IPUExpected to increase the speed of machine learning workloads significantly:General: by 5xSpecific: by 50 - 100x (e.g. autonomous vehicle workloads),Source: Horizon Robotics | Hailo | Gyrfalcon Technology,As well as AI processing in memory architectures and massively parallel compute capabilities,Deep learning processor for edge devices offering datacenter class performance in an embedded deviceDataflow approach, based on the structure of Neural Networks (NNs)Distributed memory fabric, combined with purpose-made pipeline elements, allowing very low power memory access (without using batch processing)Novel control scheme based on combination of hardware and software, reaching very low Joules/operation metrics with a high degree of flexibilityExtremely efficient computational elements, which can be variably applied according to needLow overhead interconnect, allowing Near Memory Processing (NMP) and balancing changing requirements of memory, compute and control along the NN,Gyrfalcons Intelligent Matrix Processor: Lightspeeur® 2801S delivers a APiM (AI Processing in memory) architecture which features massively parallel compute capabilitiesIts APiM architecture, uses memory as the AI processing unit. This eliminates the huge data movement that results in high power consumptionThe architecture features true, on-chip parallelism, in situ computing, and eliminates memory bottlenecks. It has roughly 28,000 parallel computing cores and does not require external memory for AI inferenceIt runs in various open frameworks like TensorFlow, Caffe and others to complete deep learning training and inference tasks,The Brain Processing Unit (BPU) by Horizon Robotics is a heterogeneous Multiple Instruction, Multiple Data (MIMD) computation systemBy heterogeneity, the BPU uses multiple kinds of Processing Units (PU) that were designed specifically for neural network inference. It gains performance or energy efficiency by adding dissimilar PUs, incorporating specialized processing capabilities to handle particular tasksMIMD is a technique employed to achieve parallelism, with a number of PUs that function asynchronously and independentlyAt any one time, different PUs may be executing different instructions on different pieces of dataThe first generation BPU employs a Gaussian architecture - allowing each vision task to be divided into 2 stages (i.e. attention and cognition) for optimal allocation of computations. This offers a parallel and fast filter of task-irrelevant information, on-demand cognition and edge learning to adjust models after deploymentThis design enables the BPU to achieve a performance of up to 1TOPS at a low-power of 1.5W. It can process the 1080P video input at 30 frames per second, as well as detect and recognize up to 200 objects per frame,The choice of chipset depends on use - for training, inference, in the cloud, at the edge or a hybrid of both,Some cloud providers havebeen creating their own chipsUsing alternative architectures to GPUs (e.g. FPGAs and ASICs)Cloud-based systems can handle neural network training and inference,CloudEdge,Edge devices, from phones to drones, to focus mainly on inference, due to energy efficiency and low-latency computation considerationsInference will be moved to edge devices for most applications (AR expected to be a key driver)New entrants will have the best chance of success in the end-device market given its nascenceChips for end-devices have power requirements as low as 1 wattDevices market is too large and diverse for a single chip design to address, and customers will ultimately want custom designs,With industry players adopting different approaches,CloudEdge,Google TPUs are ASICsThe high non-recurring costs associated with designing the ASIC can be adsorbed due to Googles large scaleUsing TPUs across multiple operations help save costs, ranging from Street View to search queriesTPUs save more power than GPUs,Rolling out FPGAs in its own datacenter revampSimilar to ASICsBut reprogrammable so that their algorithms can be updated,Smartphone System-on-Chips (SoCs) are likely to incorporate ASIC logic blocksCreates opportunities for new IP licensing companies. (e.g. Cambricon has licensed its ASIC design to Huawei for its Kirin 970 SoC),Specialized chips for mobile devices - anincreasing trend with:Dedicated AI chips appearing in Apples iPhone X, Huaweis Mate 10, and Googles Pixel 2ARM has reconfigured its chip design to optimize AIQualcomm launched its own mobile AIchips,Huawei Mate 10s Kirin 970,Source: Google | Microsoft | Huawei,Source: Artificial Intelligence: 10 Trends to Watch in 2017 and Beyond by Tractica | Expect Deeper and Cheaper Machine Learning by IEEE Spectrum | MIT Technology in Review | Google Rattles the Tech World with a New AI Chip for All by Wired | Back to the Edge: AI Will Force Distributed Intelligence Everywhere by Azeem | When Moores Law Met AI Artificial Intelligence and the Future of Computing by Azeem,Latency and contextualization of locales are key drivers of edge computing,Key Drivers of Edge ComputingLearning typically happens in the cloudDevices do not do any learning from their environment or experienceBesides inference, it will also be essential to push training to the edgeLatencyFor many applications that delay will be unacceptable (e.g. the high latency risk of sending signal data to the cloud for self-driving prediction, even with 5G networks)ContextDevices will soon need to be powerful enough to learn at the edge of the networkDevices will be used in situ and those locales will be increasingly contextualizedThe environment where the device is placed will be a key input to its operation. Allowing the network to learn from the experience of edge devices and the environment,Source: Artificial Intelligence: 10 Trends to Watch in 2017 and Beyond by Tractica | Expect Deeper and Cheaper Machine Learning by IEEE Spectrum | MIT Technology in Review | Google Rattlesthe Tech World with a New AI Chip for All by Wired | Back to the Edge: AI Will Force Distributed Intelligence Everywhere by Azeem | When Moores Law Met AI Artificial Intelligence and the,Going forward, we are likely to see Federated Learning -a multi-faceted infrastructure where learning happens on the edge of the network and in the cloud,Federated LearningAllows for smarter models, lower latency and power consumption, while availing differential privacy and personalized experiencesAllows the network to learn from the experience of many edge devices and their experiences of the environmentIn a federated environment, edge devices could do some learning and efficiency send back deltas (or weights) to the cloud where a central model could be more efficiently updated, instead of sending their raw experiential data back to the cloud for analysisDifferential privacy also ensures that the aggregate data in a database capture significant patterns, while protecting individual privacy,Google designed its original TPU for execution. Its new cloud TPU offers a chip that handles training as wellAmazon and Microsoft offering GPU processing via cloud services, but they do not offer bespoke AI chips for both training and executing neural networks,Googles cloud TPU,Bitmain claims to have built 70% of all the computers on the Bitcoin network. It makes specialized chips to perform the critical hash functions involved in mining and trading bitcoins, and packages those chips into the top mining rig - the Antminer S9In 2017, Bitmain unveiled details its new AI chip, the Sophon BM1680 - specialized for both training and executing deep learning algorithmsBitmains SophonBM1680,