人工智能芯片发展前景研究Ⅱ：计算硬件（英文版）.pptx-资源下载-

报告吧 > 资源分类 > PPTX文档下载

阅读全文

人工智能芯片发展前景研究Ⅱ：计算硬件（英文版）.pptx

资源ID：40812 资源大小：850.83KB 全文页数：19页
资源格式： PPTX 下载积分：25金币【人民币25元】

快捷下载

会员登录下载

三方登录下载：

下载资源需要25金币【人民币25元】

邮箱/手机：
温馨提示：	用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）
支付方式：
验证码：	换一换

加入VIP,下载共享资源

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，既可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

人工智能芯片发展前景研究Ⅱ：计算硬件（英文版）.pptx

,AI-Optimized Chipsets,Part II: Computing HardwareMay 2018,Computing Hardware,Previously in Part I, we reviewed the ADAC loop and key factors driving innovation for AI-optimized chipsets.In this instalment, we explore how AI-led computing demands are powering these trends:Deep learning is expected to drive training for neural networks requiring massive datasets for AI algorithm developmentThis in turn leads to a shift in the performance focus of computing from general application to neural nets, increasing demand for high performance computingDeep learning is both computationally and memory intensive, necessitating enhancements in processor performanceHence, the rise of startups adopting alternative, innovative approaches and how this is expected topave the way for different types of AI-optimized chipsets,Source: Nvidia | Graphcore,Deep learning is expected to drive training for neural networks,TrainingInference,“Dog”“Cat”,Untrained Neural Network Model,“Cat”,Trained Model Optimized for Performance,Training refers to neural network learning with significant dataAI algorithms are developed via trainingConsumes significant computing powerTraining loads can be divided into many concurrent tasks. This is ideal for the GPUs double floating point precision and huge core countsTraining can also be conducted using FPGAsRequires calculations with relatively high precision, often using 32-bit floating-point operations,Inference refers to the neutral network interpreting new data to generate accurate resultsTypically conducted at the application or client end-point (i.e. edge), rather than on the server or cloudRequires fewer hardware resources and depending on the application, can be performed using CPUsThis could be FPGAs, ASICs, Digital Signal Processors (DSPs) etcInference is expected to shift locally to mobile devicesPrecision can be sacrificed in favor of greater speed or less power consumption,“The workloads are changing dramatically” for computing, as a result of machine learningand whenever workloads have changed in computing, it has always created an opportunity for new kinds of computing.”Andrew FeldmanCEO | Cerebras,Source: Intel, NVIDIA, ImageNet, Ark Invest Management LLC,Deep Learning Growth Drivers,With massive datasets required for AI algorithm developmentand inference,Source: Deep Learning: An Artificial Intelligence Revolution by ARK Investment | Learning both Weights and Connections for Efficient Neural Networks by Song Han et al. | Icon made by Those Icons from flaticon,Shifting the performance focus of computing from generalapplication to neural nets,Source: Deep Learning: An Artificial Intelligence Revolution by ARK Investment | Learning both Weights and Connections for Efficient Neural Networks by Song Han et al. | Convolutional Neural Network by Mathworks,Deep learning chipsets are designed to optimize performance, power and memory.Algorithms tend to be highly parallelRequires data splitting between different processing unitsConnecting the pipeline in the most efficient manner is keySignificant transfer of data back and forth between memoryFor instance, convolutional neural networks require convolution operations to be repeated throughout the pipeline and the number of operations can be extremely significant,Example of a neural network with many convolutional layers,Deep learning is both computationally and memory intensive,A neural network takes input data, multiplies them with a weight matrix and applies an activation functionMultiplying matrices is often the most computationally intensive part of running a trained model,Driving enhancements in processor performance via matrixmultiplication.,The outputs of this matrix multiplication are then processed further by an activation function,This sequence of multiplications and additions can be written as a matrix multiplication,Y1,Y2,InputX1X2 X3,Neurons,= f (W11X1 + W12X2 + W13X3),= f (W21X1 + W22X2 + W23X3),Source: An in-depth look at Googles first Tensor Processing Unit (TPU) by Kat Sato,Quantization in neural networks,Quantization is a process of converting a range of input values into a smaller set of output values that closely approximates the original dataReduces the cost of neural network predictions and memory usageEspecially for mobile and embedded deploymentsNeural network predictions may not require the precision of 16-bit or 32-bit floating point calculationsFor example, if it is raining - knowing whether it is light or heavy will suffice, there is no need to know how many droplets of water are falling per second8-bit integers can still be used to calculate a neural network prediction while maintaining the appropriate level of accuracy,Source: An in-depth look at Googles first Tensor Processing Unit (TPU) by Kat Sato,Quantization in TensorFlow,And graph processing.,Scalar ProcessingProcesses an operation per instructionCPUs run at clock speeds in the GHz rangeMight take a long time to execute large matrix operations via a sequence of scalar operations,An,A1+B1,Bn,+,=,=,C1,A2+B2=C2,Cn,ai + bi = cifor i = 1 to n,A1,A2,An,B1,B2,Bn,+,=,C1,C2,Cn,a1:n + b1:n = c1:n,Source: Spark 2.x - 2nd generation Tungsten Engine,Vector ProcessingSame operation performed concurrently across a large number of data elements at the same timeGPUs are effectively vector processors,Graph ProcessingRuns many computational processes (vertices)Calculates the effects these vertices on other points with which they interact via lines (i.e. edges)Overall processing works on many vertices and pointssimultaneouslyLow precision needed,Source: Cerebras Founder Feldman Contemplates the A.I. Chip Age by Barrons | : Suffering Ceepie-Geepies! Do We Need a New Processor Architecture? By The Register | StartupUnveils Graph Processor at Hot Chips by EETimes,The key to a “graph” machine is software that captures the “intent” of the graph problems it needs to solveProcessing in parallel instead of sequentialThinCIs Graph Streaming Processor (GSP) is designed to understand the complex data dependencies and flowGSPs manage this entirely on the chip with:Minimal software interventionExtremely low memory bandwidth needsReduces or eliminates inter- processor communications and synchronizations,A microprocessor wastes a lot of effort with a sparse matrix multiplying by zeroA sparse matrix is a matrix that hasmany elements that are zeroA new chip is needed to:Handle sparse matrix mathEmphasize communications between inputs and outputs of calculationsMachine learning methods (e.g. convolutional neural networks) involve:RecursionFeedbackComputations in one instance feed into computations elsewhere in the processCerebras solution: Simple on compute, on arithmetic and very intense on communications,Creating new approaches that focus on graph processing and sparse matrix math, emphasizing communications between inputs and outputs of calculations,Graphcores Intelligence Processing Unit (IPU) has a structure which provides:Efficient massive computeparallelismHuge memory bandwidthBoth factors essential for delivering a significant step-up in graph processing power needed for machine intelligenceThe graph is a highly-parallelexecution plan for the IPUExpected to increase the speed of machine learning workloads significantly:General: by 5xSpecific: by 50 - 100x (e.g. autonomous vehicle workloads),Source: Horizon Robotics | Hailo | Gyrfalcon Technology,As well as AI processing in memory architectures and massively parallel compute capabilities,Deep learning processor for edge devices offering datacenter class performance in an embedded deviceDataflow approach, based on the structure of Neural Networks (NNs)Distributed memory fabric, combined with purpose-made pipeline elements, allowing very low power memory access (without using batch processing)Novel control scheme based on combination of hardware and software, reaching very low Joules/operation metrics with a high degree of flexibilityExtremely efficient computational elements, which can be variably applied according to needLow overhead interconnect, allowing Near Memory Processing (NMP) and balancing changing requirements of memory, compute and control along the NN,Gyrfalcons Intelligent Matrix Processor: Lightspeeur® 2801S delivers a APiM (AI Processing in memory) architecture which features massively parallel compute capabilitiesIts APiM architecture, uses memory as the AI processing unit. This eliminates the huge data movement that results in high power consumptionThe architecture features true, on-chip parallelism, in situ computing, and eliminates memory bottlenecks. It has roughly 28,000 parallel computing cores and does not require external memory for AI inferenceIt runs in various open frameworks like TensorFlow, Caffe and others to complete deep learning training and inference tasks,The Brain Processing Unit (BPU) by Horizon Robotics is a heterogeneous Multiple Instruction, Multiple Data (MIMD) computation systemBy heterogeneity, the BPU uses multiple kinds of Processing Units (PU) that were designed specifically for neural network inference. It gains performance or energy efficiency by adding dissimilar PUs, incorporating specialized processing capabilities to handle particular tasksMIMD is a technique employed to achieve parallelism, with a number of PUs that function asynchronously and independentlyAt any one time, different PUs may be executing different instructions on different pieces of dataThe first generation BPU employs a Gaussian architecture - allowing each vision task to be divided into 2 stages (i.e. attention and cognition) for optimal allocation of computations. This offers a parallel and fast filter of task-irrelevant information, on-demand cognition and edge learning to adjust models after deploymentThis design enables the BPU to achieve a performance of up to 1TOPS at a low-power of 1.5W. It can process the 1080P video input at 30 frames per second, as well as detect and recognize up to 200 objects per frame,The choice of chipset depends on use - for training, inference, in the cloud, at the edge or a hybrid of both,Some cloud providers havebeen creating their own chipsUsing alternative architectures to GPUs (e.g. FPGAs and ASICs)Cloud-based systems can handle neural network training and inference,CloudEdge,Edge devices, from phones to drones, to focus mainly on inference, due to energy efficiency and low-latency computation considerationsInference will be moved to edge devices for most applications (AR expected to be a key driver)New entrants will have the best chance of success in the end-device market given its nascenceChips for end-devices have power requirements as low as 1 wattDevices market is too large and diverse for a single chip design to address, and customers will ultimately want custom designs,With industry players adopting different approaches,CloudEdge,Google TPUs are ASICsThe high non-recurring costs associated with designing the ASIC can be adsorbed due to Googles large scaleUsing TPUs across multiple operations help save costs, ranging from Street View to search queriesTPUs save more power than GPUs,Rolling out FPGAs in its own datacenter revampSimilar to ASICsBut reprogrammable so that their algorithms can be updated,Smartphone System-on-Chips (SoCs) are likely to incorporate ASIC logic blocksCreates opportunities for new IP licensing companies. (e.g. Cambricon has licensed its ASIC design to Huawei for its Kirin 970 SoC),Specialized chips for mobile devices - anincreasing trend with:Dedicated AI chips appearing in Apples iPhone X, Huaweis Mate 10, and Googles Pixel 2ARM has reconfigured its chip design to optimize AIQualcomm launched its own mobile AIchips,Huawei Mate 10s Kirin 970,Source: Google | Microsoft | Huawei,Source: Artificial Intelligence: 10 Trends to Watch in 2017 and Beyond by Tractica | Expect Deeper and Cheaper Machine Learning by IEEE Spectrum | MIT Technology in Review | Google Rattles the Tech World with a New AI Chip for All by Wired | Back to the Edge: AI Will Force Distributed Intelligence Everywhere by Azeem | When Moores Law Met AI Artificial Intelligence and the Future of Computing by Azeem,Latency and contextualization of locales are key drivers of edge computing,Key Drivers of Edge ComputingLearning typically happens in the cloudDevices do not do any learning from their environment or experienceBesides inference, it will also be essential to push training to the edgeLatencyFor many applications that delay will be unacceptable (e.g. the high latency risk of sending signal data to the cloud for self-driving prediction, even with 5G networks)ContextDevices will soon need to be powerful enough to learn at the edge of the networkDevices will be used in situ and those locales will be increasingly contextualizedThe environment where the device is placed will be a key input to its operation. Allowing the network to learn from the experience of edge devices and the environment,Source: Artificial Intelligence: 10 Trends to Watch in 2017 and Beyond by Tractica | Expect Deeper and Cheaper Machine Learning by IEEE Spectrum | MIT Technology in Review | Google Rattlesthe Tech World with a New AI Chip for All by Wired | Back to the Edge: AI Will Force Distributed Intelligence Everywhere by Azeem | When Moores Law Met AI Artificial Intelligence and the,Going forward, we are likely to see Federated Learning -a multi-faceted infrastructure where learning happens on the edge of the network and in the cloud,Federated LearningAllows for smarter models, lower latency and power consumption, while availing differential privacy and personalized experiencesAllows the network to learn from the experience of many edge devices and their experiences of the environmentIn a federated environment, edge devices could do some learning and efficiency send back deltas (or weights) to the cloud where a central model could be more efficiently updated, instead of sending their raw experiential data back to the cloud for analysisDifferential privacy also ensures that the aggregate data in a database capture significant patterns, while protecting individual privacy,Google designed its original TPU for execution. Its new cloud TPU offers a chip that handles training as wellAmazon and Microsoft offering GPU processing via cloud services, but they do not offer bespoke AI chips for both training and executing neural networks,Googles cloud TPU,Bitmain claims to have built 70% of all the computers on the Bitcoin network. It makes specialized chips to perform the critical hash functions involved in mining and trading bitcoins, and packages those chips into the top mining rig - the Antminer S9In 2017, Bitmain unveiled details its new AI chip, the Sophon BM1680 - specialized for both training and executing deep learning algorithmsBitmains SophonBM1680,

注意事项

本文（人工智能芯片发展前景研究Ⅱ：计算硬件（英文版）.pptx）为本站会员（真真）主动上传，报告吧仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知报告吧（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？