与摩尔定律（Moore's law）相伴而生的是登纳德缩放定律（Dennard scaling），即随着晶体管密度的增加，每个晶体管的功耗会下降，因此每平方毫米硅的功耗几乎是恒定的。但登纳德缩放定律在2007年开始显著放缓，到2012年几乎失效。
围绕这一问题，在钛媒体和国家新媒体产业基地联合主办的2021 T-EDGE全球创新大会上，Google母公司Alphabet董事会主席、2017年美国图灵奖获得者、斯坦福大学原校长John Hennessy发表了题为《深度学习和半导体技术领域的趋势和挑战》演讲。
Hello I'm John Hennessy, professor of computer science and electrical engineering at Stanford University, and co-winner of the Turing Award in 2017.
It's my pleasure to participate in the 2021 T-EDGE conference.
很高兴能参加 2021年的 T-EDGE 大会。
Today I'm going to talk to you about the trends and challenges in deep learning and semiconductor technologies, and how these two technologies want a critical building block for computing and the other incredible new breakthroughs in how we use computers are interacting, conflicting and how they might go forward.
AI has been around for roughly 60 years and for many years it continues to make progress but at a slow rate, much lower than many of the early prophets of AI had predicted.
人工智能技术已经存在大约 60 年，多年来持续发展。但是人工智能技术的发展开始放缓，发展速度已远低许多早期的预测。
And then there was a dramatic breakthrough around deep learning for several small examples but certainly AlphaGo defeating the world’s go champion at least ten years before it was expected was a dramatic breakthrough. It relied on deep learning technologies, and it exhibited what even professional go players would say was creative play.
在深度学习上我们实现了重大突破。最出名的例子应该就是 AlphaGo 打败了围棋世界冠军，这个成果要比预期早了至少十年。Alpha Go使用的就是深度学习技术，甚至连专业人类棋手也夸赞Alpha Go的棋艺颇具创意。
That was the beginning of a world change.
Today we've seen many other deep learning breakthroughs where deep learning is being used for complex problems, obviously crucial for image recognition which enables self-driving cars, becoming more and more useful in medical diagnosis, for example, looking at images of skin to tell whether or not a lesion is cancerous or not, and applications in natural language particularly around machine translation.
Now for Latin-based language basically being as good as professional translators and improving constantly for Chinese to English, a much more challenging translation problem but we are seeing even a significant progress.
Most recently we've seen AlphaFold 2, a deep minds approach to using deep learning for protein folding, which advanced the field by at least a decade in terms of what is doable in terms of applying this technology to biology and going to dramatically change the way we make new drug discovery in the future.
近期我们也有 AlphaFold 2，一种使用深度学习进行蛋白质结构预测的应用，它将深度学习与生物学进行结合，让该类型的应用进步了至少十年，将极大程度地改变药物研发的方式。
What drove this incredible breakthrough in deep learning? Clearly the technology concepts have been around for a while and in fact many cases have been discarded earlier.
So why was it able to make this breakthrough now?
First of all, we had massive amounts of data for training. The Internet is a treasure trove of data that can be used for training. ImageNet was a critical tool for training image recognition. Today, close to 100,000 objects are on ImageNet and more than 1000 images per object, enough to train image recognition systems really well. So that was the key.
首先是我们有了大量的数据用于训练AI。互联网是数据的宝库。例如 ImageNet ，就是训练图像识别的重要工具。现在ImageNet 上有近 100,000 种物体的图像，每种物体有超过 1000 张图像，这足以让我们很好地训练图像识别系统。这是重要变化之一。
Obviously we have lots of other data were using here for whether it's protein folding or medical diagnosis or natural language we're relying on the data that's available on the Internet that's been accurately labeled to be used for training.
Second, we were able to marshal mass of computational resources primarily through large data centers and cloud-based computing. Training takes hours and hours using thousands of specialized processors. We simply didn't have this capability earlier. So that was crucial to solving the training problem.
I want to emphasize that training is the computational intensive problem here. Inferences are much simpler by comparison and here you see the rate of growth of performance demand in petaflops days needed to train a series of models here. If you look at training AlphaZero for example requires 1000 petaflops days, roughly a week on the largest computers available in the world.
我想强调的是，人工智能训练带来的问题是密集的算力需求，程序推理变得简单得多。这里展示的是训练人工智能模型的性能需求增长率。以训练 AlphaZero 为例，它需要 1000 pfs-day，也就是说用世界上最大规模的计算机来训练要用上一周。
This speed has been growing actually faster than Moore's law. So the demand is going up faster than what semiconductors ever produced even in the very best era. We've seen 300,000 times increase in compute from training simple models like AlexNet up to AlphaGo Zero and new models like GPT-3 had billions of parameters that need to be set. So the training in the amount of data they have to look at is truly massive. And that's where the real challenge comes.
这个增长率实际上比摩尔定律还要快。因此，即使在半导体行业最鼎盛的时代，需求的增长速度也比半导体生产的要快。从训练 AlexNet 这样的简单模型到 AlphaGo Zero，以及 GPT-3 等新模型，有数十亿个参数需要进行设定，算力已经增加了 300,000 倍。这里涉及到的数据量是真的非常庞大，也是我们需要克服的挑战。
Moore's law, the version that Gordon Moore gave in 1975, predicted that semiconductor density would continue to grow quickly and basically double every two years but we began to diverge from that. Really quickly diverge began in around 2000 and then the spread is growing even wider. As Gordon said in the 50th anniversary of the first prediction: no exponential is forever. Moore's law is not a theorem or something that's definitely must hold true. It's an ambition which the industry was able to focus on and keeping tag. If you look at this curve, you'll notice that for roughly 50 years we drop only a factor of 15 while gaining a factor of more than almost 10,000.
摩尔定律，即戈登摩尔在 1975 年给出的版本，预测半导体密度将继续快速增长，基本上每两年翻一番，但我们开始偏离这一增长速度。偏离在2000 年左右出现，并逐步扩大。戈登在预测后的五十年后曾说道：没有任何的物理事物可以持续成倍改变。当然，摩尔定律不是定理或必须成立的真理，它是半导体行业的一个目标。仔细观察这条曲线，你会注意到在大约 50 年中，我们仅偏离了约 15 倍，但总共增长了近 10,000 倍。
So we've largely been able to keep on this curve but we began diverging and when you factor in increasing cost of new fab and new technologies and you see this curve when it's converted to price per transistor not dropping nearly as fast as it once fell.
We also have faced another problem, which is the end of so-called dennard scaling. Dennard scaling is an observation led by Robert Dennard, the inventor of DRAM that is ubiquitous in computing technology. He observes that as dimensions shrunk so would the voltage and other assonance for example. And that would result in nearly constant power per millimeter of silicon. That meant because of the amount of transistors that were in each millimeter we're going up dramatically from one generation to the next, that power per computation was actually dropping quite quickly. That really came to a halt around 2007 and you see this red curb which was going up slowly at the beginning between 2000 and 2007 really began to take off. That meant that power was really the key issue and figuring out how to get energy efficiency would become more and more important as these technologies went forward.
我们还面临另一个问题，即所谓的登纳德缩放定律。登纳德缩放定律是由罗伯特·登纳德 领导的一项观察实验，他是DRAM的发明人。据他的观察，随着尺寸缩小，电压和其他共振也会缩小，这将导致每毫米硅的功率几乎恒定。这意味着由于每一毫米中的晶体管数量从一代到下一代急剧增加，每个计算的功率实际上下降得非常快。这在 2007 年左右最为明显，在 2000 年到 2007 年间开始缓慢上升的功耗开始激增。这意味着功耗确实是关键问题，随着这些技术的发展，弄清楚如何获得更高的能源效率将变得越来越重要。
Combine results of this is that we've seen a leveling off of unit processor performance, single core performance, after going through a rapid growth in the early period of the industry of roughly 25% a year and then a remarkable period with the introduction of RISC technologies, instructional-level parallelism, of over 50% a year and then a slower period which focused very much on multicore and building on these technologies.
在经历了行业早期每年大约 25% 的增长之后，随着 RISC 技术的引入和指令级并行技术的出现，开始有每年超过 50% 的性能增长。之后我们就迎来了多核时代，专注于在现有技术上进行深耕。
In the last two years, only less than 5% improvement in performance per year. Even if you were to look at multicore designs with the inefficiencies that come about you see that that doesn't significantly improve things across this.
And indeed we are in the we are in the era of dark silicon where multicore often slow down or shut off a core to prevent overheating and that overheating comes from power consumption.
So what are we going to do? We're in this dilemma here where we've got a new technology deep learning which seems able to do problems that we never thought we could do quite effectively. But it requires massive amounts of computing power to go forward and at the same time Moore's law on the end of Dennard Scaling is creating a squeeze on the ability of the industry to do what it relies on for many years, namely just get the next generation of semiconductor technology everything gets faster.
So we have to think about a new solution. There are three possible directions to go.
Software centric mechanisms where we look at improving the efficiency of our software so it makes more efficient use of the hardware, in particular the move to scripting languages such as python for example better dynamically-typed. They make programming very easy but they're not terribly efficient as you will see in just a second.
Hardware centric approaches. Can we change the way we think about the architecture of these machines to make them much more efficient? This approach is called domain specific architectures or domain specific accelerator. The idea is to just do a few tasks but to tune the hardware to do those tasks extremely well. We've already seen examples of this in graphics for example or modem that's inside your cell phone. Those are special purpose architectures that use intensive computational techniques but are not general purpose. They are not programmed for arbitrary things. They are not designed to do a range of graphics operations or the operation is required by modem.
And then of course some combinations of these. Can we come up with languages which match to these new domain specific architecture? Domain specific languages which improve the efficiency and let us code a range of applications very effectively.
This is a fascinating slide from a paper that was done by Charles Leiserson and his colleagues at MIT and publish on Science called There's plenty of room at the Top.
What they want to do observe is that software efficiency and the inefficiency of matching software to hardware means that we have lots of opportunity to improve performance. They took admittedly a very simple program, matrix multiply, written initially in python and ran it on an 18 core Intel processor. And simply by rewriting the code from python to C they got a factor of 47 in improvement. Then introducing parallel loops gave them another factor of approximately eight.
他们想要观察的是软件效率，以及软件与硬件匹配过程中带来的低效率，这也意味着我们有很多提高效率的地方。他们在 18 核英特尔处理器上运行了一个用 Python 编写的简单程序。把代码从 Python 重写为 C语言之后，他们就得到了 47 倍的效率改进。引入并行循环后，又有了大约 8 倍的改进。
Then introducing memory optimizations if you're familiar with large scale metrics multiplied by doing it in blocked fashion you can dramatically improve the ability to use the cashe as effectively and thereby they got another factor a little under 20 from that about 15. And then finally using the vector instructions inside the Intel processor they were able to gain another factor of 10. Overall this final program runs more than 62,000 times faster than the initial python program.
引入内存优化后可以显着提高缓存的使用效率，然后就又能获得15～20倍的效率提高。然后最后使用英特尔处理器内部的向量指令，又能够获得10 倍的改进。总体而言，这个最终程序的运行速度比最初的 Python 程序快62,000 多倍。
Now this is not to say that you would get this for the larger scale programs or all kinds of environments but it's an example of how much inefficiency is in at least for one simple application. Of course not many performance sensitive things are written in Python but even the improvement from C to the fully parallel version of C that uses SIMD instructions is similar to what you would get if you use the domain specific processor. It is significant just in its onw right. That's nearly a factor of 100, more than 100, its almost 150.
当然，这并不是说在更大规模的程序或所有环境下我们都可以取得这样的提升，但它是一个很好的例子，至少能说明一个简单的应用程序也有效率改进空间。当然，没有多少性能敏感的程序是用 Python 写的。但从完全并行、使用SIMD 指令的C语言版本程序，它能获得的效率提升类似于特定领域处理器。这已经是很大的性能提升了，这几乎是 100 的因数，超过 100，几乎是 150。
So there's lots of opportunities here and that's the key point behind us slide of an observation.
So what are these domain specific architecture? Their architecture is to achieve higher efficiency by telling the architecture the characteristics of the domain.
We're not trying to do just one application but we're trying to do a domain of applications like deep learning for example like computer graphics like virtual reality applications. So it's different from a strict ASIC that is designed to only one function like a modem for example.
It requires more domain specific knowledge. So we need to have a language which conveys important properties of the application that are hard to deduce if we start with a low level language like C. This is a product of codesign. We design the applications and the domain specific processor together and that's critical to get these to to work together.
它需要更多特定领域的知识。所以我们需要一种语言来传达应用程序的重要属性，如果我们从像 C 这样的语言开始就很难推断出这些属性。这是协同设计的产物。我们一起设计应用程序和特定领域的处理器，这对于让它们协同工作至关重要。
Notice that these are not going to be things on which we run general purpose applications. It's not the intention that we take 100 C code. It’s the intention that we take an application design to be run on that particular DSA and we use a domain specific language to convey the information to the application to the processor that it needs to get significant performance improvements.
请注意，这不是用来运行通用软件的。我们的目的不是要能够运行100 个 C 语言程序。我们的目的是让应用程序设计在特定的 DSA 上运行，我们使用特定领域的语言将应用程序的信息传达给处理器，从而获得显着的性能提升。
The key goal here is to achieve higher efficiency both in the use of power and transistors. Remember those are two limiters the rate at which transistor growth is going forward and the issue of power from the lack of Denard scaling. So we're trying to really improve the efficiency of that.
Good news? The good news here is that deep learning is a broadly applicable technology. It's the new programming model, programming with data rather than writing massive amounts of highly specialized code. Use data to train deep learning model to detect that kind of specialized circumstance in the data.
And so we have a good target domain here. We have applications which are really demanding of massive amounts of performance increase through which we think there are appropriate domain specific architectures.
It's important to understand why these domain specific architectures can win in particular there's no magic here.
People who are familiar with the books Dave Patterson and I co-authored together know that we believe in quantitative analysis in an engineering scientific approach to designing computers. So what makes these domain specific architectures more efficient?
First of all, they use a simple model for parallelism that works in a specific domain and that means they can have less control hardware. So for example we switch from multiple instruction multiple data models in a multicore to a single instruction data model. That means we dramatically improve the energy associated with fetching instructions because now we have to fetch one instruction rather than any instructions.
We move to VLIW versus speculative out of order mechanisms, so things that rely on being able to analyze the code better know about dependences and therefore be able to create and structure parallelism at compile time rather than having to do with dynamically runtime.
Second we make more effective use of memory bandwidth. We go to user controlled memory system rather than caches. Caches are great except when you have large amounts of data does streaming through them. They're extremely inefficient that's not what they meant to do. They are meant to work when the program does repetitive things but it is somewhat in predictable fashion. Here we have repetitive things in a very predictable fashion but very large amounts of data.
So we go to an alternative using prefetching and other techniques to move data into the memory once we get it into the memory within the processor within the domain specific processor. We can then make heavy use of the data before moving it back to the main memory.
We eliminate unneeded accuracy. Turns out we need relatively much less accuracy then we do for general purpose computing here. In the case of integer, we need 8-16 bit integers. In the case of floating point, we need 16 to 32 bit not 64-bit large-scale floating point numbers. So we get efficiency thereby making data items smaller and by making the arithmetic operations more efficient.
The key is that the domain specific programming model matches the application to the processor. These are not general purpose processor. You are not gonna take a piece of C code and throw it on one of these processors and be happy with the results. They're designed to match a particular class of applications and that structure is determined by that interface in the domain specific language and the underlining architecture.
关键在于特定领域的编程模型将应用程序与处理器匹配。这些不是通用处理器。你不会把一段 C 代码扔到其中一个处理器上，然后对结果感到满意。它们旨在匹配特定类别的应用程序，并且该结构由领域特定语言中的接口和架构决定。
So this just shows you an example so you get an idea of how were using silicon rather differently in these environments then we would in a traditional processor.
What I've done here is taken a first generation TPU-1 the first tensor processing unit from Google but I could take the second or third or fourth the numbers would be very similar. I show you what it looks like it's a block diagram in terms of what the chip area devoted to. There's a very large matrix multiply unit that can do a two 56 x 2 56 x 8 bit multiplies and the later ones actually have floating point versions of that multiplying. It has a unified buffer used for local activations of memory buffer, interfaces accumulators, a little bit of controls and interfaces to DRAM.
这里展示是谷歌的第一代 TPU-1 ，当然我也可以采用第二、第三或第四代，但是它们带来的结果是非常相似的。这些看起来像格子一样的图就是芯片各区域的分工。它有一个非常大的矩阵乘法单元，可以执行两个 56 x 2 56 x 8 位乘法，后者实具有浮点版本乘法。它有一个统一的缓冲区，用于本地内存激活。还有接口、累加器、DRAM。
Today that would be high bandwidth DRAMs early on it with DDR3. So if you look at the way in which the area is used. 44% of is used for memory to store temporary results in weights and things been computed. Almost 40% of being used for compute, 15% for the interfaces and 2% for control.
在今天我们使用的是高带宽DRAM，以前可能用的是DDR3。那我们来具体看看这些区域的分工。 44% 用于内存以短时间内存储运算结果。 40% 用于计算，15% 用于接口，2% 用于控件。
Compare that to a single Skylake core from an Intel processor. In that case, 33% as being used for cach. So noticed that we have more memory capacity in the TPU then we have on the Skylake core. In fact if you were to remove the caps from the cache that number because that's overhead it's not real data, that number would even be larger. The amount on the Skylake core will probably drop to about 30% also almost 50% more being used for active data.
将其与英特尔的 Skylake架构进行比较。在这种情况下，33% 用于缓存。请注意，我们在 TPU 中拥有比在Skylake 核心上更多的内存容量，事实上，如果移除缓存限制，这个数字甚至会更大。 Skylake 核心上的数量可能会下降到大约 30%，用于活动数据的数量也会增加近 50%。
30% of the area is used for control. That's because the Skylake core is an out of order dynamic schedule processor like most modern general purpose processors and that requires significantly more area for the control, roughly 15 times more area for control. That control is overhead. It’s energy intensive computation unfortunately the control unit. So it's also a big power consumer. 21% for compute.
30% 的区域用于控制。这是因为与大多数现代通用处理器一样，Skylake 核心是一个无序的动态调度处理器，需要更多的控制区域，大约是15 倍的区域。这种控制是额外负担。不幸的是，控制单元是能源密集型计算，所以它也是一个能量消耗大户。 21% 用于计算。
So noticed that the big advantage that exists here is the compute areas roughly almost double what it is in a Skylake core. Memory management there's memory management overhead and finally miscellaneous overhead. so the Skylake core is using a lot more for control a lot less for compute and somewhat less for memory.
这里存在的最大优势是计算区域几乎是 Skylake 核心的两倍。内存管理有内存管理负担，最后是杂项负担。因此，控制占据了Skylake 核心的区域，意味着用于计算的区域更少了，内存也是同理。
So where does this bring us? We've come to an interesting time in the computing industry and I just want to conclude by reflecting on this and how saying something about how things are likely to go forward in the future because I think we're at a real turning point at this point in the history of computing.
From 1960s, the introduction of the first real commercial computers, to 1980 we had largely vertically integrated companies.
从 1960 年代第一台真正的商用计算机的出现到 1980 年，市面上的计算机公司基本上都是垂直整合的。
IBM Burroughs Honeywell be early spin outs out of the activity at the university of Pennsylvania that built ENIAC the first electronic computer.
IBM、宝来公司、霍尼韦尔、以及其他参与了宾夕法尼亚大学制造的世界上第一台电子计算机 ENIAC 公司都是垂直整合的公司。
IBM is the perfect example of a vertically integrated company in that period. They did everything, they built around chips they built the round disc's in fact the West Coast operation of IBM here in California was originally open to do disc technology and the first Winchester discs were built on the West Coast.
IBM 是那个时期垂直整合公司的完美典范。IBM好像无所不能，他们围绕着芯片制造，他们制造了光盘。事实上，IBM 在加利福尼亚的西海岸业务最初就是光盘技术，而第一个温彻斯特光盘就是在西海岸制造出来的。
They built their own processors. The 360, 370 series, etc. After that they build their own operating system they built their own compilers. They even built their own database estate. They built their networking software. In some cases, they even built application program but certainly the core of the system from the fundamental hardware up through the databases OS compilers were all built by IBM. And the driver here was technical concentration. IBM could put together the expertise across these wide set of things, assemble a world-class team and really optimize across the stack in a way that enabled their operating system to do things such as virtual memory long before other commercial activities can do that.
他们还构建了自己的处理器，有360、370系列等等。之后他们开发了自己的操作系统、编译器。他们甚至建立了自己的数据库、自己的网络软件。他们甚至开发了应用程序。可以肯定的是，从基础硬件到数据库、操作系统、编译器等系统核心都是由 IBM 自己构建的。而这里的驱动力是技术的集中。 IBM 可以将这些广泛领域的专业知识整合在一起、组建一个世界一流的团队、并从而优化整个堆栈，使他们的操作系统能够做到虚拟内存这种事，这可要比在其他公司要早得多。
And then the world changed, really changed with the introduction of the personal computer. And the beginning of the micro processors takes off.
Then we change from a vertically organized industry to a horizontally organized industry. We had silicon manufacturers. Intel for example doing processors along with AMD and initially several other companies Fairchild and Motorola. We had a company like TSMC arise through silicon foundry making silicon for others. Something that didn't exist in earlier but really in the late 80s and 90s really began to take off and that enabled other people to build chips for graphics or other other functions outside the processor.
接着这个行业从垂直转变为水平纵向的。我们有专精于做半导体的公司，例如英特尔和 AMD ，最初还有其他几家公司例如仙童半导体和摩托罗拉。台积电也通过代工崛起。这些在早期都是见不到的，但在 80 年代末和 90 年代开始逐渐起步，让我们能够做其它类型的处理器，例如图形处理器等。
But Intel didn't do everything. Intel did the processors and Microsoft then came along and did OS and compilers on top of that. And oracle companies like Oracle came along and build their applications databases and other applications on top of that. So they became very horizontally organized industry. The key drivers behind this, obviously the introduction of the personal computer.
The rise of shrinkwrap software, something a lot of us did not for see coming but really became a crucial driver, which meant that the number of architecture that you could easily support had to be kept fairly small because the software company is doing a shrink wrap software did not want to have to port and and verify that their software work done lots of different architectures.
And of course the rise in the dramatic growth of the general purpose microprocessor. This is the period in which microprocessor replaced all other technologies, including the largest super computer. And I think it happened much faster than we expected by the mid 80s microprocessor put a series dent in the mini computer business and it was struggling by the by the early 90s in the main from business and by the mid 90s to 2000s really taking a bite out of the super computer industry. So even the supercomputer industry converted from customize special architectures into an array of these general purpose microprocessor. They were just far too efficient in terms of cost and performance to be to be ignored.
当然还有通用微处理器的快速增长。这是微处理器取代所有其他技术的时期，包括最大的超级计算机。我认为它发生的速度比我们预期的要快得多，因为 80 年代中期，微处理器对微型计算机业务造成了一系列影响。到 90 年代初主要业务陷入困境，而到 90 年代中期到 2000 年代，它确实夺走了超级计算机行业的一些市场份额。因此，即使是超级计算机行业，也从定制的特殊架构转变为一系列的通用微处理器，它们在成本和性能方面的效率实在是太高了，不容忽视。
Now we're all of a sudden in a new area where the new era not because general purpose processor is that gonna go completely go away. They going to remain to be important but they'll be less centric to the drive to the edge to the ferry fastest most important applications with the domain specific processor will begin to play a key role. So rather than perhaps so much a horizontal we will see again a more vertical integration between the people who have the models for deep learning and machine learning systems the people who built the OS and compiler that enabled those to run efficiently train efficiently as well as be deployed in the field.
Inference is a critical part is it mean when we deploy these in the field will probably have lots of very specialized processors that do one particular problem. The processor that sits in a camera for example that's a security camera that's going to have a very limited used. The key is going to be optimize for power and efficiency in that key use and cost of course. So we see a different kind of integration and Microsoft Google and Apple are all looking at this.
The Apple M1 is a perfect example if you look at the Apple M1, it's a processor designed by apple with a deep understanding of the applications that are likely to run on that processor. So they have a special purpose graphics processor they have a special purpose machine learning domain accelerator on there and then they have multiple cores, but even the cores are not completely homogeneous. Some are slow low power cores, and some are high speed high-performance higher power cores. So we see a completely different design approach with lots more codesign and vertical integration.
例如Apple M1，Apple M1 就是一个完美的例子，它是由 苹果设计的处理器，对苹果电脑上可能运行的程序有着极好的优化。他们有一个专用的图形处理器、专用的机器学习领域加速器、有多个核心。即使是处理器核心也不是完全同质的，有些是功耗低的、比较慢的核心，有些是高性能高功耗的核心。我们看到了一种完全不同的设计方法，有更多的协同设计和垂直整合。
We're optimizing in a different way than we had in the past and I think this is going to slowly but surely change the entire computer industry, not the general purpose processor will go away and not the companies that make software that runs on multiple machines will completely go away but will have a whole new driver and the driver is created by the dramatic breakthroughs that we seen in deep learning and machine learning. I think this is going to make for a really interesting next 20 years.
我们正在以与过去不同的方式进行优化，这会是一个缓慢的过程，但肯定会改变整个计算机行业。我不是说通用处理器会消失，也不是说做多平台软件的公司将消失。我想说的是，这个行业会有全新的驱动力，由我们在深度学习和机器学习中看到的巨大突破创造的驱动力。我认为这将使未来 20 年变得非常有趣。
Thank you for your kind attention and I'd like to wish the 2021 T-EDGE conference a great success. Thank you.
最后，你耐心地听完我这次演讲。我也预祝 2021 年 T-EDGE 会议取得圆满成功，谢谢。
金玉良缘人间几回：thanks for all your effort... but friends why don't you tell us the problem and try to give an answer of up to 10points... this paper is just too technical for average person to even have an idea of what you are trying to say or to prove.
金玉良缘人间几回：draw us a picture and show us where you wanna get to in this IT or AI arena... and what problems you are facing and what are some of the possible solution to dissolve all these bottle necks.