What just happened? Amazon has announced that they’re migrating their artificial intelligence processing to custom AWS Inferentia chips. This means that Amazon’s biggest inferencing services, like virtual assistant Alexa, will be processed on faster, specialized silicon instead of somewhat multi-purpose GPUs.
Amazon has already shifted about 80% of Alexa processing onto Elastic Compute Cloud (EC2) Inf1 instances, which use the new AWS Inferentia chips. Compared to the G4 instances, which used traditional GPUs, the Inf1 instances push throughput up by 30% and costs down by 45%. Amazon reckons that they’re the best instances on the market for inferencing natural language and voice processing workloads.
Alexa works like this: the actual speaker box (or cylinder, as it may be) does basically nothing, while AWS processors in the cloud do everything. Or to put it more technically… the system kicks in once the wake word has been detected by the Echo’s on-device chip. It starts streaming the audio to the cloud in real-time. Off in a data center somewhere, the audio is turned into text (this is an example of inferencing). Then, meaning is withdrawn from the text (another example of inferencing). Any required actions are completed, like pulling up the day’s weather information.
Once Alexa has completed your request, she needs to communicate the answer to you. What she’s supposed to say is chosen from a modular script. Then the script is turned into an audio file (another example of inferencing) and sent to your Echo device. The Echo plays the file and you decide to bring an umbrella to work with you.
Evidently enough, inferencing is a big part of the job. It’s unsurprising that Amazon has invested millions of dollars into making the perfect inferencing chips.
Speaking of, the Inferentia chips are comprised of four NeuronCores. Each one implements a “high-performance systolic array matrix multiply engine.” More or less, each NeuronCore is made up of a very large number of small data processing units (DPUs) that process data in a linear, independent fashion. Each Inferentia chip also has a huge cache, which improves latencies.