Home PC News Amazon debuts Trainium, a custom chip for machine learning training in the...

Amazon debuts Trainium, a custom chip for machine learning training in the cloud

Amazon today debuted AWS Trainium, a chip custom-designed to deliver what the company describes as cost-effective machine learning model training in the cloud. It comes ahead of the availability of new Habana Gaudi-based Amazon Elastic Compute Cloud (EC2) instances built specifically for machine learning training, powered by Intel’s new Habana Gaudi processors.

“We know that we want to keep pushing the price performance on machine learning training, so we’re going to have to invest in our own chips,” AWS CEO Andy Jassy said during a keynote address at Amazon’s re:Invent conference this morning. “You have an unmatched array of instances in AWS, coupled with innovation in chips.”

Amazon AWS Tranium

Amazon claims that Trainium will offer the most teraflops of any machine learning instance in the cloud, where a teraflop translates to a chip being able to process one trillion calculations a second. (Amazon is quoting 30% higher throughput and 45% lower cost-per-inference compared with the standard AWS GPU instances.) When Trainium becomes available to customers in the second half of 2021 as EC2 instances and in SageMaker, Amazon’s fully managed machine learning development platform, it will support popular frameworks including Google’s TensorFlow, Facebook’s PyTorch, and MxNet. Moreover, Amazon says it will use the same Neuron SDK as Inferentia, the company’s cloud-hosted chip for machine learning inference.

“While Inferentia addressed the cost of inference, which constitutes up to 90% of ML infrastructure costs, many development teams are also limited by fixed ML training budgets,” AWS wrote in a blog post. “This puts a cap on the scope and frequency of training needed to improve their models and applications. AWS Trainium addresses this challenge by providing the highest performance and lowest cost for ML training in the cloud. With both Trainium and Inferentia, customers will have an end-to-end flow of ML compute from scaling training workloads to deploying accelerated inference.”

Absent benchmark results, it’s unclear how Trainium’s performance might compare with Google’s tensor processing units (TPUs), the search giant’s chips for AI training workloads hosted in Google Cloud Platform. Google says its forthcoming fourth-generation TPU offers more than double the matrix multiplication teraflops of a third-generation TPU. (Matrices are often used to represent the data that feeds into AI models.) It also offers a “significant” boost in memory bandwidth while benefiting from unspecified advances in interconnect technology.

Machine learning deployments have historically been constrained by the size and speed of algorithms and the need for costly hardware. In fact, a report from MIT found that machine learning might be approaching computational limits. A separate Synced study estimated that the University of Washington’s Grover fake news detection model cost $25,000 to train in about two weeks. OpenAI reportedly racked up a whopping $12 million to train its GPT-3 language model, and Google spent an estimated $6,912 training BERT, a bidirectional transformer model that redefined the state of the art for 11 natural language processing tasks.

Amazon has increasingly leaned into AI and machine learning training and inferencing services as demand in the enterprise grows. According to one estimate, the global machine learning market was valued at $1.58 billion in 2017 and is expected to reach $20.83 billion in 2024. In November, Amazon announced that it shifted part of the computing for Alexa and Rekognition to Inferentia-powered instances, aiming to make the work faster and cheaper while moving it away from Nvidia chips. At the time, the company claimed the shift to Inferentia for some of its Alexa work resulted in 25% better latency at a 30% lower cost.

Most Popular

Cumulus Digital Systems raises $8 million to expand data monitoring for industrial facilities

Massachusetts-based SaaS startup Cumulus Digital Systems, whose cloud platform monitors industrial data for quality assurance, will announce an $8...

AMD CEO Lisa Su talks core counts, console launches, and Apple relationship

During CES 2021, Advanced Micro Devices CEO Lisa Su spoke with a group of journalists about a wide range of issues, including AMD’s product...

The DeanBeat: The best of CES 2021 as seen from afar

Last year at CES 2020, I walked more than 37.45 miles (over 84,385 steps) to scout for the best ideas and products of the...

Facebook claims its AI can anticipate COVID-19 outcomes using X-rays

Researchers at Facebook and New York University (NYU) claim to have developed three machine learning models that could help...

Recent Comments