The five technical challenges Cerebras overcame in building the first trillion transistor chip

INSUBCONTINENT EXCLUSIVE:
Superlatives abound at Cerebras, the until-today stealthy next-generation silicon chip company looking to make training a deep learning
model as quick as buying toothpaste from Amazon
attendees
You can read more about the chip from Tiernan Ray at Fortune and read the white paper from Cerebras itself.Superlatives aside though, the
technical challenges that Cerebras had to overcome to reach this milestone I think is the more interesting story here
I sat down with founder and CEO Andrew Feldman this afternoon to discuss what his 173 engineers have been building quietly just down the
street here these past few years with $112 million in venture capital funding from Benchmark and others.Going big means nothing but
challengesFirst, a quick background on how the chips that power your phones and computers get made
Fabs like TSMC take standard-sized silicon wafers and divide them into individual chips by using light to etch the transistors into the chip
Wafers are circles and chips are squares, and so there is some basic geometry involved in subdividing that circle into a clear array of
individual chips.One big challenge in this lithography process is that errors can creep into the manufacturing process, requiring extensive
testing to verify quality and forcing fabs to throw away poorly performing chips
The smaller and more compact the chip, the less likely any individual chip will be inoperative, and the higher the yield for the fab
Higher yield equals higher profits.Cerebras throws out the idea of etching a bunch of individual chips onto a single wafer in lieu of just
using the whole wafer itself as one gigantic chip
architecture and design was led by co-founder Sean Lie
Feldman and Lie worked together on a previous startup called SeaMicro, which sold to AMD in 2012 for $334 million
silicon wafer
So the company had to invent new techniques to allow each of those individual chips to communicate with each other across the whole wafer
Working with TSMC, they not only invented new channels for communication, but also had to write new software to handle chips with trillion
plus transistors.The second challenge was yield
With a chip covering an entire silicon wafer, a single imperfection in the etching of that wafer could render the entire chip inoperative
This has been the block for decades on whole wafer technology: due to the laws of physics, it is essentially impossible to etch a trillion
transistors with perfect accuracy repeatedly.Cerebras approached the problem using redundancy by adding extra cores throughout the chip that
Leaving extra cores allows the chip to essentially self-heal, routing around the lithography error and making a whole wafer silicon chip
But they were known problems, and Feldman said that they were actually easier to solve that expected by re-approaching them using modern
tools.He likens the challenge though to climbing Mount Everest
no other chip designer had gotten past the scribe line communication and yield challenges to actually find what happened next.The third
challenge Cerebras confronted was handling thermal expansion
Chips get extremely hot in operation, but different materials expand at different rates
That means the connectors tethering a chip to its motherboard also need to thermally expand at precisely the same rate lest cracks develop
to invent a material
manufactured, it needs to be tested and packaged for shipment to original equipment manufacturers (OEMs) who add the chips into the products
used by end customers (whether data centers or consumer laptops)
There is a challenge though: absolutely nothing on the market is designed to handle a whole-wafer chip.Cerebras designed its own testing and
That is the truth
Nobody had a printed circuit board this size
Nobody had connectors
Nobody had a cold plate
Nobody had tools
Nobody had tools to align them
Nobody had tools to handle them
processing power in one chip requires immense power and cooling
to a modern-sized AI cluster
All that power also needs to be cooled, and Cerebras had to design a new way to deliver both for such a large chip.It essentially approached
power and cooling horizontally across the chip as is traditional, power and cooling are delivered vertically at all points across the chip,
that the company has worked around-the-clock to deliver these past few years.From theory to realityCerebras has a demo chip (I saw one, and
yes, it is roughly the size of my head), and it has started to deliver prototypes to customers according to reports
The big challenge though as with all new chips is scaling production to meet customer demand.For Cerebras, the situation is a bit unusual
together to create a compute cluster
Instead, they may only need a handful of Cerebras chips for their deep-learning needs
that also includes its proprietary cooling technology.Expect to hear more details of Cerebras technology in the coming months, particularly
as the fight over the future of deep learning processing workflows continues to heat up.