nvidia's ampere architecture has arrivedat this point andat
present at time of filming we'reworking on the reviewfor the rtx 3080 but we
wanted to take alook at the block diagrams and talkabout the architecture
betweenampere and turing what the differencesare and where the similarities are
therearea lot of similarities between them at apurely block diagramlevel but as
we zoom into the partsnvidia has changed a lot this timearoundfor its ampere
architecture and thiscontent isbased off of first party architecturaldisclosures
from nvidia sowe have information from the press techday for example where some
of thearchitects from nvidia who worked onampere were presenting information onthe
architectural changesand what the meaning of them is so thatmeans thatthe
architecture details we've receivedthey're not marketing latent detailsthey're
like actual white paper andarchitectural discussionuh information so this is
more of aneducational piecethat's focused on how the product worksat a deeper
levelkeep in mind that as always we'd adviseyou reserve anyjudgment of product
performance or anyhype for the productuntil the actual reviews go live thatway
you can make a buying decision thenwith all the informationthis one's just
gonna be focused on howstuff works before that this video isbrought to you by
thermaltake's core p3case the core p3 is one of the mostunique cases on the
marketit can serve as an open-air standingchassis a test bench in vertical orhorizontal
orientationor as a wall mounted showcase pc thecore p3 now comes witha five
millimeter thick tempered glasspanel for its side but keeps the fronttop and
back open for air the core p3versatility as a display piece testbenchor
standard desktop is reinforced by itsprice of roughly 110on amazon you can
learn more at the linkin the description belowone of the most obvious
improvementsthis time has little to do witharchitecture and more to dowith
process improvements so turing gpuswere manufactured by tsmcusing a 12
nanometer ffn processampere is being manufactured by samsungthis time with an
eight nanometerprocessnvidia has been calling this 8n as thedesignatorso thanks
to this change a new amperegpu has a higher transistor density thanits
preceding terrain gpus anduh that's something that we'll sort ofkeep in mind as
we open the cards up tooand show you what the actual siliconlooks like
physically on the cardbecause that'll give us the angle oflooking at thedie
size which is at this point knownbut once you see it ontouring versus amp you're
in person likeon actual pcbs it means a little bitmoreokay so in the case of
the ga102-300-k1-a1die and the tu-102 300 k1 a1 from thispoint forward we'll
just use tu-102 andga-102 to reference those for comparisonthe ga-102 packs in
28 billiontransistors on 628 millimeters squaredof spacewhile the tu-102 used
754 millimeterssquaredto lay out 18.6 billion transistorswe've taken aparttouring
cards in the past and we canshow those but right now we can't showwhat ampere
looks like at least notuntil the review dateoften a reduction in the size of
theprocess node or transistorsize is correlated with a reduction inpower
consumptionbut with this generation of gpus that'snot the casethe reference
number for power in tdpof the ga102 is 320 watts while thetu-102 was lower at
250 wattsbe careful though not to read this asthis card runshotter it's not
really that simple andit never is the performance per watt isreportedly better
with ampereat 240 watts tdp terrain running60 fps 4k in nvidia's early
marketingimages for examplewhile ampere was at 80 plus fps at 4kand then we've
also got some powernormalized results from nvidiafirst party reveals as well
the largerpart of this content though will talkabout architecture not theprocess
changes so we'll be taking alook at what's different in ampere interms ofsms
the layout or the containerizationas we'll call it of the gpuwe'll be looking
at the third gen tensorcores the second gen rt cores thatnvidia hasdisclosed in
brief pcie gen4 changesmemory changes with gddr6xand nvlink briefly as well
there's awhole lot more to thisthe whitepaper actually we just receivedit
immediately after finishing thisscript we've added a little more detailin butthere's
more still that could be addedthe white paper is about 42 pagesjust to give you
an example and thiscontent aloneis written based off of about a one hourlong
presentationfrom an nvidia architect so the point isthere's a lot more detail
we can getinto but this content will give youall the basics that you need to
know forthe key changesand then once we get into discussingmore aboutcash for
example or some of the othermoreminute details or lower down deeperlevel
detailsthat's going to require a differentcontent piece altogether just for
sakeof time we need to start with a quickrefresher course on nvidia'scontainers
containerization andidentification for its gpu componentsif we look at the
turing block diagramfirst you'll see that it's split intogpcsl2 cache and more
gpcs with memorycontrollers at the flanksof the image and then interfaces at
thevertical edges of the imagea gpc is a graphics processing clusterby nvidia's
terminologyand a gpc contains a raster engine thensome more containers with 60
pcs per gpca tpc is a texture processing clusterand each tpc contains two sms
orstreaming multiprocessors and we haveseparate block diagramsfor what sms look
like so there are 12sms per gpcin turin if we look at the big terrainblock
diagramsthe sms contain the components thathandle floating point and integer
mathamong other things like cache that we'llget into terrain to amp hereit
looks more like this this is theblock diagram for ga102as a reminder there will
be variationsof this at different silicon sizesbut there'll probably also be
somethingwith some of these sms disabled in factthe 3080 is one of thoseif you
need an introduction to nvidia'snaming scheme for the actual identifierthat
nvidia puts on the die so if youtake youryour video card apart and you look atthe
silicon at thediffusion barrier on top of it thatsilver barrier it'll have an
id on itand that might be something like tu102t106t104 or we might get ga 106
most likelyga104ga102 stuff like that so the basicsin a short version would be
nvidiaincrements this numberin a positive direction as the componentgets lower
endso if you're looking at ga 102 that'sgoing to be one of the higher or
highestend parts and once you get down to likega 106 for examplethat's
typically more in a 60 class gpuwe don't know exactly where that's goingto be
this generationbut typically you'll see that insomething likea 60 class cheaper
card the closer thename is to the 100 designationthe larger the silicon is the
higher endit is the closer it is to thefullest of the fall of that blockdiagram
that we've shownand once you get into things like ga 100or pascal 100now you're
looking at a data center orscientific compute or some otherprofessional classcard
and the ga 102 diagram as comparedto the taurian diagramswe see that there are
seven gpc's nowwith the maximum possible sm count of84. that doesn't mean
they're all thereor all enabled rather but that's themaximum possible count
there are alsomore rops now at 16 rop units pergpc there are still six memorycontrollers
at each flankalthough the memory subsystem haschanged this generationthe pcie
interface has changed to 4.0and envy link if you look at the terrainon the ga
diagrams has changedto 4x4 links rather than 2 by 8 linksthat's the same amount
of total lanesbutapportioned differently within a gpcthere are 12 total
possible smsor 6 total possible tpcs some of thesewill be shed as the silicon
gets smallerfor lower end partsuh but that's what it looks like at thelarge
scale within the smlooking at the sm block diagram amperehas a second
generation rt coresrather than the original touring courseso this is actually a
major change thatwe'll be exploring in this content todayit also has cuda cores
that are fp32onlyor capable of either fp32 orin 32 you may not have all of
those cudacores listed in the specsheet available to you for floatingpoint at
any given timebecause some of them could be doingother things like integer
processinginsteadin the sm block diagram you can see theload store units for
each blockand the shared cache between the entiresm at 128 kilobytes l1ga102 also
has 168 fp64 unitsor 2 units per sm for double precisionyou'll probably want a
different classof card for double precision workloadsas is typically the case
nvidia has alsocriticallyimproved the rops for this generation uhso on the
rasterization side nvidia isdecoupling the ropsfrom the memory controller and
the l2cache as it was previouslycoupled to those and it is instead tyingthe
robs to the gpcnow in initial numbers nvidia says thatthis improvesthe raster
operations it allows anincrease in ropunits and the company also claims thatthe
performance changeof this will be reflected so you'll seethis probably in
benchmarks of thingslike games for examplesome of your performance uplift can
bedirectly attributed to this change inthe architecturehere's a quick
comparison table fromnvidia it is unfortunately complete witha document time
stamp on itbut do your best ignore it the rtx 3080fe card is running six of the
totalseven gpc's in ampere and it's ga 102diewe'll be curious though to see ifthere's
a part between the 3080 and the3090 like a possible 3080 ticloser to 1 000
there's certainly roomfor it at least in the pricing stack andpotentially here
as wellas a quick note while you're looking atthis table it's not really
appropriateto compare cuda cores directlyor other core counts directlygenerationally
once your crossgenerationit is non-linear that'd be sort of likecomparing a
modern four core eightthread cputo an older four core eight thread cpuand then
extrapolating performancepurely on that number and maybe thefrequencythere are
other changes there as welllike cache and cpus and the same is truefor gpusthe
newer rt cores tensor cores and cudacores all have improvements particularlyin
efficiencythat make a purely count basedcomparison non-linearso don't do just a
cuda core versus cudacorecomparison as for which card is morepowerful for the
new architecture we'llstart with the streaming multiprocessorsor the smsas the
basis for discussion and a quotefrom an nvidia engineer that we used inour
touring architecture write-upcan start us off here it was from thetouring deep
dive it referred to theinteger processing approachin turin and it read as
follows quotenote that while we call the unitan integer unit it actually
performsboth integerand simple floating point operationslike the floating point
comparemin max operations mentioned in theother part of the contentend quote
there if we picked up on whatnvidia had really been trying to say acouple of
years agothen maybe we could have predicted wherenvidia was going with sms back
thenand that's towards a new data path on arecent call with nvidia's
architectureteam the company noted that its researchhademphasized once again
that manyworkloads are floating point heavyand so it's trying to direct a
hardwarearchitectural solutiontowards this challenge and this needs aprocess
more floating point dataprompted nvidia to alter the sm datapathsuch that the
n32 portion of the datapath also had an fp32 unitin it and that would thereby
give the smthe ability toexecute either int or floating pointoperationsin that
given data path the alternativeto this would beunfortunately having units
sittingaround doing nothing while they wait forsome kind of taskthere's also a
couple of years ago backinlike to say it's the pascal era waspascaland video
talked about how execution ofcertain instructions would block otherinstructions
and so that's where youstart talking about asynchronousprocessing things like
that but that wassomething we discussed more in thepascal era and then a bit in
the taurianera of cartso each sm can execute 128 floatingpoint 32 operations
per clockup from 64 per clock in turret this newdata path doesn't produce a two
timesperformance gainas in 32 operations must still occur fordata fetchesdata
compares and things like that butthis architectural changecertainly contributes
to the performancegains that nvidia has alluded tonext section is briefly on l1
cachethere's a lot more about cash that wejustwe opened up that new white paper
andlooked at it there's a lot there todiscussperhaps in a future piece but
we'll giveyou the basics here to feed the faster128 floating point 32operations
per cycle execution of thenew smthe incoming data pipeline had to beredesigned
in ampere and so the l1cache now has double the bandwidth weneed to check on
the the precise numbersfor thatbut it is double it's got 33 morecapacity and
thatis up from 96 kilobytes to 128 kilobytesand twice the cache partition size
sothe cache partition size increase showsbenefitsspecifically with a quote long
andcomplicated shader programsas mentioned by nvidia and itsarchitecture deep
dive so together thenew data pathand the increased l1 cache are providinga
combined 2.7 times performance bumpin the sms but remember again that this2.7
times numberthat doesn't mean that your games aregoing to be 2.7 timeshigher
fps than whatever you'recomparing it touh so this is not total card performanceit'sperformance
increase for a significantpart of the card and that's importantbut we just want
to be really clear witheveryone that it's not total cardperformance because the
plot tends toget lost with that stuffso a great way to remember this or thinkabout
this would be that gallium nitridesolution that corsair introduced at ces2017
or 18 and its power supplies wherethepfc efficiency increased to somethinglike
90maybe 97.98 99 range very high 90sbut that didn't mean the entire powersupply
was 98efficient it meant that that part ofthat power supplywas 98 efficient so
you can think of thesame thing here whereit is not a direct linear increase inoverall
card performance that's not todiminish it it's tomake sure people kind of keep
a realitycheckso third gen tensor cores up next thisis an importantchange for
ampere versus touring nvidialikes to advertise the tensor core or tcas the ai
portion of the gpu effectivelythey're good at doing linear algebrafor the
gaming community this mostlymeans improved frame rates at very highresolutionswith
certain technologies like dlss orparticularly dlss 2.0simplifying things in the
turingarchitecture a network was trained ondense matrices and those dense
matricescould then be fed to thesecond gen tensor cores for aiinferencingwith
ampere's architecture nvidia is nowutilizing the conceptof sparse matrices to
boost performancenvidia has a blog post for moreinformation on sparse matrices
relatedto ampereif you're interested in that we won'ttry to do a better job
than wikipedia atexplaining a sparse matrixthis isn't something we typically
reallywork with in our part of the industry sohere's a direct quotequote in
numerical analysis andscientific computinga sparse matrix or sparse array is amatrix
in whichmost of the elements are zero bycontrast if most of the elements arenon-zero
then the matrix is considereddensethe number of zero-valued elementsdivided by
the total number of elementsfor example mtimes n for an m times n matrixis
called the sparsity of the matrixwhich is equal to 1minus the density of the
matrix usingthose definitionswikipedia writes a matrix will be sparsewhen its
sparsity is greater than 0.5nvidia's presentation also has some goodgraphics
showingall this stuff showing images fordifferent types of deep learningthe
applications and the implications ofprocessing those onits architectures so the
new learningprocessin ampere still works with a densematrixand the network is
trained to assignweights throughout the matrix so unliketuring the process then
continues as thematrix is pruned to create a sparsematrixby dropping the low
valued weight thenew sparse matrix is a more efficientversionthat's nearly as
accurate when comparedto the trained dense matrixin the final step of this
process thenew sparse matrix is retrainedto drop its dependency on weights thetensor
cores can then use a compressedversionof the more efficient sparse matrix foraiinfrarent
from a hardware perspectivethe 3rd gen tensor core has beenoptimized to work
with these sparsematricessupposedly further boosting theperformance
improvements that we've beentalking about so in the amperearchitecture we
notice that there arefewer tensor cores per sm you likely sawthis and the block
diagrams as well atinitial announceand that's as compared to turin ofcourse
nvidia's goal here was to reducethe numberof cores but to make sure each core
wasmore powerful and this is why youshouldn't directly compare the countfrom a
numerical standpoint it appearsthat nvidia achieved its goal asoutlined in this
graphicin the touring gpu there were eight tc'sper smand each tc could perform
64 fp16fused multiply add you'll see this asfma normallyoperations this
resulted in 512 fp16 fmaoperations per smin the enhanced ampere third gen
tensorcore there are only four tensor coresper smhowever each tensor core can
performeither 128 fp16fma operations on a dense matrix or 256fp16fma operations
on a sparse matrix thisyieldseither a matching level of performanceon dense
matrices per smat worst 512 there or a doubling ofperformanceon sparse matrices
per sm at 10 24.amp here is capable of at least as goodperformance as turinwhen
working with dense matrices inother words and in cases where sparsematrices can
be used insteadthe performance gains are higher byavoiding math that we don't
need to doproducing yet to be verifiedimprovements of 2.7 xor 89 tensor t flops
to 238 usingnvidia's numbersof course how that's realized in thingslike wellreal
world applications outside of justdoing the math on paper likenvidia has
presented here will depend oneverything else and the architecture inthe
pipeline as well becauseyou can still get hunt up on one part ofit and that
stuff that will be shown inbenchmarkingokay time to talk about the secondgeneration
of ray tracing cores or rtcores for amperestarting with the turing gpu the raytracing
core was defined as a multi-partunitwhose entire objective was to solveintersection
problems and specific onesat thatits purpose is to fully offload raytracing
calculationsfrom the shader cores and handle themindependentlyof those shader
cores on a dedicatedhardware unitso this means that the shader makes acall to
the rt coreand the rt core then handles the rt orthe ray tracing workloadand
then it returns the output to theshader oncethat's complete the shader core
itselfdoesn't do any ray tracing calculationnvidia's design of the rt core
utilizesthree hardware unitsand a separate memory stack that allinteroperate to
provide the desiredfunctionalityinside the rt core there are two mathunits one
to handle bounding boxintersectionsand another to perform triangleintersection
calculationsthe third interior unit is an mimdor mimd multiple instruction
multipledata hardware state machine that handlesthe bvh or the bounding volume
hierarchytraversal for each ray nvidia statedthatin the ampere architecture
there wereseveral small alterations to the rt corebut they only shared details
on two ofthem improved triangle intersectioncalculation hardwareand an added
triangle positioninterpolation unitideally in both turin and ampere rtcores
triangle intersectionand bounding box intersectioncalculations should be donein
parallel so ideally what's happeningis that a bounding box intersectioncalculation
should be occurring on somerays while the triangle intersectioncalculationsshould
be happening on other race thuskeeping each piece of hardwareat high
utilization however what nvidiaobserved during operation as is usuallythe case
withideal scenarios versus how it executesin realitywas that during testing uh
triangleintersection calculationrates were too low and this created abottlenecknear
the end of the data flow so as aresultnvidia altered the hardware in thetriangle
intersection area to reportedlyincreasethroughput twofold the other rt corehardware
improvement that we know aboutrelates mostly to making motion blurwork with ray
tracingthey added a module before the triangleintersection areato interpolate
the triangle position ata given point in timebasic ray tracing or basic rt works
withclearly defined bounding volumes or bvand triangles the bv is located thetriangle
intersection is solved forwithin that spaceand the sample is output with motionblur
things become a bit more complexbecause there are no longer fixedpositiontriangles
and because nobody in theentire world likes motion blurbut we should apparently
still spendhardware on adding itinstead those triangles are replacedwith a
formula that exists inside the bvthat represents where the triangle wouldbe
located at a given point in timejokes about motion blur aside this isuseful for
things outside of gaming 2becauseit's not all real time ray tracing allthe time
there's also just simple raytracingbeing done well simple and in thecontext of
not being real timebeing done in things like blender orother rendering
applicationscycles render for example when you as anartist might be working on
something cgsome kind of filmsome kind of graphic something alongthose lines
this would still be usefulthankfully as rays go through the rthardwarethey
enter with a defined time variablewhich is then plugged into the equationthat
then determines the position of thetriangle from there the triangleintersection
can be calculated and anoutput sample is produced in practicethere's more than
one ray for any givensceneclearly and so each ray will hit adifferent trianglebased
on its function of time more raysat different triangle intersectionsmeans a
more mathematically correct andcomplex output sampleso the increased complexity
of themotion blur rt processcould create a potential bottleneckinside of an rt
core if it weren'tsomehow accelerated and therefore toallow motion blur to work
within rtwhilestill maintaining acceptable performanceanother logical unit had
to be addedwithin the rt core nvidia shows thischange in its interpolate try
positiontime and amp your rt core diagram againwe're not exactly sure what
changedspecifically in the transistor logicbut we can see the possibility that
theadded hardwarecould improve performance for motionblur ray tracing these two
added unitsin addition to several undisclosedhardware modificationspurportedly
allow ampere generation rtcores to runroughly double the throughput of touringrt
coresif you've seen that 1.7x performancenumber floating aroundit's from
nvidia's first partycalculation of rtt flopsteraflops going from 34 to 58
dependingon what card you're looking at whencombining the performanceenhancements
of the sms namely the twotimes floating point 32and the increased data cache
sizes thert core improvementsthe claims two times throughputimprovement
assigned to those rt coreimprovementsthe tensor cores uh no explanation ofwhat
improved here at least that we'veread yet again the white paper was justdistributed
when we started uh workingon this piece so nvidia claims thatframe production
time with all thisstuff in mind can be reduced by as muchas44 so uh the next
thing to talk aboutis going to be memory for graphicsmemory nvidia teamed up
with micronthis is something that you've all heardat this point where it is now
usinggddr6xpreviously there was a gddr5x thatnvidia worked onand of course
there are hbm options onsome specific cards as wellg6x though is now going to
be thefastest video memory availableon the market today and ampere gpusfeature
specifically a 320 bit widememory busthat operates or allows the the peakoperation
of 19 gigabits per second andthis really shouldn't ever be necessaryfor us to
remind peoplebut we've seen a lot of viewercommenters and even some reviewers
uhmixing up lowercase and capital b's forthis number19 gigabits per second is a
lowercase bthe capitalization or lack thereof ofthat bdoes matter it is eight
bits per bytemost of the timeand uh that's 19 gigabits per second not19
gigabytes per secondand further still we saw a lot ofconfusionthat probably
there shouldn't be toomuch of this in the audience watchingthis video butwe saw
a lot of confusion where peoplethought this 19 gigabits per secondnumber meant
19 gigabytes of memoryit's a different thing that's that's notwhat it is so
just to be clear on thatuh anyway there's an effective 760gigabyte per second
memory bandwidthwiththis specific configuration we'retalking about so comparing
this with theturing gddr6 memoryit's a 23 improvement in memorybandwidth while
reducing the size of thememory busthis is in part accomplished by the newpam4
signaling used in gdr6xnvidia also said the following quoteinstead of binary
bits of datapam4 sends one of four different voltagelevels in 250 millivolt
voltage stepsevery clock cycle so in the same periodof time gddr6x can transmit
twice asmuch dataas gddr6 non-x memory micron had todeploy a few techniques to
maximize thedata throughputwithout introducing any form of errorrate the first
technique maximumtransitionavoidance coding prevents errors in thedata by
encoding itin such a way that it never crosses morethan one voltage levelin the
graphic provided a voltage changefrom one level to the nextleaves a clearly
defined gap between itand the next voltage changeeven a voltage change from one
level toa level two tiers aboveor below would still leave a clearlydefined gap
between it and the nextvoltage changehowever if a voltage change from onelevel
to a level three tiers above orbelow it were to occurthen the gap between
voltage changeswould not be large enoughfor clear differentiation in turn thiswould
cause the design teamto require a reduction in frequencywhich is obviously
undesirablethus micron's mta coding encodes thedatain a manner that it never
crosses morethan one voltage level at a timesimilar techniques are also used innetwork
engineering if you know thatfieldthe second technique is currently lessclearly
defined than the firstbasically our understanding now is thatas the system is
trying to read the datait wants to be atas close to a stable state as possiblea
stable reading state as possible sothe hardware in gddr6xis designed to
continuously train andadapt the sampling positioningto be as close to the
center of the eyeas possible and there's a graphic forthat as wellfrom nvidia
and micron there's a lotmore to discusswe're at a long enough video at thispoint
l0 and l2 cache would be worthtalking about as wellwhere there's improved
currencyreferring to asynchronous computespecifically that's beenupdated for
this generation we don'tknow exactly how some of this stuff ishappeningyet but
nvidia's running dlss on framen while in some cases being able to runrt on
framen plus one uh to get the 6.7 millisecondnumber that you see in some of themarketing
imagesso this is stuff we'll need to look intofurther but for now that should
give youan introduction to amp here at the thebasic level and prepare you for
thereviews coming up so as stated wait forthe reviews before you gettoo excited
about this stuff thetechnology changes are always veryinteresting to read aboutit's
fun part of the job but we do wantto remind everyone that at a productleveljust
try to keep an eye on the realityof what you really needversus what you just
want becausethere's a lot of internet hype that'snot to downplay anything aboutthe
ampere cards it's just to kind ofmaintaina level of realistic expectationsso
that's it for this one thanks forwatching you can go tostore.camerasnexus.net
to help us outdirectly by buying things like our mousematsour modmats toolkits
or other productsif you want to help fund this typeof law and form reporting or
you can goto patreon.com gamersnexus for behindthe scenes videosthanks for
watching subscribe for morewe'll see you all next timeso
Post a Comment