nvidia's ampere architecture has arrivedat this point andat present at time of filming we'reworking on the reviewfor the rtx 3080 but we wanted to take alook at the block diagrams and talkabout the architecture betweenampere and turing what the differencesare and where the similarities are therearea lot of similarities between them at apurely block diagramlevel but as we zoom into the partsnvidia has changed a lot this timearoundfor its ampere architecture and thiscontent isbased off of first party architecturaldisclosures from nvidia sowe have information from the press techday for example where some of thearchitects from nvidia who worked onampere were presenting information onthe architectural changesand what the meaning of them is so thatmeans thatthe architecture details we've receivedthey're not marketing latent detailsthey're like actual white paper andarchitectural discussionuh information so this is more of aneducational piecethat's focused on how the product worksat a deeper levelkeep in mind that as always we'd adviseyou reserve anyjudgment of product performance or anyhype for the productuntil the actual reviews go live thatway you can make a buying decision thenwith all the informationthis one's just gonna be focused on howstuff works before that this video isbrought to you by thermaltake's core p3case the core p3 is one of the mostunique cases on the marketit can serve as an open-air standingchassis a test bench in vertical orhorizontal orientationor as a wall mounted showcase pc thecore p3 now comes witha five millimeter thick tempered glasspanel for its side but keeps the fronttop and back open for air the core p3versatility as a display piece testbenchor standard desktop is reinforced by itsprice of roughly 110on amazon you can learn more at the linkin the description belowone of the most obvious improvementsthis time has little to do witharchitecture and more to dowith process improvements so turing gpuswere manufactured by tsmcusing a 12 nanometer ffn processampere is being manufactured by samsungthis time with an eight nanometerprocessnvidia has been calling this 8n as thedesignatorso thanks to this change a new amperegpu has a higher transistor density thanits preceding terrain gpus anduh that's something that we'll sort ofkeep in mind as we open the cards up tooand show you what the actual siliconlooks like physically on the cardbecause that'll give us the angle oflooking at thedie size which is at this point knownbut once you see it ontouring versus amp you're in person likeon actual pcbs it means a little bitmoreokay so in the case of the ga102-300-k1-a1die and the tu-102 300 k1 a1 from thispoint forward we'll just use tu-102 andga-102 to reference those for comparisonthe ga-102 packs in 28 billiontransistors on 628 millimeters squaredof spacewhile the tu-102 used 754 millimeterssquaredto lay out 18.6 billion transistorswe've taken aparttouring cards in the past and we canshow those but right now we can't showwhat ampere looks like at least notuntil the review dateoften a reduction in the size of theprocess node or transistorsize is correlated with a reduction inpower consumptionbut with this generation of gpus that'snot the casethe reference number for power in tdpof the ga102 is 320 watts while thetu-102 was lower at 250 wattsbe careful though not to read this asthis card runshotter it's not really that simple andit never is the performance per watt isreportedly better with ampereat 240 watts tdp terrain running60 fps 4k in nvidia's early marketingimages for examplewhile ampere was at 80 plus fps at 4kand then we've also got some powernormalized results from nvidiafirst party reveals as well the largerpart of this content though will talkabout architecture not theprocess changes so we'll be taking alook at what's different in ampere interms ofsms the layout or the containerizationas we'll call it of the gpuwe'll be looking at the third gen tensorcores the second gen rt cores thatnvidia hasdisclosed in brief pcie gen4 changesmemory changes with gddr6xand nvlink briefly as well there's awhole lot more to thisthe whitepaper actually we just receivedit immediately after finishing thisscript we've added a little more detailin butthere's more still that could be addedthe white paper is about 42 pagesjust to give you an example and thiscontent aloneis written based off of about a one hourlong presentationfrom an nvidia architect so the point isthere's a lot more detail we can getinto but this content will give youall the basics that you need to know forthe key changesand then once we get into discussingmore aboutcash for example or some of the othermoreminute details or lower down deeperlevel detailsthat's going to require a differentcontent piece altogether just for sakeof time we need to start with a quickrefresher course on nvidia'scontainers containerization andidentification for its gpu componentsif we look at the turing block diagramfirst you'll see that it's split intogpcsl2 cache and more gpcs with memorycontrollers at the flanksof the image and then interfaces at thevertical edges of the imagea gpc is a graphics processing clusterby nvidia's terminologyand a gpc contains a raster engine thensome more containers with 60 pcs per gpca tpc is a texture processing clusterand each tpc contains two sms orstreaming multiprocessors and we haveseparate block diagramsfor what sms look like so there are 12sms per gpcin turin if we look at the big terrainblock diagramsthe sms contain the components thathandle floating point and integer mathamong other things like cache that we'llget into terrain to amp hereit looks more like this this is theblock diagram for ga102as a reminder there will be variationsof this at different silicon sizesbut there'll probably also be somethingwith some of these sms disabled in factthe 3080 is one of thoseif you need an introduction to nvidia'snaming scheme for the actual identifierthat nvidia puts on the die so if youtake youryour video card apart and you look atthe silicon at thediffusion barrier on top of it thatsilver barrier it'll have an id on itand that might be something like tu102t106t104 or we might get ga 106 most likelyga104ga102 stuff like that so the basicsin a short version would be nvidiaincrements this numberin a positive direction as the componentgets lower endso if you're looking at ga 102 that'sgoing to be one of the higher or highestend parts and once you get down to likega 106 for examplethat's typically more in a 60 class gpuwe don't know exactly where that's goingto be this generationbut typically you'll see that insomething likea 60 class cheaper card the closer thename is to the 100 designationthe larger the silicon is the higher endit is the closer it is to thefullest of the fall of that blockdiagram that we've shownand once you get into things like ga 100or pascal 100now you're looking at a data center orscientific compute or some otherprofessional classcard and the ga 102 diagram as comparedto the taurian diagramswe see that there are seven gpc's nowwith the maximum possible sm count of84. that doesn't mean they're all thereor all enabled rather but that's themaximum possible count there are alsomore rops now at 16 rop units pergpc there are still six memorycontrollers at each flankalthough the memory subsystem haschanged this generationthe pcie interface has changed to 4.0and envy link if you look at the terrainon the ga diagrams has changedto 4x4 links rather than 2 by 8 linksthat's the same amount of total lanesbutapportioned differently within a gpcthere are 12 total possible smsor 6 total possible tpcs some of thesewill be shed as the silicon gets smallerfor lower end partsuh but that's what it looks like at thelarge scale within the smlooking at the sm block diagram amperehas a second generation rt coresrather than the original touring courseso this is actually a major change thatwe'll be exploring in this content todayit also has cuda cores that are fp32onlyor capable of either fp32 orin 32 you may not have all of those cudacores listed in the specsheet available to you for floatingpoint at any given timebecause some of them could be doingother things like integer processinginsteadin the sm block diagram you can see theload store units for each blockand the shared cache between the entiresm at 128 kilobytes l1ga102 also has 168 fp64 unitsor 2 units per sm for double precisionyou'll probably want a different classof card for double precision workloadsas is typically the case nvidia has alsocriticallyimproved the rops for this generation uhso on the rasterization side nvidia isdecoupling the ropsfrom the memory controller and the l2cache as it was previouslycoupled to those and it is instead tyingthe robs to the gpcnow in initial numbers nvidia says thatthis improvesthe raster operations it allows anincrease in ropunits and the company also claims thatthe performance changeof this will be reflected so you'll seethis probably in benchmarks of thingslike games for examplesome of your performance uplift can bedirectly attributed to this change inthe architecturehere's a quick comparison table fromnvidia it is unfortunately complete witha document time stamp on itbut do your best ignore it the rtx 3080fe card is running six of the totalseven gpc's in ampere and it's ga 102diewe'll be curious though to see ifthere's a part between the 3080 and the3090 like a possible 3080 ticloser to 1 000 there's certainly roomfor it at least in the pricing stack andpotentially here as wellas a quick note while you're looking atthis table it's not really appropriateto compare cuda cores directlyor other core counts directlygenerationally once your crossgenerationit is non-linear that'd be sort of likecomparing a modern four core eightthread cputo an older four core eight thread cpuand then extrapolating performancepurely on that number and maybe thefrequencythere are other changes there as welllike cache and cpus and the same is truefor gpusthe newer rt cores tensor cores and cudacores all have improvements particularlyin efficiencythat make a purely count basedcomparison non-linearso don't do just a cuda core versus cudacorecomparison as for which card is morepowerful for the new architecture we'llstart with the streaming multiprocessorsor the smsas the basis for discussion and a quotefrom an nvidia engineer that we used inour touring architecture write-upcan start us off here it was from thetouring deep dive it referred to theinteger processing approachin turin and it read as follows quotenote that while we call the unitan integer unit it actually performsboth integerand simple floating point operationslike the floating point comparemin max operations mentioned in theother part of the contentend quote there if we picked up on whatnvidia had really been trying to say acouple of years agothen maybe we could have predicted wherenvidia was going with sms back thenand that's towards a new data path on arecent call with nvidia's architectureteam the company noted that its researchhademphasized once again that manyworkloads are floating point heavyand so it's trying to direct a hardwarearchitectural solutiontowards this challenge and this needs aprocess more floating point dataprompted nvidia to alter the sm datapathsuch that the n32 portion of the datapath also had an fp32 unitin it and that would thereby give the smthe ability toexecute either int or floating pointoperationsin that given data path the alternativeto this would beunfortunately having units sittingaround doing nothing while they wait forsome kind of taskthere's also a couple of years ago backinlike to say it's the pascal era waspascaland video talked about how execution ofcertain instructions would block otherinstructions and so that's where youstart talking about asynchronousprocessing things like that but that wassomething we discussed more in thepascal era and then a bit in the taurianera of cartso each sm can execute 128 floatingpoint 32 operations per clockup from 64 per clock in turret this newdata path doesn't produce a two timesperformance gainas in 32 operations must still occur fordata fetchesdata compares and things like that butthis architectural changecertainly contributes to the performancegains that nvidia has alluded tonext section is briefly on l1 cachethere's a lot more about cash that wejustwe opened up that new white paper andlooked at it there's a lot there todiscussperhaps in a future piece but we'll giveyou the basics here to feed the faster128 floating point 32operations per cycle execution of thenew smthe incoming data pipeline had to beredesigned in ampere and so the l1cache now has double the bandwidth weneed to check on the the precise numbersfor thatbut it is double it's got 33 morecapacity and thatis up from 96 kilobytes to 128 kilobytesand twice the cache partition size sothe cache partition size increase showsbenefitsspecifically with a quote long andcomplicated shader programsas mentioned by nvidia and itsarchitecture deep dive so together thenew data pathand the increased l1 cache are providinga combined 2.7 times performance bumpin the sms but remember again that this2.7 times numberthat doesn't mean that your games aregoing to be 2.7 timeshigher fps than whatever you'recomparing it touh so this is not total card performanceit'sperformance increase for a significantpart of the card and that's importantbut we just want to be really clear witheveryone that it's not total cardperformance because the plot tends toget lost with that stuffso a great way to remember this or thinkabout this would be that gallium nitridesolution that corsair introduced at ces2017 or 18 and its power supplies wherethepfc efficiency increased to somethinglike 90maybe 97.98 99 range very high 90sbut that didn't mean the entire powersupply was 98efficient it meant that that part ofthat power supplywas 98 efficient so you can think of thesame thing here whereit is not a direct linear increase inoverall card performance that's not todiminish it it's tomake sure people kind of keep a realitycheckso third gen tensor cores up next thisis an importantchange for ampere versus touring nvidialikes to advertise the tensor core or tcas the ai portion of the gpu effectivelythey're good at doing linear algebrafor the gaming community this mostlymeans improved frame rates at very highresolutionswith certain technologies like dlss orparticularly dlss 2.0simplifying things in the turingarchitecture a network was trained ondense matrices and those dense matricescould then be fed to thesecond gen tensor cores for aiinferencingwith ampere's architecture nvidia is nowutilizing the conceptof sparse matrices to boost performancenvidia has a blog post for moreinformation on sparse matrices relatedto ampereif you're interested in that we won'ttry to do a better job than wikipedia atexplaining a sparse matrixthis isn't something we typically reallywork with in our part of the industry sohere's a direct quotequote in numerical analysis andscientific computinga sparse matrix or sparse array is amatrix in whichmost of the elements are zero bycontrast if most of the elements arenon-zero then the matrix is considereddensethe number of zero-valued elementsdivided by the total number of elementsfor example mtimes n for an m times n matrixis called the sparsity of the matrixwhich is equal to 1minus the density of the matrix usingthose definitionswikipedia writes a matrix will be sparsewhen its sparsity is greater than 0.5nvidia's presentation also has some goodgraphics showingall this stuff showing images fordifferent types of deep learningthe applications and the implications ofprocessing those onits architectures so the new learningprocessin ampere still works with a densematrixand the network is trained to assignweights throughout the matrix so unliketuring the process then continues as thematrix is pruned to create a sparsematrixby dropping the low valued weight thenew sparse matrix is a more efficientversionthat's nearly as accurate when comparedto the trained dense matrixin the final step of this process thenew sparse matrix is retrainedto drop its dependency on weights thetensor cores can then use a compressedversionof the more efficient sparse matrix foraiinfrarent from a hardware perspectivethe 3rd gen tensor core has beenoptimized to work with these sparsematricessupposedly further boosting theperformance improvements that we've beentalking about so in the amperearchitecture we notice that there arefewer tensor cores per sm you likely sawthis and the block diagrams as well atinitial announceand that's as compared to turin ofcourse nvidia's goal here was to reducethe numberof cores but to make sure each core wasmore powerful and this is why youshouldn't directly compare the countfrom a numerical standpoint it appearsthat nvidia achieved its goal asoutlined in this graphicin the touring gpu there were eight tc'sper smand each tc could perform 64 fp16fused multiply add you'll see this asfma normallyoperations this resulted in 512 fp16 fmaoperations per smin the enhanced ampere third gen tensorcore there are only four tensor coresper smhowever each tensor core can performeither 128 fp16fma operations on a dense matrix or 256fp16fma operations on a sparse matrix thisyieldseither a matching level of performanceon dense matrices per smat worst 512 there or a doubling ofperformanceon sparse matrices per sm at 10 24.amp here is capable of at least as goodperformance as turinwhen working with dense matrices inother words and in cases where sparsematrices can be used insteadthe performance gains are higher byavoiding math that we don't need to doproducing yet to be verifiedimprovements of 2.7 xor 89 tensor t flops to 238 usingnvidia's numbersof course how that's realized in thingslike wellreal world applications outside of justdoing the math on paper likenvidia has presented here will depend oneverything else and the architecture inthe pipeline as well becauseyou can still get hunt up on one part ofit and that stuff that will be shown inbenchmarkingokay time to talk about the secondgeneration of ray tracing cores or rtcores for amperestarting with the turing gpu the raytracing core was defined as a multi-partunitwhose entire objective was to solveintersection problems and specific onesat thatits purpose is to fully offload raytracing calculationsfrom the shader cores and handle themindependentlyof those shader cores on a dedicatedhardware unitso this means that the shader makes acall to the rt coreand the rt core then handles the rt orthe ray tracing workloadand then it returns the output to theshader oncethat's complete the shader core itselfdoesn't do any ray tracing calculationnvidia's design of the rt core utilizesthree hardware unitsand a separate memory stack that allinteroperate to provide the desiredfunctionalityinside the rt core there are two mathunits one to handle bounding boxintersectionsand another to perform triangleintersection calculationsthe third interior unit is an mimdor mimd multiple instruction multipledata hardware state machine that handlesthe bvh or the bounding volume hierarchytraversal for each ray nvidia statedthatin the ampere architecture there wereseveral small alterations to the rt corebut they only shared details on two ofthem improved triangle intersectioncalculation hardwareand an added triangle positioninterpolation unitideally in both turin and ampere rtcores triangle intersectionand bounding box intersectioncalculations should be donein parallel so ideally what's happeningis that a bounding box intersectioncalculation should be occurring on somerays while the triangle intersectioncalculationsshould be happening on other race thuskeeping each piece of hardwareat high utilization however what nvidiaobserved during operation as is usuallythe case withideal scenarios versus how it executesin realitywas that during testing uh triangleintersection calculationrates were too low and this created abottlenecknear the end of the data flow so as aresultnvidia altered the hardware in thetriangle intersection area to reportedlyincreasethroughput twofold the other rt corehardware improvement that we know aboutrelates mostly to making motion blurwork with ray tracingthey added a module before the triangleintersection areato interpolate the triangle position ata given point in timebasic ray tracing or basic rt works withclearly defined bounding volumes or bvand triangles the bv is located thetriangle intersection is solved forwithin that spaceand the sample is output with motionblur things become a bit more complexbecause there are no longer fixedpositiontriangles and because nobody in theentire world likes motion blurbut we should apparently still spendhardware on adding itinstead those triangles are replacedwith a formula that exists inside the bvthat represents where the triangle wouldbe located at a given point in timejokes about motion blur aside this isuseful for things outside of gaming 2becauseit's not all real time ray tracing allthe time there's also just simple raytracingbeing done well simple and in thecontext of not being real timebeing done in things like blender orother rendering applicationscycles render for example when you as anartist might be working on something cgsome kind of filmsome kind of graphic something alongthose lines this would still be usefulthankfully as rays go through the rthardwarethey enter with a defined time variablewhich is then plugged into the equationthat then determines the position of thetriangle from there the triangleintersection can be calculated and anoutput sample is produced in practicethere's more than one ray for any givensceneclearly and so each ray will hit adifferent trianglebased on its function of time more raysat different triangle intersectionsmeans a more mathematically correct andcomplex output sampleso the increased complexity of themotion blur rt processcould create a potential bottleneckinside of an rt core if it weren'tsomehow accelerated and therefore toallow motion blur to work within rtwhilestill maintaining acceptable performanceanother logical unit had to be addedwithin the rt core nvidia shows thischange in its interpolate try positiontime and amp your rt core diagram againwe're not exactly sure what changedspecifically in the transistor logicbut we can see the possibility that theadded hardwarecould improve performance for motionblur ray tracing these two added unitsin addition to several undisclosedhardware modificationspurportedly allow ampere generation rtcores to runroughly double the throughput of touringrt coresif you've seen that 1.7x performancenumber floating aroundit's from nvidia's first partycalculation of rtt flopsteraflops going from 34 to 58 dependingon what card you're looking at whencombining the performanceenhancements of the sms namely the twotimes floating point 32and the increased data cache sizes thert core improvementsthe claims two times throughputimprovement assigned to those rt coreimprovementsthe tensor cores uh no explanation ofwhat improved here at least that we'veread yet again the white paper was justdistributed when we started uh workingon this piece so nvidia claims thatframe production time with all thisstuff in mind can be reduced by as muchas44 so uh the next thing to talk aboutis going to be memory for graphicsmemory nvidia teamed up with micronthis is something that you've all heardat this point where it is now usinggddr6xpreviously there was a gddr5x thatnvidia worked onand of course there are hbm options onsome specific cards as wellg6x though is now going to be thefastest video memory availableon the market today and ampere gpusfeature specifically a 320 bit widememory busthat operates or allows the the peakoperation of 19 gigabits per second andthis really shouldn't ever be necessaryfor us to remind peoplebut we've seen a lot of viewercommenters and even some reviewers uhmixing up lowercase and capital b's forthis number19 gigabits per second is a lowercase bthe capitalization or lack thereof ofthat bdoes matter it is eight bits per bytemost of the timeand uh that's 19 gigabits per second not19 gigabytes per secondand further still we saw a lot ofconfusionthat probably there shouldn't be toomuch of this in the audience watchingthis video butwe saw a lot of confusion where peoplethought this 19 gigabits per secondnumber meant 19 gigabytes of memoryit's a different thing that's that's notwhat it is so just to be clear on thatuh anyway there's an effective 760gigabyte per second memory bandwidthwiththis specific configuration we'retalking about so comparing this with theturing gddr6 memoryit's a 23 improvement in memorybandwidth while reducing the size of thememory busthis is in part accomplished by the newpam4 signaling used in gdr6xnvidia also said the following quoteinstead of binary bits of datapam4 sends one of four different voltagelevels in 250 millivolt voltage stepsevery clock cycle so in the same periodof time gddr6x can transmit twice asmuch dataas gddr6 non-x memory micron had todeploy a few techniques to maximize thedata throughputwithout introducing any form of errorrate the first technique maximumtransitionavoidance coding prevents errors in thedata by encoding itin such a way that it never crosses morethan one voltage levelin the graphic provided a voltage changefrom one level to the nextleaves a clearly defined gap between itand the next voltage changeeven a voltage change from one level toa level two tiers aboveor below would still leave a clearlydefined gap between it and the nextvoltage changehowever if a voltage change from onelevel to a level three tiers above orbelow it were to occurthen the gap between voltage changeswould not be large enoughfor clear differentiation in turn thiswould cause the design teamto require a reduction in frequencywhich is obviously undesirablethus micron's mta coding encodes thedatain a manner that it never crosses morethan one voltage level at a timesimilar techniques are also used innetwork engineering if you know thatfieldthe second technique is currently lessclearly defined than the firstbasically our understanding now is thatas the system is trying to read the datait wants to be atas close to a stable state as possiblea stable reading state as possible sothe hardware in gddr6xis designed to continuously train andadapt the sampling positioningto be as close to the center of the eyeas possible and there's a graphic forthat as wellfrom nvidia and micron there's a lotmore to discusswe're at a long enough video at thispoint l0 and l2 cache would be worthtalking about as wellwhere there's improved currencyreferring to asynchronous computespecifically that's beenupdated for this generation we don'tknow exactly how some of this stuff ishappeningyet but nvidia's running dlss on framen while in some cases being able to runrt on framen plus one uh to get the 6.7 millisecondnumber that you see in some of themarketing imagesso this is stuff we'll need to look intofurther but for now that should give youan introduction to amp here at the thebasic level and prepare you for thereviews coming up so as stated wait forthe reviews before you gettoo excited about this stuff thetechnology changes are always veryinteresting to read aboutit's fun part of the job but we do wantto remind everyone that at a productleveljust try to keep an eye on the realityof what you really needversus what you just want becausethere's a lot of internet hype that'snot to downplay anything aboutthe ampere cards it's just to kind ofmaintaina level of realistic expectationsso that's it for this one thanks forwatching you can go tostore.camerasnexus.net to help us outdirectly by buying things like our mousematsour modmats toolkits or other productsif you want to help fund this typeof law and form reporting or you can goto patreon.com gamersnexus for behindthe scenes videosthanks for watching subscribe for morewe'll see you all next timeso

Post a Comment

Previous Post Next Post