so for those attending remotely this this lecture has quite a lot of animation so apologies that the that the the PDF which we’ll see won’t actually show you be particularly useful but i do have I thought that I thought that you would get these in the live feed I thought you get this kind of full screen you’re not but I have a recording of this lecture this is the only electric has significant animations at it our recording of this lecture on YouTube now put the link up in fact there is a link there’s a link to the final part of this lecture already on the materials page and that points you too so this was a talk I gave a few years ago as part of a science fair I think it was the British Science festool so it’s slightly in a high level such like but it does i think it covers all the major points of the course so I thought of a useful introduction to the course so kind of the area are going to be talking about more drill down into details over the next few lectures and over the next few weeks so it’s called super commuters in science for the Big Bang to climate change so again I was just a run over what IPCC’s already told you this with the supercomputing Central University of Edinburgh there’s a picture they’re making it look like it’s very leafy and green where we are but actually submit for ever act of telling them for campus scientists and engineers are always relegated to the outskirts of town so we’re in the bottom right-hand corner of Edinburgh just on what used to be the outskirts of town that’s our building there so the one I’m going to talk about is what a computers used why’s that this talk was really aimed at very general audience but you know computers are ubiquitous nowadays because computing everywhere and people use them for playing games and they use them for playing games and also updating the facebook status and watching videos occasionally people might browse the web and send the email and heaven forbid might do some work with them but what I want to sort of show you here hopefully kind of cover in this lecture is that that computers can actually use are used for scientific discovery and that for doing so by this course is called computer simulation is what I’m going to cover it’s the kind of hardware and programming techniques people use to program Bay big super computers but also touching on the kind of algorithms and techniques that people use to simulate real-life systems and hopefully try and bring them together with some simple examples so i mean the fields which i’m going to talk about is called computational science so that’s not computer science computer science is the scientific investigation of computers and computing computational science is doing doing science as in physics chemistry biology engineering using computers so there’s a sort of a overloading of terminology but what I mean by computational sciences it’s not the study of computers but using computers to do scientific research so here’s a picture here of Peter Higgs who’s based University of Edinburgh and won the Nobel Prize a couple of years ago and this is a reasonably good exam cause back in the 60s he came up with this theory which was that there would be this this this pigs boson responsible for the mass of fundamental masses of particles around us and what happened was to try and find that there’s this enormous experiment built at CERN Large Hadron Collider you know hundreds of millions of parents spent and I think I think that is actually Peter standing there in the corner you get a sense of the scale on these enormous experiments and what would have been done classically you know sort of from both science the modern view of science started right about when you started doing things that you toe up with a theory you predict that something would fall you know an apple will fall from a tree you do an experiment and if you get the right aren’t so you’re happy if you don’t you go back and refine the theory again and you do the experiment so there was this this loop between the two the two pillars of sort of modern scientific discovery with theory and experiment but that’s not true anymore because in this loop in almost all areas of sciences computer simulation so it when when they built the Large Hadron Collider they didn’t just build it turn on hope they find the Higgs boson they’d already run sophisticated computer simulations and this is a simulated offense finding a Higgs version with done decades before the machines ever built and so now in almost all areas of science there’s this there’s this this circle of a theory then you have to some computer simulation you may or may not even be able to do the experiment and then you go back and do that and update your theory but the important point is computer simulation doing things on a computer is intrinsic in this loop almost all areas increasing in science are at some point in the research cycle and all the development cycle if you’re doing engineering or or commercial work involves computer simulation so the question is why you know why was why was Newton quite happy with his pencil and paper and nowadays

we have to resort or use computing well there’s a lot of reasons for that first of all the theories can just be too too complicated this is some equation from fluid dynamics you know nowadays it’s very easy to write equations down it’s very difficult to solve them so in a lot of areas it is simply in most areas it’s simply not possible to solve a theory just using pencil and paper you can tackle it with a computer with the Americal methods but you can’t do it analytically it might be too expensive if you’re going to produce a new car okay it has to pass very stringent crash test experiments to make sure that the past miserable will be safe if there’s a crash that’s a very expensive experiment to do for 2 reasons obviously you’re here can they waste two cars here and if you get it wrong you have to redesign your car so what you want to do maybe the experiment is too expensive to do or you only want to do it once you want to make sure that when your when your robot land our lands on Mars it gets it right because you only get one shot or here you’ll count multiple shots but it’s too it’s very expensive it might just be physically impossible to do the experiment you might want to understand what’s going on in the center of volcano to predict when the next volcanic eruptions are going to be you can’t send a probe into the center of a volcano if you simulate on a computer you can look right inside and see what’s going on but you can’t do the physical experiment there it may be too big I mean one of the big worries we have at the moment is about climate change almost all the evidence for climate change being being being influenced by human activity comes from computer simulation okay you know how you cannot do that experiment you think well what if we double our co2 emissions what’s going to happen in a hundred years time well the experiment is too big to do we can’t tell we can’t create another world and tell you no one world not to not to reduce the co2 emissions another world to turn them on see what happens the only way we can do these simulations because the experiment is too big or too possibly to an unethical to do is to do on a computer simulation and the evidence for climate change being caused by human activity is almost entirely based on computer simulation okay everyone knows the world is getting caught so you just stick a thermometer right is that caused by human activity well that that’s a different question and only by running computer simulation of the climb but running them from the 1700s and running a climate when when there was do Industrial Revolution and seeing what happens well another simulation where you put the input Industrial Revolution can you actually tie the causality between human activity and climate change it might be too small you might want to look right in the center of an atom or some so I’m very subatomic particle you just can’t do the experiment so if you do on a computer that you can do that or it might be too far away you might want to or the time scales are too low what happens when two galaxies collide I can look up in the night sky and find two galaxies are colliding and you have to wait a billion years from to collide I need a computer to do the simulations to run it massively faster than real time to try and understand what’s going on so I had a very very simple example here but it does illustrate a few interesting concepts and I’m going to use this example in the future what’s the world yearly income so I’ve got if I found myself a list of all 7 billion people in the world in alphabetical order amazingly so at the top there’s a couple of brothers from Afghanistan an army who unfortunate on earn very much they are new under few hundred pounds each and then we’ve gone further down the list there’s Amon and there somewhere my salary speak strangely obscured and there I am at the position 5405 million three hundred and three further down there’s some other figures you might there’s a woman called Elizabeth Windsor in the UK one couple years ago 38 million pounds she’s the peak quite well off all the way down to the bottom of the list the seven billionth person in the world Georgian Zukas inyama from Zimbabwe you are in for fun unfortunately again don’t learn very much money but they’re they’re getting a bit more three or four thousand pounds so how am I going to add up work out what the world’s total income is well I take my list of seven but numbers i add it up and divide by 7 billion that seems quite obvious so I go to write computer program to do that and i’m going to set some running total 20 i’m going to start at the top of the list i’m going to add the income to the total the and then i’m going to go go to the next item in the list and then i’m going to repeat if I’m not at the end of the list and so I go back to the start again so this is some pseudo program but I think just just to look at that the important point is in the loop I have to do three things ok I have to add my income to the total I have to go to the next item in the list and I have to check if I’m not the end of this to go back to start so I’m going to call out three operations being very naive here but that’s going to say that that inner loop which I have to do seven billion times takes three operations so the question is how long does that take or not print the total at the end how long does that take well I’m going to go through so what I know try illustrate is the evolution of computing technology since the since the early 70s how processes have got faster but also what

the limits to that are and that’s going to motivate why parallel computing high-performance computing has become much more important recently so i’m going to go back in 1966 which might seem a strange time to to go back with us when I was born so I’m a processor I reckon I could I could add one number a second so I reckon I could do what operation a second so so my frequencies one hurts ok what what operation per second add the number next second I’m at the bottom of the list next second go to the top of the list ok so my time per operation is one second that means the time for loop is three seconds I said loop had three operations so if you work it out it’s going to take me 650 years to add that by which time what would be a sadly out of date list of course who knows what the number of people in the world will be in 650 years but that gives you some ballpark they tells you two things a people aren’t very good at numerical calculations but be it does give you some feeling that that is really a very long time so this is clearly what motivated computing in the early days mechanical calculations like this would just not doable by people so back in 1971 what might consider one the first modern Mike processors was produced by Intel the eye for than four and it had a frequency 100 kilohertz so if you work back through it had a hundred thousand operations per second the time per operation is a millionth of a second time / loop is 30 microsecond so that’s two and a half days so already you know over 40 years ago we had early computer technology was able to take calculations which was simply impossible by pencil and paper and turn them into things which were retractable even you waiting a couple of days but you know you can see that this is going to have a huge impact wine for 20 years the Pentium chip came out in the early 90s they had a 60 megahertz frequency that’s 60 million operations are second 17 nano seconds per operation that’s more like 50 nanoseconds / loop and we’re down to six minutes so these things are getting much faster and then maybe wine forward to 2012 there’s an intel chip called the core i7 which have a three gigahertz frequency and if you if you wind that through your talk about only a nanosecond / loop and that really takes seven seconds so over you know 40 years we’ve gone from something which took two two half days to something which took 7 seconds that looks great that means that you know as time goes on computing it faster so if i have a scientific program i have a calculation i want to do i just the weight and magically in a couple of years somebody’ll come on with a faster processor and my calculation will be faster now fortune that’s not true anymore i’m going to go through why just to get the figures straight okay we get a bit blase about these things nowadays i got a three gigahertz chip big deal in a third of a nanosecond which the time it takes a modern processor to do instruction light only goes ten centimeters light goes off in old money a foot a nanosecond so in a third of a nanosecond light it’s the fastest thing there is any good ten centimeters so you can see we’re really down to quite phenomenal speeds here but what I’m going to tell you is this has stopped okay that’s that’s quite an important point so what I’m going to tell you is so this is just a graphic dude how much faster so in 1971 that’s called me let’s call the Intel 544 number four it’s its speed so this is the inverse of time this is its speed was one okay it was it was one 1975 we were going about five times fast 10 times faster 1980 a lot faster 85 1990 so you can see that you know i’ll come back to this but performance is growing exponentially so if i ran the calculation if Iran that computer program i showed you start at the loop i had stuff go back that iran that you can see that the time to getting exponentially faster so let’s wide forward a bit after rescale things and i’ll make that three hundred back down at the start 99 1990s now our baseline of three hundred and we’ll go forward again 1995 faster again 2000 faster 2005 faster and there in one’s gonna tell you is that between 2005 and 2010 there was no improvement if i’d run my program as I I showed it to you in 2005 it would have gone thirty thousand times faster they did in 1971 I run the same program 2010 it would have gone 30 times 5 30 thousand times fast that seems counterintuitive because everyone knows between 2005 2010 computers got a lot more powerful okay and they continue to get more powerful yeah I’m telling them the program that i wrote wouldn’t have gone any faster okay so why is that well if we could look at what’s hand back in 1971 this chip had about 2,000 transistors and you actually see them you can almost you can always see those track individual transistors and tracks there so a very simple processor we wind forward and what happens is this there’s a there’s a law an observation by gordon moore who set up in del corporation back i think in the mid 60s he noticed that manufacturing technology was increasing

at such a was developing so that you could put the twice in on a fixed size of silicon and fix chip every roughly two years you could put twice as many transistors on and that that increase in transistor density translates it to an increase in frequency which translates into faster processing so that that’s what’s been driving this you know this arms race in increasing gigahertz over the years that the manufacturing technology enables you to build faster and faster processors so we got 14 to 92 93 we have three million transistors you can no longer see the interval individual transistors now we go through down to 2004-2005 which is when I said that that these things stopped happening there’s a hundred billion transistors so so between 1993 and 2004 we’ve now got into 30 times as many transistors so we go to our computer engineer our hardware design estate you can have 30 times more transistors he designs a chip which is 30 times more complicated in incredibly complicated and then in 2004 you say they are the 2006 you come back you say yeah i’ve got twice as it made transistors what you going to do and he comes up with this so i’m not a hardware engineer but you can see there’s something suspicious going on there he’s just given me two of why i had before okay he’s just gone away and round and making a chip which is twice as complicated twice he has just given me two of the processes he gave me before okay and you may only know it like that but that’s what has caused the that’s stalling in the speed of the program that I wrote in that this is a dual-core processor or since about the mid-2000s rather than using transistor density increasing transistor density to produce faster processors what they’ve done is they’ve given you processes which are the same speed but they’ve given you more of them okay so what so since 2005 processors haven’t got any faster the gig that the clock speeds are still in the few gigahertz but what we do is to get more processors and the reason is that by having the same frequency but twice of them this produces less power and less heat it’s actually you can’t keep clocking up the the gigahertz because ventually the power consumption gets too high and you couldn’t use the machine in them that the processor in any normal device you need special cooling you couldn’t use it your laptop if you went to 10 gigahertz processors your laptop would get so hot it would burn your lap no one’s going to buy that that kind of that kind of processor or if you have lots of them then cooling that becomes powering them becomes to it too costly cooling them becomes too costly so it’s kind of physical limitations like heat dissipation dissipation of men there since the mid-2000s we have not had faster processors we’ve got more of them and that’s why I showed my my calculation stalling because as written that program could run on the left hand core and it could run on the right-hand Corps but it can’t run on both cores that individual program will run at the same speed now you may say that’s fine I don’t care because I’ve got a laptop I’m going to run that program on the left hand core I’m going to run a game on the right hand call or check my facebook status or if you’ve got more than one thing to do that’s fine multi-core processors are great because you’ve got more than one thing to do they can all do different things at once but if you’ve got a single calculation that you want to go fast this is a problem so what you need to do to take advantage of this is you need to paralyze my calculation you need to take the single calculation and run it simultaneously on multiple cores and so the two that serial computing is old-fashioned computing like I had there i just write a program and running and the limitation is without some intervention that will only run on one of the two cores at once making the other one I’d on parallel computing if you look up if you look up the diction think this is Oxford English Dictionary cereal as applied to computing is of a process of running on a single task level I said a normal program I would call a serial program it’s a single task you can only what wrong one of those cores parallel processing in terms of computing is a mode of operation which a process is split into many parts which executed simultaneously on different processors attached to the same computer well we already have the second part we already have different processes attached to the same computer so we just need to be the first part b to split the operation into many parts and it turns like that for something simple like addition that excuse me that turns out to be a relatively simple process the important point about addition is you can do it in part if I had I’d up a hundred thousand numbers and I had 10 people to do it I could just give 10,000 numbers to each person they go ad up their sub list like that them together at the end because addition is a associative operation and so it’s actually quite simple simple cases like that to take a program unparalleled eyes it make it run simultaneously on multiple cores so just in pseudocode what i would do is i just write a program which i run simultaneously on both cores but i say well if I’m core 1

I some the top half of the list and I just run the same program I had before but I restricted from i equals 1 to n over 2 I could one two three and a half billion and then but if I’m core 2i some the bottom half of the list so I run from three billion and three and a billion five hundred thousand and one to seven billion then having got those two totals i’m going to add them together but the important point is to add them together we both have to finish and so the additional thing we need to do over and above running the serial program twice and arranging for the program to do different calculations is we need some synchronization some communication between the cores and in this simple example it turns out just to be a wait we need to wait for both course to finish only when both cores are finished that I can fit that took the total is total 1 plus total two and in this example with a big integers I will get the same answer and I can print the total so again that may seem like a fairly trivial operation here I just wait for both of course to finish but that doesn’t make you think well what happens if computing total one until tour took different times I know in this example adding the top half of the list of the top and bottom half of the list our operations of equivalents and computational complexity but if they weren’t what would I do here and that’s one of the challenges of parallel computing is keeping all your processors busy all the time in situations real situations not synthetic ones like this where you can’t predict with a cat at the calculation time so we’re moving forward we now have this 4 billion transistors here and again if you can’t there were 30 court 32 cores there so that was a precursor I think of the Xeon Phi chip the intel xeon phi so you know you can crank up more and more and more transistors but all we can do with these extra transistors now because I said practical sideration is like heat and power is to miniaturize the individual core the individual processor and get more of them and nowadays you can’t buy a single core processor if you wanted one even your mobile phones probably dual core or quad core so as I said if we now wyd forward previously with my cereal program the performance stalled at 30 x 30 thousand times faster than 1971 benchmark in 2005 but if i run that parallel program i think i can carry on riding moore’s law i can take advantage of these extra transistors and i can go four times faster in 2010 we’re at foot by 2010 i really had a quad-core chip and for court and clearly if i can split the calculation of a two course it’s greatly trivial to the split over four but it is an important observation that the computer and an individual computer program serial programs are not getting any faster despite the fact that the computers are getting more powerful and so what we do in supercomputing well we saw that if we took a jewel core processor and a parallel program we got a faster program but it’s always true that you know it doesn’t know about how fast one processor is more than one is faster and so the approach in parallel computing is to take many many processors ok so this single processor already has multiple cores on it multiple individual CPUs on it but we can always do better than that we can just buy lots of them and stick them together and so as long as we can write a program which can take take advantage of multiple processors not just multiple cores on the same process of a multiple distinct processors then then we can wecan we can carry on but we can we can effectually exceed more so we can just bind one more processors and this is the approach that supercomputing has taken for over 20 years that rather than developing special purpose processors you by fairly standard processes because they’re cheap because they’re mass-produced but you put lots of them together so back until three or four years ago be around a system at EPCC called the cray xe6 and it had about five five-and-a-half thousand cpus 90,000 cores so you can see that each cpu there has a book was a 16 core cpu each with a gigabyte of memory and take about a megawatt of power so that ballpark that’s about a million pansy in electricity’s these are quite large installations take a lot of power and cooling I haven’t really cooked one the details but those processes although the two cores on a single chip can easily communicate we’ll see because they can read and write the same memory individual processes which is like two laptops okay there’s a couple of laptops in this room each might have might have a quad-core processor to get those two laptops to speak to each other require some external networking in this case it could be Wi-Fi or ethernet but for a lot parallel supercomputer you buy a very fast Network so you buy lots and lots of fairly standard relatively high spec processors and you link them together

with some dedicated network that allows you to communicate between the processors quickly and efficiently the model system is called heck Archer that’s been around for a couple of years since late late 2012 again this now has 10,000 cpus each see if you has 12 course we have almost 12 under 20,000 cores a bit more memory but still the same power consumption because this is the limiting factor now the limiting factor in the limiting factor in computing is power okay in the sense of a laptop it can’t run too hot otherwise it would burn if you have if you have a mobile phone people don’t like people want their batteries to last a long time so typically what constrains you now is your power envelope and if you’re running realized supercomputer you know you’re limited by how much electricity you can afford how much you can physically get into your computing center so a megawatt a couple of megawatts is kind of kind of the envelope which we’re operating in but even within that fixed power budget because of the increase in and both the number of CPUs and the number of cores that were carrying on an increase in performance so this machine had about four has about four times the performance of its predecessor and of course you need a faster network to support more talk about this in the in the next few weeks so I mean I had a very very simple example they are adding up numbers which is a useful test case but that’s not what we really use these computers for we use them for example for doing things like weather modeling okay so so when you turn on the news and and you find that it’s going to rain in elder Britain or it’s going to be it’s going to be sunny it’s going to be wet it’s going to be cold whatever has been a pending rather bleak picture and what’s happened is that the UK Met Office run they have their own currently kratom they run a computer program a parallel computer program to simulate the weather to work out what the weather is tomorrow in faster than real time I mean predicting them but not no point predicting the weather tomorrow and taking two dice to type two days to do so I need to do it fast another your time and it turns out with weather forecasting to a first approximation what you do is you slip the map up into sections so basically you can you can divide the map of here the UK great britain and ireland into squares and you can assign each square to a different process and the way the equations work the communications between them is relatively it is localized in some sense we can come we’ll come back this specific example later so that works reasonably well it turns out that you know although the weather over the Southeast of England may be slightly different from the where the green processor is is working may be slightly different from the weather over edinboro with a red processor is the calculation you do is effectively the same okay stimulating the weather over whatever it is hundred mile 100-mile grid is sort of about the same amount of work in each case and so I’m kind of naive explanation but but a lot of large-scale scientific computations can be split up into physical domains different physical domains and the physical domains can be operated on not completely independently clear the weather is coupled between two of these squares but it’s coupled in a way that means that some that you get more out of the it dividing the map hopping cursed additional communication but you win because overall you have you can do parallel processing to increase the the computation I’ll cover a much more simple example in detail hopefully at the end of today another thing you might want to do is to simulate the planet I’m old-fashioned ice Pluto is a planet around the Sun and you might say okay well we’ll take the same approach here we’ll just split the the physical domain up into block so I’ve got four processors I’ll split I’ll just split the void of space into for readers but we have a problem here it’s twofold even as I’ve drawn it there you can see the port process some of the top right-hand corner has got five of the nine planets to do so already there’s this load balance issue that weight statement I had in my very simple programs going to give us a problem here because the guy on the top right hand corner is going to finna take a long time to finish the other guys going to be waiting for them that’s gonna be a problem secondly the planets move as well so you know over time they migrate to the bottom left-hand process or I can’t I can’t say well I can’t do something clever like well divide that divide the void of space off with a slightly different way shift the boundaries to make sure that everyone has the same number of planets because they move around so different problems require different paralyzation approaches and hear what you do is you don’t divide up space you divide up the entities you’re stimulating so you would give a fixed number of planets to each processor so if i have three process at the blue the red and the green i might give the inner planets to the red

processor the next three planets to the green processor and the outer three planets the blue processor and that ensured load balance that ensures it to first approximation that the process of equal amounts of work to do they kept busy all the time and so different problems have can have different paralyzation and parallelization approaches and it’s and we’ll look at a couple of these over the next few weeks so I mean some sense I think of computers have been universal as a scientist or an engineer a computer something like a universal experiment we can build very complicated physical piece of equipment of experiment we could build complicated microscopes we can build the Hubble telescope we can build the Curiosity rover which have this amazing sky crane way of landing on Mars all these things are amazing amazing piece of equipment but they effectively built to do one thing for a scientist or an engineer a computer like a universal experiment it’s like a blank sheet someone buys a computer and gives it to you but then just if you didn’t program would just sit there and so what computational scientists did increasing all scientists and engineers do is is they do although the the computer can be seen as universal experiment Lee to stimulate the weather it gets inmates or atomic power tools that consuming galactic collisions you need to write software to do that so increasingly computational science is about has been for a long time about writing good software and that nowadays almost University means a good parallel software to take advantage of parallel processing to do particular simulations and so okay so I was going to the last part of the talk was a bunch of was a bunch of movies I don’t the movie the too big to cart around and that they tend to kill machines like this so I’m not going to go through those but there is a there’s a link on the web page to a YouTube video where I’ve gone through all those so that’s that’s up there so I’m going to stop that talk at that point so I just wonder that is very much sort of a high level over here said written for a for a public understanding that is one if there are any questions on the oops lost on my windows okay i think most people are having most people seem to be having seen to be okay edit or fine okay concerned the machine on disappear into itself if I live cast the live cast but so um what I was going to do now is I think probably just do a very brief I see how I get through so I want to break it three to give people a break but this lecture sort of overlaps the previous one but I’ll go through this and that was that be a good place to break because we started slightly in late so again just briefly high productivity was useful and why the previous talk is answered love those I’ll go through a few of the other issues here I’ll just give a few ideas for what the drivers are hpc are and some explanation of why it might it’s I’m not a particularly a massive fan of hardware not a kind of a hardware junkie so if I get a laptop I don’t really care what it is as long as it’s fast enough however parallel computing you do need to understand that hard or at least a conceptual level to get the best out of it so the hardware maybe in few decades time will have compilers which can alter paralyze your software and you don’t need to know anything about the harbor we’re not at that stage at the moment parallel computing is still a surprisingly manual process and you need to understand two things you to understand how in principle you can decompose your problem into in parallel split your your program or your your task up into subtasks but you also need to understand how the hardware works to be able to allocate them effectively it’s a different process as different different different computing units okay so what’s H be used for well we’ve seen just to recap scientific simulation and modeling drive the need for greater computing but I’ve talked a lot about scientific simulation all that my examples were well from science but of course this is equally true in in in engineering designing any any any physical device cars dividing computers themselves requires vast amounts computer simulation and we’ve seen that making processes with faster clock speeds is difficult we have heat power limitations I mean you could go out and ask a computer manufacturer to make you a very fun processor and say look I’ll buy open and special kudiye equipment it’s fine i’ll i’ll cooler I don’t care but the problem is that the chip develops incredibly expensive because billions of pounds

develop a new processor technology and you might think that Archer has a lot of cores and it has 120,000 course that’s nothing compared to the meanest 10 times that much computing and Edinburgh load on people’s unit laptops and desktops the iphone’s computing market although the individual computers are very big taken globally it’s a miniscule fraction of the IT sector so all the investment goes into commodity technology so although you could go out and get someone to build your special purpose processor it’s certainly not economically viable the most efficient well built performant processors in the world the leading edge ones are the ones which you have in every day device because it’s a multi-billion dollar market as opposed to a more of a niche market that we kind of inhabit I actually may have seen also the Archer has a large amount of memory it had done 300 gigabytes and marriage quite a lot of memory and it’s very difficult you can’t really put that our men run a single processor largely you use parallel computing to access large amounts of computing power but actually also gives you access to shoot joints of mammary because we’ll see that the amount of memory eventually scales with the number of processors by adding more processes you add more memory and and that can be useful as well so a generic parallel machine the best conceptual model I have for a parallel machine is that a lot is a lot of laptops connected together by a network so modern parallel computers are lots of individual small computers which I think are maybe like a laptop with a few tens of cores linked together by some Network and each of these now of course in a machine like Archer the laptops don’t have the image computers don’t have keyboards or screens the network is very hyper for conceptual II that’s what they’re built off and each of these laptops each individual computer runs into an operating system you have a very very light is effectively a very souped-up clock if a very large number of thousands of individual computers each being multi-core each running their own operating system linked by some network and it’s up to the programmer the parallel programmer to divide their problem up and distribute it across these these distributed resources so typically that the terminology we use is each laptop would be called a compute node so I would you know you talk about how many nodes apparel a computer has so by bit a node a notably one operating system but also it will have one connection to the network so if we think of the net the commute the communications network has been a graph we can think of the the endpoints the nodes of the graph being the being the computers that talk down them each has its own operating system its own network connection so for example on Archer will be running click at this number wrong about 5000 copies of Linux for example five thousand individual linux systems talking to each other and so if this system if each of these was a quad-core laptop then we would say the total system had 20 cord but it’s actually kind of need to know that although the machine has 20 course it’s actually five individual nodes I have individual computers and each which is a quad core system so it’s a modern parallel computers have this hierarchy there’s parallelism within a node within a computer because of multi-core technology but then we buy lots of them and string them together and so the kind of simulations you run again it there so there’s a nice one there of someone was simulating dinosaur running on the on the on Archer I know I said but the video that I’ve not gone through the final part of the previous taught the video on the on the web goes through a lot of these examples and explains kind of how they work fundamentals as i said before parallel computing high performs can feed you intimately related again 25-30 years ago you could do high performance computing by going out and buying a special process which is really fast but for the last well over 20 years to get high performance you need to go to parallel ramming and there are at least two but they’re two fairly there are two very different programming models you can use to program in parallel one relies on shared memory and one relational distributed memory I’ll talk about these and you do need to understand how they work to get the both and a bit the best out of your hard work so again why do you need to know whether the whole bunch the lots of different power computers are out there it allows you to use the appropriate resource your application you have a problem you want to solve you’ve got a way of parallelizing it you may be written your software then you’ve got a whole plethora of different computers to run on what kind of computer do you want you want one with a lot of nodes each of which for the small number of cores on them you want have a small number of nodes each with large number of cores on them to want a

machine with some more some Alto accelerator something with GPUs attached with some more more than accelerator sub sub some alternative accelerator because the old Phi these decisions you know if you understand how that hardware works you can you can you can make an informed decision about what the best resource to run on again understanding the way that power of computers work can inform you on how best to paralyze your application there’s always more than one way to paralyze an application if you understand how parallel computing works you can realize that one isn’t going to be particularly efficient than one might be efficient differences in desktop computing you don’t log on to the computer nodes of a parallel computer directly so we’ll see that when you do the exercises you submit the jobs by some batch scheduling system so you log on our diagram in a second you log on to some gateway or login node and then from there you submit jobs to the to the cluster to the parallel computer it’s not a GUI based environment it’s for someone like me it can’t multiply old-fashioned in a good way it’s almost University lurks based and unfairly command line based I mean there are there are more gooey gooey type environments coming in but it’s not really it’s fairly old fashioned the command line stuff you share the system with many users I mean a machine my culture has 120,000 course we have thousands of views as any one time users will be on running jobs on the system that’s how it worked and the resource of a tightly monitored and controlled so for example if you want to use Archer large-scale you would make an application you’ll get you would get an allocation of of computing time which is affecting a budget certain amount of computing hours cpu hours and every time you run a job we decrement it against that when you’re right out of CPU out you can’t run anymore so it’s not a free-for-all like sort of you know sort of departmental server might be there tend to be much more locked down and tightly control both of the CPU usage at and disk usage they’re very tightly controlled we only have a certain type of disk we have a certain number of processors and there and that they’re allocated so I talked about performance a lot actually I should quantify what I mean by that I’m so a scientific and technical computing engineering we use floating-point operations per second flops so a floating-point operations per second is adding to double precision numbers I mean modern computing is almost universally done a double precision rather than single nowadays I’ll cover some of these issues in a couple of weeks how these how you actually do floating-point arithmetic on on a computer but adding to double position numbers together multiplying two double precision numbers together a modern processor can issue one of these instructions per cycle there’ll be a single assembly language machine code instructions from multiplying it seems incredible the single assembly language instruction for multiplying two double precision floating point numbers together but this is called a floating point operation one floating point operation and modern computers are are measured in the kind of petter flops performance so killer killer a thousand Meg as a million Giga terror Petta exa we’re at the petascale ten to twelve floating-point operations per second so Archer has a peak performance of round about two and a half petter flops two and a half million million floating-point operations per second now we’ll see that that is something akin to the to the environmental rating of a Volkswagen what you get when you actually drive it is very different from what it say on the tin but there are other limiting factors to computing and I’ll jump ahead in practice modern computing and at least scientific and technical computing is is limited by access to memory you might be able to do 10 to 12 floating-point operations per second but you can’t access 10 to 12 30 point numbers per second memory memory access speeds are lagging way way way buying clock speeds it’s a major major issue that basically that is all that matters in modern scientific technol computing is in memory bandwidth Liam the clock rates are effectively Semyon really not your limiting factor that’s the scientific integrity other disciplines have their own measures if you’re doing if you’re in graphics you walk frames per second you didn’t great if you were doing but if you’re working for a bank you would want to database access per second so other disciplines have their own measures been scientific computation floating-point operations per second is its kind of the benchmark the metric that’s used for good or ill so very briefly a schematic of how we how we use the HPC system and your your encounter this whether you use Archer or if you’re a local user use egg in the University cluster you from externally you ssh into some login nodes to upload and download

a chav compile and then to it so the login notes will be a few front end systems but the hundreds thousands of course which you actually want to run your parallel program on which typically call the compute nodes you don’t log on to them directly you have an intermediary batch system so you just say you know could you please run this job it’s going to take an hour and I 100 cores and the batch sisters is the thing which schedules those on the compute nodes and the communication there will be shared disk between the two so so I 0 can be you know your program can easily upload it and and data can be retrieved through some shared disk system but it’s important to note that you’re logging on to some front-end system and interacting at least sort of second hand access through this batch system and in fact the batch system is about the only thing which has a global view of the computer each of the compute nodes is just a little Linux machine sitting there and running neck’s thinking that it’s a nice little 12 core machine it’s not really aware of the fact it’s in a big cluster on the house of network access each of these are individual individual computers the only global view you have of the machine is the batch system app the only piece of software which only has a global view of the whole machine as a whole and then typically as I said you write code and compile on the login nodes and then you execute on the on the compute nodes and go around and some cycle an archer just to reiterate its we’ve seen this picture already but I’ve mentioned it has a two and a half petaflop 2.5 times 10 to 12 political operations percent in principle the machines got a three XD 30 it has intel Ivy Bridge processors to a typically twenty four cores per node so that’s we have what five thousand nodes and 120,000 caution about 4920 nodes each running a version of Linux called computer Linux so you might say well wait a second you haven’t told you Archer costs about 43 million pounds so you’ve bought you’ve given 43 million pounds too cray and they’ve given you processors from Intel and they give you some operating system they found free on the web pod Linux and something like a very good deal well what makes a parallel computer special is the end to connect and what Cray have put their effort into is to do is creating very very high performance high bandwidth and low latency networks and on Archer we have something called the area’s interconnect has a strange topology they’re called the drying flight apology but the main point is it that is crazed input at least at the hardware level they also do the full software stack but at the hardware level the thing which makes actually unique for a bunch of Linux laptops it is the interconnect and have bunch of software thought that we’ve run so summary that said hi Forbes computing is equal to parallel computing has been for over 20 years you have to run a multiple process of course at the same time and it may seem strange with that still very much in most real applications is very much a manual developer driven process we typically use very standard processors because that’s where the mass market is that’s where the billions of dollars of investment goes in we buy standard process but use thousands of them and the one additional feature you need to make this work is a very fast interconnect very fast network for inter process or communications okay so that was the talk of almost back on town on this check if there’s any questions on the know so they don’t have any any questions at all about any of that it’s very general high level nor good a bit more detail on the next talk but so will i’ll cover the next talk is up i’ll just give a brief overview of the hardware and then i will cover the two exercises which which i’m going to hope people are going to work on one is designed one is a prepackaged parallel program there’s purely designed to make sure that between now and next week you can get on and utilize parallel computing you’re really if you haven’t thumper allocation before I to expect to understand how it works but I want to make sure that you run a parallel program early just to get all the issues flushed out of your use of a D or arch or depending on whether you’re a local or remote user the other program is I’m going to be a simple sailor automaton model which is relatively straightforward to program but the ploy to programming is that that’s a model will look at next week and say okay you’ve written this very simple model in serial you’ve written a program in whatever language you want Python C C++ Java whatever you want how would we paralyze this it’s a surprisingly it’s a very simple example but it’s a surprisingly good model for how real scientific computation is paralyzed and illustrates i’ll use it to illustrate the two basic distributed in shared memory programming models so I’ll stop there and we’ll come back at get back on time at half three and I should be finished at half force to be caught up sometime so I’m I’ll start again apologies to

those viewing remotely that will it looks like we still don’t have a separate PowerPoint feed you get the slider on the web so the neck the talk i’m going to give now is the fourth talk hpc architectures under the lecture slide materials so so i’m going to give a brief overview of the kind of eight pc architectures which which are around at the moment and sort of allude to how they’re how they how we program and i’ll go into details next week so i’ll talk about shared memory architectures distributed memory architectures hybrid distribute memories shared memory architectures and a bit about accelerators which is GPUs and such like I’m at a bit of an overview of to how these machines are classified so the first thing is shared memory architectures this is this is nowadays synonymous with multi-core architectures so a shared memory architecture is an architecture where you have multiple processors or processor cores attached to the same memory and that’s the architecture your laptop your desktop will have you’ll have a single block of memory with multiple processors or processor court attached to it now in fact multiprocessor systems been around for a long time since the early 90s but what used to happen is manufacturer to produce single core processor so you went to the hardware store and you bought a processor you’ve got a processor and if you wanted to have a multi-episode if you want to have a shared memory architecture you had to build a special motherboard where you plugged lots of single core processors in and had external wiring to connect them to the same memory okay so shared memory architectures have been around for a long long time and we tend to call it multi socket system so you look at a model to have multiple sockets you stick processors and each socket and there’s external wiring to attach that to the memory nowadays modern multi-core processors are just a package technology you get a shared memory systems on a single chip okay so you can’t even buy a multi-core processor for you want one but the idea of so it’s a slightly confusing system a situation because the word processor has ceased to really mean anything to me as more of a software person the process or as a core it’s a CPU it’s something which can issue an instruction and multiply two numbers together to a hardware person of the processor is the thing you buy off the shelf and it might have lots of course on it okay so processor is kind of a meaningless term now or an ambiguous term so probably the correct terminology is core for the foot for this what I would call a single CPU and socket is typically what we talk about a multi socket system would be something with that whatever had a couple of processors in it whether it’s sort of wiring is external but the important point is a single operating system controls the entire shared memory system your laptop and he runs one copy of the mac OS one copy of linux okay so all those cores however their connect together they’re all attached to the same block of memory and they’re all under the control of the same operating system so that’s really what a node is Anna parallel supercomputer a node is what I would sort of call a computer but it means a single system with a single operating system it’s really the OS which defines that sort of domain so this is conceptually what it looks like we have a single block of memory a lot of processes or court attached to it and they’ll be attacked through some shared boss there will be some circuitry to allow all the processes to access the same shared memory and in this situation all course have the same access to memory so they’re all they’re all equivalent in terms of their access to memory but you can immediately see and so that’s what a multi core laptop is like you have a bunch of process of course all attached to some block of memory and you can immediately see there’s a problem because there’s a bottleneck there’s only one link going into the memory are multiple processors and you can see immediately I already said that memory access speeds were a major limiting factor well this is even worse you’re actually attaching multiple multiple course to the same memory and there’s a single bottleneck so this doesn’t scale particularly well and typically you know a 10 or so I’ve ordered 10 16 cores attached to the same block of memory in this architecture is kind of kind of a limit of practical practical usability but this is what your what your run your laptop looks like and you may worry you might say well wait a second surely lots of processors writing for the same block of memory is going to be dangerous they could overwrite each other well yes in

principle they can for normal programming so if you write lots of individual applications each of them to come to the operating system process and they’re isolated from each other but we’ll see that to do parallel programming we use threads and there they can access the same physical memory and these issues of race conditions of multiple threads reading right at the same memory location at the same time but come things we have to we have to think about but these are called symmetric multiprocessing architectures multiprocessing because there’s multiple processors or cause but symmetric because they all have the same access speed to memory there’s nobody nobody ism it is privileged and actually sort of indicated on here on the processors in these kind of architectures you modern processors have cache memory which is very fast memory where they store data for subsequent reuse and the caches will tend to be local to the processor so although you have a single block of memory to try and alleviate the fact that the man reacts s speed is crippling Lee slow you have cash now in each processor will typically have its own cash what we do to go beyond that is to build non-uniform memory access architectures what this means is you build like a motherboard with four sockets and as i said in the early days if you wanted a 4 processor system a full floor system you built a motherboard with four sockets and you stuck for single core processors on it well nowadays what you could do is you could stick for quad core processors on it so each of the the sockets takes a process in which the multi-core processor which has here for course and so memory but you have our some external wiring which allows all these cores to access any memory so all the metal though the memory is physically attached to a particular processor there’s a locality there so in the top right-hand corner this these cores here in this process I have faster access to that memory with a direct link then they do to this memory or this memory office on some external links it’s still a single operating system so this would appear to the user as a 16-core system or governed by a single operating system however in terms its performance characteristics if you try hard enough or not particularly hard you can see that actually this is a non-uniform memory access architecture some memory is fast to access which is the memory which is local to you some memory is slow so conception it’s still a shared memory architecture you still have a large block of memory and a lot of cords attached to it but architectural implemented a different way and that at least is non non non uniform memory access of course our faster access to their own local memory and if you’re concerned about performance that can give you issues I may be may be able to look at some of those issues later on so most computers and actually had memory architectures I said due to multi-core summer true shared memory with symmetric multiprocessor but most have some level of Numa most computers now have have multiple large multi multi processes are multiple multiple processors connected up externally and so you get some non-uniform non-uniform memory access element there as a user you programmed as if it’s normal to symmetric multiprocessor all the courts are controlled by a single lash I said it’s implemented in a different way however this is this is difficult to scale typically I mean day-to-day a multiprocessor system might have a few tens of course there are asked there are dedicated manufacturers who try and scale these even larger Silicon Graphics scale memories up to thousands of scale these systems of the thousands of course but it’s very very difficult to build shared memory systems with very large core cans it top site for two reasons the first reason is as I said you have this bottleneck to memory so although you can stick lots of cores on some some stage they become useless because they can’t actually read or write data at any reasonable speed but secondly there’s an issue here that I mean may not have time to go into this but each of these each of these processors has its own cash ok you have you cash local local data and if you only have a read data that will be fine the problem is people like to write data as well as we do it’s a bit of a problem so whenever you update data data in your cash you need to tell all the other processors you need to say wait a second i’ve just updated some memory so I’ve updated my local copy of this data it’s next time you access it your cat if you have a cached copy it’s invalid so when you read data to processors can it can read the same data and both have their own cached copy of it which is fine but if one of them alters that data it needs to tell all the other cores hey I just change the variable X if you’ve got a copy of it you could have to invalidate your cash and get it again from main memory and clearly that’s very difficult

to scale if every time I change data I have to check after tell tens hundreds thousands of other cores that I’ve changed that data that eventually just runs out of steam and that’s really the limiting factor this cache coherency problem is B is the limiting factor to the size of these machines so typically a modern shared memory system will have a few tens of cores in it but what we can always do is we can just buy lots of them and this is distributed memory architectures where we build clusters I said no matter what the building block is what the most power-efficient cheapest most effective multi-core node is we buy lots of them as a multiple effectively modern parallel computers like multiple multiple computers heat funding their own operating system connected by some interconnect and what about interconnect is will depend on whether you’re building up a small-scale system you might buy just Gigabit Ethernet or moderate scale system you might buy something more performant like infinity band or a very large scale system i like arch and you might have a bespoke into connect as i said if i was in a training lab here with a lot of desktop machines that you’re all city yeah that would be the best conceptual model for modern super computer lots of individual computers each finder each multi-core equal in their own operating system connected by some network so each each cell can take part is called the nodes I said before you to run some caca the OS and so almost all HPC machines a distributed memory because of the fact you cannot scale the shared memory architecture beyond tens maybe hundreds of court easily and so that means that if you write a parallel program you have to communicate over this interconnect different nodes different operating systems have to communicate with each other and the performance then is almost limited by the performance of the interconnect so as I said you can buy various various different types of interconnect it turns out that at some level I mean any program is learning as fast as its slowest part and so if you make the interconnect infinitely fast then you might be CPU bound or memory vandal io bound but I mean took two large extent a lot of programs episode to get to a very large scale to model many thousands of course you will need a very good interconnect and that’s why the very high end machines built like cray and the IBM blue jean series have their own dedicated and it’s not a it’s not a kind of project product you buy off the shelf now there are dedicated specifically engineered interconnect in the mid-range InfiniBand is the kind of dominant technology one thing which may not be obvious in though is that high bandwidth is relatively easy to achieve okay it’s like a motorway if I want to expand the capacity of a motorway I had an extra lane I want to expand the bandwidth in to collect I can just put two two wires in three wise for wires okay bandwidth isn’t particularly hard to increase the prom is latency is that it’s the delay the time taken to send up a small message that’s the thing which is harder to achieve and in fact it turns out in I forms computing parallel computing we tend to send a relatively large number of small messages whether the time taken to transmit the data is it’s at least as much dominated by the latency as is by the bandwidth and the problem is in the in the commercial space no me clearly cares about latency if you want to watch some high-def TV at home you what you want my large bandwidth but if you’re watching a second behind you don’t really care if you’re a gamer playing on a network you will want to get some latency but if the delay on a game is less than I don’t know emit a micro set I’m milli second the fibers are saying you know you’re not gonna be able to pick that up if you’re playing somebody heaven forbid shooting them in some arena virtually shooting that I mean and then some network game a delay of a milli second is just not any noticeable but behide forms computing a milli second is a huge amount a milli second there’s a million operations oh you sure if you’re operating a gigahertz frequencies and the least that many millions of wasted operations and so what we want is low latency and that’s why at least currently these networks are bespoke that the the although we’re lucky we can ride the wave or very fast processors and cheap processors and the commercial sector for networks low latency isn’t something which jump which is really targeted by commercial networks so almost everything now form since this class distributed shared memory hybrids I said you bought you network together a lot of computers a lot of nodes where each one is a is a is a shared memory architecture so we have multi-core nodes each with their own memory and that’s why the memory in these large parallel systems care for the number of nodes

each node has its own memory and you just bolt them together and so and the net will have some topology for example on the earlier very early ER incarnations of the create systems it’s quite a simple regular grid the systems lived in a fairly simple 3d 3d grid communicated routing up-down left-right forwards backwards but more modern networks out are more complicated to get better bandwidth fault-tolerance but there is some topology in there and so these these hybrid architectures not only multi-core nodes but Numa nodes for multi multiple processors in a node multiple multi-core processors connected by some external Network and so as I said it’s very normal to have these noumenon for non-uniform memory access those multi socket systems with multi-core processor but it’s important to note that the single node is still one operating system in this example although each node is made up of four physical processes each with four course it is still a single copy of your OS in this case nandi linux running that some in charge of that entire entire shared memory system so as I said how do we program these and this is what we’ll talk about a bit tomorrow next week but most applications in modern high phones computing use something called message passing so basically in order to get a in order for two different nodes to communicate with each other which remember two effects indistinct computers connected by some network they have to communicate data and what we do is we paralyze programs by message processing winner one processor wants to talk to another processor it packages up the information and sends it down the wire and it’s all intents and purposes it’s like writing a parallel program where your individual cores communicate with each other by sending and receiving emails it’s a pretty good analogy whenever you have to whenever you want to exchange data with somebody you have to explicitly patches the data up into a single box and say I want you to send this 45 kilobytes of data to process the number 53 it’s sent over the network and process the number 53 hopefully is issuing your receive to receive that data so two-sided send and receive process it’s very manual it’s been around for 20 25 years but that’s still the dominant way that these these systems are programmed and the library which is used to do that is called NPI so typically we program using standard languages there have been a lot of parallel languages invented but very few of them have ever taken off people who are in C C++ or Fortran and call external library routines to do to do this message passing so the compiler isn’t involved at all the compiler just compiles your your serial code and you have explicit calls the message passing our routines we typically the other difference by HPC as we typically run a single process parkour I mean you might have a multi core laptop with four cores it’s running hundreds of processors at any one time if you look at task manager or type top in Linux or whatever they put me in the mac if you’ll see it’s running hundreds of processors all being swapped in and out time shared for us if you’re only concerned about performance there’s no point running more processes than you have coors okay cuz if you have eight processes and four cores what’s going to happen is they’re gonna get time slice together you know they’re going to run in batches of four so so so we tend for high performance computing you tend not to use any of the sophisticated process scheduling or even virtual memory capabilities of modern operating systems you want it’s really quite stripped down you really want to say if I’ve got a quad-core node i want to run for processes and i want i want process one encore one process to encore two crosses three encore three and process for encore for okay so bored of operations it was racking much too general so no machine like like Archer the operating system that runs on the compute nodes is massively stripped down to get rid of all the extraneous functionality in the old days people used to write their own bespoke operating systems for high performs computing company will produce both the hardware and the software but you can see the temptations you see how the conversation went you know you go to your mat yoga the product management say yeah you know we’ve got a team developing or an operating system or this one on the on the on the web that’s free what did you say free all of that sounds really good so everyone uses Linda I mean let’s have a lot of advantages but it’s not designed for hope forts computing so a lot of what the customization people do exactly taking a lot of stuff out to strip it right back you might say this looks a bit weird because I’m saying that obviously to communicate between two nodes which are dis

physically distinct computers are in different operating systems with some net well we’re gonna have to do something special and this turns out to be sending messages but on a node in the shared memory these these cores can communicate by reading and writing just reading right with the same memory my analogy is allowing was to cover this leave a large blackboard okay the cores are like four of you in an office or sharing a big blackboard you can read and write and communicate via right into the blackboard however you can do that and the way that shared memory programming is done which is typically done using multiple threads is done through something called openmp openmp is a way of a system which requires compiler support foot first fruit for generating and managing multiple threads and programming the shared memory environment however typically typically people don’t do that typically people actually ignore the fact that we have this multiple level architecture and and we’ll see that you can write a message passing program where it seems wasteful that two cores on the same node communicate by sending messages to each other when they could in fact interact directly but it doesn’t harm you because in fact you’re always in fact you’re limited by your slowest operation and communication of the network is going to be the slow operation so speeding up the communication within a node doesn’t really help you because that you still gotta have sent messages over the interconnect so it may seem unnatural but until very recently people did tended not to take advantage of the fact that some of the cause were actually physically connected to say a memory that is happening more more where people are using a hybrid message passing and shared memory model where what would typically do is in this system you’d want one process / physical processor one process / pro quad-core processor but then they get the course within a single processor within a single multi-core processor community using threads that’s kind of and then you get to lots of issues that model patient systems try and isolate you from where processes and threads are physically running but we want to have control over that so again that could be at the customization which needs to be done so just a sort of instantiate that with Archer each node on Archer is a single 24 core system controlled by single copy of Linux but it’s actually to 12 way multi-core processors per node they’re fairly standard we said three gigahertz with a limit their sort of 2.73 HT gigahertz fairly standard I for bridge processors and we have about 5000 of them connected by this area’s network so that’s the kind of fetish typical of a fairly high end modern system many thousands have nodes many thousands of copies of Linux each node being a this Numa architect with multiple processors and a few tens of cores on each on each node you might have heard that you know accelerators becoming more popular people are trying to look at graphics processing so there’s a big drive there’s a drive to very efficient processes for desktop and laptop computing there’s more of a drive to very efficient graphics processors because the games market is absolutely huge and so there’s a lot of an awful lot of effort cognitive designing very fast very performance graphics processors designed for doing graphics where people realize maybe five or ten years ago there was nothing to stop you using these processes for scientific and technical calculation they’re fundamentally doing floating-point operations graphics is all about you know geometry and you know rotations and supply of these these are the floating-point operations the graphics processor optimized for so people have started to use accelerators GPUs for doing non graphical calculations how are they incorporated well typically you have a hybrid architecture so you have a node at well for example if we look at Archer Archer has to 12 way multi-core processors ode so each if you were to look at a node of arch it would have two sockets and we’d have two processors stuck into them if you wanted to include an accelerator you would instead of having two processors per node you’d have one processor once valve core processor and one GPU for example so you have a hybrid architecture heterogenous architect or a single node as a combination of say CPUs and GPUs so the nodes it’s just the same architects the nodes are connected using standard interconnect but all i know’d you might have a number of accelerators at least if you have two sockets well you can only have one and you need to have some host processor and some socket that they’re not particularly easy to program although things are coming on at the moment largely to communicate

between accelerators you have to basically communicate the data from the accelerator to the CPU the CPU communicates over the network to another cpu then the data is transferred to the GPU so so the the the GPUs are there to accelerate the performance of the CPU but they don’t they can’t really communicate directly with each other which it introduces an extra hop so and also communicated by cpu memory involves lots of extra copy operations to the CPU and the GPU dotes share memory you have to physically copy between them and that is a real that’s a real bottleneck at the moment in all accelerators that you have a very very fast potentially very very fast accelerated process I’m like a GPU but its interface to the CPU is through some fairly relatively slow link so you’re limited by the copying data between the two it’s exacerbated by the fact that the GPUs are so are so efficient so it’s so so good at doing floating-point operations that that some that they make an imprint elite up data a very fast rate I mean so what people talk typically talk about these machines is they talk about different tiers and they kind in europe we have this or sort of a classification if you hear about wanting a tier 0 machine after the pan national facility so there’s a foresight the price praises a big project which is aimed tying up supercomputing across across Europe at the very high level what they call tr.0 machines which are which are used by different nations in this classification arch will be a tier one machine it’s a national facility that we have regional facilities that various regional organizations they say in the UK to do that and then tier 3 you would have institutional facilities like like Eddie so the two machines that might be using for this course our Tier one which is the national facility Archer and then this in this sense in this classification tier 3 which are these university level facilities like Eddie and where they differ is not so much in the processor technology but more in the interconnect and because of that in the size tier 0 systems can have more nodes simply because they have a better in to connect and therefore they can they can they can support or nodes so summary is the vast majority of hpc machines are fairly simple architectures really well very basic architectures their shared memory nodes linked by some interconnect will come back this but most of programs using this pure myth this message passing model what we could talk about this next time under Liam the shared HPC this mrs. span a wide variety of sizes from multi-pattern flops machines and millions of coors down to work stations with multiple CPUs than and accelerators and i would say although i’ve been talking about the very high end very large systems the programming techniques we’re going to be talking about shared memory distributed memory program now equally applicable to your laptop or a small cluster it’s just they become more more important as the size goes off so let’s check if there’s any no fine so what I wanted to do unless you any questions was just to go through briefly in the last 20 minutes that the exercises the first one is this again this talk comes from another this talk comes from another presentation again of the public understanding plaything it’s useful here one of the exercise there are two exercises one is just to write simple serial program in as I said any language you want Python Java Fortran C whatever you want to do to do traffic modelling and the reason is that this traffic model they were very simple it’s actually a very nice analog of a real scientific computation which we could then think about how to parallel eyes so very briefly i’ll go through this traffic modelling example if ok so we want to predict traffic flow it’s very useful predict traffic flow that modeling traffic is a bit like simulating the weather you have two modes one is what you call weather forecasting watch weather going to like tomorrow what’s the traffic and mu like in at the rush hour and then you can make local decisions like how we going to alter the traffic lights how are we going to you know closed lanes here and there so people want like weather forecasting short-term predictions of how traffic is going to evolve also you want and you want to avoid congestion there’s something slightly going funny with ease yeah okay these were originally videos I think it’s getting confused it can’t find the videos yeah okay so you want to avoid congestion where things things lock up but also the

equivalent of client looking at climate forecasting in weather if you can predict the weapon if you can predict traffic flow you can say okay we’re going to build a new bridge over the fourth how is that go all to the traffic flow where are we going to need new junctions where we’re going to need new new road networks that’s longer term you know what if questions you know if we are next related this motorway would it increase the throughput traffic if we built a new junction here so people use spray sophisticated from models for for longer term traffic planning which is a bit more like you know long-term climate change simulations and so the controls are gone slightly sticky here yep we build computer models and this is actually a computable that was running EPCC about 20-25 years ago to to stimulate the new bridge roundabout to try and optimize them if you’re not British wait you may not realize that we love roundabouts we love we love traffic lights we put traffic lights on roundabouts so this was actually see relation to optimize the the the sequence of these traffic lights on this roundabout to try and maximize throughput and also they all right about pollution effects and then but I’m gonna do a very very simple model will be the most simple model you can imagine and you may have heard of the game of life which is a simple 2d cellular automaton conway’s game of life between around for many years well this is even simpler we’re going to do a 1d sailor automata so we’re just going to have we’re going to want to simulate traffic going to ride the road into a series of sales and I’ve got seven cells here and cells are either occupied or unoccupied so and we have one rule and the rule is a car moves forward if it can and doesn’t if it can’t so what would happen so we perform a number of time steps and each time step which you could say it every second or whatever you want it to be a car moves toward if it can or not if it can’t so in this situation the first car can move the second car currently the third car can so they move like that then the second car and the third car can move and then now we can see once you get this car gap car gap and they can all move and they’re all happy if they move now it’s important to note this is like an instantaneous update you don’t say you don’t you don’t say at this point here what you don’t say that you don’t move that car and then this isn’t actually going to illustrate that but you don’t what I’m trying to say is that the the state of the road the next time step is independent weather date them left to right or right to left what you say is that first car can move that second car can’t than that third car can’t and then you move them okay so to move them instantaneously you now say or that first car can’t move that second car can of that third car can you move instantaneously and then off they go the manabayions make a bit sticky here and you could do this by moving pawns on a chessboard analogy i’m going to use doing this physically in fact a long time ago me and my brother before personal computers were invented used to simulate the game of life on a a go board with lots of half spend together halfpence pieces don’t even exist anymore we used to if you ever done the game of life if a large grid of cells which are become alive or dead and we used to simulate that what you could do you could imagine doing this traffic on the chest would you have a chess board with lots of squares and lots of portents on it and you model it like that and what we’re going to do is in next week we’re going to think about how you’d tear Eliza’s calculation over the moment I’m just going to UM say that this traffic model predicts a number of interesting features traffic lights actually work reasonably well so if you start off with some traffic lights you’ve got four cars in a row you’ve got a gap and if you like to go green you get realistic behavior they’ll move off in a block you know they congest at the traffic lights then they move away in a reasonable so that’s reasonably realistic and you can actually run this model and get congestion so this is the exercise is just to see if you can reproduce this graph so clearly so density of cards is the number of cars divided by the length of the road so clearly a hundred percent density every car is every position is false the velocity is the velocity is the number of cars that move divided by the number of cars of the velocity is one all the cars move the velocities are half half the trash have moved so if we undred percent filling clean the velocity is zero because the cars are jammed and they can’t move at low density at least asymptotically when you run the model for that they end up into this this situation up to fifty percent filling at some point the cars end up a raging themselves cargo cargo car gap so if you run the model for long enough at some point they’re eight so so at least asymptotically after a long time up to fifty percent filling the

speed is one but then above fifty percent filling it’s impossible for every car to move and you get a rapid drop off in in velocity so you get what looks you know you get a curve a bit like this below 50 and filling you expect velocity one and then you get some some quite rapid drop-off down to zero and so it’s it’s quite it’s almost a trivial model but it’s a useful computational exercise and more importantly it might be quite surprising but this is actually a surprisingly good analog of quite large scale and parallel computations and so we use more complicated models and practice multiple lanes different kinds of vehicles overtaking and something like I mean I made this model up on my own so I thought but it’s actually it’s well known if you look on wikipedia it’s called them the 184 model because the update will also there are 256 possible swansea cellular automata and this is number 184 so how fast could we run the model well began this was the publish this lecture originaly written for a public understanding of science lecture so I try to put some fairly pure attempts of humor in so the idea was we measure this in car operations per second which were conveniently cops so we’ll talk about how to lose in parallel later on but I reckoned that if you got somebody who’s quite good at chess this is Bobby Fischer who was a slightly eccentric well chess master in the seventies iron he could update this model to car operations per second I reckon so I reckon we could do to car operations per second and what we’re going to do next week is we’re going to see if we had three Bobby Fischer’s could we update this model at three times they could we update this model at six cops okay and that’s the interesting questionable we’ll talk about that next week but from the moment the exercise is just so this is just a throw a comment about the performance at the moment just to write the code in cereal this is a surprisingly useful model though because what we did is we took a real situation which was real traffic we came up with some some way of simulating it we came up some some update rules and then we modeled that in as a seer model it by hand pawns on a chessboard and we’ll see you next week how we can model by in parallel so you have this loop from from a real situation you model it in some way you have some update rules you model it and then you see how you can model it in parallel that that sort of loop is analogous to have obviously but infinitely simpler but analogous to how you might model the weather you have the real where though and then you you simulate that in terms of mathematical equations and then you solve them using some numerical solution methods we’ll talk a bit about this or on the fourth installment of this lecture you then write a computer program to solve though and then you write a parallel program so this is you know although the steps are much simpler in the traffic model you know fundamentally that they’re they’re analogous and more importantly it might be surprising but the communications pattern you need for something like parallel weather simulation is remarkably similar to the communications patent you need for parallel for parallel the traffic model so it it’s kind of obvious in the traffic model the state of each cell depends on its two neighbors the state we’d sell the new iteration depends on what’s happening upstream and downstream so you have a limited domain of interaction each cell I depends on its neighbors i plus 1 and i minus 1 it might not be obvious but it’s true that i’m going to go back to well on our part it’s a bit to have the slide ready to but when we talked about dividing the map of the UK into square to do a parallel simulation it turns out the communications is typically just around the boundaries each square or need to communicate with its nearest neighbors because it’s only the boundary information there’s a limited delayed of influence the weather doesn’t depend the weather in the bottom and the southeast of Britain doesn’t entertainers depend the weather in the north west of Scotland there’s some locality to it just as the same as the the state of a sailor it depends on its nearest neighbors and that allows you to have relatively simple paralyzation strategies as I said we are complete computers have sure there’s a flops not not cops and and so as I said here it is I did have the diagram on so what we’re going to do in the parallel weather modeling you would take the map of here great britain or land and the Republic vans right into squares and you give a different square to each separate computer each separate note of your orc or on your parallel computer but the important point is and it’s not obvious but if you look at the equations but

each if every domain here had to communicate with every other domain it wouldn’t work every time you needed to update the model to the next time step every core would have to communicate with every other core and you would just be lost you spend all your time communicating in no time calculating however it turns out that in this at least in the simplest cases that the state of a sale really depends on its nearest neighbors which translates into a process or a core only needing to communicate with its nearest neighbors in tuning possibly it’s eight nearest neighbors the ones it’s connected to so the communications is localized and so the communication overhead is manageable small and you can actually although there is an overhead to communication it’s it’s it’s the increase in computational speed by paralyze a paralyzing your code drawings I or outweighs the increased computation increase communication and again the very simple traffic model which will study next week illustrates that so the slide if you look on the web the the exercise material I have an X I sheets of traffic modeling exercise it goes on to talk about parallelization in sections three and four there beyond the scope of today but sections one and two are just a description the model and the rules quantify the rules and then write a serial program and some hints as to you could might play around with it to try and reproduce that that that graph of average velocity against density and I think it’s usually really interesting exercise but the main reason for doing it is directly you’ve written a serial program and then we can think having written it you’ve got a concrete idea of how it works and then next week we can think about how you might paralyze it using these two different paranoids asian models or shared and distributed memory so that is just a something you could do on your laptop there’s no one there’s no parallelism there the other example is called the sharp and exercise I have a very short lecture to introduce that this is the example which is whoops I don’t want to do the PDF I want to do that yep this is just a program on giving you which you can you can run but it’s very useful because it allows you to check you can run on a parallel computer because it does two things it’s a parallel program so you should be able to see it gets faster and faster when you run it on more cord be it does file i/o it reads it an image and writes out an image and that’s important because at least on systems like Archer you need to make sure you’re running on the right file system I why you get problems so it checks those kind of basic things so it’s very briefly go through this is an example that I stole from the what would have them in the computer science department at Edinburgh it was done by oh no it was Bob Fisher I found this from slightly over 20 hypermedia image processing reference Bob Fisher silent workers actually walking it will foot from department of AI back in 94 that’s why I find this example but the reason for this is really to familiarize yourself with running parallel program so just to give you something concrete to do either on arch or an addy to run a real peril occur that does file i/o and you can measure the performance of this code and then we can see how well Matt correlates with something called a dialogue which will cover later which is a very simple model of how you’d expect sort of your zeroth order model of how you’d expect parallel performance to vary with with Corran obviously you need like if you run on 10 times as many cause your program goes 10 times as fast that’s never the case so Amdahl’s law this this code illustrates in practice at home you can measure you can measure the run times and we can come back to it and see how they do or don’t relate to what we’d expect so to get your own gotcha already to sort out all the details to sort out your windows mac linux laptop all those other bits and pieces for those of you remotely if you’re trying to access Archer Archer is actually down today for maintenance but that’s just a particular identify back at five or five is the intended return time but that’s earning every other Wednesday afternoons in at-risk period it’s fairly rare that it goes down but you’re definitely go have access beyond five o’clock today they shouldn’t be a problem so to introduce example just could be fuzzy for a number of rate reasons two main reasons random laws or blurring or they could be fuzzy because they’re a fake there’s a very famous 1930s picture the Loch Ness monster which I always even when I was a ten-year-old venue as a fake reason recently the guy admitted who took it that he faked it to me it’s obviously a fake it’s obviously as a ripples about an inch high but anyway it was the most famous famous vote of the Loch Ness monster for many decades random noise or blurring so we can improve picture

quality by two things if noise is random if you if you have average things enough you garnish to zero so we can we can smooth to remove noise so we take a pixel and we average it with its nearest neighbors just to try and smooth out the noise ok but secondly if it’s blurred that doesn’t help us and what we need to do is enhance the edges so we’ve got rather annoyed we smooth by averaging and then to detect the edges we can see there’s an edge or we can enhance the edges we can detect the edges and multiply them up by some factor and then add them back in again and so this is an example taking from that paper and I don’t know who this is actually this isn’t Bob Fisher I don’t know who this fuzzy picture you detect the edges and then as well as average you add the edges back in with some increased some increased factor and you get a more sharp image and that’s what the program i’m going to give you does and just a bit of technical details I mean it doesn’t matter what they are but the important point is you each pixel is replaced by a weighted average of its neighbors and you would wait you’re nearby pixels with it with a higher value so to it so what we’re going to do is an average over 17 x 17 square to eat pics us average with all the pixels plus or minus 8 from it but we don’t average and we want to average more highly with the clothes pixel so we out we wait them with some guy so that’s that I wait them with some gas in like that secondly we want to detect the edges and the standard way to do that is just to take the second derivative so you know if so if something is flat it’s not an edge it was increasing linearly it’s not an edge both of those have zero same but if it’s if it’s curving up or curving down that looks like an edge so you just mad you just take the second derivative of the image this is called the grad squared laplacian operator but that doesn’t matter you just effectively taken the second derivative of three of the image to detect the edges you can do these both at once it turns out the convolution operate operations so it turns out if you average each pixel with its neighbors but with this funny inverted top hat waiting whoops okay but you’re not seeing anything I’m seeing something funny here okay that’s fine not seeing I think funny that’s okay then you do both at once so what you all the program does it takes each pixel in an image i reduce it with all neighbors in the 17 x 17 square surrounding it I kept my what 17 squared is 489 doesn’t that’s wrong anyway and I but with this funny waiting but that actually turns out to do two things at once a average the images with a Gaussian weight be take the second relative but all you really need to know in terms of paralyzation is that you find the edges by summing you take for every image pixel you compute the edge which is the sum of all the pixels plus or minus eight in each direction of the image at that displacement x don’t filter so it’s a convolution operation you’re averaging each pixel with all its neighbors with some weight that gives us an edge and we add that back into the original image with some scaling factor I think I use a factor of two and then you have to rescale the image to get it back to not to 55 more kinds of stuff but basically that’s what it does and the most important point is it’s clear this is a parallel operation well one well one process is computing one pixel another process could be doing other pixel because they’re independent just at your average thing with the neighbors so you’re yeah so you can you can do this the computation of the edge for any pixel is independent any other pixel so I do a completely trivial paralyzation this is all deliberately very very naive a master process reads the image from disk it broadcasts to every other processor so every processor every core has a complete copy of the image which is very wasteful but it’s very simple and then you just scan the line in the scanned image line by line and the distributions we have to decide the pixels up between processors if I had four four processes or four cores each process computes every fourth pixels why my new pixels naught for eight and 12 mine able do pixels 15 9 and 13 something else that’s very very simple now you think to do we add them all back together and then we save to disk and the important point is that the program reports two times it reports the time for just computing the edges on each processor which we’d expect because it’s a the computation which pixels and independently expect that to scale linearly the performance to scale linearly but also the overall time and the awful time includes io the i/o is a serial operation we just nominate one boss guy to read and write the disk and we’ll see that sort of the fundamental assumption and these parallel performance models that you have a part which is parallelizable the part which is not and you know it’s the relative weight of the two so this program was

designed to try and model them so just a diagram read the picture in on a date of a particular guy on a particular process broadcast it to everybody and then as I said if we had four processes we moved we eat processes one two three and four we would split the image up in that in that way scan line by line and each take every fourth pixel and so I give you a reference material version and an MPI message passing version on the web and so there’s a fairly there’s a fairly verbose exercise sheet for this audience it may be a bit a bit low level the exercise she sort of sort of assumes that this property your first introduction to linux as well so some of you will be able to skim through quite a bit of it but there is the exercise sheet for the sharp and exercise here so that’s the lecture slides exercise chief the sharp and exercise it’s fairly verbose but hopefully it’s fairly explicit it’s actually written for Archer so I have a little crib sheet for those of you who might be using Eddie just a few things just a little creepy of things don’t need to do differently the one thing I don’t know is when you run on as I said when you run on the computer resource you typically have very constrained resources so as an informatics user you will have a budget of time you’re allowed to charge your jobs too and I don’t know what that’s called I know I’ve told you he had to find out what it’s called the default job we’ve given you charges to ec DF underscore physics which is clearly relevant somebody from the from the school of physics and astronomy you just need to replace that with your relevant charging code and and that’s just your unix group so if you do if you find out what your unix group is when you’re on Eddie that that will be your charging code but that’s explained in the in the sheet and there’s the source code here it’s a compressed tar file because I distribute the sample images are very raw text format so I’ve compressed them because otherwise they’re very bloated but hopefully when it’s up to you but you know over the next week come back next we can talk more about parallel programming models but both of these examples are very useful I’ll use most references next week a are used the traffic model as a very simple example of a program which you could think about how to paralyze in our to programming models which we’ll talk about these shared memory and distributed memory secondly the sharp and example will get you up and running on a parallel machine is also a very useful example to think about simple performance models for parallel programming then actually did weeks that’s next week which 3 and 4 i’ll stop talking about parallel program so much and talk more about numerical analysis from floating point numbers in week 3 and then ran the numbers pds and particle methods in week 4 but the first two weeks very much focused around parallel computing and then i’ll go on to numerix a bit about some various Monte Carlo and particle and pde based techniques okay okay thanks everyone for those remote the apologies for the audiovisual issues all the slides are on the web and hopefully by next week when we do it next week we’ll be able to get you alive separate feed you should have me flapping my arms about should be appear as a little icon and the major part of the screen should be taken up as a fairly high res copy of the slides that that’s the idea we’ll try and get that sorted out for next week