Herbert: “Okay, so, welcome to this talk As you can see, here at the VU, we overcommit resources Basically we then have to get into backup scenarios. We may actually have to disappear from here at some point point. If another group appears.. That’s our evacuation plan. If we lose the fight, we go there. “We can send them there, right?” “We can also send them there, yeah. That’s smart move So we’re very happy to have Joel here, to tell us about a real-time operating system for embedded, that’s actually been places As in, beyond planet Earth. Which is rather impressive And I’m not going to talk very long because we’ve already wasted 15 minutes of Joel’s speaking time.” Joel: There are probably hundreds of operating systems in existence And each one has its own little target audience Minix was meant for teaching; Windows was for something; Linux was because Linus wanted to do it; where we fit in is what we call deeply embedded systems and these are the kind of systems that have admitted to using us The common characteristic is that most of them don’t have what you would call a user interface This is actually a video jockey system so it probably has got more of a user interface. If you notice, there are a lot of space systems Curiosity, we’ve got a couple of systems on it. All data from the surface of Mars is sent via radio on curiosity through the reconnaissance orbiter, that’s one that we’re real proud of Every Galileo satellite will run RTEMS. The ones in orbit already do. In general terms about 75 to 80 percent of everything the European Space Agency launches, NASA launches, or JPL launches includes RTEMS. We’ve also got a lot of high-energy physics That’s really not very easy to put up a mile-long linear accellerator, excuse me, a 2km-long linear accellerator picture and have anything to look at. The fun one, and I’ve got a great picture of Ben somewhere behind that motor cycle. That’s actually, has, 3 embedded control systems on it, and one runs RTEMS. So, where RTEMS, this is a talk, that I gave in france back in august, and then I got robbed in France, so I’m hoping I’m going to do a little better here this time, which tries to show where RTEMS fits into the operating system spectrum, what kind of systems we’re targeting. We’ve done a lot of work; we’ve gone back and revisited the concept of how threads are managed. Because RTEMs is single-process, one address space, and multithreaded. So that’s like running a single process on a dedicated small computer. And when I say small, 128k of memory say to; 32 meg would be a huge system. Normally most of that would be used for data collection and data buffering. And a lot of the space systems only have 256k of memory or so. A lot of the ARM systems would have like 512 meg of flash and 64k ram are very kind of typical new targets. this was the focus on a lot of what we’ve done with the symmetric multiprocessing and some of it was for Google summer of Code projects. The person who is listed as co-author on this was my Google summer of Code student in 2010 and now the core developer has mentored a lot of students. So; start; to get in the right frame of mind, and you can tell me to skip if any of you guys remember this or know this The important thing is usually the focus on logical results in computation in computer science and that they’re precise and accurate. But in real-time, you mean you add the notion of wanting time So you want the correct result, but you it’s also a matter of when it’s produced And by that, there can be a deadline, or there can be a window around a deadline, there can be a decreasing of value after the deadline, or that you’re too early. And the scary thing is, as time goes on, we get more and more of these systems that are real-time systems and small, embedded systems that we deal with, and we don’t even think about it i think ben said he had a lot more respect for pushing the pedestrian button and not jumping out. that’s a real-time system, but it’s on the order of seconds i mean, when you think about it. but when the lights are wrong at an intersection,

and they’re both green, cars hit eachother it’s a real-time system and it has safety constraints and it’s one we use every day. but we don’t really think so much about it as being a safety critical and a computer we’re interacting with the type of system RTEMS works with usually does not have anything like a keyboard or a mouse or a graphics display; they have analog, and discrete I/O’s, and usually kindof strange custom device a lot of people build FPGA devices that end up interfacing with things like gamma-ray burst detectors or other odd things. but there are various types of sensors that you wouldn’t normally think of having an embedded system. but on the other hand, we do have a lot of standard things we have a FreeBSD TCP/IP stack, so we have a lot of routers, and network filter type products. We have a lot of control systems that have network interfaces. A lot of consumer electronic type products. But the idea is that it’s not a normal computer system. This are usually collections of unique hardware that is put together with a processor that is not out of the; it could be an ARM, but normally it; we support 18 architectures. Because people select different architectures for different reasons. I mean for example the ARM at this point does not have a space-hardened version. So the ARM is only being flown on cube-sats on low-earth orbit. So there’s a hardened PowerPC, a few hardened SPARCs, a hardened coldfire which is a 68000 deriviative, and a couple of others, which is pretty much what the whole space community flies. But when you start looking at this, and MINIX is no different from this, we really back off, the old-school definition of an operating system is just a collection of software services, an abstraction layer, that basically allow you to focus on the job of what your application’s doing rather than focusing on devices, so you can move code around. That’s all an operating system is, and that’s why you have application portability So in the old days, it was called a master control program. And the class of operating systems we deal with, these were actually called executives Because all they did was manage a set of threads. And when you start adding file systems and network stacks, then they because more of an operating system. So the important thing about being real-time is that you want to have other services and frameworks which help you meet application requirements that have time. So instead of saying.. I’ll think of an example. A fire alarm. A fire alarm has so long with from when the temperature reaches a certain level, or smoke is detected, before a alarm is generated, and there’s going to be a timing path along that. It’s not just enough to generate a signal. There’s going to be a timing requirement. So these often interact with eachother. There may be minimums. Lots of times you deal with things like electrical switches, and mechanical switches which have delays, excuse me, minimum times inbetween two.. You can’t close a light switch every so often. It has a minimum inter-arrival time. But the focus is on predictability. And predictability is the keyword that you hear a lot. Robustness is another keyword Robustness gets tied into with program correction and program proofs But you don’t want algorithms that are O(N), because then you can’t predict how long they’re going to take to execute. You don’t even want algorithms, if you can avoid ’em, that are even O(log_2(N)). You want algorithms that execute in fixed time in actually a fixed number of instructions on each architecture. The other thing that makes real-time embedded systems unique is not only the timing requirements, but starting getting into size, weight and power. A pebble watch is an embedded system that has very unique size, weight and power. A satellite has size, weight and power. The hardware we always deal with is usually one-of-a-kind It’s a collection of some commercial parts with some custom parts. Safety and reliability of course are concerns. Cars are full of systems where you want them to be 100% reliable and safe. The other thing I always want people to remember is, you’re interacting with the real world. So it’s often hard to debug because you can’t set a fire to test a fire alarm. You have to come up with a way to test a fire alarm without burning something down. I worked on a system with a 2-ton rotating platform So when we tested operating it, we actually had to have a safety zone, and a safety operator, to

kill it in case something went wrong. Which it did a couple of times in the early development. The high-energy physics people often worry about high voltage and getting killed. So your user interface; even a petrol pump. It’s got an oddball user interface That’s an oddball and it’s not standard. And they interact with the real environment. So there’s lots of examples of these. So RTEMS is designed for what I call deeply embedded systems. And this has become more pronounced as you’re starting see things like Linux move into things more like set-top boxes, entertainment centers, not air traffic control management systems as much as in-flight entertainment systems, which seem to be so unreliable. But notice that there’s a real focus on full-time operation and correctness. Medical devices down there have to go through serious regulatory control. RTEMS is used in some industrial power control systems and monitoring systems that are we in the US call substations. The large substations where power is distributed. Those have extremely high, rigorous regulation requirements That is kind of the environment in which RTEMS plays. And I’m hoping that the slide that I want is up here in a minute. So RTEMS is designed to be deterministic, embedded, operating system, very conscious of and predictable on our execution time and our resource uses. We are free in the sense that we do not place any restrictions on anybody who links with us, which I think y’all pretty much follow the same type of license. You don’t care what anybody does. It’s a BSD-style license. Even if everything is not BSD-style, you’re not going to get any obligations when you link with the code. Because nobody is ever going to use, field an embedded system which requires you to ship a kit of source for your medical device. You’re just never going to ship it. Nobody’s going to meet the GPL. If you shipped a heart monitor with the GPL. RTEMS supports POSIX. Various networking standards, sockets, JFFS2 for filesystem, DOS filesystem. We actually support 18 processor architectures. And have about 175 boards that we run on. We’re pretty open and the reason there’s a community – one of the things that’s happened as a result of GSOC is most of our hosting is moving to Oregon State’s open source lab. And the reason I’m actually here is because I’m going out to ESA and doing some training later in the week. This is the slide I wanted. This kindof helps you see where we fit into the space. Into the operating system spectrum. And it’s really hard to do this without three dimensions because you don’t get into hypervisors unless you get the third dimenison So the distinction is here is operating systems that have a strictly real-time focus versus have a light; they have some real-time focus, and I was hoping MINIX would migrate to that box someday; but MINIX is kind of this general community that is POSIX process model, and Windows is out here in its own land. When you come over to realtime, LynxOS is a small, embedded unix that feels a lot like Linux There is an avionics standard called ARINC 653. And pretty much any time you get on on airplane, you’re flying an ARINC operating system. These are very expensive, 80-120000 Euro a seat licenses. And then you get into, there’s probably a hundred single-process, multithreaded OSes, but those are probably the biggest four. And we usually compete against VxWorks when we’re compared; or ThreadX. VxWorks is now owned by Intel So it’s kindof interesting to see how things have changed. So I think I’ve said all that. We can go down to as small as 16k on an ARM Thumb We’ve tried really hard to give you an extremely large number of configuration features, and a lot left out by the linker. I’ve said we’re single-process, multi-threaded We also have, which nobody probably cares where it cames from, we have something called the Classic API, which is geared toward really truly hard real-time systems with POSIX threads came from an environment focused on workstations. Multiple filesystems. I’m proud to say that we now have a functioning port of the FreeBSD 9 TCP/IP stack and USB stack, it’s working great with IPv4 and v6, apparently it uses too many locks and it needs optimisation from our standpoint, but that’s a different problem. We do removable mass storage, which means, in our world, that means you can

use this for a data logger, with a USB device, run the FAT filesystem or something on it, take the device off, take it off and analyze it somewhere else. So as a data logger it’s really good We have something that looks a lot like a simple busybox shell with about a 100 commands, a lot of which came from the BSD land, along with things to look at the system. If you’ve ever used a router or some other product where you could like telnet into the dedicated box and run commands on it, it’s geared towards generating that where you got a custom little interface to do things, to do device maintenance and setup. But where a lot of our focus has been lately, well we’ve always had, if you studie much real-time you’ll know about the priority inversion problem which is where a high priority task or thread has to wait on a low-priority thread which is holding a resource. And there are two protocols, the inheritance and the ceiling protocol to help you deal with that. We’ve supported those since the nearly the very beginning. We keep alot of statistics on how much stack you’re using, your cpu usage, and we do rate monotonic scheduling for strictly periodic tasks. So if someone has to execute every 20 milliseconds you can set it up as a special period or init task and we will track much cpu time and how much wall time it used on each iteration So you can see if you are using 19 milliseconds on average over 20 milliseconds period, which you know is bad. Because that means you’re really really close to the edge of not meeting your system requirement We’ve been doing a lot of work on symmetric multiprocessing. And at this point in our OS space I think we’re pretty much in the lead. Because of the space community we support the SPARC on the Leon3 and what’s called the next generation microprocessor. We have ARM, x86 and then PowerPC on QorIQ. We also have an older asynchronous/distributed multiprocessing that used RPC. Which was more based on a distributed shared memory model. The, just to throw this up, based on the version, your yes/no, and then you get lots of architectures. Notice this ends at M for MIPS Which means there’s still more M’s and then it comes over and hits architectures that you’ve probably never heard of. This is an open source processor. These are old Japanese processors. This is the V9 SPARC And old DSP. And that is al automotive processor from NEC. There’s your kind of block view of the operating system. The important thing looking at the operating system is, what’s unique is about this environment. Remember I said, everybody we deal with; we have to assume they have their own hardware. Which means that we have to assume we have to put together their own unique Board Support Package. So we have to look at a piece of hardware as a set of components. And the components is partially made out of the processor architecture. So is it an ARM. Is it a PowerPC Is it a SPARC. Then we looked over here and go; okay, we knew what it was, as an architecture, but which ARM is it? I mean there are hundreds and hundreds of ARM processor models with different peripherals on them. The peripherals generally live either in this shared area for board support packages or in libcpu. I think as we move to a new buildsystem, this will fold into here. And this probably will to. These are peripheral chips like UARTs that very common and you see across all architectures But at some point you look at these of reusable software. I should’ve mentioned, this also has the framework that every BSP has to follow You have to put in your own specific information. Like what’s your address map? What address did you put your peripherals at? When you put your System-On-Chip together, if you assume it’s a custom one, did you put three UARTs, four UARTs, two network controllers, zero network controllers; and there’s a lot of information in here that is board-specific And one of the strange things; if you haven’t looked at these type of boards you wouldn’t have noticed. A lot of the embedded ARM and PowerPC boards look really good when you read the specs. And when you look at them really close, you can’t all the peripherals at the same time So there are I/O and feature configuration flags so if you get the board from one vendor, you will see one set of settings for those. You get the same chip on a different board

and you will get another set of settings from these registers. Those are the tiny details that are often on this board-specific part. Some of these board-specific parts of the BSP are only a half a dozen files. The parts that’s shared all the time.. First you got the architecture. Like SPARC, PowerPC, MIPS. And then you have these basic what we call SuperCore. And unfortunately our naming was really bad, probably still is. The joke here was that the trend of operating system design at the time was microkernel. And we weren’t doing a microkernel We were doing something that was more akin to a object-oriented threading library. Where the SuperCore implemented textbook computer science concepts and did all of them. So if you’ve seen any feature in mutex, it’s in here. And our idea was that when you went up here to look at the standards you would see an API like pthread mutex which my recollection is it has priority ceiling but doesn’t have priority inheritance or vice versa. So you only get certain features out of certain APIs So you only get certain features out of certain APIs. The message queue here, you can’t block when you send if the queue is full. In POSIX you can block when the message queue is full. POSIX also has message priorities. So the message queue class at this layer actually supports both of those. So it’s just a personality or a facade, whatever the pattern name is these days. The sort of classic API what was implemented first The POSIX API is about 90% of the standard Open Group Single Unix Specification 1003.1b. And this is actually a joke; you know the english word sap which comes out of a tree? Which is sticky? Sappy. These are the parts of the operating system which every operating system has; no standard’s gonna ever go near them because they’re too sticky to discuss Nobody’s ever going to discuss how you configure an operating system; how you initialize it; loosely there’s a device driver framework that always looks alike but is always different. So these; how do you put plugins for extensions. They’re sappy. They’re sticky. We all know we have em. They’re all different. And we’re never going to standardize. So that’s the bad joke. There’s an API for monitoring threads, stack utilization There’s the shell. Various file systems. We support GNU ADA. Remote debugging for GNU ADA. We have a graphics toolkit. There are some graphics applications. (That would be because I’m on battery, wouldn’t it.) There are some graphics applications. And we have some frameworks built on these. Microwindows effectively implements the Win32 API for embedded systems. And you can actually play Minesweeper if you use that. In the graphics toolkit it also includes libjpeg, PNG, TIFF, Adope type-1 fonts, and Freetype. There’s a lot of libraries that port to RTEMS. Python. Run Python on RTEMS. The cameras that monitor the Sydney bay harbour off the bridge run Python. And since you know lives in Sydney you know who wrote that code. So now you know who’s a big Python fan among other things. We have fun for about 15 years the standard BSD TCP/IP stack. We’ve talked a lot about how not to do it. The original stack was a great example on how not to do it. The new port is we implemented the bottom layer of the BSD kernels, I think you guys took a similar kind of approach, has it some API’s that they kindof agreed to For threads, synchronization, and timers and interrupts. We implemented their API’s. We basically cut the bottom out of the BSD kernel and run the drivers on top of their API on top of our operating system. We are actually running pretty much an unmodified FreeBSD 9.3 TCP/IP stack and USB stack. So one of the ideas of RTEMS philosophically is that this has to be done for each application There are only so many of these architectures but you want everything to be the same as much as possible up to here And if you notice, these are all pretty standard packages. We try to have as much POSIX support as it possibly makes sense. We have a lot of BSD support. So it’s easy to port any kind of open source package that will run within assumptions that

we have, which is, again, multithreaded, single-process And we’ve got all of the communication objects, including some that people don’t use very often, like barriers. And the other thing that we have is kindof unique is since when you build an RTEMS application you’re building one process with multiple threads. We actually let you select your scheduling algorithm when you configure your application So most people are going to use a single priority-based. But you can use other algorithms. And this is our thread state model, which is not that exciting. Threads start out, they don’t exist. Because of the original API, there is a distinction between creating and starting a thread. In pthread, those are the same operation. Once a thread is ready it can be dispatched to be executing. This is the version with the typo. It can yield or preempt I fixed that in the version I was going to present today. So we’ll see if the other typo was still in here. Once you’re executing you can either voluntarily yield the processor or you can be pre-empted by something that’s more important. We have a concept called suspend and resume, and you’re blocked you can be on resume Anytime you’re ready, if the scheduler determines you should execute, then you will become the executing thread. And you can voluntarily block or suspend. So the focus in the design of RTEMS is very thread-oriented And at any given time a single action or state change happens to a thread at a single time. So what happens? You’re executing and you voluntarily yield the processor You’re executing and you block, waiting on a mutex. All these things are only one action or decision point for the scheduler. Obviously you can be deleted. This goes through the states. Usually it’s just a check to make sure I’ve said the right thing We track the executing threads. The other neat thing about RTEMS is it’s linked in with the application. There’s no separation between kernel and user space. The whole processor is available because you’re one POSIX process on the processor. On a single-core system there’s only one thread running at any given time, and inside the OS there’s variable called _Thread_Executing which is now a macro because of the SMP. The neat thing is you can always trace in because you build RTEMS as a library and link it with your application which means you can step into it. So as we did the SMP work we did redesign RTEMS but we really rethought it refactored it and restructured a lot of the way the thread stuff was done. And one of the things that came out of that is: what does RTEMS do? It really just manages sets of threads. There’s a set of threads executing. There’s a set of threads ready to execute. There are multiple sets waiting on resources. And you can change on a per-thread basis your priority and actually whether you’re pre-emptible or not So in most environments your thread’s always pre-emptible. In RTEMS on a single-core system you can say you’re not pre-emptive. So this is the overarching thread set view we’ve gone to with the symmetric multiprocessing. At any given time there is a set of executing threads In a uniprocessor system there is only one set of one thread, but in a symmetric multiprocessing system the big change is now, there is obviously, one on each core. And that has implications for application correctness There’s also a set of ready threads. It’s the same set. And actually, all the scheduler does, is manage this set of threads. So he decides when it’s time to dispatch and change the set that’s executing. And the final thing is that at any given time, you have an arbitrary number of mutexes, message queues, you can wait on time, with delays, or until midnight, and this didn’t change for multicore. So all that changed with multicore is: you can have more thread running. You have more than one processor resource to allocate to. You need different scheduling algorithms. Because most of the historic algorithms only know how to schedule one core. So you go into a whole different class of scheduling algorithm. So you want

to be able to swap them in and out. The other thing about symmetric multiprocessing scheduling algorithms is that it’s still an open research topic. Which is an even stronger reason to have a plugin architecture. Because no matter what you do today, it’s wrong. I’m not gonna go down to the mat and say I know what the perfect algorithm is, because I don’t. So here are our scheduling mechanisms. And those are pretty straightforward. Manual round-robin is how old Windows and MacOS is. It’s real handy lots of times in embedded systems just to cycle through a set of activities. Rate monotonic is pretty much a topic by itself. It’s a way to schedule periodic tasks and guarantee that your task set is schedulable. We have up to 256 levels of priority. Each API defines whether numeric high or low is the most important. But the important thing is: you get to change it on a per-thread basis. And it’s important how you do it because these priorities are supposed to reflect what’s important in the external world. In rate monotonic scheduling, the activity that happens the most frequently should have the highest priority. Preemption says, when you’re executing something important, then you can take the processor away. So if I’m running and Ben has woken up, Ben’s more important, so he would run. That simple. But if Dr. Tanenbaum came in, he would pre-empt Ben, but the priority inversion problem if Ben is holding something Dr. Tanenbaum wants and Ben won’t give it back. Until he’s done. That’s the priority inversion problem in a nutshell. Time slicing limits the length of time a task runs before it’s evaluated to be executed again There’s a subtle difference between the way the classic API and the POSIX API handles when the time quantum is reallocated. Basically if you’re not preemptible, then you can’t be time-sliced, because it’s implicit that you’re going to give up the processor. Time slicing just uses the clock tick to round-robin tasks as necessary. But as I mentioned manual round-robin allows you go through the set of tasks manually. And it allows you complete and utter control. If you run for a long time and don’t yield the processor, you run for a long time and that’s why Windows 3.x would get an hourglass forever. Rate monotonic is based on periods The shorter the period the higher the priority is. A thread set is said to be schedulable if they all meet their deadlines. Most of the schedulers in RTEMS are priority based. And the goal of the scheduler is to guarantee that the highest priority ready task executes. And this is normally the only type of scheduler; the only class of scheduler you will see in practical use. So it’s your job to assign priority in a way that the ready task that you want to execute is allocated to the processor So this priority goes into account with preemption and timeslicing So the timeslicing cycles through tasks of equal priority. If you are not preemptible, even if you are not the highest priority task, you can continue to run. Which can lead again to cases of priority inversion But it is a simple way to do mutual execlusion on a single-processor system The other neat thing about going to a pluggable scheduler algorithm is that we actually finally had to name the scheduling algorithm we had for over fifteen years. And it’s now called the deterministic priority scheduler This is pretty close to a textbook algorithm. There is an array of linked lists. One list per priority. The lowest index list with a task on it, the front of that list, is the highest priority task that’s ready to execute. And we keep a bitmap of priorities which have one more ready task. And it’s really optimised for fixed-time execution. It only takes a couple of bit scans, a couple of loads and some shifts and you can find the highest-priority ready task. And if anybody’s ever seen this before, somebody will know why the maximum number of priorities in Windows NT and above and VMS

was 32. Because they had 32-bit bitscan. We do a double-level bitscan, 2 16-bit bitscans, 16*16=256, to do the trick So this is a twist on an oldschool algorithm. We use the top nibble of the priority to put you into what we call a major. So there’s a 16-bit major bitmap. And if that bit is set, if bit 7 is set, that indicates that there are threads ready between priority 112 and 127 and we know we have to come over into the minor bitmap to see the bit which indicates which specific priorities have tasks on them. So that’s why I say there is a bitscan, an indexing and a load, another bitscan, and then you shift and (?) And then you’ve got the highest priority So here’s an example. When you initialize, almost any of the default RTEMS configurations without too much tinkering, you’ll end up with a thread at priority 1, and a thread at priority 255. 255 is the idle thread And priority 1, unless you change the default, is the initialisation thread. So that shows you the major and the minor. And the major’s got a bit set each end. That would be priority 1. That would be priority 255. The interesting thing when you port, this is kindof a logical representation Sometimes you find the first zero set. Sometimes you get it from left or right. So the port’s responsibility and present this concept of a level. So you can scan from zeroes from left or right as long as you make it look like 0-15 coming out. Because I think the PowerPC even numbers the bits opposite of the way everybody else does. I think what most would bit 31, the most significant bit, I think they call bit zero. Now that’s the scheduler that everybody has used. It’s flying on every mission that’s flown. But to prove that we could have a pluggable interface the 2nd algorithm that was implemented is the one that’s in every textbook which is a simple, straight, sorted list of tasks, and it’s really simple, has the same behaviour, it’s terribly not deterministic but it doesn’t use much memory. And the cool thing is, this is engineering. This is not theoretical computer science sometimes. If you have a low-end system, it probably only has 4 or 5 tasks in it. And it doesn’t matter what algorithm you use if you only have 4 or 5 of something. The algorithm only becomes important if you have 10s or 100s of somethings. So if you only got a processor that’s capable of an application with 6 tasks, it’s not going to take you long to put them on a linked list. So that’s the tradeoff. You save 3 or 4 k of memory, and you got it. Has anyone ever seen Earliest Deadline First? This is an implementation of the Earliest Deadline First. It’s based on our rate monotonic period So you set up a task that runs on a period And its next period is its deadline and that’s used as a deadline. The start of the next period is the deadline. So logically have two bands of tasks. The period, deadline-oriented, and the ones that aren’t So the ones with the deadlines are always run before the ones without deadlines. That’s a pretty standard way to design a system on paper. I don’t know if this algorithm has been actually been used in a fielded RTEMS system. This is similar to the same algorithm in Linux, the constant bandwidth server. It’s basically the Earliest Deadline First scheduler extended so now you know a per-CPU budget, and when the budget’s exceeded the callback’s invoked. One of the cool things that this algorithm does is it does what you do. You get to Christmas eve, the stores are all closed, and you decide, it doesn’t matter what I do, I can’t shop for everyone, I quit. And then you start drinking at 2 o’clock because you know you can’t finish your shopping before 6 anyway, that’s what this scheduling algorithm does. It knows when it’s pointless to try so it just gives up and continues. If you’re not going to make your deadline anyway, why care, right? Then we get into the SMP scheduling algorithm. I should also say it sets off an alert which you probably don’t do on

Christmas. You kindof slither away in shame on the Christmas presents example. The simple SMP scheduler algorithm was the first SMP algorihtm we had. And the idea was just to take the simple single-core algorithm and extend it to multi-cores. And the irony of this is by the time it was done it was no longer a simple algorithm It was quite a complicated algorithm. But it still used a linear linked list The deterministic priority scheduler has now been extended to multiple cores so now we do have the deterministic scheduler. So how does all this work? Basically there is a dozen to fifteen system points at which a scheduler callback is.. A scheduler is nothing but a set of functions which is invoked in the life of a system based on particular events. And that lists some of them out. When you build an RTEMS application you specify lots of things. Like what filesystems do you want. Do you want a filesystem. What’s the maximum number of threads you want. What’s the maximum number of barriers. And lots of things default to zero. The scheduler is just one of the parameters that’s included. You basically get to pick the algorithm that meets your time and memory requirements and your scheduling algorithm behaviour. So you also can provide your own scheduler without modifying RTEMS source So at system initialisation it’s invoked. When a thread is created or deleted the scheduler’s allowed to have some little bookkeeping information He’s invoked when the thread yields the processor. When a thread is blocked, like when it can’t get a resource. Or is blocking for time. When it’s unblocked. An extract is what happens when you’re waiting for a mutex, and you have a timeout of a second, and you don’t get it. You’re automatically extracted from the blocking operation. You didn’t get the resource You’re arbitrarily anywhere in the blocking set. Some of the scheduler’s cache information, precompute pointers, indexes on the major/minors, so if you change the priority or some of the execution modes, the scheduler gets invoked to precompute things. And then the method schedule(), which sometimes is very simple, some of the deterministic algorithms seem to do small actions when you block or unblock or extract. The simpler algorithms tend to go out and half-evaluate the world in the schedule() operation. So it’s a matter of whether you’re doing things a little bit of the time as you know little bits and things are changing, or whether you make a big global look over the system. And then you have to enqueue a thread where it goes into its priority group. And there actually is one case I think in POSIX where you actually are required to go to the head of your priority group instead of the rear. Which is kindof an abnormal thing to do but it’s required. We’ve realized as we’ve gone along that, actually it’s kindof pedantic at this point, I don’t think it really matters. The priority of two threads, the meaning of that, is actually dependent on the scheduler. In deadline-based schedulers you can release the job, the clock tick is when time-slicing and other actions occur. The scheduler can assist you in starting the idle thread That’s what it really comes down to. It’s a large plugin framework. 15 or so. And any scheduling algorithm you want to implement. So far it looks like we’ve hit the classic scheduling types of algorithms that can fit into there. The other things RTEMS has is.. I don’t know how much y’all have played with scheduling algorithms. They are very hard to debug. Because they’re very hard to create scenarios and run them on real systems to get repeatable results. Once I realised how hard it was to debug on multicore systems and create applications of threads that did things in the way you want it, where you weren’t even sure what you really wanted the threads to do anyway, or the scheduler to do, I put together a subset of the RTEMS shell with enough of RTEMS to run on unix, where you would basically write very simple scripts which create tasks, advance the clock tick, have them block on mutex, change priority, for an arbitrary number of

cores; it’s basically a discrete simulation using all of the relevant OS source. So you can debug new algorithms. The cool thing is, nobody has the hardware to test all of these scenarios on any number of cores. You just don’t have that kind of hardware. And even if you do it’s really hard to debug on multicore. So it’s really deterministic, it doesn’t require hardware. And if your algorithm stinks, you haven’t invested a whole lot of time trying to debug it. Because I can tell you for a fact that it’s pretty tough to debug these when they get complicated. The last part of this is the challenges we’ve really seen. So remember we’re in a single address space. A lot of these problems still apply in multicore POSIX-based systems and in ARINC systems I actually have seen some ARINC SMP presentations that have even worse predictability problems. The real issue on a lot of these is this very simple thing. Traditionally in embedded systems that were multithreaded, single process address space, you worked on the assumption that when your thread was running, it was the only thing happening. And the only other thing that could happen was an interrupt. So as long as you kept control of your thread, you were perfectly fine. And if you were the highest priority thread in the system, you didn’t have to worry about anything! And guess what. As we went on, we realized how many bad multithreaded programming habits that we were just as guilty of having as the next person So the first discussions of how to make RTEMS support SMP rapidly turned into “Oh my God how many times did we violate all these rules everywhere in the support code.” So this is where when I think as a practicing community outside of the university and writing applications we’ll see an enormous amount of broken code. The first one is the highest priority task assumption which I just mentioned. So multiple cores, you should assume an application is executing on each cores, which means you no longer are running by yourself. So your implicit critical section is violated. My next favourite one is: it used to be on a single core, no matter what grade were, you could disable preemption. And if you didn’t allow the operating system to take the processor away from you, you just kept running. Same broken rule. Other things are running at the same time. You’re broken. And the sad thing is, I’ve taught people for ten years in RTEMS classes that this is one of the cheapest ways to get a critical section. So now I’ve got to go back and say this only works on a uniprocessor system. This is another bad one. If you disable interrupts, which was the classic thing to do, which is even cheaper to than disable the preemption, if you were the thread running, and you disabled interrupts, then nothing could happen. Therefore you can continue to run. For this rule to hold, number one, you would have to disable all interrupts on all the other cores, or you’d have to start doing it at the interrupt controller. And you still have other threads executing. So basically what we’re down to is, we all should’ve been using mutexes or message queues or something And now what you’re seeing is the trend toward lighter and lighter locks like the atomic stuff that you’re seeing in C++ and added to GCC and Clang. If you look for papers, you’ll see the aviation community, for manned flight, especially commercial like the FAA and I can’t remember the European equivalent of the FAA. They’ve done a lot of research on this. And there’s been some ESA-sponsored research. Caching is really really hard to predict on an SMP system. Because you don’t necessarily know what’s running. Basically you normally have one memory bus, and everybody’s hitting the memory, and everybody’s hitting this cache, trying to keep cache coherence going, and so there’s a lot of cross-contention Somebody at a conference told me they had actually seen a case where it was indeterminate on what architecture what would end up in memory in certain cases. This is a really hard problem. Assuming the cache coherence really works, the real problem is that the two multicores can basically fight for half the cache each So where you used to run

code that would completely fit in the cache, and you would’ve run fine, now somebody else is chewing up half the cache, or a third of the cache, or a fourth of the cache. Or if you get into more complicated, two-level design, you spin on memory. And now you’ve locked the cache coherence to do the atomic lock with another processor, and you’re chewing him up from even updating it So there’s lots of caching issues that are very subtle and hard. So, since when I did this, last time, I thought it was good to end with, what would really help us? Really practical scheduling algorithms. I remember when I was in grad school, distributed OSes were all the rage. And every algorithm started with the assumption: global state information, which is impossible in a distributed system, and then they would run an O(N^3) algorithm. Okay that is just not practical in the real world You can’t run O(N^3) algorithms and meet submillisecond response times. They also have to be predictable and meet the real-world implementation constraints. I really don’t know what the best practice is on how to assign threads to separate cores based on application patterns. I’ve got a couple in mind, and generally they boil down to “put all the computation on one core and put all the I/O on the other core.” Because at least that way they’re kinda naturally related, your interrupts will tend to migrate toward one processor. Chris’ big beef, and he’ll give you about an hour lecture on this if you ask is, the debugging aids really stink. We’ve all gotten spoiled with Raspberry Pi’s and Beaglebones and the $19 Tincan JTAG. Chris is dealing with a $15,000 debugger and no open source cheap alternative for the ARM board he’s using. He’s looked at; I can’t remember which Intel chip they’re pushing for embedded. I think it’s on the order of US$ 10,000 to get the documentation under NDA to know what the pinout of the chip should do to debug. These are real barriers Plus the issue – how do you actually represent what’s happening on all these? The best we’ve really got is multithreaded timeline for debugging. And the classic old problem of debugging embedded systems is that you have to stop some and your controlling a train, you’ve got a train control system, and you stop something in the debugger. The train doesn’t stop. It keeps rolling and it keeps rolling a long long time. And that’s the danger of debugging embedded systems. In multicore, if you stop one core you somehow probably need to stop all the cores. And worst-case execution analysis is a big issue in embedded systems Because you want to field the system knowing that it will always execute and meet its deadlines. Well, if you’ve got all these cache effects then the worst-case analysis starts to become very hard. It’s interesting, just as an aside, I graduated by bachelors in 1986 In the late 80s, early 90s, there was a joint UK-Europe-NATO-US fighter. And there was an enormous on real-time scheduling cacheing effects. And a lot of this is coming back, except with multi-core on the end of it. So at that time we were starting to get 7-stage PowerPC processors. And some of the other papers I remember from the either the University of York or Carnegie Mellon, basically ended with, “All this makes it impossible to do worst-case analysis. You need to disable it and get back to simple processors.” Well, we just jumped up an order of two in complexity SMP is really just now being considered for safety-critical systems. I know the ARINC committee who does manned flight standards for commercial aviation is discussing SMP for the next version of the standard. And a lot of it comes down to, we’re not going to do things on two cores on the same time, or you’re going to manually schedule the cores Which is kindof a pain. So that’s the real fast view of the whole thing and I think I did it in about the same length of

time as it was intended to be done even with all of the weirdness. So that covers a lot of territory. But we’re open to questions and anything Q: “Where does the drive come from? SMP. It was simple, nice easy; why?” A: Why? Because To keep a commercially embedded you usually don’t want a fanned processor. And you don’t want a high-heat processor. Which means you tend to focus; the SMP keeps your clockrate and your heat power lower, but still gives you the natural concurrency and the throughput you need. The other side of that is, and where we got our initial funding to do some of this, space processors are built slow. The SPARCs I mentioned, the ERC32 that flies on still most missions, I think topped out at 25MHz. So these are on old processes. The Leon3, which is the current state of the art, and the RAD750 PowerPC, I think the RAD750 will go to 133MHz, the Leon3 is in the same range, 100. The next generation is trying to go to 300 or 400 MHz. And if that’s all you can do and stay in the processes you have to do to be space-hardened, you’re better off going multicore because you can’t raise the clockrate. Also there is some belief you can disable unused cores and save power although the Leon designers say that’s not true for them. And I talked to some guys from one of the ARINC vendors and they had measured it on the PowerPC and that wasn’t true either. So we’re doing some work now and we’d actually proposed turning off cores as a feature that would be useful. And we actually went back and backpedaled and said, well, it sounds good, but it’s really not right. Becuase we could do all the software and it wouldn’t save you any power so let’s find a feature that’s actually meaningful and useful. So. It really comes down to the same thing you see in a phone except the clockrate can be a little higher. You get better performance by having more cores to do more things in parallel then do by having a 3GHz processor in your pants that’s going to melt your leg Q: “Does it not save power because it’s actually not a substantial part of the power budget, that’s consumed by the cores? Or are there other things?” I think the heat. The heat and the process that space processors have to be made to. Even embedded PC systems I’ve worked on, when you get into systems that have low height requirements or can’t have moving fans, you end up having to stay under like GHz to actually be able to be fanless. If you need more processor power than say you can get in a GHz PC then your only choice is to go multicore. And when you add in the restriction that you have to make it on a process that lets you space-harden it, you dropped your processor another factor of four so you definitely want a second core The funny thing is they want the performance but the technical challenges are pretty high at this point Q: “For SMP processors for PC’s also scheduling is not so easy at this point” One of the reasons I wanted the plug-in scheduling algorithm is; the first was my PhD was on real-time disk and cache scheuling algorithms. And I also believed there was no one true algorithm either for a lot If you knew what your system was doing, you could pick a better algorithm. These embedded systems are custom, special-purpose applications or devices. They’re only doing one thing their entire life. If I gave you the set of requirements for what you are doing with the disk, you could pick intelligently how to manage it better. I didn’t answer the question. The other part of why there’s a plugin framework. We knew that history says, very operating system did one; then did a lot of work to get to two and four. Eight was another leap of faith. And there somewhere beyond that the algorithm fell over and you needed another algorithm. So we really

anticipated that on top of that we would have real-time research to come up with better algorithms along the way. Or they were tailored for different situations. And actually there is an architecture called the TILE64 where I think they actually already have 64 cores. I don’t know how useful each core is They’re going to get there. It’s just a matter of time Q: “What’s the API for applications that want to use [realtime]. So how do I set deadlines, and how do I; do you do monitoring for example that you make the deadlines within the application, or..?” Okay. When you’re doing rate-monotonic scheduling.. Most embedded tasks look like this. And there’s some kind of blocking operation. So they sit their entire life waiting for something to happen And then they do something. And that’s all they do. So. You press a key; I wake up; I process a key; I go to sleep; and wait for the next OS event. The operation is actually There’s RTEMS, rate monotonic, and private; it’s really period and 10. And so you’d wake up every 10ms, and that would be it That’s how you set your loops. So it looks exactly like you were waiting on a message queue; or a mutex; or anything else Q: “And suppose I have an algorithm that’s for example quadratic, or something. And that’s the only thing I can run. Is that a problem?” But remember.. if you’re running in the background to completion? See normally you do things that are time-focused, or in response to events in the system. So somebody pushes a button; the door opens So if you’ve got this long-running algorithm running in the background, it could just run in the background. Or you know that, let’s say, this wakes up, and then it processes this as a scancode. You know how long that algorithm takes to run, and when it gets a whole key event it might send it to something that’s more intelligent. I’m trying to come up with a good example, because you gave me a broad one and I’m trying to come back with a good example. In another example, in a lot of applications, like video processing, that do things; pipeline processing. So you do one codec, and then another codec, and then another codec; so you could run each codec as a thread on a frame, and then you could run it on one frame; then pass a pointer to the frame to the next thread, to the next thread, to the next thread. You could set it up, so rather than having an enormously long sequence of call this, call this, call this, that you don’t know how long it’s going to take, you can tune out the algorithms into separate threads And then on an SMP system naturally those threads would run on different cores once you got a few frames in. Because you’d start off with one here, frame one here, then frame one would be here, then you’d be doing frame two. So by naturally breaking it into threads, you would naturally be able to take advantage of all the processing power And the cool thing about that is that you can break the chain of processing and insert another codec. So a lot of signal processing systems are done that way. You fill up with the data, then you do this with it, then you do that with it, then you do this with it Then if you become a little better algorithm along the way, you replace it. So the goal is you try not to have the big monstrous thing You have these very discrete things. You understand what they do. And it’s the classic ‘small is better’ because you break it into things you understand. You can put timing constraints on. So this task wakes up, does something, possibly transforms some data, it does some computations, but there are finite bounds on what it can do. And based on this period it would do something like this. At time 0, and then he would be scheduled at t=10, and t=20. And depending on whether he was.. The first time he might run all of this. And then wait until here But he might not be the most important thing Even though he was ready to run

He might run here. But he would be ready again He was ready there. So his deadline is that he completes by the end of the deadline Not that he starts on the deadline. So that’s the way the rate monotonic algorithm works. The rate monotonic algorithms was actually developed for the Apollo program It’s that old. Did I beat that or did I just go all the way around it and not answer? Q: “So then how do you tune it? So if you have a specific architecture; you install RTEMS on it; and then you think ok, this should be fast enough, and it turns out not to be.” A lot of these systems really are periodic Because you’re dealing with events in the real world that happen with a frequency Or a maximum frequency. So what it is you do; this math really helps you. You’re running every 10. So here he took 2. This time he took 1.5. Well what happens if this time he took 5? If you find out that this 10 millisecond task is averaging 7, right off the bat, you could know, that he’s taking 70% of your cpu. So he is the place to optimize. So normally what I see by focusing on the things that are naturally periodic, it’s so easy to compute how much they’re using, that you don’t look at the things that you’re worried about The other thing you get with our stats is that you get minimum, maximum and average on CPU, and wall. And walltime is calendar time. This is a good example. I had a system that actually had a max of 1.5 and it was a 10 millisecond task. But occasionally it was taking 15 milliseconds of walltime So it’s never executing very long. But sometimes it’s taking too long to do it. And it was the highest priority task in the system And that means that a lower priority task is holding a resource that it needed. And in this case it turned out that a fairly benign, obvious mistake in a design pattern had been made. And it was that the data logger woke up, and read all the data, and locked all the data he wanted. And this task was the task that woke up and updated the data. So every now and again, the guy who updated the data would wake up while there was being logged, and couldn’t update the data until the logging was done. So what we ended up turning the system around and doing was everybody wrote the newest data to the logger so the logger could run independent of everyone. And it was pretty obvious; you want everyone to lock the data, so it makes sense for the logger to wake up and go gather it, right? Well, it also led to a priority inversion problem. Which, if you look, which was the Mars rover before curiosity Spirit? The one before it had a horrible priority inversion problem that had to be fixed on the ground. And it’s a pretty famous one Q: “On the ground there?” On the ground there They had to do a code upload Q: “The other ground?” Yeah on the Mars. It was on Mars when it happened. These are easy to do because they’re not obvious It’s not that the people doing the system are stupid! It’s that we did exactly like you were supposed to. We locked the data every time it was accessed or updated We just happen to get the priorities.. We got the accessor pattern wrong Nobody said these were easy It’s multithreaded, that’s why we get the big bucks, right? That’s why you go to school.That’s actually one of my favourite examples I have actually the stats from that system in my class slides. Because that’s a real example and it’s also a hint that.. In the US we say, you’re always looking for horses, but sometimes there’s a zebra. So when you see the zebra, you need to find out why it’s there. And this was a zebra. This pass was running perfectly. Except once in a while. And this was it Keep going. I got nowhere to be except Noordwijk, and I’m going to get there easily Q: “Maybe it’s a stupid question, but do you sometimes bake the operating system into the hardware or so?” The most I’ve ever seen anybody do was a research project where they actually .. I don’t remember if he finished this or not. He was implementing

actually a scheduler using the plugin with a FPGA assister. That’s an interesting thing to do. That had actually been done with ADA in the early nineties, and it’s kinda painful I think. It’s standard FPGA, so it’s hard. Normally it’s just linked in with the application and shift around with the application, so you don’t know it’s there. We don’t know it’s there Q: “In computer systems you discover at some point where you can just can’t meet some deadlines. And that’s where usually you then start adding hardware or whatever I guess this happens less?” If it happens it will either be a horrible design problem early on, or it’ll be something where things have been added to the system. And at that point the priorities are really supposed to reflect things you don’t need to do. So by being completely priority based you’re supposed to avoid these problems. The other thing is the rate monotonic scheduling algorithm One of the rules is called the processor utilization rule. And I think the limits for the formula is like 67%. So to use that rule, by math, you can never have over 67% utilization for periodic tasks. But there’s another rule called first deadline rule where you can basically draw out a timeline, and you can go a lot higher. But I’ve got an example in my class slides where it gets high enough where if you actually had this and saw this on a system, you would probably stop the project until you fixed the problem, because it’s like within a few percentage points of being a complete overload. Because that’s the problem with all these. You start wondering how accurate are my measurements And if I’m reading 90% utilization, and I haven’t fielded the system, you start getting nervous very rapidly. But these systems are supposed to survive and not fail We focused on the scheduling. We actually do instruction level test coverage and we have near-100% generated assembly test coverage Did that sink in? “Holy shit.” We execute nearly every instruction of generated code in the threading and synchronization and about 95% of the branch path. So if you looked at every branch instruction as taken/not-taken, I think in the core set of the OS, I think the number is 1980, and it’s 97% of those, you’ve go taken or not-taken and we hit 97% of those. And that has been an ongoing Google Summer of Code project and high school student project, and anytime anyone wants to do something. There’s always test coverage to improve. The numbers aren’t as high across all the support code. But a lot of the support code is near 100%. We recently added the filesystems into the set and that dropped.. We’ve got what we call the core number and the full number. The full number dropped into the 80s, but the amount of binary code analyzed went from 75k to 275k So it dropped in 12% wasn’t bad considering how much code we added. We just said, we’re going to start having to add a lot of test code to filesystems which is pretty hard to do. And all that testing is automated Except for some commercial simulators for the space processors, all of it is reproducible on open source, complete open source “Let’s do two more.” Q: “I’m curious about, if you, for the code in the system that you didn’t write yourself, or let’s put it this way, code that was written that you want to import, without any realtime constraints or predictability in mind, and you don’t want to modify it heavily, how on Earth can you do that.” You should leave it by itself with its own constraints. “You just can’t depend on its behaviour?” “Like the network stack for instance.” the system malloc on the fly. But the reason The BSD stack pre-allocates already said. Everything is mallocced; no dynamic memory allocations. Once the application reaches a point where it’s said to be initialized, you don’t dynamically allocate memory, because if you do that you can leak,

and if you leak you can fail. So we try to, when we bring in code, to define the requirements for it. And normally they run inside a thread. So you have control over them running inside a thread, so you still have the priority. So the shell normally runs at a fairly low priority. So if you run dd to a large disk, it’s running in the background “Any final questions?” Q: “How do you.. You mentioned processors in space and that they get hardened. And you mentioned that they have reasonable memory and stuff like that. The problem is actually size? Or weight? Or is it a price problem?” When I said process problem I meant manufacturing process. Because once the lithography drops to a certain level, there is less.. There is a higher probability of a cosmic ray hitting the actual circuit, like an older circuit, where the lithography is larger. A lot of the cosmic rays pass right through which is again above my area of expertise But apparently whatever static heat makes the speed lower. And there’s also there’s a size, weight and power on a satellite. And everything is expensive I had a Canadian guy tell me that they had to come under their memory budget, because adding a memory chip was 100 grand to their satellite. It’s numbers like that where This has to survive launch and shock and all those things. And you only build normally two. You build one for the air and one for the ground to debug on. And the one for the ground is often called a flat-sat because it’s laid out in a lab with the boards laid out It’s a strange world “Good note to end on.” “Thank you very much.”