[Week 6, Continued] [David J. Malan] [Harvard University] [This is CS50.] [CS50.TV] This is CS50 and this is the end of week 6 So CS50x, one of Harvard’s first courses involved in the edX initiative indeed debuted this past Monday If you would like to get a glimpse of what others on the Internet are now following along with, you can head to x.cs50.net That will redirect you to the appropriate place on edx.org, which was where this and other courses from MIT and Berkeley now live You’ll have to sign up for an account; you’ll find that the material is largely the same as you’ve had this semester, albeit a few weeks delayed, as we get everything ready But what students in CS50x will now see is an interface quite like this one This, for instance, is Zamyla leading the walkthrough for problem set 0 Upon logging in to edx.org, a CS50x student sees the sorts of things you would expect to see in a course: the lecture for the Monday, lecture for Wednesday, various shorts, the problem sets, the walkthroughs, PDFs In addition, as you see here, machine translations of English transcripts into Chinese, Japanese, Spanish, Italian, and a whole bunch of other languages that will certainly be imperfect as we roll them out programmatically using something called an API, or application programming interface, from Google that allows us to convert English to these other languages But thanks to the wonderful spirit of some hundred-plus volunteers, random people on the Internet who have kindly offered to get involved in this project, we’ll gradually be improving the quality of those translations by having humans correct the mistakes that our computers have made So it turns out we had a few more students show up on Monday than we initially expected In fact, now CS50x has 100,000 people following along at home So realize you are all part of this inaugural class of making this course in computer science education more generally, more broadly, accessible And the reality is now, with some of these massive online courses, they all start with these very high numbers, as we seem to have done here But the goal, ultimately, for CS50x is really to get as many people to the finish line as possible By design, CS50x is going to be offered from this past Monday all the way through April 15, 2013, so that folks who have school commitments elsewhere, work, family, other conflicts and the like, have a bit more flexibility with which to dive into this course, which, suffice it to say, is quite ambitiously done if only over the course of just three months during a usual semester But these students will be tackling the same problem sets, viewing the same content, having access to the same shorts and the like So realize that we are all really in this together And one of the end goals of CS50x is not just to get as many folks to the finish line and give them this newfound understanding of computer science and programming but also to have them have this shared experience One of the defining characteristics of 50 on campus, we hope, has been this sort of communal experience, for better or for worse, sometimes, but having these people to turn to to the left and to the right, and office hours and the hackathon and the fair It’s a little harder to do that in person with folks online, but CS50x will conclude in April with the first ever CS50 Expo, which will be an online adaptation of our idea of the fair where these thousands of students will all be invited to submit a 1- to 2-minute video, either a screencast of their final project or video of them waving hello and talking about their project and demoing it, much like your predecessors have done here on campus in the fair, so that by semester’s end, the hope is to have a global exhibition of the CS50x students’ final projects, much like that which awaits you this December here on campus So more on that in the months to come With 100,000 students, though, comes a need for a few more CAs Given that you guys are blazing the trail here and taking CS50 several weeks in advance of this material’s release to the folks on edX, realize we would love to involve as many of our own students as possible in this initiative, both during the semester as well as this winter and this coming spring So if you would like to get involved in CS50x, particularly joining in on CS50x Discuss, the edX version of CS50 Discuss, which many of you have been using on campus, the online bulletin board, please do head to that URL, let us know who you are, because we’d love to build up a team of students and staff and faculty alike on campus who are simply playing along and helping out And when they see a question that’s familiar to them, you hear a student reporting some bug somewhere out there in some country on the Internet, and that rings a bell because you too had that same issue in your d-hall some time ago, hopefully then you can chime in and share your own experience So please do partake if you would like Computer science courses at Harvard have a bit of a tradition, CS50 among them, of having some apparel, some clothes, that you can wear proudly at semester’s end, saying quite proudly that you finished CS50 and took CS50 and the like, and we always try to involve students in this process as much as possible, whereby we invite, around this time of the semester, students to submit designs using Photoshop, or whatever tool of choice you’d like to use if you’re a designer, to submit designs for T-shirts and sweatshirts
and umbrellas and little bandanas for dogs we now have and the like And everything is then–the winners each year are then exhibited on the course’s website at store.cs50.net Everything is sold at cost there, but the website just runs itself and allows people to choose the colors and designs that they like So I thought we’d just share some of last year’s designs that were on the website besides this one here, which is an annual tradition “Every Day I’m Seg Faultn” was one of the submissions last year, which is still available there for alumni We had this one, “CS50, Established 1989.” One of our Bowdens, Rob, was very popular last year “Team Bowden” was born, this design was submitted, among the top sellers As was this one here. Many people had “Bowden Fever” according to the sales logs Realize that that could now be your design there, up on the Internet More details on this in the next problem sets to come One more tool: you’ve had some exposure and hopefully now some hands-on experience with GDB, which is, of course, a debugger and allows you to manipulate your program at a fairly low level, doing what kinds of things? What does GDB let you do? Yeah? Give me something. [Student answer, unintelligible] Good. Step into function, so you don’t just have to type run and have the program blow through its entirety, printing out things to standard output Rather, you can step through it line by line, either typing next to go line by line by line or step to dive into a function, typically one that you wrote What else does GDB let you do? Yeah? [Student answer, unintelligible] Print variables. So if you want to do a little introspection inside of your program without having to resort to writing printf statements all over the place, you can just print a variable or display a variable What else can you do with a debugger like GDB? [Student answer, unintelligible] Exactly. You can set breakpoints; you can say break execution at the main function or the foo function You can say break execution at line 123 And breakpoints are a really powerful technique because if you have a general sense of where your problem probably is, you don’t have to waste time stepping through the program’s entirety You can essentially jump right there and then start typing– stepping through it with step or next or the like But the catch with something like GDB is that it helps you, the human, find your problems and find your bugs It doesn’t necessarily find them so much for you So we introduced the other day style50, which is a short command line tool that tries to stylize your code a little bit more cleanly than you, the human, might have done But that, too, is really just an aesthetic thing But it turns out there’s this other tool called Valgrind that is a little more arcane to use Its output is atrociously cryptic at first glance But it’s wonderfully useful, especially now that we’re at the part of the term where you’re starting to use malloc and dynamic memory allocation Things can go really, really wrong quickly Because if you forget to free your memory, or you dereference some NULL pointer, or you dereference some garbage pointer, what is typically the symptom that results? Seg fault. And you get this core file of some number of kilobytes or megabytes that represents the state of your program’s memory when it crashed, but your program ultimately seg faults, segmentation fault, which means something bad happened almost always related to a memory-related mistake that you made somewhere So Valgrind helps you find things like this It’s a tool that you run, like GDB, after you’ve compiled your program, but rather than run your program directly, you run Valgrind and you pass to it your program, just like you do with GDB Now, the usage, to get the best kind of output, is a little long, so right there atop of the screen you’ll see Valgrind -v “-v” almost universally means verbose when you’re using programs on a Linux computer So it means spit out more data than you might by default “–leak-check=full.” This is just saying check for all possible memory leaks, mistakes that I might have made. This, too, is a common paradigm with Linux programs Generally, if you have a command line argument that’s a “switch”, that’s supposed to change the program’s behavior, and it’s a single letter, it’s -v, but if that’s switched, just by design of the programmer, is a full word or series of words, the command line argument starts with — These are just human conventions, but you’ll see them increasingly And then, finally, “a.out” is the arbitrary name for the program in this particular example And here’s some representative output Before we look at what that might mean, let me go over to a snippet of code over here And let me move this out of the way, coming soon, and let’s take a look at memory.c, which is this short example here So in this program, let me zoom in on the functions and questions We have a function main that calls a function, f, and then what does f proceed to do, in slightly technical English? What does f proceed to do? How about I’ll start with line 20, and the star’s location doesn’t matter, but I’ll just be consistent here with last lecture What’s line 20 do for us? On the left hand side. We’ll break it down further Int* x: what does that do?
Okay. It’s declaring a pointer, and now let’s be even more technical What does it mean, very concretely, to declare a pointer? Someone else? Yeah? [Student answer, unintelligible] Too far So you’re reading to the right-hand side of the equal sign Let’s focus just on the left, just on int* x This does “declare” a pointer, but now let’s dive in deeper to that definition What does that concretely, technically mean? Yeah? [Student answer, unintelligible] Okay. It’s preparing to save an address in memory Good. And let’s take this one step further; it’s declaring a variable, x, that’s 32 bits And I know it’s 32 bits because–? It’s not because it’s an int, because it’s a pointer in this case Coincidence that it’s one and the same with an int, but the fact that there’s the star there means this is a pointer and in the appliance, as with many computers, but not all, pointers are 32 bits On more modern hardware like the latest Macs, the latest PCs, you might have 64-bit pointers, but in the appliance, these things are 32 bits So we’ll standardize on that. More concretely, the story goes as follows: We “declare” a pointer; what does that mean? We prepare to store a memory address What does that mean? We create a variable called x that takes up 32 bits that will soon store the address of an integer And that’s probably about as precise as we can get It’s fine moving forward to simplify the world and just say declare a pointer called x Declare a pointer, but realize and understand what’s actually going on even in just those few characters Now, this one’s almost a little easier, even though it’s a longer expression So what is this doing, that’s highlighted now: “malloc(10* sizeof (int));” Yeah? [Student answer, unintelligible] Good. And I’ll take it there. It’s allocating a chunk of memory for ten integers And now let’s dive in slightly deeper; it’s allocating a chunk of memory for ten integers What is malloc then returning? The address of that chunk, or, more concretely, the address of the first byte of that chunk How then am I, the programmer, to know where that chunk of memory ends? I know that it’s contiguous. Malloc, by definition, will give you a contiguous chunk of memory No gaps in it. You have access to every byte in that chunk, back to back to back, but how do I know where the end of this chunk of memory is? When you use malloc? [Student answer, unintelligible] Good You don’t. You have to remember I have to remember that I used the value 10, and I don’t even seem to have done that here But the onus is entirely on me. Strlen, which we’ve become slightly reliant on for strings, works only because of this convention of having \0 or this special nul character, N-U-L, at the end of a string That does not hold for just arbitrary chunks of memory It’s up to you. So line 20, then, allocates a chunk of memory that can store ten integers, and it stores the address of the first byte of that chunk of memory in the variable called x Ergo, which is a pointer. So line 21, unfortunately, was a mistake But first, what is it doing? It’s saying store at location 10, 0 indexed, of the chunk of memory called x the value 0 So notice a couple of things are going on Even though x is a pointer, recall from a couple weeks ago that you can still use the array-style square bracket notation Because that’s actually short-hand notation for the more cryptic-looking pointer arithmetic where we would do something like this: Take the address x, move 10 spots over, then go there to whatever address is stored at that location But frankly, this is just atrocious to read and get comfortable with So the world typically uses the square brackets just because it’s so much more human-friendly to read But that’s what’s really going on underneath the hood; x is an address, not an array, per se So this is storing 0 at location 10 in x Why is this bad? Yeah? [Student answer, unintelligible] Exactly We only allocated ten ints, but we count from 0 when programming in C, so you have access to 0 1 2 3 4 5 6 7 8 9, but not 10 So either the program is going to seg fault or it’s not But we don’t really know; this is sort of a nondeterministic behavior It really depends on whether we get lucky If it turns out that the operating system doesn’t mind if I use that extra byte, even though it hasn’t given it to me, my program might not crash It’s raw, it’s buggy, but you might not see that symptom, or you might see it only once in a while But the reality is that the bug is, in fact, there And it’s really problematic if you’ve written a program that you want to be correct, that you’ve sold the program that people are using that every once in a while crashes because, of course, this is not good. In fact, if you have an Android phone or an iPhone
and you download apps these days, if you’ve ever had an app just quit, all of a sudden it disappears, that’s almost always the result of some memory-related issue, whereby the programmer screwed up and dereferenced a pointer that he or she shouldn’t have, and the result of iOS or Android is to just kill the program altogether rather than risk undefined behavior or some kind of security compromise There’s one other bug in this program besides this one What else have I screwed up in this program? I’ve not practiced what I’ve preached. Yeah? [Student answer, unintelligible] Good I haven’t freed the memory. So the rule of thumb now has to be anytime you call malloc, you must call free when you are done using that memory Now, when would I want to free this memory? Probably, assuming this first line was correct, I would want to do it here Because I couldn’t, for instance, do it down here. Why? Just out of scope. So even though we’re talking about pointers, this is a week 2 or 3 issue, where x is only in scope inside of the curly braces where it was declared So you definitely can’t free it there. My only chance to free it is roughly after line 21 This is a fairly simple program; it was fairly easy once you kind of wrapped your mind around what the program’s doing, where the mistakes were And even if you didn’t see it at first, hopefully it’s a little obvious now that these mistakes are pretty easily solved and easily made But when a program is more than 12 lines long, it’s 50 lines long, 100 lines long, walking through your code line by line, thinking through it logically, is possible but not particularly fun to do, constantly looking for bugs, and it’s also difficult to do, and that’s why a tool like Valgrind exists Let me go ahead and do this: let me open my terminal window, and let me not just run memory, because memory seems to be fine I’m getting lucky. Going to that additional byte at the end of the array doesn’t seem to be too problematic But let me, nonetheless, do a sanity check, which just means to check whether or not this is actually correct So let’s do valgrind -v –leak -check=full, and then the name of the program in this case is memory, not a.out So let me go ahead and do this. Hit Enter Dear God. This is its output, and this is what I alluded to earlier But, if you learn to read through all of the nonsense here, most of this is just diagnostic output that’s not that interesting What your eye really wants to be looking for is any mention of error or invalid Words that suggest problems And indeed, let’s see what’s going wrong down here I have a summary of some sort, “in use at exit: 40 bytes in 1 blocks.” I’m not really sure what a block is yet, but 40 bytes actually feels like I could figure out where that’s coming from 40 bytes. Why are 40 bytes in use at exit? And more specifically, if we scroll down here, why have I definitely lost 40 bytes? Yeah? [Student answer, unintelligible] Perfect. Yeah, exactly There were ten integers, and each of those is size of 4, or 32 bits, so I’ve lost precisely 40 bytes because, as you proposed, I haven’t called free That’s one bug, and now let’s look down a little further and see next to this, “invalid write of size 4.” Now what is this? This address is expressed what base notation, apparently? This is hexadecimal, and any time you see a number starting with 0x, it means hexadecimal, which we did way back in, I think, pset 0’s section of questions, which was just to do a warmup exercise, converting decimal to hex to binary and so forth Hexadecimal, just by human convention, is usually used to represent pointers or, more generally, addresses. It’s just a convention, because it’s a little easier to read, it’s a little more compact than something like decimal, and binary is useless for most humans to use So now what does this mean? Well, it looks like there’s an invalid write of size 4 on line 21 of memory.c So let’s go back to line 21, and indeed, here is that invalid write So Valgrind isn’t going to completely hold my hand and tell me what the fix is, but it is detecting that I’m doing an invalid write I’m touching 4 bytes that I shouldn’t be, and apparently that’s because, as you pointed out, I’m doing  instead of  maximally or  or something in between With Valgrind, realize any time you’re now writing a program that uses pointers and uses memory, and malloc more specifically, definitely get into the habit of running this long but very easily copied and pasted command of Valgrind to see if there’s some errors in there And it’ll be overwhelming every time you see the output, but just parse through visually all of the output and see if you see mentions of errors or warnings or invalid or lost Any words that sound like you screwed up somewhere So realize that’s a new tool in your toolkit Now on Monday, we had a whole bunch of folks come up here and represent the notion of a linked list And we introduced the linked list as a solution to what problem?
Yeah? [Student answer, unintelligible] Good Arrays cannot have memory added to them If you allocate an array of size 10, that’s all you get You can call a function like realloc if you initially called malloc, and that can try to grow the array if there is space toward the end of it that no one else is using, and if there’s not, it will just find you a bigger chunk somewhere else But then it will copy all of those bytes into the new array This sounds like a very correct solution Why is this unattractive? I mean it works, humans have solved this problem Why did we need to resolve it on Monday with linked lists? Yeah? [Student answer, unintelligible] It could take a long time In fact, any time you’re calling malloc or realloc or calloc, which is yet another one, any time you, the program, are talking to the operating system, you tend to slow the program down And if you’re doing these kinds of things in loops, you’re really slowing things down You’re not going to notice this for the simplest of “hello world” type programs, but in much larger programs, asking the operating system again and again for memory or giving it back again and again tends not to be a good thing Plus, it’s just sort of intellectually–it’s a complete waste of time Why allocate more and more memory, risk copying everything into the new array, if you have an alternative that lets you allocate only as much memory as you actually need? So there’s pluses and minuses in here One of the pluses now is that we have dynamism Doesn’t matter where the chunks of memory are that are free, I can just sort of create these bread crumbs via pointers to string my whole linked list together But I pay at least one price What do I have to give up in gaining linked lists? Yeah? [Student answer, unintelligible] Good You need more memory. Now I need space for these pointers, and in the case of this super simple linked list that is only trying to store integers, which are 4 bytes, we keep saying well, a pointer is 4 bytes, so now I’ve literally doubled the amount of memory I need just to store this list But again, this is a constant tradeoff in computer science between time and space and development, effort and other resources What’s another downside of using a linked list? Yeah? [Student answer, unintelligible] Good. Not as easy to access. We can no longer leverage week 0 principles like divide and conquer And more specifically, binary search. Because even though we humans can see roughly where the middle of this list is, the computer only knows that this linked list starts at address called first And that’s 0x123 or something like that And the only way the program can find the middle element is to actually search the whole list And even then, it literally has to search the whole list because even once you reach the middle element by following the pointers, you, the program, have no idea how long this list is, potentially, until you hit the end of it, and how do you know programmatically that you’re at the end of a linked list? There’s a special NULL pointer, so again, a convention Rather than use this pointer, we definitely don’t want it to be some garbage value pointing off stage somewhere; we want it to be hand down, NULL, so that we have this terminus in this data structure so we know where it ends What if we want to manipulate this? We did most of this visually, and with humans, but what if we want to do an insertion? So the original list was 9, 17, 20, 22, 29, 34 What if we then wanted to malloc space for number 55, a node for it, and then we want to insert 55 into the list just as we did on Monday? How do we do this? Well, Anita came up and she essentially walked the list She started at the first element, then the next, the next, the next, the next, the next Finally hit the left-hand all the way down and realized oh, this is NULL. So what pointer manipulation needed to be done? The person who was on the end, number 34, needed his left hand raised to point at 55, 55 needed their left arm pointing down to be the new NULL terminator. Done Pretty easy to insert 55 into a sorted list And how might this look? Let me go ahead and open up some code example here I’ll open up gedit, and let me open up two files first One is list1.h, and let me just remind that this was the chunk of code that we used to represent a node A node has both an int called n and a pointer called next that just points to the next thing in the list That is now in a .h file. Why? There’s this convention, and we haven’t taken advantage of this a huge amount ourselves, but the person who wrote printf and other functions gave as a gift to the world all of those functions by writing a file called stdio.h And then there’s string.h, and then there’s map.h, and there’s all these h files that you might have seen or used during the term written by other people Typically in those .h files are only things like typedefs or declarations of custom types or declarations of constants You don’t put functions’ implementations in header files You put, instead, just their prototypes You put things you want to share with the world what they need in order to compile their code. So just to get into this habit, we decided to do the same thing. There’s not much in list1.h, but we’ve put something that might be of interest to people in the world
who want to use our linked list implementation Now, in list1.c, I won’t go through this whole thing because it’s a bit long, this program, but let’s run it real quickly at the prompt Let me compile list1, let me then run list1, and what you’ll see is we’ve simulated a simple little program here that’s going to allow me to add and remove numbers to a list So let me go ahead and type 3 for the menu option 3 I want to insert the number–let’s do the first number, which was 9, and now I’m told the list is now 9 Let me go ahead and do another insertion, so I hit menu option 3 What number do I want to insert? 17 Enter. And I’ll do just one more Let me insert the number 22 So we have the beginnings of the linked list that we had in slide form a moment ago How is this insertion actually happening? Indeed, 22 is now at the end of the list So the story we told on stage on Monday and recapped just now must actually be happening in code Let’s take a look. Let me scroll down in this file We’ll gloss over some of the functions, but we’ll go down to, say, the insert function Let’s see how we go about inserting a new node into this linked list Where is the list declared? Well, let’s scroll all the way up at the top, and notice that my linked list is essentially declared as a single pointer that’s initially NULL So I’m using a global variable here, which in general we’ve preached against because it makes your code a little messy to maintain, it’s sort of lazy, usually, but it’s not lazy and it’s not wrong and it’s not bad if your program’s sole purpose in life is to simulate one linked list Which is exactly what we’re doing So rather than declare this in main and then have to pass it to every function we’ve written in this program, we instead realize oh, let’s just make it global because the whole purpose of this program is to demonstrate one and only one linked list So that feels okay. Here are my prototypes, and we won’t go through all of these, but I wrote a delete function, a find function, an insert function, and a traverse function But let’s now go back down to the insert function and see how this one works here Insert is on line–here we go Insert. So it doesn’t take any arguments, because we’re going to ask the user inside of this function for the number they want to insert But first, we prepare to give them some space This is sort of copy and paste from the other example In that case, we were allocating an int; this time we’re allocating a node I don’t really remember how many bytes a node is, but that’s fine Sizeof can figure that out for me And why am I checking for NULL in line 120? What could go wrong in line 119? Yeah? [Student answer, unintelligible] Good. Just could be the case that I’ve asked for too much memory or something’s wrong and the operating system doesn’t have enough bytes to give me, so it signals as much by returning NULL, and if I don’t check for that and I just blindly proceed to use the address returned, it could be NULL It could be some unknown value; not a good thing unless I– actually won’t be an unknown value. It could be NULL, so I don’t want to abuse it and risk dereferencing it If that happens, I just return and we’ll pretend like I didn’t get back any memory at all Otherwise, I tell the user give me a number to insert, I call our old friend GetInt, and then this was the new syntax we introduced on Monday ‘newptr->n’ means take the address that you were given by malloc which represents the first byte of a new node object, and then go to the field called n A little trivia question: This is equivalent to what more cryptic line of code? How else could I have written this? Want to take a stab? [Student answer, unintelligible] Good. Using the .n, but it’s not quite as simple as this What do I first need to do? [Student answer, unintelligible] Good. I need to do *newptr.n So this is saying new pointer’s obviously an address. Why? Because it was returned by malloc. The *newptr saying “go there,” and then once you’re there, then you can use the more familiar .n, but this just looks a little ugly, especially if we humans are going to draw pointers with arrows all the time; the world has standardized on this arrow notation, which does exactly the same thing So you only use the -> notation when the thing on the left is a pointer Otherwise, if it’s an actual struct, use the .n And then this: Why do I initialize newptr->next to NULL? We don’t want a dangling left hand off of the end of the stage We want it pointing straight down, which means the end of this list could potentially be at this node, so we better make sure it is NULL And, in general, initializing your variables or your data members and structs to something is just good practice Just letting garbage exist and continue to exist generally gets you in trouble if you forget to do something later on Here’s a few cases. This, again, is the insert function, and the first thing I check for is if the variable called first, that global variable is NULL, that means there is no linked list We haven’t inserted any numbers, so it’s trivial to insert this current number
into the list, because it just belongs at the start of the list So this was when Anita was just standing up here alone, pretending no one else was up here on stage until we allocated a node, then she could raise her hand for the first time, if everyone else had come up on the stage after her on Monday Now here, this is a little check where I have to say if the new node’s value of n is < the value of n in the current first node, that means there is a linked list that's begun There's at least one node in the list, but this new guy belongs before it, so we need to move things around In other words, if the list has started with just, let's say, just the number 17, that's the--actually, we can do this more clearly If we start our story with a pointer here called first, and initially it's NULL, and we insert the number 9, the number 9 clearly belongs at the start of the list So let's pretend we just malloced the address or the number 9 and put it here If first is 9 by default, the first scenario we discussed just means let's point this guy here, leave this as NULL; now we have the number 9 The next number we want to insert is 17 17 belongs over here, so we're going to have to do some logical stepping through this So let's instead, before we do that, let's pretend that we wanted to insert the number 8 So just for convenience's sake, I'm going to draw here But remember, malloc can put it most anywhere But for drawing's sake, I'll put it here So pretend I've just allocated a node for the number 8; this is NULL by default What now has to happen? A couple of things We made this mistake on stage on Monday where we updated a pointer like this, then did this, and then we claimed--we orphaned everyone else on stage Because you can't--the order of operations here is important, because now we've lost this node 9 that is just sort of floating in space So this was not the right approach on Monday We first have to do something else The state of the world looks like this. Initially, 8 has been allocated What would be a better way of inserting 8? Instead of updating this pointer first, just update this one here instead So we need a line of code that's going to turn this NULL character into an actual pointer that's pointing at node 9, and then we can safely change first to point at this guy here Now we have a list, a linked list, of two elements And what does this actually look like here? If we look at the code, notice that I've done exactly that I've said newptr, and in this story, newptr was pointing at this guy So let me draw one more thing, and I should have left a little more room for this So forgive the tiny little drawing This guy is called newptr That is the variable we declared a few lines earlier, in line--just above 25 And it's pointing to 8. So when I say newptr->next, that means go to the struct that’s being pointed at by newptr, so here we are, go there Then the arrow is saying get the next field, and then the = is saying put what value there? The value that was in first; what value was in first? First was pointing at this node, so that means this should now point at this node In other words, what looks albeit a ridiculous mess with my handwriting, what’s a simple idea of just moving these arrows around translates to code with just this one liner Store what is in first in the next field and then update what first actually is Let’s go ahead and fast-forward through some of this, and look only at this tail insertion for now Suppose I get to the point where I find that the next field of some node is NULL And at this point in the story, a detail that I’m glossing over is that I’ve introduced another pointer up here in line 142, predecessor pointer Essentially, at this point in the story, once the list gets long, I kind of need to walk it with two fingers because if I go too far, remember in a single-length list, you can’t go backwards So this idea of predptr is my left finger, and newptr–not newptr Another pointer that’s here is my other finger, and I’m just kind of walking the list That’s why that exists. But let’s only consider one of the simpler cases here If that pointer’s next field is NULL, what’s the logical implication? If you are traversing this list and you hit a NULL pointer? You’re at the end of the list, and so the code to then append this one additional element is sort of the intuitive will take that node whose next pointer is NULL, so this is currently NULL, and change it, though, to be the address of the new node So we’re just drawing in code the arrow that we drew on stage by raising someone’s left hand And the case that I’ll wave my hands at for now, just because I think it’s easy to get lost when we do it in this sort of environment, is checking for insertion at the list’s middle
But just intuitively, what needs to happen if you want to figure out where some number belongs in the middle is you do have to walk it with more than one finger, more than one pointer, figure out where it belongs by checking is the element < the current one, > the current one, and once you find that place, then you have to do this sort of shell game where you move the pointers around very carefully And that answer, if you’d like to reason through this at home on your own, boils down just to these two lines of code, but the order of those lines is super important Because if you drop someone’s hand and raise someone else’s in the wrong order, again, you could end up orphaning the list To summarize more conceptually, the insertion at the tail is relatively straightforward The insertion at the head is also relatively straightforward, but you need to update an additional pointer this time to squeeze number 5 into the list here, and then insertion in the middle involves even more effort, to very carefully insert the number 20 in its correct location, which is between 17 and 22 So you need to do something like have the new node 20 point to 22, and then, which node’s pointer needs to be updated last? It’s 17, to actually insert it So again, I’ll defer the actual code for that particular implementation At first glance, it’s a little overwhelming, but it’s really just an infinite loop that’s looping, looping, looping, looping, and breaking as soon as you hit the NULL pointer, at which point you can do the requisite insertion This, then, is representative linked list insertion code That was kind of a lot, and it feels like we’ve solved one problem, but we’ve introduced a whole other one. Frankly, we’ve spent all this time on big O and Ω and running time, trying to solve problems more quickly, and here we are taking a big step backwards, it feels And yet, if the goal is to store data, it feels like the Holy Grail, as we said on Monday, would really be to store things instantly In fact, suppose that we did put aside linked list for a moment and we instead introduced the notion of a table And let’s just think of a table for a moment as an array This array and this case here has some 26 elements, 0 through 25, and suppose that you needed some chunk of storage for names: Alice and Bob and Charlie and the like And you need some data structure to store those names Well, you could use something like a linked list and you could walk the list inserting Alice before Bob and Charlie after Bob and so forth And, in fact, if you want to see code like that as an aside, know that in list2.h, we do exactly that We won’t go through this code, but this is a variant of the first example that introduces one other struct we’ve seen before called student, and then what it actually stores in the linked list is a pointer to a student structure rather than a simple little integer, n So realize there’s code there that involves actual strings, but if the goal at hand really now is to address the efficiency problem, wouldn’t it be nice if we’re given an object called Alice, we want to put her into the right location in a data structure, it feels like it’d be really nice to just put Alice, whose name starts with A, in the first location And Bob, whose name starts with B, in the second location With an array, or let’s start calling it a table, a hash table at that, we can do exactly that. If we are given a name like Alice, a string like Alice, where do you put A-l-i-c-e? We need a hueristic. We need a function to take some input like Alice and return an answer, “Put Alice at this location.” And this function, this black box, is going to be called a hash function A hash function is something that takes an input, like “Alice”, and returns to you, typically, the numeric location in some data structure where Alice belongs In this case, our hash function should be relatively simple Our hash function should say, if you are given “Alice”, which character should I care about? The first one. So I look at , and then I say if  character is A, return the number 0 If it’s B, return 1. If it’s C, return 2, and so forth All 0 index, and that would allow me to insert Alice and then Bob and then Charlie and so forth into this data structure. But there’s a problem What if Anita comes along again? Where do we put Anita? Her name, too, starts with the letter A, and it feels like we’ve made an even bigger mess of this problem We now have immediate insertion, constant time insertion, into a data structure rather than worse-case linear, but what can we do with Anita in this case? What are the two options, really? Yeah? [Student answer, unintelligible] Okay, so we could have another dimension That’s good. So we can build things out in 3D like we talked about verbally on Monday We could add another access here, but suppose that no, I’m trying to keep this simple The whole goal here is to have immediate constant-time access, so that’s adding too much complexity What are other options when trying to insert Anita into this data structure? Yeah? [Student answer, unintelligible] Good. So we could move everyone else down, like Charlie nudges down Bob and Alice, and then we put Anita where she really wants to be
Of course, now, there’s a side effect of this This data structure is probably useful not because we want to insert people once but because we want to check if they’re there later if we want to print out all of the names in the data structure We’re going to do something with this data eventually So now we’ve kind of screwed over Alice, who’s no longer where she’s supposed to be Nor is Bob, nor is Charlie So maybe this isn’t such a good idea But indeed, this is one option. We could shift everyone down, or heck, Anita came late to the game, why don’t we just put Anita not here, not here, not here, let’s just put her a little lower in the list But then this problem starts to devolve again You might be able to find Alice instantly, based on her first name And Bob instantly, and Charlie. But then you look for Anita, and you see, hmm, Alice is in the way Well, let me check below Alice. Bob is not Anita Charlie is not Anita. Oh, there is Anita And if you continue that train of logic all the way, what’s the worst-case running time of finding or inserting Anita into this new data structure? It’s O(n), right? Because in the worst case, there’s Alice, Bob, Charlie . . all the way down to someone named “Y”, so there’s only one spot left Thankfully, we have no one called “Z”, so we put Anita at the very bottom We haven’t really solved that problem So maybe we do need to introduce this third dimension And it turns out, if we do introduce this third dimension, we can’t do this perfectly, but the Holy Grail is going to be getting constant-time insertion and dynamic insertions so that we don’t have to hard-code an array of size 26 We can insert as many names as we want, but let’s take our 5-minute break here and then do that properly All right. I set the story up pretty artificially there by choosing Alice and then Bob and then Charlie and then Anita, whose name was obviously going to collide with Alice But the question we ended on Monday with is just how probable is it that you would get these kinds of collisions? In other words, if we start to use this tabular structure, which is really just an array, in this case of 26 locations, what if our inputs are instead uniformly distributed? It’s not artificially Alice and Bob and Charlie and David and so forth alphabetically, it’s uniformly distributed over A through Z Maybe we’ll just get lucky and we’re not going to have two A’s or two B’s with very high probability, but as someone pointed out, if we generalized this problem and not do 0 to 25 but, say, 0 through 364 or 65, often the number of days in a typical year, and asked the question, “What’s the probability that two of us in this room have the same birthday?” Put it another way, what’s the probability that two of us have a name starting with A? The sort of question is the same, but this address space, this search space, is bigger in the case of birthdays, because we have so many more days in the year than letters in the alphabet What’s the probability of a collision? Well, we can think of this by figuring out the math the opposite way What’s the probability of no collisions? Well, this expression here says that what’s the probability if there’s just one person in this room, that they have a unique birthday? It’s 100%. Because if there’s only one person in the room, his or her birthday can be any of the 365 days out of the year So 365/365 options gives me a value of 1 So the probability in question at the moment is just 1 But if there’s a second person in the room, what’s the probability that their birthday is different? There’s only 364 possible days, ignoring leap years, for their birthday not to collide with the other persons So 364/365. If a third person comes in, it’s 363/365, and so forth So we keep multiplying together these fractions, which are getting smaller and smaller, to figure out what is the probability that all of us have unique birthdays? But then we can, of course, just take that answer and flip it around and do 1 minus all of that, an expression we’ll eventually get if you remember the back of your math books, it looks a little something like this, which is much more easily interpreted graphically And this graphic here has on the x axis the number of birthdays, or number of people with birthdays, and on the y axis is the probability of a match And what this is saying is that if you have, let’s say, even, let’s choose something like 22, 23 If there’s 22 or 23 people in the room, the probability that two of those very few people are going to have the same birthday is actually super high, combinatorially 50% odds that in a class of just 22 people, a seminar, practically, 2 of those people are going to have the same birthday Because there’s so many ways in which you can have the same birthday Even worse, if you look at the right-hand side of the chart, by the time you have a class with 58 students in it, the probability of 2 people having a birthday is super, super high, nearly 100% Now, that’s sort of a fun fact about real life But the implications, now, for data structures and storing information means that just assuming you have a nice, clean, uniform distribution of data and you have a big enough array to fit a bunch of things doesn’t mean you’re going to get people in unique locations You’re going to have collisions. So this notion of hashing, as it’s called,
taking an input like “Alice” and massaging it in some way and then getting back an answer like 0 or 1 or 2 Getting back some output from that function is plagued by this probability of collision So how can we handle those collisions? Well, on the one case, we can take the idea that was suggested We can just shift everyone down, or maybe, a little more simply, rather than move everyone else, let’s just move Anita to the bottom of the available spot So if Alice is in 0, Bob is in 1, Charlie is in 2, we’ll just put Anita at location 3 And this is a technique in data structures called linear probing Linear because you’re just walking this line, and you’re sort of probing for available spots in the data structure Of course, this devolves into O(n) If the data structure’s really full, there’s 25 people in it already, and then Anita comes along, she ends up at what would be location Z, and that’s fine She still fits, and we can find her later But this was contrary to the goal of speeding things up So what if we instead introduced this third dimension? That technique is generally called separate chaining, or having chains And what a hash table now is, this tabular structure, your table is just an array of pointers But what those pointers point to is guess what? A linked list. So what if we take the best of both of these worlds? We use arrays for the initial indexes into the data structure so we can instantly go to  ,  or so forth, but so that we have some flexibility and we can fit Anita and Alice and Adam and any other A name, we instead let the other axis grow arbitrarily And we finally, as of Monday, have that expressive capability with linked list We can grow a data structure arbitrarily Alternatively, we could just make a huge 2-dimensional array, but that’s going to be an awful situation if one of the rows in a 2-dimensional array isn’t large enough for the additional person whose name happens to start with A God forbid we have to reallocate a huge 2-dimensional structure just because there’s so many people named A, especially when there’s so few people named Z something It’s just going to be a very sparse data structure So it’s not perfect by any means, but now we at least have the ability to instantly find where Alice or Anita belongs, at least in terms of the vertical axis, and then we just have to decide where to put Anita or Alice in this linked list If we don’t care about sorting things, how quickly could we insert Alice into a structure like this? It’s constant time. We index into , and if no one’s there, Alice goes at the start of that linked list But that’s not a huge deal. Because if Anita then comes along some number of steps later, where does Anita belong? Well, . Oop. Alice is already in that linked list But if we don’t care about sorting these names, we can just move Alice over, insert Anita, but even that is constant time Even if there’s Alice and Adam and all these other A names, it’s not really shifting them physically. Why? Because we just did here with linked list, who knows were these nodes are anyway? All you have to do is move the bread crumbs Move the arrows around; you don’t have to physically move any data around So we can insert Anita, in that case, instantly. Constant time So we have constant-time lookup, and constant-time insertion of someone like Anita But kind of oversimplifying the world What if we later want to find Alice? What if we later want to find Alice? How many steps is that going to take? [Student answer, unintelligible] Exactly. The number of people before Alice in the linked list So it’s not quite perfect, because our data structure, again, has this vertical access and then it has these linked lists hanging–actually, let’s not draw it an an array It has these linked lists hanging off of it that looks a little something like this But the problem is if Alice and Adam and all these other A names end up more and more over there, finding someone could end up taking a bunch of steps, bcause you have to traverse the linked list, which is a linear operation So really, then, the insertion time ultimately is O(n), where n is the number of elements in the list Divided by, let’s arbitrarily call it m, where m is the number of linked lists that we have in this vertical axis In other words, if we truly assume a uniform distribution of names, totally unrealistic. There’s obviously more of some letters than others But if we assume for the moment a uniform distribution, and we have n total people, and m total chains available to us, then the length of each of these chains fairly simply is going to be the total, n divided by the number of chains So n/m. But here’s where we can be all mathematically clever m is a constant, because there’s a fixed number of these You’re going to declare your array at the beginning, and we’re not resizing the vertical axis. By definition, that stays fixed It’s only the horizontal axis, so to speak, that’s changing
So technically, this is a constant. So now, the insertion time is pretty much O(n) So that doesn’t feel all that much better But what’s the truth here? Well, all this time, for weeks, we’ve been saying O(n²). O(n), 2 x n², – n, divided by 2. . . ech It’s just n². But now, in this part of the semester, we can start talking about the real world again And n/m is absolutely faster than just n alone If you have a thousand names, and you break them up into multiple buckets so that you have only ten names in each of these chains, absolutely searching ten things is going to be faster than a thousand things And so one of the upcoming problem sets is going to challenge you to think about exactly that even though, yeah, asymptotically and mathematically, this is still just linear, which sucks in general when trying to find things In reality, it’s going to be faster than that because of this divisor And so there’s again going to be this trade-off and this conflict between theory and reality, and one of the knobs will start turning at this point in the semester is more of the reality one as we sort of prepare for semster’s end, as we introduce the world of web programming, where really, performance is going to count because your users are going to start to feel and appreciate poor design decisions So how do you go about implementing a linked–a hash table with 31 elements? And the previous example was arbitrarily about birthdays If someone has a birthday of January 1 or February 1, we’ll put them in this bucket If it’s January 2, February 2, March 2, we’ll put them in this bucket That’s why it was 31. How do you declare a hash table? It can be pretty simple, node* table is my arbitrary name for it,  This gives me 31 pointers to nodes, and that allows me to have 31 pointers to linked lists even if those chains are initially NULL What do I want to put if I want to store “Alice,” “Bob,” “Charlie”? Well, we need to wrap those things in a structure because we need Alice to point to Bob, to point to Charlie, and so forth We can’t just have the names alone, so I could create a new structure called node here What is an actual node? What is a node in this new linked list? The first thing, called word, is for the person’s name LENGTH, presumably, relates to the maximum length of a human’s name, whatever that is, 20, 30, 40 characters in crazy corner cases, and +1 is for what? It’s just the extra NULL character, \0 So this node is wrapping “something” inside of itself, but it also declares a pointer called next so that we can chain Alice to Bob to Charlie and so forth Can be NULL but doesn’t necessarily have to be Any questions on these hash tables? Yeah? [Student asking question, unintelligible] An array–good question Why is this char word in an array rather than just char*? In this somewhat arbitrary example, I did not want to have to resort to malloc for each of the original names I wanted to declare a maximum amount of memory for the string so that I could copy into the structure A-l-i-c-e\0 and not have to deal with malloc and free and the like But I could do that if I wanted to be more conscious of space use. Good question So let’s try to generalize away from this and focus the remainder of today on data structures more generally and other problems that we can solve using the same fundamentals even though the data structures themselves might differ in their particulars So it turns out in computer science, trees are very common And you can think of a tree sort of like a family tree, where there’s some roots, some matriarch or patriarch, grandma or grandpa or earlier back, beneath which are mom and dad or various siblings or the like So a tree structure has nodes and it has children, usually 0 or more children for each node And some of the jargon that you see in this picture here is any of the little kids or grandkids on the edges who have no arrows emanating from them, those are the so-called leaves, and anyone on the inside is an inner node; you can call it anything along those lines But this structure is pretty common. This one’s a little arbitrary We have one child on the left, we have three children on the right, two children on the bottom left So we can have different-sized trees, but if we start to standardize things, and you might recall this from Patrick’s video on binary search from a previous short online, binary search doesn’t have to be implemented with an array or pieces of paper on a blackboard Suppose that you wanted to store your numbers in a more sophisticated data structure You could create a tree like this You could have a node declared in C, and that node can have at least two elements inside of it One is the number you want to store, and the other is–well, we need one more The other is its children So here’s another data structure. This time, a node is defined as storing a number n and then two pointers; left child and right child And they’re not arbitrary. What’s interesting about this tree? What’s the pattern in how we’ve laid this out or how Patrick laid it out in his video?
It’s kind of obvious that there’s some sorting going on here, but what’s the simple rule? Yeah? [Student answer, unintelligible] Perfect. If you glance at this, you see the small numbers on the left, big numbers on the left, but that’s true for every node For each node, its left child less than it, and its right child greater than it What this means now is if I want to search this data structure for, say, the number 44, I have to start at the root, because as with all of these more complex data structures now, we only have a pointer to one thing, the beginning And in this case, the beginning is the root. It’s not the left end, it’s the root of this structure. So I see here’s 55, and I’m looking for 44 Which direction do I want to go? Well, I want to go to the left, because obviously, to the right is going to be too big So notice here, you’re sort of conceptually chopping the tree in half because you’re never going down to the right-hand side So now I go from the 55 to the 33. It’s too small of a number I’m looking for 44, but now I know if 44 is in this tree, I can go obviously to the right So again, I’m pruning the tree in half It’s pretty much identical conceptually to the phone book It’s identical to what we did with the papers on the blackboard, but it’s a more sophisticated structure that allows us to actually do this divide and conquer by design of the algorithm, and in fact, traversing a structure like this–whoops Traversing a structure like this, where it’s only “go this way or go that way,” means all that code that bent your mind at first when implementing it in section or walking through it at home, for binary search, using recursion or iteration, it’s a pain in the neck. Find the middle element, then do your rounding up or down There’s a beauty to this because we can now use recursion again, but much more cleanly. Indeed, if you’re at the number 55 and you want to find 44, you go left in this case, then what do you do? You run the exact same algorithm You check the value of the node, then you go left or right Then you check the value of the node, go left or right This is perfectly suited to recursion So even though in the past we’ve done some fairly arbitrary examples involving recursion that didn’t need to be recursive, with data stuctures, especially trees, it’s a perfect application of this idea of taking a problem, shrinking it, and then solving the same type of, but smaller, program So there’s another data structure that we can introduce This one is designed at first glance to look cryptic, but this one’s amazing So this is a data structure called a trie, t-r-i-e, which is inherited from the word retrieval, which is not pronounced re-try-val, but that’s what the world calls these things. Tries. T-r-i-e It is a tree structure of some sort, but each of the nodes in a trie appears to be what? And this is a bit misleading because it’s kind of abbreviated But it looks like each node in this trie is actually an array And even though the author of this diagram hasn’t shown it, in this case, this trie is a data structure whose purpose in life is to store words like A-l-i-c-e or B-o-b And the way in which this data stores Alice and Bob and Charlie and Anita and so forth is it uses an array whereby to store Alice in a trie, we start at the root node which looks like an array, and it’s been written in shorthand notation The author omitted a b c d e f g because there were no names with that They only showed M and P and T, but in this case, let’s move away from Alice and Bob and Charlie to some names that are here Maxwell is actually in this diagram So how did the author store M-a-x-w-e-l-l? He or she started at the root node, and went to [M], so roughly 13, the 13th location in the array Then from there, there’s a pointer A pointer leading to another array From there the author indexed into that array at location A, as depicted there at top left, and then he or she followed that pointer to another array, and went to the pointer at location X Then in the next array location W, E, L, L, and so forth, and finally, let’s actually try to put a picture to this What does a node look like in code? A node in a trie contains an array of pointers to more nodes But there’s also got to be some kind of boolean value, at least in this implementation I happen to call it is_word. Why? Because when you’re inserting M-a-x-w-e-l-l, you’re not inserting anything into this data structure You’re not writing M. You’re not writing X All you’re doing is following pointers The pointer that represents M, then the pointer that represents A, then the pointer that represents X, then W, E, L, L, but what you need to do at the end is sort of go, check, I reached this location There was a word that ends here in the data structure So what a trie is really filled with and the author chose to represent these terminuses with little triangles This just means that the fact this triangle’s here, this boolean value of true means if you go backwards in the tree, that means a word named Maxwell is in this
But the word foo, for instance, is not in the tree, because if I start at the root node up here at top, There’s no f pointer, no o pointer, no o pointer. Foo is not a name in this dictionary But by contrast, turing, t-u-r-i-n-g. Again, I didn’t store t or u or r or i or n or g But I did store in this data structure a value of true way down here in this node–in the tree by setting this boolean value of is_word to true So a trie is kind of this very interesting meta structure, where you’re not really storing the words themselves for this kind of dictionary To be clear, you’re just storing yes or no, there is a word that ends here Now what’s the implication? If you have 150,000 words in a dictionary that you’re trying to store in memory using something like a linked list, you are going to have 150,000 nodes in your linked list And finding one of those words alphabetically could take O(n) time Linear time. But in the case here of a trie, what’s the running time of finding a word? It turns out the beauty here is that even if you have 149,999 words already in this dictionary, as implemented with this data structure, how much time does it take to find or insert one more person into that, like Alice, A-l-i-c-e? Well, it’s only 5, maybe 6 steps for the trailing character Because the presense of other names in the structure does not get in the way of inserting Alice Moreover, finding Alice once there are 150,000 words in this dictionary does not get in your way of finding Alice at all, because Alice is . . . . . here, because I found a boolean value And if there is no boolean true, then Alice is not in this data structure of words In other words, the running time of finding things and inserting things into this new data structure of trie is O of–it’s not n Because the presense of 150,000 people has no effect on Alice, it seems So let’s call it k, where k is the maximum length of a word in English which is typically no more than 20-something characters So k is a constant. So the Holy Grail we seem to have found now is that of a trie, constant time for inserts, for lookups, for deletions Because the number of things already in the structure, which aren’t even physically there. Again, they’re just sort of checked off, yes or no, has no impact on its future running time But there’s got to be a catch, otherwise we wouldn’t have wasted so much time on all these other data structures just to finally get to the secret one that’s amazing So what price are we paying to achieve this greatness here? Space This thing is massive. And the reason that the author didn’t present it here, notice that all of these things that look like arrays, he didn’t draw the rest of the tree, the rest of the trie, because they’re just not relevant to the story But all of these nodes are super wide, and every node in the tree takes up 26 or actually, could be 27 characters because in this case I was including space for the apostrophe so that we could have apostrophized words In this case, these are wide arrays. So even though they’re not picutured, this takes up a massive amount of RAM Which might be fine, especilly in modern hardware, but that’s the tradeoff. We get less time by spending more space So where is this all going? Well, let’s do–let’s see here Let’s do a jump to this guy here Believe it or not, as much fun as C has been for some time now, we’re reaching the point in the semester where it’s time to transition to things more modern Things on a higher level. And even though for the next couple of weeks we’ll still continue to immerse ourselves in the world of pointers and memory management to get that comfort with which we can then build on, the end game is ultimately to introduce,ironically, not this language We’ll spend, like 10 minutes talking about HTML All HTML is is a markup language, and what a markup language is is these series of open brackets and closed brackets that say ‘make this bold’ ‘make this italics’ ‘make this centered.’ It’s not all that intellectually interesting, but it’s super useful And it’s certainly omnipresent these days But what’s powerful about the world of HTML, and web programming more generally, is building dynamic things; writing code in languages like PHP or Python or Ruby or Java or C# Really, whatever your language of choice is, and generating HTML dynamically Generating something called CSS dynamically Cascading style sheets, which is also about aesthetics And so even though, today, if I go to some website like the familiar Google.com, and I go to view, developer, view source, which maybe you’ve done before, but going to view source, this stuff probably looks pretty cryptic But this is the underlying code that implements Google.com On the front end. And actually all this is fluffy aesthetics stuff This is CSS up here. If I keep scrolling down we’ll get some color-coded stuff