Stanford University welcome to EE 380 fall 2011-2012 I’m Andy Freeman the other organizer of the class is Dennis Allison we’re approaching the end of the quarter so those of you who are taking it for credit it’s a good time to catch up a number of trends have given us the ability to do computation on a massive scale for not a lot of money or on a truly massive scale for a lot more money several years ago Jeff Dean of Google wrote a couple of papers describing some Google infrastructure including Google file system and MapReduce well Google to my knowledge has never released any of the code related to this stuff other people have starting with doug cutting first independently then as part of Yahoo and now cloud era under the name Hadoop and Hadoop you can tell us what a dupe is oh okay I’ll let him out the Hadoop story so the rest of us can concentrate on our problems instead of our tools as with Linux the other great open source success various companies have sprung up to push this along our adela of cloud era is here to tell us about Hadoop the various tools in this infrastructure how it all works together and how the ecosystem is going to continue to make it possible for the rest of us to do big data thank you for the nice introduction so I uh quickly introduce myself I’m actually some CTO and one of the founders at Cloudera and before cloudera I was at Yale for eight and a half years they had acquired my first company which I had started immediately after finishing actually I didn’t finish my I hadn’t finished my degree yet at the time from Stanford so I am Stanford alum while at Yahoo that’s where I got to witness the birth of this technology and I had fortunate to see that it’s very rare that we see technology is causing so much destruction is something that happens once every every decade and I’m beam whirs is one example Red Hat Linux is another example etc etc so we saw I saw the birth of this technology my co-founder had cloud there’s just hammer backer he was at Facebook and he saw the same thing at Facebook and he saw how this system was being born and we went to investors and we we told them essentially we came from the future we have that DeLorean parked outside and we’re telling you this is going to disrupt the whole data processing space and and they believe us luckily and now it’s happening like it’s happening and I’ll tell you that story today I’ll tell you why this technology is so foundational what business problems does it solve and I’ve covered the technology itself tell you what the technology itself what does it – I will tell you that obviously you will feel that I’m very passionate and and and believe in this and how big this is going to be so my opinions or my opinions they’re a bit biased but I believe them to be very very true so with that I want to jump into what’s the problem what why did why do we need something different what what like a good technology only works if there is a good problem that is trying to solve and this was the problem and that I experienced at Yahoo and my co-founder Jeff from Becker experienced at Facebook this is what the traditional data analytics business intelligence stack looks in any company in the world you have an instrumentation layer that instruments the point where data is created your web servers your cash registers your mobile devices your network equipment your java application servers your sis logs from from UNIX machines etc etc you collect all of that information and then you dump it into a storage on the grid you don’t put in a big red and there’s companies that build these massive massive filers I had a massive farm filers at Yale where there was petabytes and petabytes of the data stored in there but these grids they only store data they don’t process data right so you have to have a computer ed on top there’s an industry thing called ETL which is short for extract form load essentially means taking the unstructured data munging it and and preparing it to be structured to be loaded into a database and so we had this big grid of machines that essentially did this task and we take the unstructured data convert it and then load it into a relational database management system like Oracle for example and then there is lots of tools you can run on top of that to get your business intelligence and get your power of interactive applications and and do your data analytics and so on so the three problems that we got hit with with

are these three red stars that you have there on the chart this one was a forcing function this one here is that there is no escape from it is the amount of data we have accumulated every day was reaching the point where we couldn’t finish economically finish processing the data from the previous day before the new day started once you hit that point there’s nothing you can do your host unless you can find a new solution so we had we had to solve that problem and it came from the fact that moving all of this data from the storage grid to the compute grid put a massive pressure on the underlying infrastructure both the filer heads for the fighters but also the network infrastructure in the middle so we had to Angus the volumes of data and how small the data is compared to the code that’s just doing the processing and imagining it just it became very clear we need we need the code to move to where data lives not the other way around so that was the first that was the kind of the first thing that just hit us the other two important things were the archiving issue which is on the right there so archiving is necessary because as data grows older it’s consuming more disk space if the economics of the disk space don’t allow you to keep the data alive you will archive it you would send it to tape now blu-ray discs etc etc but once it goes to tape from my experience at Yahoo and others that worked at the large corporations with backup policies I just you never see the data again it’s as good as dead right the cost of archiving solutions is very economical for storage extremely expensive for evil to get something back from archival storage is extremely expensive so we wanted to keep the data alive for longer and that means that meant sorry that we had to find a solution that the economics of storing the bytes justified doing that so actually there is a term we use called return on bytes as opposed to return on investment and the turn on byte is how much information am I able how much value I’m able to extract out that byte as a code as a function of the cost of storing that byte so essentially we had to bring the cost down to a point where we can justify doing that last but not least we wanted the ability to go back and explore the original highest fidelity data because when you go through this ETL process you’re actually aggregating your data you are losing fatality you’re either aggregating or you’re normalizing to confirm dimensions for those of you with business intelligence experience that process means you’re going to lose data and every now and then you’ll get a question from the business or a question that you as somebody trying to have the business you want you want to ask and you can’t ask up here because the data for it is not up here in many ways and schema modeling and business intelligence schema design is based on the premise is what are the questions what’s at the top to have questions you want to ask and then I’m going to take these questions and build a schema that allowed me to answer these questions if you come up with a new question that wasn’t one of these questions you’re not going to be able to ask it because it’s not there and the schema it’s not supported by the model and we want to break out of that mold because to go back and amend our schema to add the new place for that amend our etiologic to process that and get the data out of our storage was an extremely expensive proposition it took weeks if not months to do and so hence it was very hard to justify doing so these are the three driving problems and I’ll go over them from the opposite side like benefits what are the benefits of overcoming these problems though I hope it’s very clear scalability of compute over the data scalability and economics of keeping the data alive for much longer flexibility and agility of going back and asking questions from the from the role unstructured data these are the forcing functions so here comes Hadoop so what is Hadoop Hadoop is an elephant for those who don’t know the creator of Hadoop dot cutting his son a three-year-old son at the time had a nice plush elephant toy a yellow elephants small elephants oops sorry and that elephants that the the the his son one day out of the blue called it Hadoop right just he made up that name and and dog coupling decided to take that name to be the name of this framework now the Sun is actually 11 years old and he’s very proud of himself because of this accomplishment so Hadoop in a nutshell it’s a scaleable fault tolerant distributed system for data storage and processing and it’s licensed under the apache license which i can talk about briefly later in the presentation it’s one of the friendliest licenses out there for open source consumption I like to describe Hadoop and in two ways one way is what’s an opening system like when I asked about Linux or Windows what’s the heart of these systems is two things at the heart is the ability to store files and the ability to run applications on top of files that’s the core and then there is like Windows environment GUI the device drivers security credentials access lists and all these things libraries and so on there’s just things around that core of storing files and running stuff on top of these files and that’s what Hadoop is with a difference that Hadoop does that on many many many machines right it’s a

data center kind of coping system as opposed to a single node opening system in fact it’s an abstraction above like Hadoop leverages Windows and leverages the next two to give that layer the other way I like to describe Hadoop it’s like the opposite of a virtual machine so if you look at if you think of VMware essentially or any other visualizing technology it’s about taking a physical server and chopping that up to be many small virtual servers right and Hadoop is the other way around it’s about taking many many physical servers and then merge them in all together to look like one big massive virtual server so it’s another way to think about to do and the two underlying systems which were inspired by Google and thanks Google for publishing the papers they didn’t publish the source code for their internal systems but they did publish the papers that that had all the concepts behind these systems are they Hadoop distributed file system which is about creating this scalable distributed self-healing storage layer and MapReduce which is – it has a dual thing and it method uses both a scheduling system for scheduling resources on top of a big cluster of machines but is also a programming model that makes easy for developers to think in a parallel way and it’s important to differentiate between that dual nature of MapReduce which I’ll highlight later on in the presentation so that’s it that’s what Hadoop a nutshell right it’s its opening system purposely built for data processing that can be installed on many many nodes and make them look like one big mainframe this is this is the key slide because this is the key thing that differentiates Hadoop from previous technologies and and and I would like you to pay attention for this one because if I it’s a very key concept it’s really what makes Hadoop stands out it is the reason why it’s being adopted by companies across the world and I’m going to test you if I meet you in the hallway or after ask you what’s the money slide and you have to remember the slide so relational databases employ what’s called a schema on right model right with the relational database you have to create your schema first upload that scheme at you create a schema in your database and then you have to go through a load operation which takes you original data and converts it into that schema and then that gets civilized into an internal format proprietary to that database right that’s what’s called the schema on write model it has benefits because I mean now that you could recreate the schema you can do optimizations you can do indexes you can do compression and you can do special data structures you can do partitioning all these nice things which allows your reads to be very fast for certain operations like joins multi table joins etc etc it also means that by having a common scheme across your organization you’re going to have standards different groups in a big enterprise can talk to each other and refer to a column by name and they know what they’re talking about right so does bring that kind of sanity to that to the data jungle that you might have so these are the benefits now the problem is is we because of this explicit load operation and because of the explicit schema being present new data cannot flow in until you prepared for it right until you created the column for it and you created the ETL logic for it and that is the problem so it limits your agility it limits your ability to be flexible and to grow at the speed that you is evolving at Yahoo and I would like to say I had a very agile team that knew how to do things very quickly we still couldn’t add a new column in our in our data warehouses any faster than a month four weeks before we can get something in and whoever is asking for that the engineer or the business unit or the analyst asking for that column had to make a very strong case why we should even bother to go do that at the first place not because of governance governance is a big part of it you need to get agreements across the organization before that new thing can go in and there’s just lots of processes you have to implement to have that become part of your pipeline that materializes your data inside your skin so people people it’s not about machine speed as people speak yeah yes so you need to get it shared yes and what you’re saying with the schema reads just write the stuff and then interpret it when you get there thank you yes so that schema and reads by show game on reads is about the defects later no it’s like accept the fact that there’s differences yes yes yes and that’s the point is a true systems are not built to replace each other but augment each other it’s a very important point which I’ll further you want both you want both environments the problem is we only had this and we didn’t have that and now we have that so scheme and read is about leaf binding it’s like let’s not bind our schema until the latest stage in the process right so with Hadoop you load your data as it you don’t load it it’s copy the data you drag and drop the files inside your your Hadoop cluster and then at 3:00 time when you’re actually query your data that’s when you apply your own lens we call it sir D and in the Hadoop robot your own lens of how you want to parse that unstructured data and expose the schema that you want right so that gives you huge flexibility it gives you huge rigidity it also means that new data can start flowing in and you can come in nine months later augment your your-your-your Lance your parser and suddenly you have access to the full history of that new piece of data without having to go through reloading and so on so lots of alledgedly not lots of flexibility now that will make the loads also very fast as well because there’s no load not doing anything loaded just copying files into the

system and however on the read side obviously if you’re doing very interactive low latency stuff you don’t have the same kind of ending structures that will allow you to do sub-second and joins of stuff like that and I’ll talk about that in more detail later on but the systems they really augment each other they both have their pros and cons and it’s not about replacing each other it’s about using the right one at the right time and I’ll elaborate on that later on in the presentation I’ll give you some rule of thumb when you want to use the database for sure when you want to use the Hadoop for sure and there’s some overlap in the middle where you can go either way so again innovation innovation is one of the key things behind this and so here’s the data committee we have to go get approvals from everybody before we can do anything and here is you take your data scientists and just throw them in the middle of your roll field of data and they go and discover something new and to drive the point home I use this layman example to explain it imagine you’re a cook now and it came to my kitchen there’s two wheels two ways you can cook the meal one way is I’m going to pre-cut for you the carrots the tomatoes the the meat the chicken the fish and give them to you in Nice containers the schema right so I did the ETL work this is the schema I’m giving to you you know as a cook you’ll be able to cook very quick right you take the meat put it in and you’ll get a meal done very very quick however you limited in terms of the shape of the meal and the taste of the meal to the stuff that I pre decided for you in the scheme and that I give to you as a cook that’s the database world the Hadoop world is when you come to my kitchen I’m gonna give you the produce the bag of tomatoes the bag of potatoes as I came in the livestock the cow all of its their chicken all of it and now you’re going to start to cook obviously is going to take you much longer to cook because you have to clean up and and and and and all the meat and and yes there’s going to be lots of which is it true I mean there’s not going to be lots of sorry for the world crap that you’ll have to deal with because you have the roll original highest fidelity data food in this case or input and that allows you now to go and create a new meal that not just this tastes new looks new right because you control all the parameters and this is really what this is about this is why this movement is so foundational it’s about going back and accessing your original role highest fatality data with the flexibility of using any language that’s the other thing so sequel is really good and we all know sequel and we can ask questions with sequel the sequel is not a Turing complete language you can do everything in sequel right we can’t do every complexity mining everything we can think of in sequel you cannot do image processing in sequel we want to go beyond and this is the other very fundamental thing but we want to go beyond with pretty of not having to learn very complex frameworks of the sugar dosing and so on and that’s the method use framework which I’ll highlight later on but these are the core ways you can actually run algorithms inside the hadoop you can start the assembly language of Hadoop is what we call Java MapReduce I quote assembly language because that if you go in with that API it gives you the most performance and the most flexibility you can do anything you would like to do if you’re going with your Map Reduce however you can also shoot yourself in the foot if you don’t know what you’re doing just like assembly language right and it can take you much longer to write the simplest loop if you go in with Java MapReduce so you only want to go to Java MapReduce when you know what you’re doing and you d care about like special things that you can get from the higher abstraction frameworks streaming Map Reduce is an extension that still depends on the MapReduce model so you know how you need to know how to think in that and I’ll cover the MapReduce model briefly later on however it extends that the universe of languages that you don’t have to be job I can Java Python Perl C C++ Ruby whatever language you’re comfortable with right so that makes it more appealing for for folks now you get a little bit of hittin performance you won’t have as much flexibility but at least now you’re working with a language that you’re more comfortable with crunch is a very industry and industry library it’s open source built on top of another system that Google has published called flume Java which makes it easier to create essentially data pipelines in Java and and it encapsulate and many of the common operations like joining files sorting files and so on so don’t have to reinvent the wheel every time it’s like a library that makes this much much easier pig latin is a higher abstraction like okay that came out of Yahoo that makes it much much easier to build data pipelines on top of Hadoop it’s not MapReduce at all you don’t have to think about these very hard concepts if they are hard to you and it will take that pig latin script and convert that into method use for you hive essentially is sequel and that was made by Facebook created that system and released it into the open source hive take sequel it’s a variant of sequence not NC sequel 92 a fully compliant but it’s mostly there and then convert that into MapReduce for you and it has extensions so you can plug in your own MapReduce functions you use defined functions user-defined aggregates etc etc last but not least Susy is a very high-level abstraction that can link all of these together right so can create a workflow of jobs and hive with Pig with MapReduce and it will it has a graph and to only start the job on these two other jobs finish because they are needed for

the input if it all fails revise it etc so that’s the spectrum so you can see and it’s getting higher and higher now now with hive because neither sequel we have ODBC and JDBC drivers so now we can hook it into standard BI tools like Microsoft Excel right so we’re getting higher and higher with with ability of X Inc Hadoop this is a very very very important and it’s fundamental to the MapReduce model itself but also that the way the system Hadoop system and the MapReduce and GFS systems have been built which materialized in Hadoop as well two concepts here first the first one is the system itself is scalable you can start with a few notes when you add more nodes the system just automatically grows right it three partitions the data it distributes the data and the jobs coming in will now start taking advantage of the new nodes except two etc and that’s just done transparently for you the other key thing is you as a developer because of the MapReduce model is a prescriptive model so there’s a design model that you have to work with in makes it easier for you to do jobs to be parallelized such that if you start with a few nodes the same job you wrote for the few nodes will work for the thousands and thousands of nodes you don’t have to go and and we architect or redesign the program that you developed and that’s another very very key and so scalability of people so they don’t have to think about how am I going to make my job work with multi not just multi-threading multi-threading multi machine how can I make that that’s taking care for you so again very very important concept very nice paper out of the out of Google position paper called are unreasonable attendance techniques of data or we’d like to say data beats algorithm you can you can take the hardest algorithms in the world but if you take the sentence algorithms with lots of data that’s the premise you will beat the hardest algorithms and example one of the examples that were used were natural language translation right so natural language transition between different human languages existed for many many years it’s a it’s a very hard problem lots of very sophisticated very complex algorithms to try and attack that problem and Google came in and by just looking at in grams and and correlations of them between different copses of text they have across all of the web index they have they were able to build a translator that rivaled the most of it is sophisticated algorithms out there right so that’s it that’s an example of how if you have if you have lots and lots of data and that you can throw out your problem you can beat the best algorithms you have so you can think about the implications of that in finance and and biotech etc cetera and this one is just a reiteration of the point I made earlier right if your data doesn’t have an economical system that you can live in you data will die and I don’t want to see data buying anymore so so that’s one of the things that Hadoop allows you to do it brings down the economics of storage and order to do orders of magnitude so now before I jump into that internals of Hadoop I wanted to to very quickly show you which cases Hadoop is the right tool for versus which case is the relational database is the right tool for and energy I use here’s then as you have a sports car not necessarily because first cars are expensive though that is part of it versus a free train right so a sports car can get from 0 to 60 miles per hour per hour faster than any free train you can think of because just because of the amazing momentum the train has however a train once against the full speed that full speed can actually be more than a sports car like bullet trains as you know go faster than cars can go and but more importantly that throughput the amount of people the amount of data that you can put in that pipe is way bigger than what you can do this now you can still go and buy a hundred sports cars and link them together and you can get the same tool but but obviously economics start to fail at that time so that’s the high level kind of abstraction but also goes back to this that interactive OLAP if you want OLAP is online analytical processing if you want to do joins in a star schema or or a snowflake schema you can finish that in milliseconds inside an Oracle system or a telltale system etc etc that same query would take seconds in Hadoop if not minutes right however queries that take hours or days in a relation to the business systems for the same economics will take seconds in Hadoop so it’s kind of dynastic right so latency these guys went through put Hadoop points second so how do can to interact with all that but not at the same speed that these systems can because of all the optimizations they are able to do because they paid the cost of parsing the data at low time as opposed to Hadoop which is paying it at at real time so obviously for the big queries the cost is amortized and hence it’s off to win for the bigger queries multi-step asset transactions Hadoop can’t even touch that so if you have if you’re doing banking transactions moving money between accounts and you’re going to disable in statements and you do in select a cursor inserts updates the leads end transaction that even have that so keep using the little assistance for that secret compliance these systems have been around for eons and for decades and hence the maturity of the sequel and their ability to work with different other systems that depend on that maturity is way ahead of where Hadoop is right now though we are moving very quickly on that dimension now what Hadoop has obviously is the flexibility if you have work with unstructured data very very hard to work with unstructured data in a traditional relational database you can write you can by just creating a column pool blob and that’s called an Pro blob just loading it your unstructured data right

in your text file or XML file Apache log file but that’s the wrong way of using a database right it’s not going to be efficient it’s not going to be the right spend of money to achieve to solve that problem and I have we have this very nice proverb from from ez ton from Egypt originally it’s about this wise man called Guha and Goha said there’s two ways you can touch your ear to the straightest concept of using the right way to solve your problem the first way is this which is the right way and the other way is this you’re still achieving the same solution you are touching your ear but you’re doing it in a completely inefficient way and that’s kind of the point here is to point that out same thing complex data processing the ability to go beyond sequel I mean you guys might have heard about this trend called no sequel and we don’t like the term no secret because Hadoop does have sequel it rather we say it’s not only sequel so they know is not only and that’s true so we do if you have sequel with hive but you can do a Java you can do C you can do Python so so it opens up that yeah and there’s my house which I’ll talk on later on for data mining and so on it opens up the possibilities of complex data processing last but not least nothing and I say this with like with confidence except for Google nothing scales to the same level that Hadoop scales today nothing well whether commercial or open source other than Google’s internal infrastructure and that exists in deployments that range from Yahoo having more than 40,000 servers running in Hadoop clusters Facebook having a single namespace single file system with 70 petabytes in it nothing is as big as that proven to run as big as that so with that now I’m going to start jumping into the internals of Hadoop itself HDFS the Hadoop the tributed file system and again I’m giving a very quick over view this is a very introductory talk if you guys want to hear more maybe we should invite the Google guys over and they can tell you way more at way more depth than I can but essentially the idea of the signify system in very simple terms is a fire comes in you chop up that file into blocks the default is 64 megabytes and then you take these blocks and you spread them out through your infrastructure your data nodes that store the data and you replicate each block a number of times the default is three across that infrastructure the system is built and optimized for these kind of operations it’s not a fully POSIX compliance file system is optimized for throughput not for latency is optimized for you can put a file you can get the file and open the file and scan the file and you can delete the file but you can’t insert stuff in the middle of the file you cannot update stuff in the middle of Maya you cannot do that with with that we have to create another file and do that as you create the other file that said you can do a pants so you can obtain stuff at the end of files so you can see that the problem and with these assumptions that’s what that’s why we’re able to achieve the scalability that we achieve the block replication so some people would look at the application says very wasteful why create three blocks we can use read like a rigid coding techniques and be way more efficient through for durability that is true right so for durability you are able to achieve it at a much lower cost than doing 3x however the availability and throughput you are not by having that block repeated three times in three different machines you can now have three different jobs leveraging the course inside the machines reading this data at the same time at full throughput from the local disks right so performance of running multi-tenant multi jobs in the same cluster served a lot by that by by the by the replication in fact when you have a hot file and you find that came in that you know a lot of people are going to want you can set the replication for it to be much higher you can actually do that selectively on a per file basis so I say this right is hot let’s make 10 replicas of every block in that file so now I can have multiple users exiting at full speed last but not least availability by having these replicas spread out in more than one machine and across racks as well we protect against a single machine failure but we protect against rack failure as well right so a whole rack and go down with all the servers on it and you still have a copy of your deal somewhere else the placement algorithm for the blocks takes that into account right so it minimizes the chances of you not having your data available as well that said we do have now thanks to Facebook we have now a rigid coding grid like storage for Hadoop that has an overhead of 1.2 X and supposed to 3 X that will give you the durability will give you a very general durability but will not give you the same availability and throughput characteristics that is useful for your older data right so your newer data it should keep with higher application your older data I should move into a system that is more about durability than about availability or throughput and this is actually FS any questions very high-level questions on HDFS yeah one question is near richer coding somewhat in fact the past the compression that you’re able to do I mean I’m sure you’re compressing each of these and getting a lot of compression is a hard layer so the compositions not at application there so it’s just like you would compress the file with with with with zipper gzip on your file system that the native Fascism does not give you compression out of the box now we do have in CDH which we’ll talk about later on the snappy compressor from Google available but still is being done at a higher layer than the pad system there yeah okay any other fgfs questions

go key point very scalable very distributed highly highly available and I’ll go over the architecture of the subsystems and any future slide yes the block placement that could be yes softly I’ll talk about that yes I’m very good points I’ll talk about that okay so this is the method used as I said earlier in the talk MapReduce has a dual nature that is the MapReduce framework for writing for making us as developers think in a parallel way and and having prescriptive a prescriptive development framework imposed on us so we can write jobs that scale without having to rewrite them but there is also MapReduce the execution engine that a source manager that manages lots of resources and executes them uncover I’m going to cover both so first the framework and again it should invite the guys from Google they can tell you any more about this than I can but at a very high level you as developer you only write to little functions you write something called a map function and reduce function and that’s only right and the system takes care of everything else it takes care of the distribution of that the fault tolerance of the aggregation the shuffle the sorting all these things are taken care of for you so a very simple example here if you have a bunch of documents and you want to count the frequency of words of documents then each one of your mappers the system will give it part of the data automatically it will figure out which depending on the file name you’re going in and and then you will take the part of data you drop now as a single function that you got here is just to count each one of these words and how many times they showed up you spit out these words so you have the word and the number of times you saw it there is a hashing there’s a consistent hashing algorithm now that will take these words and will make sure that the bits on this key that the words that have the same key go to the same reducer function that you have on the other side right so you can start your aggregation phase and then in your aggregation phase you have a very simple reducer function this is as long as I’m seeing bbbbb I’m just going to add up the numbers I’m seeing and then I’ll get it to the count right so that’s it that’s the MapReduce model at a very very high level simple explanation of it what that gives you is now you as a developer you don’t have to worry about this capability you solve this very simple problem you write this very simple function here there and you’re done right and Google would brag about this all the time and have we say we have interns come in within a week the writing algorithms running on thousands and thousands of machines which is which is very true now this is the MapReduce resource manager and scheduler which is the other half of MapReduce that actually in the new version of Hadoop is splitting these two out from each other their MapReduce framework from the MapReduce resource manager so what the resource manager does is it sees the job coming in and then based and then we have the data spread out as you guys can see the blue boxes the data the red is the job running right so the job is broken down into tasks and each task is given a piece of the data that’s going to process and then the scheduler tries to schedule these tasks to be on the same node with the data where they need the sub piece of data that they need right and actually there is three levels of the Calla T that the scheduler shoots for the first level is it tries to get you the same server that has the data that you want so that the task is running local disk read from the data if it can’t find that because some other user is running on that server you can actually say I want to wait on till it frees up and there’s optimizations for these kind of things but it can’t find that then the next level is okay find me another server on the same rack because typically if you have two servers on the same rack they connected to the same top rack switch which has enough bisection bandwidth for them to talk at full speed so you’re not going to be hit by the by the network slowing you down and then if all things fail they’re just going to find any any free slot across your whole cluster and that’s where the performance can start to to to be hit lots of room for research and optimization here how can we properly place the data and schedule the tasks so we can improve the overall system throughput and and reduce the overall average latency for jobs and actually there’s also research going on many many universities on this topic MapReduce is optimized as I said earlier for batch processing for the big stuff it’s not optimized for do something very quickly giving back an answer in a second in fact just run a very simple query that fetches one byte and MapReduce will take a second right so if your F is patching nothing is sleeping for is that it’s not optimized for that problem it’s optimized for the big stuff there is research about how to make it and actually even practice now companies starting to make it more interactive it’s also optimized for failure recovery so all of these machines are consist constantly sending heartbeats back to the manager so announcing that they’re alive it’s continuously monitoring the progress of the tasks right so that subtasks so if you have a hundred tasks for your job that each getting equal amounts of data if one task is a laggard just it’s moving much slower than all the other paths then something is wrong with that machine either the JVM is garbage collecting or the the hardest are flaky the memory is flaky somebody s is running something really bad on that machine that the scheduler will detect that and will speculatively we call this speculative execution which I don’t like the name because it collides with the CPU spective execution but essentially what the schedule will do it will fire up a backup task that’s doing this work as this test on some other node right because I have three copies of the block I have block three here I detect this is running slow I have booked another block three here I have more free cores here I fired off another tasks doing the same

work whichever one finishes first it takes the data form and it kills the other one right so so there’s lots of lots of things like that right let’s try and improve the fault tolerance and the failure recovery for the whole system so before I go to the next slide actually how am i doing timewise wanna check and what’s the time now sorry oh wait so so should slow down I’ve been going but a great sorry about it so any questions on this slide before before I move on okay great so I frequently get this question and it’s debate and I’m biased in my opinion but I want to illustrate it anyway saying oh no network speeds are getting so much faster than disks we’re going to have big switches that can can switch terabytes and terabytes of data we don’t need to separate of course all the incumbents building still filers and all the way we don’t need separate compute we don’t need to merge computer storage we can keep compute over here keep storage over there going to have a big fat pipe and though we don’t need to have this Hadoop thing right obviously my job is to say no that is not true and this is how I Illustrated yes hardest speeds are not growing at the speed of network speeds network speeds are going way way faster than hottest speeds however how does density the number of disks you can have their server and the number of course you can have per server is going up so quick and because of that we want to have locality we want to have more cores able to process data locally from the disks on that same server one hard disk gives you 100 megabytes per second which is about two gigabit per second in network speed again like multiplying by ten and stuff like that so it’s maybe method simple a server today typically a typical server for Hadoop has 12 hard disks at 2 terabytes each that’s 24 terabytes of storage and the cost is not that much it’s a few thousand dollars to get a server like that that’s 1.2 gigabytes per second and that’s not maxing out the setter bus speeds and the piece I’d like this room to grow in more than that if you have the space grow more than this right that’s 12 gigabits per second so that’s barely now touching the 10g speeds but that but now we’re going to take the server I’m going to have a rack of them which is 20 servers and that’s rack is going to be 240 gigabytes per second that’s not way above the 40g speeds which can like next 11 the next situation of technology coming up and then if you have an average cluster we did a survey and the average gossip site is about like 120 nodes so six racks of servers that will give you one point for 30 bits per second right that’s barely the bisection bandwidth you can bet from the best kind of like switches out there and then if you have a big cluster actually the cluster size is doubling every year like when we did a survey last year cluster sizes were about sixty knows now there are hundred twenty nodes so people are putting more and more live in their systems large clusters which are the 4000 kind of node clusters can can get you 48 terabytes per second so the whole point is the scalability of doing this divide and conquer parallelism you just can’t beat that with a single pipe you just can’t and and just an example here four point eight kilobytes is not that much data right so this is 44 by a turbine it’s not you can fit for it 4.8 terabytes on your harvest or on your laptop right however if you try to scan 4.8 terabytes using your laptop you’re going to take 13 hours right so it’s not just about the size of the data it’s about the speed of actually being able to manage this data if you do this in a cluster of this size you’re going to finish four point eight terabytes in a second right so you can see that the scalability of of the solution yes yes please challenge so your co-locating computed storage these two things have very different technology trends very good refresh cycle yeah they both come soon power in very actually different ways but both in significant ways yeah and it’s hard to imagine that you can be efficient when it comes to sort of the power perspective in both compute domain and storage domain if your co-locating goes to internet somehow schedule so that the nodes are both you know officially utilizes for the throughput from the disk as well as officially utilize on this yeah so yeah I mean that’s an argument that I hear often the two answers there’s a trend moving more towards power efficient CPUs obviously with a thumb and an arm and so on there’s designs now being built specifically for this type of workload to try and turn off the CPUs from power consumption point of view the parts of the CPU are not using them and so on it’s very early but we’ll get there the other arguments the other argument oh sorry Dan yes yes it’s not running yet but that’s why I’m saying it’s Tillery yes there was a picture at a seminar from engineer or whatever yeah that can be network stuff yeah so the pivot is value if you do multi fiber then you can get whatever I mean the networks are getting fast but although you may not want to pay for it yeah yeah yeah so as I said it’s my yeah economic stephanie is a big part of this second this is my prediction and I’m biased Lee attained yes and most of game design now so you should I put thought I totally agree with that I mean and there’s a running question whether you want the locality of computation to the data which you clearly do today with the stuff that most people are buying or

whether that will flatten out and then making the software simpler is a better gig a better thing to do huh but right now nobody’s building out the cheap networks they’re not very many people it’s it’s prototype and experimentation so yeah your customers can’t really get it it doesn’t argue yes so that makes your solution appropriate yeah I mean my solution selling today the problem is it appears from ours and others the second partial students who are they listed to address the second the second the second answer I give is today you need the throughputs anyway you want to get the scalability of the artists and many many hardest the core are for free you’re getting the extra cores for free in many ways so it’s like you’re you’re really paying for the throughput and for the discs the economics of this system when you factor in the watts and in fact in the price compared to a standard storage area network solution is 1/10 is an order of magnitude less right when you factor in everything so that that’s the key today that might change in the future but that’s how it is today system another thing has been done you don’t run anything I don’t these clusters run hot yes that’s not count but David so so so would they run hot on in both dimension what they are finding out on the state outside but that’s because the interpretation maybe not so much the rack has to be balanced but any server in the rack doesn’t have to be balanced you could have based on your knowledge of your workload a bunch of hefty compute servers and just a tiny little amount of storage or a whole bunch of storage you can balancing you can do you don’t have it does not be modulus within within Iraq however there is imbalance today I would say there is imbalance today however that imbalance is even with that imbalance economics is still way better than the other options so if there is room for lots of research and there’s lots of people on the power management for Hadoop and and looking at the job profiles and how can we proactively turn parts of the system on and off and so on so yes there is lots of room for improvements however the economics of its today still beats the other options so I’m agreeing with you so it does make sense to have a concept of the concept of a new cluster of a storage only node or a compute only know not everything no sort of heavy node versus oh yes it is different we’re close there’s workloads which are which are more towards I want to soar and I have more I have my little I’ve and this started spinning but I’m not necessary running very heavy compute and there is jobs that’s more about compute so an example I love to give as a company called a harmony I don’t know if you guys heard about them they help people get married and they use or at least help people meet each other and and and and and they use Hadoop and their matching algorithms the data set is so small it is I mean I gave you here the put point a throw bite they had a data set on the order of a terabyte and they still needed the solution like this because of the of the throughput they can get even though the hardest were only like one percent filled but it’s the throughput that they wanted to get so aspect is the i/o throughput but there’s also the archival storage rate and the earlier discussion of the data going dead right if you could just keep it on the same spindles you’re not accessing it yeah there and that’s hugely power yeah but the argument is not going to turn off the CPUs on these nodes when they’re not running or so they can become different yeah you have the examples of the cost metrics for various you know for IO metric or throughput metric or for a typical classroom I mean we I don’t have that with me right now if you recurse about that can we can continue that all fine so this is the very high level architecture of Hadoop and this is where somebody tried to get me to say Hadoop as a single point of failure which it does and I want to speak about that and I want to speak about what we’re doing about it which we are so the data nodes these are the servers right so the actual servers there’s lots of these right the hundreds of thousands of these each one of these servers run two subsystems they donate a node subsystem and a tasks tracker subsystem the digital system all it does is manage the blocks which are EDA files in 84 in the necks or or NTFS in Windows it just manages these box and and it doesn’t know what’s inside the blocks it doesn’t know which five belong to these boxers have block ID block that’s all that’s right and then it reports back to a central node called the name node which maintains the namespace and the file allocation table if you might if you want for the for the whole cluster and knows the mapping between a file name to a block to which server has the block which servers have that block right so they know the sponsible for doing that so that’s on the storage side so these are the HDFS subsystems and then these are the method used subsystems you have a task tracker that is giving a piece of Java runs the Java inside and cap sitting inside the JVM and then there is a job tracker that essentially manages the job flow and hands out when there’s three slots in different servers right so that’s kind of the very high level architecture you have a client that submits the job or some it’s data when it’s talking about data then it goes through here when it’s talking about jobs and it goes through there right so and then these two guys know how to talk to each other because the job tracker needs to get information from this guy so can do better mapping of the of the tasks and have them run as close as possible to where the details so the

problem is these two subsystems today and in the in the current version of Hadoop are a single point of failure these are not these can all go down again there is lots of videos on the internet on YouTube where you have somebody running I do cluster and like break up in hammer and just hammer down one of the nodes in the cluster and it still keeps running any one of these guys go down no issues these guys that they go down from a durability point of view there’s no issue so you’re not going to lose any data so actually that the state of the name node is replicated right ever there is an availability issue is when the name node goes down you can’t issue file operations until a new name node takes up from it and that can take anywhere from a minute to 15 minutes depending on the size of your cluster same thing with the job tracker and the state of the top tracker is is maintained however when the job track goes down the jobs all running in the pipeline are flushed even if some of them have finished so some of the jobs that were intermediate still running would have to be restarted from the beginning right so so there’s some kind of inefficiencies because of that and you’re not losing any data you’re not getting any correct results but you’re getting in efficiencies and you’re getting availability issues with this one you’re getting in efficiencies with this one we didn’t and we still didn’t think that this was a big problem we need to fix it’s very rare that the name node goes down it’s very high probability when you have thousands of nodes that any node will go down but when you have one node the property of that one not going down especially you’re taking good care of it is much much lower however our customers kept over and over again wanted to check that checkbox they want to say we have high availability so we will do it for you and it’s not a very easy thing but the answer is easily said but so how much harder done the answer is the name node needs to be running on more than one node at the same time right and right now we’re building the simplest solution which is an active primary and there is a warm secondary and actually don’t call it secondary because it already has something called secondary for checkpointing but essentially there is a primary and there there’s a backup and when the primary goes down the the backup will take over it and there’s a coordinator that will make sure to shoot the primary in the head like pull the power out so the state inconsistencies don’t take place so that’s actually that is built right now it’s being tested right now and actually there’s a Hadoop world last week there was a very nice talk about the architecture for that system if you guys want to hear more about it in detail the job tracker the job tracker also is being handled in a very interesting way so I said earlier I alluded to this is the MapReduce scheduler has two things in it has that it’s all scheduler for managing resources and has the application logic these now are being split from each other right there’s going to be resource manager purely focused about which machines has empty slots and as or does and then there is the MapReduce application logic that actually executes the map the MapReduce flow that now is going to be on a per job basis so each job is going to have its application master responsible for executing it logic if that master goes down only that job goes down right as opposed to now when when it goes down everything goes down so that will improve the availability and the efficiency of the system significantly the other thing actually nice byproduct of doing that split is now we are not just limited to map with use right so we have the dissociate over here we can have MapReduce but you can have MPI we can have like there’s implementation of PL which is the graph posting from from Google so you can have other computational frameworks other competition frameworks running leveraging the same hardware the same infrastructure the same storage so it’s again that was released actually I think yesterday was the first cup of that codebase Apache 0.23 it’s being tested right now it’s not production yet it should be production hopefully by the first half of next year any question about about this the single point of failure it’s going away it’s gone it’s being tested we don’t want people complaining about it anymore yes buddy thank you for minding me yes yeah out of your minds doctor available yes scalability is also enhanced significantly by doing this so on that on a note we’re implementing Federation right now so in the existing in the existing model there’s just one one namespace server that’s handling the full namespace in the Federation model there’s multiple servers each handling part of the namespace and then each client has a database of the of the full namespace and they can reach any one of them so virtually it still looks like one single big namespace but much more scalable right now with the single name node model you can only go up to 4,000 servers in a single cluster this will allow us to go way way bigger than that in terms of namespace scalability same thing with the job tracker in the current model the job tracker responsible for managing the state of all the jobs in the new one each job has own tracker so the scalability actually is also a significant enhanced the bigger issue for our customers today is the availability but scalability is also being addressed very few customers are in the thousand node range but they’re getting there so I think I have two slides before I conclude so this slide is and sorry I mean it’s kind of a bleep about our company I’m sorry about that but I wanted to get it to you very quickly is we have this thing called C D H and C D H is short for the cloud data distribution including Apache Hadoop and we say including because Hadoop is only the kernel is only PI think about like Linux there’s kernel org you can go to granule org and get the kernel open

source but then there’s red hat and Red Hat is the distribution it has the kernel but has libraries has X Windows has attached to observer has all things that make it a complete OS right same thing that’s what we’re doing here so we take we take all of these elements many of them come from the Apache Software Foundation some are ours but we put them together in a complete suite which essentially becomes your data obtained system and I’m going to highlight a couple of the important ones so I spoke about Hadoop already so I’m not going to figure out that I spoke about Apache peg and Apache hive which one of the languages that make it easier to to work with MapReduce a very very very key one very successful growing very quickly right now is HBase Apache HBase so Apache HBase is implementation of again a very nice paper from Google thanks google called BigTable right and BigTable is about how can we build a layer that is a columnar key key low latency store built on top of HDFS which is a high-throughput store right so how to the latencies of HDFS in a higher layer abstraction that is very very quick for lookups and that’s exactly what HBase does so it’s a very low latency access system at the still based on a key on an index kind of concept you have a key and you’re looking that the family that columns for that key you cannot do joins very low latency it so doesn’t have like OLAP indexes cubes and stuff like that the other key thing it adds is I told you before Hadoop HDFS is optimized for upend right so you cannot insert and delete HBase adds transactional support right and it has that memory it also changes and then writes the files behind so you don’t get to see that but essentially HBase you can for every key and for every roll you have Atomics transactions on that row right and can have multiple clients trying to access that row they can do update answers the least and they will be properly sterilized and they will be properly maintained that consistency for them will be properly maintained but only at the at the row level we there is research now about how to extend that beyond multiple rows so that’s HBase very very key system another key system is attached II flume and Apache flume is about how do we like if you have a drip system but you have no data in it then what what am I going to do with it okay there’s a question over there when you were talking about the last thing you said about HBase it’s in transactions Yanis single it’s atomic at the row level but it’s only on a single road only at a single wrong okay so if you have multiple rows then the application has to start infinite let me see under start hey so a practice room is about collection of data you have this agent you can deploy across your web servers your application servers your network equipment it’s scalable reliable will trickle all this data and mechanize it inside Apache Hadoop for you you see we talked about before right about scheduling jobs and in graphs and executing them in the right order and metadata hive actually has if you’re going to so this is actually a question that I was waiting for somebody to ask is how can I say we have sequel when I said we don’t have schemas and a and and the key point here is not we have schemas right the otherwise you can’t have sequel however we only apply these schemas at the latest point that we can write so so we live in the metadata stored for hive it has the schema the column names the column types but then it has a parser that tells it how to parse the schema and when you execute a query against a hive table it will look up the table name we’ll figure out the schema and figure out the parser will tell you how the five names behind that table and then we’ll go start to pass them and fit them in that schema for you so hive has a metadata system that does that there is a new project with an Apache called Apache edge katalk that is trying to extend that concept beyond just hive right so you can have this kind of metadata schema store for high for pig for MapReduce itself if you’re doing melodically in MapReduce because it’s easier not have to go back and look up a wiki somewhere how do I eat this file how do I break down the columns for this file and how do I get the things I care about from it etc etc Apache mahout which Stanford played a big role in is essentially a collection of data mining algorithms support vector machines clustering algorithms testicle models etc built to leverage the Hadoop parallel processing model there is an NFS mount so you can mount a dupe on top of your Windows servers UNIX servers and there is a very interesting visual environment for Hadoop you can think about this as the X Windows for Hadoop right so you know Linux and then you have X menu so you can run visual applications within Hadoop U is a Hadoop user environment so that you can have applications running in a browser that leverage the browser winning environment on the back end they’re talking to your hadoop mainframe so that’s that’s what us so and energy is X Windows now we also released a project called Apache big top Apache big top is the belt test system that takes all of these individual projects and and and cooks them into into final bits that are become RPM packages or Debian packages etc etc so which includes loss of of course building of source code but also includes testing for the unit testing from the video source code but also process integration testing though this version of hive work with this version of MapReduce work with that version of HDFS work with that version of HBase you can actually I forgot to say this you

can actually run hive queries that join HDFS tables with HBase tables so all of these kind of components talk to each other so all of that is done by Apache thick table which you guys are welcome to look at and hopefully contribute to as well we also released from cloud era this one is not open source I should point that out however it is free it’s an installer for the system right so you can now download just one simple program which is this program and you can point it at a bunch of IPs they could be easy to Amazon ec2 IPS they could be your local servers they need to be running Linux right now and then that’s it you give it the credentials and it will take everything else from there you start I want to have a CH base on how to do keep I don’t have MapReduce and then you’d have nice installation progress bars and will go fetch all the code and you’d have a cluster up and running in a few minutes we brag about this and say that our C CEO Mike Olson was able to actually that roster up running using this so that’s how he knew we made it that’s the software as cheap does go any any questions about the stack the completeness of the stack yeah if you have education in any other no sequel databases out there like Couchbase or we have it so education is a different thing so now now there is how can we get data between these systems and other actually which is a good question because scoop addresses that problem apache scoop which is short for sequel to hadoop is about bridging the gap between the Hadoop world and the sequel databases world and so scoop is a very simple framework where you can say essentially scoop the name of the file and HDFS the name of the table and database the credentials and we’ll take care of copying that over and vice-versa scoop out of the box works over a JDBC which is single threaded which is very very slow however we with a number of the big database providers Belt parallel versions of scoops you can do power transfer between these systems and some of it is CouchDB is one of them as welcome in the no sequel world there’s a bunch of des distributions so depends on the vendor so actually we we want everything to be we want everything to be open-source here if it’s part of the core platform however some vendors like and I mention names right now but some vendors because we’re doing this in collaboration with them because half of it is on our side other half on their side to get the parallelism they might say this cannot be open source any other questions on this line cool so there is these two nice books if you want to dig deeper and learn more about Hadoop and HBase Hadoop actually cover the Hadoop book covers taken hive as well and do keep a number of other technologies then I highly recommend you go and buy them from Amazon or from your favorite book book store in conclusion and the things I would like you guys to remember again the money slide which I had very earlier on the schema the scheming right the differentiating thing about Hadoop is its ability to work with unstructured data using any language you would like right so you have the agility and flexibility of evolving at the speed of your data once you find this perfect meal once you invent the MacDonalds then you can move it into a pipeline and have it happen with with with the predefined schema but you have the agility to go and discover you have that agility to go in with any programming language and solve any problem almost any problem the MapReduce model doesn’t work with anything but we’re adding new frameworks to expand that you have very you have very high scalability of your storage and your compute both in terms of systems in terms of people being able to write parallel problems but also in terms of economics which allows you to keep your data live forever and then the core the two core subsystems are HDFS and MapReduce so that how much time I have for questions nice cool question of the dupe and so forth but tell us about that era oh and how do you relate to this and what’s your business model how are you guys what are you doing related to it both thanks a great question thanks for asking that question so so cloud era we were we were started three years ago so we’re 303 years old right now in the company and Hadoop and this whole suite of software which we called essentially the data center data opening system and is our business and that’s what we do that’s at it’s open source so people frequently ask how can you make money when this this thing here 100% open source 100% free we don’t charge anything for it both the source code and the compiled bits so we make it up in volume yeah we charge you for deployment but there’s so many of them that we make one No so so open source business models ten years ago they were new and people were figuring out how to make money today it’s just 10 a business and there’s very good thing a well-established techniques for doing them our CEO Mike Austin he ran one of the first open source companies there was a and shoot and just forget sleepy cat sleepy cat thank you and sleepy cat they essentially commercialized Berkeley DB right and that was one of the first companies in the open source world some of our investors like Accel Partners and Greylock partners they were behind like companies like Red Hat and then source and J and so on so this is how it works imagine this is opera opera show you know and I give you cars alright now this key under your chair right now you’re going to go outside there’s a nice car waiting for you 100 percent three percent open source do you not expect to pay any money going forward for that car no you will pay for gas

you’ll pay for insurance and you will pay for mechanics to maintain the car now you can choose to go higher in all mechanics and become a mechanic yourself which is Facebook and Google do that right they go on high but corporate America at large that their business is to fly the airplane from point A to point B and maximize the efficiency of doing that and extract lots of money the business is not to build the airplane the business is not to maintain the airplane right so they come to companies like us in the open source world and they say can you please do this for us so that’s how we monetize the solution essentially so by having by having this core platform become open source and free it especially when it’s a core platform like the foundational platform like this it spreads virally inside organizations much easier than its track the end to go and convince somebody to buy a new propriety solution that you have so that’s the benefit to open one of the benefits of open source is the viral consumption in addition another benefit we get obviously as Facebook is contributing this platform we are contributing this platform now is contributing this platform Twitter is LinkedIn is Hortonworks which is a spin-off from now is etc etc so we get all of that free and I pee coming in so lots of benefits to open source but the viral spreading of the software makes the adoption of the technology much easier within companies and then once the technology becomes production once now it’s running a mission-critical pipeline that has revenues at the CIO the CTO of the company they can’t have that with that without insurance without having a maintenance contract with the company that if there’s an issue it will be resolved in a given time right or going and hiring their own super Rockets engineers that can do that for super rocket scientists engineers that can do that for them it has to be one of these two things in fact a very nice very nice thing that Martin Miko’s Mattamy Costa Teguise known he was the CEO from my sequel which Oracle acquired and he’s now the CEO for eucalyptus systems he told me this once he told me the word free like open source free the word free in the English language means two things and every other language in the world there’s two separate words for these two concepts and only in the English language they combine together enough that’s absolutely true but somebody knows of some language but that’s not true then let me know the two meanings are free as in freedom liberty right no lock-in I’m not locked in and free as in no money zero cost right an open source is about the former not the latter open source is about Liberty it’s about choosing your destiny by having all of your data in an open platform a year from now cloudy I can come and say hey we’re gonna up our prices by this much and and there’s nothing you can do because all of your data is locked into this bathroom no you can say thank you cloud era we were not happy with you we’re going to go work with this other company instead if you have an open source underlying foundation of platform so so that’s really what’s about it’s not about cost you either going to hire people that know how to run this system or you’re going to work with a company loculi let’s run the system he is still going to be paying for it now to augment that half of my team at at the engineering team I managed engineering team have my engineering team they work on this open source platform the other half works on a management suite that is proprietary so we do have a proprietary management suite that you only get when you are a maintenance subscriber with us and that management suite makes it easier for you to deploy the software to configure the software monitor proactively predict problems before they happen when you call in our support line we have an issue there is a snapshot button you click the Texas national of the food state across the cluster and sends that back to our extension of the data metrics not the actual data that helps us then debug what the problem is and tell you Nene changes change that so so you only get that when you are a customer of cloud later however it’s on the periphery like a year from now again if you’re not super happy with us with the service of providing you can go somewhere else which makes our job harder obviously then a company that has proprietary software that’s totally locked locked you in but that’s the model that we choose to – and we totally our CEO and the founders and the necessary we completely believe in that model and it is a very successful model thanks for the question and sorry for the long answer any other questions applicability opportunity you know what kills unstructured data is large big data yeah how’s your data is well but if you look at corporations what percentage is you know a lot of it is obviously relational just because it star equated you know yeah bill that’s why I have this chart for you right so there’s lots of surveys about this and relational data it tends to come from humans transactions buying something booking a ticket unstructured data comes from machines lock files from Linux servers from web servers from job application servers from network equipment from mobile devices and these are the stats that the unstructured data is growing at a rate that far exceeds the relational data in today’s world so we believe this solution this space is going to be bigger than the relationship marketplace how much money is spent doing analytics in an unstructured and unstructured data versus structured data size that’s a big again I go back to Martha Miko’s and then I go back and make us because he gave him two cool other wants to give a very nice talk about this and he said my job at my school when he was at my sequel is that my job is to shrink the

database space with my sequence from a pro billion-dollar industry to six billion dollar industry but come at the top of that industry right and and in make is that’s something we similar to what we’re doing here so I said I said that there is a great area in the middle there is a great in the middle between the structured world and the unstructured world and that’s history if you want to keep lots of data for a very long time the economics of Hadoop allow you to keep the data for much longer periods of time so actually the the money that’s being driven out of this yes the return on bites the value you’re getting out of every single byte is not the same but volume is making up for it right now it is yeah so yes you like the value of business intention and it takes out of structured data on a pair by basis is higher that is a correct observation actually there’s a question over there and then I can pack a punch in unstructured yeah how unstructured do you mean you’ve mentioned log files yes does that also include raw images yeah yeah thank you sir I have a very very good example I’m very nice for from how do call the Hydra command you guys watch it from a company called skybox imaging and I love that example and I got to the bottom to come speak here Scott describes skybox imaging what they’re doing in the building commodity satellites if you can believe that it’s possible so building there is very cheap very small satellites getting them out in space and then taking a continuous video stream of locations across the earth in real time and selling analytics on top of that data so all these images fit into Hadoop and then in real time they analyzed for example in front of Home Depot not only how many cars are parked but what stuff is loaded inside these cars so you can you can figure out what they’re buying very valuable information for competitors second it is out in the open actually video video PayPal to open you can’t go inside the houses of people yeah Boogaloo earth right at Google Earth and what’s the other thing the good street view thing so but you can do that now they’re also very useful for governments that don’t have the ability to launch their own satellites and very useful for planning urban planning and so on but the key point there is yes this is not just about business intelligence this is really about complex data processing where it’s data of any time this intelligence is one of the hottest areas right now just because it companies I mean companies believe in that and they they they see that very proposition right away however advanced analytics is another very important one so there is that these are kind of some examples of industries and there is two core use cases data processing which is the one I talked about most of today which is about taking unstructured data and converting that into structured paper but once companies do that over and over reliably they figured out the haters they tested in the system I still have it in there so they move on to what’s called advanced analytics as can I actually also do my analytics in there and do things that I can’t do from from my data base which goes into the realm of image processing OCR video processing etc etc I’m not going to say how they use this but this is public information the CIA and NSA that they’re customers of Tuggerah they’re also investors in US as well and you guys can only try and imagine what they’re doing with this technology over there other questions okay this question over there I’ll come back to you after this others who are going after the same work yes you be keeping track so absolutely of course I mean we keep we’re very very competitive company we we keep an eye on the competition and there’s a bunch of other companies now starting to enter the space including very big players like EMC and IBM where we believe that we were there at the birth we started three years earlier when everybody else was very skeptical that this technology would go anywhere and hence we have the first mover advantage which gave us both the experience of solving these problems within these industries which is very important when you go and try and sign up a new financial customer or a new big retail customer they want to see the DA Ballou this for somebody else like me it means a lot when you have that so we have that lead we also have the founders of the project so that cutting the creator of the technology works at Cloudera and many of the other key projects in the big slide where the projects are at the company so we have all these things that differentiate us but the most important thing that differentiates us is we came in way ahead of these other players we we believe that we’re going to be the leader of this way a wave like VMware is – the virtualization where there was many a wave there was many other companies that try to do like VMware and then source is one example and some of them died some of them make good money but they were not the leader and our goal is definite to stay the leader application is the layer on top right apache it so where’d you see them coming you see them being developed by the individual companies or you see a another third party parties are all coming up to provide applications that layer on top there excellent question all of the above right so first one of the key things we do there is massive massive investments in existing application layers so there’s the companies already have big deployments of BI tools and like Business Objects Cognos and and SAS for the mining and so many other ones they’re really what they want to use them with Hadoop right they want to figure out how to do that that gap so that’s one thing we’re working on is better integration between this ecosystem and the existing IT investments which are very important l and foresee that there’s going to be many other solutions built from ground

up to work with this system the same way that many others through when the client-server wave came many new solutions who got built for that way and we see many new startups trying to do that we see big companies trying to do that like IBM for example and they have this thing called big sheets and big sheets essentially is a massive massive multi-bit abide spreadsheet model built on top of Hadoop right that IBM built accel partners which is one of our first pcs for cloud era they just launched one hundred million dollar fund in collaboration closer the only goal of that fund is to fund companies building innovation new innovative new solutions on top of this stack so yeah it’s going to be all the bulk I mean we had this very good keynote at Hadoop world were from JP NC and they stressed more than one time is that yes we love these new solutions what you also must figure out how to exist with the our how to talk to our existing IT infrastructure so both are very important building the new but also existing with the coexisting with the with the present as well very good question yes back there yes yes still keeps an exploding yes so inside year this data will still keep an exploding I think it’s going to go to that’s all back yeah and does hadoop start hitting limitations when it gets to that very very good question so the leading indications for us right now luckily so most of our customers the average size of the cluster right now is 120 150 nodes Facebook and yeah how they’re running at the 4000 node limits right now and so they are our predictor and so and and and and some of the plot and the government customers as well so they’re our predictive what’s coming up next and hence and hence we are seeing these problems and we were working on them right now so I thought early about HDFS Federation is trying to solve the problem to remove the node boundary limits and then and partition the HDFS namespace and then the job tracker changes and the MapReduce system is also about having a job have its own manager now there is a bottleneck the next bottleneck is the resource managers that manage the resources of the clusters we need to split that out to be more scalable and for that we’ll use the zookeeper technology that the thing deserves is on talked zookeeper essentially is about maintaining scaleable consistent States across many nodes using a aggressive and special algorithm that’s very consistent consistency algorithm that’s very similar to taxes that deserves on talking separate at a separate time but yes we are proactively working on these problems right now but the nice thing in the benefit we have we have companies like Google publishing papers about like things which are ten years ahead of everybody else and then we have Facebook and Yahoo which are three years five years out and that’s kind of and then and then we have the actual industry at large so I don’t think it would be a problem there we luckily we have our lens on the future so you had the question yet area there’s you know companies trying to do the MapReduce within sequel with it yes yes so clearly that’s not first on track that is for so a very good website and thanks for yes so sorry for for cuffing user and you can I asked me once I say this yes it is a number of companies building map is used for sequel right so as the data is a very good example of that green plant I think have it too and was a bunch of other ones MapReduce for sequel is about that any language right so when I put up the slide and say there’s any language any problem and then there’s the flexibility of the data having MapReduce in sequel is about any language is about going beyond and building algorithms that span and beyond just sequel that’s what that’s about but the underlying data model is still a structured data model so that’s the differentiator there so the similarity is they have that which is a good thing did that answer your question yeah the idea there is it still for structured data yes what was broken because you could do large database clustering and scaling yes nothing is much structure there so where does this fit where’s the MapReduce fit in and see where is the need for movers driver because it’s not structured data I can see the nobody in so no but even even with instruction Vita escamole as I said you cannot answer all of your questions with C code so even within structured data you want to be able to write algorithms beyond just sequel so no there is absolutely a benefit to having method use over structured data things like grab things for these scalable solutions were doing that right so what I’m trying to no no no so if you look at the fit like Oracle RAC and like Oracle I think the only supported may be Java and C all using very strict UDF and you the a definition what’s the biggest Oracle installation in the world maybe a hundred machines but back then that’d be a very rich company a single instance so again I’m sure there’s any big companies to have I had actual one of the biggest ones at Yahoo I had an auto penetration with a petabyte of data in it and it was a very big massive essentially rack system with fibre channel between the nose and and so on but that’s kind of and that was it was about 20 servers if I remember 20 you know has the schema aspect but it also has the update ability and an update where yes kind of aspect a branch eating teasing apart that stuff is really interesting yes so I think it’s fascinating you’re talking about HBase and the question is how are you going to match the malleability of the data yeah as it vibrantly changes

with the analysis of the data which inherently is about taking an immutable input and creating a new immutable output yes and so I believe me I believe it can be done but I haven’t heard that discussion and how that semantics evolves is going to be one very interesting areas yeah so that I mean that’s one of the one of the deficiencies right now made up the management at large as one of those efficiencies and consistency management – well how do I interpret this changing stuff which is different than interpreting the unchanging stuff which a classic MapReduce will do in MapReduce I take a set of inputs functionally calculate outlet yes right which is way different than a living breathing HBase yes file that’s vibrantly changing and of course you’re going to want to analyze that while you’re going yes that’s going to be fun I think you’re going to do it it could be great yes so so you’re right so one of the things we don’t have this yet but one of the things being worked on in snapshots for HBase so so so you can take a snapshot of edge base and I have the query run for the snapshot and not be affected by new updates taking place while 610 what that means right now that’s done so right now as we push it back on that vacation say you need to know what to do it we’re not going to do it for you the right way to get started yeah so I think we’re hitting our plan Minotaur we will turn off the we’ll turn off the video so the people just show up baby did you tell all the cool stuff yeah I think I didn’t say all the food so I hope Oh off you see okay so thank you and uh thanks for more please visit us at stanford.edu