Marja Hoeltta Parsing JavaScript – better lazy than eager MARJA: Hi, everybody I’m super excited to be here I work in the V8 team as a software engineer and I will talk about parsing JavaScript This talk is going to be about V8, the engine behind Google Chrome and how we parse JavaScript I figure if I’m going to give a parser talk, I should tell you what parsing is to get everybody on the same page and why you should care about it I’m going to talk about how V8 parses JavaScript and what you as a web developer can do to help us parse better What is parsing? So, for those who didn’t see the previous talk, the parser gets the JavaScript source code and constructs data structures called AST and scopes based on it I will tell you in a minute what those are Then the bytecode generator walks those structures and generates bytecode It is then interpreted by the interpreter and the optimising compiler also gets the bytecode and machine code based on it and the machine code is executed directly So let’s look at some simple source code and what the parser does with it The parser constructs an AST, an abstract syntax tree, that describes the structure of the source code For the function, there is a function literal and there is a variable declaration and an an assignment Every time we see a variable name, we create a variable proxy for it We don’t know what proxy it is, it represents the variable The zero is the literal Then there is an if statement It has the condition which is empty here The “then” part contains for code There is another variable declaration and an assignment and we create variable proxy objects to recreate the name of the available and then there is a return statement again with a variable proxy This is the AST that the parser generates based on the code In addition to the AST, the parser also generates scopes where the variables are declared, so now for the function, there is one declaration scope, and for the if-statement body, there is another scope The declaration scope Danes the variable A and the if statement contains the variable B. The variable proxies occur in the code which belongs to that scope Then we do scope analysis This means we connect the variable proxies to the declared variables, so, in this phase, we figure out that all references to A actually mean the variable declared in the function, and all references to B mean a different variable And, to do this, it’s not enough to just look at the current scope where we are in, for example, with the returns statement, there is return A but A is not declared in that scope Instead, what we need to do is that we walk up the scope chain to find where the variable is declared and now we find it in the parent scope in the declaration scope Okay, so, this is quite a lot of detail on one side, and, in reality, it is even more interconnected I just couldn’t draw all the arrows on the slide This was exhausting to make and also probably super exhausting to look at Look at this for a second! Why should you care about parsing? Here is a diagram where real world spend their time in V8 Parsing is the orange blob on the left and turns out the web pages around 15 to 20 per cent of their V8 time in parsing Parsing is on the critical path for web page start-up According to Google’s production web app study a typical page spends around 370 milliseconds passing it on mobile, so that’s quite a lot of time, if you think about it So this means that our parsing speed is roughly one megabyte per second on mobile How does V8 parse JavaScript? I’m going to talk about the two parsing modes, eager and lazy, and why parsing is hard and

why benchmarking it is hard We don’t actually have one parsers, we have two They are called parser and pre-parser, for historical reasons Parser is the full eager one It builds the AST and the scopes and finds the syntax errors in the code The pre-parser is the fast, lazy one So, it basically just finds where the function ends so that we can carry on It doesn’t build an AST It build scopes but it doesn’t put variable references or variable declarations in the scopes It is approximately twice as fast as parser, and it only finds a restricted set of errors, so it doesn’t actually comply with the ECMAScript spec but we are somehow getting away with it! Here’s an example of how we use the two parsers to parse your JavaScript code All top-level code is eager We use an actual parser to parse it We see an – in there example, there is a – this is because you want to call the function right after So we should eager-parse that In this case, the guess is correct, and there actually is a call to this function so this is eager parse Other top-level function are lazy parsed So we use the pre-parser for parsing that function body Later on, at some point in time, you might want to call this function so at that point when you call the function, it is eager-parsed, compiled and executed There are some other heuristics If there is an exclamation before the function, it turns it eager If there is another comma, it turns it eager These are all eager Here are some trickier lazy versus eager cases The problem is we need to make the decision which parser to use before we use the function body or anything that follows it We need to see it when we see the function took We assign a function to a variable This function is lazy There is no par 11 before it We use the – there is no paren before it The second example looks just like the first one except we call this function an assigned return value of that call to F2, but we cannot know it when we need to make decision on whether to parse or pre-parse, or whether to eager parse or lazy-parse this function, so we end up making the exact same decision in both cases, so, in the second one, we also lazy-parse this, and, when this line is execute the, we need to eager-parse it right after and compile it But it was kind of the wrong decision but we just couldn’t know based on the code that we have seen so far So these lazy versus eager rules are not specified in the spec Each engine is free to implement them as they see fit, or they don’t need to implement lazy parsing at all if they don’t like to V8 just tries to guess based on the syntax which functions are probably called and then eager-parse those functions and lazy-parse the rest So why is this relevant for you? So, it turns out we need lazy-parsing because web pages ship a lot of code they don’t execute – at least not on start-up – so we want to use as little work for doing, to do as little work for processing that code that is not needed It is also important that we pick the right unto use If we eager-parse something that is not needed, we’re just wasting time, it is not necessary On the other hand, if we lazy-parse something that’s needed then we pay the cost at pre-parse at the cost of parse The pre-parse is half the cost cost, so it’s like nun and a half times the parse cost we need to pay The problem is knowing what code is executed at start-up You can also force eager parsing by wrapping function that is are critical functions for start-up in parens There is a library called optimise- just that does this so it should have use and that results in speed -ups with most of the browsers — optimize-just

We should also minimise the cost for cases where cases where we get the guess wrong This is an area where we are actively working on in the attempt Lazy parse be inner function assist nor complicated than lazy-parsing top-level code top to understand why, we need to look at context allocation Here is some example code There is a function outer which is an so we eager-parse That is a local variable called a Then it has an inner function that returns this local variable And then this function outer returns a reference to the inner function Now we call this function and assign the return fall to f Now, f will be a reference to inner And then we call f and we print out the return value So, this will print out to any, as you might expect, but where is it coming from? Normally, when you call a function, it is local or variables are put on the stack, but here, when we are calling f, we’re not inside a call to outer, so it definitely cannot be on the stack So where is it? The answer is it is in the function context So a function context is an object which the inner function also refers to and keeps it alive, so now we have the reference to the inner function because f is a reference to it, so that is how the function context is then kept alive When f accesses when inner accesses the variable a, it reads it from the function context If you want to lazy-parse inner in this case, we need to know which variables they refer to so that we can put those variables in the function context and not put other variables there We don’t want to put all variables to the function context because accessing them from there, it just is way slower than accessing them from the stack So normally, … need to, so we need something like lazy parsing with names, and the speed for doing that is somewhere between parser and pre-parser So this means that lazy-parsing inner functions will always be heavier than top-level functions just because of the semantics Modern JavaScript is heavily nested Everything is wrapped in functions, everything is a module now This is a price you have to pay for, nesting functions like that In some situations, V8 has to reparse code that it has already lazy-parsed In this example, there is a lazy outer function, no paren before the function so we lazy-parse it and lazy-parse everything inside the function too When we call this lazy outer function, we need to do something for inner, and how it currently works is that we need to pre-parse or lazy-parse inner again even though we have done it already once It gets even worse if you nest more So now, in the first run, when we go through the code, we lazy-parse lazy outer and lazy-parse everything in it, at some point, you call lazy outer, so we eager-parse lazy outer and then lazy lazy-parse parse inner 2 We inner parse, and need to lazy-parse inner 2 for the third time Obviously, this is quite bad This is not how it should work This is something I’m working on Instead of lazy-parsing we should keep those functions if we have already lazy-parsed them once Why is parsing hard? The JavaScript grammar is not ambiguous as such, but it contains some constructs where we don’t know up front what we are parsing One is parameter list and comma expressions They just the same If you see (abc), you don’t know what that is Maybe it is a comma expression It is also possible that it is a valid comma expression but not a valid error function parameter list If you see A12, that’s a valid comma expression, but A12 is not a valid comma parameter list We don’t know if it is an expression or not until we see an arrow following the expression

We cannot know For example, when we see the 1 The other way round is also possible, the A, B is okay It is arrow function with the rest, but a..B is not okay and this is not something we can know when we are parsing, we don’t know whether the user intends to use it as an arrow function list How the parser solves this is it never rewinds Instead, when it is parsing an unknown construct, it is parsing a very permissive grammar that allows both kinds of constructs, and then it records whether it has seen something that makes it the invalid function list or an invalid comma expression Then, when we see the closing paren we can check if there is an arrow, and if there is an arrow was it – so we just the check of information that we recorded when we – we don’t jump back and reparse it or anything like that So the parser has high-feature complexity and new language features are added to it all the tile Here is a typical parser bug that I found some time ago It is eager parsing failings with that code, but like what is this even? So, there is a variable g, and we assign to g an arrow function The arrow function has two parameters There is the destructuring x and then there is g, and the parameter g has a default value That’s again an arrow function with a body, and the body is eval x So now, if I force eager parsing, if I disable lazy-parsing, this fails We call the function g without providing a value for g, so the default value kicks in and this eval is confused and says, “I have no idea what are you talking about? What is this x even though it should resolve to the parameter x.” The features involved in this bug are lazy versus eager, a destructuring – destructuring x, turns out this is not relevant – there are default parameters, there are arrow functions, and now an arrow function is used as a default parameter to another rather row function and then there is eval, and it’s important that the eval is in the body of an arrow function which is a default parameter, so this is to give you an idea of the complexity we are dealing with in our everyday work Benchmarking parsing is also also non-trivial Here I have some mock benchmarks Benchmark one is not a bench – parsing benchmark It is lots of function with lots of code It is it looks like a lazy function, there is no paren before it, and then the actual benchmark starts the timer, calls this function, and measures how long it took But now, if you implement lazy parsing the way I described, we need to parse the function when the timer is running, and this is really bad for the benchmark, for the benchmark, it would way better just to do as much work upfront as we can, and like parse and compile everything when the timer is not running yet Even though we need lazy-parsing for the web, it really makes the benchmark score here worse It’s a tiff trade-off There is another benchmark, benchmark 2, that tries to be a benching bench marsh We start the timer, eval a lot of code and then measure how long it took Okay, this is fair, like this definitely exercises parsing when the timer is running, but this is totally not how you load JavaScript When you load JavaScript from a file from a resource, a wholly different code path kicks in as in here And, for example, there are some improvements we do for the standard code path for the normal code path, for example, we download and parse scripts in parallel, and these kind of improvements don’t benefit this kind of benchmark at all, because eval is just not using the same code path So, what can you do to help us parse better? So, none of the stuff I talk about can be sort of black and white, like, “Do this or absolutely don’t do that,” I can only tell you how things look from the parser point of view and then you can sort of figure out what’s the good trade-off for you and your

use case So, a lot of this stuff you can find in a blog post called “JavaScript start-up performance” The first is sheetless JavaScript so we don’t need to parse so much [Applause] You can also the code coverage, functionality in dev tools to see what parts of your codes are not needed or not needed on start-up so maybe it is possible to lazy-load some of that code You can measure the parse of your code and the dependencies with the Chrome tracing and runtime stats in it and you can see the concrete number of milliseconds that it spends parsing your code We have the code-caching, when you load the same script, V8 detects that and butts it in the cache The next time you load the script, we don’t need to parse it, compile it, we just read the bytecode directly from the cache This affects bundling If you bundle a lot of your JavaScript libraries into one file and then you want to update one part of it, you lose the code cache for the full bundle We won’t be able to figure out that you have updated just one part of the bundle, so this is something to be aware of when bundling and updating your code I already mentioned streaming That means we start parsing a script while it’s downloading, before it has finishing downloading the full script, so it makes sense to use this for big scripts, and to use them optimally, you should load them as early as possible and async and so the streamer kicks in, and you can also make sure that the streamer is streaming your with Chrome tracing In the event thread, there will be a background and you can see the name of the script that got streamed There is very little we can do for eval So, there won’t be streaming for that There won’t be code cache for that It makes sense to avoid it, avoid evaling big chunks of code if you can In some situations it makes sense to use the parens hack to force the compilation of the critical path in your code This makes sense, for example, in if you need to support older Chrome versions If you need performance across browsers or if you need performance right now and can wait for us to fix our – can’t wait for us to fix our code We are working on making these hacks less and less relevant in the future There is time for bonus content This is code from the V8 parser It is a hand-written recurse descent parser, so here we are inside this parsing statement So the return of that is statement And the first thing we expect to see is token “if” This is already checked above for calling this function, and then we expect to see a left paren If there is no left paren, then this is a syntax error and we bail out of this function So then we recurse We call a function called ParseExpression and then we expect the right paren after it We recurse again Now it is possible that the if part – thin part – we check is in a token: ELSE? If there is, we recurse again for parsing the ELSE part If there is no ELSE, we do nothing, and in the end, we construct the node for the statement It is handwritten, not generated by any rule file or anything like that Here are some things you might want to remember from the talk If you have further questions or comments or want to talk about parsing in general, just please get in in touch Thanks for listening [Applause] >> Hello, everyone, as you’re going for the break for the wonderful coffee, remember that we have a community track There will be a lightning talk from a couple of local meet-ups, including Up Front, and a few user groups If you’re curious how we organise events in Berlin, go and check out the community lounge