you all right so I guess I’ll get started so I’m a but rich Chandra naadi I’m a researcher in the database group here at MSR Redmond and today I’ll be talking about open source technologies for streaming and state management from our group focusing really on two systems that we have built recently and open source called trill and faster so brief background I work in the MSR database group here in building 99 on our group works on a diverse set of research areas such as streaming beginner analytics the key value stores storage security transactions scale out you name it everything related to basically the database platforms recently we have been open sourcing a lot of the research that we have been doing in our group and I wanted to focus on some of these projects in this talk these are all in available in github under an MIT license the first is citral which is the proven streaming engine that we built here in the data base group and is now part of the eternal enterprise organization it’s a streaming engine for real time and often analytics which I’ll talk about shortly the next piece is something more recent which we built or the like last couple of years is faster it’s a fast key value store for resilient state management and I’ll be focusing today on a little bit more details about these two water systems the others that we have also what opens was recently included as CRA which stands for a common runtime for applications CRA is a powerful distributed runtime for data flow graphs and it allows you to build distributed analytics or data flow systems it’s available at this URL and the other project we’ve open so recently is ambrosia which is a highly robust actual based framework for highly robust applications and micro-services and today’s focus will be really on the streaming and state management and you can talk to a bunch of us who are here on all of these projects so it’s this includes folks like Kumar Sebastian and butt guna and a bunch of us are out here so feel free to talk to us about these projects so let’s start with a brief description of so a lot of you may be familiar with truth from several years back it was a project that we started way back in 2012 and its goal was to be a streaming engine for both the cloud and the edge a lot of scenarios for big data analytics basically that we were looking at at across Microsoft drove this project this includes a very simple real-time monitoring of applications the exhaust that is generated by all these cloud applications and having to raise alerts when problems are detected the other kind of use case was the apart from real-time was they called combination of real-time with historical where we wanted to be able to take real-time streams and correlated with for example what happens say a week back to detect anomalies and things like that often analytics was also very important where the logs that were being collected in these data centers had to be either back had to be useful back testing your real-time queries or to do analysis or time series analysis on these offline logs themselves so there was a need for an engine that could support all of these applications and we did still have several key requirements for an engine to be handled to handle all of those kinds of scenarios the first is performance which is particularly important for both real-time and offline right and particular for offline because you take a large amount of data and you want to process them until seen maybe tens of seconds or minutes to be able to give interactive responses to your queries at the same time you wanted low latency for a real-time a fabric and language integration was the other requirement which really meant that we want we wanted to build a streaming engine as a library that could be embedded in a variety of distributed fabrics and this turned out to be a key reason for drills success where it just access a library that you can embed for example in systems like Orleans or a scope and all of these different kinds of fabrics as part of the language equation it’s also written in a high-level language which allows us to support rich data types and for example is super important for when you are performing operations that use a superset of the sequel data type system and finally the query model that trill supports is a superset of sequel it supports temporal it supports pattern detection and all of those kinds of features so basicly trill in terms of performance it was around two to four orders of magnitude faster than it’s the delicious steaming engines at that time and for relational we were able to get performance it was comparable to the best columnar database systems at the time and the way we approach that was by this notion of having user control latency specifications so the user says how much latency they’re willing to tolerate and then we use that to perform micro batching and colonization and applying the database techniques within these micro batches to provide us better performance and thereby straddle this

latency throughput trade-off we built well as a high-level language a component in c-sharp it works at arbitrary language datatypes and libraries and the query model was fairly rich with support for complex windows pattern detection and pretty much what you would need in this kinds of real-time and of and time series analysis environments so trainers use across Microsoft including in the SEO stream analytics service and we have taught from AC here if you want to talk more about that service as well and the key enabler for trill was the combination of performance the fabric the language integration in this library nature and the query model so the current status is that we have open source it just couple of months back and we are actually looking for people for use cases outside Microsoft to be able to both use and contribute to the system the libraries are also available or nougat which is the package management system that is used with Visual Studio it works on botnet core so you’re not really restricted to windows we can use it on Linux it works on the edge works on the cloud and it’s pretty much a very kind of reusable component that might be useful in any form of real-time or offline time oriented analysis that you may be interested we’ve also been doing work on combining trill with that the other systems that I mentioned earlier for example drillbit CRE gives you a system called quill which is the multi node streaming analytics system and you can also you still with some system the cam proves you have to build real-time pipelines and it’s faster to externalize your operators state has to disappear in memory year system so the next project I want to briefly overview is faster which is a relatively newer project which we’ve been working on is embedded key value store for state management and the basic problem faster addresses is that particularly it kind of grew out of the our experience with the tools used in these large-scale distributed pipelines with managing huge amounts of State for example in Bing they have billions of users or ads in the system which for which you have to maintain some state and update them as new what data arrives in the system and this also includes a stateful applications such as IOT device tracking data center monitoring all of them are super stateful so by state I mean like you have a large number of independent objects this could be users in the system or advertisements in your pipeline and you have some amount of state for example this could be a machine learning model or some statistics for every single object in the system now the aggregate state is so large that it does not fit in memory and this is particularly problematic given for edge or multi-tenant cysts use cases where you know indexing a huge amount of state but and you don’t want to be you don’t want to provision for the entire state and the kind of operations that you perform on this state are very simple typically it includes point operations right I mean you just do a hash look up look up a particular object and maybe do some operation on it or maybe you perform an update on the object so you read the old CPU reading and maybe increment it and things like that of course the state needs to be recoverable there’s a particular property that we can exploit in these kinds of applications is that even though the amount of state is large for example in the search engine you have billions of users who may be alive in your say seven day window of data that you are maintaining but the number of users who are actively surfing the web or kind of doing actions at a given point may be a small fraction so this could be for example if millions are actively surfing and being update updating the order statistics at any given point so we wanted us a system that could exploit this this and and reduce the amount of memory footprint that will be required to handle the state in these kinds of applications so faster is the result of this these requirements and it’s a latch fee concurrent multi-core hash key value store it’s designed for basically I shared everything environment where you have a bunch of threads talking to a shared memory and having us some kind of backing store in the end it’s the same for high memory high performance and scalability across threads it’s a multi-threaded lots free system it supports data larger than memory so you can basically are tuned the the the hot working set and exploit the temporal locality so that the hot working set stays in your main memory and the older data kind of spins off to a secondary storage or tiered storage and this could be things like Azure blobs or a cloud storage in general the performance is very good when you’re working set fits in memory so for example if even if the you know indexing gain allows that memory if you’re hot working set happens to fit in memory we show that we can actually outperform pure in-memory data structures like by by significant numbers and particularly when you start comparing it with systems that can handle it a logical memory we are orders of magnitude better than those kinds of systems that are out there today so fun of ICS B workload on a single machine with high end machine with two sockets we’re able to able to get around 150 to 200 million operations per second the

interface of faster is consists of both point reads and blinder bits as I mentioned but also atomic read-modify-write operations which are particularly interesting in streaming scenarios where you read a value and you make an update to it so it could be incrementing a counter or updating some field in your in your vector of machine learning model parameters here’s a brief graph of scalability of faster with respect to number of threads so this shows that we are faster on the multi circuit CPU gets pretty much linear scalability for this by csb workload and interestingly it’s comparable to pure enemy data structures such as the Intel TBB hash map which a lot of us are familiar with and then you compare it to the today’s indexes that are used to kind of to handle large the memory data it is significantly better very briefly I’ll kind of geek out on the next couple of slides on the system architecture before kind of stepping back and providing the the current status of the project so it does a bunch of threads talking to a hash index in faster and it’s the index is backed by what we call as a hybrid log which is a log of all the records that have been that are being accessed so the hash index basically contains the root of essentially a linked list of all the records that collide to that location in the hash chain there are three key technical inhibitions that we talk about and then the work is the most president at segment last year and you can look at it for details yeah so uh the the inhibitions there are indexing the record storage and the threading model so I think these were the kinds of inhibitions that allowed us to build a system that can handle data and the important thing here with the record log very briefly is that it goes across disk and the main memory so the hot records will stay in memory by nichole records are pushed out to disk a little bit more briefly in more detail then record-low consists of a single address space that’s spans both external memory and main memory and what the record lock does is that the hot records that are currently being updated go into the mutable region of the log which sits in the tail so you can do in place updates of those records and as the records become colder and they age out the end of the read-only region and you perform what is known as a read copy update to bring them back so essentially it’s it’s it’s a hybrid between the traditional log structured systems and the pure in memory systems so I’m going to skip over some of the technical details and just talk about the current status of the system it was open source in August and it’s available in both c-sharp and c++ and it is pretty much in in good in in good state and we have a lot of contributors who are also interested in kind of in providing new features and we are it’s really about if you want to learn about the the kind of the research innovations you should look at our papers and one of the interesting things we are doing right now is to integrate faster internal which are now both open source so to summarize we have recently open source a bunch of these source projects and we invite everyone to use contribute perform follow-up research I talked to us for more details thank you Yan suggests Anna to come up and stand here oh and what one is taking questions any questions yeah so this works on a single note how does it how does it spread right so these are components that so be approached this in a layered approach so we first build the single node components in fact a single core component that we corruption right sourcing the multi node but we have the distribution layers that I didn’t talk about in this talk so the the pieces of CRE and ambrosia are those that take the singular components unmodified building blocks and build distributed pipe ends all of them so all of them are open source from science University who is going to talk about democratizing data Federation for you okay so since everywhere folk homie under this special weather condition this is Gina from Simon Fraser University in today’s talk I’m going to talk about the three topics and in the first topic I’m going to briefly introduce Simon Fraser University data base and data mining group and the second one I’m going to talk about my vision to democratize data preparation for AI the last part I’m going to talk about my recent research project called tar which is happy you to speed updated data labeling okay so it’s a first part I want to briefly introduce our database and data mining treatment group we have been working on this area for over 30 years and we have made a lot of impact

on both education side and also the intact reset for example we develop a number of very favorite data mining algorithm called DB scan and previous band that is widely used in industry and we also written test pool coated pit mining that is also widely in many many universities in North America and so we are working on a wide range our research topics and we publish papers intensively in data mining conferences and also database conferences and I want to particularly mention our Kim Jong Wong who is sitting on the back he actually a new faculty member or drawing our school a few months ago he’s working on clouded database and also how we can develop a modern database is amusing new hardware alright so now I’m going to talk about my my vision to democratize AI so how many of you have heard of this phrase ok so so if you think about democratizing AI what does it mean that we want to make AI for the status bar for everyone okay and you see about that the country innovation of a AI is really a combination of computing algorithms and training data ok and for computing I think the problem is solved thanks to all the cloud providers because right now I can easily teacher and grad student to launch hundreds of machines within minutes ok and for every zoom I think the problem is also kind of solved because although deep learning is very hard to use but because of those software and many many undergrad students they can write just a few lines of code to remember very complex deep learning algorithm to detect objects in image okay but what about training data and although there’s a few projects that focus on the training data but I want to say this part is still the bottleneck because about three years ago I were the postdoc in the lab at UC Berkeley and I read time I heard that my knee they were sending complaining that I spend more than 80% time on theta prepare region well nowadays when I talk with them this you can tell in that even say oh I spend even more than 80% I’m on data preparation I think in order to democratize AI we really need to turn our focus on democratizing data prepare region and some of when I pretended I hear someone say always so hard the problem so hard it’s important for you to solve because symbol that how much time and how much money we spent to democratize computing to democratize algorithm and how much time and money we spent to democratize data preparation I think if it’s really the time that we sing hard how to democratize data preparation for AI I want to be more precise about what I mean by data preparation so nowadays many companies they put all the data you know centralized place they call the data lake and they hire a bunch of the earth scientists hopefully they can’t build some models using the data stored in data lake and the first step is they have to generate the training data and the process of turning the data in the data lake into the training data which had the feature at the label it’s called data preparation and so why this problem is so hot so I remember I had a conversation with fail who is sitting here i cider about two years ago and I asked him that why do you think they decline e8 very very challenging I think he said that oh he think that did cleaning is not a single problem it’s a collection of many many challenging problems thing about that data cleaning is address of one step into the peripheral region which means that if you really want to solve the data preparation problem we have to solve all challenging problems that are listed here so that’s why it makes so hard to solve the data preparation problem and some may say that all for each problem we have already turns out papers and many many to the many many algorithm to solve that then what’s new here and in my opinion I think there’s actually the two great opportunities for database community so one is that in the past when we divide by a prism we are focusing on how to reducing the Machine time but nowadays I think we need to turn our folks how to reduce the data same time so imagine we divide by a besom we need to ask yourself if there’s no such approach or no sir – how much time that the attendees band to do the task if they have it who how much time they spend to do the task if it had reduced a lot then it’s a good rule so in order to reduce that error that there anytime we have to make sure our method is easy to use is extensible and also

has a compute composite composability property and another grid opportunity I think we may have is a we have applied advanced image of learning technologies so I know there’s a talk later how about the ultimate emotional learning I think why can’t we apply automated machine learning to solve some challenging data preparation problem because my needed problem in date preparation is the central emotional learning problem we don’t really need their center to review the entire emotional learning pipeline to solve our problem and also the meta learning so because when when I talk with several banks in Canada so basically they all want to do the same thing so they want to build a term prediction model they want to build a fraud detection model they want to build a financial product recommendation model why can’t we leverage is it the effort that they stand to prepare the data so that if a new bank want to build rebuild all the models they have to they don’t have to rebuild they don’t have to rebuild their entire deal preparing pipeline so I think these are two great opportunities for data database community to solve this dinner preparation our challenge and so in my lab we have a few projects that in order to fulfill this vision and last year I prevent I presented two projects when the focus on the data enrichment the other folks on the is brought through data analysis and today I’m going to talk about the tarz which is focused on that did her labeling part so we know that little labeling is very very expensive we take time sing about that how much time we spend to build an image night and I think a very promising idea to reduce the labeling calls it will explore the trade-off between quality and the human cost so imagine that I trusted randomly label the data that I don’t have any human cost but my label are super noisy is useless if I only asked experts to label the data then my human culture is very very high but my labels of high quality I believe they’re gonna be more and more points in between for example this in this week’s information is the ideas we have a lot of people to define some rules so that they can use a route of magical label data for example we can see for every tweet if they have a smiling face inside then it might be a positive tweet so that called the rules and when it applied the rule to the data we label the data but in a noisy way and another idea we can use crowdsourcing they are not as good as experts but they can actually still provide good or good a label for us so so then the question is uh in the future I’m pretty sure they’re gonna be more and more noisy twinning data and the question is how can we deal with the noisy training data and so in fact this problem had been studying the motion learning community for many many years so there’s a good survey paper so basically at a high level there’s two ideas so the first idea is we don’t do any data cleaning we just you know we try to treat all the noisy labels either ground truth labels hopefully that it’s a model we build can tolerate those noise because of my emotional learning models they are very low but they can tolerate if those noisy labels and another idea into machine bale cleaning where we try to build another machine learning model to predict which instance training example could have a noisy labels and and so I have been working on data cleaning for many many years and I think that the best way to clean the data we should look at human in the loop so I’m thinking that why can’t we have a human in the loop label cleaning so that human can have her to clean the late noisy label but of course we have not had human to clean everything the question how can we make the best use of human to clean the noisy labels and our so with our recent paper tart which is named after intelligent robot in the movie interstellar which can always provide insightful comments to humans and so at a high level third can provide hume data scientists who piece out otherwise so the further what otherwise they imagine they have a noisy tighter tater and they have a model and they want to know that if I use this noise it has the data to evaluate the model and how good my model will be and heard can give the dentist the estimation of the true accuracy and also the confidence able to count if I the uncertainty and the centerpiece out otherwise Attard can give their centers to tell their center how to clean the noisy labels so that they can improve the model the most and so due to the time consider I cannot talk about how we do that in detail but we have a paper that is online then you

can understand how it works so in summary or I think democratizing data preparation for AI is a really exciting topic and we have as a community we have a great opportunity to solve this problem and today we talk about a tart which is a label cleaning of the wider to reduce table labeling cost for AI and some last but not least we also have two posters from ice are from my students and one is talking about how we can use try highlights from recorded live videos this work is done by our tumblr the other work is I kind of related to the keynote but we actually because psycho is been Asian had also been widely studied and nowadays we have the mushroom learning it’s been nation the question is how different it is to can immerse perinatal AIDS we have what unified is vanishing framework so that they’re scented can either is playing a single query all emotionally model so that is taking the ongoing work which is collaborating with you lingual who is the expert in sequence by nation okay thank you very much engineering questions have a question so if we ask people to kind of help clean this labeling data how different or similar is it going to be when we ask people to for example help clean let’s say you know duplicate data and just understand you something is duplicate or is not duplicate is that going to be similar or same problem or do it’s because of go in the different if you ask the human to clean the label then that go is you want to try a better model but if your argument hood remove that you click in there going yeah so I think we thought we had basically we had defined from micro town and surely this with humans so that we can label the days are infinitely by the kind of way to item shown to the user is different let me take make it one more question down on is are you can you are you that’s one of the questions there there are these other labeling tasks or ordering suspects to norcombe right that sort of do like these mini rules or machine teaches and an example of Danya how does that work relate to it yes it was at Magnolia supervision and we actually don’t do anything until they hopefully into triage attorney model so that they can tourism authority labels and we think we are also not because what if you use know who went over to the model but you still don’t satisfied with the measuring performance and then the pregnant can bring human inside to clean the know it enables so that we can prove all right thanks again the next time he threw his woman from from the University of Victoria and she’s going to talk about influence maximization and massive crops first of all let me introduce myself my name is Diana Coppola I am a researcher at University of Victoria Canada I would like to talk about two of our latest papers one was published last year in Proceedings of SSBB M and another one is just submitted to LGB 2008 19 and we didn’t get acceptance or rejection yet our gosh doesn’t want to thank you I will talk about influence discovery in graphs in general then about algorithm scalability and finally I will tell you about our algorithmic solution to influence maximization problem first of all influence discovery what we want we want a graph which was given to us to mine it and discover the most important or influential nodes for our algorithm it doesn’t matter what exactly phenomena

man-made or natural this graph models whatever what we get is incidence list which is a long long list just text actually and that’s about the node ID and another node ID it’s a list of edges for the graph and from this long long list which can be received for example by web browser that goes from page to page all those links and produces something like that without any idea what was actually written on this page and why this link is given doesn’t matter that’s the input what we want to get is this meaningful graph structure you can see in the middle for example what appears to be the most influential the cluster actually of the most influential models most important to some extent we don’t know what importance is actually is but that’s what we are trying to figure out to compute from the incidence list now I must tell you that our particular concern and our particular interest was to scale up existing method of computing this importance to massive graphs by massive I mean with billions of edges graph that models for example Facebook or Twitter or constituency of I don’t know United States of America for the next general election something like that really massive and to figure out how we can do meaningful and theoretically proven guaranteed solution for the graph structure of that size that was our thing now scalability just few words there exist tons of algorithms that do it that do graph mining and can figure out most important nodes for example and what not unfortunately it’s impossible to do in practice because if graph is massive it will take close to infinite time plus limited resources to really do it so data scientist invent more and more and more algorithms every year the main point of them is scalability so that for example on this laptop I could figure out from incidents list the structure of huge graph it’s very difficult to do because we don’t even know how to compare our scalability of different teams develop different algorithms written in different languages implementation and how to do it anyway recently in 2017 a very interesting paper was issued we’re a team tested 11 the most recognisable and the latest at that time different influence maximization algorithms and what they did they produced the whole tree that would help customers to pick up this algorithm or that algorithm depending on how important for them is in the quality or memory footprint or time processing time so now specifically we were working with from all scalable analytics of massive graph we were working for influence maximization problem and every dimension it what is it given the graph try to understand the most important notes the most important the way we interpret it is the nodes that connected to the most other nodes it’s it’s very easy in Twitter we can kind of

approximate or evaluate or we can kind of see oh okay this particular person has 1 million followers it must be very influential to us this particular person has 30 million followers and probably it’s the more thing it’s true but it’s not exactly true because in general it’s not just your degree direct connections to those things but how influential are those are the nodes and so on theoretical works two of them they were foundation for our research one was long ago and second one it’s very interesting that’s Microsoft research team in 2014 they got us patent for this particular algorithm they suggested a new way of doing graph mining influence maximization problem in general can be formulated as find a given number of seed nodes such that information would spread far and wide class in T even for approximation algorithms randomized under it we doing randomized what is different for our approach all previous teams working on this problem were focused on cutting the time of processing because time is enormous even under on demised algorithms you have to take so many samples it takes so much time and so they were trying to do to cut this time to cut time complexity we decided because we are data scientists to get it from a slightly different approach not to the best of our knowledge used by any other team before us and that is to focus on data structures data structures for small memory footprint data structures that would allow us to first of all load graph into main memory using just small part of this main memory and the most important keeping intermediate result of our sampling into data structure which again takes very little memory so we did a lot of research and there were several papers but the breakthrough came when we started using for storing intermediate result web graph now web graph it’s a compression framework it’s developed by Italian team from Milano and not only developed it’s not just how to compress the graph it’s a lot of different programs they wrote and they wrote in Java which was very close to us because I write in Java 8 it’s really easy for me to understand what exactly they did and how exactly they did it and this particular compression allows to up to 10% to get to decrease to go down to add up to 10% of graph in compressed form comparatively to the original graph when we figure out a possibility how to compress on the flight during computation during the sampling from graphs how to do it for intermediate results that was our big breakthrough second break through is this idea and that is our last year paper we call this algorithm uno singles and that is like that when you do sampling you do trial you try to understand how much will information spread from randomly picked up node very often in the real world graph it will spread no because there is a probability for each edge and it’s never one which totally corresponds to your life it’s never one even from the most tightly connected people like one family that exactly information will spread to everyone know somebody will not accept it somebody will be busy with something else and so on so anyway the idea and this another

breakthrough was why would we store this particular sampling that can get nowhere we just picked up node at random we try to you know spread information no no no because we randomly picked up the edge corresponding to the probability given probability on this edge will it go or will it not in most cases it won’t go so here’s a little bit of statistic how many in our samples yeah how many those singles things we found a way keeping theoretically proven guarantee it was proved by Microsoft research team keeping it holding it cut really cut memory footprint that used by our algorithm we compared it well I was working I am working with Japanese Institute of Informatics professor Guevara by Asha so they have supercomputer with Ram one terabyte and that what we had to use so we could compare our algorithm with existing algorithms and of course it’s at least three if not five orders of magnitude fewer memory used by us now then we a live expanded this particular breakthrough I was talking about and wrote another paper which is now in well the beam under review but we already can do conclusion the choice of data structure proved to be instrumental in raising scalability of graph analytics and focus on not time but space complexity allowed us to design implement algorithms processing large graph on this laptop the biggest one I processed here with theoretical proven guarantee it was Arabic which is 640 millions of edges which is almost billion size graph some questions any questions just wondering how did you try to apply that in a medicine actually it’s really interesting because now I’m working on new paper and that is identifying case centers clusters anyway and I am working with protein to protein interaction craft right now and yes we are using those things because when doctorally actually was presented presenting I was immediately thinking oh my god they are trying to understand different factors and how important they are that’s what I’m doing but I am very no cost purely mathematical think I don’t care what the graph is but come to think about it yes the answer is yes absolutely it can be done and it’s very important because if you can do it on laptop you can imagine practically yeah let’s let’s take it offline because well just in time yeah I’ll break in the middle in the time all right right right any any questions I am here all day I am going back to Canada only tomorrow so wait our next talk is by Patrick hemic and from the University of Washington and she’s going to talk about integrity constraints we business thank you this is joint work with Dan suture so I’ll start by really informally stating the problem so we have relation and we know about integrity constraints and in this talk we focus on functional dependencies and multi value dependencies so in the current setting either an integrity constraint holds in the relation or it does not this is something that is binary but in real life the data we can think about the relation almost a meeting the conditions

of an integrity constraint or we can say that the relation satisfies an integrity constraint to some degree and basically we are looking at some type of a systematic way to relax exact implication so suppose we have a set of integrity constraints Sigma and another one tau and suppose that we know that this type of implication holds and I will define what exactly this implication means formally very soon but suppose that we have some implications at holds what can we say about this implication if the antecedents or those that are in Sigma hold only to a large extent not exactly but they hold to a large extent what can we say about the extent with which tower holds in the relation and so and this type of question has a lot of application in particular when we want to mind there is a lot of work on mining approximate integrity constraints if in a database instance it is also used for data cleaning and in the a probabilistic inference literature is used for learning the structure of a ballistic graphical models so in order to formally define the implication problem and to formally define what is relaxation and so on I will introduce some key concepts and ideas so the first one is a conditional independence in probability distributions and they I will very soon also explain the exact relationship between integrity constraints and a conditional independencies in probability distributions so we have a probability distribution over a set of random variables and we say that a and B are independent given C if their joint probability factorizes as you see here we say that an independent statement is saturated if it covers all of the variables in this case all of X and it is marginal if it is conditioned on the empty set so now that you know what a conditional independencies are we can formally define the conditional independence implication problem so assume that Sigma is a set of conditional independent statements like those that I showed in the previous slide and tau is another conditional independence statement so we said that tau implies Sigma this is the notation if for every probability distribution that satisfies the conditional independent statements in Sigma it also has to satisfy a the conditional independence style and in the late 80s a Judah pal came up with this set of a axioms that you can see there are simple rules for implying additional conditional independent statements and showed that there are sound and that they are complete for the saturated and marginal conditional independencies meaning that if I want that I am capable using just these just by activating these axioms I can basically we can basically find all of the saturated conditional independencies or marginal dependencies that hold in the distribution okay so back to our setting to the database setting let’s say quick review of functional dependencies and multivalued dependencies so we said that the relation satisfies the functional dependencies a determines B if for every pair of tuples that agree on a they also agree on B and embedded multi-value dependencies like the one shown here basically holds if a if the projection on the set of variables in the independence is exactly it can be factorized as you can see here you see the similarity between a probability distributions the same type some type of factorization and the multi-value dependency or an MVD is simply an MPD that covers all of the attributes and it is known that we can deter Armstrong’s axioms are sounding complete for a implying all of the functional dependencies and Barry’s algorithm is something complete for discovering all of the MV d’s okay so now what is exactly this relationship between integrity constraints and conditional independencies so we can view that a relation are as an empirical distribution well each double T has a probability a uniform probability of 1 over N where n is its the cardinality and basically what we can see is that if we have the MVD that given a B and C a disjoint this B and C factorize according to what we saw then this happens the MVD holds in the relation if and only if B and C are independent given a in this empirical distribution so this is the relationship note that this holds only for envied is

it first for invitees for example a DCM VD P and C a factorize a holes in this relation but it is not the case that B and C are independent in the empirical distribution this is a easy to see ok so as I said we want to be able to look at what we call soft constraints so how exactly can we quantify the extent with which and integrity constraint holds in the relation and we use for this information theory here are the formulas they are not particularly important it stay the most important thing may be to know this is the mutual information a hill and to know that all of these terms are always a positive for every a probability distribution so for conditional independencies x and y are independent given that if and only if the mutual information the appropriate mutual information is zero and basically we will use this mutual information to quantify the degree of Independence between x and y given set and it has been shown already in the eighties by le that a functional dependency holds in the relation if and only if the a the conditional entropy in the empirical distribution is 0 and then mvd holds in the relation if and only if there a appropriate mutual information is zero in the empirical distribution and it is worth noting that both that in general these implication problems both both four-way databases and for a finding conditional independencies there are impossibility results okay so now to the main result so the relaxation problem is as follows we have a set of conditional independencies and we assume that we know that Sigma implies tell using one of the known a axiomatic systems and we want to be able to bound tell in terms of a sigma and we look at two types of bounds the first recall relaxation where we want to a bount out using a linear positive linear combination of the items in Sigma this we call relaxation and we want to see a cases in which we can find a tighter bound well basically all of these coefficients are 1 so the first result is that functional dependencies that meet unit relaxation basically it means that if a set of a functional dependencies imply in another functional dependency then this also holds in the soft sense we can really have a good bound on this implied functional dependency the extent with which it holds in the database until you can see an example these three functional dependencies implied it abd determine F and therefore what you see below is a valid information theoretic inequality that will hold in every relation however in the general sense conditional independencies do not admit relaxation at all so in paper by Kassadin or mishchenko they showed a that this say implication so this implication always holds but for any a set of a positive coefficients we can always find the distribution such that the mutual information between C and D is unbounded as a function of these so we can see that we cannot hope to have that this relaxation always hold but in some sense it holds in the limit because for every given for every epsilon we can find some set of coefficients that a can bound it so we have some a trade-off here in that sense so this was a relatively negative result another one that is positive is for saturated conditional independencies which are those that actually correspond to a multi-value dependencies and so looking at the conditional independencies we say that x and y are independent we say the two pairs of conditional independencies are disjoint if at least a a one of the following conditions hold meaning either x and c the intersection is non-empty or y and z are non-empty always symmetrically the other way around and we show and it is worth noting that all of a Pels axiom those semi graph with axioms are basically this joint you always use two disjoint conditional independencies in order to imply another one so it is not so actually this condition can be found to hold a in practice and basically we showed that if Sigma is a set of disjoint conditional independencies and tau is saturated all corresponds to an MVD then the implication admits unit relaxation so in that case we can find really a very tight bound here’s an

example so this is a very simple example that these two conditional independencies implied the one that one in the end and you can see that the disjoint condition holds because why appears here is one of the sets that they are in the conditional independence and indeed this can be relaxed to this inequality so we are capable of bounding the extent with which that NVD can hold in the relation using the to use to imply it so to conclude there is a so the the connection between integrity constraints and information theory has been known for a long time but looking at this relaxation problem is a is a new problem really trying to find out the extent with which and integrity constraints holds interrelation and there are really practical directions to this because they usually data rarely satisfies constraints really in a precise manner but only approximately and so in the main problem problems here are really finding good bounds of those coefficients for many cases so we showed some cases where the bounds are one we can show that the bounds we showed that in general the bounds that there are no bounds the coefficients can be unbounded for a different we can look at different sets of different sets of these conditional independencies and also there the bounds are not necessarily one so this is really a big open problem in this area any questions so I’m running at this phone has anything to do with them that’s a good question I am Not sure but maybe if it has small number of values then we can use more exhaustive techniques to to check but in terms of actually finding a theoretical bound for the coefficient I’m not I’m not sure it’s a good question so many years ago when I worked in this general area the rude question that I never knew how to answer was what to do about the fact that these dependencies almost hold but not quite and I’m really delighted to see a lot of work that’s trying to get to your last slide says that this is a potential practical importance I agree that these constraints that are almost satisfied are certainly important but what do I do with this insight you know now that you’ve got this axiom at times then you can solve certain inference problems how can I use that for example if you want to decompose a relation and you say okay and I know I cannot decompose it I will lose some of the tuples in the joint but given that you know that certain integrity constraints almost holds then you can say okay I’m going to decompose it I know I will lose some data but it will not be too much I can quantify the amount with which it will lose same goes for learning probabilistic graphical models you can assume the independence you say I know I know it’s wrong I know I can they are not completely independent but they can still treat them as such I’m willing to have that arrow so you get an approximation you get an approximation to a to the answer yes awesome thanks again from the University of Washington and he’s going to talk about machine learning with big clinical day automating machines okay so like most of you I’m also a database PhD but I’m hiding new medical school so today I’m going to talk about how to use database techniques to support machine learning from medical point of view but the tech techniques I present all general stuff so you can also apply them to non-medical data we all have lots of medical data and everybody in reticule is talking about how can we do 3d modeling to build 3d

models to predict patient outcome to guiding interventions so as computer scientists we say own machine Ernie’s natural choice so in the medical school if you come here you will see that her machine learning is very really used and almost never deployed in clinical practice and there are several reasons behind that I’m going to talk about two other reasons in this talk so the first one is that as computer scientists so we have strong computing expertise so we know that in order to use machine learning where we have many machine learning algorithms we can choose and we pick one machine learning algorithm and this machine learning algorithm has many hyper parameter values like the number of decision trees in our random forest so we set those values somewhat arbitrarily and we use this combination of machinery Iverson and the hyper parameter values to build a model at at the very beginning the model accuracy is typically low so we have computing expertise we’re changing the machine learning algorithm are we changing the hyper parameter values then we repeat a model we often do it with hundreds of to thousands of times we get a model the accuracy is good enough but this is very hard for healthcare researchers to do because they don’t have computer expertise and second even if we can build a such a model with high accuracy as we just talked at the beginning most machinery new models a black boss they just don’t explain say oh I’ll give you a prediction and say this patient we have better outcome next year but why this patient would be having bad outcome next year we just don’t give any explanation and this is a critical for clinical practice so now let me go over those two challenges a little bit more detail so the first one is really about model selection and to resolve this problem computer scientists have started work on automatic methods to do model selection in the past few years so the goal is to help people with a little computing expertise to do machine learning Google even released a – called Google versa two years ago to help with this process so it’s a commercial product but the existing methods they cannot handle large data says if you take a small data set such as it turns out in the patient’s each patient has 130 attributes you do the search some automatic model search cowardly takes several days and if you have millions of patients each patient has thousands of attributes that’s also common in medical practice then the search time is daunting the Google paper showed two years ago that on scale of Google datasets it takes half a year to automatic do model selection for deep learning for half a year our super large computer cluster I mean verticals who are unfortunately we don’t have so many computers so that the situation is even worse so in order for machine learning to be widely used in a medical school we need a automatic of effort completely automatic because the users don’t have medical our computer expertise and we also want to do it efficiently on large datasets and a second even if you can use a tool to build a highly accurate model that’s still insufficient because if you don’t give explanation say why a patient will have better outcome next year the clinicians will not trust your results they would refuse to use it and they need boundaries in order to design tailored interventions and if they get sued which is also very common in practice then they have to defend their decisions in court and for biomedical research this can help you will formulate new theories or hypotheses so most models are complex you have decision tree with a few layers that’s easy to understand but a low accuracy you can do deep learning or support vector machine or random forest which is complex can give you high accuracy but if you just use a single model you cannot achieve goals both coasts in your teeny asleep we should typically the case but in the medical school we want to achieve both goals concurrently we don’t want to sacrifice even 1% of accuracy because that’s patient’s life at the same time we also want to explain the models prediction results you know for a particular patient so that you can be used in clinical practice so now I’m going to talk about our work to address those two problems the first one is about model seduction so this is the weak-hearted model selection method works is called page optimization you can choose many combinations of machine learning algorithm and hyper parameter values you build a regression model another prediction model to predict the performance of a particular combination so you try multiple combinations using each combination tabular model and an accuracy you build a regression model then you use this regression model to find out the next promising combination you want to try you would use this combination the entire dataset to build a model get an accuracy update your regression model then you’ll eat hurry to this process typically thousands of

times then you get the final one you want so this is the state of the art but he tend to work very well the problem is that if you have a large data set it takes a long time for you to do models or model building on this data set even once for a single combination so in the extreme case let’s say you have turns out on the patients each one has 133 attributes you build an ensemble model this ensemble model actually wound one of the major they’ll open open data science competitions our modern server it takes you two days just to build a model once and in reality you have many more fish many more features and you have a lot of machinery algorithm a lot of almost infinite number of hyper parameter value combinations you want to do the search that’s very slow so how to address this problem if you just do you know machine learning way deriving formulas that’s almost hopeless but we are database people so we can do sampling so that’s the idea you can have multiple sample sets you start from a small training set like hundreds or thousands of data instances that’s doable then you start with our beauty models on this smaller set then you keep expanding the training set make it a bigger and bigger so at the very beginning you have a very small training set you can afford to do many thousands of tests in a very cheap way very fast and most combinations will just turn out to be come to here and promising you just throw them away the otherwise you are not really sure because you just using a small sample to test it you’re not sure whether the accuracy you get is the accuracy that I should be indicator the combinations performance so what you can do you narrow down your search space and you try to expand your training sample just called progressive sampling shown in the previous slide you can test the remaining combinations in a smaller search space and you can have a further higher confidence on their capability and a lot of them will still turn out to be unnecessary I’m promising throw them away and you keep doing it again again so over rounds the training sample becomes bigger and bigger and a search space becomes smaller and smaller and eventually once you narrow down to say 2 or 3 combinations that’s not bad you can even afford to use the entire dataset or a live sample of the entire data set to test which one is the one we really want so this is the idea we try them small to moderate our size success not after the Google scale we show that our method can speed up the process by 28 times and also within a short amount of time we can search many more configurations so we will have better luck in finding a better combination this will also improve the models prediction accuracy it actually cut the prediction area by 11 times 11 percentage so we are faster we also more accurate if you go to the data set of Google scale the such speed improvement will be even bigger because we start from a fixed size sample which is independent of the size of entire dataset so the larger the data set size the bigger performance advantage remover have now let’s say oh the healthcare researchers use or automatic a tool to find a good model with high accuracy but that’s not at the end of it you still want to explain the models prediction results and also as we say that we also want to use interventions if it’s not actionable it’s totally useless so how to achieve both goals Keamy explanations and suggest interventions automatically so this is what we do we already show you that we want to achieve both hai model accuracy and we also explain the models prediction that for every individual patient if you use a single model it’s very hard to do so the trick we play here that we use two models concurrently the first model is the most accurate model you want you don’t sacrifice even one person of ease accuracy so when a new patient coming this model make a prediction with high accuracy don’t touch it then you have a second model which is Association rule-based everybody here knows how Association rule works those Association rules are mine from historical patient data set and it’s like a market basket analysis so the second model is not trying to explain it’s not trying to make predictions because for this new patient the some models will say the patient will be at a good outcome other rules will say the patient would be a bad outcome so if you want to resolve the discrepancy is hard so if you use the second model to make prediction you will get low accuracy but the first model already tell you with

high confidence this patient will have a bad outcome next year so you can throw away those rules that are applied to this patient per se this patient will be a good outcome throw them away only the other rules that apply to this patients saying that this patient will be at a bad outcome next year only those rules matters and the issue will give you a reason say why this patient will be at a better calm and those of us are all known beforehand so what you can do is that once you mind those rules from the historical data set you can ask clinician to examine those rules and those rules will give you hints what kind of interventions can be applied to those patients so those interventions can be pre compound beforehand when a new patient coming you give explanation at the same time you suggest interventions automatically so we achieve both goals concurrently we cannot do this for every patients because some patient will have bad outcome for real reasons but we show that our data set with 10,000 patients predicting who will develop type 2 diabetes next year we can do this for a highly accurate model for 87% of patients who were correctly predicted by the highly accurate model to have type of diabetes next year we can give explanations say why this patient will be added that outcome next year so this is one of the randomly chosen rules just to give you an idea how it works the rule says the patient that is using drug and this drug is used to treat hypertension and the congestive heart failure both diseases are known to correlate with type 2 diabetes and a second condition the patients in maximum body mass index is at least 35 if you have medical background you know that this means the patient is our piece and obviously is also known to correlate which type 2 diabetes so because of those two reasons we say that’s why the patient will likely to have type 2 diabetes next year and here you already have an intervention freak umpire for this rule this intervention say because the patient is lobby’s your so you should have put a patient into our weight loss program and in medicine is known that if you can reduce patient’s weight you can reduce the patient’s likelihood of development have 2 diabetes next year let’s know so it’s an effective intervention so we can give me explanation we can also suggest interval not automatically thank you very much so actually you are not trying to say those explanations are correct because every patient is different but you just say those are the possible reasons why this patient will have bad outcome so eventually when you do medical practice each patient has lots of variables they are scattered on hundreds of pages of historical medical notes if you ask the doctor to check all those hundreds of pages to figure out exactly what are the reason for example specially you want to do the count and see how many emergency room visits the patient coming in the last three months that’s very hard you only have 15 minutes to do that in clinical practice but the tool time do all those analysis beforehand and present you the candidates those are the candidates reasons and then the doctor can check in while minutes say are those the real reasons and other suggest interventions although they are pre compared with our medical knowledge do they really apply to this vision and you have to consider the patient’s our personal situation which you cannot hear from medical record you can only know it from talking with sedation also common reasons not to terrorism that’s what you deal with yes so I think there is one piece of working this your strip Oei which is exactly use focus of selection no and we want to explain later with you both were sophomores like three years ago we published it on it yeah yes sir your results on the on the explanation part you said you were able to predict 87% is the reason probably not great 100% that the results the models different no it’s because I I think they realize that 14% of patients they are going to have bad outcome for rare reasons and those reasons are not discovered through associations association rule has support right so it only cover the

common ones thanks again it’s my pleasure to share my ostrich who share our work generating application-specific in memory database so this title is a little bit broad but actually we are focusing on one type of application is there based application with object-oriented programming interface because many like database applications are developed using object-oriented interface like Java Python or Ruby so instead of embedding sequel queries in the program people usually use object-oriented API through each search and use the object relational mapping framework to translate object queries into sequel queries and also converts the relational data back into objects example of such applications and web applications which usually use frameworks like hibernate Django and or Ruby on Rails so we profile on 12 open source very popular applications filled with Ruby on Rails and we are surprising we find that they are pretty slow so with a small amount of data usually less than 1 gigabytes in average over three pages take more than two seconds to load and most of the slow pages spend over 80% just choir in the data so why are this I choir is really slow even with a small amount of data and their three major reasons the data Messick data model and the predicate involving associate objects and also program generated predicates but before I go into the detail of this causes I’ll first give an overview of chestnut also this is the name of our tool it generates application specifically in memory database it’s row just it’s just like adding physical designers so it takes it takes in query workload and a memory budget so it customized apparently out such that the overall query time is minimized but unlike other physical designers for relational database this is specific for database app using object-oriented program interface and it solves the issues as we were introduced here basically it used a different storage model and used a different way to generate query plans next I’ll go into the three causes of the slow query the first is the data model basically the problem here is the mismatch between how the application access the data and how the data is stored and this results in a very slow data conversion let’s take an example here assume we have a chatting application and just like slack so you have channels and activities in the channel and users so um so the application manage this data using three classes the channel class activity and the user class assume we have a query that shows the top channels that and for each channel it includes activities in the channel and for each activity it includes the user for a created that activity so here the object query looks like this it starts with the top class Channel and includes activities in each channel and for each activity it includes a user so and then do honor and limit so in order to answer this query three relational queries are generated so basically this has three selection from the three tables the channel table activity table and the user table and the result is the three and the result is three relations but the object query requires the result of objects so there’s a process to translate this relational data into object so to do so the first step is to converge that it’s to go into object and then it inserts each user into each activity and after that it’s to group the activities by their channel ID and then insert activities into each channels as nested array of objects so even though the relational query is finishes really quickly within 1.7 seconds this data did a conversion from relational data to object data is quite expensive taking up to 55 seconds so the bottleneck is likely in the data conversion so to solve this problem she has not considered storing data non relationally it will consider a table consider a storage way a storage model that stores the data next stores the data as a form of object and a nested object just like this because the number of activities in each channel is different so this storage model is not relational and with this storage model the data conversion is

only from C++ object to a top level Ruby objects this is because like the database it’s implemented in C++ and that this data conversion doesn’t change the structure of objects and it only takes a one point five seconds to finish so using this storage model can accelerate the query by 15 X so suddenly cause of slow query is the query predicate involve associated objects so before so our first show acquire that doesn’t involve associated object so this is a very simple query that all that selects an active channels and order by ID so for this query we can create a partial index that index only on active channels with a key of IB so this is just like a normal partial index supported by relational database but what if you change the query to B we want to select the channels that contains message activity and also order by ID can we still create partial index just like we did for the first query with a relational database the answer is no because most relational database supports partial pressure partial index that’s the product ID and the key can only involve the fields that is in that in the table being indexed so but here the activities is not part of the channel so this partial index is not possible but Krishna chestnut will consider such index so basically just not expense the in the other English syntax by allowing associated objects fit the field to appear in the keys and the predicate for example it can create index on channel ID that contain message activity or it can also create an index on channels with the key being the activity ID the third problem for soccer is a program generated query predicate because in this object-oriented database applications usually the partial predicates are defined in multiple functions and at runtime the function calls are chained to produce the final and final query predicate so in this case curve of predicates are usually contain overlapping and the redundant particular predicate and in this case a relational query optimizer have a hard time finding the optimized query plan so let’s see an example for example we have a page web page showing join or leave activities which are also non message activities that created or updated recently so the query is like this we have the predicate says like type is not message and the type is join only and the created or updated a larger than a certain time so assume we have two indexes available the first thing that is create is a composite index created on field type and created and the second index is also composing index and type and updated so we give this a query to post grass and can you guess whether like it will find a plan to use this to index on that because it seemingly it can use that index to answer this query so the answer is actually no if you give this query to Postgres it will actually generate a sequential scan plan which takes 2.0 2.6 seconds to finish but actually we had noticed that here the type not you go to message it’s hell actually a redundant predicate so what if we rewrite it into an equivalent query that remove the redundant predicate and at this time host Kratz can generate a plane that indeed use this to index and the Jen and this plane that using the index tag only no point five seconds to finish so this means that the first occur we can actually leverage the index but the existing query optimizer is not able to do that because of the redundant predicate so this is because most most of the time the query optimizer used two rules you should feel like absolute to rewrite require a decade and it determine what index you use and how to use it but instead just not use a different approach it you numerous plans from a small size plan to large size plan so the plans so small size plan just like the first one is adjust an index scan and it will enumerate many different plans with different parameters passed through this skin from it will also enumerate from small size to larger size for example on the bottom it has a the news plan has a four index skin and then Union the result so it will enumerate many implants and

apparently in this step many planets are actually invalid it cannot answer the query so just not use a very program verification to verify whether the plan is valid or not basically it use symbolic execution so symbolic execution is a process where you create symbolic tables the tables have symbolic values instead of concrete values so when you run the query on a symbolic table you get an expression and when you run the query plan you get another expression so it use a solver to check whether these two expressions are equivalent or not and all possible values of symbolic tables if yes then the plane is valid so basically it is able to tell that the first plan using only one index current index can cannot answer the query it’s not it’s an incorrect plan but the second is of course we can see that this enumeration process is slower than existing query optimizers but it is able to find out a good plan for example the plan that use the index even for the car is set containing the redundant predicate but open this I this is slower than if this inquiry optimizers but it is tolerable as long as this optimizer runs offline and the chestnut is an offline and physical designer so here’s this figure shows the workflow of chestnut it takes in the query workload and a memory budget so first it enumerates plans and storage models and the plans for each query and then it use some heuristic to prune to prune the plans and after that it formulates the problem of the physical design into an integer linear programming problem basically the constraints are first each query uses some data structures and the second that use the data structures is within the memory budget and the goal is to minimize the overall query time so it uses an external solver to solve this IOP problem and the result tells which storage model it uses and what was planned and every query uses so based on based on the result it generates C++ code and this code implements and emember a database engine that is designed specifically for this workload so now I should evaluate we evaluated our three open source web applications and we compare against the three in memory database engines including the original application setting which uses my sequel and Postgres but we use an automatic index index ER to fill in the missing index that may not have been created by the original pro our original application and we also use used hyper with automatic in desert as a baseline so this shows like the average query time of these three applications so we showed the relative time to the origin also it means that the shorter slow shorter bars are better so we can see here a plan that generator page has not has much shorter query time than the other than the other databases and the number here on the top shows the speed up against the hyper which gives the best performance among all the relational databases and note that here the shaded area shows the portion of data conversion means converting from related from relational data into objects and that this portion actually takes takes a big unpicked a large amount of the query time and just not you because is used a different storage model it can significantly accelerate this part and for all of these three applications is and it in Chestnut is able to find or to generate the database within one hour to conclude we are we introduce a tool chestnut that generates in-memory application specific database which takes in a workload and a memory budget and optimizing the over on and customize the data layout to optimize the overall query performance it uses a non-relational storage model which source data as objects and nested objects and it extends that thing impact indexing tax and user since it says I think this is based applying enumeration to be able to find good plans even if redundant or overlapping query predicates are involved and we show that it achieves significant speed up in real word web applications Thanks any questions how do you handle the evolution of the work Road what I mean so currently is a static process because we take the application assume we can get the source code of the application so we

analyzed source code and then collect all the queries so we do not handle like dynamically changing workflows so it’s an offline static analysis but even then like let’s say your application is also evolving right then you have to like rewound because it’s a in memory data memory database so you can if the application involved in you have rerun the tool and then generate a new database because the new database will load the data from it also have used some relational database as a back-end but that only just persist in the data without answering the query so you have rebuild an in-memory database by loading the data from the new it can but on that right I’m sorry the quads I’m not sure I understand it correctly but your question is like if you have already have a like a post-grad database can chestnut translates a query into the civil power and an issue to the Postgres chestnut chestnut is a query system where the databases it’s a therapist so it handles all the query from the application but only holds the data in memory so it still needs a back-end storage to use to persist all the data to make sure your data is not lost but it kind of like serve as a middle middle chair that the query it answers all the car is much faster so the if you have like a read query you don’t need to go through the backend database so it hold it’s just like a way to how the data in memory but store it in a different form in a different form that such that the data can answer the query more efficiently I understand I’m just wondering that if we do have a database right now in focus and we want translate it into a way where we could query so much faster using chestnut be able to translate that or make up some database or we are not reading the query to the backend database so we assume that we are the workout has all the cars and like we can answer all the parts using pick on because if you have like an unseen power that it’s not a part of the workload and then going it will go to the backend database last question how to actually be one honest the last 15 you talk about the index and slides 15 oh yeah just one this query and the first one is that whose quest then they use index you know why another people you have the sauce colored yeah so I guess I try to so this is it like a hard question I tried to read the post grad school or school but it’s too complicated for me to understand so I tried different ways and it turns out that is the cardinality estimation that has a problem so it will search for like in the user plan to you it will like yum final planet use the index but because of this redundant predicate it greatly affects the Cardinal the cost estimation and it determines that with the index the cost is very large so it is her misuse of sequential scan so I don’t know exactly how it was through the estimation so and I believe it I know they can use index but it’s kind of hard I mean a lot of potato I know they can use index this query but the this one even discussed it offline so it’s a good question so I think all the speakers will be available now during lunch we’ll have now full stress session doing lunch as well phil has a bunch of logistics emails but not much not much really the lunches lunches out back we’re supposed to be but we want to stay on schedule and get out early then we should be back when 1:15 and the in addition to all the drinks out there there’s also if you go through these doors into the end there’s there’s a refrigerator full of soft drinks and other things to drink the afternoon please come up here and sign there a video release form already check that your laptop works