all right so welcome back to the NIH libraries bibliometrics training series my name is Chris Beltre and I’ll be co-presenting today with yelling Lu so this is the final session in our 13 part series so congratulations for making this far so what we wanted to do today is sort of do a kind of a wrap-up session where we sort of go back over a little bit of where what we went over so far talk a little bit about our standard report templates or to revisit that report template just to sort of walk through that just to see where everything fits where we’re getting the data from how that’ll fits into what a final product would actually look like then we’ll talk briefly about sort of ways that we sort of go off script a little bit and we how we modify that template in certain ways how we sort of move away from that template depending on what our customers are needed those kinds of things well then we’ll spend probably the majority of the time talking about how we learned how to do bibliometrics sort of our tell our stories about how we sorta did how we learned how to do all of this how we learned how to provide these kinds of services and talk about some of the lessons learned along the way just to sort of give you some ideas of things that we learned as we learned how to do this and then I’ll just we’ll close with it just a couple of final thoughts just to kind of wrap up the whole thing so let’s start out by talking a little bit then again about back to the the library standard report so we’ll walk through it kind of page by page just to give you a sense of what it looks like what we’re doing in each page and where we’re getting data from in each one of these things again just sort of just show you where all those pieces that we have talked about over the past 13 courses all fit together into a final product so our the first thing of our report is really about the introduction in methods and so this the the report that I’ll be talking about eyes again and related to the Fogarty International Center papers the same data set we’ve been using all along so you can see what a final product would actually look like for this particular data set so our introduction page is really about talking about the methodology that we used a little bit about some of the caveats although we don’t really go into a whole lot of detail about some of those it’s really more focused about the the methodology the data sources analytical techniques analytical methods those kinds of things so again just to give an introduction to what we did what’s going to be in a report all of those kinds of things we also include a standard disclaimer at the end of this in every report that we do talking about the limitations of the method itself so there’s a standard language that we used saying that there are limitations to the bibliometric approach we recommend that people never actually use bibliometrics as a sole decision-making criteria it should always be used in combination with other forms of evaluations such as peer review to actually make decision so that standard we put that on every report that we produce just as a way of sort of recommending to people how this report ought to be interpreted how we recommend people use it those kinds of things we’re thinking about expanding that in some ways to sort of touch on some of the limitations in other areas but we really haven’t gotten to that so we sort of lean on this sort of disclaimer paragraph for a lot to do a lot of that work and then mention either in email where we send or maybe in a follow-up report or a follow-up meeting saying these are some of the limitations that can go in a little more detail then the we organized the report around the major themes of productivity collaboration topics and citation impact so we start with productivity so the productivity most of these charts and graphs come directly from the web of science or the Scopus interface so when we’re working with the the publications set in whatever database we happen to be using we can use those analyze results features to pull out things like the document type number of publications per year per subject category per journal a lot of that is again built into the databases so a lot of that comes directly from either the database interface itself or in some cases we will export the data and do the analyses

from the exported data it depends on how you’re interested in but either way will work but again a lot of this comes directly from the website itself there are other things that we might also be interested in looking at here sometimes we also look at the number of publications per mesh term and publications per funding agency a bunch of other types of things but those aren’t in that standard report most of that is in custom analyses that we would work on in collaboration with patrons we then move into the collaboration analyses as we do this in two parts the first collaboration page page 3 is focused at the institutional level and the second part is focused on the individual author level so the analyses that are on this page are based on the cleaned version of the organizational affiliations so the collaboration the co-authorship rates the collaborations per institution and then this co-authorship network are all based on that cleaned version of defoliation data so we pull it down parse it with SCI to clean it with open refine and then we’ll do the counts of publications per institution based on that cleaned affiliation data will also estimate the co-authorship rates by sir within that field for things like university to see how many of them have university in that field divided by the total that gives you the collaboration proportion with universities similar types of things with outside of the particular institution if it has a delimiter in that cleaned affiliation field that’s an indication that there’s some form of collaboration so therefore there’s that’s where that comes from other types of things like that then again with the publication’s for institution again we’ll just count the number of publications for institution within that clean affiliation set to give us that chart and then the co-authorship network again is that more involved process where we go back and decide to we create the authorship network inside to visualize it with Jeffie then clean it up in Inkscape and then export the final result which is the the image that you would actually see here so again all of this is based on that cleaned affiliation data from the exported data from whatever database we happen to be using the second part of the collaboration analyses as the individual co-authorship analyses so this is again based on the clean version of the data but this time the author names so again we’ll export it clean it in open refine and then run through the exact same process that we mentioned to for creating this and then putting that into the report so again this is based on a series of steps to get to this one image so we get the data from whatever database we clean it in whatever tool were using create the data in situ visualize it and effy clean it up in Inkscape and then create the final version so again most of these things are based on the cleaned versions of the data so that has to happen first in order for us to have accuracy in confidence in this final cooperative network the next page of the analysis is our research topics page and so here we have BR the bibliographic coupling network is that article topic map on left then the charts on the right are based on that bibliographic coupling network so we’ll pull the publication’s from by the science create the bibliographic coupling network export all of the results once they’ve been clustered in getting as well as the the topic map from getty so also we have all of that and then once we have that exported node list with the topic areas assigned by the algorithm we can then go in named the research topics and then count up the number of publications per topic per are either publications per topic which is the chart in the upper right or the publications per topic per year which is in the bottom right again because we have we used that aggregate function file we have both the article titles and the publication years in that bibliographic coupling network of ever creating getty so when we export that note table we have the articles the years the titles and then the the clusters that they belong to which is all of the information we need to do these kinds of analyses so again once we exports that list of publications will then go in will look at the publications

in each topic area or in each cluster to get a sense for what the commonality is we’ll name the topic and we’ll go from there within the NIH library we will also sometimes do an extra step that doesn’t actually get included in the report but helps us in understanding what these topics are so one of the things that we will do is we will export that that node list that has all of the publication titles in it and then we will create a word called currents Network where we actually calculate the modularity of the mode of the mod clarity class essentially what we do is we create this word co-occurrence Network where we assign words to topics based on which topic it ends up occurring in most frequently so it gives us a list of terms that appear in each one of those topic areas to give us an advantage essentially to give us a sort of a head start in this is what’s probably this cluster is about so if in cluster five we see influenza virus genetic genomic those types of terms freakily occurring in that particular topic number that gives us an indication okay that’s probably about influenza it’s about viral genetics that’s kind of what we’re thinking about for that whereas in maybe in topic 2 we see much more about malaria so again we start to get a sense for what these topics are based on that word co-occurrence network that doesn’t actually get included in the actual report but gives us more to go on when we’re trying to decide what the topic areas actually are so that’s just a sort of a separate workflow that we do it’s not necessarily necessary but it certainly speeds the process for what we’re trying to work on in doing these analyses so again we’ll do the bibliographic coupling Network will then do the word co-occurrence Network where we assign terms to those clusters and then figure out what the clusters are named from there we then move into the citation impact analyses once we have all of that done so we grab the standard bibliometric indicators mostly either from insights mostly from insights in our internal working environment where we do the total publication citations citations per year those types of things we may also do that from the set create citation report in what science it just depends the percentile rank information though that all comes from insights so what we’ll do is once we have that publication set in web of science we’ll export it over into insight and then download the data from insights in the way yelling showed you in the citation impacts sections where we download the actual percentile ranks and they would assign the individual percentile numbers to classes so if it’s below one will we’ll call it top one if it’s below ten we’ll call it low ten those types of things that we’ll assign those percentile rank classes based on a percentile numbers that we get from insights and then aggregate it to say how many of those papers are in top 1% top 10% so on and so forth well we have that again well we can also then go through and count up the number of publications per subject category we can also know do the number of publications per percentile rank per institution when we if we merge the insights data back with our original data set we can then pull we can then use the cleaned author or organization affiliation information to break that into the individual collaborating institution so we can count up the number of papers for institution for topic so we’ll merge the insights data back into the original web of science data that we clean so we have the cleaned version of the affiliations to work with and so that way we can count up the number of publications per percentile rank per institution the next page is our part two of the citation impact where we look at citation impact by individual authors and so this is again when we’re merging the the data that we get from insights back into the original data set so we can use the clean version of the author names to do this so that’s again very very important to get the numbers of citations the numbers of publications right in order to make have accuracy in this particular analysis we need to merge it back into those cleaned author names so again we’ll use that web of science access accession number to merge the insights data with our original data set and then once we have that merge done we can then

sum up the number the total number of publications Times cited and percentile ranks per individual author in our data set so again we’re merging that back with that original clean version of the data set to be able to have confidence in this particular set of analyses the final versions of our the final pages in the report are our notes and explanations and so this is where we go into a little bit more detail about things that we just presented so we will go into some more explanations again notes about how things are done where the WebAssign subject categories come from how they’re assigned what they’re based on information about the co-author ships the co-authorship networks how to interpret those networks some of the filters that we made of applied to those networks so again most of the time with the collaboration networks will filter it by the either the number of co-authored works of a number of Institute of total authored publications per institution or author will document what those filters are here in the notes so that people are aware of what those filters are what they’re looking at what we did we can also talk more about what that topic map is so again as I mentioned the topic analysis we need to have a lot of explanation about that build a graphic coupling network so that’s where we do that in our reports is here in this notes field about the article topic map so we can talk in more detail about what the dots are what the nut lines are all of that stuff again how to interpret it some of the filtering options that we use all of those types of things so again we’ll do a lot of that sort of explanatory work here in the notes at the end h2 is again more notes this time more about the citation impact analyses so we’ll talk a little bit about the actual percentile rank values where those are coming from how it calculated those types of things that we’ll talk about some of the other metrics that are in here like the H index one of the things that we find a lot about is people get very confused about data set H indexes they’ll think what we’re talking about is actually the average htx of all of the authors in the data set and that’s not what we’re actually doing here so when we’re presenting this data set each index will go into a little bit of explanation about what we’re actually measuring how that agent Dex is actually calculated if or a data set rather than for an author just to make it clear what it is that we’re actually measuring here we’ll also again talk a little bit about the subject categories about the collaborating institutions filtering options that we set all of those types of things we’ll also talk about individual authors again we’ll stress that the authorship analyses are based on all authors own publications they’re based on just this particular set of years they’re not for an author’s entire career just to again make it very clear what we’re talking about some other things like that so again we’ll do a lot of this explains our work here in the notes field to just sort of be make people aware of some of the limitations that we have obviously we can’t talk about everything in here but at least we’ll talk about some of the things that are relevant to understanding the charts and graphs that we presented in the earlier parts of the report and then finally we always close with a little bit of information about the authors who we are what we look like you know contact information a little bit about our BIOS those types of things again just to sort of put in it a face with a name and to sort of let people know that work here we’re available here’s some of our expertise you’re interested in getting in contact with us here’s a contact information we do this both to sort of say here’s who we are but also because these reports have a habit of ending up in other people’s hands so when we give them to somebody they will often then give them to other people without our knowledge which is fine it’s great but again when that sort of transfer takes place it’s nice to say here’s who we are here’s how to contact us so that if people get a hold of a report and want to contact us either for a new analysis or to ask questions about this particular analysis they have the contact information to be able to do so and I’ve had a number of people contact me that way saying hey I got her hold her with this report I had a question about it and had a contact information – right there in the report so it’s always a good idea to include

some of that information to just sort of make that connection possible so that was just a very quick introduction to what the report template looks like again where we’re getting the data where all of those pieces fit into a larger report so now I want to talk a little bit about how we sort of modify that report in certain situations and based on feedback from other people so when we’re working with customers they often have their own questions and they have their own ideas about what they’re interested in so we often use our standard report template as a guideline to sort of focus some of those conversations so these are ways that will sort of modify it in response to some of these customer requests one deviation that we do a lot at NIH is to look at funding agencies we at NIH are very very interested in who funds research and who funds research in different areas who collaborates on funding or who Co funds particular analyses things like that practically speaking most of these funding analyses come from pubmed work rather than weather science or Scopus just because the pubmed data is at least standardized so we don’t have to do the affiliation cleaning in the funding agencies that we might have to do with you author affiliations and offer names that has advantages in the fact that we don’t have to clean it but it also has disadvantages in the fact that PubMed only has certain funding agencies it’s really only accurate for NIH funding agencies and some international funding agencies very very select list so we have to be very clear about if we’re doing the kind of funding analyses we’re really only looking at certain funding agencies were not for example looking at anything from China age or anything from Asia in general we’re not looking at anything from Latin America we’re not looking at anything from Germany for from a number of these other countries so there are definite limitations to using this but even within and I answer there are certain advantages so we’re very interested in which I see is tend to fund research in which institutions or institutes and centers tend to Co fund research so the co funding network that you’re looking at here on the right is based on grants that are Co cited in papers so if paper a single paper gets a grant from both NHLBI and ni NDS that create we can then create a link between those two ICS to say these are Co funded articles between these two ICS and so we can build that in in the almost cite an identical way to our affiliate affiliation co-authorship networks to look at some of these Co funding networks to see not only where are the funding agencies or what are the funding agencies but also where the co-funding happens maybe where there are opportunities for co-funding all those types of things like that so that’s one way that will then deviate from the template will also then sometimes look at articles per funding agency per topic citation impact for funding agency within a document set all kinds of other elaborations like that we will also do a number of deviations in the co-authorship and collaboration analyses some of them are sort of adding some additional metadata on top of an affiliation coefficient analysis so one example here is being at a co fill an institution co-authorship network from biomedical articles from peru and so what we were then also interested in and is within those affiliations how many of them were actually from Peru versus the United States versus other countries things like that so that we could sort of look at where those collaboration patterns are actually happening and whether there’s coal collaboration happening within Peru or whether it’s more happening with a single institution in Peru with somebody in the United States which then goes back to another affiliation improves so we can sort of look at some of these other sorts of collaboration patterns by adding some additional information on top of things we can also look at country level in collaboration so Peru plus USA or Pulu collaborations with China

Brazil’s other countries all kinds of things like that we can also do elaborations of individual level collaboration networks where we’re looking at the affiliations of those individual authors we can add other information about maybe what topics those authors tend to publish on so we can sort of color code and individual level co-authorship network based on the topic areas that people tend to work on and so we can see whether or not people tend to collaborate with people who also work on the same topic or whether they work across topics so all kinds of things like that we can also color code people based on their affiliations all kinds of other things probably the area that we do the most deviation on is in this area of research topics we do things very very differently in a lot of different ways some of them are either we’re looking at mesh terms rather than publication data so a lot of times we’ll look at the number of publications per mesh term publications for mesh term per year to get another sort of look at a publication set that’s a little bit more static over time sometimes then we’ll also look at text mining approaches rather than citation based approaches so as I mentioned in my course on topic analysis there are these two different branches of topic analyses there’s citation based methods and their text based methods citation based methods are great for certain situations but really bad for other situations um using citation based methods has two major limitations one is it requires you to have very very clean citation data so it have requires you to have citation information for all of the things in your data set in the exact same citation format for all of the things in your data set so if you’re getting publications say from scope is for instance those that citation data that you get from Scopus is not clean enough for you to actually be able to create a citation Network because there are variants of the citation formats or if you’re working with grant data you don’t necessarily have citation data to begin with so it sort of forces you to move over into other types of topic analyses the other major limitation with citation based approaches is scale when you start talking about more than about 5,000 publications it becomes very very difficult to work with um because it requires a lot of computing power in order to actually create one of these kind citation networks it becomes very limited by the amount of computing power that you actually have necessary i’ve available it becomes a limiting factory and also the software that you’re using because you have to modify situ in order to increase its memory allocation and that becomes somewhat problematic in certain situations so again more than about five thousand papers and you start running into some of these issues so for a lot of those reasons i’ve moved much more into text mining approaches to sort of get around these things so with text mining approach is you don’t have that citation accuracy problem because you’re not working with citations you’re actually working with text so you can theoretically that apply it to grants you can plot it to scope this abstracts you can apply to PubMed abstracts whatever you happen to be you can apply it to just about anything the other advantage with this is is scale it scales very very well so what you’re looking at here is actually the results of a topic modeling algorithm of about fifty thousand locations by NIH authors over about a five year time period so again we’re talking about an order of magnitude higher than what we were able to do with citation based methods in a relatively compact and easy-to-use way so with this type of an approach the individual dots actually correspond to research topics rather than publications and then they’re connected based on whether or not publications in the dataset were about both of the connected topics so with this particular algorithm you can actually assign topics or assign publications to multiple topics so it’s not a one-to-one thing so if a paper is about both say genetic variant and clinical trials you can actually assign it to both of those subject categories it doesn’t have to be shoehorned into one or the other so it’s sort of another

advantage to this particular approach where you can sort of assign it’s multiple category and so then again we can get a similar kind of visualization out of it where we can look at the topic structure that emerge from the set of complications and sort of do the same kinds of things where we’re looking at publications for topic publications for topic per year all of those other types of things because the results of the algorithm give you those that type of data so again it’s just another way that we tend to vary in to vary to vary the approaches that we’re using based on the limitations of the data set or the particular data set that we happen to be working with in the arena of citation in part impact the probably the most common thing that we get asked to do is benchmark a particularly rhetoric or institution against other laboratories or institutions so we want to compare what my citation impact numbers look like as compared to other institutions we do that a lot we get a lot of that so the way that we tend to do that is again with these percentile ranks because it controls for a lot of those variations in citation volumes per year in subject category and all the rest to make those comparisons a little bit more accurate it becomes problematic because not every research organization is the same so actually getting an accurate comparator set is a real challenge frankly so we can have these citation comparisons but whether or not we’re may actually making an apple Savile’s comparison is also dependent not only on the citation impact measures but also on the characteristics of the organization’s themselves so there’s always an caveats and some concerns to be built into this kind of benchmarking exercise but we won’t we will do it on occasion and we can do it so in this case again we were looking at all the publications by all of the ice institutes and centers at NIH they’re always very interested in how they compare to other institutes and centers at NIH so that is a thing that we can do I have would be the results of that comparison I have removed all of the Institute’s and settings just for purposes of anonymity and so forth I don’t want people making those comparisons without you know underfoot understanding of how these metrics were actually generated so I’ve removed that here but again this would be the one of the ways that we would do this kind of comparison so that’s a little bit about our reports and about how we do things how we modify that report in various ways and things like that so now we’re going to really shift gears and start talking about our learning processes ease so again in this what we’re really trying to do is just talk about sort of how we learned how to do these things talk a little bit about our lessons learned and just to give you an idea sort of what our process looks like and sort of the lessons learn along the way so I graduated from library school in 2009 the in the fall of 2009 and I started immediately out of library school working for the National Oceanic and Atmospheric Administration so I actually was hired to work for one of NOAA’s grant funding agencies called the office of ocean exploration and research or we are these are some really great folks and their mission is to quite literally explore the ocean the vast majority of the ocean floor has not been explored nobody’s been to at least half of the ocean floor so their job was literally to go out and see what was there it’s a very very cool mission but because it’s so exploratory people in the scientific realm didn’t really take them seriously they thought that people weren’t really doing real science we were just sort of going out and sort of doing these sort of explorations just to see what was there and not actually doing science well they were there so they hired me to essentially get a sense for what are the publications that have actually come out of these expeditions to be like no look we actually do science to like we’re not just exploring we’re doing science too um so they hired me to figure out what those publications were assigned them to

particular expeditions do all of that kind of stuff and basically maintain that publication of the biography for them moving forward so I spent the first couple of months sort of getting all of that in order I set up a whole bunch of search alerts to is to alert me to new publications and then I kind of sit back sat back and said alright what else can I do so they had hired me as a contractor and they were paying my salary in full out of their discretionary funding so as a contractor being paid with destruction airy funding they could decide just to not fund me and that I would be out of a job so my job security depended on me giving them good value for the money that they were they were paying to fund my salary so I was always looking for ways that I could add value to what I was already doing and so one way to do that was then to look at the bibliometric aspects of the publication set that I had just created so once I created all of these things that can then say all right are they having a citation impact to again sort of say yes we’re doing real science and this real science is actually being implemented in other scientific research to help them keep their funding essentially so in 2010 we contracted out with Dublin metric analysis firm to basically do that research for us and the problem was they were only able to successfully match about 80% of the publication that I knew worry in their database so they weren’t able to match one publication in five which with a pretty small data set in this was only about four five hundred publications so one in five really just wasn’t useful it was that was just not good enough so I thought well I can at least match the Darden publications maybe I should just do this analysis so that’s what I did so I and so I suggested to them you know maybe I can go ahead and do this out of the funds that you’re already paying so you wouldn’t have to pay this contracting company you would get a full set of citation indicators and then I could just sort of add this to the services that I provide to you they were thrilled and said great go do that so that’s how I really got started in bibliometrics is doing that kind of citation impact work for this particular program so then I challenge myself that every time I updated it I wanted to add something so they wanted updates to this report on a quarterly basis again to sort of have accurate numbers and updated numbers and so forth so I challenged myself to say so each new quarter I want to add something to this report whether it’s making the chart look a little bit nicer whether it’s doing a different cut at a new metric whether it’s adding a new analysis whatever it happens to be I wanted to just sort of add things as I went with this particular publication set to be able to use that both to add value to my services to the to OER but then also to build my skills in in analytics because I saw this as a real opportunity for myself moving forward so this time period from 2010 to 2011 was really really building skills in citation impact measurement and as I was starting to do this I started getting questions from other laboratories who also wanted to do similar things so what you’re looking at here is an actual figure from one of my reports in 2011 I very embarrassed by it looking back on it now but you know it’s it is what was and so that’s what I was able to do so this was a percentile rank comparison for one of the laboratories at NOAA that they have been used for their own laboratory evaluation so again I was able to sort of build my skills and sort of use those skills in other areas as I was learning all of this um there were really no courses there were no books there were no trainings there was no book this serves there were really weren’t any resources available to me in this time period before learning how to do stuff so I was learning completely by reading journal articles which is a terrible way to learn to anything um because they always take things for granted right like they they will just vent to something to decide is like it well everybody knows that well I don’t know that and so I did I have to go back and start learning about this things like oh

I have to learn about this other thing now because get it so it was this whole long process of me try to figure out how to do things and so as I was reading these journal articles and I was keeping up with the building metrics journals I kept seeing all of these things about Network science either co-authorship networks topic analyses all of this type of stuff and it seemed to be this whole other branch of bibliometrics that i had not even thought about up to that point but by 2011 I was like alright I need to learn how to do this stuff because there’s a lot of opportunity here so let me just buckle down and start doing some of this so this is actually one of my first networked visualizations this is actually in one of my publications from 2013 about the efficient and exploration research bibliography I published in science metrics so this is again one of those ways that I was trying to learn how to do bibliometrics in this very very sort of limited set with a captive audience which leads me to my nemesis at the time which was science appliance tool so when I did the demos of doing collaboration analyses and also in topic analyses I made the science of science tool look really easy to use I did look really easy it’s not um that is I made it look easy because I’ve been using it since 2011 um so I’ve had a good eight years to learn how to use this tool but when I was first learning how to use it I was trying to learn by myself without anybody to ask from the documentation that’s available from the science and science tool wiki it was not an easy process um I made mistakes constantly I probably made every mistake that it is possible to make with the science of science tool it felt like the two tool would break just simply because it felt like breaking it would give me errors just randomly I didn’t think I had done anything differently but it was giving me errors or it would give me errors just all the time you know maybe I was getting very good at things I hadn’t seen there in a while and then all of a sudden it’s fun now I’m getting too cocky it’s going to give me another error message bringing me back down to earth so there were a lot of errors and there’s a lot of trial and error involved in figuring out how to actually use this thing so again be aware that having errors is completely normal and I had a lot of them when I was learning so moving on through 2012 to 2014 this is again mostly sort of consolidating skills that I was building so I was getting much more confidence in use of the tools in my understanding of that theoretical underpinnings all of those types of things so I started then again more and more working outside of the ocean exploration research bibliography I was still doing these quarterly reports every quarter but at this point I was gaining enough confidence to be able to apply these things to other areas and started working with other groups within NOAA to be able to do things so what we’re looking at here is actually a collaboration with one of the scientists at NOAA around the area of geoengineering or climate engineering so this is actually a topic analysis of all of those publications on climate engineering to get a sense for what was going on and there in the research area how what the balance of research was in these different approaches those kinds of things and that was a good this is again from a publication from I believe 2013 or earlier early 2014 that I co-authored with this particular investigator I also started working with NOAA’s office of evaluation to start building some of this bibliometric analysis into some of their evaluation workflows so one of the things that they were doing is they were reporting a number of publications that Noah had authors had written every quarter and they were doing that would buy a manual data call to all of the researchers to say what did you publish last quarter they were compiling that manually and saying this is what we published so obviously that was a whole lot of work for a very very questionable set of data because not everybody responds everybody’s responds differently there were double counting it was kind of a mess so what we did is we literally set up a publications database of no articles and so we started pulling

articles from different databases to start compiling an action sort of a master list of all no authored publications where we would then go in assign them to the Centers and the divisions of within NOAA to say not only what are these publications but where did they come from what was the fiscal year what was the quarter all of that stuff so we started building this database in collaboration with the Office of Research and we started thinking about applying and now analyses to these things in late 2013 and early 2014 which is when I actually moved to NIH so in 2014 2015 was I made the transition from NOAA to my current position here at NIH where I was really focused on doing these things full-time so in NOAA I was starting to specialize in this but I was still doing a lot of reference I’m still doing a lot of other library specific things whereas moving here to NIH I was able to be much more specialized in doing publications as my primary work activity so one of the things that we did very quickly at NIH was figure out that report template that I just walked you all through we were very interested in coming up with botha templates for what the we’re going to look like but then also have something that we could use to sell the program so we were hired here at NIH to literally invent this Brooke the program from scratch the yelling and I were hired to create this program and to figure out what it looks like marketed within the NIH all of those things so what we needed was a product that we could literally market and so we use this the EIC reports as a way to sort of build that template both for marketing and then also for sort of getting the word out and but also for giving people something to react to a lot of times when we talk to people they don’t necessarily know what’s possible with bibliometrics so we can often use these profile reports as a way of saying here’s some of the things that are possible that we can do and then we can use that as a springboard for other things so a lot of that first couple of years was really us sort of developing this template figuring out what the services program was going to look like how we were going to manage our time all of that kind of stuff within NIH though it became very very clear that the challenges were going to be very different and the primary challenges were going to be around capacity and scale at NOAA we were I was working with relatively small datasets and I was working with a relatively small number of analyses you know I was doing maybe one or two reports per quarter something like that on very very small data sets once we got over to NIH and once a word started getting around NIH that we could do things people started asking us to do more and more analyses and doing them in more complicated ways on larger data sets so the capacity and scale both relates to the number of publications that were working with we very quickly had to scale from about 500 publications to 50,000 publication in a matter of years so that was one of the major challenges for me is to scaling up with the number of publications but then also the number of requests that we started getting was pre intense will show some data later about just how many reports we do every quarter but we do a lot so we had to build in ways of both handling these very large data sets and then also streamlining the analytical process to be able to turn things around in a very very quick manner to be able to deal with the number of results or the number of requests that we were getting which leads me to my nemesis part 2 which is our and our studio I started learning how to write code after I got here at m2 NIH I started learning in maybe 2015 and started getting really serious about it in 2016 and it was literally to deal with these scale and capacity issues because our in our studio is much more able to handle these very large data sets than things that I was working with the past ie Excel and then it was also

able to not only pull these very large data sets from the databases via API s so that I wouldn’t have to spend as much time downloading things I could then also write modular blocks of code that I could then reuse for prime analysis to analysis to speed up some of the processing to be able to sort of pull some of the stuff together much more quickly so again I got lots and lots and lots of errors in learning how to use this thing because again I was pretty much teaching myself how to write this um I did take the course on Coursera which will again remain nameless which was an extremely difficult and labor-intensive course I wouldn’t necessarily recommend it but it is the way that I actually learned how to actually get to work with code I remember spending maybe 60 hours a week for a couple of weeks just taking this particular course so it was a really intense environment but it got me up to speed so that now I’m much more able to do things and again I haven’t sort of gone through this with situ made it much easier to go through the same process with our studio because I kind of knew what to expect I knew sort of what I was getting into and I wasn’t necessarily as frustrated by all of the error messages that I was getting so I was sort of more able to deal with these things second time around though it was a very intensive and very mistake ridden process which leads me to the present where I am still learning how to do things what you’re looking at here is an alluvial diagram where I’m tracing the citation flows between research topics among several generations of publications so I’m starting with a final systematic review and we wanted to trace two generations of citations to say how did this these different research areas converge into a single clinical trial or a single research outcome so what were what was the research body necessary in order for confer actually sort of having making this breakthrough so what was the what did the fields look like and how did it converge on this particular Annette and for this particular breakthrough in a particular clinical trial setting so again I’m tracing citation flows through these different generations through various topic areas to try and get at with this kind of a question and so this is one of the things that’s really great about working here at NIH is we get challenged with these kinds of approaches all the time so the question from the customer was can we figure out what body what knowledge was necessary in order to get to this particular breakthrough and so it was up to me to figure out literally how to do that so I invented the method I invented the work hello and invented the way of actually visualizing in order to actually respond to that particular question so that’s one of the really great things about working here is that we get these kinds of questions we get the opportunity to literally invent new ways of analyzing and visualizing data in order to respond to these questions so in order to do that obviously I need to keep working I need to keep learning different analytical methods I need to keep learning different areas of research I need to keep learning all of these other things so I’m still within this learning process just a couple of weeks ago I just learned how to use a new implementation of a topic modeling algorithm that I’ve been working with just a couple of months ago I learned how to use a new gender estimation algorithm to estimate genders from first names like it’s just this constant process of building skills and building new ways of analyzing data in order to respond to customers to meet their particular information needs so which leads me to my lessons learned so I have three of them the first of them is persistence doing basically the same thing over and over over again and expecting different results this is sort of harkens back to those error messages that I always kept getting um when you’re learning how to do how to use these tools and particularly when I was learning how to use these tools I didn’t have people to ask I didn’t have anybody to ask to say what is this error message mean or how do I get around it so I had to figure out what the problem was and just keep plugging away at it in order

to then come up with the final analysis product so I had to just keep going and going and so I learned how to sort of make mistakes in a way that like I can learn from them and just kind of doing it over and over and over again and just keep going it keep plugging away at it in order to actually get the knowledge and the skills necessary to do what I need to do second is again related to these error messages is figuring out what the heck that error message actually was so if you’re at all familiar with error messages in computer programming they are really cryptic they don’t give you a lot of information and they don’t give you a lot to go on as to how to actually fix the problem so it was up to me to decode some of those error messages and figure out what was actually going on what the actual error was and how to actually fix the error that was actually happening so again I had to learn how to sort of do this in order to fix the errors this has become very very valuable to me now because it now allows me to invent these new analytical procedures because again it’s the same kind of a thing it’s a problem to be solved okay if I need if I know what the end result is going to be I can work my way backwards to figure out how to actually get to that result or if I need to learn how to put how to use a particulars tool or write a particular probe particular function to automate a series of processes I need to think logically about what the steps are in order to get to that final product so again it was a very valuable lesson for me in order to move forward and do a lot of things that I do and the final sort of lesson learned is that you can actually learn from your mistakes and and in fact that is the way that you learned is actually by making mistakes so I’ve gotten to some sort of this Zen attitude towards mistakes where they are not necessarily bad things they are learning opportunities I will admit that if I’m still at the office at like 7:30 at night after a full day trying after spending maybe five hours trying to solve a particular problem I don’t always have this kind of a Zen attitude towards it but I I try to UM and I know in the bigger picture that make mistakes is in fact how you learn things and it is in fact how you learn how to work with analytical tools so we’ll just sort of get to a place where you’re okay making mistakes and that’s probably the most valuable lesson of them all so that was a very quick introduction to my learning process I’m now going to turn over to yelling so that she can tell you about first alright uh following up what Croesus just said I’m going to talk a little bit about my process then it’s kind of interesting that I’ll focus a little bit more of about why I didn’t learn instead of my learning process okay so generally people agree that there are four stages of a learning process okay so the first one is unconscious incompetence and then it’s conscious incompetence and it’s conscious competence and then unconscious competence and leah has gone through the whole process the whole stages and I think I have a little extra probably all of you have gone through already which I’ll be talking about later okay but starting from these four processes okay the first one for me it’s actually pretty long since I was a library student okay and I think that the first time I heard about citation analysis by that time I don’t think it’s called the book mattress yes okay I think my editor was like I don’t know that I don’t know how to do this so it’s kind of like everyone can do this because it’s part of the collection development so people were just talking about the journal impact factor so it seems very straightforward to me and I was like anybody can do it okay and then the second stage actually quite move into when I graduated from school and I started publishing my own stuff when I edited back then look like it’s just a number do I really care I know it’s just a citation and as a researcher myself I was like do I need anybody to tell me

how good my were bad my research is or what I should be doing or anybody to guide me what to do so it’s sort of a like I don’t really care it’s just a number okay so at that time it was really unconscious incompetence okay uh what Chris and I share I think was the turning point of 2010 okay so that year uh I was pregnant with my second child and I don’t know as fortunately or unfortunately my husband got another job in Maryland and back then we were in met in New Jersey so I was in the dilemma of either raising two young kids on my own or I would have to change switch jobs and relocate to Maryland and then start other things and it’s actually uh quite a struggle to us because I’ve been in New Jersey for a long time hi ma’am I actually took a year off to have my kids delivered and then I I commuted for another year and decided that it’s not working and then I started to think about what are the things are what are other options that I can do and one thing I didn’t mention was that when I was teaching my area was actually not be the metrics at all it was children services children’s literature and also information behaviors services and one thing that I have to teach is research methods and that has a lot to do with analytics in okay so 2010 one of my colleagues back then had address out in research methods so that’s a book about research methods in one section so me like one in half pages is about the biometrics and I have to acknowledge that as to that’s actually the first time I heard about the term do metrics okay and it’s only like five paragraphs and basically talking about some theory of Abib metrics and it’s not the thing that we talk about at all it talks about the journal distribution or article distribution the preferred distribution law and I was like okay what is it do I really have to do that because I’m one of the febri member that have to teach research methodology okay so for me research methods is both basically just two paradigms qualitative quantitative or qualitative we look at the non numerical data how you can gather the data interpret that for the numerical data you do statistic analysis either descriptive or inferential things like that but now it seems to me that I have to another dimension about the glue metrics in the numerical data part okay and doing analysis is not difficult at best for me at least for me I don’t think it’s difficult so do analysis I think yes I can do it okay but how do I really make sense of it okay so that’s the time I I actually understand that this is a some thing I don’t really know what to do okay so the next stage feels the conscious incompetence is that I don’t want to teach it I know I’m supposed to include it in my class because it’s part of the method although it’s new to me but I don’t really want to teach it for several reasons okay for all of the methodology courses in library school it always have very low enrollment why a lot of times people think it’s not important it’s boring it’s too theoretical it has to do a lot of numbers it is analytical it has nothing to do with the practical life the pretty over that they’re going to do okay and a second reason there is very low demand from employers back van not too many library or I can probably say that not too many library here in the United States had a demand about analyzing publications for citation analysis or biometric analysis so basically the library schools do nothing is important to include that part my students do nothing is important to include that

part and there are not many materials back saying talking about bibliometrics especially the practical part there are some theories but again it’s different from what we have talked about recently okay and I think one of the most important thing is that I know that I don’t know how to do this so I don’t know how to exactly per us to do the geometric analysis and how to really make sense of it to interpret that one of the things that I did was actually to look for others Silva’s or by other family members so I went back to my alma mater and Jonathan further was one of the favorite members that I had earlier before FF when I was in UC life and he was actually one of the first people I think in the United States to teach liberal metrics the syllabus and you can tell that it’s actually again very Theory Laden so it talks about all sorts of a different theory Interop including statistics probability and modeling things like that and I said to myself no I just cannot do it even if I can help copy his syllabus I don’t think that would be helpful to my students or to help people to really understand what is going on so up I didn’t really include liberal metrics in in-depth discussion in my courses it’s kind of a like what my colleague did I just include one session talking about the possibility of what different metrics can do okay and then as I mentioned that there was the change for a lot of things 2010 for me is a personal change but it’s also a change for a lot of the scholarly communications you can see they’re a lot more publications popped out and there is also the emphasis being pushed out on evidence you want everything is evidence-based evidence-based services evidence evidence based medicine evidence-based a lot of a thing so there actually was a push to look for can you provide evidence on the research on the research performance okay so combining a lot of the things that just took place at the same time I talk to myself let me just try to learn okay so I think I started a little bit later so I was lucky enough to get more readings a little bit more online tutorials and experts and colleagues even when I started here okay and I think one of the most important thing for me is the change of the mindset so Bipper metrics is not just quantitative so what I understood was that there are a lot more things beyond the citations so there is the text analysis part there’s a subject analysis okay and it is not just about citations there are so many other things that you can look at because of the development of the technology the more data that is being included or being crawled by those data pods the vendor databases the more things that we can do to make sense of it so that actually leads me to the conscious competence okay to the major change of the mindset that civil metrics is not just in analysis not just an analysis of numbers it actually has to do a lot of other things say for example you have to do with data visualizations it is something that it’s quite new to me okay and then it is not just analysis it’s actually a type of library services although it’s very rare for people to look at it as a service but it is the support services to research so I dive in okay there are a lot of learning through the trials and the mistakes and again I almost did the same thing I went to the site you and I they have their manual they have their work full I just download their data sets and did each one of them follow their workflows made a mistakes and then create a things that I had to do but I think one of the things that I I sort of think that I had advantages was step I had the basis of the concepts and I know the limitation and I know the methods already okay so for

me it’s more unlike how to really make use of the tools how to really know the systems to know what they can offer and then to know how to communicate with the client the users because there are so many different stakeholders and how do I commute our communicator interpret these results to them okay but the unconscious competence part came really late I would think probably the most written two three years okay it’s a stage when I think I can say that I can teach you how to do how to do this and people told me that you made if you look really easy okay but again it’s also at the same period I started asking myself a lot of a questions like are there other ways of doing this just like doing any research there are always different ways doing the research and for doing the with metric analysis they’re different tools they’re different methods so other different ways of doing that and can be connected with other areas as I mentioned I came from very human in humanity and social science it’s part can I communicate or connect the electrics with say for example children’s literature can i connect people metrics with public library services so there’s something that I’ve been thinking about like how do I connect these different things so that actually lead to my next stage seeking meaning okay so I constantly ask myself do I really know what I’m doing so I know these data I know these sources I know the analysis I know the tools I know the limitations but do I really know what is the big picture behind when I’m done with analysis when I tell my clients that okay 20% of your publication are being ranked as the top 10% things like that or when I did the topic analysis saying that most of their and your publications were in the field of public public health okay what does it mean in the end and what’s the meaning bring to their research and what am I really promoting is it am i really promoting their research or am I just promoting the tools the strategy or maybe images promoting research assessment and am I really a neutral agent because of the various tools that we can pick and when I say that I think this one works better than the other at that time am i a really neutral agent and when I know that there are so many limitations and mistakes or errors in the systems do I really trust what they can provide me and can i really trust the result that I provide to my clients and because of the development of technology back then it was okay I would think that this 10 years from 2010 to 2020 is the first decade of a good metrics in professional world or in library services okay there are a lot of the tools being developed and systems are being updated constantly and you can see that because of the work or the research the development in AI you will see that a lot of things can be colonised okay so in the end a lot of the things that we are doing right now can probably be replaced by software by tools in just a few years so the final question that I ask myself a lot is that how am i different from the vendor products ultimately right now we have a lot of a custom projects that the vendor systems or tools cannot really do yet but I believe with the sway with the pace of the technology development this can be done easily just in a few years so I constantly ask myself what can I do how am I being different from the products are lately and that actually bring back

to my learning processes okay so when I’m trying to seeking meaning out of what I’m doing a lot of the time I have to go back to the conscious incompetence because I just do not know a lot of the things that is going on I started with analytics and then I know that I’m not good at visualizations that I’ve learned it and then I know that I’m not good at programming and then I’m learning and then okay what’s the next step what is the next big wave that is going to face I don’t know but I’m still exploring and that’s a that’s actually the circle of my learning or probably not learning alright so for the final thoughts I actually wanted to share with you what our little team has accomplished oh I’m sorry yeah no I just had a couple more things um so a couple of things that just sort of final thoughts take-home messages from a lot of this work again learning is a process I think we both sort of spoke to that already that you’re never done learning anything particularly bibliometrics there’s always more in the analytic space to learn there’s more in the meaning parts of it what does it mean there’s there’s always more out there to learn so you’re never actually done learning anything the other thing is we always want to acknowledge limitations both our own limitations and also the limitations of the tools that we’re using I think we both sort of spoke to that a little bit already but again just being aware that we have limitations being humble about what we do what we think we’re doing what we may be doing what we may not be doing what we may not capturing we always just need to be very aware of that and so whenever we’re talking to people about bibliometrics we always want to have these limitations in place and always talk about them in ways and bring them up in discussions with customers to just make them aware that there are a lot of these limitations we need to sort of be aware of that but at the same time there are a huge number of opportunities in this space again with library services in particular there’s a real opportunity for us to again add value to services and provide new kinds of services within libraries that haven’t really been provided in the same way before it is a new way of doing things and so there are a lot of opportunities related to that and so we want to be open to those new opportunities and those new ideas so to sort of speak to how many opportunities there are wanted to just sort of go back to just how busy we are in the library space and so this shows you the total number of services provided per quarter over the five years that we’ve been at NIH um you’ll see it did take about a year and a half for us to really get established and so for for the sort of number of requests to sort of peak so we definitely took a while to get the word out and to get awareness around the NIH can d so be patient when doing bibliometrics and sort of trying to get things out there it just takes a long time to gain acceptance around the whatever institution it is that you’re working on it just takes time but subsequently um you can see that we provide somewhere around 50 services per quarter which ends up being around 18 a month between the two of us so that keeps us very very busy we tend to average about eight actual full analyses every month so each one of us is doing maybe three to five analyses every single month so again that sort of speaks not only to the the scale of what we’re doing but it also speaks to the opportunities that are there for us to be able to do these kinds of things because there are there’s a need for it at least here at NIH there’s a need for competence and experience in actually providing these services and doing these kinds of analyses and also being able to interpret them in ways that are actually appropriate and not overstate their

importance or overstate what we’re actually learning from these things again we can my a a friend of mine in the evaluation space always used to talk about how evaluation never gets you to the truth it only gets you to a less imperfect understanding of the truth and I liked that way of putting things and so I think again having that as sort of our perspective and where we’re coming from is we’re not going to tell you what the truth is we’re just going to give you a less imperfect understanding of what it might be and so using these things to be able to get to that is a real opportunity both here at NIH and I think more generally in libraries around the world so that’s it thanks to all of you for making it this far and for learning with us through this whole series of courses once again we will be adding some additional training series that orchard additional trainings as we as we can to sort of supplement this whole process but once again congratulations for making it this far and thanks all for listening you