>> From the Library of Congress in Washington, DC. >> DAME WENDY HALL: Thank you for finding your way back again, I hope you had a good something to eat and a breath of fresh air. There will be two panels this afternoon. The first one is putting data for work. For me, and I'm going to talk more about this later, the key to helping people find the archives that are around is all the data. And that's what this panel is about. Focusing on types of data, how you produce it, what you do with it and some of the ethical questions of handling this type of data, which is very different to handling data from the physical sciences for example. Now, you have the bios of everybody on the panel. So, I'm not going to introduce people at length. I'm going to take people in the order that they're on this sheet. And luckily for me everyone on the panel is on this sheet. So, we're okay [laugther]. So, we're going to start with Lee Rainie who's infamous, no famous, sorry for being the Director of Internet Science and Technology Research at the Pew Research Center. And we first met in Hong Kong. I've got that right this time, haven't I? Yeah, and this was, oh a year or more ago. And I said, oh I'm coming to Washington. Oh, let's do something together. And here we are. So, Lee you have your time in the limelight now. >> LEE RAINIE: Thank you very much and thank you for doing this Wendy, it's an amazing resource and if the Doris Kerns-Goodwin of 100 years from now is a good enough historian she will discover what an important event this was and why it led to the good things that she was writing out in the next century. And anytime you can be in the same room with Vint Cerf, first of all, you have to say thanks. And so, we all say thanks to him. I just wanted to start by underlying a point that Richard Marciano made about how data quality now is so different from before and it's the phase change in data that really makes this a different kind of environment for us. So, everybody talks about volume, but one of the artifacts of greater volume is that just more people get to participate. One of the striking things we see in the big national surveys we do about the impact of the internet on people is how grateful they are that they can tell their stories now. And they're not mediated through industrial era media companies. And they get to tell their own stories in their own way. And they get to find other people who share those stories and those circumstances. And it's an enormous impact. They're annoyed at the millions and billions of other people who tell their stories, especially the wrong ones, or the ones that are spun up or lied up. But, by and large when people are assessing their own information environment, they're so grateful now that they have a voice and they have place in way that you know GeoCities gave them a long while back. The second dimension, so volume is one. Velocity is the other. So, it's not just the pace of information that's coming through sort of normal channels, but it's real time information. You know we can now talk about location as a really important indicator and predictor of people's activities and behaviors and things like that in a way we never could before. You know Pew is a big survey organization and people are not very reliable when you ask them specific questions about their activities. Their media behaviors, their location behaviors, their frequency of doing things and stuff. Well, now we have reality checks on that when we have real time information coming at the speed that it does and being manipulated in ways that can send signals about what they're doing. And the variety of information is also changing. Sort of in the same way that people are glad to be able to tell their stories there are so many other places now that can tell their truth to the world. And this is enriching the amount of information that's available. It's certainly expanding the number of gatekeepers of information. It's changing the relationship that people have with expertise. Who's an expert, and who do you rely on and where do you allocate your trust and things like that. Which brings me to one of the three points I want to make. I'm going to make three points and two please in my little time here. Insofar as you have to make choices about what information to archive and what to focus on and stuff like that, I would argue that let's be selfish about it. Let's tell what we think, those of us in the room think might be the biggest story certainly of our era, and maybe for a long period of time. It's the rise of the internet and the web itself. It is changing the way people think about their social worlds. It's changing the nature of their social worlds. It's changing how they're in community. It's changing how they think about expertise. It's changing their relationship to big institutions, little institutions and everything in between. It's changing the way they allocate their attention. It's changing the way they can act as agents in their own behalf economically, socially and politically. So, across the board. We're in the midst of a huge story. And I'm trying to think of another time in history where people were so aware that the ground was changing underneath them. Maybe the arrival of atomic weapons might be something that generated as much, sort of immediate commentary about its impact on humanity. But that's the only other one I can think of. So, we've got this big story to tell and why not be intentional in the data that we collect and the data that we preserve for the future in telling the story of this era and all fo the change that it initiated. You know, in a way that justifies archiving lots of stuff, but maybe as you're sort of sitting down, figuring out what collections to amass, that could be one filter through which you pass your judgement. The second thing is that ironically even as the volume velocity and variety of information is exploding there is an information market that is degrading. And archivists can really play a big role in helping undergird that market and it's the civic information market. One of the sort of great things of the industrial era was that newspapers arose and newspapers were awesomely good at bundling together all kind of different content. But serving it up to you in a way where the mayor got covered, the school board got covered, the zoning board got covered, the state house got covered. National and international capitals got covered. Well, as that bundle has been blown up by the internet and each sort of constituent part of the bundle has found its own business model and its own advertising base, what has been orphaned in that process in some sense is civic information. So, you don't get as much information about the mayor's office and you don't find out what's going on on the school board. And you don't necessarily know what the zoning board is up to in a way that you did when there was a reporter in the room when all those things were happening and when the material is being released. And so, archivists can maybe pay a role in sort of uplifting that segment, even as the business model for it is in trouble. I was so struck, maybe these are the marching orders to the datathon people, but I was so struck that all of them marched to civic information and civic insight as a thing that they thought was so precious and so interesting and so easily demonstrable there. Yeah for you guys. And everybody that's part of that team. And that instinct to fill in that information gap is really good and important one. And finally, there's a movement that's just poised to help you in the archiving process and in the explication process, which is the open data movement. You know it's hard to archive stuff. It's hard to sort of think through some of the ethical decisions about scraping stuff and things like that, but this is all ours, you know? The open data movement is built around information we paid for and whose copyright status and whose ethical status is not nearly in question to the degree that the proprietary information or other information is. And these folks are you know in the same tribe as you guys. They love to hack. They want to figure out how to make this information more useful for civic purposes. So, those are my three points. Organize around the big story, the rise of the web and what it's meant to us all. Help a degrading information environment and with the collapse of newspapers. And give a hand to the open data movement. Now I have two pleas to end with. One plea is an affirmative one. Another ally group for you guys is public librarians. We do a lot of work related to public libraries and they are poised to help, especially in the model that Vint was articulating about, distributive archiving. They kind of have the skillset and interest all ready. But they need marching orders and they need help from you. And I think that you could take advantage of that. They also, when I go out and give speeches to librarians, we do a lot of research about role of public libraries in people's lives and in community's lives. One of the things I say that librarians can do in the future. There would be a real service that sort of matches an industrial age service that they provided, is the algorithm fact checkers. They need training in that, they need to know how to do that. But one of the things that as we move into the age of the internet of things and big data analytics and stuff like that. Somebody's got to watch the watchers. And the wonderful computer scientists in the room are going to do a good job at that. But there's an army out there that might be willing to help you on that [laughter]. And the final. So, think of public librarians as good, useful, ready to charge into battle allies for you. And the final plea I'll make is that even as all of these data arise and the exciting ways to think about capturing it and analyzing it and stuff like that, don't make the mistake of thinking that it represents everything and everyone who's important. Twenty percent of American adults, one-fifth of America adults use Twitter. Great resource, wonderful to analyze, fabulous to do social networking analysis off. But you know four-fifth of adults don't use it in America and the number is even higher in other developed countries. And even higher in developing countries. It's you know and even for Facebook, 1/3 of American adults don't use Facebook. So, don't necessarily think that you are telling the full, full story and it's certainly a representative story by thinking that all of this stuff that you're amassing and analyzing stands in for everyone, because it doesn't. and there are ways that you can backfill for some of those issues. And there, you certainly, sort of every day you get up you should be thinking about how can I serve to narrow digital divides rather than expand them. Thanks. [ Applause ] >> DAME WENDY HALL: That was wonderful. And you're very neat, you kept time. And I'm fascinated by how we move particularly with the elections coming up, well our referendum and your election, which all seem completely crazy to me. I wish we weren't having a referendum. It's driving us apart and it didn't need to happen. But the how we go from you know the century we've had of polling to the century of social media and how these, you know how we do begin to forecast what the people are thinking. Now, it's my great pleasure to introduce someone who has become a great friend of mine Professor Katy Börner, who is a professor Indiana University. You have the list of the things that she does there. She's one of the most amazing people. She's a force of nature. She produces the most amazing maps and I know she's going to show you this, but she produces things like this, sorry, this here, this amazing book of maps, of science. And she's also one of the most generous persons I've met in terms of her time and her wish to work with others. Kathy, over to you. >> KATY BÖRNER: Thank you, Wendy. And thanks to all the organizers for hosting us in this beautiful palace of knowledge and wisdom. Forward. So, I wanted to start with a suggestion to chart the web, to map it, to create an atlas of all the different ways that this web has been helping us to find our way to understand structures and trends, to identify problems with the data that exists, truly. To also understand what data, we don't have because just geospatial regions which are black if you overlaid a number of holes there. And to then take these maps and bundle them all together in what is known as an atlas. And the presentation will start with a few maps that are part of the mapping science exhibit and that show that not only science today, but also [inaudible] truly connects us globally. And the first map I'm showing here is a map of scientific collaborations between 2005 and 2009. Some of you might have seen a similar map. On Facebook linkages. And so, this is I think inspired by that map. And it really shows that we, today are very much globally connected, but you also get to see that there are some areas which are densely interlinked and others which are more sparsely interlinked. I also saw this recent map which some of you might have seen in the news media and it shows a cartogram style map of how many domain names actually exist and how many different people use these different domain names. Another map, which was actually hand drawn is shown here and you can zoom in to see the big Google landscape and data oceans etcetera. All these maps, by the way are available online. So, you can zoom into them. So, if you go to simaps.org you will find a hundred maps that have been created by more than 250 scholars. This is a map of Twitter activity and you see that in the Netherlands and you might also know that in the Netherlands, people are holding a drink in one hand, and their texting with the other hand, while they're driving. [ Laughter ] But you also get to see that these maps of course redraw major capitals, major railway and other transportation lines. And you also get to see what areas in Europe are multi-lingual and what kind of languages that might be. So, I think the plea here to take the data we have and to try to not only explore that data, but also to communicate it to a much larger audience. In an attempt to communicate what we know about the work, but also to communicate what we don't know where the monsters are. There's a darkness's are. A team which also includes some of my colleagues at Indiana University has used Twitter data to map the mood of a nation. And it's interesting to see how grumpy some are in the morning and how happy they are when the sun gets up. And I won't have time to look at all these maps in detail here. But if you're interested in maps, there's some handouts in the front and again they're all online. This is a map which is near and dear to my heart. It shows not only how science sites itself, which is how typically these maps of science are created, but how it's used, for instance in doctor's offices by practitioners which are desperate to have parents go home with their child and not without their children. And so, here you get to see how these different scientific papers are downloaded one, and then the next, and then the next. And then there is a sequence between them, they are connected by a links. And the more often they're downloaded after each other, the stronger the linkages are. And the more closer in space they are given these layered algorithms we are using here. Now, it would be wonderful if all of us could actually read those maps, and could make sense of them, and could find our ways not only in geospatial space, but also in knowledge space. Unfortunately, however, this might not be the case. So, we recently did a study on six science museums in the US. It's still my hope that it's different in Europe or Asia, but in the US, we asked a thousand youths and their caregivers to read 20 maps, or each one of them got 5 maps to read. And it turns out that those which make it into science museums, they actually do have a hard time reading scatter plots, reading cartogram maps, reading networks. Networks are almost impossible for many, many out there. So, there is a huge need to help people not only read and write text, but also help them read and write data. And I would like to make the argument that if you only read text, but never write text, then it's very hard to actually be a good reader. And similarly, here, if you only read data visualizations which are shown in newspapers or textbooks, but you never get to make one, then it's also very, very hard to enter practice specialization. So, we have been trying to bring materials online that empower anyone from high school to beyond 100 years of age. We have some of those in our MOOC, to understand how to actually take a dataset and analyze it and visualize it and communicate that result to others. And we use visualization framework, which first of all introduces different task types from statistical to temporal, topical, geospatial and network analysis. And different levels of analysis. So, from micro individual level studies, to meso, to macro population level studies. And if you now have that table, which you see here, then each cell actually refers to a set of tools that are meant for that particular purpose that are developed by geographers, or cartographers if it's a rare question you have, or by network scientists if it's a network, with whom question. Or by linguist if it's a what question. So, these tools are really developed in very different areas of science, but oftentimes you might have multiple of these questions. So, this helps you to kind of maneuver the landscapes of different tools and workflows. And then, this is a very simple workflow of how you can work data into knowledge. And it's always one, two, three. Because it has to be so simple. You read data, you analyze it, you visualize it. And then you select a reference system, could be just two axes or a map of the US for instance. Then you overlay the data and then you start encoding the data. And then you will probably see that you don't have the right data, or the data has a problem, and then you do it all again. So, it's a very iterative process. And again, these are just different reference systems you could swap out, and this is a course schedule. And there are two books, one if more timely, and it's actually co-written with a librarian. So, Ted Polly he now is working in Indianapolis as a librarian. And he has been instrumental in bringing these materials to many libraries in the US, but also abroad. And then "The Atlas of Knowledge" is a collection of these different maps has good examples, but also has the visualization framework. And we also use our own microscope tools. And apparently, I don't have any time to go into that detail. But ultimately students are supposed to understand that if they have an Excel file with different columns then they can point to different types of columns. Let's say the year and topics. And then they can run a topical analysis of bursts of activity using these two columns as input. Or if they have a column with addresses and they can geolocate and they can overlay it over a map of the world, or a map of the US. And just by helping them understand what columns a normal Excel file facilitates what kind of analysis they are far a step ahead. And then the tools also use that organization in the menu system. And I'm out of time. Okay, so the visualization framework can be found in that book if you like, but what I wanted to do it, I'm out of time, make sure that you also take home a flyer. So, maybe while I distract Wendy. Yeah, yeah. I'm not going to do it. I am out of time. I don't know why I wanted so many slides. I'm very, very sorry [laugther]. >> DAME WENDY HALL: This is a typical academic all right. >> KATY BÖRNER: Found my slide, sorry. So, those of you which do not only want to look backwards, but want to look forward they might like to check out that recent conference, which Wendy was also involved in as a keynote speaker. Because now all the slides and all the recordings are online. So, if you want to also use data to, for a certain degree predict the future, I would recommend that website. Sorry. [ Applause ] >> DAME WENDY HALL: I love the way we academics supposedly have a different view of the space time continuum to everybody else [laugther]. But I would recommend you to look at this material. Delve into those websites. And some of the students from here are going to visit you next week. So, they'll be playing with all this data. And the conference as well was a really brilliant get together, everybody. Now, it's my please to introduce someone I've work with for a long time. A very good friend of mine, Professor James Hendler. Lee talked about open data, and of course Jim, who is at RPI, he's very well known for the work he's done on data at data.gov. And also, his seminal work with Tim Berners-Lee and others on the semantic web, which I think is really important in this world. This is not a conference about the semantic web, but maybe Tim will, Jim will touch on it. Sorry, slip of the tongue there. And he's also the chair of the website's trust so we work together on that as well. Jim, over to you. >> JAMES HENDLER: Thank you Wendy, and behalf of the Web Science Trust let me welcome everyone. We're very excited to be helping to sponsor this event to some degree. And I'll let it go at that because I don't want to use all my time on that. But, I also was going to take some time to refute the most controversial claim made this morning, which was that the internet was about puppies. But I don't have time for that. Just Google Tim Berners-Lee and cats and you'll find the correct answer. But what I'd really like to do today is start with a thought experiment. We're all sitting here in the Library of Congress and particularly those of you who are librarians and library trained, you won't need much thought about this, but for everyone else, imagine if when this building was first being conceived, when this concept was first being conceived, we had a search engine. That was a keyword based search engine like what we have on the web today. Right? So, instead of putting these books in here, they were just thrown in kind of at random, each book was just given a number and you would use the search engine, right? That would have been a terrible way to build a library. Right? and 140 years ago, this year, the person who every librarian sort of loves to hate and hates to love is Melvil Dewey came up with a very famous system; the Dewey Decimal System. Which sort of way a way of taxonomizing the whole world. But it put books near each other if they were sort of conceptually close. And again, we can argue lots of things there've been many, many attempts over the years to improve on that system, and yet it still hangs on. Partly for historic reasons, but partly because it's hard to do it differently. The problem is for data, that's not even the word search engine isn't a viable approach. Right? The umber 13 sitting in a column in my database could be anything. Could be somebody's age, could be representative of some set of columns, anything. So, one of the real problems we have with data, in fact with all of the indexing of the web, but particularly the data on the web, is that we need an index structure. We need that Dewey Decimal System replacement; however, it clearly won't work to do something very similar to Dewey. So, what I want to sort of hit on today is four things we really need to think about in terms of organizing the web archive, but particularly the data aspects of it. So, for the web itself the URL system provides at least some kind of organizational structure. And for the internet archive that's to a large degree what's been used. And then on top of it you just try to build very topical things, and that works. But again, in the data world, in the next talk we'll talk more about linking data so I don't have to. But linked data is happening, but that's not sufficient, because again, that's about how to get data organized in certain ways, but it still doesn't tell us how to find it, so discovery. It doesn't tell us how to integrate it doesn't tell us whether what this person was using here is similar to what this person is using there. So, some observations, you know sort of based on those 140 years versus what we need to do. So, the first thing, clearly anything we're going to do about web data can't be in a taxonomy. It can't be a tree. Vint put it very beautifully, he said, the web is a web, right? And it's not even just a graph. It's a complex hypergraph structure with a lot of different connectivity and a lot of different kinds. And the data that lives and powers that also tends to live in that representational scheme. So, number one is how do we even begin to build something that's an organizational structure that doesn't live in the kind of trees that we're taught in library school is the best way to organize things. The second thing is we really cannot name down to the individual data elements, right? Really a lot of the data sharing is about metadata. And you've heard that word a lot today. And I'm not going to spend the four or five hours I'd love to spend explaining it in detail. But, we really need to be thinking a lot about, there's this data here, what is it? What is it about? How do we find it? So, once we get there, a lot of techniques can be used to figure out what's in it. But the real question is how do I know and a lot of the things that I've been trying to do learning about data bottom up are missing this point. So, having figure out that this is a name and that this column is a data, right? Makes a huge difference if I'm reading a mortality database, or a maternity database. Right? One is the day somebody was born, the other is the day someone died. Right? But there's nothing in the database that tells you this. It's in the description of what the database is about. So, the second thing is we really need to figure out ways to talk about the datasets and how they change and what's in them. The third one I'm going to call more locality. And this really talks to that question we were asked about cultures, languages, and I'm going to add context to that. So, we need linking across those things, but we also need the ability to find things across these different ways of thinking about it. We're going to see something a little later on, I think Wendy's going to talk about a project called the Web Observatory, which we've been involved in together, but one of the problems in that is again we're still organizing the data based on sort of the simple terms that are put in by the creators and we don't have a way of saying these people who put it he dataset in Chan Hi are thinking about the communication differently than these people who put it in in San Francisco. Even though there may be shared terms or context. And finally, we need a temporal model, because data is not static. Books change slowly. And you know it's fairly easy to refer to the second printing of something, or the second edition of something. Very hard to talk about this particular data that was pulled on such a such a date at such and such a time, where it is changing at this kind of speed. So, when we were doing the open data for data.gov one of the big issues we were playing with was how do we tell people how often a given dataset changes? And the decision was made at a higher level than mine to ignore that issue, right? Because from the user's point of view it didn't really matter much. It was just the developer making sure they were pointing at the most up-to-date API. But from the archivist view, that's crucial if we want to know what the data was when we have to have a way of talking about that when. There are some other but I think those are the big four; that we need a way of doing it not as a taxonomy. We need something that goes to the metadata level, rather than to the data element level. We need some way of linking across localities, cultures, and things. And we need a way of capturing the temporal and changing nature of data. So, actually when the sematic web was first created there was a view of building information sources that live near the data, linking them together and having that grow bottom up. And what sort of has happened over time is what's been called link data has sort of moved down to just trying to put data elements together and that sort of organizational structure has moved up to languages like Owl and complex ontology languages, which have a role, but it's not this role. So, a lot of what I'm pushing nowadays and starting to try to get a conversation and really find the place to hold that conversation, is how do we start to organize the information about discovering data, integrating data, once we've pulled it together validating data, some of the stuff that would then feed explaining and exploring data, the [inaudible] etcetera. So, I'll stop there, but we really need new models of web ontology, of link data etcetera. That are tied to data description linking ingest, integration and all those things Vint was talking about today so that when we go looking for it, we can find it. I'll stop there, thank you. [ Applause ] >> DAME WENDY HALL: Jim, you're my most disciplined panelist so far, well done. And that was brilliant. Thank you. And our last panelist for this panel is Philip, I can't pronounce your name you've got too many consonants. >> PHILIP E. SCHREUR: Schreur. >> DAME WENDY HALL: Schreur. I've tried, I've practiced and practiced and completely failed. So, there you go. Philip's come all the way from Stanford to be with us today. He's the assistance university librarian there. He's responsible for technical and access service in the library at Stanford. Which is so crucial to what we're talking about here. So, we're really grateful that he's come all this way to talk. And I'm very interested to see he has a PhD in medieval music theory. We're getting quite a lot of this analogies with the past to where we are today in terms of we think we're struggling and it is a very different environment. So, we see the impact of what we're doing as we do it almost. But people have struggled with all sorts of ideas to do with knowledge in the past. And so, there's nothing new there. It's maybe just scale. But the other thing I do know about Stanford, because Mike Keller who's the heal librarian there is a good friend of mine. And they have done amazing work on the semantic web and digital libraries. I don't know if Philip is going to talk about that today, but they are one of the leading libraries in the world in that respect. So, Philip, over to you. >> PHILIP E. SCHREUR: So, thank you very much. I'm here to talk about the topic today really from the point of view of an academic research library. So, the first thing I wanted to say was that I am very fortunate to work for a university librarian that likes to throw out almost impossible challenges with a certain amount of regularity. So, I remember back in the late nineties when I was the head of the metadata department at Stanford, he said could you provide intellectual access to all of the internet and bring it into our catalog? So, I have been puzzling over that for the past, almost 20 years. And not in particular relation to that, but I resigned that position shortly thereafter. Although I did come back to Stanford. More recently we have been very focused on trying to shift the basic technology about the way we do our business towards link data. And then in a conversation with Mike six or seven years ago, he said, once we make this transition I don't want you to provide access simply to the resources that the library traditionally holds, I want you to produce access to all of the data that the university generates. So, another nearly impossible task. But I think both thoughts are very pertinent to what we are talking about today. I think for us in the library key to putting the data work is discovering the data in the first place. So, in my case that would mean either bringing all of the data and metadata that's on the web into our catalog, which I think is both impractical and impossible. Or, it means putting our data out onto the web, which is something that we are working on right now. Or, it means providing some sort of federated search between what is in our catalog and what is out on the web. So, unfortunately, the library environment is a very closed one right now. It is really focused on metadata; it's not focused on data. But it's also focused on metadata primarily in the Marc formats. Which were wonderful in their time, but they were developed in the 1960s in order to represent metadata that was printed or typed on library cards. So, when you think about trying to represent the data on the web, in the Marc formats, thinking about them as catalog cards, it's not a very successful transition. So, we really needed to move onto something else. And I think link data will be that one. So, as I said for the past number of years there have been a lot of efforts in libraries to transform our data into link data to be able to get it out onto the web where we see people are searching for data, to be able to incorporate what we do with what is out there on the web. Right now, we are focused on an effort to change our basic functionality, the way we create data, to create data natively in RDF instead of focusing on creation of Marc and conversion. Because we've discovered that that conversion process is extremely lossy. Marc was developed trying to reproduce a catalog card, so when you saw the data on the screen, it's as if it were a catalog card. And there were a lot of relationships which is left up to you in your own mind to create. It's not coded there in the data, so when you convert it to link data none of those links are actually there. But as I said, it's turning out to be a very costly transition for us. It's not only a particular library that needs to change the way it works. We are in our own web of data suppliers. We have data suppliers who give us cataloging that give it to us in Marc, we communicate with other libraries nationally and internationally in Marc. Our ILS system by which we do things like circulation and payments, all the data is in Marc. So, all of that needs to change as an entire ecosystem before we can successfully make that transition and it's going to be a lengthy and expensive one. As I thought about hoe libraries need to make use of the data that's out on the web, I think one of the biggest challenges for us is that the data are so extremely variable out there. One of the things which gave us most consternation recently was discovering that the Bulgarian census is now put out as a web document, as opposed to a traditionally printed document. So, that caused a lot of problems for us as far as archiving went. But again, it's a great example of a resource on the web, which really closely mimics the traditional library resource. And in those cases, we would like to bring it in, give it our traditional cataloging, to mix it with the resources that it's very closely related to. So, that is one thing. But on the other hand, there's a very small percentage of internet resources, which are really of that type. And so, we are simply overwhelmed by those numbers. Our current model of using humans to provide intellectual access to individual resources, simply will not scale to the web, it's an impossible situation. So, even if we shift to link data by which we could incorporate that data into the web, we could never describe the web. So, we're still left with a conundrum. I think the other thing is that this symposium really is about making the data work, it really wasn't about making the metadata work. Although metadata has been mentioned a number of times. So, when I think metadata, it is a type of data, but it's really data with an ulterior motive. You know it is sneakily trying to link things together in a way that you can make them discoverable. But often it does it in a way, it loses some specificity of description in order to do that linking across domains. So, it is a very tricky concept to be able to work with. So, I think if we really are going to try to incorporate the data that's out on the web with the data that's in traditional library resources, we are going to have to focus on a much more automated process for providing access to that information. So, in dealing with sort of more traditional web resources, look at things like automated entity extraction in assignment of identifiers and reconciliation behind the scenes to be able to draw things together. Also, I think we need to look at some things automated semantic analysis of content and use a common vocabulary so that we can draw things together. For about five years I worked at a publisher called High Wire Press. And there we were very interested in doing the semantic analysis of text. And it ended up at the end of the five-year period some of our major subscribers to the service liked the semantic analysis of text better than the human semantic analysis of the text, because they said, well you know you both make mistakes, but at least the machine makes a consistent mistake. So, they found it easier to deal with. And at least it's high quality. So, just in summary when I was thinking about what's really important for academic libraries from our point of view what will be our greatest challenges in trying to make the data from the web work? So, the first one is how we select which portions of the web to capture. So, I know we've talked about it a lot here. I think it will come down to either we divvy up the web, and decide on certain parts to collect, or our collection development people look at what makes most sense for Stanford and decide to focus on those types of resources. But then, once that's done, we really need to be able to find an automated way of extracting those resources, which really kind of mimic our more traditional library resources. Because we really want to provide access to them in the way we have to be able to integrate those more easily in our online catalog in that traditional sense. As I said, for the rest of the materials, we need to focus on that sort of more automated ways of providing access and providing metadata through automated entity extraction and semantic analysis. A huge problem, which we're only beginning to deal with is the assignment of identifiers and their reconciliation. For integration for discovery purposes across the web in the library catalog. We know link data sounds intriguing and it sounds like a great solution, but you know worldwide there will be many identifiers assigned for that same entity. And we need to find a way, and an automated way of reconciling all of that otherwise nothing is linked. And I think the last thing that we've been puzzling over, and I really don't know the answer to. We really want to be able to make sure that those semantic associations that we create persist over time. So, we've talked a lot about preservation here, but I think it's a different concept. It's persistence. It's not preservation. And I think it's one of those topics that we're just really beginning to struggle with now. Yep. [ Applause ] >> DAME WENDY HALL: All right. We've got plenty of time. Well done panel. And I could sit and listen to you forever Philip because you talk my language and now I know why I love working with Stanford Library. Because they get the semantics bit. Yeah? And that's a message to all librarians here. At some point, you've got to deal with that properly. Okay. So, it's hard to see from here isn't it, where the mics are. Where are the mics? Ah-ha. There's a man with two mics over there. Okay, who wants to take the conversation forward then? You look so tired, some on this is great stuff here. Questions, answers. >> KATY BÖRNER: I've got a question. >> DAME WENDY HALL: Oh, let, Professor Jain, here in the front row. You save it, save it. What I couldn't see, didn't see it. Here's, right? >> RAMESH JAIN: Hello this is Ramesh Jain. I've been hearing a lot about data and metadata and when we talk about metadata, maybe this is my perception, but I am getting impression that we consider that it's second class data, is that right? >> DAME WENDY HALL: Do you want to take that first, Phil, because you brought up metadata. >> PHILIP E. SCHREUR: Right. I did. So, I wouldn't say it's second class. But it certainly is different. I mean we had a conversation once with the Stanford history faculty and it's one of the few times in my life concerning metadata that they truly stunned me, and I didn't quite know what to say. So, they were not in favor of library metadata. What they said was they're interested in, they love metadata and that describes their collection, but they wanted metadata that was focused on their collection, that described it well, as opposed to having more generic metadata that could link across domains. And they found that not helpful for them. So, and since the whole point of everything that I do is trying to draw things together through metadata, I was very puzzled or surprised to hear that there was a large group of faculty which are not necessarily in favor of it. So, I certainly wouldn't say that it's second class, but you have to realize that it does have that ulterior motive and not everybody is happy with how it works. >> DAME WENDY HALL: Go on, Jim, Please. >> JAMES HENDLER: So, I would throw in, I think what people miss is that metadata has a completely different purpose than the data. So, it's not that it's second class it's that it's different. In the database community where you're traditionally looking at what's inside one entity that you understand very well no matter how large it is, that notion of metadata as a second-class citizen becomes clear, because we know about our data. We're using the same store, we're doing mining, etcetera. But when you start to move into I use the acronym DIVE in one of my talks, DIVE into data. Discovery, integration, validation, and exploration. When you want to move into those things, you know what's out there that I can use, how do I bring it together? Did the thing that came out make sense? Right? So, again we run some data miner on something and it's telling us something that clearly doesn't pass the sniff test, we've got to watch the watchers. And then, you know a lot of the human perception, you look at the graphs that Katy was showing, humans are much better at looking at those than machines are still in terms of understanding whether they're telling us anything new and different. And so, those kind of problems don't live in the traditional database box. In fact, I would argue for archiving and things like that the metadata is probably more important. For example, we're going to have to do data triage. Right? There's no way that the amount of data being produced on a daily basis, when you take open data, social media data, web data, changing data can be preserved. But a lot of the metadata about it. So, we could know that there were articles about this particular topic being published at this particular rate at this particular time, even if we couldn't keep track, articles we can probably keep. But a lot of the videos, a lot of the details of real time, when we talk of the internet of things and sensor data. You know we're not going to be able to preserve the sensor data. We're going to be able to say this many people were using this kid of sensor for these kind of techniques. And I think, that's to me when the semantics comes back in as you said. And that's why so many other parts of this linking stuff together has to happen at that stuff about the data, not the data. >> LEE RAINIE: Do you know and the Dewey Decimal System speaking of metadata. Dogs and cats, just to go back to the pinpoint of this conference. Dogs and cats are archived as industrial agricultural equipment [laughter]. >> DAME WENDY HALL: Really? >> LEE RAINIE: Because that's what they did in the last 19th century. They worked on farms. >> JAMES HENDLER: Yeah, so when I mentioned the temporal stuff I should have mentioned it changed over time [laughter]. >> DAME WENDY HALL: Ah yes, we should have Wolfgang naming here for the temporal stuff. We haven't talked about temporal stuff. I did invite him but he didn't come. Does anyone in the audience want to pick up a piece about metadata? Because I actually think it's different but as if not more important in some circumstances. Do you want to say anything? No? Oh, okay. Well you could [laugther]. Hang on, hang on hold that thought because of the recording Jefferson that's all. >> JEFFERSON BAILEY: I do think there's recognition in the library archives community that human provided descriptive, or technical or administrative detail will obviously never scale to the rate at which digital collections are being acquired and ingested. So, I don't know what the future is, machine learning, automated extraction, semantic entities, it will have to be highly automated. And anything automated of course is human created, but there's going to be all sorts of levels of mediation that are sort of opaque to users and I think that's something to think about. >> DAME WENDY HALL: This was the point Philip made in his talk, this will have to scale. This has to be automated. >> JAMES HENDLER: So, I'm going to disagree to a slight extent such as I believe the extraction and collection will have to be automated. But I think if we're not using human concepts to organize the data, then we're not going to have humans able to find the data, right? So, I think again, one of the things that I believe and when I was talking to Vint before and he was talking about bottom up. Well what won't happen is we won't get a Dewey Decimal system that designs the whole world from the top down. We need something that can merge bottom up. But a lot of the bridging, a lot of that semantics is going to have to come from a human saying, these are key concepts in a domain. Now we ca use computers to extract to those concepts or type. So, I don't believe that just clustering and things like that is going to get us where we need to go. Assuming it's us as humans who want to do the research. >> PHILIP E. SCHREUR: So, there's one other thing I wanted to add as well, one of the things I mentioned but we haven't talked about very much is that there then needs to be a commonly agreed upon vocabulary for pulling those things together. And one of the quotes that somebody mentioned once which I loved, and I use it whenever I can. The wonderful thing about ontologies is that there's so many of them. So, when I was working at High Wire, I had extremely the most bitter fights I ever had were on vocabulary development and between groups like the American Heart Association, the American Society of Physiologist talking exactly the same system, but they fought bitterly to have their vocabulary and their structure represent the data, so, the metadata. So, it's one of the big challenges we have, not only within a country, but across countries to come up with that vocabulary that can be used. >> JAMES HENDLER: But imagine doing the linking thing you're doing to data to those vocabularies. So, Tim Berners-Lee and Ora Lassila, and I wrote an article about that 15 years ago called "The Semantic Web" and that's really what we were talking about is that we have to link the concept space, right? So, this group can represent their data, but someone else can help map it and then we can take those mappings and map it together. >> DAME WENDY HALL: Ramesh wants to come back here. I can't see the people holding the mics, they're hiding behind the pillars. Where's the other, ah can you sort of stand in my line of sight, so [laugther]. >> RAMESH JAIN: I really think that metadata is one of the most misunderstood thing. And we started in the term context, now we just heard the term ontology and languages. The fact is languages always divide. Whether the signals are declaring original information unifies people. And that's why we got rid of a very different perspective on metadata. And the metadata should not be considered something which is secondary, and which is qualifying and some central piece of data. When we talk about books, we consider that books are the primary thing. Anything associated with that is going to be the secondary thing. But those data are gone. When you look at a photo and in the next panel, possibly we talk about that. But what is metadata and what is data? I think that distinction disappears. So, what, how is it that we, I would like to hear your perspective it's in the days when all this different types of data is coming in a distance, how do we start really considering what makes the semantics of the data emerge? It's not one data, another metadata, it's just different extremes of data, and there the semantics is emerging. >> DAME WENDY HALL: I was struck when I look at Katy's visualizations, you're visualizing, well that, this is incredibly hard work to produce those visualizations, I think because you've got to, I shouldn't be talking about this, you should. But it's there that to me what about them metadata in that process, does that make any sense to you, I'm not explaining it very well. >> KATY BÖRNER: So, I don't think you will like this but there are datasets where you could have way more metadata than the dataset itself. Because if you have multiple insight needs. >> DAME WENDY HALL: He likes that, yeah. >> KATY BÖRNER: So, every single insight need will actually require potentially a reconceptualization of what's in that data. And depending on how you represent that data, you can run different types of analysis, you can get different kinds of maps. And you get different types on insights. I do agree with Jim that it is very helpful if you actually have a few human curated kind of data dictionaries as I would call them. However, I do also agree that one scales, so very soon we will have a very heterogenous landscape, where you have some human generated kind of historical data dictionaries if you envision and the Dewey Decimal Systems and then also organizations. And at Stanford University you have NbO which is aligning like 300 plus different ontologies just in the biomedical domain itself. But then you will have also may data derived structures and you will want to interlink those. You'd want to use the machine intelligence whenever possible. But then if you have historically, or current structures then you want to interlink it to those machine generated ones. And I don't think there is much work yet on that. Even though they are now programming languages which give certain task to machines and certain tasks to human beings on mechanical [inaudible] etcetera and then it goes back onto tasks that machines are better in doing, and certain tasks that the human users are better in doing. Realizing that heterogenous compiling and running of code will also be visible in how we deal with data. So, I see a very heterogenous future for that. >> JAMES HENDLER: Yeah, so in those four or five hours I'm not going to spend, let me just say that the notion of metadata is driven by the web to smaller multiple link thinks, not these very large catalog driven very heavy weight metadata. So, the traditional notion of metadata in the library is like the Dewey Decimal System. The thing we need for data is a small set of terms at a very high level and then sub communities that can expand them in their own ways and provide linking through this graph and network of this complex entity and I think when you look at that model you can see both scaling you can see places where you can tie machines. So, we've been doing that with extraction and things. So, again I think it's where the technology space is starting to hit the conceptual space, but at the moment, the real problem is no one owns it. That's what I find the biggest frustration. I go to a funding agent and they all say the same thing to me, which is that's a great idea I wish we funded that, right. And so again, it's very hard. It's an infrastructure need. It's a library need, it's an archiving need and those are very complicated at the moment to find ways to do at the kind of scales we need to do to both really both understand the web. Wendy and I were 10 years ago co-authors of an article with some other people including Berners-Lee guy who, called "Creating a Science of the Web" and it's led to something with all websites. But you know again, it's studying that kind of interdisciplinary must have social science involved. It must have library science involved, but it also needs the technical and fundamental engineering skills for the web, it's a very complex space, but you know the fact that you're all here is talking to the importance of this and the keys. I promised Wendy I'd mention websites. >> DAME WENDY HALL: You have two or three times now. Anyone want to take another direction? Yes, to my left. >> SPEAKER 1: Hi my name is Baha Ecminar [assumed spelling] I work in the Library of Congress here. I wanted to ask you the question about helping us contextualize a good and effective role for the Library of Congress in the century ahead. So, in the past, we have had plenty of data, but most of it publication related, manuscript related, traditional library items. We're beginning to get large amounts of data in, but we don't always collect it as data, we collect it as an item, or as related to a publication, so if I can make the question concise, what can the Library of Congress do with its consistent but very finite set of resources in the century ahead to make data valuable? >> DAME WENDY HALL: Yeah, I think everyone will want to answer that I will talk to it, but Katy you go first. >> KATY BÖRNER: So, I liked Jim's question about how do you index data. And one way that you just mentioned was to connect it to the publication in which said dataset was used. This way you can potentially even rerun that analysis you get to these rerunnable papers which is what many of us want. Another way is also to connect the papers now to software artifacts, to the outsource, which they are typically connected to, but now that Orchid is in existence, try to make people use it to find carrots for them to actually use it and then you get these very heterogenous networks where you can pull out one author and all the artifacts he or she created; be it datasets or software artifacts, or papers, just all come with it. And I think that will create much richer search structure and much richer result sets when you actually start querying those datasets. And it will really leads to more reproducibility because you know whom to ask, how to run that tool, or that workflow etcetera. >> PHILIP E. SCHREUR: So, the one thing I would add is something the Library of Congress is doing now. So, I mentioned there is this transition in library metadata from Marc to link data. And the one thing that Elsie has really led us on is this development of a new schema called BIBFRAME in RDF, to be able, as something which will better represent the library data. There are a number of schemas out there, but they really took the lead in trying to develop this schema to represent library data. It eventually will be community driven, but they really pushed it to get it going and that was a great effort. >> DAME WENDY HALL: Well, yes. >> JAMES HENDLER: So, the only thing I would throw in is I think the, so the standards that everyone wants grow from defacto standards of use. And I think this is a place that has the authority, the power and the driving need in looking at archiving the web and things like that to be the place that some of that work could be done. And you know, again, I don't know enough about the organization, the internal structures, but I think things that grow out of here, and that look at those technologies, look at the vocabularies, look at the new things, say, here's stuff that can be done, would be something that would attract a lot of attention and grow. >> DAME WENDY HALL: Can I just add that I think librarians, there's always been a lot of data science in libraries. I think there's a big shift for the next century where if I were the Library of Congress, I'd be investing a lot in data science training for my staff. And I don't just mean from the concept of library data science, I mean just generally getting to grips with data science and get as many people as possible in the library understanding this world. Because in the future, libraries will be data warehouses, largely. And that, I was running the place that's what I'd do. But I'm not, so. Is that a hand up there? Yes. >> LIZ MADDEN: I'm Liz Madden from the Library of Congress and this is more of a plea from the bottom up to all of you who have students in colleges and things. I work with the data, the metadata in particular and also the regular data that comes in. And a lot of the interns and things and the new Library students. Along those lines it would be really helpful for people being taught also to understand basic concepts like relational databases. Because I think that's what we see when we get datasets. I looked at the map of the domains and I saw how tiny the US was and I know that's just because they use .US and that's a database error to me. And so, that kind of concept seems really basic probably to the people in this room, but it's a really foreign concept to a lot of people who are studying humanities and stuff like that. So. >> DAME WENDY HALL: That's very interesting. I started teaching relational databases when I first taught as a computer scientist in 1980s I don't know whenever it was. It's interesting you should bring it up now. I would widen it beyond relational. I think it's about understanding the world of data. No, actually. No, I would disagree with that. I would disagree with that. Sorry, Jim you teach databases. >> JAMES HENDLER: Yeah, I would. Well I don't teach databases, because of computer science relational has always been with databases, what I teach is about data. And I think understanding that difference that databases are one of many ways of holding, thinking about using, retrieving, storing, etcetera data. That's really the key. And you know again my early work in AI led to sort of semantic data, but then through realization that it's about describing all that other stuff. So even if you're, so the problem is once you're into the relational database, you're forgetting all that outside context. But, that said, I think teaching students about data. We, at RPI, I'll put in a plug are actually going to start requiring every student. So, not just the computer scientists, but every student is going to have to take data in tons of courses, right. So, we think data is to this generation what writing was to the last one, the thing they didn't learn in high school that they should have. And we're going to really push hard on that across the entire curriculum, across the entire university. >> DAME WENDY HALL: And digital literacy generally for the entire population. Oh, one over, I think this is probably going to be our last question so we have a chance to hand over between panels. So, yeah. Tell us who you are. >> SPEAKER 2: [Inaudible] and I'm a student from the George Washington University. So, I have a question about as a way to enter into the largest scale datathon, people may face an issue how to solve the storage problem of the being or treating and data, especially for the metadata because the already size is large. So, when people try to recover them how to solve these problems, thank you. >> DAME WENDY HALL: Thank you. Jim, you're nodding. >> JAMES HENDLER: Yeah, it's a great question. In fact, when I was first asked to be on this panel I thought that's what I was going to talk about until I realized I really had nothing very smart to say about it. But it's clearly the challenge. And the scale with which our ability to store data has been growing, has up until recently been keeping speed with the increase in data. That, you can look at almost any company that's publishing those data now showing we're heading for the crossover point. So, one of the hardest things I think we're going to have to learn is how do we throw away data. How do we decide what to collect, what not to collect. And I don't mean now in the sense of meaningfulness in the sense that an archivist thinks about whether we're going to store these or those, but really in the large hadron collider, it generates so much data that a lot of what the different sensors do is figure out which data they can ignore. Right? so 90% of what gets created never gets put into the petabyte that they generate. And it happens at a very low, automated level and things like that. I think we're going to have to start really grappling with some of those kinds of issues. So, not just what do we want in the library, but must we keep because there's going to be a lot we can't. And I think one of the reasons people tend to move toward journal articles, to books, and even to Twitter is because those things tend to be bounded in such a way that we can think about keeping all or almost all of it. But when you start moving to you know every video that was take on the web yesterday, right on anybody's phone yesterday, every Vine, every sorry I'm going to get on my favorite soapbox, I'll get off of it, but again, we're going to have to figure out. I mean a lot of that is crucial, but a lot of it isn't and in some way those kind of collection management type issues are going to be very, very crucial to the next generation of what you call do. >> DAME WENDY HALL: I don't think storage is the issue, personally I think. I was just looking up something my colleagues in the Optical Research Center are doing in Southampton, where they're predicting, you too yeah? They're predicting, you know they've got a huge store as much data as you can possible want. And so, I think material science is just, is going to give us solutions to the, I think the problem is making the haystack bigger doesn't enable you to find the needle. I mean this is the problem. I think discovery and navigation is going to be the problem. And the question is do you store everything, or what do you decide you know to throw away, if you're not storing, I mean Ted Nelson who started the whole hypertext thing always argued you absolutely store everything. That's always been his position and he's never changed it. And I think that's a big issue for debate. And it's on that the libraries have got to tackle. You know they are the first frontline of this really. Do you want any final comments from our fellow panelists? People feel, no? Okay. So, it's a bit, everyone, I can sense the room is a bit hot, so we're going to do a handover, there's water. There's coffee. I'd love there to be tea, I don't know if there's tea. It's 3 o'clock, I'm a Brit. And there will, we'll just unmic this panel and mic the next panel in just a five-minute break. And I'd like to say thank you very much for the panelists this afternoon. [ Applause ] This is the final panel this afternoon. And it's chaired, I'll let him make the introduction to the panelists. It's chaired by Matt Weber who it's been my great privilege to work with. We have been working on this event for a year I think, at least. And Matt with Jimmy and Ian organized the datathon, but also he helped me with David and Nasha Contractor [assumed spelling] who unfortunately couldn't be with us today because he had to be in France, anyway. No, he has a very genuine reason for not being able to be here. But the four of us put this symposium together. But it was really Matt who did all the work. Well, together we did it, but Matt was the person who was there for me when we were organizing this and working with the Kluge Center to get this to happen. So, Matt, now it's your turn to chair this panel over to you. >> MATTHEW WEBER: Well, thank you to Wendy for her leadership in this it's really a privilege to be here and to have an opportunity to chair this closing panel. I just want to make a few quick remarks and then get to the meat of the conversation. The title of this panel is "Saving Media" and I always think of news media because that's really what I study day in, day out. And on that topic, I was looking at Facebook last night and happened to see a post by one of my graduate students who was talking about her 7-year-old daughter and she was reporting a story of a conversation she'd had with her daughter last night. And her daughter had said to her, 'did you know years and years ago people use to get their news from papers' [laughter]. And she's reporting the story back and she said, 'I turned to my daughter and I said yes, they're called newspapers.' And her daughter looked at her and said, 'were you alive then?' [ Laugther ] And that's what it's come to folks. But this idea, not that we necessarily need to have newspapers, I'm very split on the idea of whether we'll have printed papers in the next 40, 50 years. But that is not the goal of this panel. But this idea that we need to preserve the media and preserve the content that we're producing today is front and center to the discussion that I'm hoping we're going to engage with right now. In my own research, I spent a lot of time looking at archived digital news media. And I found that it's critically important to have rich and detailed archives of news and of events that have occurred across the globe for many, many different reasons and I'll give you just two really quick examples. One, is go back to for instance the events that occurred around Hurricane Katrina. CNN at the time was online news source that did quite a bit to cover Hurricane Katrina. And created their own special archive for Hurricane Katrina and you can go to CNN's website and I believe it's still on there. Did not check just now to see if it is. But they have a repository of Hurricane Katrina news coverage. And if you go and you look at that, it's been, it's essentially been cleaned up. It is not the coverage that you will find if you for instance go to the internet archive. And if you go to the internet archive and look at the very same coverage from CNN and actually look at what was on the web at the time, it's a very different story. And the point being, we need accurate archives because we do need to understand actually what occurred at the time. So, as we start to think about these issues and importance of archiving news media in the modern landscape of media and social media, just a couple thoughts that I'll put out there as you hear our presentations today. Think about this idea of what is news, and what are media. What are we saving? What is a journalist? Who is reporting on an event? Who is reporting on news? And how do we record that? These are clearly hard questions to answer, but these are some of the challenges that we face in thinking about saving the media. So, with that I'm going to introduce our first panelist and then I'm going to take a seat down here so that I can help to keep us on track today. And our first panelist is my colleague Phil Napoli from Rutgers University. Phil is a professor in journalism in media studies as well as the associate dean of research at the school of communication and information. Beyond being a fantastic colleague and someone I have been fortunate enough to work with for the past couple years, Phil is leading a new project looking at local news media and developing new methods and measures for understanding local news media coverage. And then, lastly, I will say that unfortunately Phil is leaving Rutgers University and breaking my research heart. And joining Duke University as a James R. Shipley Chair of Public Policy. He'll be starting there in the fall. So, with that I'll hand things over to Phil. >> PHILIP NAPOLI: All right thanks. So, seated I'm good? >> MATTHEW WEBER: What's your preference? >> PHILIP NAPOLI: I'm happy to sit. I don't have any slides or anything like that. Hi everybody. Thanks, Matt, I'm talking today I guess really as a social scientist, as a wannabe user of all of this archival data that don't yet exist to serve all of my needs. And as Matt mentioned I sort of have interest in the relationship between media and public policy. And so, I wanted to kick off really sort of continuing what you were talking about which is why this matters so much. Just to give you an example. I came here because you know being here in DC reminded me of something that happens in this city every four years. And I hope this will sort of illustrate why this is such an important area to archive, but every four years the Federal Communications Commission is required by Congress to evaluate the adequacy of its media ownership regulations. And you're like what's he talking about why does this matter? But this is something I get involved with every four years and for you know quite a while now. And the fascinating part about it is that every four years the FCC commissions a series of studies and the goal of these studies is to determine whether or not these regulations need to be altered, strengthened relaxed etcetera. And every four years most of these studies say the exact same thing, they all have a disclaimer that says something along the lines of yes, it would have been nice to really investigate this question in a rigorous way, however, the data don't exist. So, you know the FCC is trying to answer questions like, you know what effect does concentration of media ownership have on the diversity of the quality or the quantity of local news. Is there some relationship between the characteristics of the owners of various news outlets and the ideological orientation of the content? And literally every four years they try to remake the wheel and try to come up with studies and they've never developed any kind of systematic data archive to answer these questions. And every four years the studies basically say, yep we're a bad study, but here it is and go try to make decisions based on it. And to me it just highlights how important having that sort of archive of the performance of our journalistic system, and especially there was a point I really want to emphasize today, is that this sort of archiving, and we haven't talked about it much in this dimension has to drill down to the local level. And that's something that we're dealing with now in our current research. So, you know from a lot of this type of, these types of concerns, the FCC did something else recently, they wanted to assess the extent to which community's critical information needs were being met. I don't know if anyone followed this, this was a couple years ago, and just doing that research was so politically you know toxic, that the research got cancelled because they were actually going to go in and actually talk to journalists about how they do their job. But nobody had a problem, which is kind of interesting, with the process, which has been happening for years, of actually studying journalistic output. But once again the data aren't really there to do that. So, we've been working on a project where, we're essentially going to gather, with the help of the Internet Archive and Matt, the totality of journalistic output for a sample of 100 communities in the US, for only about 7 days, I should mention that. And the idea is whether they are a television station or a radio station, or a hyper local news site that you're web presence represents, you know some reasonable indicator of your commitment to produce in journalism. And the idea is to start to try to inform this other very pressing policy issue today, or course which is essentially what do we do about the ongoing crisis in journalism. And how do we inform efforts to try to address it. So, there are a number of foundations, fortunately, that we have here in the US that really want to provide assistance to keep local journalism robust in the face of the variety of technological and economic conditions that are making it a less and less viable enterprise. But they don't know. You know what are the real contours of this crisis. What kind of communities is the crisis worse versus better? What are the conditions that help us understand where this crisis is really pronounced? So, we're trying to gather that kind of data essentially on the output and the performance and the infrastructure of journalism at the local level. And it was interesting as we started to do this work we realized there really is no precedent for it. You know the archiving of journalistic content, especially at the local level. If you wanted to do a study on you know local radio's coverage of an issue from six months ago, you'd have basically nothing. Local TV same thing. And again, we know this is the case when we talked about online content. So, this entire media ecosystem, and we use that term a lot in our research now to really try to understand the interconnections between old and new media, but also recognizing that our media have converged to a sufficient extent now that you really can use web content as a window into the totality of a community's media ecosystem. And so, that's where the archiving of this content becomes so important. Because we don't know, you know we have these broad generalizations about what's happening in journalism, and in fact but very little of it is based on actual efforts to understand the context in which journalism maybe matters the most, which is that local level. The community media. The context in which your vote etcetera. You know when you look at these issues from a national or international level, the same kinds of problems don't exist. There are really great archives of nightly network newscasts and things like that. But if you wanted to get a sense of how the local news was covering something like Katrina, or whether or not you know different types of levels of competition had positive or negative effects on the state of local news. Answering questions that could truly guide policies directed at making sure that we have a media ecosystem that really serves citizens well, that creates an informed citizenry that really, you know enhances the democratic process. We have been flying blind in that area for a very long time. And again, that's the irony is that it's not a web based problem, it's something that's characterized, all of our electronic media leading up to this that you know, it's still easier to do a study of news coverage from literally 1940 than it is to do a study of news coverage from you know a year ago. Because you know what they did a great job of archiving newspapers. It's all there, but if you want to you know really study the complexities of today's media ecosystem, it's challenging. And so, that's what's fun about what we're doing with the internet archiving. And Jefferson is fortunately saddled with solving this problem, even there. We handed them a list of URL for all these local media outlets. And they said, jeez, yeah, know this falls below our radar a lot of this stuff, because it comes and goes too. Local media these days is so, you know outlets come, they go, you know they go out of business. It's a very you know sort of dynamic space. And it's a space, that which again the audiences for a lot of this stuff fall below thresholds that you might think would matter for a sort of large archiving project. But we're trying to drill down to this local level now, and hope, you know, trying with the logic that whether it's the economy, or the environment, or jobs, or education that we've always had these regular indicators, these ways we could take the temperature of how's the economy in this state, or how's the economy in this region. We've never had that about, you know hey how's journalism doing here? Because journalism whether, you know all the ways you learn about the economy, education, etcetera journalism, at least traditionally has been, maybe it's decreasing extent, the mechanism for that. But we have no, we have sort of the you know thank God for you guys, you know the state of the news media that we have sort of broadly right? But now, you know can we drill down deeper? And sort of understand how a community in New Jersey's journalism, you know ecosystem has evolved? Is it getting stronger? Is it getting weaker? And does it need help? So, these are the big questions we're trying to deal with and thank God folks like Matt and Jefferson and the Internet Archive are helping develop the tools that will allow us to do this kind of work. But again, it's not just you know historical internet, it's truly trying to inform policy that you know that it's that kind of relevance to this kind of work. So, I'll stop there. Thanks. [ Applause ] >> MATTHEW WEBER: All right thank you Phil. Our second panelist is Ramesh Jain, he's the Donald Bren Professor in information and computer sciences at University of California Irvine. We were fortunate earlier today to be joined by one of the fathers of the internet, and we are now joined by the father of multimedia [laughter]. >> RAMESH JAIN: So, everybody's awake? Looks like that. So, what I'm going to do is to talk about some of the issues that in my opinion are burning issues particularly when we are talking about saving the web and the way information is nowadays getting created. Let me start with one simple chart, and that is in fact if we think about it, this century is really very different from the last century. There is a pretty interesting diagram that maybe Miguel put last year, or somebody else put it last year. When you look at this diagram, this brings in some very interesting perspective. When we start looking at this diagram what you find is that these are the analog cameras that were being sold. And then you suddenly come here and the world completely changes. And when you look at these trends, as you see this trend really started taking place between 1999 and 2001. This is very interesting and serious implications. Matt told you a story about the daughter and about then newspaper. I had a very similar experience, not secondhand but firsthand. My own granddaughter who is now about 8 years old. Two years ago, she comes to me with a CD and says she calls me nana, she says, 'nana my mother is not telling me the truth. She says that music comes out of these CDs,' right? 'Out of this device.' And I said, 'yeah that it true.' She said, 'no that is not true. Music comes out of iPod.' And that's part of the world happening, but more interesting that that is really this particular case. When you start looking at the world, for most of us here, and this is very closely related to what we were talking about data, metadata and all these kinds of things. For most of us, until very recently we tolerated photos, we tolerated videos in our documents. We still have to pay extra for color photos for getting things printed and things like that. That was the world that we grew up in. That's the world that people are thinking even in this room indirectly or directly about libraries. What is happening is that the world has changed completely. When you start looking at the world, the type of thing that people started doing, the type of devices that started coming, the way information started getting created is very different for young generation now, at our age, we tolerated photos and videos in test, reluctantly. For the young generation, they go to Instagram and Snap Chat. Why? They don't want to use text. They tolerate text very reluctantly. And that's the world that we now have to consider. That's what is happening. So, when you start considering this, it has very serious philosophical and long term implications. When you think about the world, and when we present the world, there are two things that we use, objects and events. Our current thinking, and dominated by computer scientists, we deal with the object-oriented programming, we talked about object-oriented databases and all these kind of things. We were very object-centric. But in the modern world when we are dealing with all of these strings of data coming in etcetera. And when we are dealing with Twitters of the world, when we are dealing with Facebook, events are becoming as important, possibly more important than object space. What does that mean? That means the following thing. We have to learn how to represent events and how to deal with events. And when we deal with events, these famous 5 W and 1 H become extremely important. For events, when you want to represent events there are these six important filters. It started from where, when, what, why, how? And all this information none of them is metadata. Everything is data. The pictures and the documents they are the experiential information. The informational component is in order to structuralize that, the information that you start capturing, you start putting in information. You cannot interpret photos. You cannot interpret videos. So, what do you do? You analyze those, you use deep learning, or you use context learning or whatever to get some information that you can use for indexing purposes and you start dealing with that okay. So, that's what you start doing. So, when you start looking at these charts, you start seeing that all that information the 5 Ws and 1 H it starts coming in directly from different sources. And one of your goals becomes how do we populate on this information. Okay. Once you do this for event, then you have some structure of events. Okay. So, you can start representing all those things completely. And once you have done this then you can very easily start dealing with this. And in computing term you start representing this using media this on and whatever way you want to do and you start populating this. So, for each event you start creating this kind of a structure. Indeed, the structures are good. You have the substructures, you have the super structures, everything gets linked. Effectively what you start doing now is, in place of creating a web of document that you have created, now you are starting to create a web of events. You remember we talked about doing these things very differently. When you start thinking about this and when you start thinking about doing search, the way you start searching, the way you start navigating also becomes very different. See, for example, suppose you are looking at some pictures. We are told how do we search for a picture. That's where you will start thinking more about metadata prints. If I want to search from a picture, I have to go to a steering wheel of sorts. We have to get out of that narrow [inaudible] box. We have to get out of that and we have to have the complete view of this space and we have to start navigating. So, when you start navigating this, in the current world then for example, in order to demonstrate this, there are four components. So, if you start looking for I want to see all the events related to this particular picture, or what happened in this particular thing, oops I'm going the wrong direction. Okay, it will show me all those pictures. If I want to see at the same location, I want to see the same images that it resulted, or the same person who is in the picture. So, this is how we start navigating these things. We are organizing this now as a different kind of web. When we start considering this and we are about journalism and all this kind of thing, what Twitter does is it creates a microblog. What you require really is a micro report. In journalism, the first principles are you should be objective, you should try to report as honestly as possible things. Opinions should be separated from facts. That's the first principle. The difference between facts and opinions. If you want to capture reports, then in each report for the event, you have to learn how to do that. Luckily for us, we can do that very easily now, because we have these smartphones. When I take a picture, my smartphone is really taking a report. And in this report, I have the location, I have the time, based on this I can do the contextual reasoning I know and I can show you a system working on my phone where I'm taking each picture it will tell that this is the event going on here. It becomes a report here, it will tell me exactly which room the particular thing is going on. It could even tell me that today's program was available, who is speaking, and what's going on in that thing. So, your report will be very, very objective. Of course, you can add your subjective information to those things also there. So, what we did was to see an experiment in library. Definitely Flickr from Yahoo announce that they will make 100 million photos available to people for experiment data. And these 100 million photos have some of the contextual data, the cloud metadata, they have some of the concept [inaudible] applied and that will be done. So, what we did was to run an experiment we said suppose that in the library of future all these kinds of reports are available, these kind of things, can we go and look at each photo as a report or as a micro report and can get some interesting information out of that. That experiment we did learn something. And I just want to mention that this is currently doing for Flickr data, there's no reason that we can go and use every picture that you are taking in this room or anywhere else can become part of that. So, when you start doing this something really start happening there. Suppose that you want to find out Olympic games and I must say this example was motivated because I was invited by London people here and so on. So how do we detect Olympic games from all these photos? So, what you can do is that Olympic games take place every fourth year. And for this I define some particular concepts. Now, in each picture I have the concepts. So, what I start doing is I got and plot those concepts. And this is showing now you the plots of different things. I look at the peaks which are appearing and I immediately find out that in London there are Olympic games going on at this particular time. And you can see the peaks there that immediately start showing you there. You see a secondary peak, this was for the [inaudible] Olympics that were going on immediately after the Olympics and so on. So, without doing anything, just from the photos you can just start getting all this information. So, when you start thinking about reports, when you start thinking about in future what kind of information we are going to be archiving. Everybody talks about every fourth year, the election, every this game is going on and that event. That's what really happens. For each event you generate enough experiential data. You generate enough reports related to that. And later on you the make comments about that, and all those things start getting linked. So, the important thing is if you do not do event based indexing mechanisms, we will be creating lot of write only data. Thank you very much. [ Applause ] >> MATTHEW WEBER: And our last speaker on this panel is Katrin Weller who is an information scientist and senior researcher at GESIS Leibniz Institute for social sciences in Cologne, Germany. She is also one of the inaugural digital fellows at the Kluge Center here and was here last year. And focuses on the use of social media as a potential resource for future historical perspectives from the viewpoint of computational social science. >> KATRIN WELLER: Yes, great. Okay. It's such a pleasure to be here today and I'm already quite overwhelmed by all the input I got throughout the day and try to relate to some of that during my talk now. I'm an information scientist. I did my PhD in information science, but I also studied history in my magister degree. So, I also have some experience in working as a historian and doing some archival work and somehow the internet distracted me and I got into information science. And I'm so happy that I now can bring these two perspectives together again. What I want to highlight today is the importance of also archiving information about the context of digital media, of web data, of online data, in addition to the technical parts that we may be already archiving in some project which we have seen today. I'm saying that because there's quite some resemblance of social media data and prior sources that historians used to work with. But there's also some editions that are new to that and that we will keep track of in order to make sense of that later on. So, I'm not so much worried about getting every single piece of information, collecting every single tweet archiving every single user interaction that has been on the web. I'm much more worried about if we lose the important context about how these types of media were used. How users were actually interacting through these type of portals. I mean if children today don't believe that a CD can play music, of that news can come in form of paper, how's anyone to believe that we use the phone to take photo and show it to someone in like 100 years or so. So, we have to capture these kind of context. We have to show people how this type of information was produced and is being used. Historical sources and historical source criticism as the way that historians work and interact with their material, often deals with text but not, like text is not equal text. So, there's a lot of different types of text. Like we've got diaries. We've got newspapers, we've got official documents like kind of notes to the people by a specific ruler for example. We also got multimedia content. So, in my magister degree thesis I was studying film propaganda, British film propaganda during the Second World War, and so I was watching these propaganda movies, but I also had to study like the circumstances, under which they were produced. So, I had to read a lot about different types of directors, about how the funding would work to produce such a movie, how the cinema systems worked, who could afford buying a cinema ticket to see one of these movies. A lot of things like that. There's artifacts of different sorts. And, some of these things directly relate to digital media. So, for example we've got war diaries in forms of blogs. So people blog about their experience during war. We've got like, what used to be the kind of leaflets displayed to the people can now become in form of a Tweet by the president for example. And we've also got lots and lots of eyewitness accounts in multimedia formats. For example Flickr on YouTube and on other channels, which are added. We've got lots of maps, we've already seen pretty examples of maps. And we've got maps that track how users interact, how they behave, how they move through space. So, lots of new kinds of data in this regard. This is the comment line from an online newspaper. So, we've got news, quite similar to the news we got on paper. Or we've got users directly sharing it in different platforms. Commenting on it on the news website. And finally, we've got people reporting about news events. That is the history of the Wikipedia article about the landing of the plane on the Hudson River, and event that was prominently tweeted about first before it entered the traditional news. And kind of people also responded to that by setting up a Wikipedia article. We can see the history of how that article grew. And there's differences whether you look it up in the English Wikipedia version or another language version of course. So, lots of different types of sources. Okay, so in traditional historical studies. For example, if you want to make sense of something like that you will have to know quite a bit about how to handle such types of sources. So, numismatics as the kind of study of coins, is a field where you have to learn about what does a symbol mean that is displayed on a coin. What does it mean if a head is displayed in a specific way? What are the symbols that may be placed around them? Why are they using specific types of images? Why are they writing specific types of letters? There's an entire, you can kind of spend two semesters in university just learning about how this works. Also, how the material was chosen, how the coins were produced. All these kind of things. What if our artifact looks like that? What do we know about how something like that was produced? What symbols are used here? Why is something displayed in the way it is? What does at POTUS means in this sense? We have to know that maybe this account is moving on from one president to the other. Maybe it is staying with one president. We have to know how multimedia is displayed in this kind of information. What it means to retweet something. All this kind of information. What if we don't get this, but only that. So, what if instead of the visual display of a tweet the only thing that's being stored is its tweet ID, or maybe the JSN file with the textual information behind it but not the visual information around it. Does that ring a bell with someone? What's this hashtag January 25? Well you may probably rather recognize this one. So, during the Arab Spring movement people were quite frequently using the hashtag Egypt, but only after like five or six days after the events unfolded. Prior to that they were using other hashtags like the January 25 hashtag. So the context we have to be aware of is these standards may change. If you want to study an event like the Arab Spring movement in the future you will have to be aware how people actually talked about it. What were the key hashtags probably, or the key references you needed in order to take part of the discussion. The same is for user account names. So, user account names are changed, for example on Twitter are changing a lot. I only studied soccer clubs in different countries and already had trouble with these soccer clubs always changing their Twitter handles and really being in trouble in figuring out which was the current official handle of that soccer club. So if you have the same with eyewitnesses during events like that, or with presidents, or with other actors, that's getting really complicated. So, you need this kind of context. And of course there's lots and lots of other events. And this kind of multiplies. We also need context in order to understand the evolution of platforms. And I think that's kind of a thing to start now to preserve the evolution of different platforms, how the emerged. This is one of the first drawing of Twitter, how it could look like as a platform. How many of you remember how Twitter looked like in 2007 for example, or how it changed its look and feel in 2010. It's really difficult to understand that from our perspective today. So, some things we need to think about today is how can, maybe today retrace how platforms like Twitter evolved. One thing you could look at for example, these kind of things we have books telling people how to use Twitter which will give you a lot of screenshots and images, which can trace back sort of the history of Twitter over time. You've also got videos. People sharing YouTube videos of how to use Twitter, which gives you an idea of the look and feel. But it's something you really need to look into now, in order to understand how it works and not leave it to the kind of people who don't recognize CDs and newspapers anymore in 50 years. The same about users. So, if you heard about, so Lee told about the demographics of different platforms, and that's something you really also have to keep in mind. Who's using these platforms at different points in time? How are these platforms being used? How many people are actually people? Or maybe the ISS Space Station. And again, what is being promoted to the public in terms of how you use these things, so that's even "Facebook and Twitter for Seniors, and Dummies." So, it's kind of quite interesting. And with that I'd like to close here and look forward to your questions. [ Applause ] >> MATTHEW WEBER: So, we have plenty of time for questions I have a list that could probably take a half hour. I'm going to take moderator's privilege and ask just one question and this actually isn't my question, this is a question that was emailed to me by Edward McCain, who works on digital journalism preservation at the University of Missouri and couldn't be here today, but he said I have a burning question I have to ask. And it's a great one. Which is he said to me over email, what are the ethics of news collection, and we could say more broadly with this panel, collecting media around events, collecting news around events. What are the ethics of news collection without permission? Because in none of these cases were we talking explicitly about asking people can we take your local news site, can we take your photos of yourself at Sydney Harbor. So, what are the ethics of collecting without permission and should we or shouldn't we be doing that? >> PHILIP NAPOLI: I, you know thinking strictly in terms of ethics, to me there's the thornier question goes to the issues of legality. Because I feel like ethically once something has been disseminated, the act of keeping and analyzing should almost be implicitly permitted by the act of disseminating. But, as long as I can keep going, you know it's not with the goal of further commercializing it right? Again, as someone who's worked in this space for so long, it's been so frustrating to watch as sort of research with a you know legitimate public interest purpose can be stymied by denials of access to content that's already been made publicly available. That's, so to me it's you know it's more these issues that have come up in the past about issues of copyright and things like that that I actually don't know the answer t. I just know that to me there's the yeah, I mean from a research standpoint, those seem to be the greater challenges, but you know maybe that's just me being a totally heartless social scientist there, but I just, I mean ethics, I mean you know again you know naivety, I don't know but I don't think ethically given that it's already been disseminated, that if someone wants to keep it and study it. >> MATTHEW WEBER: What about then, the photos of myself at an event or in from of the Sydney Harbor? >> RAMESH JAIN: Right, I think it's related two things here. Number one is really what the first issue you are talking about is the privacy issue. Whenever I'm asking somebody whether I can take your picture or not, whether I can do that or not. That's more of a privacy issue. And when you are in public spaces, usually it is considered that that's a weaker issue. Because people are going to report in the public spaces. So, but that's a privacy issue. To me, ethic issue is more in terms or reporting. Are you reporting objectively, or are you completely reporting subjectively. Now, what is very interesting is that even using pictures, which are considered normally objective, people have used the picture without giving you the context. And then they start talking about the picture as if something else had happened. And there are lots of things that you can find on the web, where they will show that whole pictures misrepresent the situation. Unless you have all the information related to picture, which is nowadays very easily available. So, when you take picture from a smartphone. All the things about location, time and the in which context you took the picture, etcetera is usually available. That starts making it a little bit more objective. So, to me I think the issue in reporting particularly with the journalism and particularly for events is more am I being able to differentiate between fact and fiction and am I facts and opinions and am I representing those things very correctly or not? >> MATTHEW WEBER: Which is not easily done. >> RAMESH JAIN: Which his not easily done. >> KATRIN WELLER: So, back home with my colleague Katarina Cinda-Colanda [assumed spelling]. We are doing a project where we're interviewing social media researchers, or people from different disciplines who study social media for different types of purposes. And within these qualitative interviews amongst others we've asked them about research ethics, what kind of ethical considerations they have when working with social media data. And that's been quite interesting, because they're really, really different perspective, often it drills down to privacy questions. But you've got people saying, well it's all public anyway so we don't really have to care about that. But we also have these kind of people say that's really a very complicated topic. Lots of people may not be aware, or may not have been aware of how public things get once they, for example post a comment on the public Facebook page, and they still feel like being in a protected space where only their friends can see what they are doing. And of course, it aggregates, once you work with aggregated data and kind of run some sort of algorithms on them, who will for example then say something like this person's likely voting for their party, or likely kind of being a terrorist, or whatever and this implies new forms of things. We also got people who kind of feel the obligation to share the data because some people want to be read. So, we also have this affect that once something got really viral on the web already, or something can be attributed to a specific author, that you get a kind of feeling that you have to quote this person. So, it's really an untrivial task and really something that's still being very much debated in the community of researchers who work with social media data. >> RAMESH JAIN: I just want to add one more thing to what I said, this [inaudible] item that some particular person doesn't want in his events, report us to come and report objectively, okay. Now, is it an ethical issue? Or what is that? Because these reporters are trying to do their job, they are not being allowed to come because somebody's privacy is being violated. Privacy or the perception that that person wants to prove it. >> MATTHEW WEBER: We're treading dangerously into political territory. >> RAMESH JAIN: That's why I'm not mentioning any names. >> MATTHEW WEBER: Without mentioning names. And with that, thank you for the comments. >> RAMESH JAIN: But that's what the question is. >> MATTHEW WEBER: No, I think it is important. >> RAMESH JAIN: That is exactly related to that. >> MATTHEW WEBER: Well and then can others step in as reporters and objectively report? I think with that we should open it to questions. We've got a couple over here. >> SPEAKER 3: Clifford Stultz [assumed spelling] who is a political scientist in the UK did some early work on analyzing the causes of the London riots a couple of years ago and he was early, and he used YouTube reports, contemporary YouTube reports that people involved in the riots have posted up onto YouTube. But he commented on the fact that he would use these and then a couple of days later they would be taken down because the people in question had learned that the police were using YouTube as a source of information to try to identify people. So, perhaps I'd just like to ask you to reflect on that because you talked about privacy, you talked about copyright. And the fact that these activities are not morally neutral, either that we are researching, or archiving, or undertaking with our smartphones. And how you, it's what originally got me into the topic of archiving, social medial, using institutional suppository technology as a basis of doing that for researchers. But, there are given the archivists ethical frameworks are not necessarily the same as researchers' ethical frameworks, which are not necessarily the same as journalists' ethical frameworks, what do you think something like the Library of Congress should be doing? I just turn it always onto the Library of congress. >> KATRIN WELLER: One rather, one thought that just came up to me was I remember, I can't give the credentials to who is doing that, but there's a project for example on archiving blogs by Chinese, you know people who write blog posts in China. And they are being archived before they get censored and thus disappear because the Chinese government takes them off. So, there's a deliberate project on archiving this kind of content, which is being at risk of being censored and disappearing forever kind of adding another ethical perspective to that. Or feeling obliged to help these people to keep that stuff public in some form before it gets taken down. And I think that's part of the more archiving perspective on that. When it comes to social media data, I think you also have to distinguish between let's say everyday user accounts and more official accounts, like the president, the press accounts and so on, where you have different sets of ethical considerations to take. And one thing I think that maybe the Library of Congress or other institutions involved in this could start with this kind of archiving the contents that comes from official accounts by leaders, political leaders, other types of elites where you can clearly assume that these people know what they are doing. They are doing it for professional reasons, and that kind of is the first piece of the puzzle and understanding the entire framework. But it's again, it's not trivial. >> PHILIP NAPOLI: I think, things start to get muddy here, if we're truly focusing on news media and journalism, or more broadly because you know I'm thinking about something like the right to be forgotten, right and if you think about the layering of ethical considerations that goes with that. There is a news outlet that presumably went through the right ethical protocol involved in reporting a story. Now, if that story lives on in Google's archive link essentially and it lives on in this particular news outlet's content archive for longer than the subject likes, the subject can ask to have it of course petition for Google to remove the link to the story. So, there's an interesting, you know presumably ethically there's a reason why the archive of this story should not be as accessible as it currently is essentially, but yet the right to be forgotten has never yet been extended to the original reporting outlet if I understand it right that you can't go to the news outlet and say you have to take this story down because I am you know, I haven't, you know we went through various case studies in one of my classes. Well, I haven't you know beaten my wife in 15 years, so how can you still have this story accessible? I mean that's a real example. But the fact that nobody is claiming that that person has a right to demand that the news outlet remove that story from their archive, but there is some right that they have for Google to not make it easy to find it. You know I'm not answering the question here, except sort of again showing you how these layers of etherical considerations that sort of cascade over each other. But yet when you get to these questions around the issue of yeah, you know this was my Facebook page and all my postings and my comments and what right do people have to store that and make use of it, that's I mean to me the level of overlap is almost minimal there. They are very different kettles of fish. >> MATTHEW WEBER: We may need Jefferson to come back up and talk about legal requests that the Internet Archive has received. Although he's shaking his head no. I think we had another hand directly behind Les. >> NICHOLAS HILAGIS: Hi. My name is Nicholas Hilagis [assumed spelling] and I'm a research assistant in Southampton working on [inaudible] and a lawyer. So, we talked a bit about the legal aspect of things but to me there's also a research ethics aspect which is completely different. Archiving even for historical reasons is going to be mainly used for research purposes and in any university in any study, in order to get ethical approval, you need to a, inform your subjects, and b, allow them to have the right to withdraw their information at any time. So, when notification happens, as I understand currently in archiving, but since we also archive comments of people that have gone to a webpage and they are not even aware, it's up to the discretion of the publisher to notify them, maybe we should start thinking about changing archiving practice to allow compliance on letting subjects to withdraw their consent, I would just like your opinion on that. >> MATTHEW WEBER: I'll make a quick comment on that which is that this is one of the thorny issues as an academic researcher that we muddle through with institutional review boards at US institutions, and IRBs which is what we refer to them as, do not know the answer to this question. And in fact in most cases, don't even want to wade into this water right now because it is such a thorny issue. The attitude more or less to date has been well, the comment is put up there, you knew you were putting it on a public site so, we're just going wave our hands, but I think long term that's going to become a more complicated issue. So, my short answer is there is no answer, at least in the US at research institutions right now. But it's important. >> RAMESH JAIN: This is not an exact answer to your question but in 1994 David Brin wrote a book called "Transparent Society." And in that book he argued that in future there are going to be censors everywhere which are going to recall each and everything. And he argued that there is certainly no point discussing whether censor should be put or not put. Okay. Because that's going to happen independent of everything. What needs to be discussed is who has what information and how that information could be used. And then he argued something very interesting. He argued that from the beginning of technology we always argued about the convenience versus privacy. And in the long-term convenience always won over. It's starting from use of checkbooks to credit cards, to internet shopping. These are the examples that he used. So, the point with respect you are capturing where these things are being put etcetera is going to be very difficult to stop those. Particularly with the kind of camera and all these things that are coming in. The correct [inaudible] society will decide is what kind of information can be used ethically, legally where it could be used and things like that. And this is really going to become more important and slightly different context with video devices. Because everybody's collecting this information. That information is your very personal information and that's been going to the cloud and is being collected. So, how are you going to use that information Google has access to that. So, people aren't thinking always putting in the architectures for those things. So. >> KATRIN WELLER: Maybe two additions. First of all, I live in a country where there actually is no equivalent of IRBs. So, German universities don't run through that process. Yeah. >> RAMESH JAIN: I should work with you. >> KATRIN WELLER: We've got other legal frameworks to cover that, but I mean that's not the standard that as a researcher you have to file something through an IRB process. Well they've got some forms they have to fill out. But it's not an institutionalized board such as IRB, which is there. You have to check some boxes on a form and that's the case and there are legal restrictions for what you can do and what you can't. But it's working quite different than in other parts of the world. The next thing is that I don't believe that kind of just getting clearance through IRB solves everything. Because on the one hand side, that's again more of a legal issue kind of if you just need this in order to get the allowance to continue some processes. And if you need someone to sign kind of an informed consent form without really being informed or understanding what he or she is signing up for in that moment, that doesn't really solve the ethical issue around that. So, it's, yeah really I join in with your answer, there's no easy answer. >> PHILIP NAPOLI: I think what makes it interesting in this context at least for the kind of work we do on new and journalism is when it comes to IRB, it's human subjects, all right and this, okay how do we define? When are you a human subject? To figure is it your digital expressions online does that you know at least in part of the research we're doing right now, we're not doing research that actually reaches the level of studying human subjects, even though we are producing the outputs of human effort, right? And so, you know what's the threshold for counting as a human subject is kind of interesting when you're studying digital media. >> SPEAKER 4: Do you think that the role of libraries might expand to, or even if it currently does, you know providing people with more digital literacy and kind of the tools that they need to make, to actually do informed consent and to understand what their content that they're generating online is actually how it's living and what it's actually going to be potentially used for. Is that something that you think libraries should be undertaking, like that education, or is it you know the responsibility of the, you know parents? Or society in general? I don't know if you want to speak to that, but? >> KATRIN WELLER: Well, I think information literacy is a very important concept. But it also adds another level of making everything more complicated in that sense. Because as people get more literate in using different tools, they will put on new practices of how to use these things. So, it may probably turn to different channels for talking to one another. So, for example once you figure out that Facebook is doing certain things, you will use what's it for a specific task and then you move on to some other tool. And, so on and so forth. So, it's really a very dynamic field. It's a moving target that you try to capture and it's constantly involving, I'm not sure whether libraries should be the ones to be in that position. They should be aware of what's going on, they should be able to take up that role when they're asked for it. But they shouldn't be the only ones who should take care of that role, who will be responsible for all this, but I think the entire dynamic on the web is like users taking up practices and establishing forms of how to communicate with each other. Platforms changing their futures according to the users, putting on these new practices and everything goes on, and on, and on. And that's really something we need to be aware of and we keep considering in these kind of efforts. >> PHILIP NAPOLI: I mean my short answer is just going to be all of the above. I was waiting for the issue of digital literacy to come up in this conversation because I would love to know from a research standpoint how variations and levels of digital literacy amongst individuals, you know to what extent is the relationship between that even at the organizational level and what we might call chilling effects, right is you know and that how much does that input what gets produced? Ultimately then what gets archived. I mean from a research standpoint, just we don't know enough there yet. >> MATTHEW WEBER: I think we have time for one or two additional questions. One over here. >> SPEAKER 5: I know that the title of this panel is Saving Media, but my question is about taking it into the future really that is not just saving it historically, but your example of an article about somebody that he beat his wife 15 years ago, wouldn't it be nice if, for example he was, I mean it was proven not to be correct, that that particular article would be annotated by the newer information about the case? Similarly, we read all kind of things, vitamin D is good for you and then five years later, don't take it you will die. Okay? Now I think that we have an opportunity in journalism and either social media to be try to do these linking to more recent information on that particular topic. In other words, a [inaudible] is very, very important and we can try to use the dynamic web to represent the current state of knowledge. >> RAMESH JAIN: I like the question a lot because what it is also emphasizing is that in our discussions today we are not doing, we have talked very little about the importance of time and the importance of location. And both those things are extremely important because based on time and location you can decide many of these things. And that we are going forward, you allow all the tools that will link all these things very easily. And that's happening increasingly. >> MATTHEW WEBER: I think as well as we are starting to become better at working with the existing archives that we have. For instance, just reflecting on the datathon that we held over the past two days. As we're learning about ways that we can work with the data I think we will start to slowly answer these questions. By analyzing the past we'll know more about how we can tie together bits of information in the future to create a more coherent longitudinal archive. Personally, as somebody who researches on news media, that's something I would certainly like to work towards as we continue tinker with the data. >> SPEAKER 6: So, maybe a two-parter if I don't get in trouble for that. I guess, one I guess to go back to the ethical considerations, I think the concepts of public and sort of capital being public of you know I might be publicly discussing something with a specific community that I'm involved in, maybe a community of researchers, but as soon as maybe someone studies this or a journalist decides to publish about this, you've now sort of raised a profile of what I was discussing in there, I think we're, what are the? How do we sort of update our conceptions of harm? Because IRBs and most of the institutional review boards across the world, they're sort of legal frameworks to protect universities. They're not ethics, necessarily ethics bodies. Because you know of how they were founded. So, you know one is how do we come together to sort of update our conceptions of what does it mean to collect someone's geographic information that's attached to their photos that they're posting on Flickr and do this in mass. Like with our occupy datasets, we can figure out, in my lab we figured out where followers live, you know where they were protesting, we say the use of turning it on and off as a protest tactic. We saw the use of deletion of the protest tactic. So, on one hand we want to accurately reflect what was happening on the other time, but that's also what about the harm that we could cause, you know the Oakland Police Department would like to subpoena our data, they did not, but I mean if they could subpoena our data set. They could, you know find some of the protesters that they might like to arrest. You know we have how do we understand those conceptions of harm, I guess I'll stop at that one, because that might be big enough. RAMESH JAIN: To me, and the view that we are discussing here are really very complex. We want to understand them in two different levels. Number one is the technical level and then from the societal level. Which is more politics and ethics related. So, in the archiving component. We have to see carefully what could be done for example lots of views today about what information should be shared, what should be discarded. So, is it based on the politics or ethics, or is it based on the information content of it? We all know that we do image compression, we do video compression. Are we going to do the compression of all the data that's coming in based on only that kind of issue, or we are going to start making the decision based on political ethics. These two things should be kept separate. Because the technical development is going to take place based on what could be done. And then we modulate that by societal standards. Because these societal standards. Because these societal standards are going to be governed by the current socioeconomic and political situation. Today, what is allowed, tomorrow it may not be allowed and vice versa. >> SPEAKER 6: The technical objects, the technical system, then that's forever there and you can't go oops I want it back, right? I mean so? >> RAMESH JAIN: Most of the technical systems when they are developed, then the society decides. For example, Facebook, Twitter, and all these things are banned now in China. Because the society decided there that it's not in their interest to allow users to use those things. But the technology they have developed based on the technical thing, their own systems which are sometimes correctly used sometimes are being abused by people in government. So, technical issues and the political and social issues are different. At least that's how I see it. >> MATTHEW WEBER: Wow. >> SPEAKER 6: We disagree, but respect that. >> MATTHEW WEBER: On that contentious but interesting point, I think we may draw a line and end this panel. I want to thank everyone for joining us up here today. [ Applause ] Wendy for closing remarks. >> DAME WENDY HALL: So, for closing remarks, this is where I get to say what I want to say. All that we've been talking about today has been driven by a revolution that started when people started talking about machines, and computers. But in terms of the real impact on the world, of the internet, it started with Tim Berners-Lee put the first website up Christmas 1990. And in that very short space of time our world has changed dramatically. Lee talked about this on the panel this morning, it's hard to think back in history what you can retrospectively see when things hugely changed, industrial revolution and things like that, and agricultural revolutions and the different you know ways of measuring what we changed in society. But this has been something that's happened in our lifetimes, which is probably the first time. And, we, as I said this morning this system that underlies is still very fragile. And we the thing we've been talking about today it's so very, very, very complex that it's almost impossible to grasp. What we've done today is just the tip of the iceberg. We could have had days talking about this and of course in the future people will. The you know we have when you think about the, oh, sorry I've got to do this so I can see it here, but when you think about the way libraries have evolved. And somebody earlier, I can't remember, it was lovely, was it Vint? Talked about the Library of Alexandra and you know way back in history. And the big libraries we have today have been evolving for centuries. And the material in them is centuries old as well as decades old, and as well as today. And the sort of ways that society has evolved to you know people come into reading rooms and there's these cataloging indexes, which the wonderful Dave who I saw sneak in somewhere. If you want a good tour, tour the library talk to Dave Brunthin [assumed spelling] over there. Took us around the catalogs, there's some wonderful gems in the catalogs alone. These themselves are amazing artifacts now. Right? The catalogs. And I've been privileged to be in this wonderful Kluge Center where people come and do research. And most people in the Kluge Center read books and write books. I don't do that. I network, but there you go. The thing is that this has taken centuries to evolve into these types of systems. And I'm very well aware from my time on the board of the British Library, I was the non-exec directory of the British Library for eight years and then being immersed in this library for three months how much of the library is to do with this world. Right? Because it's only 15 years since the library had to start thinking about the digital world, really. That sort of period. Most of the staff who work here, most of the operation of the library is to deal with the old world. And that's going to continue for a long time. So, there are major issues for people who manage these organization about transitioning from one to another. And the other thing that's happening is that the, as I travel around the world, and I see and I talk, I mean just, well I don't have to travel around the world, I come to Washington and I have talked to many different archivists. They've all been in this room. There's the Senate archivist. There are people, the National Archives, the Congress, the all the university archives, the library here. Just in this one town, there are any number of archives and there are people grappling with this issue of moving into the digital. You then scale that up to all the libraries in the world and then you scale out, you add in my actual world, which is the university, research labs. And we are as much as you are harvesting stuff off the web all the time. There's huge amounts of duplication of effort. We're still losing stuff. And what we're actually ending up with is a very complicated jigsaw puzzle for future historians to sort out. Now, Vint talked about this in a much more eloquent way than I can about the complexity of this amazing thing that we have created. I mean Tim Berners-Lee gave us the protocols, but the web was created by us. We create this content. I was struck by a comment from Abbie when she said, oh she was sort of cursing YouTube you know for it coming into existence. Why did YouTube come into existence? Because all of the world wants to look at cats doing stupid things on videos, right? Who would have predicted? But you know? YouTube exists because of us. No technologist said, you will put your video on YouTube. They provided the infrastructure and we put our videos on. So, now of course it's become a big business and it's completely changing the world of entertainment, advertising, everything. And I mean my mentor sitting in the front row there, Professor Ramesh Jain, who you know I met at Michigan in '89 and he was the founding father of multimedia and you know we have been watching this happen for a long time. And as he said in his talk, it's only today we're beginning to realize it isn't just text we've got to sort, it isn't just you know data, it's all this, well it is data, it was the end of it it's all bits and bytes. But it's pictures, it's videos, it's audio, it's a huge amount of information. And we have invented the technology to enable us to all of us who are on the internet and have a mobile phone, we can all create this stuff isn't that wonderful? You know and that, I mean what do you think about it? During the analogy of the old days, we wouldn't expect a library to have kept every little letter that was written in America, right? You'd expect a library, a national library to store the important letters that were written in America. Not every single personal letter. And not every single personal postcard. You don't, some people collect postcard, that's a different madness. And where have postcards gone? To Twitter. I mean you know. Largely you don't send a postcard anymore, you tweet, or you Instagram, or you Snap Chat, or you whatever, you put it on your Facebook page. We don't have to collect it all and yet there's a mess. The mess isn't just what we collect, it's how we collect it and the fact that we currently, everything's in silos and we've got bits of the jigsaw puzzle in silos and they're all different full of standards. Right? There's no common way of these jigsaw pieces being cut. So, that they mesh together. So, I look back to that table Vint put up on slide with the hieroglyphics on it and I think you know how long it took people to understand that Egyptian language and put the tablets together. And as someone else here said, I can't remember who it was now, I think it was the data panel, someone said the most important thing is the context. Given the context, I think it was Vint that said it put the context of the data up. It's all about the semantics of the data. Because if you don't do that, people in the future will not be able to understand what it was about because they won't understand the world we lived in. They won't understand who Snowden was or, I have no idea who that Judge was that you all doing the thing about the court okay, but the supreme group. I don't know who that judge was. I can look him up on Wikipedia, but how long is Wikipedia going to last? Who's preserving Wikipedia? You know? I mean I'm sure they are, but can we assume that the people running Wikipedia will be in existence in 100 years' time or that a system running Wikipedia will be in existence in 100 years' time. Certainly, the people doing it today won't be. Now, these are assumptions we make that just are not, that are totally foolish assumptions to make in this world. So, the Library of Congress, we've you know we've heard about the millions of resources we have now that and a huge amount of time. Just imagine if a library ingested those numbers of books in 15 years, real books you know. This is the scale that they're working at. That's a lot of websites. Just this library, and that's replicated all over the world. And you know you deliver an API service. How do people find out about that? And how many people know what an API is and what they can do with it? Right this is the trouble. It's the same that's happened with open data. All sorts of people put data bits out there and nobody uses them because they don't know how to use them. And Jim said to me earlier when we were sitting listening to someone, what we've got to do is write the tools to enable people, just regular people, but particularly researchers who aren't computer programmers to be able to do things with this data and that means chairing. It means doing this as a community because you know otherwise we're just going to waste a lot of time and effort or duplicating and getting nowhere. So, we've, I'm going to tell you a bit in this talk about a project we've done. This is the website I trust which you've heard a bit about sponsors this project. And we've done the building at Southampton, of a simple tool to enable people to find archives. Find out what's been archived. Because at the moment we have no clue who's doing what around the world. We hear about it at symposiums like this, we hear about it in conferences and also when students do projects. Interns come in, PhD students come in, they do projects. They do wonderful things with data. They leave and it just goes. You know their laptop goes, the server goes, the URL goes. And so, we've been building this thing called the Web Observatory, which I'll explain what that means in a minute. It's all about trying to make data about archives and data about research projects that are being used in research projects, the knowledge about that available to people. It's effectively about creating data catalogs. And being able to discover them. So, that my Ramain [assumed spelling] who's sitting in the audience somewhere produced this slide so that instead of putting out an API, you put out some simple metadata, which we talked about a lot on the panel this afternoon. We are currently using shema.org which is a sort of watered down version of RDF. But, at least it's a standard that we can propose, as you know the idea that everybody who's doing work in this type of work agrees to a common description of the datasets. Not, the data, we're not talking about formats of the data, just the descriptions of the data sets. Who owns them, what they're about, what you can do with them, who can access them. That sort of level. Oh, I'm sorry I knew I'd confuse you. I'm looking here because I want to be able to read, you can't read from there. All right so to go back, this is the web observatory that we're building. We're using schema.org. It's all about metadata, about datasets, so that everybody can flag that they've got some datasets. And so, you can put out an API, absolutely because you want people to use that data and build tools to use it. You also put out the metadata about that dataset. And if we all do that in common standard then wonderful things can happen. For example, and this is something that we're just beginning to experiment with you can begin to search across different datasets. And here's this is actually it's a beater, beater, beater thing, or maybe alpha minus thing. It is a we're just developing it. So, it's like a Google for data, or metadata anyway. And when Romain did this yesterday, he typed in pollution, I think Kate, we were there in Dunk 'n Donuts at the time weren't we? Oh, there you are right? And he typed in pollution and he gets back from, this is searching across all the observatories that we currently have registered in the system for who's got datasets about pollution. It could be a Bolder, it could Zeke, or it could be Trump, it could be whatever you want. And so, he searched, but you can't read it but the first one that comes up is a dataset from Southampton and the second two are datasets from the Library of Congress. Now, you're not actually issuing this metadata at the moment. I hope you will, but we fudged it for this. No, we didn't? Oh, it's real. It's real. It's in the Southampton Observatory and these are datasets that we use for the datathon, but it's only the metadata, all right, we haven't sucked the data out, just the metadata that says the Library of Congress has these datasets. So, this is, imagine you can go to something like, well it won't be like a Google, because it won't be as big, maybe Google will run the service at some point in the future. The idea is you just, you're searching across the world's datasets of web archives. Just imagine being able to do that. So, here's another one where, oh yeah you highlighted, is that the same one, the pollution one, you highlighted Library of Congress on this one. Okay. Now, imagine a world, where, so here, we've typed in the world elections and in fact it's UK elections. And we come up with the dataset that we are hosting at Southampton about last year's UK election. We harvested from Twitter. And the datathon, the group that worked on this data, and they have the US election data from the Library of Congress they produced a dataset with the work they've done and that dataset is now, well it's pointed to by the observatory. So, here's the metadata about on there that says, you can't read it but it says this is all screenshots of a live search that says, this is the data that was produced after this datathon. And what I want to do is every time students or anyone does a project with some of the data from anywhere, like from the Library of Congress, or the British Library, or Southampton University, or George Washington, or Rutgers University, or RPI University. If you're doing something with data that you got from somewhere else. When you produce that data, you then put the metadata back to say what you've done and where the results are. Right? So, you build up a community of people working in this area and we know who's archiving what. So this, the idea of the observatory came from the physicists who are very good at sharing data in order to interpret what's happening out there in the heavens. The thing about that is that's the physical world and you don't change it by observing it, we don't think. With the data we're doing, there are all sorts of issues of using data that's about people, as we have touched on a bit today. But the idea is the same. It's that as a global community we've got to share this data to enable people to replicate our experiments to reproduce experiments, to put metadata on top of metadata to add some context to the datasets to help people in the future interpret what's going on. So this is about sharing. There's a diagram, I think Romain came up with this as well, many years ago now, if that was the web browser, that was the web in 2003 I think, which is often used to illustrate the connectivity of the web, and it's all about observing it and experimenting with the data on it. So, just the Web Observatory, if those of you who are techy want to look at websites.org, which is hosted by the trust and you can look at the index of the Web Observatories that exist at the moment and how you get into being one. It's very simple. And there's a list of, that's a screenshot of the list of web observatories that exist at the moment. Jim, we need to make yours live again, I think RPI is the top and it needs, the datasets need to be live again. The trouble is if people don't use this, and you don't have someone curating these datasets, they do stagnate. This does need, this is not something that just happens automatically. But if you go to for example the snap group at Stanford, they flag, they have, this is an amazing group up there, they flag that they have datasets they've been collecting for years. And flag them here so that others can see what they've got. And we can put them into the search engine. So, hopefully in the future we're going to be creating a distributive catalog across web observatory sites. That's the plan, the common standards. The idea is you host your data. It's your data and it's all about providing, telling people what they can do with it. You can tell people it's there, but not allow them to access it, you know on the web, without going through you without signing whatever forms, whatever terms and conditions they have to take. But it's all about cataloging what you've got. But as a community and not just as individual websites produced by the individual institutions, that's the difference. So, then you have communities based around different web observatories and then linking between. This is the Southampton one, and if you, this is live on the web. And if you go into the datasets, you get lists of datasets, and you get actually you get told, we've got access controls. Whether you've got access or you need permission. And if you look at the apps you get a list of the. Now this is the thing that these take a long time to do. This is coming to Katy's material right? All of these apps are visualizations. That's much harder to share than it is to share the metadata about the datasets. And the amount of effort that goes into building those apps. We don't want to be replicating that all around the world. We really, really, really need to share those tools wherever possible, if they're open source, or open access whatever, we really need to be able to share them. And this is about telling people what you've done with your datasets. So, for example, you can't read it but you can look at this, it's all online. There's datasets here about Wikipedia live revisions of Wikipedia we have that constantly running in Southampton, we have a screen up with who's editing Wikipedia at that moment in time. So you've got streaming data and then that's all stored in the observatory and flagged and then you get these sort of you can actually give people that, I won't that is a video, you can run it as a video and see who's doing what with Wikipedia. And you can get that data, you can look at it streaming live or you can get the past data as well. And of course, the wonderful thing about Wikipedia for researchers is it is all open. So, with the idea is we have web observatories all over the world. And they're all linked up and we have the Southampton, a couple in the Southampton, RPI, we just ones being set up in Bangalore, with a lot of industry data actually about stuff, telecoms I think. And then there's the new one in South Australia. And the idea is that libraries have them too but actually this is the modern version of your catalog I think. And that's the plea today, is that we develop common ways of sharing our catalogs about our datasets. In order to as I said earlier describe the content of the observatories, or what's in that datasets, we're using schema.org. This was developed by Jim and his group at RPI, it's in the w3.org list of schemas. Anyone can use it. And that really is a very simple just a description of what's the name of the project, what type of data, who owns it, what access you can have. It's just at that level, but that is enough to enable us to do initial searching. And we can put more and more descriptors in and then over time we'll be able to go semantic, which is what I want to do with Stanford, is to link these two things together, because it's only when you really get rich semantic descriptors that wonderful things begin to happen. So, we have a small ambition here to map the digital universe. That may be MP incomplete. Anyway, but who knows. The idea is to get a better picture of the digital universe than we have today. So, less broken jigsaw puzzles and more bits of the picture emerging. And I just want to almost finish. This is my last but one slide, with a video that we produced at Southampton, which may explain some of this a bit better than I did. And the voice, if you play it in a minute, the voice in this video is Julian Rhin Tutt. We were so lucky to get him because they didn't charge us very much for making the video. Julian Rhin Tutt if you want British television, he's always in as a character actor and he happens to be the brother of someone I went to school with. But anyway it all links. Links. I love links. Could you play the video. [ Music ] >> JULIAN RHIN TUTT: [Background Music] The Worldwide Web. Pages, people, data. Lots of data. Generated and stored by people, businesses, communities. Imagine a world in which your data combined with other data could change things dramatically. Government, health, education, the environment, and in business. To unlock the potential of your data, we've developed a Web Observatory. A system that locates and describes existing datasets around the world. It has common technical standards, safeguards, collection, and analysis tools. It allows people to observe the web and use the intelligence it holds to unlock new opportunities. You chose how your data is used and by who. Well it's free or has to be paid for. But until you put it out there, you won't have any idea of what it can do, how it can change the world. So, join us we can show you how. The Web Observatory, unlocking the potential of data. [ Music ] >> DAME WENDY HALL: Thank you. [ Applause ] I should give credit to Thenasis Jurapanis [assumed spelling] and Joanna Luis for doing that. But this is not a commercial company, right? This is just talking about this project which is a community project, or open source and open development tools we hope. And we can't do it on our own. We need to do it together. So, it's a plea for everybody to put their metadata out there in the schema.org format and tell us they've done it. And just to summarize today, this is my last slide. The center of what we've been talking about is this web, internet, whatever you want to call it. The thing that is incredibly complex, as Vint said. Very, very interconnected. As, Ted Nelson who's another great mentor of mine mad as it, oh no get quite mad, but a mentor of mine said everything is deeply intertwingled and boy he said that in the seventies and boy have we proved that. You know given, he didn't quite get Xanadu off the ground and he says two cheers to Tim and the Web, not three, but you know the web has shown us a way to do this and it's the center of what we're going and I love what Vint said the morning, the self-archiving web that's something we really need to try to build. It's very complicated. I love the term digital vellum, that resonates so much in a building like this. You know this is really, this is what the libraries need, they need digital vellum. And that's what we need to provide them with. All right? They can't do it, we the community have got to do that for them. We need to know who's archiving what, otherwise we get huge amounts of replication, duplication, wasted effort. That's where I hope something like the web observatory will come in. And for those of you who can understand all the three letter acronyms, the doi stuff is really important because in the web observatory at the moment, we're focusing on sharing metadata and we're using URLs. And as Vint said, URLs are inherently unstable because people don't, people move stuff around. And so, the work that Bob Kahns and CNRI are doing on doing on doi, as complicated as it feels, you have to have that sort of system going at some point in order to do the preservation. So, on that note, I'm going to finish. I'm going to say thank you. I think Janice is actually going to close. All right. I'm just going to say thank you to all the people who helped here at the Kluge Center, the people to be organizing, Jason who's been at the back you can now have a rest Jason I know you've got an event tomorrow. The team who produced the food, ran the mics, the technical team, thank you very much. All the behind the scenes people. And then the front up scene people, all the panelists, the chairs, the people who ran the datathon, thank you so much for everything. And thank you to the Kluge Center for letting me be here. Thank you. [ Applause ] >> JANICE HYDE: Wow, we've heard a lot today. And I want to thank all of the participants for really their thoroughly engaging and thought provoking presentations. I don 't know about you, but I need to look at some of this again and reflect. And so, I just want to remind all of you that we did record the presentations and the will be up on the Kluge Center website and the Library of Congress' website, might take a couple months but they will be there. And by then we'll all be ready to go back and look at all this again. So, I would especially like to express our gratitude to Dame Wendy Hall for organizing this event. [ Applause ] And really for serving as the catalyst for this great exchange of ideas. And during her tenure here at the Kluge Center she has really generously shared her abundant energies, talent and knowledge throughout the institution and across the hill. And I want to thank her for truly embodying the Kluge Center vision. Thank you [applause]. >> This has been a presentation of the Library of Congress. Visit us as loc.gov.