>> Abbie Grotke: Hello and welcome, everybody. I'm Abbie Grotke, Assistant Head of the Digital Content Management Section in the Library of Congress where I lead the Web Archiving Program Team. I'm delighted that you've joined us tonight for #WhyWebArchiving: Preserving Internet Content for Research Use. This evening's event is kicking off the International Internet Preservation Consortium, otherwise known as the IIPC, The events this week are all virtual and completely free and co-hosted this year by the Library of Congress. The Library of Congress recently celebrated its 20th year of web archiving since 2000. We've made available over 29,000 web archive items on the loc.gov website in 85 event and thematic collections. Web archiving collections at the Library of Congress contain digital materials from almost languages. While my team provides program and technical support, the collections are developed by subject experts all over the library and many of our divisions. Web archiving is just one aspect of the library's varied digital collections efforts. Before I hand things over to our moderator for our panel discussion, I have a few housekeeping notes to share. This event is being recorded and livestreamed, and it will be available on YouTube for 24 hours after the event. An edited version will be made available at a later date on the LoC and IIPC channels. We encourage you to submit questions of the panelists and the moderator by using the Q&A icon at the bottom of the Zoom screen. You can ask the questions anonymously if you'd like. You may also upvote questions that look to be of interest to you. The chat is also open and we encourage you to use it to engage with other participants and share information and say hello, like many of you are doing. We will not be monitoring the chat for questions. However, again, please be sure to use the Q&A for those. please be courteous in the Q&A. And while using the chat, we will share a link to the IIPC code code of conduct for your information. Lastly, if you would like to use closed captions, click the CC button in your menus bar. Now on to the show, I'd like to invite on camera and introduce my good web archiving friend and colleague Ian Milligan, Associate Professor of History and Associate Vice-President in Research Oversight and Analysis at the University of Waterloo. Ian is our moderator for this evening's event. Hey, Ian. >> Ian Milligan: Thanks so much. >> Abbie: All right, take it away. >> Ian: So thanks so much, Abbie, and so great to see so many familiar names in the audience. So I was asked just to put a few quick framing remarks together for this event. And I started by thinking, you know, what will we call a historian studying something like the COVID pandemic, in 10, 20 or 30 years? They're going to be using the web? How else would they learn about our news ecosystem? Our social media feeds, the information and misinformation that shaped the world. We will be calling them a web historian, or will we just call them a historian. And that's because the web and its archives lie at the heart of the future of well, almost, I think any study that's looking back to the mid 1990s, or later, that if we as a society want to do justice to understanding almost any dimension of our society, culture politics, since the mid -1990s, we're going to have to draw on access and use web based primary sources. And that means we're going to have we're going to have researchers that need to use all of this information. And we need to find a way to preserve all of that data. Now, that's why we have events like the International Internet Preservation Consortium, an international gathering of libraries, archives, and other memory organizations, as well as scholars, researchers, curators, activists, thought leaders and beyond to think about how we collectively can rise to the challenge of being able to tackle our memory in the future. It's interesting to reflect that web archiving itself has a surprisingly long history, which is sometimes really neat because we still think of the web as new media. In 1996, Internet entrepreneur Brewster Kahle started the Internet Archive in a tiny house in the Presidio of San Francisco, that Little House on the Prairie stage of web archiving, beginning to save large quantities of the internet onto tape. And two years later, in October of 1998, the Library of Congress received with triumph and fanfare, a donation of their own first web archive. It came with a statue, it was two terabytes and 63 inches, half a million archived pages. And that seemed big then, but now it's something that all of us could carry around in our pockets, which underscores just how complicated this world is. That web archiving is, we're going to learn from our colleagues and in our discussion tonight and for the conference that follows this opening event, is complicated, that we take so much of the internet for granted. But just think about how much of an interplay it takes to display even one website. If you went to loc.gov, you'd see images, you'd see links, the ability to go to Instagram, or Pinterest, or Twitter or YouTube, you can zoom into documents and into the tiniest smudges on a page. But if I go to loc.gov, tomorrow, it will be different than it was today. And it will be different than it was yesterday. So what does it mean to preserve loc.gov? Do we preserve it every day? Do we preserve it every month, every week? Every second? What does it mean that I use my laptop to view the webpage and I made these remarks and checked it with my iPhone this morning? What if it's interactive? We might want to look at previous versions of the Library of Congress' page through a web archive, often through the Wayback Machine named after Rocky and Bullwinkle's machine from that cartoon, which lets us see these older websites as they once were. Or we might want to think about viewing them differently, looking at them as text, or all of the images or all of the pictures. That the possibilities are limitless, but also terrifying. As the amount of data we're talking about, in these conversations, scales to levels few could have imagined two decades ago, hundreds of billions of archived pages, hundreds of petabytes, and a petabyte is itself 1000 terabytes, which is itself 1000 gigabytes, which hurts my head when I think about it. So in 20 years, when we look back at COVID, and other events that are sweeping our world today, this is the kind of landscape we're going to be leaving our researchers. So we will need librarians and archivists and memory professionals to help preserve and steward this material, and scholars to imagine what we can do with it all. So at this point, my goal in these short opening remarks was to just introduce problems, which I will then walk away from and leave all of these problems and thoughts with our panelists. And so what I'd like to do now is introduce our panel that will be joining us today. So I asked them to come up onto the virtual stage, so you can see their faces as I introduce them. And I'll introduce our four speakers in the order that they'll be speaking. So our first speaker tonight is JJ Harbster. JJ is Head of the Science Reference Services in the Science Technology and Business Division at the Library of Congress. Her first introduction to web archiving was in 2005, nominating content for the Hurricane Katrina collaborative web archive, and she has been involved with collecting and preserving websites ever since. Our second speaker is Beth Osborne, a librarian at the Law Library of Congress, in addition to providing reference, research, and instructional services in the area of US law, she has been involved in web archiving for the Law Library for the past four years. Prior to entering the field of librarianship, Beth was an attorney in the New York State court system. She is currently on a supervisory detail to the US Copyright Offices Outreach and Education section and returns to the Law Library this summer. Our third speaker is Ben Lee, a fourth year PhD candidate in the Paul G. Allen School for Computer Science and Engineering at the University of Washington, the other UW where he is a National Science Foundation graduate research fellow in machine learning. He recently served as a 2020 innovator in residence at the Library of Congress and the Willner Memorial Fellow in the Stroum Center for Jewish Studies at the University of Washington. Previously, he was the inaugural Digital Humanities Associate Fellow at the United States Holocaust Memorial Museum, and a visiting fellow in Harvard's history department. And then last but not least, our final speaker tonight will be Dr. Amelia Acker, an assistant professor at the University of Texas at Austin in the School of Information where she leads the Critical Data Studies Lab. Her research on data archives and preservation has been funded by the National Science Foundation and the Institute for Museum and Library Services. In 2016, Acker was awarded early access to the Obama White House social media archive, and has since then studied preservation and access issues related to social media data. Ackers current research focuses on cultures of mobile computing, emerging digital preservation models data literacy, social media data archives, and metadata standards for exchange between private and public archives. Previously, Acker worked as a librarian and archivist and a mobile app developer. And when she's at home in Austin, you could find her biking, bouldering, or swimming about town. So thank you to all of our panelists. And with that, I will invite all of us to turn off our cameras except for JJ, who will lead off our conversation tonight. So thank you. >> JJ Harbster: Thank you, and hello everyone. I'm really enjoying to seeing where everyone is from. This is truly an international conference. And so it brings me a lot of joy to see all of the geographic locations in the chat. So to introduce myself, I am JJ Harbster. I am the Head of the Science Reference Section at the Library of Congress. And when I was posed the question, why web archiving as a librarian, and now a champion of librarians, or a supervisor, but I like to consider myself a champion of librarians, I spent many years and those years were through trial and error of how to integrate web archiving in traditional collection development activities. And so part of my job is to help the library develop the science collections. And so I like to ask myself questions like, this is a big one, am I collecting and preserving scientific knowledge for Congress for current users and for future generations. That is a very big charge and a very big question. But I also ask questions like is there something at risk of not being collected? And what can I do about it. So communication of science, as everyone knows, is expanded outside of this realm of print, you know, we ingest things that are born digital, you know, in blogs, and social media, and all those other things, YouTube. So there's so many factors that prompt me to archive something from the web. And oftentimes, it's documenting a moment in time. So the one of the good examples is Hurricane Katrina in 2005. Or maybe there is a theme of things that I'm collecting. I truly love polar bears. And so I have been selecting content that documents how these beautiful animals more than likely are going to become extinct, probably in my lifetime, which is very sad. So I'm collecting all sorts of material about polar bears. I also collect materials, I also collect material that's afraid of getting lost, there's some sort of risk involved. And that led me to do the Science Blog Archive. And let's see, I would just, I have little prompts to keep me on track, because I have five minutes. So through the years, what I've now come under come to understand about web archiving is that sort of my style and purpose of archiving tends to go more towards the thematic or the event based, and I see myself more as a curator than a librarian. And, you know, because like, as a librarian, and at the Library of Congress, I can't collect and preserve everything. I feel like that's a romantic notion and nobody can achieve that. I don't think that's achievable. But what I can achieve is, is a curated, represented collection. And so that's kind of where I have gone that direction, I've gone to that direction. So you know, taking a snapshot of a moment in time, or identifying represented content that tells a story, answers a question - those are the sorts of things that I'm trying to do when I'm when I'm archiving for the web. And I think I probably, am I at my five minutes, Ian? >> Ian Milligan: I haven't been keeping very strict timing. >> JJ Harbster: Okay, I'll just quickly share a couple of the projects that I've worked on, over the years. One of the projects, which is my baby, I consider it my baby, and I treat it like a baby is the Science Blog Web Archive project. I started that almost working on that project, I've, I've kind of, I started understanding more about science communication, how information is distributed, how it sort of organized and that sort of thing. But one thing that happened unexpectedly happened with that collection is that I noticed I was because it's been almost 10 years now that I was documenting some early careers of science writers that now have very big, big jobs in science writing. But I was collecting their blogs when they were just starting out. So that was a very unexpected thing that happened with that. I also led an Earth Day 2020 Web Archive project that was Earth Day was turning 50. Little did I know that in 2009, when I was creating that project, that there was going to be a global pandemic. And we would be on lockdown and 2020 I still decided to continue with that project with the Earth Day 2020. And again, something unexpected happened that I started documenting the first digital Earth Day, which was not my intention from the start. The last project that I'm working on right now is the Coronavirus web archive. And in that, with that archive, what I've been focusing on is not only is balancing the the science and the business, industry policy stuff with more human stories, and how I got to, to that was when I started thinking about historical pandemics. And I started, being in science, I started reading up on historical pandemics, especially the influenza of 1918. And I was really noticing things that authors were citing, topics they were focusing on, there was a lot of talk about mask wearing and such. And so I thought it was really important when we were creating this web archive that we really needed to balance that science and, and policy and business with very human stories. And so the impact of everyday lives and what those people were doing. And so we did a lot on what we call Corona cuisine, or something, you know, cooking and fashion, and religion, performing arts, you know, all of that kind of really human stuff that makes us human. So I will stop now and give the floor back to Ian. >> Ian Milligan: Wonderful. Thanks for kicking us off JJ. Beth, I'll turn it over to you for your remarks. >> Elizabeth Osborne: Hi, thanks, Ian. And like everybody else, I'm so happy to be here and talk about web archiving from, you know, my perspective. I am a librarian. I'm a law librarian and my background as an attorney informs that. And, you know, I come to you today with the perspective of somebody who's also a curator like JJ and I have been doing this for about four years. But the I did not start this archive. The archive has been in place since before I assumed responsibility for the collection. And so that was interesting in and of itself, since I kind of took over from something that had been started by other folks, you know, in a different time period. But, you know, so we're talking today about, you know, what is the value of web archiving and nd what are its potential uses. Why do we archive legal blogs? - that's, you know, that's my area of expertise. The Law Library decided to participate in web archiving, kind of, like JJ said, to preserve at risk born digital content. And so the archiving of legal blogs fits within that, that mission. Web archiving ensures that these collections will be accessible to future generations or we hope they will, future generations of researchers. And we do this because blogs are temporary. They can go in and out of existence, you know, at the drop of a dime. They're relevant to the legal profession, they can be very niche, so they can be very on another very specific topic. The third thing is, they're being cited for their legal analysis by courts. And when we talked about law, we talked about primary sources of law, which is a case law is an example of that. That's coming from the courts. And then we also talk about secondary sources. So a secondary source is something that informs those primary sources. So we would consider legal blogs to be a secondary source of law. The legal blogs enhance the Law Library's collection, you know, the Law Library of Congress, their mission is to provide access to an unrivaled collection of US foreign comparative and international law. And so, we believe that the archiving of legal blogs is important to this mission. So I mean, one of the things that I talk about a lot when I talk about the collection that I work on is that I love the fact that the blogs are kind of several types of a, it's several types of information, right? It's an overlapping attributes of like the because of a medium and the content. So the legal blogs are a source of information, right? In this case, legal scholarship and analysis, bloggers are pushing out, you know, kind of newer ideas or discussing areas where the law is changing, you know, emerging law, and they have the ability to publish quickly. And then there's also drawbacks to that, you know, in terms of authoritativeness, and long term relevancy, the depth of their analysis, and the gender of the author. So all these things are kind of in comparison to more traditional forms of legal scholarship, like, you know, law reviews, or, or kind of more peer reviewed publications. And then two: blogs are a form of social media, right, it's an avenue for discourse. And they typically invite engagement and critique in the form of comments or responses. So it's more interactive than your traditional secondary source, like a treatise or, you know, something that doesn't really invite responses. And, of course, there are websites, right, that's the third thing, which is interesting in and of itself, because the physical design and the layout are, they're important to the conveyance of the message. And fourth: they're a brand for some people. They're a form of advertising, they're a form of marketing. A lot of these blogs are written by law firms for the purposes of marketing their services. And, but also, the field is also, you know, heavily dominated by academics who have their own brand to sell in terms of relevance, prominence, and quality of research. So all four of those things sort of come together in a legal blog. And so that's, that's where I'm interested, I don't know which of those things necessarily is going to be interesting to researchers in the future. Are they going to be interested in you know, the physical design of a legal blog, are they going to be interested in the content help, they're interested in the content, that would be the, like, my biggest hope. But I think, you know, because had to be sort of comfortable with that ambiguity, you know, I have made it my goal to try to make sure that I have cast a wide net, and that we cast a wide net and bring in to the collection, blogs that are diverse, represent a variety of topics and voices, present a consistent level of intellectual debate. And that could be in danger of being lost to time or neglect. You know, along those lines, I also had to get comfortable with, you know, a willingness to reassess items that maybe are in the collection, and are no longer in scope or have become defunct. So I think one of the best parts about web archiving is I get to look at those things, you know, I go through my reviews. And if I find that it can be, you know, the crawling period is over the website is defunct, the blog is no longer useful or no longer in the scope of the collection, you know, I get to free up room and bring in new content, and new voices, and new blogs into the collection. So that's what you know, that's what I really love about it. And that's, you know, what I got out of web archiving and why I think it's important. >> Ian Milligan: Wonderful. Thanks so much, Beth. And I'll go to our third panelist now, Ben Lee, thank you again. >> Ben Lee: Thanks. Hi, everyone, it's very nice to see you all. And I'm really excited to be a part of this conversation today. So I'll give a little bit of background on my own research, and then perhaps talk a little bit about what I see to be so exciting about web archives, especially from the perspective as somebody in computer science. My research generally, what I'm interested in, and as Ian mentioned, you know, I'm working on my PhD in Computer Science at the University of Washington, is really how we do what I might call computing, cultural heritage, how we really think about doing projects involving machine learning in order to improve access, or facilitate search and discovery for cultural heritage collections. So thinking along the lines of libraries, archives, museums, memory institutions. And along these lines, I think what I'm interested in particular is really some questions at scale about how we search over massive collections. My dissertation is largely centered around a project called newspaper navigator, which I began with the Library of Congress through their innovator in residence program in 2019. And the goal of this project is really trying to reimagine how we search over millions of newspaper pages, in particular from Chronicling America, which is maintained, by the Library of Congress, now at the time, had about 16 million digitized historic newspaper pages. And so one of the real challenges that we wanted to ask was, first, how could we try to extract this visual content at scale? And then second, how can we reimagine searching over it? So we released a dataset and then also a search application that allows you to search not only by keyword search or by facets, so looking by state or by date of publication, but also to search by visual simularity using some underlying machine learning mechanisms. And so I say all this because I think, despite the scale that emerges through Chronicling America and thinking through, you know, around a terabyte of data from all of the digitized newspapers, as Ian mentioned, when we start to level up and think about web archives, as we move toward, you know, a single petabyte let alone hundreds of petabytes, the questions become, I think, really fascinating from a number of different perspectives, especially computationally in terms of how we even make some of these search affordances possible, and then what we can do with machine learning. I do want to just mention a little bit in terms of, I think, to be one of the exciting things about web archives in terms of how it highlights so many different disciplines. Here, I'd point to, I think, even to Ian's book, "History in the age of abundance", and all of his work in terms of describing how web archives will of course, change the contours of historical research in the future, and even currently in the present. And I'd add all the work in computational social sciences and the digital humanities, but also from computer science and STEM as well. For those of you who might be aware of some of these bigger machine learning projects that train what I call foundational models around like GPT-3 and these really large, machine learning models, pretty much all of them are trained off of web archives as well. And I think that presents a really interesting opportunity to try to really imagine how we search over these collections and understand their composition, because it really affects how we train these models, and think about them from a computer science perspective as well. And so I mentioned that I think, really the the exciting frontier of web archives, for me personally, is thinking about these kinds of search and discovery systems that scale and how we can really think about whether archives are unique in this case, not just in terms of how big they are. But in terms of the ability to try to do interesting new mechanisms for searching over text and images jointly, how we can use webpage layout as an interesting way of trying to serve as recommendations for search in general. And so here I'd mentioned projects like Archives Unleashed, but also thinking through these kinds of general challenges and what machine learning might offer us, I'll mention that I started some of this work with Trevor Owens here at the Library of Congress and thinking through how we can try to search over born digital government PDFs. So in particular, leveraging the Library of Congress's wonderful datasets in the 1000 .gov PDF data set. And I know there was a question in the, in the Q&A already about the places to get started. And I think some of the Library of Congress datasets for born digital collections are a great starting point, just to think through some of the questions of how we actually try to search over these materials using mechanisms other than just keyword search. So this includes things like visual similarity, where we uncover all sorts of, you know, interesting PDF documents, you know, containing maps or containing redacted, heavily redacted documents that might not be surfaced if we just look at the metadata that were provided. And then I think I'll wrap up my section here by talking a little bit about, you know, who would these the audiences be for these really kinds of, you know, massive search and discovery systems that I envision and talk so much about here. And for me, I think there's there's one question, of course, which is the research perspective, and I'd mentioned all the disciplines involved. I also think there's a really, really compelling curatorial component of this too, for subject matter experts in terms of, you know, if you inherit a born digital collection, or a web archive to try to provide a finding aid or some descriptions for where do we even begin if it's too big, and I think being able to use some of these new, you know, novel computer science affordances, to be able to understand the broader contours of some patterns that are emerging, give at least a place to begin to bootstrap up. And then lastly, of course, I've mentioned, I think, for the general public, being able to think through new ways of searching and discovery materials, sort of per the larger goal. And I think cultural heritage these days of thinking about serendipitous search, and also just generally new ways of searching web archives are certainly fit this mold very well. And I really hope that as the research develops here, around search and discovery, it really brings new people to the collections as well. So with that, I'll go ahead and turn it over. >> Ian Milligan: Wonderful. Thanks so much, Ben. And then I'll turn to Amelia for our last speaker of the panel. >> Amelia Acker: Great. Thanks so much, Ian. And thanks for the organizers of this event, LoC and IIPC. And I'm really excited to be here with the panelists for the web archiving conference. So yeah, I'm an information scientist who researches and teaches about how digital preservation approaches are changing and how that impacts the way we understand society and how we know ourselves with these new tools, as we create digital information more and more online. And so in my research, I'm really trying to examine the methodological and ethical challenges of collecting and analyzing and preserving data at scale, especially from our new information and communication technologies, like social media platforms, mobile apps, and the wireless networks that we use when we're connecting with our mobile phones and using all these new tools. And so as an information scientist who relies on data from the internet to do my research, I'm really concerned about the rise of increasingly privatized platforms of information infrastructures that we're now using to create personal records, corporate documents, even government communications. And what internet researchers and web historians like Ian and Ben know is that the information that we create today really does need to have really important thoughtful people like yourselves thinking about how to save it, because the stuff that we create today in networks is really radically ephemeral. And it's really difficult for us to rely on platforms to preserve and provide access to this information over time. So for empirical and historical scholarship, we're really facing a replication and reproducibility crisis. Because we don't have access to things like social media data in particular. And that's why web archives are so important. So that's why we need really visionary archivists and web archiving efforts and institutional leaders like LoC, and other organizations to preserve our digital culture that's now just not only born digital, but born networked in systems like the teleconferencing software that we're using right now. So I thought I'd use a few minutes on why web archiving my section and tell you a little bit about a research study that I supervised with some master's students a few years ago in a class I teach on metadata, where we played around with the Library of Congress's web archive of memes from meme generator, which is this online site that allows users to create and publish memes to share across networks online. So at the i-school at UT, I have the great privilege of teaching all kinds of information professionals, archivists, librarians and data scientists to learn how to sort of play and leverage and learn about metadata with lots of different kinds of datasets, but we're always looking for historical and really internet driven collections to play with. And LoC's web archive has a lot of different kinds of collections that we play around with. And in the course, I teach students to use Open Refine, and some of you may have heard of this software before, it's a great open source toolkit that allows journalists, researchers, data practitioners to clean messy datasets, but it also has really powerful analysis tools, and features that allow you to sort of parse texts or identify languages and bundle it up or locate patterns in strings of text. So you can group together different kinds of trends across structured data. So Open Refine works really well for metadata, or the titles from memes - the text that we find on top of the image macros. So it's really great for analyzing web archiving data and datasets that are derived from web crawls. So in the metadata course, we use LoC's archive of meme generator and these related public datasets that archivists had created information from crawling the whole site in 2012. So my students were very excited. It's very old, vintage memes if you if you if you like and follow memes. And so meme generator is sort of like a printing press or means. It allows people to copy not just images, but also copy text that lets you to update and change them in different ways. And vice versa. And so you can see the titles change over time, it gives us a really good insight into cultural expectations of humor, politics, current events, and so on. The public data set from the meme generator archive is about 57,000, unique images or memes and it links to archive versions of those images. So you can bring it into open, refine and study these trends across not only groups of images, but different kinds of texts. So we were able to do time series of changes in phrases, we were able to detect five different languages across the dataset, and work with this huge corpus of what we call historic memes. So that happy to share more about that in the discussion. But I just wanted to talk a little bit about my own experience using web archives that already exists in my teaching, and in research, training new librarians and archivists. We publish our results and we'll share this too, I'll have the analysis in an IEEE conference, social media and society. And if you want to check it out, it's kind of fun. But I just want to remind researchers and reference folks today here at the conference, to consider how often internet scholars, our students, and our colleagues don't know about the web archives that already exist of new media or the internet now, and we really need to share and tell them because internet researchers have a bias towards collecting their own data sets and creating silos. So we really need to publicize and share different kinds of scholarly and pedagogical approaches to using and reusing these collections. Like the web cultures web archive at LOC, which contains like hundreds, probably 1000s of really exciting archives that are not only born digital, but are born network records that we're creating together on the internet. And as Ian said at the top that we've been preserving with web archives for decades now. So thanks, those are my comments. >> Ian Milligan: Wonderful, thanks so much, Amelia. And I might just as I invite the rest of the panel to turn the cameras on to come back in, ask the audience to give us that sort of cavalcade of applause for those four wonderful presentations. You know, for presentations, I think, that have done such an excellent job of laying a foundation. And I'm really in awe of the amazing work that these scholars are doing helping to build collections, helping to think about access, reflecting on the role in society today. And you know, questions of sharing, and education and discussion that lie at the heart of our community of web archiving practitioners, very broadly defined. So what we're gonna do now is I'm gonna kick our conversation off for a few minutes. And as I pose questions to the panelists that I know, a few people already have, just again, a reminder, you know, if you do want to add to the conversation, do click on that Q&A tab at the bottom, ask a question and will shortly begin asking some of the audience questions. But just as people put their questions in, and so you know, catch a breath, I want to pose a question to our four panelists here. And that is that we've all seen web archiving in the news, I think quite a bit over the last few months, you know, with the Ukraine crisis, certainly, with political debates in the United States, with which has come up today, you know, the response to COVID-19. And in the United States in the world, you know, Canada had its moment in the sun a few months ago. And like, these are wild events, and just like, you know, the legal blogs, lots of transitory things, lots of emotions, everything's moving really, really quickly. So I'm really curious if the panel could reflect on how you react to like these big events happening in real time. So I definitely will will turn to JJ and Beth for this first, and then I'm curious as well, as you know, maybe Ben's thoughts on access and Amelia thoughts around, you know, the memes and the private platforms. But JJ, what would you, What's your thoughts on this? >> JJ Harbster: I'm still trying to process the whole Coronavirus thing. You know, working on that web archive, it feels like it's never going to end. And it's just keeps... there's like another phase or another something. And I feel like I'm so lost, not lost, maybe not lost. It's not the right word. But I'm so focused on that, that I'm not seeing what what else needs to be done. Like what's happening in the Ukraine. And I would imagine, you know, the library is very international in scope. And so I would imagine some of our area study folks are preserving Ukraine web content. I would imagine. I'm not currently involved in that just because Coronavirus is still around. I know there will be a sort of end or will there? You know, I know there has to it has to sort of wind down but I feel like it really? It's been a very long two years. For everyone, that is. So we were collecting ... we have about 85 active collections on a range of topics. I will pass it off to Elizabeth she might be doing something. >> Elizabeth Osborne: I mean, I think, you know, Ian, your question was how do you react to it. I mean, I would say like, I try not to overreact to the exclusion of... - I hope I'm phrasing this right - I try not to kind of solely focus on that when I when these events come up, what I try to make sure is that my collections have topics and cover things that are covering those topics and ideas and, and if I can see a gap because something arises in current events, then I kind of redouble my efforts. So, you know, maybe the collection was slim on health law blogs. So during the pandemic, you know, we tried to refocus and make sure that we were getting like the newer voices, different voices in that perspective. So I try not to ... you know, my collections are very pragmatic because they deal with the law and, and I try to make sure that I'm just getting enough voices in there. And when these things happen, it's time to look at the collection and see if there's gaps, a very librarianship approach to it. >> JJ Harbster: And we've done a lot of analysis like Elizabeth was alluding with the Coronavirus web archive. So we were looking at the various subjects seeing how many, you know, do we have gaps in say, what what religions response was or a specific religions response. And so, there was a lot of analysis going on, as we were developing, as well as collecting content, and things change, directions, change, scopes change, purposes change. And I think it's good just to kind of be aware and flexible, and, and not be so anchored to a certain purpose and allow that flexibility. >> Amelia Acker: One thing that I have found really exciting is that we're beginning to see that web archiving is becoming something of a citizen science project. We could probably go back and mark it to one of the the last presidential transition, I think, with some data rescue projects, EDGI [Environmental Data and Governance Initiative] our colleagues in Toronto made a big efforts towards rescuing online data. But we're seeing that with SUCHO. Someone's mentioned that in the chat with Saving Ukrainian Heritage. And then with many of the early COVID, tracking dashboards, the COVID tracking project, in many others started with volunteerism efforts, people really wanting to save our online cultural heritage and share it for different kinds of things. So from a cultural perspective, it's super exciting that increasingly, these tools are being accessible and easy. And you just need concerned citizens or a group of people who really want to save stuff. And really rapidly we can see informal, to more formal web archiving projects just happen really quickly, which I think is really, really exciting. And maybe one of the bright things that came out of, you know, lockdown pandemic, and suddenly switching to Zoom life and Slack and these new tools. >> Benjamin Lee: I guess the what I might add, you know, coming from the perspective of other researchers that I think it really underscores the distinction between digitized materials and born digital materials. That's the specific immediacy of it all. And I think it's really compelling, I mean, being able to get up and running with, you know, a dataset, and Google might say, under the left language, have collections of data or something like that, in such a short turnaround, I think is really compelling. Digitization, I think, you know, has its own certain cadence or limitations in terms of how long it takes practically to go from a physical material to into something that you can use in a digital realm. And so for me, as a researcher, I think, one it's super exciting. And so echoing what everybody else has said here, but second, I think, it also once again, necessitates all sorts of really interesting and compelling tools that allow you to get up and running really quickly. And, I think, that's also a great opportunity to rethink this also on the computer science end too. >> Ian Milligan: Wonderful, thanks to the panel for such a thoughtful discussion. One day COVID will end, I had to knock on wood, when I think JJ was saying that, but certainly, you know, a lot of going on and really fascinating responses. So turning to some audience questions. You know, Christopher Werner asked a question directed primarily at JJ, but I think others on the panel might have thoughts on this as well. And the question from Christopher is: if you can't collect everything, how does one justify your bound? What topics or events are worthy of being collected? How can we at this moment predict what will not be useful in the future? So JJ, maybe I'll throw to you for your first answer, but I certainly think the other panelists could think of this. >> JJ Harbster: I always say I don't have a crystal ball. So I don't know what future researchers, historians, scholars and alike will need 50 years from today. But what I do know is what current, you know, historians, scholars, what they're using and what they're looking for. And it changes. You know, there's definitely ... things ebb and flow. One thing that really helped us - and I think it's a good opportunity that to bring this up - is with the Coronavirus web archive. One of our or a web archivist created a rubric for us. Because there was so much information. And we had a lot of things to go through a lot of topics because it's International, the whole entire planet was affected by this. Every single living being was affected. And so we created a rubric. And so whenever we looked at content, we would ask questions like: does this have research value? We also asked: can this be easily crawled? Is this something that we can actually capture? I think we had about 17 different questions we would ask ourselves. Also developing a scope and a purpose, I think is super, super vital to any collection. But I think you also need to reflect and revisit your scope and purpose as you're going on. So if it's a long term project, like the Coronavirus is in year two on this, we are reflecting and doing analysis. We're looking at where we have gap areas, but we're also asking questions like ... now we're pivoted towards, you know, vaccines, and going back to wherever the endemic phase, you know, stuff like that. And I'll share the stage with my colleagues. >> Ian Milligan: Amelia, Beth, or Ben, any thoughts to add to that? >> Elizabeth Osborne: JJ, you nailed. It was a very comprehensive response. I mean, there was a couple of questions in the in the chat about scope and sticking to it. It's a dance, right between adhering to your scope and also trying to think outside your scope, which is something that, I think, the people who are involved with any collections work struggle with but it's not a perfect science. And I think the most important thing is: don't work in a bubble. I have colleagues and a lot of other people who are involved in my project, you know, I may be kind of calling it my collection, but it's not my collection, there's a lot of other people who are involved in, you know, looking at things approving, discussing. And, you know, we try to kind of work collaboratively on that. >> Ben Lee: I'd say, you know, approaching this, as a researcher, one of the most compelling things about web archives, for me is oftentimes seeing the strategies adopted in terms of what's collected, I think it's incredibly useful, you know, once again, maybe bringing this back to other collections. You know, oftentimes one of the challenges with newspapers is not even that, you know, of course, we have to accept that there are all sorts of newspapers that are out there that we'll never have copies of, or they're incomplete. But part of the challenge is not just even with having that understanding, but not knowing why certain materials were decided upon to be kept, you know, hundreds of 100 years ago, what the, what the process was like, and what the genealogy is. And so I think it's really exciting to be at a time where we actually can trace that genealogy a bit more clearly. And it, I think, really makes it much more straightforward to use web archives, at least from that perspective, and I think also will allow us maybe 50 years down the road, as JJ said, even though it presents all sorts of challenges about who knows what people will be interested in, at least we'll have an understanding of what the process was like. >> Ian Milligan: Perfect, and I see that I think Zinab asked in the chat as well, about the bias and selection criteria, and then you discussed that really well there. So we have a question. I see some up votes on it so it's obviously a pressing question for the community, which is from Carolyn Frainger. You know, she says, It's a basic question. It's not a basic question: can anybody contribute to web archiving? How do you get started with contributing to web archiving as a whole? So maybe, because, you know, Amelia works with students in this area. Maybe Amelia - what are your thoughts to that question? >> Amelia Acker: I'd like to think that there are two kinds of, sort of swim lanes of web archiving: there's macro efforts that we see from huge organizations, that's formal collecting, lots of resources. And then micro efforts. That will be individuals, small groups, people doing it, not for profit work. Both kinds of swim lanes are both types of web archiving efforts are just as important. So there's lots of different ways that you can have one click and be involved. "Save-page-now" is a great plugin that I use with my students when we're looking for things. That's a way to contribute to the Internet Archive tacitly as you're encountering things. You can also send efforts or lobby our librarians and archivists at Library of Congress, and so on to say, "please collect this or give us advice". But I feel like there's a whole continuum of ways to get involved. At a professional level. We train web, web archivists, or web archiving at the master's level, usually, at least in US i-schools. But there are many, like workshops and programs that you can learn about and find out about are publicized through different professional associations like this one as well. I don't know maybe other colleagues have ideas. >> Ian Milligan: Possibly to hear from JJ, Beth, or Ben. Although I think Amelia, you've laid out really well. >> Elizabeth Osborne: Exactly. Besides my professional capacity, I'm also just an interested citizen. And when I see things on the web that I think should be captured, I kept it, I try to capture them on the Internet Archive or use the features that you talked about about. So absolutely. I mean, I think it's great to have a coming from both sides. >> JJ Harbster: And I often share things I discover that might not be under my purview. And I will send it to somebody else going "Gosh, I found this really great thing. I think we need to preserve it". So I participate in that as well. And I agree with everything Amelia was saying. >> Benjamin Lee: I might add to that, from the perspective of getting engaged with web archives in terms of using them, I think it's oftentimes surprising how many people engage with web archives, but they might not be familiar with the term or the language behind it. I mean, so many people use the Wayback Machine or encounter it, in their daily media consumption. And so in a sense, we're already all using web archives. It's just about the right framing or language for it sometimes, too, which is, I think compelling. >> Ian Milligan: That's a really good way to put it, and some good resources in the chat, too. So the next question we got from the audience is from Melanie Solomon, which I think is a great question: is anyone actively preserving, "bad info" when misinformation driven web content sites? I suspect we can all speak to that. Maybe Beth? I'm sure no law blog has ever published misinformation but ... >> Elizabeth Osborne: I think we're passively doing it in that collection, probably. I mean, the thing about content is that it's coming from a blog or an individual is that it's not always well vetted. That's kind of the interesting part of the legal blog, that archive and things like that, like, you know, social media archives like that. Is that - why are we saving and how will this be useful? There are some big questions and, you know, what is the impact of saving going to have on its reputation over time. These are things I think about from a scholarly perspective, you know, and concerns I have. But are there efforts to, you know, deliberately preserve that? I do think that some of the citizen archiving projects that are going to Internet Archive, and other organizations are doing that, it's just that that's not really something that the Library of Congress has set out that I'm aware of has set out to do in an assertive way. We may be capturing content that later proves to be inaccurate or misinformation, but I don't believe that there's a... JJ is there any concerted efforts to do that? >> JJ Harbster: How we approached it for the Coronavirus web archive, because there was a lot of misinformation or disinformation, both was prevalent, and we really didn't want to nominate content to give that misinformation authority. Because if they get an email from the Library of Congress saying "we're going to preserve your website", we did not want that to say that we agree with the information they're distributing. So what we did is that we approached a big sort of misinformation aggregator, I think it was called NewsGuard. I could be wrong on that. But it's a database of just all sorts of misinformation and where you find it. So rather than going to individual sites that are providing this misinformation, we decided to use an aggregator and I think that was a really good solution to that problem. >> Amelia Acker: I can speak to a little bit about what some disinformation and misinformation researchers have been doing in collaboration with platforms. It's a real ethical issue because if you save it and preserve it in some ways, it is an archive of all the skullduggery that people can reproduce, right. And so some of the stuff we actually do want removed from the internet. But perhaps we do want to have researchers or students to have access in some way. So how to provide access is really hard. Twitter for a number of years has been publishing datasets of state backed operations, misinformation operations, and they published the accounts and some of the reasoning and metadata behind how this is a malicious botnet. Some disinformation researchers have begun to have begun to establish things like data co-ops, because the platforms remove access from misinformation, once it's been published after a certain amount of time, and there are a few co-ops beginning to start. Of course, we are still hemmed in by the terms of service with which we access the data. So usually, the platform APIs don't allow you to share data once it's been deleted or taken down. So it's an area, that's a really big concern, especially for scholars of political communication, malicious content, and disinformation. And it's something that we really need to think about: how to provide access to some of those, I don't know, secret, sensitive content that we maybe do want to provide access to, but in limited sort of ways or in gatekeeping ways. >> JJ Harbster: I see the value in researching that stuff. It's just that we just didn't want to give this authority. You know, saying that we think their content is good. We were collecting it for other reasons, I guess. >> Ian Milligan: Perfect. In my own work, I've seen that in the 90s. Right. When people put a stamp on their website, say, hey, this has been archived by somebody. That means my website matters. So we have a question here from Herbj rn Andresen who writes: "as an archival science professor, I find it interesting that the term curation is emphasized. Well, appraisal might signal both expectations of accountability and transparency, and also more open approach to successive recreations". And then he's got the question: "is there a risk of web archiving being too tightly connected to the views and perspectives of the collectors? A risk that, of course, is also relevant to more traditional appraisal work?" So because I see JJ smiling, I see Beth nodding, maybe I'll throw it to one of you for your thoughts on this question. Maybe JJ, because you're smiling the most. >> JJ Harbster: It just made me smile because just as librarians, we're always biased. And depending on what environment you are you're trying to reach certain goals, but but there's always going, I'm always going to have a bias towards a certain type of information or a certain topic or like the polar bears. I'm really sad about that. And it affects me and so I am making sure to collect all sorts of information about polar bears. So, I think, it's just, it's something that's always going to be there, I don't think we should hide it, we should sort of embrace it. But I think also you need to not - I think Elizabeth mentioned this - not to work in a bubble. And I think you need to be communicating with your colleagues and sharing and getting fresh new eyes and different perspectives. And, you know, being open and receptive to maybe going to some sort of other type of information that you were thinking "No, I don't want to collect that". But, you know, but it's important to. But it's always their bias. >> Elizabeth Osborne: One of the components of the question was, you know, kind of focused on the word duration as opposed to appraisal. I think we're doing both. Because we are limited in resources, just the fact of evaluating for scope is part of the appraisal process. I mean, no one else is inside my own brain. But I'm doing the best that I can, I would say when it comes to that but the way that I do that is by inviting other voices, trying to have a well written clear public scope and reasoning behind the collection, and making sure that other people in the library are always aware of what we're doing. So that's the best... I try to do the best can, you know. >> Ian Milligan: Perfect. I think that's the right approach. And so I see we have another question here is what scope but I think we've talked about the scope enough so if we have time, I might come back to that. But I think we've at least partially answered that for our anonymous attendee. And so I see Marlayna Christensen has has an interesting question here, which is: how do you facilitate searching the web archives? That she's struggling getting people to use web archives because of course, it's difficult to search for specific topics. Ben, maybe I'll look to you, as someone who's been, you know, working on a lot of these questions. >> Ben Lee: Yeah, sure thing. I mean, I think web archives, in terms of searching are really interesting, because, in a sense, we're already so familiar with how we search the web from something like Google, and we have some sort of fluency and what we might expect from when we try to search a system. On the other hand, looking back on web archives, though, there's so many possibilities for new ways of searching that I think oftentimes might be a little bit restricted in terms of we all have this sort of shared vision of entering text into a box and seeing what happens, because a lot of exciting computer science research, information science research, research in library science as well, going back decades, in terms of reimagining how we do that. I generally say that I fully anticipate as we move forward here, these kinds of search systems will become more expressive, allowing us to do things beyond just basic keyword search. Really compelling examples could include searching by image, and obviously, like the Amelia and others have spoken about, how we can trace fidelity or things like that through and look at network analysis and things of that nature. And I think that all falls under the domain or purview of search as well. And I'd say too we can think of specific use cases and government collections, you know, trying to retrieve different document types by you know, congressional seals appealing or hearing or things like that, can be really valuable too. I mean, I actually think we need to tie this back to the previous collection too, maybe the perspectives that I might take on how we access web archives is not just even the bias of how we, how they're collected, or how we curate them. And the big question that emerged, one that was just addressed, but also for that person sitting in front of the computer trying to search a web archive, there are all sorts of other mediating layers that occur here throughout the search tool, and everything as well, that imprint on what they're able to find. And I think looking at that sort of closed loop is really fascinating. And I think trying to reimagine what it's like to sit there as the end user using one of these systems and compiling lists of what are the specific kinds of affordances that are useful to people. How do they want to be able to search? What are these kinds of questions that they have remain really interesting questions, and, you know, one that I'm personally very invested in that as well. >> Ian Milligan: Awesome. Thanks, Ben. And Amelia, your name check there. So just want to see if you have any thoughts on searching with you, with your students and these difficult collections you're using. >> Amelia Acker: Ben raises this really good point that we see everything through the Google search bar. And before the ascendancy of the keyword, there were all these other ways of finding information in objects, even webpages. So my favorite thing to do is to try and give students or friends, my brother, for example, a challenge a quest to try and find a bit of information of fact, from one of the oldest websites that they can find on the Wayback Machine, for example, or even going to your library's catalog and messing around with the electronic resources options. And just seeing what the world was like, even 5, 10, 15 years ago, really gets people into a different way of thinking about what is potentially possible. And I don't know, hat's one thing that I try to get people to think about is just actually looking for information, not through known knowns, known keywords, but trying to answer it with stuff. And just remembering that the web is full of all this different kinds of stuff. And web archives have those, like amazing resources like Wayback Machine. I want to say the web cultures collection at LOC has these sort of like vignettes like baseball cards, and that was really helpful for starting inspiring, so that helps me when I'm trying to get people excited about, I don't know, researching and using web archives as evidence in there, I don't know, scholarship, learning and so on. >> Ian Milligan: Perfect. >> JJ Harbster: Metadata really, really good metadata. >> Ian Milligan: Yes, metadata is one of those critical words. So we're nearing the end so I'm gonna go to a lightning round question. So I'll see what people think but really quickly, because it's a big one from Karen Liston, which is: how do you avoid duplicating your efforts? So maybe Beth and JJ: how do you handle when we're all doing to make sure we don't all click the same website that a million times? >> Elizabeth Osborne: If we're duplicating our efforts, in some respects. I mean, that's like saying, there's two collections in the Library of Congress that hold congressional materials, whether it is I mean, there is some overlap, and, you know, putting them together in a collection hopefully is meaningful to that. And so it's possible, we are duplicating some efforts, but when we co-locate those things with like other things, you know, and kind of the original library sciency type way, then, hopefully, even if it's a duplicated effort, then it makes sense to the researcher. >> JJ Harbster: And yet, there's definitely duplication going on. One thing that I did during the Coronavirus web archiving project, which is still going on, was to attend meetings to read what was other people were doing in this respect in terms of web archiving, so I couldn't understand everything that was going across this planet but I was definitely trying to be aware of what was going on in the U.S., what was going on in other countries, in terms of web archiving, and seeing where we could fit in. But there will be duplication. >> Ian Milligan: I see in the chat that Amelia put "Lots of Copies Keep Stuff Safe" [LOCKSS] in it. It dovetails with a giant question that, unfortunately, I don't think we'd have time to get to tonight about the succession planning of the Internet Archive, etc. So, you know, there's value to this work. But I want to ask the one last question here really quickly before we wrap up tonight's event, and this came from Dorcus Johnson, who is a grad student at Kent State, majoring in digital humanities and archival studies. And their question is: "having taken an introductory DH course acquiring many new skills, how/where do I look for internships to advance those skills and practice more beyond theory and an initial DH project potentially, with many things?". So this is a big question, I guess a good way to end tonight's conversation, you know, what do we do when we all want to get into this big field? So who wants to tackle that first? Ben - do you want to tackle it? And then we'll see. >> Benjamin Lee: Definitely. I definitely appreciate the question too, because I think I'm certainly a relative newcomer to the space compared to everybody else on this panel, and certainly many of the attendees as well. I mean, I think the one of the best first steps is just immersing yourself in the community. And I think the conference over the next few days is a really fantastic way of doing that. I know practically, I think Abbie dropped in some some resources in the chat about LOC internships and things like that as well, which I think I personally advocate, you know, here I will make a plug for being as close to the collections as you possibly can to do your research. And I think that's it's a great opportunity, as well. And yeah, I'd say finding these kinds of active communities where you can participate in the research and also share ideas, I think, even part of the deduplication effort, maybe in terms of a very liberal interpretation of that, in terms of research, there's nothing like meeting other people and seeing what they are working on and learning from them as well. >> Ian Milligan: Amelia, or JJ or Beth, any quick parting thoughts on how do we get how do we get where we all want our next generation of practitioners to be? >> Amelia Acker: Well, I was gonna say you should check out all the cool web archiving projects and tools and Ian is a great a member of our community. Archives Unleashed project is a good place to start. Conifer used to be called Webrecorder, also, I think, funded by Mellon, is big into preserving digital art online. The Programming Historian is really cool. Just maybe some of the library and software carpentry resources, but just getting your feet wet until it's learning about the different tools approaches, different ways that Um, researchers and librarians have partnered with, you know, institutions, philanthropies, and so on is a great place to start. That's what I would recommend. >> JJ Harbster: I think the same thing and like going out there and talking with people and going to conferences like this and or just watching YouTube videos and get inspired. I think just going out there and and talking with other folks about what's going on is a great way to kind of set a path that you want to go down. >> Amelia Acker: And LC labs. That's where I heard about the meme archive. So follow that. Put that on your RSS reader, tip into their blog. They talk a lot about all the different stuff that you can do new collections coming out, it's a great place to see how other people are using these collections as well. >> Abbie Grotke: And let us know when you do, because we're always interested in talking to the researchers that are starting to use our datasets. So please be in touch if you attempt to use any of the stuff we've put out there for you. So with that, I think we need to close the event. Thank you so much to Ian, Amelia, Beth, JJ, Ben. This has been really fabulous. So exciting. Thank you to all the attendees for the fabulous questions. Look for the recording available on loc.gov and IIPC in the future, and please do share it. I also want to thank you to the IIPC colleagues: Olga, Kelsey, and Robin for your tremendous support tonight. Robin is on UK time so we're really grateful that she is here with us tonight. Late for her but we're really happy that you all joined us and this was a fabulous conversation. So thank you very much and hope you have a good evening.