>> The Library of Congress in Washington, D.C. [ Silence ] >> Well, good evening, everybody. I think we'll go ahead and get started. I don't like to get started too late. My name is Leslie Johnson and I'm the acting director of the National Digital Information Infrastructure and Preservation Program here at Library of Congress. And I am happy to welcome you, and thank you for joining us tonight at the Rosenzweig forum on technology and the humanities. The forum is a collaboration between the Center For History and New Media at George Mason University, the Center For New Designs in Learning and Scholarship at Georgetown University. University and the Maryland institute for technology and the humanities at the University of Maryland. Tonight's very exciting discussion topic is preserving and interpreting born-digital collections. You'll be hearing about how some of our national cultural heritage institutions are addressing the acquisition and the preservation of some of their born-digital collections, as well as the perspective of a scholar and researcher using these unique collections. The forum is named in honor of Roy Rosenzweig, Roy was the Mark and Barbara Freed Professor of History and New Media at George Mason university where he also headed the Center For History and New Media. Among his other accomplishments as the founder and director of CHNM, he was involved in a number of different digital history projects including web sites on US history, historical thinking, the French Revolution, the history of science and technology, world history, and the September 11 attacks. He won a posthumous award from Wikipedia France recently for an essay he wrote in 2006, "Can History Be Open Source, Wikipedia and the Future of the Past." So we'll thrilled to be co-hosting the forum tonight as one of the special events in celebration of the American Library Association's preservation week, 2013. And someone asked if there was a hash tag , and it's PRESWK, President Week, there's no hash tag for this event but there is one for Preservation Week, so if you're tweeting please go ahead and use that. This is an initiative that is supported by the Library of Congress and elects the Association For Library Collections and Technical Services in ALA. The Institute of Library and Museum Services, the American Institute For Conservation, the Society of American Archivists and Heritage Preservation. Preservation Week is an event that encourages libraries and other collections to connect to our local and regional communities and inspire action to preserve our collective heritage across all of our institutions. We're having a number of events here at the library this week, we started today, we're going through Friday, we're very excited about these sort of events and that we got to co-sponsor this event tonight in particular. So a special focus this year is on saving the mementoes of military members and their families. So before we launch into the speakers I would like to take this opportunity to introduce Monica Mahinda [Phonetic], the Library of Congress Veteran's History Project because she'd like to share some information today with you about the project. >> Monica Mahinda: Good evening everything, I'm short and I move around a lot. And those of you who heard me earlier will remember that I said that. So if you can't hear me you just need to wave, and if you can't see me, duck around and I'll try to duck to see you. Before we get to this really exciting topic and one that we're also really interested in, in the Retvan's history project, I'm going to give you a sales pitch. I assume all of you are here because you're working with these kinds of materials and objectives, and we need your help. How many of you have heard of the veteran's history project? Oh, be still my beating heart, that's really fantastic. Okay, well I'll go really fast because you know? Of the basics, but hopefully I can dispel a couple of myths and encourage you to work with renewed passion with us. So the history project was started in October of 2000, and the idea was actually, believe it or not, legislated through unanimous legislation in congress. Sometimes we joke it was the last [Inaudible] legislation in congress signed into law by president Bill Clinton and went into effect essentially six months later we started collecting actual item. What we collect, the focus is of course oral histories. But we also want first person experiences that are told through that first person narrative that get to that same feeling. So collections of letters, documents, photographs, those kinds of materials, journals, diaries, and on our web site which is back there on that [Inaudible] off LLC, on our web site, I'll show you in just a quick minute how to get to exactly who our scope is. For you, the pitch I want you to understand is the legislation was written in such a way that individuals would interview the veterans in their lives and their communities and submit those materials to the Library of Congress. So I give this talk a lot, I do it in five minute, ten minute, hour, two hour increments, I do it for people all over the country, I do it for students, I do it for laureates. It doesn't really matter who I do it for. What happens when I do this is people start seeing the veteran that they're connected to, and Aunt Millie and Uncle Bob and their cousin, and they stop listening right about when I say what we want, and so at the end, however long I have talked, somebody invariably comes up to me and says what you really need to do is come interview Aunt Millie. I need you to come interview Aunt Millie, because Aunt Millie was a whack, and she has all this great stuff to tell you. I'm one person, and we need you, because you don't have a cadre of oral historians flanking the country going to get these stories. So we particularly need people who are working with organizations to get the stories of the veterans in your communities, because we need -- to borrow a treasured phrase from the Library of Congress, we need multiplier effects, we need people who can help spread the word. Why do we need those people? At the moment we've got between 100-150 collections per week, voluntarily submitted. That's amazing. What does that add up to? We have about 86,000 individual collections currently. Also a number we're very proud of. Guess what, 86,000, 22 million. That's the universe of veterans. And so we start with world war I, we have about 200 collections and we go through the conflicts in Iraq and Afghanistan, of which we have about a thousand collections. 60% of our collections are World War II, with apologies to the white male population in the room, 60% of those are enlisted white gentlemen, so we need the broad spectrum of stories so that we can tell the full story of America in conflict, and we need your help to particularly reach underserved collections, the Korean war, Vietnam War, women, people of different faiths, backgrounds, et cetera. So my point here tonight and what's so exciting to us is this is a unique opportunity with a special focus on military service members and their families and people who are reaching out to work with them is we're talking to people about how to preserve items born-digital, how to preserve items for personal use, how to preserve items with small institutions, with large institutions. My message for everyone who is participating in this week is consider the veteran's history project for the military person in your family as a repository for the original materials, and then do all the other exciting stuff that you're going to do for your family members and for your locality. And so just a couple of quick things on the web site and then I'll get out of your hair. But I want you all to now start thinking about that veteran or two or three that you're going to be working with to get to the Library of Congress. So on our web site what you'll see are a few things that are really important for you to pay attention to. One is how to participate. This is a great little area to learn all kinds of things, including learn more about what we collect. This is within the next couple of months going to be updated. But the scope is still the same, that gives you this information here. And then this is the part that's going to be updated a little bit, when we accept. But you can pay attention to this for now. And then working backwards to our homepage, the other thing to know about is this wonderful area up here is where our 15-minute video is. It requires Real Player, which annoying, so we also have it on iTunes U and on YouTube. And you can find helpful information like that on our RSS feed which I highly recommend everybody subscribe to, look at this, we're celebrating preservation week. But if you go through here you find all kinds of really fantastic tips and things that are happening in this universe. So this is our feature this week which we're really excited to share. And one thing that's a myth that a lot of people don't know is that we do accept these first person narratives that come in a form of this experience without necessarily being an oral history that's important for deceased veterans, it's important for people who can't tell their first person story for a variety of reasons. We do not take proxy or second-hand interview. So it has to be the original materials. People say why originals? Well, I think I don't have to persuade all of you. I think many of you probably understand why originals. But just one quick story, this gentleman here, John Albert chapter is one of my favorites because his family sent us his diary from World War I, and later thanked us for existing, which we thought was kind of sweet. And then we found out why. The diary was kept in a hall closet with a family Bible in a special box with all of these other special, you know, hand prints and baby shoes and all those things we all hold dear. And then Hurricane Rita came by and swept that entire closet and that whole first floor wet. So if they hadn't turned in the original to us and taken nice digital copies for family members it would no longer exist. I like to tell these dire stories because they pull at our heart strings and make us think about, yes, the veteran's history project at the Library of Congress. So that's a really great series of collections that you get a quick sense of who we are, and my colleague at the very back, Rachel Teleford helped put this together. So if you have questions for her as the evening winds down she's the person to ask about this. So that's pretty much it. This is the database so you know how to access some of our 86 -- information about some of our 86,000 individual collections, over 13,000 of which have been digitized and we hope for more of that. And then this is the last in our series about Vietnam, but we have 39 of these total Experiencing War features, another great way to thematically introduce yourself to our collections. I have to let go of the stage now for somebody else to speak more substantively about tonight's topic, but before I do I think I could probably take two questions. None? Great. Thanks for your time, and think about that veteran and you know, get in touch. Appreciate it. [ Applause ] >> Thank you so very, very much, Monica. I'd like to introduce our first speakers who are coming to us from the Smithsonian Institution archives. Riccardo Ferrante is an information technology archivist, and director of the Smithsonian Institution Archive's Electronic Records Program. He oversees digital preservation there and digitization and their electronic records management activities. I'm also welcoming Lynda Schmitz Fuhrig who is the electronic records archivist who handles everything from word processing files to born-digital audio video and the Smithsonian web sites that are in the archives collections. Welcome Rick and Lynda. [ Background noise ] >> Linda Schmitz Fuhrig: Hello everyone, thank you for coming out this evening. [ Background noise ] >> First of all a little background about us. We're kind of a mystery to some folks. Some people think we're part of the National Archives, you know, they're a little confused about the role of the Smithsonian Institution archives. And actually there are 16 special collections, libraries in archives at the Smithsonian. This includes the Smithsonian libraries, Archives of American Gardens, Archives of American Art, the National Air and Space Museum Archives and so on. We were formally established in 1891 so we are the institutional memory and the record keeper of the Smithsonian. The electronic records program was formally established in 2003 but we are taking in electronic records prior to then. So obviously we are concerned about the digital history of the institution. The photo you see here is the collection of a zoo researcher, and this documents many decades of her work in addition to all the paper that we have in this succession, we also have three decades of computing as well. So we have everything that you could think of in terms of born-digital and electronic records. We have text and audio, CAD drawings. Everything comes to us in a variety of ways, we have about 500 collections now with electronic records in them. Some of them are a mix of paper and born digital and then some are strictly born-digital such as e-mail collections and web sites. And we get this material to us on removable media such as hard drives, five and a quarter floppies still come our way, three-and-a-half inch diskettes, we're starting to get some blue ray now as well as jump drives, we do have [Inaudible] transfer, server transfer, and we'll occasionally get some e-mail attachments as well. And some of this information that we get, anything you can think of. We have exhibit files that include photos of various extinct exhibits, we have museum memos, we have CAD drawings from the National Museum of the American Indian, and then we also have Anna, the forensic anthropologist from natural history who has her very own Facebook page. So some of our best practices include inspecting the media if it comes to us on removable media before we do any work with it. We conduct virus scans on all our material, we take in the material and we ingest that with check [Inaudible] to make sure the integrity and the authenticity of the files. We also do another source copy, so this is how we can do our preservation work. We do keep the media as well, and we do analysis of the files that come to us using tools such as Jove and Droid to determine formats and any issues that we may have with these files. And if we have proprietary files which we tend to get a lot of, we will work to convert those into more stable open formats as soon as we can. One of our manage tools is a MySQL database. This gives us a more gradual level to track what we're doing with our electronic records, we're able to determine how much we're getting in every year, how far we are in the preservation track with these records, as well as determining how many word perfect files we have, how many images files we have and so on. These are some of our current preservation formats. So anything that's a word processing format, we'll want to get that into PDF A or PDF. We do the same with Power Point and Excel when possible. Images, we like to have those in uncompressed tif, access databases, we're using something from the Swiss archives called SAIRD XML. We've had good success with our access databases, it's also supposed to work with Oracle and MySQL. But we haven't had as much luck with those. CAD has been very tricky for us, we've done some research into this, we were able to convert some of them into PDF going from [Inaudible] to PDF, but there are issues with losing various layers when you do the conversion. So we're tending to rely on CAD viewers now for that. Audio goes into broadcast wave. Our web sites are crawled an captured as work, which is an archival web container format. E-mail, which I'll get into a little more later on, but that's saved as an XML preservation format, and born-digital video is not straight forward at all. And our tool kit includes quite a bit. We use both open source and proprietary software I mentioned Jove and Droid already, we use fits a little bit as well which is another tool from Harvard. But keep in mind fist is also a format that's used by the astronomy community as well as the Vatican. There's also media info which is a good tool for AV files, and we have some of our own in-house batch scripts we use to run on our files. We use Bagit as well as the [Inaudible] file analyzer. Jack [Inaudible] is a check some tool, [Inaudible] copies is another tool that helps us copy over files. Quick View Plus is a piece of software that's gaining a lot of attention in our field, people are using this as a viewer. And then the rip station which helps us actually get content off our CD's So some of the difficult collections that we do have are ones that contain older, possibly unknown formats dating back twenty years or so. Social media and web sites, social media tend to be very tricky. We're dealing with third party vendors, they're constantly changing what they're doing on the front end and the back end, they're pretty hard to crawl. And then there's also the whole can of worms with the terms of service. E mail we'll discuss in a little bit. The video, we are now making disc images of DVDs that come to us so we can retain the playback and chapter functionality of those. And I've talked about CAD already a little bit. So some of the older files that we've been dealing with, we have something called Gerber, and these are actually files that describe images of printed circuit boards, and we do have some of these from the Smithsonian Astrophysical Observatory. And I can't show you those because we're actually in a restricted collection, but here's what they should look like. And actually part of what we do in the archives is research and try to find out what these old formats are and what solutions may be out there, and we did find a tool called Gerber view and we're actually able to convert them into P T F. PC D is the Kodak photo CT, this dates back to '92 when we were all still using film and taking that in to get developed. At some point, you know, in the '90 and 2000's you were able to get a CD as well with your digital images on there. Photoshop used to have a plug in that would work with photo CD, and it no longer does. So again, doing some research we found some Earth Hand View which is a free tool that can convert these files into tif, otherwise they're unreadable. Executables, some of them will still work, other ones don't execute any more so we need to do some DOS emulator research into this area. This is what I like to call our files in disguise. Again, we have many, many word processing files, and some of them have no extension, they have weird extensions. I know Leslie recently talked about this. You know, they'll have dot M E M, dot L B L for memo and label, or it might be dot dock and it actually really is a word perfect file and not a Microsoft Word file. But again we'll run our script, see if Jove, Droid anything else can figure out what they are. Sometimes they can, sometimes they can't. But we are able to sometimes open up these files in note pad and actually view the various coding. Here you'll see the W PC N which means this is actually a word perfect file. If it was a Microsoft Word file there should be coding toward the bottom of the file saying Microsoft, it would say MS Word and then what version it. But you can also use hex editors to see the hex, decimal, and figure out what the file signature is and determine from there what it might be. We're still getting these files in as well, it doesn't stop. I recently had a file that had dot Power Point on it, and actually turned out to be a tif file. And again, we have other files that have come to us maybe 15 years ago. We did the bit little preservation on them and still haven't been able to figure out what they are yet. But we're going to go ahead and keep them and keep working at them. On the other end of the spectrum when you get to the actual preservation, here's a straight forward word processing file, it's Microsoft, Microsoft Word, it's a cover letter from one of our units, that's using our standard letterhead at the Smithsonian. We went ahead and used Acrobat to put it into PDF A, which is PDF Archival, I think everybody in here knows that. And then when you actually run preflight on it, it comes back and tells you, no. The problem here is with the font, actually. It's minion regular, which is a font that cannot be embedded due to copyright issues, but minion pro can be embedded. But I'm not going to be able to convince the entire institution to change their letterhead. But this is something we know about and we've documented it, and we're aware of it. Getting into the issue of e-mail account preservation, for those of you who don't know, we did a three-year e-mail preservation project with the Rockefeller Archive Center. We developed a preservation schema, we co-developed that with the MCAT project which was another e-mail preservation project with North Carolina state archives, Kentucky and Pennsylvania. We had also developed a parsing tool as part of that project, and we still use these tools today and we're still very happy with what we had done with the project. At the time the largest account that came to us was a gig-and-a-half, and it had 89,000 messages. And now we're starting to run into issues with scale. Largest account we have right now is about 26 gig, it's 208,000 e-mails, very attachment heavy. So it's kind of getting up on the edge of what our tools can do and we're encountering issues with corrupt files coming to us as well and things breaking with the PST files too. But the point being that digital preservation is a continuing process and we have to keep working at it and keep looking for tools and know some things will be getting better. And at this point I'm going turn it over to Rick, and here we go. Yup. >> E-mail illustration is Microsoft, because it really gives me a launching point. So this is my segment, getting from mine to ours. And when the electronic records program started in 2003 our focus was very much on capturing that piece of history that was slipping away, the born-digital records. And did a great job of that, our focus has been very much on capturing that and preserving that and getting it out of those kind of really risky situations. Most of what we got was part of a mixed media session, and we only had one copy, right? So you have to go through all of those best practices to make sure that it doesn't disappear. The end result of that is we were basically a -- as far as electronic records were concerned, we were a dark archive. And moving -- transitioning from the dark to the light has been full of, So we have a number of ways to get at our information. One is the collection search on our web site. And then we have a couple of others that are really kind of the focal points right now. One is the Smithsonian's archives catalog where we participate with the other archival units at the Smithsonian, and we call that cirrus, stands for Smithsonian Institution Research Information System. And then we've been involved with the natural history museum, their botany department, in a clear funded project to expose hidden field book collections, various scientists field books that they've collected over the past 150 years that haven't been accessible on line because there was no cataloguing repository for that. The field book project is doing that through the Smithsonian Collection Research Center, and we have 7,000-plus titles current cataloged there. And so that is again another slice on the material that we have. Here's where we begin to see the rub. So it's great to have lots of different ways to see the material on the back end. We have a number of different repositories that we have to deal with for our digital assets. Electronic records is big and green, that's because that's the largest set of born-digital material that we have. We have a little over a quarter of a million born-digital records at this point in time. And we typically take in a little bit over 40,000 a year. And so every year the archives adds to its collections overall by about 300 -- 350, and a good third of those at this point in time contain born-digital media. Digitized collection material, our history on that has been as people need it we'll do what they're asking for. Right? So that might be photocopy of a page. Then we realize, oh, you do that over and over and over again, it's going to have an impact on that material. So we started making better choices about that. And again, it was very on demand driven and not a whole lot of critical mass to it. We've since switched to a more project-driven, and then we take those reference requests to trigger a focus on a given project. And that is overlapping with the Smithsonian central digital asset management system, which is based on Artesia, and here's where you get into the internal politics. Because so much of going from dark to light is about negotiating relationships. And every organization is -- this is such an organic process, but they -- as the Smithsonian dams got started it was funded with discretionary funds from four different units, their meta data model didn't really exist, as everybody had their fields and they kind of were all in a flat model. And as there became more substance and support to that to make that rigor, to integrate that into what was there they started limiting what went in. So right now the centralized dam supports images and audio and video. So it's a limited set of formats. Now there are great reasons in terms of building a system and getting it out in a federated environment like the Smithsonian, but that impacts our -- the choices we have to make with regards to using that system. Our digitized collection material is largely image, audio, and video. So that works, that's why you see the overlap there. Electronic records, our approach to that is an archival information package that I'm now going to refer to them as AIP's, because that's so much easier. We take a whole [Inaudible] and treat that as an archival information package. If I have to then pull that apart to be able to put the images and the audio and the video and leave the CAD and the databases and put textual documents and the web sites and social media in a different repository, that just becomes a nightmare to manage. And so what we've done has been working with our digitized material and storing that in the dams. Not 100%, but we're working towards that because that's kind of late to the game. The electronic records have been largely dark. And now we'll move on, as we begin to solve this solution. As we're dealing with all of this, you have context now. We found that process and ownership as we try and drive this to the light is bluntly said, a turf war which if you want to throw a positive spin on it, it's a great way to identify where the gaps are. Okay? And then bring those parties together to help resolve those gaps. So technically, the issues really come into what do we call it? I want to call it this, call it that. Do we stick with the file names that they came in with? If we don't, is there sufficient justification for doing that, how do we document that, what does that do to all the things we care about in terms of authenticity and completeness, et cetera. Or do we do that with access master that's we create for this and work from there. Storage and disaster recovery then get encompassed. So what we have found ourselves quickly pushed into is this idea of digital curation. So we're not just dealing with our born-digital material but the digitized material, because we want to keep that for a long period of time we need to apply the same sensibilities to the care for that material. Which is great, and the dark info system, the MySQL database that Lynda referred to is a way of providing that curatorial management for both sets of material. We've also kind of organized our groups inside the institution -- institution archives so that the digitization team, the electronic records program and then the web and New Media outreach team all fall under digital services, which is really, really good. And makes coordination a lot easier. The acquisitions team are the folks who actually do the cataloguing, they do A and D for us, and they are the owners of the finding [Inaudible] they also produce marked records and the cataloguing group also produces mod records for the field book registry. So this is part of the dialogue. When we had a conversation earlier today about as we begin to push out enhancing our finding aids, how do we communicate back and forth and who gets to make the decisions about how those finding aids are annotated. The important thing is those conversations are being made and we're keeping that final decision and ownership in one place. So we went for low hanging fruit. Here's an example of our finding aid, sliced and diced. So you see what the header is, and it's kind of faint here, but you'll see underneath the folder names that we've been able to use the digital archival object tag, the D A O tag in E A D, to put that information and access object right there into the finding aid. So as they're going through they don't have to go somewhere else to get to it, they can pop up, look at that really closely, really closely. These are actually water color illustrations of Egyptian -- they're ground [Inaudible] or beef flies. I'm not the scientist so I don't know. But these are hand drawn water color the illustrations, amazing amount of detail. And this is what we can deliver off the web. So yeah, it's fun. We also are doing audio and here is one of our secretaries who also it a lot of work in Panama collecting. And is so we have some of his oral history clips on here as well as PDFs of the transcripts of that material and then we also have video. Again, all embedded so they can experience this right in the context of the finding aid. And so as we now got that capability the next thing is how do we push out in a massive push the digital material that we have in [Inaudible] this is where we get into ideas of working with Power Point 2010 and expanding it, Power Point 2007. Low hanging fruit is great, but we can't stop there. And that's I think in terms of organizational dynamics, that's the hardest thing to do is to think in a more abstract sense, further out on the horizon. It's very easy to see, you know, to the end of the block. But that approach has been a limitation for us in terms of dealing with scale. And you know, when a -- a given accession from the Smithsonian channels who is the group that puts together the programs that appear on the Smithsonian channel on your cable, they'll come in, like, we -- a given accession from them may have over 200 proposed DVD programs. So that's a lot of tape to take in, and that's just one of 150 accessions during the course of the year. How do we push that out, how do we manage that. So this is why I'm saying here, it requires a sustained commitment and tolerance from a level of abstraction that you haven't had before, you're looking a little bit further down the road. Research is key, and that's what we found is we have worked with folks coming in from various programs, including George Mason and University of Michigan and GW and UMD, as we tackle the issues of what is behind the issues with CAD, you know, and it turns out that the latest version of CAD is actually layers, and you have to have multiple files in order to have the complete CAD drawing. We have to dig into that. Doing an assessment of all the born-digital video and figuring out whether or not we have all the Codex, right? That becomes part of what we do, and that's the research that has to be done, what are those tools available to do that. And it means that we also have to play nice with our service providers. You saw the picture about the centralized dam system, it's really easy to focus on what they can't do for us are aren't willing to do for us. But that's not productive on either side, so we really have to look at the positives and look at how to help them accomplish their agenda so that they will help us accomplish ours. Building for engagement. Clearly, the way that we're approaching our funding aids are all about taking that information and making it accessible. But we're also pushing out in terms of engaging communities through Facebook, through flicker, we're also involved in something called history pin, if you've experienced that. But it's a way of getting people involved. Our blog pulls from the same source and provides the same meta data along with this material and that also links back to the findings or other resources on our web site. Again, this is cycle around, cycle around, cycle around. The next places that we are headed with this is crowd sourcing, so those scientific field books with the wonderful hand writing that nobody knows how to read because we're not taught that in school any more, have to be transcribed, but there are communities of interest. We don't have the staff to transcribe those, we don't have the money to hire people to transcribe those, but we have communities who are very interested in a particular topic. We can tell them that it's -- you guys heard about the New York public menus thing? That's a different thing, that's a general interest collecting field books for birds in South America, small group but deeper. So how do we get to them? Virtual reading rooms and applying digital humanities tools to doing research with this material now that it's in a digitized in what I call a semi-permeable form. People are able to get past the surface. E-mail preservation is kind of where I'm going to wrap up here. And the issue with this is we -- we did a -- what I feel is a stellar job -- of course I have a stake in that, disclaimer -- figuring out how to capture a whole account in a single XML file, leaving you the choice to embed your attachments or not, capturing the structure of how that user organized their material and all the nested threads. But being able to deliver that to a researcher without requiring them to have a fairly good grasp of XML and be able to read raw XML is a place where we fall short. And the funding ended at the point of being able to preserve it. The next step is providing that front end that gives a user friendly view. And something meaningful. Obviously, when we have a CIO who has departed, left us 15 gig of e-mail, a quarter of a million messages, browsing through that is not an effective way to approach that material. So a couple of things that we've seen and explored and are very excited about is mus and the next generation of muse is epad. We involved Peter Chan from Stanford with that recently, as we are kind of working with them on this next stage. And we pulled in our chief reference archivist to go through this demo that we had for us. And she has been very hesitant to encourage people to look at e-mail as part of a person's correspondence just because it's so difficult to approach. And as she saw the different ways that these tools let you visualize trends, let you modify a lexicon with which would approach this with. Her eyes just lit up, she was very excited about being able to now offer this to researchers as they dig into this material. So things are coming, it's a concentrated push, you just have to keep pushing and pushing and pushing. But at the end of the day you've got to start with facts. Not a gut feeling, not here say, so we began in 2003 to do physical inventories of all of our digital media and what was on it, and Leslie, yell at me if I'm over. >> You're about to go over. >> Okay. Last year we did the first phase of a [Inaudible] institutional born-digital collections holdings survey where we looked at seven different archival units. We've had at the archives a ten year program at this point in time, come September, focusing specifically and formally on born-digital material. The other ones had not. So we had a good grasp of what we had and the condition it was in. They didn't know where it was, they didn't know what it was, they didn't know what correction it was in. However, as we went through this and actually did find the bulk of everything, almost everybody -- the predominant type of media was optical discs. The only other thing that really kind of popped up with five and a quarter inch diskettes, which largely were us and the national anthropological archives because of how they get their material. And in reasonably good condition. So at that point in time, Leslie is cutting me off and I respect Leslie, and don't ever want to make Leslie mad. So -- >> Wow. I didn't know I had that much power. >> No -- she does, she really does. And for a good reason. Is there time for Q and A? Questions? >> Can you say more about what you envision as a [Inaudible] -- >> Our goal behind a virtual reading room is being able to have available environments where we can pull in digitized and born-digital material for a researcher to work on, to build their own data projects and products, but not be able to take out of that environment our records. Right? So you want them to be able to take their data products with them but not to be able to take our ferial out. And to do it in a way where it's timed, right? So we -- we put a -- a keyed access to it, and we give them the key and it times out in a period of time. Unless they ask for it to be renewed, all right? So when that key expires that environment gets wiped away. But the drag and drop capability for reference staff to be able to quickly set those up is really key. That environment really is great if we can allow the researchers to bring their tools to that so that they can really do things, like doing natural language analysis of that material and so on. great. Thanks. [ Applause ] >> I need to talk to you about that virtual reading room system because I think that's something I could use. I am very pleased to introduce Bertram Lyons, Bert is a colleague here at the Library of Congress who is a certified archivist working in our American Folk Life Center, working with our digital assets and the Folk Life Center, and I'm going to cut short my introduction because I want to give Bert time to talk. [ Background noise ] >> Bertram Lyons: Thanks. Thanks for having me. I'm very excited to be here to talk about one of the American Folk Life Center's oldest born-digital collections, seems to get a lot of air play in the public from Story Core and through NPR, but I don't think it gets a lot of air play in the archives world in terms of what the collection is, what it looks like. But I'm going to focus specifically on how we acquire the collection and what steps we are taking to make sure the collection survives into the future as a valuable resource, First, however, I'll talk a little bit about the reference model which -- thanks Rick for going ahead and talking about that first, because I never like to be the first person to bring up [Inaudible] -- [laughter] this particular model as I think many people know is not something that's a -- necessarily a guided work flow, but it does kind of talk about functions. And it's really important for a lot of the terminology that I'm going to use today and that Rick also broke -- broke the seal on. But in the second half of the presentation I'll use the Story Core collection as an example of this model in action. I'm going to give first of all, because I'm an archivist and I like context, I am going give two points of context before I jump into this. First keep in mind the Story Core collection is only one of 3,000 collections in the American Folk Life Center, the American Folk Life Center archive has been around since the 1920's, it's home to more than 3 million photographs, manuscripts, sound recordings, moving images, and it's -- all of this documentation is of traditional and extensive culture from around the globe, from the earliest field recordings made in the 1890's on wax cylinders to contemporary digital oral histories made just yesterday for the Civil Rights History Project. Also I should say the Folk Life Center itself is only one of many collecting divisions within the Library of Congress. I'm speaking here today for the Folk Life Center. Second point of context, I want to note that the Folk Life Center's collection management program is an emerging one, and this echos some of what Rick was saying a second ago. Although we've been creating digital collections since 2002, and acquiring them, it really wasn't until 2009 that we began to develop systematic processes for the acquisition -- for the creation and the preservation of digital content, to align those processes with those throughout the Library of Congress as well as those throughout the general cultural heritage community. To give you a sense of scope, last year the Folk Life Center acquired 77,627 digital objects for our collections. 60,720 were born-digital acquisitions, which we can consider unique collection material, and let's see here, we've created also 10,000-plus preservation masters of original analog or other digital objects, and about 7,000 just general reference copies, making available on line or for research in our reading room. of these assets, real quickly just to also kind of give you the scope, you can see here just under 35,000 are manuscripts, 24,000 still images, 17,000 sound recordings, and about 2,000 moving images. So as is the case with many cultural heritage institutions, we're getting better at the Folk Life Center at acquiring digital content, at being able to tell you what we acquired, when we got it, where we put it, whether it's undergone any changes. Because at the heart of our concerns I would say that being an organization that holds collections in the public trust we want to be able to say 10 years or 100 years from now, here's the resource that we acquired, and it hasn't changed over the years, or if it has changed this is how it's changed and this is why it's campaigned. All right. Wrong direction. Our general work flows are based here in the O A I S model. Which everybody loves. These -- these circles here, the SIP, the AIP, DIP are really some of the -- for me, some of the most important parts of the model that we take away in our practice. Looking at how digital content is coming to us, what's happening, when it gets here, and how it's being used to be made available to others, the SIP, the submission information package represents for us content as it arrives, as we prepare to move it into the library's environment, the AIP, the archival information package is a way that we talk about representing content as its stabilized and as it's documented and organized by library service, on library service, for long term service. The DIP is something we're not quite using yet, but we see ourselves moving into that particular area, represents how content is then prepared for access in the future. All of these different instances, and I like the way Rick mentioned this because this is how we've been building our systems, are representations of content in some state. They're not necessarily one to one relationships, one SIP doesn't have to equal one AIP, which doesn't have to equal one DIP, because we do exactly the same. We see the AIP, the archival information package, as all the content related to one given collection in our collections, and multiple SIP can come in to create that one AIP in some iterative process. I'll talk about that here shortly. These six boxes [Inaudible] data management, access, archival storage, administration, preservation, planning, these are more various about functions, these are about functions that be -- I'm going to use these real quickly to talk a little bit about the Folk Life Center's work though, before I look at Story Core. At ingest, what we mean by ingest and what we think about with ingest is the act of acquisition and the act of accepting physical control over the content. When we can, we work with donors to provide check sums and manifests for the files upon delivery. And when we can't we create those kinds of security measures, those kinds of inventories for the content upon receipt here at the Folk Life Center. It's during this ingest phase, ingest function, that we're preparing an SIP for storage as an AIP and for future access. Data management is something that is really at the heart of our work in terms of knowing what we have, knowing what we're doing with what we have. Gives us these abilities to understand where we're moving, where we're putting content into the library servers, what we may need to do it with in the future. At this point we collect information about objects at a file level and at an aggregate level. As you can see, in order to be able to generate the stats I was just showing which [Inaudible] holdings on arrival at item level and extract a minimal amount of information from each of the objects. We're also at this point able to apply characterization to the files that only a human could provide. Because not every bit of information that you might want to keep about it, and I'm not talking about descriptive, about a file, can be extracted from it. For instance, questions like is this an original, unique, born-digital object, or is it a digital representation of some other digital object, or it is a digital representation of some analog object. It's often necessary to have some kind of human interaction to get this kind of information. I'm sure we've all experienced seeing hundreds, thousands of files without any way of understanding the difference between what was the source of a file, I mean, you can imagine over years it's going to be very difficult for us to understand about these. Anyway, we work really hard to do this. After we generate item level information in this data management stage we're also documenting the aggregates, these SIP's in our own Folk Life Center digital inventory database so we can track this SIP as we move it into the library environment. And lucky for us, we also benefit from our colleagues here at the library who build larger inventory systems. One is specifically called the content transfer services system, which is -- been a huge help to us here in the collecting divisions. It's kind of a misnomer, the content transfer services at this point. Services is really important. But this is a user interface to a system here at the library that helps us all kind of track the -- as it says, the inventory of our collections, what do we have, what are we doing, and to help us move the collections around on the servers and perform certain services, certain functions, certain things like characterization with [Inaudible] other kinds of analysis that we might want to use. All right, I'm boring myself right now. Sorry. I'm much better with questions. We can just jump to questions here. Try to go real quickly here. Archival storage, entirely on the knowledge of our [Inaudible] again, we're at the Library of Congress and myself, I don't have any control over the IT infrastructure here. But I do get to work with colleagues here at the library and learn from them, and make use of their expertise. So we still have all our of digital collections on tape, that linear tape at the library -- not everybody, but we store what we consider our archival collections on archival tape. Which are both redundant and distributed. I'm going to quickly run through here. Administration, you know what that is. And I would say that our weakest links are preservation. Not here at the Library of Congress, but in the Folk Life Center, preservation planning and access, these are things we're working on. We feel really strong with our ability to ingest content, to manage the information about that content, know where it's stored, work with, all of that. We are now at the point where we're looking forward to preservation planning and format obsolescence and moving even further into thinking about what's the best way to make our collections accessible. We're really good at giving you access one at a time if you come into the reading room, we can get you to the digital content that we have. But I want to -- we're starting to push further where we think about data sets that we have, that we can make available to researchers or instead of 1 or 2 or 15 objects, how about a batch of 15,000 objects, how do we get to that point. That's what we're thinking of now in the Folk Life Centers. All right, so Story Core. I promised to talk about this. I'm going to give you a quick overview. For those who don't know, this project is a national project actually based in Brooklyn, New York. It was created by Dave Isay in 2003, and it's modelled actually by the Works in Progress Administration of the 1930's through which oral histories with everyday Americans across the country were recorded. Many of those collections are also here at the library. Each interview in this collection is usually around 40 minutes, the interviews are conducted usually by friends and family members of the interviewees, and they almost always accompany -- almost always accompany with photographs and manuscripts. I'll show you a little bit about that here in a second. Story Core does provide some access to these collections on line through their web site, but usually it's a one at a time kind of thing and it's more curated, specially chosen collection. Specially chosen interviews that they want to highlight for any given moment. We've had a relationship with Story Core since 2004 at the American Folk Life Center and we've been collecting the content from the Story Core every year since then. Early on in the relationship Story Core would send C Ds to us, they would send Microsoft access databases, they would send hard drives. But they would usually only do this once a year, and they would batch all that up into one big package and send it to us, and we would work our way through it. Over the past three years, though, we've been working with them to try to make those deliveries smaller by making the frequency of those deliveries more frequent -- I suppose. Which has been good for us, actually. And good for them in terms of controlling and being more accurate with our work. Give you a quick overview of the collection itself. There's 33 -- 389,114 files current in this collection, those digital files represent about 46,134 interviews, and the collection itself is about 16 terabytes in size. If you break it down by resource you can see that manuscripts by far are the biggest group. 202,000-plus manuscripts in the collection. See, sound recording is only about 47,000. One measly moving image. No idea where that came from. But it's there. [Laughter] but it's kind of an important point that I'm starting to realize is that if you were to look at this by file size you see a very different picture. The sound recordings are making up the clear majority of bits in this collection. Those 200,000-plus manuscripts seem like nothing. And that's really important to think about in terms of collection management, I mean, this is a big -- big issue. Working with audio-visual collections is really, as we all talk about how scared we are of born-digital video, and things like this. It's really a lot because of the size. If we just had a lot of different file formats the size of a manuscript it wouldn't be so scary, I don't think, to us. Those are gigabytes. Yes, sir. Dr. [Inaudible] -- [laughter] all right, so what does the digital stewardship look like for the Story Core collection here at the Folk Life Center. It's created entirely by Story Core employees, it's only made up of unique interviews. This is not a traditional typical kind of archive collection, this is very much a -- a hand created collection by the Story Core organization. Each interview itself usually consists of about one sound recording, a few still images and manuscripts. For instance, here is a quick run down of all the times related to one given interview. You might see them on the server. Rendered for human consumption, this is about what you'd get for a given Story Core interview. The documentation that you see there, we usually get three types. At the top you have the logs, in the middle there, the release forms. And then a demographic survey for each of the participants in the interview. Every two or three months at this point Story Core employees aggregate those interviews into SIPs of less than 100 gigabytes each, and deliver those on the A F C, via hard drives. Usually get about 10 to 15 SIPS on a given hard drive. We receive those hard drives, check the manifests, verify all the files. Sometimes as many, many people have encountered, different file systems like Mac file systems versus PC file systems have create or inject certain problems with text, which will obvious make it impossible to verify the manifests in these different SIPS. Anyway, we do a little bit of work here processing as content shows up. Afterwards, we extract all the item level meta data that we need, create aggregate level meta data, and prepare this for ingest. Take the drives down to the ITS division here in the Madison building, actually, and plug them into some special computers, at which point we can copy the content to staging locations We use then the CTS database, the library central inventory system, to move the collections to where we want them to go. Every collection that we have has a final resting home on a particular tape storage related to that particular collection. And as those SIPs land there in that particular directory we unpack them to create the AIPs for the given collection, at which point, well, if you remember correctly there was the AIP I was talking about and they have the AIPs added to the [Inaudible] inventory as a whole. So there's the big process that we do, and let me move on a little bit here. I mentioned in the title that this is a holistic process and what I was thinking about when I wrote that was that early in our relationship with Story Core they were giving us databases, Microsoft access databases of all the different [Inaudible] around the country were sending into them. And we were for years every year trying to recreate that database and use it internally in our reading room. But the minute we got that database recreated, it was already out of date, Story Core is continually -- it's a living, breathing institution. They're creating new interviews every day. They're updating their own databases every day, so what we've been trying to do these days is to think a little bit more creatively about our relationship with them and to think about what it might mean to be the archive for a living organization. And in that sense, we've now changed our -- changed our strategy with them and -- to the -- I'm stuttering. Sorry. Changed our strategy in working with Story Core to allow them to do their work and for us not to do redundant work in order to provide access in our own research -- for our own researchers. So Story Core brought all of their databases into one central location now and instead of us mounting our own database we take exports every three months from their database as well as the code that supports this [Inaudible] web site to serve now as a dark archive for their own services here. In return, they let our archive staff and our reference staff access their databases which are web accessible and password protected in order to get -- provide access or do research within the database that our staff [Inaudible] and for us that's become a pretty -- taken some of the load off of our [Inaudible] also allowed us to serve as a -- as more of a living archive for our Story Core colleagues. So I'll end by saying we hope by using these preservation collection management strategies with the holistic approach to collections development we hope that we can provide long term access to the interviews created by Story Core for current and future research needs. And we also really invite you all to come visit and explore the collection. It's open to the public, you're welcome to use it whenever. Thanks. [Applause] And if anybody has any questions I'd be happy to answer them. Yes, Deb? [ Inaudible audience comment ] >> Yes. You mean the files -- the file -- it's definitely 100% consistent. Yes? [ Inaudible audience comment ] >> No, these are the raw interviews. Sometimes -- actually, some of them we do get a couple of takes if there is a bad take. This is -- this is everything. And when interviews are redacted from the collection even if they've already been here we move them out. But often Story Core keeps everything and then they just note in their data that the content was redacted by the interviewee or the interviewer or something. Sorry. No, go ahead. [ Inaudible audience comment ] >> From them to us? They put them on -- usually terabyte drives. I -- on average, every three months I get two terabyte drives from them, and those might have 10 to 15 bags on them of content they bag them in little 100 gigabyte size bags which make it easier for me to put those bags into the system. Instead of having to work with a terabyte at a time I work with chunks. Yes? [ Inaudible audience comment ] >> You see that interview I.D. on the far left here? The DDO 139 FFB 100. That interview I.D. is there, is standard for Story Core. Those first three letters represent one of their initiatives and then the number is a one-up number. Every file that shows up in the system, every file that shows up on us has these -- these I.D.s to them. That's really the connector to everything. So I can go and extract all of those I.D.s out of the file, out of these files, and at any time I can go and extract all of the data out of their database and make those connections, make a joint based on any of those -- okay -- unique [Inaudible] yeah. Does that answer your question at all? Yeah, that's a big one. Well, thank you very much. [Applause] >> Thank you very much, Bert. Our final speaker tonight is Kari Kraus, who is an assistant professor in the college of information studies an the department of English at the University of Maryland, Kari is very well known to us, having been the co-PI on a grant for preserving virtual worlds, PI on digital humanities internship grant from IMLS, and pretty much everything else we're interested in at the Library of Congress. So I will cut that short to give Kari time to talk. [ Background noise ] >> So thanks for having me tonight, and especially thanks to my [Inaudible] friends. I'm really thrilled to be here. So as Leslie mentioned, I have this unique position where I cross the sciences and the humanities, I have a joint appointment in the college of information studies and in the English department. So I approach born-digital from at least two perspectives. That is I'm a humanities researcher interested in studying and interpreting and creating creative and excessive content, but I'm also a digital preservation researcher and I help preserve different kinds of works, particularly video games and virtual 3D worlds. So I decided for tonight one of the things I will do is first give you a sense of several current projects that illustrate that dual perspective. And I'm going to spend the bulk of my time, about 15 minutes, covering some really great interview data that we collected as part of preserving virtual worlds too which was funded by the Institute of Museum and Library Services, we just wrapped up that project. Current projects, one -- the first one is -- was just funded. We got some seed money for a grant called exploring invisible traces in historic recordings. This is with my colleagues Ming Woo, electrical engineering, and Doug ward in the information school. We're building on Ming Woo's research where she's been able -- it sounds like C S I stuff but it's really not -- she's been able to extract native fingerprints or native meta data from audio and video recordings, and what I mean by that is that there are natural fluctuations in the alternating current of the electrical grid that are recording devices automatically pick up and embed in our recordings subliminally without us noticing them. But with the right tools you can extract them and use those native signatures to geo and time code files. So you can imagine there are all kinds of forensic and legal text for this around authenticating when audio and video files were actually made. We're interested in other things too, for example, how far back can we go in time to actually extract these signals, can we go back to -- if not the earliest cylinder recordings which were produced in a mechanical process not far after that. What happens as files are reformatted from one format to another, from analog to digital, and then using a case set around some of President Kennedy's recordings around the time of the Cuban missile crisis. We don't exactly know the -- current, we don't exactly know when some of those recordings were made, we just have an approximate time period. And so we're going to see if we can nail some of that down. So a second project that I thought I would mention in passing is -- was another high school colleague. We started crawling congressional web sites a few months ago using archive it, the internet archives web crawling service, and with my colleague we want to develop some infrastructure for web archiving that will allow us to do more with work files and text line them. I decide tide to mention this one particular challenge because as we started doing this there was, as some of you might have seen, a press release that came out around the Senate issuing some new restrictions on the NSF, the national science foundation, funding political science projects. In fact, the new restrictions are such that the NSF can't fund them at all unless they relate directly to national security or the economy. The idea, of course, is that congress does not want to be in the research spotlight, imagine that. And so we are current at a stand still here. We don't really know how to proceed because of the chilling environment and we had actually hoped to go after some NSF funding for this. Then I do -- I teach in English a course called book 2.0 the history of the book and the future of reading. And one of the things we've been doing a lot in that class is thinking about the future of the book through a lot of hands-on engagement and experimentation. And so one of the things the students have been doing is embedding micro controllers in physical books. So micro controllers are these very small computers, often two or three inches in diameter, that can give the analog world, the physical world, can endow them with interactive properties. And so what intrigues me about these books, there's one guy, one of my students, [Inaudible] it's an artist's book, it's an altered book, she's -- she's sort of re-telling Edward Lear's classic nonsense poem the "Owl and the Pussy Cat" with a lot of illustrations and water colors and she's embedded as I said micro controllers in there to make some of the illustrations light up. So in this case, you see the micro controller, she's actually sort of created a recess in the book to conceal the micro controllers an then you see on one side of the page some paper that says kind of translucent paper, if you were to flip the page back there, there's an illustration on the other side and it lights up and does various things with the L E Ds that the micro controller is controlling. So this is an example of an artifact that is neither born-digital nor digitized, nor completely paper-based. And so I think this represents an interesting new domain for digital preservation research. And then finally I also work in the area of alternate reality games, this is funded by the National Science Foundation to develop an alternative reality game for K through 12 that we implemented in an Indianapolis middle school in 2011. An alternative reality game is not a video game and it's not augmented reality. It's the kind of a game that's told and played out across multiple media. So a player might for example purchase a book from Amazon, and that book can be read and enjoyed just like any other novel. But there will also be, for example, maybe an evidence pack in the book that it contains photographs, newspaper clippings, and then phone numbers to call and web sites to visit. And in order to get the full experience of the game you've got to branch out to all of these media and piece together all of the glues and help advance the story line. So as part of this we create -- as part of our own game we end up creating a ton of artifacts, game-based artifacts, this is very typical for alternate reality games, that are both analog and digital, and also again contain a lot of micro controllers. So there are -- there's a lot of the physical based computing aspect that we saw that I just showed with the -- with the electronic -- with the bibliocircuitry. And so this is -- some of this is being exhibited in a couple of months at the games learning society conference. And it's just another domain where this other sort of hybrid space where you've got a mix of analog, digital, and then these interest artifacts that are neither analog, nor digital, but both. So moving on to preserving virtual worlds two, this was a successor project to PDW1, which as Leslie mentioned was funded by [Inaudible] this project here we just wrapped up, it was funded by the institute of museum and library services. It was a mix of those basic and applied research, but primarily basic research. Like PDW 1, we adopted a case set approach. We had a total of seven games in our case set. They were all nominally education-based games although we really pushed the limit on what we mean by educational. So there are some classics and some famous ones like Oregon Trail, and Super Mario Brothers, but we also snuck Doom in there, which is of course the classic first person shooter, and I'm not even sure we tried to make an educational argument around that one. And then my favorite, up in the upper right corner, Typing of the Dead, which was based on an earlier arcade game called House of the Dead 2, essentially the plot has you adopting the role of a secret agent running around in an urban environment trying to kill zombies. In the education-based version, Typing of the Dead, they redid the game or reimplemented it as an educational game that taught you how to type. So instead of shooting the zombies with a gun you are typing as quickly and as accurately as you can the words that appear above their heads. And as you can see, the secret agents no longer wield guns they have keyboards strapped to them and laptop computers on the backs. So this was our case set. I won't go into detail about all of the deliverables, I'll just highlight maybe a few things. A big interest of ours was trying -- studying and researching the significant properties of video games. So significant properties has emerged as a key term in the archival literature and it's fairly self descriptive. It refers to those properties of digital objects that are -- that are particularly salient or consequential. And the idea is if you can identify those significant properties of your objects you can adopt preservation strategies that will ensure that those properties survive over time even if consequently you sacrifice others. That's the theory. In practice, I have to say we -- I don't think in practice that this -- there has been as much up take of this term as there has in research literature, and that's one of the problems, actually. I thought I'd also point to a really interest -- I mentioned the project that was spear-headed by my colleague, Henry [Inaudible] at Stanford University, it's the last bullet point there, methods for auditing the integrity of game software that involve testing the software's ability to playback demo, safe state, or replay files. So taking Doom as an example, Henry was interested in the role of demo files, what are called demo files or lump files, lump refers to their extension, dot LMP extension, and how they might be used to again audit the integrity of various kinds of video games and particularly Doom. So these are very interesting files that are a kind of -- they're recordings of a game, a single game play session. Now that's not a video recording. It is instead a kind of notation of game play that is both machine readable and for the expert human readable. And so to replay a demo file you've got to use the same version of the game software that was used to create it in the first place. And so if you can effectively replay a demo file at some point in the future, successfully replay it, then you know that that file has been -- sorry, that version of the game software has been successfully preserved. Versioning is very important, so for example if you try to run a demo file in the wrong version of a there are game, in this case Doom, it's not that it won't -- it's not that it won't replay, it will, but your player -- your player avatars will do things like run into walls or seem to shoot randomly rather than purposely at targets and so forth. Two -- I mentioned significant properties, these are two very influential definitions of significant event properties in the research literature. Those properties of digital objects that effect their quality, usability, rendering and behavior, and then similarly, essential characteristics that must be preserved over time in order to ensure the continued accessibility, usability, and meaning of the objects. So if you're dealing with something like video games, significant properties might be things like sound effects or character levels or dialogue or narrative or cut scenes. And the way in which I just isolated some is very typical. They tend to be understood and articulated as sort of discreet properties. As part of PVW2 we interviewed something like 23 game developers, creators or others who had a significant role in the creation of the video games in our case set. So these included people like Will Wright, creators of SIMS and Spore and a bunch of other games. Sid Myer, the creative mind behind the Civilization franchise. Let's see, Derek Paxton who is famous for making one of the best, most particular mods to the Civilization franchise and its history. So a mod is a kind of user-based reimagining of the game. Claire Curtin, let's see what else do I want to highlight here. John Romero who is one of the masterminds behind Doom, and Larry Bond, Matt's favorite, Matt Kershwinbaum's favorite, who was responsible for Harpoon, a military -- a naval warfare game. Larry Bond was also a co-author with Tom Clancy of "Red Storm Rising." I'm going to concentrate just on the interview data around the Civilization franchise because I had a direct role in that. I was involved in almost all of the interviews. And I'm specifically going to be -- so the people we interviewed related to Civilization included Sid Meyer who co-created Civ 1 in 1991 with Bruce Shelly as he founded a game company called micro pros, which has since been superceded by [Inaudible] games, it's a Maryland-based game company. And we basically interviewed every significant developer in Civilizations game development history. So Bruce Shelly who is responsible for Civ 1; Brian Reynolds, lead designer for Civ 2; Jeff Briggs, Civ 2; Soren Johnson, Civ 4; John Schaffer, Civ 5; and then Derek Paxton I just mentioned a moment ago who was a player, a player of the game who created, again, one of the most consequential and popular mods called "Fall From Heaven." And so I've actually coded using a -- a social science analytics package called Envivo, three of the interviews it will be concentrating on. And so those interviews are Brian Reynolds of Civ 2, Jeff Briggs of Civ 3, and Soren Johnson of Civ 4, and I'll be touching on some of the others a little bit. One of the ideas around this case was working with games that exist in a series. So where multiple versions of the game have been developed for multiple platforms Civilization was developed for the PC, for the Mac, and for various game consoles. And that existed in -- often existed in sequel form. And that's -- part of the original idea behind that was that we were wondering what game developers did to manage the tension between continuity and change in the game franchise, what that might tell us about significant properties. In other words, are the most significant properties of the games those that remain in tact as you develop new versions of the game and so we thought we could get, you know, base a series of interviews around some of that. All of our games have very complex geneologies and Civ certainly falls in that category. And I should say that it's not just -- I'm not just talking about when I say the genealogy, I'm not just talking about games that have the seven name in their title, it's much more complicated than that. I try to create a little network that -- it's not really a genealogy, it's not so much a branching structure of modification and inheritance so much as a kind of jungle or risome, but just to point out a few things, Civilization the video game was based in part on a pre-existing board game called Civilization but out by Avalon Hill. It was also based in part on another video game called empire that pre-dated it. Civ 1 in turn spawned or helped inspire Rise of Nations which was I should say civilization kind of -- for those who don't know anything about the game it's a turn-based strategy game created in 1991 where you build an empire that will with stand the test of time. Starting in 4000 BC moving forward in time up to the near future. And you -- you engage in, it's competitive, you're competing against other Civilizations, you can adopt diplomatic strategies, more marshall strategies and so forth. So Civ 1 also spawned something called Rise of Nations which was an attempt to reimagine in another game company by one of the original Civ designers, reimagine Civ not as a turn-based strategy game but as a realtime strategy game. Civ 2 also in part -- inspired a game called Alpha Centauri which is created by Electronic Arts, not by Phraxis, but interestingly several of the designers that we interviewed of Civ claim that Civ 3 which was a Phraxis game was reused, part of the code base of Alpha Centauri. So you have code base migrating across game companies in interesting ways. These were just some of the interview themes, there were something like 64 of them when I coded them, they include things like different game developers, the names of significant players, significant creators, various concepts that emerged, and technical terms. And I don't know why this slide shows up like this, so dark like this. But it's just to sort of demonstrate that one of the most particular themes in all of the interviews was this one of continuity and change. The game developers had thought extensively about this problem, how do you honor the core characteristics of the original game while continuing to let it evolve. They all had different answers to the question of what are the most significant properties of the game. So I mentioned -- so I give you -- I give you sort of the primary answer to each of the people that we interviewed. Sid Meyer said it was the turn-based nature of the game. Jeff Briggs on Civ 3, it was the technology tree, which is this tree of innovations that you develop over the course of your Civilization, things like writing -- writing systems, the wheel, pottery, space crafts, nuclear fusion and so forth. Brian Reynolds said the game mechanics, Soren Johnson said it was the tile-based nature of the game, not the turn-based nature of the game but the tile-based nature, John Schaffer said tile and turn-based nature of the game. And Derek Paxton said empire-building. So no consensus, but each of them when they explained their rational adopted a very similar strategy. I'm going to high light that with a quotation, part of which is here. This is from soren Johnson. So soren Johnson said it's the tile-based nature of Civ that is that most significant property. And I'll read it first and then I'll unpack it a little bit. Because again, what he does in terms of rationalizing his choice is a tactic we saw across the interviews regardless of what they chose. Well, I'd say one of the key things for Civ is the tile-based nature of the game. Everything revolves around the tile. Most people first identify it as a turn-based game, and that's also important. But I think what's most important is that everything revolves around the tile. Your units are on specifically one tile. They're not somewhere in some sort of more analog space, less discreet space, which determines everything. It determines how units can move, who's adjacent to the city you're on, if you can build an improvement, if it's a good place for a farm, all these specific things. Everything flows out from the tile. Your culture grows one tile at a time, your resources come from specific tiles. It's a very, very important thing. So the strategy that he uses that we saw again and again was that the developers would first definitely think about significant properties in terms of sort of decomposing the game into component parts or properties or characteristics. So thinking about properties discreetly. But then when they describe the importance of the characteristic it was to show or demonstrate the cascade effects it had across the entire world of the game and across game play sort of viewed as a Gestalt or viewed holistically. So they would sort of -- I guess you could say the strategy was they would kind of identify a master property that seemed to govern or control everything else. So take an initially discreet property but showing essentially analog effects. I mentioned that they all had thought a great deal about managing the tension between continuity and change. There were a lot of -- there were a number of principles of change that emerged, and I want to highlight specifically in this list the complexity, the one third rule, and platform considerations. Complexity, this came up again and again. A number of the developers said for this game and other games, and maybe particularly the first person shooter genre, but others. There is a tendency if you're not careful as a developer to skew the game more and more over time in the direction of what we sometimes call the lead players or the hard core players, sacrificing the novice players or users. And you would often sacrifice that sort of sweet spot in the middle. So they were all very conscious of that and sort of had to work through it very consciously to mitigate against that. The platform considerations are also very interesting. This is the degree to which the game developers either try to just sort of super impose a game originally created for one platform on to a new platform or whether they really tried to understand and honor and respect the affordances of the new platform. And so we have this wonderful quotation by one of the game developers. This is Brian Reynolds on Civ 2 talking about creating a version of the game for Windows. Most games up to this point have did not done for M S DOS as a platform, certainly Civilization had and almost every other game that micro pros had done. But we were starting to get that antsy Windows is coming feeling. And so we decided to do Civ 2 as a Windows game. Windows 3.1 was the last one that kind of ran on top of DOS. You'd boot up DOS and then tell Windows to load and then you can be in Windows. And so it was kind of an early phase of even doing games for Windows. And one of the things that really worked out was the decision to really embrace Windows in the new version of the game. There was a lot of resistance to the operating system in general because everyone wanted to have their super-fancy graphicy whatever, and it would take over the full screen. That was the thing. But we not only didn't do that, we actually kind of went for Windows in the sense if you play Civ 2 you can actually not only size the window of the game but you can sub size the smaller Windows and all of that stuff, so we kind of just said oh, we're making the game for Windows, so let's just own it and really make a game for Windows. And that paid huge dividends when Windows 95 came out, because it became the dominant operating system. So we really picked the right platform at the right time, and at the same time we leaned into the platform, there's my title, rather than a lot of the games that were kind of leaning away. I think that the really surprising magnitude of the success of Civ 2 has a lot to do with the fact that it wasn't just fun and polished, was that it also leaned hard into that platform at a really good time, and so yes, platform is a very significant -- it's very significant, and the choice of platform and market timing were very significant. I mentioned the one-third resume. This was a kind of formulaic response or strategy to managing continuity and change that at least three of our interviewees mentioned interdependently of one another without being prompted. And this directly answers the question how do you manage the balance between tradition and innovation and the development of the franchise that is being true to the original while still changing and evolving. So here's Soren Johnson again. There's this other phrase we use that originates from Bing Gordon that a good sequel should be one third old, one third improved, and one third new. You know where you're kind of looking for this nice mix of stuff people are doing, are going to be very familiar with, stuff you weren't happy with before so you're going to change, and then sort of the big bullets that are going to be on the box, these are the new things. For Civ 4 the new things were stuff like religions and great people and promotions. This is actually one of the most interesting parts of the project to me. Almost all of the definitions of significant property in the research literature emphasize visible surface characteristics. So I ticked off some people might mention in response to games, things like character levels, sound effects, dialogue, look and feel type of things. But for games and probably with other artifacts surface visible characteristics aren't enough. You need to be able to get at the underlying data structures and game model and physics engine. And there was a whole range of data that we got that sort of spoke to this. A lot of players do things like -- for example, called speed runs, where you try to make it through a single game session as quickly as possible, either an entire game session or a particular level, and you're competing against other players to see how fast you can do this. Players exploit glitches in the game, things that shouldn't be there or anomalies in the underlying physics engine in order to do this. So do things like run or walk diagonally across walls. You shouldn't really be able to do that. They exploit the glitchs. There's a really famous piece by a gamer, a player of a first person shooter game called quake, called zig zagging through a strange universe or something like that. And he calls what he does physics 101, where you're probing the underlying game model as a player trying to understand how it works. When we interviewed Don Rawitsch of Oregon Trail, he appropriately made the same point, that what made the simulation of Oregon Trail so interesting was the underlaying game model where he really researched original settler's diaries to understand how settlers died along the Oregon Trail in the 19th century from Missouri to Oregon. He tried to get the probabilities right in the game, in other words, how often would settlers die of dysentery, or typhoid, or being killed by wild animals or whatever. And he -- he specifically mentioned players who would, for example, do things like intentionally play passively and try to go as long as possible without buying provisions. And he called this probing the underlying game model. And so players do this all the time. We asked our developers were there any ways in which they tried to surface those game properties, make them visible at the player level, and they all had interesting things to say about this. How am I doing on time, Leslie? Okay. So I am going to read -- if I can identify the right one. I'm going to read a quotation that speaks to this from Civ. Soren Johnson again, the best quotes. Generally speaking, we found that the more explicit you are the better players respond to it. So it's all about the game model which is normally invisible, how they're making it visible to the player. And I don't think it's necessarily true for every game, but for us it's very true because Civ kind of feels like essentially a board game inside your computer. The type of board game that would be fun to play but impossible to do logistically to do in the real world. You're using your computer to help you run it, a good example of this would be diplomacy. So Civ is a single-player game. And so one of the interesting ripples of that is how do you do diplomatic AI, artificial intelligence. how do you decide what the enemy Civilizations are going to do. You know, are they going to declare war, are they going to be your friend, how aggressive are they going to be, how easy are they going to be to trade with, and so on and so forth. And that's really kind of a very subjective question. The Civ series has gone back and forth with this question a lot. So with Civ 4 one of the things we did that was a little different is when you went into the diplomacy window, when you moused over the leader's head, it would literally give you the mathematical breakdown of all of the factors that made them like you or dislike you, right? Plus two if you shared the same religion, minus three if you invaded them in the past, minus one if we share borders, plus four if we had a trade agreement. This gave players a very tangible sense of why a Civilization likes them or dislikes them and what they could do to change that. So this was initially the sort of algorithms that dictated how the game engine would respond to these things was initially of course invisible. And they decided to make it very explicit to the gamer community. When they then tried to take it out of the game again there was an up roar from the player community who demanded that they restore it. Interestingly, Sid Meyer was the only developer of the Civ franchise who contradicted this line of thinking. He thought you never want to surface this kind of information, that it destroys the suspension of disbelief, of course. But he was a sort of outlier voice among the developers. And I'll skip that. I'll end by covering a few implications for collection development. And so I think one take away is that any particular title manifest complex genealogies in which the curator should be aware, and recognition of such familiar relationships might be used to guide collection policy. And then finally the implications for preservation. I won't go over all of these, I'll just highlight a few. But obviously the idea behind the interviews was that understanding development history and roles can help guide preservation actions and strategies. Another is that users care about the underlying game structures and models and making them visible, accessible, and manipulatable aids in long term preservation, that reuse, that promoting reuse is a form of preservation and I quoted Bethany Novisky [Phonetic] here, the used key is always bright. This is from a key note she gave at the rare books and manuscripts conference in 2012. And she -- this is part of the quotation from Bethany. In contrast to physical documents and artifacts the best preserved specimens are the ones that time and good housekeeping forget. The more a digit object is handled and manipulated and shared, and even kicked around the longer it will endure, the harder they work, the longer they last. And certainly we saw that Civ at later stages in the franchise made it very easy for game players to mod the game, to mash -- to create, to sort of take the game in new directions. And these new generations remake the game to make it relevant for their own time, so I would say that was another take away. And I'll go ahead and wrap it up there. Thanks. [Applause] >> Happy to take questions. Yeah, Susan, hi? >> Did you or are you considering doing a comparable level interview with a set of players of Civilization? >> We did not. The only one we did for Civ was Derek Paxton, and yes, I think ideally we should an we didn't. I don't think we did. I think almost all 23 exempt for Derek Paxton were developers. But ideally, yeah, it would be great to have that perspective as well. And one of the things we did do was we -- we created a player questionnaire, in other words, when you're thinking about preservation strategies we created different questionnaires for different stake holders including developers, players, and curators. So we did think about that, but no, we don't have interview data on that. That's a great point. yeah? >> Not a question but a comment, and it has to do with [Inaudible] significant properties in the [Inaudible] in the archives around that time -- >> Yes. >> And I'm interested in what you said about the [Inaudible] an example I think of, again to deal with millions of these, an HTML file that renders very differently in different browsers. [Inaudible] well it's [Inaudible] the functional fallacy [Inaudible] called a design feature now. So it's still limited to the [Inaudible] a particular producer in a particular point in time, it creates huge problems. >> It creates huge problems, and I -- yeah -- so I think one thing that tends to happen in the research literature is there's this embracing of the idea of significant properties. That in turn implies that there is such a thing as insignificant properties. But when you pose that question to -- particularly archivists and curators, you hear crickets. They have tons to say about significant properties which are all about minute material properties. They'll say -- in games they say everything is important. All of the hardware is important, the kind of input device you're using is important, but no one is ever willing to go on the record to say what those insignificant properties might be. And I think your example is a great point, and I have to say I really -- I don't feel that much closer to answering the question, I think the key is determining a designated user community as the O I S model specifies and you use that community in its perspective to help you make -- help you adjudicate those decisions. Thank you. Thanks. [Applause] >> We want to thank everyone for coming, thank everyone for staying a little bit late tonight and please come back for some of our other Preservation Week events, we have some going on, you know, every day this week ranging from working with digit yam scrap books to learning about digital imaging and preserving your digital photos to every aspect of personal digital archiving. So please come back to the library and we look forward to seeing you later in the week. Thank you very much. >> This has been a presentation of the Library of Congress.