I attended the International Internet Preservation Consortium (IIPC) general assembly (#IIPCGA15), held at Stanford University, April 27-28, 2015. IIPC is made up of member institutions from around the world. They meet annually and open the first two days of the conference to non-members. This year’s conference included representatives from over 30 countries and a wide range of professional fields. The content of the presentations expanded to much more than web archiving, and presented solutions and research in broad born-digital issues, digital humanities, and IT infrastructure. General takeaways included:
- Web archives have value, and are a potential gold mine of primary source documents and data
- More collaboration is needed between archivists, librarians, users, ethnographers, and computer analysts and engineers to support and enhance the value of web archives
- There is a lot going on in the world of web archiving beyond what archivists and librarians are doing (the traditional acquisition, preservation, and access facets). In fact, most of the presenters were information technologists and academic professors from humanities fields. The World Wide Web is a new kind of primary source, and everyone is interested in what this means and how this can benefit their research.
In two days, I saw over twenty presentations. While I’d love to recount each of them, here are the highlights:
- OLIVE, presented by keynote speakers Vinton Cerf, Vice President and Chief Internet Evangelist for Google, and Mahadev Satyanarayanan, Professor at Carnegie Mellon University. The Olive Executable Archive is an online repository where obsolete software can be streamed across the internet. Although not available yet due to intellectual property rights issues, it was exciting to see a potential source of obsolete software available at the touch of a button. At the end of the presentation, I wanted to give a standing ovation to the presenters and their work, as obsolete software is a major issue for appraising and accessing born digital material. I’m not sure exactly how this might impact the world of born digital archives, but it is a good sign that leaders in the internet world are thinking and doing something about this issue.
- According to a study by Cathy Marshall, Adjunct Professor in the Center for the Study of Digital Libraries at Texas A&M, most Facebook users believe that their Facebook data is fine where it is; that the value is immediate, not historical; and a Facebook archive is not a welcome idea. She interviewed 250 Facebook users and found that 22% thought everything on the Internet was in the public domain, but most participants were concerned about violations of privacy and confidentiality.
- Meghan Dougherty, Assistant Professor of Digital Communication at Loyola University in Chicago, gave a very interesting presentation about her research into how people use the web and social media. This included videos of people describing what they were looking at on their mobile phones and PCs. She is collecting data on web usage, which is fraught with challenges, but will help increase understanding of how researchers locate the artifacts of digital living, and how people experience life in web media. She also showed this hilarious video produced by The Onion, Internet Archaeologists Find Ruins of ‘Friendster’ Civilization.
- A collaborative project between four UK partners termed “BUDDAH”- Big UK Domain Data for the Arts and Humanities. This project aims to deliver a methodological framework for the analysis of web archives. One of the project deliverables was a short video explaining what web archives are to the public, because apparently most researchers think web archives are digitized material from an archive made available online. The project offered grants to arts and humanities researchers to conduct research using web archives, and case studies will be available soon. This project has led to many findings, including that researchers often don’t want what they think they want, that sentiment analysis was less useful or problematic, and that researchers want to be able to take a corpus of data and curate and analyze it, but nothing too complex. There are also implications of ethics, as it is possible to pull personal stories from big data on the web. And, the data is inherently messy.
- Also, be on the lookout in early 2016 for “The Web as History: Using Web Archives to Understand the Past and Present,” a book published from this project.
- My favorite presentation was probably Warcbase: Building a Scalable Web Archiving Platform on Hadoop and Hbase, by Jimmy Lin, Professor from the University of Maryland. It was a little over my head technically, but Dr. Lin was a fantastic presenter and explained technical terms and processes. Basically, Warcbase is an open source platform for managing web archives data. If I understood correctly, it’s a solution to the need for researchers wanting to take a corpus of data and analyze it. The cool thing about it is that it can be used straight from a personal computer, and possibly even a mobile device. It can handle a large amount of data (over one terabyte), although it does take a while to ingest the data and then extract the analysis. Ian Milligan wrote about his experience testing it. I hope the UCI Libraries can play with something like this after we have Archive-It ready to roll.
- Did you know that when you look at a page in a web archive you might actually be looking at pieces of data pulled from various times? It’s true! And was one of the biggest takeaways from this conference. Termed temporal coherence, Michael Nelson, Associate Professor in Computer Science at Old Dominion University, presented on his research about this important, and little known, concept of web archives. Using HTTP metadata, you can see when the parts of a page are from. There are four coherence states: existed/true, did not exist, might have existed, and probably did not exist. Prevalence of this varies, but according to Nelson 5% of websites did not exist as they appear. Approximately 40% of web pages did exist as presented, and the other 55% were somewhere in between. He then discussed the significance of this finding, and using the Domino’s Pizza logo as an example, asked if it mattered if the new Domino’s logo appeared on a web page created long ago, or vice versa? It may not matter to the casual observer, but could very well make a difference if there was, say, a law suit over the logo and the web archive was used as evidence in a court of law. The bottom line is, as mentioned by a member of the audience, we’re not archiving the web reliably. In a world where authenticity matters, this is a major issue and one that needs to be addressed further.
And those were just some of the highlights! If you’re interested in hearing more about the IIPC General Assembly, please see links to most of the abstracts in the online schedule. Or, feel free to contact me!