Frequently Asked Questions

Some often-asked questions about how web archiving works, and how the University of Edinburgh is preserving historic web-based content.

What is web archiving?

Web archiving is the process of creating reliable copies of web-based content for long term preservation. Paper records may seem delicate, but digital content is often at a higher risk of loss due to fragile physical carriers and rapid technological obsolescence. In average conditions, paper records can survive for decades stashed away in a box or filing cabinet, but digital content can become inaccessible quickly without appropriate action. This is particularly true for websites as content is changed, updated, and removed frequently – the average lifespan of a website is around just two and a half years.

How does web archiving work?

The most common type of web archiving is ‘crawling’. A web crawler is a tool that navigates through sites via links, making copies of the content as it goes. Copied pages are collated into a standardised file format (WARC) for long term preservation. WARC files also include useful metadata for archivists and future users, such as information about how a website works or when it was captured, which allows us to assert the authenticity of the captured resources. The archived files can then be loaded into a playback tool, such as Wayback or ReplayWeb, where they can be viewed.

Archiving a website isn’t the same as saving the HTML files from a website: it involves creating a file of the website as it appeared on the live web, preserving as much of its functionality as possible. Web archiving does not take a static image or screenshot of a site – instead, we aim to reproduce archived websites in the way they functioned on the live web.

Why do we need to archive the University’s websites?

Heritage Collections hold records that document over 400 years in the life of the University, including records of institutions, organisations, and departments that have merged with the University or are no longer operating. Like many other institutions, the University’s communications are increasingly disseminated via online channels. We want to capture as full a picture as possible of how the University has communicated with its staff, students, and the wider community in the twenty-first century, and web archiving allows us to extend our collecting into the digital age, ensuring we can continue to demonstrate the global influence of the University over time.

Archiving historical snapshots of the University’s many web pages provide a valuable record of its research activities and community engagement, enables the University to comply with information compliance legislation and its own digital preservation policy, and supports lifecycle management of the University's Web Estate – keeping our important memories secure and making space for newer, up-to-date information.

Does the University archive social media?

All web content within the collecting scope of the University Archives – regardless of the platform where its published – may be archived. Due to the technical and regulatory restrictions around some platforms, it may not be possible to capture content to a standard quality or at all. Content published on Facebook, for example, cannot be easily captured, even if pages are shared publicly. Content shared on Twitter/X can only be captured as web pages and not harvested as data due to platform restrictions around API access.

Some Facebook or other social media content may be collected as personal digital archives. This process requires account owners to download their own data via the platform’s own download function and to donate these downloaded files to the Archives. The Digital Preservation Coalition has some great guidance on personal digital archiving for social media that can be found here.

Does the University Archives collect non-University web content, either by donation or purchase?

Yes, in some instances, the University Archives may purchase web-based publications (such as web-based art works) or take in web pages donated as part of personal digital archives. These purchases and donations are acquired through agreements and documented in line with Heritage Collections collecting processes.

How are University websites being archived?

The University’s online history is predominantly archived through the UK Web Archive. By law, all UK print and digital publications - including websites - must be deposited with the British Library and by request to the other five Legal Deposit Libraries (the National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Libraries and Trinity College, Dublin). The UK Web Archive is a partnership of these six UK Legal Deposit Libraries and aims to collect a copy of all UK websites at least once per year.

There are two modes of collection used to gather sites within the UKWA. A yearly ‘domain crawl’ attempts to make a copy of any website that has a UK top level domain name (such as .uk, .scot, .wales, .cymru, and .london). University sites are often captured through this process.

Additionally, curators are able to build targeted collections around specific topics and themes. The frequency of these captures varies depending on the nature of the site and its content – for example, if a page is regularly updated, it might be collected on a daily or weekly basis. University sites are regularly captured through this targeted collection. The University Web Archivist can instruct the UKWA crawler to copy a site and provide detailed descriptive metadata that helps to make sites findable in the future.

Some websites aren’t well suited to being captured by the UK Web Archive, and need bespoke capture using open access web archiving tools. These allow a user to create a WARC (Web ARChive) or WACZ (Web Archive Collection Zipped) file of a site as it appears on the live web and store this locally in the University’s digital archive.

Can anyone see a website that has been archived?

Access to websites that have been archived in the UK Web Archive is by default restricted to users at computer terminals onsite in Legal Deposit Libraries, unless open access permission has been explicitly granted by the website owner.

Web content that has been captured using manual tools can be accessed through the Heritage Collections Research Service.

How do I know if my site has been archived?

You can search across multiple repositories using Memento’s Time Travel tool.

To view captures of a site in the UK Web Archive, append the URL of your site to the following:

https://www.webarchive.org.uk/wayback/archive/*?url=

This will bring up a calendar view showing all the captures of that URL over time.

If you can't find any captures of your site, it may need to be manually added to the UK Web Archive - please contact the Web Archivist who can can assist you with this.

The archived copy of my site doesn’t look right. What can I do to change this?

The best way to get a good archived copy is to build a site that is ‘archivable’, and there are a few simple steps that you can follow during the design process that can make your website more crawler-friendly and improve any copy that is made for preservation. You can find more information on our page 'Making archive-friendly websites'.

There are some limitations to what a web crawler can do, so an archived website won’t always look and behave the same as the live site did. A web crawler can’t interact with a website, so it can’t fill out a form, search a database, or scroll down a page to load more content, for example. Similarly, crawlers can’t input passwords, so anything behind a login is out of reach to them. Any content that is generated through user interaction like this will not be accessible to the crawler

Some websites might be better suited to bespoke capture using open access web archiving tools. These allow users to create a WARC (Web ARChive) or WACZ (Web Archive Collection Zipped) file of a site as it appears on the live web and store this locally. If you think your site would benefit from being captured in this way, please contact the Web Archivist (heritagecollections@ed.ac.uk) who can assist with this.

The information in the archived copies of my site is out of date. Is likely that a search engine will find and return the archived version in the UK Web Archive above more recent content?

Content that has been archived in the UK Web Archive isn’t indexed in the same way as content on live sites, and so shouldn’t appear in search engine results.

Captures in the UK Web Archive include a blue banner at the top of the page to indicate that the user is viewing an archived copy. This banner contains the page title, a timestamp of when the capture was made and links to the calendar page that will show other captures of that URL.

My site has been archived but I don’t want it to be. How do I get it removed from the UK Web Archive?

The University of Edinburgh expects all website owners and content contributors to adhere to the Website Terms and Conditions, and ensure that all content on University sites is fit to be in the public domain before it is published.

The UK Web Archive is committed to ensuring that UK-published web material that is collected under legal deposit legislation is preserved and made available for researchers to use in the Libraries’ premises. However, the Libraries are also committed to ensuring that material is archived and displayed lawfully. It is possible for access to a specified page or site to be restricted to users at computer terminals onsite in Legal Deposit Libraries.

If you feel you have reasonable groups for requesting that content preserved in the UK Web Archive be restricted, please contact the Web Archivist (heritagecollections@ed.ac.uk) who can assist with formulating a takedown request.

This article was published on 8 Apr, 2024