What is web archiving?
Web archiving is the process of using tools such as crawlers to collect web content for long-term preservation in an archive.
Why do we need to archive the web?
Heritage Collections hold records that document over 400 years in the life of the University, including records of institutions, organisations, and departments that have merged with the University or are no longer operating. Web archiving allows us to extend these collections and capture a picture of how the University has communicated with its staff, students, and the wider community in the twenty-first century.
Like many other institutions, the University’s communications are increasingly disseminated via online channels. The capture of these digital outputs plays an important role in enabling the University to comply with information legislation and its own digital preservation policy. Many research funders require that project outputs, like websites, are made available for many years in the future. Historical snapshots of the University's many webpages provide a valuable record of its research activities and community engagement. Archiving these historical snapshots supports lifecycle management of the University's Web Estate – keeping our important memories and making space for newer, up-to-date information.
Paper records may seem delicate, but digital content is often at a higher risk of loss. In average conditions, paper records can survive for decades stashed away in a box or filing cabinet, but digital content can become unusable quickly without appropriate action. This is particularly true for websites as content is changed, updated, and removed – the average lifespan of a website is around just two and a half years.
How does web archiving work?
Archiving a website isn’t the same as saving the HTML files from a website. It involves creating a file of the website as it appeared on the live web, preserving as much of its functionality as possible.
The most common type of web archiving is ‘crawling’ – using a tool that navigates through links making copies of the content as it goes. This approach is automated by providing a crawler with a starting URL (like a homepage). Crawling aims to capture a large number of web pages at once, however, this approach has limitations. Crawlers are often unable to capture content behind a login, or content that requires user interaction in order to be generated, like embedded videos.
Copied pages are collated into a standardised file format (WARC or WACZ) for long term preservation, along with useful metadata for archivists and future users, such as information about how a website works or when it was captured. This information allows archivists to assert the authenticity of the captured resources.
The archived files can then be loaded into a playback tool where they can be viewed. Web archiving does not take a static image or screenshot of a site. Web archiving enables users to reproduce archived websites the way they functioned on the live web.
The UK Web Archive
By law, all UK print and digital publications - including websites - must be deposited with the British Library and by request to the other five Legal Deposit Libraries (the National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Libraries and Trinity College, Dublin). The UK Web Archive (UKWA) is a partnership of these six UK Legal Deposit Libraries and aims to collect a copy of all UK websites at least once per year.
How does the UKWA capture websites?
There are two modes of collection used to gather sites within the UKWA. A yearly domain ‘crawl’ attempts to pick up any website that has a UK top level domain name (such as .uk, .scot, .wales, .cymru, and .london). Additionally, library curators and other affiliated institutions build collections around specific topics and themes of eligible sites - if a website contains a UK postal address (or the website owner confirms UK residence or place of business) it can be included. The frequency for these collections varies depending on the nature of the site and its content – for example, some news sites are collected on a daily basis. Building topical collections also allows curators to provide more detailed descriptive metadata that helps to make sites findable in the future, and to build collections to support particular research interests.
The UKWA and the University of Edinburgh
As part of the Collecting Covid-19 Initiative to document the University’s response to the coronavirus pandemic, the Centre for Research Collections (CRC) joined forces with the UKWA in April 2020 in order to collect web-based submissions to the Initiative.
In 2022, the University partnered with the UKWA again as part of the Wellcome Trust funded Archive of Tomorrow project to explore how the internet is used to disseminate, access and share health information. The project ended in February 2023, and the ‘Talking About Health’ collection can be viewed online.
How can the UKWA be accessed?
Access to archived websites is by default restricted to users at computer terminals onsite in Legal Deposit Libraries, unless open access permission has been explicitly granted by the website owner.
The University is currently pursuing an open access license which would automatically allow archived University pages to be accessed outwith the Legal Deposit Library network. This license will allow University content creators to check the status of any captures, view their archived web pages remotely, and share links to archived content.
How do I preserve my website?
How you choose to preserve your website will depend on the needs of your site, the type of content you have on the site, and how you wish to use the captures.
In the UK Web Archive
The UKWA encourages anyone to nominate eligible pages for capture in the UKWA by filling in the nomination form here.
It must be noted that the UK Web Archive does not archive sites in which audio-visual material is the predominant content, private Intranets and emails, or websites only available to restricted groups.
Other web archiving tools
Some websites might be better suited to bespoke capture using open access web archiving tools. These allow users to create a WARC (Web ARChive) or WACZ (Web Archive Collection Zipped) file of a site as it appears on the live web and store this locally.
ArchiveWeb.page s a tool created by Webrecorder that captures pages interactively as you browse. It can be installed on your desktop or used as a Chrome, Edge or Brave browser extension, and the resulting captures can then be accessed through ReplayWeb.page, a browser-based playback tool. For information on getting started with these tools, see the ArchiveWeb.page User Guide.
What support is available to help me preserve my university site?
If you manage web content that needs to be preserved, please get in touch with the Web Archivist, Alice Austin, (email@example.com) who can help you to identify key components for preservation, provide guidance on how to prepare your site for crawling, and troubleshoot common issues.
More general guidance on web archiving can be found on the International Internet Preservation Consortium (IIPC) website: https://netpreserve.org/ The Digital Preservation Coalition also provides guidance on archiving websites and other forms of digital content.
Making ‘archivable’ sites
There are a few simple steps that you can follow during the design process that can make your website more crawler-friendly and improve any copy that is made for preservation.
Make sure your content can be found
- Use a sitemap. Crawlers use links to move around a site, so listing all the pages of your website in a sitemap (in XML or HTML) can help them to find everything. The sitemap should ideally be called ‘sitemap.XXX’ and placed at the top-level of your web server (e.g. http://www.example.com/sitemap.xml). You can find guidance on creating a sitemap for WordPress sites on the Blogs.ed support pages.
- Provide standard links to content which would otherwise only be accessed through dynamic navigation like search forms or drop-down menus – crawlers can’t interact with a site, so make sure any pages that would normally be accessed this way are represented in your sitemap.
- Make sure your site is accessible. Ensuring your website adheres to accessibility standards helps to make your site more usable by everyone, including web crawlers used for archiving. The University has guidance and policy to support this – find out more here.
- Use robots.txt to help the crawler find the right content. Crawlers navigate using links, so if your site includes features like databases or infinitely-running calendars, the crawler may get stuck in a loop. Using robots.txt exclusions allows you to block crawlers from accessing specific pages or directories and prevents these crawler traps. You can also use your robots.txt file to be sure that any directories which contain stylesheets or images are not restricted. The UK Web Archive uses the Heritrix crawler which identifies itself as ‘bl.uk_lddc_bot’. To provide full access to the UKWA crawler, include the following two lines in your robots.txt file:
- User agent: bl.uk_lddc_bot
- Keep URLs stable, clean and operational. Keeping URLs consistent (using redirects where necessary) can help minimise ‘link rot’ and allows users to see the evolution of your site over time. Avoid using variable (e.g. “?”, “=” and “&”) or unsafe (such as the space character or the “#” character) characters in your URLs – these can prevent the crawler from properly accessing all pages. Finally, make sure the links on your site are up-to-date and working – if your website contains broken links, the archived copy will too!
- Stick to one domain. By default, the crawler operates on a domain name-basis. If a link takes the crawler to a different domain, it will assume these pages are out of scope and stop crawling. This is also true for images and other ‘secondary’ content – where possible, host objects on the same domain where they are being served.
- Use explicit links for audio and video content where possible. Crawlers can download common audio-visual file formats, but only if they can find them.
- Avoid embedding content if you can. Third-party services (such as YouTube, Flickr, Soundcloud etc) are useful, but they effectively hide content from the crawler.
- Use open file formats. Proprietary file formats can be susceptible to upgrade issues and obsolescence, meaning that even where content has been saved in the long-term, it can’t be opened. Using formats that can be read by open source software improves the chances that a site and its contents can be properly accessed in the future.
Be aware of limitations
There are limits to what a crawler is able to access and copy. Being aware of these limitations can you help you to assess the archivability of your site and identify any areas that might need further attention.
Crawlers can’t interact with websites. They can’t fill out a form, search a database, or scroll down a page to load more content. Anything that is generated through visitor interaction like this will not be accessible to the crawler – making sure these pages are listed in your sitemap can overcome this issue. Similarly, crawlers can’t input passwords, so anything behind a login is out of reach to them.
Useful tools and resources
Memento – Search across multiple web archive repositories to see if captures have already been made of your site. The video below explains how to search using Memento’s Time Travel tool.
ArchiveReady.com – Useful tool for evaluating the ‘archive readiness’ of your site.
Validator – Use this tool for checking whether HTML and CSS are validated and compliant with current standards.
Link Checker – Use this tool to check for any broken links on your site. If the links between pages on your site aren’t working, the crawler won’t be able to move around your site properly.