If you have some old Omeka sites that are still valuable resources, but are no longer being actively maintained, you might want to consider converting them to a static site and archiving the PHP code and database. This means that the site can stay online in much the form that it’s in now, at the same URLs, and you still have the code and database to bring it back if you want to. From a maintenance perspective this is a big win since you no longer have the problem of keeping the PHP, Omeka and MySQL code up to date and backed up. The big trade off is that the site becomes truly static. Making any changes across the static assets would be quite tedious. So only consider this if you really anticipate that the project is no longer being actively curated.

I have done this a with several Wordpress sites before, but Omeka is a little bit different in a few ways so I thought it was worth a quick blog post to jot down the steps.

This is kind of important for the usability of your static site. Since there will no longer be any server side PHP code and database for it to query, the code that performs the search will be gone. Since it’s probably not a great idea to leave the search form there inviting people to enter a query only to get an error, you can remove it. For good reason this may be a deal breaker for you, depending on how you are using the search. The good news however is that people can still find your site via Google, DuckDuckGo, or some other search engine, instead of it being completely offline.

To disable search take a look in your Omeka theme, often in common/header.php and simply comment out the code that generates the search form:

It might be nice to be able to generate a static Lunr.js index for your database and drop it into your Omeka site before creating the static version. This an approach that the minimal computing project Wax has taken, and should work well for average size collections. Or perhaps you could configure a Google Custom Search Engine, and similarly drop that into your Omeka before conversion. But it may be easiest to simply accept that some functionality will be lost as part of the archiving process.

Localize External Resources

It’s fairly common to use JavaScript and CSS files from various CDNs. To find them view the source of one of your Omeka pages and scan for http to review the types of JavaScript and CSS files that might be needed for the pages to work properly. If you find any try downloading them into your theme and then updating your theme to reference them there.

Use Slash URLs

This one is kind of esoteric, but could be important. Most Omeka installs don’t use trailing slashes on URLs, for example:

https://archive.blackgothamarchive.org/items
https://archive.blackgothamarchive.org/items/show/121

The problem with this is that when you use a tool like wget to mirror a website it will download those pages using a .html extension:

archive.blackgothamarchive.org/items/show/121.html
archive.blackgothamarchive.org/items.html

This works just fine when you mount it on the web, but if anyone has linked to https://archive.blackgothamarchive.org/items/show/121 they will get a 404 Not Found.

One way around this is to convert your application URLs to end in a slash prior to creating your mirror. You can do this by modifying the url function which can be found in libraries/globals.php or in application/helpers/Url.php in older versions of Omeka. This issue ticket has some more details.

Then your URLs will look like this:

https://archive.blackgothamarchive.org/items/
https://archive.blackgothamarchive.org/items/show/121/

and will be saved by wget as:

archive.blackgothamarchive.org/items/index.html
archive.blackgothamarchive.org/items/show/121/index.html

Then when someone comes asking for an old link like:

https://archive.blackgothamarchive.org/items/show/121

Apache will happily redirect them to:

https://archive.blackgothamarchive.org/items/show/121/

and serve up the index.html that’s there. Whew. Yeah, all that for a for a forward slash. But if links are important to you it might be worth the code spelunking to get it working.

Do the Crawl

I’ve used wget for this in the past. It’s a venerable tool, that has been battle hardened over the years. It won’t execute JavaScript in your pages, but most Omeka applications don’t rely too heavily on that – it could be a problem if you use this approach to archive other types of sites.

The one problem with wget is it has many, many options, many of which interact in weird ways. Here’s an example wget command I use:

wget \
  --output-file $log \
  --warc-file $name \
  --mirror \
  --page-requisites \
  --html-extension \
  --convert-links \
  --wait 1 \
  --execute robots=off \
  --no-parent $url 2>/dev/null

This is painful so I’ve developed a little helper utility I call bagweb so I don’t need to remember the options and what they do every time I want to mirror a website. The --warc-file option will also create a WARC file as it goes if you tell it too, which can be useful, as we’ll see in a second. You run bagweb giving it a URL and a name to use for a new directory that will contain a BagIt package:

% bagweb https://archive.blackgothamarchive.org bga

This will run for a while writing a log to bga.log. Once it’s done you’ll see a directory structure like this:

% tree bga
bga
├── bag-info.txt
├── bagit.txt
├── data
│   ├── bga.warc.gz
│   └── archive.blackgothamarchive.org.tar.gz
├── manifest-md5.txt
└── tagmanifest-md5.txt

You can zip up that directory or copy it to an archive. But before we do that let’s test them.

Test!

You can unpack your mirrored website and make sure they work properly using Docker to easily start up an Apache instance on your computer:

% tar xvfz bga/data/archive.blackgothamarchive.org.tar.gz
% cd archive.blackgothamarchive.org
% docker run -v `pwd`:/usr/local/apache2/htdocs -p 8080:80 httpd

And then turn off your Internet connection (wi-fi, ethernet, whatevs) and visit this URL in your browser:

http://localhost:8080/

You should see your Omeka site! For extra points you can download Webrecorder Player and open the generated WARC file and interact with it that way.

Install

Now that you have your static version of the website you need to move it up to your production web server. That should be as simple as copying the tarball up to your server and unpacking it to a directory that your Apache configuration identifies as a <DocumentRoot>.

You may also want to create a tarball of the Omeka server side code and a MySQL dump of the Omeka database to save in your bag. It’s probably worth noting some details about external dependencies in the bag-info.txt such as the version of Apache, PHP, MySQL and the operating system type/version for anyone courageous enough to try to get the code running again in the future.

So, admittedly this is hardly a walk in the park. But if the Omeka environment is at risk for whatever reason this is a pretty satisfying process that ensures that the data is preserved, and still available on the web for people to use.