Omeka Staticify with 👜🕸
If you have some old Omeka sites that are still valuable resources, but are no longer being actively maintained, you might want to consider converting them to a static site and archiving the PHP code and database. This means that the site can stay online in much the form that it’s in now, at the same URLs, and you still have the code and database to bring it back if you want to. From a maintenance perspective this is a big win since you no longer have the problem of keeping the PHP, Omeka and MySQL code up to date and backed up. The big trade off is that the site becomes truly static. Making any changes across the static assets would be quite tedious. So only consider this if you really anticipate that the project is no longer being actively curated.
I have done this a with several Wordpress sites before, but Omeka is a little bit different in a few ways so I thought it was worth a quick blog post to jot down the steps.
Disable Search
This is kind of important for the usability of your static site. Since there will no longer be any server side PHP code and database for it to query, the code that performs the search will be gone. Since it’s probably not a great idea to leave the search form there inviting people to enter a query only to get an error, you can remove it. For good reason this may be a deal breaker for you, depending on how you are using the search. The good news however is that people can still find your site via Google, DuckDuckGo, or some other search engine, instead of it being completely offline.
To disable search take a look in your Omeka theme, often in
common/header.php
and simply comment out the code that
generates the search form:
<!-- <?php //echo simple_search(); ?> -->
It might be nice to be able to generate a static Lunr.js index for your database and drop it into your Omeka site before creating the static version. This an approach that the minimal computing project Wax has taken, and should work well for average size collections. Or perhaps you could configure a Google Custom Search Engine, and similarly drop that into your Omeka before conversion. But it may be easiest to simply accept that some functionality will be lost as part of the archiving process.
Localize External Resources
It’s fairly common to use JavaScript and CSS files from various CDNs.
To find them view the source of one of your Omeka pages and scan for
http
to review the types of JavaScript and CSS files that
might be needed for the pages to work properly. If you find any try
downloading them into your theme and then updating your theme to
reference them there.
Use Slash URLs
This one is kind of esoteric, but could be important. Most Omeka installs don’t use trailing slashes on URLs, for example:
https://archive.blackgothamarchive.org/items
https://archive.blackgothamarchive.org/items/show/121
The problem with this is that when you use a tool like wget to mirror a website
it will download those pages using a .html
extension:
archive.blackgothamarchive.org/items/show/121.html
archive.blackgothamarchive.org/items.html
This works just fine when you mount it on the web, but if anyone has linked to https://archive.blackgothamarchive.org/items/show/121 they will get a 404 Not Found.
One way around this is to convert your application URLs to end in a
slash prior to creating your mirror. You can do this by modifying the
url
function which can be found in
libraries/globals.php
or in
application/helpers/Url.php
in older versions of Omeka.
This issue
ticket has some more details.
Then your URLs will look like this:
https://archive.blackgothamarchive.org/items/
https://archive.blackgothamarchive.org/items/show/121/
and will be saved by wget as:
archive.blackgothamarchive.org/items/index.html
archive.blackgothamarchive.org/items/show/121/index.html
Then when someone comes asking for an old link like:
https://archive.blackgothamarchive.org/items/show/121
Apache will happily redirect them to:
https://archive.blackgothamarchive.org/items/show/121/
and serve up the index.html
that’s there. Whew. Yeah,
all that for a for a forward slash. But if links are important to you it
might be worth the code spelunking to get it working.
Do the Crawl
I’ve used wget for this in the past. It’s a venerable tool, that has been battle hardened over the years. It won’t execute JavaScript in your pages, but most Omeka applications don’t rely too heavily on that – it could be a problem if you use this approach to archive other types of sites.
The one problem with wget is it has many, many options, many of which interact in weird ways. Here’s an example wget command I use:
wget \
--output-file $log \
--warc-file $name \
--mirror \
--page-requisites \
--html-extension \
--convert-links \
--wait 1 \
--execute robots=off \
--no-parent $url 2>/dev/null
This is painful so I’ve developed a little helper utility I
call bagweb so I don’t
need to remember the options and what they do every time I want to
mirror a website. The --warc-file
option will also create a
WARC file as it
goes if you tell it too, which can be useful, as we’ll see in a second.
You run bagweb giving it a URL and a name to use for a new directory
that will contain a BagIt package:
% bagweb https://archive.blackgothamarchive.org bga
This will run for a while writing a log to bga.log. Once it’s done you’ll see a directory structure like this:
% tree bga
bga
├── bag-info.txt
├── bagit.txt
├── data
│ ├── bga.warc.gz
│ └── archive.blackgothamarchive.org.tar.gz
├── manifest-md5.txt
└── tagmanifest-md5.txt
You can zip up that directory or copy it to an archive. But before we do that let’s test them.
Test!
You can unpack your mirrored website and make sure they work properly using Docker to easily start up an Apache instance on your computer:
% tar xvfz bga/data/archive.blackgothamarchive.org.tar.gz
% cd archive.blackgothamarchive.org
% docker run -v `pwd`:/usr/local/apache2/htdocs -p 8080:80 httpd
And then turn off your Internet connection (wi-fi, ethernet, whatevs) and visit this URL in your browser:
http://localhost:8080/
You should see your Omeka site! For extra points you can download Webrecorder Player and open the generated WARC file and interact with it that way.
Install
Now that you have your static version of the website you need to move it up to your production web server. That should be as simple as copying the tarball up to your server and unpacking it to a directory that your Apache configuration identifies as a <DocumentRoot>.
You may also want to create a tarball of the Omeka server side code
and a MySQL dump of the Omeka database to save in your bag. It’s
probably worth noting some details about external dependencies in the
bag-info.txt
such as the version of Apache, PHP, MySQL and
the operating system type/version for anyone courageous enough to try to
get the code running again in the future.
So, admittedly this is hardly a walk in the park. But if the Omeka environment is at risk for whatever reason this is a pretty satisfying process that ensures that the data is preserved, and still available on the web for people to use.