Assorted Pelican tricks

Posted on: Thursday, 2023-11-16

Category: Tech

There are a zillion static website generators out there, as this list shows. I settled on Pelican years and years ago because I specifically needed something that could handle multiple languages, it ended up doing what I need it to do with minimal faff, and so here we are today.

A quick intro first: Pelican is written in Python. It lets you write content in something called reStructuredText or Markdown - I've personally used Markdown. There are a bunch of provided themes, or you can write your own somewhat easily by using the templating engine Jinja2. That means you can write HTML files with a bunch of logic in them to do things like replace blocks or iterate over collections. You can add fixed pages in addition to blog posts, it can generate RSS and Atom feeds, the default values are reasonable, there are a whole bunch of plugins. It's a good choice if you're somewhat technically inclined but extremely low on patience for Typical Webdev Bullshit, like me. I specifically installed it via the pip package manager inside a virtual environment so I can minimize the contamination of my wider environment settings, as explained by the longer version of the Pelican starter guide. The most popular plugins can also be installed via pip. Now, onto the tricks.

First off, period archives. That's the name for collections of posts based on publication date - year, month, day if you're especially prolific. Pelican can generate those pages for you via variables like YEAR_ARCHIVE_SAVE_AS (actual filename saved in the output folder) and YEAR_ARCHIVE_URL (web address at which said file is made available), documented in the Settings page. The steps after that can be a bit trickier, however. Period archives are a special case in theme development: they seem to have a default, hidden template that uses your base one but changes the content block to a description list without really telling you. Once you write your own period_archives.html, note that it does not use pagination and therefore does not have access to the exact same variables. And see that period_archives variable mentioned in the section about creating links? That was not there a week ago, because it's a new feature in 4.9.0. I discovered this by happily using it in my templates, only for my site to fail to compile until I upgraded from 4.8.0.

Second was header downgrading, or offsetting, or whatever you want to call it. I'm trying to do better on accessibility, and part of that is structuring the HTML as soundly as possible. Notably, a document should only have one <h1>, and lower heading levels - <h2>, <h3> - should be properly nested. Those various levels can be expressed in Markdown as described by the documentation. Where this becomes complicated is if you are trying to keep a clean hierarchy in different contexts. For example, in a page with a single article visible, the article title will be <h1>, and headings inside the article should start with <h2>. But if articles are printed in full in a category page, the category itself must be the <h1>, the article titles must be <h2>, and then the internal headings must start with <h3>! Additionally, needing to remember to start headings with ## (or is it ###, then?) instead of # in the source Markdown documents seems annoying and error-prone.

The second problem is easier to solve than the first. The Table Of Contents extension for Python-Markdown includes a baselevel setting doing exactly what I wanted, which is interpreting <h1> in Markdown as actually being <h2> (or is it <h3>?). In Pelican, this is enabled by adding the MARKDOWN variable in settings, copying its defaults and adding both the Table Of Contents extension and the baselevel setting, like this:

MARKDOWN = {
    'extension_configs': {
        'markdown.extensions.codehilite': {'css_class': 'highlight'},
        'markdown.extensions.extra': {},
        'markdown.extensions.meta': {},
        'markdown.extensions.toc': {'marker': '', 'baselevel': 2}
    },
    'output_format': 'html5',
}

This leaves the issue of context-sensitive header nesting. I was stuck on this for a while, until az of Geeks For Social Change asked the right question: why have the entire post in index pages anyway, if the single view exists? Sometimes, the best answer to "how do I do this" actually is "don't". Therefore, my templates put the article summaries in index views, and the full thing is only copied into single-article pages, making the relevant base heading level <h2> in that scenario.

Third problem: images! Markdown does have a link format for images, but it just generates a raw <img> element and doesn't handle alt-text or captions. Previously I hand-copied the same chunk of HTML over and over again inside Markdown pages, since you can do that, which is error-prone and looks ugly. But hey, this whole thing runs a template engine, surely we can use a macro of some sort to solve this problem, right? Not by default, since Markdown content is processed to HTML and then fed to Jinja2. However, as often, someone else got annoyed about this and wrote a plugin solving this exact problem, called jinja2content. With this installed, I added a content_macros.html file to my theme templates, with things like this:

{% macro clickable_image_with_alt(url_small, url_big, alt) -%}
    <a href="{{ url_big }}"><img src="{{ url_small }}" alt="{{ alt }}"></a>
{%- endmacro %}

The Markdown file can then include something like this:

{% from 'content_macros.html' import clickable_image_with_alt %}
{{ clickable_image_with_alt("{static}/images/MyCoolImageSmall.png", "{static}/images/MyCoolImageBig.png", "Suitable alt-text goes here.") }}

This gives me an <img> element with alt-text, embedded inside a <a> which turns it into a clickable link to a larger version of the image. I use this format a lot with craft pictures because I cannot be bothered to include a lightbox zoom library and I like to avoid Javascript as much as possible. For captions, I instead generate the <img> inside a <figure>, paired with a <figcaption>.

The last "tricks" are significantly less far-fetched. It's adding the sitemap plugin to generate a sitemap, listing that in a robots.txt file, and adding a whole bunch of exclusion clauses to try and stop the tide of so-called-AI content scrappers somewhat. One useful bit there is setting up special copy paths for such files in order to make them end up unmodified at the website root, like this:

STATIC_PATHS = ['images', 'extras']
EXTRA_PATH_METADATA = {
    'extras/robots.txt': {'path': 'robots.txt'},
    'extras/.htaccess': {'path': '.htaccess'},
}

And with all that plus some CSS fun, the result is a website that I'm happy with, hopefully mostly sound from an accessibility perspective, and most importantly not a colossal pain in the butt to work with.