Route360

Generating a custom XML sitemap for a multilingual Gatsby site

Table of Contents

This blog (route360.dev) is a multilingual website generated by Gatsby.js.

In this entry, I'll explain how to customize the xml sitemap with the official Gatsby's gatsby-plugin-sitemap.

Check out the repository of this blog if you're interested. The code in this entry is simplified for the sake of explanation.

Environment:

  • gatsby v5.10.0
  • gatsby-plugin-sitemap v6.10.0
  • react v18.2.0
  • node v18.16.0

Prerequisite

I'm talking about a website with Markdown content. If case of CMS, rewrite your own query.

URL paths

The URL paths of this blog are as follows;

  • Single entry pages /[lang]/post/[slug]/
  • Individual pages /[lang]/[slug]/ *ex. about page
  • Tag Archive pages /[lang]/tag/[slug]/
  • Tag Archive pages (after page 2) /[lang]/tag/[slug]/page/[num]/
  • Front page /[lang]/
  • Front page (after page 2) /[lang]/page/[num]/

Keys:

  • Translated page paths share the same slug
  • The language code is placed right after the root domain

For the cases where the translation pages have their own slug or the default language code doesn't appear in the URL, edit the example code accordingly.

I also consider cases where the total number of pages varies by locale.

Goal

The goal is to generate a sitemap as indicated by Google's guidelines as follows;

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>https://www.example.com/english/page.html</loc>
    <xhtml:link
               rel="alternate"
               hreflang="de"
               href="https://www.example.de/deutsch/page.html"/>
    <xhtml:link
               rel="alternate"
               hreflang="de-ch"
               href="https://www.example.de/schweiz-deutsch/page.html"/>
    <xhtml:link
               rel="alternate"
               hreflang="en"
               href="https://www.example.com/english/page.html"/>
  </url>
  <url>
    <loc>https://www.example.de/deutsch/page.html</loc>
    <xhtml:link
               rel="alternate"
               hreflang="de"
               href="https://www.example.de/deutsch/page.html"/>
    <xhtml:link
               rel="alternate"
               hreflang="de-ch"
               href="https://www.example.de/schweiz-deutsch/page.html"/>
    <xhtml:link
               rel="alternate"
               hreflang="en"
               href="https://www.example.com/english/page.html"/>
  </url>
  <url>
    <loc>https://www.example.de/schweiz-deutsch/page.html</loc>
    <xhtml:link
               rel="alternate"
               hreflang="de"
               href="https://www.example.de/deutsch/page.html"/>
    <xhtml:link
               rel="alternate"
               hreflang="de-ch"
               href="https://www.example.de/schweiz-deutsch/page.html"/>
    <xhtml:link
               rel="alternate"
               hreflang="en"
               href="https://www.example.com/english/page.html"/>
  </url>
</urlset>

Code

Here is the code.

gatsby-config.js
module.exports = {
  plugins: [
    {
      resolve: "gatsby-plugin-sitemap",
      options: {
        query: `
        {
          site {
            siteMetadata {
              siteUrl
            }
          }
          allSitePage {
            nodes {
              path
            }
          }
        }`,
        resolvePages: ({ allSitePage: { nodes: allPages } }) => {
          const pages = allPages.map(page => {
            const alternateLangs = allPages
              .filter(
                alterPage =>
                  alterPage.path.replace(/\/.*?\//, "/") ===
                  page.path.replace(/\/.*?\//, "/")
              )
              .map(alterPage => alterPage.path.match(/^\/([a-z]{2})\//))
              .filter(match => match)
              .map(match => match[1])

            return {
              ...page,
              ...{ alternateLangs },
            }
          })

          return pages
        },
        serialize: ({ path, alternateLangs }) => {
          const pagepath = path.replace(/\/.*?\//, "/")

          const xhtmlLinks =
            alternateLangs.length > 1 &&
            alternateLangs.map(lang => ({
              rel: "alternate",
              hreflang: lang,
              url: `/${lang}${pagepath}`,
            }))

          let entry = {
            url: path,
            changefreq: "daily",
            priority: 0.7,
          }

          if (xhtmlLinks) {
            entry.links = xhtmlLinks
          }

          return entry
        },
      },
    },
  ],
}

*The actual code of this blog is more complicated because I added lastmod for each post.

Sitemap output example

<?xml version="1.0" encoding="UTF-8"?>
<urlset
  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
  xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
  xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"
  xmlns:xhtml="http://www.w3.org/1999/xhtml"
>
  <url>
    <loc>https://route360.dev/en/post/gatsby-i18n/</loc>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
    <xhtml:link rel="alternate" hreflang="en" href="https://route360.dev/en/post/gatsby-i18n/" />
    <xhtml:link rel="alternate" hreflang="fr" href="https://route360.dev/fr/post/gatsby-i18n/" />
    <xhtml:link rel="alternate" hreflang="ja" href="https://route360.dev/ja/post/gatsby-i18n/" />
  </url>
  <url>
    <loc>https://route360.dev/en/post/codeium/</loc>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
    <xhtml:link rel="alternate" hreflang="en" href="https://route360.dev/en/post/codeium/" />
    <xhtml:link rel="alternate" hreflang="fr" href="https://route360.dev/fr/post/codeium/" />
    <xhtml:link rel="alternate" hreflang="ja" href="https://route360.dev/ja/post/codeium/" />
  </url>
  <!-- omitted below -->
</urlset>

The sitemap of this blog is here.

What I do in the code

The overview of the code above:

  1. Spread all the page paths, and assign language code(s) of the same slug (including itself) of each path into an array named alternateLangs
  2. If alternateLangs (locale number length) is greater than 2, add <xhtml:link rel="alternate" hreflang="lang_code" href="page_path" /> to the url element

Generating an array of language code(s) for each path

First, the first half.

module.exports = {
  plugins: [
    {
      resolve: "gatsby-plugin-sitemap",
      options: {
        // ...

        resolvePages: ({ allSitePage: { nodes: allPages } }) => {
          const pages = allPages.map(page => {
            const alternateLangs = allPages
              // Extract translated pages (including the URL itself) for each URL path
              // ex) /en/first-post/ and /ja/first-post/ -> true
              .filter(
                alterPage =>
                  alterPage.path.replace(/\/.*?\//, "/") ===
                  page.path.replace(/\/.*?\//, "/")
              )
              // Get language codes from translated page paths and convert to an array
              .map(alterPage => alterPage.path.match(/^\/([a-z]{2})\//))
              // Eliminate null *.filter(Boolean) works as well
              .filter(match => match)
              // Arrayed language codes only
              .map(match => match[1])

            return {
              ...page,
              ...{ alternateLangs }, // Add the language codes array
            }
          })

          return pages
        },

        // ...
      },
    },
  ],
}

I get the following string from URL paths with regular expressions;

  • language code(s)
  • URL paths without their language code in the URL

In the case of this blog, I could get language codes from pageContext or slugs from markdownRemark because I've added them in gatsby-node.js. I didn't do that in this code for explanation and versatility.

If you use CMS, you could get the language code of the post from GraphQL.

Next, the second half.

module.exports = {
  plugins: [
    {
      resolve: "gatsby-plugin-sitemap",
      options: {
        //...

        serialize: ({ path, alternateLangs }) => {
          // Get page path without language code
          const pagepath = path.replace(/\/.*?\//, "/")

          // Generate xhtml for translated pages (including the URL itself)
          const xhtmlLinks =
            alternateLangs.length > 1 && // If the number of translations is 2 or more
            alternateLangs.map(lang => ({
              rel: "alternate",
              hreflang: lang,
              url: `/${lang}${pagepath}`,
            }))

          // Default <url> element
          let entry = {
            url: path,
            changefreq: "daily",
            priority: 0.7,
          }

          // Add child element <xhtml:link rel="alternate" hreflang="lang"> to <url> if translations are available
          if (xhtmlLinks) {
            entry.links = xhtmlLinks
          }

          return entry
        },
      },
    },
  ],
}

With the above code, <xhtml:link rel="alternate" hreflang="lang"> under <url> is generated and added only when translations are available.

That's it!

References