Generating a custom XML sitemap for a multilingual Gatsby site
This blog (route360.dev) is a multilingual website generated by Gatsby.js.
In this entry, I'll explain how to customize the xml sitemap with the official Gatsby's gatsby-plugin-sitemap.
Check out the repository of this blog if you're interested. The code in this entry is simplified for the sake of explanation.
Environment:
- gatsby v5.10.0
- gatsby-plugin-sitemap v6.10.0
- react v18.2.0
- node v18.16.0
Prerequisite
I'm talking about a website with Markdown content. If case of CMS, rewrite your own query
.
URL paths
The URL paths of this blog are as follows;
- Single entry pages
/[lang]/post/[slug]/
- Individual pages
/[lang]/[slug]/
*ex. about page - Tag Archive pages
/[lang]/tag/[slug]/
- Tag Archive pages (after page 2)
/[lang]/tag/[slug]/page/[num]/
- Front page
/[lang]/
- Front page (after page 2)
/[lang]/page/[num]/
Keys:
- Translated page paths share the same slug
- The language code is placed right after the root domain
For the cases where the translation pages have their own slug or the default language code doesn't appear in the URL, edit the example code accordingly.
I also consider cases where the total number of pages varies by locale.
Goal
The goal is to generate a sitemap as indicated by Google's guidelines as follows;
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://www.example.com/english/page.html</loc>
<xhtml:link
rel="alternate"
hreflang="de"
href="https://www.example.de/deutsch/page.html"/>
<xhtml:link
rel="alternate"
hreflang="de-ch"
href="https://www.example.de/schweiz-deutsch/page.html"/>
<xhtml:link
rel="alternate"
hreflang="en"
href="https://www.example.com/english/page.html"/>
</url>
<url>
<loc>https://www.example.de/deutsch/page.html</loc>
<xhtml:link
rel="alternate"
hreflang="de"
href="https://www.example.de/deutsch/page.html"/>
<xhtml:link
rel="alternate"
hreflang="de-ch"
href="https://www.example.de/schweiz-deutsch/page.html"/>
<xhtml:link
rel="alternate"
hreflang="en"
href="https://www.example.com/english/page.html"/>
</url>
<url>
<loc>https://www.example.de/schweiz-deutsch/page.html</loc>
<xhtml:link
rel="alternate"
hreflang="de"
href="https://www.example.de/deutsch/page.html"/>
<xhtml:link
rel="alternate"
hreflang="de-ch"
href="https://www.example.de/schweiz-deutsch/page.html"/>
<xhtml:link
rel="alternate"
hreflang="en"
href="https://www.example.com/english/page.html"/>
</url>
</urlset>
Code
Here is the code.
module.exports = {
plugins: [
{
resolve: "gatsby-plugin-sitemap",
options: {
query: `
{
site {
siteMetadata {
siteUrl
}
}
allSitePage {
nodes {
path
}
}
}`,
resolvePages: ({ allSitePage: { nodes: allPages } }) => {
const pages = allPages.map(page => {
const alternateLangs = allPages
.filter(
alterPage =>
alterPage.path.replace(/\/.*?\//, "/") ===
page.path.replace(/\/.*?\//, "/")
)
.map(alterPage => alterPage.path.match(/^\/([a-z]{2})\//))
.filter(match => match)
.map(match => match[1])
return {
...page,
...{ alternateLangs },
}
})
return pages
},
serialize: ({ path, alternateLangs }) => {
const pagepath = path.replace(/\/.*?\//, "/")
const xhtmlLinks =
alternateLangs.length > 1 &&
alternateLangs.map(lang => ({
rel: "alternate",
hreflang: lang,
url: `/${lang}${pagepath}`,
}))
let entry = {
url: path,
changefreq: "daily",
priority: 0.7,
}
if (xhtmlLinks) {
entry.links = xhtmlLinks
}
return entry
},
},
},
],
}
*The actual code of this blog is more complicated because I added lastmod
for each post.
Sitemap output example
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
>
<url>
<loc>https://route360.dev/en/post/gatsby-i18n/</loc>
<changefreq>daily</changefreq>
<priority>0.7</priority>
<xhtml:link rel="alternate" hreflang="en" href="https://route360.dev/en/post/gatsby-i18n/" />
<xhtml:link rel="alternate" hreflang="fr" href="https://route360.dev/fr/post/gatsby-i18n/" />
<xhtml:link rel="alternate" hreflang="ja" href="https://route360.dev/ja/post/gatsby-i18n/" />
</url>
<url>
<loc>https://route360.dev/en/post/codeium/</loc>
<changefreq>daily</changefreq>
<priority>0.7</priority>
<xhtml:link rel="alternate" hreflang="en" href="https://route360.dev/en/post/codeium/" />
<xhtml:link rel="alternate" hreflang="fr" href="https://route360.dev/fr/post/codeium/" />
<xhtml:link rel="alternate" hreflang="ja" href="https://route360.dev/ja/post/codeium/" />
</url>
<!-- omitted below -->
</urlset>
The sitemap of this blog is here.
What I do in the code
The overview of the code above:
- Spread all the page paths, and assign language code(s) of the same slug (including itself) of each path into an array named
alternateLangs
- If
alternateLangs
(locale number length) is greater than 2, add<xhtml:link rel="alternate" hreflang="lang_code" href="page_path" />
to the url element
Generating an array of language code(s) for each path
First, the first half.
module.exports = {
plugins: [
{
resolve: "gatsby-plugin-sitemap",
options: {
// ...
resolvePages: ({ allSitePage: { nodes: allPages } }) => {
const pages = allPages.map(page => {
const alternateLangs = allPages
// Extract translated pages (including the URL itself) for each URL path
// ex) /en/first-post/ and /ja/first-post/ -> true
.filter(
alterPage =>
alterPage.path.replace(/\/.*?\//, "/") ===
page.path.replace(/\/.*?\//, "/")
)
// Get language codes from translated page paths and convert to an array
.map(alterPage => alterPage.path.match(/^\/([a-z]{2})\//))
// Eliminate null *.filter(Boolean) works as well
.filter(match => match)
// Arrayed language codes only
.map(match => match[1])
return {
...page,
...{ alternateLangs }, // Add the language codes array
}
})
return pages
},
// ...
},
},
],
}
I get the following string from URL paths with regular expressions;
- language code(s)
- URL paths without their language code in the URL
In the case of this blog, I could get language codes from pageContext or slugs from markdownRemark because I've added them in gatsby-node.js
. I didn't do that in this code for explanation and versatility.
If you use CMS, you could get the language code of the post from GraphQL.
Adding xhtml:link only if the number of languages is 2 or more
Next, the second half.
module.exports = {
plugins: [
{
resolve: "gatsby-plugin-sitemap",
options: {
//...
serialize: ({ path, alternateLangs }) => {
// Get page path without language code
const pagepath = path.replace(/\/.*?\//, "/")
// Generate xhtml for translated pages (including the URL itself)
const xhtmlLinks =
alternateLangs.length > 1 && // If the number of translations is 2 or more
alternateLangs.map(lang => ({
rel: "alternate",
hreflang: lang,
url: `/${lang}${pagepath}`,
}))
// Default <url> element
let entry = {
url: path,
changefreq: "daily",
priority: 0.7,
}
// Add child element <xhtml:link rel="alternate" hreflang="lang"> to <url> if translations are available
if (xhtmlLinks) {
entry.links = xhtmlLinks
}
return entry
},
},
},
],
}
With the above code, <xhtml:link rel="alternate" hreflang="lang">
under <url>
is generated and added only when translations are available.
That's it!