Recently I needed to analyse the text content of all the pages/news in a few SharePoint Online site collections and upload them to an Azure Blob Storage. 

After checking if https://pnp.github.io/script-samples already had a ready-made script sample that I could re-use and adapt, I ended up writing my own script. 

This turned out to be as easy as just fetching the CanvasContent1 list field and dumping it as html to a file. Additional metadata like Author and Publishing Date can be included as needed. 

Find below the script, or view it on the PnP Script Samples website at https://pnp.github.io/script-samples/spo-export-page-html/README.html?tabs=pnpps . As with every script sample, please check what the script does and run it in a test environment first.

$url = "<spo site url>"
$destFolder = "C:\SitePages"

$ErrorActionPreference = 'Stop'

# Create the destination folder if it doesn't exist
mkdir $destFolder -ErrorAction:SilentlyContinue | Out-Null

# Connect to SPO using PnP PowerShell
Connect-PnPOnline $url -Interactive
# Get all the pages. Credits to https://pnp.github.io/script-samples/spo-export-stream-classic-webparts/README.html to filter only pages
$list = Get-PnPList "SitePages"
$pageItems = Get-PnPListItem -List $list -Fields CanvasContent1,Title,FileLeafRef | Where-Object { $_["FileLeafRef"] -like "*.aspx" }
foreach ($pageItem in $pageItems)
{
    try
    {
        # Save the html content of each page to a .html file
        $content = $pageItem["CanvasContent1"]
        $filename = $pageItem["FileLeafRef"]
        # Additional metadata could be added here in its own paragraph
        $prefix = "<div><h1>$($pageItem["Title"])</h1></div>"
        $prefix + $content | Out-File -LiteralPath "$($destFolder)\$($filename.Replace(".aspx",".html"))"
    }
    catch
    {
        Write-Host $_
    }
}

Below an example of the source and the result:

Leave a comment