Recently I needed to analyse the text content of all the pages/news in a few SharePoint Online site collections and upload them to an Azure Blob Storage.
After checking if https://pnp.github.io/script-samples already had a ready-made script sample that I could re-use and adapt, I ended up writing my own script.
This turned out to be as easy as just fetching the CanvasContent1 list field and dumping it as html to a file. Additional metadata like Author and Publishing Date can be included as needed.
Find below the script, or view it on the PnP Script Samples website at https://pnp.github.io/script-samples/spo-export-page-html/README.html?tabs=pnpps . As with every script sample, please check what the script does and run it in a test environment first.
$url = "<spo site url>"
$destFolder = "C:\SitePages"
$ErrorActionPreference = 'Stop'
# Create the destination folder if it doesn't exist
mkdir $destFolder -ErrorAction:SilentlyContinue | Out-Null
# Connect to SPO using PnP PowerShell
Connect-PnPOnline $url -Interactive
# Get all the pages. Credits to https://pnp.github.io/script-samples/spo-export-stream-classic-webparts/README.html to filter only pages
$list = Get-PnPList "SitePages"
$pageItems = Get-PnPListItem -List $list -Fields CanvasContent1,Title,FileLeafRef | Where-Object { $_["FileLeafRef"] -like "*.aspx" }
foreach ($pageItem in $pageItems)
{
try
{
# Save the html content of each page to a .html file
$content = $pageItem["CanvasContent1"]
$filename = $pageItem["FileLeafRef"]
# Additional metadata could be added here in its own paragraph
$prefix = "<div><h1>$($pageItem["Title"])</h1></div>"
$prefix + $content | Out-File -LiteralPath "$($destFolder)\$($filename.Replace(".aspx",".html"))"
}
catch
{
Write-Host $_
}
}
Below an example of the source and the result:
Leave a comment