How to define All Existing and Archived URLs on a web site

There are plenty of reasons you may need to have to search out every one of the URLs on a website, but your actual intention will ascertain Everything you’re attempting to find. For illustration, you may want to:

Determine just about every indexed URL to analyze concerns like cannibalization or index bloat
Collect recent and historic URLs Google has seen, specifically for web site migrations
Discover all 404 URLs to Get well from publish-migration faults
In Just about every situation, one tool won’t Provide you every little thing you need. Unfortunately, Google Search Console isn’t exhaustive, and a “website:example.com” search is restricted and hard to extract data from.

During this article, I’ll walk you through some equipment to construct your URL record and prior to deduplicating the data using a spreadsheet or Jupyter Notebook, determined by your website’s dimension.

Old sitemaps and crawl exports
If you’re searching for URLs that disappeared from your live web page just lately, there’s a chance someone with your staff might have saved a sitemap file or maybe a crawl export ahead of the modifications had been produced. Should you haven’t by now, check for these files; they are able to usually provide what you will need. But, should you’re reading this, you most likely did not get so Fortunate.

Archive.org
Archive.org
Archive.org is an invaluable Instrument for Search engine optimisation responsibilities, funded by donations. In the event you hunt for a site and choose the “URLs” option, you can entry nearly 10,000 outlined URLs.

Nevertheless, there are a few limits:

URL Restrict: You can only retrieve as many as web designer kuala lumpur 10,000 URLs, that's insufficient for much larger sites.
Good quality: Quite a few URLs may be malformed or reference resource information (e.g., photos or scripts).
No export choice: There isn’t a constructed-in way to export the list.
To bypass The shortage of the export button, use a browser scraping plugin like Dataminer.io. Even so, these constraints signify Archive.org may well not offer a complete Answer for larger web sites. Also, Archive.org doesn’t indicate no matter if Google indexed a URL—but when Archive.org identified it, there’s a great chance Google did, far too.

Moz Pro
Though you could ordinarily utilize a hyperlink index to search out exterior web sites linking to you, these instruments also explore URLs on your web site in the method.


The way to utilize it:
Export your inbound back links in Moz Professional to obtain a quick and straightforward list of goal URLs from the website. For those who’re addressing a large Web site, consider using the Moz API to export facts past what’s manageable in Excel or Google Sheets.

It’s imperative that you Be aware that Moz Professional doesn’t verify if URLs are indexed or uncovered by Google. Having said that, since most websites implement exactly the same robots.txt policies to Moz’s bots since they do to Google’s, this technique typically will work nicely for a proxy for Googlebot’s discoverability.

Google Look for Console
Google Look for Console offers several useful sources for developing your listing of URLs.

One-way links experiences:


Much like Moz Professional, the Back links segment delivers exportable lists of target URLs. Sad to say, these exports are capped at one,000 URLs Each and every. You may utilize filters for particular pages, but given that filters don’t use on the export, you may perhaps need to depend on browser scraping resources—restricted to 500 filtered URLs at a time. Not ideal.

Performance → Search engine results:


This export gives you a list of web pages acquiring research impressions. Although the export is restricted, You should utilize Google Research Console API for more substantial datasets. You will also find absolutely free Google Sheets plugins that simplify pulling extra considerable facts.

Indexing → Web pages report:


This segment presents exports filtered by problem form, although these are definitely also restricted in scope.

Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is a superb source for collecting URLs, having a generous Restrict of a hundred,000 URLs.


Better still, it is possible to apply filters to create different URL lists, correctly surpassing the 100k limit. One example is, if you need to export only blog URLs, stick to these methods:

Move one: Add a phase to your report

Action two: Click on “Develop a new section.”


Stage three: Outline the segment having a narrower URL sample, for instance URLs containing /weblog/


Note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide important insights.

Server log information
Server or CDN log documents are Potentially the ultimate Device at your disposal. These logs capture an exhaustive checklist of every URL path queried by end users, Googlebot, or other bots in the recorded period of time.

Criteria:

Data dimension: Log data files is often substantial, numerous websites only retain the last two months of data.
Complexity: Analyzing log information is usually demanding, but many instruments are available to simplify the process.
Mix, and excellent luck
When you finally’ve gathered URLs from every one of these resources, it’s time to mix them. If your website is sufficiently small, use Excel or, for more substantial datasets, resources like Google Sheets or Jupyter Notebook. Ensure all URLs are continuously formatted, then deduplicate the list.

And voilà—you now have a comprehensive listing of current, previous, and archived URLs. Fantastic luck!

Leave a Reply

Your email address will not be published. Required fields are marked *