Skip to content

sitemaps links are not returned for some websites (ex: https://www.sainsburys.co.uk) #17

@kvamsij

Description

@kvamsij

when you try to get sitemap for a website like https://www.sainsburys.co.uk it returns an empty array. But i have checked https://www.sainsburys.co.uk/robots.txt. The sitemap url exists in robots.txt.

So I did a little digging and found out the server was denying the request. The response was this.

`https://www.sainsburys.co.uk/robots.txt

<TITLE>Access Denied</TITLE>

Access Denied

You don't have permission to access "http://www.sainsburys.co.uk/robots.txt" on this server.


Reference #18.878f7b5c.1723733159.22628e9

https://errors.edgesuite.net/18.878f7b5c.1723733159.22628e9

`

I can see that there were no headers added when requesting respective robots.txt url. So I added headers following headers in the get.concat and it worked for me.
headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.9', 'Accept-Encoding': 'gzip, deflate, br' }

I'll be happy to contribute. As it is a small change.
Regards,
Vamse.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions