Our delta crawl runs hourly and checks the post date in the sitemap to index the new/updated posts. The ‘lastmod’ field in the sitemap must be later than the last indexed date for delta crawl to work.
But any deleted URL(s) from the sitemap need re-indexing to avoid displaying the deleted pages in search results.
XPath is a query language which provide easy way to select node from an XML/HTML document. You can read more about the language here.
Below we provide some examples:
- Post title:
/html/head/title
- Short description:
/html/head/meta[@name="description"]/@content
- Post image:
//div[@id="container"]/article/img/@src
- Post content:
//div[@id="container"]/article
- Post author:
//div[@id="container"]/span[@id="author"]
If you have different html structure for every post content you can set two or more XPath queries separated by comma.
If your site is WordPress site, you don’t need to be publicly accessible as the site indexing is performed via the WordPress plugin.
Otherwise, your site needs to be indexed by our crawler that requires your site to be publicly accessible.
If you provide us sitemap and the folder/subpath is listed there, we will ignore your robots.txt. To get around without changing your sitemap, you can use our blacklist feature.