Xpath image extraction guide

“Xpath uses path expressions to select nodes or node-sets in an XML document.”

-From w3schools.com

In our case, an xpath will tell the program where to look for the images that we want. Since mostly all different website are built in their own unique way with their own layouts and structures, the location of their contents can be completely different from one to another.

When using the “Custom” alternative. You have the possibility to scrape a page of a website which aren’t supported otherwise. To do this however, you need to help the program with finding the images using a custom Xpath. This guide will show you how to do it in a simple way

First you need to get an addon for your browser that will help you test xpaths. For Chrome you can use this and for Firefox you can use this. This guide will be using Chrome, but it’s similar for Firefox. I will use this website in this example. It is a page that contains ~100 images that we will scrape.

Right-click on an image and click “Inspect”. This will open Google Developer Tools in a bar to the right or the bottom.

You should now be seeing a sidebar with the URL to the image visible.

Now right-click the link and select “Copy” and then “Copy XPath”. The Xpath to the image should now be saved to your clipboard.

Open your Xpath-addon.    addon

Paste the Xpath into the left part of the box that poped up. If the path doesn’t end with “//img/@src”, add it. It should now look like this. With the URL to the image showing up in the right part of the box. We now have a working Xpath.

Here things get a bit more complicated. But once you have done it a few times things will clear up.

We want all the images on the page. To do this we modify our Xpath a bit. What we want to do is to identify a top level element or “container” in which all of the images are located. You can do this by moving your mouse cursor inside the Developer Tools and look for a place where all the images appears to get a blue overlay.

Here I have found a spot that seems to get what we want. Now we repeat step 5-7. Remember to add “//img/@src”

This is the result you should now be seeing. We now have an Xpath which the program can use to scrape the images.