Golang Web Scraper Example

Posted on  by 



I stumbled across a scraper and crawler framework written in Go called Colly. Colly makes it really easy to scrape content from web pages with it’s fast speed and easy interface. I have always been interested in web scrapers ever since I did a project for my university studies and you can read about that project here. Before continuing, please note that scraping of websites is not always allowed and sometimes even illegal. In the guide below we will be parsing this blog, GoPHP.io.

  1. Golang Code Examples
  2. Golang Web Development

To begin let’s take a look at the Colly Github page and scroll down to the example code listed there. We will create a new project with a new main.go file that looks like this:

You may need to use go get -u github.com/gocolly/colly/... to download the framework into your go directory. Now let’s go ahead and change the url to the gophp.io website.

Golang Web Scraper Example

And then we can run the script by typing go run main.go in your terminal making sure you are in the project directory when you do this. You can use ctrl+c in your terminal to cancel as it may run for a long time. What do we get as our output? For me it looked like this:

What we see here is exactly what you would expect. Our program parsed all the urls on the main gophp.io page and then proceeded to the first link. This first link is a post at gophp.io but the first link on that page is a link to Virtualbox and our program will keep looping until it stops finding links. That could be a long time and unless you want to make a search engine spider it won’t be the most efficent. What I want is a server that I can call on from a PHP script that just fetches and formats the data I need. Luckily Colly has a complete example of what we need, a scraper server.

  • Golang Example Web Scraping A collection of 4 posts. Ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. 07 January 2019. Command Line 99.
  • By Divyanshu Shekhar. In Golang, Go Web Development. On June 20, 2020. A Simple HTTP Server in Golang can easily be created using Golang’s net/http package. In this blog, we will be creating a simple HTTP Server in Golang that will render some text on the browser.

Our scraper will basically try to download a list of web pages we’re giving him first, and check it gets a 200 HTTP status code (meaning the server returned an HTML page without an error). We’re not dealing with HTML results parsing here, since the goal is to focus on.

What does the above code do? It will start a webserver running locally on your machine on port 7171. It takes a url parameter and returns all the links found on the url you input. Let’s give it a go by going to http://127.0.0.1:7171/?url=https://gophp.io/. Here is an example of the json encoded output we get:

The above json output is only 1 level deep. Notice that it does not keep finding links on the pages it finds. This is great because now we could use this program as a sort of microservice. A PHP application could make calls to this microservice and receive all links for the specified url which could later be processed by the PHP application. Now, links are good but we might want to parse other content on the page. Let’s customize our code for this purpose.

Queries For Specific Content With Colly

If we take a look at the source of gophp.io we can see that every title has the css class entry-title which we can use for our query. We will modify the handler function by adding another map for headings. I am only including the section of code that I have changed below:

Now if we restart our program and navigate to our page on port 7171 again we will see some additional output in our json response.

Golang Code Examples

As you can see we have now parsed all the titles on the page and added them to our json output. Using queries we can make very general or specific parsers for any kind of website.

I hope this guide helps someone get started with web scraping. There are several real world examples in the documentation if you would like to learn more. I would love to hear your feedback, questions and comments below!

Web scrapping is a technic to parse HTML output of website. Most of the online bots are based on same technic to get required information about particular website or page.

Using XML parser we can parse HTML page and get the required information. However, jquery selector are best to parse HTML page. So, in this tutorial we will be using Jquery library in Golang to parse the HTML doc.

Golang web tutorial

Project Setup and dependencies

As mention above, we will be using Jquery library as a parser. So go get the library using following command

Create a file webscraper.go and open it in any of your favorite text editor.

Web Scraper code to get post from website

Golang Web Scraper Example
2
4
6
8
10
12
Getting started with ReactJs
-http://www.code2succeed.com/getting-started-with-reactjs/
Intro toReact
Post#2:
-http://www.code2succeed.com/caesar-decryption-of-string-using-javascript/
Caesar encryption of stringusing JavaScript
-http://www.code2succeed.com/caesar-encryption-of-string-using-javascript/

Stay tuned for more updates and tutorials !!!

Related posts:





Coments are closed