Scraping Data From Web With Golang

Web scraping is a powerful technique for extracting data from websites, and Golang (Go) is an excellent language for this task. Known for its performance and efficiency, Go can handle web scraping with ease. So, how to scrape data from web with Golang? This guide will walk you through the process of scraping webpage using Golang, covering related techniques and tips.

Contents hide

I Is Golang Good for Scraping Data from Web?

II Basic Configuration to Scrape Web Data with Golang

II.I Setting Up The Environment

II.II Installing Necessary Packages

II.III Writing the Scraper

II.IV Running the Scraper

II.V Additional Considerations

III Advanced Techniques to Scrape Web Data with Golang

III.I Handling Pagination

III.II Handling JavaScript-Rendered Content

III.III Managing Sessions and Cookies

III.IV Throttling and Rate Limiting

III.V Handling AJAX Requests

III.VI Handling Captchas and Anti-Scraping Mechanisms

III.VII Using Proxies

III.VIII Concurrent Scraping

III.IX Scraping Data from APIs

III.X Storing Data

III.XI Error Handling and Logging

IV Essential Libraries for Web Scraping in Golang

V Summary

Is Golang Good for Scraping Data from Web?

Before learn more about scraping data from the web with Golang, it’s important to understand why choose Golang for web scraping and what advantages it offers.

Golang is a strong choice for web scraping due to its high performance, efficient concurrency model, and robust standard library. With its ability to handle multiple requests concurrently using goroutines and its built-in packages for HTTP requests and HTML parsing, Go can efficiently scrape large volumes of data. Its simplicity and error-handling capabilities further streamline the development process, while third-party libraries like Colly and Goquery offer additional functionality. Although less common than Python for web scraping, Go’s advantages make it a compelling option for those familiar with the language.

Basic Configuration to Scrape Web Data with Golang

Scraping data from web with Go (Golang) involves making HTTP requests to retrieve web pages and then parsing the HTML content to extract the desired information. Below is a step-by-step guide to scraping data from the web using Go:

1. Setting Up The Environment
  
  First, make sure Go has installed on your system. Neither people could download it from the official website.
2. Installing Necessary Packages
  
  A few packages is needed to help with HTTP requests and HTML parsing. The most popular packages are net/http for HTTP requests and goquery for parsing HTML.
  
  Get specific package by running like:
```
go get github.com/PuerkitoBio/goquery
```
  Writing the Scraper
  
  Here’s a simple to demonstrate how to scrape data from a website using Golang:
```
package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    // URL of the website to scrape
    url := "https://example.com"

    // Make an HTTP GET request
    res, err := http.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()

    // Check the response status code
    if res.StatusCode != 200 {
        log.Fatalf("Failed to fetch data: %d %s", res.StatusCode, res.Status)
    }

    // Parse the HTML
    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }

    // Find and print the data
    doc.Find("h1").Each(func(index int, item *goquery.Selection) {
        heading := item.Text()
        fmt.Println(heading)
    })
}
```
  Making HTTP Requests:
  
  http.Get(url) makes an HTTP GET request to the specified URL.
  res.Body.Close() ensures that the response body is closed after reading.
  
  Parsing HTML:
  
  goquery.NewDocumentFromReader(res.Body) parses the HTML response and returns a goquery.Document object.
  
  Extracting Data:
  
  doc.Find(“h1”).Each() finds all h1 elements in the HTML and iterates over them.
  item.Text() extracts the text content of each h1 element.
3. Running the Scraper
  
  Save the above code in a file, for example, main.go, and run it using:
```
go run main.go
```

Additional Considerations

Handling Errors: Always handle errors appropriately to ensure your scraper doesn’t crash unexpectedly.

Respecting robots.txt: Check the robots.txt file of the website to ensure you’re allowed to scrape it.

Rate Limiting: Implement rate limiting to avoid overwhelming the server with requests.

User-Agent: Set a custom User-Agent header to identify your scraper , such as:

req, err := http.NewRequest("GET", url, nil)
if err != nil {
    log.Fatal(err)
}
req.Header.Set("User-Agent", "Golang_Scraper/1.0")

client := &http.Client{}
res, err := client.Do(req)
if err != nil {
    log.Fatal(err)
}
defer res.Body.Close()

// Parse the HTML as before

Advanced Techniques to Scrape Web Data with Golang

Handling Pagination

Many websites use pagination to split content across multiple pages. To scrape all the data, you need to handle pagination by making requests to each page sequentially.

Here’s an example of handling pagination:

package main

import (
    "fmt"
    "log"
    "net/http"
    "strconv"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    baseURL := "https://example.com/page/"
    page := 1

    for {
        url := baseURL + strconv.Itoa(page)
        res, err := http.Get(url)
        if err != nil {
            log.Fatal(err)
        }
        defer res.Body.Close()

        if res.StatusCode != 200 {
            log.Println("No more pages to fetch, stopping.")
            break
        }

        doc, err := goquery.NewDocumentFromReader(res.Body)
        if err != nil {
            log.Fatal(err)
        }

        doc.Find(".item").Each(func(index int, item *goquery.Selection) {
            title := item.Find(".title").Text()
            fmt.Println(title)
        })

        page++
    }
}

Handling JavaScript-Rendered Content

Some websites use JavaScript to render content dynamically. Go doesn’t have a built-in way to execute JavaScript, but you can use a headless browser like Chromedp.

go get -u github.com/chromedp/chromedp

Example of using Chromedp to scrape JavaScript-rendered content:

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/chromedp/chromedp"
)

func main() {
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()

    var htmlContent string

    err := chromedp.Run(ctx,
        chromedp.Navigate("https://example.com"),
        chromedp.OuterHTML("body", &htmlContent),
    )
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(htmlContent)
}

Managing Sessions and Cookies

If a website requires login or session management, you can handle cookies and sessions using the http.CookieJar.

Example of managing cookies:

package main

import (
    "fmt"
    "log"
    "net/http"
    "net/http/cookiejar"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    jar, _ := cookiejar.New(nil)
    client := &http.Client{Jar: jar}

    // Log in and save cookies
    loginURL := "https://example.com/login"
    loginForm := url.Values{}
    loginForm.Set("username", "your_username")
    loginForm.Set("password", "your_password")

    res, err := client.PostForm(loginURL, loginForm)
    if err != nil {
        log.Fatal(err)
    }
    res.Body.Close()

    // Access a protected page
    url := "https://example.com/protected-page"
    res, err = client.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()

    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }

    doc.Find(".protected-content").Each(func(index int, item *goquery.Selection) {
        content := item.Text()
        fmt.Println(content)
    })
}

Throttling and Rate Limiting

To avoid being blocked by websites, implement rate limiting by introducing delays between requests.

Example of rate limiting:

package main

import (
    "fmt"
    "log"
    "net/http"
    "time"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    urls := []string{"https://example.com/page1", "https://example.com/page2"}

    for _, url := range urls {
        res, err := http.Get(url)
        if err != nil {
            log.Fatal(err)
        }
        defer res.Body.Close()

        doc, err := goquery.NewDocumentFromReader(res.Body)
        if err != nil {
            log.Fatal(err)
        }

        doc.Find(".item").Each(func(index int, item *goquery.Selection) {
            title := item.Find(".title").Text()
            fmt.Println(title)
        })

        // Delay to avoid getting blocked
        time.Sleep(2 * time.Second)
    }
}

Handling AJAX Requests

Some websites load data dynamically through AJAX requests. You can capture and replicate these requests using tools like browser developer tools to find the API endpoints.

Example of fetching data from an AJAX API endpoint:

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
)

type Item struct {
    Title string `json:"title"`
}

func main() {
    url := "https://example.com/api/items"

    res, err := http.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()

    var items []Item
    if err := json.NewDecoder(res.Body).Decode(&items); err != nil {
        log.Fatal(err)
    }

    for _, item := range items {
        fmt.Println(item.Title)
    }
}

Handling Captchas and Anti-Scraping Mechanisms

Websites often use CAPTCHAs and other anti-scraping mechanisms. While solving CAPTCHAs programmatically is complex and often against terms of service, you can use techniques like rotating user-agents and proxies to avoid detection.

Example of rotating user agents:

package main

import (
    "fmt"
    "log"
    "net/http"
    "math/rand"
    "time"
)

func main() {
    userAgents := []string{
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:54.0) Gecko/20100101 Firefox/54.0",
        // Add more user agents here
    }

    client := &http.Client{}
    rand.Seed(time.Now().UnixNano())

    for i := 0; i < 5; i++ {
        req, err := http.NewRequest("GET", "https://example.com", nil)
        if err != nil {
            log.Fatal(err)
        }

        req.Header.Set("User-Agent", userAgents[rand.Intn(len(userAgents))])
        res, err := client.Do(req)
        if err != nil {
            log.Fatal(err)
        }
        res.Body.Close()

        fmt.Println("Request sent with user-agent:", req.Header.Get("User-Agent"))
    }
}

Using Proxies

To further protect your IP from getting banned, you can use proxies. Services like OkeyProxy or MacroProxy provide proxy solutions.

As one of the best proxy provider, OkeyProxy is supported by HTTP/HTTPS/SOCKS and provides 150 million+ real rotating residential IPs, covering 200+ countries/areas, which could avoid IP ban as much as possible and ensures the security, reliability and stability of network connections.

Example of using a proxy with http.Client:

package main

import (
    "fmt"
    "log"
    "net/http"
    "net/url"
)

func main() {
    proxyURL, _ := url.Parse("http://proxyusername:proxypassword@proxyserver:port")
    transport := &http.Transport{
        Proxy: http.ProxyURL(proxyURL),
    }

    client := &http.Client{Transport: transport}

    res, err := client.Get("https://example.com")
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()

    fmt.Println("Response status:", res.Status)
}

Concurrent Scraping

To speed up scraping, you can use goroutines to handle multiple requests concurrently. This is useful for scraping large datasets.

Example of concurrent scraping with goroutines:

package main

import (
    "fmt"
    "log"
    "net/http"
    "sync"

    "github.com/PuerkitoBio/goquery"
)

func scrape(url string, wg *sync.WaitGroup) {
    defer wg.Done()

    res, err := http.Get(url)
    if err != nil {
        log.Println(err)
        return
    }
    defer res.Body.Close()

    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Println(err)
        return
    }

    doc.Find(".item").Each(func(index int, item *goquery.Selection) {
        title := item.Find(".title").Text()
        fmt.Println(title)
    })
}

func main() {
    urls := []string{
        "https://example.com/page1",
        "https://example.com/page2",
        // Add more URLs
    }

    var wg sync.WaitGroup

    for _, url := range urls {
        wg.Add(1)
        go scrape(url, &wg)
    }

    wg.Wait()
}

Scraping Data from APIs

Many websites offer APIs to access data. Using APIs is often easier and more efficient than scraping HTML.

Example of calling an API:

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
)

func main() {
    url := "https://api.example.com/data"

    res, err := http.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()

    var data map[string]interface{}
    if err := json.NewDecoder(res.Body).Decode(&data); err != nil {
        log.Fatal(err)
    }

    fmt.Println("API Data:", data)
}

Storing Data

Depending on your requirements, you might need to store scraped data in a database or file. Here’s an example of writing data to a CSV file:

package main

import (
    "encoding/csv"
    "fmt"
    "log"
    "os"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    file, err := os.Create("data.csv")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()

    urls := []string{"https://example.com/page1", "https://example.com/page2"}

    for _, url := range urls {
        res, err := http.Get(url)
        if err != nil {
            log.Fatal(err)
        }
        defer res.Body.Close()

        doc, err := goquery.NewDocumentFromReader(res.Body)
        if err != nil {
            log.Fatal(err)
        }

        doc.Find(".item").Each(func(index int, item *goquery.Selection) {
            title := item.Find(".title").Text()
            writer.Write([]string{title})
        })
    }

    fmt.Println("Data written to data.csv")
}

Error Handling and Logging

Robust error handling and logging are essential for troubleshooting and maintaining scrapers. You can use Go’s logging capabilities or external libraries like logrus for advanced logging.

Essential Libraries for Web Scraping in Golang

Colly:Powerful and easy-to-use web scraping framework.Installation: go get -u github.com/gocolly/colly
Goquery:jQuery-like library for manipulating and querying HTML.Installation: go get -u github.com/PuerkitoBio/goquery
Request:Simplified HTTP client for making requests.Installation: go get -u github.com/imroc/req
Grequests:Higher-level HTTP requests library, similar to Python’s Requests.Installation: go get -u github.com/levigross/grequests
Chromedp:Browser automation using Chrome DevTools Protocol.Installation: go get -u github.com/chromedp/chromedp
Rod:Modern browser automation library for Go, with an emphasis on ease of use and modern features.Installation: go get -u github.com/ysmood/rod
Go-Selenium:A Selenium WebDriver client for Go, useful for automating browsers.Installation: go get -u github.com/tebeka/selenium
Scolly:A wrapper around Colly for simplified web scraping.Installation: go get -u github.com/scolly/scolly
Browshot:A Go client for Browshot API to take screenshots and scrape content from web pages.Installation: go get -u github.com/browshot/browshot-go
Puppeteer-go:A Go port of Puppeteer, a Node library for controlling headless Chrome.Installation: go get -u github.com/chromedp/puppeteer-go
Go-requests:Simple HTTP request library inspired by Python’s Requests.Installation: go get -u github.com/deckarep/golang-set
Httpproxy:A simple HTTP proxy server for Go, useful for routing web scraping traffic.Installation: go get -u github.com/henrylee2cn/httpproxy
Crawling:A library for building distributed web crawlers.Installation: go get -u github.com/whyrusleeping/crawling
K6:Although primarily a load testing tool, K6 can be used for scraping web data with its scripting capabilities.Installation: go get -u github.com/loadimpact/k6
Net/http:The standard library for making HTTP requests in Go.Installation: Built-in with Go, no need for separate installation.
Goquery-html:Another HTML parsing library with Goquery-based enhancements.Installation: go get -u github.com/PuerkitoBio/goquery-html
Httpclient:A high-level HTTP client for Go, offering advanced request features.Installation: go get -u github.com/aymerick/raymond

These libraries and tools cover a range of functionalities, from simple HTTP requests to full browser automation, making them versatile for different web scraping needs.

Summary

Scraping data from web with Golang offers several advantages, including performance efficiency and ease of concurrency. Go’s lightweight goroutines and channels enable the handling of multiple simultaneous requests with minimal resource overhead, making it ideal for large-scale data extraction tasks. Additionally, Go’s strong standard library supports robust HTTP and HTML parsing capabilities, simplifying the development of efficient and reliable web scraping applications. This combination of speed, concurrency, and built-in tools makes Golang a compelling choice for web scraping projects that require high performance and scalability.

More to be interested in: