Web scraping is a powerful technique for extracting data from websites, and Golang (Go) is an excellent language for this task. Known for its performance and efficiency, Go can handle web scraping with ease. So, how to scrape data from web with Golang? This guide will walk you through the process of scraping webpage using Golang, covering related techniques and tips.
Is Golang Good for Scraping Data from Web?
Before learn more about scraping data from the web with Golang, it’s important to understand why choose Golang for web scraping and what advantages it offers.
Golang is a strong choice for web scraping due to its high performance, efficient concurrency model, and robust standard library. With its ability to handle multiple requests concurrently using goroutines and its built-in packages for HTTP requests and HTML parsing, Go can efficiently scrape large volumes of data. Its simplicity and error-handling capabilities further streamline the development process, while third-party libraries like Colly and Goquery offer additional functionality. Although less common than Python for web scraping, Go’s advantages make it a compelling option for those familiar with the language.
Basic Configuration to Scrape Web Data with Golang
Scraping data from web with Go (Golang) involves making HTTP requests to retrieve web pages and then parsing the HTML content to extract the desired information. Below is a step-by-step guide to scraping data from the web using Go:
-
-
Setting Up The Environment
First, make sure Go has installed on your system. Neither people could download it from the official website.
-
Installing Necessary Packages
A few packages is needed to help with HTTP requests and HTML parsing. The most popular packages are net/http for HTTP requests and goquery for parsing HTML.
Get specific package by running like:
go get github.com/PuerkitoBio/goquery
Writing the Scraper
Here’s a simple to demonstrate how to scrape data from a website using Golang:
package main import ( "fmt" "log" "net/http" "github.com/PuerkitoBio/goquery" ) func main() { // URL of the website to scrape url := "https://example.com" // Make an HTTP GET request res, err := http.Get(url) if err != nil { log.Fatal(err) } defer res.Body.Close() // Check the response status code if res.StatusCode != 200 { log.Fatalf("Failed to fetch data: %d %s", res.StatusCode, res.Status) } // Parse the HTML doc, err := goquery.NewDocumentFromReader(res.Body) if err != nil { log.Fatal(err) } // Find and print the data doc.Find("h1").Each(func(index int, item *goquery.Selection) { heading := item.Text() fmt.Println(heading) }) }
Making HTTP Requests:
http.Get(url) makes an HTTP GET request to the specified URL.
res.Body.Close() ensures that the response body is closed after reading.Parsing HTML:
goquery.NewDocumentFromReader(res.Body) parses the HTML response and returns a goquery.Document object.
Extracting Data:
doc.Find(“h1”).Each() finds all h1 elements in the HTML and iterates over them.
item.Text() extracts the text content of each h1 element. -
Running the Scraper
Save the above code in a file, for example, main.go, and run it using:
go run main.go
-
Additional Considerations
Handling Errors: Always handle errors appropriately to ensure your scraper doesn’t crash unexpectedly.
Respecting robots.txt: Check the robots.txt file of the website to ensure you’re allowed to scrape it.
Rate Limiting: Implement rate limiting to avoid overwhelming the server with requests.
User-Agent: Set a custom User-Agent header to identify your scraper , such as:
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Fatal(err)
}
req.Header.Set("User-Agent", "Golang_Scraper/1.0")
client := &http.Client{}
res, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
// Parse the HTML as before
Advanced Techniques to Scrape Web Data with Golang
Handling Pagination
Many websites use pagination to split content across multiple pages. To scrape all the data, you need to handle pagination by making requests to each page sequentially.
Here’s an example of handling pagination:
package main
import (
"fmt"
"log"
"net/http"
"strconv"
"github.com/PuerkitoBio/goquery"
)
func main() {
baseURL := "https://example.com/page/"
page := 1
for {
url := baseURL + strconv.Itoa(page)
res, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
if res.StatusCode != 200 {
log.Println("No more pages to fetch, stopping.")
break
}
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
log.Fatal(err)
}
doc.Find(".item").Each(func(index int, item *goquery.Selection) {
title := item.Find(".title").Text()
fmt.Println(title)
})
page++
}
}
Handling JavaScript-Rendered Content
Some websites use JavaScript to render content dynamically. Go doesn’t have a built-in way to execute JavaScript, but you can use a navigateur sans tête like Chromedp.
go get -u github.com/chromedp/chromedp
Example of using Chromedp to scrape JavaScript-rendered content:
package main
import (
"context"
"fmt"
"log"
"github.com/chromedp/chromedp"
)
func main() {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
var htmlContent string
err := chromedp.Run(ctx,
chromedp.Navigate("https://example.com"),
chromedp.OuterHTML("body", &htmlContent),
)
if err != nil {
log.Fatal(err)
}
fmt.Println(htmlContent)
}
Managing Sessions and Cookies
If a website requires login or session management, you can handle cookies and sessions using the http.CookieJar.
Example of managing cookies:
package main
import (
"fmt"
"log"
"net/http"
"net/http/cookiejar"
"github.com/PuerkitoBio/goquery"
)
func main() {
jar, _ := cookiejar.New(nil)
client := &http.Client{Jar: jar}
// Log in and save cookies
loginURL := "https://example.com/login"
loginForm := url.Values{}
loginForm.Set("username", "your_username")
loginForm.Set("password", "your_password")
res, err := client.PostForm(loginURL, loginForm)
if err != nil {
log.Fatal(err)
}
res.Body.Close()
// Access a protected page
url := "https://example.com/protected-page"
res, err = client.Get(url)
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
log.Fatal(err)
}
doc.Find(".protected-content").Each(func(index int, item *goquery.Selection) {
content := item.Text()
fmt.Println(content)
})
}
Throttling and Rate Limiting
To avoid being blocked by websites, implement rate limiting by introducing delays between requests.
Example of rate limiting:
package main
import (
"fmt"
"log"
"net/http"
"time"
"github.com/PuerkitoBio/goquery"
)
func main() {
urls := []string{"https://example.com/page1", "https://example.com/page2"}
for _, url := range urls {
res, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
log.Fatal(err)
}
doc.Find(".item").Each(func(index int, item *goquery.Selection) {
title := item.Find(".title").Text()
fmt.Println(title)
})
// Delay to avoid getting blocked
time.Sleep(2 * time.Second)
}
}
Handling AJAX Requests
Some websites load data dynamically through AJAX requests. You can capture and replicate these requests using tools like browser developer tools to find the API endpoints.
Example of fetching data from an AJAX API endpoint:
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
)
type Item struct {
Title string `json:"title"`
}
func main() {
url := "https://example.com/api/items"
res, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
var items []Item
if err := json.NewDecoder(res.Body).Decode(&items); err != nil {
log.Fatal(err)
}
for _, item := range items {
fmt.Println(item.Title)
}
}
Handling Captchas and Anti-Scraping Mechanisms
Websites often use CAPTCHAs and other anti-scraping mechanisms. While solving CAPTCHAs programmatically is complex and often against terms of service, you can use techniques like rotating user-agents and proxies to avoid detection.
Example of rotating user agents:
package main
import (
"fmt"
"log"
"net/http"
"math/rand"
"time"
)
func main() {
userAgents := []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:54.0) Gecko/20100101 Firefox/54.0",
// Add more user agents here
}
client := &http.Client{}
rand.Seed(time.Now().UnixNano())
for i := 0; i < 5; i++ {
req, err := http.NewRequest("GET", "https://example.com", nil)
if err != nil {
log.Fatal(err)
}
req.Header.Set("User-Agent", userAgents[rand.Intn(len(userAgents))])
res, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
res.Body.Close()
fmt.Println("Request sent with user-agent:", req.Header.Get("User-Agent"))
}
}
Using Proxies
To further protect your IP from getting banned, you can use proxies. Services like OkeyProxy or MacroProxy provide proxy solutions.
L'un des meilleurs fournisseurs de proxy, OkeyProxy is supported by HTTP/HTTPS/SOCKS and provides 150 million+ real rotating residential IPs, covering 200+ countries/areas, which could avoid IP ban as much as possible and ensures the security, reliability and stability of network connections.
Example of using a proxy with http.Client:
package main
import (
"fmt"
"log"
"net/http"
"net/url"
)
func main() {
proxyURL, _ := url.Parse("http://proxyusername:proxypassword@proxyserver:port")
transport := &http.Transport{
Proxy: http.ProxyURL(proxyURL),
}
client := &http.Client{Transport: transport}
res, err := client.Get("https://example.com")
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
fmt.Println("Response status:", res.Status)
}
Concurrent Scraping
To speed up scraping, you can use goroutines to handle multiple requests concurrently. This is useful for scraping large datasets.
Example of concurrent scraping with goroutines:
package main
import (
"fmt"
"log"
"net/http"
"sync"
"github.com/PuerkitoBio/goquery"
)
func scrape(url string, wg *sync.WaitGroup) {
defer wg.Done()
res, err := http.Get(url)
if err != nil {
log.Println(err)
return
}
defer res.Body.Close()
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
log.Println(err)
return
}
doc.Find(".item").Each(func(index int, item *goquery.Selection) {
title := item.Find(".title").Text()
fmt.Println(title)
})
}
func main() {
urls := []string{
"https://example.com/page1",
"https://example.com/page2",
// Add more URLs
}
var wg sync.WaitGroup
for _, url := range urls {
wg.Add(1)
go scrape(url, &wg)
}
wg.Wait()
}
Scraping Data from APIs
Many websites offer APIs to access data. Using APIs is often easier and more efficient than scraping HTML.
Example of calling an API:
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
)
func main() {
url := "https://api.example.com/data"
res, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
var data map[string]interface{}
if err := json.NewDecoder(res.Body).Decode(&data); err != nil {
log.Fatal(err)
}
fmt.Println("API Data:", data)
}
Storing Data
Depending on your requirements, you might need to store scraped data in a database or file. Here’s an example of writing data to a CSV file:
package main
import (
"encoding/csv"
"fmt"
"log"
"os"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
file, err := os.Create("data.csv")
if err != nil {
log.Fatal(err)
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
urls := []string{"https://example.com/page1", "https://example.com/page2"}
for _, url := range urls {
res, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
log.Fatal(err)
}
doc.Find(".item").Each(func(index int, item *goquery.Selection) {
title := item.Find(".title").Text()
writer.Write([]string{title})
})
}
fmt.Println("Data written to data.csv")
}
Error Handling and Logging
Robust error handling and logging are essential for troubleshooting and maintaining scrapers. You can use Go’s logging capabilities or external libraries like logrus for advanced logging.
Essential Libraries for Web Scraping in Golang
- Colly:Powerful and easy-to-use web scraping framework.Installation: go get -u github.com/gocolly/colly
- Goquery:jQuery-like library for manipulating and querying HTML.Installation: go get -u github.com/PuerkitoBio/goquery
- Request:Simplified HTTP client for making requests.Installation: go get -u github.com/imroc/req
- Grequests:Higher-level HTTP requests library, similar to Python’s Requests.Installation: go get -u github.com/levigross/grequests
- Chromedp:Browser automation using Chrome DevTools Protocol.Installation: go get -u github.com/chromedp/chromedp
- Rod:Modern browser automation library for Go, with an emphasis on ease of use and modern features.Installation: go get -u github.com/ysmood/rod
- Go-Selenium:A Selenium WebDriver client for Go, useful for automating browsers.Installation: go get -u github.com/tebeka/selenium
- Scolly:A wrapper around Colly for simplified web scraping.Installation: go get -u github.com/scolly/scolly
- Browshot:A Go client for Browshot API to take screenshots and scrape content from web pages.Installation: go get -u github.com/browshot/browshot-go
- Puppeteer-go:A Go port of Puppeteer, a Node library for controlling headless Chrome.Installation: go get -u github.com/chromedp/puppeteer-go
- Go-requests:Simple HTTP request library inspired by Python’s Requests.Installation: go get -u github.com/deckarep/golang-set
- Httpproxy:A simple HTTP proxy server for Go, useful for routing web scraping traffic.Installation: go get -u github.com/henrylee2cn/httpproxy
- Crawling:A library for building distributed web crawlers.Installation: go get -u github.com/whyrusleeping/crawling
- K6:Although primarily a load testing tool, K6 can be used for scraping web data with its scripting capabilities.Installation: go get -u github.com/loadimpact/k6
- Net/http:The standard library for making HTTP requests in Go.Installation: Built-in with Go, no need for separate installation.
- Goquery-html:Another HTML parsing library with Goquery-based enhancements.Installation: go get -u github.com/PuerkitoBio/goquery-html
- Httpclient:A high-level HTTP client for Go, offering advanced request features.Installation: go get -u github.com/aymerick/raymond
These libraries and tools cover a range of functionalities, from simple HTTP requests to full browser automation, making them versatile for different web scraping needs.
Résumé
Scraping data from web with Golang offers several advantages, including performance efficiency and ease of concurrency. Go’s lightweight goroutines and channels enable the handling of multiple simultaneous requests with minimal resource overhead, making it ideal for large-scale data extraction tasks. Additionally, Go’s strong standard library supports robust HTTP and HTML parsing capabilities, simplifying the development of efficient and reliable web scraping applications. This combination of speed, concurrency, and built-in tools makes Golang a compelling choice for web scraping projects that require high performance and scalability.