Web Scraping Google Search Results
I am web scraping Google Scholar search results page by page. After a certain number of pages, a captcha pops up and interrupts my code. I read that Google limits the requests that
Solution 1:
I feel your pain since I have done scraping from Google in the past. I have tried the following things in order to get my job done. This list is ordered from easiest to hardest techniques.
- Throttle your requests per second: Google and many other websites will identify a large number of requests per second coming from the same machine and block them automatically as a defensive action against Denial-of-Service attacks. All you need to do is to be gentle and do just 1 request every 1-5 seconds, for instance, to avoid being banned quickly.
- Randomize your sleep time: Making your code sleep for exactly 1 second is too easy to detect as being a script. Make it sleep for a random amount of time at every iteration. This StackOverflow answer shows an example on how to randomize it.
- Use a web scraper library with cookies enabled: If you write scraping code from scratch, Google will notice your requests don't return the cookies it received. Use a good library, such as Scrapy to circumvent this issue.
- Use multiple IP addresses: Throttling will definitely reduce your scraping throughput. If you really need to scrape your data fast, you will need to use several IP addresses in order to avoid being banned. There are several companies providing this kind of service on the Internet for a certain amount of money. I have used ProxyMesh and really liked both their quality, documentation and customer support.
- Use a real browser: Some websites will recognize your scraper if it doesn't process JavaScript or have a graphical interface. Using a real browser with Selenium, for instance, will solve this problem.
You can also take a look at my crawler project, written for the Web Search Engines course at the New York University. It does not scrape Google per se but contains some of the aforementioned techniques, such as throttling and randomizing the sleep time.
Solution 2:
From personal experience scraping Google Scholar. 45 seconds is enough to avoid CAPTCHA and bot detection. I have had a scraper running for >3 days without detection. If you do get flagged, waiting about 2 hours is enough to start again. Here is an extract from my code..
classScholarScrape():
def__init__(self):
self.page = None
self.last_url = None
self.last_time = time.time()
self.min_time_between_scrape = int(ConfigFile.instance().config.get('scholar','bot_avoidance_time'))
self.header = {'User-Agent':ConfigFile.instance().config.get('scholar','user_agent')}
self.session = requests.Session()
passdefsearch(self, query=None, year_lo=None, year_hi=None, title_only=False, publication_string=None, author_string=None, include_citations=True, include_patents=True):
url = self.get_url(query, year_lo, year_hi, title_only, publication_string, author_string, include_citations, include_patents)
whileTrue:
wait_time = self.min_time_between_scrape - (time.time() - self.last_time)
if wait_time > 0:
logger.info("Delaying search by {} seconds to avoid bot detection.".format(wait_time))
time.sleep(wait_time)
self.last_time = time.time()
logger.info("SCHOLARSCRAPE: " + url)
self.page = BeautifulSoup(self.session.get(url, headers=self.header).text, 'html.parser')
self.last_url = url
if"Our systems have detected unusual traffic from your computer network"instr(self.page):
raise BotDetectionException("Google has blocked this computer for a short time because it has detected this scraping script.")
returndefget_url(self, query=None, year_lo=None, year_hi=None, title_only=False, publication_string=None, author_string=None, include_citations=True, include_patents=True):
base_url = "https://scholar.google.com.au/scholar?"
url = base_url + "as_q=" + urllib.parse.quote(query)
if year_lo isnotNoneandbool(re.match(r'.*([1-3][0-9]{3})', str(year_lo))):
url += "&as_ylo=" + str(year_lo)
if year_hi isnotNoneandbool(re.match(r'.*([1-3][0-9]{3})', str(year_hi))):
url += "&as_yhi=" + str(year_hi)
if title_only:
url += "&as_yhi=title"else:
url += "&as_yhi=any"if publication_string isnotNone:
url += "&as_publication=" + urllib.parse.quote('"' + str(publication_string) + '"')
if author_string isnotNone:
url += "&as_sauthors=" + urllib.parse.quote('"' + str(author_string) + '"')
if include_citations:
url += "&as_vis=0"else:
url += "&as_vis=1"if include_patents:
url += "&as_sdt=0"else:
url += "&as_sdt=1"return url
defget_results_count(self):
e = self.page.findAll("div", {"class": "gs_ab_mdw"})
try:
item = e[1].text.strip()
except IndexError as ex:
if"Our systems have detected unusual traffic from your computer network"instr(self.page):
raise BotDetectionException("Google has blocked this computer for a short time because it has detected this scraping script.")
else:
raise ex
if self.has_numbers(item):
return self.get_results_count_from_soup_string(item)
for item in e:
item = item.text.strip()
if self.has_numbers(item):
return self.get_results_count_from_soup_string(item)
return0 @staticmethoddefget_results_count_from_soup_string(element):
if"About"in element:
num = element.split(" ")[1].strip().replace(",","")
else:
num = element.split(" ")[0].strip().replace(",","")
return num
@staticmethoddefhas_numbers(input_string):
returnany(char.isdigit() for char in input_string)
classBotDetectionException(Exception):
passif __name__ == "__main__":
s = ScholarScrape()
s.search(**{
"query":"\"policy shaping\"",
# "publication_string":"JMLR","author_string": "gilboa",
"year_lo": "1995",
"year_hi": "2005",
})
x = s.get_results_count()
print(x)
Post a Comment for "Web Scraping Google Search Results"