[파이썬] 구글 뉴스 웹 스크래핑 해보기

여니의 프로그래밍 study/파이썬

[파이썬] 구글 뉴스 웹 스크래핑 해보기

여니's 2021. 1. 11. 16:43

참고 출처 : 실무자를 위한 파이썬 100제

import requests from bs4 import BeautifulSoup

base_url="https://news.google.com"

search_url=base_url+"/search?q=python&hl=ko&gl=KR&ceid=KR%3Ako"

resp=requests.get(search_url)

html_src=resp.text

soup=BeautifulSoup(html_src,'html.parser')

#뉴스 아이템 블록 선택

news_items=soup.select('div[class="xrnccd"]')

print(len(news_items))

print(news_items[0])

print("\n")

#각 뉴스 아이템에서 링크,제목,내용,출처,등록일시 데이터를 파싱한다.

for item in news_items[:3]:

#앞에서부터 3개의 원소만을 대상으로 반복문을 적용한다.

link=item.find('a',attrs={'class':'VDXfz'}).get('href')

#href 속성을 따로 추출하기 위해 get() 메서드 사용 news_link=base_url+link[1:]

#link는 ./articles/~로 시작하기 떄문에 .를 제거하기 위해 두 번째 문자부터 슬라이싱한다. print("_____________________________")

print(news_link)

news_title=item.find('a',attrs={'class':'DY5T1d'}).getText()

print("_____________________________")

print(news_title)

#a태그요소에 getText()메소드를 적용해서 텍스트부분을 추출한다.

news_content=item.find('span',attrs={'class':'xBbh9'}).text

#span의 text 속성을 이용해서 텍스트부분을 추출한다.

print("_____________________________")

print(news_content)

news_agency=item.find('a',attrs={'class':'wEwyrc AVN2gc uQIVzc Sksgp'}).text

print("_____________________________")

print(news_agency)

news_reporting=item.find('time',attrs={'class':'WW6dff uQIVzc Sksgp'})

news_reporting_datetime=news_reporting.get('datetime').split('T')

#split으로 문자열의 날짜와 시간 부분을 나눈다

. news_reporting_date=news_reporting_datetime[0][:-1]

news_reporting_time=news_reporting_datetime[1][:-1]

print("_____________________________")

print(news_reporting_date,news_reporting_time)

print("\\\\n")

#앞의 코드를 이용해서 구글 뉴스 클리핑 함수 정의

def google_news_clipping(url,limit=5):

resp=requests.get(url)

html_src=resp.text

soup=BeautifulSoup(html_src,'html.parser')

news_items=soup.select('div[class="xrnccd"]')

links=[]; titles=[]; contents=[]; agencies=[]; reporting_dates=[]; reporting_times=[]; for item in news_items[:limit]: link=item.find('a',attrs={'class':'VDXfz'}).get('href') news_link=base_url+link[1:] links.append(news_link) news_title=item.find('a',attrs={'class':'DY5T1d'}).getText() titles.append(news_title) news_content=item.find('span',attrs={'class':'xBbh9'}).text contents.append(news_content) news_agency=item.find('a',attrs={'class':'wEwyrc AVN2gc uQIVzc Sksgp'}).text agencies.append(news_agency) news_reporting=item.find('time',attrs={'class':'WW6dff uQIVzc Sksgp'}) news_reporting_datetime=news_reporting.get('datetime').split('T') #split으로 문자열의 날짜와 시간 부분을 나눈다. news_reporting_date=news_reporting_datetime[0] news_reporting_time=news_reporting_datetime[1][:-1] reporting_dates=news_reporting_date reporting_times=news_reporting_time result={'link':links,'title':titles,'content':contents,'agency':agencies,'date':reporting_dates,\\\\ 'time':reporting_times} return result #함수 실행하여 뉴스 목록 정리하기 news=google_news_clipping(search_url,2) print(news)

google_news_clipping.pdf

0.03MB

저작자표시 비영리 동일조건 (새창열림)

'여니의 프로그래밍 study > 파이썬' 카테고리의 다른 글

아나콘다 네비게이션에서 셀레니움(selenium) 다운로드 (0)	2021.01.12
[파이참] 쥬피터 노트북 연동하기 (0)	2021.01.11
[파이썬] 웹 스크래핑 (검색어를 url 코드로 변환) (1)	2021.01.11
[주피터 노트북] pdf 파일로 저장하기 : 오류발생 -> 해결 (0)	2021.01.11
[파이썬] 웹 스크래핑 하는 방법 (0)	2021.01.11

현재글[파이썬] 구글 뉴스 웹 스크래핑 해보기

yeony's story