[crawling] 보도자료 list up

Notice

Recent Posts

Recent Comments

Link

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

데이터 공부를 기록하는 공간

[crawling] 보도자료 list up 본문

STUDY/MLOPS

[crawling] 보도자료 list up

BOTTLE6 2022. 3. 14. 22:16

# version_2
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
today = datetime.datetime.now().strftime("%Y%m%d")

####################################################################
########################### 산 업 부 ################################
####################################################################
산업부 = "https://www.motie.go.kr/motie/ne/presse/press2/bbs/bbsList.do?bbs_cd_n=81" 
res = requests.get(산업부)
soup = BeautifulSoup(res.content, 'html.parser')

# 제목 / 주소 
titles = soup.select("tr > td.al > div > a")
href_front = "https://www.motie.go.kr/motie/ne/presse/press2/bbs/"
href = [href_front + title.get('href') for title in titles]
title = [title.get('title') for title in titles]
# 날짜
dates = soup.find_all("td", attrs={'class':"", 'data-device':""}) #data-device가 없는 것을 고름
date = [x.get_text() for x in dates]
# dataframe
df_1 = pd.DataFrame({"title":title, "href":href, "date":date} )
df_1['구분'] = '산업부'
df_1 
####################################################################
########################### 환 경 부 ################################
#################################################################### 
환경부 = "https://me.go.kr/home/web/board/list.do?menuId=286&boardMasterId=1&boardCategoryId=39"
res = requests.get(환경부)
soup = BeautifulSoup(res.content, 'html.parser')
titles = soup.find_all('td', attrs={'class':'al'})

# 제목
titles = soup.select("td.al")
title = [title.get_text().strip() for title in titles]
# 주소
hrefs = soup.select("td.al > a[href]")
href = ["https://me.go.kr/" + href.get('href') for href in hrefs]
# 날짜
dates = soup.find_all("td", attrs={"class":"",'span':""})
# 5개의 데이터 중 3번 째 이므로 
date = []
for i, x in enumerate(dates):
    if (i+1)%5==4:
        date.append(x.get_text().strip())
# dataframe
df_2 = pd.DataFrame({"title":title, "href":href, "date":date})
df_2 
df_2['구분'] = '환경부'

df = pd.concat([df_1, df_2], ignore_index=True).sort_values(by='date', ascending=False)
df.to_csv(f"{today}_보도자료.csv",index=False, encoding='cp949')

#####################################################################
########################### 키 워 드 ################################
#####################################################################
keywords = ['원자력','탄소중립'] 
for x in df['title']:
    for keyword in keywords:
        if keyword in x:
            print(x)
df

<결과 엑셀>

'STUDY > MLOPS' 카테고리의 다른 글

전력수급 실시간 crawling (0)	2021.12.19
[crawling, flask] 프로젝트2 - 뉴스 및 네이버쇼핑 크롤링 페이지 (0)	2021.12.19
[crawling,flask] 프로젝트 1 - 인기글 불러오는 페이지 만들기 (0)	2021.12.18
ubuntu에 아나콘다 설치하기 (0)	2021.12.07
DOCKER 설치 오류 관련 (0)	2021.12.06

'STUDY/MLOPS' Related Articles

Comments

데이터 공부를 기록하는 공간

[crawling] 보도자료 list up 본문

[crawling] 보도자료 list up

'STUDY > MLOPS' 카테고리의 다른 글

티스토리툴바