IT Cookie
Web Crawling ํ์ฉ ๋ณธ๋ฌธ
๐ต์ง๋๋ฎค์ง 1 ~ 50์ ํฌ๋กค๋ง
์ง๋๋ฎค์ง ์ฌ์ดํธ(2023.02.01)
์ง๋์ฐจํธ>์๊ฐ - ์ง๋
AI๊ธฐ๋ฐ ๊ฐ์ฑ ์์ ์ถ์ฒ
www.genie.co.kr
1. ๊ธฐ๋ณธ ์ ํ
# ๋ผ์ด๋ธ๋ฌ๋ฆฌ
import requests
from bs4 import BeautifulSoup
# ์นํ์ด์ง ๊ฐ์ ธ์ค๊ธฐ
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20230201',headers=headers)
# ์นํ์ด์ง ํ์ฑ
soup = BeautifulSoup(data.text, 'html.parser')
2. ๊ฐ์ ธ์ค๊ณ ์ถ์ ๋ฐ์ดํฐ ์์น ํ์ ํ๊ธฐ
๐ ํฌ๋กฌ ๊ฐ๋ฐ์๋๊ตฌ ์ฐธ๊ณ
2-1. rank
- ๊ฐ๋ฐ์ ๋๊ตฌ → ์์๊ฐ ์ ํ ์๋ ๋ถ๋ถ ํด๋ฆญ
- ๊ฒ์ฌ ์ํ๋ ํ๊ทธ์์ ๋ง์ฐ์ค ์ค๋ฅธ์ชฝ ํด๋ฆญ Copy → Copy selector
- ๊ฐ์ ๋ฐฉ์์ผ๋ก 2์์ ๋ถ๋ถ๋ Copy selector
rank 1 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.number
rank 2 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(2) > td.number
2-2. title
- ์์ ๊ฐ์ ๋ฐฉ๋ฒ์ผ๋ก 2๊ฐ์ ๋ ธ๋ ์ ๋ชฉ Copy selector
rank 1 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.info > a.title.ellipsis
rank 2 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(2) > td.info > a.title.ellipsis
2-3. artist
- ์์ ๊ฐ์ ๋ฐฉ๋ฒ์ผ๋ก 2๊ฐ์ ๋ ธ๋ ๊ฐ์ Copy selector
rank 1 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.info > a.artist.ellipsis
rank 2 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(2) > td.info > a.artist.ellipsis
3. ์ํ๋ ๋ฐ์ดํฐ ์ ํํด์ ๊ฐ์ ธ์ค๊ธฐ
3-1. tr ํ๊ทธ
genie[0]์๋ tr:nth-child(1)์ ํฌํจ๋๋ ์ ๋ณด๋ฅผ ์ ์ฅ
genie[1]์๋ tr:nth-child(2)์ ํฌํจ๋๋ ์ ๋ณด๋ฅผ ์ ์ฅ
...
genie[49]์๋ tr:nth-child(50)์ ํฌํจ๋๋ ์ ๋ณด๋ฅผ ์ ์ฅ
genie ๋ณ์, ์๋ ๊ฒฝ๋ก์ ์๋ ๋ชจ๋ tr ํ๊ทธ ์์ ์๋ ์ ๋ณด ๊ฐ์ ธ์ด
genie = soup.select('#body-content > div.newest-list > div > table > tbody > tr')
3-2. ๋ฐ๋ณต๋ฌธ์ ์ด์ฉํด rank, title, artist ๋ฐ์ดํฐ ์ ํ
- rank ์์น: tr > td.number
- title ์์น: tr > td.info > a.title.ellipsis
- artist ์์น: tr > td.info > a.artist.ellipsis
4. ์ถ๋ ฅ
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20230201',headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')
genie = soup.select('#body-content > div.newest-list > div > table > tbody > tr')
for tr in genie:
rank = tr.select_one('td.number').text
title = tr.select_one('a.title.ellipsis').text
artist = tr.select_one('a.artist.ellipsis').text
print(rank, title, artist)
๐ ๋ฌธ์ ์ ๋ฐ๊ฒฌ
์ถ๋ ฅํด๋ณด๋ ์์์ ๊ณก์ ๋ชฉ์ด ๊น๋ํ๊ฒ ๋์ค์ง ์์๋ค.
4-1. rank, title ์ฌ๋ฐฑ ์ ๊ฑฐ
โก๏ธ ํ์ด์ฌ ๋ด์ฅ ํจ์์ธ strip() ์ฌ์ฉ
# ๋ฐ๋ณต๋ฌธ ์ฝ๋ ์์ (rank/title)
for tr in genie:
rank = tr.select_one('td.number').text.strip()
title = tr.select_one('a.title.ellipsis').text.strip()
4-2. rank ์์๋ง ์ถ๋ ฅ
โก๏ธ ์์์ ๋ ๊ธ์๋ง → text[0:2] ์ฐ๊ธฐ
# ๋ฐ๋ณต๋ฌธ ์ฝ๋ ์์ (rank)
for tr in genie:
rank = tr.select_one('td.number').text[0:2].strip()
[์ต์ข ์ฝ๋]
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20230201',headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')
genie = soup.select('#body-content > div.newest-list > div > table > tbody > tr')
# ๋ฐ๋ณต๋ฌธ ์ฝ๋ ์์
for tr in genie:
rank = tr.select_one('td.number').text[0:2].strip()
title = tr.select_one('a.title.ellipsis').text.strip()
artist = tr.select_one('a.artist.ellipsis').text
print(rank, title, artist)
'SW > Python' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
๊ฐ์ํ๊ฒฝ Virtualenv (1) | 2023.03.13 |
---|---|
Web Crawling ๊ธฐ์ด (0) | 2023.03.13 |
[Python ๋ฌธ๋ฒ] ํจ์ (0) | 2023.03.07 |
[Python ๋ฌธ๋ฒ] ๋ฐ๋ณต๋ฌธ (0) | 2023.03.07 |
[Python ๋ฌธ๋ฒ] ์กฐ๊ฑด๋ฌธ (0) | 2023.03.07 |
Comments