IT Cookie

Web Crawling ํ™œ์šฉ ๋ณธ๋ฌธ

SW/Python

Web Crawling ํ™œ์šฉ

ahhyeon 2023. 3. 13. 22:21

๐ŸŽต์ง€๋‹ˆ๋ฎค์ง 1 ~ 50์œ„ ํฌ๋กค๋ง

์ง€๋‹ˆ๋ฎค์ง ์‚ฌ์ดํŠธ(2023.02.01)
 

์ง€๋‹ˆ์ฐจํŠธ>์›”๊ฐ„ - ์ง€๋‹ˆ

AI๊ธฐ๋ฐ˜ ๊ฐ์„ฑ ์Œ์•… ์ถ”์ฒœ

www.genie.co.kr

1. ๊ธฐ๋ณธ ์…‹ํŒ…

# ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
import requests
from bs4 import BeautifulSoup

# ์›นํŽ˜์ด์ง€ ๊ฐ€์ ธ์˜ค๊ธฐ
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20230201',headers=headers)

# ์›นํŽ˜์ด์ง€ ํŒŒ์‹ฑ
soup = BeautifulSoup(data.text, 'html.parser')

2. ๊ฐ€์ ธ์˜ค๊ณ  ์‹ถ์€ ๋ฐ์ดํ„ฐ ์œ„์น˜ ํŒŒ์•…ํ•˜๊ธฐ

๐Ÿ‘€ ํฌ๋กฌ ๊ฐœ๋ฐœ์ž๋„๊ตฌ ์ฐธ๊ณ 

2-1. rank

  • ๊ฐœ๋ฐœ์ž ๋„๊ตฌ → ์ˆœ์œ„๊ฐ€ ์ ํ˜€ ์žˆ๋Š” ๋ถ€๋ถ„ ํด๋ฆญ

  • ๊ฒ€์‚ฌ ์›ํ•˜๋Š” ํƒœ๊ทธ์—์„œ ๋งˆ์šฐ์Šค ์˜ค๋ฅธ์ชฝ ํด๋ฆญ Copy → Copy selector

  • ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ 2์ˆœ์œ„ ๋ถ€๋ถ„๋„ Copy selector
rank 1 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.number
rank 2 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(2) > td.number

2-2. title

  • ์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ 2๊ฐœ์˜ ๋…ธ๋ž˜ ์ œ๋ชฉ Copy selector
rank 1 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.info > a.title.ellipsis
rank 2 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(2) > td.info > a.title.ellipsis

2-3. artist

  • ์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ 2๊ฐœ์˜ ๋…ธ๋ž˜ ๊ฐ€์ˆ˜ Copy selector
rank 1 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.info > a.artist.ellipsis
rank 2 selector: #body-content > div.newest-list > div > table > tbody > tr:nth-child(2) > td.info > a.artist.ellipsis

3. ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์„ ํƒํ•ด์„œ ๊ฐ€์ ธ์˜ค๊ธฐ

3-1. tr ํƒœ๊ทธ 

genie[0]์—๋Š” tr:nth-child(1)์— ํฌํ•จ๋˜๋Š” ์ •๋ณด๋ฅผ ์ €์žฅ
genie[1]์—๋Š” tr:nth-child(2)์— ํฌํ•จ๋˜๋Š” ์ •๋ณด๋ฅผ ์ €์žฅ
...
genie[49]์—๋Š” tr:nth-child(50)์— ํฌํ•จ๋˜๋Š” ์ •๋ณด๋ฅผ ์ €์žฅ
 genie ๋ณ€์ˆ˜, ์•„๋ž˜ ๊ฒฝ๋กœ์— ์žˆ๋Š” ๋ชจ๋“  tr ํƒœ๊ทธ ์•ˆ์— ์žˆ๋Š” ์ •๋ณด ๊ฐ€์ ธ์˜ด
genie = soup.select('#body-content > div.newest-list > div > table > tbody > tr')

3-2. ๋ฐ˜๋ณต๋ฌธ์„ ์ด์šฉํ•ด rank, title, artist ๋ฐ์ดํ„ฐ ์„ ํƒ

- rank ์œ„์น˜: tr > td.number
- title ์œ„์น˜: tr > td.info > a.title.ellipsis
- artist ์œ„์น˜: tr > td.info > a.artist.ellipsis

4. ์ถœ๋ ฅ

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20230201',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

genie = soup.select('#body-content > div.newest-list > div > table > tbody > tr')

for tr in genie:
    rank = tr.select_one('td.number').text
    title = tr.select_one('a.title.ellipsis').text
    artist = tr.select_one('a.artist.ellipsis').text
	
    print(rank, title, artist)

๐Ÿ‘€ ๋ฌธ์ œ์  ๋ฐœ๊ฒฌ
์ถœ๋ ฅํ•ด๋ณด๋‹ˆ ์ˆœ์œ„์™€ ๊ณก์ œ๋ชฉ์ด ๊น”๋”ํ•˜๊ฒŒ ๋‚˜์˜ค์ง€ ์•Š์•˜๋‹ค.

4-1. rank, title ์—ฌ๋ฐฑ ์ œ๊ฑฐ

โžก๏ธ ํŒŒ์ด์ฌ ๋‚ด์žฅ ํ•จ์ˆ˜์ธ strip() ์‚ฌ์šฉ

# ๋ฐ˜๋ณต๋ฌธ ์ฝ”๋“œ ์ˆ˜์ •(rank/title)
for tr in genie:
    rank = tr.select_one('td.number').text.strip()
    title = tr.select_one('a.title.ellipsis').text.strip()

4-2. rank ์ˆœ์œ„๋งŒ ์ถœ๋ ฅ

โžก๏ธ ์•ž์—์„œ ๋‘ ๊ธ€์ž๋งŒ → text[0:2] ์“ฐ๊ธฐ

# ๋ฐ˜๋ณต๋ฌธ ์ฝ”๋“œ ์ˆ˜์ •(rank)
for tr in genie:
    rank = tr.select_one('td.number').text[0:2].strip()

 

[์ตœ์ข…์ฝ”๋“œ]

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20230201',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

genie = soup.select('#body-content > div.newest-list > div > table > tbody > tr')

# ๋ฐ˜๋ณต๋ฌธ ์ฝ”๋“œ ์ˆ˜์ •
for tr in genie:
    rank = tr.select_one('td.number').text[0:2].strip()
    title = tr.select_one('a.title.ellipsis').text.strip()
    artist = tr.select_one('a.artist.ellipsis').text
	
    print(rank, title, artist)

'SW > Python' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

๊ฐ€์ƒํ™˜๊ฒฝ Virtualenv  (1) 2023.03.13
Web Crawling ๊ธฐ์ดˆ  (0) 2023.03.13
[Python ๋ฌธ๋ฒ•] ํ•จ์ˆ˜  (0) 2023.03.07
[Python ๋ฌธ๋ฒ•] ๋ฐ˜๋ณต๋ฌธ  (0) 2023.03.07
[Python ๋ฌธ๋ฒ•] ์กฐ๊ฑด๋ฌธ  (0) 2023.03.07
Comments