Lấy danh sách bài viết trong Windows

Linux trỉ phù hợp khi không bị chặn bot bằng Captcha hay Cloudflare.

Nếu bị chặn, bạn cần làm trên windows, chrome thật.

Bước 1. Tạo môi trường ảo myenv

Việc này là cần thiết khi có nhiều luồn crawl và không muốn các luồng ảnh hưởng nhau cũng như ảnh hưởng đến windows gốc.

--> Mỗi dự án/mỗi luồng crawl nên có 1 myenvy riêng để trong chính thư mục luồng crawl đó.

Tham khảo:

Bước 2. Vào môi trường đã tạo

Bước 3. Tạo file RunFirst.bat

Mục đích của file này:

1️⃣ Kích hoạt đúng môi trường Python

Kiểm tra nếu có thư mục myenv → tự kích hoạt môi trường ảo.
Nếu chưa có → dùng Python hệ thống (py -3 hoặc python).

➡️ Nhờ vậy, script ds2000.py chạy với đúng gói bạn đã cài trong myenv.

2️⃣ Mở Chrome riêng biệt cho quá trình crawl

Tạo profile riêng biệt tên DS để:
- Không ảnh hưởng Chrome bạn đang dùng.
- Dễ attach Selenium qua cổng --remote-debugging-port=9333.
Có thể dùng proxy (nếu bạn bật dòng HTTP_PROXY).

➡️ Chrome này dùng riêng cho script crawl, lưu cookie/session an toàn.

3️⃣ Khởi chạy script `ds2000.py` tự động

Sau khi Chrome đã mở xong, file .bat:

hoặc (nếu dùng venv)

➡️ Tự động chạy script, không cần bạn gõ tay.

4️⃣ Hiển thị log & giữ cửa sổ

In ra log: Python nào đang dùng, Chrome đã mở chưa, tiến trình crawl có lỗi không.
Dù script chạy xong hay lỗi, cửa sổ vẫn giữ mở (pause), để bạn xem thông báo.

5️⃣ Đảm bảo script và Chrome “ăn khớp”

Cả hai cùng dùng chung:

DEBUG_PORT=9333
USER_DATA_DIR=%LOCALAPPDATA%\Google\Chrome\User Data - DS

➡️ Đảm bảo ds2000.py có thể attach vào Chrome đúng cổng debug bạn đã mở.

💡 Khi nào cần file .bat

Khi bạn muốn click chuột là chạy luôn mà không mở CMD thủ công.
Khi bạn muốn chắc chắn script luôn dùng đúng môi trường & Chrome profile.
Khi bạn muốn người khác (ví dụ nhân viên khác) chỉ cần nhấp chuột là hệ thống crawl hoạt động mà không phải biết lệnh Python.

@echo off
setlocal ENABLEEXTENSIONS ENABLEDELAYEDEXPANSION
REM ============================================================
REM start_ds.bat — Chạy ds2000.py với chế độ “đợi vượt CAPTCHA”
REM - Khi ds2000.py nghi CAPTCHA -> thoát mã 90 -> batch DỪNG chờ bạn nhấn Enter.
REM - Sau khi vượt CAPTCHA trên Chrome, quay lại CMD nhấn Enter để chạy tiếp.
REM ============================================================

REM ===== Paths =====
set "SCRIPT_DIR=%~dp0"
set "DS_SCRIPT=%SCRIPT_DIR%ds2000.py"
set "CHROME_EXE=C:\Program Files\Google\Chrome\Application\chrome.exe"
if not exist "%CHROME_EXE%" set "CHROME_EXE=C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"

REM ===== Check ds2000.py exists =====
if not exist "%DS_SCRIPT%" (
echo [ERROR] Khong tim thay "%DS_SCRIPT%".
echo Hay dat file .bat nay cung thu muc voi ds2000.py
echo.
pause
exit /b 1
)

REM ===== Pick Python (uu tien venv .\myenv) =====
set "PY_EXE="
if exist "%SCRIPT_DIR%myenv\Scripts\python.exe" set "PY_EXE=%SCRIPT_DIR%myenv\Scripts\python.exe"
if not defined PY_EXE ( where py >nul 2>&1 && set "PY_EXE=py -3" )
if not defined PY_EXE ( where python >nul 2>&1 && set "PY_EXE=python" )
if not defined PY_EXE (
echo [ERROR] Khong tim thay Python. Hay tao venv: "%SCRIPT_DIR%myenv".
echo Goi y: py -3 -m venv "%SCRIPT_DIR%myenv" ^&^& "%SCRIPT_DIR%myenv\Scripts\activate.bat"
echo.
pause
exit /b 1
)
echo [INFO ] Python se su dung: %PY_EXE%

REM ===== (Tuy chon) Proxy =====
REM set "HTTP_PROXY=http://100.82.200.6:8888"
REM set "HTTPS_PROXY=%HTTP_PROXY%"

REM ===== Chrome attach settings (phai khop ds2000.py) =====
set "USER_DATA_DIR=%LOCALAPPDATA%\Google\Chrome\User Data - DS"
set "PROFILE_NAME=DS"
set "DEBUG_PORT=9333"

REM ===== Build proxy arg only if defined =====
set "PROXY_ARG="
if defined HTTP_PROXY set "PROXY_ARG=--proxy-server=%HTTP_PROXY%"

REM ===== Launch Chrome once (giu mo) =====
echo [BOOT ] Launch Chrome profile rieng...
start "" "%CHROME_EXE%" ^
--remote-debugging-port=%DEBUG_PORT% ^
--user-data-dir="%USER_DATA_DIR%" ^
--profile-directory="%PROFILE_NAME%" ^
%PROXY_ARG% ^
--lang=vi

REM ===== Give DevTools a moment =====
timeout /t 2 /nobreak >nul 2>&1

REM ===== Run loop =====
set "PYTHONIOENCODING=utf-8"

:RUN_LOOP
echo.
echo [RUN ] %PY_EXE% "%DS_SCRIPT%"
%PY_EXE% "%DS_SCRIPT%"
set "ERRLVL=%ERRORLEVEL%"
echo.

if "%ERRLVL%"=="0" (
echo [DONE ] ds2000.py ket thuc thanh cong.
goto END_ALL
)

REM ===== Ma 90: nghi CAPTCHA/WAF -> DỪNG THẬT SỰ cho bạn thao tac
if "%ERRLVL%"=="90" (
echo [PAUSE] He thong nghi bi WAF/CAPTCHA.
echo Hay chuyen sang cua so Chrome dang mo, tu vuot CAPTCHA/cho qua trang.
echo Khi xong, quay lai cua so nay va NHAN Enter de chay tiep...
pause >nul
goto RUN_LOOP
)

REM ===== Loi khac: Hoi chay lai khong
echo [FAIL ] ds2000.py thoat voi ma loi %ERRLVL%.
choice /C YN /N /M "Chay lai? (Y/N): "
if errorlevel 2 goto END_ALL
goto RUN_LOOP

:END_ALL
echo.
echo (Nhan phim bat ky de dong...)
pause >nul
endlocal

Bước 4. Tạo file python để crawl danh sách ds2000.py

Ví dụ script bên dưới:

Nếu gặp captcha --> dừng --> cập nhật file resum_state.json
cần vào web để vượt thủ công
Sau khi vượt vào cmd ấn Enter
script sẽ xem resum_state.json và tiếp tục từ vị trí bị ngừng

# ds2000.py
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import pandas as pd
import re
import time
import random
import traceback
from datetime import datetime, timedelta
import os
import sys
import socket
import subprocess
import json

# ===== Selenium (Chrome thật) =====
from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException, WebDriverException, JavascriptException

# ================== CẤU HÌNH ==================
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LOG_FILE = os.path.join(SCRIPT_DIR, "1-1-LayDanhSach-2000.log")
OUTPUT_FILE = os.path.join(SCRIPT_DIR, "1-1-LayDanhSach-2000.csv")
RESUME_FILE = os.path.join(SCRIPT_DIR, "resume_state.json") # <— Checkpoint

# URL mẫu theo NGÀY: phải có {bdate}, {edate}, {page}
TARGET_URL_TPL = (
"https://WebSiteNguon.vn/link"
"?bdate={bdate}&edate={edate}&page={page}"
)

# Khoảng ngày cần lấy
START_DATE_STR = "18/04/2004" # dd/mm/YYYY
END_DATE_STR = "20/10/2025" # dd/mm/YYYY
DATE_FMT = "%d/%m/%Y"

# Số trang tối đa thử cho MỖI NGÀY (để tránh vòng lặp vô hạn)
TOTAL_PAGE = 200

# Độ trễ
DELAY_NEXT = (3.1, 7.5)
DELAY_EACHDAY = (23.1, 27.5)

# Chỉ sang trang tiếp theo nếu trang hiện tại có đúng 20 văn bản
PAGE_SIZE_TRIGGER = 20

# Timeout & Retry
PAGELOAD_TIMEOUT_S = 240
DOM_READY_TIMEOUT_S = 60
NAV_MAX_RETRIES = 3
NAV_BACKOFF_RANGE = (3.0, 8.0)

# Chrome DevTools attach
DEBUG_HOST = "127.0.0.1"
DEBUG_PORT = 9333
DEBUG_ADDR = f"{DEBUG_HOST}:{DEBUG_PORT}"
USER_DATA_DIR = os.path.join(os.environ.get("LOCALAPPDATA", ""), "Google", "Chrome", "User Data - DS")
PROFILE_DIR = "DS"

HEADER_COLUMNS = ["tt", "tieude", "ngaybanhanh", "capnhat", "url", "uuid", "LanDau", "NgayKiemTraTrungLap"]

def _sleep_range(val):
if isinstance(val, (list, tuple)) and len(val) == 2:
time.sleep(random.uniform(float(val[0]), float(val[1])))
else:
time.sleep(float(val))

def _sleep_next_page(): _sleep_range(DELAY_NEXT)
def _sleep_each_day(): _sleep_range(DELAY_EACHDAY)
def now_str(): return datetime.now().strftime("%Y-%m-%d %H:%M:%S")

def log(message: str):
timestamp = now_str()
line = f"[{timestamp}] {message}"
os.makedirs(os.path.dirname(LOG_FILE), exist_ok=True)
with open(LOG_FILE, "a", encoding="utf-8") as log_file:
log_file.write(line + "\n")
print(line)

# ---------- CSV state ----------
def ensure_csv_initialized():
if not os.path.exists(OUTPUT_FILE) or os.path.getsize(OUTPUT_FILE) == 0:
pd.DataFrame(columns=HEADER_COLUMNS).to_csv(OUTPUT_FILE, index=False, encoding="utf-8-sig")
return
df = pd.read_csv(OUTPUT_FILE, dtype=str, encoding="utf-8-sig")
changed = False
for col in HEADER_COLUMNS:
if col not in df.columns:
df[col] = ""; changed = True
if list(df.columns) != HEADER_COLUMNS:
df = df.reindex(columns=HEADER_COLUMNS); changed = True
if changed:
df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8-sig")

def load_state():
df = pd.read_csv(OUTPUT_FILE, dtype=str, encoding="utf-8-sig").fillna("")
uuid_to_idx = {str(u).strip(): i for i, u in enumerate(df["uuid"].astype(str))}
try:
tt_vals = pd.to_numeric(df["tt"], errors="coerce")
tt_max = int(tt_vals.max()) if not tt_vals.isna().all() else 0
except Exception:
tt_max = 0
return df, uuid_to_idx, tt_max

def append_new_row(row_dict, tt_next: int):
r = {
"tt": tt_next,
"tieude": row_dict.get("tieude", ""),
"ngaybanhanh": row_dict.get("ngaybanhanh", ""),
"capnhat": row_dict.get("capnhat", ""),
"url": row_dict.get("url", ""),
"uuid": row_dict.get("uuid", ""),
"LanDau": now_str(),
"NgayKiemTraTrungLap": ""
}
pd.DataFrame([r], columns=HEADER_COLUMNS).to_csv(
OUTPUT_FILE, mode="a", index=False, header=False, encoding="utf-8-sig"
)

def update_duplicate_timestamp(df, uuid_to_idx, uuid_val: str):
idx = uuid_to_idx.get(uuid_val, None)
if idx is None:
return False
current = str(df.at[idx, "NgayKiemTraTrungLap"]) if "NgayKiemTraTrungLap" in df.columns else ""
ts = now_str()
df.at[idx, "NgayKiemTraTrungLap"] = (current + ", " + ts).strip(", ").strip()
df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8-sig")
return True

# ---------- CAPTCHA / banner ----------
CAPTCHA_HINTS = [
"captcha", "recaptcha", "hcaptcha", "please verify you are a human",
"ddos protection by cloudflare", "access denied", "to continue, please verify",
"cf-chl", "/cdn-cgi/challenge-platform", "blocked", "xác minh bạn không phải là robot",
"just a moment..."
]

NO_RESULT_TEXT = "Không tìm thấy văn bản nào!"

def has_no_result_banner(html_bytes: bytes) -> bool:
try:
html = (html_bytes or b"").decode("utf-8", errors="ignore")
except Exception:
return False
if ('<p id="ketqua"' in html) and (NO_RESULT_TEXT in html):
return True
try:
soup = BeautifulSoup(html, "lxml")
node = soup.select_one("p#ketqua strong")
if node:
return NO_RESULT_TEXT in (node.get_text(strip=True) or "")
except Exception:
pass
return False

def has_ketqua_present(html_bytes: bytes) -> bool:
try:
soup = BeautifulSoup((html_bytes or b"").decode("utf-8", errors="ignore"), "lxml")
except Exception:
return False
node = soup.select_one("p#ketqua")
if not node:
return False
text = node.get_text(" ", strip=True).lower()
return NO_RESULT_TEXT.lower() not in text # có p#ketqua và không phải 'không tìm thấy'

def suspected_block(html_text: str, html_bytes: bytes = None) -> bool:
# Không gắn cờ CAPTCHA nếu trang hợp lệ
if html_bytes is not None:
if has_no_result_banner(html_bytes): return False
if has_ketqua_present(html_bytes): return False
text = (html_text or "").lower()
return any(hint in text for hint in CAPTCHA_HINTS)

def save_snapshot(tag: str, content: bytes) -> str:
path = os.path.join(SCRIPT_DIR, f"snapshot_{tag}.html")
try:
with open(path, "wb") as f:
f.write(content or b"")
return path
except Exception as e:
log(f"⚠️ Không thể lưu snapshot {tag}: {e}")
return ""

# ---------- Checkpoint (resume) ----------
def save_resume(day_str: str, page: int):
data = {"day": day_str, "page": page}
with open(RESUME_FILE, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False)
log(f"💾 Lưu checkpoint: {data}")

def load_resume():
if not os.path.exists(RESUME_FILE):
return None
try:
with open(RESUME_FILE, "r", encoding="utf-8") as f:
data = json.load(f)
if not data or "day" not in data or "page" not in data:
return None
log(f"📂 Phát hiện checkpoint: {data}")
return data
except Exception as e:
log(f"⚠️ Không đọc được checkpoint: {e}")
return None

def clear_resume():
try:
if os.path.exists(RESUME_FILE):
os.remove(RESUME_FILE)
log("🧹 Xóa checkpoint.")
except Exception as e:
log(f"⚠️ Không xóa được checkpoint: {e}")

# ------------------------------------------------------

def parse_page(html: bytes) -> list:
soup = BeautifulSoup(html, "lxml")
entries = soup.find_all("div", class_=re.compile(r"content-\d+"))
rows = []
for entry in entries:
title_tag = entry.select_one("div.left-col div.nq p.nqTitle a")
tieude = title_tag.text.strip() if title_tag else ""
url = title_tag["href"].strip() if title_tag and title_tag.has_attr("href") else ""
match = re.search(r"-(\d+)\.aspx", url)
uuid = match.group(1) if match else ""
ngaybanhanh_tag = entry.select_one("div.right-col p:nth-of-type(1)")
ngaybanhanh = (ngaybanhanh_tag.text or "").replace("Ban hành:", "").strip() if ngaybanhanh_tag else ""
capnhat_tag = entry.select_one("div.right-col p:nth-of-type(4)")
capnhat = (capnhat_tag.text or "").replace("Cập nhật:", "").strip() if capnhat_tag else ""
rows.append({
"tieude": tieude,
"ngaybanhanh": ngaybanhanh,
"capnhat": capnhat,
"url": url,
"uuid": uuid
})
return rows

def build_url(day_str: str, page: int) -> str:
return TARGET_URL_TPL.format(bdate=day_str, edate=day_str, page=page)

# ================== DRIVER ==================
DRIVER = None

def _is_port_open(host, port):
try:
with socket.create_connection((host, port), timeout=1.5):
return True
except OSError:
return False

def _launch_chrome_ds():
chrome_exe = r"C:\Program Files\Google\Chrome\Application\chrome.exe"
if not os.path.exists(chrome_exe):
chrome_exe = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
os.makedirs(USER_DATA_DIR, exist_ok=True)
proxy = os.environ.get("HTTP_PROXY") or globals().get("HTTP_PROXY", "")
args = [
chrome_exe,
f"--remote-debugging-port={DEBUG_PORT}",
f'--user-data-dir={USER_DATA_DIR}',
f'--profile-directory={PROFILE_DIR}',
"--lang=vi",
]
if proxy:
args.append(f'--proxy-server={proxy}')
subprocess.Popen(args, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

def _create_driver_attach():
options = ChromeOptions()
options.set_capability("pageLoadStrategy", "eager")
options.add_experimental_option("debuggerAddress", DEBUG_ADDR)
drv = webdriver.Chrome(options=options)
drv.set_page_load_timeout(PAGELOAD_TIMEOUT_S)
return drv

def _driver_alive(drv) -> bool:
try:
return drv.execute_script("return 1") == 1
except Exception:
return False

def _open_blank_tab(drv):
try:
drv.execute_script("window.open('about:blank','_blank');")
drv.switch_to.window(drv.window_handles[-1])
except JavascriptException:
pass

def _reset_driver(reason: str = ""):
global DRIVER
try:
if DRIVER:
try: DRIVER.quit()
except Exception: pass
finally:
DRIVER = None
log(f"🔁 Recreate driver vì: {reason}")
return create_driver()

def create_driver():
global DRIVER
if DRIVER and _driver_alive(DRIVER):
return DRIVER
if not _is_port_open(DEBUG_HOST, DEBUG_PORT):
_launch_chrome_ds()
for _ in range(80):
if _is_port_open(DEBUG_HOST, DEBUG_PORT): break
time.sleep(0.3)
if not _is_port_open(DEBUG_HOST, DEBUG_PORT):
raise RuntimeError(
f"Không attach được DS tại {DEBUG_ADDR}. Hãy chạy start_ds.bat trước."
)
DRIVER = _create_driver_attach()
proxy = os.environ.get("HTTP_PROXY") or globals().get("HTTP_PROXY", "")
log(f"🟢 Attached DS at {DEBUG_ADDR} | user-data-dir={USER_DATA_DIR} | proxy={proxy or 'DIRECT'}")
return DRIVER

def fetch_page(url: str):
backoff = lambda: time.sleep(random.uniform(*NAV_BACKOFF_RANGE))
attempt = 0
last_html = b""
while attempt < NAV_MAX_RETRIES:
attempt += 1
driver = create_driver()
try:
if not _driver_alive(driver):
raise WebDriverException("DevTools/driver không phản hồi")
_open_blank_tab(driver)
log(f"🌐 Điều hướng (lần {attempt}/{NAV_MAX_RETRIES}): {url}")
driver.get(url)
WebDriverWait(driver, DOM_READY_TIMEOUT_S).until(
lambda d: d.execute_script("return document.readyState") == "complete"
)
html = driver.page_source or ""
last_html = html.encode("utf-8", errors="ignore")
return 200, last_html
except TimeoutException:
try:
driver.execute_script("window.stop();")
html = driver.page_source or ""
last_html = html.encode("utf-8", errors="ignore")
except Exception:
pass
snapshot = save_snapshot(f"timeout_attempt{attempt}", last_html)
log(f"⏱️ TIMEOUT (lần {attempt}) khi tải URL. Snapshot: {snapshot}")
if attempt >= 2:
_reset_driver("Timeout nhiều lần / có thể Chromedriver treo")
backoff(); continue
except WebDriverException as e:
snapshot = save_snapshot(f"wderr_attempt{attempt}", last_html)
log(f"⚠️ WebDriverException (lần {attempt}): {e}. Snapshot: {snapshot}")
_reset_driver("WebDriverException"); backoff(); continue
except Exception as e:
snapshot = save_snapshot(f"unknown_attempt{attempt}", last_html)
log(f"⚠️ Lỗi không xác định (lần {attempt}): {e}. Snapshot: {snapshot}")
_reset_driver("Unknown exception"); backoff(); continue
return 0, last_html

# ====== Dừng cho người dùng vượt CAPTCHA (ghi checkpoint rồi exit 90) ======
def pause_for_user_and_exit(day_str: str, page: int, html_bytes: bytes, reason: str, url: str):
snapshot = save_snapshot(f"{day_str.replace('/','-')}_p{page}_pause", html_bytes or b"")
save_resume(day_str, page) # <— LƯU checkpoint tại đúng trang gặp vấn đề
log("────────────────────────────────────────────────────────")
log(f"🛑 PHÁT HIỆN NGHI WAF/CAPTCHA — LÝ DO: {reason}")
log(f"🔗 URL: {url}")
log(f"🧾 Snapshot HTML: {snapshot}")
log("👉 Hãy chuyển sang Chrome để vượt CAPTCHA/đợi trang qua.")
log("👉 Sau đó quay lại CMD và NHẤN Enter (file .bat sẽ chạy lại từ checkpoint).")
log("────────────────────────────────────────────────────────")
sys.exit(90)

# =======================================================================

def crawl_one_day(day_str: str, df, uuid_to_idx, tt_max: int, start_page: int = 1):
"""Crawl theo 1 ngày, bắt đầu từ trang start_page (hỗ trợ resume)."""
total_new = 0
total_dup_updates = 0
reason = "exhausted"

for page in range(start_page, TOTAL_PAGE + 1):
url = build_url(day_str, page)
status, content = fetch_page(url)

if status != 200:
snapshot = save_snapshot(f"{day_str.replace('/','-')}_p{page}_http{status}", content)
log(f"❌ {day_str} - Page {page} - HTTP {status} hoặc fetch lỗi. Snapshot: {snapshot}")
reason = "error"
break

# 1) Banner 'Không tìm thấy...' -> no_results (không phải CAPTCHA)
if has_no_result_banner(content):
log(f"ℹ️ {day_str} - Page {page} - Không có văn bản theo bộ lọc ngày này (có banner 'ketqua').")
reason = "no_results"
break

# 2) Có p#ketqua -> trang hợp lệ
ketqua_ok = has_ketqua_present(content)

# 3) Nếu không có p#ketqua mới xét CAPTCHA
if not ketqua_ok:
decoded = (content or b"").decode("utf-8", errors="ignore")
if suspected_block(decoded, content):
pause_for_user_and_exit(day_str, page, content, "Không có p#ketqua và HTML có dấu hiệu CAPTCHA/WAF", url)

rows = parse_page(content)

if not rows:
# Trang hợp lệ nhưng không rows: snapshot và tạm cho người dùng xem
pause_for_user_and_exit(day_str, page, content, "Trang hợp lệ nhưng không có rows (cấu trúc thay đổi?)", url)

log(f"✅ {day_str} - Page {page} - Lấy {len(rows)} VB")

for r in rows:
u = (r.get("uuid") or "").strip()
if not u: continue
if u in uuid_to_idx:
if update_duplicate_timestamp(df, uuid_to_idx, u):
total_dup_updates += 1
else:
tt_max += 1
append_new_row(r, tt_max)
df = pd.concat([df, pd.DataFrame([{
"tt": str(tt_max),
"tieude": r.get("tieude", ""),
"ngaybanhanh": r.get("ngaybanhanh", ""),
"capnhat": r.get("capnhat", ""),
"url": r.get("url", ""),
"uuid": u,
"LanDau": now_str(),
"NgayKiemTraTrungLap": ""
}], columns=HEADER_COLUMNS)], ignore_index=True)
uuid_to_idx[u] = len(df) - 1
total_new += 1

log(f"📥 {day_str} - Page {page}: mới {total_new} | trùng {total_dup_updates} (lũy kế)")

# Hoàn thành trang hiện tại => cập nhật checkpoint sang TRANG TIẾP THEO
save_resume(day_str, page + 1)

if len(rows) < PAGE_SIZE_TRIGGER:
log(f"⛳ {day_str} - Page {page}: chỉ có {len(rows)}/{PAGE_SIZE_TRIGGER} VB ⇒ dừng ngày này.")
reason = "no_more_pages"
break

_sleep_next_page()

return df, uuid_to_idx, tt_max, total_new, total_dup_updates, reason

def iter_dates(start_str: str, end_str: str):
start = datetime.strptime(start_str, DATE_FMT)
end = datetime.strptime(end_str, DATE_FMT)
cur = start
while cur <= end:
yield cur.strftime(DATE_FMT)
cur += timedelta(days=1)

def crawl_all_by_day():
start_time = datetime.now()
log(f"🚀 Bắt đầu crawl theo NGÀY từ {START_DATE_STR} đến {END_DATE_STR}")

ensure_csv_initialized()
df, uuid_to_idx, tt_max = load_state()

# Đọc checkpoint nếu có
resume = load_resume()
resume_day = resume["day"] if resume else None
resume_page = resume["page"] if resume else None

overall_new = 0
overall_dup = 0

for day_str in iter_dates(START_DATE_STR, END_DATE_STR):
# Nếu có checkpoint: bỏ qua ngày trước đó
if resume_day and day_str < resume_day:
continue

# Nếu đúng ngày checkpoint -> bắt đầu từ trang đã lưu; ngược lại từ trang 1
start_page = resume_page if (resume_day == day_str and isinstance(resume_page, int) and resume_page >= 1) else 1
if start_page < 1: start_page = 1

log(f"===== ▶️ Bắt đầu ngày {day_str} (start_page={start_page}) =====")
try:
df, uuid_to_idx, tt_max, n_new, n_dup, reason = crawl_one_day(day_str, df, uuid_to_idx, tt_max, start_page=start_page)
overall_new += n_new
overall_dup += n_dup
log(f"🏁 Kết thúc ngày {day_str} — mới {n_new}, trùng {n_dup}, lý do: {reason}")

# Hoàn tất ngày này -> cập nhật checkpoint sang NGÀY TIẾP THEO, trang 1
next_day = (datetime.strptime(day_str, DATE_FMT) + timedelta(days=1)).strftime(DATE_FMT)
save_resume(next_day, 1)

except SystemExit as se:
# Mã 90: đã lưu checkpoint trước khi exit; ném tiếp cho .bat bắt
if se.code == 90:
raise
else:
log(f"❌ SystemExit bất ngờ với mã {se.code}"); raise
except Exception as e:
log(f"❌ Lỗi bất ngờ ở ngày {day_str}: {e}")
traceback.print_exc()

log(f"🛌 Nghỉ giữa ngày: DELAY_EACHDAY={DELAY_EACHDAY}")
_sleep_each_day()

# Hoàn thành toàn bộ -> xóa checkpoint
clear_resume()

log(f"✅ Hoàn thành toàn bộ. Thêm mới: {overall_new} | Cập nhật trùng: {overall_dup}")
log(f"⏱️ Từ: {start_time} → {datetime.now()}")

if __name__ == "__main__":
crawl_all_by_day()
# KHÔNG driver.quit(): Chrome DS vẫn giữ mở để bạn dùng tiếp

Đăng nhập để gửi ý kiến

Linux trỉ phù hợp khi không bị chặn bot bằng Captcha hay Cloudflare.

Bước 1. Tạo môi trường ảo myenv

Bước 2. Vào môi trường đã tạo

Bước 3. Tạo file RunFirst.bat

1️⃣ Kích hoạt đúng môi trường Python

2️⃣ Mở Chrome riêng biệt cho quá trình crawl

3️⃣ Khởi chạy script `ds2000.py` tự động

4️⃣ Hiển thị log & giữ cửa sổ

5️⃣ Đảm bảo script và Chrome “ăn khớp”

💡 Khi nào cần file .bat

Bước 4. Tạo file python để crawl danh sách ds2000.py

Xây dựng website bệnh viện

Bài nổi bật

Crawler, Spider, Scraper

Câu hỏi, thảo luận

Lấy danh sách bài viết trong Windows

Linux trỉ phù hợp khi không bị chặn bot bằng Captcha hay Cloudflare.

Bước 1. Tạo môi trường ảo myenv

Bước 2. Vào môi trường đã tạo

Bước 3. Tạo file RunFirst.bat

1️⃣ Kích hoạt đúng môi trường Python

2️⃣ Mở Chrome riêng biệt cho quá trình crawl

3️⃣ Khởi chạy script ds2000.py tự động

4️⃣ Hiển thị log & giữ cửa sổ

5️⃣ Đảm bảo script và Chrome “ăn khớp”

💡 Khi nào cần file .bat

Bước 4. Tạo file python để crawl danh sách ds2000.py

Xây dựng website bệnh viện

Bài nổi bật

Crawler, Spider, Scraper

3️⃣ Khởi chạy script `ds2000.py` tự động