Web Scraping: Beginner’s Guide + Tools, Examples & Ethics
Web scraping has quietly become the backbone of AI training data. But legal gray areas and sophisticated anti-blocking measures make success tricky. This guide reveals what works in 2025.
Explore how our advanced AI technology can recover your deleted data. Experience seamless data retrieval with our smart AI solutions today.
Security researchers have uncovered a troubling privacy leak in Microsoft Copilot. Data exposed to the internet—even briefly—can persist in AI chatbots long after being made private.
Israeli cybersecurity firm Lasso discovered this vulnerability when their own private GitHub repository appeared in Copilot results. The repository had been accidentally public for a short time before being locked down.
"Anyone in the world could ask Copilot the right question and get this data," warned Lasso co-founder Ophir Dror.
The problem extends far beyond Lasso. Their investigation found over 20,000 since-private GitHub repositories still accessible through Copilot, affecting more than 16,000 organizations including Google, IBM, PayPal, Tencent, and Microsoft itself.
The exposed repositories contain damaging materials: confidential archives, intellectual property, and even access keys and tokens. In one case, Lasso retrieved contents from a deleted Microsoft repo that hosted a tool for creating "offensive and harmful" AI images.
Microsoft classified the issue as "low severity" when notified in November 2024, calling the caching behavior "acceptable." While the company stopped showing Bing cache links in search results by December, Lasso says the underlying problem persists—Copilot still accesses this hidden data.
Why this matters:
Read on, my dear:
Get the 5-minute Silicon Valley AI briefing, every weekday morning — free.