A Python program for web scraping, MongoDB interaction, data analysis, and statistics collection.
This Python script performs various operations such as scraping news articles from a website, storing data in MongoDB, analyzing word frequency, and collecting statistics.
- The
scrape_datafunction extracts news articles from a specified website. - The
scrape_and_store_data_workerfunction retrieves data from a specific page and stores it in MongoDB. - The
scrape_and_store_datafunction retrieves and processes data from a range of pages in parallel.
- The
connect_to_mongodbfunction establishes a connection to a MongoDB database. - The
scrape_and_store_data_workerfunction adds the scraped data to MongoDB. - The
group_and_display_by_update_datefunction groups data in MongoDB based on update dates.
- The
analyze_and_store_word_frequencyfunction analyzes word frequency in the text content of scraped articles. - It generates bar charts for the top 10 most used words and stores the results in MongoDB.
- The
update_stats_collectionfunction collects statistics such as elapsed time, success and failure counts, and stores them in MongoDB.
- The
mainfunction orchestrates the above functionalities in sequence.
try-exceptblocks handle potential errors during execution, and errors are logged in thelogs.logfile.
- Clone the repository to your local machine.
- Install the required dependencies using
pip install -r requirements.txt. - Make sure you have MongoDB installed and running on your local machine.
- Run the
mainfunction to execute the entire data processing workflow.
- BeautifulSoup
- requests
- pymongo
- matplotlib
- datetime
- concurrent.futures
- logging
- The word frequency analysis results are stored in MongoDB.
- Bar charts for the top 10 most used words can be found in the project directory (
barchart.png).

- Log information is available in the
logs.logfile.
- Ensure compliance with the terms of use of the website being scraped.
- Be aware of legal regulations related to web scraping.
This project is licensed under the MIT License.