IMPORTANT: Only works for https://www.alestlelive.com
This Python program is designed to visit each URL for The Alestle's web pages listed in a text file and scrape or download the content from those URLs. It serves as a versatile tool for extracting information from various web resources efficiently.
- URL Extraction: The program parses a text file containing a list of URLs and extracts each URL for further processing.
- Headline Extraction: It visits each URL and extracts the headline from the web page.
- Content Extraction: It visits each URL and extracts the content from the web page.
This Python program is designed to visit each URL for The Alestle's web pages listed in a text file and scrape or download the content from those URLs. It serves as a versatile tool for extracting information from various web resources efficiently.
- URL Extraction: The program parses a text file containing a list of URLs and extracts each URL for further processing.
- Headline Extraction: It visits each URL and extracts the headline from the web page.
- Content Extraction: It visits each URL and extracts the content from the web page.
To run the tests for this project, you will need to use pytest. If you haven't installed pytest yet, you can do so by running the following command:
conda install pytest
Once pytest is installed, navigate to the project directory and run the following command:
conda pytest test.py
The tests in this project cover the following functionalities:
-
URL Extraction: Tests if the FileReader class correctly parses a text file containing a list of URLs and extracts each URL for further processing.
-
HTML Page Retrieval: Tests if the FileReader class correctly retrieves the HTML pages from the URLs.
-
Content Extraction: Tests if the HtmlProcessor class correctly processes the HTML pages and extracts the content from the web pages.
-
Output File Creation: Tests if the HtmlProcessor class correctly creates output files with the extracted content.
-
HtmlProcessor get_output_filenames method: Tests if the HtmlProcessor class correctly returns the filenames of the output files.
-
FileReader Initialization: Tests if the FileReader class is correctly initialized with the given input file name.
-
HtmlProcessor Initialization: Tests if the HtmlProcessor class is correctly initialized with the given pages and output directory.
-
**AI Initialization:Tests if the AI class is correctly initialized with the given files and output directory.
Each of these tests is designed to ensure that the individual components of the project are working as expected, and that they correctly interact with each other to produce the expected output.
- Input File: Prepare a text file containing a list of URLs, with each URL on a separate line.
- Python Environment: Ensure that you have Python installed on your system.
- Install Dependencies: Install the required Python packages by running
conda install [package_name]
. - Run the Program: Execute the main Python script
main.py
. - Output: The program will process each URL, extract the content, and save it to an external
.txt
file.
This repository contains a simple demonstration of how to use the OpenAI API to interact with powerful GPT models.
To access OpenAI's powerful GPT models and utilize its API, you'll need to create an account on the OpenAI platform and generate an API key. Here's a step-by-step guide:
-
Sign Up for an Account:
- Go to the OpenAI website and sign up for an account if you haven't already. Follow the instructions to complete the registration process.
-
Navigate to API Settings:
- After signing in, navigate to your account settings. You can find this by clicking on your profile icon or by directly visiting the account settings page.
-
Generate an API Key:
- Once you're in the account settings, find the section related to API access or API keys. Click on the option to generate a new API key. You might be asked to provide additional information or agree to terms of service.
-
Copy the API Key:
- After generating the API key, it will be displayed on the screen. Copy this key and store it securely. Treat your API key like a password, as it grants access to your OpenAI account and can incur charges based on usage.
-
Add API key to .bash_profile:
- Open Terminal: You can find it in the Applications folder or search for it using Spotlight (Command + Space).
- Edit Bash Profile: Use the command nano ~/.bash_profile or nano ~/.zshrc (for newer MacOS versions) to open the profile file in a text editor.
- Add Environment Variable: In the editor, add the line below, replacing your-api-key-here with your actual API key:
export OPENAI_API_KEY='your-api-key-here'
- Save and Exit: Press Ctrl+O to write the changes, followed by Ctrl+X to close the editor.
- Load Your Profile: Use the command source ~/.bash_profile or source ~/.zshrc to load the updated profile.
- Verification: Verify the setup by typing echo $OPENAI_API_KEY in the terminal. It should display your API key.
Here's how you can use the generated API key to make a simple API call in macOS using Python:
-
Install the OpenAI Python Library:
- Before making API calls, ensure you have the OpenAI Python library installed. You can install it via conda:
conda install openai
-
Write Python Script:
- Before making API calls, ensure you have the OpenAI Python library installed. You can install it via conda:
import openai
-
Make API Call:
- Use the library functions to make API calls. For example, to use the completion endpoint of the API:
client = OpenAI() completion = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "assistant", "content": "~Article content~"}, {"role": "user", "content": "Please make the article concise, up to 50 words"} ] )
-
Print or Use Response:
- Once you receive the response, you can print it or utilize it as required. For example:
print(completion.choices[0].message.content)
-
Write Python Script:
- Execute the Python script in your terminal:
python run.py
Ensure that you are in the directory containing your Python script.
- URL File Path: Specify the path to the text file containing the list of URLs.
- Content Extraction Method: Choose the desired method for extracting content from the web pages (e.g., using BeautifulSoup, etc.).
- Download Options: Enable or disable the downloading of resources linked within the web pages.
- Output Directory: Set the directory where the scraped content will be saved.
Suppose you have a text file named input.txt
with the following content:
https://example.com/page1 https://example.com/page2 https://example.com/page3
You can run the program with the default configuration to visit each URL, extract the content, and save it in the specified output directory.
- Python 3.12.1: The program is compatible with Python 3.x versions.
- Requests: Used for making HTTP requests to retrieve web pages.
- BeautifulSoup: A Python library for pulling data out of HTML and XML files.
- The development of this program was inspired by the need to efficiently scrape web content for various applications.
- Special thanks to the developers of Requests and BeautifulSoup for providing powerful tools for web scraping in Python.