-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated excel_knowledge_source.py to account for excel files with multiple tabs. #1921
base: main
Are you sure you want to change the base?
Conversation
…ve multiple tabs. The old implementation contained a single df=pd.read_excel(excel_file_path), which only reads the first or most recently used excel sheet. The updated functionality reads all sheets in the excel workbook.
Disclaimer: This review was made by a crew of AI Agents. Code Review Comment for PR: ExcelKnowledgeSource UpdatesOverviewThis pull request significantly enhances the Positive Aspects
Issues and Recommendations1. Memory Efficiency ConcernsThe current implementation loads all sheets into memory at once, which poses a risk of excessive memory usage for large files. Recommendation: def load_content(self) -> Dict[Path, Dict[str, str]]:
content_dict = {}
for file_path in self.safe_file_paths:
with pd.ExcelFile(file_path) as xl:
sheet_dict = {sheet_name: pd.read_excel(xl, sheet_name).to_csv(index=False)
for sheet_name in xl.sheet_names}
content_dict[file_path] = sheet_dict
return content_dict 2. CSV Conversion ImplementationUsing string concatenation for CSV conversion is prone to errors, especially with special characters. Recommendation: import csv
from io import StringIO
def _sheet_to_csv(self, worksheet) -> str:
output = StringIO()
writer = csv.writer(output)
for row in worksheet.values:
writer.writerow(row)
return output.getvalue() 3. Type HintsThe absence of type hints affects both maintainability and IDE functionality. Recommendation: from typing import Dict, Path
def load_content(self) -> Dict[Path, Dict[str, str]]:
... 4. Debug Print StatementUnnecessary print statements can clutter logs. Recommendation: # Remove this line
print(sheet_str) 5. Simplifying Content Processing in the
|
…e memory usage and provide better documentation
Made the changes suggested by the AI Agents. Please let me know if I should change anything else. Thank you! |
…mmit - corrected this
I realized I made a mistake in the 2nd to last commit.. did not delete the old load_content function - corrected this in the latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, good catch. Removed them. Thanks for pointing that out!
…nowledgeSource to account for the change in the load_content method to support excel files with multiple tabs/sheets. This change should ensure it passes the type check test, as it failed before since content was assigned a different type in BaseFileKnowledgeSource
I just made another commit that should fix the issue with the type check test failing. It failed because the new load_content method returned a slightly different type (still a dict but a Dict[Path,Dict[str,str]]) in order to support excel files with multiple tabs/sheets. Apologies that I did not catch that before. |
Also I just removed the commented out imports, apologies I should have done that and fixed the type test in one commit. |
Updated excel_knowledge_source.py to account for excel sheets that have multiple tabs. The old implementation contained a single df=pd.read_excel(excel_file_path), which only reads the first or most recently used excel sheet. The updated functionality reads all sheets in the excel workbook.