Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: better IFrame and shadowDom elements scraping #409

Draft
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

RohitR311
Copy link
Collaborator

@RohitR311 RohitR311 commented Jan 28, 2025

Closes: #408

Summary by CodeRabbit

  • New Features

    • Enhanced selector generation for complex DOM structures, including improved handling of shadow DOMs and iframes.
    • Refined selector generation logic with more robust error handling.
  • Bug Fixes

    • Improved selector selection process for different action types by eliminating unnecessary checks for iframes and shadow selectors.
  • Refactor

    • Consolidated selector generation functions into a unified approach.
    • Updated function signatures for selector-related methods.

@RohitR311 RohitR311 added the Type: Enhancement Improvements to existing features label Jan 28, 2025
Copy link

coderabbitai bot commented Jan 28, 2025

Walkthrough

The pull request introduces substantial enhancements to the selector generation process in the selector.ts file, specifically targeting elements within shadow DOMs and iframes. Key changes include the removal of outdated functions, the introduction of new types, and a unified selector generation function, generateBoundaryAwareSelector. Additionally, the logic for handling boundaries has been refined, and error handling has been improved to better manage cross-origin iframe scenarios.

Changes

File Change Summary
server/src/workflow-management/selector.ts - Removed genSelectorForIframe and genSelectorForShadowDOM functions
- Introduced generateBoundaryAwareSelector function
- Added getBoundaryPath for boundary path retrieval
- Introduced types: ShadowBoundary, IframeBoundary, and Boundary
- Updated signatures for getSelectors and getNonUniqueSelectors functions
- Enhanced error handling and logging for selector generation
server/src/workflow-management/utils.ts - Removed conditional checks for selectors?.iframeSelector?.full and selectors?.shadowSelector?.full in getBestSelectorForAction for ActionType.Click, ActionType.Hover, and ActionType.DragAndDrop

Assessment against linked issues

Objective Addressed Explanation
Improve IFrame and shadowDom elements scraping [#408]

Possibly related PRs

Suggested labels

Type: Feature

Suggested reviewers

  • amhsirak

Poem

🐰 In shadows deep and iframes wide,
Selectors dance with newfound pride.
Code weaves a path both smart and neat,
Where elements and context meet,
A rabbit's magic, selector's delight! 🔍✨

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
server/src/workflow-management/selector.ts (4)

1343-1364: Make shadow DOM depth configurable
Currently, the maximum shadow DOM depth (MAX_DEPTH = 4) is hard-coded, which might miss elements in deeper nested shadow roots. Consider making this limit configurable or documenting the rationale behind the chosen value.


1366-1398: Unify and streamline shadow-aware selector logic
The genSelectors function and its nested generateShadowAwareSelector are clear, but could be further streamlined by directly leveraging the new getShadowPath approach. This may avoid future duplication and improve maintainability if you unify your shadow DOM traversal logic in a single helper.


1409-1409: Exercise caution with the all-attributes mode
Using { attr: () => true } can potentially produce very lengthy or non-unique selectors if the element has many attributes. Confirm that this is the desired behavior.


1439-1441: Revisit ignoring IDs starting with digits
Currently, you skip IDs that begin with a digit, but CSS allows them if properly escaped. Evaluate whether escaping these IDs or falling back to attributes is preferable for completeness.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4484abf and e254624.

📒 Files selected for processing (2)
  • server/src/workflow-management/selector.ts (1 hunks)
  • server/src/workflow-management/utils.ts (0 hunks)
💤 Files with no reviewable changes (1)
  • server/src/workflow-management/utils.ts
🔇 Additional comments (6)
server/src/workflow-management/selector.ts (6)

1403-1403: No concerns
The call to generateShadowAwareSelector() is wrapped in a try/catch block to handle potential errors, which is good practice.


1414-1414: No code changes
Nothing was modified on this blank line.


1415-1417: No issues with href filter
Generating a shadow-aware selector filtering on ['href'] is straightforward and appears correct.


1418-1420: Potential collisions on 'name', 'placeholder', or 'for'
Multiple elements can share the same 'name', 'placeholder', or 'for' attributes, leading to non-unique selectors. Verify if additional checks or fallback strategies are needed.


1421-1423: Possible duplication of aria-label, alt, or title
Elements can reuse these attributes, potentially resulting in non-unique selectors. Ensure this is acceptable, or consider an additional fallback for uniqueness.


1425-1434: Excellent practice of supporting test ID attributes
Using a custom test ID approach improves maintainability and reliability of selectors. Implementation looks solid.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)

1359-1395: Consider making the depth limit configurable.

The getBoundaryPath implementation is robust and handles both shadow DOM and iframe contexts well. However, the MAX_DEPTH constant is hardcoded. Consider making it configurable through options or environment variables for better flexibility.

-const MAX_DEPTH = 4;
+const MAX_BOUNDARY_DEPTH = process.env.MAX_BOUNDARY_DEPTH 
+  ? parseInt(process.env.MAX_BOUNDARY_DEPTH, 10) 
+  : 4;

1422-1462: Consider caching boundary-aware selectors for performance.

The generateBoundaryAwareSelector function performs well but might be called multiple times for the same element with different options. Consider implementing a caching mechanism for the boundary path calculations to improve performance.

+const boundaryPathCache = new WeakMap<HTMLElement, string>();

 const generateBoundaryAwareSelector = (elementOptions = {}) => {
+  const cachedPath = boundaryPathCache.get(element);
+  if (cachedPath && !elementOptions.attr) {
+    return cachedPath;
+  }

   if (boundaryPath.length === 0) {
     return finder(element, elementOptions);
   }
   // ... existing code ...
   
+  if (!elementOptions.attr) {
+    boundaryPathCache.set(element, selectorParts.join(delimiter));
+  }
   return selectorParts.join(delimiter);
 };
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22a3dc3 and 58a893c.

📒 Files selected for processing (1)
  • server/src/workflow-management/selector.ts (4 hunks)
🔇 Additional comments (1)
server/src/workflow-management/selector.ts (1)

765-780: Well-structured type definitions for DOM boundaries!

The introduction of ShadowBoundary and IframeBoundary types with the unified Boundary type provides a clear and type-safe way to handle different DOM contexts. This improves code maintainability and makes the boundary handling more robust.

} catch (e) {
}

// const iframeSelector = genSelectorForIframe(element);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove commented-out code.

The code contains commented-out sections related to the old iframe selector implementation. Since this functionality has been replaced with the new boundary-aware selector, these comments should be removed to maintain code cleanliness.

-// const iframeSelector = genSelectorForIframe(element);

-// iframeSelector: iframeSelector ? {
-//   full: iframeSelector.fullSelector,
-//   isIframe: iframeSelector.isFrameContent,
-// } : null,

Also applies to: 1519-1522

Comment on lines 1379 to 1391
const ownerDocument = current.ownerDocument;
const frameElement = ownerDocument?.defaultView?.frameElement as HTMLIFrameElement;
if (frameElement) {
path.unshift({
type: 'iframe',
frame: frameElement,
document: ownerDocument,
element: current
});

// For the last context, get selector for target element
if (index === shadowPath.length - 1) {
const elementSelector = finder(element, {
root: context.root as unknown as Element
});
selectorParts.push(`${hostSelector} >> ${elementSelector}`);
} else {
selectorParts.push(hostSelector);
}
});

return {
fullSelector: selectorParts.join(' >> '),
mode: shadowPath[shadowPath.length - 1].root.mode
};
} catch (e) {
console.warn('Error generating shadow DOM selector:', e);
return null;
current = frameElement;
depth++;
continue;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add origin check for iframe access.

When accessing iframe content, it's important to check the iframe's origin to prevent potential security issues with cross-origin frames. Consider adding origin validation before attempting to access iframe content.

 const ownerDocument = current.ownerDocument;
 const frameElement = ownerDocument?.defaultView?.frameElement as HTMLIFrameElement;
 if (frameElement) {
+  try {
+    // Check if we can access the iframe's origin
+    const iframeOrigin = new URL(frameElement.src).origin;
+    const currentOrigin = window.location.origin;
+    if (iframeOrigin !== currentOrigin) {
+      console.warn(`Skipping cross-origin iframe: ${iframeOrigin}`);
+      break;
+    }
+
     path.unshift({
       type: 'iframe',
       frame: frameElement,
       document: ownerDocument,
       element: current
     });
     current = frameElement;
     depth++;
     continue;
+  } catch (error) {
+    console.warn('Cannot access iframe origin:', error);
+    break;
+  }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const ownerDocument = current.ownerDocument;
const frameElement = ownerDocument?.defaultView?.frameElement as HTMLIFrameElement;
if (frameElement) {
path.unshift({
type: 'iframe',
frame: frameElement,
document: ownerDocument,
element: current
});
// For the last context, get selector for target element
if (index === shadowPath.length - 1) {
const elementSelector = finder(element, {
root: context.root as unknown as Element
});
selectorParts.push(`${hostSelector} >> ${elementSelector}`);
} else {
selectorParts.push(hostSelector);
}
});
return {
fullSelector: selectorParts.join(' >> '),
mode: shadowPath[shadowPath.length - 1].root.mode
};
} catch (e) {
console.warn('Error generating shadow DOM selector:', e);
return null;
current = frameElement;
depth++;
continue;
}
const ownerDocument = current.ownerDocument;
const frameElement = ownerDocument?.defaultView?.frameElement as HTMLIFrameElement;
if (frameElement) {
try {
// Check if we can access the iframe's origin
const iframeOrigin = new URL(frameElement.src).origin;
const currentOrigin = window.location.origin;
if (iframeOrigin !== currentOrigin) {
console.warn(`Skipping cross-origin iframe: ${iframeOrigin}`);
break;
}
path.unshift({
type: 'iframe',
frame: frameElement,
document: ownerDocument,
element: current
});
current = frameElement;
depth++;
continue;
} catch (error) {
console.warn('Cannot access iframe origin:', error);
break;
}
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
server/src/workflow-management/selector.ts (1)

1313-1325: ⚠️ Potential issue

Add origin check for iframe access.

When accessing iframe content, it's important to check the iframe's origin to prevent potential security issues with cross-origin frames.

🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)

1293-1329: Consider extracting the depth limit constant.

The MAX_DEPTH constant is defined within the function. Consider moving it to a module-level constant for better maintainability and reusability.

+const MAX_BOUNDARY_DEPTH = 4;
+
 const getBoundaryPath = (element: HTMLElement): Boundary[] => {
   const path: Boundary[] = [];
   let current = element;
   let depth = 0;
-  const MAX_DEPTH = 4;
   
-  while (current && depth < MAX_DEPTH) {
+  while (current && depth < MAX_BOUNDARY_DEPTH) {

1356-1396: Consider extracting delimiter constants and improving error handling.

The function effectively handles different boundary types, but the delimiter handling could be more maintainable. Also, consider adding more robust error handling for edge cases.

+const SHADOW_DELIMITER = ' >> ';
+const IFRAME_DELIMITER = ' :>> ';
+
 const generateBoundaryAwareSelector = (elementOptions = {}) => {
   if (boundaryPath.length === 0) {
     return finder(element, elementOptions);
   }

   const selectorParts: string[] = [];
   
   boundaryPath.forEach((context, index) => {
     const root = getRootElement(index);

     if (context.type === 'shadow') {
       const hostSelector = finder(context.host, { root });

       if (index === boundaryPath.length - 1) {
         const elementSelector = finder(element, {
           ...elementOptions,
           root: context.root as unknown as Element
         });
-        selectorParts.push(`${hostSelector} >> ${elementSelector}`);
+        selectorParts.push(`${hostSelector}${SHADOW_DELIMITER}${elementSelector}`);
       } else {
         selectorParts.push(hostSelector);
       }
     } else {
       const frameSelector = finder(context.frame, { root });

       if (index === boundaryPath.length - 1) {
         const elementSelector = finder(element, {
           ...elementOptions,
           root: context.document.body as Element
         });
-        selectorParts.push(`${frameSelector} :>> ${elementSelector}`);
+        selectorParts.push(`${frameSelector}${IFRAME_DELIMITER}${elementSelector}`);
       } else {
         selectorParts.push(frameSelector);
       }
     }
   });

   const lastBoundary = boundaryPath[boundaryPath.length - 1];
-  const delimiter = lastBoundary.type === 'shadow' ? ' >> ' : ' :>> ';
+  const delimiter = lastBoundary.type === 'shadow' ? SHADOW_DELIMITER : IFRAME_DELIMITER;
   return selectorParts.join(delimiter);
 };
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58a893c and 8b82d33.

📒 Files selected for processing (1)
  • server/src/workflow-management/selector.ts (3 hunks)
🔇 Additional comments (2)
server/src/workflow-management/selector.ts (2)

765-780: Well-structured type definitions for boundary handling!

The new type definitions provide a clear and type-safe way to handle both shadow DOM and iframe contexts. The union type Boundary effectively combines both contexts while maintaining their distinct properties.


1410-1410: Excellent cleanup of commented-out code!

The removal of the old iframe selector implementation comments improves code cleanliness.

Also applies to: 1453-1456

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)

1293-1342: Enhance error handling for cross-origin iframe access.

The getBoundaryPath function correctly implements origin checking for iframes, but the error handling could be improved to provide more context about the failure.

Consider enhancing the error handling:

 } catch (error) {
-  console.warn('Cannot access iframe origin:', error);
+  console.warn(`Cannot access iframe origin for frame ${frameElement.id || frameElement.src}:`, error);
   break;
 }

1425-1444: Consider consolidating attribute selectors.

The code defines multiple selector types with overlapping attribute sets. Consider consolidating these into a more maintainable structure.

Here's a suggested refactor:

+const SELECTOR_ATTRIBUTES = {
+  href: ['href'],
+  form: ['name', 'placeholder', 'for'],
+  accessibility: ['aria-label', 'alt', 'title'],
+  testId: [
+    'data-testid',
+    'data-test-id',
+    'data-testing',
+    'data-test',
+    'data-qa',
+    'data-cy',
+  ]
+};

-const hrefSelector = generateBoundaryAwareSelector({
-  attr: genValidAttributeFilter(element, ['href'])
-});
+const hrefSelector = generateBoundaryAwareSelector({
+  attr: genValidAttributeFilter(element, SELECTOR_ATTRIBUTES.href)
+});
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b82d33 and 83afef7.

📒 Files selected for processing (1)
  • server/src/workflow-management/selector.ts (4 hunks)
🔇 Additional comments (3)
server/src/workflow-management/selector.ts (3)

765-780: Well-structured type definitions for boundary handling.

The new type definitions provide a clear and type-safe way to handle different types of DOM boundaries. The separation into ShadowBoundary and IframeBoundary with a union type Boundary follows TypeScript best practices.


1369-1409: LGTM! Robust boundary-aware selector generation.

The generateBoundaryAwareSelector function effectively handles both shadow DOM and iframe contexts, using appropriate delimiters (>> for shadow DOM and :>> for iframes) and maintaining proper selector hierarchy.


1466-1469: Remove commented-out code.

The code contains commented-out sections related to the old iframe selector implementation. Since this functionality has been replaced with the new boundary-aware selector, these comments should be removed to maintain code cleanliness.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)

1293-1342: Document the MAX_DEPTH constant and its implications.

The MAX_DEPTH constant is used to prevent infinite loops but lacks documentation explaining its purpose and why 4 was chosen as the limit.

Add a comment explaining the constant:

-        const MAX_DEPTH = 4;
+        // Maximum depth for traversing shadow DOM and iframe hierarchies
+        // to prevent infinite loops and excessive recursion
+        const MAX_DEPTH = 4;

1669-1690: Extract repeated cross-origin check into a helper function.

The cross-origin iframe check is duplicated across multiple functions. This should be extracted into a reusable helper function to improve maintainability and reduce code duplication.

Create a helper function:

+        function checkIframeSameOrigin(frameElement: HTMLIFrameElement): boolean {
+          try {
+            const iframeOrigin = new URL(frameElement.src).origin;
+            const currentOrigin = window.location.origin;
+            if (iframeOrigin !== currentOrigin) {
+              console.warn(`Skipping cross-origin iframe: ${iframeOrigin}`);
+              return false;
+            }
+            return true;
+          } catch (error) {
+            console.warn('Cannot access iframe origin:', error);
+            return false;
+          }
+        }

Then use it in both locations:

-              try {
-                // Check if we can access the iframe's origin
-                const iframeOrigin = new URL(frameElement.src).origin;
-                const currentOrigin = window.location.origin;
-                if (iframeOrigin !== currentOrigin) {
-                  console.warn(`Skipping cross-origin iframe: ${iframeOrigin}`);
-                  break;
-                }
+              if (!checkIframeSameOrigin(frameElement)) {
+                break;
+              }

Also applies to: 1922-1943

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 83afef7 and 1a8cfbc.

📒 Files selected for processing (1)
  • server/src/workflow-management/selector.ts (12 hunks)
🔇 Additional comments (2)
server/src/workflow-management/selector.ts (2)

8-22: LGTM! Well-structured type definitions.

The new types ShadowBoundary, IframeBoundary, and Boundary are well-defined and provide good type safety for handling different DOM contexts.


1423-1424: Remove commented-out code.

The code contains commented-out sections related to the old iframe selector implementation. Since this functionality has been replaced with the new boundary-aware selector, these comments should be removed.

Also applies to: 1466-1469

@RohitR311 RohitR311 marked this pull request as draft February 1, 2025 07:10
@RohitR311 RohitR311 added the Status: Work In Progess This issue/PR is actively being worked on label Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work In Progess This issue/PR is actively being worked on Type: Enhancement Improvements to existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: better IFrame and shadowDom elements scraping
1 participant