-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: better IFrame and shadowDom elements scraping #409
base: develop
Are you sure you want to change the base?
Conversation
WalkthroughThe pull request introduces substantial enhancements to the selector generation process in the Changes
Assessment against linked issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
server/src/workflow-management/selector.ts (4)
1343-1364
: Make shadow DOM depth configurable
Currently, the maximum shadow DOM depth (MAX_DEPTH = 4
) is hard-coded, which might miss elements in deeper nested shadow roots. Consider making this limit configurable or documenting the rationale behind the chosen value.
1366-1398
: Unify and streamline shadow-aware selector logic
ThegenSelectors
function and its nestedgenerateShadowAwareSelector
are clear, but could be further streamlined by directly leveraging the newgetShadowPath
approach. This may avoid future duplication and improve maintainability if you unify your shadow DOM traversal logic in a single helper.
1409-1409
: Exercise caution with the all-attributes mode
Using{ attr: () => true }
can potentially produce very lengthy or non-unique selectors if the element has many attributes. Confirm that this is the desired behavior.
1439-1441
: Revisit ignoring IDs starting with digits
Currently, you skip IDs that begin with a digit, but CSS allows them if properly escaped. Evaluate whether escaping these IDs or falling back to attributes is preferable for completeness.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/src/workflow-management/selector.ts
(1 hunks)server/src/workflow-management/utils.ts
(0 hunks)
💤 Files with no reviewable changes (1)
- server/src/workflow-management/utils.ts
🔇 Additional comments (6)
server/src/workflow-management/selector.ts (6)
1403-1403
: No concerns
The call togenerateShadowAwareSelector()
is wrapped in atry/catch
block to handle potential errors, which is good practice.
1414-1414
: No code changes
Nothing was modified on this blank line.
1415-1417
: No issues with href filter
Generating a shadow-aware selector filtering on['href']
is straightforward and appears correct.
1418-1420
: Potential collisions on 'name', 'placeholder', or 'for'
Multiple elements can share the same 'name', 'placeholder', or 'for' attributes, leading to non-unique selectors. Verify if additional checks or fallback strategies are needed.
1421-1423
: Possible duplication of aria-label, alt, or title
Elements can reuse these attributes, potentially resulting in non-unique selectors. Ensure this is acceptable, or consider an additional fallback for uniqueness.
1425-1434
: Excellent practice of supporting test ID attributes
Using a custom test ID approach improves maintainability and reliability of selectors. Implementation looks solid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)
1359-1395
: Consider making the depth limit configurable.The
getBoundaryPath
implementation is robust and handles both shadow DOM and iframe contexts well. However, theMAX_DEPTH
constant is hardcoded. Consider making it configurable through options or environment variables for better flexibility.-const MAX_DEPTH = 4; +const MAX_BOUNDARY_DEPTH = process.env.MAX_BOUNDARY_DEPTH + ? parseInt(process.env.MAX_BOUNDARY_DEPTH, 10) + : 4;
1422-1462
: Consider caching boundary-aware selectors for performance.The
generateBoundaryAwareSelector
function performs well but might be called multiple times for the same element with different options. Consider implementing a caching mechanism for the boundary path calculations to improve performance.+const boundaryPathCache = new WeakMap<HTMLElement, string>(); const generateBoundaryAwareSelector = (elementOptions = {}) => { + const cachedPath = boundaryPathCache.get(element); + if (cachedPath && !elementOptions.attr) { + return cachedPath; + } if (boundaryPath.length === 0) { return finder(element, elementOptions); } // ... existing code ... + if (!elementOptions.attr) { + boundaryPathCache.set(element, selectorParts.join(delimiter)); + } return selectorParts.join(delimiter); };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/workflow-management/selector.ts
(4 hunks)
🔇 Additional comments (1)
server/src/workflow-management/selector.ts (1)
765-780
: Well-structured type definitions for DOM boundaries!The introduction of
ShadowBoundary
andIframeBoundary
types with the unifiedBoundary
type provides a clear and type-safe way to handle different DOM contexts. This improves code maintainability and makes the boundary handling more robust.
} catch (e) { | ||
} | ||
|
||
// const iframeSelector = genSelectorForIframe(element); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove commented-out code.
The code contains commented-out sections related to the old iframe selector implementation. Since this functionality has been replaced with the new boundary-aware selector, these comments should be removed to maintain code cleanliness.
-// const iframeSelector = genSelectorForIframe(element);
-// iframeSelector: iframeSelector ? {
-// full: iframeSelector.fullSelector,
-// isIframe: iframeSelector.isFrameContent,
-// } : null,
Also applies to: 1519-1522
const ownerDocument = current.ownerDocument; | ||
const frameElement = ownerDocument?.defaultView?.frameElement as HTMLIFrameElement; | ||
if (frameElement) { | ||
path.unshift({ | ||
type: 'iframe', | ||
frame: frameElement, | ||
document: ownerDocument, | ||
element: current | ||
}); | ||
|
||
// For the last context, get selector for target element | ||
if (index === shadowPath.length - 1) { | ||
const elementSelector = finder(element, { | ||
root: context.root as unknown as Element | ||
}); | ||
selectorParts.push(`${hostSelector} >> ${elementSelector}`); | ||
} else { | ||
selectorParts.push(hostSelector); | ||
} | ||
}); | ||
|
||
return { | ||
fullSelector: selectorParts.join(' >> '), | ||
mode: shadowPath[shadowPath.length - 1].root.mode | ||
}; | ||
} catch (e) { | ||
console.warn('Error generating shadow DOM selector:', e); | ||
return null; | ||
current = frameElement; | ||
depth++; | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add origin check for iframe access.
When accessing iframe content, it's important to check the iframe's origin to prevent potential security issues with cross-origin frames. Consider adding origin validation before attempting to access iframe content.
const ownerDocument = current.ownerDocument;
const frameElement = ownerDocument?.defaultView?.frameElement as HTMLIFrameElement;
if (frameElement) {
+ try {
+ // Check if we can access the iframe's origin
+ const iframeOrigin = new URL(frameElement.src).origin;
+ const currentOrigin = window.location.origin;
+ if (iframeOrigin !== currentOrigin) {
+ console.warn(`Skipping cross-origin iframe: ${iframeOrigin}`);
+ break;
+ }
+
path.unshift({
type: 'iframe',
frame: frameElement,
document: ownerDocument,
element: current
});
current = frameElement;
depth++;
continue;
+ } catch (error) {
+ console.warn('Cannot access iframe origin:', error);
+ break;
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const ownerDocument = current.ownerDocument; | |
const frameElement = ownerDocument?.defaultView?.frameElement as HTMLIFrameElement; | |
if (frameElement) { | |
path.unshift({ | |
type: 'iframe', | |
frame: frameElement, | |
document: ownerDocument, | |
element: current | |
}); | |
// For the last context, get selector for target element | |
if (index === shadowPath.length - 1) { | |
const elementSelector = finder(element, { | |
root: context.root as unknown as Element | |
}); | |
selectorParts.push(`${hostSelector} >> ${elementSelector}`); | |
} else { | |
selectorParts.push(hostSelector); | |
} | |
}); | |
return { | |
fullSelector: selectorParts.join(' >> '), | |
mode: shadowPath[shadowPath.length - 1].root.mode | |
}; | |
} catch (e) { | |
console.warn('Error generating shadow DOM selector:', e); | |
return null; | |
current = frameElement; | |
depth++; | |
continue; | |
} | |
const ownerDocument = current.ownerDocument; | |
const frameElement = ownerDocument?.defaultView?.frameElement as HTMLIFrameElement; | |
if (frameElement) { | |
try { | |
// Check if we can access the iframe's origin | |
const iframeOrigin = new URL(frameElement.src).origin; | |
const currentOrigin = window.location.origin; | |
if (iframeOrigin !== currentOrigin) { | |
console.warn(`Skipping cross-origin iframe: ${iframeOrigin}`); | |
break; | |
} | |
path.unshift({ | |
type: 'iframe', | |
frame: frameElement, | |
document: ownerDocument, | |
element: current | |
}); | |
current = frameElement; | |
depth++; | |
continue; | |
} catch (error) { | |
console.warn('Cannot access iframe origin:', error); | |
break; | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
server/src/workflow-management/selector.ts (1)
1313-1325
:⚠️ Potential issueAdd origin check for iframe access.
When accessing iframe content, it's important to check the iframe's origin to prevent potential security issues with cross-origin frames.
🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)
1293-1329
: Consider extracting the depth limit constant.The
MAX_DEPTH
constant is defined within the function. Consider moving it to a module-level constant for better maintainability and reusability.+const MAX_BOUNDARY_DEPTH = 4; + const getBoundaryPath = (element: HTMLElement): Boundary[] => { const path: Boundary[] = []; let current = element; let depth = 0; - const MAX_DEPTH = 4; - while (current && depth < MAX_DEPTH) { + while (current && depth < MAX_BOUNDARY_DEPTH) {
1356-1396
: Consider extracting delimiter constants and improving error handling.The function effectively handles different boundary types, but the delimiter handling could be more maintainable. Also, consider adding more robust error handling for edge cases.
+const SHADOW_DELIMITER = ' >> '; +const IFRAME_DELIMITER = ' :>> '; + const generateBoundaryAwareSelector = (elementOptions = {}) => { if (boundaryPath.length === 0) { return finder(element, elementOptions); } const selectorParts: string[] = []; boundaryPath.forEach((context, index) => { const root = getRootElement(index); if (context.type === 'shadow') { const hostSelector = finder(context.host, { root }); if (index === boundaryPath.length - 1) { const elementSelector = finder(element, { ...elementOptions, root: context.root as unknown as Element }); - selectorParts.push(`${hostSelector} >> ${elementSelector}`); + selectorParts.push(`${hostSelector}${SHADOW_DELIMITER}${elementSelector}`); } else { selectorParts.push(hostSelector); } } else { const frameSelector = finder(context.frame, { root }); if (index === boundaryPath.length - 1) { const elementSelector = finder(element, { ...elementOptions, root: context.document.body as Element }); - selectorParts.push(`${frameSelector} :>> ${elementSelector}`); + selectorParts.push(`${frameSelector}${IFRAME_DELIMITER}${elementSelector}`); } else { selectorParts.push(frameSelector); } } }); const lastBoundary = boundaryPath[boundaryPath.length - 1]; - const delimiter = lastBoundary.type === 'shadow' ? ' >> ' : ' :>> '; + const delimiter = lastBoundary.type === 'shadow' ? SHADOW_DELIMITER : IFRAME_DELIMITER; return selectorParts.join(delimiter); };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/workflow-management/selector.ts
(3 hunks)
🔇 Additional comments (2)
server/src/workflow-management/selector.ts (2)
765-780
: Well-structured type definitions for boundary handling!The new type definitions provide a clear and type-safe way to handle both shadow DOM and iframe contexts. The union type
Boundary
effectively combines both contexts while maintaining their distinct properties.
1410-1410
: Excellent cleanup of commented-out code!The removal of the old iframe selector implementation comments improves code cleanliness.
Also applies to: 1453-1456
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)
1293-1342
: Enhance error handling for cross-origin iframe access.The
getBoundaryPath
function correctly implements origin checking for iframes, but the error handling could be improved to provide more context about the failure.Consider enhancing the error handling:
} catch (error) { - console.warn('Cannot access iframe origin:', error); + console.warn(`Cannot access iframe origin for frame ${frameElement.id || frameElement.src}:`, error); break; }
1425-1444
: Consider consolidating attribute selectors.The code defines multiple selector types with overlapping attribute sets. Consider consolidating these into a more maintainable structure.
Here's a suggested refactor:
+const SELECTOR_ATTRIBUTES = { + href: ['href'], + form: ['name', 'placeholder', 'for'], + accessibility: ['aria-label', 'alt', 'title'], + testId: [ + 'data-testid', + 'data-test-id', + 'data-testing', + 'data-test', + 'data-qa', + 'data-cy', + ] +}; -const hrefSelector = generateBoundaryAwareSelector({ - attr: genValidAttributeFilter(element, ['href']) -}); +const hrefSelector = generateBoundaryAwareSelector({ + attr: genValidAttributeFilter(element, SELECTOR_ATTRIBUTES.href) +});
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/workflow-management/selector.ts
(4 hunks)
🔇 Additional comments (3)
server/src/workflow-management/selector.ts (3)
765-780
: Well-structured type definitions for boundary handling.The new type definitions provide a clear and type-safe way to handle different types of DOM boundaries. The separation into
ShadowBoundary
andIframeBoundary
with a union typeBoundary
follows TypeScript best practices.
1369-1409
: LGTM! Robust boundary-aware selector generation.The
generateBoundaryAwareSelector
function effectively handles both shadow DOM and iframe contexts, using appropriate delimiters (>>
for shadow DOM and:>>
for iframes) and maintaining proper selector hierarchy.
1466-1469
: Remove commented-out code.The code contains commented-out sections related to the old iframe selector implementation. Since this functionality has been replaced with the new boundary-aware selector, these comments should be removed to maintain code cleanliness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
server/src/workflow-management/selector.ts (2)
1293-1342
: Document the MAX_DEPTH constant and its implications.The
MAX_DEPTH
constant is used to prevent infinite loops but lacks documentation explaining its purpose and why 4 was chosen as the limit.Add a comment explaining the constant:
- const MAX_DEPTH = 4; + // Maximum depth for traversing shadow DOM and iframe hierarchies + // to prevent infinite loops and excessive recursion + const MAX_DEPTH = 4;
1669-1690
: Extract repeated cross-origin check into a helper function.The cross-origin iframe check is duplicated across multiple functions. This should be extracted into a reusable helper function to improve maintainability and reduce code duplication.
Create a helper function:
+ function checkIframeSameOrigin(frameElement: HTMLIFrameElement): boolean { + try { + const iframeOrigin = new URL(frameElement.src).origin; + const currentOrigin = window.location.origin; + if (iframeOrigin !== currentOrigin) { + console.warn(`Skipping cross-origin iframe: ${iframeOrigin}`); + return false; + } + return true; + } catch (error) { + console.warn('Cannot access iframe origin:', error); + return false; + } + }Then use it in both locations:
- try { - // Check if we can access the iframe's origin - const iframeOrigin = new URL(frameElement.src).origin; - const currentOrigin = window.location.origin; - if (iframeOrigin !== currentOrigin) { - console.warn(`Skipping cross-origin iframe: ${iframeOrigin}`); - break; - } + if (!checkIframeSameOrigin(frameElement)) { + break; + }Also applies to: 1922-1943
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/workflow-management/selector.ts
(12 hunks)
🔇 Additional comments (2)
server/src/workflow-management/selector.ts (2)
8-22
: LGTM! Well-structured type definitions.The new types
ShadowBoundary
,IframeBoundary
, andBoundary
are well-defined and provide good type safety for handling different DOM contexts.
1423-1424
: Remove commented-out code.The code contains commented-out sections related to the old iframe selector implementation. Since this functionality has been replaced with the new boundary-aware selector, these comments should be removed.
Also applies to: 1466-1469
Closes: #408
Summary by CodeRabbit
New Features
Bug Fixes
Refactor