You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But there is often the values that are corrupted and have a negative impact on downstream transformations, e.g.: too big values to write into a Firestore document, corrupted values popping up in aggregations.
But in the meantime and if not feasible to achieve 100% valid technologies on wptagent - we should add a cleanup stage for the production crawl.pages table.
It will save the time when working with the crawl data.
SQL draft:
DECLARE crawl_month DATE DEFAULT DATE('2024-11-01');
CREATE TEMP TABLE technologies_cleaned AS (
WITH wappalyzer AS (
SELECT
name as technology,
category
FROMwappalyzer.apps,
UNNEST(categories) AS category
), pages AS (
SELECTdate,
client,
page,
technologies
FROMcrawl.pagesWHEREdate= crawl_month
), impacted_pages AS (
SELECT DISTINCTdate,
client,
page
FROM pages,
UNNEST (technologies) AS tech,
UNNEST (tech.categories) AS category
LEFT JOIN wappalyzer
USING (technology, category)
WHEREwappalyzer.category IS NULLORwappalyzer.technology IS NULL
), flattened_technologies AS (
SELECTdate,
client,
page,
technology,
category,
info
FROM pages,
UNNEST(technologies) AS tech,
UNNEST(tech.categories) AS category
WHERE page IN (SELECT DISTINCT page FROM impacted_pages)
), whitelisted_technologies AS (
SELECTdate,
client,
page,
f.technology,
f.category,
f.infoFROM flattened_technologies f
INNER JOIN wappalyzer
USING (technology, category)
), reconstructed_technologies AS (
SELECTdate,
client,
page,
ARRAY_AGG(STRUCT(
technology,
categories,
info
)) AS technologies
FROM (
SELECTdate,
client,
page,
technology,
ARRAY_AGG(DISTINCT category IGNORE NULLS) AS categories,
info
FROM whitelisted_technologies
GROUP BYdate, client, page, technology, info
)
GROUP BYdate, client, page
)
SELECTdate,
client,
page,
r.technologiesFROM impacted_pages
LEFT JOIN reconstructed_technologies r
USING (date, client, page)
);
UPDATEcrawl.pagesSET technologies =technologies_cleaned.technologiesFROM technologies_cleaned
WHEREpages.date= crawl_month ANDpages.client=technologies_cleaned.clientANDpages.page=technologies_cleaned.page;
The text was updated successfully, but these errors were encountered:
But there is often the values that are corrupted and have a negative impact on downstream transformations, e.g.: too big values to write into a Firestore document, corrupted values popping up in aggregations.
We can do some efforts to fix the technologies detections on wptagent side.
But in the meantime and if not feasible to achieve 100% valid technologies on wptagent - we should add a cleanup stage for the production
crawl.pages
table.It will save the time when working with the crawl data.
SQL draft:
The text was updated successfully, but these errors were encountered: