Skip to content

Parsing HTML #12

@kcambrek

Description

@kcambrek

I have trained the model on my own and now would like to use it for inference.

I am wondering how you parsed the HTML to get all the relevant nodes of the DOM tree. This is how I implemented it based on what I assume you have used, but the eventual results of the model are way off on new data.

for c, url in enumerate(urls):
    #selenium webdriver
    driver.get(url)
    driver.save_screenshot(os.path.join("test_data", "imgs", f"{c}.png"))
    locations = []

    ids = driver.find_elements_by_xpath('//*[@id]')
    for ii in ids:
        #catch stale elements????
        try:
            if ii.is_displayed():
            
                location_dic = {}
                location_dic.update(ii.location)
                location_dic.update(ii.size)
                #check if bounding box in screenshot
                if all([i < 1280 for i in location_dic.values()]):
                    locations.append(location_dic)
        except:
             continue

    #save bounding boxes in csv
    bbox_df = pd.DataFrame(locations)
    print(len(bbox_df))
    for column in bbox_df.columns:
        bbox_df[column] = bbox_df[column].astype(float)
    bbox_df.to_csv(os.path.join("test_data", "bboxes", f"{c}.csv"), sep = ",", index = False)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions