Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
# print(soup_2) # # write back to the original file # with open(file_path, "w", encoding="utf-8") as file: # # write the new str # file.write(soup_2.prettify())
class NavigableString
A tag can contain strings as pieces of text. Beautiful Soup uses the NavigableString class to contain these pieces of text:
# for child in tag_installation.children: # print(child)
# it is a generator! # print(type(tag_installation.children))
# you can transform that into a list child_list = list(tag_installation.children) print(len(child_list)) print(r"Numbers of \n: ", child_list.count("\n"))
# if you want to ignore '\n': filtered_string = [child for child in tag_installation.children ifnot (isinstance(child, NavigableString) and child.strip() == "")] print(len(filtered_string)) print(tag_installation.string)
# split into array, the same as list(tag_installation.children) print(tag_installation.contents)
filtered = list(filter(lambda x: x != "\n", tag_installation.contents)) print(filtered)
for content in filtered: print(content.string)
# or you can use the descendants methods all_des = list(filter(lambda x: x != "\n", tag_installation.descendants)) print(all_des) for content in all_des: print(content) print(content.string)
Using requests to see a website’s HTML
The key to scrapping files is understanding files’ structure.
Let’s see another demo? How will you paste content on a web? Maybe you can use your mouse to scroll down or press Ctrl + A command and copy it. Now you can use scrapping to finish this!
We will use Lilian Weng’s Blog How we think as a demo.
soup_lilian = BeautifulSoup(response.text) print(soup_lilian.find("title").get_text()) paras = soup_lilian.find_all("p")
for para in paras: print(para.get_text())
After that, you can split the string and get the article!
1 2 3 4 5 6 7 8 9 10 11 12 13 14
content_string = [para.get_text().strip() for para in paras] final_string = "\n".join(content_string)
print(final_string)
# write into file
withopen("demo.md", "w") as file: file.write(final_string) file.close() refs = soup_lilian.find_all("a") for ref in refs: print(ref.attrs["href"]) # attrs is a dict