python - Using BeautifulSoup4 to retrieve text between 2 tags at different levels -
here's snippet of "real-world" html file i'm trying scrape beautifulsoup4 (python 3) using xml
parser (the other parsers don't work kind of dirty html files i'm working with):
<html> <p> hello </p> <a name='one'>item one</a> <p> text scrape. </p> <p> more text scrape. <table> <tr> <td> <a name='two'>item two</a> </td> </tr> </table> bunch of text shouldn't scraped. more text. , more text. </p> </html>
my goal scrape text sitting between <a name='one'>item one</a>
, <a name='two'>item two</a>
without scraping 3 lines of text in last <p>
.
i've attempted trying traverse first <a>
tag using find_next()
function , invoking get_text()
, happens when hit last <p>
text @ end gets scraped, isn't want.
sample code:
tag_one = soup.find('a', {'name': 'one'}) tag_two = soup.find('a', {'name': 'two'}) found = false tag = tag_one while found == false: tag = tag.find_next() if tag == tag_two: found = true print(tag.get_text())
any ideas on how solve this?
you use find_all_next
method iterate on next tags, , list of strings each tag strings
generator.
soup = beautifulsoup(html, 'xml') tag_one = soup.find('a', {'name': 'one'}) tag_two = soup.find('a', {'name': 'two'}) text = none tag in tag_one.find_all_next(): if tag tag_two: break strings = list(tag.stripped_strings) if strings , strings[0] != text: text = strings[0] print(text)
Comments
Post a Comment