python - Using BeautifulSoup4 to retrieve text between 2 tags at different levels -


here's snippet of "real-world" html file i'm trying scrape beautifulsoup4 (python 3) using xml parser (the other parsers don't work kind of dirty html files i'm working with):

<html>     <p> hello </p>     <a name='one'>item one</a>     <p> text scrape. </p>     <p> more text scrape.         <table>             <tr>                 <td>                     <a name='two'>item two</a>                 </td>             </tr>         </table>         bunch of text shouldn't scraped.         more text.         , more text.     </p> </html> 

my goal scrape text sitting between <a name='one'>item one</a> , <a name='two'>item two</a> without scraping 3 lines of text in last <p>.

i've attempted trying traverse first <a> tag using find_next() function , invoking get_text(), happens when hit last <p> text @ end gets scraped, isn't want.

sample code:

tag_one = soup.find('a', {'name': 'one'}) tag_two = soup.find('a', {'name': 'two'}) found = false tag = tag_one while found == false:     tag = tag.find_next()     if tag == tag_two:         found = true     print(tag.get_text()) 

any ideas on how solve this?

you use find_all_next method iterate on next tags, , list of strings each tag strings generator.

soup = beautifulsoup(html, 'xml') tag_one = soup.find('a', {'name': 'one'}) tag_two = soup.find('a', {'name': 'two'}) text = none  tag in tag_one.find_all_next():     if tag tag_two:         break     strings = list(tag.stripped_strings)     if strings , strings[0] != text:         text = strings[0]         print(text) 

Comments

Popular posts from this blog

angular - Ionic slides - dynamically add slides before and after -

minify - Minimizing css files -

Add a dynamic header in angular 2 http provider -