java - JSoup: Problems getting text from elements -
import java.io.ioexception; import org.jsoup.jsoup; import org.jsoup.nodes.document; import org.jsoup.nodes.element; public class main { public static void main(string[] args) throws exception { document d=jsoup.connect("https://osu.ppy.sh/u/charless").get(); for(element line : d.select("div.profilestatline")) { system.out.println(d.select("b").text()); } } }
i'm having problems getting text "2027pp (#97,094)" in div.profilestatline b. should output, doesn't. url: https://osu.ppy.sh/u/charless
parts of page loaded javascript, why can't see divs you're looking for.
you can use browser load page , interpret javascript before parsing. library webdrivermanager help.
public static void main(string[] args) throws exception { chromedrivermanager.getinstance().setup(); chromedriver chromedriver = new chromedriver(); chromedriver.get("https://osu.ppy.sh/u/charless"); document d = jsoup.parse(chromedriver.getpagesource()); chromedriver.close(); (element line : d.select("div.profilestatline")) { system.out.println(line.select("b").text()); } }
the alternative examine javascript in page , make same calls retrieve data.
the page loading profile https://osu.ppy.sh/pages/include/profile-general.php?u=4084042&m=0
. looks u
user id, relatively simple extract page:
public class profilescraper { private static final pattern uid_pattern = pattern.compile("var userid = (\\d+);"); public static void main(string[] args) throws ioexception { string uid = getuid("charless"); document d = jsoup.connect("https://osu.ppy.sh/pages/include/profile-general.php?u=" + uid).get(); (element line : d.select("div.profilestatline")) { system.out.println(line.select("b").text()); } } public static string getuid(string name) throws ioexception { document d1 = jsoup.connect("https://osu.ppy.sh/u/" + name).get(); (element script : d1.select("script")) { string text = script.data(); matcher uidmatcher = uid_pattern.matcher(text); if (uidmatcher.find()) { return uidmatcher.group(1); } } throw new ioexception("no such character"); } }
Comments
Post a Comment