git.pterodactylus.net Git - Sone.git/commit

🚸 Improve text extraction even further

This actually revamps the way the first paragraph is extracted from freesites and could cause a lot more descriptions to show up in Sone — which was the goal!

Previously I tried to locate all top-level nodes (under <body>) that themselves had text nodes below them and whose name did not start with an “h” (to exclude the header tags) but it turns out this can be easily defeated by wrapping all of the site in e.g. a <center> tag. And I’m sure that a <div> tag would do exactly the same…

So now I use a CSS selector query to get all <p> and <div> nodes, get those with text nodes below them and then get their text (which flattens them for me and removes embedded tags like <a> or <span>).

author	David ‘Bombe’ Roden <bombe@pterodactylus.net>
	Fri, 2 Sep 2022 15:27:11 +0000 (17:27 +0200)
committer	David ‘Bombe’ Roden <bombe@pterodactylus.net>
	Fri, 2 Sep 2022 15:59:29 +0000 (17:59 +0200)
commit	ca05f37d6d77ebad800b252719b0ff03877fc968
tree	5e4d877d74f0d4c19bc6a8a95848ec917294844a	tree \| snapshot
parent	17a659821355e6396f464e50a9b4048c0ea01ff7	commit \| diff

src/main/kotlin/net/pterodactylus/sone/core/DefaultElementLoader.kt		diff \| blob \| history
src/test/kotlin/net/pterodactylus/sone/core/DefaultElementLoaderTest.kt		diff \| blob \| history
src/test/resources/net/pterodactylus/sone/core/element-loader5.html	[new file with mode: 0644]	blob