Skip to content Skip to sidebar Skip to footer

Problems Using Extended Escape Mode For Jsoup Output

I need to transform a HTML file, by removing certain tags from the file. To do this I have something like this - import org.jsoup.Jsoup; import org.jsoup.helper.Validate; import o

Solution 1:

What output encoding character set are you using? (It will default to the input, which if you are loading from URLs, will vary according to the site).

You probably want to explicitly set it to either UTF-8, or ASCII or some other low setting if you are working with systems that cannot deal with UTF-8. If you set the escape mode to base (the default), and the charset to ascii, then any character (like rsquo) than cannot be represented natively in the selected charset will be output as a numerical escape.

For example:

Stringcheck="<p>&rsquo; <a href='../'>Check</a></p>";
Documentdoc= Jsoup.parse(check);
doc.outputSettings().escapeMode(Entities.EscapeMode.base); // default

doc.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc.body().html());

doc.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc.body().html());

Gives:

UTF-8: <p><ahref="../">Check</a></p>
ASCII: <p>&#8217;<ahref="../">Check</a></p>

Hope this helps!

Post a Comment for "Problems Using Extended Escape Mode For Jsoup Output"