Usuń znaczniki HTML z ciągu znaków

Question

Usuń znaczniki HTML z ciągu znaków

Czy jest dobry sposób na usunięcie HTML z ciągu Java? Prosty regex jak

replaceAll("\\<.*?>", "")

Będzie działać, ale rzeczy takie jak & nie zostaną poprawnie przekonwertowane i nie-HTML między dwoma nawiasami kątowymi zostanie usunięty (tzn. .*? w wyrażeniu regularnym zniknie).

439

java html regex parsing

Author: D. Pardal, 2008-10-27

Source

30 answers

Jeśli piszesz dla Android możesz to zrobić...

android.text.HtmlCompat.fromHtml(instruction, HtmlCompat.FROM_HTML_MODE_LEGACY).toString()

282

Author: Ken Goodridge,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2021-01-21 13:32:40

Jeśli użytkownik wprowadzi hey!, Czy chcesz wyświetlić hey! lub hey!? Jeśli pierwszy, escape less-thans, i kodowanie HTML ampersands (i opcjonalnie cudzysłowy) i jesteś w porządku. Modyfikacja kodu w celu implementacji drugiej opcji to:

replaceAll("\\<[^>]*>","")

Ale napotkasz problemy, jeśli użytkownik wprowadzi coś zniekształconego, na przykład <bhey!.

Możesz również sprawdzić JTidy które parsuje "brudne" wejście html i powinno dać ci sposób na usunięcie tagów, zachowując tekst.

Problem z próbą usunięcia html jest taki, że przeglądarki mają bardzo pobłażliwe parsery, bardziej pobłażliwe niż jakakolwiek biblioteka, którą możesz znaleźć, więc nawet jeśli zrobisz wszystko, co w twojej mocy, aby usunąć wszystkie znaczniki (używając metody replace powyżej, biblioteki DOM lub JTidy), będziesz nadal musisz upewnić się, że kodujesz pozostałe znaki specjalne HTML, aby zachować bezpieczeństwo danych wyjściowych.

87

Author: Chris Marasti-Georg,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-09-20 14:00:24

Innym sposobem jest użycie javaxa.swing.tekst.html.HTMLEditorKit , aby wyodrębnić tekst.

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Ref: usuń znaczniki HTML z pliku, aby wyodrębnić tylko tekst

30

Author: RealHowTo,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-02-02 17:27:05

Myślę, że najprostszym sposobem filtrowania znaczników html jest:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}

24

Author: Serge,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-11-04 10:13:09

Również bardzo proste użycie Jericho , i można zachować niektóre formatowanie (na przykład podziały linii i linki).

    Source htmlSource = new Source(htmlText);
    Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
    Renderer htmlRend = new Renderer(htmlSeg);
    System.out.println(htmlRend.toString());

18

Author: Josh,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2014-06-30 16:13:46

Akceptowana odpowiedź robienia po prostu Jsoup.parse(html).text() ma 2 potencjalne problemy (z JSoup 1.7.3):

usuwa podziały linii z tekstu
zamienia tekst <script> na <script>

Jeśli używasz tego do ochrony przed XSS, jest to trochę denerwujące. Oto moja najlepsza szansa na ulepszone rozwiązanie, używając zarówno JSoup, jak i Apache StringEscapeUtils: {]}

// breaks multi-level of escaping, preventing &amp;lt;script&amp;gt; to be rendered as <script>
String replace = input.replace("&amp;", "");
// decode any encoded html, preventing &lt;script&gt; to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);

Zauważ, że ostatnim krokiem jest użycie wyjścia jako zwykłego tekstu. Jeśli potrzebujesz tylko wyjścia HTML to powinien być w stanie go usunąć.

A oto kilka przypadków testowych (wejście na wyjście):

{"regular string", "regular string"},
{"<a href=\"link\">A link</a>", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"&lt;script&gt;", ""},
{"&amp;lt;script&amp;gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}

Jeśli znajdziesz sposób, aby to poprawić, proszę daj mi znać.

17

Author: Damien,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-01-08 09:59:45

Na Androidzie spróbuj tego:

String result = Html.fromHtml(html).toString();

16

Author: Ameen Maheen,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-06-25 18:23:43

Ucieczkowanie HTML jest naprawdę trudne do zrobienia dobrze - zdecydowanie sugerowałbym użycie kodu biblioteki do tego, ponieważ jest to o wiele bardziej subtelne niż myślisz. Sprawdź StringEscapeUtils Apache dla całkiem dobrej biblioteki do obsługi tego w Javie.

11

Author: Tim Howland,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2013-05-23 09:42:53

To powinno zadziałać -

Użyj tego

  text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.

I to

  text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like &nbsp;, &amp;, &gt; etc.

8

Author: Sandeep1699,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-07-01 03:57:28

Możesz po prostu użyć domyślnego filtra HTML Androida

    public String htmlToStringFilter(String textToFilter){

    return Html.fromHtml(textToFilter).toString();

    }

Powyższa metoda zwróci przefiltrowany łańcuch HTML dla Twojego wejścia.

8

Author: Anuraganu Punalur,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2019-07-01 17:37:41

Możesz zastąpić znaczniki   i  nowymi liniami przed usunięciem HTML, aby zapobiec nieczytelnemu bałaganowi, jak sugeruje Tim.

Jedynym sposobem na usunięcie znaczników HTML, ale pozostawienie nie-HTML między nawiasami kątowymi byłoby sprawdzenie listy znaczników HTML. Coś w tym stylu...

replaceAll("\\<[\s]*tag[^>]*>","")

Następnie dekoduj znaki specjalne HTML, takie jak &. Wynik nie powinien być uważany za dezynfekowany.

6

Author: foxy,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2008-10-27 23:52:37

Alternatywnie można użyć HtmlCleaner :

private CharSequence removeHtmlFrom(String html) {
    return new HtmlCleaner().clean(html).getText();
}

5

Author: Stephan,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2014-02-17 20:19:48

Użycie Html.fromHtml

HTML Tagi to

<a href=”…”> <b>,  <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>,  <em>, <font size=”…” color=”…” face=”…”>
<h1>,  <h2>, <h3>, <h4>,  <h5>, <h6>
<i>, <p>, <small>
<strike>,  <strong>, <sub>, <sup>, <tt>, <u>

Zgodnie z oficjalne dokumenty Androida wszystkie znaczniki w HTML będą wyświetlane jako ogólny zamiennik String, który twój program może następnie przejść i zastąpić rzeczywistym struny.

Html.formHtml metoda zajmuje Html.TagHandler i Html.ImageGetter jako argumenty, a także tekst do parse.

Przykład

String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";

Then

Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());

Wyjście

To jest o mnie tekst, który użytkownik może umieścić w swoim profilu

5

Author: IntelliJ Amiya,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-11-23 12:11:20

Zaakceptowana odpowiedź nie zadziałała dla mnie w przypadku testowym, który wskazałem: wynik "a c" to "a b lub b > c".

Więc zamiast tego użyłem TagSoup. Oto ujęcie, które zadziałało w moim przypadku testowym (i kilku innych):

import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * Take HTML and give back the text part while dropping the HTML tags.
 *
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 *
 * @author dan
 * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> 
 */
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {
}

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    reader.setContentHandler(this);
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));
}

public String getText() {
    return sb.toString();
}

@Override
public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {
    sb.append(ch[idx+start]);
    }
}

@Override
public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {
    sb.append(ch);
}

// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}

@Override
public void endElement(String uri, String localName, String qName)
    throws SAXException {
}

@Override
public void endPrefixMapping(String prefix) throws SAXException {
}


@Override
public void processingInstruction(String target, String data)
    throws SAXException {
}

@Override
public void setDocumentLocator(Locator locator) {
}

@Override
public void skippedEntity(String name) throws SAXException {
}

@Override
public void startDocument() throws SAXException {
}

@Override
public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {
}

@Override
public void startPrefixMapping(String prefix, String uri)
    throws SAXException {
}
}

4

Author: dfrankow,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-08-12 23:24:27

Wiem, że to stare, ale właśnie pracowałem nad projektem, który wymagał od mnie filtrowania HTML i to działało dobrze:

noHTMLString.replaceAll("\\&.*?\\;", "");

Zamiast tego:

html = html.replaceAll("&nbsp;","");
html = html.replaceAll("&amp;"."");

4

Author: rqualis,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2011-06-07 16:35:23

Oto nieco bardziej rozbudowana aktualizacja, aby spróbować poradzić sobie z formatowaniem przerw i list. Użyłem jej jako przewodnika.

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HTML2Text extends HTMLEditorKit.ParserCallback {
    private static final Logger log = Logger
            .getLogger(Logger.GLOBAL_LOGGER_NAME);

    private StringBuffer stringBuffer;

    private Stack<IndexType> indentStack;

    public static class IndexType {
        public String type;
        public int counter; // used for ordered lists

        public IndexType(String type) {
            this.type = type;
            counter = 0;
        }
    }

    public HTML2Text() {
        stringBuffer = new StringBuffer();
        indentStack = new Stack<IndexType>();
    }

    public static String convert(String html) {
        HTML2Text parser = new HTML2Text();
        Reader in = new StringReader(html);
        try {
            // the HTML to convert
            parser.parse(in);
        } catch (Exception e) {
            log.severe(e.getMessage());
        } finally {
            try {
                in.close();
            } catch (IOException ioe) {
                // this should never happen
            }
        }
        return parser.getText();
    }

    public void parse(Reader in) throws IOException {
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("StartTag:" + t.toString());
        if (t.toString().equals("p")) {
            if (stringBuffer.length() > 0
                    && !stringBuffer.substring(stringBuffer.length() - 1)
                            .equals("\n")) {
                newLine();
            }
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.push(new IndexType("ol"));
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.push(new IndexType("ul"));
            newLine();
        } else if (t.toString().equals("li")) {
            IndexType parent = indentStack.peek();
            if (parent.type.equals("ol")) {
                String numberString = "" + (++parent.counter) + ".";
                stringBuffer.append(numberString);
                for (int i = 0; i < (4 - numberString.length()); i++) {
                    stringBuffer.append(" ");
                }
            } else {
                stringBuffer.append("*   ");
            }
            indentStack.push(new IndexType("li"));
        } else if (t.toString().equals("dl")) {
            newLine();
        } else if (t.toString().equals("dt")) {
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.push(new IndexType("dd"));
            newLine();
        }
    }

    private void newLine() {
        stringBuffer.append("\n");
        for (int i = 0; i < indentStack.size(); i++) {
            stringBuffer.append("    ");
        }
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        log.info("EndTag:" + t.toString());
        if (t.toString().equals("p")) {
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("li")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.pop();
            ;
        }
    }

    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("SimpleTag:" + t.toString());
        if (t.toString().equals("br")) {
            newLine();
        }
    }

    public void handleText(char[] text, int pos) {
        log.info("Text:" + new String(text));
        stringBuffer.append(text);
    }

    public String getText() {
        return stringBuffer.toString();
    }

    public static void main(String args[]) {
        String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol>  <li>This</li>  <li>is</li>  <li>an</li>  <li>ordered</li>  <li>list    <p>with</p>    <ul>      <li>another</li>      <li>list        <dl>          <dt>This</dt>          <dt>is</dt>            <dd>sdasd</dd>            <dd>sdasda</dd>            <dd>asda              <p>aasdas</p>            </dd>            <dd>sdada</dd>          <dt>fsdfsdfsd</dt>        </dl>        <dl>          <dt>vbcvcvbcvb</dt>          <dt>cvbcvbc</dt>            <dd>vbcbcvbcvb</dd>          <dt>cvbcv</dt>          <dt></dt>        </dl>        <dl>          <dt></dt>        </dl></li>      <li>cool</li>    </ul>    <p>stuff</p>  </li>  <li>cool</li></ol><p></p></body></html>";
        System.out.println(convert(html));
    }
}

4

Author: Mike,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-02-02 17:26:46

Oto jeszcze jeden wariant zastępowania wszystkich (znaczniki HTML | encje HTML / puste miejsce w treści HTML)

content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); gdzie treść jest ciągiem znaków.

4

Author: silentsudo,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-06-20 07:36:44

Jeszcze jednym sposobem może być użycie com.google.gdata.util.pospolite.html.Klasa HtmlToText jak

MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));

Nie jest to jednak kod kuloodporny, a gdy uruchamiam go na wpisach Wikipedii, otrzymuję również informacje o stylu. Uważam jednak, że w przypadku małych / prostych zadań byłoby to skuteczne.

3

Author: rjha94,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-08-06 18:23:34

Wygląda na to, że chcesz przejść z HTML do zwykłego tekstu.
Jeśli tak jest spójrz na www.htmlparser.org. oto przykład, który usuwa wszystkie znaczniki z pliku html znalezionego pod adresem URL.
Korzysta z org.htmlparser.fasola.StringBean .

static public String getUrlContentsAsText(String url) {
    String content = "";
    StringBean stringBean = new StringBean();
    stringBean.setURL(url);
    content = stringBean.getStrings();
    return content;
}

3

Author: CSchulz,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-02-02 17:26:23

Oto inny sposób, aby to zrobić:

public static String removeHTML(String input) {
    int i = 0;
    String[] str = input.split("");

    String s = "";
    boolean inTag = false;

    for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
        inTag = true;
    }
    if (!inTag) {
        for (i = 0; i < str.length; i++) {
            s = s + str[i];
        }
    }
    return s;
}

2

Author: blackStar,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-02-02 17:28:51

Można również użyć Apache Tika do tego celu. Domyślnie zachowuje białe spacje z pozbawionego kodu html, co może być pożądane w pewnych sytuacjach:

InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())

2

Author: Maksim Sorokin,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-09-04 08:42:58

Jednym ze sposobów na zachowanie informacji o nowej linii w jsoup jest poprzedzenie wszystkich nowych znaczników linii jakimś atrap ciągiem, wykonanie JSoup i zastąpienie atrapa ciągiem "\n".

String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
    html = html.replace(tag, NEW_LINE_MARK+tag);
}

String text = Jsoup.parse(html).text();

text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");

1

Author: RobMen,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-09-04 20:53:40

classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim()

1

Author: Guilherme Oliveira,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2019-07-01 17:44:01

Moje 5 centów:

String[] temp = yourString.split("&amp;");
String tmp = "";
if (temp.length > 1) {

    for (int i = 0; i < temp.length; i++) {
        tmp += temp[i] + "&";
    }
    yourString = tmp.substring(0, tmp.length() - 1);
}

0

Author: Alexander,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-02-02 17:27:00

Aby uzyskać formateed zwykły tekst html możesz to zrobić:

String BR_ESCAPED = "&lt;br/&gt;";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");

To get formateed plain text change by \n and change last line by:

nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");

0

Author: surfealokesea,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2013-04-25 17:02:24

Wiem, że minęło trochę czasu od tego pytania, jak zostało zadane, ale znalazłem inne rozwiązanie, to jest to, co działało dla mnie:

Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
    Source source= new Source(htmlAsString);
 Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
                        String clearedHtml= m.replaceAll("");

0

Author: Itay Sasson,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2020-05-25 11:14:50

Warto zauważyć, że jeśli próbujesz to osiągnąć w projekcie stos usług , jest to już wbudowane rozszerzenie łańcucha

using ServiceStack.Text;
// ...
"The <b>quick</b> brown <p> fox </p> jumps over the lazy dog".StripHtml();

0

Author: Jared Beach,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2020-07-15 17:53:29

Często stwierdzam, że muszę tylko usunąć komentarze i elementy skryptu. To działa niezawodnie dla mnie przez 15 lat i może być łatwo rozszerzony do obsługi dowolnej nazwy elementu w HTML lub XML:

// delete all comments
response = response.replaceAll("<!--[^>]*-->", "");
// delete all script elements
response = response.replaceAll("<(script|SCRIPT)[^+]*?>[^>]*?<(/script|SCRIPT)>", "");

0

Author: vallismortis,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2020-08-23 21:14:52

Czasami łańcuch html pochodzi z xml z takim &lt. Podczas używania Jsoup musimy go przeanalizować, a następnie wyczyścić.

Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);

Podczas gdy tylko Jsoup.parse(htmlstrl).text() nie można usunąć tagów.

0

Author: jiamo,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2020-09-03 09:03:15

score 598 · Accepted Answer

Użyj parsera HTML zamiast wyrażeń regularnych. To jest bardzo proste z Jsoup .

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Jsoup obsługuje również usuwanie znaczników HTML z konfigurowalną białą listą, co jest bardzo przydatne, jeśli chcesz zezwolić tylko na np. ,  i .

Usuń znaczniki HTML z ciągu znaków

30 answers

Zobacz też:

Przykład