Usuń HTML z tekstu JavaScript

Question

Usuń HTML z tekstu JavaScript

Czy istnieje łatwy sposób na pobranie ciągu html w JavaScript i usunięcie html?

507

javascript html string

Author: Gideon, 2009-05-05

Source

30 answers

myString.replace(/<(?:.|\n)*?>/gm, '');

433

Author: nickf,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2011-11-16 10:12:35

Najprostszy sposób:

jQuery(html).text();

, który pobiera cały tekst z ciągu html.

224

Author: Mark,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-08-24 18:18:28

Jako rozszerzenie metody jQuery, jeśli twój ciąg znaków może nie kontanować HTML (np. jeśli próbujesz usunąć HTML z pola formularza)

jQuery(html).text();

Zwróci pusty łańcuch, jeśli nie ma html

Użycie:

jQuery('<p>' + html + '</p>').text();

Zamiast tego.

Aktualizacja: Jak wspomniano w komentarzach, w pewnych okolicznościach To rozwiązanie uruchomi javascript zawarty w html jeśli na wartość html może mieć wpływ atakujący, użyj inne rozwiązanie.

49

Author: user999305,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-06-18 13:06:14

Chciałbym podzielić się edytowaną wersją shog9 's approved answer .

Jako Mike Samuel wskazał komentarz, ta funkcja może wykonywać wbudowane kody javascript.
Ale Shog9 ma rację mówiąc " Niech przeglądarka zrobi to za Ciebie..."

Więc.. tutaj moja edytowana wersja, używając DOMParser :

function strip(html){
   var doc = new DOMParser().parseFromString(html, 'text/html');
   return doc.body.textContent || "";
}

Tutaj kod do testowania wbudowanego javascript:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Ponadto, nie żąda zasobów na parsie (jak images)

strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")

45

Author: Saba,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-12-06 10:15:15

Konwersja HTML dla zwykłego tekstu e-mail utrzymanie hiperłącza (href) nienaruszone

Powyższa funkcja wysłana przez hypoxide działa dobrze, ale chodziło mi o coś, co w zasadzie konwertowałoby HTML utworzony w edytorze Web RichText (na przykład FCKeditor) i wyczyściłoby cały HTML, ale zostawiło wszystkie linki ze względu na fakt, że chciałem zarówno HTML, jak i wersję zwykłego tekstu, aby pomóc w tworzeniu poprawnych Części do wiadomości e-mail STMP (zarówno HTML, jak i zwykły tekst).

Po długim czasie szukając w Google ja i moi koledzy wpadliśmy na to przy użyciu silnika regex w Javascript:

str='this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>
';
str=str.replace(/<br>/gi, "\n");
str=str.replace(/<p.*>/gi, "\n");
str=str.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<(?:.|\s)*?>/g, "");

Zmienna str zaczyna się tak:

this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>

A potem po uruchomieniu kodu wygląda to tak:-

this string has html code i want to remove
Link Number 1 -> BBC (Link->http://www.bbc.co.uk)  Link Number 1


Now back to normal text and stuff

Jak widać cały HTML został usunięty i Link został wytrwały z hiperłączem tekst jest nadal nienaruszony. Zamieniłem również znaczniki <p> i <br> na \n (znak newline), dzięki czemu formatowanie wizualne zostało zatrzymane.

Aby zmienić format linku (np. BBC (Link->http://www.bbc.co.uk)) po prostu edytuj $2 (Link->$1), gdzie $1 to href URL / URI, a {[10] } to hiperłącze. Dzięki linkom bezpośrednio w treści zwykłego tekstu większość klientów pocztowych SMTP konwertuje je, aby użytkownik miał możliwość kliknięcia na nie.

Mam nadzieję, że uznasz to za przydatne.

35

Author: Jibberboy2000,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-06-18 14:21:56

Poprawa przyjętej odpowiedzi.

function strip(html)
{
   var tmp = document.implementation.createHTMLDocument("New").body;
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

W ten sposób coś takiego nie zaszkodzi:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Firefox, Chromium i Explorer 9+ są bezpieczne. Opera Presto jest wciąż bezbronna. Również obrazy wymienione w ciągach nie są pobierane w Chromium i Firefox zapisujące żądania http.

26

Author: Janghou,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-09-19 15:26:03

To powinno działać w dowolnym środowisku Javascript(w tym NodeJS). text.replace(/<[^>]+>/g, '');

16

Author: Karl.S,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-01-20 05:49:54

Zmieniłem odpowiedźJibberboy2000 , aby włączyć kilka formatów znaczników <BR />, usunąć wszystko wewnątrz znaczników <SCRIPT> i <STYLE>, sformatować wynikowy HTML, usuwając wiele podziałów linii i spacji i przekonwertować kod HTML na normalny. Po kilku testach okazuje się, że można przekonwertować większość pełnych stron internetowych na prosty tekst, w którym tytuł strony i zawartość są zachowane.

W prostym przykładzie,

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<!--comment-->

<head>

<title>This is my title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>

    body {margin-top: 15px;}
    a { color: #D80C1F; font-weight:bold; text-decoration:none; }

</style>
</head>

<body>
    <center>
        This string has <i>html</i> code i want to <b>remove</b><br>
        In this line <a href="http://www.bbc.co.uk">BBC</a> with link is mentioned.<br/>Now back to &quot;normal text&quot; and stuff using &lt;html encoding&gt;                 
    </center>
</body>
</html>

Staje się

This is my title

Ten ciąg ma kod html, który chcę usunąć

In this line ( http://www.bbc.co.uk ) z linkiem jest wspomniany.

Teraz z powrotem do "normalnego tekstu" i rzeczy za pomocą

Funkcja JavaScript i strona testowa wyglądają tak:

function convertHtmlToText() {
    var inputText = document.getElementById("input").value;
    var returnText = "" + inputText;

    //-- remove BR tags and replace them with line break
    returnText=returnText.replace(/<br>/gi, "\n");
    returnText=returnText.replace(/<br\s\/>/gi, "\n");
    returnText=returnText.replace(/<br\/>/gi, "\n");

    //-- remove P and A tags but preserve what's inside of them
    returnText=returnText.replace(/<p.*>/gi, "\n");
    returnText=returnText.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 ($1)");

    //-- remove all inside SCRIPT and STYLE tags
    returnText=returnText.replace(/<script.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/script>/gi, "");
    returnText=returnText.replace(/<style.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/style>/gi, "");
    //-- remove all else
    returnText=returnText.replace(/<(?:.|\s)*?>/g, "");

    //-- get rid of more than 2 multiple line breaks:
    returnText=returnText.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/gim, "\n\n");

    //-- get rid of more than 2 spaces:
    returnText = returnText.replace(/ +(?= )/g,'');

    //-- get rid of html-encoded characters:
    returnText=returnText.replace(/&nbsp;/gi," ");
    returnText=returnText.replace(/&amp;/gi,"&");
    returnText=returnText.replace(/&quot;/gi,'"');
    returnText=returnText.replace(/&lt;/gi,'<');
    returnText=returnText.replace(/&gt;/gi,'>');

    //-- return
    document.getElementById("output").value = returnText;
}

Został użyty z tym HTML:

<textarea id="input" style="width: 400px; height: 300px;"></textarea><br />
<button onclick="convertHtmlToText()">CONVERT</button><br />
<textarea id="output" style="width: 400px; height: 300px;"></textarea><br />

15

Author: Elendurwen,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-05-23 11:54:58

Innym, co prawda mniej eleganckim rozwiązaniem niż nickf czy Shog9, byłoby rekurencyjne przechodzenie przez DOM rozpoczynający się od tagu

i dołączanie każdego węzła tekstowego.

var bodyContent = document.getElementsByTagName('body')[0];
var result = appendTextNodes(bodyContent);

function appendTextNodes(element) {
    var text = '';

    // Loop through the childNodes of the passed in element
    for (var i = 0, len = element.childNodes.length; i < len; i++) {
        // Get a reference to the current child
        var node = element.childNodes[i];
        // Append the node's value if it's a text node
        if (node.nodeType == 3) {
            text += node.nodeValue;
        }
        // Recurse through the node's children, if there are any
        if (node.childNodes.length > 0) {
            appendTextNodes(node);
        }
    }
    // Return the final result
    return text;
}

7

Author: Bryan,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2009-05-04 23:14:30

Jeśli chcesz zachować linki i strukturę treści (h1, H2, itp.), powinieneś sprawdzić TextVersionJS możesz go używać z dowolnym HTML, chociaż został stworzony do konwersji wiadomości HTML na zwykły tekst.

Użycie jest bardzo proste. Na przykład w node.js:

var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

Lub w przeglądarce z czystym js:

<script src="textversion.js"></script>
<script>
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
</script>

Działa również z require.js:

define(["textversionjs"], function(createTextVersion) {
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
});

5

Author: gyula.nemeth,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-08-04 07:38:10

Po wypróbowaniu wszystkich odpowiedzi wymienionych większość, jeśli nie wszystkie z nich miały edge cases i nie mógł całkowicie zaspokoić moich potrzeb.

Zacząłem badać jak php to robi i natknąłem się na php.js lib, który replikuje metodę strip_tags tutaj: http://phpjs.org/functions/strip_tags/

4

Author: Deminetix,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-06-11 22:06:11

function stripHTML(my_string){
    var charArr   = my_string.split(''),
        resultArr = [],
        htmlZone  = 0,
        quoteZone = 0;
    for( x=0; x < charArr.length; x++ ){
     switch( charArr[x] + htmlZone + quoteZone ){
       case "<00" : htmlZone  = 1;break;
       case ">10" : htmlZone  = 0;resultArr.push(' ');break;
       case '"10' : quoteZone = 1;break;
       case "'10" : quoteZone = 2;break;
       case '"11' : 
       case "'12" : quoteZone = 0;break;
       default    : if(!htmlZone){ resultArr.push(charArr[x]); }
     }
    }
    return resultArr.join('');
}

Odpowiada atrybutom > inside I <img onerror="javascript"> w nowo utworzonych elementach dom.

Użycie:

clean_string = stripHTML("string with <html> in it")

Demo:

Https://jsfiddle.net/gaby_de_wilde/pqayphzd/

Demo top answer robiące straszne rzeczy:

Https://jsfiddle.net/gaby_de_wilde/6f0jymL6/1/

4

Author: user40521,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-03-27 07:29:37

Wiele osób już na to odpowiedziało, ale pomyślałem, że może być przydatne udostępnienie funkcji, którą napisałem, która usuwa znaczniki HTML z łańcucha znaków, ale pozwala na dołączenie tablicy znaczników, których nie chcesz pozbawić. Jest dość krótki i działa dobrze dla mnie.

function removeTags(string, array){
  return array ? string.split("<").filter(function(val){ return f(array, val); }).map(function(val){ return f(array, val); }).join("") : string.split("<").map(function(d){ return d.split(">").pop(); }).join("");
  function f(array, value){
    return array.map(function(d){ return value.includes(d + ">"); }).indexOf(true) != -1 ? "<" + value : value.split(">")[1];
  }
}

var x = "<span><i>Hello</i> <b>world</b>!</span>";
console.log(removeTags(x)); // Hello world!
console.log(removeTags(x, ["span", "i"])); // <span><i>Hello</i> world!</span>

4

Author: Harry Stevens,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-01-27 06:55:53

Zrobiłem kilka modyfikacji do oryginalnego skryptu Jibberboy2000 Hope it ' ll be usefull for someone

str = '**ANY HTML CONTENT HERE**';

str=str.replace(/<\s*br\/*>/gi, "\n");
str=str.replace(/<\s*a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<\s*\/*.+?>/ig, "\n");
str=str.replace(/ {2,}/gi, " ");
str=str.replace(/\n+\s*/gi, "\n\n");

3

Author: Jaxolotl,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2011-10-04 14:02:41

Oto wersja, która tak jakby rozwiązuje problem bezpieczeństwa @ MikeSamuel:

function strip(html)
{
   try {
       var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
       doc.documentElement.innerHTML = html;
       return doc.documentElement.textContent||doc.documentElement.innerText;
   } catch(e) {
       return "";
   }
}

Uwaga, zwróci pusty łańcuch, jeśli znaczniki HTML nie są poprawne XML(aka, znaczniki muszą być zamknięte, a atrybuty muszą być cytowane). Nie jest to idealne rozwiązanie, ale pozwala uniknąć problemu wykorzystania potencjału zabezpieczeń.

Jeśli nie masz poprawnego znacznika XML, możesz spróbować użyć:

var doc = document.implementation.createHTMLDocument("");

Ale to też nie jest idealne rozwiązanie z innych powodów.

3

Author: Jeremy Johnstone,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-07-12 21:10:24

Myślę, że najprostszym sposobem jest użycie wyrażeń regularnych, jak ktoś wspomniał powyżej. Chociaż nie ma powodu, by ich używać. Try:

stringWithHTML = stringWithHTML.replace(/<\/?[a-z][a-z0-9]*[^<>]*>/ig, "");

2

Author: Byron Carasco,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2011-01-10 05:40:34

Z jQuery można po prostu pobrać go za pomocą

$('#elementID').text()

2

Author: ianaz,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-09-03 15:03:35

Poniższy kod pozwala zachować niektóre znaczniki html podczas usuwania wszystkich innych

function strip_tags(input, allowed) {

  allowed = (((allowed || '') + '')
    .toLowerCase()
    .match(/<[a-z][a-z0-9]*>/g) || [])
    .join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)

  var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
      commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;

  return input.replace(commentsAndPhpTags, '')
      .replace(tags, function($0, $1) {
          return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
      });
}

2

Author: aWebDeveloper,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-07-14 12:56:53

Możliwe jest również użycie fantastycznego htmlparser2 czystego JS HTML parsera. Oto działające demo:

var htmlparser = require('htmlparser2');

var body = '<p><div>This is </div>a <span>simple </span> <img src="test"></img>example.</p>';

var result = [];

var parser = new htmlparser.Parser({
    ontext: function(text){
        result.push(text);
    }
}, {decodeEntities: true});

parser.write(body);
parser.end();

result.join('');

Wyjście będzie This is a simple example.

Zobacz go w akcji tutaj: https://tonicdev.com/jfahrenkrug/extract-text-from-html

Działa to zarówno w node, jak i w przeglądarce, jeśli spakujesz aplikację internetową za pomocą narzędzia takiego jak webpack.

2

Author: Johannes Fahrenkrug,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-12-29 19:11:59

Po prostu musiałem usunąć znaczniki <a> i zastąpić je tekstem linku.

To wydaje się działać świetnie.

htmlContent= htmlContent.replace(/<a.*href="(.*?)">/g, '');
htmlContent= htmlContent.replace(/<\/a>/g, '');

2

Author: FrigginGlorious,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-01-06 18:57:29

var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

Jest to wersja regex, która jest bardziej odporna na zniekształcony HTML, jak:

Unclosed tags

Some text <img

"" inside tag attributes

Some text <img alt="x > y">

Newlines

Some <a href="http://google.com">

Kod

var html = '<br>This <img alt="a>b" \r\n src="a_b.gif" />is > \nmy<>< > <a>"text"</a'
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

2

Author: hegemon,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-07-06 10:39:57

Sam stworzyłem działające Wyrażenie regularne:

str=str.replace(/(<\?[a-z]*(\s[^>]*)?\?(>|$)|<!\[[a-z]*\[|\]\]>|<!DOCTYPE[^>]*?(>|$)|<!--[\s\S]*?(-->|$)|<[a-z?!\/]([a-z0-9_:.])*(\s[^>]*)?(>|$))/gi, '');

1

Author: MarekJ47,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-11-09 16:06:12

Prosty 2-liniowy jquery do usunięcia html.

 var content = "<p>checking the html source&nbsp;</p><p>&nbsp;
  </p><p>with&nbsp;</p><p>all</p><p>the html&nbsp;</p><p>content</p>";

 var text = $(content).text();//It gets you the plain text
 console.log(text);//check the data in your console

 cj("#text_area_id").val(text);//set your content to text area using text_area_id

1

Author: Developer,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2013-07-05 09:18:26

Zaakceptowana odpowiedź działa dobrze głównie, jednak w IE jeśli html string to null otrzymujesz "null" (zamiast "). Fixed:

function strip(html)
{
   if (html == null) return "";
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

1

Author: basarat,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-05-27 00:12:48

Using Jquery:

function stripTags() {
    return $('<p></p>').html(textToEscape).text()
}

1

Author: math2001,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-12-09 08:41:42

input element obsługuje tylko jeden tekst liniowy :

Stan tekstu reprezentuje jednolinijkową kontrolkę edycji zwykłego tekstu dla wartości elementu.

function stripHtml(str) {
  var tmp = document.createElement('input');
  tmp.value = str;
  return tmp.value;
}

Update: to działa zgodnie z oczekiwaniami

function stripHtml(str) {
  // Remove some tags
  str = str.replace(/<[^>]+>/gim, '');

  // Remove BB code
  str = str.replace(/\[(\w+)[^\]]*](.*?)\[\/\1]/g, '$2 ');

  // Remove html and line breaks
  const div = document.createElement('div');
  div.innerHTML = str;

  const input = document.createElement('input');
  input.value = div.textContent || div.innerText || '';

  return input.value;
}

1

Author: Mike,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-10-27 02:13:08

Możesz bezpiecznie usunąć znaczniki html używając atrybutu iFrame sandbox .

Chodzi o to, że zamiast próbować regex naszego ciągu, korzystamy z natywnego parsera przeglądarki, wstrzykując tekst do elementu DOM, a następnie odpytywając textContent/innerText właściwość tego elementu.

Najodpowiedniejszym elementem, w którym możemy wstawić nasz tekst, jest ramka iFrame z piaskownicą, dzięki czemu możemy zapobiec dowolnemu wykonaniu kodu (znanego również jako XSS ).

The minusem tego podejścia jest to, że działa tylko w przeglądarkach.

Oto co wymyśliłem (nie testowany):

const stripHtmlTags = (() => {
  const sandbox = document.createElement("iframe");
  sandbox.sandbox = "allow-same-origin"; // <--- This is the key
  sandbox.style.setProperty("display", "none", "important");

  // Inject the sanbox in the current document
  document.body.appendChild(sandbox);

  // Get the sandbox's context
  const sanboxContext = sandbox.contentWindow.document;

  return (untrustedString) => {
    if (typeof untrustedString !== "string") return ""; 

    // Write the untrusted string in the iframe's body
    sanboxContext.open();
    sanboxContext.write(untrustedString);
    sanboxContext.close();

    // Get the string without html
    return sanboxContext.body.textContent || sanboxContext.body.innerText || "";
  };
})();

Usage ( demo):

console.log(stripHtmlTags(`<img onerror='alert("could run arbitrary JS here")' src='bogus'>XSS injection :)`));
console.log(stripHtmlTags(`<script>alert("awdawd");</` + `script>Script tag injection :)`));
console.log(stripHtmlTags(`<strong>I am bold text</strong>`));
console.log(stripHtmlTags(`<html>I'm a HTML tag</html>`));
console.log(stripHtmlTags(`<body>I'm a body tag</body>`));
console.log(stripHtmlTags(`<head>I'm a head tag</head>`));
console.log(stripHtmlTags(null));

1

Author: Etienne Martin,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-04-04 18:20:29

    (function($){
        $.html2text = function(html) {
            if($('#scratch_pad').length === 0) {
                $('<div id="lh_scratch"></div>').appendTo('body');  
            }
            return $('#scratch_pad').html(html).text();
        };

    })(jQuery);

Zdefiniuj to jako wtyczkę jquery i użyj jej w następujący sposób:

$.html2text(htmlContent);

0

Author: Shiv Shankar,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2012-03-16 06:25:57

Dla znaków escape również będzie to działać przy użyciu dopasowania wzorca:

myString.replace(/((&lt)|(<)(?:.|\n)*?(&gt)|(>))/gm, '');

0

Author: Abhishek Dhanraj Shahdeo,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-11-16 06:00:59

score 633 · Accepted Answer

Jeśli używasz przeglądarki, najprostszym sposobem jest po prostu pozwolić przeglądarce zrobić to za Ciebie...

function strip(html)
{
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

Uwaga: jak ludzie zauważyli w komentarzach, najlepiej tego unikać, jeśli nie kontrolujesz źródła HTML(na przykład, nie uruchamiaj tego na niczym, co mogło pochodzić z danych wejściowych użytkownika). W przypadku tych scenariuszy możesz nadal pozwolić przeglądarce wykonać pracę za ciebie- Zobacz odpowiedź Saba na temat korzystania z obecnie powszechnie dostępnego Domparsera .