Skip to content Skip to sidebar Skip to footer

Capture Content Inside Html Tags With Regex

First off, I'm aware this is a bad practice and I have answered many questions even saying so, but to clarify I am forced to use regex because this application stores regexes in a

Solution 1:

It sounds like you need to enable the "dot all" (s) flag. This will make . match all characters including line breaks. For example:

preg_match('/<div\s*class="intro-content">(.*)<\/div>/s', $html);

Solution 2:

You should not use regexp's to parse html like this. div tags can be nested, and since regexp don't have any context, there is no way to parse that. Use a HTML parser instead. For example:

$doc = new DomDocument();
$doc->loadHtml($html);
foreach ($doc->getElementsByClassName("div") as$div) {
  var_dump($div);
}

See: DomDocument

Edit:

And then I saw your note:

I am forced to use regex because this application stores regexes in a database and only functions this way. I absolutely cannot change the functionality

Well. At least make sure that you match non-greedy. That way it'll match correct as long as there are no nested tags:

preg_match('/<div\s*class="intro-content">(.*?)<\/div>/s', $html);

Solution 3:

This obviously doesn't work because the . character will not match space characters.

Should do, but if it doesn't, we can just add them in:

<div\s*class="intro-content">([ \t\r\n.]*)</div>

You then need to make it lazy, so it captures everything up to the first</div> and not the last. We do this by adding a question mark:

<div\s*class="intro-content">([ \t\r\n.]*?)</div>

There. Give that a shot. You might be able to replace the space characters (\t\r\n) between [ and ] with a single \s too.

Post a Comment for "Capture Content Inside Html Tags With Regex"