Monday, October 3, 2011

Using regular expressions to extract content - php extract texts from html content


Hello PHP Googler,

PHP provides a number of really neat regular expression functions. You can find the list of the regex function at the PHP site.

But the one that I’ve had most fun with is the preg_match_all() function which I’ve been using to do content extraction from an HTML page.

I’m not going to explain what Regular Expression (regex) is in this post. There are whole books on just this one topic along; I would be crazy to think I can explain it all in just a few paragraphs. But in order for you to understand how to use the regex functions you need to have a basic understanding of regular expressions.

If you think back to your childhood days, you would remember a toy that you can match holes with shapes with the corresponding blocks – like the picture here. Well, regular expressions is very much like that toy, but instead you have define your own ’shape’ (or pattern as it’s known) and apply your content to it. Any text that matches the pattern will ‘fall’ through it.

Let’s say you have a block of text like below and you want to extract out the all links from, you can use preg_match_all to do just that.

 
$content = "He's goin' everywhere, <a href=\"http://www.bjmckay.com\">B.J. McKay</a> and his best friend Bear. Rollin' down to <a href=\"http://www.dallas.net\">Dallas</a>, who's providin' my palace, off to New Orleans or who knows where." 
The pattern you want to look for would be the link anchor pattern, like <a href=”(something)”>(something)</a>. 
The actual regular expression might look something like Once you have your pattern you apply the $content and $regex_pattern to preg_match_all() like this
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";


Once you have your pattern you apply the $content and $regex_pattern to preg_match_all() like this
preg_match_all($regex_pattern,$content,$matches); print_r($matches);
preg_match_all will store all the matches into the array $matches, so if you output the array,
you’ll see something like this.

Array ( [0] => Array ( [0] => <a href="http://www.bjmckay.com">B.J. McKay</a> [1] => <a href="http://www.dallas.net">Dallas</a> ) [1] => Array ( [0] => http://www.bjmckay.com [1] => http://www.dallas.net ) [2] => Array ( [0] => B.J. McKay [1] => Dallas ) )



From this array, $matches, you should be able to loop through and get the information you need.

I hope this has been useful to you. I know it doesn’t cover all the things this function can do, but for first-timers, it should be a simple look at a very powerful PHP function.

Incidently, PHP also provides the function preg_match(). The difference is preg_match() only matches a single instance of the pattern, whereas preg_match_all() tries to find all matching instances within the content.

Contact:
bhavinrana07[@]gmail.com

0 comments:

Post a Comment

Any Questions or Suggestions ?

About

Professional & Experienced Freelance Developer From India, Technologist, Software Engineer, internet marketer and Open Sources Developer with experience in Finance, Telecoms and the Media. Contact Me for freelancing projects.

Enter your email address:

Delivered by FeedBurner