How to Parse HTML DOM with PHP

Submitted by nanodano on Mon, 07/27/2015 - 23:52

PHP Simple HTML DOM is a one-file library that lets you traverse the elements of an HTML and search for specific elements. The examples below show how to use this library. To learn how to crawl (or spider) websites in order to get many pages to process see this post on How to Crawl Web Pages with PHP

<?php
// Download simple_html_dom.php first from http://simplehtmldom.sourceforge.net/
require_once('simple_html_dom.php');

// Get the contents of the HTML document either using cURL, a crawling
// framework, or use the provided file_get_html() function.
$html = file_get_html('http://www.devdungeon.com/');

// Find all images
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

// Translate whole document to plain text
echo file_get_html('http://www.devdungeon.com/')->plaintext;

Scrape DevDungeon.com

Example of scraping the DevDungeon.com archive page and pulling all the post titles. In the future the page may change and this script may break. YMMV.

<?php
// This snippet will print out all of the post titles in the DevDungeon.com archive.
require_once('simple_html_dom.php'); // Get simple_html_dom.php from http://simplehtmldom.sourceforge.net/

$html = file_get_html('http://devdungeon.com/archive');

foreach ($html->find(".view-blog-archive a") as $archiveLink) {
  echo $archiveLink->plaintext . "\n";
}