You can't parse XML with regex. Let's do it anyways.

This “walking” behavior can be visualized even better after adding declare -p stack to every loop:

$ ./parse.sh a b d

Figure 5 – stack in action

Due to the single-pass nature of our parser (which combines tokenization and a few other steps into one), I had to remove some repetition. Furthermore, this parser is for demonstrational purposes only and cannot parse arbitrary XML. Real-world XML has a lot of special objects, self-terminating tags, and other gotchas that have to be accounted for, even during a simple text extraction.

How your brain reads XML

Now that you have a gist of how an algorithm for parsing XML may work (and hopefully understand that writing a parser is a lot of pain), let’s step back and consider how we, creatures of protein and flesh, parse XML. To make things harder, let’s look at the raw, true form of XML – no pretty-printing allowed.

meownya

Figure 6 – example from before, compacted

To an untrained eye, this doesn’t look like a tree.

nya meow

Figure 7 – the same structure, with whitespace arranged to form an x-mas tree

Ah, much better! This is semantically equivalent to all the snippets I’ve attached before, but you have to think really hard to picture that a > b > (c, d). To me, this snippet is first and foremost a string.

String parsing

Approaching XML or any other structured data format as a string is like dumpster-diving for parts. I don’t mean this in a bad way; both regex and dumpster diving have awarded me some great stuff. But they also give me the urge to shower immediately afterwards.

To continue the analogy, you can’t inquire about why something got thrown out (as in, why given data is present and why it is formatted the way it is). This information is lost. You can make educated guesses if you stare at it long enough, but you can’t know for sure. Worse even, if your data changes (as may happen with XML returned by an API), the whole tree may get ordered in a slightly different way, rendering your meticulously crafted parser useless. For this – and many other reasons – it’s best to parse XML with a real parser.

I’ll explore actual string parsing techniques later in this post. Before that, we have an elephant in the room to address…

HTML: XML but quirky

Pedantry Corner

Some might argue that both HTML and XML were derived from SGML, not from each other, so this section title doesn’t make sense.

In opposition, I’d like to argue that while XML inspires fear in CS majors and hackers alike, virtually nobody knows about SGML. HTML is quirky XML.

HTML is the main language used for presentation online. The web lives and breathes HTML. You can make webapps without WebAssembly, without ECMAScript, or even without CSS. But you absolutely need² HTML (… or XHTML – hold that thought).
#2
Before publication, Lisa argued that you technically can make pages without HTML:

SVG, Java Applets, Flash, PDF

One could discredit the last three options, as they’re external technologies that aren’t a part of any Web spec. However, SVG is much tougher to ignore. It’s a W3C Recommendation, which makes it at least adjacent. It also specifies the tag, so technically SVG could be used “without HTML” to create a webpage. I remain sceptical.

A few thousand bytes ago, I touched on how XML is extremely strict in the layout. HTML is the exact opposite, allowing for unclosed tags and broken grammar. An XML parser would get a heart attack if asked to parse HTML found online.

Parsing HTML is near-impossible

Well-formed HTML is fine. However, browsers are designed to make educated guesses instead of failing outright when the markup doesn’t fit. This was a compromise made for accessibility. Today’s devtools make debugging easy, but in the early 90s? There was virtually no tooling for this. Having the parsers accept slightly mangled input no doubt improved adoption when HTML was all new.

Sadly, this means that HTML is already two layers removed from XML. Quirks mode is largely based on how things got implemented by IE and Netscape 30 years ago. Standards compliance mode somewhat improves the situation, but it will still accept missing closing tags or quotes.

That being said, virtually all of those situations are defined by the standard, and contemporary browsers implement it extremely closely. Why is it “near-impossible” then? HTML living standard dwarfs the base of XML, being over 1500 pages long! …Okay, perhaps that’s a bit unfair – at the time of writing, only 114 of those pages actually deal with parsing (thanks for checking, Linus!). Regardless, that’s still over x2 the length of the XML standard, and this growth is mostly defining edge-cases! Unless you’re using an actual browser, chances are that your DOM tree will parse slightly differently on pages that aren’t well-formed.

HTML4.01? Ridiculous! We need to develop a better alternative that suits everyone’s needs

Situation: there are two sibling standards.

XHTML is… a weird creature. It was first introduced in late 1998 and refined into a standard that was adopted as a W3C recommendation in January 2000. Unfortunately, it wasn’t widely adopted (unlike later HTML5)…

The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn’t work. The large HTML-generating public did not move, largely because the browsers didn’t complain. Some large communities did shift and are enjoying the fruits of well-formed systems, but not all.

~ Tim Berners Lee, 2005

I’m only mentioning XHTML here because, technically, we’ve had a strict, well-defined HTML alternative for almost 3 decades by now, despite not many people knowing about it. Heck, XHTML5 exists too! You can use it right now! It’s really cool! (famfo keeps telling me about it, so it has to be true.)

Finally: actually parsing HTML with regex

The following section is entirely a product of my attempts to scrape various webpages over the years. I’m aware how badly the practice of scraping is viewed in some circles, and I’d like to assure the reader that the bots I’ve built in the past have always been slow to request, and used extensive caching. GenAI scrapers constantly DoSing the internet can go to hell.

Benefits

Haruhi says…

“Bet you didn’t expect them to talk about benefits after they spent so long rambling about how hard it is to parse HTML. Ha!”

Development speed

Modern websites often have hundreds, if not thousands of nested elements. Writing a selector for something really deep down can take a while, especially if additional constraints are present (randomized class names? the developer only knowing about div-s?).

Writing a regex takes me 30 seconds. But hacking up a good selector and debugging why it doesn’t work on the next request? Tens of minutes of cursing.

Adaptability

Selectors are strict. They either give you a result or fail. This is great when you trust the other side of the system to send you good, accurate markup. HOWEVER, this is not something you can expect when scraping. For instance:

(...)



Peterborough
        
1
        
1801
        
On Time




        , followed by removing everything that matches >. This leaves us with a lot of spaces, which can be mitigated by matching for    (two spaces in a row). My shell one-liner looks something like: curl (...) | tr -d '\r\n' | grep -Poh 'scroll0.*?@@g;s/  //g;'
This leaves us with the following payload:
scroll0" class="scrollable">

Source link

DZdano

Administrator

Visit Website View All Posts

Leave a Reply Cancel reply

Related Stories

Michigan vs UConn score, highlights in March Madness national title game – Detroit Free Press

Michigan vs UConn highlights, score as Wolverines win NCAA title – Bergen Record

[News] DDR5 Retail Prices Pullback Amid Market Correction, but Industry Players Cite Stable Contract Trends – TrendForce

You may have missed

Resident Evil Mod Channel Nuked For NSFW Mods

Blue Prince Devs Want You To Go Into These Indie Games Blind

Nintendo Keeps Leaving Cool Retro Gadget Ideas On The Table

Overwatch’s Switch 2 Version Literally Made Me Recoil With Disgust