03:16:37 [@jgmac1106] ↩️ I spent day hacking on Blogger adding microformats so it can play with #IndieWeb tools. In terms of syndication POSSE copies with IFTT best I can do for now, but still blogger with webmentions and a social reader is awesome… https://quickthoughts.jgregorymcverry.com/2019/02/12/alexstubenbort-i-spent-day-hacking-on (https://twitter.com/_/status/1095159685009891328) 01:08:02 thanks sknebel and zegnat! 01:08:02 cjwillcock: sknebel left you a message 1 day, 5 hours ago: congrats on your parser progress! It'd be great if you could put a testing page like https://php.microformats.io up, helps with manual testing and sharing results? 01:08:02 cjwillcock: sknebel left you a message 16 hours, 53 minutes ago: FYI https://bugzilla.gnome.org/show_bug.cgi?id=769760 01:09:04 sknebel that's disappointing about libxml, I'll need to think about that some more; and I'll put a test page on my list 01:09:18 nice! 01:10:20 generally, good HTML parsers have been somewhat of an issue, since HTML5 allows a bunch of stuff older parsers don't get, and e.g. HTML minimizer tools tend to exploit all that's allowed 01:12:45 I was able to get around libxml not understanding the tags from 5 by adding the recover and noerror flags - but I wasn't expecting it have that open exploit, unfixed for +2 years :/ 01:14:59 I somehow remembered seeing the maintainer of something else rant about that a few weeks back 01:15:22 ah, found the thread: https://twitter.com/tenderlove/status/1088888141958733824 01:15:22 [@tenderlove] FYI, if you use upstream libxml2 you're subject to multiple CVEs https://bugzilla.gnome.org/show_bug.cgi?id=769760 01:16:15 Is PHP using libxml2 for its parsing too? I know we had issues with DOMDocument parsing and moved to a userland PHP implementation 01:16:26 zegnat: yep 01:16:57 Then there are definitely limitations, cjwillcock. In fact, we already know those limitations to be out there in the wild because that made us look for a userland one… 01:19:09 cjwillcock: this would be an interesting test: https://github.com/microformats/php-mf2/blob/14e8c5e9c0f2725a99528eb0f20bfc418c2c1a2c/tests/Mf2/ParserTest.php#L791-L803 01:19:40 I think we wrote that one based on a live example someone had. Sadly we didn’t include an issue number there so would have to go looking 01:22:44 * Zegnat idly wonders how hard it would be to wrap something like https://github.com/servo/html5ever in a PHP ext 01:22:46 [servo] html5ever: High-performance browser-grade HTML5 parser 01:23:32 I was looking at wrapping up one of: http://xerces.apache.org/xerces-c/ or https://pugixml.org/ 01:24:07 I haven't finished going through my options for which one to use 01:24:14 (maybe some other) 01:25:21 HTML really isn’t XML, so I would always be a little sceptical of those XML parsers. 01:25:43 I know someone was working on bringing Google’s Gumbo parser as a PHP ext, but both the ext and the parser itself have gone stale :( 01:29:22 zegnat: thanks for the pointer to the test (failing) 01:31:36 No problem! We’ve been here before. If we can get someone (you? :P) to provide a nice modern HTML5 parser to PHP, we will all rejoice, haha 01:32:28 lol 01:32:42 well, step one, make me want one is done 01:36:56 oh, that test doesn't fail because of html5 tags - that one is because of bad html 01:37:09 I'm inclined to leave that as an exercise for userland code 01:38:59 ? 01:39:58 What is bad HTML? 01:40:09 unclosed

01:40:13 No, that is 100% valid 01:40:19 oh 01:40:20 yes! 01:40:23 you ARE right 01:40:34 Closing tags are optional for P elements. It is one of the things an XML-based browser will get wrong. 01:40:45 thanks for that 01:40:57 so libxml is out lol 01:41:15 Older XML based parsers try to get it right by finding the location they need to close the P (before block elements) but they do not know

is a block element because it is an HTML5 element 01:41:16 and I think the one in HTML knew that, but didn'T know that
would force the close to happen 01:41:22 Yep 01:41:24 *one in PHP 01:41:59

Something

Something else

worked, IIRC. But as soon as HTML5 comes in it is game over 01:53:53 exactly right. Running the html through the tidy extension first resolves it. So I can either internally use the tidy extension - or strip out libxml and replace with a good html5 parser 01:54:21 however, that use of tidy may not work in the case described in the CVE (I'll check it out) 01:55:17 Tidy may work too 02:00:06 <[kevinmarks]> An advantage of working in node or go is that they have actual html5 parsers, not xml hacks? 02:00:21 node's a little slower 02:00:35 go wins 02:01:14 Does node really have an actual html5 parser available? 02:02:40 Or are you refering to a userland implementation aswell? I found https://github.com/inikulin/parse5 02:02:42 [inikulin] parse5: HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant. 02:03:12 what does "have actual html5 parsers" mean? afaik Go is the only one where it is part of the official language project 02:03:59 That’s what I meant. Official language part, or otherwise available as some sort of official extension/plugin. Like how Node offers file system functionality ontop of the V8/ECMAScript that powers it 02:04:03 but the ones we use in php-mf2, mf2py and I think microformats-node are html5 parsers 02:04:51 php-mf2 will use the official/native/default DOMDocument parser of the language, unless you provide a userland implementation (which we recommend, because the official isn’t HTML5 safe) 02:04:53 <[kevinmarks]> Well, mf2py uses beautiful soup which can use html5lib. 02:05:15 exactly 02:05:25 <[kevinmarks]> But can also use the very bad default python parser if you're not careful. 02:07:45 I am guessing the Python default is also just xmllib? :P 02:07:51 That seems to be the case for most places 02:09:12 <[kevinmarks]> No, https://docs.python.org/3/library/html.parser.html 02:35:24 Oh interesting 08:29:22 <[tantek]> Kevinmarks, 15 years ago last night (!!!) https://twitter.com/t/status/433494367601717248 08:29:22 [@t] Ten years ago to the hour, @KevinMarks and I introduced #microformats at an #ETech BoF session: http://tantek.com/presentations/2004etech/realworldsemanticspres.html (ttk.me t4UY2) 09:00:01 jekyll postcss 09:00:48 wrong console. my mistake 09:22:12 [[invisible-data-considered-harmful]] http://microformats.org/wiki/index.php?title=invisible-data-considered-harmful&diff=66982&oldid=65063&rcid=103869 * Tantek * (+143) geourl archive link for map visualization of common lat long errors (enough to show up in data aggregations) 09:26:29 [[invisible-data-considered-harmful]] http://microformats.org/wiki/index.php?title=invisible-data-considered-harmful&diff=66983&oldid=66982&rcid=103870 * Tantek * (+453) another geourl errors citation, via kevinmarks 09:26:55 [[invisible-data-considered-harmful]] M http://microformats.org/wiki/index.php?title=invisible-data-considered-harmful&diff=66984&oldid=66983&rcid=103871 * Tantek * (+0) /* invisible metadata failures */ -cr 11:17:37 [@jgmac1106] You know...I really like the minimalist features of Blogger and Classic Theme... 11:17:37 always said I just want a blank HTML box with all the plumbing..in a way I am getting this vibe 11:17:37  My stylesheet, my ideas, my HTML  and now with microformats all my metadat… http://bit.ly/2SqCiB0 (https://twitter.com/_/status/1095461920650547206)