It's neat to see this getting attention. I've used similar techniques in production RAG systems that query over big collections of HTML docs. In our case the primary motivator was higher token efficiency (ie to represent the same semantic content but with a smaller token count).
I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.
I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.