Five file types you should avoid if you want to work with data

Jairo G. Sarmiento Sotelo
Five file types you should avoid if you want to work with data

In the world of data analysis, not all files are created equal. Receiving information in an improper format can turn a task that should take minutes into a nightmare of hours, filled with manual cleaning and restructuring. If you want your data work to be easier and simpler, at Datasketch, we bring you five file types you should avoid at all costs.

1. Tables in PDF Files This is public enemy number one. PDFs are designed to preserve a document’s appearance, not to share data. Extracting a table from a PDF is often an odyssey: columns get jumbled, numbers are converted to text, and rows get split. Although tools exist to “liberate” this data, the process is rarely perfect and almost always requires an exhaustive manual review. If you’re asking for data, always request the original source file, not a PDF.

2. Images of Tables (JPG, PNG) A screenshot of a table isn’t data; it’s pixels. This format is even worse than a PDF because the information doesn’t exist as text or numbers but as a static image. The only way to extract it is through Optical Character Recognition (OCR) software, a technology that can make mistakes, especially with complex numbers or text.

3. Word Documents (.docx) with Tables Much like PDFs, Word documents are designed for narrative, not for structured data. A table in Word might look organized, but its internal structure is fragile. It can easily contain merged cells, hidden line breaks, or inconsistent formatting that will break any attempt at automatic importing.

4. Spreadsheets with Excessive Formatting (Excel) Yes, Excel is a data tool, but it’s often used like a drawing canvas. A file filled with merged cells, data differentiated by color instead of columns, multiple tables on a single sheet, or complex headers is a minefield for analysis. For a spreadsheet to be useful, it must be simple: a single table per sheet, with clear headers and no merged cells.

5. Presentations (PowerPoint, Google Slides) Presentations contain data summaries, not the data itself. The information is usually represented in simplified charts, images, or bullet points. Asking for a .pptx file to analyze data is like asking for the movie trailer instead of the full movie. It’s always better to go to the original source of those graphics.

Avoiding these formats will save you countless hours of frustration and allow you to get straight to what matters: analyzing and visualizing information. Explore our blog to learn more tips on how to work with data, create engaging charts, or transform your data work.

At Datasketch, we’re committed to making information more accessible, reusable, and useful for everyone who works with data. We know that it’s not always possible to receive files in the ideal format, which is why we’re developing extensions that help convert files or work with data in unconventional formats.

Create an account here and start using your data without the hassle.

🚀 Limited opportunity: Be one of our 100 data partners shaping the future of AI with verified data!

Join the Network