In this tutorial you'll find out how to download files from URLs in your data, how to extract text from PDF files and how to calculate checksums for files.
Here is a small dataset (15 articles) in CSV format:
File download node can be found on Process data > File operations > Download file. You must give file path field as input.
There is a text extraction node in Process data > File operations -> PDF: extract text. You need to give absolute file path to as in input to the node.
Checksums are used to check integrity of files. It the case of Wikimedia Commons it can also used to check if certain file is already in commons or not. Checksum node can be found on File operations section.
Create the Checksum node, give the file path field as an input and execute it.