Web Scraping - Database

Database

Why do we need database?

There are many reasons for a big organization. For an individual user like us, the biggest advantage is, EFFICIENCY. Assume a website publishes data every day and you need consistent time series to run some models. For economic reasons, we only need to scrape the latest report each day in theory. What if one report gets delayed by some 'technical issues' (usually this is a lame excuse)? We have to scrape two reports next day. But we cannot keep track of everything every day and machines are supposed to do the grunt work for us. If we scrape the entire historical dataset as an alternative, it will take too much time and computing capacity. Additionally, we are running a risk of being blacklisted by the website. Some of those APIs even implement daily limit for each user. This is when database kicks in and keep tracks of everything. With records in the database, we can always start from where we left off last time. Of course, there are many other benefits of database, e.g. Data Integrity, Data Management.

Enough of sales pitch, let's get into the technical details of database. The package we installed is called sqlite3, referring to SQLite database. The setup of SQLite database is hassle-free, in contrast to other relational database such as MySQL or PostgreSQL. Other benefits of SQLite include rapide execution, petit size. Since we are not running a big organization, we shouldn't be bothered with things like Azure, SQL server or Mongo DB.

To create a database, we simply do

conn = sqlite3.connect('database.db')
c = conn.cursor()

The above command would create a database if it does not exist in a given directory. If it exists, it will automatically connect to the database instead.

If you are connecting to corporate server, you might consider the following tutorial on pyodbc. To connect to SQL server, simply do

import pyodbc
conn = pyodbc.connect("""
DRIVER={SQL Server};SERVER=some_ip_address;DATABASE=some_database_name;UID=some_username;PWD=some_password
""")
c = conn.cursor()

Next step is to create a table in the database, we can do

c.execute("CREATE TABLE table_name ([column1] DATATYPE, [column2] DATATYPE, [column3] DATATYPE, PRIMARY KEY ([column1], [column2], [column3]));")
conn.commit()

Some key notes

An interesting feature of SQL is its case insensitivity. Still, the upper case letters make things more distinguishable. The brackets [] serve the same purpose.
For some reason, sqlite3 requests ; at the end of each SQL command.
Always remember to commit changes to database. Think of it as a pop up window to ask you if you want to save all the changes and you click HELL YEAH.
The data types in SQL are very sophisticated and precise. Unless your duty is to maintain the efficiency of the database, I'd suggest you go with python data types such as 'FLOAT','DATE','TEXT'.
Primary key is crucial to your data integrity. Primary key guarantees the uniqueness of the datasets. For instance, if the website modifies one historical data point, without the primary key constraint, we could end up with two values for the same period. With the primary key constraint, we can update the old value to skip any data corruption.
Only one primary key is allowed in each table (or no primary key at all). Even though, primary key can involve multiple columns, like the above statement.

Now that tables are set up, let's insert some scraped data into the database, we can do

c.execute("INSERT INTO table_name VALUES (?,?,?,?)",[data1,data2,data3,data4])
conn.commit()
conn.close()

We should not forget the last statement. SQLite3 database does not allow multiple modification at the same time. Other users cannot make changes inside the table if we don't close the database, similar to Excel in a way.

To make query directly from database, we do

c=conn.cursor()
c.execute("SELECT * FROM table_name WHERE [column1]=value1;")
rows=c.fetchall()
conn.commit()

The above is a conventional query method in sqlite3. However, pandas provide a much more convenient way. The output goes straight into dataframe instead of tuples within a list. Easy peasy lemon squeezy!

df=pd.read_sql("SELECT * FROM table_name WHERE [column1]=value1",conn)

One of the very common issues from query is encoding. Unfortunately, I haven't managed to solve it so far. Though there is a way to get around like this

'C'était des loques qui se traînaient'.encode('latin-1').decode('ISO-8859-1')

There are other useful SQL sentences as well. For more details, you can check w3schools. I personally believe fluency in SELECT, DELETE, UPDATE and INSERT is enough to cover most of your daily tasks, unless you aim to be a data architect dealing with numerous schemas.

Some other useful statements including

UPDATE table_name SET [column1]=value1
DELETE FROM table_name WHERE [column1]=value1

Feel free to take a look at LME for more coding details.

Click the icon below to be redirected to GitHub Repository