Why do we need database?
There are many reasons for a big organization. For an individual user like us, the biggest advantage is, EFFICIENCY. Assume a website publishes data every day and you need consistent time series to run some models. For economic reasons, we only need to scrape the latest report each day in theory. What if one report gets delayed by some 'technical issues' (usually this is a lame excuse)? We have to scrape two reports next day. But we cannot keep track of everything every day and machines are supposed to do the grunt work for us. If we scrape the entire historical dataset as an alternative, it will take too much time and computing capacity. Additionally, we are running a risk of being blacklisted by the website. Some of those APIs even implement daily limit for each user. This is when database kicks in and keep tracks of everything. With records in the database, we can always start from where we left off last time. Of course, there are many other benefits of database, e.g. Data Integrity, Data Management.
Enough of sales pitch, let's get into the technical details of database. The package we installed is called sqlite3, referring to SQLite database. The setup of SQLite database is hassle-free, in contrast to other relational database such as MySQL or PostgreSQL. Other benefits of SQLite include rapide execution, petit size. Since we are not running a big organization, we shouldn't be bothered with things like Azure, SQL server or Mongo DB.
To create a database, we simply do
conn = sqlite3.connect('database.db')
c = conn.cursor()
The above command would create a database if it does not exist in a given directory. If it exists, it will automatically connect to the database instead.
If you are connecting to corporate server, you might consider the following tutorial on pyodbc. To connect to SQL server, simply do
import pyodbc
conn = pyodbc.connect("""
DRIVER={SQL Server};SERVER=some_ip_address;DATABASE=some_database_name;UID=some_username;PWD=some_password
""")
c = conn.cursor()
Next step is to create a table in the database, we can do
c.execute("CREATE TABLE table_name ([column1] DATATYPE, [column2] DATATYPE, [column3] DATATYPE, PRIMARY KEY ([column1], [column2], [column3]));")
conn.commit()
Some key notes
Now that tables are set up, let's insert some scraped data into the database, we can do
c.execute("INSERT INTO table_name VALUES (?,?,?,?)",[data1,data2,data3,data4])
conn.commit()
conn.close()
We should not forget the last statement. SQLite3 database does not allow multiple modification at the same time. Other users cannot make changes inside the table if we don't close the database, similar to Excel in a way.
To make query directly from database, we do
c=conn.cursor()
c.execute("SELECT * FROM table_name WHERE [column1]=value1;")
rows=c.fetchall()
conn.commit()
The above is a conventional query method in sqlite3. However, pandas provide a much more convenient way. The output goes straight into dataframe instead of tuples within a list. Easy peasy lemon squeezy!
df=pd.read_sql("SELECT * FROM table_name WHERE [column1]=value1",conn)
One of the very common issues from query is encoding. Unfortunately, I haven't managed to solve it so far. Though there is a way to get around like this
'C'était des loques qui se traînaient'.encode('latin-1').decode('ISO-8859-1')
There are other useful SQL sentences as well. For more details, you can check w3schools. I personally believe fluency in SELECT, DELETE, UPDATE and INSERT is enough to cover most of your daily tasks, unless you aim to be a data architect dealing with numerous schemas.
Some other useful statements including
UPDATE table_name SET [column1]=value1
DELETE FROM table_name WHERE [column1]=value1
Feel free to take a look at LME for more coding details.
Click the icon below to be redirected to GitHub Repository