doodle

UCSD CSE29 FA25 Syllabus and Logistics

Aaron Schulman (Instructor)
Joe Gibbs Politz (Instructor)

Basics - Staff & Resources - Schedule - Course Components - Grading - Policies

CSE 29 introduces you to the broad field of systems programming, including 1) the basics of how programs execute on a computer, 2) programming in C with direct access to memory and system calls, 3) software tools to manage and interact with code and programs. All very cool stuff that makes every programmer better!

Basics

Lecture (attend the one you're enrolled in):
- Joe/A section: 10am Center Hall 119
- Aaron/B section: 1pm Warren Lecture Hall 2005
Discussions (attend either or both):
- Mon 3pm Center 115
- Mon 4pm Ledden Auditorium
Labs: Tuesdays (check your schedule!). Either B240 or B250 in the CSE building
Exams: AP&M Testing Center, flexible scheduling in weeks 3, 6, and 9 on PrairieTest
Final exam: AP&M Testing Center, flexible scheduling in final exam week on PrairieTest
Professor office hours – Joe and Aaron each have 2 hours. Come to either and ask anything you need.
- Aaron: Monday 3:30pm-4:30pm and Wednesday 11am-12pm CSE3120
- Joe: Tuesday 1:00pm-2:00pm (CSE B260 in the labs) and Wednesday 1:00-2:00pm (CSE 3206 Joe's Office)
Office Hours: See the Office Hours Calendar
Q&A forum: Piazza
PrairieLearn: https://us.prairielearn.com
Textbook/readings: Dive Into Systems, plus additional readings we will assign (all free/online)
- Free: MIT Missing Semester
- Not free but pretty cheap: Julia Evans Zines, especially The Pocket Guide to Debugging

Staff Resources

Office Hours Calendar

Schedule

The schedule below outlines topics, due dates, and links to assignments. We'll typically update the material for the upcoming week before Monday's lecture so you can see what's coming.

Week 4 – Processes and Memory

Readings
- Processes, especially fork and exec
- C Structs
- make and Makefiles
Lecture Materials
- Monday:
  - Joe's Lecture: Annotated Handout

Week 3 – Where (Some) Things Are in Memory

Readings
Lecture Materials
- Friday:
  - Joe's Lecture: Annotated Handout
- Wednesday:
  - Joe's Lecture: Annotated Handout
  - Aaron's Lecture: Slides
- Monday:
  - Joe's Lecture: Handout | Annotated Handout
  - Aaron's Lecture: Slides | Code on ieng6 in dir /home/linux/ieng6/CSE29_FA25_B00/public/lectures/10-10-strcat-memory
Lab
- Lab 3 Reference

Week 2 - Number Representations, Sizes, and Signs

Announcements
- PA 1 is due Thursday 10/9 at 11:59pm.
- Exam 1 is next week, please make a reservation ASAP for exam slots on PrairieTest by logging in with your @ucsd.edu account. See more info in the under Exams part of the syllabus below.
Readings
- Binary and Data Representaion (Sec 4.1-4.6)
- Arrays in C
- Scanf and fgets (though this week we'll only use fgets with stdin)
Lecture Materials
- Friday:
  - Repository
  - Joe's Lecture: Handout | Annotated Handout
  - Aaron's Lecture: Code on ieng6 in dir /home/linux/ieng6/CSE29_FA25_B00/public/lectures/10-10-strcat-memory
- Wednesday:
  - Joe's Lecture: Handout | Annotated Handout
  - Aaron's Lecture: Code on ieng6 in dir /home/linux/ieng6/CSE29_FA25_B00/public/lectures/10-08-utf8
- Monday:
  - Repository
  - Joe's Lecture: Annotated Handout
  - Aaron's Lecture: Slides
Lab
- Lab 2 Reference

Week 1 – Strings, Memory and Bitwise Representations (in C)

Announcements
- Problem Set 1 will be available on PrairieLearn, after lab, due Mon Oct 6 at 11:59pm
Readings
- The UNIX Command Line
- Arrays and Strings (In C)
Bitwise Videos
- Brief Bitwise Operators Intro
- Counting 0 bits
Lecture Materials
- Friday:
  - Joe's Lecture: Annotated Handout
  - Aaron's Lecture: Slides
- Wednesday:
  - Repository
  - Joe's Lecture: Annotated Handout
  - Aaron's Lecture: Slides
- Monday:
  - Repository
  - Joe's Lecture: Annotated Handout
  - Aaron's Lecture: Slides
Lab
- Lab 1 Form and Reference

Week 0 – Welcome

Announcements
- Lab attendance is required and a lot happens there, make sure to go to lab.
Lecture Materials
- Friday:
  - Joe's Lecture: Notes
  - Aaron's Lecture: Slides

Syllabus

There are several components to the course:

Lab sessions
Lecture and discussion sessions
Problem sets
Assignments
Exams

Labs

The course's lab component meets for 2 hours. In each lab you'll switch between working on your own, working in pairs, and participating in group discussions about your approach, lessons learned, programming problems, and so on.

The lab sessions and groups will be led by TAs and tutors, who will note your participation in these discussions for credit. Note that you must participate, not merely attend, for credit.

For each lab there are 4 possible points to earn:

1 for being present
2 for being an active, professional, productive participant while present
1 for submitting and/or getting work checked off that was done in lab

If you miss lab, you can earn the third point by submitting work late, but cannot earn back the 3 points related to participation.

Lecture and Discussion Sessions

Lecture sessions are on Monday, Wednesday, and Friday, and discussion sections are Monday. We recommend attending every lecture and one of the two discussion sections.

In a typical discussion section two things will happen:

TAs will go over programming work in the problem sets, sometimes reviewing, sometimes doing work on the active problem set.
There will be time to work on problem sets and ask questions of the TAs and your peers

Problem Sets

On Mondays of even weeks (2, 4, 6, 8, 10), a problem set is due.

Problem sets have a collection of small programming problems that provide practice on concepts and techniques from lecture, lab, and the reading. In addition, a subset of each problem set has programming problems that are directly related to completing the current assignment (e.g. some of the code may even be copy-pasteable).

Assignments

The course has 5 assignments that involve programming and writing. Individual assignments will have detailed information about submission components; in general you'll submit some code and some written work (to PrairieLearn).

For each assignment, we will give a 0-4 score along with feedback:

4 for a complete submission with high code and writing quality with few mistakes, and no significant errors
3 for a complete submission with some mistakes or some unclear writing
2 for a complete submission with some mistakes and some unclear writing, or a submission with high code or writing quality, but missing the other
1 for a submission missing key components, or with clear inaccuracies in multiple components
0 for no submission or a submission unrecognizable as a partial or complete submission

After each assignment is graded, you'll have a chance to resubmit it based on the feedback you received, which will detail what you need to do to increase your score. You can increase your score by up to 2 points on resubmit (e.g. 0 to 2, 1 to 3, 2 to 4, 3 to 4)

This is also the only late policy for assignments. Unsubmitted assignments are initially given a 0, and can get a maximum of 2 points on resubmission.

Exams

Exams will use the testing facility in AP&M B349, which is a computer lab. You will schedule your exam on PrairieTest by logging in with your @ucsd.edu account. You can schedule the exam at a time that's convenient for you in the given exam week, and you will go to that lab and check in for your exam at the time you picked. The exam will be proctored by staff from the Triton Testing Center (not by the course staff from this course). No study aids or devices are allowed to be used in the testing center. You will need only a photo ID and something to write with (scratch paper is available on request).

The Triton Testing Center has shared a document of rules and tips for using the testing center.

The exams will be administered through PrairieLearn; we will give you practice exams and exercises so you can get used to the format we'll use before you take the first one. The exams will have a mix of questions; they will typically include some that involve programming and interacting with a terminal.

There are three exams during the quarter in weeks 3, 6, and 9. On each you'll get a score from 0-4.

We don't have a traditionally-scheduled final exam for this course (you can ignore the block provided in Webreg). Instead, in final exam week, you'll have the opportunity to retake exams from during the quarter to improve your score up to a 4, regardless of the score on the first attempt. The retakes may be different than the original exam, but will test the same learning outcomes. This is also the only make-up option for missed exams during the quarter: if you miss an exam for any reason it will be scored as 0, and you can use one of your retake opportunities on that exam.

Exams during the quarter are all 45m long, the make-up slot is 2h long and gives the opportunity to make up any or all 3 of the in-quarter exams.

Grading

Each component of the course has a minimum achievement level to get an A, B, or C in the course. You must reach that achievement level in all of the categories to get an A, B, or C.

A achievement:
- ≥34/40 lab points
- ≥16/20 assignment points
- ≥10/12 exam points
B achievement:
- 30-33 lab points
- 13-15 assignment points
- 8-9 exam points
C achievement:
- 24-29 lab points
- 10-12 assignment points
- 6-7 exam points

Problem sets: Problem set performance will determine pluses and minuses. We don't publish an exact number for these in advance, but it's consistent across the class.

Some general examples: if you complete all the problem sets completely, correctly, and on time, you'll get a + modifier. If you submit no problem sets on time or don't get any of them done completely or correctly, you will get a - modifier.

Policies

Individual assignments describe policies specific to the assignment. Some general policies for the course are here.

Assignments and Academic Integrity

You can use code that we provide or that your group develops in lab as part of your assignment. If you use code that you developed with other students (whether in lab or outside it), got from Piazza, or got from the internet, say which students you worked with and a sentence or two about what you did together in CREDITS.txt. All of the writing in assignments (e.g. in open-ended written questions) must be your own.

You can use an AI assistant like ChatGPT or Copilot to help you author assignments in this class. If you do, you are required to include in CREDITS.txt:

The prompts you gave to the AI chat, or the context in which you used Copilot autocomplete
What its output was and how you changed the output after it was produced (if at all)

This helps us all learn how these new, powerful, and little-understood tools work (and don't).

If you don't include a CREDITS.txt and it's clear you included code from others or from an AI tool, you may lose credit or get a 0 on the assignment, and repeated or severe violations can be escalated to reports of academic integrity violations.

Exams and Academic Integrity

Problem sets are the best preparation for the exams. You're free to collaborate with others on preparing for the exam, trying things out beforehand, and so on.

You cannot share details of your exam with others until after you receive your grade for it. You cannot communicate with anyone during the exam.

Problem Sets and Academic Integrity

You can work on problem sets with other students.

FAQ/AFQ (Anticipated Frequent Questions)

Can I attend a lab section other than the one I'm enrolled in?

No, please do not try to do this. The lab sections have limited seating and are full. We cannot accommodate switching.

How can I switch sections?

You have to drop and re-add (which may involve getting [back on] the waitlist). Sorry.

Can I leave lab early if I'm done or have a conflict?

The labs are designed to not be things you can “finish”. Labs have plenty of extension and exploration activities at the end for you to try out, discuss, and help one another with. Co-located time with other folks learning the same things is precious and what courses are for. Also, if you need an extrinsic motivation, you won't get credit for participation if you don't stay, and participate, the whole time. We can often accommodate one-off exceptions – if you have a particular day where you need to leave early, it's a good idea to be extra-engaged in your participation so your lab leader can give you participation credit before you leave.

Do I have to come to lab?

Yes, see grading above.

What should I do if I'm on the waitlist?

Attend and complete all the work required while waitlisted (this is consistent with CSE policy).

I missed lab, what should I do?

The lab page will have instructions on how to submit the make-up, which can get you 1 (of 4 possible) points.

I missed a problem set deadline, what should I do?

You can submit it late until the end of the quarter. Generally we allow lots (think like 1/3 to 1/2) of the psets to be late without it impacting your grade. They are there to give you focused practice and to prepare you for the exams and the assignments.

I missed an assignment deadline, what should I do?

Some time after each assignment deadline (usually around 2 weeks) there is a late/resubmission deadline. You can resubmit then. See the assignment section above for grading details about resubmissions.

I missed a assignment resubmission deadline, what should I do?

You cannot get an extension on assignment resubmissions; we cannot support multiple late deadlines and still grade all the coursework on time.

I missed my exam time, what should I do?

Stay tuned for announcements about scheduling make-ups in final exam week.

Where is the financial aid survey?

We do this for you; as long as you submit a quiz or do a lab participation in the first two weeks, we will mark you as commencing academic activity.

When are the midterms scheduled?

The midterms will be flexibly scheduled during the quarter using a testing center. More details will come; you will need to set aside some outside-of-class time to do them, but there is not a specific class-wide time you have to put on your calendar.

I have a conflict with the final exam time, what can I do?

The final exam will also be flexibly scheduled during final exam week using the testing center.

Lab 1

September 30, 2025

Fill out the Lab 1 Form

References

Terminal

On the lab machines, you can open a terminal either one of two ways:

On the toolbar at the bottom of the screen, there are a few icons on the far right. The last one (shaped like a black rectangle) will open a new terminal.

⠀

Right-click on the desktop, and a menu of options will appear. Select the option to create a new terminal.

⠀

The following are some commands you can run in your terminal. To run a command, type it in and press enter.

`pwd`

Prints the name of the working directory, which is the location in the file system that the terminal is currently in.

`whoami`

Prints the username of the account you are using on the computer.

`uname -a`

Prints information about the computer that the terminal is connected to.

`cd [directoryname]`

Changes to the directory named [directoryname]. The special directory .. takes you up one level.

`ls`

Prints a list of all the files and directories that are in the working directory.

`mkdir [directoryname]`

Creates a new directory called [directoryname].

`touch [filename]`

Creates a new file called [filename].

`rm [filename]`

Removes the file called [filename].

`man [commandname]`

Displays a documentation page for the terminal command [commandname].

Logging into ieng6

`ssh yourusername@ieng6.ucsd.edu`

Your account name is the same account name as the one that’s used for your school Google account, i.e. it is the string that precedes “@ucsd.edu” in your school email address. In case you need to check the status of your student account, refer to the UCSD Student Account Lookup page.

Your password is the same password that you use for your student account on other websites (i.e. Canvas). The terminal does not show the characters that you type when you enter your password.
The first time you use this command, you will get a message like this:

⠀$ ssh yourusername@ieng6.ucsd.edu
The authenticity of host 'ieng6.ucsd.edu (128.54.70.227)' can't be established.
RSA key fingerprint is SHA256:ksruYwhnYH+sySHnHAtLUHngrPEyZTDl/1x99wUQcec.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

Copy and paste the one of the corresponding listed public key fingerprints and press enter. ¹

If you see the phrase ED25519 key fingerprint answer with: SHA256:8vAtB6KpnYXm5dYczS0M9sotRVhvD55GYz8EjN1DYgs
If you see the phrase ECDSA key fingerprint answer with: SHA256:/bQ70BSkHU8AEUqommBUhdAg0M4GaFIHLKq0YQyKvmw
If you see the phrase RSA key fingerprint answer with: SHA256:npmS8Gk0l+zAXD0nNGUxr7hLeYPn7zzhYWVKxlfNaeQ

Getting the fingerprint from a trusted source is the best thing to do here. You can also just type “yes”, as it's pretty unlikely anything nefarious is going on. If you get this message when you're connecting to a server you connect to often, it could mean someone is trying to listen in on or control the connection. This answer is a decent description of what's going on and how you might calibrate your own risk assessment: Ben Voigt's answer

PrairieLearn

PrairieLearn URL: us.prairielearn.com

Enroll in CSE 29: Systems Programming and Software Tools, Fall 2025 at https://us.prairielearn.com/pl/enroll

Complete parts 1 and 2 of the vimtutor demo. (Run “vimtutor” in the terminal)

Part 1: git motivated

git: command line program that enables version control (in case you break it)

Repository: folder containing code

GitHub: website that holds (repos)itories; can contribute and share with others

Part 2: SSH Keys on GitHub

Command to get the Github People repo

$ git clone git@github.com:Monip1/lab2-people.git

This command should fail. We need to set up SSH Keys on GitHub to fix that!

Setting up SSH Keys on GitHub

Recall from last lab how to log into your ieng6 account:


$ ssh yourusername@ieng6.ucsd.edu

Please log into ieng6. Within your ieng6 account, use the following command to generate a new key pair, replacing github_email with your GitHub email address:


$ ssh-keygen -t rsa -b 4096 -C github_email

You’ll be prompted to “Enter a file in which to save the key”. Press Enter to accept the default location. You’ll then be prompted to enter a passphrase, which isn’t really necessary. Press Enter twice to continue without setting a passphrase. Though if you really want to set a passphrase, refer to GitHub docs on passphrases.

Adding your SSH key to GitHub

By default, the public SSH key is saved to a file at ~/.ssh/id_rsa.pub.

Instead of typing out the whole filename, you can type out some prefix of the name (e.g. ~/.ssh/id), and press Tab to autocomplete the name. In this case, tab complete won’t complete the full filename, since the private key happens to be named id_rsa. Please be too lazy to type out entire filenames and use tab complete instead!

View the contents of ~/.ssh/id_rsa.pub (using cat), then copy the contents of the public key file to your clipboard.

On the GitHub website, click your profile picture in the top right to open a menu, and click on “Settings”.

Click on Settings in your GitHub account menu

On the left, open “SSH and GPG keys”, then click on “New SSH key”.

Go to "New SSH key"

Populate the fields as follows:

Title: “ieng6”
- The title doesn’t affect the functionality of the key, it’s just a note for you that this key is tied to your ieng6 account.
Key type: “Authentication key”
Key: Paste the contents of the public key file here (entire block including the email portion).

Go to the "Key" box

Click “Add SSH key”. You may need to confirm access to your account on GitHub at this point.

Testing your SSH key

Finally, test your connection to GitHub with the command:

$ ssh -T git@github.com

If this is your first time connecting to GitHub, you might get a warning about the “authenticity of host can’t be established”. This is a warning for you to make sure that you’re connecting to the right thing. For the purposes of this lab, we assume that GitHub didn’t suddenly get hacked, so you can safely respond with “yes”. But if you’re really paranoid, you can check GitHub’s public key fingerprint here.

After a successful connection, it should output Hi <your-username>! You've successfully authenticated, but GitHub does not provide shell access.

Command to get the Github People repo (Take 2!)

If you set up the SSH Keys correctly in the previous steps, now you should be able clone the repo:

$ git clone git@github.com:Monip1/lab2-people.git

Part 3: Whiteboard Activity - UTF-8 Strings

★彡:)
¿Sí?
ㅋ😂!
Jé😀

Part 4: Submitting the PA (Lab Work Checkoff)

PA 1 GitHub Classroom Link: https://classroom.github.com/a/pGoD-4Uz

To get the final 1 point for lab work checkoff this week (out of 4 total points), you just need to make a submission on Gradescope for pa1!

Important git commands

$ git status

Gets the status of our repository. It returns which files are untracked (new) modified (changed) and deleted.

$ git add [filename(s)]

When we are done making changes to a file, we "stage" it to mark it as ready to be committed. Using the git add command with the path of the changed file(s) will stage each to be included in the next commit.

$ git commit -m “message”

A commit is a package of associated changes. Running the git commit command will take all of our staged files, and package them into a single commit. We specify the -m option to specify we want to write a commit message, and write our commit message in the “”. Without -m, git opens a vim window to write the commit message.

$ git push

Pushes the commit with our changes to the remote repository (i.e. GitHub).

Part 1: SSH Keys to ieng6

Note: if you have a personal laptop you can set up ssh keys on it to be able to ssh to ieng6. If you're on windows, you’ll need a linux-like terminal, so use the terminal in VSCode, a program like git bash, or WSL for this.

If you don’t have a laptop or it isn’t set up with a linux terminal, work from your lab computer for today. You can set up ssh keys on your laptop at home on your own time.

Generate a new pair of ssh keys (on your laptop, or on your lab computer)

cd to your .ssh directory, (if it doesn’t exist, first create it with mkdir ~/.ssh)
IF YOU ALREADY HAVE AN 'id_rsa', SKIP THIS STEP (otherwise you'll overwrite your github keys)
If you don't have an id_rsa, run: ssh-keygen -t rsa -b 4096 -C YOUR_EMAIL_ADDRESS
- You'll be using the default filename and no password, so just press enter twice
Copy the contents of your new public key (you can print out the contents with cat)

Copy your public key onto ieng6, into the file ~/.ssh/authorized_keys

On ieng6, in your .ssh directory, create a file called authorized_keys (or open it if it exists already)
Use vim to paste in the contents of the new public key into the authorized_keys file
- (if your authorized_keys already existed, paste the new key on a new line at the end)

Log out from ieng6 (use exit), and make sure you can ssh into it without typing your password

Part 2: Debugging with -Wall

Clone the repository we’ll be using for this lab (you’ll need to get the ssh URL). You’ll be working with one of the three programs, depending on your group: https://classroom.github.com/a/X-DdsPq-

Many common errors can be caught by the compiler, but a lot of these checks aren't enabled by default. We can ask the compiler to warn us by adding the -Wall flag ("- W(arn) all") when you compile:

$ gcc -Wall YOUR_CODE.c -o YOUR_CODE

Part 3: Debugging with `gdb`

Intro to gdb commands:

To use gdb, make sure you’ve compiled your program with “debug symbols”

$ gcc -g YOUR_CODE.c

Run your compiled code in gdb

$ gdb YOUR_BINARY

This puts you into a gdb prompt: normal terminal commands don’t work here, and you can instead run gdb-specific commands

(gdb) run     # starts running your program

If your program stops running or segfaults in gdb, print out the backtrace (also called a “stack trace”)

(gdb) backtrace

(gdb) bt

Use the following commands to print out values at the point the program stopped:

(gdb) info locals
(gdb) info args
(gdb) print VALUE (or p VALUE)
- You can print any variable or expression, e.g.
  - print x, p arr[5], p ((x & 0b1111) << 3)
- You can also specify a format to print in
  - print/t (binary), print/x (hex), print/d (decimal)
(gdb) x ADDRESS
- This prints out memory at an address, e.g. strings / arrays / pointers
(gdb) x/16cb str1
- This prints the first 16 bytes of str1 as characters
(gdb) x/20xb str2
- This prints 20 bytes of str2 in hex
(gdb) x/4dw arr
- This prints 4 “words” (i.e. int32s) of arr, as decimal numbers
You can use the following reference card for reference on gdb commands, and format commands for x and print.

If you’re done early, use the reference card to try exploring other gdb commands!

Lab 3 Work Check-Off (due Monday, Oct 20):

In your copy of the lab3-buggy repo: fix the bugs in your assigned program, then commit and push your changes so they show up on GitHub.

Lab 4 Reference Document

Part 1: `.vimrc` and `.bash_profile`

In the file ~/.bash_profile on ieng6, add:

alias ls="ls --color"

Create the file ~/.vimrc on ieng6, and copy in:

set tabstop=4
set softtabstop=4
set shiftwidth=4
set autoindent
set number

Part 2: Writing a Search Program

(clone the github classroom repo from here: https://classroom.github.com/a/nzV8A2UB)

For this lab, you'll be testing your program on the "rockyou" password list from your PA2, you can copy it over from this path on ieng6:

/home/linux/ieng6/CSE29_FA25_A00/public/pa2/rockyou_clean.txt

You'll be writing "mysearch.c". Your program should take 1 argument (the string to search for). It should read lines from standard input, and print out all of the input lines that contain the search string.

./mysearch PATTERN

You can use the strstr() function to search for a string within another string.

Part 3: Extra Options

Expand your program to handle extra flags:

./mysearch -n PATTERN : print line numbers before each line

./mysearch -v PATTERN : print only lines that don't contain PATTERN

./mysearch -c PATTERN : don't print pattern matches, just print out the count of matching lines at the end

Each person in your group should implement a different one of these; you'll need all three to do the whiteboard activity.

You can use the strcmp() function to check whether two strings are equal.

Work Check-off

Make sure your program supports at least one of -n, -v, or -c, and commit/push your code.

If done early, implement some of the following (in no particular order)

Make your program handle all 3 of -n, -v, and -c.
Add the option -i, for case-insensitive search (i.e. search -i pattern should match lines containing "Pattern" or "PATTERN")
Use ANSI Escape Codes to make your program bold or highlight the matches in every matching line, either always or with a --color option. (more info on Wikipedia)
Add the option -A to print extra context after a match, so e.g. search -A 2 pattern would print an extra 2 lines after every match.
Add the option -B to print extra context before a match (search -B 2 pattern would print an extra 2 lines before every match.) (Note: to be able to do this in C with the tools we've seen so far, you might need to set an upper limit on how many lines back your program will be able to support)
Make your program handle options more flexibly, e.g. it could:
- be able to handle multiple options simultaneously
- accept arguments in any order
- exit with a help message for unrecognized options (e.g. -f, -o) instead of treating them as search patterns. (Or if you pass it the -h or --help options)
- if it sees the option "--", treat all following arguments as search patterns, even if they start with a "-". (This is how actual command line programs allow you to search for the string "-n" instead of specifying the "-n" option.)
- allow long versions of options, e.g. --invert-match, --line-number, --count, etc

PA1 - UTF-8:

Due Date: Thursday 10/9 at 11:59pm

Looking for resubmission? Click here

UTF-8

Representing text is straightforward using ASCII: one byte per character fits well within char[] and it represents most English text. However, there are many more than 256 characters in the text we use, from non-Latin alphabets (Cyrillic, Arabic, and Chinese character sets, etc.) to emojis and other symbols like €, to accented characters like é and ü.

The UTF-8 encoding is the default encoding of text in the majority of software today. If you've opened a web page, read a text message, or sent an email in the past 15 years that had any special characters, the text was probably UTF-8 encoded.

Not all software handles UTF-8 correctly! For example, Joe got a marketing email recently with a header “Take your notes further with Connectâ€‹” We're guessing that was supposed to be an ellipsis (…), UTF-8 encoded as the three bytes 0x11100010 0x10000000 0x10100110, and likely the software used to author the email mishandled the encoding and treated it as three extended ASCII characters.

This can cause serious problems for real people. For example, people with accented letters in their names can run into issues with sign-in forms (check out Twitter/X account @yournameisvalid for some examples). People with names best written in an alphabet other than Latin can have their names mangled in official documents, and need to have a "Latinized" version of their name for business in the US. Joe had trouble writing lecture notes because LaTeX does not support UTF-8 by default.

UTF-8 bugs can and do cause security vulnerabities in products we use every day. A simple search for UTF-8 in the CVE database of security vulnerabilities turns up hundreds of results.

It's useful to get some experience with UTF-8 so you understand how it's supposed to work and can recognize when it doesn't. To that end, you'll write several functions that work with UTF-8 encoded text, and use them to analyze some example texts.

Getting Started

You can do your programming however you like; using vim on ieng6 will give you good practice for labs, exams, and problem sets, but you are not required to use any particular environment (we'd have no way to check anyway).

To get started, follow the instructions in lab 2 and create a GitHub Classroom assignment.

UTF-8 Analyzer

You'll write a program that reads UTF-8 input and prints out some information about it.

Here's what the output of a sample run of your program should look like:

$ ./utf8analyzer
Enter a UTF-8 encoded string: My 🐩’s name is Erdős.
Valid ASCII: false
Uppercased ASCII: MY 🐩’S NAME IS ERDőS.
Length in bytes: 27
Number of code points: 21
Bytes per code point: 1 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
Substring of the first 6 code points: My 🐩’s
Code points as decimal numbers: 77 121 32 128041 8217 115 32 110 97 109 101 32 105 115 32 69 114 100 337 115 46
Animal emojis: 🐩

You can also test the contents of files by using the < operator:

$ cat utf8test.txt
My 🐩’s name is Erdős.
$ ./utf8analyzer < utf8test.txt
Enter a UTF-8 encoded string:
Valid ASCII: false
Uppercased ASCII: MY 🐩’S NAME IS ERDőS.
Length in bytes: 27
Number of code points: 21
Bytes per code point: 1 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
Substring of the first 6 code points: My 🐩’s
Code points as decimal numbers: 77 121 32 128041 8217 115 32 110 97 109 101 32 105 115 32 69 114 100 337 115 46
Animal emojis: 🐩

Using the Problem Set for your PA

Most of the needed functionality for the PA is in the problem set problems! So part of your workflow for this PA should be to move code over from the appropriate part of your problem set and into your PA code.

A good first task is to move over only is_ascii, and then write the corresponding part of main needed to read input and print the result for is_ascii, and make sure you can test that. Then move onto capitalize_ascii, and so on.

You can and should save your work by using git commits (if you're comfortable with that), or even just saving copies of your .c file when you hit important milestones. We may ask to see your work from an earlier milestone if you ask us for help on a function from a later one.

Here we give some hints about how specific problem set problems correspond to specific lines of output.

Valid ASCII: – HW1.15. is_ascii
Uppercased ASCII: – HW1.13. capitalize_ascii

Here it is important to avoid changing the string that's going to be used for the rest of the printing!
Length in bytes: – Are there any built-in string.h functions that can help?
Number of code points: – HW1.17. Count UTF-8 String Length
Bytes per code point: – HW1.16. UTF8 Codepoint Size

Here it could be useful to use the function from the problem set inside a loop!
Substring of the first 6 code points: – HW1.18. utf8_substring
Code points as decimal numbers: – HW1.5. Find UTF-8 Codepoint at Index

Another use of a loop with a problem set problem!
Animal emojis: – HW1.19. is_animal_emoji_at

Another use of a loop, maybe also combined with some conditionals!

Testing

We provide 3 basic tests in the tests folder - which contain simple tests identifying valid ASCII and converting ASCII lowercase to uppercase characters.

You can see the result for a single test by using:

./utf8analyzer < test-file

Here are some other ideas for tests you should write. They aren't necessarily comprehensive (you should design your own!) but they should get you started. For each of these kinds of strings, you should check how UTF-8 analyzer handles them:

Strings with a single UTF-8 character that is 1, 2, 3, 4 bytes
Strings with two UTF-8 characters in all combinations of 1/2/3/4 bytes. (e.g. "aa", "aá", "áa", "áá", and so on)
Strings with and without animal emojii, including at the beginning, middle, and end of the string, and at the beginning, middle, and end of the range
Strings of exactly 5 characters

We recommend saving your input in files and using redirection to test so you don't have to figure out how to type the same UTF8 characters over and over.

PA Design Questions

You will answer the following questions in your DESIGN.md file.

Answer each of these with a few sentences or paragraphs; don't write a whole essay, but use good writing practice to communicate the essence of the idea. A good response doesn't need to be long, but it needs to have attention to detail and be clear. Examples help!

Another encoding of Unicode is UTF-32, which encodes all Unicode code points in 4 bytes. For things like ASCII, the leading 3 bytes are all 0's. What are some tradeoffs between UTF-32 and UTF-8?
UTF-8 has a leading 10 on all the bytes past the first for multi-byte code points. This seems wasteful – if the encoding for 3 bytes were instead 1110XXXX XXXXXXXX XXXXXXXX (where X can be any bit), that would fit 20 bits, which is over a million code points worth of space, removing the need for a 4-byte encoding. What are some tradeoffs or reasons the leading 10 might be useful? Can you think of anything that could go wrong with some programs if the encoding didn't include this restriction on multi-byte code points?

Resources and Policy

Refer to the policies on assignments for working with others or appropriate use of tools like ChatGPT or Github Copilot.

You can use any code from class, lab, or discussion in your work.

What to Hand In

Any .c files you wrote (can be one file or many; it's totally reasonable to only have one). We will run gcc *.c -o utf8analyzer to compile your code, so you should make sure it works when we do that.
A file DESIGN.md (with exactly that name) containing the answers to the design questions
Your tests are in files tests/*.txt

Hand in to the pa1 assignment on Gradescope. You can either submit a link to your GitHub Classroom Repository or upload the files from your computer.

The submission system will show you the output of compiling and running your program on the test input described above to make sure the baseline format of your submission works. You will not get feedback about your overall grade before the deadline.

PA1 Resubmission: Due Date 10/23 at 11:59pm

If you want to resubmit PA1, please read this section carefully. You need to pass all the tests in the original PA1, while also implementing an extra function described below.

`void next_utf8_char(char str[], int32_t cpi, char result[])`

Takes a UTF-8 encoded string and a codepoint index. Calculates the codepoint at that index. Then, calculates the code point with value one higher (so e.g. for ”é“ U+00E9 that would be “ê” (U+00EA), and for “🐩” (U+1F429) that would be “🐪” (U+1F42A)). Saves the encoding of that code point in the result array starting at index 0.

Example Usage:

char str[] = "Joséph";
char result[100];
int32_t idx = 3;
next_utf8_char(str, idx, result);
printf("Next Character of Codepoint at Index 3: %s\n",result);
// 'é' is the 4th codepoint represented by the bytes 0xC3 0xA9
// 'ê' in UTF-8 hex bytes is represented as 0xC3 0xAA

=== Output ===
Next Character of Codepoint at Index 3: ê

Now, Your final output on running the utfanalyzer code that will be graded should contain this extra line:

Next Character of Codepoint at Index 3: FILL

Note: If the number of codepoints in the input string is less than 4, this added line would only have the prompt without any character as follows:

Next Character of Codepoint at Index 3:

The complete program output for example, should look like:

$ ./utf8analyzer
Enter a UTF-8 encoded string: My 🐩’s name is Erdős.
Valid ASCII: false
Uppercased ASCII: "MY 🐩’S NAME IS ERDőS."
Length in bytes: 27
Number of code points: 21
Bytes per code point: 1 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
Substring of the first 6 code points: "My 🐩’s"
Code points as decimal numbers: 77 121 32 128041 8217 115 32 110 97 109 101 32 105 115 32 69 114 100 337 115 46
Animal emojis: 🐩
Next Character of Codepoint at Index 3: 🐪

(All our tests will check for this newly added line, in addition to lines from the original PA)

You will also need to answer the following updated DESIGN question in your resubmission

Consider the 3-byte sequence 11100000 10000000 10100001. Answer the following questions:

What code point does it encode in UTF-8, and what character is that?
What are the three other ways to encode that character?
Give an example of a character that has exactly three encodings (but not four, like the one in the previous example does)
What are some problems with having these multiple encodings, especially for ASCII characters? A web search for “overlong UTF-8 encoding” may be useful here.

PA2 - Hashing and Passwords: Due 10/23 at 11:59pm - Github Classroom Link

Cryptographic hash functions take an input of arbitrary length and produces a fixed length output. The special features are that the outputs are deterministic (the same input always generates the same output) and yet the outputs are apparently “random” – two similar inputs are likely to produce very different outputs, and it's difficult to determine the input by looking at the ouput.

A common application of hashing is in storing passwords. Many systems store only the hash of a user's password along with their username. This ensures that even if someone gains access to the stored data, users' actual passwords are not exposed. When a user types in a password on such a system, the password handling software grants access if the hash value generated from the user's entry matches the hash stored in the password database.

We said above that it is difficult to determine an input given an output. Password cracking is a family of techniques for accomplishing this difficult (but possible!) task. That is, let's say we have access to a user's password hash only. Can we figure out their password? We could then use it to log in, and it may also be shared across their accounts which we could also access.

In some cases, password cracking can exploit the structure of a hash function; this is a topic for a cryptography class. In our case, we will take a more brute-force approach: trying variations on existing known passwords, under the assumption that many passwords are easy to guess.

Getting Started

To get started, visit the Github Classroom assignment link. Select your username from the list (or if you don't see it, you can skip and use your Github username). A repository will be created for you to use for your work. This PA should be completed on the ieng6 server. Refer this section in Week 1's Lab for instructions on logging in to your account on ieng6 and working with files on the server.

Note - The autograder uses the C11 standard of the C programming language.

Overall Task

We'll start by describing the overall task and goal so it's clear where you're going. However, don't start by trying to implement this exactly – use the milestones below to get there.

`pwcrack`, a Password Cracker

Write a program pwcrack that takes one command-line argument, the SHA256 hash of a password in hex format.

The program then reads from stdin potential passwords, one per line (assumed to be in UTF-8 format).

The program should, for each password:

Check if the SHA256 hash of the potential password matches the given hash
Check if the SHA256 hash of the potential password with each of its ASCII characters uppercased or lowercased matches the given hash

If a matching password is found, the program should print

Found password: SHA256(<matching password>) = <hash>

If a matching password is not found, the program should print:

Did not find a matching password

Examples

seCret has a SHA256 hash of a2c3b02cb22af83d6d1ead1d4e18d916599be7c2ef2f017169327df1f7c844fd, and notinlist has a SHA256 hash of 21d3162352fdd45e05d052398c1ddb2ca5e9fc0a13e534f3c183f9f5b2f4a041

$ ./pwcrack a2c3b02cb22af83d6d1ead1d4e18d916599be7c2ef2f017169327df1f7c844fd
Password
NeverGuessMe!!
secret
Found password: SHA256(seCret) = a2c3b02cb22af83d6d1ead1d4e18d916599be7c2ef2f017169327df1f7c844fd
$ ./pwcrack 21d3162352fdd45e05d052398c1ddb2ca5e9fc0a13e534f3c183f9f5b2f4a041
Password
NeverGuessMe!!
secret
<Press Ctrl-D for end of input>
Did not find a matching password

We could also put the potential passwords in a file:

$ cat guesses.txt
Password
NeverGuessMe!!
secret
$ ./pwcrack a2c3b02cb22af83d6d1ead1d4e18d916599be7c2ef2f017169327df1f7c844fd < guesses.txt
Found password: SHA256(seCret) = a2c3b02cb22af83d6d1ead1d4e18d916599be7c2ef2f017169327df1f7c844fd

Note that we only consider single-character changes when trying uppercase/lowercase variants of the guesses (i.e. we DON'T try all possible capitalizations of the string): In the example below, the correct password is NOT found because going from SECRET to seCret would require 5 characters to be changed.

$ ./pwcrack a2c3b02cb22af83d6d1ead1d4e18d916599be7c2ef2f017169327df1f7c844fd
SECRET
<Press Ctrl-D for end of input>
Did not find a matching password

Using the Problem Set for your PA

PSet 2 11. cap_variants helps create all the variants of the strings needed for checking
PSet 2 15. sha256_stdin helps with the main loop that does SHA on values from input and has examples of calling SHA256 (problems 14 and 16 are also related)
Pset 2 9. shacheck helps with understanding how to compare SHA values

Design Questions

Real password crackers try many more variations than just uppercasing and lowercasing. Do a little research on password cracking and suggest at least 2 other ways to vary a password to crack it. Describe them both, and for each, write a sentence or two about what modifications you would make to your code to implement them. (You don't have to actually implement them).
How much working memory is needed to store all of the variables needed to execute the password cracker? Based on your response would you say that a password cracker is more memory-limited or is it more limited by how fast the process can run the code?

What to Hand In

Any .c files you wrote (can be one file or many; it's totally reasonable to only have one). We will run gcc *.c -o pwcrack -lcrypto to compile your code, so you should make sure it works when we do that.
A file DESIGN.md (with exactly that name) containing the answers to the design questions

You will hand in your code to the pa2 assignment on Gradescope. An autograder will give you information about if your code compiles and works on some simple examples like the ones from this writeup. Your implementation and design questions will be graded after the deadline with a mix of automatic and manual grading.

Fun (and Authentic!) Testing

To help testing your PA, we are providing you with a file containing 3 million real plaintext passwords famously found a data breach of the RockYou social network in 2009. You can use the password file present in the ieng6 servers by reading it into pwcrack using the following commandline if you're in Joe's section:

$./pwcrack a2c3b02cb22af83d6d1ead1d4e18d916599be7c2ef2f017169327df1f7c844fd < /home/linux/ieng6/CSE29_FA25_A00/public/pa2/rockyou_clean.txt

and the following command if you're in Aaron's section:

$./pwcrack a2c3b02cb22af83d6d1ead1d4e18d916599be7c2ef2f017169327df1f7c844fd < /home/linux/ieng6/CSE29_FA25_B00/public/pa2/rockyou_clean.txt

Note: these are real human-generated passwords, so they may contain profane words (and offensive concepts). We are providing a censored version of the RockYou password list. Our filtering methodology was to remove any passwords that matched a widely-used 2,800 profane word list. If you read the contents of the password file, note that you are doing so at your own risk, as we can not guarantee that we removed all offensive passwords from the list.

Staff Passwords Scenario

You see in the news and on social media that there was a “data breach” of PrairieLearn, and the passwords of all instructors and TAs were exposed on the “dark web”. A classmate sends you a mysterious link that clearly contains usernames for the CSE29 staff and gibberish after each name.

PrairieLearn Password Dump

Can you figure out the passwords of each of the staff members using your pwcrack program, the data at that link, and the rockyou dataset shared above?

Cracking these is ungraded, and just provided for fun and to get your attention with a provocative scenario where password cracking would be interesting (if unethical). They aren't our real passwords.

It also gives us a moment to discuss something important: just because you can crack passwords doesn't mean you should. There are plenty of laws prohibiting the use of passwords that aren't yours to access data that isn't yours, regardless of how you obtained the password. Responsibility and ethics are another – it may simply be wrong to reverse-engineer (access to) someone else's private data, even if it happens to be legal. Of course, there are good uses for doing this – cracking can recover a lost password needed for access to critical data, for example, if the hash is available. Like any powerful tool, your knowledge of C programming and passwords should be used with care and appropriately in context!

In many cases, finding and reporting security issues is possible to do responsibly – there's an entire theory and practice of responsible disclosure. Researchers like Prof Aaron do work on responsibly informing companies, governments, and the public about vulnerabilities in real-world systems. So it's critical to learn about how systems work, how secure systems fail, and how to navigate the surrounding ethical, legal, and social issues!